Experiments with Program Parallelization Using
Archetypes and Stepwiise Refinement *
Berna L. ti.iini!! t
January 23, 1998
UF CISE Technical Report 98012
Abstract
Parallel programming continues to be I 1!l. i11 and errorprone, whether
starting from specifications or from an existing sequential program. This
paper presents (1) a methodology for parallelizing sequential applica
tions and (2) experiments in applying the methodology to application
programs. The methodology is based on the use of stepwise refinement
together with what we call parallel programming archetypes (briefly, ab
stractions that capture common features of classes of programs), in which
most of the work of parallelization is done using familiar sequential tools
and techniques, and those parts of the process that cannot be addressed
with sequential tools and techniques are addressed with 1.. !.i..i!! 1 .IaI I
transformations. The experiments consist of applying the methodology
to sequential application programs, and they provide evidence that the
methodology produces correct and reasonably th. i. wI programs at rea
sonable humaneffort cost. Of particular interest is the fact that the aspect
of the methodology that is most completely formally justified is the aspect
that in practice was the most troublefree.
1 Introduction
Much work has been done to make parallel programming tractable, but the
development of parallel applications continues to be 't1!11i ,Il and errorprone,
whether the starting point is a specification or an existing sequential program.
In this paper, we describe a l.r 1!. 1..1  for parallelizing sequential applications
based on the use of stepwise refinement together with what we call parallel pro
grammring ,., 1.,,(," (' i n such an archetype is an abstraction that captures
*This work was supported by the AFOSR via an Air Force Laboratory Graduate F. ii..
tUniversity of Florida, P.O. Box 116120, Gainesville, FL 32611; blm~cise.ufl.edu.
the . i i ..i 1 i I of a class of programs with similar structure), in which most of
the work of parallelization is done using familiar sequential tools and techniques,
and those parts of the process that cannot be addressed with sequential tools
and techniques are addressed with f . iii 11 i, L i . ,1 transformations. We then
present the results of i' 1 ii! this ii. l1 *.1 . to a moderatelength applica
tion program, providing evidence that the ,i 11 rl 1. ,1_ produces correct and
reasonably. i .i, ii programs at reasonable humaneffort cost. We focus atten
tion on transforming programs for execution on distributedmemorymessage
passing architectures, but we note that such programs 11i also be executed on
architectures that support a sharedmemory model.
2 The methodology
Our iii l! l. .1_ is based on the idea of transforming a sequential program
into a parallel program by ;,'I'l1 ii, a sequence of small semanticspreserving
transformations, guided by the pattern provided by a parallel programming
archetype.
2.1 Parallel programming archetypes
By parallel programming W,, ,,1'.1" we mean an abstraction, similar to a design
pattern, that captures the *nn,'.,, i of a class of programs. (An exam
ple of a sequential programming archetype is the familiar divideandconquer
paradigm.)
Methods of exploiting design patterns in program development begin by
1. i!' if i,_ classes of problems with similar computational structures and cre
ating abstractions that capture the '. .,iii,. .!! ilI Combining a problem class's
computational structure with a parallelization 1i,. _ gives rise to a dataflow
pattern and hence a communication structure. It is this combination of compu
tational structure, parallelization i ,. _ and the implied pattern of dataflow
and communication that we capture as a parallel programming r ,, n , in1.,
The .! ii ii. .!!i li I captured by the archetype abstraction makes it possible to
develop a small collection of semanticspreserving transformations that together
allow parallelization of ,i!! program that fits the archetype. Also, the common
dataflow pattern makes it possible to encapsulate those parts of the computation
that involve interprocess communication and transform them only once, with
the results of the transformation made available as a communication operation
usable in ,i!! program that fits the archetype.
2.2 Stepwise refinement and the sequential simulatedparallel
version
A key feature of our !r n 1 ,L.1 _ for parallelizing programs via a sequence of
transformations is that almost all of the transformations are performed in the
sequential domain, as illustrated by 1 oIi 1. Ideally all of the transforma
original s parallel
seq. parallel pg
pgm pgm
sequential domain parallel domain
1 iii. 1: Parallelization via a sequence of transformations.
tions would be formally stated and proved, with all but the final transformation
proved using the techniques of sequential stepwise refinement. In the experi
ments described in this paper, however, we chose to focus our proof efforts on
the final transformation the one that takes the program into the parallel do
main since the sequentialtosequential transformations are more amenable
to checking by testing and .1. 1,, __i!i_ than the sequentialtoparallel transfor
mation, and hence the case for a formal justification is more compelling for the
former than for the latter. Thus, in this paper we give a formal proof only for
this last transformation; the proof is applicable to all programs meeting certain
stated criteria, and guarantees the correctness of the parallel program, provided
the intermediate stages are correct. The key intermediate stage in this transfor
mation process is the last stage before transformation into the parallel domain;
we call it the sequential simulatedparallel version of the program. This version
essentially simulates the operation of a program consisting of N processes exe
cuting on a distributedmemorymessagepassing architecture. We define it as
follows:
Definition 1 (Sequential simulatedparallel program).
A sequential simulatedparallel program is one with the following characteristics:
1. The atomic data 1i .l I ' of the program are partitioned into N groups,
'An atomic data object is as defined in HPF [16]: one that contains no subobjects e.g.,
one for each simulated process; the ith group simulates the local data
for the ith process. These data objects 11 , include duplicated variables
(e.g., loop counters and program constants).
2. The computation consists of an alternating sequence of localcomputation
blocks and dataexchange operations, in which:
(a) Each localcomputation block is a composition of N program blocks,
in which the ith block accesses only local data for the ith simulated
process. Such blocks correspond to sections of the parallel program
in which processes execute independently and without interaction.
(b) Each dataexchange operation consists of a set of assignment state
ments that satisfy the following restrictions:
i. If an atomic data object is the target of an assignment, it is not
referenced in ;oi other assignment.
ii. No lefthand or righthand side !1ii reference atomic data ob
jects belonging to more than one of the N simulatedlocaldata
partitions. The lefthand and righthand sides of an assignment
11i however, reference data from different partitions.
iii. For each simulated process i, at least one assignment statement
must assign a value to a variable in i's local data.
Such blocks correspond to sections of the parallel program in which
processes exchange messages: Each assignment statement can be im
plemented as a single pointtopoint messagepassing operation, and
a group of messagepassing operations with a common sender and a
common receiver can be combined for tin i. i ,
Definition 2 (Sequential simulatedparallel version).
A sequential simulatedparallel version of program P is a sequential simulated
parallel program P' that refines P (i.e., meets the same or a stronger specifica
tion).
As we demonstrate in the next section, sequential execution of such a pro
gram truly simulates execution of the corresponding parallel program, in that
all executions of the parallel program give the same results as an execution of
the sequential program.
a scalar data object or a scalar element of an array.
In general, producing such a simulatedparallel program could be tedious,
timeconsuming, and errorprone. However, in our i !i. 11.. .1.  the transfor
mation process is guided by an archetype, with the archetype providing, for
programs in the class for which it is an abstraction, classspecific guidelines for
program parallelization as well as classspecific code libraries encapsulating the
communication and other operations required for parallelization. For a pro
gram that fits an archetype for which guidelines and a code library have been
developed, the task of producing the simulatedparallel version is much more
manageable, since the archetype guides the transformation process, and the pro
gram's dataexchange operations correspond to the archetype's communication
operations, i!i'l~l1i i! their transformation.
3 Supporting theory
In this section we present a general theorem allowing us to transform sequential
simulatedparallel programs into equivalent parallel programs.
3.1 The parallel program and its simulatedparallel ver
sion
The target parallel program. The goal of the transformation process is a
parallel program with the following characteristics:
1. The program is a collection of N sequential, deterministic processes.
2. Processes do not share variables; each has a distinct address space.
3. Processes interact only through sends and blocking receives on single
readersinglewriter channels with infinite slack (i.e., infinite I i).
4. An execution is a fair interleaving of actions from processes.
The simulatedparallel version. We can simulate execution of such a par
allel program as follows:
1. Simulate concurrent execution by interleaving actions from (simulated)
processes.
2. Simulate separate address spaces by defining a set of distinct addressspace
data structures.
3. Simulate communication over channels by representing channels as queues,
taking care that no attempt is made to read from a channel unless it is
known not to be lI
1 i 1. 2 illustrates the relationship between the simulatedparallel and parallel
versions of a program.
PO compute
P1 compute
compute compute P send
send send PI send
receive receive PO receive
PI receive
compute compute
PO compute
PO P1
............................... ..............................P co m p u te
real parallel
simulated parallel
I i ,!. 2: Correspondence between parallel and simulatedparallel program ver
sions.
3.2 The theorem
Theorem 3.
Given deterministic processes P0,...,PNj with no shared variables except
singlereadersinglewriter channels with infinite slack, if I and I' are two max
imal interleavings of the actions of the Pj's that begin in the same initial state,
then I and I' both terminate, and in the same final state.
Proof of Theorem 3.
Given interleavings I and I' beginning in the same state, we show that I' can be
permuted to match I without changing its final state. The proof is by induction
on the length of I.
Notation: We write bi to denote the ith action of interleaving I and aj,k to
denote the kth action taken by process Pj.
Base case: The length of I is 1. Trivial.
Inductive step: Suppose we can permute I' such that the first n steps of
the permuted I' match the first n steps of I. Then we must show that we can
further permute I' so that the first n + 1 steps match I. That is, suppose we
have the following situation:
I: bo,0b,...,bn l, aj,k,...
permuted I' : bo, b,... ,bi, b i,,k'',...
Then we want to show that we can further permute I' so that its (n + 1)th
action is also aj,k.
Observe first that the first action taken in Pj after br,i in the permuted I'
must be aj,k: All processes are deterministic, the state after n actions is the
same in I and the permuted I', and channels are singlereadersinglewriter.
Observe analogously that aj,,k' is the first action taken in Pj, after bri in I.
Thus, if j = j, aj,k = aj,,k', and we are done. So suppose j # j'.
Lemma: Observe that for ;,! two consecutive actions a,,,, and am,,n, if
m # m' and it is not the case that a,n,, and a,,,,', are both actions on the same
channel c, then because the l i' contains no shared variables except the
channels, these actions can be performed in either order with the same results.
We now demonstrate via a casebycase ,i ,1 i that we can permute I' by
repeatedly exchanging aj,k with its immediate predecessor until it follows b,i
(as it does in I).
1. If aj,k is the mth receive on some channel c and aj,,k, also affects c, then:
aj,,k' is the m'th send on c, with m' > m. (The action is a send because
channels are singlereader, and m' > m since aj,k precedes aj',k, in I.)
Further, no action between aj',k, and aj,k can affect c (since channels are
singlereadersinglewriter). Thus, using the lemma, we can repeatedly
exchange aj,k with its predecessors in permuted I', up to and including
aj,,k', as desired.
2. If aj,k is the mth send on some channel c and aj',k, also affects c, then:
aj',k' is the m'th receive on c, with m' < m. (The action is a receive
because channels are singlewriter, and m' < m since aj',k, precedes aj,k
in permuted I'.) Further, no action between aj',k, and aj,k can affect c
(since channels are singlereadersinglewriter). Thus, we can repeatedly
exchange aj,k with its predecessors in permuted I', up to and including
aj,,k', as desired.
3. If aj,k is the mth receive on some channel c and aj',k, does not affect c,
then: If no action between aj',k, and aj,k in permuted I' affects c, then we
can perform repeated exchanges as before. If some action bi does affect
c, then it must be the m'th send, with m' > m. (The action is a send
because channels are singlereader, and m' > m because the placement of
aj,k in I guarantees that actions bo,..., b,i contain at least m sends on
c.) We can thus exchange bi with aj,k, giving the desired result.
4. If aj,k is the mth send on some channel c and aj,,k' does not affect c,
then: If no action between aj,,k' and aj,k in permuted I' affects c, then we
can perform repeated exchanges as before. If some action bi does affect c,
then it must be a receive (since channels are singlewriter), so it also can
be exchanged with ajk, giving the desired result.
5. If aj,k is not an action on a channel, then from the lemma we can exchange
it with its predecessors as desired.
3.3 Implications and application of the theorem
Theorem 3 implies that if we can produce a sequential simulatedparallel pro
gram that meets the same specification as a sequential program, then we can
mechanically convert it into a parallel program that meets that same specifica
tion, by transforming the simulated processes into real processes, the simulated
multiple address spaces into real multiple address spaces, and the simulated
communication actions into real communication actions2. As noted earlier, in
general producing such a simulatedparallel program could be quite .1!t !i Il
but if we start with a program that fits an archetype, and produce a sequen
tial simulatedparallel version of the form described in Definition 1, where the
dataexchange operations correspond to the communication operations of the
archetype, then the task becomes manageable. I ,,..i 3 illustrates the rela
tionship between the simulatedparallel and parallel versions of such a program.
Each collection of assignments constituting a dataexchange operation can
be replaced, as described earlier, with a collection of sends and receives: Be
cause of the restrictions described in Definition 1, the set of assignments can be
implemented as a set of sendreceive pairs over singlereadersinglewriter chan
nels, where each assignment generates one sendreceive pair, or for tih n ! all
assignment statements with lefthandside variables in process Pi's local data
and righthandside variables in process Pj's local data are combined into one
sendreceive pair from process Pj to process Pi. Further, it is straightforward
to choose an ordering for the simulatedparallel version that does not violate
the restriction that we !1! i not read from an !!!1l, channel: 1 !i I perform all
sends; then perform all receives. 1 i! !!I if the dataexchange operation cor
responds to an archetype communication routine, it can be encapsulated and
implemented as part of the archetype library of routines, which can be made
2Observe that a sequence of messages over a singlereadersinglewriter channel from pro
cess Pi to process Pj can be implemented as a sequence of pointtopoint messages from P1
to Pj by giving each message a tag .... ....i.... to a channel ID and receiving selectively
based on these tags.
PO compute
PI compute
compute compute PO send
send send P ___ Psend
receive receive PO receive
. ......... ............... . ....... ................ . ...... ..........
compute compute
PO compute
PO P1
..... PI compute
real parallel 'P1 compute
from archetype library
simulated parallel
I ,o, . 3: Correspondence between parallel and simulatedparallel program ver
sions of; :!, r , I 1 1 .... program.
available in both parallel and simulatedparallel versions. The application de
veloper thus need not write out and transform the parts of the application that
correspond to dataexchange operations.
4 Application experiments
The experiments described in this section consist of, 1,1 1i i, our iii. 1 l1. .1. 
independently to two sequential implementations of an electromagnetics appli
cation. We describe the application, the archetype used to parallelize it, and
the experiments.
4.1 The application
The application parallelized in this experiment is an electromagnetics code that
uses the fii,, I itl!. 1 i,,  timedomain (FDTD) technique to model transient
electromagnetic scattering and interactions with objects of arbitrary shape and
composition. \\ iil this technique, the object and surrounding space are rep
resented by a 3dimensional grid of computational cells. An initial excitation
is specified, after which electric and magnetic fields are alternately updated
throughout the grid. By ;:, ''1, ii, a nearfield to farfield transformation, these
fields can also be used to derive far fields, e.g., for radar cross section computa
tions. Thus, the application performs two kinds of calculations:
Nearfield calculations. This part of the computation consists of a time
stepped simulation of the electric and magnetic fields over the 3dimensional
grid. At each time step, we first calculate the electric field at each point
based on the magnetic fields at the point and neighboring points, and then
we similarly calculate the magnetic fields based on the electric fields.
Farfield calculations. This part of the computation uses the abovecalculated
electric and magnetic fields to compute radiation vector potentials at each
time step by integrating over a closed surface near the boundary of the 3
dimensional grid. The electric and magnetic fields at a particular point on
the integration surface at a particular time step affect the radiation vector
potential at some future time step (depending on the point's position);
thus, each calculated vector potential (one per time step) is a double sum,
over time steps and over points on the integration surface.
Two versions of this code were available to us: a publicdomain version (" i
A", described in [19]) that performs only the nearfield calculations, and an
exportcontrolled version (. ..!! C", described in [4]) that performs both
nearfield and farfield calculations. The two versions were itl !,. 1 !11. i. I I
that we parallelized them separately, producing two parallelization experiments.
4.2 The archetype
In i!! i!! respects this application is a very good fit for our mesh ,., !, ip.,,7 as
described in this section.
4.2.1 Computational pattern
The pattern captured by the mesh archetype is one in which the overall compu
tation is based on Ndimensional grids (where N is 1, 2, or 3) and structured
as a sequence of the following operations on those grids:
Grid operations. Grid operations apply the same operation to each point in
the grid, using data for that point and possibly neighboring points. If the
operation uses data from neighboring points, the set of variables modified
in the operation must be .1i i. i! from the set of variables used as input.
Input variables !! also include global variables (variables common to all
points in the grid, e.g., constants).
Reduction operations. Reduction operations combine all values in a grid into
a single value (e.g., finding the maximum element).
File input/output operations. I !I.. input/output operations read or write
values for a grid.
Data i!! i also include global variables common to all points in the grid (con
stants, for example, or the results of reduction operations), and the computation
i!! i include simple control structures based on these global variables (for ex
ample, looping based on a variable whose value is the result of a reduction).
4.2.2 Parallelization strategy and dataflow
Devising a parallelization 11 i for a particular archetype begins by consid
ering how its dataflow pattern can be used to determine how to distribute data
among processes in such a way that communication requirements are minimized.
For the mesh archetype, the dataflow patterns of the archetype's charac
teristic operations lend themselves to a datadistribution scheme based on par
titioning the data grid into regular contiguous subgrids (local sections) and
distributing them among processes.
Grid operations. Provided that the I'' ."1 1, filed restriction is met,
points can be operated on in ;io order or simultaneously. Thus, each
process can compute (sequentially) values for the points in its local section
of the grid, and all processes can operate concurrently.
Reduction operations. Provided that the operation used to perform the re
duction is associative (e.g., maximum) or can be so treated (e.g., floating
point addition, for appropriate data and if some degree of nondeterminism
is acceptable), reductions can be computed concurrently by allowing each
process to compute a local reduction result and then combining them, for
example via recursive doubling. After completion of a reduction operation,
all processes have access to its result.
File input/output operations. Exploitable concurrency and appropriate data
distribution depend on considerations I !!I. structure and (perhaps) I 1 i1,
dependent I/O considerations. One i .... i1 r is to define a separate host
process responsible for !LI. I/O. A read operation then requires that the
host process read the data from the !!. and then redistribute it to the other
(,, '1i processes, while a write operation requires that the data first be re
distributed from the grid processes to the host process and then written to
the !!I. Another .. .ilHl 1 is to perform I/O concurrently in all processes
(actual concurrency ii be limited by  , i, or i!I. constraints).
Each of these operations 11i incorporate or necessitate communication opera
tions, as discussed below. Further, distributed memory introduces the require
ment that each process have a duplicate copy of;,! i global variables, with their
values kept i !!!~ i I , that is, ;,! change to such a variable must be
duplicated in each process before the value of the variable is used again. The
guidelines and transformations provided by the archetype ensure that these re
quirements are met.
4.2.3 Communication patterns
This combination of datadistribution scheme and computational operations
gives rise to the need for a small set of communication operations:
Exchange of boundary values. If a grid operation uses values from neigh
boring points, computing new values for points on the boundary of each
local section requires data from neighboring processes' local sections. This
dataflow requirement can be met by surrounding each local section with a
ghost boundary containing shadow copies of boundary values from neigh
boring processes, and using a ',,L!Il ,1 .i 1,!i_' operation (in which
neighboring processes exchange boundary values) to refresh these shadow
copies, as illustrated in I ,oL,' 4.
boundary exchange
ghost boundaries
1 !,.ti 4: Boundary exchange.
Broadcast of global data. When global data is computed or changed in one
process only (for example, if it is read from a til 1, a broadcast operation
is required to reestablish copy consistency.
Support for reduction operations. Reduction operations can be supported
by several communication patterns depending on their implementation 
for example, alltoone/onetoall or recursive doubling. 1 i,.i 5 illus
trates recursive doubling by means of an example (computing the sum of
elements of an ;,! i, .
Support for file input/output operations. Support for ti!. input/output
operations consists of redistribution operations that copy data from a
single host process to the remaining I_! Ili processes and vice versa, as
illustrated in I i,,iL, 6.
All of the required operations can be supported by a communication library
containing a ',..Iirl ! . 1! ii.' operation, hosttogrid and gridtohost data
redistribution operations, and a general reduction operation.
4.2.4 Implementation
We have developed for this archetype an implementation consisting of program
transformation guidelines, together with a code skeleton and an archetype
a(l) a(2) a(3) a(4)
sum(a(1:2)) sum(a(l:2)) 1... *4, 1 .i 1 1 .. 4,
sum (a(1:4)) ..i..i.. .. I i.... .. I.. '. 11 I.. ..
I 1,o,,. 5: Recursive doubling to compute a reduction (sum).
redistribution

data "distributed"
over host process
data distributed
over grid processes
1 i. 6: Hosttogrid and gridtohost redistribution.
specific library of communication routines. The code skeleton and library are
Fortranbased, with versions in Fortran M [12], Fortran with p4 [6], and Fortran
with NX [26]. The implementation is described in detail in [24].
4.3 Parallelization strategy
As noted, in most respects our target application fits the pattern of the mesh
archetype. The nearfield calculations are a perfect example of this archetype
and thus can be readily parallelized; all that is required is to partition the data
and insert calls to nearestneighbor communication routines.
The farfield calculations fit the archetype less well and are thus more. 1!tl!
cult to parallelize. The simplest approach to parallelization involves reordering
the sums being computed: Each process computes local double sums (over all
time steps and over points in its subgrid); at the end of the computation, these
local sums are combined. The effect is to reorder, but not otherwise change,
the summation. This method has the advantages of being simple and readily
implemented using the mesh archetype (since it consists mostly of local com
putation, with one final globalreduction operation). It has the disadvantage of
being nondeterministic that is, not guaranteed to give the same results when
executed with different numbers of processes since floatingpoint arithmetic
is not associative. Nonetheless, because of its iii .1;, i we chose this method
for an initial parallelization.
4.4 Applying our methodology
Determining how to apply the strategy. 1 i I we determined how to
apply the parallelization 1, 1. , guided by documentation [24] for the mesh
archetype, as follows:
1. Identify which variables should be distributed (among grid processes) and
which duplicated (across all processes). For those variables that are to
be distributed, determine which ones should be surrounded by a ghost
boundary. Conceptually partition the data to be distributed into local
sections, one for each grid process.
2. Identify which parts of the computation should be performed in the host
process and which in the grid processes, and also which parts of the grid
process computation should be distributed and which duplicated. Also
identify ;,!I parts of the computation that should be performed ,hil! ,
ently in the individual grid processes (e.g., calculations performed on the
boundaries of the grid).
Generating the simulated sequentialparallel version. We then applied
the following transformations to the original sequential code to obtain a simu
lated sequentialparallel version, operating separately on the two versions of the
application described in 4.1:
1. In effect partition the data into distinct address spaces by adding an index
to each variable. The value of this index constitutes a simulated process
ID. At this point all data (even variables that are eventually to be dis
tributed) is duplicated across all processes.
2. Ali,1L the program to fit the archetype pattern of blocks of local compu
tation alternating with dataexchange operations.
3. Separate each localcomputation block into a simulatedhostprocess block
and a simulatedgridprocess block.
4. Separate each simulatedgridprocess block into the desired N simulated
gridprocess blocks. This implies the following changes:
(a) Modify loop bounds so that each simulated grid process modifies only
data corresponding to its local section. This step was complicated
by the fact that loop counters in the original code were used both
as indices into !! that were to be distributed and to indicate a
grid point's global position, and although the former usage must be
changed in this step, the latter must not.
(b) If there are calculations that must be done differently in 1!l i!. i. I
grid processes (e.g., boundary calculations), ensure that each process
performs the appropriate calculations.
(c) Insert dataexchange operations (calls to appropriate archetype li
brary routines).
The result of these transformations was a sequential simulatedparallel version
of the original program.
Generating the parallel program. I i i! we transformed this sequential
simulatedparallel version into a program for messagepassing architectures, as
described in 3.3.
4.5 Results
Correctness. For those parts of the computation that fit the mesh archetype
 the nearfield calculations the sequential simulatedparallel version pro
duced results identical to those of the original sequential code. For those parts of
the computation that did not fit well the farfield calculations the sequen
tial simulatedparallel version produced results markedly 1lt!!. i. i I from those of
the original sequential code. Our original assumption that we could regard
floatingpoint addition as associative and thus reorder the required summations
without markedly changing their results proved to be !in!!, I '. Correct
parallelization of these calculations would thus require a more sophisticated
li iI  than that suggested by the mesh archetype, which we did not pursue
due to time constraints. While disappointing, this result does not invalidate our
i 1. .1.. what it invalidates is our assumption that floatingpoint addition
in the context of the farfield calculations could be regarded as associative and
hence these calculations could be treated as a reduction operation as defined by
the mesh archetype.
For all parts of the computation, however, the messagepassing programs
produced results identical to those of the corresponding sequential simulated
parallel versions, on the first and every execution.
Performance. We quote performance figures to demonstrate that the par
allelization produced by our !!!. Ill L.1. .1_ is acceptably I i!', an important
consideration for application developers. Both versions of the application were
parallelized using our Fortran M implementation of the mesh archetype. Be
cause of exportcontrol constraints, we were able to obtain performance data
for Version C only on a network of workstations. Table 1, Table 2, Table 3,
and Table 4 show execution times and speedups4 for Version C, executing on a
network of Sun workstations. 1 ,,., 7 and I ,,.i. 8 show execution times and
speedups for Version A, executing on an IBM SP. The falloff of performance for
more than 4 processors in I !,.i! 7 is probably due to the ratio of computation
to communication falling below that required to give good performance. Unsur
prisingly, performance for the larger problem size (I ,n  8) scales acceptably
for a larger number of processors than performance for the smaller problem size
(I i, i!. 7), but also falls off when the ratio of computation to communication
decreases below that required to give good performance.
Table 1: Execution times and speedups for electromagnetics code (version C),
for 33 by 33 by 33 grid, 128 steps, using Fortran M on a network of Suns.
3Analysis of the values involved showed that they ranged over many orders of magnitude,
so it is not surprising that the result of the summation was markedly affected by the order of
summation.
41Ve define speedup as execution time for the original sequential code divided by execution
time for the .. .11. 1 code.
Execution time Speedup
(seconds)
Sequential 78.6 1.00
Parallel, P=1 189.0 0.41
Parallel, P=2 51.4 1.52
Parallel, P=4 25.3 3.10
Execution time Speedup
(seconds)
Sequential 4309.5 1.00
Parallel, P=4 1189.8 3.62
Table 2: Execution times and speedups for electromagnetics code (version C),
for 65 by 65 by 65 grid, 1024 steps, using Fortran M on a network of Suns.
Table 3: Execution times and speedups for electromagnetics code (version C),
for 46 by 36 by 36 grid, 128 steps, using Fortran M on a network of Suns.
Table 4: Execution times and speedups for electromagnetics code (version C),
for 91 by 71 by 71 grid, 2048 steps, using Fortran M on a network of Suns.
sequential
actual
ideal
100 L
10 I1
1 10
Processors
actual
perfect/
1 2 3 4 5 6 7 8
Processors
1 i,.. 7: Execution times and speedups for electromagnetics code (version A)
for 34 by 34 by 34 grid, 256 steps, using Fortran M on the IBM SP.
Execution time Speedup
(seconds)
Sequential 123.1 1.00
Parallel, P=1 258.5 0.47
Parallel, P=2 65.4 1.88
Parallel, P=4 32.5 3.78
10000 18
sequential 16 actual
0 actual 14 perfect.
(80 ideal 4
12
10
E 1000 1
8
100 2 I0
1 10 100 0 2 4 6 8 10 12 14 16 18
Processors Processors
1 ,oi.. 8: Execution times and speedups for electromagnetics code (version A)
for 66 by 66 by 66 grid, 512 steps, using Fortran M on the IBM SP.
Ease of use. It is .1![i ,til1 to define objective measures of ease of use, but
our experiences in the experiments described in this chapter suggest that the
parallelization bi. l!..l1..,_ described herein produces results at a reasonable
humaneffort cost.
SI !iii in both cases with unfamiliar code (about 2400 lines for Version
C and 1400 lines for Version A, including comments and whitespace), we were
able to perform the transformations described in 4.4 relatively quickly: For
version C of the code, one person spent 2 .1 determining how to apply the
mesharchetype parallelization Ii Il. 8 .1 converting the sequential code
into the sequential simulatedparallel version, and less than a ,1 converting
the sequential simulatedparallel version into a messagepassing version. For
version A of the code, one person spent less than a ,1 determining how to
apply the parallelization 11 5 .1 converting the sequential code into
the sequential simulatedparallel version, and less than a ,1 converting the
sequential simulatedparallel version into a messagepassing version.
5 Related work
Stepwise refinement. Program development via stepwise refinement is de
scribed by ii ,ii researchers, for example Back [2], Gries [14], and Hoare [17]
for sequential programs, and Back [1], Martin [23], and Van de Velde [29] for
parallel programs. The transformation of assignment statements into message
passing operations, which is central to Theorem 3, is treated by Hoare [18] and
Martin [23].
Operational models of parallel programming. Our proof of Theorem 3
is loosely based on regarding programs as action or statetransition 1. ii
as described in ('! I [I and Misra [10], L i 1! and Tuttle [21], Lamport [20],
Manna and Pnueli [22], and Pnueli [27]. We present a more complete model of
parallel programming that supports this and other experimental work in 2 ,]
Automatic parallelizing compilers. Much effort has gone into develop
ment of compilers that automatically recognize potential concurrency and emit
parallel code, for example Fortran D [11] and HPF [16]. We regard this work
as complementary to our ii,. 1L. .1. .1.,_ and postulate that some of our transfor
mations could be automated by such compilers.
Design patterns. 'I ,i1 researchers have investigated the use of patterns in
developing algorithms and applications. Our previous work [7, 8, 9] explores a
more general notion of archetypes and their role in developing both sequential
and parallel programs. Gamma et al. [13] address primarily the issue of pat
terns of computation, in the context of objectoriented design. Our work, in
contrast, also examines patterns of dataflow and communication. Schmidt [28]
focuses more on parallel structure, but in a different context from our work and
with less emphasis on code reuse. Brinch Hansen's work on parallel structures
[5] is similar in motivation to our work, but his model programs are I 11'i ll
more narrowly defined than our archetypes. Other work addresses lowerlevel
patterns, as for example the use of templates to develop algorithms for linear
algebra in [3] and the use of templates in developing f .! ii! ,1 .1 if ; software
in [15].
6 Conclusions
This paper describes experiments with a i l!. 1. ,1.,_ for parallelizing sequen
tial programs based on 1 '! i' the techniques of stepwise refinement under the
guidance of a parallel programming archetype. The experiments demonstrate
that the ii. '1r ...1 produces programs that are correct (when the transforma
tions are applied in the proper context) and reasonably Iti !. wI at what seems
to be reasonable humaneffort cost. It is particularly heartening to note that the
transformation for which we provide formal support the transformation from
sequential simulatedparallel to parallel produced correct results in practice
as well as in theory, with the resulting parallel programs producing results iden
tical to those of their simulatedparallel predecessors on the first execution (and
all subsequent executions).
Much work remains to be done i1. ilif ii,_ and developing additional
archetypes, formally stating and proving additional transformations ([25] presents
additional examples but is not exhaustive), and providing automatic support for
transformations where feasible but we are encouraged by our results so far.
Acknowledgments
Special thanks go to Eric Van de Velde, whose book [29] inspired this work, and
to John Beggs, who provided and explained the electromagnetics application
that was the 1,l.1. i of our experiments.
References
[1] R. J. R. Back. Refinement calculus, part II: Parallel and reactive programs.
In ;. 6.... ... 1 ! of Distributed ',!, ..... Models, Formalisms, Cor
rectness, volume 430 of Lecture Notes in Computer Science, pages ,.7 'I
SpringerVerlag, 1990.
[2] R. J. R. Back and J. von Wright. Refinement calculus, part I: Sequential
nondeterministic programs. In s;. .i.. /?. P, ... .... !of Distributed ',! .i...
Models, Formalisms, Correctness, volume 430 of Lecture Notes in Computer
Science, pages 4266. SpringerVerlag, 1990.
[3] R. Barrett, M. Berry, T. ('I !, J. Demmel, J. Donato, J. Dongarra, V. Ei
jkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the
Solution of Linear S,,!i ... Building Blocks for Iterative Methods. SIAM,
1993.
[4] J. H. Beggs, R. J. Luebbers, D. i, I! H. S. Langdon, and K. S. Kunz.
User's manual for threedimensional FDTD version C code for scattering
from fi. L. 1r 1 ; 1. I !. !1 ii dielectric and magnetic materials. Technical
report, The P. !l! .1 I I ii ,i U1 h i i July 1992.
[5] P. Brinch Hansen. Model programs for computational science: A pro
gramming !,n. 1,l..,_" for multicomputers. C,, ..,. .... Practice and
Experience, 5(5, 117423, 1993.
[6] R. M. Butler and E. L. Lusk. Monitors, messages, and clusters the p4
parallel programming  I. i! Parallel C '.,i'.., .I'" 20(4):547564, 1994.
[7] K. M. ('!C i!i [. Concurrent program archetypes. In Proceedings of the
Scalable Parallel Library C,,. ,. ... 1994.
[8] K. M. ('I' ,! R. Manohar, B. L. Massingill, and D. I. Meiron. Integrating
task and data parallelism with the group communication archetype. In
Proceedings of the 9th International Parallel Processing S., 1,!,' ... I 1995.
[9] K. M. ('!i ,i! 1I and B. L. Massingill. Parallel program archetypes. Technical
Report CSTR9628, California Institute of T..1 i. .1... 1997.
[10] K. M. (''! ii.11 and J. Misra. Parallel Program Design: A Foundation.
AddisonW. 1. 1989.
[11] K. D. Cooper, M. W. Hall, R. T. Hood, K. Kennedy, K. S. M.i, I!!. J. M.
MellorC ILiiii,. L. Torczon, and S. K. Warren. The Parascope parallel
programming environment. Proceedings of the IEEE, 82(2):244263, 1993.
[12] I. T. Foster and K. M. ('li md .. FO1 1i:.\N M: A language for modu
lar parallel programming. Journal of Parallel and Distributed C' i,,! ".,,
26(1):2435, 1995.
[13] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: !.L 
ments of Reusable ObjectOriented fi. ...' AddisonW. 1. 1995.
[14] D. Gries. 1!, Science of Programming. SpringerVerlag, 1981.
[15] D. Hemer and P. Li!l1 Reuse of verified design templates through
extended pattern matching. Technical Report 9703, Software Verifica
tion Research Centre, School of Information T 1!'.. ..! _, The Uiii if
of Queensland, 1997. To appear in Proceedings of Formal Methods Europa
(FME '97).
[16] High Performance Fortran Forum. High Performance Fortran language
specification, version 1.0. S, '. "'/. ,, P.,:,,.. ..'.. ., 2(12):1170, 1993.
[17] C. A. R. Hoare. An axiomatic basis for computer programming. Commu
nications of the ACM, 12(10; ,.71 583, 1969.
[18] C. A. R. Hoare. Communicating Sequential Processes. PrenticeHall, 1985.
[19] K. S. Kunz and R. J. Luebbers. I !, Finite D' T. ,. .... I ... Domain Method
for !.. / ... .. : CRC Press, 1993.
[20] L. Lamport. A temporal logic of actions. AC.1 lI ...I.. .(... on Program
ming Languages and ,!. ...i. 16(3):872923, 1994.
[21] N. A. L 1! and M. R. Tuttle. Hierarchical correctness proofs for dis
tributed algorithms. In Proceedings of the 6th Annual .lif o. ,, .'..." .... on
Principles of Distributed C... .'i '.,' 1987.
[22] Z. Manna and A. Pnueli. Completing the temporal picture. I! ....' :,
Computer Science, 83(1):97130, 1991.
[23] A. J. Martin. Compiling communicating processes into ,1. 1 i I, !ilive
VLSI circuits. Distributed C...i. ."',., 1(4):226234, 1986.
[24] B. Massingill. The mesh archetype. Technical Report CSTR9625,
California Institute of T 1!i. .1. 1997. Also available via ( http://
www.etext.caltech.edu/Implementations/).
[25] B. Massingill. A structured approach to parallel programming (Ph.D. the
sis). Technical Report CSTR9804, California Institute of T 1!ii..1..
1998.
[26] P. Pierce. The NX messagepassing interface. Parallel C.i. !',r,'.'.
20(4):463480, 1994.
[27] A. Pnueli. The temporal semantics of concurrent programs. 1!....I , ,,,
Computer Science, 13:4560, 1981.
[28] D. C. Schmidt. Using design patterns to develop reusable objectoriented
communication software. Communications of the .I .1/, 38(10):6574, 1995.
[29] E. F. Van de Velde. Concurrent '. 1. "i' Computing. SpringerVerlag,
1994.
