Experiments with Program Parallelization Using
Archetypes and Stepwiise Refinement *
Berna L. ti.i--ini!! t
January 23, 1998
UF CISE Technical Report 98-012
Parallel programming continues to be I 1!l. i11 and error-prone, whether
starting from specifications or from an existing sequential program. This
paper presents (1) a methodology for parallelizing sequential applica-
tions and (2) experiments in applying the methodology to application
programs. The methodology is based on the use of stepwise refinement
together with what we call parallel programming archetypes (briefly, ab-
stractions that capture common features of classes of programs), in which
most of the work of parallelization is done using familiar sequential tools
and techniques, and those parts of the process that cannot be addressed
with sequential tools and techniques are addressed with 1.. !.i..i!! -1 .-IaI I
transformations. The experiments consist of applying the methodology
to sequential application programs, and they provide evidence that the
methodology produces correct and reasonably th. i. wI programs at rea-
sonable human-effort cost. Of particular interest is the fact that the aspect
of the methodology that is most completely formally justified is the aspect
that in practice was the most trouble-free.
Much work has been done to make parallel programming tractable, but the
development of parallel applications continues to be 't1!11i ,Il and error-prone,
whether the starting point is a specification or an existing sequential program.
In this paper, we describe a l.r 1!. 1..1 - for parallelizing sequential applications
based on the use of stepwise refinement together with what we call parallel pro-
grammring ,., 1.,,(," (' i n such an archetype is an abstraction that captures
*This work was supported by the AFOSR via an Air Force Laboratory Graduate F. ii..
tUniversity of Florida, P.O. Box 116120, Gainesville, FL 32611; blm~cise.ufl.edu.
the . i i ..i 1 i I of a class of programs with similar structure), in which most of
the work of parallelization is done using familiar sequential tools and techniques,
and those parts of the process that cannot be addressed with sequential tools
and techniques are addressed with f .- iii 11- i, L i -. ,1 transformations. We then
present the results of i|' 1- ii!- this ii. l1 *.1 -. to a moderate-length applica-
tion program, providing evidence that the ,i 11 rl 1. ,1_- produces correct and
reasonably. i .i, ii programs at reasonable human-effort cost. We focus atten-
tion on transforming programs for execution on distributed-memory-message-
passing architectures, but we note that such programs 11i also be executed on
architectures that support a shared-memory model.
2 The methodology
Our iii l! l. -.1_- is based on the idea of transforming a sequential program
into a parallel program by ;,'I'l1 ii,- a sequence of small semantics-preserving
transformations, guided by the pattern provided by a parallel programming
2.1 Parallel programming archetypes
By parallel programming W,, ,,1'.1" we mean an abstraction, similar to a design
pattern, that captures the *nn,'.,, i- of a class of programs. (An exam-
ple of a sequential programming archetype is the familiar divide-and-conquer
Methods of exploiting design patterns in program development begin by
1. i!' if- i,_ classes of problems with similar computational structures and cre-
ating abstractions that capture the '. -.,iii,. .!! ilI- Combining a problem class's
computational structure with a parallelization 1i,. _- gives rise to a dataflow
pattern and hence a communication structure. It is this combination of compu-
tational structure, parallelization i ,. _- and the implied pattern of dataflow
and communication that we capture as a parallel programming r ,, n , in1.,
The .! ii ii. .!!i li -I captured by the archetype abstraction makes it possible to
develop a small collection of semantics-preserving transformations that together
allow parallelization of ,i!! program that fits the archetype. Also, the common
dataflow pattern makes it possible to encapsulate those parts of the computation
that involve interprocess communication and transform them only once, with
the results of the transformation made available as a communication operation
usable in ,i!! program that fits the archetype.
2.2 Stepwise refinement and the sequential simulated-parallel
A key feature of our !r n 1 ,L.1 _- for parallelizing programs via a sequence of
transformations is that almost all of the transformations are performed in the
sequential domain, as illustrated by 1 oIi- 1. Ideally all of the transforma-
original s parallel
seq. parallel pg
sequential domain parallel domain
1 iii. 1: Parallelization via a sequence of transformations.
tions would be formally stated and proved, with all but the final transformation
proved using the techniques of sequential stepwise refinement. In the experi-
ments described in this paper, however, we chose to focus our proof efforts on
the final transformation the one that takes the program into the parallel do-
main since the sequential-to-sequential transformations are more amenable
to checking by testing and .1. 1,, __i!i_ than the sequential-to-parallel transfor-
mation, and hence the case for a formal justification is more compelling for the
former than for the latter. Thus, in this paper we give a formal proof only for
this last transformation; the proof is applicable to all programs meeting certain
stated criteria, and guarantees the correctness of the parallel program, provided
the intermediate stages are correct. The key intermediate stage in this transfor-
mation process is the last stage before transformation into the parallel domain;
we call it the sequential simulated-parallel version of the program. This version
essentially simulates the operation of a program consisting of N processes exe-
cuting on a distributed-memory-message-passing architecture. We define it as
Definition 1 (Sequential simulated-parallel program).
A sequential simulated-parallel program is one with the following characteristics:
1. The atomic data 1i .l I -' of the program are partitioned into N groups,
'An atomic data object is as defined in HPF : one that contains no subobjects e.g.,
one for each simulated process; the i-th group simulates the local data
for the i-th process. These data objects 11 ,- include duplicated variables
(e.g., loop counters and program constants).
2. The computation consists of an alternating sequence of local-computation
blocks and data-exchange operations, in which:
(a) Each local-computation block is a composition of N program blocks,
in which the i-th block accesses only local data for the i-th simulated
process. Such blocks correspond to sections of the parallel program
in which processes execute independently and without interaction.
(b) Each data-exchange operation consists of a set of assignment state-
ments that satisfy the following restrictions:
i. If an atomic data object is the target of an assignment, it is not
referenced in ;oi- other assignment.
ii. No left-hand or right-hand side !1ii reference atomic data ob-
jects belonging to more than one of the N simulated-local-data
partitions. The left-hand and right-hand sides of an assignment
11i however, reference data from different partitions.
iii. For each simulated process i, at least one assignment statement
must assign a value to a variable in i's local data.
Such blocks correspond to sections of the parallel program in which
processes exchange messages: Each assignment statement can be im-
plemented as a single point-to-point message-passing operation, and
a group of message-passing operations with a common sender and a
common receiver can be combined for tin i. i ,
Definition 2 (Sequential simulated-parallel version).
A sequential simulated-parallel version of program P is a sequential simulated-
parallel program P' that refines P (i.e., meets the same or a stronger specifica-
As we demonstrate in the next section, sequential execution of such a pro-
gram truly simulates execution of the corresponding parallel program, in that
all executions of the parallel program give the same results as an execution of
the sequential program.
a scalar data object or a scalar element of an array.
In general, producing such a simulated-parallel program could be tedious,
time-consuming, and error-prone. However, in our i !i. 11.. .1. - the transfor-
mation process is guided by an archetype, with the archetype providing, for
programs in the class for which it is an abstraction, class-specific guidelines for
program parallelization as well as class-specific code libraries encapsulating the
communication and other operations required for parallelization. For a pro-
gram that fits an archetype for which guidelines and a code library have been
developed, the task of producing the simulated-parallel version is much more
manageable, since the archetype guides the transformation process, and the pro-
gram's data-exchange operations correspond to the archetype's communication
operations, i!i'l~l1i i!- their transformation.
3 Supporting theory
In this section we present a general theorem allowing us to transform sequential
simulated-parallel programs into equivalent parallel programs.
3.1 The parallel program and its simulated-parallel ver-
The target parallel program. The goal of the transformation process is a
parallel program with the following characteristics:
1. The program is a collection of N sequential, deterministic processes.
2. Processes do not share variables; each has a distinct address space.
3. Processes interact only through sends and blocking receives on single-
reader-single-writer channels with infinite slack (i.e., infinite I i).
4. An execution is a fair interleaving of actions from processes.
The simulated-parallel version. We can simulate execution of such a par-
allel program as follows:
1. Simulate concurrent execution by interleaving actions from (simulated)
2. Simulate separate address spaces by defining a set of distinct address-space
3. Simulate communication over channels by representing channels as queues,
taking care that no attempt is made to read from a channel unless it is
known not to be lI
1 i 1.- 2 illustrates the relationship between the simulated-parallel and parallel
versions of a program.
compute compute P send
send send PI send
receive receive PO receive
............................... ..............................P co m p u te
I i ,!.- 2: Correspondence between parallel and simulated-parallel program ver-
3.2 The theorem
Given deterministic processes P0,...,PNj- with no shared variables except
single-reader-single-writer channels with infinite slack, if I and I' are two max-
imal interleavings of the actions of the Pj's that begin in the same initial state,
then I and I' both terminate, and in the same final state.
Proof of Theorem 3.
Given interleavings I and I' beginning in the same state, we show that I' can be
permuted to match I without changing its final state. The proof is by induction
on the length of I.
Notation: We write bi to denote the i-th action of interleaving I and aj,k to
denote the k-th action taken by process Pj.
Base case: The length of I is 1. Trivial.
Inductive step: Suppose we can permute I' such that the first n steps of
the permuted I' match the first n steps of I. Then we must show that we can
further permute I' so that the first n + 1 steps match I. That is, suppose we
have the following situation:
I: bo,0b,...,bn -l, aj,k,...
permuted I' : bo, b,... ,b-i, b i,,k'',...
Then we want to show that we can further permute I' so that its (n + 1)-th
action is also aj,k.
Observe first that the first action taken in Pj after br,-i in the permuted I'
must be aj,k: All processes are deterministic, the state after n actions is the
same in I and the permuted I', and channels are single-reader-single-writer.
Observe analogously that aj,,k' is the first action taken in Pj, after br-i in I.
Thus, if j = j, aj,k = aj,,k', and we are done. So suppose j # j'.
Lemma: Observe that for ;,!- two consecutive actions a,,,, and am,,n, if
m # m' and it is not the case that a,n,, and a,,,,', are both actions on the same
channel c, then because the -l i' contains no shared variables except the
channels, these actions can be performed in either order with the same results.
We now demonstrate via a case-by-case ,i ,1- -i- that we can permute I' by
repeatedly exchanging aj,k with its immediate predecessor until it follows b,-i
(as it does in I).
1. If aj,k is the m-th receive on some channel c and aj,,k, also affects c, then:
aj,,k' is the m'-th send on c, with m' > m. (The action is a send because
channels are single-reader, and m' > m since aj,k precedes aj',k, in I.)
Further, no action between aj',k, and aj,k can affect c (since channels are
single-reader-single-writer). Thus, using the lemma, we can repeatedly
exchange aj,k with its predecessors in permuted I', up to and including
aj,,k', as desired.
2. If aj,k is the m-th send on some channel c and aj',k, also affects c, then:
aj',k' is the m'-th receive on c, with m' < m. (The action is a receive
because channels are single-writer, and m' < m since aj',k, precedes aj,k
in permuted I'.) Further, no action between aj',k, and aj,k can affect c
(since channels are single-reader-single-writer). Thus, we can repeatedly
exchange aj,k with its predecessors in permuted I', up to and including
aj,,k', as desired.
3. If aj,k is the m-th receive on some channel c and aj',k, does not affect c,
then: If no action between aj',k, and aj,k in permuted I' affects c, then we
can perform repeated exchanges as before. If some action bi does affect
c, then it must be the m'-th send, with m' > m. (The action is a send
because channels are single-reader, and m' > m because the placement of
aj,k in I guarantees that actions bo,..., b,-i contain at least m sends on
c.) We can thus exchange bi with aj,k, giving the desired result.
4. If aj,k is the m-th send on some channel c and aj,,k' does not affect c,
then: If no action between aj,,k' and aj,k in permuted I' affects c, then we
can perform repeated exchanges as before. If some action bi does affect c,
then it must be a receive (since channels are single-writer), so it also can
be exchanged with ajk, giving the desired result.
5. If aj,k is not an action on a channel, then from the lemma we can exchange
it with its predecessors as desired.
3.3 Implications and application of the theorem
Theorem 3 implies that if we can produce a sequential simulated-parallel pro-
gram that meets the same specification as a sequential program, then we can
mechanically convert it into a parallel program that meets that same specifica-
tion, by transforming the simulated processes into real processes, the simulated
multiple address spaces into real multiple address spaces, and the simulated
communication actions into real communication actions2. As noted earlier, in
general producing such a simulated-parallel program could be quite .1!t !i Il
but if we start with a program that fits an archetype, and produce a sequen-
tial simulated-parallel version of the form described in Definition 1, where the
data-exchange operations correspond to the communication operations of the
archetype, then the task becomes manageable. I ,,..i- 3 illustrates the rela-
tionship between the simulated-parallel and parallel versions of such a program.
Each collection of assignments constituting a data-exchange operation can
be replaced, as described earlier, with a collection of sends and receives: Be-
cause of the restrictions described in Definition 1, the set of assignments can be
implemented as a set of send-receive pairs over single-reader-single-writer chan-
nels, where each assignment generates one send-receive pair, or for tih n !- all
assignment statements with left-hand-side variables in process Pi's local data
and right-hand-side variables in process Pj's local data are combined into one
send-receive pair from process Pj to process Pi. Further, it is straightforward
to choose an ordering for the simulated-parallel version that does not violate
the restriction that we !1! i not read from an !!!1l,- channel: 1 !i -I perform all
sends; then perform all receives. 1 i! !!I if the data-exchange operation cor-
responds to an archetype communication routine, it can be encapsulated and
implemented as part of the archetype library of routines, which can be made
2Observe that a sequence of messages over a single-reader-single-writer channel from pro-
cess Pi to process Pj can be implemented as a sequence of point-to-point messages from P1
to Pj by giving each message a tag .... ....i.... to a channel ID and receiving selectively
based on these tags.
compute compute PO send
send send P ___ Psend
receive receive PO receive
. ......... ............... . ....... ................ . ...... ..........
..... PI compute
real parallel 'P1 compute
from archetype library
I ,o, .- 3: Correspondence between parallel and simulated-parallel program ver-
sions of; :!, r ,- I 1 -1 .... program.
available in both parallel and simulated-parallel versions. The application de-
veloper thus need not write out and transform the parts of the application that
correspond to data-exchange operations.
4 Application experiments
The experiments described in this section consist of, 1,1 1-i i,- our iii. 1 l1. .1. -
independently to two sequential implementations of an electromagnetics appli-
cation. We describe the application, the archetype used to parallelize it, and
4.1 The application
The application parallelized in this experiment is an electromagnetics code that
uses the fii,, I -itl!. 1 i,, -- time-domain (FDTD) technique to model transient
electromagnetic scattering and interactions with objects of arbitrary shape and
composition. \\ iil this technique, the object and surrounding space are rep-
resented by a 3-dimensional grid of computational cells. An initial excitation
is specified, after which electric and magnetic fields are alternately updated
throughout the grid. By ;:, ''1, ii,- a near-field to far-field transformation, these
fields can also be used to derive far fields, e.g., for radar cross section computa-
tions. Thus, the application performs two kinds of calculations:
Near-field calculations. This part of the computation consists of a time-
stepped simulation of the electric and magnetic fields over the 3-dimensional
grid. At each time step, we first calculate the electric field at each point
based on the magnetic fields at the point and neighboring points, and then
we similarly calculate the magnetic fields based on the electric fields.
Far-field calculations. This part of the computation uses the above-calculated
electric and magnetic fields to compute radiation vector potentials at each
time step by integrating over a closed surface near the boundary of the 3-
dimensional grid. The electric and magnetic fields at a particular point on
the integration surface at a particular time step affect the radiation vector
potential at some future time step (depending on the point's position);
thus, each calculated vector potential (one per time step) is a double sum,
over time steps and over points on the integration surface.
Two versions of this code were available to us: a public-domain version (" i
A", described in ) that performs only the near-field calculations, and an
export-controlled version (--. -..!! C", described in ) that performs both
near-field and far-field calculations. The two versions were itl !,. 1- !11. i. I I
that we parallelized them separately, producing two parallelization experiments.
4.2 The archetype
In i!! i!! respects this application is a very good fit for our mesh ,., !, ip.,,7 as
described in this section.
4.2.1 Computational pattern
The pattern captured by the mesh archetype is one in which the overall compu-
tation is based on N-dimensional grids (where N is 1, 2, or 3) and structured
as a sequence of the following operations on those grids:
Grid operations. Grid operations apply the same operation to each point in
the grid, using data for that point and possibly neighboring points. If the
operation uses data from neighboring points, the set of variables modified
in the operation must be .1i -i. i! from the set of variables used as input.
Input variables !! also include global variables (variables common to all
points in the grid, e.g., constants).
Reduction operations. Reduction operations combine all values in a grid into
a single value (e.g., finding the maximum element).
File input/output operations. I !I.. input/output operations read or write
values for a grid.
Data i!! i also include global variables common to all points in the grid (con-
stants, for example, or the results of reduction operations), and the computation
i!! i include simple control structures based on these global variables (for ex-
ample, looping based on a variable whose value is the result of a reduction).
4.2.2 Parallelization strategy and dataflow
Devising a parallelization -11 i for a particular archetype begins by consid-
ering how its dataflow pattern can be used to determine how to distribute data
among processes in such a way that communication requirements are minimized.
For the mesh archetype, the dataflow patterns of the archetype's charac-
teristic operations lend themselves to a data-distribution scheme based on par-
titioning the data grid into regular contiguous subgrids (local sections) and
distributing them among processes.
Grid operations. Provided that the I'' ."-1- -1, filed restriction is met,
points can be operated on in ;io order or simultaneously. Thus, each
process can compute (sequentially) values for the points in its local section
of the grid, and all processes can operate concurrently.
Reduction operations. Provided that the operation used to perform the re-
duction is associative (e.g., maximum) or can be so treated (e.g., floating-
point addition, for appropriate data and if some degree of nondeterminism
is acceptable), reductions can be computed concurrently by allowing each
process to compute a local reduction result and then combining them, for
example via recursive doubling. After completion of a reduction operation,
all processes have access to its result.
File input/output operations. Exploitable concurrency and appropriate data
distribution depend on considerations -I !!I. -structure and (perhaps) -I -1 i1,-
dependent I/O considerations. One i ...--. i1 r is to define a separate host
process responsible for !LI.- I/O. A read operation then requires that the
host process read the data from the !!. and then redistribute it to the other
(,, '1i processes, while a write operation requires that the data first be re-
distributed from the grid processes to the host process and then written to
the !!I. Another ..- -.ilHl 1 is to perform I/O concurrently in all processes
(actual concurrency ii be limited by -- -, i, or i!I.- constraints).
Each of these operations 11i incorporate or necessitate communication opera-
tions, as discussed below. Further, distributed memory introduces the require-
ment that each process have a duplicate copy of;,! i global variables, with their
values kept i !!!~ --i I -, that is, ;,! change to such a variable must be
duplicated in each process before the value of the variable is used again. The
guidelines and transformations provided by the archetype ensure that these re-
quirements are met.
4.2.3 Communication patterns
This combination of data-distribution scheme and computational operations
gives rise to the need for a small set of communication operations:
Exchange of boundary values. If a grid operation uses values from neigh-
boring points, computing new values for points on the boundary of each
local section requires data from neighboring processes' local sections. This
dataflow requirement can be met by surrounding each local section with a
ghost boundary containing shadow copies of boundary values from neigh-
boring processes, and using a ',,L!Il ,1 .i 1,!i_' operation (in which
neighboring processes exchange boundary values) to refresh these shadow
copies, as illustrated in I ,oL,'- 4.
1 !,.ti- 4: Boundary exchange.
Broadcast of global data. When global data is computed or changed in one
process only (for example, if it is read from a til 1, a broadcast operation
is required to re-establish copy consistency.
Support for reduction operations. Reduction operations can be supported
by several communication patterns depending on their implementation -
for example, all-to-one/one-to-all or recursive doubling. 1 i,.i 5 illus-
trates recursive doubling by means of an example (computing the sum of
elements of an ;,! i, .
Support for file input/output operations. Support for ti!.- input/output
operations consists of redistribution operations that copy data from a
single host process to the remaining I_! Ili processes and vice versa, as
illustrated in I i,,iL,- 6.
All of the required operations can be supported by a communication library
containing a ',..Iirl !- -. 1! ii.' operation, host-to-grid and grid-to-host data-
redistribution operations, and a general reduction operation.
We have developed for this archetype an implementation consisting of program-
transformation guidelines, together with a code skeleton and an archetype-
a(l) a(2) a(3) a(4)
sum(a(1:2)) sum(a(l:2)) 1... *4, 1 .i 1 1 .. 4,
sum (a(1:4)) ..i..i.. .. I i.... .. I.. '. 11 I.. ..
I 1,o,,. 5: Recursive doubling to compute a reduction (sum).
over host process
over grid processes
1 i.- 6: Host-to-grid and grid-to-host redistribution.
specific library of communication routines. The code skeleton and library are
Fortran-based, with versions in Fortran M , Fortran with p4 , and Fortran
with NX . The implementation is described in detail in .
4.3 Parallelization strategy
As noted, in most respects our target application fits the pattern of the mesh
archetype. The near-field calculations are a perfect example of this archetype
and thus can be readily parallelized; all that is required is to partition the data
and insert calls to nearest-neighbor communication routines.
The far-field calculations fit the archetype less well and are thus more. 1!tl!-
cult to parallelize. The simplest approach to parallelization involves reordering
the sums being computed: Each process computes local double sums (over all
time steps and over points in its subgrid); at the end of the computation, these
local sums are combined. The effect is to re-order, but not otherwise change,
the summation. This method has the advantages of being simple and readily
implemented using the mesh archetype (since it consists mostly of local com-
putation, with one final global-reduction operation). It has the disadvantage of
being nondeterministic that is, not guaranteed to give the same results when
executed with different numbers of processes since floating-point arithmetic
is not associative. Nonetheless, because of its -iii .1;, i- we chose this method
for an initial parallelization.
4.4 Applying our methodology
Determining how to apply the strategy. 1 i -I we determined how to
apply the parallelization -1, 1. -, guided by documentation  for the mesh
archetype, as follows:
1. Identify which variables should be distributed (among grid processes) and
which duplicated (across all processes). For those variables that are to
be distributed, determine which ones should be surrounded by a ghost
boundary. Conceptually partition the data to be distributed into local
sections, one for each grid process.
2. Identify which parts of the computation should be performed in the host
process and which in the grid processes, and also which parts of the grid-
process computation should be distributed and which duplicated. Also
identify ;,!I parts of the computation that should be performed ,hil! ,-
ently in the individual grid processes (e.g., calculations performed on the
boundaries of the grid).
Generating the simulated sequential-parallel version. We then applied
the following transformations to the original sequential code to obtain a simu-
lated sequential-parallel version, operating separately on the two versions of the
application described in 4.1:
1. In effect partition the data into distinct address spaces by adding an index
to each variable. The value of this index constitutes a simulated process
ID. At this point all data (even variables that are eventually to be dis-
tributed) is duplicated across all processes.
2. Ali,-1L the program to fit the archetype pattern of blocks of local compu-
tation alternating with data-exchange operations.
3. Separate each local-computation block into a simulated-host-process block
and a simulated-grid-process block.
4. Separate each simulated-grid-process block into the desired N simulated-
grid-process blocks. This implies the following changes:
(a) Modify loop bounds so that each simulated grid process modifies only
data corresponding to its local section. This step was complicated
by the fact that loop counters in the original code were used both
as indices into !! that were to be distributed and to indicate a
grid point's global position, and although the former usage must be
changed in this step, the latter must not.
(b) If there are calculations that must be done differently in 1!l i!. i. I
grid processes (e.g., boundary calculations), ensure that each process
performs the appropriate calculations.
(c) Insert data-exchange operations (calls to appropriate archetype li-
The result of these transformations was a sequential simulated-parallel version
of the original program.
Generating the parallel program. I i i! we transformed this sequential
simulated-parallel version into a program for message-passing architectures, as
described in 3.3.
Correctness. For those parts of the computation that fit the mesh archetype
- the near-field calculations the sequential simulated-parallel version pro-
duced results identical to those of the original sequential code. For those parts of
the computation that did not fit well the far-field calculations the sequen-
tial simulated-parallel version produced results markedly 1lt!!. i. i I from those of
the original sequential code. Our original assumption that we could regard
floating-point addition as associative and thus reorder the required summations
without markedly changing their results proved to be !in!!, I '. Correct
parallelization of these calculations would thus require a more sophisticated
-li iI -- than that suggested by the mesh archetype, which we did not pursue
due to time constraints. While disappointing, this result does not invalidate our
i 1. .1..- what it invalidates is our assumption that floating-point addition
in the context of the far-field calculations could be regarded as associative and
hence these calculations could be treated as a reduction operation as defined by
the mesh archetype.
For all parts of the computation, however, the message-passing programs
produced results identical to those of the corresponding sequential simulated-
parallel versions, on the first and every execution.
Performance. We quote performance figures to demonstrate that the par-
allelization produced by our !!!. Ill L.1. .1_-- is acceptably I i!', an important
consideration for application developers. Both versions of the application were
parallelized using our Fortran M implementation of the mesh archetype. Be-
cause of export-control constraints, we were able to obtain performance data
for Version C only on a network of workstations. Table 1, Table 2, Table 3,
and Table 4 show execution times and speedups4 for Version C, executing on a
network of Sun workstations. 1 ,,.,-- 7 and I ,,.i.- 8 show execution times and
speedups for Version A, executing on an IBM SP. The fall-off of performance for
more than 4 processors in I !,.i!- 7 is probably due to the ratio of computation
to communication falling below that required to give good performance. Unsur-
prisingly, performance for the larger problem size (I ,n - 8) scales acceptably
for a larger number of processors than performance for the smaller problem size
(I i, i!.- 7), but also falls off when the ratio of computation to communication
decreases below that required to give good performance.
Table 1: Execution times and speedups for electromagnetics code (version C),
for 33 by 33 by 33 grid, 128 steps, using Fortran M on a network of Suns.
3Analysis of the values involved showed that they ranged over many orders of magnitude,
so it is not surprising that the result of the summation was markedly affected by the order of
41Ve define speedup as execution time for the original sequential code divided by execution
time for the ..- .11. 1 code.
Execution time Speedup
Sequential 78.6 1.00
Parallel, P=1 189.0 0.41
Parallel, P=2 51.4 1.52
Parallel, P=4 25.3 3.10
Execution time Speedup
Sequential 4309.5 1.00
Parallel, P=4 1189.8 3.62
Table 2: Execution times and speedups for electromagnetics code (version C),
for 65 by 65 by 65 grid, 1024 steps, using Fortran M on a network of Suns.
Table 3: Execution times and speedups for electromagnetics code (version C),
for 46 by 36 by 36 grid, 128 steps, using Fortran M on a network of Suns.
Table 4: Execution times and speedups for electromagnetics code (version C),
for 91 by 71 by 71 grid, 2048 steps, using Fortran M on a network of Suns.
1 2 3 4 5 6 7 8
1 i,.. 7: Execution times and speedups for electromagnetics code (version A)
for 34 by 34 by 34 grid, 256 steps, using Fortran M on the IBM SP.
Execution time Speedup
Sequential 123.1 1.00
Parallel, P=1 258.5 0.47
Parallel, P=2 65.4 1.88
Parallel, P=4 32.5 3.78
sequential 16 actual
0 actual 14 perfect.
(80 ideal 4
E 1000 1
100 2 I0
1 10 100 0 2 4 6 8 10 12 14 16 18
1 ,oi.. 8: Execution times and speedups for electromagnetics code (version A)
for 66 by 66 by 66 grid, 512 steps, using Fortran M on the IBM SP.
Ease of use. It is .1![i ,til1 to define objective measures of ease of use, but
our experiences in the experiments described in this chapter suggest that the
parallelization bi. l!..l1..-,_ described herein produces results at a reasonable
SI !iii in both cases with unfamiliar code (about 2400 lines for Version
C and 1400 lines for Version A, including comments and whitespace), we were
able to perform the transformations described in 4.4 relatively quickly: For
version C of the code, one person spent 2 .1 determining how to apply the
mesh-archetype parallelization -Ii Il. 8 .1- converting the sequential code
into the sequential simulated-parallel version, and less than a ,1 converting
the sequential simulated-parallel version into a message-passing version. For
version A of the code, one person spent less than a ,1 determining how to
apply the parallelization -11 5 .1 converting the sequential code into
the sequential simulated-parallel version, and less than a ,1 converting the
sequential simulated-parallel version into a message-passing version.
5 Related work
Stepwise refinement. Program development via stepwise refinement is de-
scribed by ii ,ii- researchers, for example Back , Gries , and Hoare 
for sequential programs, and Back , Martin , and Van de Velde  for
parallel programs. The transformation of assignment statements into message-
passing operations, which is central to Theorem 3, is treated by Hoare  and
Operational models of parallel programming. Our proof of Theorem 3
is loosely based on regarding programs as action or state-transition -1. ii-
as described in ('! I [I and Misra , L- i 1! and Tuttle , Lamport ,
Manna and Pnueli , and Pnueli . We present a more complete model of
parallel programming that supports this and other experimental work in -2 ,]
Automatic parallelizing compilers. Much effort has gone into develop-
ment of compilers that automatically recognize potential concurrency and emit
parallel code, for example Fortran D  and HPF . We regard this work
as complementary to our ii,. 1L. .1. .1.,_ and postulate that some of our transfor-
mations could be automated by such compilers.
Design patterns. 'I ,i1- researchers have investigated the use of patterns in
developing algorithms and applications. Our previous work [7, 8, 9] explores a
more general notion of archetypes and their role in developing both sequential
and parallel programs. Gamma et al.  address primarily the issue of pat-
terns of computation, in the context of object-oriented design. Our work, in
contrast, also examines patterns of dataflow and communication. Schmidt 
focuses more on parallel structure, but in a different context from our work and
with less emphasis on code reuse. Brinch Hansen's work on parallel structures
 is similar in motivation to our work, but his model programs are I 11'i ll
more narrowly defined than our archetypes. Other work addresses lower-level
patterns, as for example the use of templates to develop algorithms for linear
algebra in  and the use of templates in developing f .! ii! ,1- -.1 if ; software
This paper describes experiments with a i l!. 1. ,1.,_- for parallelizing sequen-
tial programs based on 1 '! i' the techniques of stepwise refinement under the
guidance of a parallel programming archetype. The experiments demonstrate
that the ii. '1r ...1 produces programs that are correct (when the transforma-
tions are applied in the proper context) and reasonably Iti !. wI at what seems
to be reasonable human-effort cost. It is particularly heartening to note that the
transformation for which we provide formal support the transformation from
sequential simulated-parallel to parallel produced correct results in practice
as well as in theory, with the resulting parallel programs producing results iden-
tical to those of their simulated-parallel predecessors on the first execution (and
all subsequent executions).
Much work remains to be done i-1. ilif ii,_ and developing additional
archetypes, formally stating and proving additional transformations ( presents
additional examples but is not exhaustive), and providing automatic support for
transformations where feasible but we are encouraged by our results so far.
Special thanks go to Eric Van de Velde, whose book  inspired this work, and
to John Beggs, who provided and explained the electromagnetics application
that was the -1,l.1. i of our experiments.
 R. J. R. Back. Refinement calculus, part II: Parallel and reactive programs.
In ;. 6.... ... 1 -! of Distributed ',!, ..... Models, Formalisms, Cor-
rectness, volume 430 of Lecture Notes in Computer Science, pages ,.7 'I
 R. J. R. Back and J. von Wright. Refinement calculus, part I: Sequential
nondeterministic programs. In s;. .i.. /?. P, ... .... !of Distributed ',! .i...
Models, Formalisms, Correctness, volume 430 of Lecture Notes in Computer
Science, pages 42-66. Springer-Verlag, 1990.
 R. Barrett, M. Berry, T. ('I !, J. Demmel, J. Donato, J. Dongarra, V. Ei-
jkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the
Solution of Linear S,,!i ...- Building Blocks for Iterative Methods. SIAM,
 J. H. Beggs, R. J. Luebbers, D. i, I! H. S. Langdon, and K. S. Kunz.
User's manual for three-dimensional FDTD version C code for scattering
from fi. L. 1r 1 ;- 1. I !. !1 ii dielectric and magnetic materials. Technical
report, The P. !l-! .1 I I ii ,i U1 h i -i July 1992.
 P. Brinch Hansen. Model programs for computational science: A pro-
gramming !,n. -1,l..,_" for multicomputers. C,-, ..,. .... Practice and
Experience, 5(5, 117-423, 1993.
 R. M. Butler and E. L. Lusk. Monitors, messages, and clusters the p4
parallel programming -- -I. i! Parallel C '.,i'.., .I'" 20(4):547-564, 1994.
 K. M. ('!C i!i [. Concurrent program archetypes. In Proceedings of the
Scalable Parallel Library C,,. ,.- ... 1994.
 K. M. ('I' ,! R. Manohar, B. L. Massingill, and D. I. Meiron. Integrating
task and data parallelism with the group communication archetype. In
Proceedings of the 9th International Parallel Processing S., 1,!,' ... I 1995.
 K. M. ('!i ,i! 1I and B. L. Massingill. Parallel program archetypes. Technical
Report CS-TR-96-28, California Institute of T..1 i. .1... 1997.
 K. M. (''! ii.11 and J. Misra. Parallel Program Design: A Foundation.
Addison-W. -1. 1989.
 K. D. Cooper, M. W. Hall, R. T. Hood, K. Kennedy, K. S. M.i, I!!. J. M.
Mellor-C ILiiii,. L. Torczon, and S. K. Warren. The Parascope parallel
programming environment. Proceedings of the IEEE, 82(2):244-263, 1993.
 I. T. Foster and K. M. ('li md .. FO1 1i:.\N M: A language for modu-
lar parallel programming. Journal of Parallel and Distributed C' i,,! ".,,
 E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: !.L -
ments of Reusable Object-Oriented fi. ...' Addison-W. -1. 1995.
 D. Gries. 1!, Science of Programming. Springer-Verlag, 1981.
 D. Hemer and P. Li!l1- Reuse of verified design templates through
extended pattern matching. Technical Report 97-03, Software Verifica-
tion Research Centre, School of Information T 1!'.. ..! _-, The Uiii- -if
of Queensland, 1997. To appear in Proceedings of Formal Methods Europa
 High Performance Fortran Forum. High Performance Fortran language
specification, version 1.0. S, '. "'/. ,, P.,:,,.. ..'.. ., 2(1-2):1-170, 1993.
 C. A. R. Hoare. An axiomatic basis for computer programming. Commu-
nications of the ACM, 12(10; ,.71 583, 1969.
 C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985.
 K. S. Kunz and R. J. Luebbers. I !, Finite D' T. ,. .... I ... Domain Method
for !.. / ... .. : CRC Press, 1993.
 L. Lamport. A temporal logic of actions. AC.1 lI ...I.. .(... on Program-
ming Languages and ,!. ...i. 16(3):872-923, 1994.
 N. A. L- 1! and M. R. Tuttle. Hierarchical correctness proofs for dis-
tributed algorithms. In Proceedings of the 6th Annual .lif o. ,, .'..." .... on
Principles of Distributed C... .'i '.,' 1987.
 Z. Manna and A. Pnueli. Completing the temporal picture. I! ....' :,
Computer Science, 83(1):97-130, 1991.
 A. J. Martin. Compiling communicating processes into ,1. 1 -i I-, !-ilive
VLSI circuits. Distributed C...i. ."',., 1(4):226-234, 1986.
 B. Massingill. The mesh archetype. Technical Report CS-TR-96-25,
California Institute of T 1!i. .1. 1997. Also available via ( http://
 B. Massingill. A structured approach to parallel programming (Ph.D. the-
sis). Technical Report CS-TR-98-04, California Institute of T 1!ii..1..
 P. Pierce. The NX message-passing interface. Parallel C.i. !',r,'.'.
 A. Pnueli. The temporal semantics of concurrent programs. 1!....I , ,,,
Computer Science, 13:45-60, 1981.
 D. C. Schmidt. Using design patterns to develop reusable object-oriented
communication software. Communications of the .I .1/, 38(10):65-74, 1995.
 E. F. Van de Velde. Concurrent '. 1. -"i' Computing. Springer-Verlag,