Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: A Pattern language for parallel application programs
Full Citation
Permanent Link:
 Material Information
Title: A Pattern language for parallel application programs
Series Title: Department of Computer and Information Science and Engineering Technical Report 99-022
Physical Description: Book
Language: English
Creator: Massingill, Berna L.
Mattson, Timothy G.
Sanders, Beverly
Affiliation: University of Florida -- Department of Computer and Information Science and Engineering
Parallel Algorithms Laboratory
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1999
 Record Information
Bibliographic ID: UF00095459
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

1999390 ( PDF )

Full Text

A Pattern Language for Parallel Application

Programs *

UF CISE Technical Report 99-022

Berna L. T.i,--il.i!!t Timothy G. :'..1,1i-,.it

Beverly A. Sanders

A design pattern is a description of a high-quality solution to a fre-
quently occurring problem in some domain. A pattern language is a col-
lection of design patterns that are carefully organized to embody a design
methodology. A designer is led through the pattern language, at each step
choosing an appropriate pattern, until the final design is obtained in terms
of a web of patterns. This paper < I -- i i .1 a pattern language for parallel
application programs. The current version of the pattern language can be
viewed at The
goal of our pattern language is to lower the barrier to parallel program-
ming by guiding a programmer through the entire process of developing a
parallel program. The main target audience is experienced programmers
who may lack experience with parallel programming. The programmer
brings to the process a good understanding of the actual problem to be
solved, then works through the pattern language to obtain a detailed
parallel design or possibly working code. In this paper, we describe the
pattern language, present an example program, and sketch a case study
illustrating the design process using the pattern language.

1 Introduction

Parallel hardware has been available for decades, and is becoming increasingly
mainstream. Parallel software that fully exploits the hardware is much rarer,
however, and mostly limited to the specialized area of supercomputing. Part
*We acknowledge the support of Intel Corporation, the National Science Foundation, and
the Air Force Office of Scientific Research.
iDepartment of Computer and Information Science and Engineering, University of Florida,
Gainesville, FL; (current address: Department of Computer Science, Trin-
ity University, San Antonio, TX;
+Parallel Algorithms Laboratory, Intel Corporation;
Department of Computer and Information Science and Engineering, University of Florida,
Gainesville, FL;

of the reason for this state of affairs could be that most parallel programming
environments, which focus on the implementation of concurrency rather than
higher-level design issues, are simply too l!i 11 11 for most programmers to risk
using them.
A design pattern describes, in a prescribed format, a high-quality solution to
a frequently occurring problem in some domain. The format is chosen to make
it easy for the reader to quickly understand both the problem and the proposed
solution. Because the pattern has a name, a collection of patterns provides a
vocabulary with which to talk about the these solutions.
A structured collection of design patterns is called a pattern language. A
pattern language is more than just a catalog of patterns: The structure of the
pattern language is chosen to lead the user through the collection of patterns in
such a way that complex -1. i,1- can be designed using the patterns. At each
decision point, the designer selects an appropriate pattern. Each pattern leads
to other patterns, resulting in a final design in terms of a web of patterns. Thus a
pattern language embodies a design !!I. 11 l. 1. .- and provides domain-specific
advice to the application designer. (In spite of the overlapping I. ,i iii .1.. _, a
pattern language is not a programming language.)
The first pattern language was elaborated by Alexander [1] and addressed
problems from architecture, landscaping, and city planning. More recently,
design patterns were introduced to the software engineering .,!n!! ii,! i by
Gamma, Helm, Johnson, and Vlissides [8] via a collection of design patterns
for object-oriented programming.
This paper describes a pattern language for parallel application programs.
The current state of the pattern language can be viewed at http://www.cise. The pattern language is extensively
hyperlinked, allowing the programmer to work through it by following links. The
goal is to lower the barrier to parallel programming by guiding a programmer
through the entire process of developing a parallel program. The main target
audience is experienced programmers who 1I lack experience with parallel
programming. The programmer brings to the process a good understanding
of the actual problem to be solved, then works through the pattern language,
eventually obtaining a detailed parallel design or possibly working code.
In the next section, we give a brief overview of the organization of the pattern
language. In subsequent sections we give the text of one of the patterns and
then sketch a simple case study to illustrate the design process using the pattern
language. We close with brief descriptions of related approaches.

2 Organization of the pattern language

The pattern language is organized into four design spaces I !!!.i hiConcur-
I. II AI ..i I lii S ini n I ll,..~I SI I In -I and ImplementationMecha-
nisms which form a linear hierarchy, with I n!! iiConcurrency at the top
and ImplementationMechanisms at the bottom.

2.1 FindingConcurrency

This design space is concerned with structuring the problem to expose exploitable
.. .. .I...'. I The designer working at this level focuses on high-level algorith-
mic issues and reasons about the problem to expose potential concurrency. The
i! i, ,' patterns in this design space are described in this section; additional pat-
terns in the space and the relationships among them are illustrated in I i,.i. 1.



1 ,,.. 1: Organization of patterns i in the 1 iiConcurrency design space.

GettingStarted. This pattern helps the programmer answer the ques-
tion --;- - do you start when you want to solve a problem using a parallel

DecompositionStrategy. This pattern addresses the question of whether
it is best to decompose the problem based on a data decomposition, a task
decomposition, or a combination of the two.

DependencyAnalysis. Once the programmer has identified the entities
into which the problem is to be decomposed, this pattern helps him or her
understand how they depend on each other.

SRecap. This pattern is a consolidation pattern; it is used to bring to
gether the results from the other patterns in this design space and prepare
the programmer for the next design space, the A 1o , 1 i Ia s e. q design

2.2 AlgorithmStructure

This design space is concerned with structuring the algorithm to take advantage
of potential concurrency. That is, the designer working at this level reasons
about how to use the concurrency exposed in the previous level. Patterns in this
space describe overall strategies for exploiting concurrency. Selected patterns
in this space have been described in [14].
Patterns in this design space can be divided into the following four groups,
plus the ( 'I .... SI 1 I I [n pattern, which addresses the question of how to select
an appropriate pattern from those in this space. I ,,.i.- 2 illustrates how the
programmer would navigate through the space.


OrgatzeByOrdering OrgamnzeByData

SPipelmeProcessing IAsnchrnous | Linear Recrsive | Geometric Recursive
Compositon .-- --- -. --- Decomposition Data
Pat1toning DiideAnd1

I Independent Dependent

L .i.-. ..r | Separable Inseparable
,, | Dependences | Dependencies

S,, 2: Organization of the AI._ o o.ii Ili ,, I1 ., design space.

2.2.1 "Organize by ordering" patterns

These patterns are used when the ordering of groups of tasks is the iii'I. ri or-
ganizing principle for the parallel algorithm. This group has two members,

of tasks connected by data dependencies.
ref n t wDecsiot iranchpt idt

1 ii orderings that do not change during the algorithm; the other represents
"ii. !1 ii orderings that are more dynamic and unpredictable.

PipelineProcessing. The problem is decomposed into ordered groups
of tasks connected by data dependencies.

AsynchronousComposition. The problem is decomposed into groups
of tasks that interact through asynchronous events.

2.2.2 "Organize by tasks" patterns

These patterns are those for which the tasks themselves are the best organizing
principle. There are i 1 i!~- ways to work with such "i .-I:-,p I1. 1" problems,
making this the largest pattern group.

EmbarrassinglyParallel. The problem is decomposed into a set of in-
dependent tasks. Most algorithms based on task queues and random sam-
pling are instances of this pattern.

SeparableDependencies. The parallelism is expressed by splitting up
tasks among units of execution (threads or processes). Any dependencies
between tasks can be pulled outside the concurrent execution by replicat-
ing the data prior to the concurrent execution and then combining the
replicated data after the concurrent execution. This pattern works when
variables involved in data dependencies are written but not subsequently
read during the concurrent execution.

ProtectedDependencies. The parallelism is expressed by splitting up
tasks among units of execution. In this case, however, variables involved
in data dependencies are both read and written during the concurrent
execution and thus cannot be pulled outside the concurrent execution but
must be managed during the concurrent execution of the tasks.

DivideAndConquer. The problem is solved by recursively dividing it
into subproblems, solving each subproblem independently, and then re-
combining the subsolutions into a solution to the original problem.

2.2.3 "Organize by data" patterns

These patterns are those for which the decomposition of the data is the i. r'
organizing principle in understanding the concurrency. There are two patterns
in this group, differing in how the decomposition is structured (linearly in each
dimension or recursively).

GeometricDecomposition. The problem space is decomposed into dis-
crete subspaces; the problem is then solved by computing solutions for
the subspaces, with the solution for each subspace I l. 11- requiring data
from a small number of other subspaces. I! i,- instances of this pattern
can be found in scientific computing, where it is useful in parallelizing
grid-based computations, for example.

RecursiveData. The problem is defined in terms of following links
through a recursive data structure.

2.3 SupportingStructures
This design space represents an intermediate stage between the A!. ii, IIiS i iL -
ture and ImplementationMechanisms design spaces. While it is sometimes pos-
sible to go directly from the A I, l I!!!! 1 11, I, 11,- space to an implementation for
a target programming environment, it often makes sense to construct the imple-
mentation in terms of other patterns that constitute an intermediate stage be-
tween the problem-oriented patterns of the A I, i 1ill ii I I ,I I I 1. design space and
the machine-oriented "I 1 ii i- of the ImplementationMechanisms space. Two
important groups of patterns in this space are those that represent program-
structuring constructs (such as SPMD and ForkJoin) and those that represent
commonly used shared data structures (such as !,i .,I L, iL. iL. In some cases,
a library ii ,- be available that contains a :-. .I -1I,-use implementation of the
data structures. If so, the pattern describes how to use the data structure;
otherwise, the pattern describes how to implement it.

2.4 ImplementationMechanisms

This design space is concerned with how the patterns of the higher-level spaces
are mapped into particular programming environments. We use it to provide
pattern-based descriptions of common mechanisms for process/thread manage-
ment (e.g., creating or I1. -i.. i !,- processes/threads) and process/thread inter-
action (e.g., monitors, semaphores, barriers, or i. - i. I -i.!_). Patterns in
this design space, like those in the 'ip' i i i!S,Si i In 11.L space, describe entities
that strictly speaking are not patterns at all. We include them in our pattern
language ;, i ;- however, to provide a complete path from problem descrip-
tion to code, and we document them using our pattern notation for the sake of

3 An example pattern

This section contains the text of the Fi1 1., -i. 1-; P 11!. 1 pattern (from the
A I,!.I li !!ini tSi In i. design space) to illustrate the format of a design pattern.
Although not every pattern will contain exactly the sections shown in this ex-
ample, it is a representative example of the pattern format we use, adapted from
[8]. Comments indicating the general purpose of each section are indicated in
italic -I I. at the beginning of the section. These comments have been included
for the benefit of the reader of this paper; they are not part of the actual pat-
tern. Underlined words represent 1i I" 1,:- in the actual pattern. These could
be links to a definition, another pattern, example code, a document containing
more detailed i, .1 -i- etc. The pattern is split into two -, a main pattern
(presented as Section 3.1) and a supporting ". . ii,! '. Il'... iL. ii document
containing longer and more detailed examples (presented as Section 3.2).

3.1 The main pattern

Pattern Name: EmbarrassinglyParallel

(P. ', [fl. states the problem solved by this pattern.)
This pattern is used to describe concurrent execution by a collection of indepen-
dent tasks. Parallel algorithms that use this pattern are called .. 1.-..1', ....',",
parallel because once the tasks have been defined the potential concurrency is

Also Known As:

(Lists other names by which this pattern is .............Il known.)

Master-Worker, Task Queue.


(Gives the context for the pattern, i.e., why a designer would use the pattern
and what background '".f.,. .....!'.. should be kept in mind when using it.)
Consider an algorithm that can be decomposed into 11ii !- independent tasks.
Such an algorithm, often called an -". !i i i --,!'I parallel" algorithm, contains
obvious concurrency that is trivial to exploit once these independent tasks have
been defined, because of the independence of the tasks. Nevertheless, while the
source of the concurrency is often obvious, taking advantage of it in a way that
makes for. t. i. H i execution can be tl!., !111
The Fl!!! ,,i i 1-i_ P, i11 1 pattern shows how to organize such a collection
of tasks so they execute t~ III ,ii The challenge is to organize the computation
so that all units of execution (UEs) finish their work at about the same time -
that is, so that the computational load is balanced among processors. I i 3,..i 3
illustrates the problem.
This pattern automatically and dynamically balances the load as necessary.
\\ i I! this pattern, faster or less-loaded UEs automatically do more work. When
the amount of work required for each task cannot be predicted ahead of time,
this pattern produces a statistically optimal solution.
Examples of this pattern include the following:

Vector addition (considering the addition of each pair of elements as a
separate task).

Ray-tracing codes such as the medical-imaging example described in the
D .... I..'-i.. L'4 iS ,I; pattern. Here the computation associated with
each "-- becomes a separate task.


assgd to4UEF ass ned to 4UEs
(poc loadba a e) (gcwd loadbalce)


1 ,o,.,- 3: Load balance with the Fii i --;!'1 Pr 11 11 pattern.

Database searches in which the problem is to search for an item meeting
specified criteria in a database that can be partitioned into subspaces that
can be searched concurrently. Here the searches of the subspaces are the
independent tasks.

Branch-and-bound computations, in which the problem is solved by re-
peatedly removing a solution space from a list of such spaces, examining
it, and either declaring it a solution, discarding it, or dividing it into
smaller solution spaces that are then added to the list of spaces to exam-
ine. Such computations can be parallelized using this pattern by making
each -", ,ii'i,.- and process a solution -*i ... step a separate task.

As these examples illustrate, this pattern allows for a fair amount of vari-
ation: The tasks can all be roughly equal in size, or they can vary in size.
Also, for some problems (the database search, for example), it !!i be possible
to solve the problem without executing all the tasks. I i !!~,, for some prob-
lems (branch-and-bound computations, for example), new tasks i be created
during execution of other tasks.
Observe that although frequently the source of the concurrency is obvious
(hence the name of the pattern), this pattern also applies when the source of the
concurrency requires some insight to discover; the distinguishing characteristic
of problems using this pattern is the complete independence of the tasks.
More formally, the F .1, 1 1- P r 11. 1 pattern is applicable when what
we want to compute is a solution(P) such that

solution(P) =
f(subsolution(P, 0),
subsolution(P, 1), ...,
subsolution(P, N-1))

such that for i and j different, subsolution(P, i) does not depend on subsolu-
tion(P, j). That is, the original problem can be decomposed into a number of
independent subproblems such that we can solve the whole problem by solv-
ing all of the subproblems and then combining the results. We could code a
sequential solution thus:

Problem P;
Solution subsolutions [N];
Solution solution;
for (i = 0; i < N; i++) {
subsolutions[i] =
computesubsolution(P, i);
solution =

If function compute_subsolut ion modifies only local variables, it is straight-
forward to show that the sequential composition implied by the for loop in the
preceding program can be replaced by ;i!- combination of sequential and par-
allel composition without affecting the result. That is, we can partition the
iterations of this loop among available UEs in whatever way we choose, so long
as each is executed exactly once.
This is the F i,. ,, -ii _i P i 11. 1 pattern in its simplest form all the sub-
problems are defined before computation begins, and each subsolution is saved
in a distinct variable (;, i i element), so the computation of the subsolutions is
completely independent. These computations of subsolutions then become the
independent tasks of the pattern as described earlier.
There are also some variations on this basic theme:

Subsolutions accumulated in a shared data structure. One such variation
differs from the simple form in that it accumulates subsolutions in a shared
data structure (a set, for example, or a running sum). Computation of sub-
solutions is no longer completely independent (since access to the shared
data structure must be synchronized), but concurrency is still possible if
the order in which subsolutions are added to the shared data structure
does not affect the result.

Termination condition other than /,i tasks complete". In the simple form
of the pattern, all tasks must be completed before the problem can be re-
garded as solved, so we can think of the parallel algorithm as having
the termination condition -- tasks iinql. 1 For some problems, how-
ever, it ii be possible to obtain an overall solution without solving all
the subproblems. For example, if the whole problem consists of deter-
mining whether a large search space contains at least one item meeting

given search criteria, and each subproblem consists of searching a subspace
(where the union of the subspaces is the whole space), then the computa-
tion can stop as soon as ;, i- subspace is found to contain an item meeting
the search criteria. As in the simple form of the pattern, each computa-
tion of a subsolution becomes a task, but now the termination condition
is something other than "- I tasks completed". This can also be made
to work, although care must be taken to either ensure that the desired
termination condition will actually occur or to make provision for the case
in which all tasks are completed without reaching the desired condition.

Not all subproblems known '. "i ,ll, A final and more complicated varia-
tion ll!!. i in that not all subproblems are known initially; that is, some
subproblems are generated during solution of other subproblems. Again,
each computation of a subsolution becomes a task, but now new tasks
can be created ". .i! the fly". This imposes additional requirements on the
part of the program that keeps track of the subproblems and which of
them have been solved, but these requirements can be met without too
much trouble, for example by using a thread-safe shared task queue. The
trickier problem is ensuring that the desired termination condition (" II
tasks completed" or something else) will eventually be met.

What all of these variations have in common, however, is that they meet
the pattern's key restriction: It must be possible to solve the subproblems into
which we partition the original problem ',.,I 1. ...., ,.'/il, Also, if the subsolution
results are to be collected into a shared data structure, it must be the case that
the order in which subsolutions are placed in this data structure does not affect
the result of the computation.


(Gives a high-level discussion of when this pattern can be used.)
Use the DF!l. , -ii-,;_l- P! 11. 1 pattern when:

The problem consists of tasks that are known to be independent; that is,
there are no data dependencies between tasks (aside from those described
in "SLII- .I.. !, . accumulated in a shared data -ii H I. above).

This pattern can be particularly effective when:

The startup cost for initiating a task is much less than the cost of the task

The number of tasks is much greater than the number of processors to be
used in the parallel computation.

The effort required for each task or the processing performance of the pro-
cessors varies unpredictably. This unpredictability makes it very .1!!,i i11i1
to produce an optimal static work distribution.

(Describes how the participants interact to .,1 I',, this pattern.)
Implementations of this pattern include the following key elements:

A mechanism to define a set of tasks and schedule their execution onto a
set of UEs.

A mechanism to detect completion of the tasks and terminate the compu-

(Describes how the pattern is used. Other patterns are .'. [ .'' ....'' as applicable
to explain how it can be used to solve larger problems.)
This pattern is I 1.i. ,11- used to provide high-level structure for an application;
that is, the application is I i'i ,11- structured as an instance of this pattern. It
can also be used in the context of a simple sequential control structure such as
sequential composition, if-then-else, or a loop construct. An example is given in
our overview paper [13], where the program as a whole is a simple loop whose
body contains an instance of this pattern.


(Gives the designer the f. ... ...l...!' needed to make intelligent design ..... l../T )
The F,],l. ,n ,i-, Pr 11. 1 pattern has some powerful benefits, but also a sig-
nificant restriction.

Parallel programs that use this pattern are among the simplest of all paral-
lel programs. If the independent tasks correspond to individual loop iter-
ations and these iterations do not share data dependencies, parallelization
can be easily implemented with a parallel loop directive.

\\ i I! some care on the part of the programmer, it is possible to implement
programs with this pattern that automatically and dynamically i li L- the
load between units of execution. This makes the F!il. 1,n -i- P i 1 1
pattern popular for programs designed to run on parallel computers built
from networks of workstations.

This pattern is particularly valuable when the effort required for each task
varies significantly and unpredictably. It also works particularly well on
heterogeneous networks, since faster or less-loaded processors naturally
take on more of the work.

The downside, of course, is that the whole pattern breaks down when the
tasks need to interact during their computation. This limits the number
of applications where this pattern can be used.


(Explains how to implement the pattern, usually in terms of patterns from lower-
level design spaces.)
There are 11 1 ii- ways to implement this pattern. If all the tasks are of the
same size, all are known a prior, and all must be completed (the simplest
form of the pattern), the pattern can be implemented by simply dividing the
tasks among units of execution using a parallel loop directive. Otherwise, it is
common to collect the tasks into a queue (the task queue) shared among UEs.
This task queue can then be implemented using the 1,I, il 1IL. IL.- pattern. The
task queue, however, can also be represented by a simpler structure such as a
shared counter.

Key elements.

(Describes elements ./ i';, in the .... ..... section.)

A set of tasks is represented and scheduled for execution on multiple units of
execution (UEs). Frequently the tasks correspond to iterations of a loop. In
this case we implement this pattern by splitting the loop between multiple UEs.
The key to making algorithms based on this pattern run well is to schedule their
execution so the load is balanced between the UEs. The schedule can be:

s.: In this case the distribution of iterations among the UEs is deter-
mined once, at the start of the computation. This might be an effective
I,1. _- when the tasks have a known amount of computation and the
UEs are running on -- -1. i with a well-known and stable load. In other
words, a static schedule works when you can statically determine how
1 ,ii iterations to assign to each UE in order to achieve a balanced load.
Common options are to use a fixed interleaving of tasks between UEs, or a
blocked distribution in which blocks of tasks are defined and distributed,
one to each UE.

D........ Here the distribution of iterations varies between UEs as the
computation proceeds. This I 1 is used when the effort associated
with each task is unpredictable or when the available load that can be
supported by each UE is unknown and potentially changing. The most
common approach used for dynamic load balancing is to define a task
queue to be used by all the UEs; when a UE completes its current task
and is therefore ready to process more work, it removes a task from the
task queue. Faster UEs or those receiving lighter-weight tasks will go to
the queue more often and automatically grab more tasks.

Implementation techniques include parallel loops and master-worker and SPMD
versions of a task-queue approach.

PARALLEL LOOP. If the computation fits the simplest form of the pattern -
all tasks the same size, all known a prior, and all required to be completed -
they can be scheduled by simply setting up a parallel loop that divides them
equally (or as equally as possible) among the available units of execution.

1MASTER-WORKER OR S1'. 11). If the computation does not fit the simplest
form of the pattern, the most common implementation involves some form of a
task queue. Frequently this is done using two I- i" of processes, master and
worker. There is only one master process; it manages the computation by:

Setting up or otherwise managing the workers.

Creating and managing a collection of tasks (the task queue).

Consuming results.

There can be ii ,i- worker processes; each contains some 1- ..- of loop that

Removes the task at the head of the queue.

Carries out the indicated computation.

Returns the result to the master.

Frequently the master and worker processes form an instance of the ForkJoin
pattern, with the master process forking off a number of workers and waiting
for them to complete.
A common variation is to use an SPMD program with a global counter to
implement the task queue. This form of the pattern does not require an explicit

Termination can be implemented in a number of ways.
If the program is structured using the ForkJoin pattern, the workers can
continue until the termination condition is reached, checking for an ii! I task
queue (if the termination condition is II tasks completed") or for some other
desired condition. As each worker detects the appropriate condition, it termi-
nates; when all have terminated, the master continues with :i~- final combining
of results generated by the individual tasks.
Another approach is for the master or a worker to check for the desired
termination condition and, when it is detected, create a "i" "i -!i pill", a special
task that tells all the other workers to terminate.

Correctness considerations.

(Summarizes key results from the supporting theory, ideally in the form of guide-
lines, that if followed will result in a correct algorithm.)
The keys to exploiting available concurrency while maintaining program cor-
rectness (for the problem in its simplest form) are as follows.

Solve subproblems '..,. 1 .. ... .. ,., Computing the solution to one subprob-
lem must not interfere with computing the solution to another subproblem.
This can be guaranteed if the code that solves each subproblem does not
modify i! variables shared between units of execution (UEs).

Solve each subproblem I ,... il, once. This is almost trivially guaranteed
if static scheduling is used (i.e., if the tasks are scheduled via a parallel
loop). It is also easily guaranteed if the parallel algorithm is structured
as follows:

A task queue is created as an instance of a thread-safe shared data
structure such as [! 11I. it i with one entry representing each
A collection of UEs execute concurrently; each repeatedly removes a
task from the queue and solves the corresponding subproblem.
When the queue is !!iil and each UE finishes the task it is cur-
rently working on, all the subsolutions have been computed, and the
algorithm can proceed to the next step, combining them. (This also
means that if a UE finishes a task and finds the task queue iiil,
it knows that there is no more work for it to do, and it can take
appropriate action terminating if there is a master UE that will
take care of :!! combining of subsolutions, for example.)

C,... ,, ii, save subsolutions. This is trivial if each subsolution is saved in
a distinct variable, since there is then no ...- il1l ,ii that the saving of one
subsolution will affect subsolutions computed and saved by other tasks.

C,.' ., il., combine subsolutions. This can be guaranteed by ensuring that
the code to combine subsolutions does not begin execution until all sub-
solutions have been computed as discussed above.

The variations mentioned earlier impose additional requirements:

Subsolutions accumulated in a shared data structure. If the subsolutions
are to be collected into a shared data structure, then the implementation
must guarantee that concurrent access does not damage the shared data
structure. This can be ensured by implementing the shared data structure
as an instance of a -" I! ,, -1-- i -" pattern.

Termination condition other than ,I/ tasks complete". Then the imple-
mentation must guarantee that each subsolution is computed at most once

(easily done by using a task queue as described earlier) and that the com-
putation detects the desired termination condition and terminates when
it is found. This is more .[!1..t 11il but still possible.

Not all subproblems known ". "'illI, Then the implementation must guar-
antee that each subsolution is computed exactly once, or at most once
(depending on the desired termination condition.) Also, the program de-
signer must ensure that the desired termination detection will eventually
be reached. For example, if the termination condition is II tasks com-
pleted", then the pool generated must be finite, and each individual task
must terminate. Again, a task queue as described earlier solves some of the
problems; it will be safe for worker UEs to add as well as remove elements.
Detecting termination of the computation is more [~'1!i 1t, however. It
is not necessarily the case that when a "--. i !:. i finishes a task and finds
the task queue ii!l that there is no more work to do another worker
could generate a new task. One must therefore ensure that the task queue
is !!il' and all workers are finished. Further, in -- -I. i,- based on ,-- i -
chronous message passing, one must also ensure that there are no messages
in transit that could, on their arrival, create a new task. There are 11 i,!!
known algorithms that solve this problem. One that is useful in this con-
text is described in [Diii:-, i ''] [7]. Here tasks conceptually form a tree,
where the root is the master task, and the children of a task are the tasks
it generated. When a task and all its children have terminated, it noti-
fies its parent that it has terminated. When all the children of the root
have terminated, the computation has terminated. This of course requires
children to keep track of their parents and to notify them when they are
finished. Parents must also keep track of the number of active children
(the number created minus the number that have terminated). Additional
algorithms for termination detection are described in [Bertsekas89] [2].

Efficiency considerations.
(Discusses implementation issues that affect IT;1 . )

If all tasks are roughly the same length and their number is known a
prior, static scheduling (usually performed using a parallel loop directive)
is likely to be more ti! II than dynamic scheduling.

If a task queue is used, put the longer tasks at the beginning of the queue
if possible. This ensures that there will be work to overlap with their


(Provides implementations of the pattern in particular programming environ-
ments. Higher-level patterns ...-, use pseudocode, while lower-level patterns

0i...1 use code from one or more popular programming environments such as
MPI, 01 .... !', or Java.)

Vector addition.

Consider a simple vector addition, C = A + B. As discussed earlier, we can
consider each element addition Ci = Ai + B, as a separate task and parallelize
this computation in the form of a parallel loop:

SSee the section "\,, i ii A.[~1~1'li- in the examples document (presented
in this paper as Section 3.2).

Varying-length tasks.
Consider a problem consisting of N independent tasks. Assume we can map each
task onto a sequence of simple integers ranging from 0 to N 1. Further assume
that the effort required by each task varies considerably and is unpredictable.
Several implementations are possible, including:

A master-worker implementation using a task queue. See the section
"\ ,-i i!_-Length Tasks, Master-Worker iin!, ! I ,I i, '!i in the examples
document (presented in this paper as Section 3.2).

An SPMD implementation using a task queue. See the section "\ ii- Length
Tasks, SPMD inii 1. I !' r ,ii. in the examples document (presented in
this paper as Section 3.2).

See our overview paper [13] for an extended example using this pattern.

Known Uses:

(Describes contexts in which the pattern has been used, including real programs,
where possible in the form of literature ,. f, ,. ... )
There are 1iii, application areas in which this pattern is useful. i,! ray-
tracing codes use some form of partitioning with individual tasks corresponding
to scan lines in the final image [Bjornson91a] [4]. Applications coded with the
Linda coordination language are another rich source of examples of this pattern
[Bjornson91b] [3].
Parallel computational chemistry applications also make heavy use of this
pattern. In the quantum chemistry code (;.\., 11i. the loops over two electron
integrals are parallelized with the TC(;. ISG task queue mechanism mentioned
earlier. An early version of the Distance Geometry code, D(;i.OM, was par-
allelized with the Master-Worker form of the Fn!. ,i ,--i!_ 1- P ,! ,11. 1 pattern.
These examples are discussed in ,. I '1I -. I'-.1 ] [15].

Related Patterns:

(Lists patterns related to this pattern. In some cases, a small change in the
parameters of the problem can mean that a .I'fT ,'. .!i pattern is indicated; this
section notes such cases, with circumstances in which designers should use a
.I'ff, !, .i pattern spelled out.)
The SeparableDependencies pattern is closely related to the F~l., J i 1-i -1, P, -
allel pattern. To see this relation, think of the SeparableDependencies pattern
in terms of a three-phase approach to the parallel algorithm. In the first phase,
dependencies are pulled outside a set of tasks, usually by replicating shared data
and converting it into task-local data. In the second phase, the tasks are run
concurrently as completely independent tasks. In the final phase, the task-local
data is recombined (reduced) back into the original shared data structure.
The middle phase of the SeparableDependencies pattern is an instance of
the F~1. i ,d ,-1-1 P ,i 11. 1 pattern. That is, you can think of the SeparableDe-
pendencies pattern as a technique for converting problems into embarrassingly
parallel problems. This technique can be used in certain cases with most of the
other patterns in our pattern language. The key is that the dependencies can be
pulled outside of the concurrent execution of tasks. If this isolation can be done,
then the execution of the tasks can be handled with the F 1i, ,-i- P i 11. 1
i !11, instances of the GeometricDecomposition pattern (for example, "1 -. -
* ,qiit i ,i- in which new values are computed for each point in a grid based
on data from nearby points) can be similarly viewed as two-phase computations,
where the first phase consists of exchanging boundary information among UEs
and the second phase is an instance of the F., ,m --!_1 P 11 1 pattern in
which each UE computes new values for the points it .;- !-'.
It is also worthwhile to note that some problems in which the concurrency is
based on a geometric data decomposition are, despite the name, not instances
of the GeometricDecomposition pattern but instances of E~Il. 1i -;1, _1 Pr, ,1-
lel. An example is a variant of the vector addition example presented earlier,
in which the vector is partitioned into I l i1:- ', with computation for each
"., Iii :' treated as a separate task.

3.2 The supporting "examples document"

Pattern Name: EmbarrassinglyParallel
Supporting Document: Examples

Vector Addition:

The following code uses an O1,. !.1 I' parallel loop directive to perform vector

DO I= 1, N
C(I) = A(I) + B(I)


Varying-Length Tasks, Master-Worker Implementation:

The following code uses a task queue and master-worker approach to solv-
ing the stated problem. We implement the task queue as an instance of the
H11 111 [i iL. iL,- pattern.
The master process, shown below, initializes the task queue, representing
each task by an integer. It then uses the ForkJoin pattern to create the worker
processes or threads and wait for them to complete. When they have completed,
it consumes the results.
#define Ntasks 500 /* Number of tasks */
#define Nworkers 5 /* Number of workers */

SharedQueue task_queue; /* task queue */
Results Globalresults[Ntasks]; /* array to hold results */

void master()
void Worker();

// Create and initialize shared data structures
taskqueue = new SharedQueue();
for (int i = 0; i &It; N; i++)
enqueue(&taskqueue, i);

// Create Nworkers threads executing function Worker()
ForkJoin (Nworkers, Worker);

Consumetheresults (Ntasks);

The worker process, shown below, loops until the task queue is !1,! Every
time through the loop, it grabs the next task, does the indicated work (storing
the results into a global results : W When the task queue is ii|,!l the
worker terminates.
void Worker()
int i;
Result res;

While (!empty(taskqueue) {
i = dequeue(task_queue);
res = do_lots_of_work(i);
Globalresults[i] = res;

Note that we ensure safe access to the key shared variable (the task queue)
by implementing it using patterns from the I'1p' Lll,,,! IS I In ,, space. Note
also that the overall organization of the master process is an instance of the
ForkJoin pattern.

Varying-Length Tasks, SPMD Implementation:
As an example of implementing this pattern without a master process, con-
sider the following sample code using the TC(;. !SG message-passing library
(described in [Harrison91] [9]. The library has a function called 11E:T'..- L that
implements a global counter. An SPMD program could use this construct to
create a task-queue program as shown below.
While (itask = NEXTVAL() < Number_of_tasks){

4 Using the pattern language

In this section, we illustrate the process of program design using the pattern
language via a simple example. In the example, we develop a parallel version
of a global optimization algorithm using interval arithmetic, focusing on the
parallel aspects and omitting other details.

4.1 Problem description
As an example of our pattern language in action, we will look at a global op-
timization algorithm that uses the properties of interval arithmetic [16] to con-
struct a reliable global optimization algorithm.
Intervals provide an alternative representation of floating-point numbers. In-
stead of a single floating-point number, a real number is represented by a pair
of numbers that bound the real number. The arithmetic operations produce in-
terval results that are guaranteed to bound the mathematically "i. i. I result.
This arithmetic -- -1. i is robust and safe from numerical errors associated with
the inevitable rounding that occurs with floating-point arithmetic. One can ex-
press most functions in terms of intervals to produce interval extensions of the
functions. Values of the interval extension are guaranteed to bound the mathe-
matically rigorous values of the function. This fact can be used to define a class
of global optimization algorithms [17] that find rigorous global optima. The
details go well ',. i1 the scope of this paper. The structure of the algorithm
however, can be fully appreciated without understanding the details.
Consider the minimization of an objective function which contains a num-
ber of parameters that we want to investigate to find the values that yield a
minimum value for the function. This problem is complicated by the fact that
there 11i be 0 or i ii ,- sets of such parameter values. A value that is a mini-
mum over some neighborhood ii in fact be larger than the values in a nearby
We can visualize the problem by associating an axis in a multidimensional
plot with each variable parameter. A candidate set of parameters defines a
box in this multidimensional space. We start with a single box covering the
domain of the function. The box is tested to see if it can contain one or more
minima. If the box cannot contain a minimum value, we reject the box. If it

can contain a minimum value, we split the box into smaller sub-boxes and put
them on a list of candidate boxes. We then continue for each box on the list
until either there are no remaining boxes or the remaining boxes are -it!i ,iI I!1
small. Pseudocode for this (sequential) algorithm is shown in I L,..i- 4.

Interval_box B;
Listofboxes L;
L = Initialize();
While (!done){
B = get_next_box(L);
if (nominima (B))
reject (B, L);
splitandput (B, L);
done = termination(L);

I ,,,i.- 4: Pseudocode for sequential optimization algorithm.

4.2 Parallelization using our pattern language

An experienced parallel programmer would immediately see this algorithm as
an instance of our F!il. i ,i- i_1- P, i 11. 1 pattern. For such a programmer,
entering our language at this pattern might be the right thing to do. A pro-
grammer less experienced with parallel programming might need guidance from
the pattern language to arrive at this pattern.

4.2.1 Using the FindingConcurrency design space

The first step for such a programmer is to find the concurrency in the algo-
rithm, which is the domain of our I ii [1 iConcurrency design space. Entering
our pattern language at that level, the programmer would use the Decompo-
-ii i!. S.I, I - pattern. This pattern guides the programmer to the conclusion
that a task-based decomposition is appropriate for this problem and that the
natural unit of concurrency is the test of whether a box can contain a mini-
mum. This is the most computationally intensive part of the problem, and the
computation for ;,i- given box can be carried out independently of the other
Next, the programmer needs to understand the dependencies between con-
current tasks. Moving into the DependencyAi 11 -i- pattern, he or she would
see that interactions between tasks occur through the list of boxes. To prevent
tasks from interfering with each other, the access to the list must be protected
so that only one task at a time can read or write to the list.
A more subtle issue that would be exposed while working with the Depen-
1. i, A'i ,1 -1- pattern is the nature of the termination test. A test for termi-
nation of the algorithm cannot take place while I,1 of the tasks are processing

a box. This implies a partial order in the algorithm, which can be addressed by
breaking it up into two phases: a box processing phase and a termination test
\\ 1 the fundamental decomposition in hand and the dependencies iden-
tified, the programmer can decide how to structure the concurrency so it can
be exploited. Our pattern language guides this process via the Recap pattern,
which helps the programmer combine the decomposition I1 _- and the de-
pendency :i! 1- -1- to choose an algorithm structure. Here, this pattern would
guide the programmer to the F!,El. i, i, i -,1- P 1 11. 1 pattern.
Most of the literature concerned with the use of patterns in software engi-
neering associates code, or constructs that will directly map onto code, with each
pattern. It is important to appreciate, however, that patterns solve problems
and that these problems are not always directly associated with code. In this
case, for example, the first three patterns have not led to :i!! code; rather they
have helped the programmer reach an understanding of the best alternatives
for structuring a solution to the problem. It is this guidance for the algorithm
designer that is missing in most parallel programming environments.

4.2.2 Using the AlgorithmStructure design space

Having selected the Fi,1l.1 i,, -!_1 P 1 11. 1 pattern, the programmer next uses
it to help identify the key issues in designing the optimization algorithm. The
computation defining a task is the test on whether a box can contain minima
(i.e., the function no-minima(). To support the dual-phase structure and to
ensure that subproblem results are correctly combined, two lists are needed: a
task list and a result list.
As indicated earlier, the calculation proceeds in two phases. In the first
phase, a set of tasks executes concurrently to test whether each box in the task
list can contain ,i!! minima. If a box can, it is split into sub-boxes, and these
sub-boxes are placed on the result list. In the second phase, the result list is
checked to see if the termination condition has been satisfied. If so, the results
are printed and the computation is finished. If not, the process is repeated with
the result list becoming the task list for the next pass.
Reviewing the "Correctness .. ,i-i. I ii.,i- section of the Embarrassingly-
Parallel pattern, we see that both lists need to be implemented as "I !i. ,- [- i.-"
shared data structures, since the task list will be used as a task queue and the
result list will be used to collect subsolutions. Once again, an experienced par-
allel programmer would know this right away, but someone with less experience
with parallel programming would need to be guided to this conclusion
At this point the programmer has used our pattern language to decide the

The program will be structured in two phases: box tests and termination

These two phases imply two box lists: an input list and a results list.

The ii i ,r source of productive concurrency is in the box tests.

To ensure correctness, some data structures must be implemented using
thread-safe shared-data-structure patterns.

He or she is now ready to start designing code.

4.2.3 Using the SupportingStructures design space

As a first step in designing code, the programmer would probably consult the
I! !. 11. Ii I. iL. pattern (in the 'i1pLll,!' !! it I [In-! space) to see how to use
the shared queue library component. Ideally, the library would provide exactly
what is needed, a shared-queue component that can be instantiated to provide
a queue with elements of a user-provided abstract data I- "' (in this case a box).
If it did not, the pattern would indicate how the needed shared data structure
could be implemented.
The programmer would next consider program structure, again guided by
the !l i ,- i -!_ P 1 P .11 1 pattern. Given the two-phase structure in which the
second phase (the termination test) is not computationally intensive, it makes
sense to use a fr!..i-i ..1 structure (described in the ForkJoin pattern in the
'Iq''I" "'l' i iy tS L 'inL. design space), as indicated in the Fnil, ,, ,-_-i !- P 11.1
pattern's Implementation section, in which a master process sets up the problem,
initializes the queue, and then forks a number of processes or threads to test the
boxes in the box list. Following the join, the master carries out the termination
test sequentially and then as needed returns to process the list concurrently in
the next cycle. The cycles continue until the termination conditions are met.
Observe that the overall structure of the program is a composition of a sequential
program construct (a "- I!!. loop) with an instance of the ForkJoin pattern.
I ,i ,11- 5 gives pseudocode for the resulting design. Note that although this
pseudocode omits ii i .!, portions of the program (for example, details pertaining
to the details of interval global optimization algorithms), it includes. i- ll
relevant to the parallel structure of the program.

4.2.4 Using the ImplementationMechanisms design space

Once the programmer has arrived at a design at the level of the pseudocode
above, he or she must then implement it in a particular programming environ-
ment, addressing whatever additional issues are relevant in that environment.
(For example, implementing the ForkJoin pattern for a shared-memory multi-
processor t[lIl i substantially from implementing the same pattern for an clus-
ter of workstations.) For this problem, all such issues are encapsulated in the
ForkJoin and !1, ,.,1 ilL. iL.- patterns, so the programmer can consult the Imple-
mentation section of these patterns for guidance on how to implement them for
the desired environment (and recall, the I!< [.,i I 'i.L iL.- pattern would ideally al-
ready be implemented and available as a library component). These patterns in
turn guide the programmer into the ImplementationMechanisms design space,
which provides lower-level and more environment-specific help, such that after

#define Nworkers N
SharedQueue InList;
SharedQueue ResList;
SharedCounter Workersdone;
void main()
int done = FALSE;
InList = Initialize();
While (!done) {
Workers_done = new SharedCounter(O);
// Create Workers to test boxes on InList and write
// boxes that may have global minima to ResList
Fork(Nworkers, workers);
// Wait for the join (i.e. until all workers are done)
Waitoncounter_value(Workers_done, Nworkers);
// Test for completion and copy ResList to InList
done = termination(ResList, InList);
void function worker ()
Interval_box B;
While (!done) {
B = dequeue(InList);
// Use tests from Interval arithmetic to see
// if the box can contain minima. If so, split
// into sub-boxes and put them on the result list.
if (HasMinima (B))
split_and_put (B, ResList);

5: Pseudocode r parallel op ization algori

1 i-i i.- 5: Pseudocode for parallel optimization algorithm.

review of the relevant patterns the programmer can finish the process of turning
a problem description into finished code for the target environment.

5 Related work

Design patterns and pattern languages. Design patterns and pattern
languages were first introduced in software engineering in [8]. Since then, con-
siderable work has been done in 1;il I i and exploiting patterns to facilitate
software development, on levels ranging from overall program structure to de-
tailed design. The common theme of all of this work is that of 1l !i; F;il i!
a pattern that captures some aspect of effective program design and/or imple-
mentation and then reusing this design or implementation in i! ii!! applications.
Early work on patterns dealt mostly with object-oriented sequential program-
ming, but more recent work [19, 10] addresses concurrent programming as well,
though mostly at a fairly low level. Ortega-Arjona and Roberts [18] have given
what they call architectural patterns for parallel programming that are similar
to the patterns in our A!,I. i i iiiiS i I, In 1,i. design space. Their patterns do not,
however, belong to a pattern language.

Program skeletons and frameworks. Algorithmic skeletons and frame-
works capture very high-level patterns that provide the overall program or-
ganization with the user providing lower-level code specific to an application.
!:, !, i.,!i-, as in [6], are I 1'., 11 envisioned as higher-order functions, while
frameworks are often use object-oriented I, 11!!i..!... Particularly interesting
is the work of MacDonald, et al., [11], which uses design patterns to generate
frameworks for parallel programming from pattern template specifications.

Programming archetypes. Programming archetypes [5, 12] combine ele-
ments of all of the above categories: They capture common computational and
structural elements at a high level, but they also provide a basis for implemen-
tations that include both high-level frameworks and low-level code libraries. A
parallel programming archetype combines a computational pattern with a paral-
lelization I ,1. _- this combined pattern can serve as a basis both for designing
and reasoning about programs (as a design pattern does) and for code skeletons
and libraries (as a framework does). A, ~1. 1i, do not, however, directly ad-
dress the question of how to choose an appropriate archetype for a particular

6 Conclusions

We have described a pattern language for parallel programming. Currently,
the top two design spaces (1 [!!i'iConcurrency and Al .i i, iiS !! iL, Iii .) are
relatively mature, with several of the patterns having undergone -,i I1 ;I at a
writer workshop for design patterns [14]. Although the lower two design spaces

are still under construction, the pattern language is now usable and several case
studies are in progress. Preliminary results of the case studies and feedback
from the writers workshop leave us optimistic that our pattern language can
indeed achieve our goal of lowering the barriers to parallel programming.


[1] C. Alexander, S. I-!,!. ., and M. -'i -,. i. A Pattern Language: Towns,
Buildings, Construction. Oxford University Press, 1977.
[2] D. P. Bertsekas and J. N. '1-!i-i! !i- Parallel and Distributed Computation..
medical Methods. Prentice-Hall, 1989.
[3] R. Bjornson, N. Carrier, T. G. Mattson, D. Kaminsky, and A. Sherman. Expe-
rience with Linda. Technical Report RR-866, Yale University Computer Science
Department, August 1991.
[4] R. Bjornson, C. Kolb, and A. Sherman. Ray tracing with network Linda. SIAM
-.' .. 24(1), January 1991.
[5] K. M. ('!i.,iii. Concurrent program archetypes. In Proceedings of the Scalable
Parallel Library Conference, 1994.
[6] M. I. Cole. l.;..' -... ..'. .... I Structured Management of Parallel Computa-
tion. MIT Press, 1989.
[7] E. W. Dij! -I ., and C. S. Scholten. Termination detection for tions. Information Processing Letters, 11(1), August 1980.
[8] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: I...
of Reusable Object-Oriented Software. .\.. I. I. -\\. -!1 1995.
[9] R. J. Harrison. Portable tools and applications for parallel computers. Int. J.
Quantum Chem., 40:847 863, 1991.
[10] D. Lea. Concurrent Programming in Java: Design Principles and Patterns.
A. ,,11i-.,.-\\, -!. 1997.
[11] S. MacDonald, D. Szafron, J. Schaeffer, and S. Bromling. From patterns to
frameworks to parallel programs, 1999. Submitted to IEEE Concurrency, August
1999; see also
[12] B. L. Massingill and K. M. ('!.ii.,l Parallel program archetypes. In Proceedings
of the 13th International Parallel Processing Symposium (IPPS'99), 1999. Ex-
tended version available as Caltech CS- II -'I .- '. (
[13] B. L. Massingill, T. G. Mattson, and B. A. Sanders. A pattern language for
parallel application programming. Technical Report ('I i. TR 99-009, University
of Florida, 1999. Available via
[14] B. L. Massingill, T. G. M\attson, and B. A. Sanders. Patterns for parallel ap-
plication programs. In Proceedings of the Sixth Pattern Languages of Programs
Workshop (PLoP99), 1999. See also our \\. 1, site at
[15] T. G. Mattson, editor. Parallel Computing in Computational ( '. volume
592 of ACS Symposium Series. American ('!I. im. .i Society, 1995.
[16] R. E. Moore. Methods and Applications of Interval Analysis. SIAM, 1979.
[17] R. E. Moore, E. Hanson, and A. Leclerc. Rigorous methods for global opti-
mization. In C. Foudas and P. Pardalos, editors, Recent Advances in (.'..
Optimization. 1992.

18] J. 0i _.,-.ii.., and G. Roberts. Architectural patterns for parallel program-
ming. In Proceedings of the 3rd European Conference on Pattern Languages of
Programming and Computing, 1998. See also uk/staff/
19] D. C. Schmidt. The ADAPTIVE Communication Environment: An object-
oriented network programming toolkit for developing communication software., 1993.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs