Citation
Implementation patterns for parallel program and a case study

Material Information

Title:
Implementation patterns for parallel program and a case study
Creator:
Kim, Eunkee ( Author, Primary )
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Copyright Date:
2002
Language:
English

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Buffer storage ( jstor )
Buffer zones ( jstor )
Code switching ( jstor )
Data models ( jstor )
Data types ( jstor )
Design evaluation ( jstor )
Integers ( jstor )
Mathematical independent variables ( jstor )
Pipelines ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Kim, Eunkee. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
12/27/2005
Resource Identifier:
53309251 ( OCLC )

Downloads

This item has the following downloads:

kim_e.pdf

kim_e_Page_004.txt

kim_e_Page_011.txt

kim_e_Page_118.txt

kim_e_Page_087.txt

kim_e_Page_121.txt

kim_e_Page_064.txt

kim_e_Page_104.txt

kim_e_Page_009.txt

kim_e_Page_075.txt

kim_e_Page_083.txt

kim_e_Page_067.txt

kim_e_Page_021.txt

kim_e_Page_002.txt

kim_e_Page_003.txt

kim_e_Page_119.txt

kim_e_Page_068.txt

kim_e_Page_038.txt

kim_e_Page_005.txt

kim_e_Page_057.txt

kim_e_Page_128.txt

kim_e_Page_108.txt

kim_e_Page_107.txt

kim_e_Page_031.txt

kim_e_Page_126.txt

kim_e_Page_091.txt

kim_e_Page_060.txt

kim_e_Page_062.txt

kim_e_Page_112.txt

kim_e_Page_111.txt

kim_e_Page_006.txt

kim_e_Page_032.txt

kim_e_Page_071.txt

kim_e_Page_024.txt

kim_e_Page_135.txt

kim_e_Page_132.txt

kim_e_Page_098.txt

kim_e_Page_090.txt

kim_e_Page_050.txt

kim_e_Page_043.txt

kim_e_Page_074.txt

kim_e_Page_020.txt

kim_e_Page_094.txt

kim_e_Page_096.txt

kim_e_Page_086.txt

kim_e_Page_093.txt

kim_e_Page_109.txt

kim_e_Page_007.txt

kim_e_Page_042.txt

kim_e_Page_059.txt

kim_e_Page_101.txt

kim_e_Page_051.txt

kim_e_Page_097.txt

kim_e_Page_103.txt

kim_e_Page_066.txt

kim_e_Page_070.txt

kim_e_Page_030.txt

kim_e_Page_117.txt

kim_e_Page_034.txt

kim_e_Page_018.txt

kim_e_Page_076.txt

kim_e_Page_008.txt

kim_e_Page_010.txt

kim_e_Page_133.txt

kim_e_Page_129.txt

kim_e_Page_122.txt

kim_e_Page_105.txt

kim_e_Page_134.txt

kim_e_Page_099.txt

kim_e_Page_113.txt

kim_e_Page_028.txt

kim_e_Page_136.txt

kim_e_Page_041.txt

kim_e_Page_013.txt

kim_e_Page_092.txt

kim_e_Page_014.txt

kim_e_Page_056.txt

kim_e_Page_077.txt

kim_e_Page_047.txt

kim_e_Page_033.txt

kim_e_Page_016.txt

kim_e_Page_063.txt

kim_e_Page_048.txt

kim_e_Page_084.txt

kim_e_Page_088.txt

UFE0000552_00001_xml.txt

kim_e_Page_123.txt

kim_e_Page_027.txt

kim_e_Page_124.txt

kim_e_Page_114.txt

kim_e_Page_054.txt

kim_e_Page_082.txt

kim_e_Page_110.txt

kim_e_Page_130.txt

kim_e_Page_026.txt

kim_e_Page_120.txt

kim_e_Page_049.txt

kim_e_Page_080.txt

kim_e_Page_089.txt

kim_e_Page_100.txt

kim_e_Page_023.txt

kim_e_Page_052.txt

kim_e_Page_073.txt

kim_e_Page_037.txt

kim_e_Page_102.txt

kim_e_Page_039.txt

kim_e_Page_046.txt

kim_e_Page_058.txt

kim_e_Page_015.txt

kim_e_Page_053.txt

kim_e_Page_040.txt

kim_e_Page_078.txt

kim_e_Page_127.txt

kim_e_Page_085.txt

kim_e_Page_069.txt

kim_e_Page_044.txt

kim_e_Page_025.txt

kim_e_Page_095.txt

kim_e_Page_036.txt

kim_e_Page_137.txt

kim_e_Page_045.txt

kim_e_Page_106.txt

kim_e_Page_061.txt

kim_e_pdf.txt

kim_e_Page_029.txt

kim_e_Page_115.txt

kim_e_Page_017.txt

kim_e_Page_055.txt

kim_e_Page_022.txt

kim_e_Page_065.txt

kim_e_Page_019.txt

kim_e_Page_081.txt

kim_e_Page_116.txt

kim_e_Page_079.txt

kim_e_Page_001.txt

kim_e_Page_012.txt

kim_e_Page_125.txt

kim_e_Page_131.txt

kim_e_Page_072.txt

kim_e_Page_035.txt


Full Text














IMPLEMENTATION PATTERNS FOR PARALLEL PROGRAM AND A CASE STUDY













By

EUNKEE KIM


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2002































Copyright 2002

by

Eunkee Kim






























Dedicated to Grace Kim, Jineun Song and parents


















ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my adviser, Dr. Beverly A. Sanders, for providing me with an opportunity to work in this exciting area and for providing feedback, guidance, support, and encouragement during the course of this research and my graduate academic career.

I wish to thank Dr. Joseph N. Wilson and Dr. Stephen M. Thebaut for serving on my supervisory committee.

Finally I thank Dr. Berna L. Massingill for allowing me to use machines in Trinity University and for technical advice.


















TABLE OF CONTENTS
09e


A CK N OW LED GM EN TS . IV


LIST OF TABLES . Ix


LIST OF FIGU RES . x


A B S T R A C T . x i


CHAPTER


I IN TROD U CTION . I

1. 1 Parallel Com puting . 1
1.2 Parallel D esign Patterns . 1
1.3 Im plem entation Patterns for D esign Patterns . 2
1.4 Implementation of Kernel IS of NAS Parallel Benchmark Set Using Parallel
D e sig n P a tte rn s . . 2
1.5 Organization of the Thesis . 3

2 OVERVIEW OF PARALLEL PATTERN LANGUAGE . 4

2.1 Finding Concurrency D esign Space . 4
2. 1. 1 G etting Started . 4
2.1.2 D ecom position Strategy . 4
2.1.3 Task D ecom position . 5
2.1.4 D ata D ecom position . 5
2.1.5 D ependency A nalysis . 6
2 .1 .6 G ro u p T a sk s . 6
2 .1 .7 O rd e r T a sk s . 7
2 .1 .8 D a ta S h a rin g . 7
2.1.9 D esign Evaluation . 7
2.2 A lgorithm Structure D esign Space . 8
2.2.1 Choose Structure . 8
2.2.2 A synchronous Com position . 9
2.2.3 D ivide and Conquer . 9













2.2.4 Em barrassingly Parallel . 10
2.2.5 Geom etric D ecom position . 10
2.2.6 Pipeline Processing . 11
2.2.7 Protected D ependencies . 11
2.2.8 Recursive D ata . 11
2.2.9 Separable D ependency . 12
2.3 Supporting Structures D esign Space . 12
2.3.1 Program - Structuring Group . 12
2.3. 1.1 Single program and m ultiple data . 12
2 .3 .1 .2 F o rk j o in . 1 3
2.3.1.3 M aster w orker . 13
2 .3 .1 .4 S p a w n . 1 3
2.3.2 Shared D ata Structures Group . 13
2.3.2.1 Shared queue . 13
2.3.2.2 Shared counter . 14
2.3.2.3 D istributed array . 14
2.4 Chapter Sum m ery . 14

3 PA TTERN S FOR IM PLEM EN TA TION . 15

3.1 U sing M essage Passing Interface (M PI) . 15
3 . 1 .1 In te n t . 1 5
3 .1 .2 A p p lica b ility . 1 5
3.1.3 Im plem entation . 15
3.1.3.1 Sim ple m essage passing . 16
3.1.3.2 Group and com m unicator creation . 17
3.1.3.3 D ata distribution and reduction . 22
3.2 Sim plest Form of Em barrassingly Parallel . 23
3 .2 . 1 In te n t . 2 3
3.2.2 A pplicability . 23
3.2.3 Im plem entation . 24
3.2.4 Im plem entation Exam ple . 25
3.2.5 Exam ple U sage . 27
3.3 Im plem entation of Em barrassingly Parallel . 28
3 .3 . 1 In te n t . 2 8
3.3.2 A pplicability . 28
3.3.3 Im plem entation . 28
3.3.4 Im plem entation Exam ple . 29
3.3.5 U sage Exam ple . 38
3.4 Im plem entation of Pipeline Processing . 39
3 .4 . 1 In te n t . 3 9
3.4.2 A pplicability . 39
3.4.3 Im plem entation . 39
3.4.4 Im plem entation Exam ple . 41
3.4.5 U sage Exam ple . 47
3.5 Im plem entation of A synchronous-Com position . 48
3 .5 .1 In te n t . 4 8


vi













3.5.2 A pplicability . 49
3.5.3 Im plem entation . 49
3.5.4 Im plem entation Exam ple . 50
3.6 Im plem entation of D ivide and Conquer . 54
3 .6 . 1 In te n t . 5 4
3 .6 .2 M o tiv atio n . 5 4
3 .6 .3 A p p lica b ility . 5 5
3.6.4 Im plem entation . 55
3.6.5 Im plem entation Exam ple . 57
3.6.6 U sage Exam ple . 61

4 K ERN EL IS OF N A S BEN CI-EVIARK . 63

4.1 Brief Statem ent of Problem . 63
4.2 K ey G eneration and M em ory M apping . 63
4.3 Procedure and Tim ing . 64

5 PARALLEL PATTERNS USED TO IMPLEMENT KERNEL IS . 66

5.1 Finding Concurrency . 66
5. 1. 1 G etting Started . 66
5.1.2 D ecom position Strategy . 67
5.1.3 Task D ecom position . 68
5.1.4 D ependency A nalysis . 68
5.1.5 D ata Sharing Pattern . 68
5.1.6 D esign Evaluation . 69
5.2 A lgorithm Structure D esign Space . 70
5.2.1 Choose Structure . 70
5.2.2 Separable D ependencies . 71
5.2.3 Em barrassingly Parallel . 71
5.3 U sing Im plem entation Exam ple . 71
5.4 A lgorithm for Parallel Im plem entation of K ernel IS . 72

6 PERFORMANCE RESULTS AND DISCUSSIONS . 74

6.1 Perform ance Expectation . 74
6.2 Perform ance R esults . 74
6 .3 D isc u ssio n s . 7 6

7 RELATED WORK AND CONCLUSIONS AND FUTURE WORK . 78

7 .1 R e la te d w o rk . 7 8
7. 1.1 A ids for Parallel Program m ing . 78
7.1.2 Parallel Sorting . 80
7 .2 C o n c lu sio n s . 8 1
7 .3 F u tu re W o rk . 8 2

APPENDIX












A KERNEL IS OF THE NAS PARALLEL BENCHMAKRK . 83 B PSEUDORANDOM NUMBER GENERATOR . 90 C SOURCE CODE OF THE KERNEL IS IMPLEMENTATION . 94 D SOURCE CODE OF PIPELINE EXAMPLE . 107 E SOURCE CODE OF DIVIDE AND CONQUER . 115 L IST O F R E FE R E N C E S . 124 B IO G R A PH IC A L SK E T C H . 127

















LIST OF TABLES
Table pige


4-1 Parameter Values to be used for Benchmark . 65 6-1 Perform ance Results for Class A and B . 75 A -1 V alues to be used for Partial V erification . 88 A-2 Parameter Values to be used for Benchmark . 89


















LIST OF FIGURES
Figure page


3-1 The Algorithm Structure Decision Tree . 9

3-2 Usage of M essage Passing . 40

3-3 Irregular M essage Handling . 50

3-4 M essage Passing for Invocation of the Solve and M erge Functions . 56

4 -1 C o u n tin g S o rt . 6 7 6-1 Execution Time Comparison for Class A . 76

6-2 Execution Time Comparison for Class B . 76


















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

IMPLEMENTATION PATTERNS FOR PARALLEL PROGRAM AND A CASE STUDY

By

Eunkee Kim

December 2002

Chairman: Beverly A. Sanders Major Department: Computer and Information Science and Engineering

Design patterns for parallel programming guides the programmer through the entire process of developing a parallel program. In this thesis, implementation patterns for parallel program are presented. These patterns help programmers to implement a parallel program after designing it using the parallel design patterns.

Parallel Integer sorting, one of the kernels of Parallel Benchmark set of the Numerical Aerodynamic Simulation Program (NAS), has been designed and implemented using the parallel design pattern as a case study. How the parallel design pattern was used in implementing Kernel IS and its performance are discussed



















CHAPTER 1
INTRODUCTION

1.1 Parallel Computing

Parallel computing is what a computer does when it carries out more than one computation at a time using many processors. An example of parallel computing (processing) in our daily life is an automobile assembly line: at every station somebody is doing part of the work to complete the product. The purpose of parallel computing is to overcome the limit of the performance of a single processor. We can increase the performance of a sequential program by parallelizing exploitable concurrency in a sequential program and using many processors at once.

Parallel programming has been considered more difficult than sequential programming. To make it easier for programmers to write a parallel program, a parallel design pattern has been developed by B.L. Massingill, T.G. Mattson, and B.A Sanders.1-4

1.2 Parallel Design Patterns

The Parallel pattern language is a collection of design patterns for parallel

programming.5-6 Design patterns are a high level description of a solution to a problem.7 Parallel pattern language is written down in a systematic way so that a final design for a parallel program can result by going through a sequence of appropriate patterns from a pattern catalog. The structure of the patterns is designed to parallelize even complex problems. The top-level patterns help to find the concurrency in a problem and decompose it into a collection of tasks. The second-level patterns help to find an












appropriate algorithm structure to exploit concurrency that has been identified. The parallel design patterns are described in more detail in Chapter 2.

1.3 Implementation Patterns for Design Patterns

Several implementation patterns for design patterns in the algorithm structure design space were developed as a part of the parallel pattern language and presented in this thesis. These implementation patterns are solutions of problems mapping high-level parallel algorithms into programs using a Message Passing Interface (MPI) and programming language C.8 The patterns of the algorithm structure design space capture recurring solutions to the problem of turning problems into parallel algorithms. These implementation patterns can be reused and provide guidance for programmers who might need to create their own implementations after using the parallel design pattern. The implementation patterns of design patterns in the algorithm structure design space are in Chapter 3.

1.4 Implementation of Kernel IS of NAS Parallel Benchmark Set Using Parallel Design Patterns

The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA

Ames Research Center, developed parallel benchmark set for the performance evaluation of highly parallel supercomputers. The NAS Parallel Benchmark set is a "Paper and Pencil" benchmark.9 All details of this benchmark set are specified only algorithmically. The Kernel IS of NAS Parallel Benchmark set is a parallel sorting over large numbers of integers. A solution to the Kernel IS is designed and implemented using parallel pattern language as a case study. The used patterns are following.

Getting Start, Decomposition Strategy, Task Decomposition, Dependency Analysis, Data Sharing, and Design Evaluation patterns of Find Concurrency Design Space are












used. Choose Structure pattern, Separable Dependency, and Embarrassingly Parallel patterns of the algorithm structure design space are also used. The details of how these patterns are used and the final design of the parallel program for Kernel IS are in Chapter

5.

1.5 Organization of the Thesis

The research that has been conducted as part of this thesis and its organization are as follows:

" Analysis of the parallel design patterns (Chapter 2) " Implementation patterns for patterns in Algorithm structure design space (Chapter 3) " A description of Kernel IS of the NAS Parallel Benchmark set (Chapter 4) " Design and implementation of Kernel IS through the parallel design patterns
(Chapter 5)

" Performance results and discussions (Chapter 6). " Conclusions and future work (Chapter 7).



















CHAPTER 2
OVERVIEW OF PARALLEL PATTERN LANGUAGE

Parallel pattern language is a set of design patterns that guide the programmer through the entire process of developing a parallel program. 10 The patterns of parallel pattern language, as developed by Masingill et al., are organized into four design spaces: Finding Concurrency Design Space, Algorithm Structure Design Space, Supporting Structure Design Space, and Implementation Design Space.

2.1 Finding Concurrency Design Space

The finding concurrency design space includes high-level patterns that help to find the concurrency in a problem and decompose it into a collection of tasks.

2.1.1 Getting Started

The getting started pattern helps to start designing a parallel program. Before using these patterns, a user of this pattern needs to be sure that the problem is large enough or needs to speed up and understand the problem. The user of this pattern needs to do the following tasks:

* Decide which parts of the problem require most intensive computation.

* Understand the tasks that need to be carried out and data structure that are to be
manipulated.

2.1.2 Decomposition Strategy

This pattern helps to decompose the problem into relatively independent entities that can execute concurrently. To expose the concurrency of the problem, the problem can be decomposed along two different dimensions:












" Task Decomposition: Break the stream of instructions into multiple chunks called
tasks that can execute simultaneously.

" Data Decomposition: Decompose the problem's data into chunks that can be operated

relatively independently.

2.1.3 Task Decomposition

The Task Decomposition pattern addresses the issues raised during a primarily taskbased decomposition. To do task decomposition, the user should try to look at the problem as a collection of distinct tasks. And these tasks can be found in the following places:

" Function calls may correspond to tasks. " Each iteration of a loop, if the iterations are independent, can be the tasks. " Updates on individual data chunks decomposed from a large data structure.

The number of tasks generated should be flexible and large enough. The tasks should have enough computation.

2.1.4 Data Decomposition

This pattern looks at the issues involved in decomposing data into units that can be

updated concurrently. The first point to be considered is whether the data structure can be broken down into chunks that can be operated on concurrently. An array-based computation and recursive data structures are examples of this approach. The points to be considered in decomposing data are as follows: " Flexibility in the size and number of data chunks to support the widest range of
parallel systems

" The size of data chunks large enough to offset the overhead of managing dependency " Simplicity in data decomposition












2.1.5 Dependency Analysis

This pattern is applicable when the problem is decomposed into tasks that can be executed concurrently. The goal of a dependency analysis is to understand the dependency among the tasks in detail. Data-sharing dependencies and ordering constraints are the two kinds of dependencies that should be analyzed. The dependencies must require little time to manage relative to computation time and easy to detect and fix errors. One effective way of analyzing dependency is following approach. " Identify how the tasks should be grouped together. " Identify any ordering constraints between groups of tasks. " Analyze how tasks share data, both within and among groups.

These steps lead to the Group Tasks, the Order Tasks, and the Data Sharing patterns.

2.1.6 Group Tasks

This pattern constitutes the first step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to group tasks based on the following constraints. Three major categories of constraints among tasks are as follows: " A temporal dependency: a constraint placed on the order in which a collection of
tasks executes.

" An ordering constraint that a collection of tasks must run at the same time. " Tasks in a group are truly independent.

The three approaches to group tasks are as follows:

" Look at how the problem is decomposed. The tasks that correspond to a high-level
operation naturally group together. If the tasks share constraints, keep them as a
distinct group.

" If any other task groups share the same constraints, merge the groups together. " Look at constraints between groups of tasks.












2.1.7 Order Tasks

This pattern constitutes the second step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to identify ordering constraints among task groups. Two goals are to be met in defining this ordering: " It must be restrictive enough to satisfy all the constraints to be sure the resulting
design is correct.

" It should not be more restrictive than it need be.

To identify ordering constraints, consider the following ways tasks can depend on each other:

" First look at the data required by a group of tasks before they can execute. Once this
data have been identified, find the task group that created it, and you will have an
order constraint.

" Consider whether external services can impose ordering constraints. " It is equally important to note when an order constraint does not exit.

2.1.8 Data Sharing

This pattern constitutes the third step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to identify what data are shared among groups of tasks and how to manage access to shared data in a way that is both correct and efficient. The following approach can be used to determine what data is shared and how to manage the data:

" Identify data that are shared between tasks. " Understand how the data will be used.

2.1.9 Design Evaluation

The goals of this pattern are to evaluate the design so far and to prepare for the next phase of the design space. This pattern therefore describes how to evaluate the design












from three perspectives: suitability for the target platform, design quality, and preparation for the next phase of the design.

For the suitability for the target platform, the following issues included: " How many processing elements are available? " How are data structures shared among processing elements? " What does the target architecture imply about the number of units of execution and
how structures are shared among them?

" On the target platform, will the time spent doing useful work in a task be significantly
greater than the time taken to deal with dependencies?

For the design quality, simplicity, flexibility, and efficiency should be considered. To prepare the next phase, the key issues are as follows: " How regular are the tasks and their data dependencies? " Are interactions between tasks (or task groups) synchronous or asynchronous? " Are the tasks grouped in the best way?

2.2 Algorithm Structure Design Space The algorithm structure design space contains patterns that help to find an appropriate algorithm structure to exploit the concurrency that has been identified.

2.2.1 Choose Structure

This pattern guides the algorithm designer to the most appropriate AlgorithmStructure patterns for the problem. Consideration of Target Platform, Major Organizing Principle and Algorithm- Structure Decision Tree are the main topics of this pattern.

The two primary issues of Consideration of Target Platform are how many units of execution (threads or processes) the target system will effectively support and the way information is shared between units of execution. The three major organizing principles are "organization by ordering . . organization by tasks," and "organization by data." The












Algorithm-Structure Decision Tree is in Figure 3.1. We can select an algorithm-structure using this decision tree.





startR~~v

SRegLu I hegLu U ] Ia Rec~sive

[PipelinePro.e.sing ] Asynchrcnous ' Linear ' Recizsive C ecanetric ] eisv Composition "- ----- Decomposition Dt
Pa honng DivideAn
Conque
, . . . . . .= - - - a-- -- --C-- - --e
Independe t Depender

SEmbarassinly Sepafable Insepafable
Parallel Dependencies Dependencies

S epaableD ep endeni Plr-tectedDepenencI
Decision
Key iD ecision x anch �i ]




Figure 3-1 The Algorithm Structure Decision Tree
2.2.2 Asynchronous Composition

This pattern describes what may be the most loosely structured type of concurrent program in which semi-independent tasks interact through asynchronous events. Two examples are the Discrete-Event Simulation and Event-Driven program. The key issues in this pattern are how to define the tasks/entities, how to represent their interaction, and how to schedule the tasks.

2.2.3 Divide and Conquer

This pattern can be used for the parallel application program based on the well-known divide-and-conquer strategy. This pattern is particularly effective when the amount of












work required to solve the base case is large compared to the amount of work required for the recursive splits and merges. The key elements of this pattern are as follows: " Definitions of the functions such as "solve . . split . . merge . . baseCase,"
"baseSolve"

" A way of scheduling the tasks that exploits the available concurrency

This pattern also includes Correctness issues and Efficiency issues.

2.2.4 Embarrassingly Parallel

This pattern can be used to describe concurrent execution by a collection of

independent tasks and to show how to organize a collection of independent tasks so they execute efficiently.

The key element of this pattern is a mechanism to define a set of tasks and schedule their execution and also a mechanism to detect completion of the tasks and terminate the computation. This pattern also includes the correctness and efficiency issues.

2.2.5 Geometric Decomposition

This pattern can be used when the concurrency is based on parallel updates of chunks of a decomposed data structure, and the update of each chunk requires data from other chunks. Implementations of this pattern include the following key elements: " A way of partitioning the global data structure into substructures or "chucks" " A way of ensuring that each task has access to all the data needed to perform the
update operation for its chunk, including data in chunks corresponding to other tasks " A definition of the update operation, whether by points or by chunks " A way of assigning the chunks among units of execution (distribution), that is a way
of scheduling the corresponding tasks












2.2.6 Pipeline Processing

This pattern is for algorithms in which data flow through a sequence of tasks or stages. This pattern is applicable when the problem consists of performing a sequence of calculations, each of which can be broken down into distinct stages on a sequence of inputs. So for each input, the calculations must be done in order, but it is possible to overlap computation of different stages for different inputs. The key elements of this pattern are as follows:

" A way of defining the elements of the pipeline where each element corresponds to
one of the functions that makes up the computation

" A way of representing the dataflow among pipeline elements, i.e., how the functions
are composed

" A way of scheduling the tasks

2.2.7 Protected Dependencies

This pattern can be used for task-based decompositions in which the dependencies

between tasks cannot be separated from the execution of the tasks, and the dependencies must be dealt with in the body of the task. The issues of this pattern are as follows: " A mechanism to define a set of tasks and schedule their execution onto a set of units
of executions

" Safe access to shared data

" Shared memory available when it is needed

2.2.8 Recursive Data

This pattern can be used for parallel applications in which an apparently sequential

operation on a recursive data is reworked to make it possible to operate on all elements of the data structure concurrently. The issues of this pattern are as follows:

A definition of the recursive data structure plus what data is needed for each element
of the structure












" A definition of the concurrent operations to be performed " A definition of how these concurrent operations are composed to solve the entire
problem

" A way of managing shared data " A way of scheduling the tasks onto units of executions " A way of testing its termination condition, if the top-level structure involves a loop.

2.2.9 Separable Dependency

This pattern is used for task-based decompositions in which the dependencies between tasks can be eliminated as follows: Necessary global data are replicated and (partial) results are stored in local data structures. Global results are then obtained by reducing results from the individual tasks. The key elements of this pattern are as follows: " Defining the tasks and scheduling their execution " Defining and updating a local data structure " Combining (reducing) local objects into a single object

2.3 Supporting Structures Design Space The patterns at this level represent an intermediate stage between the problem-oriented patterns of the algorithm structure design space and the machine-oriented patterns of the Implementation-Mechanism Design Space.

Patterns in this space fall into two main groups: Program- Structuring Group and Shared Data Structures Group.

2.3.1 Program- Structuring Group

Patterns in this group deal with constructs of structuring the source code.

2.3.1.1 Single program and multiple data

The computation consists of N units of execution in parallel. All N LTEs (Generic term for concurrently executable entity, usually either process or thread) execute the same












program code, but each operates on its own set of data. A key feature of the program code is a parameter that differentiates among the copies.

2.3.1.2 Fork join

A main process or thread forks off some number of other processes or threads that

then continue in parallel to accomplish some portion of the overall work before rejoining the main process or thread.

2.3.1.3 Master worker

A master process or thread sets up a pool of worker processes or threads and a task queue. The workers execute concurrently with each worker repeatedly removing a task from the task queue and processing it until all tasks have been processed or some other termination condition has been reached. In some implementations, no explicit master is present.

2.3.1.4 Spawn

A new process or thread is created, which then executes independently of its creator. This pattern bears somewhat the same relation to the others as GOTO bears to the constructs of structured programming.

2.3.2 Shared Data Structures Group

Patterns in this group describe commonly used data structures.

2.3.2.1 Shared queue

This pattern represents a "thread-safe" implementation of the familiar queue abstract data type (ADT), that is, an implementation of the queue ADT that maintains the correct semantics even when used by concurrently executing units of execution.












2.3.2.2 Shared counter

This pattern, as with the Shared Queue pattern, represents a "thread-safe"

implementation of a familiar abstract data type, in this case a counter with an integer value and increment and decrement operations.

2.3.2.3 Distributed array

This pattern represents a class of data structures often found in parallel scientific

computing, namely, arrays of one or more dimensions that have been decomposed into subarrays and distributed among processes or threads.

2.4 Chapter Summery

This chapter describes an overview of the parallel pattern languages for an application program which guides programmers from the design process of the program to the implementation point. The Next chapter illustrates how these design patterns have been used to design and implement the Kernel IS of the NAS Parallel Benchmark.


















CHAPTER 3
PATTERNS FOR IMPLEMENTATION In this chapter, we will introduce patterns for the implementation of patterns in the algorithm structure design space. Each pattern contains Implementation Example for each pattern in the algorithm structure design space. The Implementation Examples are implemented in an NIPI and C environment. The NIPI is standards for "Message Passing Interface" in Distributed Memory Environments. The Implementation Examples may need modification occasionally, but will be reusable and helpful for an implementation of the parallel program designed using parallel pattern language.

3.1 Using Message Passing Interface (MPI)

3.1.1 Intent

This pattern is an introduction of Message Passing Interface (MPI).

3.1.2 Applicability

This pattern is applicable when the user of parallel pattern language has finished the design of his parallel program and considers implementing it using NWI.

3.1.3 Implementation

The MPI is a library of functions (in Q or subroutines (in Fortran) that the user can

insert into a source code to perform data communication between processes. The primary goals of MPI are to provide source code portability and allow efficient implementations across a range of architectures.












3.1.3.1 Simple message passing

Message passing programs consist of multiple instances of a serial program that communicates by library calls. The elementary communication operation in MPI is "point-to-point" communication, that is, direct communication between two processes, one of which sends and the other receives. This message passing is the basic mechanism of exchanging data among processes of a parallel program using MPI. This message passing is also a good solution about how to order the executions of tasks belongs to a group and how to order the executions of groups of tasks.

To communicate among processes, communicators are needed. Default communicator of MPI is MPICOMMWORLD which includes all processes. The common implementation of MPI executes same programs concurrently at each processing element (CPU or workstation). The following program is a very simple example of using MPISend (send) and MPIRecv (receive) functions. /* simple send and receive */ /* Each process has the same copy of the following program and executes it *
#include
#include

#define BUF SIZE 100
int tag =1;
void main (int argc, char **argv) { int myrank; /* rank (identifier) of each process*
MPI Status status; /* status object */ double a[BUFSIZE]; /*Send(Receive) buffer * MPIInit(&argc, &argv); / Initialize MPI */ MPIComm rank(MPICOMM WORLD, &myrank); /* Get rank ofprocess* if( myrank ==0) {
/* code for process 0 *
/* send Message *
MPISend(












a, /*initial address of send buffer */
BUF SIZE, /*number of elements in send buffer *
MPIDOUBLE, *datatype of each send buffer element *
1, /*rank of destination *
tag, /* message tag */
MPI COMM WORLD *communicator *

}
else if( myrank == {
/* code for process 1 *
/*receive message */
MPI Recv( a, /*initial address of receive buffer *
BUFSIZE, /*number of elements in send buffer *
MPIDOUBLE, *datatype of each receive buffer element *
0, /*rank of source *
tag, /*message tag */
MPI COMM WORLD, *communicator *
&status /*status object *

}
/* more else if statement can be addedfor more processes * /* switch statements can be used instead of if else statement s/ VPIFinalize(); /* Terminate VPI */
}

3.1.3.2 Group and communicator creation

A communicator in VPI specifies the communication context for communication

operations and the set of processes that share this communication context. To make

communication among processes in a program using MPI, we must have communicators.

There are two kinds of communicators in VPI: intra-communicator and intercommunicator. Intra-communicators are for message passing among processes belong to

the same group. Inter-communicators are for message passing between two groups of

processes. As mentioned previously, the execution order of tasks belonging to one group

can be implemented using intra-communicator and message-passing. The execution order

of groups of tasks can be implemented using inter-communicator and message-passing.












To create a communicator, a group of processes is needed. MPI does not provide mechanism to build a group from scratch, but only from other, previously defined groups.

The base group, upon which all other groups can be created, is the group associated with the initial communicator MPI-CONM-WORLD. A group of processes can be created by the five steps as follows:

1. Decide the processing units that will be included in the groups

2. Decide the base groups to use in creating new group.

3. Create group.

4. Create communicator.

5. Do the work.

6. Free communicator and group.

The first step is decide how many and which processing units will be included in this group. The points to be considered are the number of available processing units, the number of tasks in one group, the number of groups that can be executed concurrently, etc., according to the design of the program.

If only one group of tasks can executed because of dependency among groups than the group can use all available processing units and may include them in that group. If there are several groups that can be executed concurrently, then the available processing units may be divided according to the number of groups that can be executed concurrently and the number of tasks in each group.

Since MPI does not provide a mechanism to build a group from scratch, but only from other, previously defined groups, the group constructors are used to subset and superset existing groups. The base group, upon which all other groups can be defined, is the group












associated with the initial communicator MPICOMWORLD, and processes in this group are all the processes available when MPI is initialized.

There are seven useful group constructors in MPI. Using these constructors, groups can be constructed as designed. The first three constructors are similar to the union and intersection of set operation in mathematics. The seven group constructors are as follows: The MPIGROUPUNION creates a group which contains all the elements of the two groups used. The MPIGROUPINTERSECTION creates a group which contains all the elements that belong in both groups at the same time. The MPIGROUPDIFFERENCE creates a group which has all the elements that do not belong in both groups at the same time. The MPIGROUPINCL routine creates a new group that consists of the specified processes in the array from the old group. The MPIGROUPEXCL routine creates a new group that consists of the processes not specified in the array from the old group. The MPIGROUPRANGEINCL routine includes all processes between one specified process to another process, and the specified processes themselves. The MPIGROUPRANGEEXCL routine excludes all processes not between the specified processes, and also excludes the specified processes themselves.

For the creation of communicators, there are two types of communicators in MPI: intra-communicator and inter-communicator. The important intra-communicator constructors in MPI are as follows: The MPICOMDUP function will duplicate the existing communicator. The MPICOMCREATE function creates a new communicator for a group.

For the communication between two groups, the inter-communicator for the identified two groups can be created by using the communicator of each group and peer-












communicator. The routine for this purpose is MPIINTERCOMMICREATE. The peercommunicator must have at least one selected member process from each group. Using duplicated MPICOVMWORLD as a dedicated peer communicator is recommended.

The Implementation Example Code is as follows:


/* An Example of creating intra-communicator */

#include main(int argc, char **argv)


int myRank, count, count2; int *sendBuffer, *receiveBuffer, *sendBuffer2, *receiveBuffer2; MPI Group MPI GROUP WORLD, grprem; MPI Comm commSlave; static int ranks[] { 0 }; /*. */
MPIInit(&argc, &argv); MPIComm group(MPICOMM WORLD, &MPIGROUPWORLD); MPIComm rank(MPI_COMM WORLD, &myRank);

/* Build group for slave processes *
MPI Groupexcl(
MPI GROUPWORLD, /* group *
1, /* number of elements in array ranks *
ranks, /* array of integer ranks in group not to appear
in new group */
&grprem); /* new group derived from above *

/* Build communicator for slave processes *
MPIComm create(
MPI COMM WORLD, /* communicator *
grprem, /* Group, which is a subset of the group of
above communicator * &commSlave); /* new communicator */


if(myRank! =0)
f
/* compute on processes other than root process * /*. */


MPIReduce(sendBuffer, receiveBuffer, count, MPI INT, MPISUM, 1,
commSlave);












/* .
I
/* Rank zero process falls through immediately to this reduce, others do later"" */
MPIReduce(sendBuffer2, receiveBuffer2, count2, MPI INT, MPISUM, 0,
MPI COMM WORLD);
MPICommfree(&commSlave); MPI Group free(&MPI GROUPWORLD); MPI Group free(&grprem); MPIFinalize0;


/* An example of creating inter-communicator */

#include

main(int argc, char **argv)
{
MPIComm myComm; /* intra-communicator of local sub-group *
MPI Comm myInterComm; /* inter-communicator between two group * int membershipKey; int rank;

MPIInit(&argc, &argv); MPIComm rank(MPICOMM WORLD, &rank); /* User generate membership Key for each group *
membershipKey = rank % 3;


/* Build intra-communicatorfo
MPIComm split(
MPI COMM WORLD,
membershipKey,


rank, &myComm


r local sub-group *

/* communicator * /* Each subgroup contains all processes of the same membershipKey */ /* Within each subgroup, the processes are ranked in the order defined by this rank value */ /* new communicator *


/* Build inter-communicators. Tags are hard-coded *
if (membershipKey == 0)
f












MPIIntercomm create(
myComm, /* local intra-communicator * 0, /* rank of local group leader *
MPICOM WORLD,/* "peer" intra-communicator * 1, /* rank of remote group leader
in the 'peer" communicator * 1, *tag */
&myInterComm /* new inter-communicator *


else

MPIIntercomm create(myComm, 0, MPICOVMWORLD,0,1, &myInterComm);
}
/* do work in parallel *

if(membershipKey == 0) MPI Comm free(&mylnterComm); MPIFinalize0;


3.1.3.3 Data distribution and reduction

There are several primitive functions for distribution of data in MPI.

The three basic functions are as follows: The VPIBROADCAST function replicates data in one root process to all other processes so that every process has the same data. This function can be used for a program which needs replication of data in each process to improve performance.

The MPISCATTER function of MPI can scatter data from one processing unit to

every other processing unit with the same amount (or variant amounts) of data, and each processing unit will have a different portion of the original data. The Gather function of MPI is the inverse of Scatter.

For the reduction of data distributed among processes, the MPI provides two primitive MPI REDUCE and MPI ALL REDUCE functions. These reduce functions are useful













when it is needed to combine locally computed subsolutions into one global solution. The IVPI REDUCE function combines the elements provided in the input buffer of each process in the group using the specified operation as a parameter, and returns the combined value in the output buffer of the process with rank of root. The MPI ALL REDUCE function combines the data in the same way as the MPI REDUCE, but every process has the combined value in the output buffer. This function is beneficial when the combined solution can be used as an useful information for the next phase of computation.

The Implementation Example Code is not provided in this section, but the

Implementation Example Code of 3.2.4 can be used as an Implementation Example Code. The source code of the Kernel IS implementation is also a good Implementation Example Code.

3.2 Simplest Form of Embarrassingly Parallel

3.2.1 Intent

This is a solution to the problem of how to implement the simplest form of Embarrassingly Parallel pattern in the IVIPI and C environment.

3.2.2 Applicability

This pattern can be used after the program is designed using patterns of the finding concurrency design space and patterns of algorithm structure design space. And, the resulted design is a simplest form of the Embarrassingly Parallel pattern. The simplest form of Embarrassingly Parallel pattern satisfies the conditions as follows: All the tasks are independent. All the tasks must be completed. Each task executes same algorithm on a distinct section of data.













3.2.3 Implementation

The common IVPI implementations have a static process allocation at initialization time of the program. It also executes same program at each processing unit (or workstation). Since all the tasks executes same algorithm on a different data, the tasks can be implemented as one high level function which will be executed in each process.

The simplest form of Embarrassingly Parallel pattern can be implemented by the following five steps.

1. Find out the number of available processing units.

2. Distribute or replicate data to each process.

3. Execute the tasks.

4. Synchronize all processes.

5. Combine (reduce) local results into a single result.

In IVPI, the number of available processing units can be found by calling the

MPI CONMSIZE function and passing the MR -WORLD COM communicator as a parameter.

In most cases, after finding out the number of available processing units, data can be divided by that number and distributed into each processing unit. There are several primitive functions for distribution of data in IVIPI. The three basic functions are Broadcast, Scatter, and Gather. These functions are introduced in 3.1.3.3.

After computing local results at each process, synchronize all processes to check that all local results are computed. This can be implemented by calling the IVIPI BARRIER function after each local result is solved at each processing unit because the IVPI BARRIER blocks the caller until all group members have called it.












The produced local results can be combined to get the final solution of the problem.

The Reduce functions of MPI are useful in combining subsolutions. The Reduce function of MPI combines the elements, which are provided in the input buffer of each process in the group, using the specified operation as a parameter, and returns the combined value in the output buffer of the process with rank of root. The operations are distributed to each process in many MPI implementations so they can improve overall performance of the program if it takes considerable time to combine subsolutions in one processing unit.

3.2.4 Implementation Example

#include
#define SIZEOFDATA 2000 /* size of data to be distributed into each process * int root = 0; /* The rank(ID) of root process *
int myRank; /* process ID *

/*= Modification 1-----------------------/
/* The data type of each variable should be modified int* data; /* The data to be distributed into each process =*/
int* localData; /* distributed data that each process will have */
int* subsolution; /*= subsolutaion data int* solution; /*= solution data


int sizeOfLocalData; /* number of elements in local data array * int numOfProc; /* number ofavaiable process *
int numOfElement; /* number of element in subsolution *



* Highest levelfunction which solves subproblem *

void solve
f1= Implementation 1 ==-/
/*=The code for solving subproblem should be implemented =/





/* Main Function which starts program *












main(argc, argv) int argc; char **argv;

MPI Status status;


/* MPI initialization */
MPIInit(&argc, &argv);

/* Finding out the rank (ID) of each process *
MPIComm rank(MPI_COMM WORLD,&myRank); /*Finding out the number of available processes */
MPIComm size(MPICOMM WORLD,&numOfProc);

* Dividing the data by the number of processes *
sizeOfLocalData = SIZEOFDATA/numOfProc; /* Dynamic memory allocation for local data */
localData=(int *)malloc(sizeOfLocalData* sizeof(int));


/* Distribute data into local data of each process *


MPIScatter(


data,
sizeOfLocalData, MPI INT,


localData, sizeOfLocalData,

MPI INT,


/* Address of send buffer (data) to distribute * /* Number of element sent to each process * /*MODIFICATION2---------------*/
/* data type of the send buffer

/* Address of receive buffer (local data) * /* Number of element in receive buffer (local * /* data) *
/*= MODIFICATION2------------*/
/* Data type of receive buffer


root, /* Rank of sending process *
MPI COMM WORLD /* Communicator *


/*solve sub-problem in each process *
solve;

/*Synchronize all processes to check all subproblems are solved *
MPIBarrier(MPICOMM WORLD);












/* Combine sub-solutions to get the solution *

MPIReduce( subsolution, /* address of send buffer (subsolution) *
solution, /* address of receive buffer (solution) *
numOfElement, /* number of elements in send buffer
(subsolution) *
MPIINT, /*= MODIFICATION3-----------/
/* data type of the send buffer

MPIMULT, /*= MODIFICATION4 -------------/
7* reduce operation = *7
7/*-----------------------------*/
root, /* rank of root process *
MPI COMM WORLD /* communicator *


MPIFinalize0;
}

3.2.5 Example Usage

The steps to reuse the Implementation Example are as follows:

* Implement the blocks labeled as IMPLEMENTATIONI. This block should contain
code for the "solve" function.

* Modify the data type of the variables which are in the block labeled as
MODIFICATION.

* Modify the data type parameters which are in the block labeled as MODIFICATION2.
Each data type parameter must be one of MPI data type which matches with the type
of the data to receive and send.

* Modify the data type parameters which are in the block labeled as MODIFICATION3.
Each data type parameter must be one of MPI data type which matches with the type
of the data to receive and send

* Modify the reduce operator parameter which are in the block labeled as
MODIFICATION4. This parameter should be one of MPI operators.

Consider a simple addition of each element in an integer array with size 1024. What

should be modified is the solve function and fifth parameter of MPIReduce function as

follows:


void solve













for(i=0;i
subSolution=subSolution+localData[i];
}
}

MIPI Reduce( subsolution, solution,
numOfElement,
MPI INT, /*Modified *7
MPISUM, /* Modfied *7
root,
MPI COMM WORLD


3.3 Implementation of Embarrassingly Parallel

3.3.1 Intent

This pattern shows how to implement Embarrassingly Parallel pattern in MPI and C environment

3.3.2 Applicability

This pattern can be used after the program is designed using patterns of finding

concurrency design space and patterns of the algorithm structure design space. A simple form of the Embarrassingly-Parallel-Algorithm-Structure pattern is that the order of priority of all the tasks is known and all tasks must be completed and all the tasks are independent. The Implementation Example is for the case that the resulted design is a simple form of the Embarrassingly Parallel Algorithm Structure pattern.

3.3.3 Implementation

The common implementation of NIPI executes same programs concurrently at each

processing element and makes message passing on a needed basis. Since the design of the parallel program has independent tasks, those tasks can be executed without restriction on the order of tasks execution.













If each task has known amount of computations, the implementation of a parallel

program is straightforward in IVPI and C environment. Each task can be implemented as a function and can be executed at each process using process ID (rank) and if else statement or switch statement. Since the amount of computation of each task, the load balance should be achieved by distributing the tasks among processes when the program is implemented.

If the amount of computation is not known, there should be some dynamic tasks

scheduling mechanism for the load balance. The IVIPI does not provide primitive process scheduling mechanism. One way of simulating dynamic tasks scheduling mechanism is using a task-name queue and primitive message passing of IVIPI. We implemented this mechanism as follows: Each task is implemented as a function. The root process (with rank 0) has a task-name queue which contains the names of tasks (functions). Each process sends message to the root process to get task name, whenever it is idle or finished its task. The root process with rank 0 waits for messages from other processes. If the root process receives messages and the task queue is not empty, it sends back the name of task in the task queue to the sender process of that message. When the process, which sent message for a task name, receives the task name, it executes the task (function). The Implementation Example is provided in 3.3.4.

The IVPI routines, which is useful for the combining the locally computed subsolutions into a global solution, are described in 3.1.3.3

3.3.4 Implementation Example

#include
include

define EXIT -1 /* exit message*
define TRUE 1












#define FALSE 0 #define ROOT 0 #define NUM OF ALGO 9

int oneElm =1; /* one element * int tag =0 ; int algoNameSend= -1; int idleProc = -1; int algoNameRecv= -1;

/* Size of Data to send and Receive *
int sizeOfData;

/* number of tasks for each algorithm *
int numOffask[NUM OF ALGO];

int isAlgorithmName = TRUE; int moreExitMessage = TRUE; VPI Status status;



/* Simple Queue implementation. *


int SIZEOFQ = 20; int* queue; int qTail = 0; int qHead 0;


/* insert *

void insert(int task) { if(qHead-qTail1==) { /* when queue is full, increase array *
int* temp = (int *) malloc(2*SIZE OF Q*sizeof(int));
int i;
for(i=0;i {
temp[i] removee;
}
queueltemp;
SIZE OFQ = SIZEOFQ*2;
}
else if(qTail2) {












/* queue is not full and there is more than one space *
qTail++;
queue[qTail]= task;

}
else{
/* there is just one more space in queue *
qTail=0;
queue[qTail]= task;
}
}

/* check whether or not queue is empty *

int isEmpty0
{
if(qTail==qHead) {/* queue is empty * return 1;
}
else{
return 0;
}
}

/* remove *

int remove(
{
if(qHead return queue[qHead];
}
else { qHead =0; return queue[SIZE OFQ - 1];
}
return 0;
}


/* Each function should correspond to an algorithm for tasks. *
/* If several tasks have same algorithm, then one function for them *
/* More algorihtm for tasks can be added as a function *


void algorithms 0 { /*= DATA TRANSFER 1-/












/*= the data which will be used by this algorithm in local process /*= More of this block can be added on needed basis int* data; /* Modify the data type*/ /* receive the size of data to receive *
MPI Recv(&sizeOfData,oneElm,MPI INT,ROOT,
tag,MPICOMM WORLD,&status); /* dynamic memory allocation */ data = (int*/* Modify the data type*7) malloc(sizeOfData* sizeof(int/* Modify the data type*/)); /* Receive the input data */
MPIRecv(&data,sizeOfData, MPI INT, /*= <- Modify the data type =*7
ROOT,tag,MPI COMM WORLD,&status);
/*= =*/


/*=IMPLEMENTA TION1-*7 7* code for this algorithm should be implemented


/* Memory deallocation ex) free(data); *

void algorithm20 {

7*= DATA TRANSFER 1-/ /*= the data which will be used by this algorithm in local process =*/ 7* More of this block can be added on needed basis int* data; /* Modify the data type*7 /* receive the size of data to receive *
MPI Recv(&sizeOfData,oneElm,MPI INT,ROOT,
tag,MPICOMM WORLD,&status); /* dynamic memory allocation */ data = (int*/* Modify the data type*7) malloc(sizeOfData* sizeof(int/* Modify the data type*/)); /* Receive the input data */
MPIRecv(&data,sizeOfData, MPI INT, /*= <- Modify the data type *7
ROOT,tag,MPI COMM WORLD,&status);
/*= =*7
7*----------------------------*7












/*=IMPLEMENTA TION1 = *7
/* code for this algorithm should be implemented


/*Memory deallocation ex)free(data); *
}

void algorithm30 {
7*= DATA TRANSFER 1 / /*= the data which will be used by this algorithm in local process /*= More of this block can be added on needed basis

int* data; /* Modify the data type*7 /* receive the size of data to receive *
MPI Recv(&sizeOfData,oneElm,MPI INT,ROOT,
tag,MPICOMM WORLD,&status);

/* dynamic memory allocation */ data = (int*/* Modify the data type*7) malloc(sizeOfData* sizeof(int/* Modify the data type*/));

/* Receive the input data */
MPIRecv(&data,sizeOfData, MPI INT, /*= <- Modify the data type *7
ROOT,tag,MPI COMM WORLD,&status);
/*= =*7


/*=IMPLEMENTA TION-- -*/
7* code for this algorithm should be implemented


/* Memory deallocation ex) free(data); */}


/* main method *

void main( argc, argv)
int argc;
char **argv;
f
int myRank;
int mySize;

int i;
intj;














MPIInit(&argc, &argv); /*find local rank*1
MPIComm rank(MPICOMMWORLD, &myRank); /*find out last rank by using size of rank*1
MPIComm size(MPI_COMMWORLD,&mySize);


/* This process distributes tasks to other processing elements *


if(myRank == 0) {/* code for root process.
This root process receives messages from other processes when they are idle.
Then this process removes an algorithm name for a task from task name queue
and send it back to sender of the message.
If the task queue is empty, this process sends back an exit message *

7*=MODIFICA TIONJ -*
/* the data for each algorithm
/* the data type should be modified =*7
int* algolData; int* algo2Data; int* algo3Data; int* algo4Data; int* algo5Data;
/* */


int numOfSentExit = 0;

/* This array is a task queue. *
queue= (int *) malloc(SIZE OFQ*sizeof(int));
/* */
for(i=O;i
numOffask[i] = mySize;
}
for(i=O;i
for(j=O;j











insert(i);
} }

/* Receive message from other processes *
while(moreExitMessage)
{
int destination= ROOT;
/* Receive message from any process *
MPIRecv(&algoNameRecv,oneElm,MPI INT,MPI ANY-SOURCE,
MPIANYTAG,MPI COMMWORLD,&status);
destination = algoNameRecv;

if(! isEmptyo)
{
/* If the task queue is not empty, send back an algorihim name for a task
to the sender of received message *
algoNameSend = queue[removeo];
MPISend(&algoNameSend,oneElm,MPI INT,
destination,tag,MPI_COMMWORLD);

switch(algoNameSend)
{
case 1:
7*=IMPLEMENTA TION2---- - --*7
/*= code for calculating the size of data to send and
7*=finding the starting address of data to send
7* should be implemented =*7


/*=DA TA TRANSFER2------ ---*7
/* More of this block can be added on needed basis
/*= =*7
MPI_Send(&sizeOfData,oneElm,MPI INT,
destination,tag,MPI_COMMWORLD); MPISend(
&algo I Data,/* <- Modify the initial address of data =*/
sizeOfData,
MPI INT /* <- Modify the data type =*/
,destination,tag,MPICOMMWORLD);
/*= =*7



break; case 2:
/*=IMPLEMENTA TI-N2-.----.-/












/*= code for calculating the size of data to send and =
/*=finding the starting address of data to send */
7* should be implemented *7


/*=DATA TRANSFER2------ ---*7
7* More of this block can be added on needed basis
/*= =*7
MPI_Send(&sizeOfData,oneElm,MPI INT,
destination,tag,MPI_COMWORLD);
MPISend(
&algo I Data,/* <- Modify the initial address of data *7
sizeOfData,
MPI INT 7* <- Modify the data type =*/
,destination,tag,MPICOMMWORLD);
/*= =*/


break; case 3:
/*=IMPLEMENTA TION2-- -- --- *7
/*= code for calculating the size of data to send and
/*=finding the starting address of data to send
7* should be implemented =*7


7*=DATA TRANSFER2------ --- -*
7* More of this block can be added on needed basis
/*= =*7
MPI_Send(&sizeOfData,oneElm,MPI INT,
destination,tag,MPI_COMMWORLD);
MPISend(
&algo I Data,/* <- Modify the initial address of data *7
sizeOfData,
MPI INT 7* <- Modify the data type =*/
,destination,tag,MPICOMMWORLD);
/*= =*7



break; default: break;



else
{ *If the task queue is empty, send exit message *












algoNameSend = EXIT;
MPISend(&algoNameSend,oneElm,MPI INT,
destination,tag,MPI_COMM WORLD);

/* keep tracking the number of exit message sent.
If the number of exit messages is same with the number ofprocesses,
root process will not receive any more message requesting tasks. *

numOfSentExit++;
if(numOfSentExit=mySize- 1)

moreExitMessage=FALSE;



/* computation andlor message passingfor combining subsolution can be
added here *




/* Code for other processes, not root. These processes send message * /* requesting next task to execute to root process when they are idle. * /* These processes execute tasks received from root *

else

idleProc = myRank;
I*Send message to root process, send buffer contains the rank of
sender process whenever it is idle */
MPI_Send(&idleProc,oneElm,MPIINT,ROOT,tag,MPICOMM WORLD);
while(isAlgorithmName) 1*while message contains an algorihtm name *

/* Receive message from root which contains the name of task
to execute in this process */

MPI Recv(&algoNameRecv,oneElm,MPIINT,ROOT,tag,
MPICOMM WORLD,&status);


/* each process executes tasks using the algorihtm name
and data received from root *
switch(algoNameRecv)
{ /* More case statements should be added or removed, according to
the number of tasks */
case EXIT: isAlgorithmName =FALSE;












break;
case 1:
algorithm I0;
break;
case 2:
algorithm20;
break;
case 3:
algorithm30;
break;
case 4:
algorithm40;
break;
case 5:
algorithmS 0;
break;
default:
break;
}
if(algoNameRecv EXIT)

I*Send message to root process, send buffer contains the rank of
sender process whenever it is idle*/
idleProc = myRank;
MPISend(&idleProc,oneElm,MPI INT,ROOT,tag,MPI COMM WORLD);


/* Codes for collecting subsolution and /* computing final solution can be added here.

MPIFinalize0;


3.3.5 Usage Example

The steps to reuse this Implementation Example are as follows:

* Implement the blocks labeled as IIPLEIENTATION1. Each block should contain
codes for each algorithm.

* Implement the block labeled as IMPLEIENTATION2. What should be implemented
are codes for calculating the initial address and the number of elements of the send
buffer for each algorithm.












" Modify the data type of the variables and data type parameters which are in the
blocks labeled as DATA TRANSFER 1. The data type of each variable should match
with the data type of the input data for each algorithm. The purpose of this "send"
function is to send input data to the destination process. More of this block can be
added on needed basis.

" Modify the data type and data type parameters which are in the block labeled as
DATA TRANSFER2. Each data type parameter must be one of MPI data type which matches with the type of the data to receive. The purpose of these "receive" function
is to receive input data for the algorithm execution on local process. More of this
block can be added on needed basis

Assume that there is a parallel program design such that the amount of computation for each task is unknown. If some of the tasks share same algorithm, these tasks should be implemented as one algorithm function in one of the block labeled as IMPLMENTATIONI. To execute the tasks, the same number of algorithm name with the number of tasks that share same algorithm should be put in the queue.

3.4 Implementation of Pipeline Processing

3.4.1 Intent

This is a solution to the problem of how to implement the Pipeline Processing pattern in the MPI and C environment.

3.4.2 Applicability

This pattern can be used after the program is designed using patterns of finding

concurrency design space and patterns of the algorithm structure design space and the resulted design is a form of the Pipeline Processing pattern.

3.4.3 Implementation

Figure3.1 illustrates the usage of message passing to schedule tasks in the

Implementation Example. The arrows represent the blocking synchronous send and receive in MPI. The squares labeled Cl, C2, etc., in Figure3. 1, represent the elements of calculations to be performed which can overlap each others. The calculation elements are













implemented as functions in the Implementation Example. The calculation elements (functions) should be filled up to do computation by the user of this Implementation Example. Adding more functions for more calculation is trivial because message passing calls inside functions are very similar between functions.

Each Stage (Stagel, Stage2, and etc.), in Figure3. 1, corresponds to each process. In the Implementation Example, each Stage calls single function (calculation element) sequentially.

time



Process CI C2 C3 C4
1



Process Cl C2 C3 C4
2



Process Cl C2 C3 C4
3



Process Cl C2 C3 C4
4


Figure 3-2 Usage of Message Passing

One point to note is that the first pipeline stage do not receive message and last element do not send message for scheduling of tasks.

Scheduling of tasks is achieved by blocking and synchronous mode point-to-point communications (MPI SSEND, MPI RECEIVE).












In MPI, if the Send mode is blocking and synchronous and the Receive mode is blocking, then the Send can complete only if the matching Receive has started. The statements after the matching receive will not be executed until the receive operation is finished: IPI provides synchronous communication semantics.

A duplicate of MPICOMWORLD (initial intra-communicator of all processes) is used to provide separate communication space and safe communication.

If we change the duplicate of MPICOMWORLD into appropriate communicator for a group of processes then, this structure can be used for a portion of a program where the Pipeline Processing pattern can be used.

3.4.4 Implementation Example

#include #include #include int numOfCalElem = 4; char startMsg[7]; int myRank;
int mySize;
int tag = 1;

char* sendBuf, recvBuf, int sendBufSize, recvBufSize; IPI Status status; /* Only four pipeline stages are implemented in this Implementation Example. More elements can be added or removed, according to the design of the parallel program * /*First Pipeline Stage *

void firstPipelineStage(MPI Comm myComm)
{
/*=IMPLEMENTA TION1-- - - - - -- /
/= = *7












7* code for this Pipeline Stage should be implemented here /*= =*/
7*.*/

/* send message to the next Pipeline Stage of process with myRank+ *
MPISsend(startMsg,strlen(startMsg),MPICHAR ,myRank+1 ,tag,MPICOMM WORLD);


/*=DA TA TRANSFER 1-*/ /* More send function, which has same structure, can be added *7
7* to transfer data */
MPISsend( 7* Modify the following parameters
sendBuf, /* <- starting address of buffer
sendBufSize, /* <- the number of elements in buffer
MPICHAR, /* <- the data type *7

myRank+1,tag,MPICOMM WORLD);



/* Pipeline Stage 2 *

void pipelineStage2(MPIComm myComm)

/* Receive message from the previous Pipeline Stage ofprocess with myRank-] *
MPIRecv(startMsg,strlen(startMsg),MPI_CHAR,myRank- 1, tag, MPICOMM WORLD,&status);

7*=DA TA TRANSFER 2=======/ 7* More receive functions, which has same structure, can be added *7 7* to transfer data
MPIRecv( 7* Modify the following parameters =*7
recvBuf, * <- starting address of the buffer
recvBufSize, * <- number of elements to receive
MPICHAR, 7* <- data type

myRank-1,tag, MPICOMM WORLD,&status);


7*=IMPLEMENTA TION1 - - - - - - -- /
/= = *7
7*= code for this Pipeline Stage should be implemented here =1
/*= =*7
7*.*7












/* send message to the next Pipeline Stage of process with myRank+ *
MPI_Ssend(startMsg,strlen(startMsg),MPICHAR,myRank+l,tag, MPICOMM WORLD);

/*=DA TA TRANSFER 1-*7 7* More send function, which has same structure, can be added 7* to transfer data *7
MPISsend( 7* Modify the following parameters
sendBuf, /* <- starting address of buffer *7
sendBufSize, /* <- the number of elements in buffer
MPICHAR, /* <- the data type */

myRank+1,tag,MPICOMM WORLD);



/* Pipeline Stage 3 *

void pipelineStage3(MPIComm myComm)

/* Receive message from the previous Pipeline Stage ofprocess with myRank-] *
MPIRecv(startMsg,strlen(startMsg),MPI_CHAR,myRank- 1, tag, MPICOMM WORLD,&status);

7*=DA TA TRANSFER 2=======/ 7* More receive functions, which has same structure, can be added *7 7* to transfer data =*7
MPIRecv( 7* Modify the following parameters
recvBuf, = <- starting address of the buffer
recvBufSize, 7* <- number of elements to receive
MPICHAR, 7* <- data type

myRank-1,tag, MPICOMM WORLD,&status);

7*=IMPLEMENTA TION1------------------------------------*7
/*= =*7
7* code for this Pipeline Stage should be implemented here
/*= =*7
7*.*7

/* send message to the next Pipeline Stage of process with myRank+ *
MPI_Ssend(startMsg,strlen(startMsg),MPI_CHAR,myRank+l,tag,

MPICOMM WORLD);

/*=DA TA TRANSFER 1-*7












7* More send function, which has same structure, can be added /* to transfer data */
MPISsend( 7* Modify the following parameters
sendBuf, /* <- starting address of buffer
sendBufSize, /* <- the number of elements in buffer
MPICHAR, /* <- the data type *7

myRank+1,tag,MPICOMM WORLD);



/* Pipeline Stage 4 *

void pipelineStage4(MPIComm myComm)


/* Receive message from the previous Pipeline Stage ofprocess with myRank-] *
MPIRecv(startMsg,strlen(startMsg),MPI_CHAR,myRank- 1, tag, MPICOMM WORLD,&status);

7*=DA TA TRANSFER 2=======/ 7* More receive functions, which has same structure, can be added *7 7* to transfer data =*7
MPIRecv( /*= Modify the following parameters
recvBuf, * <- starting address of the buffer
recvBufSize, /*= <- number of elements to receive
MPICHAR, /*= <- data type

myRank- ,tag, MPICOMM WORLD,&status);

7*=IMPLEMENTA TION1-*7 /*= =*7
7* code for this Pipeline Stage should be implemented here /*= =*7
7*.*7

/* send message to the next Pipeline Stage of process with myRank+ *
MPI_Ssend(startMsg,strlen(startMsg),PI_CHAR,myRank+l,tag, MPICOMM WORLD);

7*=DA TA TRANSFER 1-*7 7* More send function, which has same structure, can be added 7* to transfer data *7
MPISsend( 7* Modify the following parameters
sendBuf, /* <- starting address of buffer
sendBufSize, /* <- the number of elements in buffer












MPICHAR, /* <- the data type *7

myRank+1,tag,MPICOMM WORLD);



/* Pipeline Stage 5 *

void PipelineStage5(MPIComm myComm)

/* Receive message from the previous Pipeline Stage of process with myRank-1 *
MPIRecv(startMsg,strlen(startMsg),MPI_CHAR,myRank- 1, tag, MPICOMM WORLD,&status);

/*=DA TA TRANSFER 2=======/ /* More receive functions, which has same structure, can be added *7 7* to transfer data =*/
MPIRecv( /*= Modify the following parameters
recvBuf, /*= <- starting address of the buffer
recvBufSize, /*= <- number of elements to receive
MPICHAR, 7* <- data type

myRank-1,tag, MPICOMM WORLD,&status);



/*=IMPLEMENTA TION1------------------------------------*7
/= = *7
/* code for this Pipeline Stage should be implemented here /= = *7
7*.*7

/* send message to the next Pipeline Stage of process with myRank+ *
MPI_Ssend(startMsg,strlen(startMsg),PI_CHAR,myRank+l,tag, MPICOMM WORLD);

7*=DA TA TRANSFER 1--/ 7* More send functions, which has same structure, can be added 7* to transfer data = *7
MPISsend( 7* Modify the following parameters
sendBuf, /* <- starting address of buffer = /
sendBufSize, /* <- the number of elements in buffer
MPICHAR, /* <- the data type = /

myRank+1,tag,MPICOMM WORLD);













/* Last Pipeline Stage *


void lastPipelineStage(MPI Comm myComm)
{
/* Receive message from previous Pipeline Stage ofprocess with myRank-1 *
MPIRecv(startMsg,strlen(startMsg),MPI_CHAR,myRank- 1, tag, MPICOMMWORLD,&status);

/*=DA TA TRANSFER 2===== =*7
/*= More receive functions, which has same structure, can be added =*/ /*= to transfer data =*7
MPIRecv( 7* Modify the following parameters
recvBuf, /*= <- starting address of the buffer
recvBufSize, /*= <- number of elements to receive
MPICHAR, 7* <- data type

myRank-l,tag, MPICOMMWORLD,&status);

/*=IMPLEMENTA TION1------------------------------------*7
/*== *7 7* code for this Pipeline Stage should be implemented here
/*= =*7
7*.*7



void main(argc,argv)
int argc;
char **argv;

int i = 0;
MPIComm myComm;

MPIInit(&argc, &argv);

MPIComm dup(MPICOMM WORLD, &myComm); I*find rank of this process*/
MPIComm rank(myComm, &myRank); I*find out rank of last process by using size of rank*/
MPIComm size(myComm,&mySize);


strcpy(startMsg," start");












switch(myRank)
f
case 0:
for(i=0;i firstPipelineStage(myComm);
break;
case 1 :
for(i=0;i pipelineStage2(myComm);
break;
case 2 :
for(i=0;i pipelineStage3 (myComm);
break;
case 3 :
for(i=0;i pipelineStage4(myComm);
break;
case 4 :
for(i=0;i pipelineStage5(myComm);
break;

7*=MODIFICA TION2=-"/*= More case statement can be added or removed =/
on needed basis. ='*7


case 5
for(i=0;i lastPipelineStage(myComm);
break;
default:
break;
}

MPIFinalize0;

3.4.5 Usage Example

The steps to use this frame are as follows:

Add or remove the pipeline stage functions according to the number of pipeline stages of the program design. Each pipeline stage function has same structure of
message passing, so it can be coped and pasted to implement more pipeline stages.
But the first and last pipeline stage should not be removed because the first pipeline stage does not receive messages and the last pipeline stage does not send messages.












" Add or remove the case statements, in the block labeled as MODIFICATION2,
according to the number of pipeline stages of the program design.

" Implement the blocks labeled as IMPLEMENTATIONI. Each block should contain
codes for each calculation element.

" Modify the initial address of send buffer, the number of elements in send buffer, and
the data type parameters which are in the block labeled as DATA TRANSFERI. The
data to be sent is the input for the next calculation element. More send functions,
which have same structure, can be added according to the need.

" Modify the initial address of receive buffer, the number of elements in receive buffer,
and the data type parameters which are in the block labeled as DATA TRANSFER2.
The data to be received is the input for this calculation element. More receive
functions, which have same structure, can be added according to the need.

Consider following problem as an example problem to parallelize: There are four SAT scores of four schools. Find out standard deviation for each class. But the main memory of one processor barely contains the scores of one school.

To solve this problem, the pipeline stages can be defined as follows: The first pipeline stage computes the sum and average of SAT scores of each school. The second pipeline stage computes the differences between each individual score and the average. The third pipeline stage computes squares of each difference. Forth pipeline stage computes the sum of the computed squares.

This problem can be easily implemented by following the above steps. The implementation of this is provided in appendix.

3.5 Implementation of Asynchronous-Composition

3.5.1 Intent

This is a solution of the problem how to implement the parallel algorithm which

resulted from the Asynchronous-Composition pattern in the algorithm structure design space in MPI (Message Passing Interface)?












3.5.2 Applicability

Problems are represented as a collection of semi-independent entities interacting in an irregular way.

3.5.3 Implementation

Three points that need to be defined to implement the algorithm resulting from Asynchronous-Composition pattern using MPI are tasks/entities, events, and task scheduling.

A task/entity, which generates an event and processes them, can be represented as a process in an implementation of an Algorithm using NIPI. An event corresponds to a message sent from the event generating task to the event processing task. All tasks can be executed concurrently.

In this Implementation Example, each case block in a switch statement should contain code for each semi-independent entity (process) which will be executed concurrently.

An event corresponds to a message sent from the event generating task (process) to the event processing task (process) in NIPI and C environment. Therefore, event handling is a message handling. For safe message passing among processes, the creation of groups of processes which need to communicate is necessary. Creating communicators for the groups in NIPI is also necessary. Because of that, we added groups and communicators creations in this Implementation Example.

This Implementation Example is also an example of event handling in NIPI

environment. In a situation that an entity (process) receives irregular events (messages sent) from other known entities (processes), we can implement it using defined constant of MPI, MPI-ANY-SOURCE and MPI-ANY-TAG.












Assume that process A, B and C sends messages in an irregular way to process D and the process D handles (processes) events (messages). It is same situation as the car/driver example of the Asynchronous-Composition pattern. To use the MPIANYSOURCE and MPIANYTAG in the receive routine of MPI, we need to create an dedicated group and communicator for these entities (processes) to prevent entity (process) D receives messages from other entities (processes) that is not intended to send message to the entity (process) D.





tProcess
B















Figure 3-3 Irregular Message Handling
3.5.4 Implementation Example

#include #include #include


main(argc, argv)
int argc;
char **argv;
f


/* More group variables can be added or removed on need basis */












MPI Group MPIGROUPWORLD,group A,group B; I*More communicator variables can be added or removed on need basis *
MPIComm comm A,comm B;

* This varilable will hold the rank for each process in MPI default communicator MPI COMM WORLD *
int rank in world;

I*This variable will hold the rank for each process in group A. Variables can be added or removed on need basis. *
int rank in group A;
int rank in group B;

/* ranks ofprocesses which will be used to create subgroups. More array of ranks can be added or removed on need basis. *
int ranks a[]= {1,2,3,4}; int ranks b[]= {1,2,3,4};

MPIInit(&argc, &argv); I*Create a group ofprocesses and communicator for a group. More groups and communicator can be created or removed on need basis*

MPIComm group(MPICOMMWORLD,&MPIGROUPWORLD); /*Create group and communicator */
MPI Groupincl(MPIGROUPWORLD,4,ranks a,&groupA); MPICommcreate(MPICOMM WORLD,group A,&commA);

/*Create group and communicator */
MPI Groupincl(MPIGROUPWORLD,4,ranks b,&groupB); MPICommcreate(MPICOMM WORLD,group B,&commB);

MPIComm rank(MPICOMM WORLD, &rank in world);

switch(rank in world)
{

I*This case 1 contains codes to be executed in process 0 *
case 0:

/* events can be generated or processed *

/* work *
break;














/*This case contains codes to be executed in process 1 *
case 1:
{
char sendBuffer[20];
int isOn=l;
int i=0;

while(isOn)
{
/* works that need to be done before generating event *
strcpy(sendBuffer," event");/* an example *

/* Generate Event (message passing) */
MIPI_Send(sendBuffer,strlen(sendBuffer),MPI_CHAR,3,1 ,commB);
printf(" sent message");
i++;

/* break loop *
if(i==10)/* this should be changed according to the problem *
{
isOn = 0;
}


break;



I*This case contains codes to be executed in process 2 *
case 2:
{
char sendBuffer[20];
int isOn=l;
int i=0;

while(isOn)
{
/* works that need to be done before generating event *
strcpy(sendBuffer," event");/* an example */

/* Generate Event (message) */
MPIS end(sendBuffer,strlen(sendBuffer),MPI_CHAR,3,1 ,commB);
i++;












/* break loop *
if(i==10) /* this should be changed according to the problem *


isOn = 0;


break;
}
/*This case contains codes to be executed in process 3 *
case 3:
{
char sendBuffer[20];
int isOn=1;
int i=0;

while(isOn)
{
/* works that need to be done before generating event *
strcpy(sendBuffer," event"); /* an example *

/* Generate Event (message) */
MPIS end(sendBuffer,strlen(sendBuffer),MPI_CHAR,3,1 ,commB);
i++;

/* break loop *
if(i==10) /* this should be changed according to the problem *


isOn = 0:


break;
}
/*This case contains codes to be executed in process 4 *
case 4:
{
char receiveBuffer[20];
int isOn = 1;
int messageCount=0;
VPI Status status;

while(isOn)
{
MPIRecv(receiveBuffer,20,MPI_CHAR,MPI ANY-SOURCE,
MPI ANYTAG,comm B,&status);












messageCount++;
if(O==strncmp(receiveBuffer," event",3))

/* work to process the event (message) *
printf("\nreceived an event at process 4");

if(messageCount==30)

isOn = 0;


break;


more cases(processes) can be added or removed.
*/
default: break; MPIFinalize0;

3.6 Implementation of Divide and Conquer

3.6.1 Intent

For the completeness of the parallel pattern language, it would be beneficial for a

programmer to show an implementation example of a top-level program structure for the Divide-and-Conquer pattern in the Algorithm Design space.

3.6.2 Motivation

The top-level program structure of the divide-and-conquer strategy that is stated in the Divide-and-Conquer pattern is as follows Solution solve(Problem P) { if (baseCase (P)) return baseSolve(P); else {
Problem subProblems[N]; Solution subSolutions [N];
subProblems = split(P);
for (int i=0;i subSolutions[i] = solve(subProblems[i]);
return merge(subSolutions);














This structure can be mapped onto a design in terms of tasks by defining one task for each invocation of the solve function.


The common MPI implementations have a static process allocation at initialization time of the program. We need a mechanism to invoke the solve function at another processor whenever the solve function calls the split function, and the split function splits the problem into sub-problems so that subproblems can be solved concurrently.

3.6.3 Applicability

This pattern can be used after the program is designed using patterns of the parallel pattern language and the resulting design is a form of the Divide-and-Conquer pattern and if we want to implement the design in an MPI and C environment. This structure can be used as an example of how to implement a top-level program structure in an NIPI and C environment or directly adopting this structure as an implementation of the program by adjusting control parameters and adding a computational part of the program.

3.6.4 Implementation

We are trying to implement a top-level program structure of the divide-and-conquer strategy in NIPI and C so that parallel program design resulting from the Divide-andConquer pattern can be implemented by filling up each function or/and adjusting the structure. The basic idea is using message passing to invoke the solve functions at other processing elements when needed. In a standard NIPI and C environment, each processing element has the same copy of the program with other processing elements, and same program executes at each processing element communicating with each other on a needed basis. When the problem splits into subproblems, we need to call the solve












function at other processes so that subproblems can be solved concurrently. To call the solve functions, the split function send messages to other processing elements and the blocking MPI Receive function receives messages before the first solve function at each process. This message passing can contain data/tasks divided by the split function. In this structure, every solve function calls the merge function to merge subsolutions.







split
Message sent from split to solve



split split


solve so ve


split split split split


solve so ve solve solve













Figure 3-4 Message Passing for Invocation of the Solve and Merge Functions

For simplicity, we split a problem into two subproblems and used problem size to

determine whether or not the subproblem is a base case. Figure3.3 shows the sequence of












function calls in each process and message passing to invoke the solve function at remote process for new sub-problem and to merge sub-solutions.

3.6.5 Implementation Example

#include #include #include #define DATA TYPE int int numOfProc; *number of available processes* int my rank; int ctrlMsgSend; int ctrlMsgRecv; int* localData; int dataSizeSend; int dataSizeRecv; /* Solve a problem *

void solve(int numOfProcLeft) if(baseCase(numOfProcLeft))
f
baseSolve(numOfProcLeft);
merge(numOfProcLeft);
}
else

split(numOfProcLeft);

if(numOfProcLeft! =numOfProc)

merge(numOfProcLeft);





/* split a problem into two subproblems *

int split(int numOfProcLeft)













/*=IMPLEMENTA TION2 /*= Code for splitting a problem into two subproblems 7* should be implemented =*7
/*= =*7



ctrlMsgSend = numOfProcLeft/2; /*invoke a solve function at the remote process *
MPISend(&ctrlMsgSend, 1,MPI INT, my rank+numOfProc/numOfProcLeft,
0,MPICOMM WORLD);

7*=DATA TRANSFER 1 * /*= More of this block can be added on needed basis
MPISend(&dataSizeSend, 1 ,MPI INT, my rank+numOfProc/numOfProcLeft,
0, MPI COMM WORLD); MPISend(
&localData[dataSizeLeft], /*<- modify address of data *1
dataSizeSend,
MPIINT, /*<-modify data type*7
my rank+numOfProc/numOfProcLeft,
0, MPICOMM WORLD);
7*.*7
/* invoke a solve function at a local process *
solve(numOfProcLeft/2); return 0;
}


* Merge two subsolutions into a solution *

void merge(int numOfProcLeft)
{
if(my rank >= numOfProc/(numOfProcLeft*2))
{
ctrlMsgSend = 1;

/* Send a subsolution to the process from which this process got the subproblem *
MPIS end(&ctrlMsg Send, 1IMPI INT,
my-rank - numOfProc/(numOfProcLeft*2),
0,MPICOMM WORLD);


/*=DA TA TRANSFER 3-- -












/*= More of this block can be added on needed basis
MPISend(&dataSizeSend, 1,MPI INT, my-rank - numOfProc/(numOfProcLeft*2),
0,MPI COMM WORLD); MPISend(
localData, /*<-modify address of data *7
dataSizeSend,
MPIINT, /*<-modify data type *7
my-rank - numOfProc/(numOfProcLeft*2),
0,MPICOMM WORLD);
/* */

}
else
f
MPI Status status;
/* Receive a subsolution from the process which was invoked by this process *
MPIRecv(&ctrlMsgRecv, 1,MPI INT, my rank+numOfProc/(numOfProcLeft*2),
0,MPICOMM WORLD,&status);

7*=DA TA TRANSFER 3-*/ /*= More of this block can be added on needed basis
MVPI Recv(&dataSizeRecv, 1 ,MPI INT, my rank+numOfProc/(numOfProcLeft*2),
0,MPICOMM WORLD,&status); MPI Recv(
&localData[dataSizeLeft], /*<- modify address of data *
dataSizeRecv,
MPIINT, /*<- modify data type*/
my rank+numOfProc/(numOfProcLeft*2),
0,MPICOMM WORLD,&status);



7*=IMPLEMENTA TION3--/ /* Code for merging two subsolutions into one solution =*/ /* should be implemented


/* *7













/* Decide whether a problem is a "base case" *
/*that can be solved without further splitting *

int baseCase(int numOfProcLeft)
{
if(numOfProcLeft == 1) return 1;
else
return 0;
}


/* Solve a base-case problem *

int baseSolve(int numOfProcLeft)
{
/*=IMPLEMENTA TION1- - -- - --- /
/*= Code for solving base case problem should be implemented */


return 0;
}


main(argc, argv)
int argc;
char **argv; int i;

MPI Status status; MPIInit(&argc, &argv); MPIComm rank(MPI_COMM WORLD, &my rank); MPIComm size(MPICOMM WORLD, &numOfProc);

count = (int *)malloc((maxlnt+l)*sizeof(int)); if(my rank == 0) { ctrlMsgSend=numOfProc; /* number ofprocesses must be power of 2 *
}


ctrlMsgRecv = numOfProc;
solve(ctrlMsgRecv);
}
else{
/* Every process waits a message before calling solve function *












MPIRecv(&ctrlMsgRecv,l,MPI INT,MPI ANYSOURCE,MPI ANYTAG,
MPICOMM WORLD,&status);

/*=DA TA TRANSFER 2------------------ */
/*= More of this block can be added on needed basis
MPIRecv(&dataSizeRecv, 1 ,MPI INT,IVPI ANY-SOURCE,
MPI ANY TAGMPICOMM WORLD,&status); MPIRecv(
localData, /*<- modify address of data *7
dataSizeRecv,
MPI INT, /*<- modify data type*7
MPI ANY SOURCE,MPI ANY TAG,
MPICOMM WORLD,&status);
/* *7


dataSizeLeft = dataSizeRecv; solve(ctrlMsgRecv);
}
if(my rank==0) {

for(i=0;i
printf(" %d",localData[i]);


MPIFinalize0;


3.6.6 Usage Example

To use the previous framework, the number of available processes must be power of 2. The steps to use this Implementation Example Code are as follows:

* Implement the block labeled as IMPLEMENTATIONI. This block should contain
codes for "baseSolve" algorithm.

* Implement the block labeled as IMPLEMIENTAION2. This block should contain
codes for "split" algorithm.

* Implement the block labeled as IMPLEMENTATION3. This block should contain
codes for "merge" algorithm.

* Find out and modify the initial address of send buffer, the number of elements in send
buffer, and the data type parameters which are in the block labeled as DATA












TRANSFER1. The data to be sent is the input for the solve function of remote
process. More of this block can be added on needed basis.

* Find out and modify the initial address of receive buffer, the number of elements to
receive, and the data type parameters which are in the block labeled as DATA
TRANSFER2. The data to receive is the input for the solve function of local process.
More of this block can be added on needed basis.

* Find out and modify the initial address of receive buffer, the number of elements to
receive, and the data type parameters which are in the block labeled as DATA
TRANSFER3. The data to receive and send are the input for the merge function of
local process. More of this block can be added on needed basis.

One point to note about using the Implementation Example is that the "baseCase" of the Implementation Example is when there are no more available processes.

Merge sort algorithm is well known problem which uses divide and conquer strategy. If we consider the merge sort over N integers as the problem to parallelize, it can be easily implemented using the Implementation Example. To use the Implementation Example, the merge sort algorithm should be modified as follows: The "baseSolve" is a counting sort over the local data (an array of integers). The "split" algorihtm divides received data (an array of integers) into two contiguous subarrays. The "merge" algorithm merges two sorted array into one sorted array. The "BaseCase" of this problem is when there are no more available processes. The problem can be implemented by following the implementation steps above. The implementation of this problem is provided in appendix.

















CHAPTER 4
KERNEL IS OF NAS BENCHMARK

The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA

Ames Research Center, developed a set of benchmarks for the performance evaluation of highly parallel supercomputers. These benchmarks, NPI3 I (NAS Parallel Benchmarks 1), consist of five parallel kernels and three simulated application benchmarks. The principle distinguishing feature of these benchmarks is their "pencil and paper" specification. All details of these benchmarks are specified only algorithmically. The Kernel IS (Parallel Sort over small integers) is one of the kernel benchmark set.

4.1 Brief Statement of Problem

Sorting N integer keys in parallel is the problem. The number of keys is 2 21 for class A and 2 21 for class B. The range of keys is [0, Bm.,, ) where Bmax is 219 for class A and

2 21 for class B. The keys must be generated by the key Generation algorithm of the NAS benchmark set. The initial distribution of the key must follow the specification of memory mapping of keys. Even though the problem is sorting, what will be timed is the time needed to rank every key, and the permutation of keys will not be timed.

4.2 Key Generation and Memory Mapping The Keys must be generated using the pseudorandom number generator of the NAS Parallel Benchmark. The numbers generated by this pseudorandom number generator will have range (0, 1) and have very nearly uniform distribution of the unit interval. 11-12 The keys will be generated using this number in the following way. Let rf













be a random fraction uniformly distributed in the range (0, 1), and let K, be the ih key. The value of K, is determined as K, <(- L~A. (r4, + r4, 1 + r4i 2 + r4,+ )1/4J for i =0,1,., N 1. K, must be an integer and Li1 indicates truncation. Bmax is 219 for class A and is 2 21 for class B. For the distribution of keys among memory, all keys initially must be stored in the main memory units and each memory unit must have same amount of keys in a contiguous address space. If the keys cannot be evenly divisible, the last memory unit can have a different amount of keys, but it must follow specification of NAS Benchmark. See Appendix A for details.

4.3 Procedure and Timing

The implementation of Kernel IS must follow this procedure. The partial verification of this procedure tests the ranks of five unique keys where each key has unique rank. The full verification rearranges the sequence of keys using the computed rank of each key and tests that the keys are sorted.

1. In a scalar sequential manner and using the key generation algorithm described above, generate the sequence of N keys.

2. Using the appropriate memory mapping described above, load the N keys into the memory system.

3. Begin timing.

4. Do, for i =Itomx

(a) Modify the sequence of keys by making the following two changes: K <- i

K,+m <- +(Bmax- i)












(b) Compute the rank of each key.

(c) Perform the partial verification test described above.

5. End timing.

6. Perform full verification test described previously. Table 4-1: Parameter Values to Be Used for Benchmark Parameter Class A Class B

N 2 23 2 25
B 219 2 21
max
seed 314159265 314159265
,max 10 10

















CHAPTER 5
PARALLEL PATTERNS USED TO IMPLEMENT KERNEL IS

In this chapter, we will explain how we used the parallel pattern language to design the parallel program and implement it in an NWI and C environment.

5.1 Finding Concurrency

5.1.1 Getting Started

As advised by this pattern, we have to understand the problem we are trying to solve. According to the specification of Kernel IS of NAS Parallel Benchmark, the range of the keys to be sorted is [0,2'9) for class A and a range of [0,2 21) for class B. Because of this range, bucket sort, radix sort, and counting sort can sort the keys in O(n) time." These sequential sorting algorithms are good candidate algorithms to parallelize because of their speeds. Because of following reasons, Counting Sort is selected as a target algorithm to parallelize. According to the specification of the benchmark, what will be timed is the time needed to find out the rank of each keys, and the actual permutation of keys occurs after the timing. This means that it is important to minimize the time needed to find out rank of each key. One interesting characteristic of the three candidate sorting algorithms is following. The Radix Sort and Bucket Sort make permutations of keys to find out rank of each key, in other words, the keys should be sorted to know the rank of every key. But, the Counting Sort does not need to be sorted to find out the rank of every key. This means that the Counting Sort takes less time in ranking every key than the others. Because of this reason, the Counting Sort is selected. Another reason is the simplicity of the Counting Sort algorithm.













The basic idea of the Counting Sort algorithm is to determine, for each element x, the number of elements less than x. This information can be used to place element x directly into its position in the output array and the number of elements less than x is rank(x) -1.

Counting Sort Algorithm is as follows:

Counting- Sort(A, B ,k) // A is an input array. B is an output array. C is an intermediate //array. k is an biggest key in the input array. for i <- 1 to k
do C[i] <- 0
for j <- 1 to length[A]
do C[A[j]] <- C[A[j]] +1I
IC[i] now contains the number of elements equal to i for i *- 2 to k
do C[i] <- C[i] + C[i -1]
IC[j] now contains the number of elements less than or equal to i for j <- length[A] down to 1 do B[C[A[j]]] <- A[j]
C[A[j]] <- C[A[j]] 1I

Algorithm 4-1 Counting Sort
5.1.2 Decomposition Strategy

The decision to be made in this pattern is whether or not to follow data decomposition or task decomposition pattern. The key data structure of the Counting Sort algorithm is an array of integer keys, and the most compute-intensive part of the algorithm is counting each occurrence of same keys in the integer keys array and summing up the counts to determine the rank of each key. The task-based decomposition is a good starting point if it is natural to think about the problem in terms of a collection of independent (or nearly independent) tasks. We can think of the problem in terms of tasks that count each occurrence of the same keys in key array. So we followed the task decomposition pattern.












5.1.3 Task Decomposition

The Task Decomposition pattern suggests finding tasks in functions, loop, and updates on data. We tried to find tasks in one of the four loops in the Counting Sort algorithm. We found, using this pattern, that there are large enough independent iterations in the secondfor loop of the counting sort algorithm. These iterations can be divided into many enough relatively independent tasks which find out rank of each key. We can think that each task concurrently counts each occurrence of the same keys on array A. The array A is shared among tasks right now and it is not divided yet.

Because of dependency among tasks, this pattern recommends using Dependency Analysis pattern.

5.1.4 Dependency Analysis

Using this pattern, data sharing dependencies are found on array A. Each task is

sharing the array A because the array A is accessed by each task to read a number in each cell of the array. If one task has read and counted the integer number of a cell, that number should not be read and counted again.

Array C is shared, too, because each task access this array to accumulate the count for each number, and one cell must not be accessed or updated concurrently by two or more tasks at the same time. We follow Data-Sharing Pattern since data sharing dependencies are found.

5.1.5 Data Sharing Pattern

Array A can be considered read only data among the three categories of shared data according to this pattern, because it is not updated. Array A can be partitioned into subsets, each of which is accessed by only one of the tasks. Therefore Array A falls into effectively-local category.












The shared data, array C, is one of the special cases of read/write data. It is an

accumulate case because this array is used to accumulate the counts of each occurrence of the same keys. For this case, this pattern suggests that each task has a separate copy so that the accumulations occur in these local copies. The local copies are then accumulated into a single global copy as a final step at the end of the computation.

What has been done until this point is a parallelization of second loop of Counting

Sort algorithm. The fourth for loop can be parallelized as follows: Each task has its own ranks of keys in its own data set so the keys can be sorted in each process. To sort these locally sorted keys, redistribute keys after finding out the range of keys for each task so that each task has approximately same amount of keys and the ranges of keys in ascending order among tasks.

Then the tasks of sorting redistributed keys have a dependency on locally sorted keys. But this dependency is read only and effectively local so the locally sorted keys can be redistributed among processes without complication, according to the ranges, because those are already sorted. Then each task can merge the sorted keys in its own range with out dependency using the global count.

Therefore, the final design of parallel sorting (the implementation of Kernel IS)

follows. First, each task counts each occurrence of the same keys on its own subset data, accumulates the results on its own output array, and then reduces them into one output array. Second, each task redistributes the keys into processes, according to the range of keys of each process, and merges the redistributed keys.

5.1.6 Design Evaluation

Our target platform is Ethernet-connected Unix workstations. It is a distributed

memory system. This Design Evaluation pattern says that it usually makes it easier to












keep all the processing elements busy by having many more tasks than processing elements. But using more than one LTE per PE is not recommended in this pattern when the target system does not provide efficient support for multiple LTEs per PEs (Processing Element. Generic term used to reference a hardware element in a parallel computer that executes a stream of instructions), and the Design can not make good use of multiple UEs per PE, which is our case because of reduction of output array (if we have more local output array than the number of PEs, then the time needed to reduce to one output will take much longer). So our program adjusts the number of tasks into the same number of workstations.

This pattern questions simplicity, flexibility, and efficiency of the design. Our design is simple because each generated task will find out rank of each key on its own subset data (keys) and then reduce them into global ranks. The next steps of sorting locally and redistribution of keys and merging the redistributed local key arrays are also simple because we already know the global ranks. It is also flexible and efficient because the number of tasks is adjustable and the computational load is evenly balanced.

5.2 Algorithm Structure Design Space

5.2.1 Choose Structure

The Algorithm- Structure decision tree of this pattern is used to arrive at an appropriate Algorithm- Structure for our problem.

The maj or organizing principle of our design is organization by tasks because we used the loop-splitting technique of the Task-Decomposition pattern, so we arrived at the Organized-by-Tasks branch point. At that branch point, we take a partitioning branch because our ranking tasks can be gathered into a set linear in any number of dimensions.












There are dependencies between tasks and it is an associative accumulation into shared data structures. Therefore we arrived at the Separable-Dependencies pattern.

5.2.2 Separable Dependencies

A set of tasks that will be executed concurrently in each processing unit corresponds to iterations of a secondfor loop and thirdfor loop of Algorithm 4. 1. Each task will be independent of each other because each task will use its own data. This pattern recommends balancing the load at each PE. The size of each data is equal because of data distribution specification of Kernel IS so the load is balanced at each PE. The specification of keys distribution of Kernel IS is also satisfied. The fact that all the tasks are independent leads to the Embarrassingly Parallel Pattern.

5.2.3 Embarrassingly Parallel

Each task of finding all the ranks of distinct key array can be represented as a process. Because each task will have almost same amount of keys, each task will have same amount of computation. So the static scheduling of the tasks will be effective as advised by this pattern. For the correctness considerations of this pattern, each tasks solve the subproblem independently, solve each subproblem exactly once, correctly save subsolutions, and correctly combine subsolutions.

5.3 Using Implementation Example

The design of the parallel integer sort problem satisfies the condition of simplest form of parallel program as follows: All the tasks are independent. All the tasks must be completed. Each task executes same algorithm on a distinct section of data. Therefore the resulted design is a simplest form of Embarrassingly Parallel Pattern. The Implementation example is provided in Chapter 3. Using the Implementation Example of the Embarrassingly Parallel pattern, the Kernel IS of NAS Parallel Benchmark set can be













implemented. The basic idea of Implementation Example of Embarrassingly Parallel pattern is that of scattering data to each process, compute subsolutions, and then combine subsolutions into a solution for the problem. If we apply the Implementation Example two times, one time for finding rank of each key and one for sorting the keys in local processes, the Kernel IS can be impelemented.

Another method of implementation is to use Implementation Example of Divide Conquer pattern. But this Implementation Example is not chosen because it is hard to measure the elapsed time for ranking all the keys.

5.4 Algorithm for Parallel Implementation of Kernel IS

The following algorithm illustrates the parallel design for implementation of Kernel IS (parallel sort over small integer) that has been obtained through the parallel design patterns.

1. Generate the sequence of N keys using key generation algorithm of NAS Parallel
Benchmarks.

2. Divide the keys by the number of PEs and distribute to each memory of PEs.

3. Begin timing.

4. Do, for i =Itomx

(a) Modify the sequence of keys by making the following two changes: K <- i

'Kimy <-Bma i')

(b) Each task (process) finds out the rank of each key in its own data set (keys)
using ranking algorithm of counting sort.

(c) Reduce the arrays of ranks in its own local data into an array of ranks in global
data.

(d) Do partial verification.









73


5. End timing.

6. Perform permutation of keys according to the ranks.

(a) Each task sorts local keys according to the ranks among its local keys.

(b) Compute the ranges of keys each process will have so that each process will have nearly same amount of keys and the ranges of keys are in ascending order

(c) Redistribute keys among processes according to the range of each process.

(d) Each task (process) merges its redistributed key arrays.

7. Perform full verification.



















CHAPTER 6
PERFORMANCE RESULTS ANT) DISCUSSIONS

6.1 Performance Expectation

An ideal execution time for a parallel program can be expressed as TN where T is the total time taken with one processing element and N is the number of processing Elements used. Our implementation of Kernel IS of NAS Parallel Benchmarks will not have an ideal execution time because of overhead that comes from several sources. A source of overhead comes from the computations needed to reduce local arrays of ranks into one array of global ranks. The more local arrays of ranks and processing elements we use, the more computation time will be needed. The gap between the ideal execution time and the actual execution time will be increased. Another source of overhead will be the communication because IVPI uses message transfer, which typically involves both overhead due to kernel calls and latency due to the time it takes the message to travel over the network.

6.2 Performance Results

The Kernel IS implementation was executed on top of Ethernet-connected

workstations and LAM 6.3.2, which is an implementation of MPI. 14 These workstations are Sun Blade 100 with 500-Iz UltraSPARC-Ile cpu, 256-KB L2 External Cache, 256MB DIIT memory, Ethernet/Fast Ethernet, and twisted pair standard (1 OBASE-T and

1 OOBASE-T) self-sensing network. Figure 6.1 shows the performance results for class A and B. The rows show the total number of processing elements (workstations) used to












execute Kernel IS implementation. The columns show the execution time of Kernel IS

implementation and an ideal execution time in milliseconds for each class A and B. The

performance result of NPB2.2 (NAS Parallel Benchmark 2.2. These are MPI-based

source-code implementations written and distributed by NAS. They are intended to be

run with little or no tuning, and they approximate the performance a typical user can

expect to obtain for a portable parallel program. They supplement, rather than replace,

NPB 1. The NAS solicits performance results from all sources) for class A and B are

shown for comparison purpose."

Table 6-1 Performance Results for Class A and B

Number Actual Ideal Execution Actual Ideal Execution
of Execution Execution Time of Execution Execution Time of
Processing Time for Time for NPB2.2 for Time for Time for NPB2.2 Elements class A Class A Class A Class B Class B for ClassB
1 30940 30940 20830 147220 147220 N/A
2 16500 15470 25720 76100 73610 1446270
3 12160 10313 N/A 55640 49073 N/A
4 10600 7735 16870 46500 36805 103110
5 9200 6188 N/A 41150 29444 N/A
6 8150 5156 N/A 36550 24536 N/A
7 7600 4420 N/A 33350 21031 N/A
8 7080 3867 9720 30200 18402 42760
9 7470 3437 N/A 31790 16357 N/A
10 7040 3094 N/A 30860 14722 N/A
11 6820 2812 N/A 28740 13383 N/A
12 6750 2578 N/A 27900 12268 N/A
13 6690 2380 N/A 27580 11324 N/A
14 6320 2210 N/A 26630 10515 N/A
15 6000 2062 N/A 25560 9814 N/A
16 5780 1933 16680 25230 9201 62080


The execution times for Class B when the number of processing elements is I and 2

are not shown in Figure 6.2 because execution time for Class B is too big to show in

Figure 6.2. The reason for that long execution time is that NPB2.2 consumes much more













memory than the physical memory, which leads to many 1/0 between the hard disk and main memory of workstations.


60000

50000 A Actual Execution Time

40000
CZ
0 -a-- Ideal Execution Time
30000
\A A Execution Time of
20000 A A NP132.2

10000

0
1 3 5 7 9 11 13 15
Number of Processing Elements Figure 6-1 Execution Time Comparison for Class A


160000
140000
120000 Actual Execution
Cn A Time
100000
0 -*--Ideal Execution Time
80000
tDUUUU A Execution Time of
40000 A NP132.2
20000
0
1 3 5 7 9 11 13 15
Number of Processing Elements Figure 6-2 Execution Time Comparison for Class B 6.3 Discussions

As seen from the previous graphs, the following conclusions can be made about the performance of the parallel implementation of Kernel IS of NAS Parallel Benchmarks. The most increase in performance is achieved by using one or two additional processing









77


elements, that is when the total number of processing elements is 2 or 3 respectively and there are smaller improvements by using more processing elements. The more processing elements we use for computation, the more overhead was created and the gap between the ideal execution time and the actual execution time increased, as we expected. The overall performance is acceptable because it has better performance compared to the performance of NPB2.2

















CHAPTER 7
RELATED WORK AND CONCLUSIONS AND FUTURE WORK

7.1 Related work

7.1.1 Aids for Parallel Programming

Considerable research has been done to ease the tasks of designing and implementing efficient parallel program. The related work can be categorized into a program skeleton, a program framework, and design patterns.

An algorithmic skeleton was introduced by M. Cole as a part of his proposed parallel programming systems (language) for parallel machines. 16 He presented four independent algorithmic skeletons which are "fixed degree divide and conquer . . iterative combination . . clustering," and "task queues." Each of skeletons describes the structure of a particular style of algorithm in terms of abstraction. These skeletons capture very high-level patterns and can be used as an overall program structure. The user of this proposed parallel programming system must choose one of these skeletons to describe a solution to a problem as an instance of the appropriate skeleton. Because of this restriction, these skeletons can not be applied to every problem. These skeletons are similar to the patterns in the algorithm structure design space of parallel pattern language in a sense that both are an overall program structure and provide algorithmic frameworks. But these skeletons provide less guidance for an inexperienced parallel programmer about how to arrive at one of these skeletons in comparison with parallel pattern language.

Program Frameworks, which can address overall program structure but are more

detailed and domain specific, are widely used in many areas of computing. 17 In a parallel












computing area, Parallel Object Oriented Methods and Applications (POOMA) and Parallel Object-oriented Environment and Toolkit (POET) are examples of frameworks from Los Alamos and Sandia, respectively. 18-19

The POOMA is an object-oriented framework for a data-parallel programming of scientific applications. It is a library of C++ classes designed to represent common abstractions in these applications. Application programmers can use and derive from these classes to express the fundamental scientific content and/or numerical methods of their problem. The objects are layered to represent a data-parallel programming interface at the highest abstraction layer whereas lower, implementation layers encapsulate distribution and communication of data among processors.

The POET is a framework for encapsulating a set of user-written components. The

POET framework is an object model, implemented in C++. Each user-written component must follow the POET template interface. The POET provides services, such as starting the parallel application, running components in a specified order, and distributing data among processors.

In recent years, Launay and Pazat developed a framework for parallel programming using Java .20 This framework provides a parallel programming model embedded in Java by a framework and intended to separate computations from control and synchronizations between parallel tasks. Doug Lea provided design principles to build concurrent applications using the Java parallel programming facilities. 21

Design patterns that can address many levels of design problem have been widely used with object oriented sequential programming. Recent work of Douglas C. Schmidt, et al












addresses issues associated with concurrency and networking but mostly at a fairly low leVel.22

Design pattern, frameworks, and skeleton share the same intent of easing parallel programming by providing libraries, classes, or a design pattern. In comparison with parallel pattern language, which provides a systemic design patterns and implementation patterns for parallel program design and implementation, frameworks and skeletons might have more efficient implementations or better performance in their specialized applicable problem domain. But parallel pattern language could be more helpful for inexperienced parallel programmer in designing and solving more general application problems because of its systematic structure in exploiting concurrency and providing frameworks and libraries down to implementation level.

7.1.2 Parallel Sorting

Parallel sorting is one of the most widely studied problems because of its importance in wide variety of practical applications. Various parallel sorting algorithms have also been proposed in the literature. Guy E. Blelloch, et al. analyzed and evaluated many of the parallel sorting algorithms proposed in the literature to implement as fast a general purpose sorting algorithm as possible on the Connection Machine Supercomputer model
23
CM-2. After the evaluation and analysis, the researchers selected the three most promising alternatives for implementation: bitonic sort that is a parallel merge sort, parallel version of counting-based radix sort, and a theoretically efficient randomized algorithm, sample sort. According to their experiments, sample sort was the fastest on large data sets.

Andrea C. Dusseau, et al. analyzed these three parallel sorting algorithms and column sort using a LogP model, which characterizes the performance of modern parallel












machines with a small set of parameters: the communication latency, overhead, bandwidth, and the number of processes .24 They also compared performances of Split-C implementation of the four sorting algorithms on message passing, distributed memory, massively parallel machine, CM-5. In their comparison, radix and sample sort was faster than others on a large data set.

To understand the performance of parallel sorting on hardware cache-coherent shared address space (CC-SAS) multiprocessors, H. Shan, et al. investigated the performance of two parallel sorting algorithms (radix, sample sort) under three maj or programming models (a load-store CC-SAS, message passing, and the segmented SIEMEM model) on a 64 processor SGI OrigM2000 (A scalable, hardware-supported, cache-coherent, nonuniform memory access machine) .25 In their investigation, the researchers found that sample sort is generally better than radix sort up to 64k integers per processor, and radix sort is better after that point. The best combination of algorithm and programming models are sample sort under the CC-SAS for smaller data sets and radix sort under SHMEM for larger data sets, according to their investigation.

Communication is fundamentally required in a parallel sorting problem as well as

computation. Because of this characteristic and the importance of sorting in applications, parallel sorting problem has been selected as one of kernel benchmarks for the

26
performance evaluation of various parallel computing environments. The Kernel IS of NAS Parallel Benchmark set has been implemented and reported its performance on

27-21
various parallel supercomputers by its vendors.

7.2 Conclusions

This thesis shows how the parallel design patterns were used to develop and

implement a parallel algorithm for Kernel IS (Parallel sort over large integers) of NAS












Parallel Benchmarks as a case study. And it presented reusable frameworks and examples for implementations of patterns in the algorithm structure design space of parallel pattern language.

Chapter 6 showed the performance results for Kernel IS. As the result shows, the parallel design patterns help developing a parallel program with relative ease, and it is also helpful in reasonably improving the performance.

7.3 Future Work

Parallel pattern language is an ongoing project. Mapping design patterns with various parallel computing environments and developing frameworks for object oriented programming systems can be considered as future research. Testing the resulted implementation of Kernel IS using parallel pattern language on various supercomputers, comparing this algorithm as a full sorting, not just ranking, and more case studies of the parallel application program using parallel pattern language can also be included in the future work.



















APPENDIX A
KERNEL IS OF THE NAS PARALLEL BENCHMAKRK A.1 Brief Statement of Problem

Sort N keys in parallel. The keys are generated by the sequential key generation

algorithm given below and initially must be uniformly distributed in memory. The initial distribution of the keys can have a great impact on the performance of this benchmark, and the required distribution is discussed in detail below.

A.2 Definitions

A sequence of keys, {Kj i =0,1,., N -1} , will be said to be sorted if it is arranged in non-descending order, i.e., K, !
A.3 Memory Mapping

The benchmark requires ranking an unsorted sequence of N keys. The initial sequence of keys will be generated in an unambiguous sequential manner as described below. This












sequence must be mapped into the memory of parallel processor in one of the following ways depending on the type of memory system. In all cases, one key will map to one word of memory. Word size must be no less than 32 bits. Once the keys are loaded onto the memory system, they are not to be removed or modified except as required by the procedure described in the Procedure subsection.

A.4 Shared Global Memory

All N keys initially must be stored in a contiguous address space. If Ai is used to

denote the address of the ith word of memory, then the address space must be [AI AiN1 The sequence of keys, K0, K1,, KNT- 1 initially must map to this address space as

4i~ <- MEM (Kj ) for j =O,1., N 1I (A.l)

where MEM(KJ) refers to the address of Ki * A.5 Distributed Memory

In a distributed memory system with p distinct memory units, each memory unit initially must store NP keys in a contiguous address space, where

NP =N/p (A.2)

If 4l is used to denote the address of the ith word in a memory unit, and if P is used to denote the jth memory unit, then Pn l will denote the address of the ith word in the j t memory unit. Some initial addressing (or "ordering") of memory units must be assumed and adhered to throughout the benchmark. Note that the addressing of the memory units is left completely arbitrary. If N is not evenly divisible by p, then












memory units {Pj I j = 0,1,., p - 2} will store Np keys, and memory unit Pp 1 will store Npp keys, where now

Np =LN/p+O.51

Npp = N - (p - 1)Np

In some cases (in particular if p is large) this mapping may result in a poor initial load balance with Npp >> Np. In such cases it may be desirable to use p' memory units to store the keys, where p' < p. This is allowed, but the storage of the keys still must follow either equation 2.2 or equation 2.3 with p' replacing p. In the following we will assume N is evenly divisible by p. The address space in an individual memory unit must be [Aj, A iN - . If memory units are individually hierarchical, then Np keys must be stored in a contiguous address space belonging to a single memory hierarchy and A then denotes the address of the ith word in that hierarchy. The keys cannot be distributed among different memory hierarchies until after timing begins. The sequence of keys, K0, K1,., KN 1, initially must map to this distributed memory as
Pk flA+ +-MEM(KkN +j) for j = O,1,.,Np - 1 and k = o,.,p -1


where MEM(KkN +j) refers to the address of KkN +j. If N is not evenly divisible by p, then the mapping given above must be modified for the case where k = p -1 as

Pp-1 f +--MEM(K(p 1)NA +j) for j =O,1,.,Np -1. (A.3) A.6 Hierarchical Memory

All N keys initially must be stored in an address space belonging to a single memory hierarchy which will here be referred to as the main memory. Note that any memory in












the hierarchy which can store all N keys may be used for the initial storage of the keys, and the use of the term "main memory" in the description of this benchmark should not be confused with the more general definition of this term in section 2.2. 1. The keys cannot be distributed among different memory hierarchies until after timing begins. The mapping of the keys to the main memory must follow one of either the shared global memory or the distributed memory mappings described above.

The benchmark requires computing the rank of each key in the sequence. The

mappings described above define the initial ordering of the keys. For shared global and hierarchical memory systems, the same mapping must be applied to determine the correct ranking. For the case of a distributed memory system, it is permissible for the mapping of keys to memory at the end of the ranking to differ from the initial mapping only in the following manner: The number of keys mapped to a memory units at the end of the ranking may differ from the initial value, NP * It is expected, in a distributed memory machine, that good load balancing of the problem will require changing the initial mapping of the keys and for this reason a different mapping may be used at the end of the ranking. If N,, is the number of keys in memory unit P.- at the end of ranking, then the mapping which must be used to determine the correct ranking is given by knAj <-mEm(r(kNP,+j)) f6rj=0,l,.,NP,- - I and k = 0,1,., p - I where r(kNpk_ + j) refers to the rank of key K kA',k- +j . Note, however, this does not imply that the keys, once loaded into memory, may be moved. Copies of the keys may be maid and moved, but the original sequence must remain intact such that each time the ranking process is repeated (Step 4 of Procedure) the original sequence of keys exist (except for the two modification of Step 4a) and the same algorithm for ranking is applied.












Specifically, knowledge obtainable from the communications pattern carried out in the first ranking cannot be used to speed up subsequent rankings and each iteration of Step 4 should be completely independent of the previous iteration.

A.7 Key Generation Algorithm

The algorithm for generating the keys makes use of the pseudorandom number

generator described in section 2.2. The keys will be in the range [0, Bmax). Let rf be a random fraction uniformly distributed in the range [0,I], and let Ki be the ih key. The value of Ki is determined as

Ki '(- LBm. (r4iO + r4i+1 + r4i+2 + r4i+3)14] for i = 0,1,., N - 1. (A.4)

Note that Ki must be an integer and L-1 indicates truncation. Four consecutive pseudorandom numbers from the pseudorandom number generator must be used for generating each key. All operations before the truncation must be performed in 64-bit double precision. The random number generator must be initialized with s = 314159265 as a starting seed.

A.8 Partial Verification Test

Partial verification is conducted for each ranking performed. Partial verification

consists of comparing a particular subset of ranks with the reference values. The subset of ranks and the reference values are given in table 2. 1.

Note that the subset of ranks is selected to be invariant to the ranking algorithm (recall that stability is not required in the benchmark). This is accomplished by selecting for verification only the ranks of unique keys. If a key is unique in the sequence (i.e., there is no other equal key), then it will have a unique rank despite an unstable ranking algorithm. The memory mapping described in the Memory Mapping subsection must be applied.













Table A- I: Values to be used for partial verification Rank (full) Full scale Rank (sample) Sample code
r(2112377) 104 +i r(48427) O+i
r(662041) 17523 +i r(1 7148) 18+i
r(5336171) 123928 +i r(23627) 346+ i
r(3 642833) 8288932-i r(62548) 64917 -i
r(4250760) 8388264-i r(4431) 65463 -i

A.9 Full Verification Test

Full verification is conducted after the last ranking is performed. Full verification requires the following:

1. Rearrange the sequence of keys, {Kj i =O,1., N -1} , in the order

{Kj j =r(O),r() .r(N -1)}1, where r(O),r() .r(N -1) is the last computed

sequence of ranks.

2. For every K, from i =O0. N -2 test that K, ! K, +1

If the result of this test is true, then the keys are in sorted order. The memory mapping described in the Memory Mapping subsection must be applied.

A. 10 Procedure

1. In a scalar sequential manner and using key generation algorithm described above, generate the sequence of N keys.

2. Using the appropriate memory mapping described above, load the N keys into the memory system.

3. Begin timing.

4. Do, for i =Itomx
(a) Modify the sequence of keys by making the following two changes: K<-i

KIm <-(Bm. i)












(b) Compute the rank of each key.
(c) Perform the partial verification test described above.
5. End timing.

6. Perform full verification test described above. Table A-2: Parameter values to be used for benchmark Parameter Class A Class B

N 2 23 2 25
Bmax 219 2 21
seed 314159265 314159265
,max 10 10

A.11 Specirications

The specifications given in table A.2 shall be used in the benchmark. Two sects of values are given, one for Class A and one for Class B.




Full Text
xml version 1.0 encoding UTF-8 standalone no
fcla fda yes
!-- Implementation patterns for parallel program and a case study ( Book ) --
METS:mets OBJID UFE0000552_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchema-instance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:rightsmd http:www.fcla.edudlsmdrightsmd
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.fcla.edudlsmdrightsmd.xsd
http:www.loc.govmodsv3mods-3-4.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 2020-04-28T10:06:32Z ID LASTMODDATE 2020-04-27T14:46:21Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UF,University of Florida
OTHERTYPE SOFTWARE OTHER
Go UFDC - FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:note Online edit by Nicola Hill ( 8/25/2010 )
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:accessCondition Copyright Kim, Eunkee. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
mods:language
mods:languageTerm type text English
code authority iso639-2b eng
mods:location
mods:physicalLocation University of Florida
UF
mods:url access object in context https://ufdc.ufl.edu/UFE0000552/00001
mods:name personal
mods:namePart Kim, Eunkee
mods:role
mods:roleTerm Author, Primary
mods:originInfo
mods:publisher University of Florida
mods:dateIssued 2002
mods:copyrightDate 2002
mods:recordInfo
mods:recordIdentifier source sobekcm UFE0000552_00001
mods:recordContentSource University of Florida
mods:subject SUBJ650_-0_1 jstor
mods:topic Algorithms
SUBJ650_-0_2
Buffer storage
SUBJ650_-0_3
Buffer zones
SUBJ650_-0_4
Code switching
SUBJ650_-0_5
Data models
SUBJ650_-0_6
Data types
SUBJ650_-0_7
Design evaluation
SUBJ650_-0_8
Integers
SUBJ650_-0_9
Mathematical independent variables
SUBJ650_-0_10
Pipelines
mods:titleInfo
mods:title Implementation patterns for parallel program and a case study
mods:typeOfResource text
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation ALL
UFIR
UFETD
IUF
GRADWORKS
sobekcm:MainThumbnail kim_e_Page_001thm.jpg
sobekcm:Wordmark UFIR
sobekcm:bibDesc
sobekcm:BibID UFE0000552
sobekcm:VID 00001
sobekcm:Publisher
sobekcm:Name University of Florida
sobekcm:PlaceTerm Gainesville, Fla.
sobekcm:Source
sobekcm:statement UF University of Florida
sobekcm:SortDate 730850
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT PROJECT UFDC
METS:rightsMD RIGHTS1
RIGHTSMD Rights
rightsmd:accessCode public
rightsmd:embargoEnd 2005-12-27
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
sobekcm:File fileid JP21 width 2480 height 3509
JPEG1 630 891
JPEG2
JP22
JPEG3
JP23
JPEG4
JP24
JPEG5
JP25
JPEG6
JP26
JPEG7
JP27
JPEG8
JP28
JPEG9
JP29
JPEG10
JP210
JPEG11
JP211
JPEG12
JP212
JPEG13
JP213
JPEG14
JP214
JPEG15
JP215
JPEG16
JP216
JPEG17
JP217
JPEG18
JP218
JPEG19
JP219
JPEG20
JP220
JPEG21
JP221
JPEG22
JP222
JPEG23
JP223
JPEG24
JP224
JPEG25
JP225
JPEG26
JP226
JPEG27
JP227
JPEG28
JP228
JPEG29
JP229
JPEG30
JP230
JPEG31
JP231
JPEG32
JP232
JPEG33
JP233
JPEG34
JP234
JPEG35
JP235
JPEG36
JP236
JPEG37
JP237
JPEG38
JP238
JPEG39
JP239
JPEG40
JP240
JPEG41
JP241
JPEG42
JP242
JPEG43
JP243
JPEG44
JP244
JPEG45
JP245
JPEG46
JP246
JPEG47
JP247
JPEG48
JP248
JPEG49
JP249
JPEG50
JP250
JPEG51
JP251
JPEG52
JP252
JPEG53
JP253
JPEG54
JP254
JPEG55
JP255
JPEG56
JP256
JPEG57
JP257
JPEG58
JP258
JPEG59
JP259
JPEG60
JP260
JPEG61
JP261
JPEG62
JP262
JPEG63
JP263
JPEG64
JP264
JPEG65
JP265
JPEG66
JP266
JPEG67
JP267
JPEG68
JP268
JPEG69
JP269
JPEG70
JP270
JPEG71
JP271
JPEG72
JP272
JPEG73
JP273
JPEG74
JP274
JPEG75
JP275
JPEG76
JP276
JPEG77
JP277
JPEG78
JP278
JPEG79
JP279
JPEG80
JP280
JPEG81
JP281
JPEG82
JP282
JPEG83
JP283
JPEG84
JP284
JPEG85
JP285
JPEG86
JP286
JPEG87
JP287
JPEG88
JP288
JPEG89
JP289
JPEG90
JP290
JPEG91
JP291
JPEG92
JP292
JPEG93
JP293
JPEG94
JP294
JPEG95
JP295
JPEG96
JP296
JPEG97
JP297
JPEG98
JP298
JPEG99
JP299
JPEG100
JP2100
JPEG101
JP2101
JPEG102
JP2102
JPEG103
JP2103
JPEG104
JP2104
JPEG105
JP2105
JPEG106
JP2106
JPEG107
JP2107
JPEG108
JP2108
JPEG109
JP2109
JPEG110
JP2110
JPEG111
JP2111
JPEG112
JP2112
JPEG113
JP2113
JPEG114
JP2114
JPEG115
JP2115
JPEG116
JP2116
JPEG117
JP2117
JPEG118
JP2118
JPEG119
JP2119
JPEG120
JP2120
JPEG121
JP2121
JPEG122
JP2122
JPEG123
JP2123
JPEG124
JP2124
JPEG125
JP2125
JPEG126
JP2126
JPEG127
JP2127
JPEG128
JP2128
JPEG129
JP2129
JPEG130
JP2130
JPEG131
JP2131
JPEG132
JP2132
JPEG133
JP2133
JPEG134
JP2134
JPEG135
JP2135
JPEG136
JP2136
JPEG137
JP2137
METS:fileSec
METS:fileGrp USE archive
METS:file GROUPID G1 TIF1 imagetiff CHECKSUM 0f50363081be30b6c253ea92714269cf CHECKSUMTYPE MD5 SIZE 8720700
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href kim_e_Page_001.tif
G2 TIF2 355601c0e11ee1b3c1971ae596eaa558 8720204
kim_e_Page_002.tif
G3 TIF3 d97177bc016a8d9d2411d429f4320e20 8720272
kim_e_Page_003.tif
G4 TIF4 f84f39777a22ae1a33507400f23d429b 8721036
kim_e_Page_004.tif
G5 TIF5 2f8eb8305c1d68a31bffb680539c4ff7 8721356
kim_e_Page_005.tif
G6 TIF6 baf92459b9dff73877b680a414b28e71 8721756
kim_e_Page_006.tif
G7 TIF7 7152ef2fca2ec582bc6d9143e1db8ce8 8721864
kim_e_Page_007.tif
G8 TIF8 2af466229e2612b67528414036b88bd6 8720836
kim_e_Page_008.tif
G9 TIF9 6732a2575465126210d64386e8b9f3da 8720588
kim_e_Page_009.tif
G10 TIF10 7a8673890bc13b09f3253272f6e08095 8720756
kim_e_Page_010.tif
G11 TIF11 8333b92f95cf1d809bdb0aea0810e76c 8721580
kim_e_Page_011.tif
G12 TIF12 a658657afe5a247a6bc3dd07a7b51ac7 8722220
kim_e_Page_012.tif
G13 TIF13 25a9c9065cfbcd0a9424d51fb642ffb4 8722860
kim_e_Page_013.tif
G14 TIF14 cb71e26c1f2ea135e421890eef89f0e5 8721392
kim_e_Page_014.tif
G15 TIF15 52795bb1fcb5d2cd30d02e10d6d44e92 8722304
kim_e_Page_015.tif
G16 TIF16 3fb26c42a9516099dbfda84fd79a5833 8722504
kim_e_Page_016.tif
G17 TIF17 9a08b93ca52bdb29a0ede864592abea2 8722608
kim_e_Page_017.tif
G18 TIF18 9d5117d92ba203274e55e8bbd925a478 8722344
kim_e_Page_018.tif
G19 TIF19 df24b450fbcd8615bfb71e0f123cdda7 8722792
kim_e_Page_019.tif
G20 TIF20 9600b5e428cb7675469816822c688a0e 8721976
kim_e_Page_020.tif
G21 TIF21 d1a6c988dd80d130563c9bf272f8b2b2 8722488
kim_e_Page_021.tif
G22 TIF22 28ff40342f0e206bc514034ca315b27e 8722480
kim_e_Page_022.tif
G23 TIF23 85455e27a29878e5ce19049fbc5617a3 8722540
kim_e_Page_023.tif
G24 TIF24 910c8fb98ca8e5f2074b026e2cff8eb2 8722236
kim_e_Page_024.tif
G25 TIF25 cada23eee7ce84906ecef66b6f2d560d 8721520
kim_e_Page_025.tif
G26 TIF26 a822a9020f578a4d96c7d28978dffe25 8722000
kim_e_Page_026.tif
G27 TIF27 cd42faf516b94085223d3ec899a6c308 8722376
kim_e_Page_027.tif
G28 TIF28 abd1f2e6a7ada816b0b98f80f16eacc0 8722184
kim_e_Page_028.tif
G29 TIF29 3a08e2b66f397b84e563b86dfdc3f5db 8722464
kim_e_Page_029.tif
G30 TIF30 1ce78f6a319193ac25affef4316fe826 8722776
kim_e_Page_030.tif
G31 TIF31 b4359dc0c88fb4e8d9ed7c23b2c7de82 8722244
kim_e_Page_031.tif
G32 TIF32 78f9d51925ed2959289ab4d3b87f639a 8721816
kim_e_Page_032.tif
G33 TIF33 d3b459d5854d277e200365de4f1d910d
kim_e_Page_033.tif
G34 TIF34 5cf31b270d62048bdadcea8d5d2e5335 8722616
kim_e_Page_034.tif
G35 TIF35 3dbb23fae9076003624137699ae1c25c 8722380
kim_e_Page_035.tif
G36 TIF36 7982b45217ba9fd34d3cecaca4ce18f5 8722528
kim_e_Page_036.tif
G37 TIF37 1e8b9601282d56c0742e6d08abe38492 8721948
kim_e_Page_037.tif
G38 TIF38 a15b66515d872995d0c5075c1b232835 8722456
kim_e_Page_038.tif
G39 TIF39 76df362b402f48a65b7b1af625889c69 8722208
kim_e_Page_039.tif
G40 TIF40 904051bb35dbdf6e818e28a2f7611ae5 8722552
kim_e_Page_040.tif
G41 TIF41 2a5d1c65eb23fcff915affe722afd81b 8721500
kim_e_Page_041.tif
G42 TIF42 28b300cd5e800311365419fe3d1d0e8d
kim_e_Page_042.tif
G43 TIF43 bbf7debac18a28b78fb3350b31733f22 8722388
kim_e_Page_043.tif
G44 TIF44 c50e2d27afef5b220d2f26673e5bb200 8721984
kim_e_Page_044.tif
G45 TIF45 3d9ef34f3e4bb900d43924038e01a9ea 8721736
kim_e_Page_045.tif
G46 TIF46 dc9402eb1281a24e0bda89b56945d64f 8722016
kim_e_Page_046.tif
G47 TIF47 359e82651455a5d87baf49519206554e 8722292
kim_e_Page_047.tif
G48 TIF48 0283bae46b60d1ddf8880e4a4088261b 8722128
kim_e_Page_048.tif
G49 TIF49 e533a387b212b454c69a8639639597a5 8721652
kim_e_Page_049.tif
G50 TIF50 4668fd2e0bc60d98f0991631083c2249 8722768
kim_e_Page_050.tif
G51 TIF51 ff769c93b1f0074a8eeee3baaf1b94f4 8722508
kim_e_Page_051.tif
G52 TIF52 6e58426eaf66bcce94573393260b4569
kim_e_Page_052.tif
G53 TIF53 dfa34b4a9b4b29f2dfa5bd728a86fe33 8722612
kim_e_Page_053.tif
G54 TIF54 17c257d554029bc2ac5e2f6c7f869f4c 8722716
kim_e_Page_054.tif
G55 TIF55 83e74903cac07098e065c16b12404565 8722800
kim_e_Page_055.tif
G56 TIF56 fde34fa8afa32987a929745797a7fc92 8722652
kim_e_Page_056.tif
G57 TIF57 881438da76ee09a9f21e999f3ee41929 8722088
kim_e_Page_057.tif
G58 TIF58 af58680ee24b6903188f0d25d1804b85 8721664
kim_e_Page_058.tif
G59 TIF59 a65c2dc0172fa0f24c94cb3feac9951f 8722732
kim_e_Page_059.tif
G60 TIF60 c0e65232fac6b36b86dc9f108165dd90 8722424
kim_e_Page_060.tif
G61 TIF61 484cb66b1c0dbe1321941e99e3248508 8721852
kim_e_Page_061.tif
G62 TIF62 bf9bc67ee419f3caebeb07c249af07c5 8722080
kim_e_Page_062.tif
G63 TIF63 ea620094174d794e49b19936bf8bb2fc 8721344
kim_e_Page_063.tif
G64 TIF64 02f2cd43bb8c087e0988a10ed6302a6e 8721532
kim_e_Page_064.tif
G65 TIF65 53be29642e636e9f2ccea22d4a4df0a7 8721636
kim_e_Page_065.tif
G66 TIF66 0756615bb37f2cfe479b94937d35e282
kim_e_Page_066.tif
G67 TIF67 e141adc2918aee7a6038740873f14c48 8722788
kim_e_Page_067.tif
G68 TIF68 eec4a9fdadca08fb89f339a7e23c8759 8721384
kim_e_Page_068.tif
G69 TIF69 3a401c2c1b37b541371e3810a3f3c163 8722164
kim_e_Page_069.tif
G70 TIF70 4ab79ff4053e1bcc8d92d8413a2310b6 8721988
kim_e_Page_070.tif
G71 TIF71 c6d0458cc88b736837ef01214390edc7 8721720
kim_e_Page_071.tif
G72 TIF72 b0e8ce0d4239ef1694c9dd64db1e9bb3 8722316
kim_e_Page_072.tif
G73 TIF73 dd9a27c792fe4349f6c01d11c4041f78 8722324
kim_e_Page_073.tif
G74 TIF74 b1772b8b09b8b6806a73202d7ae78b4b 8722476
kim_e_Page_074.tif
G75 TIF75 7966ce2bee48510bd69e1416db0da7f8 8722264
kim_e_Page_075.tif
G76 TIF76 b6738a778590dfd57ea7737c90cf8235 8720800
kim_e_Page_076.tif
G77 TIF77 c176a0c77537f0ed96f76b114cf1a505
kim_e_Page_077.tif
G78 TIF78 d5c62a6b08e21f2716d8cb81d505df3f
kim_e_Page_078.tif
G79 TIF79 039d7db834fcafae1ef433873c2ad641 8722580
kim_e_Page_079.tif
G80 TIF80 be1b43009f4600004af27a5d17acda59 8722896
kim_e_Page_080.tif
G81 TIF81 d3c01d911f356aa62f23423723b7158a 8722692
kim_e_Page_081.tif
G82 TIF82 de1a650a3861e48ae2ba6aa1471b80dd 8722876
kim_e_Page_082.tif
G83 TIF83 050a90b1fcc55e2953dc1085b25c7143 8722432
kim_e_Page_083.tif
G84 TIF84 bbf84bb2df68a3be1d4c3360fe975307 8720804
kim_e_Page_084.tif
G85 TIF85 ed8fc9f17c5ea6eac9a258d765efe9b8 8722284
kim_e_Page_085.tif
G86 TIF86 f7c5d06fc92d832a301326f533355358 8722848
kim_e_Page_086.tif
G87 TIF87 abd4aa1dbbe9ac75da3da5d3a363eaaa
kim_e_Page_087.tif
G88 TIF88 49aed2b9f15325e5dbac130ef2992778 8720792
kim_e_Page_088.tif
G89 TIF89 416b9fe285a3b7c67cacdf7975852610
kim_e_Page_089.tif
G90 TIF90 5c05a92e0e1b8f6bead395dd0aa3a0e8
kim_e_Page_090.tif
G91 TIF91 ed9c5d88bd4c673882f7fdf5b79dc593 8722844
kim_e_Page_091.tif
G92 TIF92 9ac9f511dffb9ecd94809dc6cc2d9a17 8722820
kim_e_Page_092.tif
G93 TIF93 4124def38b3d8b1ce5de5580cfedbbcd 8721560
kim_e_Page_093.tif
G94 TIF94 c08a0cace403e5a32df0b2ebf3f9a169
kim_e_Page_094.tif
G95 TIF95 851e6c537f7866799caddfc28f4636d8 8722308
kim_e_Page_095.tif
G96 TIF96 f773b84cb60882ec916302d0944399d0 8722648
kim_e_Page_096.tif
G97 TIF97 c2622010c98bf443632aa38149d3a735 8722972
kim_e_Page_097.tif
G98 TIF98 22c5f631f49f4f764370b55cabbcc4de
kim_e_Page_098.tif
G99 TIF99 c5fb3a5412eed1a178ce03db68ce0bfb 8722300
kim_e_Page_099.tif
G100 TIF100 0acd24ba64b564fe0ae6f632c55adf39 8721032
kim_e_Page_100.tif
G101 TIF101 84bf335647c71cce05c0c3b908fd8cb2 8721900
kim_e_Page_101.tif
G102 TIF102 67b55e5990b38255099f7077d1c4619c 8721536
kim_e_Page_102.tif
G103 TIF103 9792e3353a4492465d192b3c9ca384ba 8722640
kim_e_Page_103.tif
G104 TIF104 328107846d433a8cd85f5ffac55c5e66 8720476
kim_e_Page_104.tif
G105 TIF105 be0f19c4e1c280ee391758038d77a7f5 8721528
kim_e_Page_105.tif
G106 TIF106 ec1c1bf4fe78f265debd54e6dec5c67f 8721364
kim_e_Page_106.tif
G107 TIF107 881962933045e84d123d12441aab4c23 8721764
kim_e_Page_107.tif
G108 TIF108 faa686e08c2f48a16461e8c78b0fac0a 8721872
kim_e_Page_108.tif
G109 TIF109 9434a719d0937ed45e4241f8cffa2cf8
kim_e_Page_109.tif
G110 TIF110 92d7ebd7c9fb107f0390948aae23398f
kim_e_Page_110.tif
G111 TIF111 8fdbf4a0b67a8e9562b1cdeb934681db 8721404
kim_e_Page_111.tif
G112 TIF112 8edfbadd399ae59ad22678012abb2d35 8720816
kim_e_Page_112.tif
G113 TIF113 52d299d94b43fa5ee37357738076c1f2 8721252
kim_e_Page_113.tif
G114 TIF114 96041a9da42b54615ddf4173086870e5 8721904
kim_e_Page_114.tif
G115 TIF115 feb28dd6a3435f40814f8293dc612d0b 8721632
kim_e_Page_115.tif
G116 TIF116 9fa563d2f620d6f5be4c16a47ea30252 8720912
kim_e_Page_116.tif
G117 TIF117 4627a50721d60ea09b24097c7842e8ba 8721368
kim_e_Page_117.tif
G118 TIF118 dc2750daf47d016cdf2df73d7383fdbd 8721916
kim_e_Page_118.tif
G119 TIF119 1812cb48c8e3123297ea7a9d385186af 8722076
kim_e_Page_119.tif
G120 TIF120 1b6f8fffdaa995a30a5b2b02379eb35a 8721884
kim_e_Page_120.tif
G121 TIF121 d7ad7238efa449339e2acab01ca1bc31
kim_e_Page_121.tif
G122 TIF122 29910e08b9d88144afb5eb0c3d6f99b0 8721848
kim_e_Page_122.tif
G123 TIF123 413137eea5edf73d7581aa5b0c14e6b7 8721244
kim_e_Page_123.tif
G124 TIF124 0b5b3a17f209539c9121e252956caace 8720368
kim_e_Page_124.tif
G125 TIF125 cece4858ddc2ee39a2bc37a756d4234c 8721052
kim_e_Page_125.tif
G126 TIF126 db3584ca38768bd6befb58dfa43ddbb7 8721460
kim_e_Page_126.tif
G127 TIF127 00ad27e6e13ac2b76783bf5728da240d 8721708
kim_e_Page_127.tif
G128 TIF128 86aefbfa057812d72e225cf074d78b32 8721692
kim_e_Page_128.tif
G129 TIF129 71210cf7c1f093b5020b4e44ec59cfed 8721388
kim_e_Page_129.tif
G130 TIF130 c18f0006a1c9a283c95c3c93cce5b213 8721348
kim_e_Page_130.tif
G131 TIF131 49741d583984e25c71b7b41183dfff07 8721428
kim_e_Page_131.tif
G132 TIF132 0f7da1a7b4fda1335cc1eb1133871cdc
kim_e_Page_132.tif
G133 TIF133 a08046e69dfad49ac848a6e71347b27b 8720232
kim_e_Page_133.tif
G134 TIF134 299369557ac7835bafa60b3ba46ddc27 8722684
kim_e_Page_134.tif
G135 TIF135 276f89ecf5dde27906fca8c02b879fca 8722956
kim_e_Page_135.tif
G136 TIF136 8877eb5a5175d28c8bc72c21b60fc4cf 8721840
kim_e_Page_136.tif
G137 TIF137 e6b5bb227491678dd151af0d158aba9e 8720672
kim_e_Page_137.tif
reference
imagejp2 90375a413bdb8fef47e21cc934dcb30f 209299
kim_e_Page_001.jp2
d75502a12dd8e6a9cd78a5ba87c055ac 24359
kim_e_Page_002.jp2
c48928e3f3e1e3db0579397ea1a37d48 34226
kim_e_Page_003.jp2
b2a62fb6077bc87b351de7df680f4924 314126
kim_e_Page_004.jp2
29f95971c8ba76c4f1830ec9b90bbd25 713790
kim_e_Page_005.jp2
5d6a19e5afd3e65bb46d06b6fdc1f474 1087901
kim_e_Page_006.jp2
9bee137050f545159a80a9bc82da2c7d 1039577
kim_e_Page_007.jp2
ac24016da86d76485e457aec25191188 263718
kim_e_Page_008.jp2
c6c6d75340251de3e5aa00187a19faa3 165723
kim_e_Page_009.jp2
7c4414dcc34b6caf18c1e27d3bd38fd1 262396
kim_e_Page_010.jp2
6b39b9058a58af7936641a0f47b55c7d 590585
kim_e_Page_011.jp2
f6543b64391598974bca80b1f907f038 837422
kim_e_Page_012.jp2
4e142d124c254603df5809122fe1a406 1054057
kim_e_Page_013.jp2
8bc4ea813ca76ccc7131569cfe3eff3b 463724
kim_e_Page_014.jp2
ed99eb16ff55d8b3c6be922d8667ca19 797804
kim_e_Page_015.jp2
5c8fb228204022e98ad553ddf2b5e8b2 840420
kim_e_Page_016.jp2
ccd3c497e3f4fb29b103839b3e1991bb 929172
kim_e_Page_017.jp2
9d7013f77ed32db028d07b40bf2ec635 857272
kim_e_Page_018.jp2
51a52ee6212a551605824c5b641d1de9 983423
kim_e_Page_019.jp2
cc9b721624e98af6613f9faffec989eb 667245
kim_e_Page_020.jp2
cb87f28453bef895c913e57ca619cdd0 902747
kim_e_Page_021.jp2
6ee5119e6808ffcd26ce4d3668ff0f72 897646
kim_e_Page_022.jp2
86541c2b0a32329e10ebdb0d9cee0004 886804
kim_e_Page_023.jp2
8a4d73c6776103d9a6f6ea1e3673850a 756452
kim_e_Page_024.jp2
1fdb13cddf7b551e938dc7b4b1b40c8f 485592
kim_e_Page_025.jp2
5d851785b593f05a32d87d787fb4d3ad 725402
kim_e_Page_026.jp2
36131e77900693253ee27e1f570e8760 887486
kim_e_Page_027.jp2
e43c4de02979bdfb934d9e59e2b0d224 899762
kim_e_Page_028.jp2
29cc58594b02a4951530b5c18d85fb90 877791
kim_e_Page_029.jp2
e9f3b4326f1eae247b173876555724a2 1051300
kim_e_Page_030.jp2
0710006b1ec39b6763ca146cd9b1d54d 860944
kim_e_Page_031.jp2
642c71cbf87c89803b074d6a455479f7 739418
kim_e_Page_032.jp2
9ea1da074fc841e19fd676bf3bc44954 768582
kim_e_Page_033.jp2
2fd55088f2c96031472a8e01120ad8cd 880489
kim_e_Page_034.jp2
cc26dab7afee47685a70dcaa60f03e3d 845033
kim_e_Page_035.jp2
e20ec9633832963c0d97a20b2f0624cb 1031099
kim_e_Page_036.jp2
bbc02b2c4f27a8dade4acb40fac9f027 865096
kim_e_Page_037.jp2
8d9338f2d27a54b5823343e27a1c7429 927530
kim_e_Page_038.jp2
f6ca5778f0e4369369fc5ac74eb0f183 702912
kim_e_Page_039.jp2
798886d0f01238d91fe17840c701e37c 967583
kim_e_Page_040.jp2
3572ccdbf8f96292540673abbccce4bc 549863
kim_e_Page_041.jp2
34f29d99038d3d7ee7b351119ff0995e 563974
kim_e_Page_042.jp2
85163572e1c2f1c3578a21f329d722c6 867697
kim_e_Page_043.jp2
fa05615cadaf5b68d5b486ab828466d1 703939
kim_e_Page_044.jp2
3012a6b38df1490f9d7b8b20fc48ec23 694973
kim_e_Page_045.jp2
d4d6a55e9fa3d6c7ebb8af398ab41801 735025
kim_e_Page_046.jp2
ec26fda5d4a2be30f2f950f66a33a885 749941
kim_e_Page_047.jp2
af0dc510a560ccdcaeb2047fc1f079d8 891855
kim_e_Page_048.jp2
f82916fe812dd84abf942a13be53a70d 562492
kim_e_Page_049.jp2
b7f66e810bc5c1d0303e13c52cbf919b 1071066
kim_e_Page_050.jp2
7a6f8b29a7195d5038f09c511aed2b5f 619226
kim_e_Page_051.jp2
09d2fbe6e4f8f4599c59feececdf9f65 740156
kim_e_Page_052.jp2
4097b101e5b411b8235d0ebc0acc5853 908118
kim_e_Page_053.jp2
132bcdb0af9a8dd42a53bacc560c8785 968241
kim_e_Page_054.jp2
1d2afabf923b8bcd2e63e3e800534d25 992942
kim_e_Page_055.jp2
0d205e8f7abf077724165df247d19533 906209
kim_e_Page_056.jp2
fd299fb16c31b3b3c2b781c7f7892010 722046
kim_e_Page_057.jp2
9635eb60951b88e0c630f4c287b31a6e 640611
kim_e_Page_058.jp2
dbf6e3c3b25d4b90581875eea105f4cc 1077868
kim_e_Page_059.jp2
586a72f05b7c1e4d0c96016d008fa4fc 893057
kim_e_Page_060.jp2
e636465d580cf8eae401a39e7fc4c1f2 538789
kim_e_Page_061.jp2
9fd8f090e2e66c21e84c09daa26c0ce8 826951
kim_e_Page_062.jp2
ba2a81849c7d32b3ac1c4783502aea6d 475847
kim_e_Page_063.jp2
0a48e541d0dfa12ced1bf6779d7e679e 476622
kim_e_Page_064.jp2
68bdd6edbc5b467fe51f58b09c81746c 558279
kim_e_Page_065.jp2
01e07affe1ccc0683fb8bbb15ec9696f 968312
kim_e_Page_066.jp2
fcda879de0952463d52eb824035cc468 729590
kim_e_Page_067.jp2
2ac33bee3778f149a996e5985c85d10b 515568
kim_e_Page_068.jp2
e84f75e34a20dcebb9bc4810f34503f6 788141
kim_e_Page_069.jp2
3217d15b5ac95b262e388463b58d7823 723265
kim_e_Page_070.jp2
0c63873de5f0c195b44921efcb712208 611276
kim_e_Page_071.jp2
091f7b0ea39dab0264610153579cbc40 795832
kim_e_Page_072.jp2
651b7b7f9123642472ce044f4c9c6c86 943129
kim_e_Page_073.jp2
24740bb738fb8db01db1991065751e2c 889414
kim_e_Page_074.jp2
71a93f7a21a636b3c4996738dc64461e 745997
kim_e_Page_075.jp2
9ea3969d5746e325a37824dcdc5a2bd0 194503
kim_e_Page_076.jp2
04903e918ed27fdc43c4da66d0c32188 918054
kim_e_Page_077.jp2
37479f188fd65ce166e6037d3381b2f0 834415
kim_e_Page_078.jp2
cc8c7d1c3cc4d8aa6d0309bc38fad7c6 928392
kim_e_Page_079.jp2
89161586e33eda8bbdf955c2545fe0f3 1052971
kim_e_Page_080.jp2
ba14554ec69c5d76927fbdb5c740eb64 998267
kim_e_Page_081.jp2
a962748561471ce6ff7704cff03252a3 1039036
kim_e_Page_082.jp2
956b916e5f3d206fefcf48cd92d864d8 801846
kim_e_Page_083.jp2
40025848463f655a061a3109e7c56b15 272130
kim_e_Page_084.jp2
1e3d4f43eea4eb9ca22eb55d81ed64a2 863928
kim_e_Page_085.jp2
85015fd3f6d0377b677c3c048d9e7249 1087887
kim_e_Page_086.jp2
ebc25656b082b0a14e1b3582a734effb 638693
kim_e_Page_087.jp2
ecfb34c3b36f5121e329bf693392de41 267085
kim_e_Page_088.jp2
9370a9c4d1a791c22a4326e04b57d528 940217
kim_e_Page_089.jp2
7bc64d69b5d4ed4d725be14c76a86771 996751
kim_e_Page_090.jp2
fbe176a7e140b4c04e6976807d068223 1038013
kim_e_Page_091.jp2
85a2579d1c186db804d315fe520ae3a2 1069037
kim_e_Page_092.jp2
d8299ab29a99e807f4435639ff0990b5 556456
kim_e_Page_093.jp2
c73017bbc96be5f8f61a83d04d2ad228 808213
kim_e_Page_094.jp2
b38a6c8833dd03f0947959c2a97df79b 782051
kim_e_Page_095.jp2
3301939cb2a8a0894ae27d5d6b53727e 821677
kim_e_Page_096.jp2
1965a030618c8bb11d791907b4e9dc51 1082938
kim_e_Page_097.jp2
46a4bdd65dc3e31708015b3bad44d7fa 940678
kim_e_Page_098.jp2
5afa3655545adc06bf58667743d4b462 682346
kim_e_Page_099.jp2
6d9650456529ddc2f65a7813c3c62f02 278758
kim_e_Page_100.jp2
b83c3d923272a0d393a72faea3bcb535 572770
kim_e_Page_101.jp2
6d67201914fa16dc417dd404b034d642 525300
kim_e_Page_102.jp2
267ec0e45705571e6178163d5d565403 857088
kim_e_Page_103.jp2
a66bd8d6e8d431495fdff114ab844f64 117037
kim_e_Page_104.jp2
71107097b928d66bf49be70d0c911d40 574557
kim_e_Page_105.jp2
8f03c1ce533e9f7655a65e298569369d 479393
kim_e_Page_106.jp2
bf271dd60f934ee4a3ee33cf7e1fdc68 669671
kim_e_Page_107.jp2
fe90cbb535d6e31e8fb8f4beeee78060 637115
kim_e_Page_108.jp2
545044e405c09054b672fe76a639b12b 645251
kim_e_Page_109.jp2
7b35bf3e3e04474e3ea39c076954fd19 560950
kim_e_Page_110.jp2
d394281cc19cc8e6178ea151b76bb89b 480954
kim_e_Page_111.jp2
c6e694d30a9a475fe06bb0097bda8d70 212827
kim_e_Page_112.jp2
bae693d6eaa2e4de5c3743367f0078b4 412747
kim_e_Page_113.jp2
7934ac48ee51b6b5268f56653062cad2 724309
kim_e_Page_114.jp2
284f503d9858eb4084439525b7b6eaf7 579620
kim_e_Page_115.jp2
cfa43f880f9dcb5733ef3e2175076244 246158
kim_e_Page_116.jp2
6df7beda83145d0ab8dae791267c37c2 475557
kim_e_Page_117.jp2
5dc9749572975a408e873bdf5eb96957 696127
kim_e_Page_118.jp2
06a6fa164f928484429b54b6c40f24f6 770296
kim_e_Page_119.jp2
ec9bfe218f47612430d79c538009be42 690388
kim_e_Page_120.jp2
f4b0eb918357cadb9ac8ddfa1085fbc1 717921
kim_e_Page_121.jp2
442da4e980262804372f77f8d7807f78 645949
kim_e_Page_122.jp2
083ac406cda44e4f36521e7c3e070ed3 380620
kim_e_Page_123.jp2
917a347d27f20aec980f4c2cdfc62849 70873
kim_e_Page_124.jp2
5147c7fbeb900fb6b170cbc6df4a206b 330769
kim_e_Page_125.jp2
c1cdb7d46ad2b189743d65746669db56 540709
kim_e_Page_126.jp2
311cecb41cde4c088f4acc5c802a03ab 629912
kim_e_Page_127.jp2
773f78ef1e31369deedd7362b0f2553a 531466
kim_e_Page_128.jp2
ff1440fea4f8e7a897ecfc904bc90afc 494071
kim_e_Page_129.jp2
2cccd50c89c3c84509ce850e86c96f68 503324
kim_e_Page_130.jp2
b57d331c0d65a61b35cbbd91931265d5 492781
kim_e_Page_131.jp2
2c852d651265541de601fb02bfe54fec 591379
kim_e_Page_132.jp2
7c80314075bbb927fbf7dc1bf3bf3aae 37911
kim_e_Page_133.jp2
e5da63521c76fa2ae51c44a21f4c9478 1087888
kim_e_Page_134.jp2
894ff2c3583c152f3168cb0490c53d35 1087869
kim_e_Page_135.jp2
999bc63daa816605e2838107cb51aaa9 663159
kim_e_Page_136.jp2
b00f7fb79e6cd0f14f4121a0c6527c18 197375
kim_e_Page_137.jp2
imagejpeg e858720ad93e2ac780e697d839c2ad15 42666
kim_e_Page_001.jpg
JPEG1.2 5da1712b2f7a0a3f058c24878d1bc19c 25258
kim_e_Page_001.QC.jpg
b37d89cde4a22cabaa6a9bc49bf75750 21981
kim_e_Page_002.jpg
JPEG2.2 69ab5a005c0b3a9f29f83164f3e9d8ee 19119
kim_e_Page_002.QC.jpg
e401e307db08a6de8ef824a002b27142 23165
kim_e_Page_003.jpg
JPEG3.2 17baaa28158ca7bc209c019a0f551d75 19197
kim_e_Page_003.QC.jpg
e53f7a6b48b93263d2ce8955aaf82bca 51541
kim_e_Page_004.jpg
JPEG4.2 f67064cf7be06f2424b6e509606ebe3b 28980
kim_e_Page_004.QC.jpg
ea85477ece19a4e331cd967d991d8cc2 94799
kim_e_Page_005.jpg
JPEG5.2 7845cf8dd1d838151ed4464cf14aae42 35810
kim_e_Page_005.QC.jpg
d4a5bd9423c8b7d58024d1ef6d3bc23c 137767
kim_e_Page_006.jpg
JPEG6.2 6453c8b406e6c1d7fbaad17fb3354bbc 43456
kim_e_Page_006.QC.jpg
ce1bba313e7e13b54bad8b22e51ee7c2 125254
kim_e_Page_007.jpg
JPEG7.2 ec1427e279571e8a82cfb4f1ac70470b 43105
kim_e_Page_007.QC.jpg
1702bcd768180579b031e182fd235e84 47652
kim_e_Page_008.jpg
JPEG8.2 2a5e9f2c3bd419bc1f343ba7c80c9c6f 26934
kim_e_Page_008.QC.jpg
cfada38cdf285d058c312216407bc757 37321
kim_e_Page_009.jpg
JPEG9.2 b492cfaf8bfde0f5b87f0ad02a4933d1 23934
kim_e_Page_009.QC.jpg
d4de0e5dc156a1305e7d947883816287 46739
kim_e_Page_010.jpg
JPEG10.2 d9ad52b968c450e09fa850327678b11c 26667
kim_e_Page_010.QC.jpg
6cd4a4e3b3c439e00168eff5e1666869 78281
kim_e_Page_011.jpg
JPEG11.2 f96d9312c7cbd877725bf02a10af7d44 36899
kim_e_Page_011.QC.jpg
b2c02f73d9968f7f4dbf9f1572ad7bdc 100486
kim_e_Page_012.jpg
JPEG12.2 29769071d51547d72aa348ae5d7ebb01 46348
kim_e_Page_012.QC.jpg
40523826afb2e39e23630a8d6fed63f9 121154
kim_e_Page_013.jpg
JPEG13.2 db081add7422e76ede4894fbcd2c9a2d 52280
kim_e_Page_013.QC.jpg
89870903b55dbc23e4225af5ac408d64 64741
kim_e_Page_014.jpg
JPEG14.2 f177d192149462a1ea04f99b3c262a47 33641
kim_e_Page_014.QC.jpg
9919bb1ac622cfe14cb4855b3a89a3fe 97071
kim_e_Page_015.jpg
JPEG15.2 7d8323a951086aa98ad3c393e39e9bc3 44282
kim_e_Page_015.QC.jpg
f9e50ce666e4baeeef7dfbf6cd6afabc 101294
kim_e_Page_016.jpg
JPEG16.2 5df6d2c7a086084553d82b82613817e5 45993
kim_e_Page_016.QC.jpg
98da058d45b8870aa2c0e46d14e1ef82 107346
kim_e_Page_017.jpg
JPEG17.2 41203653c099e7561dc53bdab1669d38 48033
kim_e_Page_017.QC.jpg
45e452a6d1a3d763b37724b4bc234992 101113
kim_e_Page_018.jpg
JPEG18.2 e041debdd2d573c2786333c8d5939b60 46889
kim_e_Page_018.QC.jpg
158404bca60b897416e558f4ef3a3899 113727
kim_e_Page_019.jpg
JPEG19.2 1def8eb1e59cbad0d7d4cd9c27083bb4 50552
kim_e_Page_019.QC.jpg
ad491ea739538f30f2ef2bf610be4adc 84928
kim_e_Page_020.jpg
JPEG20.2 b89849eb2c750c4899d1f21948764b12 40337
kim_e_Page_020.QC.jpg
b6d8c9150d8243905ad30d94ad1b67b5 106236
kim_e_Page_021.jpg
JPEG21.2 092be5a785255e69f9a45d95e501d366 47448
kim_e_Page_021.QC.jpg
6627817b1ed69fe370e0ab9ea7089d01 102044
kim_e_Page_022.jpg
JPEG22.2 376ce898cb2cb0ade15ad2102138378d 47624
kim_e_Page_022.QC.jpg
6bdc97429b3c53f26491a8a9a9a34263 105181
kim_e_Page_023.jpg
JPEG23.2 866d187116e9456adb1efac97403df12 47369
kim_e_Page_023.QC.jpg
389d2b469bd18b34c381f39455f9ae90 90083
kim_e_Page_024.jpg
JPEG24.2 18e0072f15bdc0aaeb98dfd169b9080c 42262
kim_e_Page_024.QC.jpg
529834e7c7fabfa99a3b2fa3c4c31577 66267
kim_e_Page_025.jpg
JPEG25.2 09567c2c7114016792597c1938048582 35039
kim_e_Page_025.QC.jpg
ad5597371397abf202b22abf10618508 89648
kim_e_Page_026.jpg
JPEG26.2 e17e3f5016c2e2b5cf2f4014fb3995cd 43156
kim_e_Page_026.QC.jpg
0b532b93f45f13d16383d034de6b1c4a 102532
kim_e_Page_027.jpg
JPEG27.2 51ef30c1faba7b933669f70c0ed1c8fb 45876
kim_e_Page_027.QC.jpg
e523d4f30b341e3b75503bba29627193 102997
kim_e_Page_028.jpg
JPEG28.2 a23aa534cba6bcc3b4601d687842c487 45122
kim_e_Page_028.QC.jpg
1f926866ca675cdeac98e373e0953270 104764
kim_e_Page_029.jpg
JPEG29.2 fd0b7bec6c1b10f1a936d83987c46761 47625
kim_e_Page_029.QC.jpg
1f10cd5eac9a8fb0b100824c9658a926 121672
kim_e_Page_030.jpg
JPEG30.2 35cd866e576ca715704b7f8d4fc4af0b 52574
kim_e_Page_030.QC.jpg
1ac288405b500ffbf797783f605a72b0 101862
kim_e_Page_031.jpg
JPEG31.2 f61d0f46431797498b21feb0f8e0aadc 44300
kim_e_Page_031.QC.jpg
5fc382d84cc881364cedffca5fe10992 86479
kim_e_Page_032.jpg
JPEG32.2 86b95e77bef9fe64f75fcc0e93b3960f 39089
kim_e_Page_032.QC.jpg
1d78835ed006efa1d6df851ec41430bf 94051
kim_e_Page_033.jpg
JPEG33.2 96bdbefdce00e6c0ecdcbef17a8f69d0 42329
kim_e_Page_033.QC.jpg
5ff2d9747cce4063c2e70f27bc7f596a 106100
kim_e_Page_034.jpg
JPEG34.2 702a5938ba82d784a04291cd8243a8d9 47206
kim_e_Page_034.QC.jpg
deeaa599a4bb9059b3409a0d8e78b16b 100451
kim_e_Page_035.jpg
JPEG35.2 760cea3252e21d2f1af54006d233b402 46671
kim_e_Page_035.QC.jpg
7ed60787b3ade0c347545e422fbbce08 117647
kim_e_Page_036.jpg
JPEG36.2 ab0dad3752f06036396347aed56dae1c 48762
kim_e_Page_036.QC.jpg
f0ceaa62d8e81e7a2814a4328a10a665 98307
kim_e_Page_037.jpg
JPEG37.2 426ce0b7b203a175ddb3fdf338cd5702 42550
kim_e_Page_037.QC.jpg
75b28496ddc21d84dcf825ba8dd159c6 109840
kim_e_Page_038.jpg
JPEG38.2 176d9c367a04891704a907b2f2aeeae3 46083
kim_e_Page_038.QC.jpg
6c2f9724d9d197a4dc18daabf73130f6 88833
kim_e_Page_039.jpg
JPEG39.2 eeaa8cbda97247cf71c9981ae3370d29 41867
kim_e_Page_039.QC.jpg
98e76b8c16fd2b0c7b99d7fed90bd30c 112815
kim_e_Page_040.jpg
JPEG40.2 e4d32af3ca55f7aaa1d58c9fa5654637 50022
kim_e_Page_040.QC.jpg
585907412eed4452c36a2c98e2a3ba20 73740
kim_e_Page_041.jpg
JPEG41.2 1b08e43a712c86a9ccef19d6f4a79541 34913
kim_e_Page_041.QC.jpg
6493aa5391eb8eb0922545054e9ce1c1 74365
kim_e_Page_042.jpg
JPEG42.2 fab43efec56feee481f90c72da10e5f8 34455
kim_e_Page_042.QC.jpg
d90b3ad7ae4e26b1cbd8aec4b3c0a89f 104527
kim_e_Page_043.jpg
JPEG43.2 f041ae0c333cd02dae695bf52de9f260 43555
kim_e_Page_043.QC.jpg
36aa624d7bf81856c3878a57674d70ea 90761
kim_e_Page_044.jpg
JPEG44.2 bfbe3a5e7b9287cf852b1343060f3758 39583
kim_e_Page_044.QC.jpg
1bae4a451886f93d9a4d48cd1cefa1b3 86258
kim_e_Page_045.jpg
JPEG45.2 4f87bed5651aeb08b45580ba91559710 38683
kim_e_Page_045.QC.jpg
ddf909c22693b9736074746ad555e934 92020
kim_e_Page_046.jpg
JPEG46.2 03f4466e9bb009f2650ba35f305ae525 40566
kim_e_Page_046.QC.jpg
1180515cebf400fdb2b87d81cbfec91b 92528
kim_e_Page_047.jpg
JPEG47.2 ff219e004624d83733bc04849711ecb9 41825
kim_e_Page_047.QC.jpg
7d65f68560ba689cd7c40d03e7afa8d1 101074
kim_e_Page_048.jpg
JPEG48.2 7cd136d4aa53fee701dd09b92528fc69 42165
kim_e_Page_048.QC.jpg
2a6afddb41394b869d88d150766ae0e0 75444
kim_e_Page_049.jpg
JPEG49.2 feb7c0aa4e90656869526d3916240171 35756
kim_e_Page_049.QC.jpg
2a58f64e964295b73df312183ef54e2e 121279
kim_e_Page_050.jpg
JPEG50.2 afe810d6e3a208fedb298f3a0e1ed0e3 51366
kim_e_Page_050.QC.jpg
7a377c082405b62adbc32f4c7733aa10 84925
kim_e_Page_051.jpg
JPEG51.2 4f4a8247ecb469dfb8bf1df3e9d8bbcf 41680
kim_e_Page_051.QC.jpg
507f16285cd77714e1f91345ab95666a 90257
kim_e_Page_052.jpg
JPEG52.2 c31e383880f21389abafaf5d01e9d8b3 41355
kim_e_Page_052.QC.jpg
f025c0d4b80a2b5efe06f841b041f51b 109738
kim_e_Page_053.jpg
JPEG53.2 44da265056be28d7bebdc996e8bdcf99 46124
kim_e_Page_053.QC.jpg
1b7ea57ef27653be009abcefd8ee39c0 114473
kim_e_Page_054.jpg
JPEG54.2 e8e40df0ec9de707f0268f29d74d4a98 47437
kim_e_Page_054.QC.jpg
29662f99fb0ddc0546e234d82947440c 115408
kim_e_Page_055.jpg
JPEG55.2 62769deb87024157d6efac117eda5fd5 48083
kim_e_Page_055.QC.jpg
6da898a2d8e8b602f8da5df821589847 107856
kim_e_Page_056.jpg
JPEG56.2 4df7051818cc4177e9df0fc53069c172 45407
kim_e_Page_056.QC.jpg
335087b6866d9986dbf61b2e77943b1a 88750
kim_e_Page_057.jpg
JPEG57.2 a9c7d296da7ea12dba4a1ad51f4f4707 40739
kim_e_Page_057.QC.jpg
620a83fe6e9b9668b4bdc7f171750748 80095
kim_e_Page_058.jpg
JPEG58.2 31c46bea80f7465b1b1b3112b94ce307 36853
kim_e_Page_058.QC.jpg
e8a7391251703eecbfbccc5b9b126454 123968
kim_e_Page_059.jpg
JPEG59.2 a45f58538c9e9e740d2ac49be94f5ca5 50510
kim_e_Page_059.QC.jpg
455da3a909987cd0a661f708aa85e638 103439
kim_e_Page_060.jpg
JPEG60.2 e890c4dfd516af065cbc9e771c05989b 46121
kim_e_Page_060.QC.jpg
66a5d158e66ca3631d82460ea4bad9f1 74853
kim_e_Page_061.jpg
JPEG61.2 3bbf588fefdb6620e612cdb772e66cfc 37506
kim_e_Page_061.QC.jpg
f724fb6cd8a1a9536fc27f436dc7548a 97989
kim_e_Page_062.jpg
JPEG62.2 d26ae8a91eb07cb0d36d9082da21cd09 42771
kim_e_Page_062.QC.jpg
ce21aac319d44b7180f922cd2698f56e 63115
kim_e_Page_063.jpg
JPEG63.2 87d78c316fdf8ca7304a1b967cc43701 33167
kim_e_Page_063.QC.jpg
808bea2526140aa321fc45020f08fc04 66012
kim_e_Page_064.jpg
JPEG64.2 ddfa193129a450b4cc64444661ab39b3 32568
kim_e_Page_064.QC.jpg
5386eceae268d49f6139b27c1ade437a 75268
kim_e_Page_065.jpg
JPEG65.2 ae9011e875ee94bd0ba13ec8652a89af 35650
kim_e_Page_065.QC.jpg
50c121eb12c4201e5bed5716447c4f0b 112317
kim_e_Page_066.jpg
JPEG66.2 4a1248c67eedca3304712faca8fb6f59 50469
kim_e_Page_066.QC.jpg
3111a2a6c62dd27f8e11462aaf17d3c4 100510
kim_e_Page_067.jpg
JPEG67.2 d1b338b6d06ed83a49ac69655a5a03a2 46733
kim_e_Page_067.QC.jpg
6a2450569c589921d53bbc0d47158c00 68029
kim_e_Page_068.jpg
JPEG68.2 7721e0e6767d3c3d1976cb92c4c9ae71 33189
kim_e_Page_068.QC.jpg
7b36b1baebc53e6f311ea749e1e1fb4c 94995
kim_e_Page_069.jpg
JPEG69.2 68e55030bdb9da688d9d068d944af4a9 41433
kim_e_Page_069.QC.jpg
140279917d794b9a7e2dd81591129bde 92087
kim_e_Page_070.jpg
JPEG70.2 21feac0c4ac2609f08f3921e95c1ee34 39690
kim_e_Page_070.QC.jpg
896fe1e45465b2139eea0842fa21f9cb 79691
kim_e_Page_071.jpg
JPEG71.2 16a6a0b5c097ee9e6166f2b959139c7e 36328
kim_e_Page_071.QC.jpg
783d9e9a5ad8dc27c4271e1f70a7f8e0 98797
kim_e_Page_072.jpg
JPEG72.2 7921ee8f3389b43f3877978ff006964e 43453
kim_e_Page_072.QC.jpg
213c9b39b992d616fa4f7214a102ca15 110624
kim_e_Page_073.jpg
JPEG73.2 2548fd68a68f60b04b7c430ab22ab54d 46455
kim_e_Page_073.QC.jpg
d75cab64442d0a6af0d0c1077bab2aba 106105
kim_e_Page_074.jpg
JPEG74.2 4af82a9c8a68dbaa6796e74f48e8d3ab 47541
kim_e_Page_074.QC.jpg
378d66a0c83009fcf70e38d6d90a0bc5 92113
kim_e_Page_075.jpg
JPEG75.2 bc768cfb26bf06aa26b8fd8a26692b48 42875
kim_e_Page_075.QC.jpg
c53e3e1ad8bd8867012a69d73eab1f01 40930
kim_e_Page_076.jpg
JPEG76.2 ad562e709c064b0121aaea45b6a362c4 25866
kim_e_Page_076.QC.jpg
04f354c3759563a36272a2e4af76418d 107964
kim_e_Page_077.jpg
JPEG77.2 74f3011805a0678ac0b5516f2e5edd08 49444
kim_e_Page_077.QC.jpg
bc94b511d43c224f64ed03b057bfaa1e 103706
kim_e_Page_078.jpg
JPEG78.2 97a7f380b5a3c837002b7e96c4385607 45747
kim_e_Page_078.QC.jpg
8cdc74be5b5424be71e2f8b001742b99 108674
kim_e_Page_079.jpg
JPEG79.2 115c23073b27621c4cc0faf43285c013 48395
kim_e_Page_079.QC.jpg
7f1b3d027fdf925672296b1e21591805 120720
kim_e_Page_080.jpg
JPEG80.2 77f4ea9a71d5dc3d6fa90eb8a5ac6321 52263
kim_e_Page_080.QC.jpg
acca6f9301b07e0c97f2b22c8477d07a 116342
kim_e_Page_081.jpg
JPEG81.2 835618047c27a78b147798b2a5d56558 50686
kim_e_Page_081.QC.jpg
99f53b812ad141b39bd8e08fa4ca1d25 120047
kim_e_Page_082.jpg
JPEG82.2 c8a5e475315175b1d31a234c405f6ca6 52399
kim_e_Page_082.QC.jpg
2553be0e55af737b9c93c9ad988721b8 96986
kim_e_Page_083.jpg
JPEG83.2 db642741ce89cfc821488aac27fbb201 44255
kim_e_Page_083.QC.jpg
0ffb51d23a792cdfac785c93d48207bc 47692
kim_e_Page_084.jpg
JPEG84.2 7c2aa8ae81482e06c25bcd6352cf5f2e 27360
kim_e_Page_084.QC.jpg
f6e7e53d2fa99b427e80ade629e5e821 102791
kim_e_Page_085.jpg
JPEG85.2 a84856d740631b64dadd48d849a71388 46933
kim_e_Page_085.QC.jpg
bf2a3e02e7d7deddd809b85236bf151b 132433
kim_e_Page_086.jpg
JPEG86.2 4033c43a28131c9de4196e023afe9405 54414
kim_e_Page_086.QC.jpg
83ddf80a86fbb09f263b8551f41406f4 83518
kim_e_Page_087.jpg
JPEG87.2 f97f03b6851fcf13a23a19a554b41919 40623
kim_e_Page_087.QC.jpg
56007517b4338116c8a6a0f64e8c791a 46834
kim_e_Page_088.jpg
JPEG88.2 9d7a08998575a41fe3c04d3e7796f895 27259
kim_e_Page_088.QC.jpg
ac97c45f738c3454577882cefc87d194 110416
kim_e_Page_089.jpg
JPEG89.2 a21da9c30e1cf886ddc902e0d04308f7 48748
kim_e_Page_089.QC.jpg
49a6fd36944783d1a05033081cab03de 114567
kim_e_Page_090.jpg
JPEG90.2 25953d35fca02a456d699ea8000b8390 51036
kim_e_Page_090.QC.jpg
98fbb6f6dc087ee1a7db7ca788e12e89 119129
kim_e_Page_091.jpg
JPEG91.2 b2ed97de79a244ab90d1050e14a3f9d0 51733
kim_e_Page_091.QC.jpg
635757255d807d9e0c33cfa46262e155 122340
kim_e_Page_092.jpg
JPEG92.2 82916e0a9ff5327857e5e3b5b1e8a58f 53068
kim_e_Page_092.QC.jpg
272d56f9531914c4cea45ead129a0fb2 73699
kim_e_Page_093.jpg
JPEG93.2 aa7eef7c1b13de42c6ab31cda9166b34 36031
kim_e_Page_093.QC.jpg
3cab1fbf0962a070eb66f41a9ffa62bf 96537
kim_e_Page_094.jpg
JPEG94.2 18f1915f604069d5eb928947d06fffb9 45635
kim_e_Page_094.QC.jpg
adf49374197d3d26bbbebc331c15c107 96044
kim_e_Page_095.jpg
JPEG95.2 3571ecf5bc5837fecabc5913d3015087 43664
kim_e_Page_095.QC.jpg
f0f730decf426f97925512373efe16db 99934
kim_e_Page_096.jpg
JPEG96.2 733e6c15afcbe54ebcef830357976f60 48330
kim_e_Page_096.QC.jpg
6366eda541d14c0c702cd94a3ef8857d 125496
kim_e_Page_097.jpg
JPEG97.2 2448702674180ef5f9abc2ee6aca1572 53319
kim_e_Page_097.QC.jpg
fb590d9f550b3f2ff7604e83ad09958d 110702
kim_e_Page_098.jpg
JPEG98.2 ccf46861951ac31f54a94652d71d8205 47690
kim_e_Page_098.QC.jpg
6cd9492884222ddec19f3ac83ecbf0cc 89309
kim_e_Page_099.jpg
JPEG99.2 958818a3b984be06daf783818056b861 40951
kim_e_Page_099.QC.jpg
64284f92a2cfb5a2cb3da82628fd940a 50346
kim_e_Page_100.jpg
JPEG100.2 eeca0bfb6e023845af9592e70c0396e5 29145
kim_e_Page_100.QC.jpg
de445a89f3243ea1089b88ca606918cb 79018
kim_e_Page_101.jpg
JPEG101.2 0665c013e33be7984803612fe753d1fd 39242
kim_e_Page_101.QC.jpg
0088dbb198b709661f463bc9f72e96ec 71250
kim_e_Page_102.jpg
JPEG102.2 91c41a9937fdb7b5f65e5dc00a82d1ec 36271
kim_e_Page_102.QC.jpg
0057221fdda35632f03991540cb501ab 103728
kim_e_Page_103.jpg
JPEG103.2 8f9f403454a667a0c78065eec428a44a 46566
kim_e_Page_103.QC.jpg
c77dc0ffba98072ae787cbb2c09193d8 31201
kim_e_Page_104.jpg
JPEG104.2 a9d8dfec3f1957a87f905db2f07c4303 22025
kim_e_Page_104.QC.jpg
196f2cf537cb687011b79ede1ff0354c 75743
kim_e_Page_105.jpg
JPEG105.2 888986a01d95ee00d27b67678363d45e 35759
kim_e_Page_105.QC.jpg
568a8fc27db74c42734a6d9e07d4a516 64900
kim_e_Page_106.jpg
JPEG106.2 36719780b3a3c03ef99f6676777d21f0 32726
kim_e_Page_106.QC.jpg
117d66b959b0bd2ff4821114a4a90561 80625
kim_e_Page_107.jpg
JPEG107.2 b3b7c4bfe261fa46965e604e12d669f9 38122
kim_e_Page_107.QC.jpg
12c41077043c58b690d968e3ce25c000 81667
kim_e_Page_108.jpg
JPEG108.2 3ec298d8b3508cfd5f56fd6a77d820d5 39489
kim_e_Page_108.QC.jpg
d702df44cfd7ead56d66dcd617c6d667 82018
kim_e_Page_109.jpg
JPEG109.2 f3605e91d36a10f9c9b7043e6b190921 38334
kim_e_Page_109.QC.jpg
02bc6a5aff3d972279c33b313099e119 76279
kim_e_Page_110.jpg
JPEG110.2 27a6ec2465e2f3d231b40acc0c29cbe5 36845
kim_e_Page_110.QC.jpg
c32912a68bed22acb5a9389f012fab15 65906
kim_e_Page_111.jpg
JPEG111.2 c02969040d7388e6f2943d72f5be1728 34123
kim_e_Page_111.QC.jpg
72a759a0d3b083b912789f83f6b9c491 44427
kim_e_Page_112.jpg
JPEG112.2 c3af143fba220913db1d793a4e5f898a 26691
kim_e_Page_112.QC.jpg
70023d0b3df79687c442024ac656c254 62943
kim_e_Page_113.jpg
JPEG113.2 80193c902210eb2de7ab509d35506783 31777
kim_e_Page_113.QC.jpg
673f23eaa751bed7cb00b8546786ad34 90592
kim_e_Page_114.jpg
JPEG114.2 04a81c72bd4b587091236907e1ab61ef 39987
kim_e_Page_114.QC.jpg
72f99250efeff29ef8427b269ac44994 76763
kim_e_Page_115.jpg
JPEG115.2 be31256dba91289b18a0ea0cde38fb53 36356
kim_e_Page_115.QC.jpg
d9271e68646dbff83ac0b148c378b514 45297
kim_e_Page_116.jpg
JPEG116.2 c84d424c479e55cc4cf17dfdc2824d3c 26925
kim_e_Page_116.QC.jpg
eb4391e57e98ce6889a1171324441468 64678
kim_e_Page_117.jpg
JPEG117.2 59ef80fabc0ee4d974c183c2cd153cce 33130
kim_e_Page_117.QC.jpg
d2dcd185ad2455d5e4680a3d89d80981 86145
kim_e_Page_118.jpg
JPEG118.2 215a921dbb662087e821c727e9c5b32b 40220
kim_e_Page_118.QC.jpg
14551dce4ac5b5a9b83b06764821668b 92799
kim_e_Page_119.jpg
JPEG119.2 d7a462c65efbbf8324ae77631966b7b9 42475
kim_e_Page_119.QC.jpg
d08ca756299f694012dfc5a9e303a059 84266
kim_e_Page_120.jpg
JPEG120.2 4f5606a08ceee0f495b2875a48001334 39575
kim_e_Page_120.QC.jpg
d68464284251af4a04a6a2a8546ade00 87379
kim_e_Page_121.jpg
JPEG121.2 c3f89de354bb81856309bf92a458def7 41226
kim_e_Page_121.QC.jpg
c8989f1f8e1c94fe4308275beaa2eeab 81049
kim_e_Page_122.jpg
JPEG122.2 85d9de7587e914b276f59500594ff39a 39097
kim_e_Page_122.QC.jpg
9f79a48c26ad7105dbb68cdcf4ba74b3 57734
kim_e_Page_123.jpg
JPEG123.2 9b8649da042675b750af59ff2d850dd1 30571
kim_e_Page_123.QC.jpg
5ef8d1f5aa065fdb0004485933d7a12e 27170
kim_e_Page_124.jpg
JPEG124.2 3d84473884f7777c1a6c3d2a0bdd27e1 20657
kim_e_Page_124.QC.jpg
bf80c6b49b1c0b71f7000313a8935a29 53101
kim_e_Page_125.jpg
JPEG125.2 56b199304059efe61d1568d1914c26fb 30038
kim_e_Page_125.QC.jpg
10f1e9864089cf5122de9d3359b57650 72307
kim_e_Page_126.jpg
JPEG126.2 1bc68fe1e80057895dc032cc4c27a72b 35920
kim_e_Page_126.QC.jpg
2d4217425a09d731f4b106371788457c 80681
kim_e_Page_127.jpg
JPEG127.2 86da44adefbac5975d3152e5ba81a5ae 38247
kim_e_Page_127.QC.jpg
4482760e473182491450cfca77b735fb 73149
kim_e_Page_128.jpg
JPEG128.2 6f6aaf723becdb475f209f56e8792f8e 36537
kim_e_Page_128.QC.jpg
e988fd8c13f7fbe79a1179674821c57f 69151
kim_e_Page_129.jpg
JPEG129.2 fba5759d67b8e0268d6d230a8ce09dd9 34336
kim_e_Page_129.QC.jpg
92383da06e03385aa9024c1a2179d465 67248
kim_e_Page_130.jpg
JPEG130.2 270838841a0d6435b67d9a0042c24994 34409
kim_e_Page_130.QC.jpg
a4132a01d4acd62c7434b0f4f3c42e5f 68694
kim_e_Page_131.jpg
JPEG131.2 7e0fafe58d8653480da418e12bba833e 34611
kim_e_Page_131.QC.jpg
b52d4a58ccb76e06f68c3b8cff7024c8 79153
kim_e_Page_132.jpg
JPEG132.2 e17e25f4923d4e4ae7d79cbfd157ba2a 37353
kim_e_Page_132.QC.jpg
3dbafcb562b9de7d5b54f3e8482608fa 23527
kim_e_Page_133.jpg
JPEG133.2 e5f023c76e29ce25996b10c2c55a6b8c 19793
kim_e_Page_133.QC.jpg
d36de0ce97d14e819e656b0a10c1c475 125216
kim_e_Page_134.jpg
JPEG134.2 da0525059c588e69306856c063689326 48630
kim_e_Page_134.QC.jpg
3c567111c0825083871a1c76cc516e0c 125268
kim_e_Page_135.jpg
JPEG135.2 e39a5450a23254e8ce3653e8155c7197 51022
kim_e_Page_135.QC.jpg
c0aa93f3f6945f4eb364d1635c803286 87345
kim_e_Page_136.jpg
JPEG136.2 f86fc61d6df2b1cada8a79864347c2df 38534
kim_e_Page_136.QC.jpg
1dce1a2e53e6a69d458033859e05978d 40067
kim_e_Page_137.jpg
JPEG137.2 df2769d1d328c0c9440b931f696bb011 25322
kim_e_Page_137.QC.jpg
THUMB1 imagejpeg-thumbnails db5f8ecacbc9df4bcdde4dff870af07a 20159
kim_e_Page_001thm.jpg
THUMB2 f0e6d8bedb51a4fda85095f16ab9510b 18143
kim_e_Page_002thm.jpg
THUMB3 15b66f503db2c8cf7b9e7a961411eba3 18232
kim_e_Page_003thm.jpg
THUMB4 79d83105b32565619399d9aecde4585c 21415
kim_e_Page_004thm.jpg
THUMB5 3c39a5ee7bbd8658206296e1d57e98d2 23268
kim_e_Page_005thm.jpg
THUMB6 e390b7ef78d8ba8b89ada96a7a3c3964 24525
kim_e_Page_006thm.jpg
THUMB7 0e68e5d47788fc333399939d379a6020 24921
kim_e_Page_007thm.jpg
THUMB8 da86135f25ed69956079d19c71eec4ce 20946
kim_e_Page_008thm.jpg
THUMB9 1a1021f71fea1449234ee7353a3e15c2 19701
kim_e_Page_009thm.jpg
THUMB10 0dc11a1b0a841881366ba4296c5cb890 20421
kim_e_Page_010thm.jpg
THUMB11 498102114bfe3534eed1ea011b9a684f 23864
kim_e_Page_011thm.jpg
THUMB12 1db11736c51a44bdb653b275e1516c31 26434
kim_e_Page_012thm.jpg
THUMB13 7822ca8eea3b05982bed1ee29a0266d5 28440
kim_e_Page_013thm.jpg
THUMB14 4feb96291daab5bedebb928b212bd53c 22989
kim_e_Page_014thm.jpg
THUMB15 c5f0ddbe3b24b9ebae9555152090a757 26118
kim_e_Page_015thm.jpg
THUMB16 bc5b062349d9de810168908be376826c 27038
kim_e_Page_016thm.jpg
THUMB17 72314f5dffc29d4f4cbefc80a91a2acb 27593
kim_e_Page_017thm.jpg
THUMB18 af1ba9a4194df76a8726e3dbf0650080 27100
kim_e_Page_018thm.jpg
THUMB19 3d152afc79e54ec429c55e6e43eddb8e 27925
kim_e_Page_019thm.jpg
THUMB20 ea4aa0ebda8da36c571e022dcbdbb10d 25130
kim_e_Page_020thm.jpg
THUMB21 29d91271b7efb84fb41977694f0eb97e 27091
kim_e_Page_021thm.jpg
THUMB22 40a8082476c29dfd0bd6141ef4db1996 27069
kim_e_Page_022thm.jpg
THUMB23 4955a068d034bcbd0c89706519bc59d2 27537
kim_e_Page_023thm.jpg
THUMB24 d076dc0f1a889fbc92f9a5f177de90dc 26001
kim_e_Page_024thm.jpg
THUMB25 8c88d3394fa9b425565e22d72551c6f0 23216
kim_e_Page_025thm.jpg
THUMB26 07a19c01e01544e6893497e8a1454793 25560
kim_e_Page_026thm.jpg
THUMB27 922c31b1f4af8949ed4b48c2d247056a 26407
kim_e_Page_027thm.jpg
THUMB28 eb0cca9b5f837b7aa09b4b5d1f8381f4 26250
kim_e_Page_028thm.jpg
THUMB29 a52588c717efd52af8d2642fed3775f4 27079
kim_e_Page_029thm.jpg
THUMB30 bda3c3fe22f41f759abf2c540a318a6b 28391
kim_e_Page_030thm.jpg
THUMB31 5c2ae52e5ecfa9f60690b713d7eb6133 26216
kim_e_Page_031thm.jpg
THUMB32 51f02b0a826f4430c81d47a6f97b7414 24573
kim_e_Page_032thm.jpg
THUMB33 462df20688556c4b3aaf6484021d3b12 25924
kim_e_Page_033thm.jpg
THUMB34 4eaf382d13f8177d881f75805063269e 27491
kim_e_Page_034thm.jpg
THUMB35 5454c93298de2697d910ca6d1267c97f 26539
kim_e_Page_035thm.jpg
THUMB36 362b85d35ca2cac97b30813bee99e8c0 27044
kim_e_Page_036thm.jpg
THUMB37 dda326cfc75e2ef261aa6fbd60922199 25187
kim_e_Page_037thm.jpg
THUMB38 6f2470823b0124ac827fcbf9663c04de 26863
kim_e_Page_038thm.jpg
THUMB39 13c62c3e6dee74e2aba343714133cc4a 25453
kim_e_Page_039thm.jpg
THUMB40 6c55afa7fa6ff8acdef21c20ce8103e2 27663
kim_e_Page_040thm.jpg
THUMB41 55dd69b4c54c6a36c36b428304c52a5a 22854
kim_e_Page_041thm.jpg
THUMB42 07a1db381ef0e514e73a5c50fffa9d1f 22888
kim_e_Page_042thm.jpg
THUMB43 1703260f6c265efa302183833e115429 26191
kim_e_Page_043thm.jpg
THUMB44 6b82c2d90be55dac0af535322338c922 24604
kim_e_Page_044thm.jpg
THUMB45 166c00189d644bddf497b8d877aed2c2 24147
kim_e_Page_045thm.jpg
THUMB46 2a24543d7ff583e774d730e43acfb356 24923
kim_e_Page_046thm.jpg
THUMB47 19ebf19e2c9776eba285c061a154dbd2 25668
kim_e_Page_047thm.jpg
THUMB48 7a8a30ca36406eaf2b05c0d98ef2ce83 25549
kim_e_Page_048thm.jpg
THUMB49 af72affa7f40c0a00de234c0896f2e6c 23653
kim_e_Page_049thm.jpg
THUMB50 a8facfd04165844c4a46ca681412c836 27944
kim_e_Page_050thm.jpg
THUMB51 781dd7e9fd6a9186a5a406e8c830eb9f 25991
kim_e_Page_051thm.jpg
THUMB52 33cdff631ad3c928f9604eaa5f518bd2 25324
kim_e_Page_052thm.jpg
THUMB53 13559fd9b55d3837c42dd9ddecce0f06 26869
kim_e_Page_053thm.jpg
THUMB54 935e366d151dea6e7568c89c3013fd3b 27263
kim_e_Page_054thm.jpg
THUMB55 b4394291c2c0f4d3652839c808f9b797 27658
kim_e_Page_055thm.jpg
THUMB56 3a3a2f67ae23cd47f0277b2b4e395a7e 26982
kim_e_Page_056thm.jpg
THUMB57 27caf3bc862e4d37b212cb796d440b5b 25117
kim_e_Page_057thm.jpg
THUMB58 da84938722772444dae09797fabe8410 23881
kim_e_Page_058thm.jpg
THUMB59 8461e98eeb7ae8310bb9b02de0361650 27968
kim_e_Page_059thm.jpg
THUMB60 b2b11740266e9c57aabe9c41dd87ed79 27128
kim_e_Page_060thm.jpg
THUMB61 97a71e5459274bd2a7b02cfc0bcf24bd 24115
kim_e_Page_061thm.jpg
THUMB62 841a6ac0061c08f8f304086a81e1e631 25608
kim_e_Page_062thm.jpg
THUMB63 f377aea239e6447b2f3022d76fc9bd4f 22720
kim_e_Page_063thm.jpg
THUMB64 b898dd6229ec1218e122a3d0fb8e2bed 23018
kim_e_Page_064thm.jpg
THUMB65 5ffea6fb4bfdb6eff0fe3c5bf52c32e7 24159
kim_e_Page_065thm.jpg
THUMB66 1f8d73d2afe28f274c6cb2af13a36a6b 27770
kim_e_Page_066thm.jpg
THUMB67 8763610dfa9240cbcaa6661ad283fe1f 27047
kim_e_Page_067thm.jpg
THUMB68 b1dad42127ff39b81f9e2481d3ab3467 22744
kim_e_Page_068thm.jpg
THUMB69 2a7e0ca8c422077f6f23337a3e4aa12f 25366
kim_e_Page_069thm.jpg
THUMB70 b6116dba8335c938ab0f23a0c081d2b7 24977
kim_e_Page_070thm.jpg
THUMB71 418478f97a3abbc46fe4802beec46be1 23766
kim_e_Page_071thm.jpg
THUMB72 d0633b1f2c5bd3b01b0e45b582c2f1f8 26142
kim_e_Page_072thm.jpg
THUMB73 09fd9cd274e2d26e475a55d79a2ddd37 26597
kim_e_Page_073thm.jpg
THUMB74 323b026e142d6e44e9d633ba8527315f 27390
kim_e_Page_074thm.jpg
THUMB75 ade427b854f153de0321ccee61c087d4 26497
kim_e_Page_075thm.jpg
THUMB76 6fe9db86a9319b80e6bcb133476c67bc 20431
kim_e_Page_076thm.jpg
THUMB77 cd129babf17a334ccf941fbed193694d 27482
kim_e_Page_077thm.jpg
THUMB78 e880340c05ee003d8c08ac15b8affe3f 26797
kim_e_Page_078thm.jpg
THUMB79 83638f325ef68796ee6136040067bf5e 27463
kim_e_Page_079thm.jpg
THUMB80 4d9cd483cb5d0877ecad51954030466c 28838
kim_e_Page_080thm.jpg
THUMB81 792f1b6b181c7092d7ce16c96ce0aca4 28210
kim_e_Page_081thm.jpg
THUMB82 1d6716f34a09c51133aa4f42b5a672a1 28626
kim_e_Page_082thm.jpg
THUMB83 fed445c1793235a36dbf301f64cc8dcc 26524
kim_e_Page_083thm.jpg
THUMB84 78904c77267167d38a365639977f3c00 20497
kim_e_Page_084thm.jpg
THUMB85 39164452682d6ae353f20c8ed1861d68 26674
kim_e_Page_085thm.jpg
THUMB86 ff9e03e31a73cf7373dbb8fcdb35361b 28625
kim_e_Page_086thm.jpg
THUMB87 f29251a60faec0826e5fac9a565e9a4e 25629
kim_e_Page_087thm.jpg
THUMB88 d06cdd2df1191010ae999caadaa124ca 20626
kim_e_Page_088thm.jpg
THUMB89 67f4dbc4d37cebaeb363015696724383 27473
kim_e_Page_089thm.jpg
THUMB90 b85865186c22fb59f62d76b4f8e8c956 27723
kim_e_Page_090thm.jpg
THUMB91 028c599549fabb5a781849065b5f80ad 28444
kim_e_Page_091thm.jpg
THUMB92 b0c1dc340176afeeb00486aa31bf4a0d 28549
kim_e_Page_092thm.jpg
THUMB93 6531af761feacb54bbed71109fac7964 23612
kim_e_Page_093thm.jpg
THUMB94 3fe8bcb0409e989c3523e15cc1c3d5f3 26904
kim_e_Page_094thm.jpg
THUMB95 0ed6b682c4c4a431fee7c5b650c980f4 26832
kim_e_Page_095thm.jpg
THUMB96 75154d4dc54dfd5339e81767acb974b4 27609
kim_e_Page_096thm.jpg
THUMB97 3b2a25d3791c00095acc3899b3e38eec 29181
kim_e_Page_097thm.jpg
THUMB98 b6c14b32f7171dda30d56374696317f1 27851
kim_e_Page_098thm.jpg
THUMB99 5d658a1ddd6c403f53bf1e930bdaab8d 25623
kim_e_Page_099thm.jpg
THUMB100 ea7b115c117a59140bcee87faea27e49 21426
kim_e_Page_100thm.jpg
THUMB101 b76e0d3b5d0c46f2b66ef983351ccd5e 25040
kim_e_Page_101thm.jpg
THUMB102 efe07e94a5ec805c2415d359e7663c34 23959
kim_e_Page_102thm.jpg
THUMB103 9b50c9af7abbdf67b42189fc13a1527b 27679
kim_e_Page_103thm.jpg
THUMB104 7e63d39f27c4c3afdfabee9ef8e75ba7 19215
kim_e_Page_104thm.jpg
THUMB105 af5b39a044942da53d14ec37fa859080 23895
kim_e_Page_105thm.jpg
THUMB106 10994115bdc367f8fd11b7ddae09e87a 22760
kim_e_Page_106thm.jpg
THUMB107 4b9b5528508237876a2869a6c55a9142 24825
kim_e_Page_107thm.jpg
THUMB108 de0800c0c7456a9fe7550c8eab8816e7 24343
kim_e_Page_108thm.jpg
THUMB109 eb597dbe2842e36f99af9ce91e74d7f8 23954
kim_e_Page_109thm.jpg
THUMB110 7121e12df77d27dc8dc318e85469fac5 23753
kim_e_Page_110thm.jpg
THUMB111 eaad81ad5a31f73df1a455d196ebecc6 22813
kim_e_Page_111thm.jpg
THUMB112 0870dcb0eae1a20bffe1975c9392fe2c 20672
kim_e_Page_112thm.jpg
THUMB113 02a6dcc500d88b8d98a67f56e295ef7c 22141
kim_e_Page_113thm.jpg
THUMB114 887ece050606b77872a1e9804c9c3916 24802
kim_e_Page_114thm.jpg
THUMB115 3ce99d946fa02995dc5ca279e8f25317 23544
kim_e_Page_115thm.jpg
THUMB116 ca46b883f92cb8f0171cd706e2df4406 20930
kim_e_Page_116thm.jpg
THUMB117 f31349a5b93e56b41a86d46c0fcd81e9 22787
kim_e_Page_117thm.jpg
THUMB118 06b2a3956fd8fa3f93fbcd33e78777e3 25082
kim_e_Page_118thm.jpg
THUMB119 3f6a3e2fd99c0d57417a2754de86eb9d 25768
kim_e_Page_119thm.jpg
THUMB120 c75bf7c0917b213f481afdc02f5a7a73 25066
kim_e_Page_120thm.jpg
THUMB121 91b4043957022d0991aec85601bcb059 25346
kim_e_Page_121thm.jpg
THUMB122 ba50fb0c4322814aae7429168add7911 24585
kim_e_Page_122thm.jpg
THUMB123 84f8872c58fddd5967810767703e7980 22164
kim_e_Page_123thm.jpg
THUMB124 0619fdb1155c4d6c3a99775f4387ac43 18750
kim_e_Page_124thm.jpg
THUMB125 33a0560ca91408a4c774580f0a6335f6 21566
kim_e_Page_125thm.jpg
THUMB126 f480c21cc8eb2cdd94ab5d3a52535e9f 23378
kim_e_Page_126thm.jpg
THUMB127 f86f463f1356722172525d01b36ac2ed 24179
kim_e_Page_127thm.jpg
THUMB128 6b7d53e3d2d79da7d8a364fad4e5a173 23794
kim_e_Page_128thm.jpg
THUMB129 615b3bfd16474658adc1a6c5a2d23efc 22948
kim_e_Page_129thm.jpg
THUMB130 bf28c4ba45e9ee3aa5c44114eb236f4e 22847
kim_e_Page_130thm.jpg
THUMB131 01e6096ac217b24ef962b7ac0c8415b1 23128
kim_e_Page_131thm.jpg
THUMB132 64bfe814f0ecc8738fd6254b31c3c113 23957
kim_e_Page_132thm.jpg
THUMB133 5a2e5b89811dee9803cb398515432e69 18344
kim_e_Page_133thm.jpg
THUMB134 9530d8971f922b5266ee2587c3ef003e 27375
kim_e_Page_134thm.jpg
THUMB135 5135d017d2dd25a100a732e91bfab35a 28397
kim_e_Page_135thm.jpg
THUMB136 0ec70be2616b578becaeec83d2d4350d 24336
kim_e_Page_136thm.jpg
THUMB137 3e57fbdd50e3a2ca2df4aa85e0f2cae6 20229
kim_e_Page_137thm.jpg
TXT1 textplain 8f5c4a4f2f5788c62a7ecfd9a120e6e5 418
kim_e_Page_001.txt
TXT2 f89f6f7d73fb5a034290b0315701a8de 100
kim_e_Page_002.txt
TXT3 7a1b9d4fb2d4d1794dc2033fbcf68d85 105
kim_e_Page_003.txt
TXT4 937c2507cf9606f9953bb0f518e6af90 605
kim_e_Page_004.txt
TXT5 f9bcb71c2a7be748d40834458fa5417a 1972
kim_e_Page_005.txt
TXT6 c5797a294ce5a95fb7e27f0d6abbba90 3295
kim_e_Page_006.txt
TXT7 97b727bbf6990c5202c29a6ada7ab645 2696
kim_e_Page_007.txt
TXT8 1283939fa5ff01c35f0b86047790dbd7 320
kim_e_Page_008.txt
TXT9 ed92cf8915e42db54c5a91e9aa19a5b2 405
kim_e_Page_009.txt
TXT10 c5694e080bbea6e4f12d63f55148edf2 721
kim_e_Page_010.txt
TXT11 65ad39566690d07abdd64e11c36fc863 1200
kim_e_Page_011.txt
TXT12 d58734de909a917c267caf6c658a0fbc 1662
kim_e_Page_012.txt
TXT13 3b1d8be6e9a7431a0d4b35f09abec1a1 1872
kim_e_Page_013.txt
TXT14 13e2af3a7cb34243b26e1542e1fb4279 869
kim_e_Page_014.txt
TXT15 e607c8a2cfc3e54a8f94e751af7bd7bd 1515
kim_e_Page_015.txt
TXT16 bf49728da87a38ea7a46c155e7b65ef2 1533
kim_e_Page_016.txt
TXT17 7f4ff59c781a3bd0c30058b2c9795d90 1702
kim_e_Page_017.txt
TXT18 732318194a9aa0d6a7b302c25a1c96f6 1606
kim_e_Page_018.txt
TXT19 7de843b18e22811ea14128077dfe561c 1807
kim_e_Page_019.txt
TXT20 1d164506487ac972e09280457b0361c6 1738
kim_e_Page_020.txt
TXT21 9c1877e53c4ed8a77b15adcb42567804 1673
kim_e_Page_021.txt
TXT22 53c22ceb2be2ecb507c3c61ebff9a520 1687
kim_e_Page_022.txt
TXT23 4be78aa3787e92f5b0130d58598adacd 1645
kim_e_Page_023.txt
TXT24 544a27d7c658a7d13a56ef43a5303387 1387
kim_e_Page_024.txt
TXT25 b11ae57a5499a7386fc6cd868d03cd81 910
kim_e_Page_025.txt
TXT26 c722414bd39c8e1bc73dc985fbd0f107 1371
kim_e_Page_026.txt
TXT27 077564a6d8cd69c7363d9a4c87cdf658 1525
kim_e_Page_027.txt
TXT28 5bb0988043b42a98ea58199e87b6c2c3 1784
kim_e_Page_028.txt
TXT29 723bd652c844956964ba422f3266fd2a 1591
kim_e_Page_029.txt
TXT30 ad80025811d388441d9b10c569a86db2 1780
kim_e_Page_030.txt
TXT31 7aa13f384592d8fc302c1be160efc5a2 1542
kim_e_Page_031.txt
TXT32 0e791a5ce0bffc087c20e938e20324b3 1148
kim_e_Page_032.txt
TXT33 dfd9529212b05228ddd683d8212b8dd0 1470
kim_e_Page_033.txt
TXT34 cfec6e91e347f9c9ffd1a91480921057 1594
kim_e_Page_034.txt
TXT35 c4ed17efe07009fd5d216dd102b57e52 1553
kim_e_Page_035.txt
TXT36 1fc48fdfbcf5ad6a4159a3f2612d84d9 1633
kim_e_Page_036.txt
TXT37 9b25cf99645ff861911e766d47db9f12 1350
kim_e_Page_037.txt
TXT38 e4e79450cbaec1de224cc5b9a94e6c3e 1979
kim_e_Page_038.txt
TXT39 d449c2951467eb1762624441f060140e 1392
kim_e_Page_039.txt
TXT40 8c8a45a03aa59121962286da18302116 1729
kim_e_Page_040.txt
TXT41 699c026a3648d7c3f546ac3882f04eb0 883
kim_e_Page_041.txt
TXT42 3115509892a7ee033b473d9aafbb335d 827
kim_e_Page_042.txt
TXT43 8e7bcee68779fd06a67ace9a713465e2 1474
kim_e_Page_043.txt
TXT44 f354ad6d4961c04de9fc2b4b6a1c2620 1153
kim_e_Page_044.txt
TXT45 cf251c32a153939d1eafdad28fb707b4 1140
kim_e_Page_045.txt
TXT46 c69d61afd18beb66b556dba194b63733 1566
kim_e_Page_046.txt
TXT47 7f99b46d9e46c47c00c1bf0e9c60b7a9 1693
kim_e_Page_047.txt
TXT48 c2e36dc8eb33e683b656c2521d418fdf 1578
kim_e_Page_048.txt
TXT49 24af840462437eae80fdf131f39c21ec 1007
kim_e_Page_049.txt
TXT50 e6d209671674e3f4cff4fc8a3c7d76bb 2002
kim_e_Page_050.txt
TXT51 a2ed38c7519cbc224b50ee4b08551851 1181
kim_e_Page_051.txt
TXT52 9547a5af7fe74a0671b4882150e5855b 1342
kim_e_Page_052.txt
TXT53 c6afdf41496a88425f46ed03728a88b8 1709
kim_e_Page_053.txt
TXT54 8cd10699a2cc1cb3b3997ea2debfd955 1752
kim_e_Page_054.txt
TXT55 d4e79df316192689428d01a01e1bd857 1838
kim_e_Page_055.txt
TXT56 2ff5fac026c96e03944bc4e76a0415fe 1694
kim_e_Page_056.txt
TXT57 9083137b43d54b24ac83d82e41d0d1bf 1274
kim_e_Page_057.txt
TXT58 0871c5480dda4d640569b6fa8beee5af 1256
kim_e_Page_058.txt
TXT59 09dbc2c84ce82295b066006c68afffa3 2003
kim_e_Page_059.txt
TXT60 f24b63bc0256f3b5065267145198815c 1590
kim_e_Page_060.txt
TXT61 0c7d456b0184f69f34922abf92efe38c 926
kim_e_Page_061.txt
TXT62 5392190b6a8114caf57f839bdda49d68 1331
kim_e_Page_062.txt
TXT63 20fb9169fbb3bafe525d9a004c85dd65 1043
kim_e_Page_063.txt
TXT64 2bfe46afa3a84618bd3a9bb7bc42429a 990
kim_e_Page_064.txt
TXT65 b6908b006be699aa3d0b2f036f9ad006 1157
kim_e_Page_065.txt
TXT66 37bb9a05398f080e6a98be67c1ac8e14
kim_e_Page_066.txt
TXT67 bbd064bba9907693f42d741c056d5ca3 1004
kim_e_Page_067.txt
TXT68 64f9f73bfefc945437ba7c1099b9e9be 786
kim_e_Page_068.txt
TXT69 35040972332369eaffc59c515c70c8d6 1273
kim_e_Page_069.txt
TXT70 ab8091bacae6079550b3f81f5b44bccd 1268
kim_e_Page_070.txt
TXT71 d33c110bae3c8496ac7e8c76d0097ace 908
kim_e_Page_071.txt
TXT72 48df060342f38372b501db24e8fb5e18 1419
kim_e_Page_072.txt
TXT73 1e97e2467f779a48755e7b27dcbb5a06 1756
kim_e_Page_073.txt
TXT74 2405d96296db90ead139687ac253034a 1682
kim_e_Page_074.txt
TXT75 11d9cd2a2fc0bc8b47f2eb5ef5b1a326 1440
kim_e_Page_075.txt
TXT76 b3d42da88910f53f0c4fbe83a459c20a 486
kim_e_Page_076.txt
TXT77 99a8024f2585a9fad6149384f5f43906 1742
kim_e_Page_077.txt
TXT78 8e5157a77f7cbbd1f48f1dc45ea17187 1609
kim_e_Page_078.txt
TXT79 b7228215072bbb9d6e6f23d92c1b25cd 1679
kim_e_Page_079.txt
TXT80 6fa082b0492de31ccc33b930ad9609e9 1940
kim_e_Page_080.txt
TXT81 4ed78a21b4b8ad3c4c1e6a50425dfc77 1825
kim_e_Page_081.txt
TXT82 41120c24cea621d4a88366128fc39d04 1899
kim_e_Page_082.txt
TXT83 2867fde6cf8655a0344a07d93794d14b 1548
kim_e_Page_083.txt
TXT84 ea5629e6a2f1c101acc2ee4af1aeed4c 582
kim_e_Page_084.txt
TXT85 e19af8cc48b3812d3ef1d1602f3615ed 1621
kim_e_Page_085.txt
TXT86 dcf7cfb3eed66df076eff304a9dbc8ba 2522
kim_e_Page_086.txt
TXT87 859c01569b1e6da6b17e07d2d564739e 1499
kim_e_Page_087.txt
TXT88 92cc6e88bb0bfd7191b1d708358cfef8 513
kim_e_Page_088.txt
TXT89 011dd0e843fa3658a1b16d70292a0687
kim_e_Page_089.txt
TXT90 b68dfaecab6806628844755b4e1d794f 1744
kim_e_Page_090.txt
TXT91 8531c8f8c1ab5ec63e887a6c64e2543f 1865
kim_e_Page_091.txt
TXT92 986d7d4f7a16a35ed5c92b89450df9d6 2043
kim_e_Page_092.txt
TXT93 089cd59d47acf50253d857f0fbd5c272 1027
kim_e_Page_093.txt
TXT94 87dc69a8ab8e164ba13f03a27db96b66 1598
kim_e_Page_094.txt
TXT95 4de8164fb22c41289c463c9080f44ca5 1496
kim_e_Page_095.txt
TXT96 6519292e64af515efc3ca1420e6ea589 1554
kim_e_Page_096.txt
TXT97 626bb5e1c0faebd7d273594cba2b6ea6 1953
kim_e_Page_097.txt
TXT98 94442a99603f582b936ecbc93c76c9c1 1796
kim_e_Page_098.txt
TXT99 1c8ae79e4ec8d4d5c3df6a3e83fac695 1471
kim_e_Page_099.txt
TXT100 27cddea827598f89a8a6bd8581b295a0 678
kim_e_Page_100.txt
TXT101 f7a3d923ec640e1f0231bef62f3368fe 1174
kim_e_Page_101.txt
TXT102 3c04b1e28599935343b19f1064cc30d9 1097
kim_e_Page_102.txt
TXT103 1600fa242dbca06270159e9a960e2bfc
kim_e_Page_103.txt
TXT104 fc61a637f5a427f4472eef82e1cb75c7 246
kim_e_Page_104.txt
TXT105 53a96212a43f2e8acd25f03b59a7dcc7 933
kim_e_Page_105.txt
TXT106 55b3908d2273cdfd9fdb6222f6fdb45b 748
kim_e_Page_106.txt
TXT107 0c41640f0c79fcf47f3f01e7264317f1 1057
kim_e_Page_107.txt
TXT108 8ab6b37dccb2e769c5847fbca3581570 1046
kim_e_Page_108.txt
TXT109 63f50765fd0e5d79b27cdd1a73319558 1053
kim_e_Page_109.txt
TXT110 1875df64ec088ae01fbae0e76765a047 967
kim_e_Page_110.txt
TXT111 aa6392bfbe3a4b2e1c61917d5b88391c 976
kim_e_Page_111.txt
TXT112 9f6f33dd04feffe7510d1f88ec41c8f5 481
kim_e_Page_112.txt
TXT113 c493731956217f897ebf8400bae3fbc1 621
kim_e_Page_113.txt
TXT114 d6b1d09a906030737dd1096656ad97bb 1114
kim_e_Page_114.txt
TXT115 60d923e80c5c0bccc80d92bb6f131d02 895
kim_e_Page_115.txt
TXT116 6f9dbe6ecf10e417821514de7f8ba3dc 452
kim_e_Page_116.txt
TXT117 a03e1898b00533398abfd732c3c6dc72 740
kim_e_Page_117.txt
TXT118 2e4940e5482b1bcda2b7906d44fb9b77 1429
kim_e_Page_118.txt
TXT119 5d337af73cd15324bea1981cbdd88f43 1323
kim_e_Page_119.txt
TXT120 e49c2a4395bcd1dbd93265790309ef33 1016
kim_e_Page_120.txt
TXT121 5ad5ca9f7dc9b1708425c8200cad0bb9 1442
kim_e_Page_121.txt
TXT122 92cd700f226f055616a150b77b453607 1346
kim_e_Page_122.txt
TXT123 41e636bae98b9b7276cbf3d58efc125b 652
kim_e_Page_123.txt
TXT124 d3623c29ffa3631e37cfd7d4d7f92e29 185
kim_e_Page_124.txt
TXT125 2004363eeca4ad39783c546c0c2153da 561
kim_e_Page_125.txt
TXT126 0e6fdd14f3402f244e012f3eae0c4a6b 876
kim_e_Page_126.txt
TXT127 a8848b58055734b0c8344ab6bbf579c3
kim_e_Page_127.txt
TXT128 8dd358ecb9c9780f8fb6b6b9190bda1e
kim_e_Page_128.txt
TXT129 a1ab8e4124ffa25e870973347cc7e039 1014
kim_e_Page_129.txt
TXT130 fba9c957e2f26b2ac8528a9067fa897c 750
kim_e_Page_130.txt
TXT131 54faf7dbcde23e082f7b5a618e303a5d 859
kim_e_Page_131.txt
TXT132 538be67e4bcd1ecd17e9f0409521b15a 1039
kim_e_Page_132.txt
TXT133 86b3dd4756f00c15509d031b83422d08 82
kim_e_Page_133.txt
TXT134 02fee62f9aa7fcd604be604f0215ca8f 2085
kim_e_Page_134.txt
TXT135 6b098c11b4e8e208c57ea55825c139d7 2064
kim_e_Page_135.txt
TXT136 878f710680be80c382792aeeba029ec8 1240
kim_e_Page_136.txt
TXT137 ca153faf03a6592e9c5a23a1694f8b57 350
kim_e_Page_137.txt
PDF1 applicationpdf c9c312170a3b474f1656608e7329b65d 1928918
kim_e.pdf
METS2 unknownx-mets f59d1a339cc847508b5b82ac2c5ea70c 141457
UFE0000552_00001.mets
METS:structMap STRUCT1 physical
METS:div DMDID ADMID ORDER 0 main
PDIV1 1 Main
PAGE1 Page i
METS:fptr FILEID
PAGE2 ii 2
PAGE3 iii 3
PAGE4 iv 4
PAGE5 v 5
PAGE6 vi 6
PAGE7 vii 7
PAGE8 viii 8
PAGE9 ix 9
PAGE10 x 10
PAGE11 xi 11
PAGE12 12
PAGE13 13
PAGE14 14
PAGE15 15
PAGE16 16
PAGE17 17
PAGE18 18
PAGE19 19
PAGE20 20
PAGE21 21
PAGE22 22
PAGE23 23
PAGE24 24
PAGE25 25
PAGE26 26
PAGE27 27
PAGE28 28
PAGE29 29
PAGE30 30
PAGE31 31
PAGE32 32
PAGE33 33
PAGE34 34
PAGE35 35
PAGE36 36
PAGE37 37
PAGE38 38
PAGE39 39
PAGE40 40
PAGE41 41
PAGE42 42
PAGE43 43
PAGE44 44
PAGE45 45
PAGE46 46
PAGE47 47
PAGE48 48
PAGE49 49
PAGE50 50
PAGE51 51
PAGE52 52
PAGE53 53
PAGE54 54
PAGE55 55
PAGE56 56
PAGE57 57
PAGE58 58
PAGE59 59
PAGE60 60
PAGE61 61
PAGE62 62
PAGE63 63
PAGE64 64
PAGE65 65
PAGE66 66
PAGE67 67
PAGE68 68
PAGE69 69
PAGE70 70
PAGE71 71
PAGE72 72
PAGE73 73
PAGE74 74
PAGE75 75
PAGE76 76
PAGE77 77
PAGE78 78
PAGE79 79
PAGE80 80
PAGE81 81
PAGE82
PAGE83 83
PAGE84 84
PAGE85 85
PAGE86 86
PAGE87 87
PAGE88 88
PAGE89 89
PAGE90 90
PAGE91 91
PAGE92 92
PAGE93 93
PAGE94 94
PAGE95 95
PAGE96 96
PAGE97 97
PAGE98 98
PAGE99 99
PAGE100
PAGE101 101
PAGE102 102
PAGE103 103
PAGE104 104
PAGE105
PAGE106 106
PAGE107 107
PAGE108 108
PAGE109 109
PAGE110 110
PAGE111 111
PAGE112 112
PAGE113 113
PAGE114 114
PAGE115 115
PAGE116 116
PAGE117 117
PAGE118 118
PAGE119 119
PAGE120 120
PAGE121 121
PAGE122 122
PAGE123 123
PAGE124 124
PAGE125 125
PAGE126 126
PAGE127 127
PAGE128 128
PAGE129 129
PAGE130 130
PAGE131 131
PAGE132 132
PAGE133 133
PAGE134 134
PAGE135 135
PAGE136 136
PAGE137 137
STRUCT2 other
ODIV1
FILES1
FILES2



PAGE 1

IMPLEMENTATION PATTERNS FOR PARALLEL PROGRAM AND A CASE STUDY By EUNKEE KIM A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2002

PAGE 2

Copyright 2002 by Eunkee Kim

PAGE 3

Dedicated to Grace Kim, Jineun Song and parents

PAGE 4

iv ACKNOWLEDGMENTS I would like to express my sincere gratitude to my adviser, Dr. Beverly A. Sanders, for providing me with an opportunity to work in this exciting area and for providing feedback, guidance, support, and encouragement during the course of this research and my graduate academic career. I wish to thank Dr. Joseph N. Wilson and Dr. Stephen M. Thebaut for serving on my supervisory committee. Finally I thank Dr. Berna L. Massingill for allowing me to use machines in Trinity University and for technical advice.

PAGE 5

v TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................ivLIST OF TABLES.............................................................................................................ixLIST OF FIGURES............................................................................................................xABSTRACT....................................................................................................................... xiCHAPTER 1 INTRODUCTION.........................................................................................................11.1 Parallel Computing...................................................................................................11.2 Parallel Design Patterns............................................................................................11.3 Implementation Patterns for Design Patterns...........................................................21.4 Implementation of Kernel IS of NAS Parallel Benchmark Set Using Parallel Design Patterns...................................................................................................21.5 Organization of the Thesis........................................................................................32 OVERVIEW OF PARALLEL PATTERN LANGUAGE.............................................42.1 Finding Concurrency Design Sp ace..........................................................................42.1.1 Getting Started................................................................................................42.1.2 Decomposition Strategy..................................................................................42.1.3 Task Decomposition.......................................................................................52.1.4 Data Decomposition........................................................................................52.1.5 Dependency Analysis......................................................................................62.1.6 Group Tasks....................................................................................................62.1.7 Order Tasks.....................................................................................................72.1.8 Data Sharing....................................................................................................72.1.9 Design Evaluation...........................................................................................72.2 Algorithm Structure Design Space...........................................................................82.2.1 Choose Structure.............................................................................................82.2.2 Asynchronous Composition..... ................ ................ ................ ............. ..........92.2.3 Divide and Conquer........................................................................................9

PAGE 6

vi 2.2.4 Embarrassingly Parallel................................................................................102.2.5 Geometric Decomposition............................................................................102.2.6 Pipeline Processing.......................................................................................112.2.7 Protected Dependencies................................................................................112.2.8 Recursive Data..............................................................................................112.2.9 Separable Dependency..................................................................................122.3 Supporting Structures Design Space.......................................................................122.3.1 Program-Structuring Group..........................................................................122.3.1.1 Single program and multiple data.......................................................122.3.1.2 Fork join..............................................................................................132.3.1.3 Master worker.....................................................................................132.3.1.4 Spawn..................................................................................................132.3.2 Shared Data Structures Group.......................................................................132.3.2.1 Shared queue.......................................................................................132.3.2.2 Shared counter....................................................................................142.3.2.3 Distributed array.................................................................................142.4 Chapter Summery...................................................................................................143 PATTERNS FOR IMPLEMENTATION.....................................................................153.1 Using Message Passing Interface (MPI).................................................................153.1.1 Intent................................................................................................................... .153.1.2 Applicability........................................................................................................153.1.3 Implementation....................................................................................................153.1.3.1 Simple message passing.............................................................................163.1.3.2 Group and communicator creation.............................................................173.1.3.3 Data distribution and reduction..................................................................223.2 Simplest Form of Embarrassingly Parallel.............................................................233.2.1 Intent.............................................................................................................233.2.2 Applicability ................................................................................................. 233.2.3 Implementation.............................................................................................243.2.4 Implementation Example..............................................................................253.2.5 Example Usage..............................................................................................273.3 Implementation of Embarrassingly Parallel............................................................283.3.1 Intent.............................................................................................................283.3.2 Applicability..................................................................................................283.3.3 Implementation.............................................................................................283.3.4 Implementation Example..............................................................................293.3.5 Usage Example ............................................................................................. 383.4 Implementation of Pipeline Processing..................................................................393.4.1 Intent.............................................................................................................393.4.2 Applicability.................................................................................................. 393.4.3 Implementation ............................................................................................. 393.4.4 Implementation Example..............................................................................413.4. 5 Usage Example ............................................................................................. 473.5 Implementation of Asynchronous-Composition.....................................................483.5.1 Intent.............................................................................................................48

PAGE 7

vii 3.5.2 Applicability..................................................................................................493.5.3 Implementation.............................................................................................493.5.4 Implementation Example..............................................................................503.6 Implementation of Divide and Conquer.................................................................543.6.1 Intent.............................................................................................................543.6.2 Motivation.....................................................................................................543.6.3 Applicability..................................................................................................553.6.4 Implementation.............................................................................................553.6.5 Implementation Example..............................................................................573.6.6 Usage Example.............................................................................................614 KERNEL IS OF NAS BENCHMARK........................................................................634.1 Brief Statement of Problem....................................................................................634.2 Key Generation and Memory Mapping..................................................................634.3 Procedure and Timing.............................................................................................645 PARALLEL PATTERNS USED TO IMPLEMENT KERNEL IS.............................665.1 Finding Concurrency..............................................................................................665.1.1 Getting Started..............................................................................................665.1.2 Decomposition Strategy................................................................................675.1.3 Task Decomposition.....................................................................................685.1.4 Dependency Analysis....................................................................................685.1.5 Data Sharing Pattern.....................................................................................685.1.6 Design Evaluation.........................................................................................695.2 Algorithm Structure Design Space.........................................................................705.2.1 Choose Structure...........................................................................................705.2.2 Separable Dependencies................................................................................715.2.3 Embarrassingly Parallel................................................................................715.3 Using Implementation Example.............................................................................715.4 Algorithm for Parallel Implementation of Kernel IS..............................................726 PERFORMANCE RESULTS AND DISCUSSIONS..................................................746.1 Performance Expectation........................................................................................746.2 Performance Results...............................................................................................746.3 Discussions.............................................................................................................767 RELATED WORK AND CONCLUSIONS AND FUTURE WORK.........................787.1 Related work...........................................................................................................787.1.1 Aids for Parallel Programming.....................................................................787.1.2 Parallel Sorting..............................................................................................807.2 Conclusions.............................................................................................................817.3 Future Work............................................................................................................82APPENDIX

PAGE 8

viii A KERNEL IS OF THE NAS PARALLEL BENCHMAKRK.....................................83B PSEUDORANDOM NUMBER GENERATOR.......................................................90C SOURCE CODE OF THE KERNEL IS IMPLEMENTATION...............................94D SOURCE CODE OF PIPELINE EXAMPLE..........................................................107E SOURCE CODE OF DIVIDE AND CONQUE R ...................................................115LIST OF REFERENCES................................................................................................124BIOGRAPHICAL SKETCH..........................................................................................127

PAGE 9

ix LIST OF TABLES Table page 4-1 Parameter Values to be used for Benchmark...............................................................65 6 -1 Performance Results for Class A and B ......................................................................75 A -1 Values to be used for Partial Verification................................................................... 8 8 A-2 Parameter Values to be used for Benchmark ........................................................... ...... 89

PAGE 10

x LIST OF FIGURES Figure page 3-1 The Algorithm Structure Decision Tree..........................................................................93-2 Usage of Message Passing ...............................................................................................403-3 Irregular Message Handling ............................................................................................503-4 Message Passing for Invocation of the Solve and Merge Functions.............................564-1 Counting Sort...................................................................................................................676-1 Execution Time Comparison for Class A........................................................................766-2 Execution Time Comparison for Class B........................................................................76

PAGE 11

xi Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science IMPLEMENTATION PATTERNS FOR PARALLEL PROGRAM AND A CASE STUDY By Eunkee Kim December 2002 Chairman: Beverly A. Sanders Major Department: Computer and Information Science and Engineering Design patterns for parallel programming guides the programmer through the entire process of developing a parallel program. In this thesis, implementation patterns for parallel program are presented. These patterns help programmers to implement a parallel program after designing it using the parallel design patterns. Parallel Integer sorting, one of the kernels of Parallel Benchmark set of the Numerical Aerodynamic Simulation Program (NAS), has been designed and implemented using the parallel design pattern as a case study. How the parallel design pattern was used in implementing Kernel IS and its performance are discussed

PAGE 12

1 CHAPTER 1 INTRODUCTION 1.1 Parallel Computing Parallel computing is what a computer does when it carries out more than one computation at a time using many processors. An example of parallel computing (processing) in our daily life is an automobile assembly line: at every station somebody is doing part of the work to complete the product. The purpose of parallel computing is to overcome the limit of the performance of a single processor. We can increase the performance of a sequential program by parallelizing exploitable concurrency in a sequential program and using ma ny processors at once. Parallel programming has been considered more difficult than sequential programming. To make it easier for programmers to write a parallel program, a parallel design pattern has been developed by B.L. Massingill, T.G. Mattson, and B.A Sanders.1-4 1.2 Parallel Design Patterns The Parallel pattern language is a collection of design patterns for parallel programming.5-6 Design patterns are a high level description of a solution to a problem.7 Parallel pattern language is written down in a systematic way so that a final design for a parallel program can result by going through a sequence of appropriate patterns from a pattern catalog. The structure of the patterns is designed to parallelize even complex problems. The top-level patterns help to find the concurrency in a problem and decompose it into a collection of tasks. Th e second-level patterns help to find an

PAGE 13

2 appropriate algorithm structure to exploit concurrency that has been identified. The parallel design patterns are described in more detail in Chapter 2. 1.3 Implementation Patterns for Design Patterns Several implementation patterns for design patterns in the algorithm structure design space were developed as a part of the parallel pattern language and presented in this thesis. These implementation patterns are solutions of problems mapping high-level parallel algorithms into programs using a Message Passing Interface (MPI) and programming language C.8 The patterns of the algorithm structure design space capture recurring solutions to the pr oblem of turning problems into parallel algorithms. These implementation patterns can be reused and provide guidance for programmers who might need to create their own implementations after using the parallel design pattern. The implementation patterns of design patterns in the algorithm structure design space are in Chapter 3. 1.4 Implementation of Kernel IS of NA S Parallel Benchmark Set Using Parallel Design Patterns The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, developed parallel benchmark set for the performance evaluation of highly parallel supercomputers. The NAS Parallel Benchmark set is a Paper and Pencil benchmark.9 All details of this benchmark set are specified only algorithmically. The Kernel IS of NAS Parallel Benchmark set is a parallel sorting over large numbers of integers. A solution to the Kernel IS is de signed and implemented using parallel pattern language as a case study. The used patterns are following. Getting Start, Decomposition Strategy, Task Decomposition, Dependency Analysis, Data Sharing, and Design Evaluation patterns of Find Concurrency Design Space are

PAGE 14

3 used. Choose Structure pattern, Separable Dependency, and Embarrassingly Parallel patterns of the algorithm structure design space are also used. The details of how these patterns are used and the final design of the parallel program for Kernel IS are in Chapter 5. 1.5 Organization of the Thesis The research that has been conducted as part of this thesis and its organization are as follows: Analysis of the parallel design patterns (Chapter 2) Implementation patterns for patterns in Algorithm structure design space (Chapter 3) A description of Kernel IS of the NAS Parallel Benchmark set (Chapter 4) Design and implementation of Kernel IS through the paralle l design patterns (Chapter 5) Performance results and discussions (Chapter 6). Conclusions and future work (Chapter 7).

PAGE 15

4 CHAPTER 2 OVERVIEW OF PARALLEL PATTERN LANGUAGE Parallel pattern language is a set of design patterns that guide the programmer through the entire process of developing a parallel program.10 The patterns of parallel pattern language, as developed by Masingill et al., are organized into four design spaces: Finding Concurrency Design Space, Algorithm Structure Design Space, Supporting Structure Design Space, and Implementation Design Space. 2.1 Finding Concurrency Design Space The finding concurrency design space includes high-level patterns that help to find the concurrency in a problem and decompose it into a collection of tasks. 2.1.1 Getting Started The getting started pattern helps to start designing a parallel program. Before using these patterns, a user of this pattern needs to be sure that the problem is large enough or needs to speed up and understand the problem. The user of this pattern needs to do the following tasks: Decide which parts of the problem require most intensive computation. Understand the tasks that need to be carried out and data structure that are to be manipulated. 2.1.2 Decomposition Strategy This pattern helps to decompose the problem into relatively independent entities that can execute concurrently. To expose the concurrency of the problem, the problem can be decomposed along two different dimensions:

PAGE 16

5 Task Decomposition: Break the stream of instructions into multiple chunks called tasks that can execute simultaneously. Data Decomposition: Decompose the problemÂ’s data into chunks that can be operated relatively independently. 2.1.3 Task Decomposition The Task Decomposition pattern addresses the issues raised during a primarily taskbased decomposition. To do task decomposition, the user should try to look at the problem as a collection of distinct tasks. And these tasks can be found in the following places: Function calls may correspond to tasks. Each iteration of a loop, if the iterations are independent, can be the tasks. Updates on individual data chunks decomposed from a large data structure. The number of tasks generated should be flexible and large enough. The tasks should have enough computation. 2.1.4 Data Decomposition This pattern looks at the issues involved in decomposing data into units that can be updated concurrently. The first point to be considered is whether the data structure can be broken down into chunks that can be operated on concurrently. An array-based computation and recursive data structures are examples of this approach. The points to be considered in decomposin g data are as follows: Flexibility in the size and number of data chunks to support the widest range of parallel systems The size of data chunks large enough to offset the overhead of managing dependency Simplicity in data decomposition

PAGE 17

6 2.1.5 Dependency Analysis This pattern is applicable when the proble m is decomposed into tasks that can be executed concurrently. The goal of a dependency analysis is to understand the dependency among the tasks in detail. Data-sharing dependencies and ordering constraints are the two kinds of dependencies that should be analyzed. The dependencies must require little time to manage relative to computation time and easy to detect and fix errors. One effective way of analyzing dependency is following approach. Identify how the tasks should be grouped together. Identify any ordering constraints between groups of tasks. Analyze how tasks share data, both within and among groups. These steps lead to the Group Tasks, the Order Tasks, and the Data Sharing patterns. 2.1.6 Group Tasks This pattern constitutes the first step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to group tasks based on the following constraints. Three major categories of constraints among tasks are as follows: A temporal dependency: a constraint placed on the order in which a collection of tasks executes. An ordering constraint that a collection of tasks must run at the same time. Tasks in a group are truly independent. The three approaches to group tasks are as follows: Look at how the problem is decomposed. The tasks that correspond to a high-level operation naturally group together. If the tasks share constraints, keep them as a distinct group. If any other task groups share the same constraints, merge the groups together. Look at constraints between groups of tasks.

PAGE 18

7 2.1.7 Order Tasks This pattern constitutes the second step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to identify ordering constraints among task groups. Two goals are to be met in defining this ordering: It must be restrictive enough to satisfy all the constraints to be sure the resulting design is correct. It should not be more rest rictive than it need be. To identify ordering constraints, consid er the following ways tasks can depend on each other: First look at the data required by a group of tasks before they can execute. Once this data have been identified, fi nd the task group that crea ted it, and you will have an order constraint. Consider whether external services can impose ordering constraints. It is equally important to note when an order constraint does not exit. 2.1.8 Data Sharing This pattern constitutes the third step in analyzing dependencies among the tasks of problem decomposition. The goal of this pattern is to identify what data are shared among groups of tasks and how to manage access to shared data in a way that is both correct and efficient. The following approach can be used to determine what data is shared and how to manage the data: Identify data that are shared between tasks. Understand how the data will be used. 2.1.9 Design Evaluation The goals of this pattern are to evaluate the design so far and to prepare for the next phase of the design space. This pattern therefore describes how to evaluate the design

PAGE 19

8 from three perspectives: suitability for the target platform, design quality, and preparation for the next phase of the design. For the suitability for the target plat form, the following issues included: How many processing elements are available? How are data structures shared among processing elements? What does the target architecture imply about the number of units of execution and how structures are shared among them? On the target platform, will the time spent doing useful work in a task be significantly greater than the time taken to deal with dependencies? For the design quality, simplicity, flexibility, and efficiency should be considered. To prepare the next phase, the key issues are as follows: How regular are the tasks and their data dependencies? Are interactions between tasks (or task groups) synchronous or asynchronous? Are the tasks grouped in the best way? 2.2 Algorithm Structure Design Space The algorithm structure design space contains patterns that help to find an appropriate algorithm structure to exploit the concurrency that has been identified. 2.2.1 Choose Structure This pattern guides the algorithm designer to the most appropriate AlgorithmStructure patterns for the problem. Consideration of Target Platform, Major Organizing Principle and Algorithm-Structure Decision Tr ee are the main topics of this pattern. The two primary issues of Consideration of Target Platform are how many units of execution (threads or processes) the target system will effectively support and the way information is shared between units of execution. The three major organizing principles are organization by ordering, organization by tasks, and organization by data. The

PAGE 20

9 Algorithm-Structure Decision Tree is in Figure 3.1. We can select an algorithm-structure using this decision tree. Figure 3-1 The Algorithm Structure Decision Tree 2.2.2 Asynchronous Composition This pattern describes what may be the most loosely structured type of concurrent program in which semi-independent tasks interact through asynchronous events. Two examples are the Discrete-Event Simulation and Event-Driven program. The key issues in this pattern are how to define the tasks/en tities, how to represent their interaction, and how to schedule the tasks. 2.2.3 Divide and Conquer This pattern can be used for the parallel application program based on the well-known divide-and-conquer strategy. This pattern is particularly effective when the amount of

PAGE 21

10 work required to solve the base case is large compared to the amount of work required for the recursive splits and merges. The key elem ents of this pattern are as follows: Definitions of the functions such as “solve,” “split,” “merge,” “baseCase,” “baseSolve” A way of scheduling the tasks that exploits the available concurrency This pattern also includes Correctness issues and Efficiency issues. 2.2.4 Embarrassingly Parallel This pattern can be used to describe concurrent execution by a collection of independent tasks and to show how to organize a collection of independent tasks so they execute efficiently. The key element of this pattern is a mechanism to define a set of tasks and schedule their execution and also a mechanism to detect completion of the tasks and terminate the computation. This pattern also includes the correctness and efficiency issues. 2.2.5 Geometric Decomposition This pattern can be used when the concurrency is based on parallel updates of chunks of a decomposed data structure, and the update of each chunk requires data from other chunks. Implementations of this pattern include the following key elements: A way of partitioning the global data structure into substructures or “chucks” A way of ensuring that each task has access to all the data needed to perform the update operation for its chunk, including da ta in chunks corresponding to other tasks A definition of the update operation, whether by points or by chunks A way of assigning the chunks among units of execution (distribution), that is a way of scheduling the corresponding tasks

PAGE 22

11 2.2.6 Pipeline Processing This pattern is for algorithms in which data flow through a sequence of tasks or stages. This pattern is applicable when the problem consists of performing a sequence of calculations, each of which can be broken do wn into distinct stages on a sequence of inputs. So for each input, the calculations must be done in order, but it is possible to overlap computation of different stages for different inputs. The key elements of this pattern are as follows: A way of defining the elements of the pipeline where each element corresponds to one of the functions that makes up the computation A way of representing the dataflow among pipeline elements, i.e., how the functions are composed A way of scheduling the tasks 2.2.7 Protected Dependencies This pattern can be used for task-based decompositions in which the dependencies between tasks cannot be separated from the execution of the tasks, and the dependencies must be dealt with in the body of the task. The issues of this pattern are as follows: A mechanism to define a set of tasks and sc hedule their execution onto a set of units of executions Safe access to shared data Shared memory available when it is needed 2.2.8 Recursive Data This pattern can be used for parallel applications in which an apparently sequential operation on a recursive data is reworked to make it possible to operate on all elements of the data structure concurrently. The issues of this pattern are as follows: A definition of the recursive data structure plus what data is needed for each element of the structure

PAGE 23

12 A definition of the concurrent operations to be performed A definition of how these concurrent operations are composed to solve the entire problem A way of managing shared data A way of scheduling the tasks onto units of executions A way of testing its termination condition, if the top-level structure involves a loop. 2.2.9 Separable Dependency This pattern is used for task-based decompositions in which the dependencies between tasks can be eliminated as follows: Necessary global data are replicated and (partial) results are stored in local data structures. Gl obal results are then obtained by reducing results from the individual tasks. The key el ements of this pattern are as follows: Defining the tasks and scheduling their execution Defining and updating a local data structure Combining (reducing) local ob jects into a single object 2.3 Supporting Structures Design Space The patterns at this level represent an intermediate stage between the problem-oriented patterns of the algorithm structure design space and the machine-oriented patterns of the Implementation-Mechanism Design Space. Patterns in this space fall into two ma in groups: Program-Structuring Group and Shared Data Structures Group. 2.3.1 Program-Structuring Group Patterns in this group deal with constructs of structuring the source code. 2.3.1.1 Single program and multiple data The computation consists of N units of execution in parallel. All N UEs (Generic term for concurrently executable entity, usually e ither process or thread) execute the same

PAGE 24

13 program code, but each operates on its own set of data. A key feature of the program code is a parameter that differentiates among the copies. 2.3.1.2 Fork join A main process or thread forks off some number of other processes or threads that then continue in parallel to accomplish some portion of the overall work before rejoining the main process or thread. 2.3.1.3 Master worker A master process or thread sets up a pool of worker processes or threads and a task queue. The workers execute concurrently with each worker repeatedly removing a task from the task queue and processing it until all tasks have been processed or some other termination condition has been reached. In so me implementations, no explicit master is present. 2.3.1.4 Spawn A new process or thread is created, which then executes independently of its creator. This pattern bears somewhat the same relation to the others as GOTO bears to the constructs of structured programming. 2.3.2 Shared Data Structures Group Patterns in this group describe commonly used data structures. 2.3.2.1 Shared queue This pattern represents a "thread-safe" implementation of the familiar queue abstract data type (ADT), that is, an implementation of the queue ADT that maintains the correct semantics even when used by concurrently executing units of execution.

PAGE 25

14 2.3.2.2 Shared counter This pattern, as with the Shared Queue pattern, represents a “thread-safe” implementation of a familiar abstract data ty pe, in this case a counter with an integer value and increment and decrement operations. 2.3.2.3 Distributed array This pattern represents a class of data st ructures often found in parallel scientific computing, namely, arrays of one or more dimensions that have been decomposed into subarrays and distributed among processes or threads. 2.4 Chapter Summery This chapter describes an overview of the parallel pattern languages for an application program which guides programmers from the design process of the program to the implementation point. The Next chapter illustrates how these design patterns have been used to design and implement the Kern el IS of the NAS Parallel Benchmark.

PAGE 26

15 CHAPTER 3 PATTERNS FOR IMPLEMENTATION In this chapter, we will introduce patterns for the implementation of patterns in the algorithm structure design space. Each pattern contains Implementation Example for each pattern in the algorithm structure design space. The Implementation Examples are implemented in an MPI and C environment. The MPI is standards for "Message Passing Interface" in Distributed Memory Environments. The Implementation Examples may need modification occasionally, but will be reus able and helpful for an implementation of the parallel program designed using parallel pattern language. 3.1 Using Message Passing Interface (MPI) 3.1.1 Intent This pattern is an introduction of Message Passing Interface (MPI). 3.1.2 Applicability This pattern is applicable when the user of parallel pattern language has finished the design of his parallel program and considers implementing it using MPI. 3.1.3 Implementation The MPI is a library of functions (in C) or subroutines (in Fortran) that the user can insert into a source code to perform data communication between processes. The primary goals of MPI are to provide source code portability and allow efficient implementations across a range of architectures.

PAGE 27

16 3.1.3.1 Simple message passing Message passing programs consist of multip le instances of a serial program that communicates by library calls. The elementary communication operation in MPI is "point-to-point" communication, that is, direct communication between two processes, one of which sends and the other receives. This message passing is the basic mechanism of exchanging data among processes of a parallel program using MPI. This message passing is also a good solution about how to order the executions of tasks belongs to a group and how to order the executions of groups of tasks. To communicate among processes, communicators are needed. Default communicator of MPI is MPI_COMM_WORLD which includes all processes. The common implementation of MPI executes same programs concurrently at each processing element (CPU or workstation). The following program is a very simple example of using MPI_Send (send) and MPI_Recv (receive) functions. /* simple send and receive */ /* Each process has the same copy of the following program and executes it */ #include #include #define BUF_SIZE 100 int tag =1; void main (int argc, char **argv) { int myrank; /* rank (identifier) of each process*/ MPI_Status status; /* status object */ double a[BUF_SIZE]; /*Send(Receive) buffer */ MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get rank of process*/ if( myrank == 0 ) { /* code for process 0 */ /* send Message */ MPI_Send(

PAGE 28

17 a, /*initial address of send buffer */ BUF_SIZE, /*number of elements in send buffer */ MPI_DOUBLE, /*datatype of each send buffer element */ 1, /*rank of destination */ tag, /* message tag */ MPI_COMM_WORLD /*communicator */ ); } else if( myrank == 1 ){ /* code for process 1 */ /*receive message */ MPI_Recv( a, /*initial address of receive buffer */ BUF_SIZE, /*number of elements in send buffer */ MPI_DOUBLE, /*datatype of each receive buffer element */ 0, /*rank of source */ tag, /*message tag */ MPI_COMM_WORLD, /*communicator */ &status /*status object */ ); } /* more else if statement can be added for more processes */ /* switch statements can be used instead of if else statement s*/ MPI_Finalize(); /* Terminate MPI */ } 3.1.3.2 Group and communicator creation A communicator in MPI specifies the communication context for communication operations and the set of processes that share this communication context. To make communication among processes in a program using MPI, we must have communicators. There are two kinds of communicators in MPI: intra-communicator and intercommunicator. Intra-communicators are for message passing among processes belong to the same group. Inter-communicators are for message passing between two groups of processes. As mentioned previously, the execution order of tasks belonging to one group can be implemented using intra-communicator and message-passing. The execution order of groups of tasks can be implemented using inter-communicator and message-passing.

PAGE 29

18 To create a communicator, a group of processes is needed. MPI does not provide mechanism to build a group from scratch, but only from other, previously defined groups. The base group, upon which all other groups can be created, is the group associated with the initial communicator MPI_COMM_WORLD. A group of processes can be created by the five steps as follows: 1. Decide the processing units that will be included in the groups 2. Decide the base groups to use in creating new group. 3. Create group. 4. Create communicator. 5. Do the work. 6. Free communicator and group. The first step is decide how many and which processing units will be included in this group. The points to be considered are the number of available processing units, the number of tasks in one group, the number of groups that can be executed concurrently, etc., according to the design of the program. If only one group of tasks can executed because of dependency among groups than the group can use all available processing units and may include them in that group. If there are several groups that can be executed concurrently, then the available processing units may be divided according to the number of groups that can be executed concurrently and the number of tasks in each group. Since MPI does not provide a mechanism to build a group from scratch, but only from other, previously defined groups, the group constructors are used to subset and superset existing groups. The base group, upon which all other groups can be defined, is the group

PAGE 30

19 associated with the initial communicator MPI_COMM_WORLD, and processes in this group are all the processes available when MPI is initialized. There are seven useful group constructors in MPI. Using these constructors, groups can be constructed as designed. The first three constructors are similar to the union and intersection of set operation in mathematics. The seven group constructors are as follows: The MPI_GROUP_UNION creates a group which contains all the elements of the two groups used. The MPI_GROUP_INTERSECTION creates a group which contains all the elements that belong in both groups at the same time. The MPI_GROUP_DIFFERENCE creates a group which has all the elements that do not belong in both groups at the same time. The MPI_GROUP_INCL routine creates a new group that consists of the specified processes in the array from the old group. The MPI_GROUP_EXCL routine creates a new group that consists of the processes not specified in the array from the old group. The MPI_GROUP_RANGE_INCL routine includes all processes between one specified process to another process, and the specified processes themselves. The MPI_GROUP_RANGE_EXCL routine excludes all processes not between the specified processes, and also excludes the specified processes themselves. For the creation of communicators, there are two types of communicators in MPI: intra-communicator and inter-communicator. The important intra-communicator constructors in MPI are as follows: The MPI_COMM_DUP function will duplicate the existing communicator. The MPI_COMM_CREATE function creates a new communicator for a group. For the communication between two groups, the inter-communicator for the identified two groups can be created by using the communicator of each group and peer-

PAGE 31

20 communicator. The routine for this purpose is MPI_INTERCOMM_CREATE. The peercommunicator must have at least one selected member process from each group. Using duplicated MPI_COMM_WORLD as a dedicated peer communicator is recommended. The Implementation Example Code is as follows: /******************************************/ /* An Example of creating intra-communicator */ /******************************************/ #include main(int argc, char **argv) { int myRank, count, count2; int *sendBuffer, *receiveBuffer, *sendBuffer2, *receiveBuffer2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commSlave; static int ranks[] = { 0 }; /* ... */ MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); /* Build group for slave processes */ MPI_Group_excl( MPI_GROUP_WORLD, /* group */ 1, /* number of elements in array ranks */ ranks, /* array of integer ranks in group not to appear in new group */ &grprem); /* new group derived from above */ /* Build communicator for slave processes */ MPI_Comm_create( MPI_COMM_WORLD, /* communicator */ grprem, /* Group, which is a subset of the group of above communicator */ &commSlave); /* new communicator */ if(myRank !=0) { /* compute on processes other than root process */ /* ... */ MPI_Reduce(sendBuffer, receive Buffer, count, MPI_INT, MPI_SUM, 1, commSlave);

PAGE 32

21 /* ... */ } /* Rank zero process falls through immediately to this reduce, others do later */ MPI_Reduce(sendBuffer2, receiveBuffer2, count2, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Comm_free(&commSlave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize(); } /******************************************/ /* An example of creating inter-communicator */ /******************************************/ #include main(int argc, char **argv) { MPI_Comm myComm; /* intra-communicator of local sub-group */ MPI_Comm myInterComm; /* inter-communicator between two group */ int membershipKey; int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* User generate membership Key for each group */ membershipKey = rank % 3; /* Build intra-communicato r for local sub-group */ MPI_Comm_split( MPI_COMM_WORLD, /* communicator */ membershipKey, /* Each subgroup contains all processes of the same membershipKey */ rank, /* Within each subgroup, the processes are ranked in the order defined by this rank value*/ &myComm /* new communicator */ ); /* Build inter-communicators. Tags are hard-coded */ if (membershipKey == 0) {

PAGE 33

22 MPI_Intercomm_create( myComm, /* local intra-communicator */ 0, /* rank of local group leader */ MPI_COMM_WORLD, /* "peer" intra-communicator */ 1, /* rank of remote group leader in the "peer" communicator */ 1, /* tag */ &myInterComm /* new inter-communicator */ ); } else { MPI_Intercomm_create(myComm, 0, MPI_COMM_WORLD,0,1, &myInterComm); } /* do work in parallel */ if(membershipKey == 0) MPI_Comm_free(&myInterComm); MPI_Finalize(); } 3.1.3.3 Data distribution and reduction There are several primitive functions for distribution of data in MPI. The three basic functions are as follows: The MPI_BROADCAST function replicates data in one root process to all other processes so that every process has the same data. This function can be used for a program which needs replication of data in each process to improve performance. The MPI_SCATTER function of MPI can scatter data from one processing unit to every other processing unit with the same amount (or variant amounts) of data, and each processing unit will have a diffe rent portion of the original data. The Gather function of MPI is the inverse of Scatter. For the reduction of data distributed among processes, the MPI provides two primitive MPI_REDUCE and MPI_ALL_REDUCE functions. These reduce functions are useful

PAGE 34

23 when it is needed to combine locally computed subsolutions into one global solution. The MPI_REDUCE function combines the elements provided in the input buffer of each process in the group using the specified operation as a parameter, and returns the combined value in the output buffer of the process with rank of root. The MPI_ALL_REDUCE function combines the data in the same way as the MPI_REDUCE, but every process has the combined value in the output buffer. This function is beneficial when the combined solution can be used as an useful information for the next phase of computation. The Implementation Example Code is not provided in this section, but the Implementation Example Code of 3.2.4 can be used as an Implementation Example Code. The source code of the Kernel IS implementation is also a good Implementation Example Code. 3.2 Simplest Form of Embarrassingly Parallel 3.2.1 Intent This is a solution to the problem of how to implement the simplest form of Embarrassingly Parallel pattern in the MPI and C environment. 3.2.2 Applicability This pattern can be used after the program is designed using patterns of the finding conc urrency design space and patterns of algorithm structure design space. And, the resulted design is a simplest form of the Embarrassingly Parallel pattern. The simplest form of Embarrassingly Parallel pattern satisfies the conditions as follows: All the tasks are independent. All the tasks must be completed. Each task executes same algorithm on a distinct section of data.

PAGE 35

24 3.2.3 Implementation The common MPI implementations have a stat ic process allocation at initialization time of the program. It also executes same program at each processing unit (or workstation). Since all the tasks executes same algorithm on a different data, the tasks can be implemented as one high level function which will be executed in each process. The simplest form of Embarrassingly Parallel pattern can be implemented by the following five steps. 1. Find out the number of available processing units. 2. Distribute or replicate data to each process. 3. Execute the tasks. 4. Synchronize all processes. 5. Combine (reduce) local re sults into a single result. In MPI, the number of available processing units can be found by calling the MPI_COMM_SIZE function and passing the MPI_WORLD_COM communicator as a parameter. In most cases, after finding out the number of available processing units, data can be divided by that number and distributed into each processing unit. There are several primitive functions for distribution of data in MPI. The three basic functions are Broadcast Scatter and Gather These functions are introduced in 3.1.3.3. After computing local results at each process, synchronize all processes to check that all local results are computed. This can be implemented by calling the MPI_BARRIER function after each local result is solved at each processing unit because the MPI_BARRIER blocks the caller until all group members have called it.

PAGE 36

25 The produced local results can be combined to get the fina l solution of the problem. The Reduce functions of MPI are useful in combining subsolutions. The Reduce function of MPI combines the elements, which are provided in the input buffer of each process in the group, using the specified operation as a parameter, and returns the combined value in the output buffer of the process with rank of root. The operations are distributed to each process in many MPI implementations so they can improve overall performance of the program if it takes considerable time to co mbine subsolutions in one processing unit. 3.2.4 Implementation Example #include #define SIZE_OF_DATA 2000 /* size of data to be distributed into each process */ int root = 0; /* The rank(ID) of root process */ int myRank; /* process ID */ /*= Modification1========================================*/ /*= The data type of each variable should be modified =*/ int* data; /*= The data to be distributed into each process =*/ int* localData; /*= distributed data that each process will have =*/ int* subsolution; /*= subsolutaion data =*/ int* solution; /*= solution data =*/ /*===================================================*/ int sizeOfLocalData; /* number of elements in local data array */ int numOfProc; /* number of avaiable process */ int numOfElement; /* number of element in subsolution */ /***************************************************/ /** Highest level function which solves subproblem */ /***************************************************/ void solve() { /*= Implementation1==============================*/ /*=The code for solving subproblem should be implemented =*/ /*============================================*/ } /****************************************************/ /* Main Function which starts program */ /****************************************************/

PAGE 37

26 main(argc, argv) int argc; char **argv; { MPI_Status status; /* MPI initialization */ MPI_Init(&argc, &argv); /* Finding out the rank (ID) of each process */ MPI_Comm_rank(MPI_COMM_WORLD,&myRank); /*Finding out the number of available processes */ MPI_Comm_size(MPI_COMM_WORLD,&numOfProc); /* Dividing the data by the number of processes */ sizeOfLocalData = SIZE_OF_DATA/numOfProc; /* Dynamic memory allocation for local data */ localData=(int *)mall oc(sizeOfLocalData*sizeof(int)); /*****************************************************************/ /* Distribute data into local data of each process */ /*****************************************************************/ MPI_Scatter( data, /* Address of send buffer (data) to distribute */ sizeOfLocalData, /* Number of element sent to each process */ MPI_INT, /*MODIFICATION2=================*/ /*= data type of the send buffer =*/ /*===============================*/ localData, /* Address of receive buffer (local data) */ sizeOfLocalData, /* Number of element in receive buffer (local */ /* data) */ MPI_INT, /*= MODIFICATION2===============*/ /*= Data type of receive buffer =*/ /*===============================*/ root, /* Rank of sending process */ MPI_COMM_WORLD /* Communicator */ ); /*solve sub-problem in each process */ solve(); /*Synchronize all processes to check all subproblems are solved */ MPI_Barrier(MPI_COMM_WORLD); /******************************************************** *****/

PAGE 38

27 /* Combine sub-solutions to get the solution */ /*************************************************************/ MPI_Reduce( subsolution, /* address of send buffer (subsolution) */ solution, /* address of receive buffer (solution) */ numOfElement, /* number of elements in send buffer (subsolution) */ MPI_INT, /*= MODIFICATION3============*/ /*= data type of the send buffer =*/ /*=============================*/ MPI_MULT, /*= MODIFICATION4 ============*/ /*= reduce operation =*/ /*============================*/ root, /* rank of root process */ MPI_COMM_WORLD /* communicator */ ); MPI_Finalize(); 3.2.5 Example Usage The steps to reuse the Implementation Example are as follows: Implement the blocks labeled as IMPLEMENTATION1. This block should contain code for the “solve” function. Modify the data type of the variables which are in the block labeled as MODIFICATION1. Modify the data type parameters which are in the block labeled as MODIFICATION2. Each data type parameter must be one of MPI data type which matches with the type of the data to receive and send. Modify the data type parameters which ar e in the block labeled as MODIFICATION3. Each data type parameter must be one of MPI data type which matches with the type of the data to receive and send Modify the reduce operator parameter which are in the block labeled as MODIFICATION4. This parameter should be one of MPI operators. Consider a simple addition of each element in an integer array with size 1024. What should be modified is the solve function and fifth parameter of MPI_Reduce function as follows: void solve()

PAGE 39

28 { for(i=0;i
PAGE 40

29 If each task has known amount of computations, the implementation of a parallel program is straightforward in MPI and C environment. Each task can be implemented as a function and can be executed at each process using process ID (rank) and if else statement or switch statement. Since the amount of computation of each task, the load balance should be achieved by distributing the tasks among processes when the program is implemented. If the amount of computation is not known, there should be some dynamic tasks scheduling mechanism for the load balance. The MPI does not provide primitive process scheduling mechanism. One way of simulating dynamic tasks scheduling mechanism is using a task-name queue and primitive message passing of MPI. We implemented this mechanism as follows: Each task is implemented as a function. The root process (with rank 0) has a task-name queue which contains the names of tasks (functions). Each process sends message to the root process to get task name, whenever it is idle or finished its task. The root process with rank 0 waits for messages from other processes. If the root process receives messages and the task queue is not empty, it sends back the name of task in the task queue to the sender process of that message. When the process, which sent message for a task name, receives the task name, it executes the task (function). The Implementation Example is provided in 3.3.4. The MPI routines, which is useful for the combining the locally computed subsolutions into a global solution, are described in 3.1.3.3 3.3.4 Implementation Example #include #include #define EXIT -1 /* exit message */ #define TRUE 1

PAGE 41

30 #define FALSE 0 #define ROOT 0 #define NUM_OF_ALGO 9 int oneElm =1; /* one element */ int tag =0 ; int algoNameSend= -1; int idleProc = -1; int algoNameRecv= -1; /* Size of Data to send and Receive */ int sizeOfData; /* number of tasks for each algorithm */ int numOfTask[NUM_OF_ALGO] ; int isAlgorithmName = TRUE; int moreExitMessage = TRUE; MPI_Status status; /*******************************************************/ /* Simple Queue implementation. */ /*******************************************************/ int SIZE_OF_Q = 20; int* queue; int qTail = 0; int qHead = 0; /******************************/ /* insert */ /******************************/ void insert(int task){ if(qHead-qTail==1) { /* when queue is full, increase array */ int* temp = (int *) malloc(2*SIZE_OF_Q*sizeof(int)); int i; for(i=0;i2){

PAGE 42

31 /* queue is not full and there is more than one space */ qTail++; queue[qTail]= task; } else{ /* there is just one more space in queue */ qTail=0; queue[qTail]= task; } } /***************************************/ /* check whether or not queue is empty */ /***************************************/ int isEmpty() { if(qTail==qHead){ /* queue is empty */ return 1; } else{ return 0; } } /*************************/ /* remove */ /*************************/ int remove() { if(qHead
PAGE 43

32 /*= the data which will be used by this algorithm in local process =*/ /*= More of this block can be added on needed basis =*/ int* data; /* Modify the data type*/ /* receive the size of data to receive */ MPI_Recv(&sizeOfData, oneElm,MPI_INT,ROOT, tag,MPI_COMM_WORLD,&status); /* dynamic memory allocation */ data = (int* /* Modify the data type*/ ) malloc(sizeOfData*sizeof(int /* Modify the data type*/ )); /* Receive the input data */ MPI_Recv(&data,sizeOfData, MPI_INT, /*=
PAGE 44

33 /*=IMPLEMENTATION1 ==============================*/ /*= code for this algorithm should be implemented =*/ /*=================================================*/ /* Memory deallocation ex) free(data);*/ } void algorithm3(){ /*= DATA TRANSFER 1 =================================*/ /*= the data which will be used by this algorithm in local process =*/ /*= More of this block can be added on needed basis =*/ int* data; /* Modify the data type*/ /* receive the size of data to receive */ MPI_Recv(&sizeOfData, oneElm,MPI_INT,ROOT, tag,MPI_COMM_WORLD,&status); /* dynamic memory allocation */ data = (int* /* Modify the data type*/ ) malloc(sizeOfData*sizeof(int /* Modify the data type*/ )); /* Receive the input data */ MPI_Recv(&data,sizeOfData, MPI_INT, /*=
PAGE 45

34 MPI_Init(&argc, &argv); /*find local rank*/ MPI_Comm_rank(MPI_COMM_WORLD, &myRank); /*find out last rank by using size of rank*/ MPI_Comm_size(MPI_COMM_WORLD,&mySize); /*****************************************************************/ /* This process distributes tasks to other processing elements */ /*****************************************************************/ if(myRank == 0) { /* code for root process. This root process receives messages from other processes when they are idle. Then this process removes an algorithm name for a task from task name queue and send it back to sender of the message. If the task queue is empty, this process sends back an exit message */ /*=MODIFICATION1=================================*/ /*= the data for each algorithm =*/ /*= the data type should be modified =*/ int* algo1Data; int* algo2Data; int* algo3Data; int* algo4Data; int* algo5Data; /*= =*/ /*=================================================*/ int numOfSentExit = 0; /* This array is a task queue. */ queue = (int *) malloc(SIZE_OF_Q*sizeof(int)); /* */ for(i=0;i
PAGE 46

35 insert(i); } } /* Receive message from other processes */ while(moreExitMessage) { int destination= ROOT; /* Receive message from any process */ MPI_Recv(&algoNameRecv,oneElm,MPI_INT,MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD,&status); destination = algoNameRecv; if(!isEmpty()) { /* If the task queue is not empty, send back an algorihtm name for a task to the sender of received message */ algoNameSend = queue[remove()]; MPI_Send(&algoNameSend,oneElm,MPI_INT, destination,tag,MPI_COMM_WORLD); switch(algoNameSend) { case 1: /*=IMPLEMENTATION2===========================*/ /*= code for calculating the size of data to send and =*/ /*= finding the starting address of data to send =*/ /*= should be implemented =*/ /*=============================================*/ /*=DATA TRANSFER2=============================*/ /*= More of this block can be added on needed basis =*/ /*= =*/ MPI_Send(&sizeOfData,oneElm,MPI_INT, destination,tag,MPI_COMM_WORLD); MPI_Send( &algo1Data, /*=
PAGE 47

36 /*= code for calculating the size of data to send and =*/ /*= finding the starting address of data to send =*/ /*= should be implemented =*/ /*===========================================*/ /*=DATA TRANSFER2=============================*/ /*= More of this block can be added on needed basis =*/ /*= =*/ MPI_Send(&sizeOfData,oneElm,MPI_INT, destination,tag,MPI_COMM_WORLD); MPI_Send( &algo1Data, /*=
PAGE 48

37 algoNameSend = EXIT; MPI_Send(&algoNameSend,oneElm,MPI_INT, destination,tag,MPI_COMM_WORLD); /* keep tracking the number of exit message sent. If the number of exit messages is same with the number of processes, root process will not receive any more message requesting tasks.*/ numOfSentExit++; if(numOfSentExit==mySize-1) { moreExitMessage=FALSE; } } /* computation and/or message passing for combining subsolution can be added here */ } } /***************************************************************/ /* Code for other processes, not root. These pr ocesses send message */ /* requesting next task to execute to root process when they are idle. */ /* These processes execute tasks received from root */ /***************************************************************/ else { idleProc = myRank; /*Send message to root process, send buffer contains the rank of sender process whenever it is idle*/ MPI_Send(&idleProc,oneElm,MPI_INT,ROOT,tag,MPI_COMM_WORLD); while(isAlgorithmName) /*while message contains an algorihtm name */ { /* Receive message from root which contains the name of task to execute in this process */ MPI_Recv(&algoNameRecv,oneElm,MPI_INT,ROOT,tag, MPI_COMM_WORLD,&status); /* each process executes tasks using the algorihtm name and data received from root */ switch(algoNameRecv) { /* More case statements should be added or removed, according to the number of tasks */ case EXIT : isAlgorithmName = FALSE;

PAGE 49

38 break; case 1: algorithm1(); break; case 2: algorithm2(); break; case 3: algorithm3(); break; case 4: algorithm4(); break; case 5: algorithm5(); break; default: break; } if(algoNameRecv != EXIT) { /*Send message to root process, send buffer contains the rank of sender process whenever it is idle*/ idleProc = myRank; MPI_Send(&idleProc,oneElm,MPI_INT,ROOT,tag,MPI_COMM_WORLD); } } } /*============================================*/ /*= Codes for collecting subsolution and =*/ /*= computing final solution can be added here. =*/ /*============================================*/ MPI_Finalize(); } 3.3.5 Usage Example The steps to reuse this Implementation Example are as follows: Implement the blocks labeled as IMPLEMENTATION1. Each block should contain codes for each algorithm. Implement the block labeled as IMPLEMENTATION2. What should be implemented are codes for calculating the initial address and the number of elements of the send buffer for each algorithm.

PAGE 50

39 Modify the data type of the variables and data type parameters which are in the blocks labeled as DATA TRANSFER 1. The data type of each variable should match with the data type of the input data for ea ch algorithm. The purpose of this send function is to send input data to the destination process. More of this block can be added on needed basis. Modify the data type and data type parameters which are in the block labeled as DATA TRANSFER2. Each data type parameter must be one of MPI data type which matches with the type of the data to receive. The purpose of these receive function is to receive input data for the algorithm execution on local proc ess. More of this block can be added on needed basis Assume that there is a parallel program design such that the amount of computation for each task is unknown. If some of the tasks share same algorithm, these tasks should be implemented as one algorithm function in one of the block labeled as IMPLMENTATION1. To execute the tasks, the same number of algorithm name with the number of tasks that share same algorithm should be put in the queue. 3.4 Implementation of Pipeline Processing 3.4.1 Intent This is a solution to the problem of how to implement the Pipeline Processing pattern in the MPI and C environment. 3.4.2 Applicability This pattern can be used after the program is design ed using patterns of finding concurrency design space and patterns of the algorithm structure design space and the resulted design is a form of the Pipeline Processing pattern. 3.4.3 Implementation Figure3.1 illustrates the usage of messa ge passing to schedule tasks in the Implementation Example. The arrows represent the blocking synchronous send and receive in MPI. The squares labeled C1, C2, etc., in Figure3.1, represent the elements of calculations to be performed which can overlap each others. The calculation elements are

PAGE 51

40 implemented as functions in the Implementation Example. The calculation elements (functions) should be filled up to do comput ation by the user of this Implementation Example. Adding more functions for more calculation is trivial because message passing calls inside functions are very similar between functions. Each Stage (Stage1, Stage2, and etc.), in Figure3.1, corresponds to each process. In the Implementation Example, each Stage calls single function (calculation element) sequentially. Figure 3-2 Usage of Message Passing One point to note is that the first pipe line stage do not receive message and last element do not send message for scheduling of tasks. Scheduling of tasks is achieved by blocking and synchronous mode point-to-point communications (MPI_SSEND, MPI_RECEIVE). C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 Process 1 Process 2 Process 3 Process 4 time

PAGE 52

41 In MPI, if the Send mode is blocking and synchronous and the Receive mode is blocking, then the Send can complete only if the matching Receive has started. The statements after the matching receive will not be executed until the receive operation is finished: MPI provides synchronous communication semantics. A duplicate of MPI_COMM_WORLD (initial intra-communicator of all processes) is used to provide separate communication space and safe communication. If we change the duplicate of MPI_COMM_WORLD into appropriate communicator for a group of processes then, this structure can be used for a portion of a program where the Pipeline Processing pattern can be used. 3.4.4 Implementation Example #include #include #include int numOfCalElem = 4; char startMsg[7]; int myRank; int mySize; int tag = 1; char* sendBuf, recvBuf; int sendBufSize, recvBufSize; MPI_Status status; /* Only four pipeline stages are implem ented in this Implementation Example. More elements can be added or removed, according to the design of the parallel program */ /******************/ /*First Pipeline Stage */ /******************/ void firstPipelineSta ge(MPI_Comm myComm) { /*=IMPLEMENTATION1========================================*/ /*= =*/

PAGE 53

42 /*= code for this Pipeline Stage should be implemented here =*/ /*= =*/ /*==========================================================*/ /* send message to the next Pipeline Stage of process with myRank+1 */ MPI_Ssend(startMsg,strlen(startMsg),MPI_CHAR ,myRank+1,tag,MPI_COMM_WORLD); /*=DATA TRANSFER 1======================================*/ /*= More send function, which has same structure, can be added =*/ /*= to transfer data =*/ MPI_Ssend( /*= Modify the following parameters =*/ sendBuf, /*
PAGE 54

43 /* send message to the next Pipeline Stage of process with myRank+1 */ MPI_Ssend(startMsg,strlen(startMsg),MPI_CHAR,myRank+1,tag, MPI_COMM_WORLD); /*=DATA TRANSFER 1======================================*/ /*= More send function, which has same structure, can be added =*/ /*= to transfer data =*/ MPI_Ssend( /*= Modify the following parameters =*/ sendBuf, /*
PAGE 55

44 /*= More send function, which has same structure, can be added =*/ /*= to transfer data =*/ MPI_Ssend( /*= Modify the following parameters =*/ sendBuf, /*
PAGE 56

45 MPI_CHAR, /*
PAGE 57

46 /********************/ /* Last Pipeline Stage */ /********************/ void lastPipelineStage(MPI_Comm myComm) { /* Receive message from previous Pipeline Stage of process with myRank-1 */ MPI_Recv(startMsg,strlen(s tartMsg),MPI_CHAR,myRank-1, tag, MPI_COMM_WORLD,&status); /*=DATA TRANSFER 2====================================*/ /*= More receive f unctions, which has same structure, can be added =*/ /*= to transfer data =*/ MPI_Recv( /*= Modify the following parameters =*/ recvBuf, /*=
PAGE 58

47 switch(myRank) { case 0 : for(i=0;i
PAGE 59

48 Add or remove the case statements, in the block labeled as MODIFICATION2, according to the number of pipeline stages of the program design. Implement the blocks labeled as IMPLEMENTATION1. Each block should contain codes for each calculation element. Modify the initial address of send buffer, the number of elements in send buffer, and the data type parameters which are in the block labeled as DATA TRANSFER1. The data to be sent is the input for the next calculation element. More send functions, which have same structure, can be added according to the need. Modify the initial address of receive buffer, the number of elements in receive buffer, and the data type parameters which are in the block labeled as DATA TRANSFER2. The data to be received is the input for this calculation element. More receive functions, which have same structure, can be added according to the need. Consider following problem as an example problem to parallelize: There are four SAT scores of four schools. Find out standard deviation for each class. But the main memory of one processor barely contains the scores of one school. To solve this problem, the pipeline stages can be defined as follows: The first pipeline stage computes the sum and average of SAT scores of each school. The second pipeline stage computes the differences between each individual score and the average. The third pipeline stage computes squares of each difference. Forth pipeline stage computes the sum of the computed squares. This problem can be easily implemented by following the above steps. The implementation of this is provided in appendix. 3.5 Implementation of As ynchronous-Composition 3.5.1 Intent This is a solution of the problem how to implement the parallel algorithm which resulted from the Asynchronous -Composition pattern in the algorithm structure design space in MPI (Message Passing Interface)?

PAGE 60

49 3.5.2 Applicability Problems are represented as a collection of semi-independent entities interacting in an irregular way. 3.5.3 Implementation Three points that need to be defined to implement the algorithm resulting from Asynchronous-Composition pattern using MP I are tasks/entities, events, and task scheduling. A task/entity, which generates an event and processes them, can be represented as a process in an implementation of an Algorithm using MPI. An event corresponds to a message sent from the event generating task to the event processing task. All tasks can be executed concurrently. In this Implementation Example, each case block in a switch statement should contain code for each semi-independent entity (process) which will be executed concurrently. An event corresponds to a message sent from the event generating task (process) to the event processing task (process) in MPI and C environment. Therefore, event handling is a message handling. For safe message passing among processes, the creation of groups of processes which need to communicate is necessary. Creating communicators for the groups in MPI is also necessary. Because of that, we added groups and communicators creations in this Implementation Example. This Implementation Example is also an example of event handling in MPI environment. In a situation that an entity (process) receives irregular events (messages sent) from other known entities (processes), we can implement it using defined constant of MPI, MPI_ANY_SOURCE and MPI_ANY_TAG.

PAGE 61

50 Assume that process A, B and C sends messa ges in an irregular way to process D and the process D handles (processes) events (message s). It is same situation as the car/driver example of the Asynchronous-Composition pattern. To use the MPI_ANY_SOURCE and MPI_ANY_TAG in the receive routine of MPI, we need to create an dedicated group and communicator for these entities (processes) to prevent entity (process) D receives messages from other entities (processes) that is not intended to send message to the entity (process) D. Figure 3-3 Irregular Message Handling 3.5.4 Implementation Example #include #include #include main(argc, argv) int argc; char **argv; { /* More group variables can be added or removed on need basis*/ Process B Process D Process A Process C

PAGE 62

51 MPI_Group MPI_GROUP_WORLD,group_A,group_B; /*More communicator variables can be added or removed on need basis */ MPI_Comm comm_A,comm_B; /*This varilable will hold the rank for each process in MPI default communicator MPI_COMM_WORLD */ int rank_in_world; /*This variable will hold the rank for each process in group A. Variables can be added or removed on need basis.*/ int rank_in_group_A; int rank_in_group_B; /* ranks of processes which will be used to create subgroups. More array of ranks can be added or removed on need basis.*/ int ranks_a[]={1,2,3,4}; int ranks_b[]={1,2,3,4}; MPI_Init(&argc, &argv); /*Create a group of processes and communicator for a group More groups and communicator can be created or removed on need basis*/ MPI_Comm_group(MPI_COMM_WORLD,&MPI_GROUP_WORLD); /*Create group and communicator */ MPI_Group_incl(MPI_GROUP_WORLD,4,ranks_a,&group_A); MPI_Comm_create(MPI_COMM_WORLD,group_A,&comm_A); /*Create group and communicator */ MPI_Group_incl(MPI_GROUP_WORLD,4,ranks_b,&group_B); MPI_Comm_create(MPI_COMM_WORLD,group_B,&comm_B); MPI_Comm_rank(MPI_COMM_WORLD, &rank_in_world); switch(rank_in_world) { /*This case 1 contains codes to be executed in process 0 */ case 0: { /* events can be generated or processed */ /* work */ break;

PAGE 63

52 } /*This case contains codes to be executed in process 1 */ case 1: { char sendBuffer[20]; int isOn=1; int i=0; while(isOn) { /* works that need to be done before generating event */ strcpy(sendBuffer,"event"); /* an example */ /* Generate Event (message passing) */ MPI_Send(sendBuffer,strlen(sendBuffer),MPI_CHAR,3,1,comm_B); printf("sent message"); i++; /* break loop */ if(i==10) /* this should be changed according to the problem */ { isOn = 0; } } break; } /*This case contains codes to be executed in process 2 */ case 2: { char sendBuffer[20]; int isOn=1; int i=0; while(isOn) { /* works that need to be done before generating event */ strcpy(sendBuffer,"event");/* an example */ /* Generate Event (message) */ MPI_Send(sendBuffer,strlen (sendBuffer),MPI_CHAR,3,1,comm_B); i++;

PAGE 64

53 /* break loop */ if(i==10) /* this should be changed according to the problem */ { isOn = 0; } } break; } /*This case contains codes to be executed in process 3 */ case 3: { char sendBuffer[20]; int isOn=1; int i=0; while(isOn) { /* works that need to be done before generating event */ strcpy(sendBuffer,"event"); /* an example */ /* Generate Event (message) */ MPI_Send(sendBuffer,strlen (sendBuffer),MPI_CHAR,3,1,comm_B); i++; /* break loop */ if(i==10) /* this should be changed according to the problem */ { isOn = 0; } } break; } /*This case contains codes to be executed in process 4 */ case 4: { char receiveBuffer[20]; int isOn = 1; int messageCount=0; MPI_Status status; while(isOn) { MPI_Recv(receiveBuffer,20 ,MPI_CHAR,MPI_ANY_SOURCE, MPI_ANY_TAG,comm_B,&status);

PAGE 65

54 messageCount++; if(0==strncmp(r eceiveBuffer,"event",3)) { /* work to process the event (message) */ printf("\nreceived an event at process 4"); if(messageCount==30) { isOn = 0; } } break; } /* more cases(processes) can be added or removed. */ default: break; } MPI_Finalize(); } 3.6 Implementation of Divide and Conquer 3.6.1 Intent For the completeness of the parallel pattern language, it would be beneficial for a programmer to show an implementation example of a top-level program structure for the Divide-and-Conquer pattern in the Algorithm Design space. 3.6.2 Motivation The top-level program structure of the divide-and-conquer strategy that is stated in the Divide-and-Conquer pattern is as follows Solution solve(Problem P){ if (baseCase (P)) return baseSolve(P); else { Problem subProblems[N]; Solution subSolutions[N]; subProblems = split(P); for (int i=0;i
PAGE 66

55 } This structure can be mapped onto a design in terms of tasks by defining one task for each invocation of the solve function. The common MPI implementations have a st atic process allocation at initialization time of the program. We need a mechanism to invoke the solve function at another processor whenever the solve function calls the split function, and the split function splits the problem into sub-problems so that subproblems can be solved concurrently. 3.6.3 Applicability This pattern can be used after the program is designed using patterns of the parallel pattern language and the resulting design is a form of the Divide-and-Conquer pattern and if we want to implement the design in an MPI and C environment. This structure can be used as an example of how to implement a top-level program structure in an MPI and C environment or directly adopting this structure as an implementation of the program by adjusting control parameters and adding a computational part of the program. 3.6.4 Implementation We are trying to implement a top-level program structure of the divide-and-conquer strategy in MPI and C so that parallel program design resulting from the Divide-andConquer pattern can be implemented by filling up each function or/and adjusting the structure. The basic idea is using message passing to invoke the solve functions at other processing elements when needed. In a standard MPI and C environment, each processing element has the same copy of the program with other processing elements, and same program executes at each processing element communicating with each other on a needed basis. When the problem splits into subproblems, we need to call the solve

PAGE 67

56 function at other processes so that subproblems can be solved concurrently. To call the solve functions, the split function send messag es to other processing elements and the blocking MPI_Receive function receives messages before the first solve function at each process. This message passing can contain data/tasks divided by the split function. In this structure, every solve function calls the merge function to merge subsolutions. Figure 3-4 Message Passing for Invocation of the Solve and Merge Functions For simplicity, we split a problem into tw o subproblems and used problem size to determine whether or not the subproblem is a base case. Figure3.3 shows the sequence of CPU CPU2 CPU1 CPU5 CPU6 CPU4 CPU3 CPU7 solve split solve solve split split solve solve solve solve split split split split solve solve solve solve solve solve solve solve merge merge merge merge merge merge merge merge merge merge merge merge merge merge Message sent from split to solve

PAGE 68

57 function calls in each process and message passing to invoke the solve function at remote process for new sub-problem and to merge sub-solutions. 3.6.5 Implementation Example #include #include #include #define DATA_TYPE int int numOfProc; /*number of available processes*/ int my_rank; int ctrlMsgSend; int ctrlMsgRecv; int* localData; int dataSizeSend; int dataSizeRecv; /********************/ /* Solve a problem */ /********************/ void solve(int numOfProcLeft) { if(baseCase(numOfProcLeft)) { baseSolve(numOfProcLeft); merge(numOfProcLeft); } else { split(numOfProcLeft); if(numOfProcLeft!=numOfProc) { merge(numOfProcLeft); } } } /*****************************************************/ /* split a problem into two subproblems */ /*****************************************************/ int split(int numOfProcLeft)

PAGE 69

58 { /*=IMPLEMENTATION2 ==============================*/ /*= Code for splitting a problem into two su bproblems =*/ /*= should be implemented =*/ /*= =*/ /*================================================*/ ctrlMsgSend = numOfProcLeft/2; /* invoke a solve function at the remote process */ MPI_Send(&ctrlMsgSend,1,MPI_INT, my_rank+numOfProc/numOfProcLeft, 0,MPI_COMM_WORLD); /*=DATA TRANSFER 1 ==============================*/ /*= More of this block can be added on needed basis =*/ MPI_Send(&dataSizeSend,1,MPI_INT, my_rank+numOfProc/numOfProcLeft, 0, MPI_COMM_WORLD); MPI_Send( &localData[dataSizeLeft], /*= numOfProc/(numOfProcLeft*2)) { ctrlMsgSend = 1; /* Send a subsolution to the process from which this process got the subproblem */ MPI_Send(&ctrlMsgSend,1,MPI_INT, my_rank numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); /*=DATA TRANSFER 3=============================*/

PAGE 70

59 /*= More of this block can be added on needed basis =*/ MPI_Send(&dataSizeSend,1,MPI_INT, my_rank numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); MPI_Send( localData, /*<-modify address of data */ dataSizeSend, MPI_INT, /*<-modify data type */ my_rank numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); /*= =*/ /*=============================================*/ } else { MPI_Status status; /* Receive a subsolution from the process which was invoked by this process */ MPI_Recv(&ctrlMsgRecv,1,MPI_INT, my_rank+numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD,&status); /*=DATA TRANSFER 3===============================*/ /*= More of this block can be added on needed basis =*/ MPI_Recv(&dataSizeRecv,1,MPI_INT, my_rank+numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD,&status); MPI_Recv( &localData[dataSizeLeft], /*
PAGE 71

60 /***********************************************/ /* Decide whether a problem is a "base case" */ /*that can be solved without further splitting */ /***********************************************/ int baseCase(int numOfProcLeft) { if(numOfProcLeft == 1) return 1; else return 0; } /*****************************/ /* Solve a base-case problem */ /*****************************/ int baseSolve(int numOfProcLeft) { /*=IMPLEMENTATION1=================================*/ /*== Code for solving base case problem should be implemented ==*/ /*== ==*/ /*===================================================*/ return 0; } main(argc, argv) int argc; char **argv; { int i; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &numOfProc); count = (int *)malloc((maxInt+1)*sizeof(int)); if(my_rank == 0){ ctrlMsgSend=numOfProc; /* number of processes must be power of 2 */ } ctrlMsgRecv = numOfProc; solve(ctrlMsgRecv); } else{ /* Every process waits a message before calling solve function */

PAGE 72

61 MPI_Recv(&ctrlMsgRecv,1,MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG, MPI_CO MM_WORLD,&status); /*=DATA TRANSFER 2===============================*/ /*= More of this block can be added on needed basis =*/ MPI_Recv(&dataSizeR ecv,1,MPI_INT,MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD,&status); MPI_Recv( localData, /*
PAGE 73

62 TRANSFER1. The data to be sent is the input for the solve function of remote process. More of this block can be added on needed basis. Find out and modify the initial address of receive buffer, the number of elements to receive, and the data type parameters which are in the block labeled as DATA TRANSFER2. The data to receive is the inpu t for the solve function of local process. More of this block can be added on needed basis. Find out and modify the initial address of receive buffer, the number of elements to receive, and the data type parameters which are in the block labeled as DATA TRANSFER3. The data to receive and send are the input for the merge function of local process. More of this block can be added on needed basis. One point to note about using the Implementation Example is that the “baseCase” of the Implementation Example is when there are no more available processes. Merge sort algorithm is well known problem which uses divide and conquer strategy. If we consider the merge sort over N integers as the problem to parallelize, it can be easily implemented using the Implementa tion Example. To use the Implementation Example, the merge sort algorithm should be modified as follows: The “baseSolve” is a counting sort over the local data (an array of integers). The “split” algorihtm divides received data (an array of integers) into two contiguous subarrays. The “merge” algorithm merges two sorted array into one sorted array. The “BaseCase” of this problem is when there are no more available processes. The problem can be implemented by following the implementation steps above. The implementation of this problem is provided in appendix.

PAGE 74

63 CHAPTER 4 KERNEL IS OF NAS BENCHMARK The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, developed a set of benchmarks for the performance evaluation of highly parallel supercomputers. These benchmarks, NPB 1 (NAS Parallel Benchmarks 1), consist of five parallel kernels and three si mulated application benchmarks. The principle distinguishing feature of these benchmarks is their “pencil and paper” specification. All details of these benchmarks are specified only algorithmically. The Kernel IS (Parallel Sort over small integers) is one of the kernel benchmark set. 4.1 Brief Statement of Problem Sorting N integer keys in parallel is the problem. The number of keys is 232 for class A and 252for class B. The range of keys is max, 0 B where maxB is 192 for class A and 212 for class B. The keys must be generated by the key Generation algorithm of the NAS benchmark set. The initial distribution of the key must follow the specification of memory mapping of keys. Even though the problem is sorting, what will be timed is the time needed to rank every key, and the pe rmutation of keys will not be timed. 4.2 Key Generation and Memory Mapping The Keys must be generated using the pseudorandom number generator of the NAS Parallel Benchmark. The numbers generated by this pseudorandom number generator will have range (0, 1) and have very nearly uniform distribution of the unit interval.11-12 The keys will be generated using this number in the following way. Let fr

PAGE 75

64 be a random fraction uniformly distributed in the range (0,1), and let iK be the thi key. The value of iK is determined as 4 / ) (3 4 2 4 1 4 0 4 max i i i i ir r r r B K for 1 ,..., 1 0 N i iK must be an integer and indicates truncation. maxB is 192 for class A and is 212 for class B. For the distribution of keys among memory, all keys initially must be stored in the main memory units and each memory unit must have same amount of keys in a contiguous address space. If the keys cannot be evenly divisible, the last memory unit can have a different amount of keys, but it must follow specification of NAS Benchmark. See Appendix A for details. 4.3 Procedure and Timing The implementation of Kernel IS must follow this procedure. The partial verification of this procedure tests the ranks of five unique keys where each key has unique rank. The full verification rearranges the sequence of keys using the computed rank of each key and tests that the keys are sorted. 1. In a scalar sequential manner and using the key generation algorithm described above, generate the sequence of N keys. 2. Using the appropriate memory ma pping described above, load the N keys into the memory system. 3. Begin timing. 4. Do, for 1 i to maxI (a) Modify the sequence of keys by making the following two changes: i Ki ) (maxmaxi B KI i

PAGE 76

65 (b) Compute the rank of each key. (c) Perform the partial verification test described above. 5. End timing. 6. Perform full verification test described previously. Table 4-1: Parameter Values to Be Used for Benchmark Parameter Class A Class B N 232 252 maxB 192 212 seed 314159265 314159265 maxI 10 10

PAGE 77

66 CHAPTER 5 PARALLEL PATTERNS USED TO IMPLEMENT KERNEL IS In this chapter, we will explain how we used the parallel pattern language to design the parallel program and implement it in an MPI and C environment. 5.1 Finding Concurrency 5.1.1 Getting Started As advised by this pattern, we have to understand the problem we are trying to solve. According to the specification of Kernel IS of NAS Parallel Benchmark, the range of the keys to be sorted is ) 2 0 [19 for class A and a range of ) 2 0 [21 for class B. Because of this range, bucket sort, radix sort, and counting sort can sort the keys in O(n) time.13 These sequential sorting algorithms are good candidate algorithms to parallelize because of their speeds. Because of following reasons, Counting Sort is selected as a target algorithm to parallelize. According to the specification of the benchmark, what will be timed is the time needed to find out the rank of each keys, and the actual permutation of keys occurs after the timing. This means that it is impo rtant to minimize the time needed to find out rank of each key. One interesting characteristic of the three candidate sorting algorithms is following. The Radix Sort and Bucket Sort make permutations of keys to find out rank of each key, in other words, the keys should be sorted to know the rank of every key. But, the Counting Sort does not need to be sorted to find out the rank of every key. This means that the Counting Sort takes less ti me in ranking every key than the others. Because of this reason, the Counting Sort is selected. Another reason is the simplicity of the Counting Sort algorithm.

PAGE 78

67 The basic idea of the Counting Sort algorithm is to determine, for each element x, the number of elements less than x. This information can be used to place element x directly into its position in the output array and the number of elements less than x is 1 ) ( x rank. Counting Sort Algorithm is as follows: Counting-Sort(A, B ,k) // A is an input array. B is an output array. C is an intermediate //array. k is an biggest key in the input array. for 1 i to k do 0 ] [ i C for 1 j to ] [ A length do 1 ]] [ [ ]] [ [ j A C j A C // ] [ i C now contains the number of elements equal to i for 2 i to k do ] 1 [ ] [ ] [ i C i C i C // ] [j C now contains the number of elements less than or equal to i for ] [ A length j down to 1 do ] [ ]]] [ [ [ j A j A C B 1 ]] [ [ ]] [ [ j A C j A C Algorithm 4-1 Counting Sort 5.1.2 Decomposition Strategy The decision to be made in this pattern is whether or not to follow data decomposition or task decomposition pattern. The key data structure of the Counting Sort algorithm is an array of integer keys, and the most compute-intensive part of the algorithm is counting each occurrence of same keys in the integer keys array and summing up the counts to determine the rank of each key. The task-based decomposition is a good starting point if it is natural to think about the problem in terms of a collection of independent (or nearly independent) tasks. We can think of the problem in terms of tasks that count each occurrence of the same keys in key array. So we followed the task decomposition pattern.

PAGE 79

68 5.1.3 Task Decomposition The Task Decomposition pattern suggests findin g tasks in functions, loop, and updates on data. We tried to find tasks in one of th e four loops in the Counting Sort algorithm. We found, using this pattern, that there ar e large enough independent iterations in the second for loop of the counting sort algorithm. These iterations can be divided into many enough relatively independent tasks which find out rank of each key. We can think that each task concurrently counts each occurrence of the same keys on array A. The array A is shared among tasks right now and it is not divided yet. Because of dependency among tasks, this pattern recommends using Dependency Analysis pattern. 5.1.4 Dependency Analysis Using this pattern, data sharing dependencies are found on array A. Each task is sharing the array A because the array A is accessed by each task to read a number in each cell of the array. If one task has read and counted the integer number of a cell, that number should not be read and counted again. Array C is shared, too, because each task access this array to accumulate the count for each number, and one cell must not be accessed or updated concurrently by two or more tasks at the same time. We follow Data-Sharing Pattern since data sharing dependencies are found. 5.1.5 Data Sharing Pattern Array A can be considered read only data among the three categories of shared data according to this pattern, because it is not updated. Array A can be partitioned into subsets, each of which is accessed by only one of the tasks. Therefore Array A falls into effectively-local category.

PAGE 80

69 The shared data, array C, is one of the special cases of read/write data. It is an accumulate case because this array is used to accumulate the counts of each occurrence of the same keys. For this case, this pattern su ggests that each task has a separate copy so that the accumulations occur in these local copies. The local copies are then accumulated into a single global copy as a final step at the end of the computation. What has been done until this point is a parallelization of second loop of Counting Sort algorithm. The fourth for loop can be parallelized as follows: Each task has its own ranks of keys in its own data set so the keys can be sorted in each process. To sort these locally sorted keys, redistribute keys after finding out the range of keys for each task so that each task has approximately same amount of keys and the ranges of keys in ascending order among tasks. Then the tasks of sorting redistributed keys have a dependency on locally sorted keys. But this dependency is read only and effectively local so the locally sorted keys can be redistributed among processes without complication, according to the ranges, because those are already sorted. Then each task can merge the sorted keys in its own range with out dependency using the global count. Therefore, the final design of parallel sorting (the implementation of Kernel IS) follows. First, each task counts each occurrence of the same keys on its own subset data, accumulates the results on its own output array, and then reduces them into one output array. Second, each task redistributes the keys into processes, according to the range of keys of each process, and merges the redistributed keys. 5.1.6 Design Evaluation Our target platform is Ethernet-connected Unix workstations. It is a distributed memory system. This Design Evaluation pattern says that it usually makes it easier to

PAGE 81

70 keep all the processing elements busy by having many more tasks than processing elements. But using more than one UE per PE is not recommended in this pattern when the target system does not provide efficient support for multiple UEs per PEs (Processing Element. Generic term used to reference a hardware element in a parallel computer that executes a stream of instructions), and the Design can not make good use of multiple UEs per PE, which is our case because of reduction of output array (if we have more local output array than the number of PEs, then th e time needed to reduce to one output will take much longer). So our program adjusts the number of tasks into the same number of workstations. This pattern questions simplicity, flexibility, and efficiency of the design. Our design is simple because each generated task will find out rank of each key on its own subset data (keys) and then reduce them into global ranks. The next steps of sorting locally and redistribution of keys and merging the redistributed local key arrays are also simple because we already know the global ranks. It is also flexible and efficient because the number of tasks is adjustable and the computational load is evenly balanced. 5.2 Algorithm Structure Design Space 5.2.1 Choose Structure The Algorithm-Structure decision tree of this pattern is used to arrive at an appropriate Algorithm-Structure for our problem. The major organizing principle of our design is organization by tasks because we used the loop-splitting technique of the Task-D ecomposition pattern, so we arrived at the Organized-by-Tasks branch point. At that branch point, we take a partitioning branch because our ranking tasks can be gathered into a set linear in any number of dimensions.

PAGE 82

71 There are dependencies between tasks and it is an associative accumulation into shared data structures. Therefore we arrived at the Separable-Dependencies pattern. 5.2.2 Separable Dependencies A set of tasks that will be executed concurrently in each processing unit corresponds to iterations of a second for loop and third for loop of Algorithm 4.1. Each task will be independent of each other because each task will use its own data. This pattern recommends balancing the load at each PE. The size of each data is equal because of data distribution specification of Kernel IS so the load is balanced at each PE. The specification of keys distribution of Kernel IS is also satisfied. The fact that all the tasks are independent leads to the Embarrassingly Parallel Pattern. 5.2.3 Embarrassingly Parallel Each task of finding all the ranks of distinct key array can be represented as a process. Because each task will have almost same amount of keys, each task will have same amount of computation. So the static schedu ling of the tasks will be effective as advised by this pattern. For the correctness considera tions of this pattern, each tasks solve the subproblem independently, solve each subproblem exactly once, correctly save subsolutions, and correctly combine subsolutions. 5.3 Using Implementation Example The design of the parallel in teger sort problem satisfies the condition of simplest form of parallel program as follows: All the tasks are independent. All the tasks must be completed. Each task executes same algorithm on a distinct section of data. Therefore the resulted design is a simplest form of Embarrassingly Parallel Pattern. The Implementation example is provided in Chapter 3. Using the Implementation Example of the Embarrassingly Parallel pattern, the Kern el IS of NAS Parallel Benchmark set can be

PAGE 83

72 implemented. The basic idea of Implementation Example of Embarrassingly Parallel pattern is that of scattering data to each process, compute subsolutions, and then combine subsolutions into a solution for the problem. If we apply the Implementation Example two times, one time for finding rank of each key and one for sorting the keys in local processes, the Kernel IS can be impelemented. Another method of implementation is to use Implementation Example of Divide Conquer pattern. But this Implementation Example is not chosen because it is hard to measure the elapsed time for ranking all the keys. 5.4 Algorithm for Parallel Implementation of Kernel IS The following algorithm illustrates the parallel design for implementation of Kernel IS (parallel sort over small integer) that has been obtained through the parallel design patterns. 1. Generate the sequence of N keys using key generation algorithm of NAS Parallel Benchmarks. 2. Divide the keys by the number of PEs and distribute to each memory of PEs. 3. Begin timing. 4. Do, for 1 i to maxI (a) Modify the sequence of keys by making the following two changes: i Ki ) (maxmaxi B KI i (b) Each task (process) finds out the rank of each key in its own data set (keys) using ranking algorithm of counting sort. (c) Reduce the arrays of ranks in its own local data into an array of ranks in global data. (d) Do partial verification.

PAGE 84

73 5. End timing. 6. Perform permutation of keys according to the ranks. (a) Each task sorts local keys accord ing to the ranks among its local keys. (b) Compute the ranges of keys each process will have so that each process will have nearly same amount of keys and the ranges of keys are in ascending order (c) Redistribute keys among processes a ccording to the range of each process. (d) Each task (process) merges its redistributed key arrays. 7. Perform full verification.

PAGE 85

75 CHAPTER 6 PERFORMANCE RESULTS AND DISCUSSIONS 6.1 Performance Expectation An ideal execution time for a parallel program can be expressed as T/N where T is the total time taken with one processing element and N is the number of processing Elements used. Our implementation of Kernel IS of NAS Parallel Benchmarks will not have an ideal execution time because of overhead that comes from several sources. A source of overhead comes from the computations needed to reduce local arrays of ranks into one array of global ranks. The more local arrays of ranks and processing elements we use, the more computation time will be needed. The gap between the ideal execution time and the actual execution time will be increased. Another source of overhead will be the communication because MPI uses message transfer, which typically involves both overhead due to kernel calls and latency due to the time it takes the message to travel over the network. 6.2 Performance Results The Kernel IS implementation was executed on top of Ethernet-connected workstations and LAM 6.3.2, which is an implementation of MPI.14 These workstations are Sun Blade 100 with 500-MHz UltraSPARC-IIe cpu, 256-KB L2 External Cache, 256MB DIMM memory, Ethernet/Fast Ethernet, and twisted pair standard (10BASE-T and 100BASE-T) self-sensing network. Figure 6.1 shows the performance results for class A and B. The rows show the total number of processing elements (workstations) used to

PAGE 86

75 execute Kernel IS implementation. The column s show the execution time of Kernel IS implementation and an ideal execution time in milliseconds for each class A and B. The performance result of NPB2.2 (NAS Parallel Benchmark 2.2. These are MPI-based source-code implementations written and distri buted by NAS. They are intended to be run with little or no tuning, and they appr oximate the performance a typical user can expect to obtain for a portable parallel prog ram. They supplement, rather than replace, NPB 1. The NAS solicits performance results from all sources) for class A and B are shown for comparison purpose.15 Ta ble 6-1 Performance Results for Class A and B Number of Processing Elements Actual Execution Time for class A Ideal Execution Time for Class A Execution Time of NPB2.2 for Class A Actual Execution Time for Class B Ideal Execution Time for Class B Execution Time of NPB2.2 for ClassB 1 30940 30940 20830 147220 147220 N/A 2 16500 15470 25720 76100 73610 1446270 3 12160 10313 N/A 55640 49073 N/A 4 10600 7735 16870 46500 36805 103110 5 9200 6188 N/A 41150 29444 N/A 6 8150 5156 N/A 36550 24536 N/A 7 7600 4420 N/A 33350 21031 N/A 8 7080 3867 9720 30200 18402 42760 9 7470 3437 N/A 31790 16357 N/A 10 7040 3094 N/A 30860 14722 N/A 11 6820 2812 N/A 28740 13383 N/A 12 6750 2578 N/A 27900 12268 N/A 13 6690 2380 N/A 27580 11324 N/A 14 6320 2210 N/A 26630 10515 N/A 15 6000 2062 N/A 25560 9814 N/A 16 5780 1933 16680 25230 9201 62080 The execution times for Class B when the number of processing elements is 1 and 2 are not shown in Figure 6.2 because execution time for Class B is too big to show in Figure 6.2. The reason for that long execution time is that NPB2.2 consumes much more

PAGE 87

76 memory than the physical memory, which leads to many I/O between the hard disk and main memory of workstations. Figure 6-1 Execution Time Comparison for Class A Figure 6-2 Execution Time Comparison for Class B 6.3 Discussions As seen from the previous graphs, the following conclusions can be made about the performance of the parallel implementation of Kernel IS of NAS Parallel Benchmarks. The most increase in performance is achieved by using one or two additional processing

PAGE 88

77 elements, that is when the total number of pr ocessing elements is 2 or 3 respectively and there are smaller improvements by using more processing elements. The more processing elements we use for computation, the more overhead was created and the gap between the ideal execution time and the actual execution time increased, as we expected. The overall performance is acceptable because it has better performance compared to the performance of NPB2.2

PAGE 89

78 CHAPTER 7 RELATED WORK AND CONCLUSIONS AND FUTURE WORK 7.1 Related work 7.1.1 Aids for Parallel Programming Considerable research has been done to ease the tasks of designing and implementing efficient parallel program. The related work can be categorized into a program skeleton, a program framework, and design patterns. An algorithmic skeleton was introduced by M. Cole as a part of his proposed parallel programming systems (language) for parallel machines.16 He presented four independent algorithmic skeletons which are fixed degree divide and conquer, iterative combination, clustering, and task queues. Each of skeletons describes the structure of a particular style of algorithm in terms of abstraction. These skeletons capture very high-level patterns and can be used as an overall program structure. The user of this proposed parallel programming system must choose one of these skeletons to describe a solution to a problem as an instance of the appropriate skeleton. Because of this restriction, these skeletons can not be applied to every problem. These skeletons are similar to the patterns in the algorithm structure design space of parallel pattern language in a sense that both are an overall program structure and provide algorithmic frameworks. But these skeletons provide less guidance for an inexperienced parallel programmer about how to arrive at one of these skeletons in comparison with parallel pattern language. Program Frameworks, which can address overall program structure but are more detailed and domain specific, are widely used in many areas of computing.17 In a parallel

PAGE 90

79 computing area, Parallel Object Oriented Methods and Applications (POOMA) and Parallel Object-oriented Environment and Toolkit (POET) are examples of frameworks from Los Alamos and Sandia, respectively.18-19 The POOMA is an object-oriented framework for a data-parallel programming of scientific applications. It is a library of C++ classes designed to represent common abstractions in these applications. Application programmers can use and derive from these classes to express the fundamental scientific content and/or numerical methods of their problem. The objects are layered to represent a data-parallel programming interface at the highest abstraction layer whereas lower, implementation layers encapsulate distribution and communication of data among processors. The POET is a framework for encapsulating a set of user-written components. The POET framework is an object model, implemented in C++. Each user-written component must follow the POET template interface. The POET provides services, such as starting the parallel application, running components in a specified order, and distributing data among processors. In recent years, Launay and Pazat developed a framework for parallel programming using Java.20 This framework provides a parallel programming model embedded in Java by a framework and intended to separate co mputations from control and synchronizations between parallel tasks. Doug Lea provided design principles to build concurrent applications using the Java parallel programming facilities.21 Design patterns that can address many levels of design problem have been widely used with object oriented sequential programming. Recent work of Douglas C. Schmidt, et al

PAGE 91

80 addresses issues associated with concurrency and networking but mostly at a fairly low level.22 Design pattern, frameworks, and skeleton share the same intent of easing parallel programming by providing libraries, classes, or a design pattern. In comparison with parallel pattern language, which provides a systemic design patterns and implementation patterns for parallel program design and implementation, frameworks and skeletons might have more efficient implementations or better performance in their specialized applicable problem domain. But parallel pattern language could be more helpful for inexperienced parallel programmer in designing and solving more general application problems because of its systematic structure in exploiting concurrency and providing frameworks and libraries down to implementation level. 7.1.2 Parallel Sorting Parallel sorting is one of the most widely studied problems because of its importance in wide variety of practical applications. Various parallel sorting algorithms have also been proposed in the literature. Guy E. Blelloch, et al. analyzed and evaluated many of the parallel sorting algorithms proposed in the literature to implement as fast a general purpose sorting algorithm as possible on the Connection Machine Supercomputer model CM-2.23 After the evaluation and analysis, the researchers selected the three most promising alternatives for implementation: bitonic sort that is a parallel merge sort, parallel version of counting-based radix sort, and a theoretically efficient randomized algorithm, sample sort. According to their e xperiments, sample sort was the fastest on large data sets. Andrea C. Dusseau, et al. analyzed these three parallel sorting algorithms and column sort using a LogP model, which characterizes the performance of modern parallel

PAGE 92

81 machines with a small set of paramete rs: the communication latency, overhead, bandwidth, and the number of processes.24 They also compared performances of Split-C implementation of the four sorting algorithms on message passing, distributed memory, massively parallel machine, CM-5. In their comparison, radix and sample sort was faster than others on a large data set. To understand the performance of parallel sorting on hardware cache-coherent shared address space (CC-SAS) multiprocessors, H. Shan et al. investigated the performance of two parallel sorting algorithms (radix, samp le sort) under three major programming models (a load-store CC-SAS, message passing, and the segmented SHMEM model) on a 64 processor SGI Origin2000 (A scalable, hardware-supported, cache-coherent, nonuniform memory access machine).25 In their investigation, the researchers found that sample sort is generally better than radix sort up to 64k integers per processor, and radix sort is better after that point. The best combination of algorithm and programming models are sample sort under the CCSAS for smaller data sets and radix sort under SHMEM for larger data sets, accordin g to their investigation. Communication is fundamentally required in a parallel sorting problem as well as computation. Because of this characteristic and the importance of sorting in applications, parallel sorting problem has been selected as one of kernel benchmarks for the performance evaluation of various parallel computing environments.26 The Kernel IS of NAS Parallel Benchmark set has been implemented and reported its performance on various parallel supercomputers by its vendors.27-28 7.2 Conclusions This thesis shows how the parallel design patterns were used to develop and implement a parallel algorithm for Kernel IS (Parallel sort over large integers) of NAS

PAGE 93

82 Parallel Benchmarks as a case study. And it presented reusable frameworks and examples for implementations of patterns in the algorithm structure design space of parallel pattern language. Chapter 6 showed the performance results for Kernel IS. As the result shows, the parallel design patterns help developing a parallel program with relative ease, and it is also helpful in reasonably improving the performance. 7.3 Future Work Parallel pattern language is an ongoing project. Mapping design patterns with various parallel computing environments and developing frameworks for object oriented programming systems can be considered as future research. Testing the resulted implementation of Kernel IS using parallel pattern language on various supercomputers, comparing this algorithm as a full sorting, not just ranking, and more case studies of the parallel application program using parallel pattern language can also be included in the future work.

PAGE 94

83 APPENDIX A KERNEL IS OF THE NAS PARALLEL BENCHMAKRK A.1 Brief Statement of Problem Sort N keys in parallel. The keys are generated by the sequential key generation algorithm given below and initially must be uniformly distributed in memory. The initial distribution of the keys can have a great impact on the performance of this benchmark, and the required distribution is discussed in detail below. A.2 Definitions A sequence of keys,} 1 ,..., 1 0 | { N i Ki, will be said to be sorted if it is arranged in non-descending order, i.e., ...2 1 i i iK K K The rank of a particular key in a sequence is the index value I that the key would have if the sequence of keys were sorted. Ranking, then, is the process of arriving at a rank for all the keys in a sequence. Sorting is the process of permuting the keys in a sequence to produce a sorted sequence. If an initially unsorted sequence, 1 1 0,..., NK K K has ranks ) 1 ( ),... 1 ( ), 0 ( N r r r, the sequence becomes sorted when it is rearranged in the order ) 1 ( ) 1 ( ) 0 (,..., N r r rK K K. Sorting is said to be stable if equal keys retain their original relative orde r. In other words, a sort is stable only if ) ( ) ( j r i r whenever ) ( ) ( j r i rK K and j i Stable sorting is not required for this benchmark. A.3 Memory Mapping The benchmark requires ranking an unsorted sequence of N keys. The initial sequence of keys will be generated in an unambiguous sequential manner as described below. This

PAGE 95

84 sequence must be mapped into the memory of parallel processor in one of the following ways depending on the type of memory syst em. In all cases, one key will map to one word of memory. Word size must be no less than 32 bits. Once the keys are loaded onto the memory system, they are not to be removed or modified except as required by the procedure described in the Procedure subsection. A.4 Shared Global Memory All N keys initially must be stored in a contiguous address space. If iA is used to denote the address of the thiword of memory, then the address space must be ] [1 N i iA A. The sequence of keys,1 1 0,..., NK K K initially must map to this address space as ) (j j iK MEM A for 1 ,..., 1 0 N j (A.1) where ) (jK MEM refers to the address of jK A.5 Distributed Memory In a distributed memory system with p distinct memory units, each memory unit initially must store pNkeys in a contiguous address space, where p N Np/ (A.2) If iA is used to denote the address of the thi word in a memory unit, and if jP is used to denote the thj memory unit, then i jA P will denote the address of the thi word in the thj memory unit. Some initial addressing (or ordering) of memory units must be assumed and adhered to throughout the benchmark. Note that the addressing of the memory units is left completely arbitrary. If N is not evenly divisible by p then

PAGE 96

85 memory units } 2 ,..., 1 0 | { p j Pj will store pN keys, and memory unit 1 pP will store ppN keys, where now 5 0 / p N Np p ppN p N N) 1 ( In some cases (in particular if p is large) th is mapping may result in a poor initial load balance with p ppN N. In such cases it may be desirable to use 'p memory units to store the keys, wherep p '. This is allowed, but the storage of the keys still must follow either equation 2.2 or equation 2.3 with 'p replacing p In the following we will assumeN is evenly divisible by p The address space in an individual memory unit must be 1, pN i iA A. If memory units are individually hierarchical, then pN keys must be stored in a contiguous address space belonging to a single memory hierarchy and iA then denotes the address of the thi word in that hierarchy. The keys cannot be distributed among different memory hierarchies until after timing begins. The sequence of keys,1 1 0,..., NK K K, initially must map to this distributed memory as ) (j kN j i kpK MEM A P for 1 ,..., 1 0 pN j and 1 ,..., 1 0 p k where ) (j kNpK MEM refers to the address of j kNpK. If N is not evenly divisible by p then the mapping given above must be modified for the case where 1 p k as ) () 1 ( 1j N p j i ppK MEM A P for 1 ,..., 1 0 ppN j (A.3) A.6 Hierarchical Memory All N keys initially must be stored in an address space belonging to a single memory hierarchy which will here be referred to as the main memory. Note that any memory in

PAGE 97

86 the hierarchy which can store all N keys may be used for the initial storage of the keys, and the use of the term main memory in the description of this benchmark should not be confused with the more general definition of this term in section 2.2.1. The keys cannot be distributed among different memory hierarchies until after timing begins. The mapping of the keys to the main memory must follow one of either the shared global memory or the distributed memory mappings described above. The benchmark requires computing the rank of each key in the sequence. The mappings described above define the initial ordering of the keys. For shared global and hierarchical memory systems, the same mapping must be applied to determine the correct ranking. For the case of a distributed memory system, it is permissible for the mapping of keys to memory at the end of the ranking to differ from the initial mapping only in the following manner: The number of keys mapped to a memory units at the end of the ranking may differ from the initial value, pN. It is expected, in a distributed memory machine, that good load balancing of the problem will require changing the initial mapping of the keys and for this reason a different mapping may be used at the end of the ranking. If kpN is the number of keys in memory unit kP at the end of ranking, then the mapping which must be used to determine the correct ranking is given by )) ( (j kN r MEM A Pkp j i k for 1 ,..., 1 0 kpN j and 1 ,..., 1 0 p k where ) (j kN rkp refers to the rank of key j kNk pK. Note, however, this does not imply that the keys, once loaded into memory, may be moved. Copies of the keys may be maid and moved, but the original sequence must remain intact such that each time the ranking process is repeated (Step 4 of Procedure) the original sequence of keys exist (except for the two modification of Step 4a) and the same algorithm for ranking is applied.

PAGE 98

87 Specifically, knowledge obtainable from the communications pattern carried out in the first ranking cannot be used to speed up subsequent rankings and each iteration of Step 4 should be completely independent of the previous iteration. A.7 Key Generation Algorithm The algorithm for generating the keys makes use of the pseudorandom number generator described in section 2.2. The keys will be in the range max, 0B. Let fr be a random fraction uniformly distributed in the range 1 0 and let iK be the thi key. The value of iK is determined as 4 / ) (3 4 2 4 1 4 0 4 max i i i i ir r r r B K for 1 ,..., 1 0 N i. (A.4) Note that iK must be an integer and indicates truncation. Four consecutive pseudorandom numbers from the pseudorandom number generator must be used for generating each key. All operations before the truncation must be performed in 64-bit double precision. The random number generator must be initialized with 314159265 s as a starting seed. A.8 Partial Verification Test Partial verification is conducted for each ranking performed. Partial verification consists of comparing a particular subset of ranks with the reference values. The subset of ranks and the reference values are given in table 2.1. Note that the subset of ranks is selected to be invariant to the ranking algorithm (recall that stability is not required in the benchm ark). This is accomplished by selecting for verification only the ranks of unique keys. If a key is unique in the sequence (i.e., there is no other equal key), then it will have a unique rank despite an unstable ranking algorithm. The memory mapping described in the Memory Mapping subsection must be applied.

PAGE 99

88 Table A-1 : Values to be used for partial verification Rank (full) Full scale Rank (sample) Sample code ) 2112377 (r i 104 ) 48427 (r i 0 ) 662041 (r i 17523 ) 17148 (r i 18 ) 5336171 (r i 123928 ) 23627 (r i 346 ) 3642833 (r i 8288932 ) 62548 (r i 64917 ) 4250760 (r i 8388264 ) 4431 (r i 65463 A.9 Full Verification Test Full verification is conducted after the last ranking is performed. Full verification requires the following: 1. Rearrange the sequence of keys, } 1 ,..., 1 0 | { N i Ki, in the order )} 1 ( ),..., 1 ( ), 0 ( | { N r r r j Kj, where ) 1 ( ),..., 1 ( ), 0 ( N r r r is the last computed sequence of ranks. 2. For every iK from 2 ... 0 N i test that 1 i iK K If the result of this test is true, then the keys are in sorted order. The memory mapping described in the Memory Mapping subsection must be applied. A.10 Procedure 1. In a scalar sequential manner and using key generation algorithm described above, generate the sequence of N keys. 2. Using the appropriate memory mapping described above, load the N keys into the memory system. 3. Begin timing. 4. Do, for 1 i to maxI (a) Modify the sequence of keys by making the following two changes: i Ki ) (maxmaxi B KI i

PAGE 100

89 (b) Compute the rank of each key. (c) Perform the partial verification test described above. 5. End timing. 6. Perform full verification test described above. Table A-2: Parameter values to be used for benchmark Parameter Class A Class B N 232 252 maxB 192 212 seed 314159265 314159265 maxI 10 10 A.11 Specifications The specifications given in table A.2 shall be used in the benchmark. Two sects of values are given, one for Class A and one for Class B.

PAGE 101

90 APPENDIX B A PSEUDORANDOM NUMBER GENERATOR FOR THE PARALLEL NAS KERNELS B.1 Pseudorandom Number Generator Suppose that n uniform pseudorandom numbers are to be generated. Set 135 a and let s x0 be a specified initial “seed,” i.e., an integer in the range 462 0 s. Generate the integers kx for n k 1 using the linear congruential recursion ) 2 (mod46 1 k kax x and return k kx r462 as the result. Thus 1 0 kr, and the kr are very nearly uniformly distributed distribution on the unit interval. See [2], beginning on page 9 for further discussion of this type of pseudorandom number generator. Note that any particular value kx of the sequence can be computed directly from the initial seed s by using the binary algorithm for exponentiation, taking remainders modulo 462 after each multiplicati on. To be specific, let m be the smallest integer such that km 2, set s b anda t. Then repeat the following for i from 1 to m: 2 /k j ) 2 (mod46bt b if k j 2 ) 2 (mod46 2t t j k

PAGE 102

91 The final value of b is s a xk k (mod 462). See [2] for further discussion of the binary algorithm for exponentiation. The operation of multiplying two large integers modulo 462 can be implemented using 64 bit floating point arithmetic by splitting the arguments into two words with 23 bits each. To be specific, suppose one wishes to compute ) 2 (mod46ab c. Then perform the following steps, where int de notes the greatest integer: ) 2 int(23 1a a 1 23 22a a a ) 2 int(23 1b b 1 23 22b b b 1 2 2 1 1b a b a t ) 2 int(1 23 2t t 2 23 1 32t t t 2 2 3 23 42b a t t ) 2 int(4 46 5t t 5 46 42t t c An implementation of the complete pseudorandom number generator algorithm using this scheme produces the same sequence of results on any system that satisfies the following requirements: The input multiplier a and initial seed s as well as the constants 232, 232, 462 and 462, can be represented exactly as 64 bit floating point constants.

PAGE 103

92 The truncation of a nonnegative 64 bit floating point value less than 242 is exact. The addition, subtraction and multiplication of 64 bit floating point values, where the arguments and results are nonnegative whole numbers less than 472, produce exact results. The multiplication of a 64 bit floating point value, which is a nonnegative whole number less than 472, by the 64 bit floating point value m2, 46 0 m, reduces and exact result. These requirements are met by virtually all scientific computers in use today. Any system based on the IEEE-754 floating point arithmetic standard [1] easily meets these requirements using double precision. However, it should be noted that obtaining an exact power of two constant on some systems requires a loop rather than merely an assignment statement with. B.2 Other Features The period of this pseudorandom number generator is 13 4410 76 1 2 and it passes all reasonable statistical tests. This calculation can be vectorized on vector computers by generating results in batches of size equal to the hardware vector length. By using the scheme described above for computing kx directly, the starting seed of a particular segment of the sequence can be quickly and independently determined. Thus numerous separate segments can be generated on separate processors of a multiprocessor system. Once the IEEE-754 floating point arithmetic standard gains universal acceptance among scientific computers, the radix 462 can be safely increased to 522 although the

PAGE 104

93 scheme described above for multiplying two such numbers must be correspondingly changed. This will increase the period of the pseudorandom sequence by a factor of 64 to approximately.

PAGE 105

94 APPENDIX C SOURCE CODE OF THE KE RNEL IS IMPLEMENTATION /* This program is an implementation of Kernel IS(sorting over small integers of NAS Parallel benchmark set. */ #include #include #include #include #define MAX_KE Y 524288 /*maximum value of key */ #define NUM_ KEYS 8388608 /*number of keys to be sorted */ #define TEST_ARRAY_SIZE 5 int partialVerifyVals[TEST_ARRAY_SIZE]; int testIndexArray[TEST_ARRAY_SIZE] = {2112377,662041, 5336171,3642833,4250760}; int testRankArray[TEST_ARRAY_ SIZE] ={104,17523 ,123928,8288932,8388264}; int passedVerification = 0; /* the number of key s each processing element has */ int numOfLocalKeys; /* rank of each key */ int* countGlobal; /* rank of each key in local key subset */ int count[MAX_KEY+1]; /* a subset of keys th at each process has*/ int* localKeyArray;

PAGE 106

96 /* Sorted subset keys before redistribution */ int* tempResultArray; /* Final sored keys */ int* finalResultArray; /* rank of each process */ int myRank; /* total number of processes */ int mySize; /*************************************/ /* RANKING */ /*************************************/ void ranking(){ long i; long j; for(i=0;i
PAGE 107

97 2. Test whether the keys are in sored order. *******************************/ void full_verify(){ int i, j; int temp; /* The number of keys th at each process will have at last */ int lastNumOfKeysLocal; /* max value in each process before key redistributtion */ int maxValueInPE[mySize]; /* number of keys afte r key redistribution */ int newNumOfKeysLocal[mySize]; /* number of keys that will be send to each process */ int numOfKeysRedis[mySize]; /* number of keys that each process will r eceive from other proc esses and itself */ int numOfKeysToRecv[mySize]; /* Entry i specifies the disp lacement relative to sendbuffer from which to take the outgoing data destined for process j*/ int senddispls[mySize]; /* Entry j specifies the displacement rela tive to receivebuffer at which to place the incoming data from process i*/ int recvdispls[mySize]; int arrayTemp[mySize]; int *lastArray; int accum[mySize]; MPI_Bcast(countGlobal,MAX_KE Y+1,MPI_INT,0,MPI_COMM_WORLD); /* Sorting keys locally in process */ for(j=numOfLocalKeys-1;j>=0;j--){

PAGE 108

98 tempResultArray[count[localKeyArray[j]]-1]=localKeyArray[j]; count[localKeyArray[j]]=count[local KeyArray[j]]-1; } /* KEY REDISTRIBUTION */ /* finding out maximum number and number of k eys that each process will have afte r key redistribution */ if(myRank == 0){ i=0; int tempN = 0; for(j=1;j= nu mOfLocalKeys+tempN){ maxValueInPE[j-1]=i;/* maxValueInPE */ newNumOfKeysLocal[j-1] = countGlobal[i-1]-tempN; tempN = tempN + newNumOfKeysLocal[j-1]; break; } } } maxValueInPE[j-1]= MAX_KEY; newNumOfKeysLocal[j-1]=NUM_KEYS tempN; tempN=0; for(i=0;i
PAGE 109

99 lastNumOfKeysLocal= newNumOfKeysLocal[myRank]; lastArray = (int *) malloc(lastNumOfKeysLocal*sizeof(int)); /* compute the number of keys to send to each process and displacement(starting point of send buffer for each pr ocess) */ i=0; temp=0; for(i;i
PAGE 110

100 for(i=0;i=0;j--){ lastArray[countGlobal [finalResultArray[j]] -accum[myRank]]=fin alResultArray[j]; countGlobal[finalResultArray[j ]]=countGlobal[finalR esultArray[j]]-1; } /* Test wheterher the keys are in sorted order*/ j = 0; for( i=1; i lastArray[i] ){ j++; } if( j != 0 ){ printf( "Full_verify: number of keys out of so rt: %d\n", j ); } else printf("Full verifica tion was successful\n"); } /*************************************/ /* Partial Verification Test */ /*************************************/ partialVerify(int itera tion,int* countGlobal){ int i=0; int k=0; int j=0; passedVerification=0; for( i=0; i
PAGE 111

101 printf( "Failed partial verification: "iteration %d, test key %d ,\n", iteration, i ); printf("%d %d\n", countGlobal[k-1],testRankA rray[i]+iteration-1); } else passedVerification++; } else{ if( countGlobal[k-1] != te stRankArray[i] -iteration+1){ printf( "Failed partial verification: "iteration %d, test key %d\n", iteration, i ); printf("%d %d\n", countGlobal[k-1],testRankA rray[i]-iteration+1); } else passedVerification++; } if(passe dVerification == 5){ printf("Partial verifica tion was successful\n"); } } } } /**************************************************/ /* Genterate uniform ly distributed random number */ /**************************************************/ double randlc(X, A) double *X; double *A; { static int KS=0; static double R23, R46, T23, T46; double T1, T2, T3, T4; double A1;

PAGE 112

102 double A2; double X1; double X2; double Z; int i, j; if (KS == 0){ R23 = 1.0; R46 = 1.0; T23 = 1.0; T46 = 1.0; for (i=1; i<=23; i++){ R23 = 0.50 R23; T23 = 2.0 T23; } for (i=1; i<=46; i++){ R46 = 0.50 R46; T46 = 2.0 T46; } KS = 1; } T1 = R23 *A; j = T1; A1 = j; A2 = *A T23 A1; T1 = R23 *X; j = T1; X1 = j; X2 = *X T23 X1; T1 = A1 X2 + A2 X1; j = R23 T1; T2 = j;

PAGE 113

103 Z = T1 T23 T2; T3 = T23 Z + A2 X2; j = R46 T3; T4 = j; *X = T3 T46 T4; return(R46 *X); } /****************************************/ /* Generate a sequence of integer keys */ /****************************************/ void create_seq( doub le seed, double a,in t* key_array ){ double x; int i, j, k; k = MAX_KEY/4; for (i=0; i< NUM_KEYS; i++){ x = randlc(&seed, &a); x += randlc(&seed, &a); x += randlc(&seed, &a); x += randlc(&seed, &a); key_array[i] = k*x; } } /*************************************/ /* MAIN */ /*************************************/ void main(int argc ,char* argv[]){ int i = 0; int j = 0; int* key_array; int* resultArray; /* sorted integer */ clock_t start; clock_t end;

PAGE 114

104 double timecounter; double maxtime; int numOfKeyLastNode; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_ COMM_WORLD,&myRank); MPI_Comm_size(MPI_COMM_WORLD,&mySize); /* compute the numb er of keys that each process will have according to the memo ry mapping of Kernel IS */ numOfLocalKeys=( int)((double)(NUM_KEYS /mySize) + 0.5); numOfKeyLastNode = NUM_KEYS-(numOfLocalKeys*(mySize-1)); if(myRank == mySize -1){ numOfLocalKeys = numOfKeyLastNode; } countGlobal= (int *) mall oc((MAX_KEY+1)*si zeof(int)); localKeyArray =(int *) malloc(numOfLocalKeys*sizeof(int)); tempResultArray = (int *) malloc(numOfLocalKeys*sizeof(int)); /* Process 0 (root proc ess) will distribute k eys to each process */ if(myRank==0){ int i = 0; int j = 0; key_array = (int *) malloc(NUM _KEYS*sizeof(int)); resultArray= (int *) malloc(NUM_KEYS* sizeof(int)); /****Generating Keys to be sorted *********************************/ printf("Generatin g Random Numbers...\n"); create_seq( 314159265.00, /* Random number gen seed */ 1220703125.00,key_array ); printf("Creat ed Sequence\n"); }

PAGE 115

105 /*** SCATTER *******************************************/ MPI_Scatter(key_array,numOfLocalKeys,MPI_INT,localKeyArray, numOfLocalKeys,MPI_INT ,0,MPI_COMM_WORLD); /*********************************************************/ /* Synchronize all pr ocesses before timing*/ MPI_Barrier(MPI_COMM_WORLD); start=clock(); for(i=1;i<11;i++){ /* Modify the sequence of keys */ if(myRank==0){ localKeyArray[i] = i; localKeyArray [i+10] = MAX_KEY i; for( j=0; j< TEST_ARRAY_SIZE; j++ ) partialVerifyVals[j] = key_array[testIndexArray[j]]; } /* Compute the rank of each key */ ranking(); /*** REDUCE **************************************************/ MPI_Reduce(coun t,countGlobal,MAX_KEY+1,MPI_INT,MPI_SUM, 0,MPI_COMM_WORLD); /*** REDUCE **************************************************/ /* Partial Verification */ if(myRank==0) partialVerify(i,countGlobal); } MPI_Barrier(MPI_COMM_WORLD);

PAGE 116

106 /* End Timing*/ end = clock(); timecounter = ((double) (end start)) 1000/CLO CKS_PER_SEC; MPI_Reduce( &timecounter, &maxtime, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD ); if(myRank==0){ printf("star t:%ld\n",start); printf("en d:%ld\n",end); printf("Sorting pro cess took %f mill iseconds to execute\n",maxtime); } /* Full verification */ full_verify() ; MPI_Finalize(); }

PAGE 117

107 APPENDIX D SOURCE CODE OF PIP ELINE EXAMPLE PROBLEM #include #include #include #include int numOfCalElem = 20; char startMsg[7]; int myRank; int mySize; int tag = 1; char* sendBuf, recvBuf; double sendBufSize, recvBufSize; int numOfStudent = 1000; MPI_Status status; /* Only four pipeline stag es are implemented in this Example Implementation Code. More elements can be added or removed, according to the design of the para llel program */ /***********************/ /*First Pipeline Stage */ /***********************/ void firstPipelineStage(MPI_Comm myComm) { /*=IMPLEMENTATION1==============================*/ /*= =*/ /*= code for this Pipeline Stage should be impl emented here =*/

PAGE 118

108 double scores[numOfStudent]; int i=0; double sum =0; double av erage = 0; for(i=0;i
PAGE 119

109 /*=========================================*/ myRank+1,tag ,MPI_COMM_WORLD); } /********************/ /* Pipeline Stage 2 */ /********************/ void pipelineStage2(MPI_Comm myComm) { double scores[numOfStudent]; double average=0; /* Receive message from the previous Pipeline Stage of process with myRank-1 */ MPI_Recv(startMsg,strlen(startMsg),MPI_CHAR,myRank-1, tag, MPI_COMM_WORLD,&status); /*=DATA TRANSFER 2================================*/ /*= More receive functions, whic h has same structure, can be add ed =*/ /*= to transfer data =*/ MPI_Recv( /*= Modify the following paramete rs =*/ scores, /*=
PAGE 120

110 /*= =*/ /*= code for this Pipeli ne Stage should be implemented here =*/ int i=0; for(i=0;i
PAGE 121

111 /*= More receive functions which has same structure, can be added =*/ /*= to transfer data =*/ MPI_Recv( /*= Modify the following para meters =*/ scores, /*=
PAGE 122

112 /* Last Pipeline Stage */ /***********************/ void lastPipelineStage(MPI_Comm myComm) { double scores[numOfStudent]; double stdDeviation; /* Receive messag e from previous Pipeline Stage of process with myRank-1 */ MPI_Recv(startMsg,strlen(startMsg),MPI_CHAR,myRank-1, tag, MPI_COMM_WORLD,&status); /*=DATA TRANSFER 2==================================*/ /*= More receive functions, which ha s same structure, can be added =*/ /*= to transfer data =*/ MPI_Recv( /*= Modify the following pa rameters =*/ scores, /*=
PAGE 123

113 } void main(argc,argv ) int argc; char **argv; { int i = 0; MPI_Comm myComm; MPI_Init(&argc, &argv); MPI_Comm_dup(MPI_COMM_WORLD, &myComm); /*find rank of this process*/ MPI_Comm_rank(myComm, &myRank); /*find out rank of last process by usin g size of rank*/ MPI_Comm_size(myComm,&mySize); strcpy(startMsg,"start"); switch(myRank) { case 0 : for(i=0 ;i
PAGE 124

114 for(i=0 ;i
PAGE 125

115 APPENDIX E SOURCE CODE OF DIVIDE A ND CONQUER EXAMPLE PROBLEM #include #include #include #define DATA_TYPE int int numOfProc; /*number of available processes*/ int my_rank; int ctrlMsgSend; int ctrlMsgRecv; int* localData; int mergedInt[200]; int dataSizeSend; int dataSizeRecv; int maxInt = 200; int dataSizeLeft; int numOfIntTo Sort = 200; int* integer int* count; int* temp; /********************/ /* Solve a proble m */ /********************/ void solve(int numOfProcLeft) {

PAGE 126

116 if(baseCase(numOfProcLeft)) { baseSolve(numOfProcLeft); merge(numOfProcLeft); } else { split(numOfProcLeft); if(numOfProcLeft!=numOfProc) { merge(numOfProcLeft); } } } /*****************************************************/ /* split a problem into two subproblems */ /*****************************************************/ int split(int numOfProcLeft) { /*=IMPLEMENTATION2 ============================*/ /*= Code for splitting a problem into two subproblem s =*/ /*= should be implemente d =*/ dataSizeSend = dataSizeRecv/2; dataSizeLeft = dataSi zeRecv-dataSizeSend; dataSizeRecv = dataSizeLeft; ctrlMsgSend = numOfProcLeft/2; /*= =*/ /*==========================================*/ /* invoke a solve f unction at the remo te process */ MPI_Send(&ctrlM sgSend,1,MPI_INT, my_rank+numOfP roc/numOfProcLeft, 0,MPI_COMM_WORLD); /*=DATA TRANSFER 1 ==========================*/

PAGE 127

117 /*= More of this bloc k can be added on needed ba sis =*/ MPI_Send(&dataSizeSend,1,MPI_INT, my_rank+numOfP roc/numOfProcLeft, 0, MPI_COMM_WORLD); MPI_Send( &localData[dataSizeLeft], /*= numOfP roc/(numOfProcLeft*2)) { ctrlMsgSend = 1; dataSizeSend = dataSizeLeft; /* Send a subsolution to the process from which this pro cess got the subproblem*/ MPI_Send(& ctrlMsgSend,1,MPI_INT, my_rank numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); /*=DATA TRANSFER 3=========================*/ /*= More of th is block can be added on needed basis =*/

PAGE 128

118 MPI_Send(&da taSizeSend,1,MPI_INT, my_rank nu mOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); MPI_Send( localData, /* <-modify address of data */ dataSizeSend, MPI_INT, /*<-modify data type */ my_rank numOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD); /*= =*/ /*=======================================*/ } else { MPI_Status status; int i1; int i2; int iResult; int t; /* Receive a subsolution from the process which was invoked by this process */ MPI_Recv(&ctrlMsgRecv,1,MPI_INT, my_rank+nu mOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD,&status); /*=DATA TRANSFER 3===========================*/ /*= More of this block can be added on needed basis =*/ MPI_Recv(&dataSizeRecv,1,MPI_INT, my_rank+nu mOfProc/(numOfProcLeft*2), 0,MPI_COMM_WORLD,&status); MPI_Recv( &localData[dataSizeLeft], /*
PAGE 129

119 dataSizeRecv, MPI_INT, /*dataSizeLeft-1){ for(t=i2;t<=dataSizeLeft+dataSizeRecv-1;t++){ mergedInt[iResult+t-i2] = localData[t]; } } else{ for(t=i1;t
PAGE 130

120 for(t=0;t
PAGE 131

121 count[localData[j]] = count[localData[j]]+1; } for(i=1;i=0;j--){ temp[count[localData[j]]]=localData[j]; count[localData[j]] =count[localData[j]]-1; } for(i=0;i
PAGE 132

122 localData[i]= ((int)rand())%maxInt; } /* dataSizeS end = numOfIntToSort; */ dataSizeRecv = numOfIntToSort; ctrlMsgRecv = numOfProc; dataSizeLeft = dataSizeRecv; solve(ctrlMsgRecv); } else{ /* Every process waits a message before calling so lve function */ MPI_Recv(&ctrl MsgRecv,1,MPI_INT,MPI_A NY_SOURCE,MPI_ANY_TAG, MPI_COMM_WORLD,&status); /*=DATA TRANSFER 2=============================*/ /*= More of this bloc k can be added on needed basis =*/ MPI_Recv(&dataSizeRe cv,1,MPI_INT,MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD,&status); localData = (int *)mallo c(dataSizeRecv*sizeof(int)); MPI_Recv( localData, /*
PAGE 133

123 printf(" %d",localData[i]); } } MPI_Finalize(); }

PAGE 134

124 LIST OF REFERENCES 1. Massingill BL, Mattson TG, Sanders BA. Patterns for Parallel Application Programs. Proceedings of the Sixth Pa ttern Languages of Programs Workshop (PLoP 1999) http://jerry.cs.uiuc.edu/~plop/plop99/ proceedings/massingill/massingill.pdf Accessed August 2002. 2. Massingill BL, Mattson TG, Sanders BA. A Pattern Language for Parallel Application Programs. Proceedings of the Sixth International Euro-Par Conference (Euro-Par 2000) Springer, Heidelberg Germany 2000; pages 678-681 3. Massingill BL, Mattson TG, Sanders BA. Patterns for Finding Concurrency for Parallel Application Program. Proceedi ngs of the Eighth Pattern Languages of Programs Workshop (PLoP 2000) Washington University Technical Report Number WUCS-00-29. 4. Massingill BL, Mattson TG, Sanders BA More Patterns for Parallel Application Programs. Proceedings of the Eighth Patt ern Languages of Programs Workshop (PLoP 2001) http://jerry.cs.uiuc.edu/~plop/plop2001/accepted_submissions/PLoP2001/bmassin gill0/PLoP2001_bmassingill0_1.pdf Accessed August 2002. 5. Massingill BL, Mattson TG, Sanders BA Parallel Programming with a Pattern Language. International Journal on Software Tools for Technology Transfer 2001;volume 3 issue 2 pages 217-234. 6. Coplien JO, Schmidt DC, editors. Pattern Languages of Program Design. Addison-Wesley, Reading MA 1995. 7. Gamma E, Helm R, Johnson R, Vlissides J. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading MA1995. 8. Message Passing Interface Forum. A Message Passing Interface Standard. http://www.mpi-forum.org/docs/docs.html Accessed August 2002. 9. Bailey D, Barszcz E, Barton J, Browning D, Carter R, Dagum L, Fatoohi R, Fineberg S, Frederickson P, Lasinski T, Schreiber R, Simon H, Venkatakrishnan V, Weeratunga S. The NAS Parallel Benchmark RNR Technical Report RNR-94007 1994.

PAGE 135

125 10. Massingill BL, Mattson TG, Sanders BA. A Pattern Language for Parallel Application Program. http://www.cise.ufl.edu/research/Parallel Patterns/PatternLanguage/Background/P DSE99_long.htm Accessed August 2002. 11. Knuth DE. The Art of Computing Programming: volume 2. Addison-Wesley, Reading MA 1981. 12. IEEE Standard for Binary Floating Point Numbers. ANSI/IEEE Standard 7541985. IEEE New York 1985. 13. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. The MIT Press, Cambridge MA1998. 14. Squyer M, Lumsdaine A, George WL, Hagedorn JG, Devaney JE. The Interoperable Message Passing Interfac e (IMPI) Extensions to LAM/MPI. MPI Developer's Conference (M PIDC), Ithaca NY 2000. 15. Saphir W, Wijngaart RV, Woo A, Yarr ow M. New Implementations and Results for the NAS Benchmark 2. 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis MN 1997. 16. Cole M. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge MA1989. 17. Coplien JO, Schmidt DC, editors. Pattern Languages of Program Design. Addison-Wesley, Reading MA1995. 18. Reynders JV, Hinker PJ, Cummings JC, Atlas SU, Banerjee S, Humphrey W, Karmesin SR, Keahey K, Srikant M, Tholburn M. POOMA: A Framework for Scientific Simulation on Parallel Architectures. SuperComputing, San Diago CA 1995. 19. Armstrong R. An Illustrative Example of Frameworks for Parallel Computing. Proceedings Of Parallel Object Oriented Methods and Applications (POOMA) http://www.acl.lanl.gov/Pooma96/abstracts/rob-armstrong/pooma96.htm Accessed August 2002 20. Launay P, Pazat JL. A Framework for Parallel Programming in Java. Proceedings of the High-Performance Computing and Networking (HPCN Europe) Springer, Heidelberg Germany 1998; pages 628-637 21. Lea D. Concurrent Programming in Java. Addison-Wesley, Reading MA 1996.

PAGE 136

126 22. Douglas D, Schmidt C, Stal M, Rohnert H, Buschmann F. Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects. Wiley & Sons, New York NY 2000. 23. Blelloch GE, Plaxton CG, Leiserson CE Smith SJ, Zagha M. A Comparison of Sorting Algorithms for the Connection Machine CM-2. ACM Symposium on Parallel Algorithms and Architectures, Hilton Head SC 1991. 24. Dusseau AC, Culler DE, Schauser KE, Martin RP. Fast Parallel Sorting under LogP: Experience with the CM-5. IEEE Transactions on Parallel and Distributed Systems August 1996; pages 791-805 25. Shan H, Singh JP. Parallel Sorting on Cache-Coherent DSM Multiprocessors. Proceedings of the 1999 conference on Supercomputing http://www.supercomp.org/sc99/proceedings/papers/shan.pdf Accessed August 2002. 26. Al AE. A Measure of Transaction Proc essing Power. Datamation 1985; volume 32 issue 7 pages 112-118 27. Saini S, Bailey DH. NAS Parallel Benchm ark (Version 1.0) Results 11-96. Report NAS-96-018 December 1996. 28. Saini S, Bailey DH. NAS. Parallel Be nchmark Results 12-95. Report NAS-95-021 December 1995.

PAGE 137

127 BIOGRAPHICAL SKETCH Eunkee Kim was born 1970 in Dea-Gu, Republic of Korea. He received a B.S in physics at Yeung-Nam Universtiy in Keung-Sa n, Republic of Korea, in 1997. He joined the graduate program of the Computer and Information Science and Engineering Department in 1999 to pursue his masterÂ’s degree.