Title Page
 Table of Contents
 Theories for paralell computat...
 Structure programming model
 Structure programming construc...
 The wavefront computer
 What paralell processing may offer...
 Biographical sketch

The structure of parallel computation and computers for graphics procedures
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00082199/00001
 Material Information
Title: The structure of parallel computation and computers for graphics procedures
Physical Description: Book
Language: English
Creator: Zhou, Xiaofeng.
Publisher: Xiaofeng Zhou
Publication Date: 1990
 Record Information
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 23011920
alephbibnum - 001583933
System ID: UF00082199:00001

Table of Contents
    Title Page
        Page i
        Page ii
    Table of Contents
        Page iii
        Page iv
        Page v
        Page vi
        Page vii
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
    Theories for paralell computation
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
    Structure programming model
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
    Structure programming constructs
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
    The wavefront computer
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
    What paralell processing may offer in the future
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
    Biographical sketch
        Page 151
        Page 152
        Page 153
Full Text








I would like to express my thanks to my advisor,

supervisory committee chairman, and friend, Dr. John

Staudhammer, for his guidance and encouragement. Without his

support, this work would never have been accomplished. During

this years relationship, he enlightened me by not only his

knowledge, but also the way of exploring truth. That will

have a valuable inspiration on my future research.

I also would like to express my appreciation to the other

members of my supervisory committee, Dr. Joseph Duffy, Dr.

Fredrick Taylor, Dr. G. Ronald Dalton, and Dr. William

Eisenstadt, for their commitment and service on this


I wish to thank all the members of the Computer Graphics

Research Group at University of Florida for their helpful

discussions and suggestions.

Special thanks go to my wife, Jiming Yang, for her

understanding and support for this work.

This dissertation is dedicated to my parents who could

not have the opportunity of a university education, but

encouraged me to study science and engineering since my




ACKNOWLEDGMENTS ....................................... ii

ABSTRACT .............................................. v


1 INTRODUCTION .................................... 1
Trend Toward Parallel Computation ................ 2
Limitation of Parallel Computers ................. 4
Understanding Parallel Computation .............. 12
Partition Computational Problems ............. 14
Parallel Algorithms ........................... 15
Parallel Programming Languages ............... 18
Operating Systems ............................. 20
Parallel Architectures ........................ 21
Overview of Dissertation ......................... 22

Speedup Functions ................................ 27
Scalability and Efficiency ....................... 34
Granularity ..................................... 39

3 STRUCTURE PROGRAMMING MODEL ...................... 48
Difficulties of Parallel Programming ............ 48
What Is Good for Parallel Programming ........... 51
Conceptual Model ................................. 60

Processes ........ ............................... 64
Communication .................................... 70
Control Flow .................................... 74
System Control ................................... 84
System Control Table .......................... 85
Compiler ........... ......................... 89
Invoking and Terminating Processes ........... 90
Simulation ............. ........................ 97

5 THE WAVEFRONT COMPUTER .......................... 108
The Wavefront Computer Concept ................... 109
Architecture .................................... 114
Process Distribution and Communication ....... 116
Memory Addressing ............................. 122


Design Considerations ........................... 129
Potential Performance ........................... 133

MAY OFFER IN THE FUTURE ..................... 137

BIBLIOGRAPHY .......................................... 145

BIOGRAPHICAL SKETCH ................................... 151

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Xiaofeng Zhou

May 1990

Chairman: Dr. John Staudhammer
Major Department: Electrical Engineering

Parallel computation can be conceived as a structure with

its nodes representing computation and branches representing

cooperation. Parallel computers can also be modeled by

networks, where nodes are processors and links are memory and

buses. The computational structure is "soft" because its

pattern can be changed according to different applications.

However, the computer architecture is "hard" because the

interconnection network is usually fixed. A fundamental

problem for parallel computation is how to map a versatile

computational structures to a rigid architecture so that the

maximum speedup and the minimum communication overhead will

be achieved.

The difficulty of solving this problem comes from our

computational model that allows mutual communication among

processes. If two processes talk to each other, they should


be assigned to two processors which in some sense are 'near'.

If a large number of processes exist, the assignment task is

NP hard. In this dissertation, I investigate a new

computational model called the Structure Programming Model.

The model combines data flow concepts with conventional

control flow strategy. A process is invoked only when its

input data and control token are ready and the process will

not communicate to other processes during its execution.

After the execution, the result will be sent out to invoke

next processes. Process executions are concurrent and

independent. The process communication is carried out by

unidirectional message passing that occurs only when a process

is invoked or is finished.

If we consider the computational structure to be a graph

and color its nodes on execution, then the execution of a

program will appear as a wave propagating on the graph. The

wave will move forward and change its shape when computing is

going on. A parallel computer architecture, called the


is a partial shared memory machine with a multiple bus system,

tailored for the above computational model. It has a large

communication bandwidth like a local memory machine. It also

has a simple control scheme like a shared memory machine. The

architecture has a very flexible and fast communication



The WAVEFRONT COMPUTER is a MIMD general purpose machine.

The major application for this computer is the large-grain

parallel computing that is widely used in computer graphics

and other scientific computations.


Parallel processing has the potential to break through

the performance limitation of today's sequential computers.

Powerful multiprocessor computers connected by fiber-optic

networks are the prospects for computing in the next century.

The challenge, however, is that we have to leave the well-

understood Von Neumann model and adopt parallel computer

models which we have little knowledge about. This

dissertation studies a new parallel computational model that

considers parallel computation as a structure of processes

connected by data flow and control flow branches. Based on

this model, I describe a parallel system called WAVEFRONT

COMPUTER. The WAVEFRONT COMPUTER is a Multiple Instruction

Multiple Data (MIMD) computer that hybridizes data driven and

control driven strategies and achieves an efficient process

communication by a partial shared memory organization and a

multiple bus system. Because of that, the WAVEFRONT COMPUTER

is more efficient and much faster than current MIMD



Trend Toward Parallel Computation

The demand for computing power seems endless in our

society. Ten years ago, no one wanted to generate a photo

quality image by a computer because we knew it would be

impossible. Today, we produce many such images. We still

do not want to fire billions rays when a ray-tracing program

[WHI80] is run. But tomorrow, with much more powerful

computing engines, firing billions rays may become a standard

for High Definition Televisions (HDTV). Our desire for

computing power rises along with the advance of technology.

Today's luxury will become tomorrow's necessity. When a wish

is impossible, we do not think about it. But, when a wish is

reachable, we need it badly.

To satisfy the need for computing power, our society keep

on making faster and faster computers. Within the last 50

years, we successfully increased computer speed by more than

10,000,000 times. However, while the computer speed is

increasing, we have experienced more and more resistance to

push it further.

It is fair to say that the speed gained in the last half

century mainly came from the revolution of computer devices.

When a new device was invented, a new generation of computers

was created. There have been four generations of computers

that indicate device change from vacuum tubes, transistors,

to integrated circuits and Very Large Scale Integrated (VLSI)

circuits. Switch devices became smaller and smaller; their


speed became faster and faster. The VLSI technology

developed in 1980s has been very powerful. It has reduced

the device size to only a few microns and has made the system

clock run at the nanosecond level.

When the VLSI industry tried to further shrink the device

size, the situation was changed. As if the physics had to

be changed from Newton's theory to Einstein's theory, many

secondary degree effects that had been ignored in the "large

world" started to show up in the "small world." We find that

a wire is no longer an ideal connector. It takes time to

transfer signals, and it also occupies space. Since a VLSI

chip has thousands, or even a million transistors connected

by wires, the wire is so long that the signal passing time may

have a dominant effect. That slows down our systems.

Moreover, fast VLSI chips use relatively strong currents

to activate their circuits. The heat generated by very high

density circuits can easily melt wires that have a width of

only a few microns. Therefore, a powerful cooling system

must be used to prevent burning the circuits. Today, large

fraction of expense in a supercomputer is not its transistors

but its wire connection, cooling system, power system, etc.

These costs will increase if VLSI circuits reach to submicron

level. The result will be that we spend a tremendous amount

of money to squeeze out the last few nanoseconds of the

system clock.


Then, what will characterize the next generation

computers? Will we see another device revolution to thrust

computer speed? It is not very likely in the near future.

We have seen the saturation of device speed. The saturation

does not result from any specific technology, but results

from the fact that we are approaching an universal limitation:

nothing can travel faster than light, so there is a limit to

the speed that any device technology can provide.

An alternative to break through the device limitation is

to change computer architectures. Instead of making a large

computing engine, we can make many smaller engines that work

cooperatively on a single problem. In consequence, the

concurrent system may deliver more computing power in total.

The concept is simple and attractive. Ideally, we can always

add more processing elements to a parallel computer to

increase its computing power. In practice, the VLSI

technology does provide many cheap processing elements. It

once seemed that we could build a parallel computer with

unlimited computing power. Very soon, we found this was not

the case.

Limitation of Parallel Computers

By definition, a parallel computer is composed by

multiple processors that compute concurrently on a single

problem. Therefore, a computational problem can take

advantage of a parallel computer only if it is parallelizable.


In other words, the computation can be partitioned into

multiple processes and can be executed on multiple processors


Unfortunately, not every computational problem is

parallelizable. Some of them are inherently sequential

processing. For example, tracing a light ray in a two-

dimensional space with mirrors is sequential processing.

Figure 1-1 shows the ray-tracing procedure. The computation

of the reflection angle on mirror 2 will not be able to start

without first computing the reflection angle on mirror 1.

Only one processor is needed at one computing step.

Therefore, this problem will not gain any speedup even it is

run on a parallel computer.

In practice, most computational problems are neither

completely sequential nor completely parallel. They are

hybrids of sequential and parallel computation. It is

interesting to study how these problems can take advantage of

parallel processing.

Let us first consider an ideal parallel computer model

that has N identical processors connected by an ideal

communication network that has no bandwidth limitation.

Such a computer model is not practical but it is a good tool

to study because it avoids processor to processor

communication, which is the most difficult part of parallel


light source

angle 1

angle 2

mirror 1 mirror 2

Raytracing for two mirrors

Figure 1-1


Assume a computational problem requires Q steps of

operations by a uniprocessor computer. Also assume that this

problem can be partitioned to a sequential part s and a

parallel part p. Then, Q=s+p. Under the best condition, the

problem requires s+p/N steps on the parallel computer with

N identical processors. The speedup is defined as the

sequential execution time divided by the parallel execution

time, i.e.

s+p (s+p)N
S = = 1.1
s+p/N sN+p

Equation 1.1 does not promise a great speedup for general

concurrent computations. The maximum speedup we can get is

SN = -1.2

If a computational problem contains 1% sequential operations,

i.e., if s=0.01Q, the maximum speedup cannot exceed 100 even

by an ideal computer that has infinite number of processors.

This is known as the Amdahl's law [AMD67] which had a great

impact in the parallel processing area. Figure 1-2 depicts

the function 1/s, where Q is normalized to 1 and s represents

the percentage of sequential operations. The steep curve

gives the impression that only very few computational problems

can achieve large speedup.



100 -

PIP percentage of
0.2% 1% serial operations

Figure 1-2 Amdahl's speedup function


If Amdahl's law predicted a theoretical low speedup, the

first commercial multiprocessor computer Illiac-IV provided

a practical evidence that parallel computation was not easy.

The Illiac-IV system was developed at the University of

Illinois in the 1960s [BAR68]. The system was fabricated by

the Burroughs Corporation in 1972 [BOU72]. Illiac-IV had 64

processors connected to a 8x8 processor array as shown in

figure 1-3.

The system was a Single Instruction Multiple Data (SIMD)

computer. The 8x8 processor array was controlled by a

central control unit. The control unit fetched instructions

and broadcast them to the processor array. All processors

executed the same instruction stream concurrently but worked

on different data. From the commercial point of view, the

Illiac-IV computer was a failure because the company could not

really sell it. However, much experience about parallel

computation were gained by programming the Illiac-IV system

[HWA84]. It was found that parallel computation might have

many different data communication patterns but all of them

had to be mapped to the rigid array pattern of Illiac-IV.

Programming the parallel computer was very difficult because

a programmer had to take care of data distribution in

different processors, had to consider the architecture and

had to worry that massive communication might completely

destroy the benefit of parallel processing. Indeed, many

performances showed that those were gained in speed from

Figure 1-3 Illiac-IV architecture


concurrent computation, were lost speed because of

communication [GAN84]. In parallel processing, a new idea

emerges: computing is cheap, moving data is expensive.

The Amdahl's law and the experience about the Illiac-IV

system raised a pessimistic view toward developing large

scale parallel computers. In the 1970s and early 1980s, most

commercial multiprocessor systems used only a few processors.

While the industrial community took cautious progress, the

academic community was very active in parallel computer


In the late 1980s the situation began to change. Sequent

Computer Inc. produced Sequent Balance 2100 system with 30

processors. Ncube Corporation manufactured its Ncube

multiprocessor computer with 1024 processors. Thinking

Machine Company built the Connection Machine that has up to

64,000 processors. More and more positive reports about

parallel computation show up in literatures [BEN88]. A report

about a 1000 time speedup on a 1024-processor Ncube computer

[GUS88] surprised many people. Is Amdahl's law still valid?

Will large scale parallel systems have a bright future? The

answer is not clear yet and the confusion comes from lack

understanding for the underlying principles of parallel



UnderstandinQ Parallel Computation

Parallel computation is not a simple extension of serial

computation. In both hardware and software areas, there exist

many design choices for which there are no direct

counterparts in conventional serial computation. In hardware,

examples are the number of processors, MIMD or SIMD

architectures, shared or local memory, and interconnection

topology. In software, such choices are fine or coarse

parallelism, data distribution pattern, synchronization

beneficial or necessary, and interprocess communication

methods. Moving to parallelism is like adding an independent

axis to the Euclidian space. Many new problems are emerged.

This section briefly discusses these problems.

An essential issue for studying computation and computer

architecture is to build a computer model that is abstract

enough for theoretical analysis but also practical enough for

implementation. The Random Access Machine (RAM) [AH074] is

a model that represents the sequential computer very well.

The parallel computer model is still in its developing

stage. The most popular model is the Parallel Random Access

Machine (PRAM) [FOR78]. The PRAM model represents shared

memory architectures. Each node of PRAM is a random access

machine and they communicate via a common memory bank.

Although the PRAM model has been widely used because of its

simplicity, it does have serious drawbacks. The PRAM

model assumes a constant communication time among arbitrarily


large number of processors, which is not realizable. A real

shared memory, multiprocessor computer can only have a

limited bandwidth. When more than one processor competes for

a bus or tries to access the same memory bank, conflicts

arise. If this happens, requests for a physical device have

to be queued up, waiting until the device is available. That

degrades performance. As a matter of fact, memory and bus

contention can be very severe so that they may completely

override the benefit of parallel processing [PFI85] [YEW87].

Nevertheless, the PRAM model does not reflect that at all.

A more realistic model is to describe a parallel computer

as a processor network. Each processor in this network has

its own local memory, and processors communicate by sending

messages over links to neighboring processors [FIS88]. The

network model reflects communication overheads by restricting

the number of links that one processor can possess. If a

processor needs to send a message to another processor that

is not its direct neighbor, the message must find a path that

goes through relaying processors before it reaches its

destination. This represents a real communication scheme used

by local memory parallel computers. The network model is

good for studying parallel computing problems such as

algorithm mapping, scheduling, and data distribution [JAM87]

[BER84] [HIL86]. The shortcoming of the model, however, is

that the network can have many different topologies, and each

one denotes a certain type of communication pattern. The


highlight of this feature is the variety of parallel computer

architectures such as mesh connection, hypercube connection,

butterfly connection, etc.. It is unlikely to find a single

topology that is good for all applications [LEV85].

Therefore, the network model suffers from a lack of

uniformity for theoretical analysis.

The difficulty of finding an adequate parallel computer

model shows the complex nature of parallel computation. Major

problems associated with parallel computing are

1. how to partition computational problems;

2. how to design parallel algorithms;

3. how to design programming languages;

4. how to design operating systems; and

5. How to design parallel architectures.

We briefly discuss each of these questions.

Partitioning Computational Problems

Problem partition is, of course, a necessary step in

order to take the advantages of parallel computing. Two

questions must be answered in the problem partition

procedure. First, how to partition a problem? Second, to

what degree of fineness the partition should go? The first

problem is associated with parallel algorithm design and we

will discuss it later. The second problem is interesting

because it shows the shortcoming of PRAM model and the

limitation of parallel computation.


The PRAM model encourages very fine partitions since the

model has no limitation on processor number and communication

capability. Therefore, the finer the partition is, the faster

the computing will be. Obviously, this is not the case in

practice. A finely partitioned problem tends to have more

communication requests that may easily override the speedup

gained by concurrent computing. Data moving is more expensive

than number crunching.

The optimal granularity seems to depend on applications

and system size (number of processors). No theory has emerged

to guide a systematic search. This uncertainty leads to

large granularity difference in commercial parallel machines.

The spectrum is covered from coarse parallelism such as the

Intel iPSC [IPS86], to fine parallelism such as the Connection

Machine [HIL85]. They both have merits and shortcomings,

favoring different applications. More research work is

needed to understand the underlying principle of granularity


Parallel Algorithms

Algorithm design is another issue that reflects the

complexity of parallel computation. Sequential algorithm

design has a simple goal: removing redundant operations as

much as possible. Since this is the only concern of design,

we can push it to an extreme. Contrarily, parallel algorithm

design has several parameters that can be adjusted. Among

those are communication pattern, synchronization, granularity


and minimizing sequential steps. Pushing any single parameter

to its extreme usually will not yield a good design. The art

of parallel algorithm design is to find a balance point among

those parameters.

Again, the lack of good computer model becomes a problem

for developing a general algorithm design theory. Algorithms

based on the PRAM model cannot be run in many parallel systems

without emulation software [RAN87], which degrades the

performance. Well designed parallel algorithms are usually

tailored to certain type of architectures. For example, the

sorting problem has many different algorithm versions to match

the diversity of interprocessor communication networks. There

are perfect shuffle sorting [ST071], cube sorting [SIE77],

mesh sorting [TH077], shared memory sorting [BOR82] [BIT84],

etc.. An algorithm designed for a specific machine cannot be

directly run in other machines; even worse, the algorithm may

need to be modified when the number of processors is changed.

So, the computer architecture is no longer hidden from the

algorithm designers as it used to be. In consequence, the

algorithm design becomes more complex and expensive, and the

algorithm utilization gets lower because of the porting

restriction [LUS87].

Sequential algorithms care little about computer

architecture. The Quicksort algorithm [HOA62] runs very well

on a sophisticated mainframe computer, and it also works fine

on a simple PC. Although the system configuration between the


mainframe computer and the PC can be completely different,

they both are Von Neumann machines and both are described well

by the RAM model.

Unfortunately, the situation becomes different when more

processors are added to one computer. We have to consider how

many processors are available before a decision on granularity

can be made. We have to consider the topology of the

interconnection network for efficient algorithms. Control

flow is no longer the only scheme that need to be taken care

of, data distribution and data flow are of the same

importance. The algorithm design problem goes from an one

dimensional flow chart design to a two dimensional structure

design. The measurement of the "goodness" of an algorithm

requires not only the shortest sequential steps, but also the

match of a computational structure (parallel algorithm) to a

parallel computer architecture. This requirement leads to

many different parallel architectures since the computational

structure can have a variety of patterns. As a matter of

fact, many commercial multiprocessor computers have different

architectures characterized by their interprocessor

communication schemes [INT86] [ALL88] [DAP88]. As we

mentioned early, program porting could be a serious problem

for different parallel computers. Without a large amount of

software support, which heavily depends on software migration

and compatibility, parallel computers will not achieve the

popularity that uniprocessor computers have enjoyed.


Parallel Programming Languages

The transparency of architecture also causes difficulty

for parallel programming. A programmer has to decide variable

and datum allocation among processors, has to find routes for

communications and has to define synchronization operations

in order to get a correct result. Programming parallel

computers is very alike to programming uniprocessor computers

by using assembly languages, thinking about how to use

registers and memory efficiently.

Even worse, debugging parallel programs could be very

frustrating, sometimes, almost impossible [BAR87]. A common

error of parallel programming is in the use of

synchronization. A programmer often finds that some variables

are modified unexpectedly because of improper use of

synchronization. However, when he wants to locate the error

by inserting some break points in his program, the error may

simply disappear and some new errors may show up. This is

known as the indeterministic feature of parallel computing,

caused by an asynchronous execution of operations in a

parallel program [MCG84].

All sequential programs are deterministic. Multiple

operations on the same variable must be executed in order that

the variable value in memory is definite, predictable and

repeatable. Concurrent execution of multiple operations can

be simulated by sequential executions. The programmer can

control the order by using synchronizing primitives [ST080]


[DIJ68]. In the case he misuses synchronizing control, the

order will be randomly formed by an operating system, based

on run time work load. That makes the result unpredictable.

The programming difficulty and the lack of sophisticated

system software have been the major obstacles for the

development of parallel computers.

In early 1960s, high level programming language such as

FORTRAN was developed to shift the work load from programmers

to systems. A programmer does not need to be concerned about

the usage of registers and memory, the compiler and the

operating system will take care of that. Usually, the

performance of FORTRAN programs is not as efficient as

assembly programs. However, the sacrifice of some efficiency

returned tremendous rewards in software productivity. As a

result, many people can program, not only a small group of

computer experts. The software development becomes more

efficient by marshalling more human talent in this area.

It is of great interest to pursue the same progress for

parallel programming. Nevertheless, research on automatic

parallel compilers and operating systems has not come

satisfactory solutions yet. An initial goal for parallel

compilers was to automatically detect parallelism in a

sequential program and convert them to multiple processes that

can be run concurrently [BAE73] [GIR88]. The research work

did not yield a fruitful outcome. Detecting parallel codes

in a sequential program is a very complex task [KUC86], and


current compilers for commercial systems usually convert only

iterative operations to concurrent processes [ALL88] [ARD88].

Moreover, a compiler can only detect concurrency in programs

but not in algorithms where more parallelism may exist.

Operating Systems

A more promising area is to design a better operating

system that can take care of those operations such as the

allocation of processes to different processors, process to

process communication, invoking processes or cancelling

processes, etc. The problems associated with parallel

operating systems can be generally reduced to a scheduling

problem [CAS88]. Any parallel algorithm can be perceived as

a structure with nodes and branches. Nodes represent

computational tasks that must be performed by processors and

branches represent communication that must be carried out by

physical devices such as memory or buses. The scheduling

problem is how to assign tasks to processors so that some

objective functions can be achieved.

Three objective functions are used to develop algorithms

for parallel operating systems [STO77] [LO88]. They are

maximum concurrency, minimum communication, and minimum

execution time. The optimal assignment has been proved to be

an NP complete problem [GAR79]. Thus, the search for

suboptimal solutions is necessary. Although theoretical

investigations of heuristic algorithms have been shown in the

literatures [LO88] [GAL82], most assume that a global


knowledge of the characteristics of the task force is known

in advance. This type of algorithm is of interest for

establishing the best performance bounds, but they are not

implementable because the task force is indeterministic before

its execution. The parallel operating system is still an

active research field with many unknown questions.

Parallel Architectures

Contrary to slow software progress, the research and

development of parallel architectures are more advanced.

Characterized by different architectures, a number of

multiprocessor systems are commercially available now.

Alliant FX/8 with 8 processors [ALL88], Encore Multimax with

20 processors [ENC88], and Sequent Balance 2100 with 30

processors [SEQ88] are examples of shared memory machines.

There are also local memory parallel computers such as 32x32

processor, mesh connected DAP 500 [DAP88], and hypercube

connected Intel iPSC [IPS86]. In addition, massively parallel

computers such as MPP system with 16,000 processors [BAT80]

and the Connection Machine with 64,000 processors [HIL85] are

also available.

Shared memory systems usually use a small number of

processors to avoid memory contention. Local memory systems,

on the other hand, can use more processors to enhance their

computing speed. However, high performance cannot be

sustained for all computational problems. The peak speed is


usually achieved when an algorithm structure matches the

pattern of the interconnection network.

Research in parallel computer architectures has been an

active field for the last two decades. Nevertheless, the

hardware research is somewhat independent of software

development. The design of parallel architectures has its own

criteria such as efficient memory structures, fast

interconnection networks, etc.. When such research is

conducted, we are little concerned about what the programming

language or operating system will be. We build a body first

and put a soul in it later. Unfortunately, training souls is

more difficult than creating bodies. The lagging progress of

parallel processing software has barricaded the further

development of parallel computers. The issue of combining

design considerations for both hardware and software has

become critically important. This is the crisis that we have

to face and have to solve.

Overview of Dissertation

Parallel computation can be conceived as a structure with

its nodes representing computation and branches representing

cooperation. Parallel computers can also be modeled by

networks, where nodes are processors and links are memory

modules and buses. The computational structure is "soft"

because its pattern can be changed according to different

applications. However, the computer architecture is "hard"


because the interconnection network is usually fixed. A

fundamental problem for parallel computation is how to map

versatile computational structures to a rigid architecture so

that the maximum speedup and minimum communication overhead

will be achieved.

The difficulty of solving this problem comes from our

computational model that allows mutual communications among

active processes. If two active processes talk to each other,

they must be assigned to two close processors. If numerous

such processes exist, the assignment work is NP hard and a

large communication overhead is inevitable.

In this dissertation, I investigate a new computational

model that combines data flow concept with conventional

control flow strategy. Mutual communications among processes

are not allowed. Instead, process cooperation is carried out

by unidirectional message passing as shown in figure 1-4. The

global communication among all processes is not required. At

any particular time step, active processes only need to send

their outputs to certain memory modules (figure 1-5). Next

step processes can then be attached to these modules after

current processes are finished. This adaptive communication

is more efficient than the static algorithm mapping method.

A higher level parallel programming construct is a direct

result of this model. Figure 1-4 is actually an example of

programs. Not like FORTRAN or C that use statements as

program commands, the new construct uses processes as basic

Figure 1-4 A sample structure program



Figure 1-5 Execution wave







building blocks. The synchronization is expressed explicitly

by the data driven method. Typical control driven operations

such as looping and branching are also employed. A graphical

tool is possible to help programming and run-time debugging.

Consider the computational structure to be a graph and

color those nodes on execution, they will look like a wave as

shown in figure 1-5. The wave will move forward and change

its shape when computing is going on. A parallel computer

architecture, called the WAVEFRONT COMPUTER, is also

investigated. The WAVEFRONT COMPUTER is a MIMD general

purpose machine tailored for the above computational model.

The major application for this computer is large-grain

parallel computing that is widely used in computer graphics

and other scientific computation.


Parallel computation is still an amateur discipline that

possesses a number of unknown problems. Some concepts such

as speedup, scalability, and granularity are unique features

that have no counterpart in sequential computation. System

efficiency is another problem that is difficult to manage for

multiprocessor systems, but is self-guaranteed by uniprocessor

computers. This chapter is devoted to the study of these

theoretical issues.

Speedup Functions

The purpose of developing parallel computers is to

accelerate computing. The speedup that can be achieved by a

parallel computer with N identical processors working

concurrently on a single problem is at most N times faster

than a single processor. In practice, the linear speedup is

seldom obtained. There are many effects that reduce the

speedup. These effects can be summarized in two categories:

by limited hardware resource and by computational problem

itself [EAG89]. In this chapter, we concentrate on

theoretical upper bounds for speedup, i.e., we will use the



model to study how the nature of computational problems may

affect the speedup.

A computational problem can gain speedup by using a

parallel computer only if the problem is parallelizable.

Nevertheless, a practical problem almost always has a fraction

of serial operations. As mentioned in Chapter 1, the outlook

for large scale parallel systems (hundreds to thousands

processors) is pessimistic. The distrust of an achievable

large speedup is raised mainly from Amdahl's law [AMD67].

Amdahl's law indicates that the maximum speedup, even on a

parallel system with infinite number of processors, can not

exceed 1/s, where s is the fraction of operations that cannot

be executed in parallel. The steep curve of 1/s gives the

impression that developing large scale parallel systems may

not be a feasible choice because a linear speedup in terms of

number of processors is virtually impossible.

However, Amdahl's law becomes not so forbidding when

dealing with applications in computer graphics that often

repeat the same computation for many pixels. Consider the

ray-tracing algorithm [LEV86] as an example. The algorithm

for a 1024x1024 display has three nested iterative loops.

for y=l to y=1024 do

for x=l to x=1024 do

send a light ray from observer through pixel (x,y)

for object=l to number of object do

determine closest object to origin along ray

determine lighting of closest object

color pixel (x,y)

It is obvious that the two outer do loops can be executed

in parallel. If the algorithm is performed by the PRAM

computer, one million (1024x1024) processors can be used, each

one computing the inner loop for one pixel. Then the speedup

should be linear. The ray-tracing algorithm is not the

only one that has this feature. Recently, Sandia National

Laboratories studied beam stress problems, baffled surface

wave simulation problems and fluid flow problems. They

achieved 1000 time speedup for these three problems on their

1024-processor hypercube system [GUS88]. These results

clearly show the reality of linear speedup for certain

applications. Therefore, we have reasons to suspect that

something may be amiss with Amdahl's law.

An argument raised by Sandia National Laboratories is a

problem size is not a constant as assumed by Amdahl's law;

instead, the problem size will increase when a more powerful

computing resource is given. For their particular

applications, the researchers at Sandia find that the parallel

part of a problem scales with the problem size, but the serial


part remains constant. Based on this application model, E.

Barsis at Sandia suggested a new formula [GUS88]:

Q=s+pN 2.1

S=N+(l-N)s 2.2

where, Q is the problem size, s is the serial part of the

problem, p is the parallel part of the problem before scaling,

N is the number of processors and S is the speedup.

Contradicting Amdahl's law, equation 2.2 indicates a linear


The different speedup prediction between Amdahl and

Barsis is a result of different assumptions for the problem

size. Amdahl assumed a fixed sized problem to derive his

formula, while Barsis, on the other hand, used a scaled sized

problem. Since the predicted speedup is completely different

simply by changing the problem size, it suggests that the

problem size should be considered to be an independent

variable in the speedup function. As a consequence, we can

derive a new speedup function that takes the problem size Q

as an independent variable and consider the serial part of the

problem to be f(Q), a function of the problem size. By

defining different f(Q), it is interesting to find that

Amdahl's law is the low-extremum of the new function and

Barsis's is the high-extremum [ZHO89a].


To derive our speedup function, we define following


Q--Problem size, defined as the number of unit operations

required to solve the problem.

f(Q)--Serial part of the problem that cannot be executed in

parallel. f(Q) is function of Q, and is decided by the amount

of parallelism in the problem and its parallel algorithm.

N--The number of processors.

c(N)--The average number of processors that do not contribute

in computing because of communications and communication

conflicts. The communication cost is usually a function of

the system size.

If the execution time for a problem on a uniprocessor is

Q, then the same problem, under the best conditions, will use

f(Q)+(Q-f(Q)))/(N-c(N)) time on a parallel system with N

processors which have the same speed as the uniprocessor.

Therefore, the speedup is

S = 2.3

S = 2.4
(N-c(N)-l) f(Q)+Q

If we assume that the PRAM model is used, i.e. the

communication cost c(N)=0, equation 2.4 becomes

S = 2.5

Equation 2.5 will be used to study how the problem size Q and

its serial part f(Q) can have an effect on the speedup

function. The c(N) impact will be discussed later.

First of all, it is worth noting that f(Q) is not an

arbitrary function; instead, the growth rate of f(Q) satisfies

the conditions: f(Q)=O(Q), and f(Q)=n(constant). If, for a

certain type of problems, we can find an algorithm that fixes

f(Q)=const., then we have

S(Q)Q_= = N 2.6

Equation 2.6 means, under the condition that f(Q) is a

constant, we can always get a linear speedup for an arbitrary

large system simply by increasing the problem size. An

interesting question is how a problem size should be increased

in order to maintain a linear speedup, i.e., linear,

polynomial, or exponential?

Let us use Barsis's assumption Q=s+pN, a linearly

increased problem size in terms of processor number.

Substitute Q=s+pN and f(Q)=s into equation 2.5,

S = 2.7


Assume s+p=l as Barsis did, we have

S=s+pN=s+(l-s)N=N+(l-N)s 2.8

This is exactly Barsis's formula that shows the linear

speedup can be achieved if f(Q) is a constant and the problem

size is linearly increased.

However, not every computational problem is able to find

a parallel algorithm that can fix f(Q) as a constant. Since

f(Q) must be bounded by Q, the worst case for a real parallel

computation is f(Q)=kQ, where 0
coefficient. For such a case,

S (Q) Q0= 2.9

which is exactly Amdahl's formula [AMD67]. Equation 2.9

indicates that no matter how fast one increases the problem

size, the maximum speedup will not exceed 1/k. Equation 2.9

also shows that the Amdahl's law is still true for certain

kinds of problems even though the problem size is increased.

The above discussion shows that equation 2.5 is a more

general formula to predict the upper bound of the speedup

that can be achieved by a multiprocessor system. Amdahl's

and Barsis's formulas happen to be two extrema of equation

2.5. The new speedup function solves the argument about the

validity of Amdahl's law, challenged by Sandia's computing

results [HEA89] [GUS88].

Before leaving the topic of speedup, I would like to

point out that the term c(N) also plays an important role in

the speedup function. To appreciate this, let us assume the

best condition: f(Q)=constant and Q goes to infinity.

Substitute these into equation 2.4, we get

S=N-c(N) 2.10

Equation 2.10 shows that c(N) must be a constant in order to

obtain a near N time speedup. In practice, it is quite

difficult to keep c(N) as a constant. However, as long as

the growth rate of c(N) is under aN (0
is still achievable.

Scalability and Efficiency

The speedup function derived in the previous section

explains why some problems can gain speedup but some cannot

by increasing their problem sizes. That indicates an

interesting research topic, the scalability and system

efficiency of parallel computing, which is not well studied,

but is of fundamental importance for large scale parallel

computing [ZHO88].

The feasibility of developing large scale parallel

computers lies strongly on the system efficiency, and the

goal is to keep it constantly high. It is certainly correct

to increase the problem size to maintain high system

efficiency, because this is what the large scale parallel

computer is designed for. Nevertheless, not every

computational problem is scalable to get a constant system

efficiency or, equivalently, linear speedup. The scalability

is based on the problem itself and its parallel algorithm.

Fortunately, f(Q) can be used to find the scalability.

The previous discussion has shown that there is no way to

keep a linear speedup if f(Q)=kQ. On the other hand, a

linear problem size scaling is enough to have constant

efficiency if f(Q) is a constant. There are many other

possible f(Q) functions between f(Q)=constant and f(Q)=kQ as

depicted in Figure 2-1.

We can generally prove that the problem size must

satisfy equation 2.11 in order to maintain the linear speedup

aN, where 0

Q/f(Q) = B(N-l), B>0 2.11

Proof: Substitute S by aN in equation 2.5, we have

= aN 0< a <1

Q(1-a) = a(N-1)f(Q)

Q/f(Q) = B(N-l), B = a/(l-a)




Figure 2-1 Possible f(Q) range

f(Q) =KQ

If a parallel algorithm can fix f(Q) to be a constant,

it only requires a linear scaling of Q to satisfy the

condition, and the algorithm is well designed for large scale

parallel systems. The ray-tracing algorithm is one of such

"good" algorithms. Certainly, there are algorithms that

cannot fix f(Q) as a constant. If f(Q) = Q, the problem

size must be increased at the rate of N2 in order to satisfy

equation 2.11.

For some algorithms, to satisfy equation 2.11 is

virtually impossible. An example is to run the Quicksort

algorithm in parallel. The algorithm cannot start parallel

sorting until its sequential testing detects where to

separate a sorting list. If a sorting problem has L sorting

keys, it requires L sequential tests, i.e. f(Q)=L. Sorting

problems have a well established problem size Q=LlogL. Then

we must have

LlogL/L = B(N-l)

L = 28('1) 2.12

This is an unacceptable scaling rate, indicating a "bad"

parallel algorithm.

The traditional parallel complexity theory defines that

"good" parallel algorithms are those that can solve

computational problems in O(logk (L)) time where L is the

input size [C0083]. This is known as NC class algorithms

that may use 0(Lk) number of processors. Although the NC

class is a good metric to classify if a problem is

parallelizable or not, using NC class alone is inadequate to

judge the "goodness" of parallel algorithms. The NC theory

only provides the measure of execution time, but not the

measure of system efficiency. It is very often that a

problem has thousands inputs, then the NC theory allows to

use millions or even billions of processors to achieve a

logarithmic time without considering efficiency. This method

is not realistic for today's parallel computers that use

hundreds or thousands of processors to solve problems with

thousands or millions of inputs.

The speedup function derived in this chapter can be used

as a new criterion to judge the "goodness" of parallel

algorithms. To use the speedup function, the problem size Q

must be defined. Q should be the minimum number of

operations needed to solve a problem. Unfortunately, we do

not know what are the minimum number of required operations

for many computational problems. The attempt to find the

minimum number of operations for an arbitrarily given problem

has not succeeded yet. Solving this problem in general is

extremely difficult because it is equivalent to prove the NP-

complete problem. If one can find the required minimum

number of operations for the traveling salesman problem, he

solves the NP problem.


An alternative is to accept the complexity of the best

sequential algorithm for a given problem as their problem

size [VIT86] [KRU88]. For instance, we used LlogL as the

problem size for sorting problems. The shortcoming is that

the speedup function needs to be updated when a better

sequential algorithm is invented. Fortunately, the

complexity theory of sequential algorithms is quite mature

today and we do not expect dramatic changes in the near



The goal for parallel algorithm design is to maximize

concurrency and minimize communication overhead. These two

goals usually are in conflict with each other and the art of

algorithm design is to find the balance point. Granularity

is a parameter that can be manipulated for the trade-off. In

general, fine granularity exploits more parallelism but tends

to have a large communication overhead. Vice versa, coarse

granularity saves communication time but uses less

parallelism. A general theory regarding a systematic

granularity has not emerged yet. We will use matrix

multiplication as an example for discussing granularity.

Although the result cannot be used as a general theory, it

provides interesting insight for this topic.

Consider computing C=A*B, where A and B are two nxn

matrices. C is also a nxn matrix and each of its elements is

cij = Z aik*bkj k=l,2...n 2.13

The computation of each ci, is independent. So, all elements

of C can be calculated in parallel. Three algorithms are

chosen to study the performances of different granularity.

Algorithm 1: fine grain partition, using as many

processors as possible. Equation 2.13 is computed by 2n-1

processors connected as a binary tree. Each processor has

two input numbers and performs summation or multiplication.

Figure 2-2 shows n2 such trees.

Algorithm 2: medium grain partition, using n2

processors. Each processor is fed by a row of A and a column

of B and the summation of equation 2.13 is calculated by one

processor. Figure 2-3 depicts the algorithm.

Algorithm 3: coarse grain partition, using n processors.

Each processor computes a column of C. The algorithm is

shown in figure 2-4.

Three assumed parallel systems with different hardware

resources are used to analyze the performance. To simplify

the discussion, some assumptions are made.

1. Assume communication channel set up time tc is a constant

between any pair of processors. Although this is not true in

a real system, tc can be considered as the average channel


Figure 2-2 Algorithm 1



alk,blk, k=1, ...n


ank,blk, k=1, ...n


alk,bkn, k=1, ...n


ank,bkn, k=1, ...n

Figure 2-3 Algorithm 2

C11 012 C 1n
C 21 C 22 C 2n

C nl C n2 O nn

Do Do Do

,nd Do A End Do A End Do

, k=1,...n A,bk2 k=1,...n Abkn k=


Figure 2-4 Algorithm 3


set up time. Also assume that concurrent channel set up is

possible if processors are available.

2. Assume an arithmetic operation takes to time, no matter

if it is summation or multiplication.

3. Using tp to denote the time of passing a number in an

established communication channel. We use T1, T2, and T3 to

denote the elapsed times for algorithm 1, 2, and 3.

Case 1: Using an ideal system that has unlimited

processors and communication channels.

TI = log2n*to+log2n*tp+tc 2.14

T2 = (2n-l) t+2ntp+tc 2.15

T3 = n(2n-l)to+(n2+n)tp+tc 2.16

It is obvious that the algorithm 1 is the best solution under

the condition of case 1. The algorithm needs only a

logarithmic number of sequential steps for arithmetic

computing and data communication. Algorithm 2 has O(n)

complexity. It is slower than algorithm 1 but faster than

algorithm 3 that has O(n2) complexity.

Case 2: Using n2 processors with limited communication

channels. For algorithm 1, n processors are assigned to

compute each binary tree and n trees are computed

simultaneously. Algorithm 2 uses one processor for each C

elements. Algorithm 3 uses only n processors and wastes the

rest of them.

T, = nlog2n*to+nlog2n*tp+nlog2n*tc 2.17

T2 = (2n-l)to+2ntp+tc 2.18

T3 = n(2n-1)t+(n2+n)tp+t 2.19

Under the condition of case 2, algorithm 2 is the best. It

has the least sequential computing steps and the least date

communication cost. Its channel set up time is constant,

which is the same as algorithm 3 but is much less than

algorithm 1. Algorithm 3 has O(n2) complexity for arithmetic

operations and data passing. These two parts are slower than

algorithm 1 that possesses O(nlogn) complexity. However,

algorithm 3 has a constant channel set up time that is much

faster than algorithm 3. Since tc is much longer than to or

tp in general, algorithm 1 is not much superior than

algorithm 3.

Case 3: Using n processors with limited communication

channels. Algorithm 1 will still use n processors to compute

each binary tree and the computation will repeat n2 times.

Algorithm 2 uses n processors for n elements of C matrix and

the computation repeats n times. Algorithm 3 uses all n

processors once.

Ti = n21og2n*to+n21og2n*tp+n21og2n*tc 2.20

T2 = n(2n-l)to+2n2tp+nt 2.21

T3 = n(2n-l)to+(n2+n)tp+tc 2.22


Algorithm 3 is the best choice for case 3 although the

performances of algorithm 2 and 3 are close. They have the

same complexity 0(n2) for computing and data passing, but

algorithm 3 has less channel set up time. Algorithm 1 turns

out to be the worst choice with the longest computing, data

passing and channel setting time.

Summarizing the analysis for three cases, we find that

very fine grain partition is good only for the PRAM model.

It gains speed from exploiting maximum parallelism. For real

systems, the maximum parallel algorithm is not a good choice

because it imposes more communication overhead.

Furthermore, it is surprising to find that algorithm 1

also uses more sequential steps for arithmetic computing in

case 2 and case 3. The reason is that algorithm 1 cannot

fully use n processors because of its binary tree structure.

Therefore, if a parallel computer does not have enough

hardware resources to accommodate fine grain concurrency, the

algorithm becomes slow in both communication and computation.

Algorithm 3 has a fixed performance in all three cases.

It is a conservative algorithm if the processor number is

larger than n. Algorithm 2 is probably the most favorable

one in this study. It has the optimal performance when the

number of processors matches its partition. It is slightly

slower than algorithm 3 in case 3, where the number of

processors does not match the partition.


It is quite clear that the number of processors in a

parallel system is an important parameter for choosing

granularity. If hardware can only provide N processors, the

best partition is to divide a problem to N independent tasks

with equal work load. If this partition is possible, further

partition on individual tasks will not increase computing

speed. Using one processor for each task is always more

efficient than using multiprocessors for one task and then

repeat the same procedure. Although such a partition may not

always be practical, the concept that task number should be

comparable to processor number is important and fundamental.

Very fine grain partition such as parallelism at the

instruction level may not be good for general purpose MIMD

systems. Instead, employing concurrency on the process level

is a more feasible choice.


The structure programming (SP) model proposed in this chapter

provides a base to develop a higher level parallel programming

construct. The new construct frees programmers from the

concern of many implementation details such as hardware

architectures, data allocations, process assignments and

communication routines. Instead, it lets programmers

concentrate on more important issues such as problem partition

and algorithm design. The SP model supports deterministic

computation and provides powerful visual programming and

debugging tools.

The motivation of developing the SP language is the same

as the one for developing FORTRAN and C. A higher level

language independent of any specific machine is essential for

software productivity. The current difficulty in achieving

reliable, efficient and portable parallel programs argues

strongly for such higher level models.

Difficulties of Parallel Programming

A sequential program is a list of instructions which can

be conceived as a single control flow or thread. A Von Neuman

computer performs the program by following the flow and


executing instructions one by one. When an instruction

operates, it always assumes that data are ready in memory.

The data dependence of the program is implicit. The

correctness of the program is ensured not by preparing data

in time for fetching, but by arranging instructions in the

right order. This methodology works fine in sequential

programming because data in memory are deterministic and

traceable at any time.

Parallel programs for shared memory machines still keep

the same concept [AND83]. However, the concurrent access and

modification of the same variables by multiple processes cause

indeterministic data in memory. For example, suppose the

initial value of variable x is x=0. A parallel program

updates x to x=a+b, where a is in process P1 and b is in

process P2. If P1 and P2 are executed concurrently:

PI: x=x+a P2: x=x+b

the computing may yield three different results: x=a, x=b, and


If P1 and P2 both get x before either one finishes its

computing, both of them get x=0. Then the computed result

depends on who is writing last back to memory. If P1 writes

back last, it overwrites whatever was stored before and sets

x=a. Vice versa, if P2 writes back last, x=b. The correct

result is obtained only when one process is executed after


another. In other words, P1 and P2 must be synchronized to

guarantee the correctness of the program.

Although synchronizing controls such as {P,V) semaphore

[DIJ68] has been invented a long time ago, to use these

primitives throughout a large parallel program correctly and

consistently is not a simple task. As a matter of fact, one

of the most common errors in parallel programming is the

misuse of synchronization. A programmer often finds that some

variables are changed unexpectedly.

Tracing errors of parallel programs can be very

frustrating. Recall the previous example about updating

variable x by two processes P1 and P2. It is quite common

that a programmer overlooks the necessity of imposing

synchronization when updating x. After running the program,

the output result is not correct. The programmer suspects

that the error may be caused by a wrong x value. The

debugging step he used is to set a break point after P1 and

P2, because he may know that the variable should be x=a+b.

When he reruns the program, however, the operating system

happens to assign P2 to a processor after P1 modified x. The

change of the processor assignment could result from the

change of the program, or simply from the change of overall

work load in the system. So, the programmer cannot find what

is wrong in his program. Instead, he find that the error

sometime shows up, sometime disappears. This non-repeatable

computation is known as the indeterministic feature in

parallel computation.

Programming local memory systems also lead to

synchronization problems. Moreover, the architecture of a

local memory machine is not hidden from users. Therefore, a

user must design his program for a specific parallel computer

characterized by its interconnection network. It is the

programmer's responsibility to assign processes to different

processors, and it is also the programmer's responsibility to

design a communication routine algorithm for passing data

among processors. Since programs are tightly associated with

specific hardware, any hardware changes, even adding more

processors to the same architecture, may call for a complete

redesign of the programs. At this stage, parallel programming

is very alike the assembly programming for sequential

computers. Yes, programs are more efficient because of

special designs. However, the software productivity is low

and programs are not portable.

What Is Good for Parallel Programming

FORTRAN and C are two popular high level languages in

scientific and engineering computing. They are called high

level languages because one statement of FORTRAN or C includes

several low level instructions. For example, the statement

z=x+y may include four assembly level instructions:

fetch x to register A

fetch y to register B

add the content of register B to register A

write the content of register A to z in memory

In high level language, these four instructions are formed to

an atomic operation z=x+y.

FORTRAN and C programs may not be as efficient as

assembly programs. But in application programs these two

languages are much more popular than assembly. The success

of FORTRAN and C lies on the fact that they match the level

of the mathematics expression. The use of library functions

such as sin(x) and exp(x) is another example that high level

operations are useful if they are at the equivalent level of

our thinking. High level operations do impose restrictions

for some low level controls. Nevertheless, when trees are

counted, one does not want to see every leaf. When forests

are counted, one does not want to see every tree. The

principle is that the building blocks used by our thinking

should be building blocks used in programming.

Then, what should be the building blocks for parallel

programs? The answer is related to different architectures.

In this research, we only consider MIMD computers.

Since concurrent programs are usually designed for large

computational problems, it is natural to consider processes

(which are one step up in abstraction than statements) to be

basic building blocks. Consider the development of parallel


programs. It starts from problem partition, then defines

independent processes that can be executed in parallel, and

also designs the interprocess communication scheme. Compared

against sequential program design, these three steps are new,

and they are process based designs. It is argued in this

research that the design emphasis for parallel programs should

be focused on how to partition a problem, how to decide

granularity and how to design an efficient interprocess

communication pattern. In other words, the programmer should

pay more attention to process based structure design. The

detailed implementation of each process, by employing

conventional FORTRAN or C, is straightforward.

Another important issue for parallel programs is

synchronization. A good parallel programming language must

provide an explicit synchronizing control scheme. If

synchronization is misused, it should be detected by a

compiler or a run-time error message. Programs must be

traceable and computing results must be repeatable. This is

the most desired feature for parallel programs, and also is

a difficult one to achieve.

Let us go back to the previous example of computing x=a+b

by two processes P1 and P2. The indeterministic calculation

results from the fact that two processes can access and modify

a shared variable concurrently. The computing result depends

upon the execution order of P1 and P2, which is an

indeterministic factor in parallel computation.


The problem raised here is actually a historical mistake.

We adopted the sequential execution paradigm for parallel

computation. As mentioned in an earlier section, the

sequential execution model always assumes that data are ready

when an instruction is performed. The datum dependency is

implicit, and it is controlled by arranging the order of


Parallel programs for shared memory machines still keep

the "fetch-compute-write" paradigm. They still assume that

data are ready when instructions are executed concurrently.

However, data may not be ready in concurrent execution model

as illustrated in the above example. Theoretically, a

programmer can make the data ready by imposing

synchronization. In practice, there is no rule that can guide

a systematic decision for imposing synchronization, because

the datum dependency is implicit.

A more natural parallel execution model is the data flow

computing. Data flow is a completely different computational

model for which a computing engine is driven by data instead

of by a list of control commands [TRE82]. In many aspects,

the data flow model appears more natural than the control flow

model for parallel computation. The idea of data flow

computing is to execute operations whenever data for these

operations are ready. Figure 3-1 shows the data flow graph

for the execution of the following program [HWA84]


Figure 3-1 Data flow graph


Notice that the data dependence is the only restriction for

concurrent execution and instructions are performed when their

input data are ready. No designed synchronization is

necessary and the datum dependency is self guaranteed.

In theory, the data flow model is more attractive. It

possesses a natural way to employ maximum parallelism and it

is free from side effects. However, the performance of real

data flow computers is not as good as computer scientists

expected [GAJ82]. The problem comes from communication

overhead that is required in data flow computing.

When operands are ready, a data flow computer must find

proper instructions to operate them. The matching is

performed by an interconnection network. If too many operands

arrive together, network contention is built up. After

getting operands, all ready instructions are queued and go

through another interconnection network to find available

processors. With a limited network bandwidth, a large number

of ready instructions cause an excessive pipeline overhead,

which may destroy the benefits of parallelism.

The performance degradation is caused by pursuing very

fine parallelism that exceeds the communication capability of

interconnection networks. To overcome this problem, the


large-grain data flow model has been proposed [BAB84]. The

model keeps the data flow concept, but it exploits concurrency

at the process level. The drawback, however, is the

difficulty for programming. This work attracts more attention

recently [DON87] [XU89].

The data flow programming is based on data flow analysis.

Necessary instructions or processes are designed to modify

data in the data flow. When data reach the end of the

program, they become computing results. This programming

style is opposite to the way of our thinking which is more

alike to control flow. Some reports show that it may take

two weeks for a computer expert to develop a simple matrix

multiplication program by using the large-grain data flow

model [KAP87].

Although data flow programming has not become popular

yet, it does possess advantages such as deterministic

computation and easy scheduling for operating systems. The

question is: can we combine control flow and data flow

concepts to form a new programming model that acquires

advantages from both of its offspring. The structure

programming (SP) model proposed in this dissertation is such

an effort.

In the SP model, program design still follows control

flow. We construct processes for what needs to be done first,

then prepare data for them. The innovation of the SP model

is the way for a process to get data. Data are no longer


considered to be stored statically in memory and to be fetched

by processes. Instead, data are sent to processes and flow

dynamically in a structure constructed by processes and

communication links. For example, the SP model does not allow

two processes P1 and P2 to modify the shared variable x as

illustrated before. It lets P1 and P2 send variables a and

b to another process P3, and let P3 perform the addition as

shown in figure 3-2. Synchronization is carried out by

assigning a logical function to P3: P3=PlP2, which means that

P3 will not be invoked until it receives data from both P1 and

P2. Since synchronization controls are explicit now,

algorithms can be developed to check the semantic correctness.

The SP model is a process based, concurrent programming

construct. It keeps the control flow idea for process design,

but uses the data flow concept to explicitly feed data to

processes. Although race conditions may still exist because

of programming mistakes, they are traceable by the means of

program graph and logical function. Debugging becomes much

easier because program execution can be viewed by a graphics

monitor, displaying which processes are finished and where

data/controls are going. The SP model is abstract from any

specific machine and is implementable by different MIMD

computers. Therefore, we can expect portable software to be

developed by using this construct.

Figure 3-2 SP program for x=a+b


Conceptual Model

The structure programming model is characterized by a

directed graph with nodes and arrows. Nodes represent

processes, arrows represent data and control flows. A process

is defined as an indivisible, atomic computing unit that has

an input buffer and a computing body. Figure 3-3 illustrates

a process and its graphics symbol. The process receives data

from other processes and stores data in its input buffer.

When the computing body starts computing, the process will

seal its input buffer and the computing body can only use data

currently stored in the input buffer. The computing body is

a conventional FORTRAN or C code that must be executed in

sequence. Send commands are used to write outputs from this

process to other processes. Arrows in the directed graph

denote where to send data.

Associated with each process is a logic function that

controls the activation of the process. If the logical

function is FALSE, the process will remain inactive. If the

logical function is TRUE, the process is ready for execution.

The function variables are the input branches to the process.

If a process has no input branch, it is a leaf process or

start process. Its logical function is set to TRUE, and it

will be invoked automatically when a program starts. Figure

3-4 shows an example. P1 and P2 are two starting processes.

P1 sends outputs either to P3, P4, or to P5. P5 is invoked



Input buffer

Figure 3-3

A process in SP model

Logical functions:

P 6= P P4 +P

P = P,

3 1

Figure 3-4 Sample program of the SP model


only under the condition that P1 and P2 both request it. P6

can be invoked by either P3 and P4, or by P5 alone.

The effect of logical functions associated with send

commands are analogous to the branch commands in sequential

program. They control the program execution based on run-time

conditions. In sequential programs, the branch command must

be used with care, as it has to be in structure programs. If

PI, by mistake, sends output to all P3, P4, and P5, a race

condition is established. The execution of P6 will depend

which branch comes first, P3P4 or P5. This is the

indeterministic feature that we want to avoid. Fortunately,

this type of error can be detected by the compiler or the

operating system in the SP model. The implementation details

are discussed in the next chapter.

The SP model allows a hierarchical design. A subgraph

with a group of processes and links can be defined as a macro

process. In a later chapter, we will find that the concept

of macro process is useful for defining iterative processes

and beneficial for large program designs.


The structure programming construct is the language

implementation of the SP model. Design issues about the SP

construct are: how to define variables for a process, how to

send data to processes, and the rules for using control

branches. Design considerations are also needed for system

controls that include detecting race condition and deadlock

in SP programs, rules for invoking processes, and rules for

terminating processes. At the end of this chapter, an SP

program is designed to display a 3D synthesized image for a

pump room of a nuclear power plant. The program is simulated

on a Silicon Graphics IRIS4D workstation and yields correct



As defined previously, a process is an atomic computing

unit that receives data from other processes, performs

operations, and sends data out. A process receives all data

during its waiting period. When a process is awakened, it no

longer accepts any data. All data the process needed from

outside world must be in its input buffer. After starting,

the computing process cannot be interrupted by other


the computing process cannot be interrupted by other

processes, and it will continue till its end. Conventional

languages such as FORTRAN and C can be used to write

processes. In this chapter, C type expressions are used for


The principle of the structure programming is that data

are not hidden in memory. They are not fetched by processes,

but are sent to processes. Therefore, the data flow must be

explicitly expressed in program structure. A process must

declare all its variables at the beginning of the process,

just like a C program. There are three types of variables:

global, local, and input.

Global variables are accessible by all processes. But,

they are read only variables. No process can modify a global

variable, or send it to other processes. The initial values

of global variables are assigned in a special starting process

P_start(), which is analogous to the main routine in a C


Local variables are created in a process. These

variables can be modified and also can be sent out to other

processes as their inputs.

Input variables are those that a process gets from its

input buffer. They are fed by other processes, and they are

updatable and transmittable.

When programming, a programmer does not need to be

concerned with where variables are physically located in


before a process is invoked. It is the programmer's

responsibility to ensure data correctness when the process

starts execution. Input and local data are kept in memory

during the process's life time. When the process is

terminated, these data are also aborted.

Each process has a name and a logical function. The

process name is the process identification and the logical

function sets the condition for awakening the process.

Variables of a logical function denote if there are data

coming through input branches. If data or control tokens are

sent in through a branch, its logical status is TRUE. If the

logical function for a process is TRUE (all data are ready),

the process is awakened. Otherwise, it is sleeping.

Computation may not be the only purpose to use a process.

A process can be designed for branching data or just to ensure

logical correctness for the whole program.

Checking the intersection for two triangles is given as

an example to illustrate how to define variables for different

processes. Figure 4-1 shows two triangles A and B in the x-

y plane. To decide if A intersects B, pair by pair line

intersection checking should be performed. Assume three

processes are used for testing the intersection. P1, P2, and

P3 each computes one edge of A against three edges of B. If

an intersection is detected, a logical variable Li is set

TRUE. If no intersection is detected, Li is set FALSE. PI,

P2, and P3 send Ll, L2, and L3 to process P4. P4 outputs a


Figure 4-1

Intersection computation for two triangles


no intersection signal if Ll=L2=L3=FALSE. Otherwise, P4

outputs an intersection signal.

The structure program is depicted in figure 4-2. Ps is

the starting process that defines bl, b2, b3 as global

variables and al, a2, a3 as local variables. Ps reads in bl,

b2, b3, al, a2, and a3 from a disk file, sends al, a2, and a3

to P1, P2, P3 respectively. PI, P2 and P3 also declare bl,

b2, and b3 as global variables, but define ai as input

variable and Li as local integer variable. When Pi is

invoked, it gets al from Ps and can read bl, b2, and b3 as

global data. After computing, it sets L1=l if intersection

is detected. Otherwise, it sets L1=0. The processes in P2 and

P3 are the same as in P1. P4 declares LI, L2, and L3 to be

input variables, and L to be a local variable. P4 is invoked

only when all P1, P2, and P3 send their outputs to it. If

L1=L2=L3=0, L='no intersection'. Otherwise, L='intersection'.

A sample declaration for P1, P2, or P3 is as follows.

global float bl[4], b2[4], b3[4];

input float a[4];

local int L;

Four element arrays are used because each edge of a triangle

is specified by four floating numbers, (xl,yl) and (x2,y2).

PI, P2, or P3 can start executing whenever it receives data

from Ps. But P4 will not start executing until it gets data

P) 2P { } P2 =Ps

P3 = Ps
a a a
2 3


Figure 4-2 Structure program for the
intersection computation

from all three previous processes. Synchronization is self

guaranteed by data dependence.


In structure programming, interprocess communication is

carried out by one directional message passing, which is

realized by a send command. The graphical symbol of send

command is an arrow. In a process, the text of send command

has the format:

send (Pi, Ad, VI, V2,...);

where, Pi is the destination process identification and Ad is

the beginning address of Pi's input buffer for data Vl, V2....

When the send command is used, the input buffer of Pi can be

viewed as a continuous memory space. Send commands from

multiple processes must fill this space correctly. The

beginning address Ad indicates where this process starts

filling and the variable length decides where to end the


For example, process Pc performs matrix multiplication

C=A*B, where matrices A and B come from processes Pa and Pb

as shown in figure 4-3. A pseudo code of Pc is

Pc: matrix multiplication process

input float A[10][10], B[10][10];

local float C[10][10];


Pc =Pa Pb

Figure 4-3 Matrix multiplication C=A*B

Assume Pa has matrix R[10][10] and Pb has matrix T[10][10],

then the send commands for Pa and Pb should be

Pa: send (Pc, A[l][l], R);

Pb: send (Pc, B[l][l], T);

The actual output from Pc will be C=R*T.

Data mapping can be in a more flexible format. Matrices

R and T are not necessarily the same sizes of A and B. Figure

4-4 is another example where A and B are filled by four

arrays. Suppose

Pa has U[10][5], Pa: send (Pc, A[l][l], U);

Pb has V[10][5], Pb: send (Pc, A[l][6], V);

Pd has W[10][5], Pd: send (Pc, B[l][l], W);

Pe has X[10][5], Pe: send (Pc, B[1][6], X);

Pc still performs C=A*B. But this time, A and B are filled

by four arrays, with U and V mapped to A and W and X to B.

To have a deterministic input data, each variable in an

input buffer can only be filled once for one computing cycle.

So, send commands used in different processes must take care

not to have datum assignment overlap in the input buffer. If

overlap exists because of programming mistakes, it is not

difficult to detect such errors by a compiler.

Pc = Pa Pb Pd Pe

Figure 4-4 Matrix multiplication C=A*B
A and B are filled by four processes

When a send command is executed, a control token is sent

to the operating system. The operating system uses this token

to evaluate the associated logical function, checking if the

receiving process is ready for execution or not. Therefore,

the send command can also be used for the purpose of awakening

a process. The format is

send (Pi);

If the current process is Pj, the execution of send (Pi) will

set Pj=TRUE in the logical function of Pi. The requirement

of deterministic computing allows a logical variable to be

changed only once in one computing cycle. As a consequence,

the send (Pi) or send (Pi, Ad, Vl...) command can only be

executed once in process Pj.

Control Flow

When a program is designed, it has to be designed in

general to accommodate all possible processing situations.

Only a part of the program is executed on each run. Control

operations must be embedded in the program to guide the

execution. In sequential programs, the control flow jumps

among different code sections. In structure programs, the

control flow forms different subgraphs of the program for


The send command and the logical function for a process

are basic control operations in structure programming. It is

worth noting that a send command can access any process


(except the starting process) in a program. Its power is

equivalent to the "go to" command in a sequential program.

Therefore, the send command must be used with care. The

underlying principle of employing control flow in structure

programming is to avoid race and deadlock situations.

A race condition arises when multiple terms in a logical

function are not mutually exclusive. In other words, there

exist more than one branch which compete for awakening a

process. Figure 4-5 shows three different cases. Case (a)

is not a valid graph because PC can be invoked by either Pa

or Pb. Pa and Pb are two leaf processes that are not mutually

exclusive from each other. Then the computing result of Pc

will depend upon which one, Pa or Pb, gets Pc first. This is

the indeterministic computation that we want to avoid in the

SP model.

The case (a) program must be modified to the case (b) or

the case (c). If both Pa and Pb need Pc, an extra copy of Pc

should be created. It is named Pd in the case (b). If either

Pa or Pb needs to access Pc, a previous process Ps can be used

to control the exclusive access. In Ps, a variable x may be

tested for choosing the path. If x>=0, call process Pa. If

x<0, call process Pb. The Ps may have following codes:

if (x>=0)

send (Pa, Ad, Vl,...);


send (Pb, Ad, Vl,...);

Pc= Pa Pd = Pb

Po Pd

Pa Pb


Eliminating race conditions



P c = Pa+ Pb

Pc= Pa+ Pb

Figure 4-5

There are also other alternatives. For example, we can change

the logical function Pc=Pa+Pb to Pc=PaPb, then decide what

needs to be done inside the Pc process.

Consider the intersection problem again. Suppose that

the starting process Ps in figure 4-2 tests the maximum x of

edge al and minimum x of edge bl. If the maximum x of al is

less than the minimum x of bl, the intersection calculations

in P1, P2, and P3 are not necessary. Ps can directly send

results to P4. The code in Ps should have the following test.


If (max x of al< min x of bl)


L1 = L2 = L3 = 0;

send (P4, L1, LI, L2, L3);


Where, the first LI in the send command is the beginning

address of the input buffer of P4. As mentioned before, it

is the programmer's responsibility to ensure the data

correctness for the execution of a process. P4 has three

input variables LI, L2, and L3. So, Ps must prepare these

data and explicitly send them to P4.

The program graph is depicted in figure 4-6. Notice that

the logical function of P4 has two terms: P4=PlP2P3+Ps. If,

by a mistake, the "send (P4, LI, 11, L2, L3)" statement is put

1 2 3

2 =PS
*P2 s

a a
a2 3

Figure 4-6 Intersection computation
with a jump branch

P= P, Pp+p
P4 1 2 3

P =P
1 8

at the outside of the if block, the output of Ps and outputs

of PI, P2, and P3 will compete for P4. A race condition

arises, and the program is wrong. Send commands and logical

functions do provide flexible control in programs; however,

they also provide error sources. Automatic detecting of race

conditions by compiler and operating system is possible. This

problem will be addressed in a later section.

Besides the branch operation, another control primitive

is the feedback, or loop operation. Consider the intersection

testing problem to be that of a mobile robot in a hazardous

working environment. Let the triangle A represent a surface

of the mobile robot, and assume the triangle B is the

dangerous area that the robot should not go into. When the

robot is moving, the program in figure 4-6 must be executed

repeatedly. Whenever an intersection is detected, the output

signal from P4 should stop the robot. For this application,

a loop operation is needed.

If feedback branches exist in a program, its structure

changes from a combinational logic to a finite state machine.

Synchronization and deadlock problems may rise in program

design. The program in figure 4-7 is given as an example.

If process Pb keeps on sending data out to Pd for every

computing loop, Pd has synchronization problems. It has to

decide which Pb output should go with Pc's output. Moreover,

the program will also run into a deadlock status, since Pc

only sends data to Pd one time. After using up this control

d b c

Figure 4-7 An illegal loop graph


token, the logical function of Pd cannot be set to TRUE again

no matter how many Pb control tokens are fed to Pd.

To eliminate synchronization and deadlock problems, a

loop operation should be defined as a macro process MP. An

MP is a subgraph constructed according to certain rules. Each

MP has a starting process MPs and an ending process MPe. All

inputs for the MP must go through MPs and all outputs of MP

must come out from MPe. The rest of MP processes are isolated

from the outside world. They only communicate to those

processes inside the MP. Figure 4-8 shows three different

MPs. Case (a) is a valid MP. The MPs is initially invoked

by the input branch. Then it is repeatedly awakened by the

loop branch. When loop computing finishes, the MPe cuts the

feedback branch and sends data out through the output branch.

Case (b) and (c) are not valid MPs because they have branches

arriving and departing to outside processes.

Testing the validity of a MP is simple. Every process

defined as inner process in the MP cannot have any branch that

comes in or goes out to the MP. Table 4-1 is the connection

table that contains all link information for a program. All

ls shown in columns denote where a process gets inputs from.

All ls shown in a row indicate to where the outputs of this

process go. The link information about the program in figure

4-8 (a) is shown in the square area of Table 4-1. A valid MP

must have all its is kept inside the square area, bounded by

processes MPs and MPe. If a row search or a collum search


(a) Valid Mp


(b) not Valid Mp (c) Not Valid Mp

Figure 4-8 Defining loop process macros

receiving processes

MPs Pa Pb Pc Pd Pe Pf PC

MPs 1 1

Pa 1 1

Pb 1 1

D P 1
Q ---- -
' Pd 1

0 Pe

0 -f
(D Pf

(D Pg

Table 4-1 Connection table


finds one or more is outside of this square, it indicates an

invalid MP. The two marked Is in table 4-1 correspond to two

invalid MP graphs in figure 4-8.

In program structure, a macro process is equivalent to

a single process. Therefore, the life time of data must be

decided for the processes in a MP. For a process not

associated with any MP, its input and local data are no longer

kept in memory after its termination. However, all MP

processes can keep their data during the life time of the MP.

If the MP conducts a loop computation, a new iterative step

uses whatever data in the memory modules reserved for inner

processes. A practical example will be discussed in the

simulation section.

The loop operation is not the only usage of process

macros. A process macro can be formed according to other

functional features, for example to design programs in a

functional hierarchy. Nevertheless, process macros should be

used with constraint because MPs tend to use more memory space

during execution.

System Control

The SP model is a high level construct that shifts

implementation detail from users to systems. To run SP

programs in a real parallel computer, the system must have a

proper compiler and an operating system that supports the SP


Since all SP processes are written by conventional

languages such as FORTRAN and C, there is nothing too much new

about process compiling. The new thing for the SP model is

the structure, characterized by process connection and logical

functions. An SP compiler must build a connection table that

holds all link information for a program. It also needs to

set a logical function table for the operating system for run-

time usage. Actually, the connection table and the logical

function table can be combined together to a system control

table. After building this table, the compiler can use it to

detect programming errors such as race and deadlock. The

operating system can also use it to start and to terminate

processes. The control table servers as a run time monitor

that stores the execution status and reports errors.

System control table

The system control table is constructed based on the

program graph and logical functions. Figure 4-9 shows a

program graph and its logical functions. Its control table

is table 4-2. To construct this table, we start from Ps and

fill the table row by row according to the connection

information. To fill a particular column, a search of the

column is conducted first to see if there is any pre-filled

entries. If there is not, fill 1 at the current position.

If there is any, the logical function for this process should

be checked to see if the current process has an AND or OR


Pe =

Pf =
Pd =

Figure 4-9 Sample SP program graph

Pf +Pb

receiving processes

Ps Pa Pb Pc Pd Pe Pf Pg Ph

Ps 1 1 1 1

Pa 1

" p 1

Pc 1

0 Pd 1
Pe 2

Pf 2



Table 4-2 System control table for
the sample SP program


relation with the previous one. If it is AND, fill the same

number as the pre-filled entries into current position. If

it is OR, fill the current position by the number of a pre-

filled entry plus 1. As an example, let us consider the row

of Pb in table 4-2. At the first time when Pb fills the

column Pg, the Pg column is empty. So, the position is filled

by 1. When Pe starts to fill the Pg column, the column is not

empty. So, the logical function is checked. Since Pe has OR

relation with Pb, 2 is filled into this position. Next, Pf

fills the Pg. It is not empty as before, but, this time Pf

has an AND relation with Pe. Therefore, 2 is filled in at the

(Pf,Pg) position.

Table entries can be coded to denote different status.

Assume 8 bits are used for each entry, ranging from -128 to

+127. 1 to 126 can represent 126 input branches with OR

relation for one process. Suppose a process Pi has a send

command that sends data to process Pj. When the command is

executed, the operating system searches the Pi row and sets

the number at (Pi,Pj) to negative. If a group of numbers, all

is or all 2s etc., is set to their negative in one column, the

logical function for the process of this particular column is

TRUE. This process is then ready for execution.

The program in figure 4-9 and table 4-2 can be used as

an example. After executing "send (Pg,..)" command in process

Pe, the entry (Pe,Pg) is set to -2. Similarly, after

executing send(Pg,...) in Pf, the entry (Pf,Pg) is also set


to -2. Since all 2s are set to their negative, the term PePf

in the logical function Pg=PePf+Pb becomes TRUE and Pg is

ready for execution.

Since the control table stores information of all

executed paths for a program, it can be used to display the

execution graph at run time. Before starting the program, all

process graphics symbols (a circle) can be colored white.

When a process is ready for execution, its color can be set

to green. The color changes to red when a process is

assigned to a real processor. After execution, the color

becomes black. The control table provides enough information

for coloring the processes. The display of execution

procedures by graphics is a very useful feature for monitoring

and debugging.


After constructing the control table, the compiler should

check the correctness of the structure, i.e. test the validity

of a program graph. A valid graph must have no obvious race

conditions and no feedback branches outside the process

macros. In graph theory, detecting race conditions is

equivalent to finding a common destination for multiple paths.

Detecting an illegal feedback is finding a cyclic graph. Both

of them are well studied graph problems and standard

algorithms are available [SED83].

However, a valid program is only a necessary but not

sufficient condition for correct computation. A number of


race and deadlock errors are caused by program bugs that

cannot be detected directly from the graph. For example, a

send command that should be inside of a "if...end if" block

is put outside of the block by a mistake. That may cause race

at program execution time.

A practical method of finding all race and deadlock

errors is to simulate the program execution logic. The

logical function defines the input logic of a process. Send

commands embedded in a process code define output logic for

the process. Starting from leaf processes, the simulation

program goes through all possible input and output

combinations for a program. If any process, which is not in

a loop process macro, is invoked more than once, a race is

detected. If the simulation program stops before the end of

the program, a deadlock is detected. All paths gone through

by the simulation program are stored and can be displayed when

an error is found.

The simulation only tests the logical correctness for SP

programs. So, its execution is fast. The complexity of the

simulation is equal to the number of executable subgraphs in

a program. For practical applications, the subgraph number

is not very large.

Invoking and terminating processes

To run an SP program, the operating system must be able

to decide when to invoke and when to terminate a process. The

control table is a useful tool for making such a decision.


All leaf processes have empty columns in the control table.

When a program starts, the operating system checks the control

table and puts those processes with an empty column into a

queue, waiting for available processors. When the execution

of a leaf process reaches a send command, a control token is

sent to the operating system. Then the corresponding entry

in the control table is set to its negative value. If all the

same numbers in one column are set negative, the corresponding

process is ready for execution. This procedure carries on

till the end of the program.

Table 4-3 shows a sample control table after executing

the program in figure 4-9. Processes Pb and Pd chose, based

on run-time conditions, to send data to Pe and Pf instead to

Pg and Ph. The execution subgraph reconstructed from the

control table is depicted in figure 4-10.

The normal termination of a process is the completion of

its computing. If the process is not in a loop process macro,

the operating system will mark the process as a finished

process. Any attempt to reawaken this process will be

reported as a run-time race error.

Sometimes, certain running processes may become useless

if one of them branches out. For example, the process Pb in

figure 4-9 may detect some condition and jump to Pg. The

logical function of Pg is Pg=PePf+Pb. Now, Pb gets Pg, so Pe

and Pf output should not come in order to maintain the race

free agreement. Then, the computing of Pe and Pf become



Ps Pa Pb PC Pd Pe Pf Pg Ph

Ps -1 -1 -1 -1

Pa -1

D Pb -1 1

Pd -1
D ---- --- ---- ---- --- ---- -- -- -- t e a- e
D Pe -2

Pf -2



Table 4-3 System control table after execution

Figure 4-10 Reconstructed program graph
after execution

University of Florida Home Page
© 2004 - 2011 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated May 24, 2011 - Version 3.0.0 - mvs