A PARALLEL ALGORITHM FOR
DISTRIBUTED MINIMUM VARIANCE DISTORTIONLESS RESPONSE
BEAMFORMING
By
JESUS LUIS GARCIA
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
1999
For my family
ACKNOWLEDGMENTS
I would like to thank Dr. Alan George for allowing me to pursue my research
interests in the HCS Lab and for encouraging me and believing in me during unfortunate
circumstances. I especially want to thank Ken Kim without whose guidance and advice
this work would not be possible.
I owe much to my family in ZSystems, which has helped me grow intellectually
and musically. I specifically want to thank Glenn Zelniker for showing me the wonders
of DSP, Rainier De Varona for giving me the opportunity to hear him vent, and Dale
Ramcunis for enriching my palate. I must also thank ZSystems for those extra lines on
my resume.
My fellow lab members deserve many thanks for hours of wasted time and
entertainment. I thank the Trilogy for many dinners and classic rock quizzes, the CGs for
a great team effort, the Beefcakes for letting me train their muscles and their minds, and
the ITMSBs for letting me teach them how to speak proper English. I especially want to
thank BBB, Mr. Bigglesworth, Gomez Christopher, and Jefferson for allowing me to call
them by any ridiculous name I could think of, and many other things.
Last but definitely not least, I want to thank my family for their infinite love and
support. I would also like to thank my many friends for years of good memories and
unremembered nights.
TABLE OF CONTENTS
page
A C K N O W L E D G M E N T S ..............................................................................................iii
ABSTRACT ........................................................................... vi
CHAPTERS
1 IN TR OD U CTION ........................................................................................................7
2 B A C K G R O U N D ........................................................................................................ 1 1
2.1 M P I ........................................................................................................ .............. 12
2 .2 T h re a d s ................................................................................................................ 1 4
2.3 Perform ance m easurem ents .............................................................. .............. 16
3 OVERVIEW OF MVDR ALGORITHM ............................................... 19
4 SEQUENTIAL MVDR ALGORITHM AND PERFORMANCE............................25
4 .1 Sequential M V D R tasks .......................... ... ......................................................... 25
4.2 Performance results for sequential MVDR algorithm ........................... ..............27
5 PARALLEL M VDR ALGORITHM .......................................................... .............. 31
5.1 Com putation com ponent of DPM VDR............................................... ............... 33
5.2 Communication component of DPM VDR .......................................... ............... 37
6 PARALLEL M VDR PERFORM ANCE .................................................... ............... 45
7 C O N C L U SIO N S ........................................................................................ 50
APPENDICES
A D A T A M O D E L ......................................................................................................... 54
B VERIFICATION OF RESULTS ........................ ................... 56
C PARALLEL AND SEQUENTIAL C++ CODE..................................................... 58
D M ATLAB SOURCE CODE .. ............................................................ ............... 94
R E F E R E N C E S ............................................................................................................ .. 9 9
BIO GRAPH ICAL SK ETCH .. ............................................................. .............. 101
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
A PARALLEL ALGORITHM FOR
DISTRIBUTED MINIMUM VARIANCE DISTORTIONLESS RESPONSE
BEAMFORMING
By
Jesus Luis Garcia
December 1999
Chairman: Alan D. George
Major Department: Electrical and Computer Engineering
Quiet submarine threats and high clutter in the littoral environment increase
computation and communication demands on beamforming arrays, particularly for
applications that require inarray autonomous operation. By coupling each transducer
node in a distributed array with a microprocessor, and networking them together,
embedded parallel processing for adaptive beamformers can glean advantages in
execution speed, fault tolerance, scalability, power, and cost. In this thesis, a new parallel
algorithm for Minimum Variance Distortionless Response (MVDR) adaptive
beamforming on distributed systems is introduced for inarray sonar signal processing.
Performance results are also included, among them execution times, parallel efficiencies,
and memory requirements, using a distributed system testbed comprised of a cluster of
workstations connected by a highspeed network.
CHAPTER 1
INTRODUCTION
Beamforming is a method of spatial filtering and is the basis for all array signal
processing. Beamforming attempts to extract a desired signal from a signal space
cluttered by interference and noise. Specifically, sonar beamforming uses an array of
spatially separated sensors, or hydrophones, to acquire multichannel data from an
undersea environment. The data is filtered spatially to find the direction of arrival of
target signals and to improve the signaltonoise ratio (SNR). An array can be configured
arbitrarily in three dimensions, but this thesis assumes a linear array with equispaced
nodes designed to process narrowband signals.
Adaptive beamforming (ABF) optimizes a collection of weight vectors to localize
targets, via correlation with the data, in a noisy environment. These weight vectors
generate a beampattern that places nulls in the direction of unwanted noise (i.e., signals,
called interferers, from directions other than the direction of interest). As opposed to
conventional beamforming (CBF) techniques that calculate these weight vectors
independently of the data, an adaptive beamformer uses the crossspectral density matrix
(CSDM) to extract the desired signal without distortion and a minimum of variance at the
output. Numerous ABF algorithms have been proposed in the literature14, but this thesis
focuses on the Minimum Variance Distortionless Response (MVDR) algorithm as
presented by Cox et al.5 MVDR is an optimum beamformer that chooses weights to
minimize output power subject to a unity gain constraint in the steering direction. The
steering direction is the bearing that the array is "steered" toward to look for a particular
incoming signal.
This thesis introduces a novel parallel algorithm for MVDR beamforming in a
distributed fashion that scales with problem and system size on a linear array. This new
algorithm is designed for inarray processing where the nodes are hydrophones coupled
with lowpower DSP microprocessors connected via a pointtopoint communications
network. The parallel nature of such a sonar array eliminates a single point of failure
inherent in arrays with a single processor, making it more reliable. By harnessing the
aggregate computational performance, parallel processing on a distributed sonar array
holds the potential to realize a variety of conventional, adaptive, and matchfield
algorithms for realtime processing.
Several studies in the literature on computational algorithms for sonar signal
processing have exploited the increased computational power attained by a parallel
system that scales with the problem size. George et al. 6 developed three parallel
algorithms for frequencydomain CBF. George and Kim7 developed a coarsegrained
and a mediumgrained parallel algorithm for splitaperture CBF (SACBF). Both of
these sets of algorithms were designed for distributed systems composed of networked
processors each with local memory. Banerjee and Chau8 proposed a finegrained MVDR
algorithm for a network of generalpurpose DSP processors using a QRbased matrix
inversion algorithm. Their results, obtained through the use of an Interprocessor
Communication (IPC) cost function, show that when the number of processors matches
the problem size, the interprocessor communication costs exceeds the computational
savings.
Most of the work conducted to date on parallel adaptive algorithms has been
targeted for systolic array implementations913. These parallel algorithms exhibit real
time processing capabilities but are designed for implementation on a single VLSI
processor and therefore cannot easily scale with varying problem and array size.
Moreover, these systolic methods are inherently finegrained algorithms that are not
generally appropriate for implementation on distributed systems because of the
communication latencies inherent in such architectures. One of the major contributors to
execution time in MVDR is a matrix inversion that must be performed on the CSDM
with each iteration of beamforming. Many finegrained algorithms exist for parallel
matrix inversion1419. However, like systolic array algorithms, these algorithms generally
require too much communication to be feasible on a distributed system.
The parallel algorithm MVDR for adaptive beamforming presented in this thesis
addresses both the computation and communication issues that are critical for achieving
scalable performance. Beamforming iterations are decomposed into stages and executed
across the distributed system via a loosely coupled pipeline. Within each pipelined
iteration, the tasks that exhibit the highest degree of computational complexity are further
pipelined. Moreover, data packing is employed to amortize the overhead and thereby
reduce the latency of communication, and multithreading is used to hide much of the
remaining latency.
Background information necessary to understand some of the implementation
details of the parallel algorithm is given in Chapter 2. A theoretical overview of MVDR
is given in Chapter 3. In Chapter 4, the sequential algorithm is examined in terms of its
individual tasks and their performance. In Chapter 5, a new distributedparallel MVDR
10
algorithm is presented. A performance analysis comparing the parallel algorithm and the
sequential algorithm in terms of execution time, speedup and efficiency is given in
Chapter 6. Finally, Chapter 7 concludes with a summary of the advantages and
disadvantages of the parallel algorithm as well as a synopsis of the results and directions
for future research.
CHAPTER 2
BACKGROUND
The parallel programs employed in this thesis use a paradigm for communication
among distributedmemory multicomputers known as message passing. Message passing
allows multiple processes in a system to communicate through messages. The Message
Passing Interface (MPI) is a standard developed by the Message Passing Interface
Forum20 that defines the syntax and semantics of a core of library functions, for
communication in a distributedmemory multicomputer, that is highly portable and easy
to use. The MPI standard is implementation independent, meaning that it can be
implemented differently depending on the target system. Thus, there are several
proprietary as well as public domain implementations of MPI.
For this research, a publicdomain implementation called MPICH was used.
Since MPICH is freely distributed and is meant for high portability, it makes no
assumptions about the underlying network and therefore does not take advantage of
features such as physical broadcast capabilities. Instead, MPICH uses pointtopoint
communications running over a robust TCP/IP protocol stack to simulate broadcast
communications. Communicating over TCP/IP increases the communication overhead
substantially, especially for short messages. In the parallel MVDR algorithm presented
in this thesis, one technique used to hide the communication latency is to use a separate
thread to handle communications. Even though MPI does not offer explicit support for
threads it is, nevertheless, threadsafe. Threadsafe refers to the fact that only the calling
thread is affected by calls to MPI functions.
The parallel programs in this thesis use threads and MPI to achieve
communication and synchronization among the nodes. The following two sections in this
chapter will explain the basics of programming with MPI and threads in order to provide
a clearer understanding of the mechanisms used to hide the communication latency.
Section 2.3 will give an explanation of the performance measurements used to evaluate
the performance of the parallel algorithm.
2.1 MPI
Included in the MPI standard is a set of pointtopoint and collective
communications operations. MPI provides 125 functions but most parallel programs
require only six basic functions, which are:
1. MPI Init initialization,
2. MPI Comm rank get process ID,
3. MPI Comm size get total number of processes,
4. MPI Send send message,
5. MPI Recv receive message,
6. MPI Finalize clean up all MPI states and exit MPI.
These are the six MPI functions used in the code for the parallel MVDR algorithm. The
prototypes of the MPI Send and MPI Recv functions are as follows:
* intMPI Send(&message, count, datatype, dest, tag, communicator)
* intMPI Recv(&message, count, datatype, source, tag, communicator, &status).
The message parameter is the location of the data that is to be sent or received. Count
specifies how many data elements of type datatype are to be sent or received. MPI
defines a set of data types that are analogous to the C data types (i.e., MPI _INT,
MPI DOUBLE, MPI_BYTE, etc.) MPI defines the concept of a communicator, which
is the collection of processes that can communicate with one another. Most MPI function
calls must specify the communicator. The rank of the process is a unique integer
assigned to each process in a communicator to distinguish between them. It is this rank
that serves as the source and destination in pointtopoint functions. The tag parameter
serves to distinguish messages from a single process. The status parameter in the
MPI Recv function specifies the tag and source of the received message.
MPI also provides a set of collective communication primitives. There are three
types of collective operations: (1) Synchronization, (2) Data movement, and (3)
Collective computation. The synchronization primitive is MPI Barrier(comm), which
blocks until all processes in the communicator comm call it. Data movement functions
include MPI Bcast, MPI Gather, MPI Allgather, MPI Scatter, MPI Alltoall, and many
others. The collective computation functions (e.g., MPI Reduce, MPI Scan, etc.) allow
computations to be performed on a set of collective data and distribute the results in a
specified fashion among the nodes.
The parallel code uses multiple pointtopoint communication primitives,
MPI Send and MPI Recv, to perform alltoall communication. It was found empirically
that using multiple MPI Send and MPI Recv function calls is actually faster than using
MPI collective communications primitives such as MPI Bcast and MPI Allgather.
However, results may differ for a different testbed and MPI implementation. The high
level pseudocode for how MPI Send and MPI Recv were used in the parallel code to
perform alltoall communication is given in Figure 2.1.
MPI Barrier(MPI COMM WORLD); //synchronize nodes
/* Begin communication */
FOR i = 0 TO number of nodes //send to all nodes
IF (i!=my rank) //do not send to yourself
MPI Send(&my data, num of data elements, MPIDOUBLE, i, tag(i),
MPI COMM WORLD);
ENDIF
ENDFOR
FOR j = 0 TO number of nodes //receive from all nodes
IF (j!=my rank) //do not receive from yourself
MPI Recv(&data buffer, num of data elements, MPI DOUBLE, j, tag(j),
MPI COMM WORLD, &status);
ENDIF
ENDFOR
Figure 2.1. Pseudocode of alltoall communication using pointtopoint primitives.
2.2 Threads
A thread of execution is the stream of instructions in a program executed by the
CPU. Each thread of execution is associated with a thread, an abstract data type
representing flow of control within a process. A thread has its own execution stack,
program counter value, register set, and state. Threads give the illusion of concurrent
execution by context switching in which the processor dedicates a certain amount of time
to the execution of one thread then switches context and executes another thread. The
thread package used to code the parallel MVDR algorithm is the POSIX.lc21 standard.
This standard defines function calls for thread management, mutual exclusion, and
condition variables.
In the parallel code, six different functions are used for thread handling. The six
functions handle thread creation and destruction, and thread synchronization. The six
functions are the following:
1. pthread create creates a new thread to execute a specified function,
2. pthreadjoin blocks the calling function until the specified thread exits,
3. pthread mutex lock acquire a lock to protect a critical section,
4. pthread mutex unlock release the lock,
5. pthread cond wait blocks the calling thread until a signal is received,
6. pthread cond signal sends a signal to awaken a blocked thread.
The remainder of this section will describe how thread synchronization is achieved by the
use of functions 3, 4, 5, and 6.
Multiple threads within a process must share resources. Thread synchronization
is used to ensure that threads get consistent results when they share resources. The
POSIX. lc standard provides mutexes (i.e., mutual exclusion variables) and condition
variables to support the two types of synchronization locking and waiting. A thread
locks a resource when it has exclusive access to it. Locking is typically of short duration.
A thread is waiting when it blocks until the occurrence of some event.
The parallel code uses a combination of mutexes and condition variables to
achieve synchronization among the threads. Mutexes are used to protect critical sections
and obtain exclusive access to resources. A thread calls the function
pthread mutex lock(&my lock) to acquire a lock at the beginning of the critical section.
The same thread calls pthread mutex unlock(&my lock) to release the lock at the end of
the critical section. Condition variables allow a thread to block until a certain event or
combination of events occurs. A thread tests a predicate (i.e., a condition) and calls
pthread cond wait if the predicate is false. When another thread changes variables that
might make the predicate true, it awakens the waiting thread by executing
pthread cond signal. To ensure that the test of the predicate and the wait are atomic, the
calling thread must obtain a mutex before it tests the predicate. If the thread blocks on a
condition variable, pthread cond wait atomically releases the mutex and blocks. After
returning from pthread cond wait the predicate must be tested again since the return
may have been caused by pthread cond signal that did not signify that the predicate had
become true.
The parallel MVDR algorithm was coded using one thread for communication
and another thread for computation. The threads must synchronize at the beginning and
end of each communication cycle. The thread that finishes its cycle first calls
pthread cond wait and blocks until the other thread calls pthread cond signal to
awaken the blocking thread. Figure 2.2 gives the highlevel pseudo code for the thread
synchronization sections of the parallel code.
2.3 Performance measurements
The performance of the parallel MVDR algorithm was gauged by measuring its
execution time, scaled speedup, and parallel efficiency and compared to the sequential
algorithm.
Execution time is a measure of the amount of time for an application to complete
execution. The execution time was measured by calling a highresolution timing function
in UNIX (i.e., gethrtime) that the returns the time expressed in nanoseconds since some
arbitrary time in the past. The function was called at the beginning of the section of
interest and the return value saved. It was called again at the end of the section and the
two times subtracted to get the total execution time of the section of interest.
Speedup is the ratio of sequential execution time to parallel execution time.
Scaled speedup is attained when the problem size scales with the number of processors.
// Wait for communication thread to te
pthread mutex lock(&my lock);
WHILE (!(done))
pthread cond wait(&cond, &my lock);
done = 0;
pthread mutex unlock(&my lock);
/* Tell communication thread to begin
pthread mutex lock(&start lock);
startnow=l;
pthread cond signal(&start cond);
pthread mutex unlock(&start lock);
ll you it is done communicating
//acquire mutex my lock
//wait for done to be true
//set done flag to false
//release mutex my lock
new communication cycle */
//acquire mutex start lock
//set startnow flag to true
//signal communication thread
//release mutex start lock
Figure 2.2. Pseudocode for thread synchronization.
(a) Code executed by communication thread; (b) Code executed by computation thread.
In the parallel MVDR algorithm, the number of processors is tied to the problem
size (i.e., the number of sensors in the sonar array) so the performance results in Chapter
6 are in terms of scaled speedup. The ideal speedup, or the degree of parallelism (DOP),
is the number of processes executing in parallel. If speedup is less than 1, then the
pthread mutex lock(&start lock); //acquire mutex start lock
WHILE (!startnow) //wait for startnow to be true
pthread cond wait(&start cond, &start lock);
startnow = 0; //set startnow back to false
pthread mutex unlock(&start lock); //release mutex start lock
/* Communication section (see Figure 2.1)*/
/* Tell computation thread you are done communicating*/
pthread mutex lock(&my lock); //acquire mutex my lock
done=l; //set done flag to true
pthread cond signal(&cond); //signal computation thread
pthread mutex unlock(&my lock); //release mutex my lock
18
sequential program executes faster than the parallel program and the parallel program is
futile.
Parallel efficiency is the ratio of the speedup attained versus the ideal speedup and
is expressed as a percentage. The ideal parallel efficiency is 100%.
CHAPTER 3
OVERVIEW OF MVDR ALGORITHM
The objective of beamforming is to resolve the direction of arrival of spatially
separated signals within the same frequency band. Multiple sensors in an array are used
to capture multichannel data that is spatially filtered by weights assigned to each sensor
output. The weights, also called the aperture function, form a beampattern response,
which is a function of the number of sensors, sensor spacing, and wave number given by
k = o/c where o) is the center frequency of the signal and c is the propagation velocity
for the signal. The sensor outputs are multiplied by the weights and summed to produce
the beamformer output. The beamformer output is a spatially filtered signal with an
improved SNR over that acquired from a single sensor.
Frequencydomain beamforming offers finer beamsteering resolution and is more
memory efficient than timedomain beamforming. George et al.8 showed that the
execution time of the Fast Fourier Transform (FFT) is negligible compared to the
execution times of other stages in the beamforming process. Hence, for this research, we
assume that all timedomain data has been Fourier transformed and we only process the
data in a single frequency bin since, as will become clear later, the parallel algorithm
presented in this thesis is easily extensible to multiple frequency bins.
Figure 3.1 shows the structure of a basic adaptive beamformer with a signal
arriving from angle 0. There are n sensors with frequencydomain outputs Xo, ..., Xn .
The weights, w,, in the figure are intersected by an arrow to illustrate their adaptive
nature. Denoting the column vector of frequencydomain sensor outputs as x and the
column vector of weights as w with entries w,, the scalar output, y, can be expressed as
y =w* (3.1)
where the denotes complexconjugate transposition.
node 0 X
node n1 
Figure 3.1. Narrowband ABF structure.
CBF uses a delay and sum technique to steer the array in a desired direction
independently of the data. The array is steered by properly phasing the incoming signal
and summing the delayed outputs to produce the result. The steering vector, s, is
calculated from the frequency and steering direction, which when multiplied by the
incoming signal will properly phase the array. The equation for s is given by
s = [1, ejkdsin() ej2kdsin(O) ', e(n1)kdsin(O) ] (3.2)
where k is the aforementioned wave number, n is the number of nodes, and d is the
distance between the nodes in the array. For a detailed discussion of CBF and the
formation of the steering vectors, the reader is referred to Clarkson22.
A crossspectral density matrix, R, is estimated as the autocorrelation of the vector
of frequencydomain sensor outputs
R = E{xx* }.
(3.3)
The matrix R is also referred to as the covariance or correlation matrix in timedomain
algorithms. The output power per steering direction is defined as the expected value of
the squared magnitude of the beamformer output:
P E{y2} = w*E{x'x*}w w*Rw. (3.4)
In CBF, the weight vector w is equal to s, the steering vector.
MVDR falls under the class of linearly constrained beamformers where the goal is
to choose a set of weights, w, that satisfy
ws = g (3.5)
which passes signals from the steering direction with gain g while minimizing the output
power contributed by signals, or interferers, from other directions. In MVDR, the gain
constant g equals one. Thus, by combining Eqs. (3.4) and (3.5) and assuming that s is
normalized (i.e., s s = 1), we solve for the weights from
MinP = w* R w constrained to w*s = 1. (3.6)
w
Using the method of Lagrange multipliers, it is found that the solution for the weight
vector in Eq. (3.6) is
Rls
w R (3.7)
s*R s
By substituting Eq. (3.7) into Eq. (3.6), we obtain the scalar output power for a single
steering direction as:
1
P (3.8)
s R 1s
Thus, the MVDR algorithm optimally computes the weight vectors depending on the
sampled data. The result is a beampattern that places nulls in the direction of strong
interferers. As an example, Figure 3.2 shows the beampatterns and beamforming power
22
plots from a conventional beamformer and an MVDR beamformer. The data in this
example was created for one frequency bin, f=30 Hz, where the output of each sensor has
a signal plus noise component. The amplitudes of both the signal and noise are normally
distributed with zero mean and unit variance, with an SNR of 25 dB.
B.npallrnm for CEF
 ..... i  .. '   : , .  .....
 __, _ _ J_   _ __ _ 
   __ _L r _ _       
0 50 0 40 0 2?0 40 50 0o
Bearing (dugrenDS)
PowerPlitfr CBFI
..... ... . ....  ...  ..... .... 
0 80 40 20 0 20 40 g0 80
iteeini ailwe1on (01ieea)
Beampattern for MVDR
60 60 40 20 0 20 40 V 60 80
Bearing (degrees)
(b)
Power Plot for MVDR
.  _________ _____ _
80 60 40 20 0 20 40 50 80
steering direction (degrees)
Figure 3.2. Array response and power plots for 4 nodes, 91 steering directions, and
incoming signal at 0 = 45. (a) CBF plot ofbeampattern for the array when steered
toward broadside; (b) MVDR plot of beampattern for the array when steered toward
broadside. Arrow points to null steered at interference arriving at 450; (c) Resulting log
scale power plot for CBF; (d) Resulting logscale power plot for MVDR.



    
  

Isotropic noise is modeled as a large number of statistically independent stochastic
signals arriving at the array from all directions. Since the weight vectors for CBF are
calculated independently from the sampled data and the only a priori knowledge of the
desired signal is the bearing, CBF ignores possible interferers from other directions.
Figure 3.2(a) shows the beampattern created by a CBF algorithm looking for signals at
broadside (i.e., 0 = 00) with an interferer at 0 = 45. The result is a beampattern that
passes signals at broadside with unity gain while attenuating signals from other angles of
incidence. At 0= 450, the beampattern exhibits a side lobe that does not completely
attenuate the interfering signal, which will result in a "leakage" of energy from the signal
at 450 into broadside. Conversely, the weight vectors in MVDR depend explicitly on the
sampled data and can therefore steer nulls in the direction of interferers, as indicated by
the arrow in Figure 3.2(b). In this figure, the main lobe is at broadside (i.e., the steering
direction), and at 0= 450 there is a null that minimizes the energy contributed from the
interferer in that direction, thereby preventing leakage. Changing the number of nodes
and the distance between the nodes can change the beampattern of the array. Clarkson22
gives a detailed analysis of how changing these array parameters affects the spatial
bandwidth of the main lobe, the side lobe level, and rolloff of the side lobes. Figures
3.2(c) and 3.2(d) show the final beamforming results, for CBF and MVDR respectively,
where the power found in each steering direction is plotted on a logarithmic scale to
emphasize the artificial energy found in directions where there was no signal source. The
relatively poor results in Figure 3.2(c) are evidence of the energy contributed by the
interferer to other steering directions as this energy was passed, although attenuated, by
the CBF beam pattern. By contrast, the MVDR results in Figure 3.2(d) exhibits a
24
dramatic improvement due to the ability of MVDR to place nulls in the direction of
strong signal interference when not steering toward the direction of the interference.
CHAPTER 4
SEQUENTIAL MVDR ALGORITHM AND PERFORMANCE
Adaptive beamforming with MVDR is divided into three tasks. The first task is
the Ensemble Averaging of the CSDM. The second task is Matrix Inversion, which
inverts the ensemble average of the CSDM. The third task is Steering, which steers the
array and calculates the power for each steering direction. Figure 4.1 is a block diagram
of the sequential algorithm, where n is the number of sensor nodes in the array.
SSteenng
Vectors
SCompute
Eereate Irse. Steeg P(&)
CSDM Matrx x )
Figure 4.1. Block diagram of the sequential MVDR algorithm.
Section 4.1 presents an overview of each task in the sequential algorithm,
followed by performance results given in Section 4.2. The performance results will
identify the computational bottlenecks of the sequential MVDR algorithm.
4.1 Sequential MVDR tasks
As defined in Eq. (3.3), the formation of the CSDM requires an expectation
operation. The expectation is computed by multiplying the column vector of sensor data,
x, by its complexconjugate transpose, resulting in an instantaneous estimate of the
CSDM. This instantaneous estimate is then added to the ensemble, and the result is
divided by the number of estimates accumulated to compute the average. This ensemble
averaging task consists of n2 complex multiplications and additions to compute the
CSDM and update the ensemble, followed by n2 divisions to perform the averaging,
which results in a computational complexity of O(n2).
The inversion algorithm we use for the matrix inversion task is the GaussJordan
elimination algorithm with full pivoting, adapted from Press et al.23 GaussJordan
elimination for matrix inversion assumes that we are trying to solve the equation
A X = I (4.1)
where A is the matrix to be inverted, Xis the matrix of unknown coefficients which form
the inverse of A, and I is the identity matrix. This inversion technique uses elementary
row operations to reduce the matrix A to the identity matrix, thereby transforming the
identity matrix on the righthand side into the inverse of A. The method used here
employs full pivoting, which involves doing row and column operations in order to place
the most desirable (i.e., usually the largest available element) pivot element on the
diagonal. This algorithm has a main loop over the columns to be reduced. The main
loop houses the search for a pivot element, the necessary row and column interchanges to
put the pivot element on the diagonal, division of the pivot row by the pivot element, and
the reduction of the rows. The columns are not physically interchanged but are relabeled
to reflect the interchange. This operation scrambles the resultant inverse matrix so a
small loop at the end of the algorithm is needed to unscramble the solution.
GaussJordan elimination with full pivoting was chosen because it is numerically
stable and efficient in number of computations and memory usage. For details on the
algorithm structure and numerical stability of GaussJordan elimination the reader is
referred to Golub and Van Loan24. The complexity of the algorithm is O(n3), which is as
efficient as any other method for matrix inversion. GaussJordan elimination is
particularly attractive for parallel systems because the outer loop that executes n times is
easily decomposed over n stages and is therefore ideal for pipelining in a coarsegrained
algorithm where the iterations of the loop are decomposed into separate stages. The
memory requirements are minimized by building the inverse matrix in the place of A as it
is being destroyed.
The steering task is responsible for steering the array and finding the output
power for each of the steering directions. The main operation in the steering task is the
product of a row vector by a matrix then by a column vector, which results in 0(n2)
operations. This operation must be performed once for every steering direction, which
increases the execution time linearly as the number of steering directions increases.
Although with a fixed number of steering angles the computational complexity of the
steering task is lower than that of the matrix inversion task as n > 0o, it will dominate the
execution time of the algorithm for sonar arrays where the number of nodes is less than
the number of steering angles. The performance analysis in the next section will
illustrate the impact of each of the tasks in the sequential algorithm.
4.2 Performance results for sequential MVDR algorithm
Experiments were conducted to examine the execution time of the sequential
model and its three inherent tasks. The sequential algorithm was coded in C, compiled
under Solaris 2.5, and then executed on an Ultra1 workstation with a 167MHz
UltraSPARCI processor and 128MB of memory. To study the effects of problem size,
the program was executed for 4, 6, and 8 sensor nodes, each using 45 steering directions.
The execution time for each task was measured separately and averaged over many
beamforming iterations. The results are shown in Figure 4.2. Each stacked bar is
partitioned by the tasks of MVDR beamforming discussed above. The height of each
stacked bar represents the average amount of time to execute one complete iteration of
beamforming. As predicted, the steering task is the most computationally intensive
since, in this case, the number of nodes is always less than the number of steering
directions. The matrix inversion task follows while ensemble averaging, seen at the
bottom of each stacked bar, is almost negligible.
The array architecture assumed for the sequential algorithm is one in which the
sensor nodes collect data and send it to a frontend processor for beamforming.
Communication latency is ignored in the performance analysis for the sequential
algorithm, and the total execution times are based completely on the sum of the execution
times of the three computational tasks.
The sequential algorithm was optimized for computational speed by pre
calculating the steering vectors and storing them in memory to be retrieved only when
needed in the steering task. George and Kim7 analyze the tradeoffs between a
minimumcalculation model and a minimummemory model for a sequential split
aperture CBF algorithm. In their minimumcalculation model, the steering vectors and
the inverseFFT basis are calculated in an initial phase of execution and saved in memory
to access when needed in the steering and iFFT processing tasks, respectively.
5 00
4 50
4 00
3 50

E 3 00
E
0 250
2 00

1 50
1 00
050
0 oo
SSteering
E Inversion
* Ensemble
Seq4 Seq6 Seq8
OSteering 1 20E+00 2 27E+00 3 73E+00
SInversion 1 34E01 3 79E01 8 62E01
HEnsemble 3 07E02 8 48E02 1 21E01
Nodes
Figure 4.2. Average execution time per iteration as a function of array size for the
sequential MVDR algorithm with 45 steering directions and 576 iterations on an Ultra1
workstation.
The minimummemory model conserves memory by computing these vectors on the fly
as needed. Performance analyses indicate that the minimumcalculation model is five
times faster than the minimummemory model but requires twice as much memory
capacity. In MVDR, there is no iFFTtask and therefore no need to calculate and store an
inverseFFT basis. However, the SACBF and MVDR algorithms both require steering
vectors in their steering tasks. Since, like MVDR, the steering task was found to be the
dominant factor in the performance of the SACBF algorithm, we expect similar
computationvsmemory tradeoffs with the MVDR algorithm. As execution time is the
primary focus of the performance analyses in this thesis, we chose to compromise
memory space for execution speed by precalculating the steering vectors and storing
them in memory to access when needed in the steering task.
30
The following chapter describes a distributedparallel MVDR algorithm that
achieves promising levels of speedup by decomposing, partitioning, and scheduling in
pipeline stages the two most computationally intensive tasks, steering and matrix
inversion. Chapter 6 analyzes the performance of the parallel algorithm and compares it
with the purely sequential algorithm presented above.
CHAPTER 5
PARALLEL MVDR ALGORITHM
The level at which a computational task is partitioned among processors defines
the granularity of the parallel algorithm. In a coarsegrained decomposition the
computational tasks are relatively large with less communication among the processors.
In a finegrained decomposition the tasks are relatively small and require more
communication. A mediumgrained decomposition is a compromise between the two.
The lower communication overhead inherent in coarse and mediumgrained
decompositions makes them the preferred methods when developing parallel algorithms
for distributed systems. A coarsegrained decomposition for parallel beamforming is one
that would assign independent beamforming iterations (i.e., the outermost loop in the
sequential basis) to each processing node and then pipeline the beamforming tasks for
overlapping concurrency of execution across the nodes. For instance, in a 4node array,
node 0 would be assigned iterations 0, 4, 8, 12, etc., node 1 node be assigned iterations 1,
5, 9, 13, etc., and so forth. By contrast, a mediumgrained decomposition is one that
would assign the same beamforming iteration to all nodes for simultaneous concurrency
of execution, where each node would be responsible for the computing the power results
for a subset of the steering angles of interest. For instance, in a 4node array that is
beamforming with 180 steering angles, node 0 would be assigned the first 45 steering
angles of all iterations, node 1 the second 45 steering angles, etc.
Coarse and mediumgrained decomposition algorithms have been developed for
parallel CBF and SACBF by George et al.6 and George and Kim7, respectively. In both
papers, the coarsegrained algorithm is called iteration decomposition and the medium
grained algorithm is called angle decomposition. The angledecomposition method
requires alltoall communication to distribute the data among the nodes while the
iteration decomposition method requires alltoone communication to send the data to the
scheduled node. The performance of iteration decomposition is generally superior to
angle decomposition on distributed systems because it requires less communication
among the nodes. However, since angle decomposition is not pipelined, it has a shorter
result latency (i.e., the delay from the sampling of input data until the respective
beamforming output is rendered) and makes more efficient use of memory by distributing
the steering vectors across the nodes. Moreover, when the network in the distributed
system is capable of performing a true hardware broadcast (i.e., when a sender need only
transmit a single packet onto the network for reception by all other nodes), performance
approaches that of iteration decomposition.
The smaller frequency of communication inherent in a coarsegrained
decomposition makes it the most practical basis for a parallel MVDR algorithm designed
for inarray processing on a distributed sonar array. A mediumgrained algorithm for
MVDR would require all nodes to invert the same matrix independently, causing undue
computational redundancy. A coarsegrained algorithm avoids redundant computations
by scheduling beamforming iterations for different data sets. Of course, the high
frequency of communication inherent to finegrained algorithms would make them
impractical for implementation on a distributed array.
The distributedparallel MVDR (DPMVDR) beamforming algorithm presented
here has two main components, a computation component and a communication
component. The computation component uses roundrobin scheduling of beamforming
iterations to successive nodes in the array, and both staggers and overlaps execution of
each iteration within and across nodes by pipelining. The communication component of
the algorithm reduces the communication latency of alltoall communication with small
amounts of data by spooling multiple data samples into a single packet. In so doing, a
separate thread is employed to hide communication latency by performing the alltoall
communication while the main computational thread is processing the data received
during the previous communication cycle. The next two sections discuss the two
algorithmic components in detail, followed by performance results in Chapter 6.
5.1 Computation component of DPMVDR
As reviewed in Chapter 4, each beamforming iteration of the sequential MVDR
algorithm is composed of an ensemble averaging task, a matrix inversion task, and a
steering task. Before the task of matrix inversion can begin, the ensemble average of the
CSDMs must be computed. The steering task then uses the inverse of the CSDM average
to steer for every angle of interest. In the DPMVDR algorithm, each node in succession
is scheduled to beamform a different data set in a roundrobin fashion. For instance,
node 0 would be assigned to process the first set of data samples collected by the array,
node 1 the second set, and so forth, causing each node to beamform using every nth data
set, where n is the number of nodes.
Furthermore, the DPMVDR algorithm also pipelines the matrix inversion and
steering tasks within each beamforming iteration by decomposing them into n stages
each. When a node is scheduled a new beamforming iteration, it computes the ensemble
average of the CSDM and executes the first stage of the matrix inversion task. The next
n1 stages are spent completing the computation of the matrix inversion and the
following n stages perform the steering, thereby imposing a result latency of 2n pipeline
stages. Since a node is scheduled a new beamforming iteration every n stages, iterations
are overlapped within a node by computing a matrix inversion stage of the current
iteration and the respective steering stage of the previously scheduled iteration in the
same pipeline stage. Meanwhile, the running sum of CSDMs is updated at the beginning
of every pipeline stage. The pipeline structure of the computation component of the DP
MVDR is shown in Figure 5.1 for n = 3.
Result latency
i Effective
Node 0 Iteration time
Node 1
Node 2
A different polygon for
each it ration
Legend Task being Stage number
U: Updating the running sum of CSDMs performed of current task
A: Averaging the CSDM / .0 0
I: Inversion of the CSDM A Si!
S: Steering using previously inverted CSDM
*: beamform iteration completed Number of the Iteration
current task is working on
Figure 5.1. Block diagram of the computation component of DPMVDR (for n = 3).
As shown in the legend of the figure, the symbols in the polygons each represent
the task being performed in a given stage of the pipeline. The subscripts indicate the
beamforming iteration for which the task is associated, and the superscript (if present)
indicates the stage of the task being computed. To further aid in distinguishing between
beamforming iterations scheduled across successive nodes, a different polygon is used to
depict each iteration of beamforming. For example, the U stages have the same polygon
as the following A task in the same pipe, since the U stages update with data the ensemble
that will be averaged by the subsequent A task. Each pipeline stage is depicted by a
shaded box and the completion of a beamforming iteration by a black diamond (i.e., a
milestone).
As an example, consider the first four stages of the pipeline shown in the first row
of the diagram for node 0. In the first pipeline stage, the ensemble is updated and the
average is computed for iteration i, and the output is used by the first stage of the matrix
inversion. Meanwhile, in that same pipeline stage, the first stage of the steering task is
underway based on the results from the matrix inversion previously computed for
iteration i3. In the second pipeline stage, new data from all nodes for iteration i+1 is
used to update the ensemble while the second stage of the inversion for iteration i and the
second stage of the steering for iteration i3 continue. In the third pipeline stage, the
update for iteration i+2 takes place, the inversion for iteration i is completed, and the
steering for iteration i3 is also completed thereby ending iteration i3 by producing the
beamforming result. Finally, in the fourth pipeline stage of node 0, the process starts all
over with the update and computation of the ensemble average for iteration i+3, the first
stage of inversion for that same iteration, and the first stage of steering for iteration i.
Once the pipeline has initialized and filled, the overlapping of iterations between
consecutive nodes in the array allows for one beamforming result to be available from the
system at the end of every pipeline stage. The decomposition scheme employed for each
task is discussed below.
In every pipeline stage a new CSDM is created from the present data set collected
from all the nodes in the system and added to the running sum. When a new iteration of
beamforming is scheduled to begin on a given node, the running sum is then divided by
the total number of sets accumulated thus far and thus ensemble averaging for that
iteration is completed. Therefore, the computational complexity of the ensemble
averaging task in the parallel algorithm remains the same as in the sequential algorithm,
0(n2).
The GaussJordan elimination algorithm loops for every column of the CSDM.
Since the CSDMs always have n columns, a single loop of the algorithm is executed per
pipeline stage. This form of decomposition decreases computational complexity from
0(n3) in the sequential algorithm to 0(n2) in the parallel algorithm for the matrix
inversion task over n nodes.
Figure 5.2 shows the flow of processing associated with the steering task in the
sequential algorithm and its decomposed counterpart in the parallel algorithm. The
product shown is the denominator of Eq. (3.8) and 0 is the steering direction.
Figure 5.2(b) shows the decomposition of the steering task in the DPMVDR
algorithm. The shaded elements denote the elements that are multiplied in one pipeline
stage. The conjugatetranspose of the steering vector is multiplied by one column of the
CSDM inverse, then by one element of the steering vector, for every steering direction,
which results in O(n) complexity. This process is repeated n times, one per stage, for
each column of R and each corresponding element of s(O). The results from each stage
R1 s() o R1 () _
s'(0) S (0)
(a) (b)
Figure 5.2. Steering task multiplication
(a)Sequential steering task ; (b) Decomposed steering task .
are accumulated until the multiplication is completed in the nth stage. As evidenced in
the performance experiments previously conducted with the sequential algorithm, the
steering task in the DPMVDR algorithm, while less computationally complex than
matrix inversion as n > oc for a fixed number of steering angles, dominates the execution
time when n is less than the number of steering directions.
For further clarification of the DPMVDR algorithm, highlevel pseudocode is
provided in Figure 5.3. The outer loop in the code repeats successive iterations of
adaptive beamforming. The inner loop is responsible for coordinating the activities of
each of the n processing nodes in the array. The purpose behind the communication
oriented instructions in the first IF/END IF block of the code is explained in the next
section.
5.2 Communication component of DPMVDR
The expectation operation of Eq. (3.3) at the beginning of every iteration of
beamforming requires that each node have a continually updated local copy of every
other node's data sample, where a data sample is the complex value of one frequency bin.
In so doing, the nodes must perform an alltoall communication of a single sample per
FOR i = 1 TO number of beamforming iterations
FOR k = 0 TO number of nodes 1
Create R matrix and add to ensemble;
count++; // Count number of stages per comm. cycle
IF (count == d) // Barrier reached
Wait for communication to finish;
count = 0;
Start communicating next d data samples;
ENDIF
IF (k == my rank) // my rank is the node number
CSDM = ensemble average();
stage = 0; // stage is the current stage of inversion
// and steering
ENDIF
Invert(CSDM, stage); // CSDM is the ensemble average
Steer(inverse csdm(kn), stage++); // steer using inverse found in
// previous iteration, and increment
// stage
ENDFOR
ENDFOR
Figure 5.3. Pseudocode for the DPMVDR algorithm.
node between each pipeline stage in order to meet the data requirements imposed by the
MVDR algorithm. In a distributed array this communication pattern incurs a high degree
of network contention and latency that might render the computation component of the
parallel algorithm ineffectual.
To lower the impact of alltoall communication, we use a technique called "data
packing" where the nodes pack data (e.g. d samples) that would be needed in future
stages and send that data as one packet. Data packing eliminates the need to perform an
alltoall communication cycle between each pipeline stage and instead it is performed
every d stages. Data packing does not reduce the communication complexity of alltoall
communication, which is 0(n2) in a pointtopoint network where n is the number of
nodes. However, due to the overhead in setup and encapsulation for communication,
sending small amounts of data results in a high overhead to payload ratio, making it
desirable to pack multiple data together to amortize the overhead and reduce the effective
latency of communication.
In addition to the use of data packing to reduce latency, we use multithreaded
code in order to hide latency by overlapping communication with computation. Using a
separate thread for communication is comparable to the use of a DMA controller capable
of manipulating data between the node's memory and the network interface in that, after
setup, communication occurs in the background. In this scheme, while the nodes are
processing their current data set, the communication thread is exchanging the data
samples for the next set of iterations. Figure 5.4 shows a block diagram of this method.
In Figure 5.4, the squares in the foreground represent pipeline stages of
computation that overlap in time with the alltoall communication cycles represented by
the rectangles in the background. The different types of shading depict the sets of data
that are being operated upon by each thread (e.g. the pattern in the background with the
first cycle is repeated in the pipeline stages of the second cycle). Figure 5.4 illustrates
data set z1
data set i
data set i+1
pipeline stages communi nation thread data set i+2
Tj jjj jjaI I
Time
Figure 5.4. Overlap of computation and communication threads.
how the data that is currently being processed was transmitted and received during the
previous communication cycle.
Several experiments were conducted on a testbed to determine the optimal
number of data elements per packet to transmit per communication cycle so that
communication and computation can be overlapped, with minimum overhead in the
computation thread. The testbed consisted of a cluster of Ultra1 workstations (i.e., the
same platform used in Chapter 4) connected by a 155Mb/s (OC3c) Asynchronous
Transfer Mode (ATM) network via TCP/IP. The parallel code was written in the C
version of MPI20. MPI is a standard that defines a core of functions for messagepassing
communication and synchronization. Alltoall communication was achieved by calling
multiple send and receive primitives. Each complex data sample consists of two, 8byte
floatingpoint values. Figure 5.5 compares the amount of time each thread dedicates to
each component in a full communication and processing cycle for different payload sizes
per packet in an 8node system.
41
1 data sample per packet n data samples per packet nt data samples per packet
S 12 i 168 ii 70
SW nc I I I
Work
Barrier 16 60
10 E
14
50 .
8 12
control the alltoall transaction between nodes, communication of d data samples that is40
g 10 and n data samples per
2 I
0 0 0
communication computation communication computation communications computation
Figure 5.5. Thread performance as a function of data samples per packet (n = 8).
A cycle in the communication thread consists of a barrier synchronization to
control the alltoall transaction between nodes, communication of d data samples that is
the actual work of the thread, and thread synchronization within the node. A cycle in the
computation thread consists of the processing of d data samples and thread
synchronization. The synchronization time in the communication thread is the amount of
time that it takes to tell the computation thread that it has completed the communication
of the data, plus the amount of time it waits for the computation thread to tell it to begin a
new communication cycle. The synchronization time in the computation thread is the
amount of time it waits for the communication thread to finish, plus, the amount of time it
takes to tell the communication thread to begin a new communication cycle. Thus, the
thread that finishes its work first must block to wait for the other thread to finish. To
achieve an optimal overlap of computation over communication, the goal is for the
communication thread to finish its communication cycle while the computation thread is
still processing data. Figure 5.5 shows that for payload sizes of 1 and n data samples per
packet, the computation thread finishes its work cycle before the communication thread
finishes its work cycle, and thus maximum overlap is not achieved to completely hide the
communication latency. When the payload size is n2 data samples per packet, the amount
of time the computation thread spends processing the data is greater than the amount of
time the communication thread spends in sending the next set of data, thereby achieving
complete overlap. The amount of data to pack for computation to overlap
communication depends on the inherent performance of the processors and network in
the system. Relative to processor speed, slower networks with a higher setup penalty will
require larger amounts of packing while faster networks will require less.
There are several sideeffects with the use of multithreading and data packing in
the DPMVDR algorithm. First, multithreading requires thread synchronization, which is
overhead not present in a singlethreaded solution. Also, the more data that is packed the
longer the result latency. Therefore, the number of data samples to pack should be
sufficient to overlap the communication completely while maintaining the least possible
result latency. In an nnode system where d samples per packet are transmitted per
communication cycle, the result latency with data packing is increased from 2n pipeline
stages to 2n + d since the algorithm allots d stages for communication.
An increase in memory requirements is another side effect of data packing and
multithreading. Figure 5.6 shows the data memory requirements for the sequential
algorithm and the parallel algorithm versus payload size of the packet and number of
nodes in the system. As stated earlier, each sample is a complex scalar that consists of
two, 8byte floatingpoint values. The sequential algorithm needs to store only one copy
of the data from each node and thus requires the least amount of memory, which is 16n
43
bytes. In the parallel algorithm, even without data packing the doublebuffering
associated with multithreading raises the memory requirement to twice that of the
sequential algorithm one set that is being communicated, and another set that is being
processed. In general, for an nnode system where each sample is 16 bytes, the memory
required per node to store d samples per packet from every node is 2d 16n bytes.
Consider a single frontend processor executing the sequential algorithm for an 8node
array. It requires 16n = 128 bytes to store the vector of data samples. Each node in an 8
node array executing DPMVDR and sending only one sample per packet requires 256
bytes. For n data samples per packet each node requires 2kB, and for n2 samples per
packet each node requires 16kB.
sequential: no packets
300
250 . .  
200    
5,150 L 
100 ' 
4 6 8 10 12 14 16
# of processors
n samples per packet
10000 i 1 i i i
8000  T 4  
6000 I    I H I I
S4000 ..................
2000  
0
4 6 8 10 12 14 16
# of processors
1 sample per packet
600
500   . 
400  .  . . 
200
100 .
0
4 6 8 10 12 14 16
# of processors
104 n2 samples per packet
15
5 
0 ill
4 6 8 10 12 14 16
# of processors
Figure 5.6. Data memory requirements for communication as a function of the number of
processors for various payload sizes.
The need for data packing and multithreading in the DPMVDR algorithm is most
critical when the nodes in the array are connected by a unicast pointtopoint
communications network that is slow relative to processor speed. If the network is
capable of inherently supporting hardware broadcast (e.g., a ring), where a message
destined for all other nodes in the system is transmitted only once and yet received
multiple times, then the communication complexity of the alltoall communication
reduces from O(n2) to O(n). This reduction in communication complexity may permit the
use of smaller degrees of data packing while still achieving the overlap in computation
over communication, and in so doing reduce the memory requirements and result latency.
Moreover, a broadcast network that is fast relative to processor speed may even mitigate
the need for multithreading. Thus, the communication component of the DPMVDR
algorithm may be tuned in its use depending on the processor and network speed and
functionality.
The next chapter presents performance results with the DPMVDR algorithm for
several array sizes. Since communication latency is completely hidden by packing data
samples, the execution times measured are a function of the computation thread alone.
CHAPTER 6
PARALLEL MVDR PERFORMANCE
To ascertain overall performance attributes and relative performance of inherent
tasks using a distributed systems testbed, the algorithm was implemented via several
messagepassing parallel programs (i.e., one per array configuration) coded in CMPI and
executed on a cluster of UltraSPARC workstations with an ATM communications
network. Each node measures the execution times for its tasks independently and final
measurements were calculated by averaging the individual measurements across
iterations and nodes. The degree of data packing employed is d= n2 data samples per
packet. Figure 6.1 shows the execution times for both the parallel and the sequential
algorithms for 45 steering directions. The number of nodes is varied to study the effects
of problem and system size.
The parallel times show that the steering task remains the most computationally
intensive task for both the parallel and sequential programs, since in all cases the number
of nodes is less than the number of steering directions. As the matrix inversion task in
the parallel algorithm has the same parallel complexity as the ensemble average task,
0(n2), their execution times are approximately equal. The sequential algorithm does not
include an overhead component because it is not pipelined nor multithreaded, and
communication latencies are ignored.
*Overhead
O Steering
Inversion
 HEnsemble
Par4 Par6 Par8 Seq4 Seq6 Seq8
SOverhead 1.23E03 9.56E04 1.36E03
OSteering 3.30E01 4.12E01 5.02E01 1.20E+00 2.27E+00 3.73E+00
* Inversion 3.62E02 6.88E02 1.12E01 1.34E01 3.79E01 8.62E01
* Ensemble 3.42E02 7.24E02 1.26E01 3.07E02 8.48E02 1.21 E01
Nodes
Figure 6.1. Average execution time per iteration as a function of array size for the
parallel and sequential MVDR algorithms with 45 steering angles and 576 iterations on
the UltraI/ATM cluster.
However, the overhead time of the parallel algorithm is negligible and therefore
does not significantly affect the total execution time. The amount of time to perform the
ensemble averaging task is comparable for both algorithms since it is not affected by the
pipelining in DPMVDR.
The two decomposed tasks in DPMVDR, matrix inversion and 'feet ing. show
significant improvement over their sequential counterparts. Figure 6.2 is a comparison of
DPMVDR and the sequential algorithm in terms of scaled speedup (i.e., speedup where
47
the problem size grows linearly with the number of nodes in the system) and parallel
efficiency per task. Parallel efficiency is defined as the ratio of speedup versus the
number of processing nodes employed in the system.
Ensemble Averaging Matrix Inversion Steering
8              
................. . 7
4    6    
2 2
0 0 0
4 6 8 4 6 8 4 6 8
# of processors of processors # of processors
(a)
Ensemble Averaging Matrix Inversion Steering
100 10 100
190     0
70  70 70
4 6 8 4 6 8 4 6 8
50  50 50
40 40 40
30 0 30
20 20 20
10 10 .0
0 0 0
4 6 4 6 8 4 6 8
p # of processors # of processors
(b)
Figure 6.2. Individual task performance as a function of the number of processors.
(a) Scaled speedup per task; (b) Parallel efficiency per task.
As expected, the ensemble averaging task exhibits no speedup since it is
performed identically in both the sequential and parallel algorithms. The matrix
inversion and steering tasks achieve nearlinear speedup with average parallel
efficiencies of 94% and 92%, respectively. Since these two tasks were each decomposed
into n stages, the idea case would of course be a speedup ofn and efficiency of 100%.
However, much as with any pipeline, the ideal is not realized due to the overhead
incurred from multithreading and managing the multistage computations within the
pipeline.
The previous figures show the results for each individual beamforming task.
When comparing the overall system performance of the parallel algorithm with the
sequential algorithm, we are primarily concerned with the effective execution time of
each algorithm. The effective execution time in the parallel algorithm is the amount of
time between outputs from successive iterations once the pipeline has filled (i.e., one
pipeline stage). In the sequential algorithm, the effective execution time is the same as
the result latency (i.e., the length of one beamforming iteration). For the parallel
program, timestamp functions are called at the beginning and end of each pipeline stage
to obtain the effective execution time. Each node performs its own independent
measurements, which are then averaged across iterations and nodes.
Figure 6.3 shows the overall system performance in terms of scaled speedup and
parallel efficiency. Despite the long result latency of DPMVDR, the effective execution
time is low, resulting in scaled speedups of approximately 3.4, 4.9, and 6.2 for 4, 6, and 8
nodes, respectively. The parallel efficiencies are approximately 85%, 82%, and 78%,
respectively, with an average efficiency of approximately 82%.
49
Scaled Speedup Parallel Efficiency
8 100
90 
77O
5     6 
30
10
O 0
4 6 8 4 6 8
# of processors # of processors
Figure 6.3. Overall scaled speedup and parallel efficiency as a function of the number of
processors.
CHAPTER 7
CONCLUSIONS
This thesis introduces a new algorithm for parallel MVDR beamforming on
distributed systems for inarray sonar signal processing, such as a sonar whose nodes are
comprised of DSP processors with local memory and are interconnected by a pointto
point communications network. The methods employed provide considerable speedup
with multiple nodes, thus enabling previously impractical algorithms to be implemented
in real time. Furthermore, the fault tolerance of the sonar system can be increased by
taking advantage of the distributed nature of this parallel algorithm and avoiding single
points of failure.
The parallel algorithm decomposes the most complex tasks of sequential MVDR
in a coarsegrained fashion using both a computation component and a communication
component. In the computation component, successive iterations of beamforming are
distributed across nodes in the system with roundrobin scheduling and executed in an
overlapping, concurrent manner via pipelining. Within each pipeline, the matrix
inversion and steering tasks are themselves pipelined and overlapped in execution across
and within the nodes. The computational complexity of the parallel algorithm is thus
reduced to 0(n2) for a fixed number of steering directions, where n is the number of
nodes and sensors in the system.
In the communication component, provisions are made to support pointtopoint
networks of moderate performance with the ability to only perform unicast
communications. Data packing techniques are employed to reduce the latency of the all
toall communication needed with MVDR by amortizing the setup and encapsulation
overhead, and multithreading is employed to hide the remaining latency with concurrency
in computations and communications. For a cluster of ATM workstations, the packing of
n2 data samples per packet was empirically found to provide complete overlap of
computation versus communication while maintaining the minimum result latency.
However, the optimal choice of this parameter will depend on the particular processors,
network, and interfaces present in the distributed architecture of interest.
There are several side effects associated with the use of pipelining,
multithreading, and data packing in this (or any) parallel algorithm. Synchronization
between stages in the loosely coupled pipeline brings with it increased overhead, as does
the multithreading of computation and communication threads. Pipelining increases the
result latency, as does multithreading and data packing, and memory requirements
increase linearly with increases in the degree of data packing.
The results of performance experiments on a distributed system testbed indicate
that the parallel algorithm is able to provide a nearlinear level of scaled speedup with
parallel efficiencies that average more than 80%. This level of efficiency is particularly
promising given that the network used in the experiments (i.e., ATM) does not support
hardware broadcast. When hardware broadcast is available in a given network, the
communication complexity drops from O(n2) to O(n), making it likely that even higher
efficiencies can be obtained. Moreover, on distributed systems with a fast network
relative to processor speed and capable of broadcast, the need for data packing and
multithreading may be mitigated thereby minimizing computation and communication
overhead.
Although the DPMVDR algorithm is defined in terms of a single frequency bin
of interest, it is easily extensible to multiple frequency bins since beamforming cycles are
performed independently for each bin. Thus, for each stage of each task in each
pipelined processor in the system, the computations would be repeated for each of the b
bins of interest. For example, in terms of the computation component of the DPMVDR
algorithm, the processor assigned to compute the output for beamforming iteration i
would proceed by first performing the task of ensemble averaging for each of the b bins
in a single stage. Next, in that same stage plus n1 additional stages, the processor would
perform the task of matrix inversion for each of the b matrices that result from the
ensemble averaging. Finally, via n more stages, the processor would perform the task of
steering using each of the b inverted matrices that result from the inversion task.
Several directions for future research are anticipated. An implementation of this
new parallel algorithm for MVDR beamforming targeted toward an embedded,
distributed system based on lowpower DSP devices is anticipated. Moreover, to
complement the performanceoriented emphasis in this thesis, further work is needed to
ascertain strengths and weaknesses of the sequential and parallel MVDR algorithms in
terms of fault tolerance and system reliability. Finally, techniques described in this thesis
can be applied to more advanced beamforming algorithms such as MUSIC and matched
field processing, and future work will focus on the study of opportunities for
parallelization in these algorithms.
APPENDIX A
DATA MODEL
The data model used in the experiments with the DPMVDR algorithm is based
on a model described by Schmidt25 for use with the MUSIC algorithm, but it is also
compatible with the MVDR algorithm. Using Schmidt's notation, the wavefronts
received at the M array elements are linear combinations of the D incident wavefronts
and noise. Thus, the model for the received data vector Xis
X~ F, W,
x2 [a(01) a(02,)... a(0 + (A.1)
XM FD_ WM.
or
X =AF + W. (A.2)
The complex vector, F, represents the D incident signals by amplitude and phase.
The complex vector, W, represents the noise at each sensor, whether it is isotropic noise
(i.e., noise arriving from all directions with random amplitude) or measurement noise
(i.e., instrumentation noise). The elements of the A matrix, a(Ok), are column vectors of
the sensor response as a function of the signal of arrival angles and the sensor location.
In other words, ajk depends on the position of the jth sensor and its response to a signal
incident from the direction of the kth signal. The equation for a(Ok) is:
a(Ok)= exp( i*d, *zk *21 f) (A.3)
where i = d, is the relative position of thefth sensor from the center of the array, Tjk
is the time delay from the kth signal sensed at the fh sensor, andf is the frequency of the
signal. In this thesis, the signal is generated for only one frequency bin.
APPENDIX B
VERIFICATION OF RESULTS
The following plots confirm the ability of the parallel and sequential code in their
ability to locate signals in the presence of incoherent, isotropic noise, as a correct MVDR
algorithm should. The programs do not calculate the weight vectors that determine the
beampattern explicitly, so no beampattern plots were generated. The plots shown are for
the 8node configuration. The test data had the following characteristics:
Table B. 1. Signal characteristics
Signal Frequency 30 Hz
Signaltonoise ratio 25 dB
Direction of arrival 00, 30 and 300
Noise component Isotropic noise
Figures B.la and B.lb show how the sequential algorithm resolved the signals arriving
from 0, and 30 and 30, respectively. The two signals arriving from 30 and 30 were
not perfectly resolved, as their gain in dB is less than 0. This less than ideal resolution is
an inherent limitation in the MVDR algorithm. More advanced algorithms such as
MUSIC exhibit better resolution. The power plots for the parallel algorithm in Figure
B.2 are almost identical to those of the sequential algorithm, which is evidence of its
correctness. They are not identical because they were executed using different random
data generated with the same general characteristics.
Power plot for sequential MVDR
00 50 0 50 1C
steering direction (degrees)
0
steering direction (degrees)
Figure B.1. Sequential 8node results.
(a) Signal arriving from 00; (b) Signals arriving from 300 and 30.
Power plot for parallel MVDR
    
0g dire
steering direction (degrees)
50 100
Power plot for parallel MVDR

.. 
50 0
steering direction (degrees)
(b)
Figure B.2.
(a) Signal arriving from 0;
Parallel 8node results.
(b) Signals arriving from 
50 100
300 and 30.
 




APPENDIX C
PARALLEL AND SEQUENTIAL C++ CODE
The data used by each of the following programs was generated in MATLAB and
written as text files to be read by the programs. For the parallel algorithm, the MATLAB
program generates a separate data file for each node to read. Each node that executes the
parallel program is identified by a rank number, a unique integer assigned to it by a call
to the function MPI Comm rank. The node knows which file to read by finding it's rank
number in the name of the data file. For the sequential algorithm, the MATLAB program
generates one data file with a matrix of values whose columns pertain to each node.
File: inline par.C
The inline par.C file is the parallel program each node executes.
/* Minimum Variance Distortionless Response (MVDR) Beamforming
Jesus Garcia
Parallel program for MVDR algorithm
High Performance Computing and Simulation Research Lab
garcia@hcs.ufl.edu
Began: 12/2/98
Last modified: 5/99
Filename: ~garcia/dpsa/c/par/parbak/inline par.C
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#ifndef ISE
#define PON ;
#define POFF ;
#define TOTAL ITER 576
#else
#define TOTAL ITER 576
#endif
#define SWAP(a, b) {temp=(a);(a)=(b);(b)=temp;}
//Macro used in
//matrix inversion
//for pivoting
/* Global variables used by communication and main threads */
int sigl_ received = 0, signal received main = 0;
int my rank, stage=0, p, time avg = 10000, numdata;
int *done = new int(0);
int *startnow = new int(0);
int size;
void statustu;
DPSAcomplex **array, *x;
MPI Status status;
hrtime t comm start, comm end, *comm time, thread start,
*thread time;
int numcomm=0, numcsm=0, numinv=0, numbeam=0, numthread
total, count=0, sys;
thread end,
=0, iter=l,
pthread mutex t my lock = PTHREAD MUTEX INITIALIZER;
pthread cond t cond = PTHREAD COND INITIALIZER;
pthread mutex t start lock = PTHREAD MUTEX INITIALIZER;
pthread cond t start cond = PTHREAD COND INITIALIZER;
pthread mutex t datalock = PTHREAD MUTEXINITIALIZER;
pid t parent;
/**********************************************************************
void thread began (int signo)
Description:
Signal handler to alert main program when the communication thread has
begun.
**********************************************************************/
void thread began(int signo)
{
pthread mutex lock(&my lock);
signal received main = 1;
pthread mutex unlock(&my lock);
}
void *communicate (void *arg)
Description:
Thread dedicated to communication with other nodes.
void *communicate(void *arg)
{
MPI Status status;
int mult;
mult = sizeof(DPSAcomplex)/sizeof(MPI DOUBLE);
if (kill(parent, SIGUSR2)==1) {
cout <<"Node "<
cout <
exit(l);
}
while (numcomm < TOTALITER*p/numdata 1) {
/* Wait here until main program tells you to begin */
pthread mutex lock(&start lock);
while (! (*startnow))
pthread cond wait(&start cond, &start lock);
*startnow = 0;
pthread mutex unlock(&start lock);
pthread mutex lock(&datalock);
memcpy(array[my rank], x, numdata*sizeof(DPSAcomplex));
pthread mutex unlock(&datalock);
/* Barrier synchronization */
MPI Barrier(MPI COMM WORLD);
#ifdef ISE
comm start = MPI Wtime();
#else
comm start = gethrtime();
#endif
/* Begin communication */
for (int comvar=0;comvar
if (comvar!=my rank) {
if (MPI Send(array[my rank], mult*numdata, MPI DOUBLE, comvar,
numcomm*my rank+numcomm, MPI COMM WORLD)) {
cout <<"Node "<
perror(" could not send");
exit(l);
}
for (int comvar=0;comvar
if (comvar!=my rank) {
if (MPI Recv(array[comvar], mult*numdata, MPI DOUBLE, comvar,
numcomm*comvar+numcomm, MPI COMM WORLD, &status)) {
cout <<"Node "<
perror(" could not receive");
exit(l) ;
}
}
#ifdef ISE
comm end = MPI Wtime();
#else
comm end = gethrtime();
#endif
/* End communication and tell main program you are done
communicating*/
pthread mutex lock(&my lock);
*done=l;
pthread cond signal(&cond);
pthread mutex unlock(&my lock);
comm time[numcomm] = comm end comm start;
numcomm++;
return NULL;
}
/**********************************************************************
int main (int argc, char **argv)
Description:
Main program that handles all calculations and controls communication
thread.
**********************************************************************/
int main(int argc, char **argv)
{
pthread t pid;
FILE *timeout;
DPSAcomplex ***steering vectors;
DPSAcomplex **inverse csm, **prev inv, **currentarray;
DPSAcomplex *col;
DPSAcomplex w2, *w3, one;
double *final;
double lambda = 0.9; //forgetting factor
double center = 0;
double *degs;
double *node distances;
double **delay;
int source;
int dest;
int name length;
char str2[10], matrixstr[12], currentarraystr[18];
char my name[ll];
int ise iter;
int index;
DPSAcomplex **csm;
DPSAcomplex zero;
zero.real = zero.imag = 0.0;
int num of fft = NUM OF FFT;
int handling freq bin;
int num of steering angles = NUM OF STEERING ANGLES;
int num of output angles = NUM OF OUTPUT ANGLES;
int num of samples = NUM OF SAMPLES;
int length of window = LENGTH OF WINDOW;
int m,g,j,k,i, h;
int imax;
DPSAcomplex dum, sum;
int icol, 1, irow,ll;
DPSAcomplex pivinv, temp;
int *indxc, *indxr, *ipiv;
double big;
int ii=0, ip;
double speed of sound = SPEED OF SOUND;
double sampling freq = SAMPLING FREQ;
double spacing = SPACING;
double temple, temp2;
double freq resolution;
double epsilon = EPSILON;
char filename[80];
int numtotal=0;
/* MPI Initialization */
#ifdef ISE
MPI Notime init(&argc, &argv);
#else
MPI Init(&argc, &argv);
#endif
MPI Comm rank(MPI COMM WORLD, &my rank);
MPI Comm size(MPI COMM WORLD, &p);
MPI Get processor name(my name, &name length);
63
my name[name lengthl] = '\0';
#ifdef ISE
double beam start, beam end, beam time;
double steering start, steering end, steering time, inverse start,
inverse end, inverse time;
double csm start, csm end, csm time;
double total start, total end, total time;
double comm start, comm end, comm time;
#else
hrtime t beam start, beam end, *beam time;
hrtime t csm start, csm end, *csm time, inverse start, inverse end,
*inverse time;
#endif
if (p==4) numdata = p*p;
else numdata = p*p;
csm time = new hrtime t[TOTAL ITER*p](0);
comm time = new hrtime t[TOTAL ITER*p/numdatal](0);
beam time = new hrtime t[TOTAL ITER*p 2*p](0);
inverse time = new hrtime t[TOTAL ITER*p p](0);
thread time = new hrtime t[TOTAL ITER*p/numdatal](0);
char *ext;
if (my rank<10) {
ext = new char[6];
ext = .dat";
}
else {
ext = new char[5];
ext = ".dat";
}
/* File handling code */
char my num[3], nodes[3];
itoa(my rank, my num);
itoa(p, nodes);
strcpy(str, "sim");
strcat(str, my num);
strcat(str, ext);
strcpy(datastr, "node");
strcat(datastr, my num);
strcat(datastr, ext);
strcpy(commstr, "commtimes");
strcat(commstr, my num);
strcat(commstr, ext);
strcpy(csmstr, "csmtimes");
strcat(csmstr, my num);
strcat(csmstr, ext);
strcpy(threadstr, "threadtimes");
strcat(threadstr, my num);
strcat(threadstr, ext);
strcpy(invstr, "invtimes");
strcat(invstr, my num);
strcat(invstr, ext);
strcpy(beamstr, "beamtimes");
strcat(beamstr, my num);
strcat(beamstr, ext);
ifstream mydata(datastr,ios::in);
freq resolution = sampling freq/num of fft;
handling freq bin = (int) ((num of fft/2)+1);
/*Beamforming variables*/
one.real = 1;
one.imag = 0;
/*  Allocate dynamic memory */
//tempcsm = (DPSAcomplex**)malloc(p*sizeof(DPSAcomplex*));
array = new DPSAcomplex*[p];
currentarray = new DPSAcomplex*[p];
csm = (DPSAcomplex**)malloc(p*sizeof(DPSAcomplex*));
inverse csm = (DPSAcomplex**)malloc(p*sizeof(DPSAcomplex*));
prev inv = (DPSAcomplex**)malloc(p*sizeof(DPSAcomplex*));
x = new DPSAcomplex[numdata];
col = new DPSAcomplex[p];
final = (double*)malloc(num of steering angles*sizeof(double));
w3 = new DPSAcomplex[num of steering angles];
degs = (double*)malloc(NUM OF STEERING ANGLES*sizeof(double));
node distances = (double*)malloc(p*sizeof(double));
steering vectors =
(DPSAcomplex***)malloc(NUM OF STEERING ANGLES*sizeof(DPSAcomplex**));
delay = (double**)malloc(NUM OF STEERING ANGLES*sizeof(double*));
indxc = new int[p];
indxr = new int[p];
ipiv = new int[p];
for (m=0; m
array[m] = new DPSAcomplex[numdata] (zero);
currentarray[m] = new DPSAcomplex[numdata] (zero);
prev inv[m] = (DPSAcomplex*)malloc(p*sizeof(DPSAcomplex));
inverse csm[m] = (DPSAcomplex*)malloc(p*sizeof(DPSAcomplex));
csm[m] = (DPSAcomplex*)malloc(p*sizeof(DPSAcomplex));
// tempcsm[m] = (DPSAcomplex*)malloc(p*sizeof(DPSAcomplex));
for(i=0; i
delay[i] = (double*)malloc(p*sizeof(double));
steering vectors[i] =
(DPSAcomplex**)malloc(handling freq bin*sizeof(DPSAcomplex*));
w3[i] = zero;
final[i] = 0.0;
for(j=0; j
steering vectors[i][j] =
(DPSAcomplex*)malloc(p*sizeof(DPSAcomplex));
}
for (i=0;i
for (m=0;m
csm[i] [m]=inverse csm[i] [m]=previnv[i] [m] = zero;
/*  Calculate relative position of the nodes 
for(i=0; i
node distances[i] = i*SPACING;
center += node distances[i];
}
center /= p;
for(i=0; i
node distances[i] = center;
}
/*  Calculate steering vectors */
k = num of steering angles%2;
if(k==l) {
temple = (num of steering anglesl)/2;
temp2 = 1/templ;
for(j=0; j
degs[j] = (l+(j*temp2));
}
else
return 1;
for(i=0; i
for(j=0; j
delay[i] [j] = node distances[j]*degs[i]/speed of sound;
free(node distances);
for(i=0; i
degs[i] = 180*asin(degs[i])/PI;
for(j=0; j
for(k=0; k
tempi = 2.0*PI*freq resolution*j*delay[i] [k];
steering vectors[i] [j] [k] = DPSA Complex etothej(templ);
}
free(degs);
/*Get first matrix: initialization 
strcpy(matrixstr, "matrix");
strcat(matrixstr, nodes);
strcat(matrixstr, ext);
strcpy(currentarraystr, "currentarray");
strcat(currentarraystr, nodes);
strcat(currentarraystr, ext);
ifstream inmatrix(matrixstr,ios::in);
for (m = 0;m
for (1 = 0;l
inmatrix >> csm[m][1].real >> csm[m][1].imag;
inmatrix.close();
ifstream incurrentarray(currentarraystr,ios::in);
for (1 = 0;l
for (int t=0;t
incurrentarray >> currentarray[1][t] .real >>
currentarray[1] [t].imag;
incurrentarray.close();
/* END of initialization */
/* Signal handler setup and thread creation */
int firstpause = 1;
int first = 1;
int firstthread = 1;
sigset t sigset;
struct sigaction doneact;
doneact.sa handler = thread began;
sigemptyset(&doneact.sa mask);
doneact.sa flags = 0;
parent = getpid();
sigemptyset(&sigset);
sigaddset(&sigset, SIGUSR2);
if (sigaction(SIGUSR2, &doneact, NULL)==1)
perror("Could not install signal handler");
if (pthread create(&pid, NULL, communicate, NULL)) {
perror("Thread creation was not successful");
exit(l);
}
/* Wait until signal handler returns */
while(signal received main==0)
pause();
signal received main = 0;
/* Barrier synchronization */
MPI Barrier(MPI COMM WORLD);
cout <<"Node "<
begin..."<
for(sys=0; sys
{
for (total=0;total
iter++;
/* Read data */
if ((first I I count==numdatal1) && sys < (TOTAL ITER
(2*numdata/p))) {
first=0;
for (int t=0;t
mydata >> x[t].real >> x[t].imag;
// End reading data 
/* Tell thread to begin first communication cycle */
if (firstthread) {
firstthread = 0;
MPI Barrier(MPI COMM WORLD);
pthread mutex lock(&start lock);
*startnow=l;
pthread cond signal(&start cond);
pthread mutex unlock(&start lock);
}
if (count==numdata) {
cout<<"Node "<
exit"<
exit(1) ;
}
/* CSDM creation and summation to running average */
#ifdef ISE
csm start = MPI Wtime();
#else
csm start = gethrtime();
#endif
for (i=0;i
for(j=0; j
if (currentarray[i] [count]==zero I
currentarray[j] [count]==zero) {
cerr<<"Must exit due to currentarray element = 0"<
exit(l) ;
}
csm[i] [j] +=
currentarray[i] [count]*!currentarray[j] [count];
}
#ifdef ISE
csm end = MPI Wtime();
#else
csm end = gethrtime();
#endif
csm time[numcsm++] = csm end csm start;
count++;
if (count==numdata && sys
/* Wait for thread to tell you it is done communicating */
pthread mutex lock(&my lock);
thread start = gethrtime();
while (! (*done))
pthread cond wait(&cond, &my lock);
thread end = gethrtime();
*done = 0;
pthread mutex unlock(&my lock);
thread time[numthread] = thread end thread start;
numthread++;
count = 0;
pthread mutex lock(&datalock);
for (i=0;i
memcpy(currentarray[i], array[i],
numdata*sizeof(DPSAcomplex));
pthread mutex unlock(&datalock);
if (sys < (TOTAL ITER(2*numdata/p))) {
/* Barrier synchronization */
MPI Barrier(MPI COMM WORLD);
/* Tell thread to begin new communication cycle */
pthread mutex lock(&start lock);
*startnow=l;
pthread cond signal(&start cond);
pthread mutex unlock(&start lock);
}
if (!(sys==TOTAL_ITERl && total>=my rank)) {
if (total == my rank) {
/* Ensemble averaging */
#ifdef ISE
csm start = MPI Wtime();
#else
csm start = gethrtime();
#endif
for (m = 0;m
for (j = 0;j
inverse csm[m] [j]
csm[m] [j]/(iter+time avg);
#ifdef
#else
#endif
ISE
csm end = MPI Wtime();
csm end = gethrtime();
csm time[numcsm1] += csm end csm start;
for (j=0;j
stage = 0;
if (iter >= my rank) {
/* Begin one stage of GaussJordan elimination for matrix inversion */
#ifdef ISE
inverse start = MPI Wtime();
#else
inverse start = gethrtime();
#endif
big = 0.0;
for (j=0;j
if (ipiv[j] != 1)
for (k=0;k
if (ipiv[k]==0) {
if (DPSA Complex abs(inverse csm[j] [k]) >= big) {
big = DPSA Complex abs(inverse csm[j] [k]);
irow = j;
icol = k;
}
} else if (ipiv[k]>l) exit(l);
}
++(ipiv[icol]);
if (irow!=icol)
for (l=0;l
inverse csm[icol] [1]);
indxr[stage] = irow;
indxc[stage] = icol;
if (inverse csm[icol] [icol] == zero) exit(l);
pivinv = 1.0/inverse csm[icol] [icol];
inverse csm[icol] [icol].real =
1.0;inverse csm[icol] [icol].imag = 0.0;
for (1=0;l
for (ll=0;ll
if (ll!=icol) {
dum = inverse csm[ll] [icol];
inverse csm[ll] [icol] = zero;
for (l=0;l
inverse csm[icol] [l]*dum;
}
if (stage == (p1)) {
for (l=p1;l>=0;l) {
if (indxr[l] != indxc[l])
for (k=0;k
SWAP(inverse csm[k] [indxr[l]],inverse csm[k] [indxc[l]]);
}
}
stage++;
#ifdef ISE
inverse end = MPI Wtime();
#else
inverse end = gethrtime();
#endif
inverse time[numinv++] = inverse end inverse start;
/* End matrix inversion stage */
if (stage == p) {
// Saving it as transpose for faster memory access
for (i=0;i
for (j=0;j
prev inv[i][j] = inverse csm[j] [i] ;
}
/* Begin one steering stage */
if (iter >= p+my rank) {
#ifdef ISE
beam start = MPI Wtime();
#else
beam start = gethrtime();
#endif
for (k = 0;k
w2 = zero;
for (j=0;j
w2 += !steering vectors[k] [21] [j]*prev inv[stage1][j];
w3[k] += w2 steering vectors[k] [21] [stage1];
if (stage == p) {
final[k] += DPSA_Complex_abs(1.0/(w3[k]));
w3[k] = zero;
}
#ifdef ISE
beam end = MPI Wtime();
#else
beam end = gethrtime();
#endif
beam time[numbeam++] = beam endbeam start;
/* End steering stage */
}
}
}
}
}
MPI Barrier(MPI COMM WORLD);
/* Kill the thread */
pthread join(pid, NULL);
MPI Finalize();
/* Do final averaging of results */
for (k = 0;k
2)*num of steering angles/p);
/* ================= End of beamforming =================== */
mydata.close();
ofstream dataout(str, ios::out);
for(i=0; i
dataout <
dataout.close();
free(final);
// Deallocate memory 
for (m=0;m
free(inverse csm[m]);
free(prev inv[m]);
free(csm[m]);
}
for(i=0; i
free(delay[i]);
for(j=0; j
free(steering vectors[i] [j]);
free(steering vectors[i]);
}
free(steering vectors);
delete [] ipiv;
delete [] indxr;
delete [] indxc;
delete done;
delete startnow;
delete [] col;
free(delay);
free(csm);
free(prev_ inv);
free(inverse csm);
delete [] x;
delete [] w3;
delete [] currentarray;
delete [] array;
/* Write timing results to files */
// Do Commtimes
ofstream commout(commstr, ios::out);
if (!commout) {
cerr<<"Could not open commtimes .dat"<
exit(l) ;
}
for (int z=0;z
commout <
commout <
}
commout.close();
// End Commtimes
// Do CSMtimes 
ofstream cutitout(csmstr, ios::out);
if (!cutitout) {
cerr<<"Could not open csmtimes .dat"<
exit(l) ;
}
for (int z=0;z
cutitout <
cutitout <
}
cutitout.close();
End CSMtimes
Do INVtimes
ofstream invout(invstr, ios::out);
if (!invout) {
cerr<<"Could not open invtimes .dat"<
exit(l) ;
}
for (int z=0;z
invout <
invout <
}
invout.close();
// End INVtimes
// Do threadtimes
ofstream threadout(threadstr, ios::out);
if (!threadout) {
cerr<<"Could not open threadtimes
exit(l) ;
}
for (int z=0;z
threadout <
threadout <
}
threadout.close();
// 
//
.dat"<
End threadtimes
Do beamtimes 
ofstream beamout(beamstr, ios::out);
if (!beamout) {
cerr<<"Could not open beamtimes .dat"<
exit(1) ;
}
for (int z=0;z
beamout <
beamout <
}
beamout.close();
// End beamtimes 
delete [] inverse time;
delete [] comm time;
delete [] thread time;
delete [] csm time;
delete [] beam time;
cout <<"Node "<
File: inline strict sequential. C
The inline strict sequential. C program is executed by a single workstation from
the same cluster that executed the parallel code.
/* Minimum Variance Distortionless Response (MVDR) Beamforming
Jesus Garcia
Sequential program of MVDR algorithm
High Performance Computing and Simulation Research Lab
garcia@hcs.ufl.edu
Began: 12/2/98
Last modified: 5/99
Filename: ~garcia/dpsa/c/seq/mincomp/strictly sequential/
inline strict sequential.C
*/
#include
#include
#include
#include
#include
#include
#include
#include
#ifndef ISE
#define PON ;
#define POFF ;
#define TOTAL ITER 576
#else
#define TOTAL ITER 576
#endif
/* Macro used in matrix inversion for pivoting */
#define SWAP(a, b) {temp=(a);(a)=(b);(b)=temp;}
int main(int argc, const char *argv[])
{
srand(time(NULL));
FILE *timeout;
DPSAcomplex ***steering vectors;
DPSAcomplex **csm, **inverse csm;
DPSAcomplex *w2,*w3, one, *x;
DPSAcomplex tempcomplex;
double *final;
double *angles;
double rad;
double center = 0;
double *degs;
double *node distances;
double **delay;
double value, value;
double db;
double d;
DPSAcomplex **tempcsm;
DPSAcomplex zero;
int time avg = 10000;
zero.real = zero.imag = 0.0;
int num of steering angles = NUM OF STEERING ANGLES;
int num of nodes = NUM OF NODES;
int num of output angles = NUM OF OUTPUT ANGLES;
int num of samples = NUM OF SAMPLES;
int length of window = LENGTH OF WINDOW;
int half node;
int m,g,j,k,i,div pt, sys, h;
int ise iter;
int icol, 1, irow,ll;
double big;
DPSAcomplex dum, pivinv, temp;
int *indxc, *indxr, *ipiv;
double speed of sound = SPEED OF SOUND;
double sampling freq = SAMPLING FREQ;
double spacing = SPACING;
double temple, temp2;
double freq resolution;
char filename[80], matrixstr[12], currentarraystr[16];
/* Timing variables */
#ifdef ISE
double beam start, beam end, beam time;
double steering start, steering end, steering time, inverse start,
inverse end, inverse time;
double csm start, csm end, csm time;
double total time;
#else
hrtime t beam start, beam end, beam time;
hrtime t csm start, csm end, csm time, inverse start, inverse end,
inverse time;
hrtime t total time;
#endif
csm time = beam time = total time = inverse time = 0;
#ifdef ISE
MPI Iteration(&ise iter);
#endif
num of nodes = atoi(*(argv+1));
/*Beamforming variables*/
w2 = (DPSAcomplex*)malloc(num of nodes*sizeof(DPSAcomplex));
w3 = (DPSAcomplex*)malloc(sizeof(DPSAcomplex));
one.real = 1;
one.imag = 0;
/*  Allocate dynamic memory */
csm = (DPSAcomplex**)malloc(num of nodes*sizeof(DPSAcomplex*));
inverse csm =
(DPSAcomplex**)malloc(num of nodes*sizeof(DPSAcomplex*));
final = (double*)malloc(num of steering angles*sizeof(double));
delay = (double**)malloc(NUM OF STEERING ANGLES*sizeof(double*));
degs = (double*)malloc(NUM OF STEERING ANGLES*sizeof(double));
angles = (double*)malloc(NUM OF OUTPUT ANGLES*sizeof(double));
node distances = (double*)malloc(num of nodes*sizeof(double));
tempcsm = (DPSAcomplex**)malloc(num of nodes*sizeof(DPSAcomplex*));
indxc = new int[num of nodes];
indxr = new int[num of nodes];
ipiv = new int[num of nodes];
x = new DPSAcomplex[num of nodes];
for (m=0; m
csm[m] = (DPSAcomplex*)malloc(num of nodes*sizeof(DPSAcomplex));
inverse csm[m] =
(DPSAcomplex*)malloc(num of nodes*sizeof(DPSAcomplex));
tempcsm[m] =
(DPSAcomplex*)malloc(num of nodes*sizeof(DPSAcomplex));
steering vectors =
(DPSAcomplex***)malloc(NUM OF STEERING ANGLES*sizeof(DPSAcomplex**));
for(i=0; i
{
delay[i] = (double*)malloc(num of nodes*sizeof(double));
steering vectors[i] =
(DPSAcomplex**)malloc(HANDLING FREQ BIN*sizeof(DPSAcomplex*));
final[i] = 0.0;
for(j=0; j
steering vectors[i][j] =
(DPSAcomplex*)malloc(num of nodes*sizeof(DPSAcomplex));
}
/*  Calculate relative position of the nodes */
for(i=0; i
node distances[i] = i*SPACING;
center += node distances[i];
}
center /= num of nodes;
for(i=0; i
node distances[i] = center;
}
/*  Calculate steering vectors */
k = num of steering angles%2;
freq resolution = (double)SAMPLING FREQ/NUM OF FFT;
if(k==l) {
temple = (num of steering anglesl)/2;
temp2 = 1/templ;
for(j=0; j
}
else
return 1;
for(i=0; i
for(j=0; j
delay[i][j] = node distances[j]*degs[i]/speed of sound;
for(i=0; i
for(j=0; j
for(k=0; k
temple = 2.0*PI*freq resolution*j*delay[i] [k];
steering vectors[i] [j] [k] = DPSA Complex etothej(templ);
}
}
}
for(i=0; i
degs[i] = 180*asin(degs[i])/PI;
/* =========== End of initial phase and start beamforming
int n = num of nodes;
for (m=0;m
for (j=0;j
inverse csm[m] [j] = csm[m][j] = te
/* File handling */
switch (num of nodes) {
case 4:
memcpy(matrixstr, "matrix4.dat", 12)
"data_array4.dat",16);break;
case 6:
memcpy(matrixstr, "matrix6.dat", 12)
"data array6.dat",16);break;
case 8:
memcpy(matrixstr, "matrix8.dat", 12)
"data array8.dat",16);break;
mpcsm[m] [j ]
zero;
;memcpy(currentarraystr,
;memcpy(currentarraystr,
;memcpy(currentarraystr,
ifstream inmatrix(matrixstr,ios::in);
ifstream ourdata(currentarraystr, ios::in);
/* Read in initialization matrices */
for (m = 0;m
for (j = 0;j
inmatrix >> tempcsm[m][j].real >> tempcsm[m][j].imag;
inmatrix.close();
/* Begin algorithm */
int total = TOTAL ITER*num of nodes;
for (sys=0;sys
/* Read in the data */
for (i=0;i
ourdata >> x[i].real >> x[i].imag;
/* CSDM formation and ensemble averaging */
#ifdef ISE
csm start = MPI Wtime();
#else
csm start = gethrtime();
#endif
for (m = 0;m
for (j = 0;j
tempcsm[m][j] += x[m]*!x[j];
for (m = 0;m
for (j = 0;j
inverse csm[m][j] = tempcsm[m] [j]/(sys+1+time avg);
#ifdef ISE
csm end = MPI Wtime();
#else
csm end = gethrtime();
#endif
csm time += csm end csm start;
/* Matrix inversion using GaussJordan elimination */
#ifdef ISE
inverse start = MPI Wtime();
#else
inverse start = gethrtime();
#endif
// Matrix inverse 
for (j=0;j
for (i=0;i
big = 0.0;
for (j=0;j
if (ipiv[j] !=1)
for (k=0;k
if (ipiv[k]==0) {
if (DPSA Complex abs(inverse csm[j] [k]) >= big) {
big = DPSA Complex abs(inverse csm[j][k]);
irow = j;
icol = k;
}
I else if (ipiv[k]>l) exit(l);
}
++(ipiv[icol]);
if (irow!=icol)
for (l=0;l
inverse csm[icol] [1]);
indxr[i] = irow;
indxc[i] = icol;
if (inverse csm[icol] [icol] == zero) exit(l);
pivinv = 1.0/inverse csm[icol] [icol];
inverse csm[icol] [icol].real = 1.0;inverse csm[icol] [icol].imag =
0.0;
for (1=0;l
for (ll=0;ll
if (ll!=icol) {
dum = inverse csm[ll] [icol];
inverse csm[ll] [icol] = zero;
for (l=0;l
inverse csm[icol] [l]*dum;
}
for (l=n1;l>=0;l) {
if (indxr[l] != indxc[l])
for (k=0;k
SWAP(inverse csm[k] [indxr[l]],inverse csm[k] [indxc[l]]);
}
#ifdef ISE
inverse end = MPI Wtime();
#else
inverse end = gethrtime();
#endif
inverse time += inverse end
/* End of
/* Begin beamforming */
#ifdef ISE
beam start = MPI Wtime();
#else
 inverse start;
Inverse 
beam start = gethrtime();
#endif
for (k = 0;k
for (i=0;i
w2[i] = zero;
for (j=0;j
w2[i] += !steering vectors[k] [21] [j]*inverse csm[j] [i] ;
}
*w3 = zero;
for(i=0; i
*w3 += w2[i]*steering vectors[k] [21] [i];
final[k] += DPSA Complex abs(1.0/(*w3));
}
#ifdef ISE
beam end
#else
MPI Wtime();
beam end = gethrtime();
#endif
beam time += beam endbeam start;
}
/* Averaging of final results */
for (k = 0;k
End of beamforming
ofstream results("result.dat", ios::out);
total;
ofstream angles2("angles.dat", ios::out);
results.precision(6);
for(i=0; i
// cout <
results<
angles2 <
}
angles2.close();
beam time /= total;
csm time /= total;
inverse time /= total;
results <<"avg csm "<
results <<"avg inverse "<
results <<"avg beam "<
results.close();
/*  Release dynamic memory */
for (m=0;m
free(inverse csm[m]);
free(tempcsm[m]);
free(csm[m]);
}
for(i=0;i
free(delay[i]);
for(j=0;j
free(steering vectors[i]);
}
free(steering vectors);
delete [] ipiv;
delete [] indxr;
delete [] indxc;
delete [] x;
free(delay);
free(csm);
free(inverse csm);
free(w2);
free(w3);
free(degs);
free(angles);
free(node distances);
free(final);
free(tempcsm);
}
File: abf h
The following header file, abfh, contains the declarations of the functions used to
overload the operators to be used by the DPSAcomplex data type. It is included in all
source files used for this thesis.
/*header file with declarations*/
#ifndef ABF H
#define ABF H
#include
#include
#include
#include
typedef struct
{
double real;
double imag;
} DPSAcomplex;
void printmatrix(DPSAcomplex **,int);
int window(double *weight, int length of window);
int steered time(double *out tj, double center, double center, double
*degs);
void svdcomplex(DPSAcomplex **a, int m, int n, DPSAcomplex w[],
DPSAcomplex **v);
DPSAcomplex DPSA Complex multiply(DPSAcomplex cl, DPSAcomplex c2);
DPSAcomplex DPSA Complex divide(DPSAcomplex cl, DPSAcomplex c2);
DPSAcomplex DPSA Complex divide2(DPSAcomplex cl, DPSAcomplex c2);
int DPSA Complex colvect times rowvect(DPSAcomplex **result, int rowsl,
DPSAcomplex *vectl, int columns, DPSAcomplex *vect2, int
conj vect2);
DPSAcomplex DPSA Complex conj(DPSAcomplex c);
DPSAcomplex DPSA Complex sqrt(DPSAcomplex c);
DPSAcomplex DPSA Complex etothej (double arg);
double DPSA Complex abs(DPSAcomplex c);
double DPSA Complex abs nosqrt(DPSAcomplex c);
int DPSA Complex matmult(DPSAcomplex **result, int rowsl, int columns,
DPSAcomplex **matl, int rows2, int columns, DPSAcomplex **mat2);
int DPSA Complex colvect times rowvect(DPSAcomplex **result, int rowsl,
DPSAcomplex *vectl, int columns, DPSAcomplex *vect2, int conj vect2);
int DPSA Complex rowvect times colvect(DPSAcomplex *result, int
columns, DPSAcomplex *vectl, int rows2, DPSAcomplex *vect2, int
conj vectl);
int DPSA Complex conj trans(DPSAcomplex **input, DPSAcomplex **result,
long rows, long cols);
int DPSA Nonoptimized inplacefft(DPSAcomplex *xforms, int nn, int
isign);
stream &operator<<(ostream &output, const DPSAcomplex &cl);
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
DPSAcomplex
operator(DPSAcomplex cl, DPSAcomplex c2);
operator(DPSAcomplex cl);
operator+(DPSAcomplex cl, DPSAcomplex c2);
operator*(DPSAcomplex cl, DPSAcomplex c2);
operator*(DPSAcomplex cl, double val);
operator*(DPSAcomplex cl, float val);
&operator++(DPSAcomplex &cl);
operator/(DPSAcomplex cl, int val);
operator/(DPSAcomplex cl, DPSAcomplex c2);
operator!(DPSAcomplex &cl);
&operator*:
&operator*:
&operator/:
&operator/=
&operator+:
&operator+:
&operator:
&operator:
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
(DPSAcomplex
&cl,
&cl,
&cl,
&cl,
&cl,
&cl,
&cl,
&cl,
double val);
DPSAcomplex c2);
DPSAcomplex c2);
double val);
DPSAcomplex c2);
double val);
double val);
DPSAcomplex c2);
operator/(DPSAcomplex cl, DPSAcomplex c2);
int operator==(DPSAcomplex cl, DPSAcomplex c2);
int operator>=(DPSAcomplex cl, DPSAcomplex c2);
int operator<=(DPSAcomplex cl, DPSAcomplex c2);
int operator>(DPSAcomplex cl, DPSAcomplex c2);
int operator<(DPSAcomplex cl, DPSAcomplex c2);
DPSAcomplex operator/(double val, DPSAcomplex c2);
DPSAcomplex operator+(double val, DPSAcomplex c2);
DPSAcomplex operator*(double val, DPSAcomplex cl);
int DPSA Complex rowvect times matrix(DPSAcomplex *result, DPSAcomplex
*vectl, int columns, DPSAcomplex **cl, int columns, int conjl);
void DPSA Complex separate(DPSAcomplex **c, int rows, int cols, double
**a, double **b);
void reverse(char s[]);
void itoa(int n, char s[]);
#endif
File: abf c
The file, abfc, contains the definitions, and other functions, for the overloaded
operators to handle the DPSAcomplex data type.
/* MVDR library file
Jesus Garcia
High Performance Computing and Simulation Research Lab
garcia@hcs.ufl.edu
12/98
Filename: ~garcia/dpsa/c/utils/abf.c
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
void reverse(char s[])
{
int c, i, j;
for (i=0, j=strlen(s)1;i
c=s [i] ;
s [i]=s[j];
s [j] = c;
}
void itoa(int n, char s[])
{
int i, sign;
if ((sign=n)<0) n = n;
i = 0;
do {
s[i++] = n%10 + '0';
} while ((n/=10) > 0);
if (sign < 0)
s[i++] = '';
s[i] = '\0';
reverse(s);
}
void printmatrix(DPSAcomplex **a, int n)
{
for (int row = 0;row
for (int col = 0;col
cout << a[row] [col]<<" ";
cout <
}
}
85
/*  Some of below part came from Jeff Markwell's source code  */
stream &operator<<(ostream &output, const DPSAcomplex &cl)
{
output.precision(6);
output <
return output;
}
DPSAcomplex DPSA Complex multiply(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex r;
r.real = cl.real*c2.real cl.imag*c2.imag;
r.imag = cl.real*c2.imag + cl.imag*c2.real;
return r;
}
DPSAcomplex DPSA Complex divide(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex r;
r.imag = (cl.imagc2.imag*cl.real/c2.real)/c2.real
/(l+c2.imag*c2.imag/c2.real/c2.real);
r.real=(cl.real+c2.imag*r.imag)/c2.real;
return r;
}
DPSAcomplex operator(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex result;
result.real=cl.realc2.real;
result.imag=cl.imagc2.imag;
return result;
}
DPSAcomplex operator(DPSAcomplex cl)
{
DPSAcomplex result;
result.real = cl.real;
result.imag = cl.imag;
return result;
}
DPSAcomplex operator+(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex result;
result.real=cl.real+c2.real;
result.imag=cl.imag+c2.imag;
return result;
DPSAcomplex operator+(double val, DPSAcomplex c2)
{
DPSAcomplex result;
result.real = c2.real + val;
result.imag = c2.imag;
return result;
}
DPSAcomplex operator*(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex result;
result.real = cl.real*c2.real cl.imag*c2.imag;
result.imag = cl.real*c2.imag + cl.imag*c2.real;
return result;
}
DPSAcomplex operator*(DPSAcomplex cl, double val)
{
DPSAcomplex result;
result.real = cl.real*val;
result.imag = cl.imag*val;
return result;
}
DPSAcomplex operator*(double val, DPSAcomplex cl)
{
DPSAcomplex result;
result.real = cl.real*val;
result.imag = cl.imag*val;
return result;
}
DPSAcomplex &operator++(DPSAcomplex &cl)
{
cl.real = cl.real + 1;
cl.imag = cl.imag;
return cl;
}
DPSAcomplex operator*(DPSAcomplex cl, float val)
{
DPSAcomplex result;
result.real = cl.real*val;
result.imag = cl.imag*val;
return result;
}
DPSAcomplex operator/(DPSAcomplex cl, int val)
{
DPSAcomplex result;
result.real = cl.real/val;
result.imag = cl.imag/val;
return result;
}
DPSAcomplex &operator/=(DPSAcomplex &cl, double val)
{
cl.real = cl.real/val;
cl.imag = cl.imag/val;
return cl;
}
DPSAcomplex operator/(double val, DPSAcomplex c2)
DPSAcomplex temp, result;
temp = DPSA Complex conj(c2);
result = (temp val)/(c2 temp);
return result;
int operator==(DPSAcomplex cl, DPSAcomplex c2)
if (cl.real == c2.real && cl.imag == c2.imag)
return 1;
return 0;
int operator>=(DPSAcomplex cl, DPSAcomplex c2)
if (DPSA Complex abs(cl) >= DPSA Complex abs(c2))
return 1;
return 0;
int operator>(DPSAcomplex cl, DPSAcomplex c2)
if (DPSA Complex abs(cl) > DPSA Complex abs(c2))
return 1;
return 0;
}
int operator<(DPSAcomplex cl, DPSAcomplex c2)
if (DPSA Complex abs(cl) < DPSA Complex abs(c2))
return 1;
return 0;
}
int operator<=(DPSAcomplex cl, DPSAcomplex c2)
{
if (DPSA Complex abs(cl) <= DPSA Complex abs(c2))
return 1;
return 0;
88
DPSAcomplex &operator*=(DPSAcomplex &cl, double val)
{
cl.real=cl.real*val;
cl.imag=cl.imag*val;
return cl;
}
DPSAcomplex &operator*=(DPSAcomplex &cl, DPSAcomplex c2)
{
DPSAcomplex result;
result.real = cl.real*c2.real cl.imag*c2.imag;
result.imag = cl.real*c2.imag + cl.imag*c2.real;
cl=result;
return cl;
}
DPSAcomplex &operator/=(DPSAcomplex &cl, DPSAcomplex c2)
{ // see page 14 of my notes
cl.imag = (cl.imagc2.imag*cl.real/c2.real)/c2.real
/(l+c2.imag*c2.imag/c2.real/c2.real);
cl.real=(cl.real+c2.imag*cl.imag)/c2.real; // note: this is supposed
// to use the modified
cl.imag
return cl;
}
DPSAcomplex &operator+=(DPSAcomplex &cl, DPSAcomplex c2)
{
cl.real=cl.real+c2.real;
cl.imag=cl.imag+c2.imag;
return cl;
}
DPSAcomplex &operator+=(DPSAcomplex &cl, double val)
{
cl.real=cl.real+val;
return cl;
}
DPSAcomplex &operator=(DPSAcomplex &cl, double val)
{
cl.real = val;
return cl;
}
DPSAcomplex &operator=(DPSAcomplex &cl, DPSAcomplex c2)
{
cl.real=cl.realc2.real;
cl.imag=cl.imagc2.imag;
return cl;
}
DPSAcomplex operator/(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex c;
double r, den;
if (fabs(c2.real) >= fabs(c2.imag)) {
r = c2.imag/c2.real;
den = c2.real + r*c2.imag;
c.real = (cl.real + r*cl.imag)/den;
c.imag = (cl.imag r*cl.real)/den;
} else {
r=c2.real/c2.imag;
den = c2.imag + r*c2.real;
c.real = (cl.real*r + cl.imag)/den;
c.imag = (cl.imag*r cl.real)/den;
}
return c;
}
DPSAcomplex operator!(DPSAcomplex &cl)
DPSAcomplex result;
result.real = cl.real;
result.imag = cl.imag;
return result;
double DPSA Complex abs(DPSAcomplex c)
{
double x,y,ans,temp;
x = (double)fabs(c.real);
y = (double)fabs(c.imag);
if (x == 0.0) ans = y;
else if (y == 0.0)
ans = x;
else if (x > y) {
temp = y/x;
ans=x*sqrt(l.0 + temp*temp);
} else {
temp = x/y;
ans=y*sqrt(l.0 + temp*temp);
return ans;
}
DPSAcomplex DPSA Complex divide2(DPSAcomplex cl, DPSAcomplex c2)
{
DPSAcomplex r;
double absl, abs2;
double argl, arg2;
absl = DPSA Complex abs(cl);
argl = atan(cl.imag/cl.real);
arg2 = atan(c2.imag/c2.real);
90
abs2 = DPSA Complex abs(c2);
r.real = (absl/abs2)*cos(argl arg2);
r.imag = (absl/abs2)*sin(argl arg2);
return r;
}
double DPSA Complex abs nosqrt(DPSAcomplex c)
{
double val;
val = c.real*c.real + c.imag*c.imag;
return val;
}
DPSAcomplex DPSA Complex conj(DPSAcomplex c)
DPSAcomplex r;
r.real = c.real;
r.imag = c.imag;
return r;
int DPSA Complex conj trans(DPSAcomplex **input, DPSAcomplex **result,
long rows, long cols)
{
for (int x=0;x
for(int y=0;y
result[y] [x] = DPSA Complex conj (input[x] [y]);
return 0;
}
DPSAcomplex DPSA Complex etothej (double arg)
DPSAcomplex r;
r.real = cos(arg);
r.imag = sin(arg);
return r;
int DPSA Complex matmult(DPSAcomplex **result, int rowsl, int columns,
DPSAcomplex **matl, int rows2, int columns, DPSAcomplex **mat2)
{
int row, column, tempidx;
DPSAcomplex multiplication;
if(columnsl!=rows2)
return (1);
for(row=0; row
{
for(column=0; column
{
result[row] [column].real = 0;
result[row] [column].imag = 0;
for(tempidx=0; tempidx
{
multiplication = matl[row][tempidx] mat2[tempidx][column];
result[row][column].real += multiplication.real;
result[row][column].imag += multiplication.imag;
}
}
}
return 0;
int DPSA Complex colvect times rowvect(DPSAcomplex **result, int rowsl,
DPSAcomplex *vectl, int columns, DPSAcomplex *vect2, int conj vect2)
{
int row, column;
for(row=0; row
for(column=0; column
{
if(conj vect2)
result[row][column] = DPSA Complex multiply(vectl[row],
DPSA Complex conj(vect2[column]));
else
result[row][column] = DPSA Complex multiply(vectl[row],
vect2[column]);
}
return 0;
}
int DPSA Complex rowvect times colvect(DPSAcomplex *result,
int columns, DPSAcomplex *vectl, int rows2, DPSAcomplex *vect2, int
conj vectl)
{
DPSAcomplex temp, *temp2;
temp2 = (DPSAcomplex*)malloc(columnsl*sizeof(DPSAcomplex));
int idx, i;
if(columnsl!=rows2)
return (1);
result>real=0;
result>imag=0;
if (conj vectl) for (i=0;i
else for (i=0;i
for(idx=0; idx
temp = DPSA Complex multiply(temp2[idx], vect2[idx]);
result>real += temp.real;
result>imag += temp.imag;
}
free(temp2);
return 0;
}
int DPSA Complex rowvect times matrix(DPSAcomplex *result, DPSAcomplex
*vectl, int columns, DPSAcomplex **cl, int columns, int conjl)
{
int i, j;
DPSAcomplex sum, *temp;
temp = (DPSAcomplex*)malloc(columnsl*sizeof(DPSAcomplex));
if (conjl) for (i=0;i
else for (i=0;i
for (i=0;i
sum.real = 0;
sum.imag = 0;
for (j=0;j
sum += temp[j]*cl[j] [i];
}
result[i] = sum;
}
free(temp);
DPSAcomplex DPSA Complex sqrt(DPSAcomplex z)
{
DPSAcomplex c;
double x, y, w, r;
if ((z.real == 0.0) && (z.imag = 0.0)) {
c.real = 0.0;
c.imag = 0.0;
return c;
} else {
x = fabs(z.real);
y = fabs(z.imag);
if (x >= y) {
r=y/x;
w=sqrt(x)*sqrt(0.5*(1.0+sqrt(1.0+r*r)));
} else {
r=x/y;
w=sqrt(y)*sqrt(0.5*(r+sqrt(1.0+r*r)));
}
if (z.real >= 0.0) {
c.real = w;
c.imag = z.imag/(2.0 w);
} else {
c.imag = (z.imag >= 0) ? w : w;
c.real = z.imag/(2.0*c.imag);
93
return c;
}
APPENDIX D
MATLAB SOURCE CODE
File: cbf in.m
The file, cbf in.m, was written by Ken Kim to generate one data vector for one
frequency bin. It was used to generate data used by the parallel and sequential programs.
It was also used to generate the data used by all MATLAB programs.
function [f waves] = cbf in(M,D,C,F,angles,sn ratio)
% [f waves] = cbf in(M,D,C,F,angles,sn ratio)
% Make sonar data (Mby1) for the angle (in frequency domain)
% f waves : frequency data (Mby1)
% M number of nodes
% D : spacing between nodes (feet)
% C : speed of sound (feet/sec)
% F : desired frequency of data (Hz)
% angles : incoming angles in vector form (degree)
% sn ratio : Signal to noise ratio in decibel (dB)
% ex) f waves = cbf in(10,10,1500,60, [0 45 60],10);
% generate 10by1 vector with 60 Hz frequency, incoming angles are 0,
45, 60 superimposed
% SNR is 10dB
Ken Kim : kim@hcs.ufl.edu
Highperformance Computing and
2/8/99
Simulation Research Lab
if nargin < 6, error('Not enough arguments.'), end
clear;
M = 10;
%D = 10;
C = 1500;
F = 60;
fs = 1000
numsamp
angles =
Isn ratio
fo = 30;
: 256;
[0 45 60];
= 10;
%fmin = 0.2;
%fmax = 3;
%fnum = 128;
%maxmag = 2;
num sig = max(size(angles));
if num _sig > M
error('Number of signals cannot exceed the number of nodes !');
end
% SNR = 10*log(P signal/P noise)
% P noise = P signal/(exp(SNR/10))
j = sqrt(1);
pw sig = 1;
pw noi = i/(10^(sn ratio/10));
Fx = randn(num sig,1)*sqrt(pw sig/
Fy = randn(num sig,1)*sqrt(pw sig/
S = Fx+j*Fy;
Nx = randn(M,1)*sqrt(pw noi/2);
Ny = randn(M,1)*sqrt(pw noi/2);
N = Nx+j*Ny;
DOA = pi*angles/180;
Td = D*sin(DOA)/C;
(2*num _sig));
(2*num sig));
MM = (0:1:M1);
Mpos = MM.*D;
cl = 0;
for i = 1:M,
cl
end
cl = cl/M;
rel pos = M
rel pos = re
A = exp(j*r
f waves = A*
cl+M pos(i);
pos(1:1:M)cl;
1 pos/D;
el pos'*Td*2*pi*F);
S+N;
File: cbf vt.m
The file, cbf vt.m, was also written by Ken Kim to generate the replica vectors
used for conventional and MVDR beamforming.
function [vt,degs]
cbf vt(M,D,C,F,ndeg)
Generate steering vectors for Conventional beamforming
Single phase center (in the middle of the array)
[vt,degs] = cbf vt(M,D,C,F,ndeg)
Calculate steering vectors and degree steered
vt : steering vector (Mbyndeg) in freq. domain
N : vector of bad nodes
M : number of nodes
D : spacing between nodes (feet)
C : speed of sound (feet/sec)
F : processing freq. (Hz)
ndeg : number of degree (odd number)
Ken Kim : kim@hcs.ufl.edu
Highperformance Computing and
2/8/99
Simulation Research Lab
if nargin < 5, error('Not enough arguments'), end
mm = (0:1:M1);
m pos = mm.*D;
cl = 0;
for i = 1:M,
cl
end
cl+m pos(i);
cl = cl/M;
rell = m pos(1:1:M)cl;
if rem(ndeg,2)
temple =
temp2 =
temp3 =
incdeg
else
end
== 1,
ndeg1;
templ/2;
i/temp2;
S( :temp3:1);
error('Number of degree should be odd number');
degs = 180.*asin(incdeg)./pi;
delay = (rell'*incdeg./C);
j = sqrt(1);
vt = exp(j*2*pi*F*delay);
File: abf.m
The file, abf.m, was used to compare MVDR and conventional beamformers. It
takes as arguments the number of nodes and the angles of incoming signals and generates
beampatterns and final beamforming plots for both types of beamformers.
function abf(M, in angle)
%function abf(M, in angle);
% M : number of nodes
% in angle : incoming angles of signal
%clear
J=0;
F = 30;
T = 1/F;
C = 1500
L = C/F;
D = 0.45
FS = 12*F
wave number
look = 0;
numsamp = 10
alpha = 0.9;
epsilon = .7
nangle
Base freq. processing band 0.2F3.0F
% T none
% Speed of sound(m/s)
% wave length(m)
Spacing between nodes(m)
% Sampling frequency(Hz)
(2*pi) /L;
000;
Number of angle steered
replicas = zeros(1,M);
[replicas, degs] = cbf vt(M, D, C, F, nangle);
%replicas = replicas./M*2;
%normal = conj(replicas).*replicas
R = zeros(M, M);
Rav = zeros (M, M);
w good mvdr = zeros(M, nangle);
power good mvdr = zeros(1,nangle);
power cbf = zeros(1,nangle);
for i = 1:numsamp,
x = cbf in(M, D, C, F, in angle, 25);
% R = x*x' + epsilon*eye(M);
R = R + x*x';
% Rav = (alpha)*Rav + (1alpha)*R;
end;
R = R/numsamp;
invR = inv(R);
for k = 1:nangle,
power good mvdr(k) = 1/(replicas(:,k) '*invR*replicas(:,k));
w good mvdr(:, k) =
invR*replicas(:,k)/(replicas(:,k) '*invR*replicas(:,k));
power cbf(k) = replicas(:,k) '*R*replicas(:,k);
%ws = sum(conj(w good mvdr(:,k)).*replicas(:,k))
end;
figure(1)
plot(degs, 10*logl0(abs(power good mvdr)));
*L
axis([min(degs) max(degs) min(10*logl0(real(power good mvdr)))
max(10*logl0(real(power good mvdr)))]);
%plot(degs, real(power good mvdr));
grid on;
ylabel('Power (dB)');
xlabel('look direction (degrees)');
title('MVDR Power Plot');
figure (4)
plot(degs, 10*logl0(abs(power cbf)));
axis([min(degs) max(degs) min(10*logl0(real(power cbf)))
max(10*logl0(real(power cbf)))] );
%plot(degs, real(power cbf));
grid on;
ylabel('Power (dB) ');
xlabel('look direction (degrees)');
title('CBF Power Plot');
figure (2)
[res2, radsl] = beam(w good mvdr, look, degs);
res2 = res2./max(res2);
%plot(degs, 10*logl0(res2));
plot(degs, res2);
axis([min(degs) max(degs) min(res2) max(res2)]);
grid on
xlabel('Bearing (degrees)');
ylabel('Beam power');
title('Beampattern for MVDR');
areares2 = sum(res2)
%res2
figure (3)
[rescbf, radscbf] = beam(replicas, look, degs);
%plot(degs, 10*logl0(rescbf));
plot(degs, rescbf);
axis([min(degs) max(degs) min(rescbf) max(rescbf)]);
grid on
xlabel('Bearing (degrees)');
ylabel('Beam power');
title('Beampattern for CBF');
figure(6)
polar(radscbf, rescbf);
title('Polar plot of CBF beampattern');
REFERENCES
1. A. 0. Steinhardt and B. D. Van Veen, "Adaptive Beamforming," International J. of
Adaptive Control and Signal Processing, 3, 253281 (1989).
2. L. Castedo and A. R. FigueirasVidal, "An Adaptive Beamforming Technique Based
on Cyclostationary Signal Properties," IEEE Trans. on Signal Processing, 43 (7),
16371650 (1995).
3. M. Zhang and M. H. Er, "Robust Adaptive Beamforming for Broadband Arrays,"
Circuits, Systems, and Signal Processing, 16 (2), 207216 (1997).
4. J. Krolik and D. Swingler, "Multiple BroadBand Source Location Using Steered
Covariance Matrices," IEEE Trans. on Acoustics, Speech, and Signal Processing, 37
(10), 14811494 (1989).
5. H. Cox, R. M. Zeskind, and M. M. Owen, "Robust Adaptive Beamforming," IEEE
Trans. On Acoustics, Speech, and Signal Processing, 35 (10), 13651377 (1987).
6. A. D. George, J. Markwell, and R. Fogarty, "Realtime sonar beamforming on high
performance distributed computers," submitted to Parallel Computing, Aug. 1998.
7. A. D. George and K. Kim, "Parallel Algorithms for SplitAperture Conventional
Beamforming," to appear in J. Computational Acoustics, (1999).
8. S. Banerjee and P. M. Chau, "Implementation of Adaptive Beamforming Algorithms
on Parallel Processing DSP Networks," Proc. of SPIE The International Society for
Optical Engineering, 1770, 8697 (1992).
9. M. Moonen, "Systolic MVDR Beamforming with Inverse Updating," IEE Proc.F,
Radar and Signal Processing, 140 (3), 175178 (1993).
10. J. G. McWhirter and T. J. Shepherd, "A Systolic Array for Linearly Constrained
Least Squares Problems," Proc. SPIE, Advanced Algorithms and Architectures for
Signal Processing, 696, 8087 (1986).
11. F. Vanpoucke and M. Moonen, "Systolic Robust Adaptive Beamforming with an
Adjustable Constraint," IEEE Trans. On Aerospace and Electronic Systems, 31 (2),
658669 (1995).
12. C. E. T. Tang, K. J. R. Liu, and S. A. Tretter, "Optimal Weight Extraction for
Adaptive Beamforming Using Systolic Arrays," IEEE Trans. On Aerospace and
Electronic Systems, 30 (2), 367384 (1994).
13. S. Chang and C. Huang, "An application of systolic spatial processing technique in
adaptive beamforming," J. Acoust. Soc. Am., 97 (2), 11131118 (1995).
14. M.Y. Chern and T. Murata, "A Fast Algorithm for Concurrent LU Decomposition
and Matrix Inversion," Proceedings of the International Conference on Parallel
Processing, 7986 (1983).
15. D. H. Bailey and H. R. P. Ferguson, "A StrassenNewton Algorithm for HighSpeed
Parallelizable Matrix Inversion," Proceedings Supercomputing, 419424 (1988).
16. D. S. Wise, "Parallel Decomposition of Matrix Inversion using Quadtrees,"
Proceedings of the International Conference on Parallel Processing, 9299 (1986).
17. V. Pan and J. Reif, "Fast and Efficient Parallel Solution of Dense Linear Systems,"
Computers Math. Applic., 17 (11), 14811491 (1989).
18. B. Codenotti and M. Leoncini, "Parallelism and Fast Solution of Linear Systems,"
Computers Math. Applic., 19 (10), 118 (1990).
19. K. K. Lau, M. J. Kumar, and S. Venkatesh, "Parallel Matrix Inversion Techniques,"
Proc. of the IEEE 2nd International Conference on Algorithms and Architectures for
Parallel Processing, 515521, (1996).
20. Message Passing Interface Forum, "MPI: A MessagePassing Interface Standard,"
Technical Report CS94230, Computer Science Dept., Univ. of Tennessee, April
1994.
21. K.A. Robbins and S. Robbins, Practical Unix Programming: A Guide to
Concurrency, Communication, and Multithreading, Prentice Hall, (1996).
22. P.M. Clarkson, Optimal and Adaptive Signal Processing, CRC Press, (1993).
23. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical
Recipes in C, 2nd ed., Cambridge University Press, (1992).
24. G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins
University Press, (1996).
25. R.O. Schmidt, "Multiple Emitter Location and Signal Parameter Estimation," IEEE
Trans. on Antennas and Propagation, 34 (3), 276280 (1986).