2003, HCS Research Lab, U of Florida All Rights Reserved
PARALLEL ALGORITHMS FOR ADAPTIVE MATCHEDFIELD PROCESSING
ON DISTRIBUTED ARRAY SYSTEMS
KILSEOK CHO, ALAN D. GEORGE, AND RAJ SUBRAMANIYAN
Highperformance Computing and Simulation (HCS) Research Laboratory
Department ofElectrical and Computer Engineering, University ofFlorida
P.O. Box 116200, Gainesville, FL 326116200
KEONWOOK KIM
Highperformance Computing and Simulation (HCS) Research Laboratory
Department ofElectrical and Computer Engineering, Florida A&M University and Florida State University
2525 Pottsdamer Street, Tallahassee, FL 323106046
Received (to be inserted
Revised by Publisher)
Matchedfield processing (MFP) localizes sources more accurately than planewave beamforming by employing fullwave
acoustic propagation models for the cluttered ocean environment. The minimum variance distortionless response MFP
(MVDRMFP) algorithm incorporates the MVDR technique into the MFP algorithm to enhance beamforming
performance. Such an adaptive MFP algorithm involves intensive computational and memory requirements due to its
complex acoustic model and environmental adaptation. The realtime implementation of adaptive MFP algorithms for
large surveillance areas presents a serious computational challenge where highperformance embedded computing and
parallel processing may be required to meet realtime constraints. In this paper, three parallel algorithms based on domain
decomposition techniques are presented for the MVDRMFP algorithm on distributed array systems. The parallel
performance factors in terms of execution times, communication times, parallel efficiencies, and memory capacities are
examined on three potential distributed systems including two types of digital signal processor arrays and a cluster of
personal computers. The performance results demonstrate that these parallel algorithms provide a feasible solution for
realtime, scalable, and costeffective adaptive beamforming on embedded, distributed array systems.
1. Introduction
In underwater acoustic signal processing, the classical beamforming approaches based upon planewave
signals with homogeneous signal field are not the best to localize a source in typical ocean environments.
The simple planewave model is a poor approximation of the signal received by a large array of sensors
from a distant source due to refraction and multipath effects in the ocean. Over the last 20 years, fullwave
beamforming based on more accurate models of the acoustic propagation structure has been extensively
researched to enhance the limited performance of planewave beamforming. In particular, matchedfield
processing (MFP) originally proposed by Bucker' and first implemented by Fizell2 focuses on acoustic
propagation modeling of the ocean waveguide with signal processing algorithms. The MFP algorithm
locates acoustic sources more precisely than planewave beamforming methods by using a fullwave
acoustic propagation model instead of a simple planewave acoustic propagation model for the ocean.
The performance of a beamforming algorithm is largely dependent on the accuracy of its acoustic
propagation model. The MFP algorithm requires intensive computational and memory capacities to
2003, HCS Research Lab, U of Florida All Rights Reserved
implement an accurate acoustic propagation model of the ocean, and hence suffers from the complexity of
its acoustic propagation model. The MFP algorithm is also sensitive to environmental mismatches such as
incorrect models for ocean waveguide and acoustic sources. Recent research on MFP algorithms has
concentrated on exploiting robust adaptive algorithms to overcome environmental mismatches and
constructing fast processing techniques necessary to meet the substantial computational requirements of
realworld applications.
A considerable amount of work has been done to improve the performance of MFP algorithms by
employing more accurate acoustic propagation models and more adaptive algorithms in order to overcome
mismatching environmental and signal characteristics. The main approaches that have been studied to
enhance the performance and robustness of MFP algorithms are reviewed by Vaccaro.3 Specifically, the
minimum variance distortionless response (MVDR) algorithm4 has several features suitable to the MFP
algorithm such as good sidelobe suppression and modest tolerance to mismatch which other adaptive
methods lack.5 The adaptive MFP algorithms, which combine the MVDR technique with the MFP
algorithm, were extensively used to improve the performance and robustness of beamforming by
employing environmental perturbation constraints and dimension reduction on simulated and field data.610
The MVDRMFP algorithm is used as our baseline for adaptive MFP algorithms.
Computationally, adaptive MFP algorithms are highly complex because they must compute the replica
field from a complex acoustic propagation model over dense grid points and correlate the computed field
with the observed field adaptively at each grid point. Despite the everincreasing performance levels of
singleprocessor computers, the realtime implementation of adaptive MFP algorithms for many
hydrophones and broadband frequencies still presents a serious computational challenge. In order to satisfy
these ever increasing computational trends, recent research has emphasized two approaches. One approach
is to develop fast and efficient algorithms by reducing the complexity and exploiting fast mathematical
techniques. The other approach is to develop parallel processing techniques for distributed sensor array
systems.
As for the first approach, Ozard et al.11 presented the Nearest Neighbors technique for selecting replica
to reduce the computational search within large ocean areas. By applying this technique to the Bartlett
beamformer, a speed improvement was obtained with a loss in detection performance. Cox et al.12
described a variant of beamspace processing that has a dimensionality reduction method to diminish
computational requirements by using only the beams that contain significant signal energy. A fast MFP
algorithm based on the FFT technique developed by Aravindan et al.13 reduces the computational demands
at the cost of source peak in a shallow water channel. However, as might be anticipated, the performance
of most of these algorithms is reduced in comparison to original beamforming algorithms.
The other approach to reduce computational cost is to decompose the computational load and memory
space of beamforming algorithms over multiple processors in parallel computer systems. Several parallel
algorithms and systolic array architectures have been described for adaptive beamforming algorithms.14' 15
Despite the improvement of throughput, these algorithms are closely coupled to the systolic array
architecture implemented with VLSI techniques and cannot be efficiently applied on modern general
purpose distributed systems due to high interprocessor communication requirements. Parallel beamforming
algorithms have also been studied for splitaperture conventional beamforming (SACBF)6, adaptive
minimum variance distortionless response beamforming,7' 8 and conventional matchedfield processing
(CMFP)'9 algorithms. In the aforementioned studies, different parallel algorithms based on control and
domain decomposition for beamforming are introduced and scalability of parallel performance on
distributed systems is examined.
2003, HCS Research Lab, U of Florida All Rights Reserved
The significant computational and memory requirements of the MVDRMFP algorithm make
imperative the development and use of parallel processing techniques on realtime sonar array systems. In
this paper, three parallel algorithms suitable for the MVDRMFP algorithm are developed to meet the
computational challenges. The parallel algorithms have been developed in the light of providing a feasible
solution for realtime and costeffective implementation on embedded, distributed DSP systems. The first
is the frequency decomposition (FD) algorithm that exploits data parallelism between processing tasks of
different frequency bins. The FD algorithm statically allocates the processing of different subsets of
frequency bins to different processors by using the concurrency between beamforming tasks of different
frequency bins. The second algorithm is the section decomposition (SD) that is based on data parallelism
in the processing of grid points. The SD algorithm distributes the processing of different subsets of grid
points into different processing nodes by using output domain decomposition. The third decomposition
algorithm is a variant of the iteration decomposition method known as the iterationfrequency
decomposition (IFD) scheme. The IFD algorithm distributes processing of successive iterations into
different processors by roundrobin scheduling of beamforming iterations. Within a given iteration,
processing of different sets of frequency bins is partitioned across multiple stages of execution to balance
the load. The three parallel algorithms are statically scheduled to reduce interprocessor communication,
synchronization, and other parallel management overheads that otherwise deteriorate the performance of
parallel algorithms on distributed systems.
This paper focuses on examining and analyzing performance tradeoffs of the parallel algorithms on a
cluster of personal computers (PCs) and two arrays of digital signal processors (DSPs). The testbeds are
used to investigate the parallel performance on three different types of processors, communication
networks, and system architectures. The performance characteristics are analyzed in terms of execution
times, communication times, memory requirements, and parallel efficiencies of the sequential and parallel
algorithms.
Section 2 presents background on the MVDRMFP algorithm and Section 3 describes the
computational tasks of the sequential algorithm and beamforming outputs of CMFP and MVDRMFP. In
Section 4, the configurations and features of three parallel algorithms for the MVDRMFP are explained,
and the communication characteristics of the parallel algorithms are also examined. Section 5 describes
three experimental testbeds and their software environments. Section 6 experimentally analyzes the
performance results of the parallel algorithms. Finally, Section 7 presents conclusions and directions for
future research.
2. Overview of the MVDRMFP Algorithm
The MFP algorithm localizes a source in the ocean by matching the field measured by a sensor array with
the replica field derived from a fullwave acoustic propagation model over a range of all expected source
positions. The MVDR algorithm originally introduced by Capon4 is an adaptive array processing technique
that adjusts linear weighting of sensors to minimize variance of output while maintaining the unity gain
constraint for the direction of interest. The MVDRMFP algorithm incorporates the MVDR approach into
the MFP algorithm to enhance beamforming performance. Thus, MVDRMFP is a highresolution
adaptive MFP algorithm that adaptively constructs an optimum weighting vector as in MVDR and uses a
more accurate acoustic propagation model of the ocean waveguide as in MFP.
The replica field of the MVDRMFP algorithm is generated from an acoustic propagation model based
on refractive and multipath effects of the ocean waveguide. For precise source localization, the acoustic
propagation model must be accurate otherwise the performance of the MVDRMFP algorithm will be
degraded due to the inaccuracy of the modeling assumptions. The ocean is modeled as a waveguide and
2003, HCS Research Lab, U of Florida All Rights Reserved
the source signal is assumed to be the point source solution to the wave equation. The wave equation for a
source is given by
_o__ s(r ) (z z)
V p(rz) + p(r,z)= (2.1)
c (r,z) r
where V2 is the Laplacian operator, p(r,z) and c(r,z) represent the pressure and sound speed respectively at
range r and depth z, co is the angular frequency of the source location at r, and zs, and %(xxo) is the delta
function whose amplitude is unity at x,20 The pressure field at position (r,z) from an acoustic source at
position (r,,z,) is obtained by solving Eq. (2.1). Numerous acoustic models have been proposed to solve the
wave equation. The complexity and solution of the wave equation depends on the model applied. In this
paper, the normal mode model is employed as the acoustic propagation model of the ocean because it
provides an accurate and computationally efficient propagation model for MFP applications. The normal
mode method expresses the acoustic pressure field in terms of a normal mode expansion and then solves for
the eigenvalues and eigenfunctions of the wave equation. The acoustic pressure field at position (r,z) is the
weighted sum of the contribution from each mode and can be calculated from
I /5 7 elkmr
p(r,z) & Ze 4 nm2(Z)n(Z)f (2.2)
where eigenfunction 'P,(z) and eigenvalue k2m are the mode shape function and horizontal wave number for
mode m, respectively.21 The KRAKEN normal mode program developed by Porter22,23is employed herein
to predict acoustic propagation in the ocean. The KRAKEN program, which is one of the most widely used
models, efficiently and accurately calculates horizontal wave numbers and mode shape functions for a
given acoustic ocean environment.
The MVDR processor is categorized as an adaptive beamformer with linear constraints as given by
wHp = g (2.3)
where w is the weight vector of the adaptive beamformer, p is the replica vector, and H denotes complex
conjugate transposition. By imposing this constraint, it is ensured that signals from the steering position of
interest are passed with gain g (which equals unity in the case of MVDR) while the output power
contributed by interference signals from other positions is minimized using a minimum meansquare
criterion. Therefore, assuming that p is normalized, we obtain the weight vector by solving
S = wmvr(r,z)HRwm (r,z) subjectto Wmvd(r,Z) H p(r,z) =1 (2.4)
where Wmvdr is the weight vector of the MVDRMFP beamformer, R represents the crossspectral
matrix (CSM) expressed in Eq. (2.5), and p(r, z) is the normalized replica vector obtained from Eq. (2.6).
The CSM is the covariance of the frequencydomain input data of an array as given by
R = E[xxH] (2.5)
where E[] is the expectation operation and x is a frequencydomain input data vector. The replica vector is
normalized to have unit magnitude as follows:
p(r, z)
(r, z) p(r, z (2.6)
p(r, z)l
2003, HCS Research Lab, U of Florida All Rights Reserved
Solving Eq. (2.4) by the method of Lagrange multipliers, the weight vector is derived as:
S(r,z)HR fp(r,z)
Finally, by substituting the Wmvdr calculated from Eq. (2.7) into Eq. (2.4), the output power for a steering
position (r, z) is obtained as
Smvd (r, z) = (2.8)
p(r,z)HR p1(r,z)
where Smvdr(r, z) is the detection factor at range r and depth z that presents the likeliness of detection for
a given data set. The MVDRMFP algorithm finds source locations accurately by calculating the weighting
vectors adaptively based on the sample data from Eq. (2.8).
3. Sequential MVDRMFP Algorithm
The computational tasks of the MVDRMFP algorithm include Fast Fourier Transform (FFT), Sequential
Regression (SER) inversion, steering, and broadband averaging. The incoming timedomain data is
transformed into frequencydomain by the FFT stage and then suitable frequency bins are selected from the
FFT output for broadband processing. The inverse CSM is directly updated from the FFT output by the
matrix inversion lemma in the SER inversion stage. The output power of a narrowband frequency is
calculated by steering the array for all possible grid points and the average output power for the selected
frequency set is obtained in the broadband averaging stage. The computational block diagram is depicted
in Fig. 1 where x(k) represents the frequencydomain input data vector, R1 is the inverse CSM matrix, Smvdr
is the MVDRMFP beamforming output for narrowband frequency given in Eq. (2.8), and S,v denotes the
MVDRMFP beamforming output for broadband frequency obtained from Eq. (3.2). The steering stage is
performed as many times as the number of grid points to steer an array for all possible locations of sources
under consideration. The SER inversion and steering stages are repeated for the number of selected
frequency bins for broadband processing. Finally, the outermost loop including all stages of the MVDR
MFP algorithm is performed once for every iteration obtaining one snapshot of input data from an array of
sensors.
The FFT stage transforms the input sample data received by the array of sensors from timedomain to
frequencydomain. The computational complexity of the FFT task implemented by the radix2 butterfly
method is generally O(MogN) in terms of data length. However, the FFT stage in MVDRMFP involves a
complexity of 0(N) in terms of the number of sensor nodes because the FFT process is simply replicated
for each sensor node and the FFT data length is fixed. Therefore, the computation complexity of the FFT
stage is linearly increased as the number of sensor nodes is increased.
2003, HCS Research Lab, U of Florida All Rights Reserved
Input Samples
FFT
U or
aof SER Inversion lal
of Ileiralions
Frequency Bins
Po1
Replica Sleering Gl 1
Veclors Grid Pcnls
Broadband Averaging
Fig. 1. Computational block diagram of the sequential MVDRMFP algorithm.
The SER inversion stage calculates the inverse CSM matrix. The computational intensity of matrix
inversion increases with the size of matrix but there exist numerous inversion techniques to reduce
computation time. For example, GaussJordan Elimination (JDE), LU decomposition and sequential
regression (SER) methods are widely used in beamforming applications. The choice of a suitable inversion
technique for beamforming applications affects the execution time of beamforming algorithms as well as
determines the applicability of different parallelization methods. Sinha at el.18 implemented the above three
inversion techniques to study the tradeoffs in terms of computation time and memory requirements. They
suggested that the sequential regression method is the most computationally efficient but has a relatively
large memory requirement. The computational complexity of the SER method is O(N2) while the
complexities of both JDE and LU methods are O(N3) with regard to size of the matrix. The SER method is
used as the CSM inversion method here because it is one of the most computationally efficient methods and
does not need to calculate CSM. In the SER technique, the inverse CSM is initialized with nonzero values
and updated during every subsequent iteration as shown in the following matrix inversion equation,
A 1 1 R, x H R
R,+1 = (3.1)
a + (1 \)xH1 R, x+1
where a is the forgetting factpr that determines the weight of the previous inverse CSM estimate relative to
the new input data vector, R, is the inverse CSM estimate of the ith iteration, and x, is the frequencydomain
input data vector of the ith iteration.
The steering stage is responsible for steering an array and calculating the output power for every
steering position. The steering stage calculates the output power from the inverse CSM estimate and the
replica vectors over the grid points where sources are likely to be present. The complexity of the steering
2003, HCS Research Lab, U of Florida All Rights Reserved
stage for the narrowband MVDRMFP beamformer is O(RDN2) because the steering loop has quadratic
computational complexity in terms of the number of nodes N and is executed as many times as the number
of range R and depth D grid points, as shown in Fig. 1. Along with calculation of output power as given by
Eq. (2.8), the steering stage includes the replica vector generation task that is invariant to input data. Once
replica vectors are calculated from environmental and system parameters, they are not varied until an
environmental change is encountered. The computational procedure of Eq. (2.2) for replica vector
generation is described in detail by Porter.22 The replica vector generation task has highorder complexity
due to its computationally intensive procedures such as eigenvalue computation. However, as with the FFT
stage complexity, the complexity of the replica vector generation is linear with respect to the number of
nodes since each sensor node is required to generate the replica vectors for entire range and depth points
separately. Despite the low complexity of the task, the computational burden presented by the replica
vector generation task is very significant due to a substantial scalar factor. Among all the computational
tasks in this experiment, the steering stage is the most computationally intensive.
Broadband averaging is required to calculate the average beamformer output for multiple selected
frequency bins. The broadband MVDRMFP beamformer output is given by
1 B
Sav (r, Z) = Z Sdr (f r, z) (3.2)
B k1
where B is the number of frequency bins selected from the FFT output, f represents the ith frequency bin,
and Smvdrf,rZ) is the narrowband output power of the ith frequency bin. The narrowband processing
including the SER inversion and steering stages for each individual frequency bin is performed as many
times as the number of the selected frequency bins for broadband processing. Hence, as the number of
frequency bins is increased, the MVDRMFP beamformer has to compute an increased number of
narrowband beamforming results. This broadband processing vastly increases the computation time of the
MVDRMFP algorithm. By averaging the narrowband beamformer outputs over selected frequency bins,
interference signals in the sidelobe are smoothed and signals near the main lobe are enhanced because the
positions of sidelobes are generally frequency dependent whereas the location of main lobe remains
constant. Thus, the detection probability for narrowband beamforming can be enhanced by broadband
averaging over multiple frequency bins.
To build a baseline for the three parallel MVDRMFP algorithms, two implementation models are
considered; a minimummemory (MM) model and a minimumcomputation (MC) model. The MM model
requires less memory capacity but more computation time by calculating temporary values on the fly as
needed during execution. On the contrary, the MC model requires more memory space and less
computation time since those values are precalculated, stored in memory, and reused when needed.
The final beamformer output is commonly depicted as an ambiguity surface indicating the likeliness of
target detections. In the ambiguity surface, peak positions denote the locations that are most likely to be
targets. To compare the detection capabilities of the CMFP and MVDRMFP algorithms, the ambiguity
surfaces for a source at 10Km in range and 50m in depth are shown in Fig. 2a and Fig. 2b, respectively. In
this experiment, the pressure field data was generated for a point source having 32 frequency bins with 1Hz
across from 200Hz to 231Hz from a vertical array, which contains 32 hydrophones spaced at 4m apart from
10m to 70m in depth. The noise component for each hydrophone has a Gaussian distribution with zero
mean and signaltonoise ratio (SNR) of 10dB. The ambiguity surfaces are computed from 5 to 44Km in
range at 1Km intervals, and from 2m to 160m in depth, at 2m intervals. From the two beamforming results,
it is seen that the MVDRMFP beamformer localizes the source more precisely than the CMFP as well as
2003, HCS Research Lab, U of Florida All Rights Reserved
adequately suppressing sidelobes, since it calculates optimum weight vectors adaptively for every given
sample data set.
Conventional MFP Algorithm Minimum Variance Distortionless MFP Algonthm
1 1
20 0.9 20 0.9
40 0.8 40 0.8
0.7 07
40 40
80 5 8
100 0.4 100 0.4
0.3 0.3
120 120
0.2 0.2
140 140
01.1
160 160
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
Range (Km) Range (Km)
(a) (b)
Fig. 2. Ambiguity surfaces of CMFP (a) and MVDRMFP (b).
4. Parallel Algorithms for MVDRMFP
As described in the previous section, the MVDRMFP algorithm with a large number of sensor nodes, a
wide range of frequency bands, and dense grid points requires both significant computational and memory
capacities. When beamforming algorithms are implemented on a realtime sonar system, the challenges
exceed the capabilities of singleprocessor computers including conventional DSP processors. Parallel
processing is a solution to execute the MVDRMFP algorithm in realtime, to achieve cost effectiveness,
and to provide dependability on distributed array systems. The parallel algorithms are designed to execute
on distributed DSP and PC array systems, which consist of smart processing nodes connected through
communication networks. A smart node has its own processing power, the hydrophone to collect input
data as well as the communication assist that generates outgoing messages and handles incoming messages.
Parallel algorithms need to distribute workload evenly across processors and reduce the communication,
synchronization, and other parallelization overhead to improve performance. When a parallel algorithm
breaks sequential tasks into a collection of small tasks to be distributed over available processors, finer
grained strategies impose more parallelization overhead to synchronize and communicate more frequently
between processors but may achieve more balanced workload distribution than coarsegrained
decomposition methods. Additionally, limited concurrency is the most fundamental problem in achieving
the necessary speedup through parallelism per Amdahl's law.24 In this work, three parallel algorithms for
adaptive MFP are developed using domain decomposition methods to exploit more concurrency and lower
communication overhead in distributed array systems. The following sections describe the three parallel
algorithms: frequency decomposition, section decomposition, and iterationfrequency decomposition.
4.1. Frequency decomposition algorithm
The MVDRMFP algorithm has concurrency between beamforming operations of different frequency bins
since processing of one frequency bin can be executed in parallel with another frequency bin. Frequency
2003, HCS Research Lab, U of Florida All Rights Reserved
decomposition (FD) statically allocates processing for different subsets of frequency bins within a given
beamforming iteration to distinct processors. FD is an input domain decomposition technique that exploits
data parallelism between processing tasks of different frequency bins. Thus, several processing nodes in a
distributed array system perform the same beamforming operation simultaneously on different data sets for
their assigned frequency bins as in a singleinstruction multipledata (SIMD) machine. The FD method can
be easily implemented for the MVDRMFP algorithm without large parallelization overhead.
In a distributed array system, the FFT operation is an inherently distributed implementation since each
node performs the FFT operation only for data from the sensor attached to its own node. Thus, the FFT
output data from each node must be transmitted to other processors for a subsequent beamforming stage.
Each node is only responsible for performing the partial beamforming output for a subset of narrowband
frequency bins, and therefore each node must share its beamforming output of all grid points to calculate
the average of the total beamforming output for all broadband frequency bins. FD requires two alltoall
communications per iteration where one alltoall communication is needed to distribute the FFT data and
the other to collect the narrowband beamforming output. This communication requirement results in a
substantial communication load on a distributed array system. To lessen the communication load, a data
packing method is employed to combine the partial beamforming output of the previous iteration and the
FFT data of the current iteration as one data packet. As a result, the combined data packet is transmitted
once per iteration after the FFT operation of the current iteration. This data packing decreases the
communication cost by eliminating the overhead of initiating communications twice per iteration.
However, the FD algorithm still needs a significantly large communication message size due to the partial
beamforming results. The partial beamforming outputs of all range and depth grid points on the ambiguity
surface impose a substantially large communication message on each node because the number of grid
points, 3200 in the case of 40 range and 80 depth points, is extremely large for better beamforming
resolution.
As shown in Fig. 3, each node performs an FFT operation on the input data received from its own
sensor for the current iteration. Each node performs data packing for the FFT output data from the current
iteration and the partial beamforming output from the previous iteration, and then executes an alltoall
communication for this data packet. Upon reception of the data packet, each node separates the FFT data
and the partial beamforming output from the data packet. Each node obtains the final beamforming output
from the previous iteration by averaging the beamforming output for broadband frequencies. In the 3node
array configuration, the FFT data from the current iteration is decomposed into three nodes. The first
subset of frequency bins are assigned to node 1, the second subset of frequency bins to node 2, and so on.
Each node performs the SER inversion and steers the sensor array to get a narrowband beamforming output
over all grid points. The narrowband beamforming output is added to the running sum to obtain the partial
beamforming outputs for the assigned frequency bins per node. The above beamforming procedures are
repeated for the next iteration. As the number of nodes is increased, the message size of each
communication per node is fixed since the partial beamforming outputs over all the grid points and all the
FFT output data for the selected frequency bins are broadcasted to other nodes in the FD algorithm. In the
block diagram, result latency is defined as the delay between the collection of input data and the
computation of the final output. The result latency in the FD algorithm can be slightly increased by the
data packing. The data packing can generate more delay time since each node has to wait for the FFT
output data from the current iteration to be calculated and then packed with the partial beamforming from
the previous iteration.
2003, HCS Research Lab, U of Florida All Rights Reserved
Result latency
Iter i Iter /1 Iter / Iter /+1 Iter i Iter /+1
B and Freq Subset 1/3 Freq Subset 1/3
Broadband F6Broadband
Iter F5,F Node 0 Averaging SER Steering Averaging S I
SER Steering SER Steering
 F3, F4
F2
Broad d Freq Subset 2/3 Freq Subset 2/3
S Broadband Broadband
Node 1 FFT Averaging Averaging
SER Steering
,lm Fl, F2
c .i \Freq Subset 3/3 Freq Subset3/3
F., Ni,,, Broadband Freq SBroadband Freq Subset3/3
Nod e 2 FFT Averaging FFT Averaging
Ne 2 SER Steering SER Steering
Range
Fig. 3. Block diagram for frequency decomposition in a 3node array configuration with 6 frequency bins.
4.2. Section decomposition algorithm
The second parallel algorithm features section decomposition (SD) based on data parallelism in the
beamforming processing tasks of grid points. The SD algorithm distributes processing of different subsets
of grid points within a given beamforming iteration into different processing nodes in the output domain.
Unlike both the FD and IFD parallel algorithms, the SD algorithm has a sequential dependency between the
SER inversion stage and the steering stage because full spectrums of the inverse CSM are required for the
calculation of the beamforming output for a single grid point. Therefore, the SER inversion stage is
excluded from the parallel implementation in the SD algorithm. This sequential fraction is not expected to
cause a serious computational problem in SD since the computational cost of the SER stage for a given
system and problem size is not significant compared with the steering stage. In the SD algorithm, each
node requires two alltoall communications as in FD. To lessen this communication load, the data packing
technique is used in the SD algorithm as well, resulting in one alltoall communication per iteration. As
the number of nodes is increased, the message size of each communication for the SD algorithm is
decreased because the number of beamforming outputs for the fixed grid points is divided across multiple
nodes whereas the message size for the FD algorithm is fixed as described in the previous section.
A block diagram for the SD method is shown in Fig. 4. In the SD algorithm, each node performs the
FFT task, the data packing operation, and one alltoall communication between nodes as in the FD
algorithm. Each node executes the SER inversion task to obtain the inverse CSM from the current
iteration. After that, in the 3node array configuration, the overall grid points from the current iteration are
decomposed into three nodes. The first subset of grid points are allocated to node 1, the second subset of
grid points to node 2, and so on. For its assigned subset of grid points, each node performs the steering
task to obtain partial beamforming outputs for all the selected frequency bins and then calculates the
average of the beamforming outputs of broadband frequencies. The above beamforming processes are
repeated for the next iteration. In the SD algorithm, it is expected that the communication latency will be
lower than that of the FD algorithm because the message size for each communication is decreased with
increased system size.
2003, HCS Research Lab, U of Florida All Rights Reserved
Result latency
Iter i Iter i Iter /+1 Iter /+1
Grid Subset 1/3 Grid Subset 1/3
Iter I JodeO .
Ir deSER Broadband T SER Broadband
Steering Averaging Steering Averaging
F2
F i Grid Subset 2/3 Grid Subset 2/3
Node 1
S SER Broadband F T SER Broadband
Steering Steering
Steering Averaging Steering Averaging
F. i.'. Grid Subset 3/3 Grid Subset 3/3
Node 2
Range SER Broadband SER Broadband
Steering Averaging Steering Averaging
Fig. 4. Block diagram for section decomposition in a 3node array.
4.3. Iterationfrequency decomposition algorithm
The third decomposition technique is a hybrid of frequency and iteration decompositions known as the
iterationfrequency decomposition (IFD). The IFD algorithm exploits parallelism between beamforming
tasks of successive iterations and frequency bins. IFD distributes the processing of successive iterations
into distinct processors by roundrobin scheduling of beamforming iterations. Within an assigned iteration
for each node, the processing of different sets of frequency bins is also partitioned across multiple stages of
execution for load balance. While one node is performing the beamforming task of the assigned frequency
bin for every execution stage, all the other nodes are simultaneously working on their beamforming
iterations. At the beginning of the iteration, each node computes the FFT of the data collected from its own
sensor and then sends the FFT results to only one node before the beamforming operation of the current
iteration begins. The IFD algorithm needs one alltoone communication per iteration and the smallest
message size since each node is required to transmit only one FFT data set to the only node that is
responsible for processing the current iteration.
As depicted in Fig. 5, each node performs an FFT operation on the input data after each stage and sends
the FFT result to the node that is to perform the current iteration by roundrobin scheduling. Then the node
partitions the beamforming tasks of different frequency bins into stages (i.e., three in this case) within the
assigned iteration. The first subset of frequency bins is processed in the first stage, the second subset of
frequency bins in the second stage, and so on. Within an assigned iteration, each node executes the SER
inversion and steering tasks for the frequency bins assigned in a stage, and obtains the final beamforming
output by executing broadband averaging in the final stage. The above beamforming operations are
repeated for the next iteration. As shown in Fig. 5, each node performs all the beamforming tasks for its
assigned iteration in the same way as a single node executes all beamforming tasks for every iteration in the
sequential algorithm. The IFD algorithm includes some communication and stage overhead. As a result,
the result latency of the IFD algorithm is slightly higher than that of the sequential algorithm.
2003, HCS Research Lab, U of Florida All Rights Reserved
Iter i
S : :..:. Node 0
Range+1
Iter /+1
Node 1 
Iter i+2
Node 2
Result latency
Iter t
Freq Subset 1/3
SER Steering
Iter i2
Freq S .1 i
SER :' "I : .
Iter i1
Freq Subset 2/3
SER Steering
Iter /+1
h T
t:' t:
ii T
Iter /
Freq Subset 2/3
SER Steering
Iter /+1
Iter /1
Freq S i : : i
SER _:ri )i
Iter 1+2 Iter i
Freq S .1 i
SER l ii i,)
wi ile
Iter /+1
Freq Subset 2/3
SER Steering
Iter i+2
Freq Subset 1/3
SER Steering
Fig. 5. Block diagram for iterationfrequency decomposition in a 3node array
4.4. Computational complexity and communication
In this section, we analyze computational complexities, communication patterns, and message sizes for the
parallel algorithms presented in previous sections. The computational complexities for the sequential and
three parallel algorithms are compared in Table 1 where N is the number of nodes, B is the number of
frequency bins, R is the number of range grid points, and D is the number of depth grid points. When
broadband averaging is implemented for this research, a running sum of the beamforming outputs for
different frequency bins is maintained as soon as a narrowband beamformer output is computed in the
steering stage. The final sum for every grid point is only divided by the number of frequency bins B in the
broadband averaging stage. The broadband averaging is partially calculated in the steering stage to reduce
necessary memory, and its execution time is relatively small compared to other stages. Thus, the
complexity and execution time of the broadband averaging is included in that of the steering stage.
As described in Section 3, in the sequential algorithm with a fixed number of data points per sensor, the
FFT stage for all sensor nodes requires a computational complexity of O(N). The inverse CSM using the
SER algorithm involves a complexity of O(BN2), and the steering stage has a complexity of O(BRDN2).
Total computational complexity of each algorithm is the sum of the computational complexities of all
stages and denoted as the most dominant term among the highestorder terms of the equation.
Table 1. Computational complexities of the sequential and parallel algorithms.
Sequential FD SD IFD
FFT O(N) 0(1) 0(1) 0(1)
SER O(BN2) O(BN) O(BN2) O(BN)
Steering & O(BRDN2) O(BRDN) O(BRDN) O(BRDN)
Broadband averaging
Total O(BRDN2 ) (BRDN) (BN+ (BRD)
I_ BRDN) O(
1 1 I
i I T
FFT
2003, HCS Research Lab, U of Florida All Rights Reserved
The computational complexities of all stages of the three parallel algorithms except the SER stage in the
SD algorithm are reduced by a factor of N as compared to the sequential algorithm by distributing
computational loads into N processors. In the SD algorithm, the computational complexity of the SER
stage is not decreased since this stage is not parallelized due to the dependency between computational
tasks. The execution times of parallel algorithms might ideally be expected to be N times faster than the
sequential algorithm. However, it is anticipated that parallel performance will be degraded in comparison
to the ideal since there are sources of overhead such as interprocessor communication, synchronization, and
unbalanced workload distribution.
The communication patterns and message sizes of the three parallel algorithms are shown in Table 2.
Here the number of frequency bins used is 32, the number of grid range and depth points is 40 and 80,
respectively, and the number of sensor nodes is N. The communication scheme for both FD and SD
algorithms is an alltoall communication that requires N(N1) send/receive unicast communications, but
the IFD algorithm needs an alltoone communication requiring Nl send/receive unicast communications
per iteration. As the number of sensor nodes is increased, the message sizes of each communication for the
FD and IFD algorithms are fixed for the above problem size, but that of the SD algorithm decreases as in
Table 2. From these communication characteristics of the parallel algorithms, it is expected that FD will
impose the highest communication time, and IFD the lowest communication time.
Table 2. Communication patterns and message sizes for three parallel algorithms.
FD SD IFD
Patn AlltoAll AlltoAll AlltoOne
Pattern
communication communication communication
Number of
Numbr of N(N1) send/receive N(N1) send/receive N1 send/receive
communications
s communications communications communications
per iteration
(# of frequency bins x 2 (# of frequency bins x
Message size per + # of grid points) x # of 8) + (# grid points x 4 + # of frequency bins x 8
communication bytes per float variable # of nodes) = 256 + = 256 bytes
(4) = 13056 bytes (12800 + N) bytes
5. Experimental testbeds
The performance of the parallel algorithms is analyzed on three distributed platforms; two types of DSP
arrays and a PC cluster. The testbeds have similar configurations with multiple processing units connected
by loosely coupled communication links. The testbeds contain distinct types of processing units and
communication mechanisms. The two DSP arrays use two different types of DSP devices from Analog
Devices, and the PC cluster machines have a generalpurpose processor as their CPU. All the testbeds use
the message passing interface (MPI)25 as the middleware to communicate and synchronize between
processors in the distributed environment. The hardware and software characteristics of both DSP arrays
and the PC cluster are summarized in Table 3 and described in the following sections.
2003, HCS Research Lab, U of Florida All Rights Reserved
Table 3. Characteristics of the three experimental testbeds.
Link
Testbeds Processor Memory Topology OS MPI
Speed
400 MHz Switched Linux
PC cluster elen 100 Mbps 64MB Star 2.2.1 MPI/Pro 1.5
Celeron Star 2.2.16
40 MHz
ADSP21602 Internal: 256KB MPISHARC
SADSP 320 Mbps I 26B Ring N/A
DSP array D 3 External: 3MB 21602
_DSP array 21062
80 MHz
ADSP21160 80 Internal: 512KB MPISHARC
ADSP 640 Mbps Ring N/A
DSP array A 6 Mb External: 512KB Ring N/A 21160
DSP array 21160
5.1. ADSP21062 DSP array
The first testbed is composed of eight Bittware BlacktipEX DSP boards26 connected to one another with
link ports, as shown in Fig. 6a. Each board includes a single 40MHz ADSP21062 Super Harvard
ARChitecture (SHARC) processor27 from Analog Devices, 3MB external memory with zero waitstate
access time, and interfaces for external I/O devices. The ADSP21062 SHARC processor consists of a
256KB internal memory, multiple link ports, and direct memory access (DMA) controllers, etc. Each link
port consisting of four bidirectional data and two bidirectional handshaking lines can operate at twice the
clock rate of the processor, achieving a peak throughput of 320Mbps. In this experiment, the link ports and
DMA channels are employed to provide highspeed, lowoverhead communication between DSP boards.
The BlacktipEX DSP boards contain two link ports with external connectors to communicate with other
boards. The link ports are dedicated to transmit and receive channels separately to eliminate the need for
external routing or external hardware. This configuration allows the boards to be arranged in a uni
directional ring topology. Although a ring does not provide communication scalability compared to other
topologies, its simple routing and low hardware complexity make it a natural choice for this system.
The MPISHARC network service was developed and optimized for this particular architecture by
Kohout and George28 to provide MPI functionality for the ADSP21062 DSP array. Although the MPI
SHARC is a subset of the full MPI specification, the functionality and syntax are identical to the MPI for
common distributed systems. The most common functions used in parallel applications were included in
the design. This network service provides highly efficient collective communications since certain
messagepassing functions are optimized for this particular DSP architecture and communication topology.
5.2. ADSP21160 DSP array
The second testbed consists of four EZKIT Lite DSP boards29 connected to one another like the Blacktip
EX DSP boards, as shown in Fig. 6b. Each EZKIT board includes a single 80MHz ADSP21160 SHARC
processor3" from Analog Devices, 512KB external memory with one waitstate access time, as well as
external I/O devices. Each link port, which is composed of eight bidirectional data and two bidirectional
handshaking lines, can operate at the same clock rate as the processor. It thereby provides a data transfer
rate of 640Mbps, twice as fast as the ADSP21062 processor. Similar to the BlacktipEX DSP board, the
link ports and the DMA channels on this EZKIT Lite board are also dedicated to produce efficient
communications with the ring topology between DSP boards.
In this research, the MPISHARC network service for the ADSP21062 processor has been extended to
provide the MPI functions for the ADSP21160 DSP array. The functionality and syntax of the MPI
2003, HCS Research Lab, U of Florida All Rights Reserved
SHARC for ADSP21160 are identical to the MPISHARC network service for ADSP21062. This
network service provides highly efficient collective communications since certain messagepassing
functions are designed for optimized performance in the ADSP21160 processor architecture.
(a) (b)
Fig. 6. Experimental configurations ofADSP21062 DSP array (a) and ADSP21160 DSP array (b).
5.3. PC cluster
The third testbed is a Linuxbased cluster of 32 PCs where each node consists of a 400MHz Intel Celeron
processor with 64MB of memory. The interconnection fabric between computers is 100Mbps switched
Fast Ethernet. The MPI/Pro V1.5 middleware from MPI Software Technology is used as the message
passing and synchronization layer.
6. Performance Analysis of the MVDRMFP Algorithm
The performance of the parallel MVDRMFP algorithms is experimentally analyzed on the testbeds
explained in the previous section. The sequential MVDRMFP algorithm used as a baseline for comparing
parallel performance is implemented with the MM and MC models. The system and problem parameters
used in this experiment are 32 frequency bins, 40 grid points in range, 80 grid points in depth, and up to 32
sensor nodes. In this section, parallel performance factors are analyzed in terms of execution times,
communication times, data memory requirements, scaled speedup, parallel efficiency, and result latency to
demonstrate the performance effects of the parallel algorithms presented in the previous section. In these
experiments, the average number of CPU clock cycles per iteration was measured for several hundreds of
iterations to measure the execution and communication times for one beamforming iteration more
accurately. The execution and communication times were finally calculated by multiplying the average
number of CPU clock cycles by the CPU clock period for the testbed in use.
6.1. Sequential execution time
The execution time for the sequential MVDRMFP algorithm is measured on a single processing unit in
each testbed. Fig. 7a and 7b illustrate the results of sequential execution times for the MC and MM models
on the three testbeds, respectively. The execution time results are examined up to 32 nodes in the PC
2003, HCS Research Lab, U of Florida All Rights Reserved
cluster, eight nodes in the ADSP21062 DSP array, and four nodes in the ADSP21160 DSP array because
of their processing node limitations.
20 300
(a) (b)
Fig. 7. Sequential execution time per iteration vs. system size for MC model (a) and IM model (b) on three testbeds.
number of grid points. As the number of nodes is increased, the sequential execution time in the MC model
LUstage includes the replica vector generation task in the MM model and excludes it in the MC model. The
2 3 8 2 14 2 1 8 16 32 2 3 8 2 3 2 3 8 116 12
2066.2. Parallexecution tim
BSteering 143833 9957 184716 09119 31289 00914 O2362 0821 29265 11O761 0Steer ng 593573 1199577 2357039 299475 6O775O 8O663 16144O 325O93 658117 135132O
(a) (b)
Fig. 7. Sequential execution time per iteration vs. system size for MC model (a) and MM model (b) on three testbeds.
The execution times of the sequential algorithms in the two DSP arrays are higher than those of the PC
cluster because the clock rate of their processing units is much slower than their counterparts. The clock
rate and architecture of the processors, and the software environments used in the three testbeds, are factors
that affect the sequential execution time. For instance, although the clock rate of the ADSP21062 is twice
as slow as the ADSP21160, the sequential execution time on the ADSP21062 DSP board is not exactly
twice as much as that of the ADSP21160 DSP board. One reason is that there is no clock cycle delay to
access the external memory on the ADSP21062 board (0 wait state) whereas there is one clock cycle delay
on the ADSP21160 board (1 wait state).
The execution time results show that the steering stage is the most computationally dominant stage
because the complexity of the steering stage is much higher than that of other stages due to the substantial
number of grid points. As the number of nodes is increased, the sequential execution time in the MC model
increases more rapidly than in the MM model due to the complexity of the steering stage. The steering
stage includes the replica vector generation task in the MM model and excludes it in the MC model. The
replica vector generation task has linear complexity in terms of the number of nodes but requires the most
intensive computation of the tasks in the steering stage due to the substantial scalar operations for a large
number of grid points as described in Section 3. Thus, the steering stage reveals quadratic complexity in
the MC model but linear complexity in the MM model since the replica vector generation task overshadows
that of other tasks in the steering stage. The sequential execution times in the MM model are much higher
than in the MC model due to the inherent computational characteristics of the MC model.
6.2. Parallel execution time
The execution times of the parallel algorithms on the two DSP arrays and the PC cluster are illustrated in
Fig. 8. Since the complexity of parallel algorithms is reduced by a factor of N compared to the sequential
execution times as in Table 1, the parallel execution times for both the MM and MC models slowly
increase as the system size is increased. The increase in parallel execution times for the MM model is
slower than that for the MC model because linear growth of the sequential execution time created by the
increased number of sensor nodes for the MM model is evenly distributed across multiple processors by the
2003, HCS Research Lab, U of Florida All Rights Reserved
parallel algorithms. However, the parallel execution times for the MC model are much lower than those for
the MM model as in the sequential execution times since the MC model emphasizes efficient computation.
The parallel execution times for the ADSP21062 DSP array, the ADSP21160 DSP array, and the PC
cluster are depicted in Fig. 8a, 8b, and 8c, respectively. The steering stage is still the most computationally
dominant as in the sequential execution time even though its complexity is reduced in the parallel
algorithms. The execution time results of each parallel algorithm are about the same as their counterparts
on each DSP array since the communication and parallelization overheads created by each parallel
algorithm are low for the given problem sizes on both the DSP arrays. The execution times of the three
parallel algorithms on the PC cluster follow the same trends as those on the two DSP arrays but for the
communication times. The FD algorithm handles a larger message size than both SD and IFD algorithms.
Thus, communication times of the FD algorithm increase rapidly with increased system size as a result of
ineffective communication on the PC cluster.
SFFT Communication SER a Steering
(a)
m FFT
UUUUM..00n
2 4 8 6 32 2 4 8 6 32
FDM M SDM C
Communication
*FFT Communication OSER Steering
(b)
UUUUU =. UUUUU
2 4 8 6 32 2 4 8 1 32 2 4 8 6 32
SDM M FIDM C FIDM M
SSER
E Steering
(c)
Fig. 8. Parallel execution time per iteration vs. system size on ADSP21062 array (a), ADSP21160 array (b), and PC cluster (c).
6.3. Communication time
Parallel execution time consists of two portions, computation time and communication time. The
communication time is one of the most important factors affecting the performance of parallel algorithms
2003, HCS Research Lab, U of Florida All Rights Reserved
on distributed systems. Therefore, communication times for the three parallel algorithms are compared on
the two DSP arrays and the PC cluster.
As shown in Fig. 9, the communication times for the FD algorithm for all testbeds are higher than those
for the other parallel algorithms because of its large message size and alltoall communication pattern. By
contrast, the IFD algorithm provides the most efficient communication due to its small message size and
communication frequency. On the two DSP arrays, the communication times of FD and IFD are steadily
increasing with system size but those of the SD are decreasing because the effect of decreasing message
size for the communication time outweighs the effect of increasing communication frequency in the SD
algorithm. If the message size exceeds the maximum payload size in the MPISHARC network services,
the message is divided into multiple packets and then sent to destination nodes. As a result, a larger
message size incurs more communication time compared to more frequent communications due to the
overhead in the processor's internal memory movement when using .iii,,,,r communication.28 The
communication time for the ADSP21160 DSP array is about half that for the ADSP21062 DSP array
since the ADSP21160 processor provides a communication network that is twice as fast as the ADSP
21062 processor. As the number of nodes increases, the communication times for the three parallel
algorithms on the PC cluster increase rapidly unlike the two DSP arrays. Since MPI services on the two
DSP arrays provide a form of hardware broadcast communication, the two DSP arrays can handle the
collective communication more efficiently than the PC cluster. The greater communication time of FD
compared to other parallel algorithms may result in degrading parallel performance.
9 UtUS 4 UtUj 4 UtU1
80E03 _I 3E03 3 5E01
7 0E03 1 F 3 E03 30E01
6 OE03
6 0E03 2 E03 25E01
E 50EE3 E
E 50E03 20 E03 20E01
4 0E.03  
3 0EO 3 1EE03 15E01
S20E03
S0E03 E0 10E02
o 0 0E+00 E 0 00 E 0 0E+00
 
S FDMC FDMM SDMC SDMM IFDMC IFDMM 0 FDMC FDMM SDMC SDMM IFDMC IFDMM 0 FDMC FDMM SDMC SDMM IFDMC IFDMM
S2 nodes 4 nodes o 8 nodes 12 nodes 14 nodes 1 2 nodes 4 nodes o 8 nodes a 16 nodes 32 nodes
(a) (b) (c)
Fig. 9. Communication time per iteration vs. system size on ADSP21062 array (a), ADSP21160 array (b), and PC cluster (c).
6.4. Scaled speedup and parallel efficiency
Overall system performance is analyzed next in terms of scaled speedup and parallel efficiency as depicted
in Figs. 10 and 11. Scaled speedup is defined as the ratio of sequential execution time to parallel execution
time. This performance metric takes into account the fact that the problem size is also scaled upwards as
the number of processing nodes is increased since each node has its own sensor. Parallel efficiency is the
ratio of scaled speedup to ideal speedup, which is equal to the number of processors.
Fig. 10 shows the scaled speedup of the parallel algorithms measured on the three testbeds. As the
results indicate, all three algorithms exhibit promising speedup with increase in system and problem size.
The relative speedup results for all the algorithms on both DSP arrays are better than their counterparts on
the PC cluster due to smaller communicationtocomputation ratios. In both DSP arrays, the
communicationtocomputation ratios are smaller because they have more efficient communication
channels and slower processors. The scaled speedup of the FDMC algorithm on the PC cluster is lower
than it is for the other algorithms since the FDMC algorithm has more communication overhead and less
2003, HCS Research Lab, U of Florida All Rights Reserved
computation time than the other parallel algorithms. The IFD algorithm shows marginally better speedup
than the other algorithms on the three testbeds because IFD has lower communication overhead due to its
smaller message size and alltoone communication pattern.
m 2 nodes E 4 nodes O 8 nodes
m 2 nodes m 4 nodes
FDMC FDMM SDMC SDMM IFDMC IFDMM
a 2 nodes 0 4 nodes [ 8 nodes a 16 nodes m 32 nodes
Fig. 10. Scaled speedup vs. system size on ADSP21062 array (a), ADSP21160 array (b), and PC cluster (c).
The parallel efficiencies for the two DSP arrays and the PC cluster are shown in Fig. 1 la, 1 lb, and 1 Ic,
respectively. The results on the two DSP arrays show a promising parallel efficiency of between 86% and
99% for the given system sizes. However, the results on the PC cluster exhibit a lower parallel efficiency
than those on the two DSP arrays since the PC cluster has less efficient communication channels and faster
processors. Specifically, the parallel efficiency of FDMC on the PC cluster rapidly decreases from 91%
for two nodes to 48% for 32 nodes because the communicationtocomputation ratio rapidly increases. The
SDMC algorithm on both DSP arrays shows a lower parallel efficiency than the other algorithms because
the ratio of the execution time for the parallelization overheads including nonparallelized SER stage to the
total execution time is relatively high in the case of smaller nodes. The parallel efficiency of the SD
algorithm increases with increased system size since the overhead ratio and the communication time are
decreased. The IFD algorithm achieves marginally better parallel efficiency than either FD or SD as a
result of its lower communication overhead.
FDMC FDMM SDMC SDMM IFDMC FDMM
S2 nodes 1 4 nodes o 8 nodes
DMC FDMM SDMC SDMM IFDMC IFDMM
M 2 nodes 0 4 nodes
FDMC FDMM SDMC SDMM IFDMC IFDMM
E 2 nodes E 4 nodes 0 8 nodes o 16 nodes 32 nodes
(c)
Fig. 11. Parallel efficiency vs. system size on ADSP21062 array (a), ADSP21160 array (b), and PC cluster (c).
6.5. Result latency
The result latency on two DSP arrays and the PC cluster is depicted in Fig. 12. This demonstrates that the
result latency for IFD is much higher than for the other algorithms because each node in IFD is responsible
for the entire processing of the assigned iteration as in the sequential algorithm. This higher result latency
2003, HCS Research Lab, U of Florida All Rights Reserved
is one of the drawbacks of the IFD algorithm. By contrast, the FD and SD algorithms reduce their result
latency by distributing the computations of beamforming tasks over multiple processing nodes. The MC
model shows far smaller result latency than the MM model due to its efficient computation as set forth in
Section 3.
FDMC FDMM SDMC SDMM IFDMC IFDMM
m 2 nodes m 4 nodes o 8 nodes
MC FDM M SDMC SDMM IFDMC IFDMM
m 2 nodes 4 nodes
FDMC FDMM SDMC SDMM IFDM C IFDM M
* 2 nodes 4 nodes 8 nodes a 16 nodes 32 nodes
(a) (b) (c)
Fig. 12. Result latency vs. system size on ADSP21062 array (a), ADSP21160 array (b), and PC cluster (c).
6.6. Data memory requirement
Fig. 13 shows data memory requirements for the sequential and three parallel algorithms with the MM and
MC models. Considering the sequential algorithm, MM needs much smaller memory capacity than MC
because of its memory efficiency. For all the parallel algorithms with the exception of IFDMC, their
memory requirements are considerably reduced by distributing the necessary data across multiple
processing nodes for the MC model and by using more computation time for the MM model. By contrast,
IFDMC requires the largest memory space since it needs to preserve a full set of replica vectors for the
assigned iteration as in the sequential algorithm. Hence these algorithms except IFDMC may be used on
systems with memoryconstrained processing nodes.
3 T ^
25
20
15
10
SEQ S E #of nodes
MM FS FM SD SD IFD I 2
Sm C M C MM FD
MM MM
Fig. 13. Data memory requirements per node for FD, SD, and IFD algorithms.
2003, HCS Research Lab, U of Florida All Rights Reserved
7. Conclusions
Three parallel algorithms are developed in this paper to meet the substantial computational and memory
requirements of the MVDRMFP algorithm on realtime sonar array systems. They are based on domain
decomposition methods and statically scheduled to reduce interprocessor communication and other
parallelization overheads that would otherwise deteriorate performance on distributed systems. This paper
focuses on performance analysis of the parallel algorithms as implemented on three potential platforms;
two types of DSP arrays and a PC cluster. These testbeds are used to analyze the comparative parallel
performance of three different types of processors, communication networks, and system architectures.
The performance of the parallel algorithms is examined in terms of computation and communication time,
scaled speedup, parallel efficiency, result latency, and memory requirement for the three different testbeds.
The three parallel algorithms exhibit promising parallel performance with increased system and
problem size. All parallel algorithms with exception of the FDMC algorithm on the PC cluster achieve
between 81% and 99% parallel efficiency up to a system size of 32 nodes on the PC cluster, eight nodes on
the ADSP21062 DSP array, and four nodes on the ADSP21160 DSP array. On the contrary, the parallel
efficiency of FDMC on the PC cluster rapidly decreases from 91% for two nodes to 48% for 32 nodes due
to its substantial communication overhead. The relative speedup results of all algorithms on both DSP
arrays are better than their counterparts on the PC cluster because the DSP arrays have more efficient
communication channels and slower processors. For all the parallel algorithms except the IFDMC
algorithm, memory requirements are considerably reduced by distributing the necessary data across
multiple processing nodes in the MC model or by using more computation time in the MM model.
With the FD algorithm, promising parallel performance is obtained with increase in system size on both
DSP arrays, but not on the PC cluster due to its inefficient communication network. The FD algorithm
requires a relatively small memory requirement and longer communication time than the other algorithms.
Thus, the FD algorithm would be preferable for systems with a scalable communication network and a
severe memory constraint such as some embedded, distributed DSP systems. The SD algorithm has better
parallel performance and more efficient communication as system size increases, while at the same time
requiring smaller memory capacity. Hence, the SD algorithm would be a suitable choice for systems with a
larger problem size and severe memory constraints. The IFD algorithm achieves marginally better parallel
efficiency than either FD or SD as a result of its lower communication overhead. However, this
improvement comes at the cost of higher memory requirements and result latency. The IFD algorithm
would be efficient for systems with sufficient memory for each processing unit and a relatively slow
communication network. Given the increasing demands for computationally sophisticated forms of
processing and data models projected for future sonar systems, these parallel algorithms provide a feasible
solution for realtime and costeffective implementation of beamforming algorithms on embedded and
distributed array systems.
Further research can proceed in several directions. The parallel techniques presented in this paper can
be extended to more advanced forms of beamforming algorithms such as adaptive MFP algorithms which
are robust to environmental mismatch. Further work on faulttolerant algorithms and architectures for MFP
beamforming will be required in the presence of sensor element failure or processing element failure to
satisfy the need for increasing dependability in distributed sonar array systems. Finally, power and energy
models need to be built to analyze the power efficiency of adaptive MFP algorithms on distributed DSP
systems to extend the mission time of systems operating in severe environments.
2003, HCS Research Lab, U of Florida All Rights Reserved
Acknowledgements
We acknowledge and appreciate the support provided by the Office of Naval Research on Grant N00014
9910278. Special thanks go to Byung II Koh, JeongHae Han, and Seokjoo Lee in the HCS Lab at the
University of Florida for their assistance with useful suggestions. This work was also made possible by
donations of equipment by Analog Devices and MPI/Pro software from MPI Software Technologies, Inc.
References
1. H. P. Bucker, "Use of calculated sound fields and matchedfield detection to locate sound sources in shallow
water,"J. Acoust. Soc. Amer. 59 (2), 368373 (1976).
2. R. G. Fizell and S. C. Wales, "Source localization range and depth in an Arctic environment," J. Acoust. Soc.
Amer. 78 (5), (1985).
3. R. J. Vaccaro, "The past, present, and future of underwater acoustic signal processing," IEEE Signal Processing
Magazine, 2151 (1998).
4. J. Capon, "Highresolution frequencywavenumber spectrum analysis," Proc. of the IEEE 57 (8), 14081419
(1969).
5. A. B. Baggeroer, W. A. Kuperman, and P. N. Mikhalevsky, "An overview of matched field methods in ocean
acoustics," J. of Oceanic Eng. 18 (4), 401423 (1993).
6. A. B. Baggeroer, W. A. Kuperman, and Henrik Schmidt, "Matched field processing: Source localization in
correlated noise as an optimum parameter estimation problem," J. Acoust. Soc. Am. 83 (2), 571587 (1988).
7. J. M. Ozard and G. H. Brooke, "Improving performance for matched field processing with a minimum variance
beamformer,"J. Acoust. Soc. Am. 91 (1), 141150 (1992).
8. Feuillade, W. A. Kinney, and D.R. Delbalzo, "Shallow water matchedfield localization off Panama city, Florida,"
J. Acoust. Soc. Am. 88 (1), 423433 (1990).
9. Byrne, R. T. Brent, C. Feuillade, and D. R. Delbazo, "A stable dataadaptive method for matchedfield array
processing in acoustic waveguide," J. Acoust. Soc. Am. 87 (6), 24932592 (1990).
10. J. L. Krolik and W. S. Hodgkiss, "Matched field localization in an uncertain environment using constraints based
on soundspeed perturbations," Proc. IEEE Oceans '91 Ocean Technologies and Opportunities in the Pac itc for
the '90s, 771778 (1991).
11. J. M. Ozard, M. J. Wilmut, D. G. Berryman, and P. Z. Zakarauskas, "Speedup performance of a nearest
neighbors algorithm for matchedfield processing with a vertical line array," Proc. Oceans '93 Engineering in
Harmony with Ocean, 8690 (1993).
12. H. Cox, R. M. Zeskind, and M. Myers, "A subarray approaches to matchedfield processing," J. Acoust. Soc.
Amer. 87 (1), 168178 (1990).
13. S. Abravindan, N. Ramachandran, and P. S. Naidu, "Fast matched field processing," IEEE J. of Oceanic Eng. 18
(1), 15 (1993).
14. F. Vanpoucke and M. Moonen, "Systolic robust adaptive beamforming with an adjustable constraint," IEEE
Trans. on Aerospace and Electronic Systems 31 (2), 658669 (1995).
15. E. T. Tang, K. J. R. Liu, and S. A. Tretter, "Optimal weight extraction for adaptive beamforming using systolic
arrays," IEEE Trans. on Aerospace and Electronic Systems 30(2), 367384 (1994).
16. A. George and K. Kim, "Parallel algorithms for splitaperture conventional beamforming," J. of Computational
Acoustics 7(4), 225244 (1999).
17. A. George, J. Garcia, K. Kim, and P. Sinha, "Distributed parallel processing techniques for adaptive sonar
beamforming," J. of ComputationalAcoustics, 10 (1), 123 (2002).
2003, HCS Research Lab, U of Florida All Rights Reserved
18. P. Sinha, A. George, and K. Kim, "Parallel algorithms for robust broadband MVDR beamforming," J. of
ComputationalAcoustics 10 (1), 6996 (2002).
19. K. Kim, "Parallel algorithms for distributed insitu array beamforming," Ph. D dissertation, Univ. of Florida,
(2001).
20. United States, Office of Scientific Research and Development, National Defense Committee, "Physics of sound in
the sea," Dept. of the Navy, Headquarters Naval Material Command, Washington, 840 (1969).
21. M. Porter, "Acoustic models and sonar systems," J. of Oceanic Engineering, 18 (4), 425437(1993).
22. M. Porter, "The KRAKEN normal mode program," SCALANT Memorandum, SM245, September (1991).
23. M. Porter, "KRAKEN normal mode program," fttp://oalib.saic.com/pub/oalib/demo/MatMod.ps, (1997).
24. Culler and J. P. Singh, "Parallel computer architecture: A hardware/software approach," Morgan Kaufman
Publishers Inc. San Francisco, CA, (1998).
25. Message Passing Interface Forum, "MPI: a messagepassing interface standard," Technical Report CS94230,
Computer Science Department. Univ. of Tennessee, April (1994).
26. Bittware Research Systems, "User's Guide: BlacktipEX, Bittware Research Systems," Concord, NH, (1997).
27. Computer Products Division, "ADSP2106x SHARC user's manual 2nd edition," Analog Devices Inc., Norwood,
MA, (1997).
28. J. Kohout and A. George, "A highperformance network service for parallel computing on distributed DSP
systems," Parallel C. ""/ .,, .i accepted and in press.
29. Digital Signal Processing Division, "User's guide: ADSP21160 EZKIT Lite: Hardware Rev. 2.2," Analog
Devices Inc., Norwood, MA, (2000).
30. Digital Signal Processing Division, "ADSP21160 SHARC DSP Hardware Reference 1st edition," Analog Devices
Inc., Norwood, MA, (1999).
