<%BANNER%>

Reconfigurable Computing with RapidIO for Space-Based Radar

Permanent Link: http://ufdc.ufl.edu/UFE0020221/00001

Material Information

Title: Reconfigurable Computing with RapidIO for Space-Based Radar
Physical Description: 1 online resource (104 p.)
Language: english
Creator: Conger, Chrisley David
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: radar, rapidio, reconfigurable, space
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Increasingly powerful radiation-hardened FPGAs, ASICs, and conventional processors along with high-performance embedded system interconnect technologies are helping to enable the on-board processing of real-time, high-resolution radar data on satellites and other space platforms. With current processing systems for satellites based on conventional processors and non-scalable bus interconnects, it is impossible to keep up with the high throughput demands of space-based radar applications. The large datasets and real-time nature of most space-based radar algorithms call for highly-efficient data and control networks, both locally within each processing node as well as across the system interconnect. The customized architecture of FPGAs and ASICs allows for unique features, enhancements, and communication options to support such applications. Using a Ground Moving Target Indicator application as a case study, this research investigates low-level architectural design considerations on a custom-built testbed of multiple FPGAs connected over RapidIO. This work presents experimentally gathered results to provide insight into the relationship between the strenuous application demands and the underlying system architecture, as well as the practicality of using reconfigurable components for these challenging high-performance embedded computing applications.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Chrisley David Conger.
Thesis: Thesis (M.S.)--University of Florida, 2007.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0020221:00001

Permanent Link: http://ufdc.ufl.edu/UFE0020221/00001

Material Information

Title: Reconfigurable Computing with RapidIO for Space-Based Radar
Physical Description: 1 online resource (104 p.)
Language: english
Creator: Conger, Chrisley David
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: radar, rapidio, reconfigurable, space
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Increasingly powerful radiation-hardened FPGAs, ASICs, and conventional processors along with high-performance embedded system interconnect technologies are helping to enable the on-board processing of real-time, high-resolution radar data on satellites and other space platforms. With current processing systems for satellites based on conventional processors and non-scalable bus interconnects, it is impossible to keep up with the high throughput demands of space-based radar applications. The large datasets and real-time nature of most space-based radar algorithms call for highly-efficient data and control networks, both locally within each processing node as well as across the system interconnect. The customized architecture of FPGAs and ASICs allows for unique features, enhancements, and communication options to support such applications. Using a Ground Moving Target Indicator application as a case study, this research investigates low-level architectural design considerations on a custom-built testbed of multiple FPGAs connected over RapidIO. This work presents experimentally gathered results to provide insight into the relationship between the strenuous application demands and the underlying system architecture, as well as the practicality of using reconfigurable components for these challenging high-performance embedded computing applications.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Chrisley David Conger.
Thesis: Thesis (M.S.)--University of Florida, 2007.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0020221:00001


This item has the following downloads:


Full Text





RECONFIGURABLE COMPUTING WITH RapidIO FOR SPACE-BASED RADAR


By

CHRIS CONGER













A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2007

































2007 Chris Conger


































To my favorite undergraduate professor, Dr. Fred O. Simons, Jr.









ACKNOWLEDGMENTS

I would like to thank my major professor, Dr. Alan George, for his patience, guidance, and

unwavering support through my trials as a graduate student. I thank Honeywell for their

technical guidance and sponsorship, as well as Xilinx for their generous donation of hardware

and IP cores that made my research possible. Finally, I want to express my immeasurable

gratitude towards my parents for giving me the proper motivation, wisdom, and support I needed

to reach this milestone. I can only hope to repay all of the kindness and love I have been given

and which has enabled me to succeed.









TABLE OF CONTENTS

page

A C K N O W L E D G M E N T S ......... ..... .......... ...............................................................................4

L IST O F TA B LE S ......... .... ........................................................................... 7

L IST O F FIG U R E S ............................................................................... 8

ABSTRACT ........................................... .. ......... ........... 10

CHAPTER

1 IN T R O D U C T IO N ....................................................................................... .......... .. .. .. 11

2 BACKGROUND AND RELATED RESEARCH .............. .......................................16

G round M moving Target Indicator............................................................................ ....... 16
Pulse C om pression ............................ ........................ .. .... ........ ........ 18
D oppler Processing .................. ............................................ ...... ...............20
Space-Tim e A daptive Processing........................................................ ................ 22
Constant False-Alarm Rate D etection.................................... .......................... ......... 23
C corner T urns ........................................................ ........................26
Embedded System Architectures for Space........................................ .................... 28
F ield-P program m able G ate A rrays......................................... ............................................32
R a p id IO ........................................................................3 2

3 ARCHITECTURE DESCRIPTION S......... ................................................. ............... 39

Testbed Architecture................ .............................................39
N ode A architecture ............................................. 43
E xternal M em ory C ontroller ........................................ ............................................45
N etw ork Interface Controller ................................................ .............................. 47
O n-C hip M em ory C ontroller......................................... ............................................49
Pow erPC and Softw are A PI .................................................. .............................. 50
Co-processor Engine Architectures ............................... ....... ................................ 51
Pulse C om pression E engine ......... ................. ......................................... ............... 55
D oppler P processing E engine ..................................................................... ..................57
B eam form ing E ngine............ ... ......................................................................... .. .... 58
C F A R E ngin e ................................................................60

4 ENVIRONMENT AND METHODS .............................................................................62

E xperim mental E nvironm ent.............. ........................................................... ......................62
M easurem ent P procedure .............................................................................. ..................... 63
M etrics and P aram eter D definitions .............................................................. .....................65









5 RE SU LTS ......................................................................................................................... 68

Experim ent 1 ..................................................................................................................... 68
Experim ent 2..................................................................................................................... 71
Experim ent 3..................................................................................................................... 76
Experim ent 4..................................................................................................................... 81
Experim ent 5..................................................................................................................... 84

6 CON CLU SION S A N D FU TU RE W O RK .................................................................. 95

Sum m ary and C conclusions ............................................................................................... 95
Future W ork......... .................................................................97

LIST O F R EFEREN CE S ...................................................................................................100

B IO G R A PH ICA L SK ETCH ...................................................................... ........ 104







































6









LIST OF TABLES

Table page

3-1 Softw are A PI function descriptions.......................................................................... ....... 51

3-2 Co-processor wrapper signal definitions. ........................................................................53

4-1 Experim ental param eter definitions............................................. .............................. 67









LIST OF FIGURES

Figure page

2-1 Illustration of satellite platform location, orientation, and pulse shape..............................16

2-2 GM TI data-cube. ................................................................ ... 17

2-3 GM TI processing flow diagram ................................................ ............................... 18

2-4 Processing dimension of pulse compression. ............. ................................. ...............19

2-5 Processing dim ension of Doppler processing ................................................ ............... 21

2-6 Examples of common apodization functions.................................... ......................... 21

2-7 Processing dimension ofbeamforming. ........................................ .......................... 23

2-8 Processing dimension of CFAR detection............... .... .............................24

2-9 CFAR sliding window definition. ............................................... .............................. 25

2-10 Basic com er turn ....... ...... ........... .. ........... .. ...... ............ 26

2-11 Proposed data-cube decomposition to improve comer-turn performance...........................28

2-12 Example pictures of chassis and backplane hardware .................................... 29

2-13 Three popular serial backplane topologies. ........................................................................30

2-14 Layered RapidIO architecture vs. layered OSI architecture......... ............. ............. 34

2-15 RapidIO compared to other high-performance interconnects. .........................................36

3-1 Conceptual radar satellite processing system diagram. ................... .................40

3-2 Experimental testbed system diagram ...... .....................................................................42

3-3 N ode architecture block diagram ................................................. .............................. 44

3-4 External memory controller block diagram ..........................................46

3-5 Network interface controller block diagram ....................................................................... 47

3-6 Standardized co-processor engine wrapper diagram. .......................................................52

3-7 Pulse compression co-processor architecture block diagram.........................................55

3-8 Doppler processing co-processor architecture block diagram .....................................57



8









3-9 Beamforming co-processor architecture block diagram........................ ... ............... 58

3-10 Illustration of beam form ing com putations. .............................................. .....................59

3-11 CFAR detection co-processor architecture block diagram...............................................60

4-1 T estbed environm ent and interface............................................................. .....................63

5-1 Baseline throughput performance results. ........................................................................... 69

5-2 Single-node co-processor processing performance. ................................... ............... 72

5-3 Processing and memory efficiency for the different stages of GMTI ................................73

5-4 Performance comparison between RapidIO testbed and Linux workstation. .....................75

5-5 Doppler processing output comparison. ..........................................................................77

5-6 Beamforming output comparison. ...... ........................... .......................................78

5-7 CFAR detection output comparison. ............................................. ............................ 80

5-8 D ata-cube dim ensional orientation ............................................... ............................. 82

5-9 D ata-cube perform ance results. ...... ........................... .......................................... 83

5-10 Local DMA transfer latency prediction and validation ............................................. 86

5-11 Illustration of double-buffered processing. .............................................. ............... 87

5-12 Data-cube processing latency prediction and validation. .................................................88

5-13 Illustration of a corner-turn operation. ........................................ .......................... 88

5-14 Comer-turn latency prediction and validation................. ...............................................89

5-15 Full GMTI application processing latency predictions. .................... ............................. 91

5-16 Comer-turn latency predictions with a direct SRAM-to-RapidlO data path.....................93









Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

RECONFIGURABLE COMPUTING WITH RapidlO FOR SPACE-BASED RADAR

By

Chris Conger

August 2007

Chair: Alan D. George
Major: Electrical and Computer Engineering

Increasingly powerful radiation-hardened Field-Programmable Gate Arrays (FPGAs),

Application-Specific Integrated Circuits (ASICs), and conventional processors (along with high-

performance embedded system interconnect technologies) are helping to enable the on-board

processing of real-time, high-resolution radar data on satellites and other space platforms. With

current processing systems for satellites based on conventional processors and non-scalable bus

interconnects, it is impossible to keep up with the high throughput demands of space-based radar

applications. The large datasets and real-time nature of most space-based radar algorithms call

for highly-efficient data and control networks, both locally within each processing node as well

as across the system interconnect. The customized architecture of FPGAs and ASICs allows for

unique features, enhancements, and communication options to support such applications. Using

a ground moving target indicator application as a case study, my research investigates low-level

architectural design considerations on a custom-built testbed of multiple FPGAs connected over

RapidIO. This work presents experimentally gathered results to provide insight into the

relationship between the strenuous application demands and the underlying system architecture,

as well as the practicality of using reconfigurable components for these challenging high-

performance embedded computing applications.









CHAPTER 1
INTRODUCTION

Embedded processing systems operating in the harsh environments of space are subject to

more stringent design constraints when compared to those imposed upon terrestrial embedded

systems. Redundancy, radiation hardening, and strict power requirements are among the leading

challenges presented to space platforms, and as a result the flight systems currently in space are

mostly composed of lower-performance frequency-limited software processors and non-scalable

bus interconnects. These current architectures may be inadequate for supporting real-time, on-

board processing of sensor data of sufficient volume, and thus new components and novel

architectures need to be explored in order to enable high-performance computing in space.

Radar satellites have a potentially wide field of view looking down from space, but

maintaining resolution at that viewpoint results in very large data sets. Furthermore, target

tracking algorithms such as Ground Moving Target Indicator (GMTI) have tight real-time

processing deadlines of consecutive radar returns in order to keep up with the incoming sensor

data as well as produce useful (i.e., not stale) target detections. In previous work, it has been

shown using simulation [1-3] that on-board processing of high-resolution data on radar satellites

requires parallel processing platforms providing high throughput to all processing nodes, and

efficient processing engines to keep up with the strict real-time deadlines and large data sets.

Starting in 1997, Motorola began to develop a next-generation, high-performance, packet-

switched embedded interconnect standard called RapidIO, and soon after partnered with Mercury

Computers to complete the first version of the RapidIO specification. After the release of the 1.0

specification in 1999, the RapidIO Trade Association was formed, a non-profit corporation

composed of a collection of industry partners for the purpose of steering the development and

adoption of the RapidIO standard. Designed specifically for embedded environments, the









RapidIO standard seeks to address many of the challenges faced by embedded systems, such as

providing a small footprint for lower size and power, inherent fault tolerance, high throughput,

and scalability. Additionally, the RapidIO protocol includes flow control and fault isolation,

features that enhance the fault tolerance of a system but are virtually non-existent in bus-based

interconnects. These characteristics make the RapidIO standard an attractive option for

embedded space platforms, and thus in-depth research is warranted to consider the feasibility and

suitability of RapidIO for space systems and applications.

In addition to the emergence of high-performance interconnects, the embedded community

has also been increasingly adopting the use of hardware co-processing engines in order to boost

system performance. The main types of processing engines used aside from conventional

processors are application-specific integrated circuits (ASICs) and field-programmable gate

arrays (FPGAs). While both components provide the potential for great clock-cycle efficiency

through exploitation of parallelism and custom-tailored architectures for performing a particular

processing task, the ASIC devices are much more expensive to design and fabricate than FPGAs.

By contrast, with sufficient time and resources, ASIC devices can achieve comparable or better

performance than FPGAs, as well as lower power consumption due to the removal of

unnecessary logic that would be present in any FPGA design. Unfortunately, ASICs are rarely

re-usable for many different applications, where FPGAs may be completely reconfigured for a

wide range of end uses. FPGAs are also considered valuable prototyping components due to

their flexible, re-programmable nature.

However, it will be shown that it can be difficult at this point in time to construct a

complete system based solely on hardware components (e.g. FPGAs). Application designers are

rarely fluent in the various hardware description languages (HDL) used to develop FPGA









designs. Ideally, application engineers should still be able to develop applications in the familiar

software setting, making use of the hardware processing resources of the system transparently

through function calls in software. Furthermore, most flight systems have extensive middleware

(software) for system monitoring and control, and porting these critical legacy software

components to hardware would be fought tooth-and-nail, if it were even possible. As will be

shown in my research, there is still an important role for software in these high-performance

embedded systems, and gaining insight into the optimal combination of hardware and software

processing for this application domain is an important objective.

My work conducted an experimental study of cutting-edge system architectures for Space-

Based Radar (SBR) using FPGA processing nodes and the RapidIO high-performance

interconnect for embedded systems. Hardware processing provides the clock-cycle efficiency

necessary to enable high-performance computing with lower-frequency, radiation-hardened

components. Compared to bus-based designs, packet-switched interconnects such as RapidIO

will substantially increase the scalability, robustness, and network performance of future

embedded systems, and the small footprint and fault tolerance inherent to RapidIO suggest that it

may be an ideal fit for use with FPGAs in space systems. The data input and output

requirements of the processing engines, as well as the data paths provided by the system

interconnect and the local memory hierarchies are key design parameters that affect the ultimate

system performance. By experimenting with assorted node architecture variations and

application communication schedules, this research is able to identify critical design features as

well as suggest enhancements or alternative design approaches to improve performance for SBR

applications.









An experimental testbed was built from the ground up to provide a platform on which to

perform experimentation. Based on Xilinx FPGAs, prototyping boards, and RapidlO IP cores,

the testbed was assembled into a complete system for application development and fine-grained

performance measurement. When necessary, additional printed circuit boards (PCBs) were

designed, fabricated, assembled, and integrated with the Xilinx prototyping boards and FPGAs in

order to enhance the capabilities of each testbed node. Moving down one level in the

architecture, a top-level processing node architecture was needed with which to program each of

the FPGAs of the system. The processing node architecture is equally important to performance

as is the system topology, and the flexible, customizable nature of FPGAs was leveraged in order

to propose a novel, advanced chip-level architecture connecting external memory, network

fabric, and processing elements. In addition to proposing a baseline node architecture, the

reprogrammability of the FPGAs allows features and design parameters to be modified in order

to alter the node architecture and observe the net impact on application performance. Each of the

FPGAs in the testbed contains two embedded PowerPC405 processors, providing software

processors to the system that are intimately integrated with the reconfigurable fabric of the

FPGAs. Finally, co-processor engines are designed in HDL to perform the processing tasks of

GMTI best suited for hardware processing.

The remainder of this document will proceed as follows. Chapter 2 will present the results

of a literature review of related research, and provide background information relating to this

thesis. Topics covered in Chapter 2 include GMTI and space-based radar, RapidlO,

reconfigurable computing, as well as commercial embedded system hardware and architectures.

Chapter 3 begins the discussion of original work, by describing in detail the complete

architecture of the experimental testbed. This description includes the overall testbed topology









and components, the proposed processing node architecture (i.e. the overall FPGA design), as

well as individual co-processor engine architectures used for high-speed processing. Chapter 4

will introduce the experimental environment in detail, as well as outline the experimental

methods used for data collection. This overview includes experimental equipment and

measurement procedures, testbed setup and operation, parameter definitions, and a description of

the experiments that are performed. Chapter 5 will present the experimental results, as well as

offer discussion and analysis of those results. Chapter 6 provides some concluding remarks that

summarize the insight that was gained from this work, as well as some suggestions for future

tasks to extend the findings of this research.









CHAPTER 2
BACKGROUND AND RELATED RESEARCH

Ground Moving Target Indicator

There are a variety of specific applications within the domain of radar processing, and even

while considering individual applications there are often numerous ways to implement the same

algorithm using different orderings of more basic kernels and operations. Since this research is

architecture-centric in nature, to reduce the application space one specific radar application

known as GMTI will be selected for experimentation. GMTI is used to track moving targets on

the ground from air or space, and comes recommended from Honeywell Inc., the sponsor of this

research, as an application of interest for their future SBR systems. Much research exists in

academic and technical literature regarding GMTI and other radar processing applications [1-18],

however most deal with airborne radar platforms.



radar
-j H


.:.i: i z











Figure 2-1. Illustration of satellite platform location, orientation, and pulse shape.

For the type of mission considered in this thesis research, the radar platform is mounted on

the satellite, pointed downward towards the ground as shown in Figure 2-1. The radar sends out

periodic pulses at a low duty cycle, with the long "silent" period used to listen for echoes. Each









echo is recorded over a period of time through an analog-to-digital converter (ADC) operating at

a given sampling frequency, corresponding to a swath of discrete points on the ground that the

transmitted pulse passes over [4]. Furthermore, each received echo is filtered into multiple

frequency channels, to help overcome noise and electronic jamming. Thus, each transmitted

pulse results in a 2-dimensional set of data, composed of multiple range cells as well as

frequency channels. When multiple consecutive pulse returns are concatenated together, the

result is a three-dimensional data-cube of raw radar returns that is passed to the high-

performance system for processing (see Figure 2-2).



.'



---------
arie I .






Figure 2-2. GMTI data-cube.

As mentioned above, periodically the radar will send a collection of data to the processing

system. This period represents the real-time processing deadline of the system, and is referred to

as the Coherent Processing Interval (CPI). Channel dimension lengths are typically small, in the

4-32 channel range, and typical pulse dimension lengths are also relatively small, usually

between 64-256. However, the range dimension length will vary widely depending on radar

height and resolution. Current GMTI implementations on aircraft have range dimensions on the

order of 512-1024 cells [5-6, 12], resulting in data-cube sizes in the neighborhood of 32 MB or

less. However, the high altitude and wide field of view of space-based radar platforms requires










dramatic increases in the range dimension in order to preserve resolution at that altitude. Space-

based radar systems may have to deal with 64K-128K range cells, and even more in the future,

which correspond to data-cubes that are 4 GB or larger.

GMTI is composed of a sequence of four sub-tasks, or kernels that operate on the data-

cube to finally create a list of target detections from each cube [7]. These four sub-tasks are (1)

pulse compression, (2) Doppler processing, (3) space-time adaptive processing, and (4) constant

false-alarm rate (CFAR) detection. Figure 2-3 below illustrates the flow of GMTI.

_pr.,lilni adaptive pr.r siing

weights saved
data reorganization Adapive for next cube
data reorganization Weight i
Compulation




new data Pulse Doppler 6 X%& Beafori CFARm, detection
cube Comp. Filtering etection report



weights from
previous cube

Figure 2-3. GMTI processing flow diagram.

Pulse Compression

The first stage of GMTI is a processing task associated with many radar applications, and

is known as pulse compression. The reason why pulse compression is needed is based on the

fact that the pulse transmitted by the radar is not instantaneous. In an ideal world, the radar

could send out an instantaneous pulse, or impulse, that would impact each point on the ground

instantly, and the echo received by the radar would be a perfect reflection of the ground.

However, in the real world, the radar pulse is not instantaneous. As the pulse travels over the

ground, the radar receives echoes from the entire length of the pulse at any point in time. Thus,

what the radar actually receives as the echo is an image of the surface convolved with the shape









of the transmitted pulse. Pulse compression is used to de-convolve the shape of the transmitted

pulse (which is known and constant) from the received echo, leaving a "cleaned up" image of the

surface [8].

One benefit of pulse compression is that it can be used to reduce the amount of power

consumed by the radar when transmitting the pulses. A tradeoff exists where shorter pulses

result in less-convolved echoes, however require higher transmitting power in order to retain an

acceptable signal-to-noise ratio. By transmitting longer pulses, less power is required, but of

course the echo will be more "dirty" or convolved. Pulse compression thus enables the radar to

operate with lower power using longer pulses [8] and still obtain clean echoes.

In addition to the raw radar data-cube (represented by the variable a), pulse compression

uses a constant complex vector representing a frequency-domain recreation of the original radar

pulse. This vector constant, F, can be pre-computed once and re-used indefinitely [9]. Since

convolution can be performed by a simple multiplication in the frequency domain, pulse

compression works by converting each range line of the data-cube a (see Figure 2-4, henceforth

referred to as a Doppler bin) into the frequency domain with an FFT, performing a point-by-

point complex vector multiplication of the frequency-domain Doppler bin with the complex

conjugate of the vector constant F (for de-convolution), and finally converting the Doppler bin

back into the time domain with an IFFT. Each Doppler bin can be processed completely

independently [10].









Figure 2-4. Processing dimension of pulse compression, highlighting one Doppler bin.
Figure 2-4. Processing dimension of pulse compression, highlighting one Doppler bin.









The operations of pulse compression are defined mathematically by the following

expressions, where R, P, and C represent the lengths of the range, pulse, and channel dimensions,

respectively:

R-1 2;7'i rk
Ac = a c e R
,pk c,p,r c=0,...,C-1 p =,...,P-1 k=0 ...,R-1 (2.1)
r=0

A' A F
c,p,k c,p,k k c=0,...,C-1 p=0,...,P-1 k=0,...,R-1 (2.2)

R-1 2i rki
rk
a' = Ac R
c,,r c,p,k c= 0,...,C-1 p = 0,...,P-1 r =0,...,R-1 (2.3)
k=O

The output, a', represents the pulse-compressed data-cube and is now passed to Doppler

processing after a re-organization of the data in system memory (the data reorganization

operation will be addressed later).

Doppler Processing

Doppler processing performs operations similar to pulse compression, although for a

different reason. Doppler processing is performed along the pulse dimension, and performs

apodization on the data-cube before converting it to the frequency domain along the pulse

dimension. Apodization is done by a complex vector multiplication of each pulse line of the

data-cube (henceforth referred to as a range bin, see Figure 2-5) by a constant, pre-computed

complex time-domain vector [10]. There are several standard apodization functions used for

Doppler processing, a few examples of which are given in Figure 2-6. The choice of apodization

function depends on the properties of the radar, and has no impact on the performance of the

application.

















channel


pulse

Figure 2-5. Processing dimension of Doppler processing, highlighting one range bin.






O 0 2


Figure 2-6. Examples of common apodization functions, time-domain (left) and frequency-
,b, -.
f diLi a iii an i p f the .x s






pr acpr ,..., 1 p ,..., -1 r 0,..., R 1 (2.4)
Bti = b '. e
















-ck-,r =Y pr e 1 k=0P- 1 r =,R, 1 (2.5)
-1 -a, I I,


Figure 2-6. Examples of common apodization functions, time-domain (left) and frequency-
domain (middle and right).

Each range bin of the data-cube a' is processed completely independently, and is first

multiplied by the apodization vector, g. The resulting range bin is then converted to the

frequency domain with an FFT, in preparation for the next stage. These operations are defined

mathematically by the following expressions:


bc,p,r = ac,p,r gp c= ,...,C-1 p= O,...,P-1 r ,...,R-1 (2.4)


P-1 2ri^^
pk
B,,r = bpre P
Bck~r= c~pr ec = 0,...,C-1 k = 0,...,P-1 r =,...,R-1 (2.5)
p=0

The output, B, represents the processed data-cube that is ready for beamforming in the next

stage, after yet another re-organization of the data in system memory.









Space-Time Adaptive Processing

Space-time adaptive processing (STAP) is the only part of the entire GMTI application that

has any temporal dependency between different CPIs. There are two kernels that are considered

to be a part of STAP: (1) adaptive weight computation (AWC), and (2) beamforming. While

AWC is defined as a part of STAP, it is actually performed in parallel with the rest of the entire

algorithm (recall Figure 2-3). For a given data-cube, the results of AWC will not be used until

the beamforming stage of the next CPI. Beamforming, by contrast, is in the critical path of

processing, and proceeds immediately following Doppler processing using the weights provided

by AWC on the previous data-cube [10-12].

Adaptive Weight Computation

Adaptive weight computation is performed in parallel with the rest of the stages of GMTI,

and is not considered to be a performance-critical step given the extended amount of time

available to complete the processing. The output of AWC is a small set of weights known as the

steering vector, and is passed to the beamforming stage. Since AWC does not lie in the critical

path of processing, this stage is omitted in the implementation of GMTI for this thesis in order to

reduce the necessary attention paid to the application for this architecture-centric research. For

more information on the mathematical theory behind AWC and the operations performed, see

[10, 12].

Beamforming

Beamforming takes the weights provided by the AWC step, and projects each channel line

of the data-cube into one or more "beams" through matrix multiplication [10-13]. Each beam

represents one target classification, and can be formed completely independently. The formation

of each beam can be thought of as filtering the data-cube to identify the probability that a given

cell is a given type or class of target [14]. As the final step of beamforming, the magnitude of









each complex element of the data-cube is obtained in order to pass real values to CFAR detection

in the final stage. Figure 2-7 below shows the processing dimension ofbeamforming, along with

a visualization of the reduction that occurs to the data-cube as a result of beamforming.



/ B


p p

Figure 2-7. Processing dimension of beamforming (left), and diagram illustrating data-cube
reduction that occurs as a result of beamforming (right).

Taking the processed, reorganized data-cube B from Doppler processing, as well as the

steering vector e) from AWC, the beams are formed with a matrix multiplication followed by a

magnituding operation on each element of the resulting beams. These operations are defined

mathematically by the following expressions:

C-l
BF, =B c*pB
bp,r b ,c cpr b = 0,...,B-1 p= 0,...,P r = 0,...,R-1 (2.6)
c=0



b,pr = Bp,rl + r imag b =,...,B- p =0,..., P -1 r = 0,...,R -1 (2.7)


The output cube, C, is a real-valued, completely processed and beamformed data-cube.

The data-cube undergoes one final reorganization in system memory before being passed to the

final stage of processing, CFAR detection.

Constant False-Alarm Rate Detection

Constant False-Alarm Rate detection, or CFAR, is another kernel commonly associated

with radar applications [15]. CFAR is used in this case as the final stage in GMTI processing,

and makes the ultimate decision on what is and what is not a target. The goal of CFAR is to









minimize both the number of false-positives (things reported as targets that really are not targets)

as well as false-negatives (things not reported as targets that really are targets). The output of

CFAR is a detection report, containing the range, pulse, and beam cell locations of each target

within the data-cube, as well as the energy of each target.






beam
pulse

Figure 2-8. Processing dimension of CFAR detection.

CFAR operates along the range dimension (see Figure 2-8 above) by sliding a window

along each Doppler bin, computing local averages and performing threshold comparisons for

each cell along the range dimension. In addition to the coarse-grained, embarrassingly parallel

decomposition of the data-cube across multiple processors, the computations performed by the

sliding window contain a significant amount of parallelism and that may also be exploited to

further reduce the amount of computations required per Doppler bin [15]. This fine-grained

parallelism is based on reusing partial results from previously computed averages, and will be

described in detail later in Chapter 3. The computations of CFAR on the data-cube C can be

expressed mathematically as:

I G+Ncfar 2 2
b,p,r I b,p,r+i b,p,r- (2.8)
Ncfa i=G+1

2
cb,p,r 2
S > b =0,...,B -1 p = 0,...,P r = 0,...,R -1 (2.9)
b,p,r









Where G represents the number of guard cells on each side of the target cell and Ncfr represents

the size of the averaging window on either side of the target cell (see Figure 2-9).

N range cells










N averaging Nto averaging
elements elements

Figure 2-9. CFAR sliding window definition.

For each cell in the Doppler bin, a local average is computed of the cells surrounding the

target cell (the target cell is the cell currently under consideration). A small number of cells

immediately surrounding the target cell, referred to as guard cells, are ignored. Outside of the

guard cells on either side of the target cell, a window of N,4f cells are averaged. This average is

then scaled by some constant, u. The value of the target cell is then compared to the scaled

average, and if the target cell is greater than the scaled average then it is designated as a target.

The constant u is very important to the algorithm, and is the main adjustable parameter that

must be tuned to minimize the false-positives and false-negatives reported by CFAR. This

parameter is dependent on the characteristics of the specific radar sensor, and the theory behind

its determination is beyond the scope or focus of this thesis research. Fortunately, the value of

this constant does not affect the performance of the application. One final detail of the algorithm

concerns the average computations at or near the boundaries of the Doppler bin. Near the

beginning or end of the Doppler bin, one side of the sliding window will be "incomplete" or un-

filled, and thus only a right-side or left-side scaled average comparison can be used near the

boundaries as appropriate to handle these conditions [16].










Corner Turns

Another important operation for GMTI, and many other signal processing applications, is

that of re-organization of data for different processing tasks. In order to increase the locality of

memory accesses within any given computational stage of GMTI, it is desirable to organize the

data in memory according to the dimension that the next processing stage will operate along. In

distributed memory systems, where the data-cube is spread among all of the processing

memories and no single node contains the entire data-cube, these corner-turns, or distributed

transposes can heavily tax the local node memory hierarchies as well as the network fabric. It

should be noted here that with GMTI, the corner-turns constitute the only inter-processor

communication associated with the algorithm [10, 17]. Once all data is organized properly for a

given stage of GMTI, processing may proceed at each node independent of one another in an

embarrassingly parallel fashion. Thus, it is the performance of the corner-turn operations as you

scale the system size up that determines the upper-bound on parallel efficiency for GMTI.

It is important to understand the steps required for a corner-turn, at the memory access

level, in order to truly understand the source of inefficiency for this notorious operation. The

basic distributed corner-turn operation can be visualized by observing Figure 2-10 below.

R.5 LOCAL SYSTEM REMOTE SYSTEM
I Node 0 (DMA Master) (DMA Target)
Sen Local Segment
tdode I L Size2
d -------- --------- --- --
Node 3 --- ---------- ------------









P

Figure 2-10. Basic corner turn
N N N N - --- ------------- -------- --- -
d d d d d ------- ------------ ------- --- -
e ee e^\^^^-^-^- -------------- :----




Figure 2-10. Basic corner turn









First, consider the most basic decomposition of the data-cube for any given task, along a

single dimension of the cube (see Figure 2-10, left side). Assume the data-cube is equally

divided among all the nodes in the system, and must be re-distributed among the nodes as

illustrated. It can be seen that each node will have to send a faction of its current data set to each

other node in the system, resulting in personalized all-to-all communication event, which

heavily taxes the system backplane with all nodes attempting to send to one another at the same

time. Simulations in [3] suggest that careful synchronization of the communication events can

help to control contention and improve corer-turn performance. However, look closer at the

local memory access pattern at each node. The local memory can be visualized as shown to the

right side of Figure 2-10. The data to be sent to each node does not completely lie in consecutive

memory locations, and still must be transposed at some point. The exact ordering of the striding,

grouping, and transposition is up to the designer, and will serve as a tradeoff study to be

considered later. Because of this memory locality issue, corner-turns pose efficiency challenges

both at the system interconnect level as well as at the node level in the local memory hierarchy.

A workshop presentation from Motorola [18] suggests an alternate approach to

decomposition of data-cubes for GMTI, in order to improve the performance of corner-turn

operations. Instead of decomposition along a single dimension, since all critical stages of GMTI

operate on 1-dimensional data, the data-cube can be decomposed along two dimensions for each

stage. The resulting data-cube would appear decomposed as shown below in Figure 2-11,

showing two different decompositions to help visualize where the data must move from and to.

The main advantage of this decomposition is that in order to re-distribute the data, only groups of

nodes will need to communicate with one-another, as opposed to a system-wide all-to-all

communication event.


















Figure 2-11. Proposed data-cube decomposition to improve corner-turn performance.

Other previous work [1] suggests a processor allocation technique to improve GMTI

performance, and fits well with the decomposition approach suggested above. This allocation

technique suggests that instead of using a completely data-parallel or completely pipelined

implementation, a "staggered" parallelization should be adopted which breaks the system up into

groups of nodes. Each group of nodes receives an entire data-cube, and performs all stages of

GMTI in a data-parallel fashion. Consecutive data-cubes are handed to different groups in a

round-robin fashion, thus the "staggered" title. The benefit of this processor allocation scheme is

that the entire backplane is not taxed for comer-turns, instead only the local interconnect for each

group of processors. Tuned to a specific topology, this processor allocation strategy could fit

well with proper decomposition boundaries of the data-cube as suggested above, further

improving overall application performance.

Embedded System Architectures for Space

Given the unique physical environment and architecture of satellite systems, it is worth

reviewing typical embedded hardware to be aware of what kinds of components are available,

and where the market is moving for next-generation embedded systems. Many high-

performance embedded processing systems are housed in a small (<2 ft. x <2 ft. x <2 ft.) metal

box called a chassis (example pictured in Figure 2-12(a)). The chassis provides a stable structure

in which to install cards with various components, as well as one central "motherboard" or

backplane that is a critical component of the embedded system.

























A B
Figure 2-12. Example pictures of chassis and backplane hardware; A) Mercury's Ensemble2
Platform, an ATCA chassis for embedded processing, and B) example of a passive
backplane.

The backplane is a large printed circuit board (PCB) that provides multiple sockets as well

as connectivity between the sockets. There are two general classes of backplanes, passive

backplanes and active backplanes. Active backplanes include components such as switches,

arbiters, and bus transceivers, and actively drive the signals on the various sockets. Passive

backplanes, by contrast, only provide electrical connections (e.g. copper traces on the PCB)

between pins of the connectors, and the cards that are plugged into the sockets determine what

kind of communication protocols are used. Passive backplanes are much more popular than

active backplanes, as is indicated by the wide availability of passive backplane products

compared to the scarcity of active backplane products. Figure 2-12(b) shows an example of a

passive backplane PCB with a total of 14 card sockets.

However, even passive backplanes put some restriction on what kinds of interconnects can

be used. Most importantly (and obviously), is the pin-count of the connectors, as well as the

width and topology of the connections between sockets. Most backplanes are designed with one

or more interconnect standards in mind, and until recently almost all backplane products were









built to support some kind of bus interconnect. For example, for decades one of the most

popular embedded system backplane standards has been the VME standard [19].

In response to the growing demand for high-performance embedded processing systems,

newer backplane standards are emerging, such as the new VME/VXS extension or the ATCA

standard, that are designed from the ground-up with high-speed serial packet-switched

interconnects and topologies in mind [19, 20]. In addition to providing the proper connection

topologies between sockets to support packet-switched systems, extra care and effort must be put

into the design and layout of the traces, as well as the selection of connectors, in order to support

high-speed serial interconnects with multi-GHz speeds. High-speed signals on circuit boards are

far more susceptible to signal-integrity hazards, and much effort is put into the specification of

these backplane standards, such as VXS or ATCA, in order to provide the necessary PCB quality

to support a number of different high-speed serial interconnect standards. Figure 2-13 shows

three of the most popular serial backplane topologies:













A B C

Figure 2-13. Three popular serial backplane topologies: A) star, B) dual-star, and C) full-mesh.

Generally, a serial backplane such as VXS or ATCA will have one or more switch card

slots, and several general-purpose card slots. The example backplane pictured back in Figure 2-

12(b) shows an example of this, as the two slots to the far left are larger than the rest, indicating









that those slots are the switch sockets. Most likely, the backplane picture in Figure 2-12(b) is a

dual-star topology as is pictured above in Figure 2-13(b). As mentioned previously, the passive

backplane itself does not dictate which interconnect must be used, although it may restrict which

interconnects can be used. The cards that are plugged into the backplane will dictate what the

system interconnect is, as well as what processing and memory resources are to be a part of the

system.

There are many different products for processor, memory, and switch cards, all of which

conform to one chassis/backplane standard or another. They combine any number of software

processors, FPGAs and ASICs, and various memory devices, as well as connectivity through a

switch or some other communication device to the backplane connector. While there are far too

many products to review in this document, there was one particular publication at a recent High-

Performance Embedded Computing (HPEC) workshop that is very relevant to this project, and

warrants a brief discussion. The company SEAKR published a conference paper in 2005

highlighting several card products specifically designed for space-based radar processing [21].

These products include a switch card, a processor card, and a mass memory module, all of which

are based on FPGAs and use RapidIO as the interconnect. The architecture of each board is

detailed, revealing a processor board that features 4 FPGAs connected together via a RapidIO

switch, with several other ports of the switch connecting to ports leading off the board to the

backplane connector. Beyond the details of each board architecture, seeing such products begin

to emerge during the course of this thesis research is a positive indication from the industry that

pushing towards the use of FPGAs and RapidIO for high-performance radar and signal

processing is indeed a feasible and beneficial direction to go.









Field-Programmable Gate Arrays

By this point in time, FPGAs have become fairly common and as such this thesis is written

assuming that the reader already has a basic understanding of the architecture of FPGAs.

Nevertheless, this technology is explicitly mentioned here for completeness, and to ensure the

reader has the necessary basic knowledge of the capabilities and limitations of FPGAs in order to

understand the designs and results presented later. For a detailed review of basic background

information and research regarding FPGAs, see [22-27]. Since this research involved a lot of

custom low-level design with the FPGAs, several documents and application notes from Xilinx

were heavily used during the course of the research [28-33].

RapidIO

RapidIO is an open-standard, defined and steered by a collection of major industry

partners including Motorola (now Freescale) Semiconductor, Mercury Computers, Texas

Instruments, and Lucent Technologies, among others. Specially designed for embedded systems,

the RapidIO standard seeks to provide a scalable, reliable, high-performance interconnect

solution designed specifically to alleviate many of the challenges imposed upon such systems

[34, 35]. The packet-switched protocol provides system scalability while also including features

such as flow control, error detection and fault isolation, and guaranteed delivery of network

transactions. RapidIO also makes use of Low-Voltage Differential Signaling (LVDS) at the pin-

level, which is a high-speed signaling standard that enjoys higher attainable frequencies and

lower power consumption due to low voltage swing, as well as improved signal integrity due to

common-mode noise rejection between signal pairs (at the cost of a doubling of the pin count for

all LVDS connections).

As a relatively new interconnect standard, there is a vacuum of academic research in

conference and journal publications. In fact, at the conception of this research project in August









2004, there was only a single known academic research paper that featured a simulation of a

RapidlO network [36], although the paper was brief and ambiguous as far as definitions of the

simulation models and other important details. The majority of literature used to research and

become familiar with RapidlO comes from industry whitepapers, conference and workshop

presentations from companies, and standards specifications [34-42]. This thesis research [42],

and the numerous publications to come from this sponsored project [1-3, 17], contribute much-

needed academic research into the community investigating this new and promising high-

performance embedded interconnect technology.

As defined by the standard, a single instance of a RapidlO "node," or communication port,

is known as an "endpoint." The RapidlO standard was designed from the start with an explicit

intention of keeping the protocol simple in order to ensure a small logic footprint, and not require

a prohibitive amount of space on an IC to implement a RapidlO endpoint [35]. The RapidlO

protocol has three explicit layers defined, which map to the classic Layered OSI Architecture as

shown below in Figure 2-14. The RapidlO logical layer handles end-to-end transaction

management, and provides a programming model to the application designer suited for the

particular application (e.g. Message-Passing, Globally-Shared Memory, etc). The transport layer

is very simple, and in current implementations is often simply combined with the logical layer.

The transport layer is responsible for providing the routing mechanism to move packets from

source to destination using source and destination IDs. The physical layer is the most complex

part of the RapidlO standard, and handles the electrical signaling necessary to move data

between two physical devices, in addition to low-level error detection and flow control.




























Figure 2-14. Layered RapidIO architecture vs. layered OSI architecture. RapidIO has multiple
physical layer and logical layer specifications, and a common transport layer
specification [35].

RapidlO offers different variations of the logical layer specification, specifically the

message passing logical layer, the memory-mapped logical I/O logical layer, as well as the

globally-shared memory logical layer [38, 39]. Both the logical I/O and the globally-shared

memory logical layers are designed to provide direct, transparent access to remote memories

through a global address space. Any node can read or write directly from/to the memory of any

other node connected to the RapidlO fabric, simply by accessing an address located in the remote

node's address space. The major difference between these two logical layer variants is that the

globally-shared memory logical layer offers cache-coherency, while the simpler logical I/O

logical layer provides no coherence. The message passing logical layer, by contrast, is

somewhat analogous to the programming model of message passing interface (MPI), where

communication is carried out through explicit send/receive pairs. With the logical I/O and

globally-shared memory logical layers, the programmer has the option for some request types of

requesting explicit acknowledgement through the logical layer at the receiving end, or otherwise









allowing the physical layer to guarantee successful delivery without requiring end-to-end

acknowledgements for every transfer through logical layer mechanisms. Regardless of which

logical layer variant is selected for a given endpoint implementation, the logical layer is defined

over a single, simple, common transport layer specification.

The RapidIO physical layer is defined to provide the designer a range of potential

performance characteristics, depending on the requirements of his or her specific application.

The RapidIO physical layer is bi-directional or full-duplex, is sampled on both rising and falling

edges of the clock (DDR), and defines multiple legal clock frequencies. Individual designers

may opt to run their RapidIO network at a frequency outside of the defined range. All RapidIO

physical layer variants can be classified under one of two broad categories; (1) Parallel RapidIO,

or (2) Serial RapidIO. Parallel RapidIO provides higher throughputs over shorter distances, and

has both an 8-bit and a 16-bit variant defined as part of the RapidIO standard [38]. Serial

RapidIO is intended for longer-distance, cable or backplane applications, and also has two

different variants defined in the standard, the single-lane lx Serial and the 4-lane 4x Serial [40].

Parallel RapidIO is source-synchronous, meaning that a clock signal is transmitted along

with the data, and the receive logic at each end point is clocked on the received clock signal.

Serial RapidIO, by contrast, uses 8b/10b encoding on each serial lane in order to enable reliable

clock recovery from the serial bit stream at the receiver. This serial encoding significantly

reduces the efficiency of Serial RapidIO relative to parallel RapidIO, but nearly all high-

performance serial interconnects use this 8b/10b encoding, so Serial RapidIO does not suffer

relative to other competing serial standards. Figure 2-15 depicts both RapidIO physical layer

flavors along with several other high-performance interconnects, to show the intended domain of

each.









Intra-System Interconnect Inter-System Interconnect





R 10












Figure 2-15. RapidIO compared to other high-performance interconnects [35].

There are two basic packet types in the RapidlO protocol; (1) standard packets, and (2)

control symbols. Standard packets include types such as request packets and response packets,

and will include both a header as well as some data payload. Control symbols are small 32-bit

"symbols" or byte patterns that have special meanings and function as control messages between

communicating endpoints. These control symbols can be embedded in the middle of the

transmission of a regular packet, so that critical control information does not have to wait for a

packet transfer to complete in order to make it across the link. This control information supports

physical layer functions such as flow control, link maintenance, and endpoint training sequence

requests, among others.

Error detection in RapidlO is accomplished at the physical layer in order to catch errors as

early as possible, in order to minimize latency penalties. Errors are detected differently

depending on the packet type, either through cyclic redundancy checks (CRC) on standard

packets, or bitwise inverted replication of control symbols. For maximum-sized packets, the









RapidIO protocol will insert two CRC checksums into the packet, one in the middle and one at

the end. By doing so, errors that occur near the beginning of a packet will be caught before the

entire packet is transmitted, and can be stopped early to decrease the latency penalty on packet

retries. This error detection mechanism in the physical layer is what the Logical I/O and

Globally-Shared Memory logical layers rely upon for guaranteed delivery if the application

design decides to not request explicit acknowledgements as mentioned previously. The physical

layer will automatically continue to retry a packet until it is successfully transmitted, however

without explicit acknowledgement through the logical layer, the user application may have no

way of knowing when a transfer completes.

The RapidlO physical layer also specifies two types of flow-control; (1) transmitter-

controlled flow-control, and (2) receiver-controlled flow-control. This flow-control mechanism

is carried out by every pair of electrically-connected RapidlO devices. Both of these flow-

control methods work as a "Go-Back-N" sliding window, with the difference lying in the method

of identifying a need to re-transmit. The more basic type of flow-control is receiver-controlled,

and relies on control symbols from the receiver to indicate that a packet was not accepted (e.g. if

there is no more buffer space in the receiving endpoint). This flow-control method is required by

the RapidlO standard to be supported by all implementations. The transmitter-controlled flow-

control method is an optional specification that may be implemented at the designer's discretion,

and provides slightly more efficient link operation. This flow-control variant works by having

the transmit logic in each endpoint monitor the amount of buffer space available in its link

partner's receive queue. This buffer status information is contained in most types of control

symbols sent by each endpoint supporting transmitter-controller flow control, including idle

symbols, so all endpoints constantly have up-to-date information regarding their partner's buffer









space. If a transmitting endpoint observes that there is no more buffer space in the receiver, the

transmitting node will hold off transmitting until it observes the buffer space being freed, which

results in less unnecessary packet transmissions.

The negotiation between receiver-controlled and transmitter-controlled flow-control is

performed by observing a particular field of control symbols that are received. An endpoint that

supports transmitter-controlled flow-control, endpoint A, will come out of reset assuming

transmitter-controlled flow control. If endpoint A begins receiving control symbols from its

partner, endpoint B, that indicate receiver-controlled flow control (i.e. the buffer status field of

the symbol is set to a certain value), then endpoint A will revert to receiver-controlled flow-

control by protocol with the assumption being that for some reason endpoint B does not support

or does not currently want to support transmitter-controlled flow control. RapidlO endpoint

implementations that do support both flow-control methods have a configuration register that

allows the user to force the endpoint to operate in either mode if desired.

The RapidlO steering committee is continually releasing extensions to the RapidlO

standard in order to incorporate changes in technology and enhance the capabilities of the

RapidlO protocol for future embedded systems. One such example of an extension is the

RapidlO Globally-Shared Memory logical layer, which was released after the Logical I/O logical

layer specification. Other protocol extensions include flow control enhancements to provide

end-to-end flow control for system-level congestion avoidance, a new data streaming logical

layer specification with advanced traffic classification and prioritization mechanisms, and

support for multi-cast communication events, to name a few examples. These newer RapidlO

protocol extensions are not used directly in this research, however, and as such they are

mentioned for completeness but will not be discussed in further detail.









CHAPTER 3
ARCHITECTURE DESCRIPTIONS

This chapter will provide a detailed description of the entire testbed, progressing in a top-

down manner. In addition to defining the hardware structures, the application development

environment will be described so as to identify the boundary between hardware and software in

this system, as well as define the roles of each. The first section will introduce the entire testbed

architecture from a system-level, followed by a detailed look at the individual node architecture

in the second section. This node architecture is a novel design that represents a major

contribution and focus of this thesis research. Finally, the last section will present and discuss

the architectures of each of the hardware processing engines designed in the course of this

research.

Testbed Architecture

Recall that the high-performance processing system will ultimately be composed of a

collection of cards plugged into a backplane and chassis. These cards will contain components

such as processors, memory controllers, memory, external communication elements, as well as

high-speed sensor inputs and pre-processors. Figure 3-1 illustrates a conceptual high-

performance satellite processing system, of which the main processing resources are multiple

processor cards, each card containing multiple processors, each processor with dedicated

memory. The multiple processors on each card are connected together via a RapidlO switch, and

the cards themselves are connected to the system backplane also through the same RapidlO

switches. While Figure 3-1 shows only three processor cards, a real system could contain

anywhere from four to sixteen such cards. The system depicted in Figure 3-1 matches both the

kind of system boasted by SEAKR in [21], as well as the target system described by Honeywell,

the sponsor of this thesis research.










r1 Processor Node

(S RapidlO Switch

) System Memory


Figure 3-1. Conceptual radar satellite processing system diagram.

While it is cost-prohibitive to acquire or construct a full-scale system to match the

conceptual hardware described above, it is possible to identify and capture the important

parameters of such an architecture and study those parameters on a smaller hardware platform.

There are three general parameters that affect overall application performance the most, (1)

processor selection and performance, (2) processor-local memory performance, and (3) system

interconnect selection and performance.

The experimental testbed used for this thesis research was designed from the ground-up

using very basic "building blocks" such as FPGAs on blank prototyping boards, a donated

RapidlO IP core (complements of Xilinx), custom-designed PCBs, and hand-coded Verilog and

C. There are a total of two Xilinx Virtex-II Pro FPGAs (XC2VP40-FF1152-6) and prototyping

boards (HW-AFX-FF1152-300), connected to each other over RapidlO. In addition to this dual-

FPGA foundation, a small collection of custom-designed PCBs were fabricated and assembled in









order to enhance the capabilities of the FPGAs on the blank prototyping boards. These PCBs

include items such as

2-layer global reset and configuration board

2-layer cable socket PCB, for high-speed cable connection to the pins of the FPGA

6-layer external SDRAM board: 128 MB, 64-bit @ 125 MVHz

10-layer switch PCB, hosting a 4-port Parallel RapidIO switch from Tundra

2-layer parallel port JTAG cable adapter for the Tsi500 RapidIO switch

While the design, assembly, and testing of these circuits represented a major contribution

of time and self-teaching, for this thesis research the circuits themselves are simply a means to an

end to enable more advanced experimental case studies. As such, the details of these circuits

such as the schematics, physical layouts, bill of materials, etc. are omitted from this thesis

document. This information is instead provided as supplementary material (along with all

Verilog and C source code for the testbed) in a separate document.

The overall experimental testbed is a stand-alone entity; however it does have several

connections to a laboratory workstation for debugging and measurement. These connections will

be discussed in detail in Chapter 4. All software written runs on the embedded PowerPC405

processors in the FPGAs, and all other functionality of the testbed is designed using Verilog and

realized in the reconfigurable fabric of the FPGA. Each node (FPGA) of the testbed is assigned

a node ID, which represents the upper address bits of the 32-bit global address space. One node

is defined to be the master, and is responsible for initiating processing in the system.

Communication between nodes, both data and control, is performed over RapidIO, by writing or

reading to/from pre-defined locations in the external DRAM of remote nodes(Figure 3-2).


























Figure 3-2. Experimental testbed system diagram.

Due to the potentially high cost of hardware, the high risk of prototype development, and

the complexity of full-system design from near-zero infrastructure, several unavoidable

restrictions were imposed by the experimental testbed used for this research. One major

drawback to this hardware is the absence of larger-capacity external processing memory, such as

QDR SRAM which is popular in today's embedded systems. The absence of this memory

restricts the node to the memory available internal to the FPGA, which is much smaller. Because

of this memory void, the maximum data-cube sizes able to be supported on this hardware are

capped (though the performance results and conclusions drawn from those results remain valid).

Furthermore, the high-risk nature of custom high-speed circuitry integration prevented the

successful addition of a RapidlO switch to the testbed. In addition to limiting the testbed size,

the absence of a switch also complicates the task of providing a high-speed data input into the

system. As a result of that restriction, all data to be processed for a given experiment must be

pre-loaded into system memory before processing begins, thus making a real-time processing

demonstration impossible. Despite these challenges, an impressive system architecture is

proposed and analyzed and valuable insight can still be gained from the experiments conducted.









Node Architecture

The node architecture presented in this section is an original design proposed,

implemented, and tested in order to enable the experimental research of this thesis. The design

itself represents a significant development effort and contribution of this research, and it is

important to review the details of the organization and operation of this architecture in order to

understand its potential benefits for space-based radar applications. Each node of the testbed

contains three major blocks: (1) an external memory interface, (2) a network interface to the

high-performance interconnect, and (3) one or more processing elements and on-chip memory

controller. Beyond simply providing these various interfaces or components, the node

architecture must also provide a highly efficient data path between these components, as well as

a highly efficient control path across the node to enable tight control of the node's operation.

This particular architecture features multiple processing units in each node, including both

hardware processors implemented in the reconfigurable fabric and software processors in the

form of hard-core embedded PowerPC405s. Given the dual-paradigm nature of each node, the

optimal division of tasks between the hardware and software processors is critical to identify in

order to maximize system performance. A block diagram illustrating the entire node architecture

is shown in Figure 3-3.

It should be noted here that the design is SDRAM-centric, meaning that all data transfers

occur to or from external SDRAM. With a single bank of local memory for each node, the

performance of that SDRAM controller is critical to ultimate system performance. Most

processors today share this characteristic, and furthermore the conventional method of providing

connectivity to the external memory is through a local bus. Even if the system interconnect is a

packet-switched protocol, internal to each node the various components (e.g. functional units,

network interface, etc.) are connected over a local bus, such as CoreConnect from IBM or












AMBA Bus [43]. This local bus forces heavyweight bus protocol overheads and time division


multiplexing for data transfer between different components of an individual node. An alternate


approach to providing node-level connectivity to external memory is to use the programmable


fabric of the FPGA to implement a multi-ported memory controller. The multi-port memory


controller will provide a dedicated interface for each separate component, thus increasing intra-


node concurrency and allowing each port to be optimized for its particular host. Each port of the


memory controller has dedicated data FIFOs for buffering read and write data, allowing multiple


components to transfer data to/from the controller concurrently, as well as dedicated command


FIFOs to provide parallel control paths.


External Network
Memory 1 Interface
Controller remote Conrollar
request ncoing


..3' Party

COMMAND Cor
CONTROLLER Cor

^ ., I l outgILrng
DORWRAM local
Son i request ~ --
port




On-Chip
Memory
---l-- -Controller
cilllr ESET CLOCK DMA
Get GENERATOR c ntroller


misc. IO

-- - - --- ------ ------
iII PowmrPC a s ul module
I N




Figure 3-3. Node architecture block diagram.


The control network within each node of the testbed provides the ability to request locally-


initiated transfers, control co-processor activity, and handle incoming RapidlO requests









transparently. Locally-initiated data movement is controlled through a DMA engine driven by

the PowerPC (shown as two components in Figure 3-3, the DMA controller and the command

controller). Based on the requested source and destination memory addresses, the DMA engine

determines if the memory transfer is local or remote, and sends personalized commands to the

command FIFO of the external memory controller as well as the command FIFO of the other

component, as appropriate for the transfer type. Each component acknowledges the completion

of its portion of a data transfer to the DMA engine, providing a feedback path back to the

PowerPC to indicate the successful completion of memory transfers. Co-processor activity, by

contrast, is controlled by a dedicated control bus directly connecting the PowerPC to the co-

processors (shown as the dark red line near the bottom of Figure 3-3). Similar to DMA transfers,

processing and configuration commands are submitted to the co-processors and their activity is

monitored using this control bus.

External Memory Controller

As the central component to each node in the system, the external memory controller

affects both interconnect as well as processor efficiency. As such, it is the most high-

performance part of the system, capable of handling traffic from both the interconnect as well as

the processors concurrently. The interface to the controller is composed of multiple ports, each

dedicated to a single component in the node providing dedicated data paths for every possible

direction of transfer. All connections to the external memory controller interfaces are through

asynchronous FIFOs, meaning that not only can the external memory controller operate in a

different clock domain from the rest of the node, it can also have a different data path width.

This logic isolation strategy is a key technique used through the entire node design, in order to

enable performance optimization of each section independent of the requirements of other










sections of the design. Figure 3-4 shows a very simplified block diagram of the internal logic of

the external memory controller.

external memory controller





t i from network
L interface controller




typical
to memory SDRAM
devices
S controller

state
machine
from DMA engine





control signals)
data bus (64.bit) _

Figure 3-4. External memory controller block diagram.

The external memory controller is based on a single state machine that constantly monitors

the two command queues. Notice that in Figure 3-4, there are two command FIFOs leading into

the external memory controller; one of these command FIFOs is for locally-requested memory

transfers (lower) and the other is for incoming remote transfer requests from the RapidIO fabric

(upper). Commands are decoded and executed as soon as they arrive, with a simple arbitration

scheme that prevents starvation of either interface to the controller. To further prevent

congestion at the external memory controller of each node, the external memory controller data

throughput is twice that of either the on-chip memory controller or the RapidIO interface. This

over-provisioning of the bandwidth ensures that even if a processor is writing or reading to/from

external memory, incoming requests from remote nodes can also be serviced at the full RapidIO

link speed without interfering with the processor traffic through the controller.










Network Interface Controller

The network interface controller is a complex component, due to the multiple independent

interfaces to the actual RapidIO core. In order to provide true full-duplex functionality, the user

interface to the network controller (at the logical layer of RapidIO) must provide four parallel

control and data paths, which translates to a total of four independent state machines. This multi-

interface requirement is common for all full-duplex communication protocols, and the network

interface controller would have a similar architecture even if Ethernet or some other interconnect

standard was being used. The four interfaces to the RapidIO logical layer core can be

conceptually grouped into two pairs of interfaces, and are labeled as follows: (la) initiator

request port, (lb) initiator response port, (2a) target request port, and (2b) target response port.

The two initiator ports handle outgoing requests and incoming responses to locally-initiated

remote transfers. The two target ports handle incoming requests and outgoing responses to

remotely-initiated remote transfers. A simplified block diagram of the internal architecture of

the network interface controller is illustrated in Figure 3-5.

network interface controller

target response
TRESP port
-- ale


memory controller I state \
machine 0 target request
1 port

f pRapidlO
Core

Initiator request
IREQ I port
from DMA engine
machine

4- control signals) IRESP
Sdat bus (4b) n initiator response




Figure 3-5. Network interface controller block diagram.









The reason for the multi-port requirement in full-duplex communication protocols is based

on the idea that even if the physical layer supports full-duplex communication, if the upper-layer

logic cannot do the same then a bottleneck would exist at the input to the interconnect fabric, and

the full-duplex capability would never be used. For example, imagine if instead of four ports,

there were only two, an initiator port and a target port. The initiator port would be responsible

for handling outgoing requests and incoming responses, and by having the same logic that is

performing transmissions need to also receive responses it would prevent the true ability to send

and receive at the same time. Alternately, consider the case where the two ports are simple

transmit and receive ports. In this case, while requests could be transmitted concurrently while

receiving responses, a problem arises when remotely-initiated requests come in off the fabric in

the middle of a locally-initiated transmission. It should now be clear why there is a need for four

independent upper-layer interfaces to any full-duplex communication protocol in order to

maximize network efficiency.

The two pairs of ports, the initiator pair and the target pair, are completely isolated from

one-another. Within each pair, however, the two interface state machines must exchange some

information. Small FIFOs take information from the initiator request and target request state

machines, in order to inform the initiator response and target response state machines of the need

to receive/send a response. The information contained in these messages is minimal, including

request type, size, source ID (which is the destination ID for the response), etc. The initiator

request state machine receives messages through the command FIFO from the main DMA

engine when the PowerPC requests a remote read or write, and informs the initiator response

state machine through the internal message FIFO described above if there will be any responses

to expect. The target request state machine, by contrast, sits idle and waits for new requests to









come in from the RapidIO fabric. When a new request comes in, the target state machine

informs the target response state machine to prepare to send responses, and also places a

command in its command FIFO to the external memory controller that includes the size, address,

and type of memory access to perform. In order to operate synchronously with the RapidIO link,

as a requirement of the Xilinx RapidIO core all logic in the network interface controller operates

at 62.5 MHz, which also happens to be half of the frequency of the external memory controller.

On-Chip Memory Controller

The on-chip memory (OCM) controller is fairly simple, and provides the mechanisms

necessary to enable basic read or write transactions between the FIFOs of the main data path and

the co-processor SRAM memories. Internal to the OCM controller, the data width is reduced to

32 bits from 64 bits on the SDRAM-side of the data path FIFOs, while the clock frequency

remains the same at 125 MHz. The data interface to each co-processor is a simple SRAM port;

as such, multiple co-processor interfaces can be logically combined to appear as one SRAM unit,

with different address ranges corresponding to the memories physically located in different co-

processors.

In the baseline node design, there is no method of transferring data directly from one co-

processor to another, nor is there any way to transfer data directly from the co-processor

memories to the RapidIO interface controller. There are arguable performance benefits to be

gained by enabling such data paths, and these architecture enhancements will be considered later

as part of experimentation. However, the addition of such capabilities does not come without

cost, the tradeoff in this case being an increase in development effort through a rise in the

complexity of the control logic, as well as potentially increasing the challenge of meeting timing

constraints by crowding the overall design.









PowerPC and Software API

Overall control of each node's operation is provided by one of the two embedded

PowerPC405s in each FPGA. This PowerPC is responsible for initiating data movement

throughout the system, as well as controlling co-processor activity. However, the PowerPC is

not used for any data computation or manipulation, other than briefly in some corner-turn

implementations. The PowerPC405 is relatively low performance, and moreover this particular

node design does not give the PowerPC a standard connection to the external memory. The

PowerPC processor itself is directly integrated with several peripherals implemented in the

reconfigurable fabric of the FPGA. These peripherals include a UART controller, several

memory blocks, and general-purpose I/O (GPIO) modules for signal-level integration with user

logic. One of the memory blocks is connected to the OCM controller to provide a data path for

the PowerPC, and the GPIO modules connect to architecture components such as the DMA

engine or the co-processor control busses. The external memory controller could be dedicated to

the PowerPC through its standard bus interface, however that would prevent other components in

the system from connecting directly to the external memory controller as well (recall the multi-

port memory controller is a boasted feature of this node architecture). Instead, the PowerPC is

used exclusively for control, and leaves data processing tasks to the co-processor engines.

To support application development and provide a software interface to the mechanisms

provided by the node architecture, an API was created containing driver functions that handle all

signal-level manipulation necessary to initiate data movement, control co-processors, and

perform other miscellaneous actions. The functions of this API provide a simple an intuitive

wrapper to the user for use in the application code. Table 3-1 indicates and describes the most

important functions in this API, where classes D, M, and P mean data movement, miscellaneous,

and processor control, respectively.










Table 3-1. Software API function descriptions.
Function Class Description
dmablocking D Most common data transfer function. Provide source address,
destination address, and size. Function will not return until
entire transfer has completed.
dmanonblocking D Non-blocking version of same function described above. This
function will start the data transfer, and will immediately return
while data is still being moved by the system.
dmaackblocking D Explicit function call necessary to acknowledge DMA transfers
initiated with non-blocking calls. This function, once called,
will not return until the DMA transfer completes.
dmaacknonblocking D Non-blocking version of the acknowledgement function. This
function will check if the DMA transfer is complete yet, and
will return immediately either way with a value indicating the
status of the transfer.
barrier M Basic synchronization function, necessary to implement parallel
applications. This function mimics the MPI function
MPIBarrier(, and ensures that all nodes reach the same point
in the application before continuing.
coprocinit P Co-processor initialization function. Provide co-processor ID, and
any configuration data required. The function will return once
the co-processor has been initialized and is ready for
processing.
coproc blocking P Most common processor activity function. Provide co-processor
ID, and any runtime data required. This function is blocking,
and will not return until the co-processor indicates that it has
completed processing.
coprocnonblocking P Non-blocking version of the processing function described above.
The function will initiate processing on the indicated co-
processor, and will immediately return while the co-processor
continues to process.
coproc wait P Explicit function necessary to acknowledge co-processor execution
runs initiated with the non-blocking function call.



Co-processor Engine Architectures

Each of the co-processor engines was designed to appear identical from the outside,

presenting a standardized data and control interface for an arbitrary co-processor. Since all co-

processors share a standard interface and control protocol, the common top-level design will be

described first before briefly defining the architecture and internal operation of each specific co-

processor engine in the following sections. The location of these co-processors in the overall

node architecture is shown near the bottom of Figure 3-3, labeled HWmodule 1 through HW









module N. The co-processor interface consists of a single data path, and a single control path.

Data to be processed is written into the co-processor engine through the data port, and processed

data is read back out of this port when processing completes. The control port provides the

ability to configure the engine, start processing, and monitor engine status. Figure 3-6 illustrates

this generic co-processor wrapper at the signal-level.

F ig u re. ........................................ d a
datl ar th er a A a
data path addrs
S chypseles i R of. wh. |c
wr enable .4





control path


Figure 3-6. Standardized co-processor engine wrapper diagram.

Recall that all co-processor engines contain a small amount of SRAM memory (maximum

of 32 KB), and that this memory is the main method of passing data to and from the engine. The

single external data port can address this entire internal SRAM address space, through a standard

SRAM interface containing a data bus input and output, as well as an address input, a chip select,

and a read/write select signal (single-cycle access time). Regardless of what the processing

engine is designed to do, data is written-to and read-from its processing memory like a simple

SRAM, with certain address ranges defined for various purposes internal to the engine. All

internal SRAMs, interchangeably referred to as BlockRAMs or BRAMs, of Xilinx FPGAs are

true dual-port memories. This feature is critical to the design of this style of I/O for co-processor

engines. With one port of each BRAM dedicated to the external data interface, the other port is

used by the actual processing logic.










The control port of each co-processor wrapper is divided into two busses, one input and

one output. Each bit or bit field of these control busses has a specific definition, as shown below

in Table 3-2. When using multiple co-processing engines in a single FPGA, the control ports are

combined as follows:

* The control bus outputs from each co-processor are kept separate, and monitored
individually by the main processor (PowerPC)

* All control bus inputs to the co-processors are tied together

Table 3-2. Co-processor wrapper signal definitions.


Signal Name
clock
reset n
coproc id


data in
data out
address
chip select

wr enable




control in












control out


Dir.
input
input
input


input
output
input
input

input




input












output


Width
1
1
2


3
3
3
1

1


Description
Clock signal for logic internal to actual processor
Global reset signal for entire design
Hard-coded co-processor ID; each command received on control
bus is compared to this value to see if it is the intended target
of the command
Data bus into co-processor for raw data
Data bus out of co-processor for processed data
Address input to co-processor
Enable command to memory; read or write access determined by
wr enable
Write-enable signal; when chip_select is asserted and wrenable =
'1', a write is issued to the address present on the address bus;
when chip select is asserted and wr enable = '0', a read is
performed from the address specified on the address bus


16 Control bus input into co-processor from the PowerPC. The bits
of the bus are defined as follows:


0-7: configuration data
8-9: coproc ID
10: blocking (1)/non-blocking (0)
11: synchronous reset
12: reserved (currently un-used)
13: command enable
14-15: command
"01" = config
"10" = start
"00", "11" = reserved (un-used)
Control bus from co-processor to the PowerPC. The bits of the
bus are defined as follows:
0: busy
1: done
2: uninitialized
3: reserved (un-used)


4









Some of the co-processor engines have run-time configurable parameters in the form of

internal registers in the co-processor engine itself. One example of such a configurable

parameter is for the CFAR processor, where it needs to be told what the range dimension is of

the current data-cubes so it knows how much to process before clearing its shift register for the

next processing run. As part of the wrapper standard, all co-processors must be initializedd" or

configured at least once before being used for processing. Whether or not the co-processor

currently supports configurable parameters, according to the standard upon coming out of reset

all co-processors will have their uninitialized bit set on the control bus, indicating that they need

to be configured. The PowerPC will initialize each co-processor one at a time as part of its

startup sequence by submitting config commands along with the appropriate coproc ID and

data. Once initialized, the co-processor will clear its uninitializedbit and can be used for

processing at any time. The co-processor may also be re-configured with different parameters by

executing another config command to set the config registers with the new data. This feature of

all co-processors enables virtual reconfiguration, as proposed by a recent Master's Thesis [24].

To perform data processing with the co-processor, the PowerPC will first perform a DMA

transfer of the data to be processed from external memory into the co-processor's address space.

The exact memory map of each co-processor varies, and will be defined in the appropriate co-

processor section later. Once the data has been transferred into the engine, the PowerPC submits

a start command to the appropriate engine, and monitors the busy and done bits of that co-

processor's control bus output. Upon observing the done bit being asserted, another DMA

transfer can be performed to move the processed data out of the engine and back to external

storage. It should be noted here that all of the co-processor engines implement double-buffering

in order to maximize processing efficiency and minimize idle time.











Pulse Compression Engine

The pulse compression engine design adopted for this research follows that of a


commercial IP core [9] by Pentek, Inc. The engine is centered around a pipelined, selectable


FFT/IFFT core that provides streaming computation, an important feature for this high-


performance design (Figure 3-7). Recall that pulse compression operates along the range


dimension, and is composed of an FFT, followed by a point-by-point complex vector multiply


with a pre-computed vector (also stored in the co-processor's SRAM), and finally an IFFT of the


vector product. To implement this algorithm, the aforementioned FFT core is re-used for the


final IFFT as well, in order to minimize the amount of resources required to implement the pulse


compression co-processor. The numerical precision used for the pulse compression co-processor


is 16-bit fixed point (with 8.8 decimal format) as suggested in Aggarwal's work [24], resulting in


32 bits per complex data element.

tolfrom memory controller

reed/write
ot t ort Por Port Port B
Input Input Vector Output Output
Buffer A Buffer B Buffer Buffer A Buffer B
4 KB 4KB 4KB 4KB 4KB external memory map
Ox4FFF
Dual-Port Dual-Port Dual-Port Dual-Port Dual-Port Output Buffer
SRAM SRAM SRAM SRAM SRAM B
0x4000
Port A Port A Port A Port A Port A Ox3FFF
Srea'y _' Output Buffer
A
0x3000
Ox2FFF
Input Buffer
B
0x2000
OxIFFF
Input Buffer
Xilinx A
FFT 0xOFFF
Core Vector Buffer xOOFFF



STATE
control interface _



Figure 3-7. Pulse compression co-processor architecture block diagram.









Once given a command to begin processing, the engine will start by reading data out of the

specified input buffer directly into the FFT core, through a multiplexer. As soon as possible, the

FFT core is triggered to begin processing, and the state machine waits until the core begins

reading out the transformed results. At this time, the transformed data exits the FFT core one

element at a time, and is sent through a demultiplexer to a complex multiplier. The other input

of the multiplier is driven by the output of the vector constant SRAM, which is read in synch

with the data output of the FFT core. The complex multiplier output is then routed back to the

input of the FFT core, which by this point has been configured to run as an IFFT core. The final

IFFT is performed, and the core begins outputting the pulse compression results. These results

are routed to the specified output buffer, and the state machine completes the processing run by

indicating to the PowerPC through the control interface outputs that the co-processor is finished.

The architecture of the co-processor's SRAM bank is an important topic that warrants

discussion, and the pulse compression engine will be used as an example. The pulse

compression engine contains a total of five independent buffers or SRAM units (see Figure 3-7):

two input buffers, two output buffers, and a single vector constant buffer. Each buffer is 4 KB in

size, resulting in a total pulse compression co-processor memory size of 20 KB. The two SRAM

features taken advantage of for this architecture are (1) true dual-port interface, and (2) ability to

combine smaller SRAMs into one larger logical SRAM unit. The external data/SRAM port to

the co-processor has both read and write access to the entire 20 KB of memory, by combining

one port of all SRAMs to appear as single logical SRAM unit (according to the memory map

shown in Figure 3-7). However, there is no rule to say that the other port of each SRAM unit

must be combined in the same manner; therefore, the other port of each SRAM unit is kept

independent, to allow concurrent access of all buffers to the internal computational engine.












Doppler Processing Engine


The Doppler processing engine is very similar to the pulse compression co-processor, also


based on an FFT core at its heart. Recall that Doppler processing is composed of a complex


vector multiplication followed by an FFT. Upon receiving a command to begin processing, the


input data will be read out of the specified input buffer concurrently with the pre-computed


constant vector from its vector buffer (as indicated below in Figure 3-8), and fed into a complex


multiplier unit. The product output of this multiplier is used to drive the input of the FFT core,


which computes the transform before unloading the transformed results directly into the output


buffer. Like the pulse compression co-processor, the Doppler processor unit is originally


designed assuming 16-bit fixed-point precision (32-bits per complex element), with an 8.8


decimal format.


I tolfrom memory controller

Sreadlwrte


Vector Output
Buffer Buffer A

2KB 4KB

Dual-Port Dual-Port
SRAM SRAM

Port A Port A


Port B

Output
Buffer B

4KB

Dual-Port
SRAM

Port A


external memory map
Ox47FF
Output Buffer
B
0x3800
Ox37FF
Output Buffer
A
0x2800
Ox27FF
Input Buffer
B
Ox1800
Ox17FF
Input Buffer
A
Ox0800
Vector Buffer Ox07FF
Ox0000


Figure 3-8. Doppler processing co-processor architecture block diagram.


Input
Buffer A

4 KB

Dual-Port
SRAM

Port A


Port B

Input
Buffer B

4KB

Dual-Port
SRAM

Port A











Beamforming Engine

The beamforming co-processor is based on a matrix multiplication core described in [44].


This engine computes all output beams in parallel, by using as many multiply-accumulate


(MAC) units as there are output beams. The weight vector provided by the AWC step is stored


in a specially-designed SRAM module, that provides as many independent read-only ports to the


internal engine as there are MAC units. By doing so, the input data-cube can be read out of the


input buffer one element at a time in a steaming fashion (recall data is organized along the


channel dimension for beamforming), and after an entire channel dimension has been read out,


and entire output beam dimension will have been computed. While the next channel dimension


is being read out of the input buffer, the previously-computed beam is registered and stored in


the output buffer one element at a time. The complex-format beamformed output is partially


magnitude (no square root) before being written to the output buffer as 32-bit real values.

to/from memory controller

Sreadlwrite
PortB PortB PortB PortB Port B
0x8300
Input Input Weight Output Output 0x8300
Buffer A Buffer B Buffers Buffer A Buffer B Weight Buffer ,oxooo
Ox7FFF
8KBB KB 1KB 8KB 8KB
Output Buffer
Dual-Port Dual-Port Dual-Port Dual-Port Dual-Port B
SRAM SRAM SRAM SRAM SRAM
0x6000
PortA PortA PorA Port A Pot A OxSFFF
read-only read-only write-only Output Buffer
-A--A
0x4000
Ox3FFF
Input Buffer
X *** ..x B
0x2000
Ox1FFF
R Inpul Buffer
E A
STATE G
_MCHN 0x0000
Control interfaces C




Figure 3-9. Beamforming co-processor architecture block diagram.










Figure 3-9 illustrates the architecture of the beamforming co-processor. Notice that unlike

the other co-processor designs, the input buffers and output buffers are different sizes. This

asymmetry is due to the reduction of one dimension of the data-cube from channels to beams, the

assumption being that the number of beams formed will be less than the number of channels in

the original cube. The actual computations performed by this architecture can be better

visualized by considering Figure 3.10 below. The weight vector for each beam to be formed is

stored in a small, independent buffer, so that the weights for all beams can be read in parallel.

The elements of the input data-cube (Bc,p,r) are read sequentially from the input buffer, and

broadcast to all of the MAC units. The other input to each MAC unit is driven by its respective

weight vector buffer, and so every C clock cycles (where C = channel dimension length) an

entire beam dimension is completely processed. These values will be registered and held for

storage, while the next set of beam elements are computed by re-reading the weight vectors

while reading the next C element of the input data-cube.


'J1c (0)2c (03c Vic w s tJLb.:

n* U* UI I


time (I')J1 (1)22 0)32 'i s V)tL

(CO11 ? I (')3 (41 ( bl




I .. I -

time




Figure 3-10. Illustration of beamforming computations.

Figure 3-10. Illustration of beamforming computations.












CFAR Engine


The CFAR detection co-processor architecture is based on a design presented in [16]. The


major component of the CFAR engine is a shift register, as wide as each data element and as


long as the CFAR sliding window length. Once processing begins, data is read out of the


appropriate input buffer and fed into this shift register. Several locations along the shift register


are tapped and used in the computation of the running local averages as well as for the


thresholding against the target cell. Recall that the ultimate goal of CFAR is to determine a list


of targets detected from each data-cube [15]. This output requirement implies that the entire


data-cube does not need to be outputted from CFAR, and in fact the minimum required


information results in a basically negligible output data size. However, to provide a simple


method of verification of processing results, the output of this CFAR co-processor is the entire


data-cube, with zeros overwriting all elements not deemed to be targets. To provide realistic


processing latency, for all performance measurements no output data is read from the processor.

Sto/from memory controller

readwrtex external memory map
Ox7FFF
PortB Port B PortB ortB
Output Buffer B Output Buffer
Input Buffer A Input Buffer B Output Buffer A Outpu Buffer B OutpuBuffer
8KtB
8KB 8KB 8KB 8KB
Duel-Port 0x6000
Dual-Port SRAM Dual-Port SRAM Dual-Port SRAM SRAPor Ox5
SRAM OxSFFF
Port A Port A Port A Port
Output Buffer
read-only write-only 0 A

0x4000
Ox3FFF

Input Buffer


S r Ix2000
Ox1FFF
Thresh.

Input Buffer
detection? 1 :0 I
STATE A
control interfce
Ox0000



Figure 3-11. CFAR detection co-processor architecture block diagram.









As mentioned back in Chapter 2, CFAR is not only parallel at a coarse-grained level with

each Doppler bin being processed independently, but the computations performed for each

consecutive data element also contain significant parallelism through partial results reuse of the

locally-computed averages [15]. For each averaging window (left and right), almost all data

elements to be summed remain the same between two adjacent target cells, with the exception of

one data element on each end of the windows. As the CFAR window moves one element over,

the oldest element needs to be subtracted from the sum, and the newest element added to the

sum. By taking advantage of this property, CFAR detection can be performed O(n), where n

represents the length of the range dimension, independent of the i, hh of the sliding window.









CHAPTER 4
ENVIRONMENT AND METHODS

This chapter will briefly describe the experimental environment and methods used in the

course of this research. The first section will define the interfaces to the testbed available for the

user, followed by a detailed description of measurement techniques and experimental procedures

in the second section. Finally, the third section will formally define all assumptions, metrics, and

parameters relevant to the experiments carried out.

Experimental Environment

The RapidlO testbed is connected to a number of cables and devices, providing a control

interface, a method of file transfer, and measurement capabilities. The testbed connects to a

workstation through both an RS-232 serial cable, as well as a USB port for configuration through

JTAG. Additionally, there is a PC-based logic analyzer that connects to up to 80 pins of the

FPGAs, as well as to the user workstation through another USB port for transfer of captured

waveforms. The embedded PowerPCs in each FPGA drive the UART ports on the testbed, and

the user can monitor and control testbed progress through a HyperTerminal window or other

similar UART communication program (only the "master" node's UART port is used, for a

single UART connection to the workstation). This HyperTerminal window interface provides

the ability to use 'printfo'-style printouts to display application progress as well as assist in

debugging to a degree, and also enables file transfer between the user workstation and the

RapidIO testbed. In addition to the testbed itself and the application that runs on it, a suite of

basic utility C programs was written to run on the workstation for functions such as file creation,

file translation to/from the testbed data file format, comparison of testbed results with golden

results, etc. Figure 4-1 illustrates the relationship between the user, the workstation, the testbed,

and the various cables and connections.




















user


Figure 4-1. Testbed environment and interface.

Measurement Procedure

The main tool for collecting timing and performance measurements is a PC-based logic

analyzer from Link Instruments, Inc. The logic analyzer has a maximum sampling frequency of

500 VMHz, providing extremely fine-grained timing measurements of down to 2 ns. With up to

80 channels available on the logic analyzer, the desired internal signals can be brought out to un-

used pins of the FPGA to provide a window into the internal operation of the design. The only

signals that are invisible, or unable to be brought out for capture, are the signals internal to the

Xilinx RapidIO core, which are intentionally kept hidden by Xilinx as proprietary information.

However, the interfaces to the different RapidIO cores are visible, and provide sufficient


I L









information to accurately characterize the operation of the interconnect. Generally, only one

logic analyzer capture can be taken per processing run on the testbed, due to the incredibly short

execution time of the testbed compared to the speed of the logic analyzer data transfer over USB

for the first capture. However, two key advantages of using a PC-based logic analyzer are the

convenience afforded by a large viewing area during use, as well as the ability to save captures

for archiving of results.

The experimental procedure is fairly simple and straightforward, however it will be

explicitly defined here for completeness. Assuming the proper files for programming the FPGAs

are on-hand, the first thing the experimenter must do is prepare the input data files to be sent to

the testbed using the utility programs that run on the user workstation. These files include data

such as pre-computed vector constants, data-cubes for GMTI processing, or other data for

debugging and testing, whatever the current experiment calls for. With all files on-hand, the

testbed is powered on and the FPGAs are configured with their appropriate bitfiles. The user is

greeted with a start-up log and prompt through the HyperTerminal interface, and transfers the

data files from the workstation to the testbed as prompted. Once all files have been loaded, the

experimenter is prompted one final time to press any key to begin the timed software procedure.

Before doing so, the measurement equipment is prepared and set to trigger a capture as needed.

The user may then commence the main timed procedure, which typically completes in under a

second. The logic analyzer will automatically take its capture and transfer the waveform to the

workstation for display, and the testbed will prompt the user to accept the output file transfer

containing the processing results. At this point, the user may reset and repeat, or otherwise

power down the testbed, and perform any necessary post-analysis of the experimental results.









In addition to the logic analyzer which provides performance measurements, the

previously-mentioned suite of utility programs provide the necessary tools for data verification

of output files from the testbed. For reference, a parallel software implementation (using MPI)

of GMTI was acquired from Northwestern University, described in [10]. This software

implementation comes with sample data-cubes, and provides a method of verification of

processing results from the RapidlO testbed. Furthermore, this software implementation was

used for some simple performance measurements in order to provide some basis on which to

compare the raw performance results of the testbed.

Metrics and Parameter Definitions

There are two basic types of operations that make up all activity on the testbed, (1) DMA

transfers and (2) processing "runs." DMA transfers involve a start-up handshake between the

PowerPC and the hardware DMA controller, the data transfer itself, and an acknowledgement

handshake at the completion. A processing "run" refers to the submission of a "process"

command to a co-processor engine, and waiting for its completion. Each processing run does not

involve a start-up handshake, but does involve a completion-acknowledgement handshake. In

order to characterize the performance of the testbed, three primitive metrics are identified that

are based on the basic operations described above. All metrics are defined from the perspective

of the user application, in order to provide the most meaningful statistics in terms of ultimately-

experienced performance, and so the control overhead from handshaking is included. These

different latency metrics are defined as shown here:

txfer = DMA transfer latency

tstage = processing latency of one kernel for an entire data-cube

tbuff = processing latency for a single "process" command









Based on these latency definitions (all of which may be measured experimentally), other

classical metrics may be derived, for example throughput. Data throughput is defined as follows:


throughputdata (4.1)
t fer(4.1)

where the number of bytes (Nbyte,) will be known for a specific transfer. Another pair of

important metrics used in this research is processor efficiency, or percentage of time that a

processor is busy, and memory efficiency, which illustrates the amount of time that the data path

is busy transferring data. Processor efficiency can be an indicator of whether or not the system is

capable of fully-utilizing the processor resources that are available, and can help identify

performance bottlenecks in a given application on a given architecture. Memory efficiency, by

contrast, helps indicate when there is significant idle time in the data path, meaning additional

co-processor engines could be kept fed with data without impacting the performance of the co-

processor engines) currently being used. These efficiency metrics are defined as follows:

a-M
mej mory = (4.2)
mo stage throughputid al



fN .tbJuff
effproc N (4.3)
stage

For memory efficiency, Mis a general term representing the size of an entire data-cube

(per processor), and a is a scaling factor (usually 1 or 2) that addresses the fact that, for any

given stage, the entire data-cube might be sent and received by the co-processor engine. Some

stages, for example beamforming or CFAR, either reduce or completely consume the data, and

thus a for those cases would be less than two. Basically, the numerator of the memory efficiency

metric represents how much data is transferred during the time it took to process the cube, where









the denominator represents how much data could have been transferred in that same amount of

time. For processor efficiency, Nis a general term for the number of "process" commands that

are required to process an entire data-cube. Recall that the tsage term will include overheads

associated with repeated process and acknowledge commands being submitted to the co-

processor, as well as overhead associated with handling data transfers in the midst of the

processing runs. Any efficiency value less than 100% will indicate a processor that has to sit idle

while control handshakes and data transfers are occurring.

In addition to these metrics, there are a handful of parameters that are assumed for each

experiment in this thesis, unless otherwise noted for a specific experiment. These parameters

(defined in Table 4-1) include the data-cube size, the system clock frequencies, memory sizes, as

well as the standard numerical format for all co-processors.

Table 4-1. Experimental parameter definitions.
Parameter Value Description
Ranges 1024 Range dimension of raw input data-cube
Pulses 128 Pulse dimension of raw input data-cube
( h, /W /, 16 Channel dimension of raw input data-cube
Beams 6 Beam dimension of data-cube after beamforming

Processor Frequency 100 MHz PowerPC/Co-processor clock frequency
Memory Frequency 125 MHz Main FPGA clock frequency, including SDRAM and data path
RapidlO Frequency 250 MHz RapidlO link speed

Co-processor SRAMsize 32 KB Maximum SRAM internal to any one co-processor
FIFO size (each) 8 KB Size of FIFOs to/from SDRAM
Numerical Format s.7.8 Signed, 16-bit fixed-point with 8 fraction bits









CHAPTER 5
RESULTS

A total of five formal experiments were conducted, complemented by miscellaneous

observations of interest. The five experiments included: (1) basic performance analysis of local

and remote memory transfers, (2) co-processor performance analysis, (3) data verification, (4) a

corner-turn study, and (5) analytical performance modeling and projection. The following

sections present the results of each experiment, and will provide analysis of those results.

Experiment 1

The first experiment provided an initial look at the performance capability of the RapidlO

testbed for moving data. This experiment performed simple memory transfers of varying types

and sizes, and measured the completion latency of each transfer. These latencies were used to

determine the actual throughput achieved for each type of transfer, and also illustrated the

sensitivity of the architecture to transfer size by observing how quickly the throughput dropped

off for smaller operations. Local transfers addressed data movement between the external

SDRAM and the co-processor memories, and remote transfers covered movement of data from

one node's external SDRAM to another node's external SDRAM over the RapidlO fabric.

Since the internal co-processor memories were size-constrained to be a maximum of 32

KB, the range of transaction sizes used for local transfers was between 64 B to 32 KB. Only two

transfer types exist for local memory accesses, reads and writes. Remote transfers could be of

any size, so transaction sizes from 64 B to 1 MB were measured experimentally for RapidlO

transfers. The RapidlO logical I/O logical layer provides four main types of requests: (1)

NREAD, for all remote read requests (includes a response with data), (2) NWRITER for write

operations requesting a response to confirm the completion of the operation, (3) NWRITE for

write operations without a response, and (4) SWRITE, a streaming write request with less packet










header information for slightly higher packet efficiency (also a response-less request type). All

four RapidIO request types are measured experimentally for the full range of transfer sizes, in

order to characterize the RapidIO testbed's remote memory access performance.


Local Memory Transfers Remote Memory Transfers
4- 4
3.5 3.5
"S 3 --------------- i 3 ------- ------


2- 2- -rF TE
S1.5 1.5I.-F iTE F
I 1.,-- 1- L 15I --r-i'-F ITE F
0.5 0.5 -j-.
0 .. o
0.06 0.13 0.25 0.5 1 2 4 8 18 32 0.0C3 0.25 1 4 18 64 256 1024



Figure 5-1. Baseline throughput performance results.

First, consider the local memory transfer throughputs; even though the SDRAM interface

operates at a theoretical maximum throughput of 8 Gbps (64-bit @ 125 MHz), the on-chip

memories only operate at 4 Gbps (32-bit @ 125 MHz), and thus the transfer throughputs are

capped at 4 Gbps. Recall, the over-provisioned SDRAM throughput is provided intentionally, so

that local memory transfers will not interfere with RapidIO transfers. The overheads associated

with DMA transfer startup and acknowledgement handshakes cause the drop-off in throughput

for very small transfers, but it can be seen that at least half of the available throughput is

achieved for modest transfer sizes of 1 KB. The transfer size at which half of the available

throughput is achieved for a given interconnect or communication path is referred to as the half-

power point, and is often treated as a metric for comparison between different communication

protocols in order to rate the overall efficiency of the connection for realistic transfer sizes.

For remote memory transfers, as one might expect the response-less transaction types

(NWRITE, SWRITE) achieve better throughput for smaller transfer sizes. The higher









throughput is a result of lower transfer latency, as the source endpoint does not have to wait for

the entire transfer to be received by the destination. As soon as all of the data has been passed

into the logical layer of the source endpoint, the control logic is considered done and can indicate

to the DMA engine that the transfer is complete. These transaction types take advantage of

RapidlO's guaranteed delivery to ensure that the transfers will complete without forcing the

source endpoint to sit and wait for a completion acknowledgement. The only potential drawback

to these transaction types is that the source endpoint cannot be sure of exactly when the transfer

really completes, but most of the time that does not matter. The "hump" seen around 16 KB in

NWRITE/SWRITE throughputs is due to the various FIFOs/buffers in the communication path

filling up, and causing the transfer to be capped by the actual RapidlO throughput. Before that

point, the transfer is occurring at the speed of SDRAM, however once the buffers fill up the

SDRAM must throttle itself in order to not overrun the buffers. Transfer types that require a

response do not experience this same "hump" in throughput, since for all transfer sizes they must

wait for the transfer to traverse the entire interconnect before receiving the response. As transfer

sizes grow, all transfer types settle to the maximum sustainable throughput.

One interesting observation is that for very large transfer sizes (> 256KB), the throughput

achieved by NREAD transactions overtakes the throughput of other transfer types, including the

lower-overhead SWRITE type. The reason is due to implementation-specific overheads, most

likely differences in the state transitions (i.e. numbers of states) involved in the state machines

that control the RapidlO link, where even a single additional state can result in an observable

decrease in actual throughput. The conclusions to be taken from these results are that for smaller

transfers, unless an acknowledgement is required for some reason, the NWRITE or SWRITE

transaction types are desirable to minimize the effective latency of the transfer. However, when









moving a large block of data, it would be better to have the receiving node initiate the transfer

with an NREAD transaction as opposed to having the sending node initiate the transfer with a

write, once again in order to maximize the performance achieved for that transfer since NREADs

achieve the highest throughput for large transfer sizes. It will be shown in the coming

experiments that this particular case-study application does not move large blocks of data during

the course of data-cube processing, and as such SWRITE transaction types are used for all

remote transfers unless otherwise noted.

Experiment 2

This experiment investigated the individual performance of each co-processor engine. All

processing for this experiment was done on a single node, in order to focus on the internal

operation of each node. Two latencies were measured for each co-processor: (1) t .., the amount

of time necessary to completely process a single buffer, and (2) stage, the amount of time

necessary to completely process an entire data-cube. The buffer latency t .. does not include any

data movement to/from external memory, and only considers the control overhead associated

with starting and acknowledging the co-processor operation. Thus, this metric provides the raw

performance afforded by a given co-processor design, without any dependence on the memory

hierarchy performance. The data-cube processing latency ttage includes all data movement as

well as processing overheads, and is used in conjunction with the buffer latency results to derive

the processor efficiency for each co-processor. Figure 5-2 below summarizes the basic

processing latencies of each of the co-processor engines; for the ttage results, both double-

buffered (DB) and non-double-buffered (NDB) results are shown. For all remaining

experiments, double-buffered processing is assumed; the non-double-buffered results are shown

simply to provide a quantitative look at exactly what the benefit is of enabling double-buffing.





























Figure 5-2. Single-node co-processor processing performance.

As shown above in Figure 5-2, all of the co-processors complete the processing of a single

buffer of data in the microsecond range, with pulse compression taking the longest. Both

beamforming and CFAR have the same input buffer size and operate O(n) as implemented in this

thesis, so t .. for both of those co-processors is the same, as expected. The increase of execution

time for pulse compression and Doppler processing relative to the other stages remains the same

for both / .. and tstage, which suggests that the performance of those two co-processors is bound

by computation time. However, consider the results of .. nd ttage for beamforming and CFAR

detection; while t .. is the same for both co-processors, for some reason ttage for beamforming is

significantly longer than that of CFAR. The reason for this disparity in ttage is rooted in the data

transfer requirements associated with processing an entire data-cube. Recall that for CFAR

detection, no output is written back to memory since all data is consumed in this stage.

However, if both processors are computation-bound, that would not make a difference and ttage

should also be the same between those two co-processors. To more clearly identify where the

bottleneck is, consider Figure 5-3.


Kernel : -.: I: irI Time
70
60
Diifl'-r
S50 -- E o---
40
30
20
10
0 --

PC DP BF CFAR
Engine


Data-Cube PI.:..:- : :r,.l Time
200
180

140 -



.i 60 -- ,
40
20 --

PC DP BF CFAR
Task





























Figure 5-3. Processing and memory efficiency for the different stages of GMTI.

The chart to the left of Figure 5-3 shows processing efficiency, again with both double-

buffered and non-double-buffered execution. The NDB results are simply shown to

quantitatively illustrate the benefit of double-buffering from a utilization perspective. As

predicted above, both pulse compression and Doppler processing are computation-bound,

meaning that the co-processor engines can be kept busy nearly 100% of the time during

processing of an entire data-cube as data transfer latencies can be completely hidden behind

buffer processing latencies. Without double-buffering, however, no co-processor engine can be

kept busy all of the time. The chart to the right in Figure 5-3 shows both processing efficiency

(double-buffered only) as well as memory efficiency, which can tell us even more about the

operation of the node during data-cube processing. Again, as predicted above, it can be seen that

while the performance of the CFAR engine is bound by computation, the beamforming engine is

not, and even when double-buffered it must sit idle about 35% of the time waiting on data

transfers. This inefficiency partially explains the disparity in ttage between beamforming and

CFAR, shown in Figure 5-2, where beamforming takes longer to completely process an entire

data-cube than CFAR. Furthermore, recall the data reduction that occurs as a result of

beamforming; to process an entire data-cube, beamforming must process more data than CFAR.


Processing E ficiency
100 -
90-- -- ---
80- --' -
-70 -


'0 40


10
10 -- --
0
PC DP BF CFAR
Task


Resource E fficiercy
100

80 -- --

> 60
o
" 50 .. ..---------
0 40
uj 30 -
20 --
10 -

PC DP BF CFAR
Task









Even though the input buffers of each co-processor engine are the same size, beamforming must

go through more buffers of data than CFAR. The combination of more data and lower efficiency

explains why beamforming takes longer than CFAR when looking at full data-cube processing

time.

Another important observation from the efficiency results is the memory efficiency of the

different co-processor engines. Low memory efficiency implies that the data path is sitting idle

during the course of processing, since processing a single buffer takes longer than it takes to read

and write one buffer of data. Specifically, notice that pulse compression has a very low memory

efficiency of about 25%. What this low data path utilization tells us is that multiple pulse

compression co-processors could be instantiated in each node of the testbed, and all of them

could be kept fed with data. Exactly how many co-processor engines could be instantiated can

be determined by multiplying the single-engine memory efficiency until it reaches about 75% (a

realistic upper-limit considering control overheads, based on the I/O-bound beamforming engine

results). For future experiments, two pulse-compression engines per node can be safely

assumed. By contrast, all other co-processor engines already use at least -50% of the memory

bandwidth, and as such only one co-processor engine could be supported per node. This

restriction reveals an important conclusion that it will not be worthwhile to instantiate all four

co-processor engines in each node. Instead, each node of the system should be dedicated to

performing a single stage of GMTI, which results in a pipelined parallel decomposition. If

resource utilization is not a concern, then all four co-processor engines could be instantiated in

each node of the testbed, with only one being used at a time. Doing so would enable data-

parallel decomposition of the algorithm across the system, but again at the cost of having most of

the co-processor engines sit idle the majority of the time.









Finally, to establish a frame of reference for how fast these co-processor designs are

compared to other more conventional technology, a performance comparison is offered between

the RapidlO testbed and the Northwestern University software implementation executed

sequentially on a Linux workstation. Figure 5-4 below shows the result of this performance

comparison. To defend the validity of this comparison, even though the hardware being

compared is not apples-to-apples, the assumption for both cases is that the entire data-cube to be

processed resides completely in main memory (external SDRAM) both before and after

processing. Since the capability of a given node to move data between external memory and the

internal processing elements) is a critical facet of any realistic node architecture, as long as both

systems start and end with data in analogous locations, the comparison is in fact valid and covers

more than just pure processing performance. The Linux workstation used in this comparison is a

2.4 GHz Xeon processor, with DDR SDRAM operating at a theoretical maximum of 17 Gbps.

The next experiment will provide a comparison of the output data between these two

implementations, to further defend the validity of the HW-SW comparison by showing that the

same (or approximately the same) results are achieved.


HWvs. SW Comparison
500 -
450 -
400 --
5 350--
300-
S250
S200 --
150
100 -
50

PC DP BF CFAR
Task


Figure 5-4. Performance comparison between RapidIO testbed and Linux workstation.









The results shown in Figure 5-4 are significant, since the RapidIO testbed co-processors

are running at 100 MHz compared to the workstation's 2.4 GHz. The maximum clock frequency

in the overall FPGA node design is only 125 MHz, aside from the physical layer of the RapidIO

endpoint which runs at 250 MHz (however, recall that during any given stage of processing, no

inter-processor communication is occurring). Despite the relatively low frequency design,

impressive sustained performance is achieved and demonstrated. Processors operating in space

are typically unable to achieve frequencies in the multi-GHz range, and so it is important to show

that low-frequency hardware processors are still able to match or exceed the performance of

conventional processors with highly efficient architectures.

Experiment 3

To establish confidence in the architectures presented in this thesis research, the processing

results of the co-processor engines were validated against reference data, some of which was

obtained from researchers at Northwestern University [10]. Each of the co-processing engines

was provided with a known set of data, processed the entire set, and unloaded the processed

results in the form of a file returned to the user workstation. With the exception of pulse

compression and Doppler processing, a software implementation of GMTI was the source of the

reference processing results, and employs a double-precision floating point numerical format for

maximum accuracy. For the pulse compression and Doppler processing cores, it turns out that

the 16-bit fixed-point precision with 8 fractional bits is not nearly enough precision to maintain

accuracy, and to illustrate this deficiency a more simplified data set is provided for processing.

First, consider the output of the Doppler processing engine. Since pulse compression and

Doppler processing contain nearly identical operations (pulse compression simply contains one

extra FFT), only verification of Doppler processing results are shown. Pulse compression is

guaranteed to have worse error than Doppler processing, so if 16-bit fixed point with 8 fractional










bits is shown to result in intolerable error in the output of Doppler processing, there is no need to

test the pulse compression engine. As mentioned above, the reference data-cubes from the NWU

implementation of GMTI are not used for this particular kernel. The data contained in the

reference data-cubes is very dynamic, and in most cases the output obtained from running those

data sets through the testbed implementation exhibits significant overflow and underflow,

resulting in vast differences between golden results and testbed results. In order to instill

confidence that this kernel implementation is indeed performing the correct operations in spite of

round-off errors, a simplified vector is provided to the testbed implementation, as well as a

MATLAB implementation of the Doppler processing kernel.

MATLAB DOPPLER PROCESSING
TESTED DOPPLER PROCESSING 70DOPPLER PR
0.18
0.16
0.14
g 0.12 .
1 0.1 I' I
-I lc -

0.04
0.02
0 -..... .
1 13 :. 49 61 109121
Pulse pI .1


Figure 5-5. Doppler processing output comparison.

Figure 5-5 shows both "golden" (i.e. reference) results alongside the actual testbed output.

Notice the underflow that occurs in the middle of the testbed output, where the signal is

suppressed to zero. Furthermore, in order to attempt to avoid overflow, an overly-aggressive

scaling was employed in the testbed implementation that resulted in all numbers being scaled to

<< 1. The result of this scaling is that the majority of the available bits in the 16-bit fixed-point

number are not used, which is extremely wasteful for such limited precision. More time could

have been spent tuning the scaling of the FFT, however it should be emphasized here that the










goal of this experiment was simply to instill confidence that the correct operations are being

performed, not to tune the accuracy of each kernel implementation. Despite the obvious design

flaws in terms of precision and accuracy, the performance of the kernel implementations are

completely independent of the selected precision and scaling, and by inspection the correct-ness

of the kernel implementation on the testbed can be seen.

Next, consider the verification of the beamforming co-processor. For this kernel, the

actual data from the NWU testbed was fed to the testbed, and the outputs show significantly less

error than the Doppler processing and pulse compression kernels. Figure 5-6 below presents a

small window of the overall data-cube after beamforming, specifically a 6x7 grid of data points

from one particular output beam. The window was chosen to provide a "typical" picture of the

output, as showing an entire 1024x 128 beam would be difficult to visually inspect.


B.-.it..nlll in TES'TBED BP .i nIItnIinl GOLDEII

S7 S7

UJ S6 S6 MU 40-50
0 20-30 0 30-40
SS 3 -20 S5 20-30
E30-10 10-20

0-20--10 .-10-0
S3 o -30--20 S3 o -20--10
M -40--30 0o -30--20
S -50-40 S2 -40--30
S1 S1 -50--40
1 2 3 4 5 6 1 2 3 4 5 6



Figure 5-6. Beamforming output comparison.

Aside from two points of underflow, (5, S3) and (6, S2), the outputs are nearly identical.

Again, a detailed analysis of the exact error obtained (e.g. mean-square error analysis) is not

provided, as fine-grained tuning of kernel accuracy is not the goal of this experiment. In fact, the









two points of underflow shown in the left chart are a result of arithmetic logic errors, since the

real data points should be close to zero, as shown in the golden results in the right chart. The

actual data values received from the testbed were near or equal to -128, but the scale of the charts

was truncated at -50 in order to provide a sufficient color scale to compare the rest of the

window. The immediate jump from 0 to -128 suggests a logic error as opposed to typical round-

off error, considering the 16-bit fixed-point format with 8 fractional bits and signed-magnitude

format. Similar to Doppler processing, correcting this logic error would not affect the

performance of the beamforming core, which is the real focus of this research. The results

shown above in Figure 5-6 are sufficient to instill confidence that the beamforming core is

performing the operations that are expected.

Finally, consider the CFAR detection kernel results. This kernel implementation does not

suffer from any minor arithmetic design flaws, and the results that are shown in Figure 5-7 best

illustrate why 16-bit fixed-point precision is simply not sufficient for GMTI processing. The

radar data from the NWU implementation is used to verify the CFAR engine, with a full

1024x 128 beam of the input data shown in the top chart. As can be seen, the range of values

goes from very small near the left side of the beam, all the way to very large to the right side of

the beam. The middle chart shows the targets that are actually in the data, and the chart on the

bottom shows the targets detected by the testbed implementation. By visual inspection, when the

input data is small, there are many false-positives (targets reported that really are not targets);

when the input data is large, there are many false-negatives (targets not detected). However, in

the middle of the beam where the input data values are in the "comfort zone", the testbed

implementation is very accurate.










F RV 111ptirt 10.11.1















Targets i..-..Ie


C6


Figure 5-7. CFAR detection output comparison.

One final point to make about the results observed in the above verification experiments is

that aside from beamforming (possibly), when tested independently each kernel implementation

exhibits intolerable overflow and underflow in the processing output. In a real implementation,

these kernels are chained together, and thus any round-off or over/underflow errors would

propagate through the entire application, resulting in even worse target detections by the end.

However, all hope is not lost for an FPGA implementation of GMTI, as the results simply

suggest that more precision is required if fixed-point format is desired, or alternately floating-

point could be employed to avoid the over/underflow problem all-together. However, doing so

would result in a doubling of the data set size due to the use of more bytes per element, and thus

the memory throughput would need to be doubled in order to maintain the real-time deadline.

To defend the accuracy of the performance results presented in other experiments, consider

the following upgrades to the architecture that could be implemented in order to provide this









increased throughput. Upgrading the external memory from standard SDRAM to DDR SDRAM

would double the effective throughput of the external memory controller from 8 Gbps to 16

Gbps. Also, increasing the OCM data path width from 32 bits to 64 bits would provide 8 Gbps

to on-chip memory as opposed to the current 4 Gbps. These two enhancements would provide

the necessary data throughput throughout the system to sustain the performance reported in other

experiments when using either 32-bit fixed-point, or single-precision floating point. While

floating point arithmetic cores do typically require several clock cycles per operation, they are

also pipelined, meaning once again the reported throughputs could be sustained. The only cost

would be a nominal cost of a few clock cycles to fill the pipelines for each co-processor engine,

which translates to an additional latency on the order of a few hundred nanoseconds or so

(depending on the particular arithmetic core latency). Such a small increase in processing

latency is negligible.

Experiment 4

The fourth experiment addressed the final kernel of GMTI, the comer-turn, which

performs a re-organization of the distributed data-cube among the nodes in the system in

preparation for the following processing stage. The corner-turn operation is very critical to

system performance, as well as system scalability as it is the parallel efficiency of the corner-

turns in GMTI that determine the upper-bound on the number of useful processing nodes for a

given system architecture. Recall that the only inter-processor communication in GMTI is the

corner-turns that occur between processing stages. Since the data-cube of GMTI is non-square

(or non-cubic), the dimension that the data-cube is originally decomposed along and the target

decomposition may have an effect on the performance of the comer-turn, as it affects the

individual transfer sizes required to implement that specific corner-turn. For this experiment,









two different decomposition strategies of the data-cube are implemented, and the corner-turn

operation latency of each decomposition strategy is measured and compared.

Figure 5-8 illustrates these two different decompositions, with the red line indicating the

dimension along which processing is to be performed both before and after the corner-turn.

There are a total of three corner-turns in the implementation of GMTI used for this thesis

research (one after pulse compression, one after Doppler processing, and the final one after

beamforming), and the data-cube is distributed across two nodes in the experimental testbed.

Due to the limited system size, it is impossible to implement the two-dimensional decomposition

strategy discussed in Chapter 2, and the cube is only decomposed along a single dimension for

each stage.





PC If* I If I





A B

Figure 5-8. Data-cube dimensional orientation, A) primary decomposition strategy and B)
secondary decomposition strategy.

It should be noted here that other decompositions and orders of operations are possible, but

in order to limit the scope of this experiment only two were selected in order to provide a basic

illustration of the sensitivity of performance to decomposition strategy. To avoid confusion, note

that there are a total of three corner-turns for each "type" defined above, corresponding to the

corner-turns that follow each stage of GMTI. Figure 5-9 presents the corner-turn operation

latencies that were measured for both type A and type B decompositions.










Baseline Corner-Turn P performance Baseline Co ner-Turn P erformanc e
300 900
800
700
R 200 --' 600 -
500 -
150 600
__ a 400
-- 1 _J 300
150 2- 00 -
100
0
PC DP BF PC DP BF
Type A Type B


Figure 5-9. Data-cube performance results.

Notice the difference in y-axis scale in each of the two charts in Figure 5-9. As can be

seen from the illustrations in Figure 5-8, the final corner-turn of both type A and B has the same

starting and ending decomposition. As such, one would expect the performance of that final

corner-turn to be the same for both cases, and the experimental results confirm that expectation.

The performance of the first comer-turn in both type A and type B are also seen to be similar,

however this result is simply a coincidence. The main difference between the two

decomposition strategies is therefore the comer-turn that follows Doppler processing, where the

operation latency is significantly higher in the type B decomposition than it is for the type A

decomposition. For type A, because of the way the data-cube is decomposed before and after

Doppler processing, there is no inter-processor communication required, and each node can

independently perform a simple local transpose of its portion of the data-cube. However, for

type B not only do the nodes need to exchange data, it just so happens that half of the individual

local memory accesses are only 32 bytes, the smallest transfer size of all comer-turns considered

in this experiment. Also, since such a small amount of data is being read for each local DMA

transfer, more DMA transfers must be performed than in the other cases in order to move the









entire data-cube. Since smaller transfer sizes achieve lower throughput efficiency, the

significantly-increased operation latency of the second comer-turn for type B decompositions

can be explained by this poor dimensional decomposition strategy.

One final observation to discuss about the comer-turn operations as they are implemented

in this thesis is the fact that the low-performance PowerPC is the root cause of the inefficiency

experienced when performing corner-turns. Due to the high-overhead handshaking that occurs

before and after each individual DMA transfer, and the high number of DMA transfers that occur

for each corner-turn (as well as relatively small transfer sizes, typically from 32-512 bytes per

DMA), almost all of the operation latency is caused by control overhead in the node architecture.

With a higher-performance control network, or otherwise hardware-assisted corer-turns, the

operation latency could be dramatically improved. The final experiment performed considers

such an enhancement to the baseline node architecture, as well as other enhancements to improve

the overall GMTI application latency.

Experiment 5

The fifth and final experiment performed did not involve any operation of the testbed. Due

to the highly deterministic nature of the node architectures, analytical expressions could be

derived fairly easily to accurately model the performance of basic operations. Using these

analytical models, the performance of more complex operations was predicted in order to enable

performance projection beyond the capabilities of the current RapidIO testbed. It should be

noted that the analytical modeling in this section is not presented as a generic methodology to

model arbitrary architectures; instead, this analytical modeling approach applies only for

performance prediction of this particular testbed design. This section provide both validation

results as well as performance projections for GMTI running on an architecture that features

enhancements over the baseline design presented in Chapter 3.









The two main operations modeled in this section are: (1) DMA transfers of N bytes, DN,

and (2) co-processor operation latencies, P The GMTI case-study application implemented

for this thesis research is composed of these two primitive operations, which can be expressed as

equations composed of known or experimentally-measured quantities. The analytical

expressions for the two primitive operations defined above are shown here:

N bytes
DN es+ Odma (5.1)
BWideal


Pbuff = tbuff + Ocomp (5.2)

For DMA transfer latencies, the number of bytes (Nbytes) is known for a given transfer

size and BWideal is the throughput achieved as the transfer size goes to infinity (see Experiment

#1). For Co-processor operation latencies, t .. is an experimentally-measured quantity that

represents the amount of time it takes a given co-processor to completely process one input

buffer of data, once all data has been transferred to the co-processor (as defined in Experiment

#2). The two remaining terms in Equations 5.1 and 5.2 above, Odma and Ocomp, are defined as

"overhead terms" and represent the latency of control logic operations. Specifically, these

overhead terms represent the handshaking that occurs between the PowerPC controller and the

DMA engine or co-processor engine. The overhead terms Odma and Ocomp are measured

experimentally using the logic analyzer to be 1.8 |.s and 1.5 |is, respectively. Using only the

terms and equations defined above, the first validation test of the analytical models is done by

predicting local DMA transfers and comparing the predicted transfer latency and throughput with

the experimentally-measured values presented in Experiment #1. Figure 5-10 shows the results

of this first simple validation experiment.











Local MemoryTransfer Throughput


Figure 5-10. Local DMA transfer latency prediction and validation.

As can be seen by the charts above, the analytical models for simple DMA transfers are

extremely accurate, due to the highly deterministic nature of the node architecture. However,

more complex situations need to be analyzed before considering the analytical models of the

testbed to be truly validated. The next validation experiment considers the latency of each co-

processor engine to process and entire data-cube stagege, which involves both DMA transfers as

well as co-processor activity. Analytical predictions of processing latencies will be compared

against the experimentally-measured results presented in Experiment #2. The analytical

expression that models double-buffered data-cube processing latency is shown below:


t ta=e =(D +MAX (D, Pbf ))+ (M .MAX (2 D, Pb} f))+(MAX (D, Pbf )+D) (5.3)


The first term in Equation 5.3 represents the "startup" phase of double-buffering. The

second term is the dominant term and represents the "steady-state" phase of double-buffered

processing, and the third and final term represents the "wrap-up" phase that completes

processing. The constant Min Equation 5.3 is a generic term that represents the number of

buffers that must be filled and processed to complete an entire data-cube, and is known for any

given co-processor engine and data-cube size. P .. was defined earlier in this section, and the


Eo80
70

60

5-j

20
10
0


64 128 256 512 1K 2K 4K 8K 16K32K
Transfer Si ze (Btes)


4.5 -
3,5 .__.:,__ .__t,-.__l

4-
D 3.5




I-- -
0.5
0 i i.i .i i i i i
64 128 256 512 1K 2K 4K 8K 16K32K
Transfer Size (Bytes)


Local Memo ry Transfer Latency









variable D is defined as the sum of the previously-defined DN term and a new overhead term,

Osys. This new overhead term is a generic term that captures "system overhead," or the non-

deterministic amount of time in between consecutive DMA transfers. The osys term is not

experimentally-measured, since it is not deterministic like the other two overhead terms (Odma

and Ocomp), and is instead treated as a "knob" or variable to tune the predictions produced by the

analytical expression. It should be noted here that all three overhead terms are a result of the

relatively low-performance PowerPC controller in each node.

Notice that in Equation 5.3, there are MAX(a, b) terms; these terms reflect the behavior of

double-buffered processing, which can be bounded by either computation time or

communication time. Figure 5-11 below shows an illustration of both cases, with labels

indicating the startup, steady-state, and wrap-up phases referred to earlier. Comparing the

diagram in Figure 5-11 to Equation 5.3 should clarify how the expression was arrived at to

model double-buffered processing. Figure 5-12 presents a comparison of predicted data-cube

processing latencies with experimentally measured results for each co-processor engine,

assuming an osys value of 3.7 Is. This value of osys is observed to be a realistic value for typical

delays in between DMA transfers, based on observations made from logic analyzer captures.


ig 5' I t ...a ..o o p e 'S I
I::" -'."" i I I I.- ". I I ,i''' '










Figure 5-11. Illustration of double-buffered processing.










Measured vs. Calculated Analytical Projection Error
Data-C ube Processing Time
160 0.4
140 a actual
1Fi20 calculates ened
100 02
80 0
L so0.1

40
20
0 PC DP BF CFAR
PC DP BF CFAR -02
Task


Figure 5-12. Data-cube processing latency prediction and validation.

The prediction results for data-cube processing latencies are impressively accurate, and the

analytical model used to calculate those predictions are significantly more complex than the

DMA transfer latency predictions presented earlier in this experiment. There is one final

operation whose analytical model will be defined and validated, the corner-turn operation, before

proceeding to performance projections of enhanced architectures. Based on the performance

observations from Experiment #4, only "type A" corner-turns will be used. Recall that the

corner-turn operation is composed of many small DMA transfers and local transposes, with the

PowerPC performing the local transposes. Some of the data will stay at the same node after the

corner-turn, while the rest of the data will need to be sent to a remote node's memory. Figure 5-

13 shows a simplified illustration of the order of operations in a corner-turn.

EI local read
LW ocal wrie corner-turn
rE-- remote write
time --
I I- ;I f I- -I I- I- F I- F IF I i7 -I-I I -7-I i -1IF-I



local section remote section

Figure 5-13. Illustration of a corner-turn operation.









The analytical expression used to model corner-turn operations is shown in Equation 5.4.

The first line represents the "local section" as illustrated in Figure 5-13, while the second line

represents the "remote section." Since all memory transfers must involve the external memory

of the initiating node, in order to send data from local on-chip memory to a remote node, the data

must first be transferred from on-chip memory to local SDRAM, and then transferred from local

SDRAM to remote SDRAM.

tcornerturn ; A- ((M (Dx+ + .))+ tra + (N (DY + 0o )))+ (5.4)

A ((AM (Dx + o, tan + (N (D +tRW + 2 o,))

The constants A, M, and N in Equation 5.4 represent generic terms for the number of

iterations, and are determined by the size and orientation of the data-cube as well as the size of

the on-chip memory used for local transposition. The Dx and Dy terms represent local DMA

transfer latencies of two different sizes, as defined in Equation 5.1. The new terms trans and tRW

are experimentally-measured quantities. The ttrans term represents the amount of time it takes to

perform the local transpose of a block of data, and tRW is the latency of a remote memory transfer

of a given size. Figure 5-14 shows the results of the analytical model predictions.

Measuredvs. : l, u tir ._, i t o P,, li, Error
Corner-Turn Latency
300 10
250 -- .l 8
D i il, Ulil-: J
H 200 6
l 150o- 4--
100 2
50 --

PC DP BF -
TypeA PC DP BF

Figure 5-14. Comer-turn latency prediction and validation.









The corner-turn latency predictions are noticeably less accurate than the previous

analytical models, due to the DMA-bound progression of the operation. However, the maximum

error is still less than 5% for all cases, which is a sufficiently accurate prediction of performance

to instill confidence in the analytical model proposed in Equation 5.4. Now that the analytical

models of GMTI have been completely defined and validated, architectural enhancements will

be analyzed for potential performance improvement using these analytical models.

The first performance projection looks at reducing the high overhead introduced by the

PowerPC controller in each node. Additionally, two pulse compression engines per node are

assumed, since there is sufficient memory bandwidth to keep both engines at near 100%

utilization based on observations from Experiment #2. In order to model system performance

assuming minimal overhead, the three overhead terms, Odma, Ocomp, and osys are set to 50

nanoseconds each. While it would not be realistic to reduce the overhead terms to zero, a small

latency of 50 nanoseconds is feasible as it corresponds to several clock cycles around 100 MHz.

Highly-efficient control logic implemented in the reconfigurable fabric of the FPGA would

absolutely be able to achieve such latencies. As a complementary case-study, to further reduce

the negative impact of the PowerPC on system performance, corner-turns could be enhanced

with hardware-assisted local transposes. Instead of having the PowerPC perform the transpose

through software, a fifth co-processor engine could be designed that performs a simple data

transfer from one BRAM to another, with address registers that progress through the appropriate

order of addresses to transpose the data from the input BRAM to the output. Using this type of

co-processor engine, one data element would be moved per clock cycle, drastically reducing the

ttrans term in Equation 5.4. Figure 5-15 shows the full predicted GMTI application latency for

three cases: (1) baseline architecture (labeled ORIGINAL), (2) optimized control logic and two










pulse compressions engines (labeled OPTIMIZED), and (3) optimized control logic and two

pulse compression engines, along with hardware-assisted corner-turns (labeled HW-CT).


Figure 5-15. Full GMTI application processing latency predictions.

Aside from halving the pulse compression latency due to having two pulse compression

engines, the other processing stages do not experience significant performance gains by

minimizing the control overhead. Because the Doppler processing and CFAR stages achieve

nearly 100% processor efficiency, all of the control logic latency associated with DMA transfers

is hidden behind the co-processor activity, so little performance gain from optimized control

logic is expected. For the beamforming stage, although it is difficult to discern from the chart,

the beamforming processing latency is reduced from 17.2 ms to 11.6 ms. This result is also

intuitive since the beamforming stage did not achieve 100% processor utilization in the baseline

design, and thus beamforming stands to benefit from improving control logic overhead.


Full GMTI Processing Performance
Predictions

700 -
O ]: F-F
EO BF.-:T
6O0-- ------- SF --

ur[F.I. T
500 0 ,F
0 F-: .-,: T

40 ___P T_-,

S300 --
100 -- --- --
20 ----





ORI 7I NAL OP TMEED HWV-CT









However, notice the significant improvement in comer-turn operation latencies. Since

corner-turns do not implement any latency hiding techniques (such as double-buffering), the

majority of the operation latency for the baseline design is caused by the high control overhead.

However, with only optimized control logic, and no hardware-assisted local transposes, the

performance gain is somewhat limited. The relatively low performance of the PowerPC is

further exposed by the hardware-assisted corner-turn predictions, in which the corner-turn

operations experience another significant performance improvement by offloading the local

transposes to dedicated logic in the reconfigurable fabric of the FPGA.

One additional architectural enhancement is modeled to further improve corner-turn

performance for GMTI. The second performance projection is made assuming that there is a

direct path connecting the internal on-chip memory with the RapidlO network interface at each

node, so that data from on-chip memory does not need to go through local SDRAM before being

sent to a remote node. The only kernel of this implementation of GMTI that could benefit from

such an enhancement is the corner-turn, since the only inter-processor communication occurs

during corner-turns. From Figure 5-13, it can be seen that there are two write operations that

occur per loop iteration in the remote section of a corner-turn. By enhancing the node

architecture with the new data path, the remote section would be reduced to a single write

operation that goes straight from the scratchpad memory of the local transpose to the remote

node's external SDRAM. An additional benefit of this enhancement is a reduction of the load on

local external memory. For the baseline comer-turn, both of the write operations in the remote

section require the use of local external memory, whereas with this new transfer directly from

on-chip memory to the network, the local external memory is not used at all. Figure 5-16 shows

the predicted performance improvement for the corner-turn operation with this new data path.


























Figure 5-16. Comer-turn latency predictions with a direct SRAM-to-RapidIO data path.

The results shown in Figure 5-16 are made relative to the baseline node architecture,

without the control logic optimizations or hardware-assisted corner-turns described in the

previous projections. Surprisingly, not much improvement is seen in the enhanced architecture

with the new data path between on-chip memory and the network interface. For the corer-turn

that follows Doppler processing, no improvement is seen, however this result is expected since

no data is exchanged between nodes for this particular corer-turn. But for the other two corer-

turns, the limited performance improvement is due to a couple of reasons. First, as can be seen

by inspecting Figure 5-13, only a small portion of the overall corner-turn operation involves

remote memory transfers (which are the only transfers that benefit from the new data path). In

the two-node system, only half of the data at each node needs to be exchanged with the other

node. If the node count of the system were to increase, so too would the percentage of data at

each node that would need to be written remotely (for example, in a four-node system, 75% of

the data at each node would need to be written remotely), and as a result the performance

improvement relative to the baseline design would increase. Secondly, any performance gain

due to the new data path would be additive with a performance improvement from hardware-


SRAM-to-F -l:ii:': Direct Performance
Projections
300 -
- 250 ---O-:.-r-lrn, -
2 150 -
I 200 -- ---F--- --1


50 -- -
M 100
__J
50
0 -
PC DP BF
TI 1 .;:- A









assisted local transposes. Therefore, if compared to a baseline architecture that includes the

faster hardware-assisted transposes, the improvement due to the new data path would be more

significant relative to the baseline performance.









CHAPTER 6
CONCLUSIONS AND FUTURE WORK

This section will summarize the research presented in this thesis, as well as offer

concluding remarks to review the insight gained from the experiments performed in the course of

this work. Some suggestions are also offered for potential directions that could be taken to

extend the findings of this research, as well as expand the capabilities of the experimental

platform.

Summary and Conclusions

A promising new system architecture based on FPGAs was presented to support

demanding space-based radar applications. This research goes far beyond the simple

implementation and demonstration of IP cores on an FPGA, by investigating node-level and

system-level integration of cores and components. The relatively low-frequency design was able

to demonstrate impressive sustained performance, and the highly-modular architecture will ease

the integration of more high-performance components without disrupting the overall design

philosophy. By building and demonstrating one of the first working academic RapidIO testbeds,

this thesis contributed much-needed academic research on this relatively new interconnect

standard.

Several challenges were encountered during the course of this research, namely restrictions

imposed by the hardware platform such as limited node memory and insufficient numerical

precision for the case-study application. Most real systems will provide some external SRAM to

each node for processing memory, which would permit larger data sets necessary to truly

emulate realistic space-based radar data sizes. Furthermore, the limited system size prevented

scalability experiments and restricted the complexity of the communication patterns that might

stress the underlying system interconnect.









One important observation to come from this research is that, with the proposed node

architecture, a pipelined parallel decomposition would maximize the processor utilization at each

node. The highly efficient co-processor engines typically require the full bandwidth provided to

local external memory at each node in order to maximize utilization, making it impossible to use

multiple co-processor engines concurrently in each node. However, pipelined decompositions

may exhibit a higher processing latency per data cube and, due to the nature and purpose of the

real-time GMTI algorithm, this increased latency may be intolerable. Even if the system is able

to provide the necessary throughput to keep up with incoming data cubes, if the target detections

for each cube are provided too long after the raw data is acquired, the detections may be stale

and thus useless. The system designer should carefully consider how important individual co-

processor utilization is compared to the latency requirements for their particular implementation.

If co-processor utilization is not considered important, then a data-parallel decomposition would

be possible which would reduce the latency of results for each individual data cube.

A novel design approach presented in this research is the use of a multi-ported external

memory controller at each node. While most current processor technologies connect to the

memory controller through a single bus which must be shared by various on-chip components,

the node design proposed in this thesis provides multiple dedicated ports to the memory

controller. This multi-port design philosophy allows each component to have a dedicated control

and data path to the memory controller, increasing intra-node concurrency. By over-

provisioning the bandwidth to the physical external memory devices relative to any given

internal interface to the controller, each port can be serviced at its maximum bandwidth without

interfering with the other ports by using only a simple arbitration scheme.









Another design technique demonstrated in this research that takes advantage of the flexible

nature of FPGAs is the "domain isolation" of the internal components of each node. Each

component within a node, such as the external memory controller, on-chip memory controller,

co-processor engines, and network interface, operate in different clock domains and may even

have different data path widths. The logic isolation allows each part of the node to be optimized

independent of the other components, so that the requirements of one component do not place

unnecessary restrictions on the other components.

It is important to suggest a path to industry adoption of any new technology, and to

understand the concerns that might prevent the community from accepting a new technology or

design philosophy. Today, almost all computer systems make use of a dedicated memory

controller chip, typically in the form of a Northbridge chipset in typical compute nodes or

otherwise a dedicated memory controller on a mass-memory board in a chassis system. Instead

of replacing the main processors of each node with the FPGA node design presented in this

thesis, a less-risky approach would be to replace the dedicated memory controller chips in these

systems with this FPGA design. This approach would integrate some limited processing

capability close to the memory, circumventing the need to provide an efficient data path from

host memory to an FPGA co-processor, as well as permit a variety of high-performance

interconnects through the reconfigurable fabric of the FPGA, as opposed to the typical PCI-based

interconnects associated with most memory controller chips.

Future Work

There are a variety of promising and interesting directions that could be taken to extend the

findings of this research. While valuable insight has been gained, there were several restrictions

imposed by the experimental hardware, as well as compromises made in order to reduce the

scope of the research to a manageable extent. From expanding the size of the testbed, to









including Serial RapidIO and other logical layer variants, to experimenting with different

applications or even application domains, this thesis is just the start of a potentially large and

lucrative body of unique academic research.

The most immediate step to extend this research is to increase the size of the testbed, either

through integration of RapidIO switches with the experimental platform and increasing the node

count, or by selecting a new, more stable experimental testbed all-together. A larger node count

is necessary to mimic a realistic flight system, and switched topologies would enable more

complex inter-node traffic patterns as well as high-speed data input into the processing system.

A high-speed data source is needed to enable true real-time processing for applications such as

GMTI. Ideally, a chassis-based platform could be used to replace the stand-alone boards used

for this research, providing a more realistic platform as well as reduce the risk of development.

Furthermore, it should be noted that GMTI is only one example application of interest for

high-performance space computing. Other applications within the domain of space-based radar

and remote sensing, such as Synthetic Aperture Radar or Hyperspectral Imaging, place different

loads and requirements on the underlying system architecture. Other application domains, such

as telecommunication or autonomous control, could also be investigated to cover a diverse range

of application behaviors and requirements.

Finally, only one of several RapidIO variants was used in this thesis research, the 8-bit

parallel physical layer and the Logical I/O logical layer. Serial RapidIO has become much more

popular since the initial release of the RapidIO specification, and other logical layer variants

such as the Message Passing logical layer would provide an interesting contrast to the

programming model offered by the memory-mapped Logical I/O logical layer variant. To

provide a thorough evaluation of RapidIO's suitability for the targeted application domain,









several high-performance interconnect protocols could be evaluated and compared in order to

highlight the strengths and weaknesses of each.









LIST OF REFERENCES


[1] D. Bueno, C. Conger, A. Leko, I. Troxel and A. George, "Virtual Prototyping and
Performance Analysis of RapidIO-based System Architectures for Space-Based Radar,"
Proc. of 8th High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln
Lab, Lexington, MA, September 28-30, 2004.

[2] D. Bueno, A. Leko, C. Conger, I. Troxel, and A. George, "Simulative Analysis of the
RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive
Applications," Proc. of 29th IEEE Conference on Local Computer Networks (LCN) via
the IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, November 16-
18, 2004.

[3] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, "RapidlO-based Space Systems
Architectures for Synthetic Aperture Radar and Ground Moving Target Indicator," Proc.
of 9th High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab,
Lexington, MA, September 20-22, 2005.

[4] T. Hacker, R. Sedwick, and D. Miller, "Performance Analysis of a Space-Based GMTI
Radar System Using Separated Spacecraft Interferometry," M.S. Thesis, Dept. of
Aeronautics and Astronautics, Massachusetts Institute of Technology, Boston, MA, 2000.

[5] M. Linderman and R. Linderman, "Real-Time STAP Demonstration on an Embedded
High Performance Computer," Proc. of 1997 IEEE National Radar Conference, Syracuse,
NY, May 13-15, 1997.

[6] R. Brown and R. Linderman, "Algorithm Development for an Airborne Real-Time STAP
Demonstration," Proc. of 1997 IEEE National Radar Conference, Syracuse, NY, May 13-
15, 1997.

[7] D. Rabideau and S. Kogon, "A Signal Processing Architecture for Space-Based GMTI
Radar," Proc. of 1999 IEEE Radar Conference, April 20-22, 1999.

[8] D. Sandwell, "SAR Image Formation: ERS SAR Processor Coded in MATLAB,"
Lecture Notes Radar and Sonar Interferometry, Dept. of Geography, University of
Zurich, Copyright 2002.

[9] Pentek, Inc., "GateFlow Pulse Compression IP Core," Product Sheet, July, 2003.

[10] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, M. Linderman, and R.
Brown, "Design, Implementation, and Evaluation of Parallel Pipelined STAP on Parallel
Computers," IEEE Trans. on Aerospace and Electrical Systems, vol. 36, pp. 528-548,
April 2000.

[11] M. Lee, W. Liu, and V. Prasanna, "Parallel Implementation of a Class of Adaptive Signal
Processing Applications," Algorithmica, 2001.









[12] M. Lee and V. Prasanna, "High Throughput-Rate Parallel Algorithms for Space-Time
Adaptive Processing," Proc. of 2nd International Workshop on Embedded HPC Systems
and Applications, April 1997.

[13] A. Ahlander, M. Taveniku, and B. Svensson, "A Multiple SIMD Approach to Radar
Signal Processing," Proc. of 1996 IEEE TENCON, Digital Signal Processing
Applications, vol. 2, pp. 852-857, November 1996.

[14] T. Haynes, "A Primer on Digital Beamforming," Whitepaper, Spectrum Signal
Processing, Inc., March 26, 1998.

[15] J. Lebak, A. Reuther, and E. Wong, "Polymorphous Computing Architecture (PCA)
Kernel-Level Benchmarks," HPEC Challenge Benchmark Suite Specification, Rev. 1,
June 13, 2005.

[16] R. Cumplido, C. Torres, S. Lopez, "On the Implementation of an Efficient FPGA-based
CFAR Processor for Target Detection," International Conference on Electrical and
Electronics Engineering (ICEEE), Acapulco, Guerrero; Mexico, September 8-10, 2004.

[17] D. Bueno, "Performance and Dependability of RapidIO-based Systems for Real-time
Space Applications," Ph.D. Dissertation, Dept. of Electrical and Computer Engineering,
University of Florida, Gainesville, FL, 2006.

[18] M. Skalabrin and T. Einstein, "STAP Processing on Multiprocessor Systems:
Distribution of 3-D Data Sets and Processor Allocation for Efficient Interprocessor
Communication," Proc. of 4th Annual Adaptive Sensor Array Processing (ASAP)
Workshop, March 13-15, 1996.

[19] K. Varnavas, "Serial Back-plane Technologies in Advanced Avionics Architectures,"
Proc. of 24th Digital Avionics System Conference (DASC), October 30-November 3,
2005.

[20] Xilinx, Inc., "Xilinx Solutions for Serial Backplanes," 2004.
http://www.xilinx.com/esp/networkstelecom/optical/collateral/backplanesxlnx.pdf.
Last accessed: March 2007.

[21] S. Vaillancourt, "Space Based Radar On-Board Processing Architecture," Proc. of 2005
IEEE Aerospace Conference, pp.2190-2195, March 5-12, 2005.

[22] K. Compton and S. Hauck, "Reconfigurable Computing: A Survey of Systems and
Software," ACM Computing Surveys, vol. 34, No. 2, pp. 171-210, June 2002.

[23] P. Murray, "Re-Programmable FPGAs in Space Environments," White Paper, SEAKR
Engineering, Inc., Denver, CO, July 2002.









[24] V. Aggarwal, "Remote Sensing and Imaging in a Reconfigurable Computing
Environment", M.S. Thesis, Dept. of Electrical and Computer Engineering, University of
Florida, Gainesville, FL, 2005.

[25] J. Greco, G. Cieslewski, A. Jacobs, I. Troxel, and A. George, "Hardware/software
Interface for High-performance Space Computing with FPGA Coprocessors," Proc. IEEE
Aerospace Conference, Big Sky, MN, March 4-11, 2006.

[26] I. Troxel, "CARMA: Management Infrastructure and Middleware for Multi-Paradigm
Computing," Ph.D. Dissertation, Dept. of Electrical and Computer Engineering,
University of Florida, Gainesville, FL, 2006.

[27] C. Conger, I. Troxel, D. Espinosa, V. Aggarwal and A. George, "NARC: Network-
Attached Reconfigurable Computing for High-performance, Network-based
Applications," International Conference on Military and Aerospace Programmable Logic
Devices (MAPLD), Washington D.C., September 8-10, 2005.

[28] Xilinx, Inc., "Xilinx LogiCORE RapidIO 8-bit Port Physical Layer Interface, DS243
(vl.3)," Product Specification, 2003.

[29] Xilinx, Inc., "LogiCORE RapidIO 8-bit Port Physical Layer Interface Design Guide,"
Product Manual, November 2004.

[30] Xilinx, Inc., "Xilinx LogiCORE RapidIO Logical I/O ad Transport Layer Interface,
DS242 (vl.3)," Product Specification, 2003.

[31] Xilinx, Inc., "LogiCORE RapidIO Logical I/O & Transport Layer Interface Design
Guide," Product Manual, November 2004.

[32] S. Elzinga, J. Lin, and V. Singhal, "Design Tips for HDL Implementation of Arithmetic
Functions," Xilinx Application Note, No. 215, June 28, 2000.

[33] N. Gupta and M. George, "Creating High-Speed Memory Interfaces with Virtex-II and
Virtex-II Pro FPGAs," Xilinx Application Note, No. 688, May 3, 2004.

[34] S. Fuller, "RapidIO The Next Generation Communication Fabric for Embedded
Application," Book, John Wiley & Sons, Ltd., West Sussex, England, January 2005.

[35] G. Shippen, "RapidIO Technical Deep Dive 1: Architecture and Protocol," Motorola
Smart Network Developers Forum 2003, Dallas, TX, March 20-23, 2003.

[36] J. Adams, C. Katsinis, W. Rosen, D. Hecht, V. Adams, H. Narravula, S. Sukhtankar, and
R. Lachenmaier, "Simulation Experiments of a High-Performance RapidIO-based
Processing Architecture," Proc. of IEEE International Symposium on Network
Computing and Applications, Cambridge, MA, Oct. 8-10, 2001.









[37] RapidIO Trade Association, "RapidIO Interconnect Specification Documentation
Overview," Specification, June 2002.

[38] RapidIO Trade Association, "RapidIO Interconnect Specification (Parts I-IV),"
Specification, June 2002.

[39] RapidIO Trade Association, "RapidIO Interconnect Specification, Part V: Globally
Shared Memory Logical Specification," Specification, June 2002.

[40] RapidIO Trade Association, "RapidIO Interconnect Specification, Part VI: Physical
Layer lx/4x LP-Serial Specification," Specification, June 2002.

[41] RapidIO Trade Association, "RapidIO Interconnect Specification, Part VIII: Error
Management Extensions Specification," Specification, June 2002.

[42] C. Conger, D. Bueno, and A. George, "Experimental Analysis of Multi-FPGA
Architectures over RapidIO for Space-Based Radar Processing," Proc. of 10th High-
Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington,
MA, September 19-21, 2006.

[43] IBM Corporation, "The CoreConnect Bus Architecture," White paper, Armonk, NY,
1999.

[44] J. Jou and J. Abraham, "Fault-Tolerant Matrix Arithmetic and Signal Processing on
Highly Concurrent Computing Structures," Proc. of IEEE, vol. 74, No. 5, pp. 732-741,
May 1986.









BIOGRAPHICAL SKETCH

Chris Conger received a Bachelor of Science in Electrical Engineering from The Florida

State University in May of 2003. In June of 2003, he also received his Engineer Intern

certificate from the Florida Board of Professional Engineers in preparation for a Professional

Engineer license. In his final year at Florida State, he began volunteering for the High-

performance Computing and Simulation (HCS) Research Lab under Dr. Alan George through

the Tallahassee branch of the lab.

Upon graduation from Florida State, Chris moved to Gainesville, FL, to continue his

education under Dr. George and with the HCS Lab at The University of Florida. He became a

research assistant in January 2004, and has since worked on a variety of research projects

involving FPGAs, parallel algorithms, and the design of high-performance embedded processing

systems.





PAGE 1

RECONFIGURABLE COMPUTING WITH RapidIO FOR SPACE-BASED RADAR By CHRIS CONGER A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2007 1

PAGE 2

2007 Chris Conger 2

PAGE 3

To my favorite undergraduate professor, Dr. Fred O. Simons, Jr. 3

PAGE 4

ACKNOWLEDGMENTS I would like to thank my major professor, Dr. Alan George, for his patience, guidance, and unwavering support through my trials as a graduate student. I thank Honeywell for their technical guidance and sponsorship, as well as Xilinx for their generous donation of hardware and IP cores that made my research possible. Finally, I want to express my immeasurable gratitude towards my parents for giving me the proper motivation, wisdom, and support I needed to reach this milestone. I can only hope to repay all of the kindness and love I have been given and which has enabled me to succeed. 4

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 ABSTRACT ...................................................................................................................................10 CHAPTER 1 INTRODUCTION..................................................................................................................11 2 BACKGROUND AND RELATED RESEARCH.................................................................16 Ground Moving Target Indicator............................................................................................16 Pulse Compression..........................................................................................................18 Doppler Processing..........................................................................................................20 Space-Time Adaptive Processing....................................................................................22 Constant False-Alarm Rate Detection.............................................................................23 Corner Turns.....................................................................................................................26 Embedded System Architectures for Space............................................................................28 Field-Programmable Gate Arrays...........................................................................................32 RapidIO...................................................................................................................................32 3 ARCHITECTURE DESCRIPTIONS.....................................................................................39 Testbed Architecture...............................................................................................................39 Node Architecture...................................................................................................................43 External Memory Controller...........................................................................................45 Network Interface Controller..........................................................................................47 On-Chip Memory Controller...........................................................................................49 PowerPC and Software API............................................................................................50 Co-processor Engine Architectures........................................................................................51 Pulse Compression Engine..............................................................................................55 Doppler Processing Engine.............................................................................................57 Beamforming Engine.......................................................................................................58 CFAR Engine..................................................................................................................60 4 ENVIRONMENT AND METHODS.....................................................................................62 Experimental Environment.....................................................................................................62 Measurement Procedure.........................................................................................................63 Metrics and Parameter Definitions.........................................................................................65 5

PAGE 6

5 RESULTS...............................................................................................................................68 Experiment 1.................................................................................................................... .......68 Experiment 2.................................................................................................................... .......71 Experiment 3.................................................................................................................... .......76 Experiment 4.................................................................................................................... .......81 Experiment 5.................................................................................................................... .......84 6 CONCLUSIONS AND FUTURE WORK.............................................................................95 Summary and Conclusions.....................................................................................................95 Future Work............................................................................................................................97 LIST OF REFERENCES.............................................................................................................100 BIOGRAPHICAL SKETCH.......................................................................................................104 6

PAGE 7

LIST OF TABLES Table page 3-1 Software API function descriptions.......................................................................................51 3-2 Co-processor wrapper signal definitions...............................................................................53 4-1 Experimental parameter definitions.......................................................................................67 7

PAGE 8

LIST OF FIGURES Figure page 2-1 Illustration of satellite platform location, orientation, and pulse shape.................................16 2-2 GMTI data-cube....................................................................................................................17 2-3 GMTI processing flow diagram............................................................................................18 2-4 Processing dimension of pulse compression.........................................................................19 2-5 Processing dimension of Doppler processing........................................................................21 2-6 Examples of common apodization functions.........................................................................21 2-7 Processing dimension of beamforming.................................................................................23 2-8 Processing dimension of CFAR detection.............................................................................24 2-9 CFAR sliding window definition..........................................................................................25 2-10 Basic corner turn............................................................................................................. .....26 2-11 Proposed data-cube decomposition to improve corner-turn performance...........................28 2-12 Example pictures of chassis and backplane hardware.........................................................29 2-13 Three popular serial backplane topologies..........................................................................30 2-14 Layered RapidIO architecture vs. layered OSI architecture................................................34 2-15 RapidIO compared to other high-performance interconnects.............................................36 3-1 Conceptual radar satellite processing system diagram..........................................................40 3-2 Experimental testbed system diagram...................................................................................42 3-3 Node architecture block diagram...........................................................................................44 3-4 External memory controller block diagram...........................................................................46 3-5 Network interface controller block diagram..........................................................................47 3-6 Standardized co-processor engine wrapper diagram.............................................................52 3-7 Pulse compression co-processor architecture block diagram................................................55 3-8 Doppler processing co-processor architecture block diagram...............................................57 8

PAGE 9

3-9 Beamforming co-processor architecture block diagram........................................................58 3-10 Illustration of beamforming computations..........................................................................59 3-11 CFAR detection co-processor architecture block diagram..................................................60 4-1 Testbed environment and interface........................................................................................63 5-1 Baseline throughput performance results..............................................................................69 5-2 Single-node co-processor processing performance...............................................................72 5-3 Processing and memory efficiency for the different stages of GMTI...................................73 5-4 Performance comparison between RapidIO testbed and Linux workstation........................75 5-5 Doppler processing output comparison.................................................................................77 5-6 Beamforming output comparison..........................................................................................78 5-7 CFAR detection output comparison......................................................................................80 5-8 Data-cube dimensional orientation........................................................................................82 5-9 Data-cube performance results..............................................................................................83 5-10 Local DMA transfer latency prediction and validation.......................................................86 5-11 Illustration of double-buffered processing..........................................................................87 5-12 Data-cube processing latency prediction and validation.....................................................88 5-13 Illustration of a corner-turn operation.................................................................................88 5-14 Corner-turn latency prediction and validation.....................................................................89 5-15 Full GMTI application processing latency predictions.......................................................91 5-16 Corner-turn latency predictions with a direct SRAM-to-RapidIO data path.......................93 9

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science RECONFIGURABLE COMPUTING WITH RapidIO FOR SPACE-BASED RADAR By Chris Conger August 2007 Chair: Alan D. George Major: Electrical and Computer Engineering Increasingly powerful radiation-hardened Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), and conventional processors (along with high-performance embedded system interconnect technologies) are helping to enable the on-board processing of real-time, high-resolution radar data on satellites and other space platforms. With current processing systems for satellites based on conventional processors and non-scalable bus interconnects, it is impossible to keep up with the high throughput demands of space-based radar applications. The large datasets and real-time nature of most space-based radar algorithms call for highly-efficient data and control networks, both locally within each processing node as well as across the system interconnect. The customized architecture of FPGAs and ASICs allows for unique features, enhancements, and communication options to support such applications. Using a ground moving target indicator application as a case study, my research investigates low-level architectural design considerations on a custom-built testbed of multiple FPGAs connected over RapidIO. This work presents experimentally gathered results to provide insight into the relationship between the strenuous application demands and the underlying system architecture, as well as the practicality of using reconfigurable components for these challenging high-performance embedded computing applications. 10

PAGE 11

CHAPTER 1 INTRODUCTION Embedded processing systems operating in the harsh environments of space are subject to more stringent design constraints when compared to those imposed upon terrestrial embedded systems. Redundancy, radiation hardening, and strict power requirements are among the leading challenges presented to space platforms, and as a result the flight systems currently in space are mostly composed of lower-performance frequency-limited software processors and non-scalable bus interconnects. These current architectures may be inadequate for supporting real-time, on-board processing of sensor data of sufficient volume, and thus new components and novel architectures need to be explored in order to enable high-performance computing in space. Radar satellites have a potentially wide field of view looking down from space, but maintaining resolution at that viewpoint results in very large data sets. Furthermore, target tracking algorithms such as Ground Moving Target Indicator (GMTI) have tight real-time processing deadlines of consecutive radar returns in order to keep up with the incoming sensor data as well as produce useful (i.e., not stale) target detections. In previous work, it has been shown using simulation [1-3] that on-board processing of high-resolution data on radar satellites requires parallel processing platforms providing high throughput to all processing nodes, and efficient processing engines to keep up with the strict real-time deadlines and large data sets. Starting in 1997, Motorola began to develop a next-generation, high-performance, packet-switched embedded interconnect standard called RapidIO, and soon after partnered with Mercury Computers to complete the first version of the RapidIO specification. After the release of the 1.0 specification in 1999, the RapidIO Trade Association was formed, a non-profit corporation composed of a collection of industry partners for the purpose of steering the development and adoption of the RapidIO standard. Designed specifically for embedded environments, the 11

PAGE 12

RapidIO standard seeks to address many of the challenges faced by embedded systems, such as providing a small footprint for lower size and power, inherent fault tolerance, high throughput, and scalability. Additionally, the RapidIO protocol includes flow control and fault isolation, features that enhance the fault tolerance of a system but are virtually non-existent in bus-based interconnects. These characteristics make the RapidIO standard an attractive option for embedded space platforms, and thus in-depth research is warranted to consider the feasibility and suitability of RapidIO for space systems and applications. In addition to the emergence of high-performance interconnects, the embedded community has also been increasingly adopting the use of hardware co-processing engines in order to boost system performance. The main types of processing engines used aside from conventional processors are application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs). While both components provide the potential for great clock-cycle efficiency through exploitation of parallelism and custom-tailored architectures for performing a particular processing task, the ASIC devices are much more expensive to design and fabricate than FPGAs. By contrast, with sufficient time and resources, ASIC devices can achieve comparable or better performance than FPGAs, as well as lower power consumption due to the removal of unnecessary logic that would be present in any FPGA design. Unfortunately, ASICs are rarely re-usable for many different applications, where FPGAs may be completely reconfigured for a wide range of end uses. FPGAs are also considered valuable prototyping components due to their flexible, re-programmable nature. However, it will be shown that it can be difficult at this point in time to construct a complete system based solely on hardware components (e.g. FPGAs). Application designers are rarely fluent in the various hardware description languages (HDL) used to develop FPGA 12

PAGE 13

designs. Ideally, application engineers should still be able to develop applications in the familiar software setting, making use of the hardware processing resources of the system transparently through function calls in software. Furthermore, most flight systems have extensive middleware (software) for system monitoring and control, and porting these critical legacy software components to hardware would be fought tooth-and-nail, if it were even possible. As will be shown in my research, there is still an important role for software in these high-performance embedded systems, and gaining insight into the optimal combination of hardware and software processing for this application domain is an important objective. My work conducted an experimental study of cutting-edge system architectures for Space-Based Radar (SBR) using FPGA processing nodes and the RapidIO high-performance interconnect for embedded systems. Hardware processing provides the clock-cycle efficiency necessary to enable high-performance computing with lower-frequency, radiation-hardened components. Compared to bus-based designs, packet-switched interconnects such as RapidIO will substantially increase the scalability, robustness, and network performance of future embedded systems, and the small footprint and fault tolerance inherent to RapidIO suggest that it may be an ideal fit for use with FPGAs in space systems. The data input and output requirements of the processing engines, as well as the data paths provided by the system interconnect and the local memory hierarchies are key design parameters that affect the ultimate system performance. By experimenting with assorted node architecture variations and application communication schedules, this research is able to identify critical design features as well as suggest enhancements or alternative design approaches to improve performance for SBR applications. 13

PAGE 14

An experimental testbed was built from the ground up to provide a platform on which to perform experimentation. Based on Xilinx FPGAs, prototyping boards, and RapidIO IP cores, the testbed was assembled into a complete system for application development and fine-grained performance measurement. When necessary, additional printed circuit boards (PCBs) were designed, fabricated, assembled, and integrated with the Xilinx prototyping boards and FPGAs in order to enhance the capabilities of each testbed node. Moving down one level in the architecture, a top-level processing node architecture was needed with which to program each of the FPGAs of the system. The processing node architecture is equally important to performance as is the system topology, and the flexible, customizable nature of FPGAs was leveraged in order to propose a novel, advanced chip-level architecture connecting external memory, network fabric, and processing elements. In addition to proposing a baseline node architecture, the reprogrammability of the FPGAs allows features and design parameters to be modified in order to alter the node architecture and observe the net impact on application performance. Each of the FPGAs in the testbed contains two embedded PowerPC405 processors, providing software processors to the system that are intimately integrated with the reconfigurable fabric of the FPGAs. Finally, co-processor engines are designed in HDL to perform the processing tasks of GMTI best suited for hardware processing. The remainder of this document will proceed as follows. Chapter 2 will present the results of a literature review of related research, and provide background information relating to this thesis. Topics covered in Chapter 2 include GMTI and space-based radar, RapidIO, reconfigurable computing, as well as commercial embedded system hardware and architectures. Chapter 3 begins the discussion of original work, by describing in detail the complete architecture of the experimental testbed. This description includes the overall testbed topology 14

PAGE 15

and components, the proposed processing node architecture (i.e. the overall FPGA design), as well as individual co-processor engine architectures used for high-speed processing. Chapter 4 will introduce the experimental environment in detail, as well as outline the experimental methods used for data collection. This overview includes experimental equipment and measurement procedures, testbed setup and operation, parameter definitions, and a description of the experiments that are performed. Chapter 5 will present the experimental results, as well as offer discussion and analysis of those results. Chapter 6 provides some concluding remarks that summarize the insight that was gained from this work, as well as some suggestions for future tasks to extend the findings of this research. 15

PAGE 16

CHAPTER 2 BACKGROUND AND RELATED RESEARCH Ground Moving Target Indicator There are a variety of specific applications within the domain of radar processing, and even while considering individual applications there are often numerous ways to implement the same algorithm using different orderings of more basic kernels and operations. Since this research is architecture-centric in nature, to reduce the application space one specific radar application known as GMTI will be selected for experimentation. GMTI is used to track moving targets on the ground from air or space, and comes recommended from Honeywell Inc., the sponsor of this research, as an application of interest for their future SBR systems. Much research exists in academic and technical literature regarding GMTI and other radar processing applications [1-18], however most deal with airborne radar platforms. Figure 2-1. Illustration of satellite platform location, orientation, and pulse shape. For the type of mission considered in this thesis research, the radar platform is mounted on the satellite, pointed downward towards the ground as shown in Figure 2-1. The radar sends out periodic pulses at a low duty cycle, with the long silent period used to listen for echoes. Each 16

PAGE 17

echo is recorded over a period of time through an analog-to-digital converter (ADC) operating at a given sampling frequency, corresponding to a swath of discrete points on the ground that the transmitted pulse passes over [4]. Furthermore, each received echo is filtered into multiple frequency channels, to help overcome noise and electronic jamming. Thus, each transmitted pulse results in a 2-dimensional set of data, composed of multiple range cells as well as frequency channels. When multiple consecutive pulse returns are concatenated together, the result is a three-dimensional data-cube of raw radar returns that is passed to the high-performance system for processing (see Figure 2-2). Figure 2-2. GMTI data-cube. As mentioned above, periodically the radar will send a collection of data to the processing system. This period represents the real-time processing deadline of the system, and is referred to as the Coherent Processing Interval (CPI). Channel dimension lengths are typically small, in the 4-32 channel range, and typical pulse dimension lengths are also relatively small, usually between 64-256. However, the range dimension length will vary widely depending on radar height and resolution. Current GMTI implementations on aircraft have range dimensions on the order of 512-1024 cells [5-6, 12], resulting in data-cube sizes in the neighborhood of 32 MB or less. However, the high altitude and wide field of view of space-based radar platforms requires 17

PAGE 18

dramatic increases in the range dimension in order to preserve resolution at that altitude. Space-based radar systems may have to deal with 64K-128K range cells, and even more in the future, which correspond to data-cubes that are 4 GB or larger. GMTI is composed of a sequence of four sub-tasks, or kernels that operate on the data-cube to finally create a list of target detections from each cube [7]. These four sub-tasks are (1) pulse compression, (2) Doppler processing, (3) space-time adaptive processing, and (4) constant false-alarm rate (CFAR) detection. Figure 2-3 below illustrates the flow of GMTI. Figure 2-3. GMTI processing flow diagram. Pulse Compression The first stage of GMTI is a processing task associated with many radar applications, and is known as pulse compression. The reason why pulse compression is needed is based on the fact that the pulse transmitted by the radar is not instantaneous. In an ideal world, the radar could send out an instantaneous pulse, or impulse, that would impact each point on the ground instantly, and the echo received by the radar would be a perfect reflection of the ground. However, in the real world, the radar pulse is not instantaneous. As the pulse travels over the ground, the radar receives echoes from the entire length of the pulse at any point in time. Thus, what the radar actually receives as the echo is an image of the surface convolved with the shape 18

PAGE 19

of the transmitted pulse. Pulse compression is used to de-convolve the shape of the transmitted pulse (which is known and constant) from the received echo, leaving a cleaned up image of the surface [8]. One benefit of pulse compression is that it can be used to reduce the amount of power consumed by the radar when transmitting the pulses. A tradeoff exists where shorter pulses result in less-convolved echoes, however require higher transmitting power in order to retain an acceptable signal-to-noise ratio. By transmitting longer pulses, less power is required, but of course the echo will be more dirty or convolved. Pulse compression thus enables the radar to operate with lower power using longer pulses [8] and still obtain clean echoes. In addition to the raw radar data-cube (represented by the variable a), pulse compression uses a constant complex vector representing a frequency-domain recreation of the original radar pulse. This vector constant, F, can be pre-computed once and re-used indefinitely [9]. Since convolution can be performed by a simple multiplication in the frequency domain, pulse compression works by converting each range line of the data-cube a (see Figure 2-4, henceforth referred to as a Doppler bin) into the frequency domain with an FFT, performing a point-by-point complex vector multiplication of the frequency-domain Doppler bin with the complex conjugate of the vector constant F (for de-convolution), and finally converting the Doppler bin back into the time domain with an IFFT. Each Doppler bin can be processed completely independently [10]. Figure 2-4. Processing dimension of pulse compression, highlighting one Doppler bin. 19

PAGE 20

The operations of pulse compression are defined mathematically by the following expressions, where R, P, and C represent the lengths of the range, pulse, and channel dimensions, respectively: 102,,,,RrrkRirpckpceaA 1,...,0 Cc 1,...,0 Pp (2.1) 1,...,0Rk kkpckpcFAA,,,, 1,...,0 Cc 1,...,0 Pp (2.2) 1,...,0Rk 102,,,,RkrkRikpcrpceAa 1,...,0 Cc 1,...,0 Pp (2.3) 1,...,0Rr The output, a, represents the pulse-compressed data-cube and is now passed to Doppler processing after a re-organization of the data in system memory (the data reorganization operation will be addressed later). Doppler Processing Doppler processing performs operations similar to pulse compression, although for a different reason. Doppler processing is performed along the pulse dimension, and performs apodization on the data-cube before converting it to the frequency domain along the pulse dimension. Apodization is done by a complex vector multiplication of each pulse line of the data-cube (henceforth referred to as a range bin, see Figure 2-5) by a constant, pre-computed complex time-domain vector [10]. There are several standard apodization functions used for Doppler processing, a few examples of which are given in Figure 2-6. The choice of apodization function depends on the properties of the radar, and has no impact on the performance of the application. 20

PAGE 21

Figure 2-5. Processing dimension of Doppler processing, highlighting one range bin. Figure 2-6. Examples of common apodization functions, time-domain (left) and frequency-domain (middle and right). Each range bin of the data-cube a is processed completely independently, and is first multiplied by the apodization vector, g. The resulting range bin is then converted to the frequency domain with an FFT, in preparation for the next stage. These operations are defined mathematically by the following expressions: prpcrpcgab,,,, 1,...,0 Cc 1,...,0 Pp (2.4) 1,...,0Rr 102,,,,PppkPirpcrkcebB 1,...,0 Cc 1,...,0 Pk (2.5) 1,...,0Rr The output, B, represents the processed data-cube that is ready for beamforming in the next stage, after yet another re-organization of the data in system memory. 21

PAGE 22

Space-Time Adaptive Processing Space-time adaptive processing (STAP) is the only part of the entire GMTI application that has any temporal dependency between different CPIs. There are two kernels that are considered to be a part of STAP: (1) adaptive weight computation (AWC), and (2) beamforming. While AWC is defined as a part of STAP, it is actually performed in parallel with the rest of the entire algorithm (recall Figure 2-3). For a given data-cube, the results of AWC will not be used until the beamforming stage of the next CPI. Beamforming, by contrast, is in the critical path of processing, and proceeds immediately following Doppler processing using the weights provided by AWC on the previous data-cube [10-12]. Adaptive Weight Computation Adaptive weight computation is performed in parallel with the rest of the stages of GMTI, and is not considered to be a performance-critical step given the extended amount of time available to complete the processing. The output of AWC is a small set of weights known as the steering vector, and is passed to the beamforming stage. Since AWC does not lie in the critical path of processing, this stage is omitted in the implementation of GMTI for this thesis in order to reduce the necessary attention paid to the application for this architecture-centric research. For more information on the mathematical theory behind AWC and the operations performed, see [10, 12]. Beamforming Beamforming takes the weights provided by the AWC step, and projects each channel line of the data-cube into one or more beams through matrix multiplication [10-13]. Each beam represents one target classification, and can be formed completely independently. The formation of each beam can be thought of as filtering the data-cube to identify the probability that a given cell is a given type or class of target [14]. As the final step of beamforming, the magnitude of 22

PAGE 23

each complex element of the data-cube is obtained in order to pass real values to CFAR detection in the final stage. Figure 2-7 below shows the processing dimension of beamforming, along with a visualization of the reduction that occurs to the data-cube as a result of beamforming. Figure 2-7. Processing dimension of beamforming (left), and diagram illustrating data-cube reduction that occurs as a result of beamforming (right). Taking the processed, reorganized data-cube B from Doppler processing, as well as the steering vector from AWC, the beams are formed with a matrix multiplication followed by a magnituding operation on each element of the resulting beams. These operations are defined mathematically by the following expressions: 10,,,,,CcrpccbrpbBB 1,...,0 Bb 1,...,0 Pp (2.6) 1,...,0Rr 2,,2,,,,imagrpbrealrpbrpbBBC 1,...,0 Bb 1,...,0 Pp (2.7) 1,...,0Rr The output cube, C, is a real-valued, completely processed and beamformed data-cube. The data-cube undergoes one final reorganization in system memory before being passed to the final stage of processing, CFAR detection. Constant False-Alarm Rate Detection Constant False-Alarm Rate detection, or CFAR, is another kernel commonly associated with radar applications [15]. CFAR is used in this case as the final stage in GMTI processing, and makes the ultimate decision on what is and what is not a target. The goal of CFAR is to 23

PAGE 24

minimize both the number of false-positives (things reported as targets that really are not targets) as well as false-negatives (things not reported as targets that really are targets). The output of CFAR is a detection report, containing the range, pulse, and beam cell locations of each target within the data-cube, as well as the energy of each target. Figure 2-8. Processing dimension of CFAR detection. CFAR operates along the range dimension (see Figure 2-8 above) by sliding a window along each Doppler bin, computing local averages and performing threshold comparisons for each cell along the range dimension. In addition to the coarse-grained, embarrassingly parallel decomposition of the data-cube across multiple processors, the computations performed by the sliding window contain a significant amount of parallelism and that may also be exploited to further reduce the amount of computations required per Doppler bin [15]. This fine-grained parallelism is based on reusing partial results from previously computed averages, and will be described in detail later in Chapter 3. The computations of CFAR on the data-cube C can be expressed mathematically as: cfarNGGiirpbirpbcfarrpbCCNT12,,2,,,,21 (2.8) rpbrpbTC,,2,, ? 1,...,0 Bb 1,...,0 Pp 1,...,0 Rr (2.9) 24

PAGE 25

Where G represents the number of guard cells on each side of the target cell and N cfar represents the size of the averaging window on either side of the target cell (see Figure 2-9). Figure 2-9. CFAR sliding window definition. For each cell in the Doppler bin, a local average is computed of the cells surrounding the target cell (the target cell is the cell currently under consideration). A small number of cells immediately surrounding the target cell, referred to as guard cells, are ignored. Outside of the guard cells on either side of the target cell, a window of N cfar cells are averaged. This average is then scaled by some constant, The value of the target cell is then compared to the scaled average, and if the target cell is greater than the scaled average then it is designated as a target. The constant is very important to the algorithm, and is the main adjustable parameter that must be tuned to minimize the false-positives and false-negatives reported by CFAR. This parameter is dependent on the characteristics of the specific radar sensor, and the theory behind its determination is beyond the scope or focus of this thesis research. Fortunately, the value of this constant does not affect the performance of the application. One final detail of the algorithm concerns the average computations at or near the boundaries of the Doppler bin. Near the beginning or end of the Doppler bin, one side of the sliding window will be incomplete or un-filled, and thus only a right-side or left-side scaled average comparison can be used near the boundaries as appropriate to handle these conditions [16]. 25

PAGE 26

Corner Turns Another important operation for GMTI, and many other signal processing applications, is that of re-organization of data for different processing tasks. In order to increase the locality of memory accesses within any given computational stage of GMTI, it is desirable to organize the data in memory according to the dimension that the next processing stage will operate along. In distributed memory systems, where the data-cube is spread among all of the processing memories and no single node contains the entire data-cube, these corner-turns, or distributed transposes can heavily tax the local node memory hierarchies as well as the network fabric. It should be noted here that with GMTI, the corner-turns constitute the only inter-processor communication associated with the algorithm [10, 17]. Once all data is organized properly for a given stage of GMTI, processing may proceed at each node independent of one another in an embarrassingly parallel fashion. Thus, it is the performance of the corner-turn operations as you scale the system size up that determines the upper-bound on parallel efficiency for GMTI. It is important to understand the steps required for a corner-turn, at the memory access level, in order to truly understand the source of inefficiency for this notorious operation. The basic distributed corner-turn operation can be visualized by observing Figure 2-10 below. Figure 2-10. Basic corner turn 26

PAGE 27

First, consider the most basic decomposition of the data-cube for any given task, along a single dimension of the cube (see Figure 2-10, left side). Assume the data-cube is equally divided among all the nodes in the system, and must be re-distributed among the nodes as illustrated. It can be seen that each node will have to send a faction of its current data set to each other node in the system, resulting in a personalized all-to-all communication event, which heavily taxes the system backplane with all nodes attempting to send to one another at the same time. Simulations in [3] suggest that careful synchronization of the communication events can help to control contention and improve corner-turn performance. However, look closer at the local memory access pattern at each node. The local memory can be visualized as shown to the right side of Figure 2-10. The data to be sent to each node does not completely lie in consecutive memory locations, and still must be transposed at some point. The exact ordering of the striding, grouping, and transposition is up to the designer, and will serve as a tradeoff study to be considered later. Because of this memory locality issue, corner-turns pose efficiency challenges both at the system interconnect level as well as at the node level in the local memory hierarchy. A workshop presentation from Motorola [18] suggests an alternate approach to decomposition of data-cubes for GMTI, in order to improve the performance of corner-turn operations. Instead of decomposition along a single dimension, since all critical stages of GMTI operate on 1-dimensional data, the data-cube can be decomposed along two dimensions for each stage. The resulting data-cube would appear decomposed as shown below in Figure 2-11, showing two different decompositions to help visualize where the data must move from and to. The main advantage of this decomposition is that in order to re-distribute the data, only groups of nodes will need to communicate with one-another, as opposed to a system-wide all-to-all communication event. 27

PAGE 28

Figure 2-11. Proposed data-cube decomposition to improve corner-turn performance. Other previous work [1] suggests a processor allocation technique to improve GMTI performance, and fits well with the decomposition approach suggested above. This allocation technique suggests that instead of using a completely data-parallel or completely pipelined implementation, a staggered parallelization should be adopted which breaks the system up into groups of nodes. Each group of nodes receives an entire data-cube, and performs all stages of GMTI in a data-parallel fashion. Consecutive data-cubes are handed to different groups in a round-robin fashion, thus the staggered title. The benefit of this processor allocation scheme is that the entire backplane is not taxed for corner-turns, instead only the local interconnect for each group of processors. Tuned to a specific topology, this processor allocation strategy could fit well with proper decomposition boundaries of the data-cube as suggested above, further improving overall application performance. Embedded System Architectures for Space Given the unique physical environment and architecture of satellite systems, it is worth reviewing typical embedded hardware to be aware of what kinds of components are available, and where the market is moving for next-generation embedded systems. Many high-performance embedded processing systems are housed in a small (<2 ft. <2 ft. <2 ft.) metal box called a chassis (example pictured in Figure 2-12(a)). The chassis provides a stable structure in which to install cards with various components, as well as one central motherboard or backplane that is a critical component of the embedded system. 28

PAGE 29

A B Figure 2-12. Example pictures of chassis and backplane hardware; A) Mercurys Ensemble2 Platform, an ATCA chassis for embedded processing, and B) example of a passive backplane. The backplane is a large printed circuit board (PCB) that provides multiple sockets as well as connectivity between the sockets. There are two general classes of backplanes, passive backplanes and active backplanes. Active backplanes include components such as switches, arbiters, and bus transceivers, and actively drive the signals on the various sockets. Passive backplanes, by contrast, only provide electrical connections (e.g. copper traces on the PCB) between pins of the connectors, and the cards that are plugged into the sockets determine what kind of communication protocols are used. Passive backplanes are much more popular than active backplanes, as is indicated by the wide availability of passive backplane products compared to the scarcity of active backplane products. Figure 2-12(b) shows an example of a passive backplane PCB with a total of 14 card sockets. However, even passive backplanes put some restriction on what kinds of interconnects can be used. Most importantly (and obviously), is the pin-count of the connectors, as well as the width and topology of the connections between sockets. Most backplanes are designed with one or more interconnect standards in mind, and until recently almost all backplane products were 29

PAGE 30

built to support some kind of bus interconnect. For example, for decades one of the most popular embedded system backplane standards has been the VME standard [19]. In response to the growing demand for high-performance embedded processing systems, newer backplane standards are emerging, such as the new VME/VXS extension or the ATCA standard, that are designed from the ground-up with high-speed serial packet-switched interconnects and topologies in mind [19, 20]. In addition to providing the proper connection topologies between sockets to support packet-switched systems, extra care and effort must be put into the design and layout of the traces, as well as the selection of connectors, in order to support high-speed serial interconnects with multi-GHz speeds. High-speed signals on circuit boards are far more susceptible to signal-integrity hazards, and much effort is put into the specification of these backplane standards, such as VXS or ATCA, in order to provide the necessary PCB quality to support a number of different high-speed serial interconnect standards. Figure 2-13 shows three of the most popular serial backplane topologies: A B C Figure 2-13. Three popular serial backplane topologies: A) star, B) dual-star, and C) full-mesh. Generally, a serial backplane such as VXS or ATCA will have one or more switch card slots, and several general-purpose card slots. The example backplane pictured back in Figure 2-12(b) shows an example of this, as the two slots to the far left are larger than the rest, indicating 30

PAGE 31

that those slots are the switch sockets. Most likely, the backplane picture in Figure 2-12(b) is a dual-star topology as is pictured above in Figure 2-13(b). As mentioned previously, the passive backplane itself does not dictate which interconnect must be used, although it may restrict which interconnects can be used. The cards that are plugged into the backplane will dictate what the system interconnect is, as well as what processing and memory resources are to be a part of the system. There are many different products for processor, memory, and switch cards, all of which conform to one chassis/backplane standard or another. They combine any number of software processors, FPGAs and ASICs, and various memory devices, as well as connectivity through a switch or some other communication device to the backplane connector. While there are far too many products to review in this document, there was one particular publication at a recent High-Performance Embedded Computing (HPEC) workshop that is very relevant to this project, and warrants a brief discussion. The company SEAKR published a conference paper in 2005 highlighting several card products specifically designed for space-based radar processing [21]. These products include a switch card, a processor card, and a mass memory module, all of which are based on FPGAs and use RapidIO as the interconnect. The architecture of each board is detailed, revealing a processor board that features 4 FPGAs connected together via a RapidIO switch, with several other ports of the switch connecting to ports leading off the board to the backplane connector. Beyond the details of each board architecture, seeing such products begin to emerge during the course of this thesis research is a positive indication from the industry that pushing towards the use of FPGAs and RapidIO for high-performance radar and signal processing is indeed a feasible and beneficial direction to go. 31

PAGE 32

Field-Programmable Gate Arrays By this point in time, FPGAs have become fairly common and as such this thesis is written assuming that the reader already has a basic understanding of the architecture of FPGAs. Nevertheless, this technology is explicitly mentioned here for completeness, and to ensure the reader has the necessary basic knowledge of the capabilities and limitations of FPGAs in order to understand the designs and results presented later. For a detailed review of basic background information and research regarding FPGAs, see [22-27]. Since this research involved a lot of custom low-level design with the FPGAs, several documents and application notes from Xilinx were heavily used during the course of the research [28-33]. RapidIO RapidIO is an open-standard, defined and steered by a collection of major industry partners including Motorola (now Freescale) Semiconductor, Mercury Computers, Texas Instruments, and Lucent Technologies, among others. Specially designed for embedded systems, the RapidIO standard seeks to provide a scalable, reliable, high-performance interconnect solution designed specifically to alleviate many of the challenges imposed upon such systems [34, 35]. The packet-switched protocol provides system scalability while also including features such as flow control, error detection and fault isolation, and guaranteed delivery of network transactions. RapidIO also makes use of Low-Voltage Differential Signaling (LVDS) at the pin-level, which is a high-speed signaling standard that enjoys higher attainable frequencies and lower power consumption due to low voltage swing, as well as improved signal integrity due to common-mode noise rejection between signal pairs (at the cost of a doubling of the pin count for all LVDS connections). As a relatively new interconnect standard, there is a vacuum of academic research in conference and journal publications. In fact, at the conception of this research project in August 32

PAGE 33

2004, there was only a single known academic research paper that featured a simulation of a RapidIO network [36], although the paper was brief and ambiguous as far as definitions of the simulation models and other important details. The majority of literature used to research and become familiar with RapidIO comes from industry whitepapers, conference and workshop presentations from companies, and standards specifications [34-42]. This thesis research [42], and the numerous publications to come from this sponsored project [1-3, 17], contribute much-needed academic research into the community investigating this new and promising high-performance embedded interconnect technology. As defined by the standard, a single instance of a RapidIO node, or communication port, is known as an endpoint. The RapidIO standard was designed from the start with an explicit intention of keeping the protocol simple in order to ensure a small logic footprint, and not require a prohibitive amount of space on an IC to implement a RapidIO endpoint [35]. The RapidIO protocol has three explicit layers defined, which map to the classic Layered OSI Architecture as shown below in Figure 2-14. The RapidIO logical layer handles end-to-end transaction management, and provides a programming model to the application designer suited for the particular application (e.g. Message-Passing, Globally-Shared Memory, etc). The transport layer is very simple, and in current implementations is often simply combined with the logical layer. The transport layer is responsible for providing the routing mechanism to move packets from source to destination using source and destination IDs. The physical layer is the most complex part of the RapidIO standard, and handles the electrical signaling necessary to move data between two physical devices, in addition to low-level error detection and flow control. 33

PAGE 34

Figure 2-14. Layered RapidIO architecture vs. layered OSI architecture. RapidIO has multiple physical layer and logical layer specifications, and a common transport layer specification [35]. RapidIO offers different variations of the logical layer specification, specifically the message passing logical layer, the memory-mapped logical I/O logical layer, as well as the globally-shared memory logical layer [38, 39]. Both the logical I/O and the globally-shared memory logical layers are designed to provide direct, transparent access to remote memories through a global address space. Any node can read or write directly from/to the memory of any other node connected to the RapidIO fabric, simply by accessing an address located in the remote nodes address space. The major difference between these two logical layer variants is that the globally-shared memory logical layer offers cache-coherency, while the simpler logical I/O logical layer provides no coherence. The message passing logical layer, by contrast, is somewhat analogous to the programming model of message passing interface (MPI), where communication is carried out through explicit send/receive pairs. With the logical I/O and globally-shared memory logical layers, the programmer has the option for some request types of requesting explicit acknowledgement through the logical layer at the receiving end, or otherwise 34

PAGE 35

allowing the physical layer to guarantee successful delivery without requiring end-to-end acknowledgements for every transfer through logical layer mechanisms. Regardless of which logical layer variant is selected for a given endpoint implementation, the logical layer is defined over a single, simple, common transport layer specification. The RapidIO physical layer is defined to provide the designer a range of potential performance characteristics, depending on the requirements of his or her specific application. The RapidIO physical layer is bi-directional or full-duplex, is sampled on both rising and falling edges of the clock (DDR), and defines multiple legal clock frequencies. Individual designers may opt to run their RapidIO network at a frequency outside of the defined range. All RapidIO physical layer variants can be classified under one of two broad categories; (1) Parallel RapidIO, or (2) Serial RapidIO. Parallel RapidIO provides higher throughputs over shorter distances, and has both an 8-bit and a 16-bit variant defined as part of the RapidIO standard [38]. Serial RapidIO is intended for longer-distance, cable or backplane applications, and also has two different variants defined in the standard, the single-lane 1x Serial and the 4-lane 4x Serial [40]. Parallel RapidIO is source-synchronous, meaning that a clock signal is transmitted along with the data, and the receive logic at each end point is clocked on the received clock signal. Serial RapidIO, by contrast, uses 8b/10b encoding on each serial lane in order to enable reliable clock recovery from the serial bit stream at the receiver. This serial encoding significantly reduces the efficiency of Serial RapidIO relative to parallel RapidIO, but nearly all high-performance serial interconnects use this 8b/10b encoding, so Serial RapidIO does not suffer relative to other competing serial standards. Figure 2-15 depicts both RapidIO physical layer flavors along with several other high-performance interconnects, to show the intended domain of each. 35

PAGE 36

Figure 2-15. RapidIO compared to other high-performance interconnects [35]. There are two basic packet types in the RapidIO protocol; (1) standard packets, and (2) control symbols. Standard packets include types such as request packets and response packets, and will include both a header as well as some data payload. Control symbols are small 32-bit symbols or byte patterns that have special meanings and function as control messages between communicating endpoints. These control symbols can be embedded in the middle of the transmission of a regular packet, so that critical control information does not have to wait for a packet transfer to complete in order to make it across the link. This control information supports physical layer functions such as flow control, link maintenance, and endpoint training sequence requests, among others. Error detection in RapidIO is accomplished at the physical layer in order to catch errors as early as possible, in order to minimize latency penalties. Errors are detected differently depending on the packet type, either through cyclic redundancy checks (CRC) on standard packets, or bitwise inverted replication of control symbols. For maximum-sized packets, the 36

PAGE 37

RapidIO protocol will insert two CRC checksums into the packet, one in the middle and one at the end. By doing so, errors that occur near the beginning of a packet will be caught before the entire packet is transmitted, and can be stopped early to decrease the latency penalty on packet retries. This error detection mechanism in the physical layer is what the Logical I/O and Globally-Shared Memory logical layers rely upon for guaranteed delivery if the application design decides to not request explicit acknowledgements as mentioned previously. The physical layer will automatically continue to retry a packet until it is successfully transmitted, however without explicit acknowledgement through the logical layer, the user application may have no way of knowing when a transfer completes. The RapidIO physical layer also specifies two types of flow-control; (1) transmitter-controlled flow-control, and (2) receiver-controlled flow-control. This flow-control mechanism is carried out by every pair of electrically-connected RapidIO devices. Both of these flow-control methods work as a Go-Back-N sliding window, with the difference lying in the method of identifying a need to re-transmit. The more basic type of flow-control is receiver-controlled, and relies on control symbols from the receiver to indicate that a packet was not accepted (e.g. if there is no more buffer space in the receiving endpoint). This flow-control method is required by the RapidIO standard to be supported by all implementations. The transmitter-controlled flow-control method is an optional specification that may be implemented at the designers discretion, and provides slightly more efficient link operation. This flow-control variant works by having the transmit logic in each endpoint monitor the amount of buffer space available in its link partners receive queue. This buffer status information is contained in most types of control symbols sent by each endpoint supporting transmitter-controller flow control, including idle symbols, so all endpoints constantly have up-to-date information regarding their partners buffer 37

PAGE 38

space. If a transmitting endpoint observes that there is no more buffer space in the receiver, the transmitting node will hold off transmitting until it observes the buffer space being freed, which results in less unnecessary packet transmissions. The negotiation between receiver-controlled and transmitter-controlled flow-control is performed by observing a particular field of control symbols that are received. An endpoint that supports transmitter-controlled flow-control, endpoint A, will come out of reset assuming transmitter-controlled flow control. If endpoint A begins receiving control symbols from its partner, endpoint B, that indicate receiver-controlled flow control (i.e. the buffer status field of the symbol is set to a certain value), then endpoint A will revert to receiver-controlled flow-control by protocol with the assumption being that for some reason endpoint B does not support or does not currently want to support transmitter-controlled flow control. RapidIO endpoint implementations that do support both flow-control methods have a configuration register that allows the user to force the endpoint to operate in either mode if desired. The RapidIO steering committee is continually releasing extensions to the RapidIO standard in order to incorporate changes in technology and enhance the capabilities of the RapidIO protocol for future embedded systems. One such example of an extension is the RapidIO Globally-Shared Memory logical layer, which was released after the Logical I/O logical layer specification. Other protocol extensions include flow control enhancements to provide end-to-end flow control for system-level congestion avoidance, a new data streaming logical layer specification with advanced traffic classification and prioritization mechanisms, and support for multi-cast communication events, to name a few examples. These newer RapidIO protocol extensions are not used directly in this research, however, and as such they are mentioned for completeness but will not be discussed in further detail. 38

PAGE 39

CHAPTER 3 ARCHITECTURE DESCRIPTIONS This chapter will provide a detailed description of the entire testbed, progressing in a top-down manner. In addition to defining the hardware structures, the application development environment will be described so as to identify the boundary between hardware and software in this system, as well as define the roles of each. The first section will introduce the entire testbed architecture from a system-level, followed by a detailed look at the individual node architecture in the second section. This node architecture is a novel design that represents a major contribution and focus of this thesis research. Finally, the last section will present and discuss the architectures of each of the hardware processing engines designed in the course of this research. Testbed Architecture Recall that the high-performance processing system will ultimately be composed of a collection of cards plugged into a backplane and chassis. These cards will contain components such as processors, memory controllers, memory, external communication elements, as well as high-speed sensor inputs and pre-processors. Figure 3-1 illustrates a conceptual high-performance satellite processing system, of which the main processing resources are multiple processor cards, each card containing multiple processors, each processor with dedicated memory. The multiple processors on each card are connected together via a RapidIO switch, and the cards themselves are connected to the system backplane also through the same RapidIO switches. While Figure 3-1 shows only three processor cards, a real system could contain anywhere from four to sixteen such cards. The system depicted in Figure 3-1 matches both the kind of system boasted by SEAKR in [21], as well as the target system described by Honeywell, the sponsor of this thesis research. 39

PAGE 40

Figure 3-1. Conceptual radar satellite processing system diagram. While it is cost-prohibitive to acquire or construct a full-scale system to match the conceptual hardware described above, it is possible to identify and capture the important parameters of such an architecture and study those parameters on a smaller hardware platform. There are three general parameters that affect overall application performance the most, (1) processor selection and performance, (2) processor-local memory performance, and (3) system interconnect selection and performance. The experimental testbed used for this thesis research was designed from the ground-up using very basic building blocks such as FPGAs on blank prototyping boards, a donated RapidIO IP core (complements of Xilinx), custom-designed PCBs, and hand-coded Verilog and C. There are a total of two Xilinx Virtex-II Pro FPGAs (XC2VP40-FF1152-6) and prototyping boards (HW-AFX-FF1152-300), connected to each other over RapidIO. In addition to this dual-FPGA foundation, a small collection of custom-designed PCBs were fabricated and assembled in 40

PAGE 41

order to enhance the capabilities of the FPGAs on the blank prototyping boards. These PCBs include items such as 2-layer global reset and configuration board 2-layer cable socket PCB, for high-speed cable connection to the pins of the FPGA 6-layer external SDRAM board: 128 MB, 64-bit @ 125 MHz 10-layer switch PCB, hosting a 4-port Parallel RapidIO switch from Tundra 2-layer parallel port JTAG cable adapter for the Tsi500 RapidIO switch While the design, assembly, and testing of these circuits represented a major contribution of time and self-teaching, for this thesis research the circuits themselves are simply a means to an end to enable more advanced experimental case studies. As such, the details of these circuits such as the schematics, physical layouts, bill of materials, etc. are omitted from this thesis document. This information is instead provided as supplementary material (along with all Verilog and C source code for the testbed) in a separate document. The overall experimental testbed is a stand-alone entity; however it does have several connections to a laboratory workstation for debugging and measurement. These connections will be discussed in detail in Chapter 4. All software written runs on the embedded PowerPC405 processors in the FPGAs, and all other functionality of the testbed is designed using Verilog and realized in the reconfigurable fabric of the FPGA. Each node (FPGA) of the testbed is assigned a node ID, which represents the upper address bits of the 32-bit global address space. One node is defined to be the master, and is responsible for initiating processing in the system. Communication between nodes, both data and control, is performed over RapidIO, by writing or reading to/from pre-defined locations in the external DRAM of remote nodes(Figure 3-2). 41

PAGE 42

Figure 3-2. Experimental testbed system diagram. Due to the potentially high cost of hardware, the high risk of prototype development, and the complexity of full-system design from near-zero infrastructure, several unavoidable restrictions were imposed by the experimental testbed used for this research. One major drawback to this hardware is the absence of larger-capacity external processing memory, such as QDR SRAM which is popular in todays embedded systems. The absence of this memory restricts the node to the memory available internal to the FPGA, which is much smaller. Because of this memory void, the maximum data-cube sizes able to be supported on this hardware are capped (though the performance results and conclusions drawn from those results remain valid). Furthermore, the high-risk nature of custom high-speed circuitry integration prevented the successful addition of a RapidIO switch to the testbed. In addition to limiting the testbed size, the absence of a switch also complicates the task of providing a high-speed data input into the system. As a result of that restriction, all data to be processed for a given experiment must be pre-loaded into system memory before processing begins, thus making a real-time processing demonstration impossible. Despite these challenges, an impressive system architecture is proposed and analyzed and valuable insight can still be gained from the experiments conducted. 42

PAGE 43

Node Architecture The node architecture presented in this section is an original design proposed, implemented, and tested in order to enable the experimental research of this thesis. The design itself represents a significant development effort and contribution of this research, and it is important to review the details of the organization and operation of this architecture in order to understand its potential benefits for space-based radar applications. Each node of the testbed contains three major blocks: (1) an external memory interface, (2) a network interface to the high-performance interconnect, and (3) one or more processing elements and on-chip memory controller. Beyond simply providing these various interfaces or components, the node architecture must also provide a highly efficient data path between these components, as well as a highly efficient control path across the node to enable tight control of the nodes operation. This particular architecture features multiple processing units in each node, including both hardware processors implemented in the reconfigurable fabric and software processors in the form of hard-core embedded PowerPC405s. Given the dual-paradigm nature of each node, the optimal division of tasks between the hardware and software processors is critical to identify in order to maximize system performance. A block diagram illustrating the entire node architecture is shown in Figure 3-3. It should be noted here that the design is SDRAM-centric, meaning that all data transfers occur to or from external SDRAM. With a single bank of local memory for each node, the performance of that SDRAM controller is critical to ultimate system performance. Most processors today share this characteristic, and furthermore the conventional method of providing connectivity to the external memory is through a local bus. Even if the system interconnect is a packet-switched protocol, internal to each node the various components (e.g. functional units, network interface, etc.) are connected over a local bus, such as CoreConnect from IBM or 43

PAGE 44

AMBA Bus [43]. This local bus forces heavyweight bus protocol overheads and time division multiplexing for data transfer between different components of an individual node. An alternate approach to providing node-level connectivity to external memory is to use the programmable fabric of the FPGA to implement a multi-ported memory controller. The multi-port memory controller will provide a dedicated interface for each separate component, thus increasing intra-node concurrency and allowing each port to be optimized for its particular host. Each port of the memory controller has dedicated data FIFOs for buffering read and write data, allowing multiple components to transfer data to/from the controller concurrently, as well as dedicated command FIFOs to provide parallel control paths. Figure 3-3. Node architecture block diagram. The control network within each node of the testbed provides the ability to request locally-initiated transfers, control co-processor activity, and handle incoming RapidIO requests 44

PAGE 45

transparently. Locally-initiated data movement is controlled through a DMA engine driven by the PowerPC (shown as two components in Figure 3-3, the DMA controller and the command controller). Based on the requested source and destination memory addresses, the DMA engine determines if the memory transfer is local or remote, and sends personalized commands to the command FIFO of the external memory controller as well as the command FIFO of the other component, as appropriate for the transfer type. Each component acknowledges the completion of its portion of a data transfer to the DMA engine, providing a feedback path back to the PowerPC to indicate the successful completion of memory transfers. Co-processor activity, by contrast, is controlled by a dedicated control bus directly connecting the PowerPC to the co-processors (shown as the dark red line near the bottom of Figure 3-3). Similar to DMA transfers, processing and configuration commands are submitted to the co-processors and their activity is monitored using this control bus. External Memory Controller As the central component to each node in the system, the external memory controller affects both interconnect as well as processor efficiency. As such, it is the most high-performance part of the system, capable of handling traffic from both the interconnect as well as the processors concurrently. The interface to the controller is composed of multiple ports, each dedicated to a single component in the node providing dedicated data paths for every possible direction of transfer. All connections to the external memory controller interfaces are through asynchronous FIFOs, meaning that not only can the external memory controller operate in a different clock domain from the rest of the node, it can also have a different data path width. This logic isolation strategy is a key technique used through the entire node design, in order to enable performance optimization of each section independent of the requirements of other 45

PAGE 46

sections of the design. Figure 3-4 shows a very simplified block diagram of the internal logic of the external memory controller. Figure 3-4. External memory controller block diagram. The external memory controller is based on a single state machine that constantly monitors the two command queues. Notice that in Figure 3-4, there are two command FIFOs leading into the external memory controller; one of these command FIFOs is for locally-requested memory transfers (lower) and the other is for incoming remote transfer requests from the RapidIO fabric (upper). Commands are decoded and executed as soon as they arrive, with a simple arbitration scheme that prevents starvation of either interface to the controller. To further prevent congestion at the external memory controller of each node, the external memory controller data throughput is twice that of either the on-chip memory controller or the RapidIO interface. This over-provisioning of the bandwidth ensures that even if a processor is writing or reading to/from external memory, incoming requests from remote nodes can also be serviced at the full RapidIO link speed without interfering with the processor traffic through the controller. 46

PAGE 47

Network Interface Controller The network interface controller is a complex component, due to the multiple independent interfaces to the actual RapidIO core. In order to provide true full-duplex functionality, the user interface to the network controller (at the logical layer of RapidIO) must provide four parallel control and data paths, which translates to a total of four independent state machines. This multi-interface requirement is common for all full-duplex communication protocols, and the network interface controller would have a similar architecture even if Ethernet or some other interconnect standard was being used. The four interfaces to the RapidIO logical layer core can be conceptually grouped into two pairs of interfaces, and are labeled as follows: (1a) initiator request port, (1b) initiator response port, (2a) target request port, and (2b) target response port. The two initiator ports handle outgoing requests and incoming responses to locally-initiated remote transfers. The two target ports handle incoming requests and outgoing responses to remotely-initiated remote transfers. A simplified block diagram of the internal architecture of the network interface controller is illustrated in Figure 3-5. Figure 3-5. Network interface controller block diagram. 47

PAGE 48

The reason for the multi-port requirement in full-duplex communication protocols is based on the idea that even if the physical layer supports full-duplex communication, if the upper-layer logic cannot do the same then a bottleneck would exist at the input to the interconnect fabric, and the full-duplex capability would never be used. For example, imagine if instead of four ports, there were only two, an initiator port and a target port. The initiator port would be responsible for handling outgoing requests and incoming responses, and by having the same logic that is performing transmissions need to also receive responses it would prevent the true ability to send and receive at the same time. Alternately, consider the case where the two ports are simple transmit and receive ports. In this case, while requests could be transmitted concurrently while receiving responses, a problem arises when remotely-initiated requests come in off the fabric in the middle of a locally-initiated transmission. It should now be clear why there is a need for four independent upper-layer interfaces to any full-duplex communication protocol in order to maximize network efficiency. The two pairs of ports, the initiator pair and the target pair, are completely isolated from one-another. Within each pair, however, the two interface state machines must exchange some information. Small FIFOs take information from the initiator request and target request state machines, in order to inform the initiator response and target response state machines of the need to receive/send a response. The information contained in these messages is minimal, including request type, size, source ID (which is the destination ID for the response), etc. The initiator request state machine receives messages through the command FIFO from the main DMA engine when the PowerPC requests a remote read or write, and informs the initiator response state machine through the internal message FIFO described above if there will be any responses to expect. The target request state machine, by contrast, sits idle and waits for new requests to 48

PAGE 49

come in from the RapidIO fabric. When a new request comes in, the target state machine informs the target response state machine to prepare to send responses, and also places a command in its command FIFO to the external memory controller that includes the size, address, and type of memory access to perform. In order to operate synchronously with the RapidIO link, as a requirement of the Xilinx RapidIO core all logic in the network interface controller operates at 62.5 MHz, which also happens to be half of the frequency of the external memory controller. On-Chip Memory Controller The on-chip memory (OCM) controller is fairly simple, and provides the mechanisms necessary to enable basic read or write transactions between the FIFOs of the main data path and the co-processor SRAM memories. Internal to the OCM controller, the data width is reduced to 32 bits from 64 bits on the SDRAM-side of the data path FIFOs, while the clock frequency remains the same at 125 MHz. The data interface to each co-processor is a simple SRAM port; as such, multiple co-processor interfaces can be logically combined to appear as one SRAM unit, with different address ranges corresponding to the memories physically located in different co-processors. In the baseline node design, there is no method of transferring data directly from one co-processor to another, nor is there any way to transfer data directly from the co-processor memories to the RapidIO interface controller. There are arguable performance benefits to be gained by enabling such data paths, and these architecture enhancements will be considered later as part of experimentation. However, the addition of such capabilities does not come without cost, the tradeoff in this case being an increase in development effort through a rise in the complexity of the control logic, as well as potentially increasing the challenge of meeting timing constraints by crowding the overall design. 49

PAGE 50

PowerPC and Software API Overall control of each nodes operation is provided by one of the two embedded PowerPC405s in each FPGA. This PowerPC is responsible for initiating data movement throughout the system, as well as controlling co-processor activity. However, the PowerPC is not used for any data computation or manipulation, other than briefly in some corner-turn implementations. The PowerPC405 is relatively low performance, and moreover this particular node design does not give the PowerPC a standard connection to the external memory. The PowerPC processor itself is directly integrated with several peripherals implemented in the reconfigurable fabric of the FPGA. These peripherals include a UART controller, several memory blocks, and general-purpose I/O (GPIO) modules for signal-level integration with user logic. One of the memory blocks is connected to the OCM controller to provide a data path for the PowerPC, and the GPIO modules connect to architecture components such as the DMA engine or the co-processor control busses. The external memory controller could be dedicated to the PowerPC through its standard bus interface, however that would prevent other components in the system from connecting directly to the external memory controller as well (recall the multi-port memory controller is a boasted feature of this node architecture). Instead, the PowerPC is used exclusively for control, and leaves data processing tasks to the co-processor engines. To support application development and provide a software interface to the mechanisms provided by the node architecture, an API was created containing driver functions that handle all signal-level manipulation necessary to initiate data movement, control co-processors, and perform other miscellaneous actions. The functions of this API provide a simple an intuitive wrapper to the user for use in the application code. Table 3-1 indicates and describes the most important functions in this API, where classes D, M, and P mean data movement, miscellaneous, and processor control, respectively. 50

PAGE 51

Table 3-1. Software API function descriptions. Function Class Description dma_blocking D Most common data transfer function. Provide source address, destination address, and size. Function will not return until entire transfer has completed. dma_nonblocking D Non-blocking version of same function described above. This function will start the data transfer, and will immediately return while data is still being moved by the system. dma_ack_blocking D Explicit function call necessary to acknowledge DMA transfers initiated with non-blocking calls. This function, once called, will not return until the DMA transfer completes. dma_ack_nonblocking D Non-blocking version of the acknowledgement function. This function will check if the DMA transfer is complete yet, and will return immediately either way with a value indicating the status of the transfer. barrier M Basic synchronization function, necessary to implement parallel applications. This function mimics the MPI function MPI_Barrier(), and ensures that all nodes reach the same point in the application before continuing. coproc_init P Co-processor initialization function. Provide co-processor ID, and any configuration data required. The function will return once the co-processor has been initialized and is ready for processing. coproc_blocking P Most common processor activity function. Provide co-processor ID, and any runtime data required. This function is blocking, and will not return until the co-processor indicates that it has completed processing. coproc_nonblocking P Non-blocking version of the processing function described above. The function will initiate processing on the indicated co-processor, and will immediately return while the co-processor continues to process. coproc_wait P Explicit function necessary to acknowledge co-processor execution runs initiated with the non-blocking function call. Co-processor Engine Architectures Each of the co-processor engines was designed to appear identical from the outside, presenting a standardized data and control interface for an arbitrary co-processor. Since all co-processors share a standard interface and control protocol, the common top-level design will be described first before briefly defining the architecture and internal operation of each specific co-processor engine in the following sections. The location of these co-processors in the overall node architecture is shown near the bottom of Figure 3-3, labeled HW module 1 through HW 51

PAGE 52

module N. The co-processor interface consists of a single data path, and a single control path. Data to be processed is written into the co-processor engine through the data port, and processed data is read back out of this port when processing completes. The control port provides the ability to configure the engine, start processing, and monitor engine status. Figure 3-6 illustrates this generic co-processor wrapper at the signal-level. Figure 3-6. Standardized co-processor engine wrapper diagram. Recall that all co-processor engines contain a small amount of SRAM memory (maximum of 32 KB), and that this memory is the main method of passing data to and from the engine. The single external data port can address this entire internal SRAM address space, through a standard SRAM interface containing a data bus input and output, as well as an address input, a chip select, and a read/write select signal (single-cycle access time). Regardless of what the processing engine is designed to do, data is written-to and read-from its processing memory like a simple SRAM, with certain address ranges defined for various purposes internal to the engine. All internal SRAMs, interchangeably referred to as BlockRAMs or BRAMs, of Xilinx FPGAs are true dual-port memories. This feature is critical to the design of this style of I/O for co-processor engines. With one port of each BRAM dedicated to the external data interface, the other port is used by the actual processing logic. 52

PAGE 53

The control port of each co-processor wrapper is divided into two busses, one input and one output. Each bit or bit field of these control busses has a specific definition, as shown below in Table 3-2. When using multiple co-processing engines in a single FPGA, the control ports are combined as follows: The control bus outputs from each co-processor are kept separate, and monitored individually by the main processor (PowerPC) All control bus inputs to the co-processors are tied together Table 3-2. Co-processor wrapper signal definitions. Signal Name Dir. Width Description clock input 1 Clock signal for logic internal to actual processor reset_n input 1 Global reset signal for entire design coproc_id input 2 Hard-coded co-processor ID; each command received on control bus is compared to this value to see if it is the intended target of the command data_in input 32 Data bus into co-processor for raw data data_out output 32 Data bus out of co-processor for processed data address input 32 Address input to co-processor chip_select input 1 Enable command to memory; read or write access determined by wr_enable wr_enable input 1 Write-enable signal; when chip_select is asserted and wr_enable = a write is issued to the address present on the address bus; when chip_select is asserted and wr_enable = a read is performed from the address specified on the address bus control_in input 16 Control bus input into co-processor from the PowerPC. The bits of the bus are defined as follows: 0-7: configuration data 8-9: coproc_ID 10: blocking (1)/non-blocking (0) 11: synchronous reset 12: reserved (currently un-used) 13: command enable 14-15: command = config = start = reserved (un-used) control_out output 4 Control bus from co-processor to the PowerPC. The bits of the bus are defined as follows: 0: busy 1: done 2: uninitialized 3: reserved (un-used) 53

PAGE 54

Some of the co-processor engines have run-time configurable parameters in the form of internal registers in the co-processor engine itself. One example of such a configurable parameter is for the CFAR processor, where it needs to be told what the range dimension is of the current data-cubes so it knows how much to process before clearing its shift register for the next processing run. As part of the wrapper standard, all co-processors must be initialized or configured at least once before being used for processing. Whether or not the co-processor currently supports configurable parameters, according to the standard upon coming out of reset all co-processors will have their uninitialized bit set on the control bus, indicating that they need to be configured. The PowerPC will initialize each co-processor one at a time as part of its startup sequence by submitting config commands along with the appropriate coproc_ID and data. Once initialized, the co-processor will clear its uninitialized bit and can be used for processing at any time. The co-processor may also be re-configured with different parameters by executing another config command to set the config registers with the new data. This feature of all co-processors enables virtual reconfiguration, as proposed by a recent Masters Thesis [24]. To perform data processing with the co-processor, the PowerPC will first perform a DMA transfer of the data to be processed from external memory into the co-processors address space. The exact memory map of each co-processor varies, and will be defined in the appropriate co-processor section later. Once the data has been transferred into the engine, the PowerPC submits a start command to the appropriate engine, and monitors the busy and done bits of that co-processors control bus output. Upon observing the done bit being asserted, another DMA transfer can be performed to move the processed data out of the engine and back to external storage. It should be noted here that all of the co-processor engines implement double-buffering in order to maximize processing efficiency and minimize idle time. 54

PAGE 55

Pulse Compression Engine The pulse compression engine design adopted for this research follows that of a commercial IP core [9] by Pentek, Inc. The engine is centered around a pipelined, selectable FFT/IFFT core that provides streaming computation, an important feature for this high-performance design (Figure 3-7). Recall that pulse compression operates along the range dimension, and is composed of an FFT, followed by a point-by-point complex vector multiply with a pre-computed vector (also stored in the co-processors SRAM), and finally an IFFT of the vector product. To implement this algorithm, the aforementioned FFT core is re-used for the final IFFT as well, in order to minimize the amount of resources required to implement the pulse compression co-processor. The numerical precision used for the pulse compression co-processor is 16-bit fixed point (with 8.8 decimal format) as suggested in Aggarwals work [24], resulting in 32 bits per complex data element. Figure 3-7. Pulse compression co-processor architecture block diagram. 55

PAGE 56

Once given a command to begin processing, the engine will start by reading data out of the specified input buffer directly into the FFT core, through a multiplexer. As soon as possible, the FFT core is triggered to begin processing, and the state machine waits until the core begins reading out the transformed results. At this time, the transformed data exits the FFT core one element at a time, and is sent through a demultiplexer to a complex multiplier. The other input of the multiplier is driven by the output of the vector constant SRAM, which is read in synch with the data output of the FFT core. The complex multiplier output is then routed back to the input of the FFT core, which by this point has been configured to run as an IFFT core. The final IFFT is performed, and the core begins outputting the pulse compression results. These results are routed to the specified output buffer, and the state machine completes the processing run by indicating to the PowerPC through the control interface outputs that the co-processor is finished. The architecture of the co-processors SRAM bank is an important topic that warrants discussion, and the pulse compression engine will be used as an example. The pulse compression engine contains a total of five independent buffers or SRAM units (see Figure 3-7): two input buffers, two output buffers, and a single vector constant buffer. Each buffer is 4 KB in size, resulting in a total pulse compression co-processor memory size of 20 KB. The two SRAM features taken advantage of for this architecture are (1) true dual-port interface, and (2) ability to combine smaller SRAMs into one larger logical SRAM unit. The external data/SRAM port to the co-processor has both read and write access to the entire 20 KB of memory, by combining one port of all SRAMs to appear as single logical SRAM unit (according to the memory map shown in Figure 3-7). However, there is no rule to say that the other port of each SRAM unit must be combined in the same manner; therefore, the other port of each SRAM unit is kept independent, to allow concurrent access of all buffers to the internal computational engine. 56

PAGE 57

Doppler Processing Engine The Doppler processing engine is very similar to the pulse compression co-processor, also based on an FFT core at its heart. Recall that Doppler processing is composed of a complex vector multiplication followed by an FFT. Upon receiving a command to begin processing, the input data will be read out of the specified input buffer concurrently with the pre-computed constant vector from its vector buffer (as indicated below in Figure 3-8), and fed into a complex multiplier unit. The product output of this multiplier is used to drive the input of the FFT core, which computes the transform before unloading the transformed results directly into the output buffer. Like the pulse compression co-processor, the Doppler processor unit is originally designed assuming 16-bit fixed-point precision (32-bits per complex element), with an 8.8 decimal format. Figure 3-8. Doppler processing co-processor architecture block diagram. 57

PAGE 58

Beamforming Engine The beamforming co-processor is based on a matrix multiplication core described in [44]. This engine computes all output beams in parallel, by using as many multiply-accumulate (MAC) units as there are output beams. The weight vector provided by the AWC step is stored in a specially-designed SRAM module, that provides as many independent read-only ports to the internal engine as there are MAC units. By doing so, the input data-cube can be read out of the input buffer one element at a time in a steaming fashion (recall data is organized along the channel dimension for beamforming), and after an entire channel dimension has been read out, and entire output beam dimension will have been computed. While the next channel dimension is being read out of the input buffer, the previously-computed beam is registered and stored in the output buffer one element at a time. The complex-format beamformed output is partially magnituded (no square root) before being written to the output buffer as 32-bit real values. Figure 3-9. Beamforming co-processor architecture block diagram. 58

PAGE 59

Figure 3-9 illustrates the architecture of the beamforming co-processor. Notice that unlike the other co-processor designs, the input buffers and output buffers are different sizes. This asymmetry is due to the reduction of one dimension of the data-cube from channels to beams, the assumption being that the number of beams formed will be less than the number of channels in the original cube. The actual computations performed by this architecture can be better visualized by considering Figure 3.10 below. The weight vector for each beam to be formed is stored in a small, independent buffer, so that the weights for all beams can be read in parallel. The elements of the input data-cube (B c,p,r ) are read sequentially from the input buffer, and broadcast to all of the MAC units. The other input to each MAC unit is driven by its respective weight vector buffer, and so every C clock cycles (where C = channel dimension length) an entire beam dimension is completely processed. These values will be registered and held for storage, while the next set of beam elements are computed by re-reading the weight vectors while reading the next C element of the input data-cube. Figure 3-10. Illustration of beamforming computations. 59

PAGE 60

CFAR Engine The CFAR detection co-processor architecture is based on a design presented in [16]. The major component of the CFAR engine is a shift register, as wide as each data element and as long as the CFAR sliding window length. Once processing begins, data is read out of the appropriate input buffer and fed into this shift register. Several locations along the shift register are tapped and used in the computation of the running local averages as well as for the thresholding against the target cell. Recall that the ultimate goal of CFAR is to determine a list of targets detected from each data-cube [15]. This output requirement implies that the entire data-cube does not need to be outputted from CFAR, and in fact the minimum required information results in a basically negligible output data size. However, to provide a simple method of verification of processing results, the output of this CFAR co-processor is the entire data-cube, with zeros overwriting all elements not deemed to be targets. To provide realistic processing latency, for all performance measurements no output data is read from the processor. Figure 3-11. CFAR detection co-processor architecture block diagram. 60

PAGE 61

As mentioned back in Chapter 2, CFAR is not only parallel at a coarse-grained level with each Doppler bin being processed independently, but the computations performed for each consecutive data element also contain significant parallelism through partial results reuse of the locally-computed averages [15]. For each averaging window (left and right), almost all data elements to be summed remain the same between two adjacent target cells, with the exception of one data element on each end of the windows. As the CFAR window moves one element over, the oldest element needs to be subtracted from the sum, and the newest element added to the sum. By taking advantage of this property, CFAR detection can be performed O(n), where n represents the length of the range dimension, independent of the width of the sliding window. 61

PAGE 62

CHAPTER 4 ENVIRONMENT AND METHODS This chapter will briefly describe the experimental environment and methods used in the course of this research. The first section will define the interfaces to the testbed available for the user, followed by a detailed description of measurement techniques and experimental procedures in the second section. Finally, the third section will formally define all assumptions, metrics, and parameters relevant to the experiments carried out. Experimental Environment The RapidIO testbed is connected to a number of cables and devices, providing a control interface, a method of file transfer, and measurement capabilities. The testbed connects to a workstation through both an RS-232 serial cable, as well as a USB port for configuration through JTAG. Additionally, there is a PC-based logic analyzer that connects to up to 80 pins of the FPGAs, as well as to the user workstation through another USB port for transfer of captured waveforms. The embedded PowerPCs in each FPGA drive the UART ports on the testbed, and the user can monitor and control testbed progress through a HyperTerminal window or other similar UART communication program (only the master nodes UART port is used, for a single UART connection to the workstation). This HyperTerminal window interface provides the ability to use printf()-style printouts to display application progress as well as assist in debugging to a degree, and also enables file transfer between the user workstation and the RapidIO testbed. In addition to the testbed itself and the application that runs on it, a suite of basic utility C programs was written to run on the workstation for functions such as file creation, file translation to/from the testbed data file format, comparison of testbed results with golden results, etc. Figure 4-1 illustrates the relationship between the user, the workstation, the testbed, and the various cables and connections. 62

PAGE 63

Figure 4-1. Testbed environment and interface. Measurement Procedure The main tool for collecting timing and performance measurements is a PC-based logic analyzer from Link Instruments, Inc. The logic analyzer has a maximum sampling frequency of 500 MHz, providing extremely fine-grained timing measurements of down to 2 ns. With up to 80 channels available on the logic analyzer, the desired internal signals can be brought out to un-used pins of the FPGA to provide a window into the internal operation of the design. The only signals that are invisible, or unable to be brought out for capture, are the signals internal to the Xilinx RapidIO core, which are intentionally kept hidden by Xilinx as proprietary information. However, the interfaces to the different RapidIO cores are visible, and provide sufficient 63

PAGE 64

information to accurately characterize the operation of the interconnect. Generally, only one logic analyzer capture can be taken per processing run on the testbed, due to the incredibly short execution time of the testbed compared to the speed of the logic analyzer data transfer over USB for the first capture. However, two key advantages of using a PC-based logic analyzer are the convenience afforded by a large viewing area during use, as well as the ability to save captures for archiving of results. The experimental procedure is fairly simple and straightforward, however it will be explicitly defined here for completeness. Assuming the proper files for programming the FPGAs are on-hand, the first thing the experimenter must do is prepare the input data files to be sent to the testbed using the utility programs that run on the user workstation. These files include data such as pre-computed vector constants, data-cubes for GMTI processing, or other data for debugging and testing, whatever the current experiment calls for. With all files on-hand, the testbed is powered on and the FPGAs are configured with their appropriate bitfiles. The user is greeted with a start-up log and prompt through the HyperTerminal interface, and transfers the data files from the workstation to the testbed as prompted. Once all files have been loaded, the experimenter is prompted one final time to press any key to begin the timed software procedure. Before doing so, the measurement equipment is prepared and set to trigger a capture as needed. The user may then commence the main timed procedure, which typically completes in under a second. The logic analyzer will automatically take its capture and transfer the waveform to the workstation for display, and the testbed will prompt the user to accept the output file transfer containing the processing results. At this point, the user may reset and repeat, or otherwise power down the testbed, and perform any necessary post-analysis of the experimental results. 64

PAGE 65

In addition to the logic analyzer which provides performance measurements, the previously-mentioned suite of utility programs provide the necessary tools for data verification of output files from the testbed. For reference, a parallel software implementation (using MPI) of GMTI was acquired from Northwestern University, described in [10]. This software implementation comes with sample data-cubes, and provides a method of verification of processing results from the RapidIO testbed. Furthermore, this software implementation was used for some simple performance measurements in order to provide some basis on which to compare the raw performance results of the testbed. Metrics and Parameter Definitions There are two basic types of operations that make up all activity on the testbed, (1) DMA transfers and (2) processing runs. DMA transfers involve a start-up handshake between the PowerPC and the hardware DMA controller, the data transfer itself, and an acknowledgement handshake at the completion. A processing run refers to the submission of a process command to a co-processor engine, and waiting for its completion. Each processing run does not involve a start-up handshake, but does involve a completion-acknowledgement handshake. In order to characterize the performance of the testbed, three primitive metrics are identified that are based on the basic operations described above. All metrics are defined from the perspective of the user application, in order to provide the most meaningful statistics in terms of ultimately-experienced performance, and so the control overhead from handshaking is included. These different latency metrics are defined as shown here: t xfer = DMA transfer latency t stage = processing latency of one kernel for an entire data-cube t buff = processing latency for a single process command 65

PAGE 66

Based on these latency definitions (all of which may be measured experimentally), other classical metrics may be derived, for example throughput. Data throughput is defined as follows: xferbytesdatatNthroughput (4.1) where the number of bytes (N bytes ) will be known for a specific transfer. Another pair of important metrics used in this research is processor efficiency, or percentage of time that a processor is busy, and memory efficiency, which illustrates the amount of time that the data path is busy transferring data. Processor efficiency can be an indicator of whether or not the system is capable of fully-utilizing the processor resources that are available, and can help identify performance bottlenecks in a given application on a given architecture. Memory efficiency, by contrast, helps indicate when there is significant idle time in the data path, meaning additional co-processor engines could be kept fed with data without impacting the performance of the co-processor engine(s) currently being used. These efficiency metrics are defined as follows: idealstagememorythroughputtMeff (4.2) stagebuffprocttNeff (4.3) For memory efficiency, M is a general term representing the size of an entire data-cube (per processor), and is a scaling factor (usually 1 or 2) that addresses the fact that, for any given stage, the entire data-cube might be sent and received by the co-processor engine. Some stages, for example beamforming or CFAR, either reduce or completely consume the data, and thus for those cases would be less than two. Basically, the numerator of the memory efficiency metric represents how much data is transferred during the time it took to process the cube, where 66

PAGE 67

the denominator represents how much data could have been transferred in that same amount of time. For processor efficiency, N is a general term for the number of process commands that are required to process an entire data-cube. Recall that the t stage term will include overheads associated with repeated process and acknowledge commands being submitted to the co-processor, as well as overhead associated with handling data transfers in the midst of the processing runs. Any efficiency value less than 100% will indicate a processor that has to sit idle while control handshakes and data transfers are occurring. In addition to these metrics, there are a handful of parameters that are assumed for each experiment in this thesis, unless otherwise noted for a specific experiment. These parameters (defined in Table 4-1) include the data-cube size, the system clock frequencies, memory sizes, as well as the standard numerical format for all co-processors. Table 4-1. Experimental parameter definitions. Parameter Value Description Ranges 1024 Range dimension of raw input data-cube Pulses 128 Pulse dimension of raw input data-cube Channels 16 Channel dimension of raw input data-cube Beams 6 Beam dimension of data-cube after beamforming Processor Frequency 100 MHz PowerPC/Co-processor clock frequency Memory Frequency 125 MHz Main FPGA clock frequency, including SDRAM and data path RapidIO Frequency 250 MHz RapidIO link speed Co-processor SRAM size 32 KB Maximum SRAM internal to any one co-processor FIFO size (each) 8 KB Size of FIFOs to/from SDRAM Numerical Format s.7.8 Signed, 16-bit fixed-point with 8 fraction bits 67

PAGE 68

CHAPTER 5 RESULTS A total of five formal experiments were conducted, complemented by miscellaneous observations of interest. The five experiments included: (1) basic performance analysis of local and remote memory transfers, (2) co-processor performance analysis, (3) data verification, (4) a corner-turn study, and (5) analytical performance modeling and projection. The following sections present the results of each experiment, and will provide analysis of those results. Experiment 1 The first experiment provided an initial look at the performance capability of the RapidIO testbed for moving data. This experiment performed simple memory transfers of varying types and sizes, and measured the completion latency of each transfer. These latencies were used to determine the actual throughput achieved for each type of transfer, and also illustrated the sensitivity of the architecture to transfer size by observing how quickly the throughput dropped off for smaller operations. Local transfers addressed data movement between the external SDRAM and the co-processor memories, and remote transfers covered movement of data from one nodes external SDRAM to another nodes external SDRAM over the RapidIO fabric. Since the internal co-processor memories were size-constrained to be a maximum of 32 KB, the range of transaction sizes used for local transfers was between 64 B to 32 KB. Only two transfer types exist for local memory accesses, reads and writes. Remote transfers could be of any size, so transaction sizes from 64 B to 1 MB were measured experimentally for RapidIO transfers. The RapidIO logical I/O logical layer provides four main types of requests: (1) NREAD, for all remote read requests (includes a response with data), (2) NWRITE_R for write operations requesting a response to confirm the completion of the operation, (3) NWRITE for write operations without a response, and (4) SWRITE, a streaming write request with less packet 68

PAGE 69

header information for slightly higher packet efficiency (also a response-less request type). All four RapidIO request types are measured experimentally for the full range of transfer sizes, in order to characterize the RapidIO testbeds remote memory access performance. Figure 5-1. Baseline throughput performance results. First, consider the local memory transfer throughputs; even though the SDRAM interface operates at a theoretical maximum throughput of 8 Gbps (64-bit @ 125 MHz), the on-chip memories only operate at 4 Gbps (32-bit @ 125 MHz), and thus the transfer throughputs are capped at 4 Gbps. Recall, the over-provisioned SDRAM throughput is provided intentionally, so that local memory transfers will not interfere with RapidIO transfers. The overheads associated with DMA transfer startup and acknowledgement handshakes cause the drop-off in throughput for very small transfers, but it can be seen that at least half of the available throughput is achieved for modest transfer sizes of 1 KB. The transfer size at which half of the available throughput is achieved for a given interconnect or communication path is referred to as the half-power point, and is often treated as a metric for comparison between different communication protocols in order to rate the overall efficiency of the connection for realistic transfer sizes. For remote memory transfers, as one might expect the response-less transaction types (NWRITE, SWRITE) achieve better throughput for smaller transfer sizes. The higher 69

PAGE 70

throughput is a result of lower transfer latency, as the source endpoint does not have to wait for the entire transfer to be received by the destination. As soon as all of the data has been passed into the logical layer of the source endpoint, the control logic is considered done and can indicate to the DMA engine that the transfer is complete. These transaction types take advantage of RapidIOs guaranteed delivery to ensure that the transfers will complete without forcing the source endpoint to sit and wait for a completion acknowledgement. The only potential drawback to these transaction types is that the source endpoint cannot be sure of exactly when the transfer really completes, but most of the time that does not matter. The hump seen around 16 KB in NWRITE/SWRITE throughputs is due to the various FIFOs/buffers in the communication path filling up, and causing the transfer to be capped by the actual RapidIO throughput. Before that point, the transfer is occurring at the speed of SDRAM, however once the buffers fill up the SDRAM must throttle itself in order to not overrun the buffers. Transfer types that require a response do not experience this same hump in throughput, since for all transfer sizes they must wait for the transfer to traverse the entire interconnect before receiving the response. As transfer sizes grow, all transfer types settle to the maximum sustainable throughput. One interesting observation is that for very large transfer sizes (> 256KB), the throughput achieved by NREAD transactions overtakes the throughput of other transfer types, including the lower-overhead SWRITE type. The reason is due to implementation-specific overheads, most likely differences in the state transitions (i.e. numbers of states) involved in the state machines that control the RapidIO link, where even a single additional state can result in an observable decrease in actual throughput. The conclusions to be taken from these results are that for smaller transfers, unless an acknowledgement is required for some reason, the NWRITE or SWRITE transaction types are desirable to minimize the effective latency of the transfer. However, when 70

PAGE 71

moving a large block of data, it would be better to have the receiving node initiate the transfer with an NREAD transaction as opposed to having the sending node initiate the transfer with a write, once again in order to maximize the performance achieved for that transfer since NREADs achieve the highest throughput for large transfer sizes. It will be shown in the coming experiments that this particular case-study application does not move large blocks of data during the course of data-cube processing, and as such SWRITE transaction types are used for all remote transfers unless otherwise noted. Experiment 2 This experiment investigated the individual performance of each co-processor engine. All processing for this experiment was done on a single node, in order to focus on the internal operation of each node. Two latencies were measured for each co-processor: (1) t buff the amount of time necessary to completely process a single buffer, and (2) t stage the amount of time necessary to completely process an entire data-cube. The buffer latency t buff does not include any data movement to/from external memory, and only considers the control overhead associated with starting and acknowledging the co-processor operation. Thus, this metric provides the raw performance afforded by a given co-processor design, without any dependence on the memory hierarchy performance. The data-cube processing latency t stage includes all data movement as well as processing overheads, and is used in conjunction with the buffer latency results to derive the processor efficiency for each co-processor. Figure 5-2 below summarizes the basic processing latencies of each of the co-processor engines; for the t stage results, both double-buffered (DB) and non-double-buffered (NDB) results are shown. For all remaining experiments, double-buffered processing is assumed; the non-double-buffered results are shown simply to provide a quantitative look at exactly what the benefit is of enabling double-buffing. 71

PAGE 72

Figure 5-2. Single-node co-processor processing performance. As shown above in Figure 5-2, all of the co-processors complete the processing of a single buffer of data in the microsecond range, with pulse compression taking the longest. Both beamforming and CFAR have the same input buffer size and operate O(n) as implemented in this thesis, so t buff for both of those co-processors is the same, as expected. The increase of execution time for pulse compression and Doppler processing relative to the other stages remains the same for both t buff and t stage which suggests that the performance of those two co-processors is bound by computation time. However, consider the results of t buff and t stage for beamforming and CFAR detection; while t buff is the same for both co-processors, for some reason t stage for beamforming is significantly longer than that of CFAR. The reason for this disparity in t stage is rooted in the data transfer requirements associated with processing an entire data-cube. Recall that for CFAR detection, no output is written back to memory since all data is consumed in this stage. However, if both processors are computation-bound, that would not make a difference and t stage should also be the same between those two co-processors. To more clearly identify where the bottleneck is, consider Figure 5-3. 72

PAGE 73

Figure 5-3. Processing and memory efficiency for the different stages of GMTI. The chart to the left of Figure 5-3 shows processing efficiency, again with both double-buffered and non-double-buffered execution. The NDB results are simply shown to quantitatively illustrate the benefit of double-buffering from a utilization perspective. As predicted above, both pulse compression and Doppler processing are computation-bound, meaning that the co-processor engines can be kept busy nearly 100% of the time during processing of an entire data-cube as data transfer latencies can be completely hidden behind buffer processing latencies. Without double-buffering, however, no co-processor engine can be kept busy all of the time. The chart to the right in Figure 5-3 shows both processing efficiency (double-buffered only) as well as memory efficiency, which can tell us even more about the operation of the node during data-cube processing. Again, as predicted above, it can be seen that while the performance of the CFAR engine is bound by computation, the beamforming engine is not, and even when double-buffered it must sit idle about 35% of the time waiting on data transfers. This inefficiency partially explains the disparity in t stage between beamforming and CFAR, shown in Figure 5-2, where beamforming takes longer to completely process an entire data-cube than CFAR. Furthermore, recall the data reduction that occurs as a result of beamforming; to process an entire data-cube, beamforming must process more data than CFAR. 73

PAGE 74

Even though the input buffers of each co-processor engine are the same size, beamforming must go through more buffers of data than CFAR. The combination of more data and lower efficiency explains why beamforming takes longer than CFAR when looking at full data-cube processing time. Another important observation from the efficiency results is the memory efficiency of the different co-processor engines. Low memory efficiency implies that the data path is sitting idle during the course of processing, since processing a single buffer takes longer than it takes to read and write one buffer of data. Specifically, notice that pulse compression has a very low memory efficiency of about 25%. What this low data path utilization tells us is that multiple pulse compression co-processors could be instantiated in each node of the testbed, and all of them could be kept fed with data. Exactly how many co-processor engines could be instantiated can be determined by multiplying the single-engine memory efficiency until it reaches about 75% (a realistic upper-limit considering control overheads, based on the I/O-bound beamforming engine results). For future experiments, two pulse-compression engines per node can be safely assumed. By contrast, all other co-processor engines already use at least ~50% of the memory bandwidth, and as such only one co-processor engine could be supported per node. This restriction reveals an important conclusion that it will not be worthwhile to instantiate all four co-processor engines in each node. Instead, each node of the system should be dedicated to performing a single stage of GMTI, which results in a pipelined parallel decomposition. If resource utilization is not a concern, then all four co-processor engines could be instantiated in each node of the testbed, with only one being used at a time. Doing so would enable data-parallel decomposition of the algorithm across the system, but again at the cost of having most of the co-processor engines sit idle the majority of the time. 74

PAGE 75

Finally, to establish a frame of reference for how fast these co-processor designs are compared to other more conventional technology, a performance comparison is offered between the RapidIO testbed and the Northwestern University software implementation executed sequentially on a Linux workstation. Figure 5-4 below shows the result of this performance comparison. To defend the validity of this comparison, even though the hardware being compared is not apples-to-apples, the assumption for both cases is that the entire data-cube to be processed resides completely in main memory (external SDRAM) both before and after processing. Since the capability of a given node to move data between external memory and the internal processing element(s) is a critical facet of any realistic node architecture, as long as both systems start and end with data in analogous locations, the comparison is in fact valid and covers more than just pure processing performance. The Linux workstation used in this comparison is a 2.4 GHz Xeon processor, with DDR SDRAM operating at a theoretical maximum of 17 Gbps. The next experiment will provide a comparison of the output data between these two implementations, to further defend the validity of the HW-SW comparison by showing that the same (or approximately the same) results are achieved. Figure 5-4. Performance comparison between RapidIO testbed and Linux workstation. 75

PAGE 76

The results shown in Figure 5-4 are significant, since the RapidIO testbed co-processors are running at 100 MHz compared to the workstations 2.4 GHz. The maximum clock frequency in the overall FPGA node design is only 125 MHz, aside from the physical layer of the RapidIO endpoint which runs at 250 MHz (however, recall that during any given stage of processing, no inter-processor communication is occurring). Despite the relatively low frequency design, impressive sustained performance is achieved and demonstrated. Processors operating in space are typically unable to achieve frequencies in the multi-GHz range, and so it is important to show that low-frequency hardware processors are still able to match or exceed the performance of conventional processors with highly efficient architectures. Experiment 3 To establish confidence in the architectures presented in this thesis research, the processing results of the co-processor engines were validated against reference data, some of which was obtained from researchers at Northwestern University [10]. Each of the co-processing engines was provided with a known set of data, processed the entire set, and unloaded the processed results in the form of a file returned to the user workstation. With the exception of pulse compression and Doppler processing, a software implementation of GMTI was the source of the reference processing results, and employs a double-precision floating point numerical format for maximum accuracy. For the pulse compression and Doppler processing cores, it turns out that the 16-bit fixed-point precision with 8 fractional bits is not nearly enough precision to maintain accuracy, and to illustrate this deficiency a more simplified data set is provided for processing. First, consider the output of the Doppler processing engine. Since pulse compression and Doppler processing contain nearly identical operations (pulse compression simply contains one extra FFT), only verification of Doppler processing results are shown. Pulse compression is guaranteed to have worse error than Doppler processing, so if 16-bit fixed point with 8 fractional 76

PAGE 77

bits is shown to result in intolerable error in the output of Doppler processing, there is no need to test the pulse compression engine. As mentioned above, the reference data-cubes from the NWU implementation of GMTI are not used for this particular kernel. The data contained in the reference data-cubes is very dynamic, and in most cases the output obtained from running those data sets through the testbed implementation exhibits significant overflow and underflow, resulting in vast differences between golden results and testbed results. In order to instill confidence that this kernel implementation is indeed performing the correct operations in spite of round-off errors, a simplified vector is provided to the testbed implementation, as well as a MATLAB implementation of the Doppler processing kernel. Figure 5-5. Doppler processing output comparison. Figure 5-5 shows both golden (i.e. reference) results alongside the actual testbed output. Notice the underflow that occurs in the middle of the testbed output, where the signal is suppressed to zero. Furthermore, in order to attempt to avoid overflow, an overly-aggressive scaling was employed in the testbed implementation that resulted in all numbers being scaled to << 1. The result of this scaling is that the majority of the available bits in the 16-bit fixed-point number are not used, which is extremely wasteful for such limited precision. More time could have been spent tuning the scaling of the FFT, however it should be emphasized here that the 77

PAGE 78

goal of this experiment was simply to instill confidence that the correct operations are being performed, not to tune the accuracy of each kernel implementation. Despite the obvious design flaws in terms of precision and accuracy, the performance of the kernel implementations are completely independent of the selected precision and scaling, and by inspection the correct-ness of the kernel implementation on the testbed can be seen. Next, consider the verification of the beamforming co-processor. For this kernel, the actual data from the NWU testbed was fed to the testbed, and the outputs show significantly less error than the Doppler processing and pulse compression kernels. Figure 5-6 below presents a small window of the overall data-cube after beamforming, specifically a 6 grid of data points from one particular output beam. The window was chosen to provide a typical picture of the output, as showing an entire 1024 beam would be difficult to visually inspect. Figure 5-6. Beamforming output comparison. Aside from two points of underflow, (5, S3) and (6, S2), the outputs are nearly identical. Again, a detailed analysis of the exact error obtained (e.g. mean-square error analysis) is not provided, as fine-grained tuning of kernel accuracy is not the goal of this experiment. In fact, the 78

PAGE 79

two points of underflow shown in the left chart are a result of arithmetic logic errors, since the real data points should be close to zero, as shown in the golden results in the right chart. The actual data values received from the testbed were near or equal to -128, but the scale of the charts was truncated at -50 in order to provide a sufficient color scale to compare the rest of the window. The immediate jump from 0 to -128 suggests a logic error as opposed to typical round-off error, considering the 16-bit fixed-point format with 8 fractional bits and signed-magnitude format. Similar to Doppler processing, correcting this logic error would not affect the performance of the beamforming core, which is the real focus of this research. The results shown above in Figure 5-6 are sufficient to instill confidence that the beamforming core is performing the operations that are expected. Finally, consider the CFAR detection kernel results. This kernel implementation does not suffer from any minor arithmetic design flaws, and the results that are shown in Figure 5-7 best illustrate why 16-bit fixed-point precision is simply not sufficient for GMTI processing. The radar data from the NWU implementation is used to verify the CFAR engine, with a full 1024 beam of the input data shown in the top chart. As can be seen, the range of values goes from very small near the left side of the beam, all the way to very large to the right side of the beam. The middle chart shows the targets that are actually in the data, and the chart on the bottom shows the targets detected by the testbed implementation. By visual inspection, when the input data is small, there are many false-positives (targets reported that really are not targets); when the input data is large, there are many false-negatives (targets not detected). However, in the middle of the beam where the input data values are in the comfort zone, the testbed implementation is very accurate. 79

PAGE 80

Figure 5-7. CFAR detection output comparison. One final point to make about the results observed in the above verification experiments is that aside from beamforming (possibly), when tested independently each kernel implementation exhibits intolerable overflow and underflow in the processing output. In a real implementation, these kernels are chained together, and thus any round-off or over/underflow errors would propagate through the entire application, resulting in even worse target detections by the end. However, all hope is not lost for an FPGA implementation of GMTI, as the results simply suggest that more precision is required if fixed-point format is desired, or alternately floating-point could be employed to avoid the over/underflow problem all-together. However, doing so would result in a doubling of the data set size due to the use of more bytes per element, and thus the memory throughput would need to be doubled in order to maintain the real-time deadline. To defend the accuracy of the performance results presented in other experiments, consider the following upgrades to the architecture that could be implemented in order to provide this 80

PAGE 81

increased throughput. Upgrading the external memory from standard SDRAM to DDR SDRAM would double the effective throughput of the external memory controller from 8 Gbps to 16 Gbps. Also, increasing the OCM data path width from 32 bits to 64 bits would provide 8 Gbps to on-chip memory as opposed to the current 4 Gbps. These two enhancements would provide the necessary data throughput throughout the system to sustain the performance reported in other experiments when using either 32-bit fixed-point, or single-precision floating point. While floating point arithmetic cores do typically require several clock cycles per operation, they are also pipelined, meaning once again the reported throughputs could be sustained. The only cost would be a nominal cost of a few clock cycles to fill the pipelines for each co-processor engine, which translates to an additional latency on the order of a few hundred nanoseconds or so (depending on the particular arithmetic core latency). Such a small increase in processing latency is negligible. Experiment 4 The fourth experiment addressed the final kernel of GMTI, the corner-turn, which performs a re-organization of the distributed data-cube among the nodes in the system in preparation for the following processing stage. The corner-turn operation is very critical to system performance, as well as system scalability as it is the parallel efficiency of the corner-turns in GMTI that determine the upper-bound on the number of useful processing nodes for a given system architecture. Recall that the only inter-processor communication in GMTI is the corner-turns that occur between processing stages. Since the data-cube of GMTI is non-square (or non-cubic), the dimension that the data-cube is originally decomposed along and the target decomposition may have an effect on the performance of the corner-turn, as it affects the individual transfer sizes required to implement that specific corner-turn. For this experiment, 81

PAGE 82

two different decomposition strategies of the data-cube are implemented, and the corner-turn operation latency of each decomposition strategy is measured and compared. Figure 5-8 illustrates these two different decompositions, with the red line indicating the dimension along which processing is to be performed both before and after the corner-turn. There are a total of three corner-turns in the implementation of GMTI used for this thesis research (one after pulse compression, one after Doppler processing, and the final one after beamforming), and the data-cube is distributed across two nodes in the experimental testbed. Due to the limited system size, it is impossible to implement the two-dimensional decomposition strategy discussed in Chapter 2, and the cube is only decomposed along a single dimension for each stage. Figure 5-8. Data-cube dimensional orientation, A) primary decomposition strategy and B) secondary decomposition strategy. It should be noted here that other decompositions and orders of operations are possible, but in order to limit the scope of this experiment only two were selected in order to provide a basic illustration of the sensitivity of performance to decomposition strategy. To avoid confusion, note that there are a total of three corner-turns for each type defined above, corresponding to the corner-turns that follow each stage of GMTI. Figure 5-9 presents the corner-turn operation latencies that were measured for both type A and type B decompositions. 82

PAGE 83

Figure 5-9. Data-cube performance results. Notice the difference in y-axis scale in each of the two charts in Figure 5-9. As can be seen from the illustrations in Figure 5-8, the final corner-turn of both type A and B has the same starting and ending decomposition. As such, one would expect the performance of that final corner-turn to be the same for both cases, and the experimental results confirm that expectation. The performance of the first corner-turn in both type A and type B are also seen to be similar, however this result is simply a coincidence. The main difference between the two decomposition strategies is therefore the corner-turn that follows Doppler processing, where the operation latency is significantly higher in the type B decomposition than it is for the type A decomposition. For type A, because of the way the data-cube is decomposed before and after Doppler processing, there is no inter-processor communication required, and each node can independently perform a simple local transpose of its portion of the data-cube. However, for type B not only do the nodes need to exchange data, it just so happens that half of the individual local memory accesses are only 32 bytes, the smallest transfer size of all corner-turns considered in this experiment. Also, since such a small amount of data is being read for each local DMA transfer, more DMA transfers must be performed than in the other cases in order to move the 83

PAGE 84

entire data-cube. Since smaller transfer sizes achieve lower throughput efficiency, the significantly-increased operation latency of the second corner-turn for type B decompositions can be explained by this poor dimensional decomposition strategy. One final observation to discuss about the corner-turn operations as they are implemented in this thesis is the fact that the low-performance PowerPC is the root cause of the inefficiency experienced when performing corner-turns. Due to the high-overhead handshaking that occurs before and after each individual DMA transfer, and the high number of DMA transfers that occur for each corner-turn (as well as relatively small transfer sizes, typically from 32-512 bytes per DMA), almost all of the operation latency is caused by control overhead in the node architecture. With a higher-performance control network, or otherwise hardware-assisted corner-turns, the operation latency could be dramatically improved. The final experiment performed considers such an enhancement to the baseline node architecture, as well as other enhancements to improve the overall GMTI application latency. Experiment 5 The fifth and final experiment performed did not involve any operation of the testbed. Due to the highly deterministic nature of the node architectures, analytical expressions could be derived fairly easily to accurately model the performance of basic operations. Using these analytical models, the performance of more complex operations was predicted in order to enable performance projection beyond the capabilities of the current RapidIO testbed. It should be noted that the analytical modeling in this section is not presented as a generic methodology to model arbitrary architectures; instead, this analytical modeling approach applies only for performance prediction of this particular testbed design. This section provide both validation results as well as performance projections for GMTI running on an architecture that features enhancements over the baseline design presented in Chapter 3. 84

PAGE 85

The two main operations modeled in this section are: (1) DMA transfers of N bytes, D N and (2) co-processor operation latencies, P buff The GMTI case-study application implemented for this thesis research is composed of these two primitive operations, which can be expressed as equations composed of known or experimentally-measured quantities. The analytical expressions for the two primitive operations defined above are shown here: dmaidealbytesNoBWND (5.1) compbuffbuffotP (5.2) For DMA transfer latencies, the number of bytes (N bytes ) is known for a given transfer size and BW ideal is the throughput achieved as the transfer size goes to infinity (see Experiment #1). For Co-processor operation latencies, t buff is an experimentally-measured quantity that represents the amount of time it takes a given co-processor to completely process one input buffer of data, once all data has been transferred to the co-processor (as defined in Experiment #2). The two remaining terms in Equations 5.1 and 5.2 above, o dma and o comp are defined as overhead terms and represent the latency of control logic operations. Specifically, these overhead terms represent the handshaking that occurs between the PowerPC controller and the DMA engine or co-processor engine. The overhead terms o dma and o comp are measured experimentally using the logic analyzer to be 1.8 s and 1.5 s, respectively. Using only the terms and equations defined above, the first validation test of the analytical models is done by predicting local DMA transfers and comparing the predicted transfer latency and throughput with the experimentally-measured values presented in Experiment #1. Figure 5-10 shows the results of this first simple validation experiment. 85

PAGE 86

Figure 5-10. Local DMA transfer latency prediction and validation. As can be seen by the charts above, the analytical models for simple DMA transfers are extremely accurate, due to the highly deterministic nature of the node architecture. However, more complex situations need to be analyzed before considering the analytical models of the testbed to be truly validated. The next validation experiment considers the latency of each co-processor engine to process and entire data-cube (t stage ), which involves both DMA transfers as well as co-processor activity. Analytical predictions of processing latencies will be compared against the experimentally-measured results presented in Experiment #2. The analytical expression that models double-buffered data-cube processing latency is shown below: DPDMAXPDMAXMPDMAXDtbuffbuffbuffstage ),(,2, (5.3) The first term in Equation 5.3 represents the startup phase of double-buffering. The second term is the dominant term and represents the steady-state phase of double-buffered processing, and the third and final term represents the wrap-up phase that completes processing. The constant M in Equation 5.3 is a generic term that represents the number of buffers that must be filled and processed to complete an entire data-cube, and is known for any given co-processor engine and data-cube size. P buff was defined earlier in this section, and the 86

PAGE 87

variable D is defined as the sum of the previously-defined D N term and a new overhead term, o sys This new overhead term is a generic term that captures system overhead, or the non-deterministic amount of time in between consecutive DMA transfers. The o sys term is not experimentally-measured, since it is not deterministic like the other two overhead terms (o dma and o comp ), and is instead treated as a knob or variable to tune the predictions produced by the analytical expression. It should be noted here that all three overhead terms are a result of the relatively low-performance PowerPC controller in each node. Notice that in Equation 5.3, there are MAX(a, b) terms; these terms reflect the behavior of double-buffered processing, which can be bounded by either computation time or communication time. Figure 5-11 below shows an illustration of both cases, with labels indicating the startup, steady-state, and wrap-up phases referred to earlier. Comparing the diagram in Figure 5-11 to Equation 5.3 should clarify how the expression was arrived at to model double-buffered processing. Figure 5-12 presents a comparison of predicted data-cube processing latencies with experimentally measured results for each co-processor engine, assuming an o sys value of 3.7 s. This value of o sys is observed to be a realistic value for typical delays in between DMA transfers, based on observations made from logic analyzer captures. Figure 5-11. Illustration of double-buffered processing. 87

PAGE 88

Figure 5-12. Data-cube processing latency prediction and validation. The prediction results for data-cube processing latencies are impressively accurate, and the analytical model used to calculate those predictions are significantly more complex than the DMA transfer latency predictions presented earlier in this experiment. There is one final operation whose analytical model will be defined and validated, the corner-turn operation, before proceeding to performance projections of enhanced architectures. Based on the performance observations from Experiment #4, only type A corner-turns will be used. Recall that the corner-turn operation is composed of many small DMA transfers and local transposes, with the PowerPC performing the local transposes. Some of the data will stay at the same node after the corner-turn, while the rest of the data will need to be sent to a remote nodes memory. Figure 5-13 shows a simplified illustration of the order of operations in a corner-turn. Figure 5-13. Illustration of a corner-turn operation. 88

PAGE 89

The analytical expression used to model corner-turn operations is shown in Equation 5.4. The first line represents the local section as illustrated in Figure 5-13, while the second line represents the remote section. Since all memory transfers must involve the external memory of the initiating node, in order to send data from local on-chip memory to a remote node, the data must first be transferred from on-chip memory to local SDRAM, and then transferred from local SDRAM to remote SDRAM. sysYtranssysXcornerturnoDNtoDMAt (5.4) sysRWYtranssysXotDNtoDMA 2 The constants A, M, and N in Equation 5.4 represent generic terms for the number of iterations, and are determined by the size and orientation of the data-cube as well as the size of the on-chip memory used for local transposition. The D X and D Y terms represent local DMA transfer latencies of two different sizes, as defined in Equation 5.1. The new terms t trans and t RW are experimentally-measured quantities. The t trans term represents the amount of time it takes to perform the local transpose of a block of data, and t RW is the latency of a remote memory transfer of a given size. Figure 5-14 shows the results of the analytical model predictions. Figure 5-14. Corner-turn latency prediction and validation. 89

PAGE 90

The corner-turn latency predictions are noticeably less accurate than the previous analytical models, due to the DMA-bound progression of the operation. However, the maximum error is still less than 5% for all cases, which is a sufficiently accurate prediction of performance to instill confidence in the analytical model proposed in Equation 5.4. Now that the analytical models of GMTI have been completely defined and validated, architectural enhancements will be analyzed for potential performance improvement using these analytical models. The first performance projection looks at reducing the high overhead introduced by the PowerPC controller in each node. Additionally, two pulse compression engines per node are assumed, since there is sufficient memory bandwidth to keep both engines at near 100% utilization based on observations from Experiment #2. In order to model system performance assuming minimal overhead, the three overhead terms, o dma o comp and o sys are set to 50 nanoseconds each. While it would not be realistic to reduce the overhead terms to zero, a small latency of 50 nanoseconds is feasible as it corresponds to several clock cycles around 100 MHz. Highly-efficient control logic implemented in the reconfigurable fabric of the FPGA would absolutely be able to achieve such latencies. As a complementary case-study, to further reduce the negative impact of the PowerPC on system performance, corner-turns could be enhanced with hardware-assisted local transposes. Instead of having the PowerPC perform the transpose through software, a fifth co-processor engine could be designed that performs a simple data transfer from one BRAM to another, with address registers that progress through the appropriate order of addresses to transpose the data from the input BRAM to the output. Using this type of co-processor engine, one data element would be moved per clock cycle, drastically reducing the t trans term in Equation 5.4. Figure 5-15 shows the full predicted GMTI application latency for three cases: (1) baseline architecture (labeled ORIGINAL), (2) optimized control logic and two 90

PAGE 91

pulse compressions engines (labeled OPTIMIZED), and (3) optimized control logic and two pulse compression engines, along with hardware-assisted corner-turns (labeled HW-CT). Figure 5-15. Full GMTI application processing latency predictions. Aside from halving the pulse compression latency due to having two pulse compression engines, the other processing stages do not experience significant performance gains by minimizing the control overhead. Because the Doppler processing and CFAR stages achieve nearly 100% processor efficiency, all of the control logic latency associated with DMA transfers is hidden behind the co-processor activity, so little performance gain from optimized control logic is expected. For the beamforming stage, although it is difficult to discern from the chart, the beamforming processing latency is reduced from 17.2 ms to 11.6 ms. This result is also intuitive since the beamforming stage did not achieve 100% processor utilization in the baseline design, and thus beamforming stands to benefit from improving control logic overhead. 91

PAGE 92

However, notice the significant improvement in corner-turn operation latencies. Since corner-turns do not implement any latency hiding techniques (such as double-buffering), the majority of the operation latency for the baseline design is caused by the high control overhead. However, with only optimized control logic, and no hardware-assisted local transposes, the performance gain is somewhat limited. The relatively low performance of the PowerPC is further exposed by the hardware-assisted corner-turn predictions, in which the corner-turn operations experience another significant performance improvement by offloading the local transposes to dedicated logic in the reconfigurable fabric of the FPGA. One additional architectural enhancement is modeled to further improve corner-turn performance for GMTI. The second performance projection is made assuming that there is a direct path connecting the internal on-chip memory with the RapidIO network interface at each node, so that data from on-chip memory does not need to go through local SDRAM before being sent to a remote node. The only kernel of this implementation of GMTI that could benefit from such an enhancement is the corner-turn, since the only inter-processor communication occurs during corner-turns. From Figure 5-13, it can be seen that there are two write operations that occur per loop iteration in the remote section of a corner-turn. By enhancing the node architecture with the new data path, the remote section would be reduced to a single write operation that goes straight from the scratchpad memory of the local transpose to the remote nodes external SDRAM. An additional benefit of this enhancement is a reduction of the load on local external memory. For the baseline corner-turn, both of the write operations in the remote section require the use of local external memory, whereas with this new transfer directly from on-chip memory to the network, the local external memory is not used at all. Figure 5-16 shows the predicted performance improvement for the corner-turn operation with this new data path. 92

PAGE 93

Figure 5-16. Corner-turn latency predictions with a direct SRAM-to-RapidIO data path. The results shown in Figure 5-16 are made relative to the baseline node architecture, without the control logic optimizations or hardware-assisted corner-turns described in the previous projections. Surprisingly, not much improvement is seen in the enhanced architecture with the new data path between on-chip memory and the network interface. For the corner-turn that follows Doppler processing, no improvement is seen, however this result is expected since no data is exchanged between nodes for this particular corner-turn. But for the other two corner-turns, the limited performance improvement is due to a couple of reasons. First, as can be seen by inspecting Figure 5-13, only a small portion of the overall corner-turn operation involves remote memory transfers (which are the only transfers that benefit from the new data path). In the two-node system, only half of the data at each node needs to be exchanged with the other node. If the node count of the system were to increase, so too would the percentage of data at each node that would need to be written remotely (for example, in a four-node system, 75% of the data at each node would need to be written remotely), and as a result the performance improvement relative to the baseline design would increase. Secondly, any performance gain due to the new data path would be additive with a performance improvement from hardware93

PAGE 94

assisted local transposes. Therefore, if compared to a baseline architecture that includes the faster hardware-assisted transposes, the improvement due to the new data path would be more significant relative to the baseline performance. 94

PAGE 95

CHAPTER 6 CONCLUSIONS AND FUTURE WORK This section will summarize the research presented in this thesis, as well as offer concluding remarks to review the insight gained from the experiments performed in the course of this work. Some suggestions are also offered for potential directions that could be taken to extend the findings of this research, as well as expand the capabilities of the experimental platform. Summary and Conclusions A promising new system architecture based on FPGAs was presented to support demanding space-based radar applications. This research goes far beyond the simple implementation and demonstration of IP cores on an FPGA, by investigating node-level and system-level integration of cores and components. The relatively low-frequency design was able to demonstrate impressive sustained performance, and the highly-modular architecture will ease the integration of more high-performance components without disrupting the overall design philosophy. By building and demonstrating one of the first working academic RapidIO testbeds, this thesis contributed much-needed academic research on this relatively new interconnect standard. Several challenges were encountered during the course of this research, namely restrictions imposed by the hardware platform such as limited node memory and insufficient numerical precision for the case-study application. Most real systems will provide some external SRAM to each node for processing memory, which would permit larger data sets necessary to truly emulate realistic space-based radar data sizes. Furthermore, the limited system size prevented scalability experiments and restricted the complexity of the communication patterns that might stress the underlying system interconnect. 95

PAGE 96

One important observation to come from this research is that, with the proposed node architecture, a pipelined parallel decomposition would maximize the processor utilization at each node. The highly efficient co-processor engines typically require the full bandwidth provided to local external memory at each node in order to maximize utilization, making it impossible to use multiple co-processor engines concurrently in each node. However, pipelined decompositions may exhibit a higher processing latency per data cube and, due to the nature and purpose of the real-time GMTI algorithm, this increased latency may be intolerable. Even if the system is able to provide the necessary throughput to keep up with incoming data cubes, if the target detections for each cube are provided too long after the raw data is acquired, the detections may be stale and thus useless. The system designer should carefully consider how important individual co-processor utilization is compared to the latency requirements for their particular implementation. If co-processor utilization is not considered important, then a data-parallel decomposition would be possible which would reduce the latency of results for each individual data cube. A novel design approach presented in this research is the use of a multi-ported external memory controller at each node. While most current processor technologies connect to the memory controller through a single bus which must be shared by various on-chip components, the node design proposed in this thesis provides multiple dedicated ports to the memory controller. This multi-port design philosophy allows each component to have a dedicated control and data path to the memory controller, increasing intra-node concurrency. By over-provisioning the bandwidth to the physical external memory devices relative to any given internal interface to the controller, each port can be serviced at its maximum bandwidth without interfering with the other ports by using only a simple arbitration scheme. 96

PAGE 97

Another design technique demonstrated in this research that takes advantage of the flexible nature of FPGAs is the domain isolation of the internal components of each node. Each component within a node, such as the external memory controller, on-chip memory controller, co-processor engines, and network interface, operate in different clock domains and may even have different data path widths. The logic isolation allows each part of the node to be optimized independent of the other components, so that the requirements of one component do not place unnecessary restrictions on the other components. It is important to suggest a path to industry adoption of any new technology, and to understand the concerns that might prevent the community from accepting a new technology or design philosophy. Today, almost all computer systems make use of a dedicated memory controller chip, typically in the form of a Northbridge chipset in typical compute nodes or otherwise a dedicated memory controller on a mass-memory board in a chassis system. Instead of replacing the main processors of each node with the FPGA node design presented in this thesis, a less-risky approach would be to replace the dedicated memory controller chips in these systems with this FPGA design. This approach would integrate some limited processing capability close to the memory, circumventing the need to provide an efficient data path from host memory to an FPGA co-processor, as well as permit a variety of high-performance interconnects through the reconfigurable fabric of the FPGA, as opposed to the typical PCI-based interconnects associated with most memory controller chips. Future Work There are a variety of promising and interesting directions that could be taken to extend the findings of this research. While valuable insight has been gained, there were several restrictions imposed by the experimental hardware, as well as compromises made in order to reduce the scope of the research to a manageable extent. From expanding the size of the testbed, to 97

PAGE 98

including Serial RapidIO and other logical layer variants, to experimenting with different applications or even application domains, this thesis is just the start of a potentially large and lucrative body of unique academic research. The most immediate step to extend this research is to increase the size of the testbed, either through integration of RapidIO switches with the experimental platform and increasing the node count, or by selecting a new, more stable experimental testbed all-together. A larger node count is necessary to mimic a realistic flight system, and switched topologies would enable more complex inter-node traffic patterns as well as high-speed data input into the processing system. A high-speed data source is needed to enable true real-time processing for applications such as GMTI. Ideally, a chassis-based platform could be used to replace the stand-alone boards used for this research, providing a more realistic platform as well as reduce the risk of development. Furthermore, it should be noted that GMTI is only one example application of interest for high-performance space computing. Other applications within the domain of space-based radar and remote sensing, such as Synthetic Aperture Radar or Hyperspectral Imaging, place different loads and requirements on the underlying system architecture. Other application domains, such as telecommunication or autonomous control, could also be investigated to cover a diverse range of application behaviors and requirements. Finally, only one of several RapidIO variants was used in this thesis research, the 8-bit parallel physical layer and the Logical I/O logical layer. Serial RapidIO has become much more popular since the initial release of the RapidIO specification, and other logical layer variants such as the Message Passing logical layer would provide an interesting contrast to the programming model offered by the memory-mapped Logical I/O logical layer variant. To provide a thorough evaluation of RapidIOs suitability for the targeted application domain, 98

PAGE 99

several high-performance interconnect protocols could be evaluated and compared in order to highlight the strengths and weaknesses of each. 99

PAGE 100

LIST OF REFERENCES [1] D. Bueno C. Conger, A. Leko I. Troxel and A. George, "Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar," Proc. of 8 th High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, September 28-30, 2004. [2] D. Bueno A. Leko C. Conger, I. Troxel and A. George, "Simulative Analysis of the RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive Applications," Proc. of 29 th IEEE Conference on Local Computer Networks (LCN) via the IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, November 16-18, 2004. [3] D. Bueno C. Conger, A. Leko I. Troxel and A. George, RapidIO-based Space Systems Architectures for Synthetic Aperture Radar and Ground Moving Target Indicator, Proc. of 9 th High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, September 20-22, 2005. [4] T. Hacker, R. Sedwick, and D. Miller, Performance Analysis of a Space-Based GMTI Radar System Using Separated Spacecraft Interferometry, M.S. Thesis, Dept. of Aeronautics and Astronautics, Massachusetts Institute of Technology, Boston, MA, 2000. [5] M. Linderman and R. Linderman, Real-Time STAP Demonstration on an Embedded High Performance Computer, Proc. of 1997 IEEE National Radar Conference, Syracuse, NY, May 13-15, 1997. [6] R. Brown and R. Linderman, Algorithm Development for an Airborne Real-Time STAP Demonstration, Proc. of 1997 IEEE National Radar Conference, Syracuse, NY, May 13-15, 1997. [7] D. Rabideau and S. Kogon, A Signal Processing Architecture for Space-Based GMTI Radar, Proc. of 1999 IEEE Radar Conference, April 20-22, 1999. [8] D. Sandwell, SAR Image Formation: ERS SAR Processor Coded in MATLAB, Lecture Notes Radar and Sonar Interferometry, Dept. of Geography, University of Zurich, Copyright 2002. [9] Pentek, Inc., GateFlow Pulse Compression IP Core, Product Sheet, July, 2003. [10] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, M. Linderman, and R. Brown, Design, Implementation, and Evaluation of Parallel Pipelined STAP on Parallel Computers, IEEE Trans. on Aerospace and Electrical Systems, vol. 36, pp. 528-548, April 2000. [11] M. Lee, W. Liu, and V. Prasanna, Parallel Implementation of a Class of Adaptive Signal Processing Applications, Algorithmica, 2001. 100

PAGE 101

[12] M. Lee and V. Prasanna, High Throughput-Rate Parallel Algorithms for Space-Time Adaptive Processing, Proc. of 2 nd International Workshop on Embedded HPC Systems and Applications, April 1997. [13] A. Ahlander, M. Taveniku, and B. Svensson, A Multiple SIMD Approach to Radar Signal Processing, Proc. of 1996 IEEE TENCON, Digital Signal Processing Applications, vol. 2, pp. 852-857, November 1996. [14] T. Haynes, A Primer on Digital Beamforming, Whitepaper, Spectrum Signal Processing, Inc., March 26, 1998. [15] J. Lebak, A. Reuther, and E. Wong, Polymorphous Computing Architecture (PCA) Kernel-Level Benchmarks, HPEC Challenge Benchmark Suite Specification, Rev. 1, June 13, 2005. [16] R. Cumplido, C. Torres, S. Lopez, On the Implementation of an Efficient FPGA-based CFAR Processor for Target Detection, International Conference on Electrical and Electronics Engineering (ICEEE), Acapulco, Guerrero; Mexico, September 8-10, 2004. [17] D. Bueno, Performance and Dependability of RapidIO-based Systems for Real-time Space Applications, Ph.D. Dissertation, Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 2006. [18] M. Skalabrin and T. Einstein, STAP Processing on Multiprocessor Systems: Distribution of 3-D Data Sets and Processor Allocation for Efficient Interprocessor Communication, Proc. of 4 th Annual Adaptive Sensor Array Processing (ASAP) Workshop, March 13-15, 1996. [19] K. Varnavas, Serial Back-plane Technologies in Advanced Avionics Architectures, Proc. of 24 th Digital Avionics System Conference (DASC), October 30-November 3, 2005. [20] Xilinx, Inc., Xilinx Solutions for Serial Backplanes, 2004. http://www.xilinx.com/esp/networks_telecom/optical/collateral/backplanes_xlnx.pdf. Last accessed: March 2007. [21] S. Vaillancourt, Space Based Radar On-Board Processing Architecture, Proc. of 2005 IEEE Aerospace Conference, pp.2190-2195, March 5-12, 2005. [22] K. Compton and S. Hauck, "Reconfigurable Computing: A Survey of Systems and Software," ACM Computing Surveys vol. 34, No. 2, pp. 171-210, June 2002. [23] P. Murray, Re-Programmable FPGAs in Space Environments, White Paper, SEAKR Engineering, Inc., Denver, CO, July 2002. 101

PAGE 102

[24] V. Aggarwal, Remote Sensing and Imaging in a Reconfigurable Computing Environment, M.S. Thesis, Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 2005. [25] J. Greco, G. Cieslewski, A. Jacobs, I. Troxel, and A. George, Hardware/software Interface for High-performance Space Computing with FPGA Coprocessors, Proc. IEEE Aerospace Conference, Big Sky, MN, March 4-11, 2006. [26] I. Troxel, CARMA: Management Infrastructure and Middleware for Multi-Paradigm Computing, Ph.D. Dissertation, Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 2006. [27] C. Conger, I. Troxel, D. Espinosa, V. Aggarwal and A. George, NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), Washington D.C., September 8-10, 2005. [28] Xilinx, Inc., Xilinx LogiCORE RapidIO 8-bit Port Physical Layer Interface, DS243 (v1.3), Product Specification, 2003. [29] Xilinx, Inc., LogiCORE RapidIO 8-bit Port Physical Layer Interface Design Guide, Product Manual, November 2004. [30] Xilinx, Inc., Xilinx LogiCORE RapidIO Logical I/O ad Transport Layer Interface, DS242 (v1.3), Product Specification, 2003. [31] Xilinx, Inc., LogiCORE RapidIO Logical I/O & Transport Layer Interface Design Guide, Product Manual, November 2004. [32] S. Elzinga, J. Lin, and V. Singhal, Design Tips for HDL Implementation of Arithmetic Functions, Xilinx Application Note, No. 215, June 28, 2000. [33] N. Gupta and M. George, Creating High-Speed Memory Interfaces with Virtex-II and Virtex-II Pro FPGAs, Xilinx Application Note, No. 688, May 3, 2004. [34] S. Fuller, RapidIO The Next Generation Communication Fabric for Embedded Application, Book, John Wiley & Sons, Ltd., West Sussex, England, January 2005. [35] G. Shippen, RapidIO Technical Deep Dive 1: Architecture and Protocol, Motorola Smart Network Developers Forum 2003, Dallas, TX, March 20-23, 2003. [36] J. Adams, C. Katsinis, W. Rosen, D. Hecht, V. Adams, H. Narravula, S. Sukhtankar, and R. Lachenmaier, Simulation Experiments of a High-Performance RapidIO-based Processing Architecture, Proc. of IEEE International Symposium on Network Computing and Applications, Cambridge, MA, Oct. 8-10, 2001. 102

PAGE 103

[37] RapidIO Trade Association, RapidIO Interconnect Specification Documentation Overview, Specification, June 2002. [38] RapidIO Trade Association, RapidIO Interconnect Specification (Parts I-IV), Specification, June 2002. [39] RapidIO Trade Association, RapidIO Interconnect Specification, Part V: Globally Shared Memory Logical Specification, Specification, June 2002. [40] RapidIO Trade Association, RapidIO Interconnect Specification, Part VI: Physical Layer 1x/4x LP-Serial Specification, Specification, June 2002. [41] RapidIO Trade Association, RapidIO Interconnect Specification, Part VIII: Error Management Extensions Specification, Specification, June 2002. [42] C. Conger, D. Bueno and A. George, Experimental Analysis of Multi-FPGA Architectures over RapidIO for Space-Based Radar Processing, Proc. of 10 th High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, September 19-21, 2006. [43] IBM Corporation, The CoreConnect Bus Architecture, White paper, Armonk, NY, 1999. [44] J. Jou and J. Abraham, Fault-Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures, Proc. of IEEE, vol. 74, No. 5, pp. 732-741, May 1986. 103

PAGE 104

BIOGRAPHICAL SKETCH Chris Conger received a Bachelor of Science in Electrical Engineering from The Florida State University in May of 2003. In June of 2003, he also received his Engineer Intern certificate from the Florida Board of Professional Engineers in preparation for a Professional Engineer license. In his final year at Florida State, he began volunteering for the High-performance Computing and Simulation (HCS) Research Lab under Dr. Alan George through the Tallahassee branch of the lab. Upon graduation from Florida State, Chris moved to Gainesville, FL, to continue his education under Dr. George and with the HCS Lab at The University of Florida. He became a research assistant in January 2004, and has since worked on a variety of research projects involving FPGAs, parallel algorithms, and the design of high-performance embedded processing systems. 104


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101112_AAAAES INGEST_TIME 2010-11-12T22:30:13Z PACKAGE UFE0020221_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 1051982 DFID F20101112_AACVAK ORIGIN DEPOSITOR PATH conger_c_Page_008.jp2 GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
9c033f0bc86c212c3796609679eb384d
SHA-1
f8157203e57b0e3e4586939c95c10590d48d7d53
9324 F20101112_AACUVE conger_c_Page_070thm.jpg
851cf4ae6da00ea8823d9d94821cccd9
446428c12785123fd1fa49c382a1cecc7b9c4a17
1051971 F20101112_AACVAL conger_c_Page_009.jp2
a75f74262a95fd97f403a8e3bf89de13
0f635490d00681611afab675d649629335a3723b
8564 F20101112_AACUVF conger_c_Page_079thm.jpg
d54a9bca1bd7edaf4c23b2501b97c458
9b8a020b3f65ce59b9118a2a3ecd0adb1d568b25
105213 F20101112_AACVAM conger_c_Page_010.jp2
09d732596a1cf0873e1e5c9499211c71
daeec5d5e72224416d71dc5502784e792307dd1d
36009 F20101112_AACUVG conger_c_Page_011.QC.jpg
ec1daee657f1f257361c68a83eb91410
4eb88cc56a8e2b1ed74f732866acb97f8db6f1f9
113267 F20101112_AACVBA conger_c_Page_025.jp2
4b2b26a15ccd38b4af0089cc36e6a99d
dda4b4b8408384e87c259ac28a96a3cf069da6bd
118686 F20101112_AACVAN conger_c_Page_012.jp2
4dc139b3c251a47b473781bf1b48cdda
a5e2c859b483caeb99c8a1dc7e6b35456f7ffcf9
113786 F20101112_AACUVH conger_c_Page_068.jp2
bee6dfc1a7e733f0d055f8e6f903da11
ecad99f1be0ec649902c8ab7a920cd779034acba
1051979 F20101112_AACVBB conger_c_Page_026.jp2
7a19f199874095b2d820b6e4bd42abd6
4628a576db361e7895ce13c75fe13a4f96f13553
112094 F20101112_AACVAO conger_c_Page_013.jp2
f3ddbf099683d57aa4f6c381cb7a837f
5a87dfd05d60fe4cff9ccc5819a3557547eb0a3d
117144 F20101112_AACVBC conger_c_Page_027.jp2
eefa9b325f55e8f62908e26e8a29bf7c
e9b238693d844f9dc9082e3534a1f047741fc9bd
120157 F20101112_AACVAP conger_c_Page_014.jp2
3d2cb813901294edbc620bcd8a9a5650
817cdfb612bc7e2b960c67fd21930524685692bb
111706 F20101112_AACUVI conger_c_Page_092.jpg
30912e77de14a6db7dbc0aa75b714b78
fbcb5850db42f0837731210c3d8feb29c9edcfb4
1051901 F20101112_AACVBD conger_c_Page_028.jp2
cc5cc1e574f6d2e6427d6e73d61580d1
7a263caf4618c2cea0b234d3c3e993774ffaf6be
48428 F20101112_AACVAQ conger_c_Page_015.jp2
85b2bf70eb34003287e9d3d3a7b91633
a3ba14f6b91cd7e380b878acc2820fc992d3bf98
94092 F20101112_AACUVJ conger_c_Page_089.jpg
eef54af80ffe5931f6f20b33207145cc
955bb507e89acb76beb558c8d52e3efecfdff9e9
1051937 F20101112_AACVBE conger_c_Page_029.jp2
5951ce4f029e9acb3cdd7f004670106b
6627da1f35e76bcfaa6f7e2a950eb0f9b6166385
89372 F20101112_AACVAR conger_c_Page_016.jp2
9801ca8ed687d9a023dec4750a94c3b3
a736358246e51616cf2a306bd4d0a5e3625fdcf6
107369 F20101112_AACUVK conger_c_Page_081.jpg
f7e62ecc7c8ec7106b55f76df68fc731
c2c6f6bf58e7bc3a448f8e3ca3f5d8eab7710d51
1051973 F20101112_AACVBF conger_c_Page_030.jp2
892c8e7abea8333f462d656ba07fd907
eb4b30620cc3e5444589da147164d64ca03ebf18
35697 F20101112_AACUWA conger_c_Page_028.QC.jpg
6fc173e5c5fd9e4eb8143ae69d238b05
b326c6828a7821491b8507fa31ffafcb5c476ddd
37455 F20101112_AACUVL conger_c_Page_034.pro
d0d9f8a8d740a795150e73e3aa3697d5
30f5bcde9caf49884963df01b2a145b5b05b1a24
6585 F20101112_AACUUX conger_c_Page_067thm.jpg
8333a4993a91f0fa2629d3a3c0f3deb7
9078bbbe12f17eeaf5aaf2a339f1986434a72a6e
112607 F20101112_AACVBG conger_c_Page_031.jp2
db7b403616b267be86cfce81bba169a0
865cf98c59e74997daf6f8cd9fcd7acebf30e7d1
117270 F20101112_AACUWB conger_c_Page_069.jp2
9f30e9e4165a2b23bbc1e6c0cb1b931f
fd83cc01309fc48146f62602c628da3c8c847c3e
106029 F20101112_AACVAS conger_c_Page_017.jp2
6db910bc6e49759781bd33f72f98435c
1484763ee4bfeab4ca72492129bbe546a3d76a2c
119501 F20101112_AACUVM conger_c_Page_011.jp2
6b8f976e97a0ec5c5e95b212e5b9e0bd
aa03437d7117871c71cb9233dd1c7308ce920ac3
7129 F20101112_AACUUY conger_c_Page_053thm.jpg
d310ee27a7ce0e5b2c43af968cb3198e
f80760c4c5c287b5f1abd9af938e89542e5a719e
115303 F20101112_AACVBH conger_c_Page_032.jp2
336f552eed47587555baead5f4c791d5
7c400b597e0f0542d8251f8dcf1d86db70605379
8517 F20101112_AACUWC conger_c_Page_045thm.jpg
f5206212e7528f9708223b93532edb9a
7c0911c2960e99f4b7bd7a06c9334d4e7123f8b8
1051978 F20101112_AACVAT conger_c_Page_018.jp2
928282749a8d59e70e7283e3a2e51d1d
af89ed6cd94eac9547a8d811db878277f7009a19
86382 F20101112_AACUVN conger_c_Page_040.jpg
b76772b2542b980aa4ff641596f49b04
4069d2d54181f299edba74cca3e755b2fd113966
92459 F20101112_AACUUZ conger_c_Page_055.jpg
85c5c6e82828b710cb10e1ca45287d31
ec642fecb6bbab8127a3064d5f2d47967b14f566
109262 F20101112_AACVBI conger_c_Page_033.jp2
fc8909368961dfabf4b1052c0c8b719a
53291dd383c225548c14004fa36232d83023c921
100013 F20101112_AACUWD conger_c_Page_075.jp2
d3841f346e77f7a8a45072882c39e8e6
3e589c44ad04c6b61fd43044dabcc778b593fea4
101681 F20101112_AACVAU conger_c_Page_019.jp2
73c9207787b22b78c6b96be7696abdf6
d931fbff0f9514e8e2a0b82d890b827da028d8e5
29614 F20101112_AACUVO conger_c_Page_080.QC.jpg
db6460861eb6db167c5757290f643e76
562769b265a2d4161d5f59b3c28927e1f07153b4
133538 F20101112_AACVBJ conger_c_Page_034.jp2
19fa8cf2b42887bec818e632d39728d9
5b65612a1c52b881b4c8906f4cb6269aae55b866
25271604 F20101112_AACUWE conger_c_Page_018.tif
24b29355a22e17471ad481b8fc86a996
892edc976c00ea1001db3428fe3bcb73f1206c27
83863 F20101112_AACVAV conger_c_Page_020.jp2
69a805b0c63a4c19614e541c1294c868
fb1acb0949ff0cdef1c7bffb259ab727c358ff69
111420 F20101112_AACUVP conger_c_Page_029.jpg
8154b1daa658af75f2a378124606fe18
dfbaaac121576530c0c859107471586d8209d9e5
116553 F20101112_AACVBK conger_c_Page_035.jp2
c195ab6d539b373127b2737c01a2e549
4a467230353b450cc21a2bee206e500aee339d64
53320 F20101112_AACUWF conger_c_Page_069.pro
4c3028bbeaa3f66c029280b125be4920
2cf8ea164c5ed8fa310b8b803a5e81d68d5fed72
81591 F20101112_AACVAW conger_c_Page_021.jp2
1dbf7d88b0a6f32753f47933e9b95ff5
891a085221a6eefc3853f492815ac99af81a6b78
961562 F20101112_AACUVQ conger_c_Page_088.jp2
4f596df12d4b1de434d3768b555889e9
dd2b297a7a1f4354b53a7fb622b74957cc33c787
1051963 F20101112_AACVBL conger_c_Page_036.jp2
44a325bec0d94037821703116f4484e7
08192ab366b209458b11bf1a556b1db3b5af928b
156573 F20101112_AACUWG UFE0020221_00001.xml FULL
79097e9239205877dba4254afd5024ca
d7fd159140b14a5c1ebef97edba4850abb800af7
111084 F20101112_AACVAX conger_c_Page_022.jp2
4e912800565a16f2945edc0c0ca7ac77
ec8dadba6183aeb1564a0a9b0b5735cc9949d782
7897 F20101112_AACUVR conger_c_Page_095thm.jpg
2202fae01cbdfcb41e93b5cec70ae114
f335e2598a9e55998e4495428493cbc5cd6a45f9
105874 F20101112_AACVCA conger_c_Page_052.jp2
2f29cde4657e86d9f92dff7d139feaf7
3684f7ec472a5bf22d5d9a66d6e9ac7cd6aeb842
116006 F20101112_AACVBM conger_c_Page_038.jp2
4d46d3cb1df0864b16c7f2fd816d4b68
80e85beca76834118a7c3bacec16b0cc54d84179
98583 F20101112_AACVAY conger_c_Page_023.jp2
a9403f9d664813fb997964ae6a6ee5ed
ecb7204965f0468688237ae49281b58fb4a4b25a
14729 F20101112_AACUVS conger_c_Page_015.QC.jpg
ddee6334d2592db9fb06ea2885279c71
c9f5d69b1f13e17b992c4d4525eaa4b119d8d3a8
121497 F20101112_AACVCB conger_c_Page_054.jp2
f2c758e0c72b40ec831f7af910fd4460
612ee6789d737b3b8ba145ef6dca8b3c3770d065
107543 F20101112_AACVBN conger_c_Page_039.jp2
480d0cf85e384a5e60d5847fc992082b
67bb7016aa16e2a10477b21546d2d02dc4ea26f9
86612 F20101112_AACVAZ conger_c_Page_024.jp2
96b18f91fcd4f55ad8e4acb4d278826a
734cf9146eafe0404758385dc381188d614195d6
35574 F20101112_AACUVT conger_c_Page_081.QC.jpg
51ab859a6f611181603b169f87fcd4ae
2ebe69ad3115abb8a761b4e6cea51bd1e9d87dfb
932148 F20101112_AACVCC conger_c_Page_055.jp2
0dc7a6c946f485018eef94238d2b6fc0
afe9ffa1b9b51570f86b28f1153317b19232afbe
912290 F20101112_AACVBO conger_c_Page_040.jp2
1c09cf5ebfa6a40db77b6b4c2c1baca7
1d71c732408a17d061f3ce4364ace6dd52ab2350
F20101112_AACUVU conger_c_Page_009.tif
a3affe4062f30f442ffbb8c92194c4fa
db91261539ef1608a7e7212b07967cf2e72c18f2
120447 F20101112_AACVCD conger_c_Page_056.jp2
b3c62229da66d075bfcf84475435a30b
b06bf523097e88f1c73011c6ffde6d22b3c9ac4d
105642 F20101112_AACVBP conger_c_Page_041.jp2
3a0cfe6c4c1446bfa6a7445e4423e9a1
6bb6b08df78d3608b2ef76cf12c42d6ba704be58
23098 F20101112_AACUWJ conger_c_Page_001.jpg
c3d70f2355e8e28eee98c461279f8681
f1e400691c633d77de417727c092450be1f1f7b4
7983 F20101112_AACUVV conger_c_Page_052thm.jpg
0755182157e69e9b0e1b17cf800ff075
793e08cc8fd7641d16338c5402877253640147fd
854746 F20101112_AACVCE conger_c_Page_057.jp2
dba6e2e02eacf55ab4e45fcee31a231e
0587a9020f618acc2f19f7246d556aa3fdde42fa
1051926 F20101112_AACVBQ conger_c_Page_042.jp2
0ed68029b60812f59dbaf6aff884310d
c6d2c3c39d6ff2764fe793ff73d08ad2c881f095
3758 F20101112_AACUWK conger_c_Page_002.jpg
1d43c84ec27559efece25fe3b398c21f
8a7337780911a78978bfce636cf61a7deead7e85
21857 F20101112_AACUVW conger_c_Page_061.pro
8406da54961bfd230f54d25241a33bf8
73c7df27709824bb883547cc55d86f2d2b1ae7b6
1003258 F20101112_AACVCF conger_c_Page_058.jp2
32f097460ffd565010479cfe6aa23c3d
bd0df652f0eb92f50476bf3b730269aa924208e5
118391 F20101112_AACVBR conger_c_Page_043.jp2
dd5416bc7912f583ba9a6b8d627637fc
481e389c51b387011d47faa476efc9cd1564228e
6444 F20101112_AACUWL conger_c_Page_003.jpg
b9a6bd5cea12f9a6d668edbd1cecb3f3
ada3549de05d98bc5061a221b77a2de6a86d6ef5
2518 F20101112_AACUVX conger_c_Page_047.txt
0cf5f7ed635f249c0f45ea59da757fba
151989b9f2bb4d3564b0cf150107ed0f54ce6e4e
100238 F20101112_AACVCG conger_c_Page_059.jp2
eb3af59ec891bfe840c9eb25555caa5e
af1b878566f1d68fa9b53f4319f255240b744006
97921 F20101112_AACUXA conger_c_Page_018.jpg
7594a6755fe26a81d8374cb86e4e0b9d
80c3474e5d11d770dff3c5bca1a3cbe61f22ca96
1036226 F20101112_AACVBS conger_c_Page_044.jp2
90f71dcd9bfe1233ccc720c6e67617de
2a9c544f76ebff389dbcfd863039b0f77a0d7b93
36809 F20101112_AACUWM conger_c_Page_004.jpg
ffb2c2867ec73b0627b16da8d8d48089
0c25aa000e9ccfb5cc28a811673a303fb5885644
119570 F20101112_AACUVY conger_c_Page_037.jp2
da7aa588aec1ff809933cf20b98b5b0f
6b2fb615ada1d38d229a43cf1dccc0a11204b2ae
951466 F20101112_AACVCH conger_c_Page_060.jp2
511b36482e0c9f3fcdfbacbf30a8fb90
2cc1e2d497416141c3d6bb30ff289f71a967203e
95697 F20101112_AACUXB conger_c_Page_019.jpg
f46ce01ae269da445f03d5cdf446306d
1df3500e7387562390b94c9ed2877edfa7db981d
112997 F20101112_AACUWN conger_c_Page_005.jpg
a8734da8a58d337fc80cb2702313adf0
ecf7dc4d67b46da9783a723c4daa28579ce00aef
100529 F20101112_AACUVZ conger_c_Page_053.jp2
3783217bceff2aeb872a766ebead5e8a
28f13d19e8c238e62ee84bccd4856592433a8764
51356 F20101112_AACVCI conger_c_Page_061.jp2
8cb2ab3b54a70b963c8e79624f295ab3
11f6b7bf36e09a9ed7e0b36347a858edfd1595c9
79592 F20101112_AACUXC conger_c_Page_020.jpg
4ea385c176b010e3d57483cecf2bcc21
26d7cce8711664d90b313977742021fb39ba46f9
115206 F20101112_AACVBT conger_c_Page_045.jp2
f7d439147b8197fae940cf4926063514
9ffc3fccaba330ec4fb9ab24bc7454af24a321a2
42006 F20101112_AACUWO conger_c_Page_006.jpg
0609f86ad9a086c43778b1fc4b3980e7
53fa2d3a830b63a7db74690bfbb323ff8c6769dd
110402 F20101112_AACVCJ conger_c_Page_062.jp2
0890d28bdbc98188fb235935941362f0
9db66861d09bbe18322bf5ce4b0bbb3c2a73fe56
70844 F20101112_AACUXD conger_c_Page_021.jpg
f38a8de3e46a12cf70147ca9147740f1
2bbbd889fd04002cb24161df7b4dd8cfd31799d7
986332 F20101112_AACVBU conger_c_Page_046.jp2
30871593c5bb6931032b1f1863adcf22
1e5ad928e715c47c8e4d0a41d854692cccbf4261
19294 F20101112_AACUWP conger_c_Page_007.jpg
a27fd139329e2a9cfd826d033ee531f2
73e7a014ffa91a97a35db63ca5a06f4f35e22bb9
1050884 F20101112_AACVCK conger_c_Page_063.jp2
b0458307694e9baae6a2d44eaa95ed25
504aaf788ab4777eb90ce3a5a02bf108e9f26406
105357 F20101112_AACUXE conger_c_Page_022.jpg
2f8b0269c71f11dc51e5b71250c0461e
4d964a1a92930f44d041bfd95d48cceb9784474f
1015082 F20101112_AACVBV conger_c_Page_047.jp2
cee5082b3de1ffda317e4515b45e1b42
9a931bc9b1ea386f22765898c161bf2f0ee46e1b
112143 F20101112_AACUWQ conger_c_Page_008.jpg
9594f4ab7e5322ac6df2e2b405bbaa27
ea4c1ab4d4c265ccfcaa867bb2622f6ee3d8148e
114243 F20101112_AACVCL conger_c_Page_064.jp2
f1ba017680d367c49b29b414afbb9bb9
fb9ac3c31dc1c58348b7a4f37bec87e2d7141b3f
87403 F20101112_AACUXF conger_c_Page_023.jpg
a20015a74856349e416433fb00b2b2c5
85c73d5d70551bd15584d763ab22b1c12ce9f4da
118283 F20101112_AACVBW conger_c_Page_048.jp2
5d90fc65d6a54b1e65c13a972cb59ac6
070ad53bb98528e777231479ec2af024f4d7b2d5
96456 F20101112_AACUWR conger_c_Page_009.jpg
2ed476930c67849d968a17c69f9ee308
16c19b55abfdeded297767d48b708a2490a1eee3
114571 F20101112_AACVDA conger_c_Page_082.jp2
574562213181e44a8c2ebe21847a34b8
0678cc94db33a3466bcf92c9cd974910948fec44
107142 F20101112_AACVCM conger_c_Page_065.jp2
df24a187c2626f921d62097af1f82df5
216ee58786f03d2eeb9bac83f19c9134319d137f
78351 F20101112_AACUXG conger_c_Page_024.jpg
fac0cc337d46f4cd0bd5a29b04510e18
d1d636ece259d3f843c31c1c2f4553e5dd619e78
112188 F20101112_AACVBX conger_c_Page_049.jp2
aca1045122defd3aecc7c5af51e458ec
8e57373af6be675f6a988e590a73332e6517b4f8
107133 F20101112_AACVDB conger_c_Page_083.jp2
f6a69dced224a71febd48c793276fc49
d49e5cda727fd830c8025869325d5739150ffff7
101908 F20101112_AACVCN conger_c_Page_066.jp2
dfe13d74fd673fe0789d203f75cd1d29
6ee3e8a5d13dba5c657b47921fa9e5d5091b202a
100123 F20101112_AACUXH conger_c_Page_025.jpg
12b911f04d0e59d3844859c44b1d4023
f66595856ddd32ef71df605005cc866c1c78c209
119167 F20101112_AACVBY conger_c_Page_050.jp2
b84218eee3c71ba480084c07bc50f91b
2b1febf17ea68f65c630f9af072a51713f5c051e
99133 F20101112_AACUWS conger_c_Page_010.jpg
a2955163a59edf2d1f05c0dc18063e4b
e8ce9e47760ec1cda5d0b22f74d223d53060f37e
117517 F20101112_AACVDC conger_c_Page_084.jp2
d2d157b48246241312be69f8165ddb65
aa8720eb2bfbce61a6e667449a93c014cf7684ee
94329 F20101112_AACVCO conger_c_Page_067.jp2
c7b0cf703684c3c398225fad896afad3
1b3332775b69a262ae585f965565c9940870f346
107835 F20101112_AACUXI conger_c_Page_026.jpg
20df0ce58d377864cf8efa1d12f60d45
689e65c4a5b6ab8bcf67ec6f0ae999583b034154
128248 F20101112_AACVBZ conger_c_Page_051.jp2
8f9f2d049579983cc75bbc513a56c5c4
f0246ee6e1fc6571ec065c59e3922f93ca72601b
111184 F20101112_AACUWT conger_c_Page_011.jpg
45df644487859f6ab314980848b11bea
97f5f90b7d3b5d197deff1e4c0b29e6a6c4803ed
101917 F20101112_AACVDD conger_c_Page_085.jp2
ee49fa2140c02a5158d1fbef5b4e848b
3c9585468d16f304ad674fab8072debe610991bd
121947 F20101112_AACVCP conger_c_Page_070.jp2
26f0d1f53d7b3288b8d9be0f22651169
8eca305d1e16fb4bc3a86ce5476ff816dfdfafe0
110620 F20101112_AACUXJ conger_c_Page_027.jpg
5043445eacce5b57b7ee8874e0ffe651
cb84fd340db89bd0aa1437f60fa102bbc0f470c5
110689 F20101112_AACUWU conger_c_Page_012.jpg
71750a0e69a81f964ab3a25b1c664ee8
7b38d94a2f1d561b6bd039994a15452d698bb4ac
111997 F20101112_AACVDE conger_c_Page_086.jp2
253c66f8cb5504d600cc6bc8e7dca66d
81ac12638b825bc1a814a838df9b90a02f3c4176
116142 F20101112_AACVCQ conger_c_Page_071.jp2
30abb03aa44c476eea8d0a8d5476b7d5
90b59bed37c7369e773da2ecc59a2e0d64981952
103769 F20101112_AACUWV conger_c_Page_013.jpg
55454e36df77d68b8bd52325df1fc0af
4e9f98eaf04a86a3db1aa0c895ceb7404dc83c0b
116256 F20101112_AACVDF conger_c_Page_087.jp2
ffb3ceea0d34c29536fa1f452b967e5f
34ddbd6e6583c390a0f1c8660700fffc1030fdf0
106345 F20101112_AACVCR conger_c_Page_072.jp2
96d95494c9fe4d636db1542e586d0a1c
58358da2697769d13dc2d2518031be7a955731e6
106609 F20101112_AACUXK conger_c_Page_028.jpg
ae34e606d97a659a7f13a16e9876f52f
8a644a9a858b12e4554d9e17f0ca92a48e576a62
111810 F20101112_AACUWW conger_c_Page_014.jpg
94cdbe61005dac9b1a979d41f143d23c
1583465d3b5dc206f55705eece43755e2b6982a5
100915 F20101112_AACVDG conger_c_Page_089.jp2
b9c869cbce88028671d7e89a97aed35d
a9adcfd612ede0143c13cea769989dc0978fba23
88801 F20101112_AACUYA conger_c_Page_046.jpg
a6830b53288df76c49255f3d4ec87016
61ad4f9ca1c942923c403847eb80128aa7cd9a43
125015 F20101112_AACVCS conger_c_Page_073.jp2
ecc430d8412739f12d66bd474f068c2c
a3f237f4d6b1c27409b8c665fe6964e728722411
110869 F20101112_AACUXL conger_c_Page_030.jpg
5985edf33f43d909b2e7f5ac5417d594
908e21f5c8f48cfc868ff05ba9d737b745273176
45004 F20101112_AACUWX conger_c_Page_015.jpg
8fbc73f0332f4c7f709553bf5a16e439
015a10f3bec22a942329b9b14cdc44b5bd6c24f7
F20101112_AACVDH conger_c_Page_090.jp2
f1770d41e82c7505da45d96ec832a0f1
4a350a279f2cb515926e99785d7c39b9db6fd9e1
90239 F20101112_AACUYB conger_c_Page_047.jpg
9c1c6549036cbb5a9b0f648bf3f0978e
5607dfdc405508a0b8a2e2aec66cb6e9e7df3ae1
116233 F20101112_AACVCT conger_c_Page_074.jp2
d8aae0b87c1be0663ebcead0b58b9c26
cddf636767704918676a21513cf56746c8446287
105551 F20101112_AACUXM conger_c_Page_031.jpg
422e9d91f7a5c1543cd77673c99ba0de
e1ead16e45674aaad8b36e3d15c77c71052ac718
78794 F20101112_AACUWY conger_c_Page_016.jpg
57227db094f6d5ba351ba06d292ac8b4
534d41ec471c2014d7561bbb56de0ff1cb1dfd4c
85087 F20101112_AACVDI conger_c_Page_091.jp2
bf4be4a8bba13df5aa3d7ec26e2b2127
a424459a8d906f6c03ba651fdad6a4b20bfc1c40
F20101112_AACUYC conger_c_Page_048.jpg
f53e3170f3761a57555c0a4bf2f5cf0a
9da187f142e1189d419904070439ec915f26d0a2
107527 F20101112_AACUXN conger_c_Page_032.jpg
48c5ed340e51f9ebca360a0e81252678
7e26ebad63a6663d1da7dc7b832ca05b1d6e8ed0
92221 F20101112_AACUWZ conger_c_Page_017.jpg
e73bfc2bc4f118a10b20f70cc9fa6dd5
26ea84e6e333bb85f1d23c98a3cb1776c3ed7138
118380 F20101112_AACVDJ conger_c_Page_092.jp2
ac955717a9599afd82067e7ef6ff5ab4
e981bbea8789af77f3470bae172e581f224bd929
104425 F20101112_AACUYD conger_c_Page_049.jpg
5a8712dd2c13f4a194c08bd2db414298
d04fc7ce1326abac8d9e0d3cd9fe2ac15aa62caf
122695 F20101112_AACVCU conger_c_Page_076.jp2
9aebc5cc84b29ac0db3a5ff6088fc962
3a048f7a9cd2f1677828593961aa3df944c9b993
103801 F20101112_AACUXO conger_c_Page_033.jpg
5d2e7897db43397344b48e1c9d1c56d4
588b80713161fe4022a22a1bbd5bbe3de3a06ae6
106143 F20101112_AACVDK conger_c_Page_093.jp2
5f81551752c919a1c9a42270a4abe224
6e993bfc0643c03589552fe44bd00f0caad873db
112601 F20101112_AACUYE conger_c_Page_050.jpg
b2d2fa74e647a476b11bac3e3d7277c5
273b095bb456eca7984a9813041b43d1b371219d
113445 F20101112_AACVCV conger_c_Page_077.jp2
093a253b8c057b979128398321a40f9b
ca9936f7c27b46420af89d4eec43c179ac25f568
106091 F20101112_AACUXP conger_c_Page_034.jpg
0057a2c0c119fdf7eb4f1d29dd62f5b0
aa5c12f7683e7fe9d8c31f138c0924a06602dca5
17517 F20101112_AACVDL conger_c_Page_094.jp2
0d34f7d880c6ca10c79327f1d7fc9539
1e609cae69de2036c8167aeada2b73fecbcfdc64
119726 F20101112_AACUYF conger_c_Page_051.jpg
e7731d2b4157a003cc40006c7d0038e6
2faa6d0019e6170ab7ecf29d12d49f8ec09c0972
127215 F20101112_AACVCW conger_c_Page_078.jp2
3f5708d21990dcc03d3ac44d364a8aa9
441fd000e51e26b3175c06fc687a7aacf774c9a7
111314 F20101112_AACUXQ conger_c_Page_035.jpg
fd9fccf01b33086148c0188095458b21
82452e2b0598ffdec816549490a13d0f89e951eb
103962 F20101112_AACVDM conger_c_Page_095.jp2
4d06e49bdb98e185f8cc3fe46843c0a4
ebd3d8bf79579432a928a4aee0267f97c5c0d9bb
92342 F20101112_AACUYG conger_c_Page_052.jpg
a12415af0632643ca46618281661727e
fa8403aa32bc742fd7a69f96867ab3b7ce80a3a0
109626 F20101112_AACVCX conger_c_Page_079.jp2
c06c17325908669390217ba4c7fa1247
aab7893d91a68ddac0dc7effcd5e36c3279762c6
116395 F20101112_AACUXR conger_c_Page_036.jpg
7cac5f09038e23555c5bab923b1c02a7
e74444c8370b9e958529564c48f813867552b9c9
F20101112_AACVEA conger_c_Page_005.tif
ec72cb911164150910a414ef6b2d7f41
d3c5915161da4f8d0048a8dccb353056e2ff1618
114716 F20101112_AACVDN conger_c_Page_096.jp2
016fbeabb088aa48da7922fec5dfe87b
12378438af888de3c02a909dd5c0f24d23bb015a
94228 F20101112_AACUYH conger_c_Page_053.jpg
64342be1ef30155be25924a9d56b9af2
29678948ffea57b9f080bab31b7db603b2f03220
111854 F20101112_AACVCY conger_c_Page_080.jp2
ae43f43f33db07e3f1484bb96092833e
dcbfa5de0ef5eb7dbe0eb7ffab0f6557b7bdabcb
113102 F20101112_AACUXS conger_c_Page_037.jpg
218975314396cfb8269b697aa8ad2c72
c68c4898bdf73d7e1d39e70ff15bac3a2cb929b0
F20101112_AACVEB conger_c_Page_006.tif
21574680de1274215f72b8c7c7c7f3ef
cb12614bb3ca98d2c5403acfb4259aff22700432
117862 F20101112_AACVDO conger_c_Page_097.jp2
f2e6d9309ae5cd5e00fbd2ef01b98777
16d641331924b88dd8510203edb9a08514dbf530
112140 F20101112_AACUYI conger_c_Page_054.jpg
0e5d51b004b161d3cfe304eb48344162
801c539d5f0c7404eea7ab40550ab3b8dc6245ab
113272 F20101112_AACVCZ conger_c_Page_081.jp2
38677db3b5897c7e07478f5defe819a0
223c96cd62222c6229e0fe2a89444cc4bca31571
109483 F20101112_AACUXT conger_c_Page_038.jpg
e00601cdc10f3d1b6aca5000d78eb555
226a47121492d63966036ea7174170d803e2570c
F20101112_AACVEC conger_c_Page_007.tif
a90bbb4ee47fb0e490c76d25eb23a687
d82ed02f0a05593d29f01fc034a74eb812eb9b09
118092 F20101112_AACVDP conger_c_Page_098.jp2
3bd926cbaa049d3c3a2434b45158af26
aed5e2b75c3078d3159f29c3e066181e9b7d9a5e
115213 F20101112_AACUYJ conger_c_Page_056.jpg
dc71f664f1cd45a05cfb4a759f23c46f
08d7745617d330be9b9b4dd41880224659d0de15
101554 F20101112_AACUXU conger_c_Page_039.jpg
1e58657c4e74d20fa7a193619af2bf4a
9410b9166f11b057e3bdbd3eb882c6ebd20e539e
F20101112_AACVED conger_c_Page_008.tif
21eda095ff73c2677d0c64001a5d5364
f45c2c3a9473d2bab52055856647a329f97c7422
85975 F20101112_AACUYK conger_c_Page_057.jpg
92e6dd79e3f7163b2e9df28ac15ff828
ffa15334cf067550cef1a755fe363120874d9f4c
99276 F20101112_AACUXV conger_c_Page_041.jpg
8387411585b565db21a9e3cfb43d8635
2c8f6180357d0309da9aac5b3382cf750dd66e13
1053954 F20101112_AACVEE conger_c_Page_010.tif
0f0d92072ac16766610b09d5813fcd5e
4474f66cdd3fc4a06de1b3289a9e610bdb3dbd6b
12243 F20101112_AACVDQ conger_c_Page_099.jp2
2ec2589fcfee8bac6c056b7d8ed5b352
93f37a8544a2329be7061dada475ff0fb98e0f36
103691 F20101112_AACUXW conger_c_Page_042.jpg
8b6ec9d44cc9b328b9b0b6cccb47e7fe
b6c1b408e1be4014db5514892b07a9ca690da676
F20101112_AACVEF conger_c_Page_011.tif
efa3720080e0322d3018e2510aefb1e8
174a8c8c19e9cc2201a8fe04bd74cd53984ba4d3
134460 F20101112_AACVDR conger_c_Page_100.jp2
145004d8ecbd673c235d36aa3ae7bfa6
89af0199216c777801ccf62dd29cae2322cd9338
99162 F20101112_AACUYL conger_c_Page_058.jpg
d78bd4211bb3ecc8411eda3ed4c8a51a
076ec432f2f947a5390f4799c9766a2e75c8f4cf
111703 F20101112_AACUXX conger_c_Page_043.jpg
0e4c8d9efe5ed471106943afff4197f3
8349c1ef75f8c4d650ccd34a3909b735e6da720f
F20101112_AACVEG conger_c_Page_012.tif
f2a0818a3a01353f98e120e26a198af7
f8c073718f25f3ad8f656c28caa8c71a83b018a5
112096 F20101112_AACUZA conger_c_Page_073.jpg
53ddffae0e58e3a9243db156c70721b1
212766e71adcb152e4292558d3ee6906128f2437
127437 F20101112_AACVDS conger_c_Page_101.jp2
103bd3258b1a919b92ebce45ba343649
2b707779441b390f304e55f9c08faba12f6d3ebd
89563 F20101112_AACUYM conger_c_Page_059.jpg
89dbd854dff9588f646519805b70afa4
2777aaa483f5eba42d434cdca0e971b6137d4da2
91300 F20101112_AACUXY conger_c_Page_044.jpg
271400a118dfcdb2dca7dda4f8d166a2
938e8bdbd1598d6b520f7656f4137a3ef3adfd6e
F20101112_AACVEH conger_c_Page_013.tif
1783c3cd02ee9b00896458c56192624a
45a1efe86606da9e4c1d3c145ebf9ea41a626433
109184 F20101112_AACUZB conger_c_Page_074.jpg
21400b8f80a2bd7d050114d61e02241a
95540efe125623ff940945c95403380f95e6f5a9
134655 F20101112_AACVDT conger_c_Page_102.jp2
149f7833033e02d63501c1c326e1bcbe
a63569e4c796bfd1e27d4c8a134842c85982f2a7
95457 F20101112_AACUYN conger_c_Page_060.jpg
1fc04536ea05d74b8b850cb02580f5e5
9931381590860c443a96dc88885fd5c6c5d15400
109069 F20101112_AACUXZ conger_c_Page_045.jpg
3f8bfd872a682f13a335c730b7fa43eb
c0d7a43c9329e24319640f39896c17365b932710
F20101112_AACVEI conger_c_Page_014.tif
53f12f30e548d43a7f42d4c37e5709f2
3c641e69142f335e0ec25eb767272f08a08867a4
94634 F20101112_AACUZC conger_c_Page_075.jpg
4991cc1e3c5d8beb5a27ab973906275c
e4ac09c2b8e5ffb6d0e0ec5ff0fa8bca16f5db1b
73175 F20101112_AACVDU conger_c_Page_103.jp2
c6fc3069f0ee25d567be50c42b682045
10bcbd443fb1ad3aa9af2ed5002c331f46b62850
47544 F20101112_AACUYO conger_c_Page_061.jpg
bae73d3032777a74f32689887238a089
012097a1f189b479bd7a137c6adcdaabfcbf2111
F20101112_AACVEJ conger_c_Page_015.tif
36f6f3e947df2ecaab8553aa8cf89960
179c6127bb8459beb7635aa53f6ce4c6e82b3b71
115540 F20101112_AACUZD conger_c_Page_076.jpg
bf23744d41a5be72a3b92f48aeabd6f1
311d09512d088320537179fabb55805dcfad0d4f
104832 F20101112_AACUYP conger_c_Page_062.jpg
cd7bbe9ebe146d0c96238dd555afe2b6
5250c556f80d898107987eb494b7cb0c681d3863
F20101112_AACVEK conger_c_Page_016.tif
5a4ac45206120b9d51ac376e99117201
ff5f2acb3a094cb098b41129886257b9fefbdf0b
104575 F20101112_AACUZE conger_c_Page_077.jpg
b7dab35e263aade533fdb4d77792ef09
a2fa133f3e0175be7ec686c0f208ea0e9b04dc44
53711 F20101112_AACVDV conger_c_Page_104.jp2
4d3d6a7a564ac020bf3b9b424ad1f313
641a917f3a5334cc1b7b0200ced21af2a2fc4204
85186 F20101112_AACUYQ conger_c_Page_063.jpg
cafdb319158dcffe99553fafb3e3be95
1953bf24b8b68aa6799648a3dddb8d801f186f12
F20101112_AACVEL conger_c_Page_017.tif
92465e854c8c97defef489cc553ae3cf
4d8b9053b83f99acb5223e9bdd98a73a331f0030
103419 F20101112_AACUZF conger_c_Page_078.jpg
af90e8f59db6dbb8b9044bd52c4e8764
e36606249ac3cf0199cce0da067f4f686364b774
F20101112_AACVDW conger_c_Page_001.tif
a4f364426ecd85530cb736ea6f1ecb6d
2aac5549ddd209979959fc4ff4234521cb94c5e6
107515 F20101112_AACUYR conger_c_Page_064.jpg
89790f137dba21e8203fe95fa749347c
c5c809d9ebef77b988f06619f2d646a828f75519
F20101112_AACVFA conger_c_Page_033.tif
08c719856e2a841cd74998fa90d03361
9c0d3b5186dce92e7803b89a1ae2a5e47dc41688
F20101112_AACVEM conger_c_Page_019.tif
33a2e0924094f2f578ad63e6a98f7c3b
59db4ab4cddac7d1a2cdd74bfa98443ff4c26fd6
104341 F20101112_AACUZG conger_c_Page_079.jpg
a442c736b83febec9110412ec400ace1
0c359c63f6cc55b933042c478acbdeaa41420ed7
F20101112_AACVDX conger_c_Page_002.tif
1d52fe31da2cb3e3ccb87eca1b8227f4
4d18de4a7a3bae2ead70a7de588d281ef1327bdc
100687 F20101112_AACUYS conger_c_Page_065.jpg
a76723fd23efdf541fa6dd3165731e96
a214276cdfa80238e3bfa892b0ee4698b0e47d87
F20101112_AACVFB conger_c_Page_034.tif
cb040d1b3bbde73bf4d9229f5c48dbf3
d0fe398c45d21f85030c4520852f90952dcd7815
F20101112_AACVEN conger_c_Page_020.tif
73d270c91bd060047d7c32da3de4e899
59e143a75ae5c2b7e394470b279a00df34fa3640
91742 F20101112_AACUZH conger_c_Page_080.jpg
be066b16ab9a28e541e6ed772f0ade0a
e89974889c34494e3e96ebe4b97f55f5e0f6df98
F20101112_AACVDY conger_c_Page_003.tif
f3eb1f1a30736fde369b7d1bb04aa79f
b0083387318202bb941578ec639d7143b4fa599b
93799 F20101112_AACUYT conger_c_Page_066.jpg
8500bb6e8fe740268c9417f1564bee5d
3b6c746d8be09605bd80bcc71ba2b01fdab0f7be
F20101112_AACVFC conger_c_Page_035.tif
9aa154d54acce843d52000d64a4d7679
182d57ef77196e0e322deee16226a80cba78764d
F20101112_AACVEO conger_c_Page_021.tif
9597fc527a9775893e38bec809e82992
f1706a50684b80c671f5fdfcab92eb9eb23fbb72
97294 F20101112_AACUZI conger_c_Page_082.jpg
7f00704e4124ae8ad473064400e57baf
38015f336493914f34bab29c24c46ece14068053
F20101112_AACVDZ conger_c_Page_004.tif
9f354fcbefeaa7e77ab5d1749d09019b
bc06a5e0693832ab4e81513b251a88dd2488e95e
87364 F20101112_AACUYU conger_c_Page_067.jpg
ea5f517d4e4bcedb90450ae08b30ecb5
34a0bdb8ad6d36cfacb76da28ddf7aac3183c706
F20101112_AACVFD conger_c_Page_036.tif
2320657512575dca7eb0dd274cc01749
3bdeee4bb292fa266f815c3f61339fcb8c64ad71
F20101112_AACVEP conger_c_Page_022.tif
7fe14adc1e4d7389d58f73bb22812f7d
96a0b0a49c00cce82d662f14fd3f01be39f2fc6b
100673 F20101112_AACUZJ conger_c_Page_083.jpg
13800cc5b3332d49ade9baecd6a08411
72d4960abf0a5dcfe75613bebf9eb9b57874988e
106724 F20101112_AACUYV conger_c_Page_068.jpg
b5d66a5dad5a389a5a135aefe97f6771
fcae55c5948f4b22e04750744e319c7a830688b0
F20101112_AACVFE conger_c_Page_037.tif
d2a2c2388465f9232e0bb167507a5de0
54d963b2a9b86786fff537cd539d7b2d6827d238
F20101112_AACVEQ conger_c_Page_023.tif
34de76fcce6d8c35d38fa680cc7444be
fcaf46239db614fed3cd13b7dcaffc45b04fb623
110106 F20101112_AACUZK conger_c_Page_084.jpg
715928479e14ace4ef24bddc4dbb5453
4978ab6a81059f15a9a088507c698c950a82cf1b
107195 F20101112_AACUYW conger_c_Page_069.jpg
af7f9f307cd6fc5efba63c55af500d4a
bdfb589efaa71c80bd34069918e417805a0db43b
F20101112_AACVFF conger_c_Page_038.tif
c4b48ebfc40f294bc97f2c41a7583b71
2584f59d8511138ec24d903930a10f0f991a63b5
F20101112_AACVER conger_c_Page_024.tif
efed9dab1f4c0fdf6dad5bc7684f3d25
9d6b23bace8e49c9d5394cef800540c91a20420f
93790 F20101112_AACUZL conger_c_Page_085.jpg
910294a9ace79e32e70cd55b0824856f
d4325acef455ef18c65e941a2413ade16691a327
114934 F20101112_AACUYX conger_c_Page_070.jpg
35240a6d89d42e1768fb0715f08de6b7
193e48399e96e6934b939c0813010b926b75c383
F20101112_AACVFG conger_c_Page_039.tif
0018471def7aba841c2e5f327dcebe03
2a9dc0a71176711eb75c60f972a8e14133ddff7f
F20101112_AACVES conger_c_Page_025.tif
d445359bbecd66c887ee7953803c6540
54a2194e80a3a530ed672009eea53a9713b80892
106360 F20101112_AACUYY conger_c_Page_071.jpg
f555ae8f4eae009cc6efc0055e84b2c1
a427d51615526a776caec3eb52829d7912ef0c47
F20101112_AACVFH conger_c_Page_040.tif
be2fd513018d522c2271a627a206b549
28a4b458f73d8d90e08eb378d453454a0d00b12f
F20101112_AACVET conger_c_Page_026.tif
de68f7c21acdef034adafffaa37a4736
0d73e8c765518bec4b62a4da17017f3de020ed9f
105054 F20101112_AACUZM conger_c_Page_086.jpg
9b2f2f0ec810344d2dcbe40e7ed2fa97
c442598c4780d1a83106600edd3ea8336f0b536e
96687 F20101112_AACUYZ conger_c_Page_072.jpg
6951ee8e7000828f2313326f9eeda257
946d2d9eb84195ebfdd723cf6db08ac02714dbb8
F20101112_AACVFI conger_c_Page_041.tif
65322d8baada24fbd294bd6c5b5a7706
70387672d2a47af771a17574a7af14715e417ab6
F20101112_AACVEU conger_c_Page_027.tif
36df306c04d0d90b2e52aa0861506d4b
408756f45cf7ecf325e5e6a47eb2368c2ca6366f
97360 F20101112_AACUZN conger_c_Page_087.jpg
c7e2761c99eaff2d9f23d7d60533ce9f
d4e131cd964e4f276672e0f8f2ceebe25979d31d
F20101112_AACVFJ conger_c_Page_042.tif
c9360f5fa053498496947edd556a64a2
0709a458d93169c2c95342639096b9468c23f21b
F20101112_AACVEV conger_c_Page_028.tif
0177dd0e883281f9a2a033c01b2315d5
e7d01404b0b4171d38c5a89c14b5bad8fcc5d15f
96139 F20101112_AACUZO conger_c_Page_088.jpg
e36a4247d878f8376189bd17ab47502c
9b3dfc896802728db89057bcdb6c2b14d22af364
F20101112_AACVFK conger_c_Page_043.tif
c03af7b0cc2df5b0967f996ff3b3382f
b6d9fcfa986a118c0982787f348edb032e8ab5c9
114138 F20101112_AACUZP conger_c_Page_090.jpg
a3f62c15038533a8d4a3b25cce191399
2d64f58a076630facb933c284ccf1f11b9aa764b
F20101112_AACVFL conger_c_Page_044.tif
8db6ac1a87c595fd41c949c1277d3c59
7407402d9e3f3dd6af27e589b6739199a28bb6f2
F20101112_AACVEW conger_c_Page_029.tif
423d7e728001d3ec7ba7b830fa8ff11f
d989bdd57ad0c95b8b79dc07c9de1874f85487f7
77274 F20101112_AACUZQ conger_c_Page_091.jpg
c12bac3d676200538fffb71a2ba9ed09
12275dc2b32f1c9baee251ed0c756e15ed4e84d5
F20101112_AACVFM conger_c_Page_045.tif
242830333c8168a0c8b4470a22893b30
10e2d184d6a43af094ca7b3316e85f254919833d
F20101112_AACVEX conger_c_Page_030.tif
36fc9ef9c6eb24821861fab22c23a4c1
12f599873d86d273e7c0b98490e2e144d295bae2
96751 F20101112_AACUZR conger_c_Page_093.jpg
6c6ceba9c5bcdc4709aa68ddfcbe477d
3e0eb5f8569b1d097bcf63fc0440abcce484f10e
F20101112_AACVGA conger_c_Page_059.tif
d0c9842d9ec7ec42d2482e1cd330adda
837c88506c53a0cfa67ba8e18cce5eb2062db711
F20101112_AACVFN conger_c_Page_046.tif
803e22aeb09dcf3e18baccb9e4d9b23b
9f252a8c308fca89b4a63a69f7c3439715304b1f
F20101112_AACVEY conger_c_Page_031.tif
cb960f5c1382de215f21b206e46f4309
1b894f062ffa0ed17641461ddf7ef8054b9e3975
16030 F20101112_AACUZS conger_c_Page_094.jpg
90af729656f8d95d4c7644888495605f
2cbf19701815facd559bf8708eeb3516df05b70a
F20101112_AACVGB conger_c_Page_060.tif
11526dd0c3a3903a5dd1f7db8b790c91
fa79b59268dbd03f87fdee45e681c70c303c185b
F20101112_AACVFO conger_c_Page_047.tif
6afdc37b2d9f12f0401563f5ccce9aa4
4967167b56472dae98a2b8c9560aded5b3953146
F20101112_AACVEZ conger_c_Page_032.tif
553bf8cde9dc7e386fccc6afce6ec08a
d8e295f8a60501da496c7309e55a3f4db7e0db25
95638 F20101112_AACUZT conger_c_Page_095.jpg
27093ed10b63781227ec87b72d0b7b0f
d3da4aba5866823190dcda04ef324225956e3b7a
F20101112_AACVGC conger_c_Page_061.tif
1411eb854372ee2ecfd36848ffec5bde
a3b3d90fc30fb9b3dd089a7c2f59ff98ce441809
F20101112_AACVFP conger_c_Page_048.tif
e5610d81fc2664012abc8e747b4f078e
2c8416dd22506c25e067a6c8ca36e7d06a04176b
109173 F20101112_AACUZU conger_c_Page_096.jpg
7974683789fb6219e6a7608365232ccc
f76fd748d79ba04c52b8fbc9058b96bfe49ece26
F20101112_AACVGD conger_c_Page_062.tif
0d4c84cce1fe35e30b7a3ac1148ad49e
b058decdacc5956e5046687a50cbb953c04908e2
F20101112_AACVFQ conger_c_Page_049.tif
6895b6363d40b337b728ca6cebd3902f
61304080ed953b9598306142576bc195aa896cbe
110974 F20101112_AACUZV conger_c_Page_097.jpg
d6f723f917a04e7716c24e788097c7b0
835426b9c5fa3213853bdf9380017f36fa5ab037
F20101112_AACVGE conger_c_Page_063.tif
72790309925e177263fd711c8b1c06a2
38f2f450b8459bf4213e3bbf840f2bf309deea5f
F20101112_AACVFR conger_c_Page_050.tif
bdb33853aec35084dfcf710c3781db2c
f503079f41da0a17f91b6dfeaa1451970f0356cc
110448 F20101112_AACUZW conger_c_Page_098.jpg
6fee2b77d0cc19eb99c654f6ffd32fc0
361d9f528689fee549c61401a2fc83f25dc5b0db
F20101112_AACVGF conger_c_Page_064.tif
d56f8e80806976e96702a04274c4929f
c69858090727d0a2203242f3ad02f07b3184794e
F20101112_AACVFS conger_c_Page_051.tif
f9b3e58f22bba7dfc0c884f9a36ea2d8
af3b024dd73520cf4d11f9968aa5d62fc27d0c9b
10838 F20101112_AACUZX conger_c_Page_099.jpg
6c864d3e1a5704e05ee7ca85f54e70c7
2b28b32e52b50ffa27d7928035877182a209a147
F20101112_AACVGG conger_c_Page_065.tif
a8771f583aebe5199a201af47e431c87
ca4fc41058846fa797f5bbacc7de0eceaeaf042c
F20101112_AACVFT conger_c_Page_052.tif
f061e7fc594a6c143b047a5de2304043
2b95f92bffbf35fac5fb5cb75518e938cf2fd701
133856 F20101112_AACUZY conger_c_Page_100.jpg
872c1ef198efc73e02e76205d4c15dfe
2f81d19484eae036532112492ac1119f048ddfac
F20101112_AACVGH conger_c_Page_066.tif
c5f640d1d2962578fca5eb632eca9f70
de0de8951779d11646721611a12380287094f225
F20101112_AACVFU conger_c_Page_053.tif
873c02a2fc1f14db41b3b42584965f17
89d0111e85726a989534a9f5e819caa5dc15de17
123135 F20101112_AACUZZ conger_c_Page_101.jpg
a2f9893cf8b0f9f1acd6797c3abfa6cd
7df1639733c2645fd9d50da3b4d82f5d5e5af79e
F20101112_AACVGI conger_c_Page_067.tif
aa45d025573e0dc887f6526ade9db8c8
bd8d69e7f7ca00aee6a5c4e3819a6c65040f9a93
F20101112_AACVFV conger_c_Page_054.tif
9732b8a16b04e74049f9553448e9f4bb
caa84dfab831784abea818fd7f5b0f63f07182cd
F20101112_AACVGJ conger_c_Page_068.tif
c853fde81aff469eccc13c02cdfa980e
31327d8c11b39ec0e063fd30c4773f2f05087988
F20101112_AACVFW conger_c_Page_055.tif
00b82ce491cbfe2f9ad16c82d7a074fc
2a9d559ceb2df2445b21f55c51e5b6ea8226d3ea
F20101112_AACVGK conger_c_Page_069.tif
775c7efec39ae6c1db6d33020bb26f0f
e34e19c77f0410b8fbd121e943bcb109c164ceee
F20101112_AACVGL conger_c_Page_070.tif
15e966080af3bd5b683d925f3d467473
84f2af90a5924b2e1f10ce3926e24ddb8932b6cf
F20101112_AACVFX conger_c_Page_056.tif
7c775f1542a26e64def89d7a363a89c6
75f169c340d9ae2a9c3c8f4d0b0cab7b79c447d0
F20101112_AACVHA conger_c_Page_085.tif
d10cc0a2bfa0cffb6fa7943fd3f466ed
3d652f255856518d49b95832d84a8d5e6bfe34b8
F20101112_AACVGM conger_c_Page_071.tif
d9342ebf17c2d04e17c6243dd28938d2
54707c5a1be849987606d0ee24eeac38c33b2150
F20101112_AACVFY conger_c_Page_057.tif
9a40b7d47dd789558ed7927237aeeb95
eac3bb0191ee74f92d858afa6f5f417de1ce6fc6
F20101112_AACVHB conger_c_Page_086.tif
6dd099ded2d03b2a5b10689f0af3ed95
1de2924dba2182de5ec987c7d109c583bfaf45b8
F20101112_AACVGN conger_c_Page_072.tif
afbb1474d03db5a41a57c68d78780cd1
b38909a492a12a234d4474671803352aecc69ae8
F20101112_AACVFZ conger_c_Page_058.tif
45655d1c9c41f997ad374a3e5b20e247
7a5eadd106244aaa487e7deae9e1a330a8677668
F20101112_AACVHC conger_c_Page_087.tif
d206422d0ee3bd43a12108ed5e34cd8f
a72c93555d4cf7af850bb5081998f3250dff6755
F20101112_AACVGO conger_c_Page_073.tif
040a684eec1928043518e42528dead7e
8241108b5d75abe6b101057c1fc5e6be200eb501
F20101112_AACVHD conger_c_Page_088.tif
5719e2945afc893fd5e53db2f368d6b6
735805643672783735b5e86a650510b7989130e9
F20101112_AACVGP conger_c_Page_074.tif
78039f2ab8a92cbd0d944f3aa0db5100
9b05bfb27ef07640194692ddaf1a176c621e93dc
F20101112_AACVHE conger_c_Page_089.tif
9595d8b4a0fa436e88ff39695fd024a6
2e8e68f377d57dfd296e85a68bb376c0c9f82914
F20101112_AACVGQ conger_c_Page_075.tif
c0238fa39df10595fa94777806ee88b9
28c78f5f4d9bcc436e32b9129ab449b71f5436ac
F20101112_AACVHF conger_c_Page_090.tif
a90476d1471c6b3b3017e4dfd96e21e8
a0ecfc88ce9e67b78bc1743dcb6871a9e9c8e9a2
F20101112_AACVGR conger_c_Page_076.tif
b666afbb20dd948d5d722142a115b2ce
58d06a3968438b722bcdd551e4422b3a360e9a22
F20101112_AACVHG conger_c_Page_091.tif
496961002fcd4fd074faa087142c588d
befa7ed21def38565a90910cfeab182cbf15abe7
F20101112_AACVGS conger_c_Page_077.tif
f771a6cda8f27f87468c24701b277f5a
8051a3af74213ff4d331ce26151b43b954d13b12
F20101112_AACVHH conger_c_Page_092.tif
25b7e7817d45ce28ba4b6995f9304b76
647522afbab78500a08f94915cfdf90fdcccf2a7
F20101112_AACVGT conger_c_Page_078.tif
ac9fbf2d08204c1625282bbcf2e6c639
6084864f3fb7282ff0c82514c390292e9fd8f9ae
F20101112_AACVHI conger_c_Page_093.tif
17fd34e39c5954c4ba9882f359dd14c9
cc68e4b9f402912180a948b29fb2b112fae7c8a1
F20101112_AACVGU conger_c_Page_079.tif
1f7301532c247e3abc8742a02512765f
114b13acb85d5337cf1a10e4a96a93559b58cbb1
F20101112_AACVHJ conger_c_Page_094.tif
1884dcb14271d63d0c1a117da4337983
9052525bf046de07c7eef96ef4e6cfaaecae3443
F20101112_AACVGV conger_c_Page_080.tif
36c7b7ce8118e0053eb34088dd8ca67b
5a743ce63d6f285e61aa4ce77a1a05ce52e65c3e
F20101112_AACVHK conger_c_Page_095.tif
eb4c0b2820a395fc9bcb88a20ce6e849
79c0135e6dc7b1a272dc9d59ab74e8ec8abd9813
F20101112_AACVGW conger_c_Page_081.tif
8787a0e1adb813d24943858f528c2d96
9c5ee8db9d10a99448c115482217d08f9cc22b18
F20101112_AACVHL conger_c_Page_096.tif
6ffbb3dbf2f8dfda8df9d9849b17f8a4
f95bca18eac5d71e0060dfda1caf33a1b1ce3d2e
F20101112_AACVGX conger_c_Page_082.tif
e72ad4cec839adf912fc5673119afbc1
7c6efdce5be0453e003721b5382ecddd32cb530e
12007 F20101112_AACVIA conger_c_Page_007.pro
9d359ba6dbb4c9cc252a129c327ecb46
c98afad14b5688e96c7935de599958f5f962ada6
F20101112_AACVHM conger_c_Page_097.tif
edeae3d329b837b39e10100be6bf4052
26146b9b93b12faef81b56a9625ebd69bcd41592
69169 F20101112_AACVIB conger_c_Page_008.pro
3e6ddc14657357cd5fbfc638ef0b8879
fe7f8729776e784a2afa1e201e855b259f02ee7c
F20101112_AACVHN conger_c_Page_098.tif
ecde93fa4b925c5e8ece8eadf32b8eba
7283985e6e1af1f5e18437178cc06b1b1bd8b4c0
F20101112_AACVGY conger_c_Page_083.tif
46f6578f725cd0b5ae5a84a268815731
3f90f1167e85222224fdb1e90c620df323a9f2f3
60009 F20101112_AACVIC conger_c_Page_009.pro
2cf3651a68a15454136b92f351b05a26
2efdbfb35f34fbbc9b4901067584b0ee46fb6bbc
F20101112_AACVHO conger_c_Page_099.tif
893a46e846a5553017d306d67bcb70c9
c28042f65bdb45fe2d8057016c1e3185aa9e7221
F20101112_AACVGZ conger_c_Page_084.tif
9d3840c9d9b16d62fd3283a249630415
18b55cef987945305bdbbf5a39b4c5e9c02ecbc6
47047 F20101112_AACVID conger_c_Page_010.pro
d774a5d7e8896eb1607e5d5fe6457e4a
b191a869140242509626412850dfa82ebb1e390e
F20101112_AACVHP conger_c_Page_100.tif
d170f97249b0782416b176e2dfdc441c
2f4faffd818d0af65a70b35cd619743d48aef062
54908 F20101112_AACVIE conger_c_Page_011.pro
ba2bbaac70aa8b381779a7c002ba408b
920724b03cdfaf81ae1d56098a393e009a9960ca
F20101112_AACVHQ conger_c_Page_101.tif
87c3f094e2a4bef9e145c6cd510a5083
834ef6ef3551bdd5835f0c16acf70e4d51bfa119
53952 F20101112_AACVIF conger_c_Page_012.pro
6dac80ba277a4cb24abbfed3a9ea4b1f
81d048f73f44e164403d0ff4210de8c39ea91ec1
F20101112_AACVHR conger_c_Page_102.tif
375798c6964f471a54dabebf1e9459f3
42c88ab609e09042bf242b9fc4db33e76724e2ef
50520 F20101112_AACVIG conger_c_Page_013.pro
b713daa73746538a1ce521c6e2d32d62
c682827b77dbde5af4d7e16e7760b5a2a98d5714
F20101112_AACVHS conger_c_Page_103.tif
7b04bd9bcea3525ddc39581396e5afb6
030a5162e049d0c43d3f15dd881af7e8c8caebb9
54707 F20101112_AACVIH conger_c_Page_014.pro
312357ab5667c0d378b28ecc64239efa
8d25f4cc1086c41439367320816b2eb39ed6b85b
F20101112_AACVHT conger_c_Page_104.tif
46081a13fdf865715f9987e48cfb79b9
1d2d261b668e7bb48516216a19b4bcf28a585da8
20973 F20101112_AACVII conger_c_Page_015.pro
b68279d1e3afd9b85143b4f11a138609
4486c9d26c77c8aedae69e23d12584dbcb57498c
7085 F20101112_AACVHU conger_c_Page_001.pro
8c823dc4bdcd80e91c1c411d12aa89b9
0fa9c8c3e5645248630f270abdb42298d9f18320
34566 F20101112_AACVIJ conger_c_Page_016.pro
cfb35233e4d3940f7ac23ed19b4e8c1f
879adf498fb750dbd5d0689ce90067d8cfb013d5
783 F20101112_AACVHV conger_c_Page_002.pro
192b8b642ead21365ea463c0b139dd8b
ca079f4bc361bc1928ff91a0e734470c32d93c69
40765 F20101112_AACVIK conger_c_Page_017.pro
e5a4c38bcd6bb6b897ccbd70d1f29434
8d93c3b8b98f1b18e7c9a7fc5be55a6f7ef81420
1925 F20101112_AACVHW conger_c_Page_003.pro
ea2e195a26f2aceba0a5ceda31d1fb01
5ec1eab3ded0e675b5296d0ea9548d17901dc6d9
46737 F20101112_AACVIL conger_c_Page_018.pro
4fbaf9ea8f4d1bc99716f1f5feed730d
ddbaf80fff0957e8873021bfa82ca0cc9c41000f
16122 F20101112_AACVHX conger_c_Page_004.pro
7dbe653a7ff62f9c01b94ff4ed04b552
58803b2db32b963b4036a009f62881fa42546333
46657 F20101112_AACVIM conger_c_Page_019.pro
ffa75d7d0976512a501162dd12a85605
e9bdf98c344a1fc031af97e397c38d4836bf25dc
94934 F20101112_AACVHY conger_c_Page_005.pro
1916aabc6c8f30e7ba3b141b40358dc0
f85fb239198a5e26366133148f7f39871193cc5c
50036 F20101112_AACVJA conger_c_Page_033.pro
91b05d6168080eadc29d04a71e4535c9
d6bf8ecb0783da4c8b7da45dd0b2df0abaeb9cb1
54234 F20101112_AACVJB conger_c_Page_035.pro
2bbe705e2b9f9cffc1129b4afede42dd
761be0501a55dadf3f7cac1c65d0076d7746b81d
39856 F20101112_AACVIN conger_c_Page_020.pro
11a8c2137e32ab960bdafa209bccf9e9
a1d499cc79e41671777c8e62a6b795be1f2734af
51962 F20101112_AACVHZ conger_c_Page_006.pro
b3d9a02de5ce7e9c0eccfe454df553b3
bdbb8207d04b59ba5c21f2a677f2f02d7a223aa9
34461 F20101112_AACVJC conger_c_Page_036.pro
a40854d30cc9558c2e07512d78c34727
51bb0244a3289dc0905fd56f11af8f6cd4bf1ef0
32105 F20101112_AACVIO conger_c_Page_021.pro
bf643e0d03fcb76f36f1cc32b873120f
b95eb3e92fe2367fc909a9d1a312e23ea22ff0bb
56522 F20101112_AACVJD conger_c_Page_037.pro
15393cf1c8a6008839498c5db4ff664d
2175ecaa147d99cf885a092afdfba3e47c1852e6
50994 F20101112_AACVIP conger_c_Page_022.pro
ad11fec3d7ffa53d48995f0256969150
8cfc2cc902231653f20bfd6bcb325d462217e162
54525 F20101112_AACVJE conger_c_Page_038.pro
5a7513b2438d8c690d238d59d549e2b5
79c5c84a8f35d4710509467a06c48b93168c011b
40167 F20101112_AACVIQ conger_c_Page_023.pro
67010b588b83b0dae3dd6d1a16fe9c28
db5a9d4f725865fdf4f409cf4027354e49807e99
49568 F20101112_AACVJF conger_c_Page_039.pro
68c395bbf168f4c43a2ab8f240e1c9ee
d02b72be7bc06753b6459228e603949233658a19
36381 F20101112_AACVIR conger_c_Page_024.pro
01d799da31d82860bd1b5b43b09251b9
5d022ea33a757c3429333b6306afcfa2c9b6a940
31432 F20101112_AACVJG conger_c_Page_040.pro
c94dd5c312d15768f549952b672281f8
223527af959b36f823595de568c2104d125cbe32
46022 F20101112_AACVIS conger_c_Page_025.pro
e9d791b2cee5de773bb17b6caf62ad13
ea7f1b6252d3f41643c2ed444632294080dc96c4
48442 F20101112_AACVJH conger_c_Page_041.pro
231c0d3c02f15eba6ac29d9d82229a3f
2565b4f9e17bebb5b18c981316f6edfd821011db
51644 F20101112_AACVIT conger_c_Page_026.pro
bee66db74f7bf41b8972afdec8af93a6
3af338287514b5d62a463fe52d123e8b50022a17
38560 F20101112_AACVJI conger_c_Page_042.pro
7c45661b987e79e4fc7e03029260e591
d275b45b6d41cbae7f2e04e0dcdf18bc20cec2ac
54867 F20101112_AACVIU conger_c_Page_027.pro
e0ba9ab0f38c7c105d63b41537cd5f4a
978abd46418cd13d6a911302c8a82becb7e86fde
55586 F20101112_AACVJJ conger_c_Page_043.pro
6dec1d1325713a471044077a2d2adaf6
e81f1fd1cb22232efcf1212336cf4d2a27ac4744
45659 F20101112_AACVIV conger_c_Page_028.pro
ea28d29045f79f797acffdbcb9a8dc80
3e542e416920f63f2556a14b59353d2834ca0a59
41385 F20101112_AACVJK conger_c_Page_044.pro
6c977df91b0c26d394756a8c1f3f48ad
756ef6b1711a0a33cb3ea4c7c6ac249dc2d44c81
39309 F20101112_AACVIW conger_c_Page_029.pro
aa9cc34b37d9b846b7c3e397d2f20173
dd8a1fb4e580c7d08cf4c0555f8fc685c7f0f9be
53462 F20101112_AACVJL conger_c_Page_045.pro
4b629fe83e6c418db630b06009cb34cf
3cecfa0c2d413c9ce7ab744977e5a047b6a4dcfa
42010 F20101112_AACVIX conger_c_Page_030.pro
2e86982dbe9ee975bd84d416cf6ff12f
5b5d569b924f8e3d41403bc975367f51166922ab
51752 F20101112_AACVKA conger_c_Page_062.pro
033005d537c81a6c4c58428842d5f05f
f713a9a221eee600515ffe341b333b16e5d4ef85
37524 F20101112_AACVJM conger_c_Page_046.pro
cc33449d0e2999f040bd6f9bde618bcf
bc2f5be0674c3681fc5884745f1a66469526f043
51437 F20101112_AACVIY conger_c_Page_031.pro
b9f721d61fa615a0dace87d8678e38f9
30e8a18ecb9856a07e34f1ce200bf8816e6e98f9
22810 F20101112_AACVKB conger_c_Page_063.pro
a973bb4c756142d78a4f325e7720e3c2
b56baa5e7e789389036cff77a70e0eb973d7a214
43212 F20101112_AACVJN conger_c_Page_047.pro
2b77912c78152dbf2c1f55435fb63f21
32a5db6d91eb18358fadc337b02e91d7177f4519
52587 F20101112_AACVIZ conger_c_Page_032.pro
fd4b69a4bc1ad2d73049f1c31ba4f855
8dac0ee2caa72712302642c3d02a078e02d9ff03
53488 F20101112_AACVKC conger_c_Page_064.pro
e30db8053109db9ce808d1cf2c9ca033
9f527626ac2b6f74fce7ec221917c8b979947ddb
56247 F20101112_AACVJO conger_c_Page_048.pro
edf7e485fd412b50aacf0a2aa08ff6d7
330f9cfd08c211bd9472507bafdf6949e50db5a0
48911 F20101112_AACVKD conger_c_Page_065.pro
3fcbb797872d9ceb484ed482063233e4
9d6d9bf6592f72dc2bc34f7ee7bbe594cbe6bbc9
51778 F20101112_AACVJP conger_c_Page_049.pro
b19f6652169b93828637dca965e79b58
2f33094989e337e45b7520baa9d16366fa094d8f
48773 F20101112_AACVKE conger_c_Page_066.pro
e9de5dea657d8e08fabbda7515f33a7b
b0ac934b6470a12cc34323ee72963effe7141221
55294 F20101112_AACVJQ conger_c_Page_050.pro
9a626b7985aeb450f8465fb0c5f7ffa1
7d2d028482e3bebe1f468a5e393d0e38437e59f4
45227 F20101112_AACVKF conger_c_Page_067.pro
94cb9ed7f44b6e6d003f30e0d19ad866
1a692c6c682dc90e8c148ed0ed453067f4a2ba48
67596 F20101112_AACVJR conger_c_Page_051.pro
c3df5d5f47b013a389f82ea1f18cddcb
88566530e920551f0241110db58b32ff3e843e1d
52100 F20101112_AACVKG conger_c_Page_068.pro
c5391c1607e41b03a672fe79df624617
4c9494e9ec41f718a03419441c2a1917ef7e7884
44281 F20101112_AACVJS conger_c_Page_052.pro
14d6a4c10f2f2b9e1411b7c4c4ab184c
1b8ad7bcf1fd558cc2cbcef7c7aa4df059b28985
57744 F20101112_AACVKH conger_c_Page_070.pro
fdec7e0a87f6ac19b9e0415eb328a6e9
7a8cb67105ea32bfaed36968292f7d1a35abd21e
51950 F20101112_AACVJT conger_c_Page_053.pro
e980f7f63cafd9baab6399152630a1d4
e2fc08d117125b4a6a681361102540ee7b5924e6
53288 F20101112_AACVKI conger_c_Page_071.pro
625b7b48299de00a1bed1e95c6b069cd
7b798f4f4ef40564ed8a4f0446304645fda08799
56487 F20101112_AACVJU conger_c_Page_054.pro
f128d3b8605d3e828e64e3f654f7b42a
4332187c9aaf79a959e7740bc3040e6a491489f2
41419 F20101112_AACVKJ conger_c_Page_072.pro
20679d0af68f1d65d04c0d4a7b993627
58583b9ea3a3b9275ee6e51d66cba44af123edf5
46006 F20101112_AACVJV conger_c_Page_055.pro
463fa20ecd3af0f096fd6a9d6854d33b
4d012126df410091ae589bb7f89d92dd4377ed0d
49055 F20101112_AACVKK conger_c_Page_073.pro
5ff1b03cfd540fee95d13b5ebec30e3e
2114366380eb1e24e144f511f5783b055ac71a46
57300 F20101112_AACVJW conger_c_Page_056.pro
2ccd970ae02d4f7463ee8f22a7dd144f
9778737aa590e678be2f256f91f69d29ea2dac5b
40551 F20101112_AACVKL conger_c_Page_075.pro
99bb5f781c884e4652ab013783e4399b
6845559e504a1182fc33e2548603f28f7f2b1f37
38892 F20101112_AACVJX conger_c_Page_057.pro
778fcbbc6fc6ecea138344d12968b356
90934b4c4ff85fb295b1ad4076c93d1ec16f0604
56979 F20101112_AACVKM conger_c_Page_076.pro
022beee1ddb78c6781d9441d13969d74
1fd5b7b4ac4f261f51a1ba7076b4cbb54d514cfa
38733 F20101112_AACVJY conger_c_Page_059.pro
be124ac486c12563ef40a00300dd857f
a9d3c29f1e6528b59b9217e21d51c1c04e488831
56287 F20101112_AACVLA conger_c_Page_090.pro
d2f7963d7fb72a2cfdf19e240c577851
afdd335f820e5c78c0a1ea57c4ae6ee31cd0ea63
47304 F20101112_AACVKN conger_c_Page_077.pro
0e08c67a0b6b41f193acf65bdb5830a5
f0c1d381b08441f7cb185ee3ecfbbf929b69ea53
48941 F20101112_AACVJZ conger_c_Page_060.pro
998b26b3abf1136358bde1df32426777
12ed7f0c7ddcd75c2329a4b089db17642b663328
35901 F20101112_AACVLB conger_c_Page_091.pro
83a0e72122a388ecb5ec456132efa394
c56908e3917b45e4c1581c5c215b49cdedfa9eb6
48112 F20101112_AACVKO conger_c_Page_078.pro
d738e328bd5872509c2ecda0864535bd
f4da189d4abce629356953fab8a575f7fd2f9343
55643 F20101112_AACVLC conger_c_Page_092.pro
818c8d0f2e7be3bec68e8136d852641a
d5bd23c2dc96160ea8ebcba282bed35948177d59
52196 F20101112_AACVKP conger_c_Page_079.pro
18c6b35ff8d562dde730b6e417119021
af3721408b617b5cad78c31c1ccb0a6eebff0c51
45913 F20101112_AACVLD conger_c_Page_093.pro
15987bfa7cbf603ab725fe8ce73abdfd
8697bd20d618f8b75f29e39958ddbb940ac78cb0
32824 F20101112_AACVKQ conger_c_Page_080.pro
c0add0bde6bccaea5b8aa9158afb126d
9c8132a22ebc6cee957ff65f87b3c061f65b9e6a
6453 F20101112_AACVLE conger_c_Page_094.pro
07401e9b8abd844ee210a9647801a691
3737cbc228a1e9b27f6a9bc3838008114caeb3ad
52549 F20101112_AACVKR conger_c_Page_081.pro
59658075df96cb0e31beb8eede8a02a3
9576c06fad43a0111fea72502dc2a2ecd36350af
46906 F20101112_AACVLF conger_c_Page_095.pro
f8c0f40cf741a49546519590628b7c05
00b65af3c8dab6545d1453dadcae773292a192ef
42029 F20101112_AACVKS conger_c_Page_082.pro
e79e396b44da3f830584746aabb2c55d
67a2e5c1097c2bd0e7d826928a40834542e2e7bc
53808 F20101112_AACVLG conger_c_Page_096.pro
c4dbc3128c7b90b27c7aae33798126e7
eccd163694f98ba01c3e78de327cbd2a5cf02213
47896 F20101112_AACVKT conger_c_Page_083.pro
7ab0d8c34400a179efe8cac8443fd40a
20b02cb0d7165b2718457c014f8af9375cfb9fb5
54463 F20101112_AACVLH conger_c_Page_097.pro
f3ce78785fb98e9807bfc2a3679cc997
f9cdbcb1627b8714776ec47ddb0a33cdca4345b0
54617 F20101112_AACVKU conger_c_Page_084.pro
f6892d07daced45c321683ca79305122
ead08ff03ca013c0da14fa743ef325eeed3a81be
54636 F20101112_AACVLI conger_c_Page_098.pro
26c98f71b6a1d9393251b3bed2804f7b
50a08cc6df8f48c208c0241afd945dbf9ccf5bb4
47598 F20101112_AACVKV conger_c_Page_085.pro
bb0a3b3a9b70a6b825b6912dc5c0b4a0
31dc59dd9dbd51d0421cf7075af8da3339fa3139
3916 F20101112_AACVLJ conger_c_Page_099.pro
dffde73bcd1260ec18775960e29d33b6
0e122ba9db04b04ef29acd4964f9523984a6bd96
46461 F20101112_AACVKW conger_c_Page_086.pro
af71dbc5ffce43c4e45f53bc9543a464
8d68cf1ee85cc5fa1cfb547aeb2490233e2964e1
63492 F20101112_AACVLK conger_c_Page_100.pro
41e89d3f4e08b25538d5787a450f2af0
0dee5b0881c62acee989d54f94215993bc3d6e80
42761 F20101112_AACVKX conger_c_Page_087.pro
9d342a81dd3c2ab9c6571bbac503a29f
f8be3b61645ee0eb259f0c75f1d62b4f49955db1
58926 F20101112_AACVLL conger_c_Page_101.pro
1c0b7974fbeefffd33c0ee890271b498
adf08940817c92af8744ae21349c0f0253e41bc3
41324 F20101112_AACVKY conger_c_Page_088.pro
7bbffa211e56caa0aa059af09543bbe8
4a3fa53c20bfec5abdb82ca38a55dcf831c86a1e
2123 F20101112_AACVMA conger_c_Page_012.txt
9dbeff0bc5e27ddeea2b699508330ae8
5920704730003bb8f8032aae79fc469f2e404f44
62693 F20101112_AACVLM conger_c_Page_102.pro
bc3fa4a3d01f253a20cc6ebb371f1561
80a2d04ecf32aee3b8d7fdb2059619a3199bd3fb
43553 F20101112_AACVKZ conger_c_Page_089.pro
60154a03c0ab26986b7dd0d801ae21b6
42eea8de25d3c1602709ee8a43d1b0be1043faa6
1991 F20101112_AACVMB conger_c_Page_013.txt
b7f7ff5c9be75182b81db631cc68afbc
1dc383ed330513206139c512d8c1ed65e7a4bb5e
32179 F20101112_AACVLN conger_c_Page_103.pro
e7e04c140f2c68d07bca26f6bb2873ff
a1c1eed5dfb80109c2b71bdec2690e1334ca1e6e
2149 F20101112_AACVMC conger_c_Page_014.txt
ccfa360b052dcc37c543123f54271aa1
3f2e9785dd42e440f800ccb88436d8eba3cd3c96
22969 F20101112_AACVLO conger_c_Page_104.pro
03987ff34aa8ee18283075e1967e6aa8
039d244ab9296190bac28bcdfe7cc3f6de05605e
832 F20101112_AACVMD conger_c_Page_015.txt
253d937cb68426716403ed22b22ca830
73a7698d33700ada48e8cfe67f65576dbe93ecc1
411 F20101112_AACVLP conger_c_Page_001.txt
666acd5755c8631474ce624064a1baaa
aff62230740931af3786cab1077a9514093410ac
1556 F20101112_AACVME conger_c_Page_016.txt
b326e4e5761cbeb910283f5c099ef8cf
3af0cf4e94dde57857a2cf3f2f4af4136b6ce90d
86 F20101112_AACVLQ conger_c_Page_002.txt
7d599412cd13d55d66c7020988369f3d
77d26cd8fe42e36b487bb96c37e4eacf25c7874a
1740 F20101112_AACVMF conger_c_Page_017.txt
59efd4d0af90243f8742479e7a5569aa
a1a4b299479f3268bf389f9ab047f4524be9e969
130 F20101112_AACVLR conger_c_Page_003.txt
599211fd683ece56b9963a769c0e8227
a257cbd1c1adbaef12911097cbe8d9a287b16601
2162 F20101112_AACVMG conger_c_Page_018.txt
22c946e1aa568bc1ea2baafcabc856b1
f2df9ea3b85769a17193234b27b60354f021c862
679 F20101112_AACVLS conger_c_Page_004.txt
a7d226e7dd629fc7e01e064fabe3b1c8
644d7ee9365b8e53d44dbba9b201d5ba4faf3668
1875 F20101112_AACVMH conger_c_Page_019.txt
a05d3cf4c452f54852a5f50213e81fff
6934917d3f3509c8704aeb4cc53153bccbd8db60
3883 F20101112_AACVLT conger_c_Page_005.txt
4de6d4e837b3a8106063c37aee3417e0
46ce2ab19f9973da8905f6a85de9b84bb81f3c03
1762 F20101112_AACVMI conger_c_Page_020.txt
cfc77cd842542e586db52121184a5dc5
96b3d86c3736531cf014ec31f39c696a3c95ca22
1560 F20101112_AACVLU conger_c_Page_006.txt
2a8da96b8d3bf5f104844c2801b92171
3125a717380eb66140a6d23145202e3964415637
1739 F20101112_AACVMJ conger_c_Page_021.txt
77c435c140b8e99e20c695fdacb20fdf
1fa2ebfe6978aa1e5aac239a8c5e047d7d0b4a40
541 F20101112_AACVLV conger_c_Page_007.txt
c6fcfa6b8fd7c39e56235c7e6e0591c4
f29df07d8ef264fdb62738235f171b07ebc4d3b5
2007 F20101112_AACVMK conger_c_Page_022.txt
b2d22710b88cc7a75363c03de865100f
83d0d368d94942b06b5b5331fb29350e0a8dbddb
2806 F20101112_AACVLW conger_c_Page_008.txt
f512e54c5a09eaee35bde7ee91dffe95
9af97c34ac6dbf1d1c958ec3ff48e7bdb8d6e342
F20101112_AACVML conger_c_Page_023.txt
e47ae7b4cae58c1a1cda68c37518031b
e59bd23eb11e8d80345e30bfbb911fc251f6f762
2370 F20101112_AACVLX conger_c_Page_009.txt
2bb19f9f65ab109547d9be7a7b5b05e7
d7b15f03ee738a323c0be505973ce5b0a85bb190
2143 F20101112_AACVNA conger_c_Page_038.txt
8374e1080f927dffc6df898c858861c5
44770bf795023927f98886bdf0092cc402e7d3b5
1629 F20101112_AACVMM conger_c_Page_024.txt
120889f294701708ae41e74d5434e3f5
065151b042bd041809c58a2e2db9c90267dfa741
2033 F20101112_AACVLY conger_c_Page_010.txt
73610fef7a47a6f604a63bbfe6e9faa7
5de67d8cf3269a974d71b704128c7116344f6e0d
2056 F20101112_AACVNB conger_c_Page_039.txt
63528a902d53b57645c4f4bc3b1d9388
2ac0027e49d08a571d1ed7ec4684c4387105f76d
1942 F20101112_AACVMN conger_c_Page_025.txt
9e582a9e25de6abad1a4cab1b190ba5c
60c0faeb36a87b7abed10591eee3f0cf06a70de2
2237 F20101112_AACVLZ conger_c_Page_011.txt
ca9b9c06208c4a4f679a6bd936b1d2f7
0c3bfc66375607b2909d7f27e0416bfe45c30f00
1230 F20101112_AACVNC conger_c_Page_040.txt
f1af443b9dca5a53d8861206a3fdb4a3
df1de5673ad4211811a28f0e81290d9df063f966
2354 F20101112_AACVMO conger_c_Page_026.txt
434b62469184d5449eed98f9fca840f9
50861f4a3ee4ecd998acfe25d5e453280d3e5985
1962 F20101112_AACVND conger_c_Page_041.txt
56f65f728fab1d3fed8ec4241e2c176e
c57b68b0ecb7406cad46db7e8f79d15eb8751e34
2159 F20101112_AACVMP conger_c_Page_027.txt
e04b5a3e2f1235da4bd71edfba38beaf
233c9176619ee16752124230e5e5ac5c49ea4918
1533 F20101112_AACVNE conger_c_Page_042.txt
892552bf3007f5073ece6ee56c001dcc
4f9cc26bba166b11769c9842125941ed7343d19c
1832 F20101112_AACVMQ conger_c_Page_028.txt
6d91b9837b2de0dcceef0ffce793ef36
4ab5c70294a25bd4fe7d36bd2dfde2cf3623f560
2213 F20101112_AACVNF conger_c_Page_043.txt
104ec92c5d52d4252430d655420987fa
5e3a53ee556d27ed207a907bd2063ff96521175c
1618 F20101112_AACVMR conger_c_Page_029.txt
78af7f72e3743406da4283c0b2437a96
253c42a2a768d7dc150aaf5adddc1a7060ad5b13
2482 F20101112_AACVNG conger_c_Page_044.txt
029486e6fb4aab7295055962994ec54f
c31c2f1b4ec96b9ff0a462e915e5121fa4556ffc
1708 F20101112_AACVMS conger_c_Page_030.txt
309765f9eed75c13c85b87fb1843a55e
80e7cbba1a1c925690112588a1950e2073b0e80f
2094 F20101112_AACVNH conger_c_Page_045.txt
ed0f26d4c59768b9f78889b4498a0638
ed88f6d1607a4e1dcb72fa894a1d8f4eb24778d3
2023 F20101112_AACVMT conger_c_Page_031.txt
6d2f4204a4d48c91414d3983718bae15
ab2905d3f629c45f01c9d337e83183e762e8a562
1965 F20101112_AACVNI conger_c_Page_046.txt
b50dcc9d2a3529f59b877b9c2e448913
273dab4c6150e25c90b17f5078364ec359af746b
F20101112_AACVMU conger_c_Page_032.txt
452d5ff6ac172629d44dacb647df1482
7a3a2373a07c68cd51ada714c3fdf73d2d743083
2207 F20101112_AACVNJ conger_c_Page_048.txt
f46ed939525cfbe13d87a34cc9f30c3d
1f40e0085413e15e67a931fb144fdffa9d7d65a7
1972 F20101112_AACVMV conger_c_Page_033.txt
0cbf27fca9abe376b3e39a2e7eda92bf
a9373c882a8c76028150a87d338dc9a048bb971f
1511 F20101112_AACVMW conger_c_Page_034.txt
c2b2c23d46b446bc5792d80b1b44aa8a
e2c6393937c845254db456b6682ab042b32697ab
2039 F20101112_AACVNK conger_c_Page_049.txt
a4367f8346e49626db54eb3a96785f50
2af5dba1ce62fe414c7db3271400d3870262e47b
2136 F20101112_AACVMX conger_c_Page_035.txt
7f635f7f81cbef7226fd9bf924956c27
117642d8981270eaa645e1c76abb3126df7a94ef
2170 F20101112_AACVNL conger_c_Page_050.txt
261fcabcfd51adf2ad9186a4dcb52ecf
8932c1a2e32d5450738b9de5ce857e998c674389
1414 F20101112_AACVMY conger_c_Page_036.txt
042ade0de92687de6bb781ed0e652fa9
5210572fe4741036e7fffbf6d13f5789de77eed9
2139 F20101112_AACVOA conger_c_Page_066.txt
d73b1a091821ae8a91300e85a7e7adb2
52ef9cae63698a923f062dab76b862f63b8cdc62
3521 F20101112_AACVNM conger_c_Page_051.txt
9ba3355a13fb26f8d2c2da0f362dc298
77246dd12f3cc8a65114c73a9fd2f12a9d6807bc
2211 F20101112_AACVMZ conger_c_Page_037.txt
348062ef38fc4baa5ebede56cf6f7b44
45b724482c3447e4cdf52389bf8014a62f279cac
1883 F20101112_AACVOB conger_c_Page_067.txt
c0bfeba6ef9c1dec1ed31f0379ce229c
8561ed67a77fef42c121984c93355d196a9635f7
1945 F20101112_AACVNN conger_c_Page_052.txt
a4842093423ede2e18592caac7a0e570
339cc20ad4661502d1272a3b5123163581116dd4
2168 F20101112_AACVOC conger_c_Page_068.txt
85a55831793574f09b154b6b0e860f3b
903404b0ed288c373c8548a8a2b58bd4563b726e
F20101112_AACVNO conger_c_Page_053.txt
d86892f16f0caff68c0eb9ca87686e7f
a84463c29eadcb35dafb03bc05f32fc5fb0a553a
2318 F20101112_AACVOD conger_c_Page_069.txt
7440c099210aac8b5813e97ff14dd218
33ffe1d3276031249b2c7d184e90c6e447f731f0
2218 F20101112_AACVNP conger_c_Page_054.txt
4f86b4d81fcc69f69a5a2be5338b3e7e
580b27bcbd9fce23f02c2fd57c32add42b11018a
2258 F20101112_AACVOE conger_c_Page_070.txt
f5ec69c5fc0899e09807bc73be7b7684
0bd24642b6bb8e350ea22ef721d0c884923ad1dd
3233 F20101112_AACVNQ conger_c_Page_055.txt
13b9136c160052918c9a9a104d12c9b9
f725cb919f1b0412ded307c4166b41d8a87717a5
2126 F20101112_AACVOF conger_c_Page_071.txt
fc1fec2de454ccee939277aef3e46d98
9135a878344aec340fdaca6ec2c716cf9289c4da
2248 F20101112_AACVNR conger_c_Page_056.txt
9ac20b7243536736e851bbb416fb2c10
b6babea95bcb52231dcbb811d44e1482c4dbace4
1834 F20101112_AACVOG conger_c_Page_072.txt
0ebc8f59acdfe246dfd5a24fe2e6c466
103b6bc09bbbeb77e3470bb83e658f74d29b675a
1792 F20101112_AACVNS conger_c_Page_057.txt
af111ca8eea4e92ec2c3e60a6999ed6d
ed168bc8f35af9898380dbee7647d0bc7841a3b1
2155 F20101112_AACVOH conger_c_Page_073.txt
05a45bc32c42c966e4506dd360f15d63
e6fc121d25d0ada7cd3c5805c72c4f8f702d6f4d
3335 F20101112_AACVNT conger_c_Page_058.txt
38dfd2e994791e5b581de7493d6d12d0
83dcdcbc90efcdbb09034fb6f176f8609242ec74
2119 F20101112_AACVOI conger_c_Page_074.txt
d60836a1b3864b265b147a8ec166d494
f3339b771b5db1fb2ac6ab574c068cd1d92d9b6d
3322 F20101112_AACVNU conger_c_Page_060.txt
0c7c25128ab8b2a85283484c3e4d21bb
dd1a25d9a0d73967ec50d0bd0bc2cd7c70e1503d
1987 F20101112_AACVOJ conger_c_Page_075.txt
16ba920fde6d890e069918b53085f95d
425718142e355f189f4e7683ee78e93d0db02553
872 F20101112_AACVNV conger_c_Page_061.txt
fe56b93abddac71ff46c8b078e27c1a9
c9b349bd0f2a9b71d87fac1d999cef13209a71c5
2274 F20101112_AACVOK conger_c_Page_076.txt
276f4d00ff22120a5f898891fed6b488
9467a84392643268a679c62134ab5ef61563a533
2134 F20101112_AACVNW conger_c_Page_062.txt
87447252a9c906314e679c3f1effb17e
6f7c7c261db0e05d2ad14508e98d9f24eb380210
2183 F20101112_AACVPA conger_c_Page_092.txt
309995531f8ed6cb5ff7ea9b820aaea8
31447bf494c586d81e4930a39055af2293f0b5e3
2074 F20101112_AACVOL conger_c_Page_077.txt
07f07fe4002561f0e14671410f2df1d9
8d1719350b286a1fe3f08cebec11420f90d0aeae
941 F20101112_AACVNX conger_c_Page_063.txt
e348a785b7e20a45da32612dcc23b7e8
e8478bddc9f27e02005c992cc1d98ec31e7898e6
2470 F20101112_AACVOM conger_c_Page_078.txt
0c3de76e983469991f828f5472c95820
aee0e7a9480320116b9c405f14eaaa326dea9287
2099 F20101112_AACVNY conger_c_Page_064.txt
d84a2fa5417a586cc1117578f927a2df
94b5f4f20412aa9d21f6e3178bd1c5203b97b240
1970 F20101112_AACVPB conger_c_Page_093.txt
1a5700c9a77069825fe6542539e31dcd
0feebc5489326c5337a12623a321eed4a942575a
2054 F20101112_AACVON conger_c_Page_079.txt
26e8c064026851acca536fc5e7b94b7d
d47e3389bcf6788dde31782398cc7c1d5047932c
F20101112_AACVNZ conger_c_Page_065.txt
d9be2ea5f909f18b6d90b799a4bacfbf
541133f335ccc686b1c28ebbf2f4e013f6705135
259 F20101112_AACVPC conger_c_Page_094.txt
2f30bb7c747684985fe35f6c8f62b24e
27002f44b4224bc259cc1d9328554e57dd0b28fd
1358 F20101112_AACVOO conger_c_Page_080.txt
27e716cf4bd733d3d9f894d67037bf78
cb2f43b48df78a3eaf965b91b33ea636624a6e03
1954 F20101112_AACVPD conger_c_Page_095.txt
9de74a8092935ad1f865695d93a4aafa
8fbc38de01b379d982405980e6adc8131c9308ee
2097 F20101112_AACVOP conger_c_Page_081.txt
327dd7096f7956d2818396d0fdb2a463
62b97ef6b398fad0637cadd07c6ab90f18bd3208
F20101112_AACVPE conger_c_Page_096.txt
ed94ce1b89e20fbcd800c734dc04aef1
b79ab79771019fe12e214d213f0d3d51edb1679b
1775 F20101112_AACVOQ conger_c_Page_082.txt
bf759ee28261facb2ebdd83332db0045
a94dbb42af96c65c8f29a1fc532bd0acc5f59c2f
F20101112_AACVPF conger_c_Page_097.txt
f34e5d33228734c1afd15c431f479b7a
99febd49ac8196306d1bb25c817905ee2f646d26
2328 F20101112_AACVOR conger_c_Page_083.txt
e37061fab2f904b987a8c5eed011be4d
c10d2b703110bcefa4d2cada9de58c30ff315a45
2152 F20101112_AACVPG conger_c_Page_098.txt
9a57c39835b9c7c4a57e749e18acc8d3
e1ada24f5bab0b1be2473f9baee33ff3a21a0867
2180 F20101112_AACVOS conger_c_Page_084.txt
327894e27fd4a8c9a0f8106a844909a3
d37a10ac7b609ea0797d1d7fcdb129ee48787028
159 F20101112_AACVPH conger_c_Page_099.txt
96d2d5d1ce73c10510d77bbe6d661a3f
432e73d695f87ed1b53a484a66270cb40bd14247
2010 F20101112_AACVOT conger_c_Page_085.txt
e87e4fdedbb0b25307fb889e452a5467
e65fe202bfa3fc760735d705cca7ff23392004d9
2600 F20101112_AACVPI conger_c_Page_100.txt
f210d25d5dc7f06f2a72e65394eeae3a
5e38026f2d0636e6ba221d0696215d71ed304fb6
1934 F20101112_AACVOU conger_c_Page_086.txt
f09791eb025048ff738d36bea562811d
164f795a641a73ae8dcdd4afd43c7dff0ef8d987
2418 F20101112_AACVPJ conger_c_Page_101.txt
c9551c8aec79cd44d1cde3057392a9fd
fa6d216d59589d5b66923800f68890e155034464
1759 F20101112_AACVOV conger_c_Page_087.txt
3f4affe81ae787ef69feb6f6e3ecf35a
973dedef8da508ced1d83fde457611fe324aa46b
2550 F20101112_AACVPK conger_c_Page_102.txt
13802e3af78a629002c81be1a6cd433c
faf78695494ab0b752b0d6ce89de9705c4e21d5c
1951 F20101112_AACVOW conger_c_Page_088.txt
2e262a8e82c5c5d0d4f2f43f4b55adfa
2591922a31aae27842e24b5ef0cff9495d694308
1345 F20101112_AACVPL conger_c_Page_103.txt
ffca3bc6af50bb0cd83ef89a8c5a7c4d
181ea933438e8db293b558c40664735c72360bab
1983 F20101112_AACVOX conger_c_Page_089.txt
31afaac00b0c93929ac7f2a453170edf
f72920dd0ebf78445a8822a04b401a6449db5649
5905 F20101112_AACVQA conger_c_Page_007.QC.jpg
d1a77e683274e3c54da530dfad195ca5
aabb37fe14edcb10239a82b5cc51fd8da600a802
957 F20101112_AACVPM conger_c_Page_104.txt
c1e27dc36e0ddbdb698aac612f5bf72f
6ab5493cf45ea974bbabe3f37665544ca2437746
2208 F20101112_AACVOY conger_c_Page_090.txt
ab677854649c96d6f40b60debf7c7fb3
850ab68fe72f24142e20f17efe90eec5a1fd1d27
1635 F20101112_AACVQB conger_c_Page_007thm.jpg
26bd82f3d7892c602d580e831ce772cf
d20fc7174230d0d809051ce5cb50f6db1bb88547
1936 F20101112_AACVPN conger_c_Page_001thm.jpg
1136563aaf9253b594ea3121f413b83e
3ea3e332d1552f75fc44c6a287bddac62b427ca0
1631 F20101112_AACVOZ conger_c_Page_091.txt
2b3fb9e8dd0a355f74974fefe20c861e
314e05c2fac0ab37156ee698db961320f1549896
1812093 F20101112_AACVPO conger_c.pdf
78ebbdb1eb16e40e6a6d76ca3577f7da
3ec952a2210e7571dcbbaf5eabab3e13a4ba5166
31420 F20101112_AACVQC conger_c_Page_008.QC.jpg
69f96aa0145680263e6899fdfa7032f5
73e0ed8b645affbababa28fced6c721d19865bd7
7448 F20101112_AACVPP conger_c_Page_001.QC.jpg
7fd928db4ae353e17bc505537b47f4ff
c5338fee69be718e0a61e36ef9f4199f72a2cff3
7421 F20101112_AACVQD conger_c_Page_008thm.jpg
8e18c1a2ed10ed195198172ece45135f
7d8a14ebe41be31dec655b2b7f1a93c88e9c2759
1120 F20101112_AACVPQ conger_c_Page_002.QC.jpg
e7a99836f2feec6e074dc6c354f7b494
9569474bc2550b8d5e7b50b70157a70983fc57e0
26830 F20101112_AACVQE conger_c_Page_009.QC.jpg
317b54d3a4653fbbda2cef9b50af25cc
197cc7e6dcb0ccbd59aea20d90500ab111964ef0
494 F20101112_AACVPR conger_c_Page_002thm.jpg
9fe490255de1db245ccfe4c27c63547c
88615caab9daaa7a1a3157b3a7c0d5a331137e8c
6480 F20101112_AACVQF conger_c_Page_009thm.jpg
c19b86102c2e1662863bdf41c4e94afe
58edeb253f509caa7bb1cb2b87d72b1daf838439
1702 F20101112_AACVPS conger_c_Page_003.QC.jpg
28df3f9c7eedd7f3c1d32c330fd7a721
bb71988c7e61c0e63564effb55f1e53c02bc2a42
31395 F20101112_AACVQG conger_c_Page_010.QC.jpg
a123b588167c6603bf3424a3eea515e9
a510408eb628d2d75298c4a6df983440b7ad0449
716 F20101112_AACVPT conger_c_Page_003thm.jpg
a9d06f80304f2473553b78c2c52b444d
9d80dd42674db7eeaca0c0d49fdc9f408098dcb4
7797 F20101112_AACVQH conger_c_Page_010thm.jpg
caa08febef0baade59ad3ab7da6bcff0
30bb74bf9b50dfc744586168a02c1fd99438dc14
11927 F20101112_AACVPU conger_c_Page_004.QC.jpg
b789eeba1fb8725c05acdddd7e230168
efdbc4fe75494090162b1f77a89928288f1945d3
8678 F20101112_AACVQI conger_c_Page_011thm.jpg
fec99fd25a455d1986daf7b6c79fb6f9
9cba3cc5aa904ef5e6b3ab311aae11a28e4c488c
3022 F20101112_AACVPV conger_c_Page_004thm.jpg
b7761d74c702c65ba2ff250cc7c6ed23
fe6a95f8cb9be02a2cfda67b6ff2f0e7f453e045
35507 F20101112_AACVQJ conger_c_Page_012.QC.jpg
d1de1905fc7f12737bc9be93b1a21a8e
8f13df97e0bb607dc0996ece53c5886af724c72e
25010 F20101112_AACVPW conger_c_Page_005.QC.jpg
ddb93daaf1c4e5dcac03d019f0d28f5a
9643dbc2a568006400d26bfce2b62631595ab3b5
8816 F20101112_AACVQK conger_c_Page_012thm.jpg
9756ee86231b65977581f3c0c8236851
70ad435e1a153467723be079b7c65deff3e64013
6268 F20101112_AACVPX conger_c_Page_005thm.jpg
6a1546da1642e8831b7b19102fb2e915
88244777c1ec8c43a6f5083afc22d2391de254b6
22689 F20101112_AACVRA conger_c_Page_021.QC.jpg
7eb8ea80de551cbb41f06352d88d6a07
82cf50d30e90e35725c477ce87a9d4ce8e0e2655
34264 F20101112_AACVQL conger_c_Page_013.QC.jpg
0265c876f40c50476b9aea2fed4f3005
f852df0a285bf000b91b3fd8301d32bf29b5847d
10120 F20101112_AACVPY conger_c_Page_006.QC.jpg
22ebd035fbaef00470126ec595d663a5
30662b2606262c58d23b0e8654d2f8acfc906de5
6357 F20101112_AACVRB conger_c_Page_021thm.jpg
0765797a8c7b1c82565539d4de06012d
c928c9ab4e537638e4a4a5933ebca36208c9c097
8097 F20101112_AACVQM conger_c_Page_013thm.jpg
c2cbe02e1542b23b611fddff5e2ebb0a
ce500ee0871bc93f154d36814f953ad1a2f62531
2521 F20101112_AACVPZ conger_c_Page_006thm.jpg
534333753ef625828a56e40ec8bae518
6caad8560c1cbfa954daa21f02546d57ce28f85f
33552 F20101112_AACVRC conger_c_Page_022.QC.jpg
ea857359e15daba9dddbfe8769f6f941
6f5f78f7dbd891ec3d041f11a63e996168d623ce
36728 F20101112_AACVQN conger_c_Page_014.QC.jpg
17d58aeec80aa92014e8af1398d3904d
f3af9cd79d9a69d941b98a4288620f134dae593f
9131 F20101112_AACVQO conger_c_Page_014thm.jpg
758a1da176693ace08356f0ec7b57e95
0f591eba1a102e6feaa0bfbc87ac372f0b6731c4
8322 F20101112_AACVRD conger_c_Page_022thm.jpg
34ed8ea6f4b9ce573ca3a3a60d303a96
5f58e33c7dd80877bc90b384deef33221e161472
3771 F20101112_AACVQP conger_c_Page_015thm.jpg
bc108bf2254d9f222c46aa281f272a71
c92197a00cae837d8d59b36f6410c8943f303ac2
30992 F20101112_AACVRE conger_c_Page_023.QC.jpg
d5d3b4b9e24bc58123db94fbe0181ad6
375dc05b1a621434bc0436c3c4720eae6576a1bf
26504 F20101112_AACVQQ conger_c_Page_016.QC.jpg
e3c4ce1d56f56554593254603059b69f
30b4469d961f5755764dbbb18e5ab093299381cb
7978 F20101112_AACVRF conger_c_Page_023thm.jpg
65c52cd440590773b1e5142490bfd201
dbe7daf9fc432bcfc8cff17cfd96c9ca79a045b9
6852 F20101112_AACVQR conger_c_Page_016thm.jpg
23032b5b97ff7ff53e40db454d1cd025
b94fb2ddfcfdcc27b610542b3099e4380ba8a902
26235 F20101112_AACVRG conger_c_Page_024.QC.jpg
397c8e10653a2f19a327a11eae4c4f06
c179fc4aafbb0f2246bfa8a74f7e4def11f5570e
30599 F20101112_AACVQS conger_c_Page_017.QC.jpg
2010e4457f99fb3e25b8011d941303dd
63e53a2000174dba681a353a0901b80bc05219ef
7040 F20101112_AACVRH conger_c_Page_024thm.jpg
6315563c289c8b58347363c63509a653
5895f67a6d5ce94ba9b3be6140757dd062a0b8d1
7429 F20101112_AACVQT conger_c_Page_017thm.jpg
4ddd50a8b397390c6d695a8a61b97e4c
313d81aeac4ee25addc6b13e5cb25724969a584c
33496 F20101112_AACVRI conger_c_Page_025.QC.jpg
60b3abfefc01f28928f6faea1bac346e
ebc9338898348be21bd1fb581b6aabfb6ee4fbea
33619 F20101112_AACVQU conger_c_Page_018.QC.jpg
57954c328d9061aaee326bb7a40f70cc
d693bd4083eb8f3aba48f0f2c7de1edae16ef123
8281 F20101112_AACVRJ conger_c_Page_025thm.jpg
a4276603f60fa6b3351e1e22c35fbc36
a23dc7463162f2060518ee6ce44ee5a67642a2ad
8615 F20101112_AACVQV conger_c_Page_018thm.jpg
e01b1f9078cde1300467bc3c7e6b6d09
d9a5ffb02c79fec0e707c53b7addcd8c060c00f1
35026 F20101112_AACVRK conger_c_Page_026.QC.jpg
f3d0a2fd99391513e9f22ff5352fb71a
3599a580346948bd59e60558e78327988f1855e2
31788 F20101112_AACVQW conger_c_Page_019.QC.jpg
95961289f6139007aa8d467a4c1b6926
cf61886e491c131aef36e5b1c869d16111323d1b
8570 F20101112_AACVSA conger_c_Page_034thm.jpg
ed93b5591ca9de0962c63ece44a5c716
14f977415d7e99dcef1e929fddb22a5e39eda4f6
8869 F20101112_AACVRL conger_c_Page_026thm.jpg
21c4bc9ac510ebf361bcd468bf63d4f3
c91976b5b60bc3d623285d884ec791c7d0945ca6
7880 F20101112_AACVQX conger_c_Page_019thm.jpg
b5cee0bca662c09b8beb5e158a708777
d3d7a46e57b115e9db5cd610ea7acbf19055cf5d
36028 F20101112_AACVSB conger_c_Page_035.QC.jpg
f8063c0883bff773b28e72d2f886eec9
afb514a89b110c928d19985067eab883019275db
35908 F20101112_AACVRM conger_c_Page_027.QC.jpg
d383bf001d465ce588b65178c4b4a249
8d6a5864bd4b04935e222dd146e54c19996bbea5
26673 F20101112_AACVQY conger_c_Page_020.QC.jpg
780c1b425055e874c1f2e7e07885dea1
540a3736b510408f56482d7bbc264d345a196a1a
8800 F20101112_AACVSC conger_c_Page_035thm.jpg
78db00501a98240f142a87327efa7ef0
7980635c900c1f4e0349b3eb3e43151ee052d988
8860 F20101112_AACVRN conger_c_Page_027thm.jpg
be4c362306645160cddb04851305d841
7b7e4587e2e9252ce4fc27c4e779c16763f0bab8
7288 F20101112_AACVQZ conger_c_Page_020thm.jpg
0e9dffe7c93e5a95624a243ee2993b8e
d8fef1ebb3bcf0981a0982a6eb8cd38940dabfa2
37942 F20101112_AACVSD conger_c_Page_036.QC.jpg
a1ef8ce0dbde998c81911d6bb231d3c1
fb1e0a66351fc4d9494258268cdd93c05e83f65d
8881 F20101112_AACVRO conger_c_Page_028thm.jpg
4b058e95f3ec1fcf08daca1dc5848a48
18a1855d95a702c48c02db9642bc49e69bba1471
35038 F20101112_AACVRP conger_c_Page_029.QC.jpg
9cddfaf19e72a272204ca2cdddbf1342
06e955ae8226330c4ec915ce39a75ba9f346eb27
10425 F20101112_AACVSE conger_c_Page_036thm.jpg
46b555c6a415bed9a1c6a574de792614
adca89609b5a6eb4e940d0d96578eaf6afb44ff8
9099 F20101112_AACVRQ conger_c_Page_029thm.jpg
6651cabe69cd82597417fc8e7c466a57
3668c9171858a9bbc232c1d62bcea4d654968b01
37071 F20101112_AACVSF conger_c_Page_037.QC.jpg
c2e6cf41c02ba97b9a31f76bd5b32063
b62964fc661b6c92d3431a7b477a8fb13a8a0287
37555 F20101112_AACVRR conger_c_Page_030.QC.jpg
7d38edf5e3d9342cabd90018ba1d9a22
d60fcfd41b75fc727c750a11abbb1e1780f32f74
9110 F20101112_AACVSG conger_c_Page_037thm.jpg
8e9aa8be49aebed5a8bf2ef17c98e339
0b1b19c26a4a7646121005c33c59d3aea845c048
9709 F20101112_AACVRS conger_c_Page_030thm.jpg
311e6b57c3107936df580cf8cb8ecff1
5a22212fd46cff3a7ef15efdc9955734a8b2d448
34078 F20101112_AACVRT conger_c_Page_031.QC.jpg
3a6539f71d6390dae63f428a3ab9b28f
21599671a3b86db84da119403eab7fa15338c182
35653 F20101112_AACVSH conger_c_Page_038.QC.jpg
aa0b592694b237ec5503b4bee52c5e24
5d65c1c928e4d838258f3fc6aa435618f0bccf47
8318 F20101112_AACVRU conger_c_Page_031thm.jpg
c1787f8c80f52a3c73dfcc87bad1216b
456cbb06ec7b6d5bde49dae5836e4865f1d24fe1
33144 F20101112_AACVSI conger_c_Page_039.QC.jpg
1fd052c94b5449fd38dbcfa0b554682b
7410eba81dea197123c5ccd66dcd939fc6771d06
34658 F20101112_AACVRV conger_c_Page_032.QC.jpg
bad45a63506bc54029a2478f54769165
3600c9f5eee438609bd8d2a4cde165d66c3c0f90
8149 F20101112_AACVSJ conger_c_Page_039thm.jpg
51b4544168df8819dbca4c1abab4a197
17a4785de2e382f9e516ee7c2275a74e8931161d
8719 F20101112_AACVRW conger_c_Page_032thm.jpg
7e58e1146dea6fec99e7cb9373ec4579
59ca178ef1c27ea5a39d82daf93d0b77b3989f97
29038 F20101112_AACVSK conger_c_Page_040.QC.jpg
6184d64e63c1bb66ffd0077668128505
5cdd40638b8e5aa05359f9d718eeb5b46a38af42
34139 F20101112_AACVRX conger_c_Page_033.QC.jpg
cf16c3c0eb4bf169fd1f8d3b7a2e8e85
8d9d2d04b44c1ecc1cc77f03bcce49dc10d82a61
8910 F20101112_AACVTA conger_c_Page_048thm.jpg
319928653459c3dad1428d961694a532
7697dd2523d27647d2c6d33463075a455f562236
7535 F20101112_AACVSL conger_c_Page_040thm.jpg
0b514160b0ce303a4a298bad6f8455f3
179296c04f4c174eaf9624f89baac6459ae8c6f1
8067 F20101112_AACVRY conger_c_Page_033thm.jpg
031a38369ddbb4ae6cc9e5eb63d2c1ba
a4569a7b0ae7b8bf7ebf8b99b8fa550952d90954
33908 F20101112_AACVTB conger_c_Page_049.QC.jpg
2ad496a75cf1ea0c0c0c6ebf3665a1b6
f20910f68e80824cc8cbf49233459719c833b4f4
32534 F20101112_AACVSM conger_c_Page_041.QC.jpg
c697186f40cb857f91cd5eee4816365f
02779821b9a0a26c95d19b4c311f29f0e3220b3b
33090 F20101112_AACVRZ conger_c_Page_034.QC.jpg
e63c13820c9f04544d2c678d23084999
2d6e7f99a56127072a57b5b1cb08112a3c99f50f
8637 F20101112_AACVTC conger_c_Page_049thm.jpg
851e18822f7f54e41542ca3c231b4b9f
4db35389e1a52395690a3bfd12179cb3d9c174fc
8187 F20101112_AACVSN conger_c_Page_041thm.jpg
3903a53b02418f33fabb0016a6971fb1
d772525e94eebf5458d39babf51fb61bd59fe29b
36790 F20101112_AACVTD conger_c_Page_050.QC.jpg
1845a836ea6f34445a38ff34d447b5c0
4ab8331b7f4b8a6759aeaf34c2e347cef5cda12a
33087 F20101112_AACVSO conger_c_Page_042.QC.jpg
5da0fb90c9fb0e4103d271de05d3636c
76e3d2e36925ad5093b2a3a12d8c1b75f3485586
8836 F20101112_AACVTE conger_c_Page_050thm.jpg
998d607b83c443ae905797de825eba67
9a59810c30187fb6bedf68ca421e1d135d1796d1
8270 F20101112_AACVSP conger_c_Page_042thm.jpg
93c811a3b4b5a9b31606129cab49ef9f
0abcd7ed8ffa8f6ddf5e95c9930d3835197bbcef
36744 F20101112_AACVSQ conger_c_Page_043.QC.jpg
481609add6563de64374cb546da2d0c0
86b2b981e18d29f141247f02d1c2ec9fc9171ba1
33027 F20101112_AACVTF conger_c_Page_051.QC.jpg
d0d3d23cbd7b299d96301f67b591c8f0
53e1aeb1b5cf51bc83acedd0943978372cb652f1
8750 F20101112_AACVSR conger_c_Page_043thm.jpg
c8cd2e1ac582c203cd29cdb084d50d58
293476e8ff7e592ef2079afb7eaf590f327443d5
8334 F20101112_AACVTG conger_c_Page_051thm.jpg
63cc7e3d34a9cde4c2b58334739476ac
ecf02c2e425eb1933f82cb1c3f08a659873d3a5f
31123 F20101112_AACVSS conger_c_Page_044.QC.jpg
a881fed178bc7d68c96ad90b95b48048
5851397f87fb706658a9827414aa9b8316f98cdb
30545 F20101112_AACVTH conger_c_Page_052.QC.jpg
7af444976c2894d7da96bae0aac0c560
dfea970fde462446aa56fef75890137a31dee909
7995 F20101112_AACVST conger_c_Page_044thm.jpg
ac55957af5d0c66a02fbbaedea62ece5
4dee2bff98268797dcea6443a04564cdd8b503d6
27511 F20101112_AACVTI conger_c_Page_053.QC.jpg
22beef04048f222f198ea61c8f8b4139
b6f3e9c8a3372ce1c4f233037c1c0b7306a5adb2
35844 F20101112_AACVSU conger_c_Page_045.QC.jpg
2f732be5a74b9fdd758ff94492c5a413
7d1b7376c9c096ca6640592859e2803c28ba0e2e
36506 F20101112_AACVTJ conger_c_Page_054.QC.jpg
65342c583805391a233cfd9302447b42
b5b7bd26b6d9b87da65841fa9270b24a0d75d2c8
29837 F20101112_AACVSV conger_c_Page_046.QC.jpg
3cda2b3e777b5235120f7612de6ce4cc
6f59f4d6dfecd1065e0999694b74f91639478d00
9013 F20101112_AACVTK conger_c_Page_054thm.jpg
738fa04b40554b72551e767723a69f32
bc80784d6be29a6fc5a7bcc0d02b919764524e52
7713 F20101112_AACVSW conger_c_Page_046thm.jpg
6217fe0578cf520761bc2bdc193eb048
6431349e75c862022ebb3ce80bc0d9a36dbb1450
8539 F20101112_AACVUA conger_c_Page_062thm.jpg
54b320b293cf13d3296d52cb137e4818
ac04523a3b5669f32b140be5d779aa533790c2fe
31201 F20101112_AACVTL conger_c_Page_055.QC.jpg
e492adad002fb77e1a21eae8cf868202
14198c0aa15afc14a75d5a109fc08779ee328170
29633 F20101112_AACVSX conger_c_Page_047.QC.jpg
a03d9fa24b1e5b47541c5028efe05aea
b00985bff89e737050d2e5bbb021a0eebc6edab0
28272 F20101112_AACVUB conger_c_Page_063.QC.jpg
dfbdf5bfa90ec7958e8534f7acd4efbb
aa810ec1d3a941a53832d4b1c9b91125bda95340
7854 F20101112_AACVTM conger_c_Page_055thm.jpg
3ea33f61af9bb463bcd329b96de3aa19
61aca5036508ff5faf4b546fc3b709e63adf11fc
7641 F20101112_AACVSY conger_c_Page_047thm.jpg
a388e72ecb92c0a21b91de583c4012df
4be3094e63f96e6047504c7099c24a3c6ed79f32
7930 F20101112_AACVUC conger_c_Page_063thm.jpg
f467ff5e8e0456b7a8ec5462b79e2f2d
6d091bb64e0fdf0e47578b02c543d1c6589c79c7
37745 F20101112_AACVTN conger_c_Page_056.QC.jpg
769cf8a2896715363719f7d68bf03501
401585b17f15f8a338835c033819aa9ecd85cedf
36306 F20101112_AACVSZ conger_c_Page_048.QC.jpg
e48806e61632140d10a7fb8cb6b7f778
388c8b6e9b4c7a8c0416d62968772058f33899c2
34507 F20101112_AACVUD conger_c_Page_064.QC.jpg
daae51f0b0910633905da1e985c25b60
19d92e17f171ad6b9c67bb82763b5f700e816d4f
9224 F20101112_AACVTO conger_c_Page_056thm.jpg
b7fee20629f541666e91ec4b82e3856a
b9efed5a417800e22f08808a80585801577c7a4a
8543 F20101112_AACVUE conger_c_Page_064thm.jpg
86bef639bc34e693fcd02eeac37a3397
993d34ffd62b36a843311c6d23b42b8f812e85d2
27997 F20101112_AACVTP conger_c_Page_057.QC.jpg
ac33404dcbe37625b130400a6f5de552
001673bd1d536b4044bb24a7e489b8197467953a
33304 F20101112_AACVUF conger_c_Page_065.QC.jpg
f4b444975f39632855ad3446bb96e813
03f87bafe17b54be8f381927fdb174d77e41fb6b
7479 F20101112_AACVTQ conger_c_Page_057thm.jpg
1fd050de2b16ab89743eeebfb12d9b87
c5406159bd0d1a7e3ef97319b2041a156fd5bb5d
32695 F20101112_AACVTR conger_c_Page_058.QC.jpg
9a1fdf809061e79b6b9412838868baab
511f817d10dbc844a17469f5dc031cd0fc4b1e68
8074 F20101112_AACVUG conger_c_Page_065thm.jpg
255c944458e2931d42eb923bda7eed7c
b8e1b519ea8bb75eb5bd19f3824955b50be66267
8555 F20101112_AACVTS conger_c_Page_058thm.jpg
49bdae07d16a3162630f4ed40f6aa70f
909f63b243c2ab8dbe54b226d869871a0e764d04
31407 F20101112_AACVUH conger_c_Page_066.QC.jpg
00da5b55553181513d65811c412e62dc
be9e2672c5a6b5ef928868fdc5d398f3708955fd
30660 F20101112_AACVTT conger_c_Page_059.QC.jpg
26daa8426ad431107febb924e4762129
7695711140a0935a4c84a9966f8b5ffecc8cb5c8
7751 F20101112_AACVUI conger_c_Page_066thm.jpg
575875de4be99b64b91c1e48a40fe6f1
20cd21bf4f473ab9e43e739d928d9142e4fe9e4b
8120 F20101112_AACVTU conger_c_Page_059thm.jpg
3716b911646af91fef6862755bd9a592
be1dae325408881b3aacfbda4ca8d915b6d55217
27116 F20101112_AACVUJ conger_c_Page_067.QC.jpg
f4060760cb5e1e26e520f74ac9ad08d8
e8350737e5a2337950d4d900c832e1be0a3bae07
31707 F20101112_AACVTV conger_c_Page_060.QC.jpg
e62038d0f72c827fb0f3d9d4add62865
87f20da0a6a6301c9ffa70a485b8654898a756ef
35346 F20101112_AACVUK conger_c_Page_068.QC.jpg
c3d062242de31cb38cc8d15e10a5a696
f43510196fd7e1ac4ef324efaa4eac12a59274b1
8150 F20101112_AACVTW conger_c_Page_060thm.jpg
10a6d6c199d856c6472409ba8ebecdad
ce605e249ad3bf8dc3bbd4d14a3f455cec3ff3ff
8698 F20101112_AACVUL conger_c_Page_068thm.jpg
83d36f4b82fa53346a11f56b548a3ee7
85bfead44c3a7bb0bfa3ccb3acbbef2d7f537806
15611 F20101112_AACVTX conger_c_Page_061.QC.jpg
dcb397589b1f1232d727278e7d2f59b4
40be05c7946bf2632f39127fa10d8aa9ecabffa4
9055 F20101112_AACVVA conger_c_Page_076thm.jpg
e08d08aa6bdfc2ffa5d23d89d8c0ad45
6ab527c1ff7a722d8736c404fc918b1857cb588c
35415 F20101112_AACVUM conger_c_Page_069.QC.jpg
7e58ac218470b8d3584d7d02acdc4796
53fbd9408141850006df157d8f5cb037a33017ff
4011 F20101112_AACVTY conger_c_Page_061thm.jpg
6b53b28da96e7d249430947249909fc0
1190c8e868de925599667bc20da9a5e388543a17
33515 F20101112_AACVVB conger_c_Page_077.QC.jpg
943556fb83f26dd6860969286a435e0c
fcd9444f594f1ee50a1ec7f10d6791838eba6471
F20101112_AACVUN conger_c_Page_069thm.jpg
4b0a0ffd5846634d8ac762ce08c5ca08
60158aad8ebd6ef8e658d77f457f8853f0f6ef83
34745 F20101112_AACVTZ conger_c_Page_062.QC.jpg
89db6604143699dec6035de8783840c3
56288abe6ab35016e37778ef1d6cd5a02ca77844
8218 F20101112_AACVVC conger_c_Page_077thm.jpg
ce427cbbf703114393eb93d5cfcec49b
5b108ee45033b897ee4cce88d71de480c5c75425
37331 F20101112_AACVUO conger_c_Page_070.QC.jpg
7a6d4a8a184fe7f13956f05e0f337dcc
3538d367d212e845dc5358e458060bbb8685eb3f
34593 F20101112_AACVVD conger_c_Page_078.QC.jpg
f70b3bd207d30f1208f09678dca97f7a
32f2007dafcb80b1980ad0034bb3518647dd3d31
35827 F20101112_AACVUP conger_c_Page_071.QC.jpg
0036b8648411fcc5c330f80a4a598664
fdf6561e573f0df33b15d552b0b440de70c5b5bf
9067 F20101112_AACVVE conger_c_Page_078thm.jpg
a6e9225b85aadc7aa518d79e15d6ad4e
617ee5ae46df971136e3a8e17e76899432461b2e
8558 F20101112_AACVUQ conger_c_Page_071thm.jpg
3e3a065bb9337712389e8aeab5505520
460fc08bd0013cac90dce5eed82639f04ed3b932
34167 F20101112_AACVVF conger_c_Page_079.QC.jpg
dda576c02b69d9f4aa9ca23ae857f552
72ab86353c4da4b1a28998c51951bd9fe1a61050
31404 F20101112_AACVUR conger_c_Page_072.QC.jpg
ccf8f3228b56ff4b39a18ec157a8a675
1338b3109339649fbb164fa0f03f52a4054131ed
7321 F20101112_AACVVG conger_c_Page_080thm.jpg
1b3b667eab09fa5583d99291dc4e53bc
dec84dade926a91e10c0b8ca4981bb7287be587d
8285 F20101112_AACVUS conger_c_Page_072thm.jpg
d01614b1b207fa90ea83737ab9aeedd4
426a1483bc965c87cf340faa3052778d484a28e4
35967 F20101112_AACVUT conger_c_Page_073.QC.jpg
a2f6a8af7bf241e58d298bc9511c6fab
b4b21054893b995d5aad5094e775d1873ba8269e
8485 F20101112_AACVVH conger_c_Page_081thm.jpg
41db439b22264a37a9ed3f1655a4e298
1bbf0958c3ce6a440721c83e121d488b44097161
8861 F20101112_AACVUU conger_c_Page_073thm.jpg
c83061ae63f98f4dd66257eae8d49857
da82f82f43076ae92c7da66885a721ccbe33ac6f
32506 F20101112_AACVVI conger_c_Page_082.QC.jpg
dc9cf2cd470bdbaf8c0bafda9f88e73b
c656908d9e3a5432de9eb5dcfab27552ecf7a994
35869 F20101112_AACVUV conger_c_Page_074.QC.jpg
75dbc2ccb6b967061588579508fb625d
6dcae563fcc36f31ac5132f0e7adf23514d3a109
8701 F20101112_AACVVJ conger_c_Page_082thm.jpg
1cb0e27a7bcdc2313d17b636e370ddd9
e2924fac67b5c3a2328d4f8cc874c9e6d1e89895
9036 F20101112_AACVUW conger_c_Page_074thm.jpg
c18b81e238a5107ac22b7db07ef9ed22
2f82951a74787e59e3c44d25dca8a2809e18808b
33810 F20101112_AACVVK conger_c_Page_083.QC.jpg
2eedd51b555bcbaf99922b3aeb750e20
e920289a821477af4633510941fd534ef9161db3
30780 F20101112_AACVUX conger_c_Page_075.QC.jpg
14843072b58121221720d6167d384b3d
eafe6a7b18431e1ac7d0247c9f44bdc67627ad70
26439 F20101112_AACVWA conger_c_Page_091.QC.jpg
4902873301ac602bdc744f87673e64ae
f405850dfd6802565adc75fc8152bebeb6d08090
8606 F20101112_AACVVL conger_c_Page_083thm.jpg
a8116e2738ab9a50c2e3dce2ce136458
77394a4585e29adb14ba17653c122f26b5ccf735
7644 F20101112_AACVUY conger_c_Page_075thm.jpg
9f9bd998eee8adf06a07796a90cf8c8d
418c49950519d6821d1f28d099b844b60e645309
6906 F20101112_AACVWB conger_c_Page_091thm.jpg
8b82a7608769a3ccdae8bc688852ee57
6549c7ff8feed00328a7eea2d42e95f61fdc9799
36347 F20101112_AACVVM conger_c_Page_084.QC.jpg
8307e5ca465ceb9a73eb0d2dfd3d7edd
ba8fe8959df6a0f303a1adac15f446709c24d0f1
37484 F20101112_AACVUZ conger_c_Page_076.QC.jpg
c98dbf81037f5821810fa6765cd6d3c2
4949add75304bcd26385eab2bcbd1054dd8ca0b9
36602 F20101112_AACVWC conger_c_Page_092.QC.jpg
955fcc581c7e3693ca5e1b4551f11600
d69b8a284964488fda14854921cedab6cfe25e51
8463 F20101112_AACVVN conger_c_Page_084thm.jpg
5ecab6ea28d67033596102709358e27e
eec61ceac520afc0f2524d5d62142a7052a50fc3
8890 F20101112_AACVWD conger_c_Page_092thm.jpg
1f6d2374df978d5787c8cdf841ac5d8c
80ccbf13415e56319be71f97671142ccbbdda1c2
30898 F20101112_AACVVO conger_c_Page_085.QC.jpg
6128b76fe49617ee8d80c8f40699799a
894eef31eded3e90efa3a6ed4c1b02d1975746af
32112 F20101112_AACVWE conger_c_Page_093.QC.jpg
658f63b72b529a277ea9778a8a234f2a
dc7d0053f50cf2535d618c2b5f27e96c98a55e67
7566 F20101112_AACVVP conger_c_Page_085thm.jpg
2bb25940ef4fc075ae528ac8735f8a76
649bb9984291a40c1edcf9f59c7944d2a99ef377
8112 F20101112_AACVWF conger_c_Page_093thm.jpg
1aca7a979e689dd9ef6b1bffe37bc85f
d4f2441a09c96c77bfe755fbf87c112e7211220c
32864 F20101112_AACVVQ conger_c_Page_086.QC.jpg
e0411451aef687fdb4b145e07b37a2b6
84fb8ae7d794b1879e534cd44baf7b58efc24dd9
5802 F20101112_AACVWG conger_c_Page_094.QC.jpg
0a7880e11ac4ce9a6fbb3e095ca878e9
40c047b2dda48f6a51eb1b1947ecae7e1a825fe4
8301 F20101112_AACVVR conger_c_Page_086thm.jpg
3bc5efd6b5b79e7800717531bf399f9c
3d99dfb1e10d5389fe385ccdc1a4b29392fa174d
1527 F20101112_AACVWH conger_c_Page_094thm.jpg
08f42ae7ac0784b4a448a97f7a6a0e0e
51f6329072a98bf5f804e63dfdf1e1f16f358c90
32454 F20101112_AACVVS conger_c_Page_087.QC.jpg
359c191aa3884bf4316abbea00e4b473
05ea558629997101d98e6cf3b01cd34f1ad91989
8280 F20101112_AACVVT conger_c_Page_087thm.jpg
e2d64cadd61d77ccdbd8a6812143f2c1
b66210422b217613766a6cc51eeb4f1de7dbac07
31891 F20101112_AACVWI conger_c_Page_095.QC.jpg
22774af0449121c2158c10031cfc9cb9
4102334740b9653bb0142a7be4f5050ed9fddb27
32837 F20101112_AACVVU conger_c_Page_088.QC.jpg
26b2a30b7a627df5c08413d134b2e774
0c76ed7448543a66f4b3927f199df178eed3073c
35147 F20101112_AACVWJ conger_c_Page_096.QC.jpg
6ae43f3f50462b7c616ecb6c5137c40d
ab609d701da4d26bb1a51bb957949c0a39984466
8983 F20101112_AACVVV conger_c_Page_088thm.jpg
83b598cf1d1a9d6391d700aba456d885
e113d664f70c20dbecec571ec99ccf70ab435c00
8507 F20101112_AACVWK conger_c_Page_096thm.jpg
fdfb081aae4ad81a181ff290f22676be
b72bd072bbd0a15c6356cb233f9c2fd202641edc
31923 F20101112_AACVVW conger_c_Page_089.QC.jpg
3fa389d65059d26a85b1e58d48c5a661
3491ff1ddcf2157d4aaa1a65f502b463f0610e86
4044 F20101112_AACVXA conger_c_Page_104thm.jpg
6560e0a9ea1d5c0daa27a909e24e58b1
3c212dc6fc234ceb079240372fe6c2b59274413f
36198 F20101112_AACVWL conger_c_Page_097.QC.jpg
fc8209f57137033b255530f2833bce41
bfb6d67f1e3a801c07c5d67c028073f96da82ca1
8327 F20101112_AACVVX conger_c_Page_089thm.jpg
f8c91857ce155565b3fcac54ea2090ca
de09859d0c6119b9afd3f59dcec961d15664f2f8
120999 F20101112_AACVXB UFE0020221_00001.mets
49910261bb7eecd78e596caaf92041ec
34d3477ac3864f6fcf12b6ad88f458bef4c27de3
8581 F20101112_AACVWM conger_c_Page_097thm.jpg
3b058d8350488f1d9fc02fb418448316
f0cec5a88a191bdd7dcae82b82564eb48fa94480
F20101112_AACVVY conger_c_Page_090.QC.jpg
aeb38b7624db6963493cd245e8131c13
db6ed2b239a4eb6cfc6146548840217e1bc00419
36488 F20101112_AACVWN conger_c_Page_098.QC.jpg
b9871cf56eebc994fc659e4b6dbb03e7
a7a2ed286f71880d4891a59c884c5fddfcc443d3
9093 F20101112_AACVVZ conger_c_Page_090thm.jpg
151857b840295ea4a9923bfe87c16aff
367539eb25cb596116ca3bea0fec981bf586c736
9114 F20101112_AACVWO conger_c_Page_098thm.jpg
8363ce28a59064d02861dcb22774ec92
df65768f760c7aa64962d679278774874dee7d89
4079 F20101112_AACVWP conger_c_Page_099.QC.jpg
291f0f5fca3c0cfe03736aa70e918463
f128b9ea0a05db02eaa82440f459aeeaf07f87a9
1091 F20101112_AACVWQ conger_c_Page_099thm.jpg
6c9e22d1e6a759a6a2aeae907da45c74
31e55f99e9c04b0aed0dbea8118379c44c4b2d9e
37263 F20101112_AACVWR conger_c_Page_100.QC.jpg
3cde396c01c26396c0c82ed0fef1c2e5
b0c380ea490bb67daccb9689d071a496b40542df
9201 F20101112_AACVWS conger_c_Page_100thm.jpg
a53c6cae5bc5896f62cb89bf7d32d426
6bd68a6f631b49cd55441eaf4485a8f0c8422e23
34811 F20101112_AACVWT conger_c_Page_101.QC.jpg
c399542a0126531ef26091814bd16f6c
d4c793abcdad3b229b213885d06f3aeebf8aa014
128173 F20101112_AACVAA conger_c_Page_102.jpg
6bf66ce5c7646c835aaad29d01739351
d0377b1d5381d01b46a279464a4bcc06e305ac42
8525 F20101112_AACVWU conger_c_Page_101thm.jpg
fab3168862864497a9b7395b44dc4df1
aeecf05e5dbb4c0314514539fbd3d0d2c421e355
69977 F20101112_AACVAB conger_c_Page_103.jpg
a8648d36df1bc8ad25eaed9c40c17d38
83e3b04d912f5b403fff1ab1395c0ee6c8c943df
37594 F20101112_AACVWV conger_c_Page_102.QC.jpg
8734de50dcfa26b915db0aaba556fc7d
65f843472ed4d41e4de3a42f9a8b55fb232b3b73
51500 F20101112_AACVAC conger_c_Page_104.jpg
79fedc50b5ec80c28c6dbdc59c52a0fb
2c75cacfa04c53da67e03f903e9a2141e67914c0
9104 F20101112_AACVWW conger_c_Page_102thm.jpg
5d39c6e9c5f88364e584ebf682df8303
13e47096b700b9fc6452a2bdafdb52f52277b83d
22911 F20101112_AACVAD conger_c_Page_001.jp2
ffa4251e89ba83a749c4d45376e8e92a
ec62d680ca796a509f56f332c0cb2e22208f447f
20476 F20101112_AACVWX conger_c_Page_103.QC.jpg
bd3bf6d26de26672f49e4e9ccb13c70a
cb483792af484d560539f79d98b80b00beac210b
5166 F20101112_AACVAE conger_c_Page_002.jp2
450ffd727336e43db4d2b332861b935a
0bdf5ddc740a7c72f66495d87f6772c6a01d4424
5105 F20101112_AACVWY conger_c_Page_103thm.jpg
15f45ada87a519180976c9091bb2689a
cbf96f23cb7398d7c88a66254b1e474ff392eb1d
F20101112_AACVAF conger_c_Page_003.jp2
2cea6ff5cc2dd842764530b56b734852
032ccd34af5eb3649d1fa2927fd775e9479cd12c
17297 F20101112_AACVWZ conger_c_Page_104.QC.jpg
7196bbe34c736df35f109ceeec23dc0c
b1d26a00f1afea7259736ee2761a02e250d223e4
39189 F20101112_AACVAG conger_c_Page_004.jp2
df03bc1a6d076a4ee019b463899fe18c
2596ff0ddce1b2995fac6898ad8b9873e47a80f1
1833 F20101112_AACUVA conger_c_Page_059.txt
c584a05613215ff8a640c246f2482028
6a8ff0466c04e8bab161c459ebc5d8ccb55fd9bb
1051985 F20101112_AACVAH conger_c_Page_005.jp2
d81ed99b7e5b84abdeceea7c7f6b71ef
2141b2e66f1e0a091d9f97a79623892e2520b317
54075 F20101112_AACUVB conger_c_Page_074.pro
5992a0c857cf593652032aecd48e05e8
a4143e7bd34d91e12a3bae662f408aeefdac3e4c
720890 F20101112_AACVAI conger_c_Page_006.jp2
914d4afd2e310e20cbaae9fe88095271
450fb39b34535987d4b06890a7f3c71fd7052699
8805 F20101112_AACUVC conger_c_Page_038thm.jpg
b7e93cecbe53b851ee9c8a92e188c377
b6741d2ab4b003fccb099dda07c8d4f8287cdcb6
288920 F20101112_AACVAJ conger_c_Page_007.jp2
a42d208ec1af034fbbaea2cee16c0b3c
c7750d8648348386f3cdfb51998b219dad5e1213
50242 F20101112_AACUVD conger_c_Page_058.pro
42a522b3f426b6182c752a1b9aa15312
d2f33a088165e0299a5919eef2b6757b653782c5