Title: Performance modeling and evaluation of topologies for low-latency SCI systems
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100766/00001
 Material Information
Title: Performance modeling and evaluation of topologies for low-latency SCI systems
Physical Description: Book
Language: English
Creator: Gonzalez, Damian Mark, 1974-
Publisher: State University System of Florida
Place of Publication: Florida
Florida
Publication Date: 2000
Copyright Date: 2000
 Subjects
Subject: Computer networks -- Evaluation   ( lcsh )
Network performance (Telecommunication)   ( lcsh )
Electrical and Computer Engineering thesis, M.S   ( lcsh )
Dissertations, Academic -- Electrical and Computer Engineering -- UF   ( lcsh )
Genre: government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )
 Notes
Summary: ABSTRACT: This thesis presents an analytical characterization of SCI network performance and topology comparison from a latency perspective. Experimental methods are used to determine constituent latency components and verify the results obtained by these analytical models as close approximations of reality. In contrast with simulative models, analytical SCI models are faster to solve, yielding accurate performance estimates very quickly, and thereby broadening the design space that can be explored. Ultimately, the results obtained here serve to identify optimum topology types for a range of system sizes based on the latency performance of common parallel application demands.
Summary: KEYWORDS: Scalable Coherent Interface, latency, topology, analytical modeling, microbenchmarking
Thesis: Thesis (M.S.)--University of Florida, 2000.
Bibliography: Includes bibliographical references (p. 79-80).
System Details: System requirements: World Wide Web browser and PDF reader.
System Details: Mode of access: World Wide Web.
Statement of Responsibility: by Damian Mark Gonzalez.
General Note: Title from first page of PDF file.
General Note: Document formatted into pages; contains ix, 81 p.; also contains graphics.
General Note: Abstract copied from student-submitted information.
General Note: Vita.
 Record Information
Bibliographic ID: UF00100766
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 47681855
alephbibnum - 002678723
notis - ANE5950

Downloads

This item has the following downloads:

thesis_001115 ( PDF )


Full Text











PERFORMANCE MODELING AND EVALUATION OF TOPOLOGIES FOR
LOW-LATENCY SCI SYSTEMS

















By

DAMIAN MARK GONZALEZ


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2000




























Copyright 2000

by

Damian Mark Gonzalez



























For my family















ACKNOWLEDGMENTS

I would like to thank Dr. Alan George for giving me the opportunity to gain

valuable technical experience at the HCS Lab, and for insisting on high standards. I also

wish to thank Matthew Chidester, Kyu-Sang Park (both at the University of Florida) and

Hikon Bugge (at Scali Computer AS, Norway) for their timely and valuable assistance

throughout the development of this work. Thanks also go to the members of the HCS

Lab past and present who have each taught me valuable lessons along the way, and

helped make the experience worthwhile.
















TABLE OF CONTENTS

page

A C K N O W L E D G M E N T S ......... .. ............. .................................................................. iv

LIST OF TABLES ......... ......................................... .. .. ................. vii

LIST OF FIGU RES ......... ...... ........ ............ ............ ...... ...... ............. viii

A B S T R A C T ............................................................................ ............... ix

CHAPTERS

1 INTRODUCTION..................... ............. 1

2 R E L A T E D R E SE A R C H ..................................................... ...................................... 3

3 O V E R V IE W O F S C I ...................................................................................................... 6

4 ANALYTICAL INVESTIGATION ...................................................... ..... ....... .. 9

4.1 Point-to-Point Latency M odel ................................... .......... ............... 12
4 .2 A average L atency M odel ............................................................. .... .................. 14


5 EXPERIMENTAL INVESTIGATION .................... ................................... 17

5.1 B enchm ark D design ..................................................... .. .......... .. 20
5.2 Ring Experim ents ......................................... ............... ........ .... 23
5.3 Torus Experim ents ................................................ ...... .. .......... .. 26
5.4 V alidation ............................................................... ... ..... ........ 28


6 ANALYTICAL PROJECTIONS........................................................ ............ 31

6.1 C current System .................... ........................................ .. ............ .. ... .......... 32
6.2 Enhanced System .. ................. ........................... .. .......... .. 34


7 C O N C LU SIO N S .............. ........ .......................................................... ..... .... ............ 38









APPENDICES

A SCIB EN CH COD E LISTIN G ............................................................ .............. 40

B MPIBENCH CODE LISTING........................................................ ..............67

LIST O F R EFER EN CE S ..................................................... ................................. 79

BIOGRAPHICAL SKETCH .. ..... ........ ........................ 81
















LIST OF TABLES



Table Page

1: Estimates of experimental latency components. ........................................................ 28

2: Sum m ary of crossover points (in nodes)........................................ ................. ...... 36
















LIST OF FIGURES



Figure Page

1: SCI subactions ............ ... ..... ............................... ............ 7

2: Topology alternatives............... .. ........ .. ...... ... ................ .............. 9

3: Latency components for a point-to-point transaction on a 3x3 torus............................ 11

4: Architectural components of a Wulfkit SCI NIC .................................................... 17

5: Comparison of MPI and API latency on a two-node ring ................. ....... ........... 19

6: O ne-w ay (OW ) testing schem e. ................................... ........................... ............ 21

7: Ping-pong (PP) testing schem e. ................................... ........................... ............ 21

8: Shared-m em ory testing environm ent. ........................................ ........................ 22

9: Analysis of PP testing. ............. .................. .................. .. ........... .. 23

10 : R ing test configu ration s. ............................................ ............................................... 2 5

11: Torus test configurations ...................................................................... .............. 26

12: Comparison of calculated latency components....................................................... 27

13: Validation of analytical model........................ ................................. 29

14: Inter-topology comparison of current system ............. ............................... ....... ....... 32

15: Inter-topology comparison of enhanced system ................................ .................. .... 35















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

PERFORMANCE MODELING AND EVALUATION OF TOPOLOGIES FOR
LOW-LATENCY SCI SYSTEMS

By

Damian Mark Gonzalez

December 2000

Chairman: Alan D. George
Major Department: Electrical and Computer Engineering

This paper presents an analytical performance characterization and topology

comparison from a latency perspective for the Scalable Coherent Interface (SCI).

Experimental methods are used to determine constituent latency components and verify

the results obtained by these analytical models as close approximations of reality. In

contrast with simulative models, analytical SCI models are faster to solve, yielding

accurate performance estimates very quickly, and thereby broadening the design space

that can be explored. Ultimately, the results obtained here serve to identify optimum

topology types for a range of system sizes based on the latency performance of common

parallel application demands.














CHAPTER 1
INTRODUCTION

Modem supercomputing is increasingly characterized by a shift away from the

traditional monolithic supercomputing systems toward a new generation of systems using

commercial-off-the-shelf (COTS) computers, tightly integrated with high-performance

System Area Networks (SANs). Together, these computers and interconnection networks

form a distributed-memory multicomputer or cluster that offers significantly better

price/performance than traditional supercomputers.

A fundamental challenge faced in designing these parallel processing systems is

that of interconnect performance. The Scalable Coherent Interface (SCI), ANSI/IEEE

Standard 1596-1992 [11] addresses this need by providing a high-performance

interconnect specifically designed to support the unique demands of parallel processing

systems. SCI offers considerable flexibility in topology choices, all based on the

fundamental structure of a ring. However, since a message from one node in a ring must

traverse every other node in that ring, this topology becomes inefficient as the number of

nodes increases. Multi-dimensional topologies and/or switches are used to minimize the

traffic paths and congestion in larger systems.

Before making design decisions between such elaborate topology alternatives, it

is first necessary to evaluate the relative performance of available topology choices

without incurring the expense of constructing a complete system. Toward this end, this

thesis presents analytical models for various SCI topologies from a latency perspective,

using experimentally-derived parameters as inputs, and validating later against









experimental results. These validated models are then used to project tradeoffs between

topology choices and their suitability in handling common application demands. Similar

topology projections are also performed for a conceptual system featuring enhanced

switching performance.

The remainder of the thesis is organized as follows. Chapter 2 provides an

overview of related research in this area. Chapter 3 introduces the fundamentals of SCI

communication. Chapter 4 presents a derivation of analytical models based on these

fundamentals. Chapter 5 provides a description of the experimental testbed, the

calculation of experimentally derived input parameters for the models, and a validation of

the analytical models against equivalent experimental results. In Chapter 6, the models

are used to predict the performance of topology types that exceed current testbed

capabilities. Finally, Chapter 7 presents conclusions and suggests directions for future

research.














CHAPTER 2
RELATED RESEARCH

The SCI standard originated out of efforts to develop a high-performance bus that

would overcome the inherent serial bottlenecks in traditional memory buses. SCI-related

research has since progressed in many different directions, such as the use of SCI in

distributed I/O systems [18] and as a SAN interconnect [9].

Significant progress has been made in the use of SCI as a SAN interconnect

interfacing with the I/O bus. Hellwagner and Reinefeld [9] present a survey of

representative samples of such work, demonstrating results achieved thus far in a variety

of related areas. These samples include contributions in the basic definitions, hardware,

performance comparisons, implementation experiences, low-level software, higher-level

software and management tools.

Other parallel interconnects have since entered the arena, and each finds its own

niche of support. Competing interconnects include the Myrinet network [2] from

Myricom and the cLAN network from Giganet [6]. To some extent, ATM and Gigabit

Ethernet are also used in clustering solutions, but their performance characteristics (e.g.

one-way latencies on the order of 200 [ts for GbE and OC-3 ATM using TCP/IP [13])

make them poorly suited for use as interconnects for latency-sensitive parallel systems.

A comparison of SCI, Myrinet and the Cray T3D interconnect has been performed by

Kurmann and Stricker [12].

Simulative models of SCI have been used to investigate issues such as fault

tolerance [14] and real-time optimizations [17]. However, simulative modeling often









requires several hours to simulate a few seconds of real execution time with any degree

of accuracy. An analytical model is orders of magnitude faster to solve, yielding

performance estimates very quickly, and thereby broadening the design space that can be

explored. Analytical modeling therefore provides a means to project network behavior in

the absence of an expensive hardware testbed and without requiring the use of complex,

computationally-intensive simulative models.

Analytical modeling of SCI has traditionally focused on cache coherency

modeling [1] or queue modeling [16] of SCI components. Relatively little work exists

for analytical modeling of SCI developed from an architectural perspective. Such a

perspective is necessary to identify bottlenecks for various systems and provide insight

into scalability and performance as a function of architectural system elements. Such an

architecturally-motivated analytical model would also offer valuable insight into the

suitability of a given system for handling common types of parallel communication

behavior.

Horn [7] follows such an architecturally-motivated approach, using information

about SCI packet types to develop an analytical representation of the interaction of

packets during an SCI transaction sequence. He develops a throughput model for a single

ring, and presents a single chart of results showing the scalability of the SCI ring for

different PCI bandwidth capabilities. This model demonstrates scalability issues from a

throughput perspective, but does not include a latency study and does not investigate

topology types beyond the basic ring. Moreover, no validation of the model used in this

study was provided.

Bugge [4] uses knowledge about the underlying hardware, coupled with an

understanding of traffic patterns of all-to-all communication to develop an analytical









throughput model for all-to-all communication on SCI. He shows the scalability of

various multicube topologies ranging from rings to four-dimensional tori. This study

makes topology recommendations for varying system sizes, based on a throughput study,

but does not include a similar investigation using a latency approach, and does not

investigate other types of traffic patterns. This throughput study also lacks a validation

exercise.

The simulative study of SCI fault tolerance performed by Sarwar and George [14]

presents analytical derivations for average paths taken by SCI request and response

packets for one- and two-dimensional topologies, paving the way for extrapolation of

topologies to higher degrees. These analytical expressions are used for verification of

simulative results, but no validations are made using experimental data.

This thesis complements and extends previous work by providing meaningful

performance projections of multiple SCI topology types using an architecturally

motivated analytical approach. In so doing, several contributions are achieved. In

contrast with existing throughput studies, latency performance is used as a basis for

comparison, since for many applications latency is a key characteristic with high-speed

networks for scalable parallel systems. Analytical models are derived and validated

against experimental data for traffic patterns that are representative of basic

communication in parallel applications. Finally, performance projections are rendered

for scalable systems with up to one-thousand nodes in terms of current and emerging

component characteristics.

The following chapter provides an overview of SCI communication as

background for subsequent development of analytical representations of latency.














CHAPTER 3
OVERVIEW OF SCI

The SCI standard was developed over the course of approximately four years, and

involved participation from a wide variety of companies and academic institutions. This

standard describes a packet-based protocol using unidirectional links that provides

participating nodes with a shared memory view of the system. It specifies transactions

for reading and writing to a shared address space, and features a detailed specification of

a distributed, directory-based, cache-coherence protocol.

Commercially available SCI-based systems follow two design classifications.

One class consists of parallel computers that employ memory bus interfaces based on

SCI, such as the Data General AViiON [5], the IBM/Sequent NUMA-Q [10], and the

HP/Convex Exemplar [3]. The second class consists of SCI-based network interface

cards (NICs) and switches for the construction of workstation and PC clusters using an

I/O bus interface, such as the Dolphin/Scali Wulfkit [15].

SCI offers many clear advantages for the unique nature of parallel computing

demands. Perhaps the most significant of these advantages is its low-latency

performance. This fundamental characteristic makes SCI well suited to support finer-

grained parallel computations. Typical systems can achieve single-digit microsecond

latency performance. SCI also offers a link data rate of 3.2 Gb/s in current systems. Yet

another advantage in using SCI is that, unlike competing systems, SCI offers support for

both the shared-memory and message-passing paradigms.









The analytical latency models developed in this thesis rely upon an understanding

of the fundamental SCI packet types and the ways in which they interact during a single

transaction. A typical transaction consists of two subactions, a request subaction and a

response subaction, as shown in Figure 1.



if request send
Request J >
subaction e request echo /
ILl


LU response send u).
/ j LResponse n
response echo subaction




Figure 1: SCI subactions.



For the request subaction, a request packet (read or write) is sent by a requesting

node, destined for a recipient node. The recipient or responder node sends an echo

packet back to the requesting node to acknowledge receipt of the request packet. The

recipient simultaneously processes the request and then delivers its own response packet

to the network to begin the response subaction. This packet is received at the original

requesting node, and another echo is sent along the ring to the recipient to acknowledge

receipt of this response packet.

A somewhat more complicated situation arises when the source and destination

nodes do not reside on the same ring. In such a case, there are one or more intermediate

agents that accept the request packet and then act on behalf of the requester, forwarding

the packet along the new ring, and on toward the final destination. In this regard, a node









on an SCI torus topology that enacts a change in dimension acts as an agent for that

transaction.

In SCI, data is represented in terms of symbols with a symbol being a 16-bit word

(2 bytes). All transmissions are conducted based on units of symbols and multiples

thereof. Current implementations support both 16-byte (8-symbol) and 64-byte (32-

symbol) packet payload sizes. The following chapter describes the development of

analytical representations of SCI transactions using knowledge of these basic packet

types and their interaction during an SCI transaction sequence.














CHAPTER 4
ANALYTICAL INVESTIGATION

The topologies considered in this study range from simple rings to multi-

dimensional tori. This framework is shown in Figure 2. Subsequent experimentation

explores topologies having a maximum of nine nodes and two dimensions, but ultimately

analytical models are used to predict the relative performance of systems that exceed

these limits.


4 ~ t~~


Number of dimensions (D)

Figure 2: Topology alternatives.



Rings are useful because of their simplicity. They are straightforward, and no

routing is required. However, they are not scalable since it becomes inefficient for each

node to share the ring bandwidth with traffic generated by every other node in the

network. From a latency perspective also, scalability is inhibited since a round-trip









message from one node in a ring must traverse every other node in the ring. Multi-

dimensional tori address this problem by minimizing the length of traffic paths for point-

to-point communications. Subsequent analysis of multi-dimensional topologies assumes

an equal number of nodes in each dimension. Therefore, for a system with D dimensions

and n nodes in each dimension, the total number of nodes (i.e. system size) is equal to nD

Considering a point-to-point transaction on a one-dimensional topology, it is

assumed that the overhead processing time at the sender is equal to that at the receiver,

and these are each represented using the variable o. The variables lp and If represent the

propagation latency per hop and the forwarding latency through a node, respectively.

The propagation latency is of course dictated by the speed of light through a medium,

whereas the forwarding latency is dependent upon the performance of the SCI adapter

interface in checking the header for routing purposes and directing the packet onto the

output link.

It is important to note that many of these events take place in parallel. For

example, for a relatively large packet, the first symbols of the packet may arrive at the

recipient before the requester has finished transmission of the complete packet onto the

network. This overlap ceases once the time spent by a packet traversing the network is

equal to the time spent putting the packet onto the physical links. Using a 16-bit wide

path, a 5 ns channel cycle time, and assuming a 40-symbol packet, the time to put this

packet onto the link is equal to 200 ns. Using typical values for forwarding and

propagation latencies (60 ns and 7 ns respectively), the time spent putting the packet onto

the link is matched by hop latencies after traversing only 3 hops. Since any overlapping









effect ceases for a relatively small number of hops, the effect of such parallel events does

not play a role in subsequent analytical development.

For multi-dimensional tori, there is also a switching latency (1s) to be considered.

This component represents the time taken to switch dimensions from one ring to another

ring on a torus. The echo latencies are not considered in this model, since they take place

in parallel with the request and response latencies and do not contribute to the critical

path of latency.





SStep 13io Step
1P Step 6
















Figure 3 shows how all of these latency components play a role in a complete

request and response transaction sequence on a two-dimensional topology. Latency

components in the figure are numbered 1 through 14 to identify their ordering in time.
Step 1 represents the processing overhead in putting the request packet onto the network.

This request packet then incurs forwarding and propagation latencies (Steps 2, 3 and 4) in
traversing the horizontal ring. The packet must then switch dimensions (Step 5) and

incur forwarding and propagation latencies in traversing the vertical ring (Steps 6, 7 and









8). The request subaction is complete once the recipient incurs the processing overhead

for getting the request packet off the network (Step 9).

The response subaction begins with the processing overhead for putting the

response packet onto the network (Step 10). In traveling back to the original source, this

packet incurs a propagation latency along the vertical ring (Step 11), a switching latency

(Step 12) and then a propagation latency along the horizontal ring (Step 13). The

transaction is complete once the source node incurs the processing overhead for getting

the response packet off the network (Step 14).

At this point, it is assumed that the switching, forwarding and propagation

latencies will be largely independent of message size, since they only represent the

movement of the head of a given message. However, the overhead components rely

upon the processing of the entire message, and are therefore expected to have a

significant dependence upon message size. The validity of these assumptions is

investigated through experimental testing in Chapter 5.


4.1 Point-to-Point Latency Model

Having outlined the latency components that will be considered in this model, it is

now necessary to determine the number of times that each of these components will

appear for a given point-to-point transaction. Subsequent derivations do not incorporate

contention considerations and therefore represent the unloaded point-to-point latency.

Consider a point-to-point transaction between two nodes on an SCI network. The

overall latency of the transaction is given by

Ltransact = Lrequest response (1)








Using hk to represent the number of hops from the source to the destination in the

kth dimension, the transaction latency components for an SCI ring ofn nodes are given by

Lrequest = o+h1 xl +(h, -1)xl/ +o (2)

Lresponse =o+(n -h, )xl +(n-h, -1)xl +o (3)

For a two-dimensional SCI torus with n nodes in each dimension, three cases can

occur depending upon the number of hops required in each of the two dimensions. If hi =

0 or h2 = 0, then the previous equations can be readily applied since the transaction takes

place on a single ring. For the third case, where hi # 0 and h2 # 0, the request and

response latencies are given by

Lreque, = o+h, xl, + (h -1)xlI +I, +h2 x/~ +(h2 )xlf +0
=2xo+[hI +h 2]xp + [(h, -1)+(h 2- )]xlf +s

Lresponse =o+(n-hl)Xl +(n-h, -1)xl +, + +(n-h2 )Xl +(n-h2 -1)x +0 (5)
=2xo + [(n -h, )+(n -h)]lp +[(n-h, -1)+(n2 )]x/l +/

Using a minimum function to eliminate dimensions with no hop traversals, all

three cases are generalized as

Lrees, = 2xo+[h, + h ]xp +[(h, -min(h,,))+(h, -min(h,,l))]xlf + (
[min (h,,)+ min(h2 ,)-]xls

Lespone = 2xo+ [min(h ,1l)x(n- h )+min(h2 ,)x(n-h2 )]xlp +
[min(h ,1)x(n-h, -1)+min(h2 ,1)x(n-h2 -1)]xf/ + (7)
[min(h ,1)+min(h2 ,1)-1]x1

These results are extended for D dimensions as follows:

Lr, =2xo+ h, +x (h, -min(h,,l)) xl + min(h, ,1) -1 x, (8)
[,-I I -I -I










LD =2xo+ --h,)) X1 +[D h,-I) Xxl
Response= 2xo+ f(min(h,,1)x(n -,))1, + f(min(h,,1)x(n-, -1))1x
==1 (9)

+ [ min(h,1) 1 x I



4.2 Average Latency Model

Further analysis is now performed to augment the previous point-to-point analysis

by characterizing the average distances traveled by request and response packets in a

system. The equations below extend the one- and two-dimensional average-distance

derivations of Sarwar and George [14] by developing a general form for D dimensions.

First, consider a single ring, and assume that there is a uniformly random

distribution of destination nodes for all packets. To arrive at the average number of links

traversed in a ring, a scenario having a fixed source and variable destinations is

considered. The total distance traveled for all possible source/destination pairs is

determined, and then divided by the number of destinations to determine the average

distance traveled.

The variable hi is used to represent the number of hops in the single dimension

for a given source/destination pair. For a request that has traveled hi hops, the response

will travel n -hi hops around the remainder of the ring. Therefore, the average number

of hops for request and response packets in a ring is represented as follows:

h, -1 n

Average request distance =h -- (10)
n-1 2

n-1
X(n- h )
Average response distance =h -1 n (11)
n-l 2









Similarly, for a two-dimensional system, using h2 to represent the number of hops

in the second dimension, the derivation for average number of hops is as follows:


CY-(h, +h,) 2
h Oh0 n2 X(n-l) (12)
Average request distance =- 2 n




Average response distance =--i- -- h)+(nhjJ n 2-X=(n 1- (13)
n2 -1 n2 1-

Based on the results of similar derivations for three- and four-dimensional

systems, a general expression is derived for the average number of hops as a function of

D dimensions:

D nD X (n-1) (14)
Average request distance = Average response distance = -x
2 nD -1

As for switching latencies, it can be shown that the average number of dimension

switches for a transaction in a torus of D dimensions is accurately represented as follows:



Average number of dimension switches = (15)
nD -1

For a single ring, the number of forwarding latencies is always one less than the

number of propagation latencies. However, when considering a transaction on a multi-

dimensional topology, the sum of the number of forwarding and switching latencies is

one less than the number of propagation latencies. Preceding analysis determined that, in

the average case, the number of switching latencies is given by Equation 15, and the

number of propagation latencies is given by Equation 14. As such, the number of

forwarding latencies can be determined as follows:









S(i-1)x x(n -1)'
Y- i D n(n-l)
Average number of forwarding delays += -D X D--
n -1 2 n -1


x2 x n ,(_ I l 0 'x _
Average number of forwarding delays = ( -1 (16)
nD -1

Therefore, Equation 17 represents the average latency of a request or response

packet for a D-dimensional topology.

/ D
D lD x(n-1) l+ 1xDx -O -. .- 1)] xl
Lrequest = Lresponse x= 2+ 2 x D P+ 2 1 i x,
n- -1

| /
+ i [-1)- -(n1 x (17)
nD-
In the following chapter, experimental methods are used to determine the values

for overhead, switching, forwarding and propagation latency to be used as inputs for

these analytical models.














CHAPTER 5
EXPERIMENTAL INVESTIGATION

Experimental results described in this thesis were obtained using an SCI testbed

consisting of nine PCs, each having one 400 MHz Intel Pentium-II processor and 128

MB of PC100 SDRAM. These PCs each contained 32-bit PCI bus interfaces operating at

33 MHz, and were connected using Dolphin/Scali Wulfkit [15] adapters having a link

data rate of 3.2 Gb/s. Experimental performance measurements were obtained by

configuring this system in a variety of ring and torus topologies.

Figure 4 shows a block diagram of the main components of the NIC for a single

node in any given topology. A request originating at this node will enter the NIC through

the PCI bus, at which point the PCI to SCI Bridge (PSB in Figure 4) transfers this data

from PCI to the internal B-link bus. The request send packet then traverses the B-link,

and enters the SCI network fabric through one of the Link Controllers (LC2 in Figure 4).

Together with the software overhead for processing the message on the host, these steps

collectively constitute the sender overhead (o).

S32-bit, 33 MHz PCI


Figure 4: Architectural components of a Wulfkit SCI NIC.









Packets entering the NIC from one of the SCI links will first encounter an LC2,

and that controller will check the header for routing purposes. Three cases can occur

based on the information contained in the packet header. In the first case, the header

could correspond with the address of the node, and in this case the packet would traverse

all intermediate components, and enter the PCI bus for further processing by the host.

Together with the associated software overhead, these steps constitute the receiver

overhead component in the analytical model (o).

Another possibility is that the packet is destined for a node that resides on another

SCI ring for which this node can serve as an agent. In such a case, the packet is sent

across the internal B-link bus, and enters the second ring through another LC2. These

components correspond with the switching delay (Is) in the analytical study.

In the third possible scenario, the incoming packet is addressed to a different node

residing on the same ring. In this case, the packet is routed back out the same LC2,

without traversing any other NIC components. These steps correspond with the

forwarding delay (If) in the analytical study.

The Red Hat Linux (kernel 2.2.15) operating system was used on all machines in

the testbed, and each machine contained 100 Mb/s Fast Ethernet adapters for routine

traffic. The system used the Scali Software Platform 2.0.0 to provide the drivers, low-

level API (SCI USRAPI), and MPI implementation (ScaMPI [8]) to support experimental

testing.

Despite the fact that high-level implementations using MPI are highly portable

and widely used, benchmarking at that level imposes significant software overhead (e.g.

often as much as several thousand instructions per message transferred) and obscures the











underlying architectural behavior. The low-level API facilitates a memory-mapped


dataflow that bypasses much of the software overhead. As such, low-level API


benchmarking is more relevant for this work, since it helps to expose underlying


architectural phenomena.


However, although MPI results feature additional software overhead, they are


useful in providing an idea of the performance that is offered at the application level.


Therefore, MPI-level benchmarking demonstrates the performance obtained after the


addition of extra protocol and coordination overhead. Figure 5 clearly demonstrates the


performance penalty paid for the simplicity and portability of MPI, providing a


comparison of the Scali USRAPI and ScaMPI on a two-node ring using a one-way


latency test (see Figure 6). The API results are consistently lower, achieving a minimum


latency of 1.9 ts for a one-byte message, while the minimum latency achieved using


ScaMPI is 6.4 [ts for a similarly sized message.


35.0

30.0

25.0

20.0 t s

15.0-
-JJ~



10.0 -

5.0
10.0 -----------------------
o MPI
-x- API
0.0
0 64 128 192 256 320 384 448 512
Message size (bytes)


Figure 5: Comparison of MPI and API latency on a two-node ring.









The results shown in Figure 5 demonstrate another important point. The shape of

the curves for both API and MPI suggests that the overall trend in behavior is dominated

by the performance of SCI transactions having a 64-byte packet payload size.

Transactions with 16-byte packet payloads are only significant for tests having an overall

message size less than 64 bytes. Therefore, subsequent analysis focuses on the behavior

of SCI transactions having a 64-byte packet payload size.

Since the experimental latency results are on the order of a handful of

microseconds, transient system effects (e.g. context switching, interrupts, cache misses,

UNIX watchdog, etc.) and timer resolution issues can negatively affect the results. The

results shown here were obtained by repeating experiments multiple times, calculating

maximum, minimum and average values, and then insuring that the difference between

maximum and minimum values was less than five percent of the minimum value.

Subsequent analysis uses the minimum value obtained during a series of experiments,

since it is the most reproducible value and therefore serves to efficiently negate

undesirable system effects.


5.1 Benchmark Design

For both API- and MPI-based testing, many design alternatives are available. A

suite of benchmarks was developed as part of this work to support both MPI

(mpibench) and API (scibench) benchmarking. Both one-way (OW) and ping-pong

(PP) latency tests were used. Figures 6 and 7 explain these strategies in detail using

relevant pseudo-code. The OW test in Figure 6 uses one sender and one receiver to

measure the latency of uni-directional streaming message transfers. The PP test in












Figure 7 alternates the roles of sender and receiver for each message transferred, and the


PP latency is computed as one half the round trip time.


Mpibench performs both tests, and the standardization of the interface allows it


to be easily ported for use in other high-performance networking systems. Scibench


also performs both tests, using API calls to establish the sharing and mapping of remote


memory.


SERVER:

start timing
for (index=0; index < reps; index++)
{
send message of predetermined size;

receive acknowledgement of final message;
end timing

'One-way latency'=(end-start)/reps;


CLIENT:


for (index=0; index < reps; index++)
{
receive message of predetermined size;

send acknowledgement of final message;


Figure 6: One-way (OW) testing scheme.


SERVER:

start timing
for (index=0; index < reps; index++)
send message of predetermined size;
receive message of predetermined size;
receive message of predetermined size;

end timing

'Ping-pong latency'=((end-start)/reps)/2;


CLIENT:


for (index=0; index < reps; index++)
receive message of predetermined size;
receive message of predetermined size;
send message of predetermined size;
}


Figure 7: Ping-pong (PP) testing scheme.




Setting up shared-memory communication using the USRAPI involves a few


simple steps. First, a host computer identifies a block of local memory that it wants to


share among other members of the shared-memory group. This block is then offered to


the group using API calls to the local NIC. Remote nodes can now refer to the block, and











map it into their virtual address space. These remote nodes then access the shared block


directly using their virtual address space.


Figure 8 shows how the shared memory environment was configured to support


API-based benchmarking. The pointers pLocal and pRemote are configured to point


to the local and remote memory arrays (each having s array elements), respectively.


Once established, a write directed at pRemote will be an order of magnitude larger in


latency than one directed at pLocal (e.g. 2 is vs. 100 ns, for a 4-byte message).


CLIENT I SERVER I

pLocal ... pRemote

Local .. ....... ..... View of
memory 2 2 remote
array : memory
array
s .....................
s-2 I s-2






arrayarra
s-2 ....... .. s-

S-....... ______________ s
View2of 2 Local

array
s-2 s-2
..................... _


Figure 8: Shared-memory testing environment.




The API-based benchmarks use local reads (for polling a memory location for


changes) and remote writes, since remote polling would incur a significant and


unnecessary performance penalty. Writes of large messages are performed by assigning


a remote target to memcpy () function calls. However, when transferring messages of 8


bytes and smaller, direct assignments of atomic data types (char, short, int,


long long int) are used to avoid the overhead of the memcpy () function call.









To investigate the components of latency, multiple experiments were performed

on different topology types and the results compared to determine the transaction latency

components.


5.2 Ring Experiments

In the first series of experiments, several configurations based on a ring topology

were used to conduct PP latency testing. PP tests were used since the analytical

derivations are based on the behavior of a single message, and the compound effect of

multiple serial messages in an OW test would necessarily feature a potentially misleading

pipeline effect.





W ,.--N' "
E E -
E -0

0 4
0 T4




8-

SPing request Pong request
Q Ping request echo Pong request echo
( Ping response ) Pong response
( Ping response echo Pong response echo

Figure 9: Analysis of PP testing.



Figure 9 analyzes the execution of a PP test on SCI, with constituent steps

numbered 1 through 8 to identify the ordering of ping and pong transaction components.

Steps 1 through 4 describe the four components (see Figure 1) of a single ping









transaction. Steps 5 through 8 represent the subaction components for the corresponding

pong transaction.

The ping-pong latency is calculated to be one half the round trip time, and is

shown to be equivalent to the timing of a single ping request (step 1 in Figure 9).

Therefore, the analytical representation of a ping request (Lrequest) is used in subsequent

analysis to represent experimental PP results.

The propagation latency (lp) was determined theoretically, by considering the fact

that signals propagate through a conductor at approximately half the speed of light.

Using a value of 299,792.5 km/s for the speed of light, and assuming cables one meter in

length, the propagation latency was determined to be 7 ns. Since the propagation latency

represents the latency for the head of a message passing through a conductor, it is

therefore independent of message size.

Client and server nodes were chosen such that there is a symmetrical path

between source and destination. Figure 10 demonstrates three such testing configurations

in which tests are identified based on the number of hops traversed by a ping request

traveling from the source to destination. These path scenarios were selected so that the

ping and pong messages would each traverse the same number of hops. Such

symmetrical paths allow the PP result to be characterized by an integer number of hops,

facilitating a direct association of an experimental PP result with the corresponding

analytical representation of a ping request. Once these experimental results were

obtained, their differences were then used to determine latency components.











source

s e ddt s dest

(a) one hop (b) two hops (c) three hops

Figure 10: Ring test configurations.



The first experiment performed is designed to determine the value of overhead,

and involves the one-hop test shown in Figure 10a. PP latency was measured over a

range of message sizes, and is represented analytically as follows:

PPone hop =2 x o + 1p (18)

Since lp has already been determined (7 ns), the overhead component is the only

unknown in Equation 18. This overhead was computed algebraically for a range of

message sizes, and the results of this computation are discussed further in the next

subsection.

The next series of ring experiments applied the difference between PP latencies

for the one-hop test and similar results obtained from a four-hop test. The difference

between these is derived analytically as follows:

PPfour hops =2 x + 3 X + 4 x lp

PPone hop =2 x o + lp

PPfour hops PPone hop = 3 If + 3 x 1l (19)

Using the value previously obtained for propagation latency, the forwarding

latency is the only unknown in Equation 19. As such, the value for forwarding latency

was computed algebraically for a range of message sizes. The results from this

derivation are also discussed further in the next subsection.









The only remaining unknown is the switching latency component, which occurs

when a message switches dimensions from one ring to another through an agent. This

component is determined using a series of torus experiments.


5.3 Torus Experiments

The switching latency is determined using PP benchmarking for the torus-based

testing scenarios shown in Figure 11.



M. [Mde source





(a) non-switching (b) switching

Figure 11: Torus test configurations.



Figure 1 la illustrates the first test configuration, which involves movement in a

single dimension. The second configuration, shown in Figure 1 Ib, introduces the

switching latency element. Once the latency experiments were performed for a range of

message sizes on each of these two configurations, the difference between the two sets of

results was determined algebraically. Although the topology in Figure 1 la is no longer

perfectly symmetrical for ping and pong paths, the following provides a close

approximation of the algebraic difference between torus experiments:

PPnon-switching = 2 X + 1 X f + 2 X 1,

PPswitching =2xo + 1 xlf+ 3 x ,+ Is

PPswitching PPnon-switching = is + 1p (20)











As before, Equation 20 is used along with the value for propagation latency to

algebraically determine the switching latency for a range of message sizes.



10000

2085ns 7268ns
2085ns
,,Gradient =11.6 nslbyte

1000
---- --t--- ----------- .. ---
Avg=670ns


U 100
A .......A ----- A ---...... A ...... .. .... A ------- A
-- Avg=6Ons


10-
X---- X- -- -- --x-- .---- ----- X-- ---X-- ---X. .
7ns --overhead
- switching
------- forwarding
- propagation

0 64 128 192 256 320 384 448 512
Message size (bytes)


Figure 12: Comparison of calculated latency components.




Using Equations 18, 19 and 20 along with the calculated value for propagation

latency, Figure 12 shows a comparison of all components for a range of message sizes.

This figure demonstrates the clear differences between components in terms of their

relationship with message size. The switching, forwarding and propagation components

are shown to be relatively independent of message size, whereas the overhead component

is significantly dependent upon message size. These experimental results therefore match

the original intuitive expectations.

To use these experimental results as inputs to the analytical model, Table 1

provides a summary of the estimates made for each component, for a message of m bytes.









Propagation, forwarding and switching components are assumed constant, whereas the

overhead component is represented using a linear equation.




Table 1: Estimates of experimental latency components.
Latency component Estimate
Propagation latency (1l) 7 ns
Forwarding latency (If) 60 ns
Switching latency (1s) 670 ns
Overhead (o) 2085 + 11.6 x (m 64) ns



5.4 Validation

Using these estimates as inputs to the analytical models, a validation exercise is

performed to confirm the models as worthy representations of reality. The first validation

exercise investigates the accuracy of the model as a function of message size, and

involves the symmetrical three-hop ring test shown in Figure 10c. This test is chosen

because one- and four-hop tests were used previously to experimentally determine the

inputs. The analytical PP latency for the three-hop test is given by the following

equation:

PPthree hops = 2 X + 2 X lf+ 3 X lp (21)

Figure 13a shows the results of this validation, and demonstrates how closely the

analytical estimates match experimental results.













14
12 435

S10 430
8-
S6- 425-
4-
420-
2 [ ...0... expenmental ---o---experimental
-analytical -analytical
0- 415I
0 64 128 192 256 320 384 448 512 1 2 3 4 5 6 7 8 9
Message size (bytes) Total number of nodes

(a) latency vs. message size (6-node ring) (b) latency vs. ring size (64-byte message)


Figure 13: Validation of analytical model.




The second validation exercise investigates the accuracy of the model as a


function of the number of nodes and uses a 64-byte message size on one-, two-, three-,


and four-hop tests. The analytical PP latencies for these rings are given by the following


equations:


PPone hop = 2 xo 0x + 1 x 1, (22)

PPtwo hops =2xo +1 xl+ 2x (23)

PPthree hops = 2 0 + 2 If+ 3 p1, (24)

PPfourhops = 2 xo +3 x + 4 x, (25)



Figure 13b shows the results of this validation, and again demonstrates the


accuracy of the analytical estimates. Although there is a slight deviation observed


between analytical and experimental results, a linear extrapolation of this deviation for


systems sizes up to one-thousand nodes shows that the error never exceeds five percent


within this range.






30


Having now derived and validated the analytical models, the next chapter uses

these models to project the behavior of larger systems and investigates topology

alternatives that exceed the capabilities of the experimental testbed.














CHAPTER 6
ANALYTICAL PROJECTIONS

To ascertain the relative latency performance of different topology types, the

analytical models were used to investigate topologies that range from one to four

dimensions, with a maximum system size of up to one-thousand nodes. The models were

first given input parameters derived directly from the experimental analysis. Based on

the results of these analytical projections, they were then fed data for a conceptual system

featuring enhanced parameters.

Two types of applications are considered in determining topology tradeoffs. The

first type is average latency, based on the equations derived in Section 4.2. This

application provides a useful performance metric since it represents the performance that

is achieved in a typical point-to-point transaction on a given topology.

The second application type used for comparison is an unoptimized broadcast

operation, carried out using a series ofunicast messages. For a given topology, having a

fixed source, the complete set of destinations is determined, and the point-to-point

latency of each of these transactions is calculated using the latency equations derived in

Section 4.1. As before, each point-to-point transaction is assumed equivalent to the

latency of a ping request, based on the analysis in Figure 9. The sum of these

transactions is determined, and is used as a basis for inter-topology comparison. This

one-to-all multi-unicast operation is also a useful metric for comparison, since such an

approach for collective communication operations is common in parallel applications and

systems.











6.1 Current System

To investigate the relative latency performance of different topology alternatives

using current hardware, the performance of average latency and one-to-all multi-unicast

applications was derived analytically using data obtained directly from the experimental

testing. The results obtained are shown in Figure 14. On each figure, the crossover

points identify the system sizes after which an incremental increase in the topology

dimension offers superior latency performance.


700 6000

650 .. -*.x 5000

600 ..x--- 4000





450 Crossover 18nodes ...C2D r o x..2
---3D 101D -2D -- -31
g r4D Cosv Cross over no2des
500
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Total number of nodes Total number of nodes

(a) average latency (b) one-to-all multi-unicast


Figure 14: Inter-topology comparison of current system.




Since this study was conducted using topologies having equal numbers of nodes

in each dimension, the ring is the only featured topology that can offer every system size

within the range of interest. An extrapolation of the higher-dimensional topologies was

used to fill in the gaps and identify the exact crossover points at which certain topology

types surpass the performance of others. For this reason, crossover points do not

necessarily align with equi-dimensional topology types, but they still provide a useful

basis for inter-topology comparisons.









The average latency application, shown in Figure 14a, demonstrates clear

scalability differences between topologies, with a one-dimensional topology offering the

best latency performance for systems having fewer than 18 nodes. However, for this

basic ring, messages sent by any node on the ring must pass through every other node in

the entire system, thus limiting scalability. Beyond 18 nodes, the two-dimensional

topology is able to limit the traffic paths to a sufficient extent to outweigh the large

dimension-switching penalty paid as the number of dimensions increases.

The two-dimensional topology continues to lead latency performance up to 191

nodes, at which point the additional path savings achieved using a three-dimensional

topology now outweighs the added switching latency for this higher-dimensional

topology. The three-dimensional topology continues to lead latency performance up to

1000 nodes and beyond.

The situation for the one-to-all multi-unicast application, shown in Figure 14b, is

quite different. The savings achieved in going from one to two dimensions is

pronounced, but for higher dimensions, the relative latency performance does not vary

much within the range of interest. For this application, one-dimensional topologies lead

latency performance for system sizes smaller than 20 nodes, at which point the path

savings of the two-dimensional topology enables it to provide the best latency

performance up to 220 nodes. The three-dimensional topology then offers the best

latency performance up to 1000 nodes and beyond.

These results demonstrate that one- and two-dimensional topologies dominate

latency performance for small and medium system sizes. Crossover points depend

primarily upon the relative magnitude of switching and forwarding delays. Although









higher-dimensional topologies offer significant path savings for point-to-point traffic, the

associated switching penalty makes these topologies impractical for medium-sized

systems.

As a means of comparison, these results mirror those achieved by Bugge [4], who

performed similar comparisons of multi-dimensional torus topologies based strictly on a

throughput study. Although the crossover points are different, his conclusions are

equivalent, with higher-dimensional topologies becoming practical only for very large

system sizes.


6.2 Enhanced System

Advances in semiconductor manufacturing techniques have been able to sustain a

breathtaking pace of improvement in related technologies. As such, it is reasonable to

expect that systems will be available in the near future that significantly outperform

current hardware. To investigate the relative latency performance of different topology

alternatives using such enhanced hardware, the analytical models were fed data that

artificially enhanced system performance.

The component calculations in Figure 12 demonstrate the order of magnitude

difference between switching and forwarding latencies (670 ns and 60 ns respectively).

The topology comparisons in Figure 14 demonstrate that this large difference between

switching and forwarding latencies limits the practicality of higher-dimensional

topologies for medium-sized systems. An improvement in the switching latency

parameter should therefore produce promising results. To examine latency performance

of hardware having an enhanced switching latency, the original value (670 ns) is halved











(335 ns) to explore the effect this design improvement has on relative topology


performance. Figure 15 shows the results achieved after making this change.


7 00 6000

6 50 5000 .*
3D 4D x
Crossover 232 nodes
6 00 ..x 4000 .
> I x--X 33D -4D .* I*
5 0.x Crossover 18nodes .
S550 .x*._ --- -- 3000 2D -> 3D
5x0 x- . Crossover 9 nodes
500 2D-3D 2000

SCrossover 45 nodes D
450 1D- -2 -3D 1000 1D-'2D --D2
y-~^! Crossover 9 nodes -A--4D _!=Crossover S nodes
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Total number of nodes Total number of nodes

(a) average latency (b) one-to-all multi-unicast


Figure 15: Inter-topology comparison of enhanced system.




The average latency application, shown in Figure 15a, once again demonstrates


clear differences between topologies. The one-dimensional topology is now


outperformed by the two-dimensional topology for a system size above 9 nodes. The


two-dimensional topology leads latency performance until 45 nodes, at which point the


three-dimensional topology leads latency performance up to 232 nodes. The four-


dimensional topology now becomes a practical consideration within the range of interest


and provides the best latency performance up to 1000 nodes and beyond.


The one-to-all multi-unicast application performance, shown in Figure 15b,


reflects similar trends to those in the previous configuration. The savings achieved in


going from one to two dimensions is again more pronounced than subsequent dimension


increases, but there is a clear downward shift overall as the crossover points all occur for


smaller system sizes. One-dimensional topologies are quickly outperformed by the two-


dimensional topology (5 nodes), which then leads latency performance up to only 9









nodes, at which point the three-dimensional topology leads latency performance up to

only 18 nodes. The four-dimensional topology dominates latency performance for the

remaining range of system sizes.

Although the crossover points achieved for multi-unicast on enhanced hardware

are significantly smaller than those achieved using current hardware, this downward shift

is not as significant as that in the average latency case since the best latency performance

for a given system size does not improve as significantly in the multi-unicast comparison

(Figure 15b) as it does in the average latency comparison (Figure 15a). Table 2

summarizes the crossover points observed for average latency and one-to-all multi-

unicast applications using both the current system and the enhanced system. This table

shows the system sizes at which the path savings of each dimension increase outweighs

the associated switching penalty, facilitating superior latency performance for the higher-

dimensional topology in each case.




Table 2: Summary of crossover points (in nodes).

Crossover Current system Enhanced system
point Average Multi-unicast Average Multi-unicast
latency latency
1D 2D 18 20 9 5
2D 3D 191 220 45 9
3D 4D 1831 2050 232 18


The results indicate that enhancements in the switching latency (achieved perhaps

through the use of a wider or faster internal B-link bus, or by using wormhole routing

instead of store-and-forward at switch points) would enable higher-dimensional

topologies to become practical for smaller system sizes. Such an enhancement would









provide moderately better latency performance, but this may not justify the added

complexity of a higher-dimensional topology.

The average latency data suggests that the enhancement may be warranted, since

the best latency performance for medium-sized systems is seen to improve, although only

by a modest amount (e.g. approx. 5% improvement for a system size of 100 nodes).

However, for the one-to-all multi-unicast application, the enhanced system offers no real

improvement for medium system sizes. The enhanced hardware only offers an

improvement in multi-unicast performance for large system sizes (e.g. approx. 10%

improvement for a system size of 1000 nodes).














CHAPTER 7
CONCLUSIONS

This thesis introduces an analytical characterization of SCI network performance

and topology comparison from a latency perspective, using architectural issues to inspire

the characterization. Analytical models were developed for point-to-point and average

latency of various topology types, and a validation exercise demonstrated that these

models closely match equivalent experimental results. Based on these models, this work

helps determine architectural sources of latency for various systems and provides a

straightforward means to project network behavior in the absence of an expensive

hardware testbed and without requiring the use of computationally-intensive simulative

models.

Using system parameters derived from experimental testing, topology differences

for a range of system sizes are found to be a result of the large difference between

forwarding latencies and switching latencies. Analytical projections demonstrate the

tradeoffs between path savings on higher-dimensional topologies versus the large

switching penalty paid when increasing the number of dimensions.

One-dimensional topologies offer superior latency performance for small numbers

of nodes, but are soon outperformed by two-dimensional topologies due to the inherent

lack of scalability of the basic SCI ring. Using current hardware, the two-dimensional

topology continues to lead latency performance for medium system sizes (ranging

approximately from 20 nodes to 200 nodes). For larger system sizes, the three-

dimensional topology provides the best latency performance for the remainder of the









range of interest. When using an enhanced system with a smaller switching latency,

higher-dimensional topologies become more practical for medium-sized systems, but the

improvement in best latency performance for such system sizes is only moderate.

In terms of future directions for this research, although the current models provide

an accurate approximation of the experimental data, they can be further elaborated to

include finer-grained representations of constituent network events. These improvements

could involve investigating more subtle phenomena (e.g. contention issues) thereby

enhancing the fidelity of the models. In terms of experimental analysis, further testbed

work could involve more elaborate types of topology alternatives. As the available

system resources continue to increase, further work can include studies with larger

numbers of nodes, bi-directional rings, faster network/host interfaces, and switch-

inclusive studies.

In addition, while the average latency and one-to-all multi-unicast applications

provide a practical comparison of topology types, opportunity exists for the study of more

types of traffic patterns than the ones investigated here. Some examples of such

application patterns include all-to-all, nearest-neighbor, unbalanced communication and

tree-based multicasting. Ultimately, such enhancements can be used to predict the

behavior of more complex parallel applications, and map these applications to the

topology types that best serve their needs.

















APPENDIX A
SCIBENCH CODE LISTING

/********************************************************
* SCIBENCH *
* Shared memory benchmarking for Scali USRAPI *
* *
* Damian M. Gonzalez gonzalez@hcs.ufl.edu *
* HCS Lab University of Florida, Gainesville *

#include "scasci****************************************
#include "scasci.h"
#include "rtl.h"
#include
#include
#include
#include

#define MB 1024*1024
#define KB 1024

#define MEMALIGN 1 /* 0 No memory alignment, 'gathering' effect is seen in the results */
/* 1 Memory aligned so that the PSB automatically flushes (recommended) */

#define ACCURACY 0.10 /* A warning is printed if (max-min) is greater than (min*ACCURACY)
*#

#define MAXSIZE 64*KB
/*#define MAXSIZE 4*KB */

#define NUMREPS 15 /* number of iterations made to determine max, min, avg */
#define TIMEPERPOINT 2E4 /* self explanatory, in microseconds */
#define RESOLUTION 2048 /* interval between successive points, in bytes */

#define TESTTYPE 1 /* 1= one way WITH an acknowledge at the end */
/* 2= ping pong test */

#define IAMSERVER (uServCliBool==l)

/* for timing */
struct timeval st, et;
double elapsed;
time t timevar;

/* for statistics */
double latencies[(MAXSIZE/RESOLUTION)+1] [NUMREPS+1];
double throughputs[(MAXSIZE/RESOLUTION)+1] [NUMREPS+1];
double min,max,avg,total;

static void _GetNumber (const char *sz,unsigned32 *uValue,BOOL *fOK)







41


{
char *ep;

*uValue = strtoul (sz,&ep,0);
*fOK = (*ep == 0);
}


static void _GetArguments (
int argc,
char **argv,
unsigned32 *uServCliBool,
unsigned32 *uLocalModuleID,
unsigned32 *uLocalChunkID,
unsigned32 *uRemoteModuleID,
unsigned32 *uRemoteChunkID,
unsigned32 *uLocalChunkSize,
unsignedl6 *uRemoteNodeID,
BOOL *fOK
)
{
if (argc != 8)
{
*fOK = FALSE;
I
else
{
GetNumber (argv [1],uServCliBool,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Problem with the Server/Client boolean.\n");
I
else
{
if(*uServCliBool==0 I| *uServCliBool==l)
{

GetNumber (argv [2],uLocalModuleID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid local module ID.\n");
}
else
{
GetNumber (argv [3],uLocalChunkID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid local chunk ID.\n");
}
else
{
GetNumber (argv [4],uRemoteModuleID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid Remote module ID.\n");







42


else
{
GetNumber (argv [5],uRemoteChunkID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr, "Invalid Remote chunk ID.\n");
}
else
{
GetNumber (argv [6],uLocalChunkSize,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid chunk size.\n");
}
else
{
unsigned32 uTemp;
_GetNumber (argv [7],&uTemp,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid remote node ID.\n");
I
else
{
*uRemoteNodeID=(unsignedl6)uTemp;







else

fprintf (stderr, "Server/Client boolean should be a 1 (SERVER) or a 0 (CLIENT).\n");
*fOK = FALSE;





static void Usage (void)

printf ("***************************************\"
printf ("* SCIBENCH *\n");
printf ("* Shared memory benclunarking for Scali USRAPI *\n");
printf ("* *\n");
printf ("* Damian M. Gonzalez gonzalezOhcs.ufl.edu *\n"):
printf ("o* HCS Lab University of Florida, Gainesville *\n");
printf (" *\n");
printf ("* This program, when used properly, sets up two *\n"):
printf ("* shared memory segments, on a pair of machines, *\n"):
printf ("* and each machine maps the remote segment to it's *\n");
printf ("* virtual memory. The test type, maximum data size, *\n"):
printf (" number of repetitions, time per datapoint, and *\n"):
printf ("* interval between successive points are all set *\n"):
}
}
}
}
}
}
else
{
fprintf (stderr,"Server/Client boolean should be a 1 (SERVER) or a 0 (CLIENT).\n");
*fOK = FALSE;
}




static void _Usage (void)
{

printf("* SCIBENCH *\n");
printf ("* Shared memory benchmarking for Scali USRAPI *\n");
printf("* *\n");
printf ("* Damian M. Gonzalez gonzalez@hcs.ufl.edu *\n");
printf ("* HCS Lab University of Florida, Gainesville *\n");
printf("* *\n");
printf ("* This program, when used properly, sets up two *\n");
printf ("* shared memory segments, on a pair of machines, *\n");
printf ("* and each machine maps the remote segment to it's *\n");
printf ("* virtual memory. The test type, maximum data size, *\n");
printf ("* number of repetitions, time per datapoint, and *\n");
printf ("* interval between successive points are all set *\n");










printf ("* using the variables near the top of the code. *\n");
printf("* *\n");
printf ("* Measures: Max/Min/Avg/Max-Min Lat/Thrpt *\n");
printf ("* Using : OneWay/PingPong tests *\n");
printf ("* Can vary: TESTTYPE MAXSIZE TIME PERPOINT *\n");
printf ("* NUMREPS ACCURACY RESOLUTION *\n");
printf ("* *\n");
printf ("***************************************************************\n");
printf ("* Instructions: *\n");
printf ("* rsh to the SERVER (nodeid 0x1100) and type: *\n");
printf ("* *\n");
printf ("* scibench 1 *\n");
printf ("* (e.g. scibench 1 1 1 1 1 1048576 0x1200 ) *\n");
printf ("* *\n");
printf ("* rsh to the CLIENT (nodeid 0x1200) and type: *\n");
printf ("* *\n");
printf ("* scibench 0 *\n");
printf ("* (e.g. scibench 0 11 1 1 1048576 0x1100 ) *\n");
printf ("***************************************************************\n");
exit(l);
}


int main (int argc,char **argv)
{
BOOL fOK;
register intj;
int repvar;
volatile int repetitions;
char pause;
unsigned32 uServCliBool;
unsigned32 uLocalModuleID;
unsigned32 uLocalChunkID;
unsigned32 uRemoteModuleID;
unsigned32 uRemoteChunkID;
unsigned32 uVal;
unsigned32 uLocalChunkSize;
unsigned32 uRemoteChunkSize;
volatile unsigned32 uMsgSize;
unsignedl6 uRemoteNodeID;
unsignedl6 uLocalPsbNodeId;
unsigned nAdapters;
void *local virtual addr raw;
void *remotevirtualaddr raw;



PSHARABLE pshm;
PCONNECTOR pcon;
ICM STATUS status;

setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stdin, NULL, _IONBF, 0);

_GetArguments (argc,argv,&uServCliBool,&uLocalModuleID,&uLocalChunkID,
&uRemoteModuleID,&uRemoteChunkID, &uLocalChunkSize,&uRemoteNodeID,&fOK);










if (!fOK)
Usage ();


Scilnitialize (SCIUSRVERSION, &fOK);
if (!fOK)
{
fprintf (stderr,"Could not initialize USRAPI V%/u.\n",SCIUSRVERSION);
return 1;
I
nAdapters = SciGetNumberOfAdapters ();

if (nAdapters == 0)
{
fprintf (stderr,"There are no SCI adapters on this machine.\n");
SciClose();
return 1;
I
else
{
printf ("The # of SCI adapters on this machine = %d\n", nAdapters);
I

SciFlush(O);

uLocalPsbNodeId = SciGetNodeId (0);
printf ("Using local SCI device with node ID Oxx.\n",uLocalPsbNodeId);
/* allocate local memory chunk */
SciAllocateLocalChunk (&pshm,uLocalChunkSize,LMAP_CONSISTENT,&status);
if (status != ICMS_OK)
{
fprintf (stderr,"Could not allocate %u bytes sharable memory (%s).\n", uLocalChunkSize,
SciErrorString (status));
I
else
{
/* map local memory chunk */
SciMapLocalChunk(pshm,0,0,MPREADWRITE,&status,&local virtualaddrraw);
if (status != ICMSOK)
{
fprintf (stderr,"Error (%s) creating user level mapping (SciMapLocalChunk()\n",SciErrorString
(status));
I
else
{
/* offer local memory chunk */
SciOffer (pshm, 0, uLocalModuleID, uLocalChunkID, LMAP_CONSISTENT, &status);
if (status != ICMS_OK)
{
fprintf (stderr,"Error (%s) introducing memory on SCI (SciOffer()).\n",SciErrorString (status));
}
else
{
printf ("Sharing %u bytes on node Oxx as module %u chunk %u .\n",
uLocalChunkSize, uLocalPsbNodeId, uLocalModuleID, uLocalChunkID);










SciConnectToRemoteChunk (&pcon,0,uRemoteNodelD,uRemoteModuleID,
uRemoteChunkID,&status);
while (status != ICMSOK)
{
fprintf (stderr,"Could not connect to remote memory at node Oxx module %u chunk %u
(%s).\n Trying again...\n",
uRemoteNodelD,uRemoteModuleID,uRemoteChunkID,SciErrorString (status));
sleep(3);
SciConnectToRemoteChunk(&pcon,0,uRemoteNodeID,uRemoteModulelD,
uRemoteChunkID,&status);
}
printf ("Successfully connected to remote memory at node Oxx.\n", uRemoteNodeID);

printf ("Mapping module %u chunk %u from remote node
Oxox\n",uRemoteModuleID,uRemoteChunkID,uRemoteNodeID);

SciMapRemoteChunk(pcon,0,0,RMAP_GATHERING,LMAP_CONSISTENT,MPREADWRITE,
&status, &remotevirtualaddrraw);
while (status != ICMSOK)
{
fprintf (stderr,"Could not map remote memory (%s).\n Trying again...\n",
SciErrorString (status));
sleep(3);

SciMapRemoteChunk(pcon,0,0,RMAP_GATHERING,LMAP_CONSISTENT,MPREADWRITE,
&status, &remotevirtualaddrraw);
}
printf ("Successfully mapped remote memory from remote node Oxx.\n", uRemoteNodeID);

if(status == ICMS_OK)
{
volatile unsigned32 *nodeidremote;
volatile unsigned32 *nodeid_local;

volatile unsigned32 *pRemote;
volatile unsigned32 *pLocal;

volatile unsigned32 *to;
volatile long long int *to_llint;
volatile int *to int;
volatile short *toshort;
volatile char *tochar;

volatile unsigned32 *from;
volatile long long int *fromllint;
volatile int *from int;
volatile short *from short;
volatile char *from char;

int c, ok=0;
char *rbuffer;
int *pLocalTempDatabuffer;

unsigned i, sizevar;

/* the following code 'fixes' the virtual addresses so that they end with '000000000' */










volatile unsigned32 *remotevirtual_addr_aligned =
(volatile unsigned32 *)(((int)remotevirtualaddrraw + 511) & -511);
volatile unsigned32 *localvirtual_addraligned =
(volatile unsigned32 *)(((int)localvirtual_addr_raw + 511) & -511);

nodeidlocal = localvirtualaddraligned + 0x10;
nodeidremote = remotevirtualaddraligned + 0x10;

pLocal = localvirtual_addraligned + 0x20;
pRemote = remotevirtual_addraligned + 0x20;

*nodeidlocal = uLocalPsbNodeId; /* assign it the value of the nodeid */

uRemoteChunkSize = SciGetSizeOfRemoteChunk (pcon);
uRemoteChunkSize = uRemoteChunkSize ((int)remotevirtual_addraligned -
(int)remotevirtualaddrraw);

sleep(3);

printf ("Remote memory module %u chunk %u with owner Oxo%x and size %d has been mapped
into user space.\n",
uRemoteModuleID,uRemoteChunkID,*nodeidremote,uRemoteChunkSize);

if (!I_AMSERVER) /* client prints out information about the test */
{
time(&timevar);
printf("%s", ctime(&timevar));
printf("Number of outer loops: %d \n", NUMREPS);
printf("Maximum message size: %d bytes\n", MAXSIZE);
printf("Approximate time per datapoint: %10.6f seconds \n",
((double)TIMEPERPOINT)/1E6);
printf("Anticipated duration of test: %10.6f minutes\n",
((NUMREPS*((double)TIMEPERPOINT/1E6)*(MAXSIZE/RESOLUTION))/(60)));

}

switch (TESTTYPE)
{
case 1:
{
printf ("**ONE WAY WRITES WITH AN ACKNOWLEDGE**\n");
if (IAMSERVER) /* code for SERVER */
{
printf ("** I AM THE SERVER **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;

/* synchronization phase */

tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10] = 1;
for( i=0; i < 5; i++ )
whilefromchar0]!(char)i);
while(from char[10]!=(char)i);







47


to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */

for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of second synchronization phase\n"); */


for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("message size = %d\n", uMsgSize);*/
printf(".");

while( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
I

/* Can't have the CLIENT and SERVER both calculate repetions */
/* need to receive the repetitions info from the client */
/* uses three message handshake for repetitions transfer */
to_int = (int*)pRemote;
fromint = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER, we receive it below */

fromint[0]=0;
fromint[1]=0;
repetitions=0;

while(fromint[l]!=1)
repetitions=from_int[0];
repetitions=fromint[0];
fromint[0]=0;

/* SERVER sends 'repetitions' to the CLIENT so CLIENT can confirm that it was
received*/
while(fromint[l]!=2)
{
to_int[0] = repetitions;
to_int[1] = 1;
I

/* third synchronization phase */

fromchar[10] = 1;
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to char[10]=(char)i;







48


switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]='x';

while(fromchar[0] !='o'); /* can't hold an int, use x/o chars */
to_char[0]='o';

fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = to short + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;

while(from short[0]!=(short)(repetitions-1));
to_short[0]=(short)(repetitions-1);

fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
to_int =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]=1;

while(from int[0]!=(int)(repetitions-1));
to_int[0]=(int)(repetitions-1);

fromint[0]=0;
};
break;
case 8:
{
to _Hint = (long long int*)pRemote;










fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tolint =to_llint + ((64-(uMsgSize%/64))/8);
fromlint = fromlint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;

while(fromllint[0]!=(long long int)(repetitions-1));
to_llint[0]=(long long int)(repetitions-1);

fromllint[0]=0;
};
break;
default:
{

to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =toint + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
from_int[((uMsgSize/4)-1)]=1;

while(fromint[((uMsgSize/4)-1)]!=(int)(repetitions-1))
{
/*
printf("uMsgSize=%d, ", uMsgSize);
printf("((uMsgSize/4)-l)=%d, ", ((uMsgSize/4)-l));
printf("fromint[((uMsgSize/4)-l)]=%d, ", fromint[((uMsgSize/4)-l)]);
printf("repetitions=%d, ", repetitions);
printf("repetitions-l=%d\n ", repetitions-1);
*/
};

pLocalTempDatabuffer[((uMsgSize/4)-l)]=(int)(repetitions-1);
memcpy((void*)toint, (void*)pLocalTempDatabuffer, uMsgSize);

fromint[((uMsgSize/4)-1)]=0;
}
}

if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

printf("%/d / %d complete\n",repvar+1,NUMREPS);
}

else /* code for CLIENT */
{
printf ("** I AM THE CLIENT **\n");
for (rep var = 0; rep var < NUM REPS; rep var++)










{
uMsgSize= 1;

/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10]=1;
for( i=0; i < 5; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=0; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i);
I
/* printf("end of second synchronization phase\n"); */

for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
printf(".");
/* printf("uMsgSize=%d\n", uMsgSize);*/
if( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
}
/* work out the required number of repetitions */
repetitions = getreps(pRemote, pLocalTempDatabuffer, uMsgSize)*10;

/* Can't have the CLIENT and SERVER both calculate repetitions */
/* need to send the repetitions info to the server */
/* use two phase handshake */
to_int = (int*)pRemote;
from_int = (int*)pLocal;

/* CLIENT sends 'repetitions' to the SERVER to get the info across */
to_int[0] = repetitions;
usleep(20);
to_int[1]=l;
/* SERVER sends 'repetitions to the CLIENT so CLIENT can know that it was
received */
/* keep sending till we get this acknowledgement */
fromint[0]=0;
fromint[1]=0;
while(fromint[1]!=1)
{
to_int[0] = repetitions;
to_int[1] = 1;
}
usleep(50);
if (fromint[0] != repetitions)
printf("ERROR (from int[0]=%d)\n", from int[0]);







51


else
toint[1] = 2;

/* third synchronization phase */
fromchar[10]=1;
for( i=; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
}

switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
to_char =to_char + (64-(uMsgSize%64));
fromchar = fromchar + (64-(uMsgSize%64));
}
fromchar[0]=1;
tochar[0]='x';
gettimeofday(&st, NULL);

for(i=0;i {
if(i!=repetitions-1)
to_char[0]='x';
else
to_char[0]='o';
}
while(fromchar[0] !='o');

gettimeofday(&et, NULL);
fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
to_short =to_short + ((64-(uMsgSize%64))/2);
fromshort = fromshort + ((64-(uMsgSize%/64))/2);
}
fromshort[0]=1;
gettimeofday(&st, NULL);

for(i=0;i to_short[0]=(short)i;
while(from short[0] !=(short)(repetitions-1));







52


gettimeofday(&et, NULL);
from short[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_ int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
gettimeofday(&st, NULL);

for(i=0;i to_int[0]=(int)i;
while(from int[0]!=(int)(repetitions-1));

gettimeofday(&et, NULL);
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
toliint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;
gettimeofday(&st, NULL);

for(i=0;i to_llint[0]=(long long int)i;
while(fromllint[0]!=(long long int)(repetitions-1));

gettimeofday(&et, NULL);
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
gettimeofday(&st, NULL);










for(i=0;i {
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
}
while(from int[((uMsgSize/4)-1)]!=(int)(repetitions-1));

gettimeofday(&et, NULL);
from int[uMsgSize-1]=0;
}
}

if(et.tvusec < st.tv usec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}

/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tv usec st.tvusec);


/* Double-checking position in array */
if(repvar==0)
{
latencies[size_var] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[sizevar] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6fuMsgSize=%d)\n",
size var,
latencies[sizevar] [0],
uMsgSize);

if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%od] [0]=%10.6fuMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);

}

latencies[size_var] [repvar+l] = elapsed/repetitions;
throughputs[sizevar] [repvar+l]= (uMsgSize*repetitions)/elapsed;

/*
printf("%8.5f,%6d,%7d,%10.6f,%09.3f\n",
elapsed/lE6,
repetitions,
uMsgSize,
(elapsed)/repetitions,
(uMsgSize*repetitions)/(((elapsed)/1E6)*1048576));*/










if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

printf("%d / %d complete\n",repvar+1,NUMREPS);

print latency_array();
print throughputarray);
print latency_summary();
print throughputsummary();
}
};
break;

case 2:
{
printf ("**PING PONG TEST**\n");
if (IAMSERVER) /* code for SERVER */
{
printf ("** I AM THE SERVER **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;

/* synchronization phase */

tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10] = 1;
for( i=0; i < 5; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */

for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of second synchronization phase\n"); */


for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("uMsgSize = %d\n", uMsgSize); */

printf(".");

while( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);











/* Can't have the CLIENT and SERVER both calculate repetions */
/* need to receive the repetitions info from the client */
/* uses three message handshake for repetitions transfer */
to_int = (int*)pRemote;
fromint = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER, we receive it below */

fromint[0]=0;
fromint[1]=0;
repetitions=0;

while(fromint[1]!=1)
repetitions=from_int[0];
repetitions=fromint[0];
fromint[0]=0;

/* SERVER sends 'repetitions' to the CLIENT so CLIENT can confirm that it was
received*/
while(fromint[1]!=2)
{
to_int[0] = repetitions;
to_int[1] = 1;
}

/* second synchronization phase */
fromchar[10] = 1;
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
}

switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]=1;
for(i=0;i {
while(fromchar[0] !=(char)i);
tochar[0]=(char)i;
}
fromchar[0]=0;
};
break;
case 2:
{
to short = (short*)pRemote;







56


fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = toshort + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;
for(i=0;i {
while(fromshort[0] !=(short)i);
to_short[0]=(short)i;
}
fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
to_int =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
for(i=0;i {
while(fromint[0] !=(int)i);
toint[0]=(int)i;
}
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tollint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;
for(i=0;i {
while(fromllint[0]!=(long long int)i);
to_llint[0]=(long long int)i;
}
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)










{
toint =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
for(i=0;i {
while(fromint[((uMsgSize/4)-1)]!=(int)i);
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
}
fromint[((uMsgSize/4)-1)]=0;
}
}

if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;


printf("%d / %d complete\n",repvar+1,NUMREPS);
}

else /* code for CLIENT */
{
printf ("** I AM THE CLIENT **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;

/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10]=1;
for( i=0; i < 5; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i);

/* printf("end of second synchronization phase\n"); */

for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("uMsgSize = %d\n", uMsgSize); */
printf(".");

if( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{










printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
}
/* work out the required number of repetitions */
repetitions = getreps(pRemote, pLocalTempDatabuffer, uMsgSize)*10;

/* Can't have the CLIENT and SERVER both calculate repetitions */
/* need to send the repetitions info to the server */
/* use two phase handshake */
to_int = (int*)pRemote;
from_int = (int*)pLocal;

/* CLIENT sends 'repetitions' to the SERVER to get the info across */
to_int[0] = repetitions;
usleep(20);
to_int[1]=l;
/* SERVER sends 'repetitions to the CLIENT so CLIENT can know that it was
received */
/* keep sending till we get this acknowledgement */
fromint[0]=0;
fromint[1]=0;
while(fromint[1]!=1)
{
to_int[0] = repetitions;
to_int[1] = 1;
; /* need this to context switch to allow the new write */
}
usleep(20);
if (fromint[0] != repetitions)
printf("ERROR (fromint[0]=%d)\n", fromint[0]);
else
to_int[1] = 2;

/* second synchronization phase */
fromchar[10]=1;
for( i=0; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
}

switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]=1;
gettimeofday(&st, NULL);
for(i=0;i






59


{
tochar[0]=(char)i;
while(fromchar[0] !=(char)i);
}
gettimeofday(&et, NULL);
fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = to short + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;
gettimeofday(&st, NULL);
for(i=0;i {
toshort[0]=(short)i;
while(fromshort[0] !=(short)i);
}
gettimeofday(&et, NULL);
fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_ int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
gettimeofday(&st, NULL);
for(i=0;i {
to_int[0]=(int)i;
while(fromint[0] !=(int)i);
}
gettimeofday(&et, NULL);
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tollint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);










}
fromllint[0]=1;
gettimeofday(&st, NULL);
for(i=0;i {
to_llint[0]=(long long int)i;
while(fromllint[0]!=(long long int)i);
}
gettimeofday(&et, NULL);
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =toint + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
gettimeofday(&st, NULL);
for(i=0;i {
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
while(fromint[((uMsgSize/4)-l)]!=(int)i);
}
gettimeofday(&et, NULL);
fromint[((uMsgSize/4)-l)]=0;
}
}


if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}

/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tvusec st.tvusec);

/* Double-checking position in array */
if(repvar==0)
{
latencies[size_var] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
/*
printf("latencies[%d] [0]=%10.6f uMsgSize=%d\n", sizevar,
latencies[sizevar][0], uMsgSize);
printf("throughputs[%Od][0]=%10.6f uMsgSize=%d\n", sizevar,
throughputs[sizevar] [0], uMsgSize);
*/










}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[sizevar] [0],
uMsgSize);

if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%od] [0]=%10.6fuMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);



latencies[size_var] [repvar+l] = (elapsed/2)/repetitions;
throughputs[sizevar][repvar+l] =
(uMsgSize*repetitions)/((((elapsed/2))/1E6)*1048576);

/*
printf("%8.5f,o%6d,%07d,%10.6f,%9.3f\n",
elapsed/1E6,
repetitions,
uMsgSize,
(elapsed/2)/repetitions,
(uMsgSize*repetitions)/(((elapsed/2)/1E6)*1048576));*/

if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

printf("%/d / %d complete\n",repvar+1,NUM REPS);

print latency_array();
print throughput array();
print latency_summary();
print throughput summary();
}
};
break;

default:
{
printf ("Invalid test type.\n");
break;
}
}

}
}


sleep(l);







62


SciWithdraw (pshm,0,uLocalModuleID,uLocalChunkID,&status);

if (status != ICMSOK)
fprintf (stderr,"Error (%s) withdrawing memory from SCI.\n",SciErrorString (status));
else
printf ("Memory withdrawn from SCI. Clients disconnected.\n");


SciDisconnectFromRemoteChunk (&pcon,&status);
if (status != ICMSOK)
fprintf (stderr,"Error (%s) disconnecting from remote memory.\n",SciErrorString (status));
else
printf ("Disconnected from memory.\n");

SciCloseLocalChunk (&pshm,&status);
if (status != ICMSOK)
fprintf (stderr,"Error (%s) freeing local memory.\n",SciErrorString (status));
else
printf ("Local memory freed.\n");


SciClose ();
return 0;
}


int getreps(long long int *tol, long long int *froml, int size)
{
/* this gives a rough baseline of the number of repetitions necessary for 1/10 the time per datapoint */
/* It is used for both types of tests, and admittedly isn't absolutely the same as each test */
/* It is, however, good enough, for our purposes (govm't work :)) */

int loop_count;
long long int *to, *from;

gettimeofday(&st, NULL);
for(loop_count=10; loop_count < 1E6; loop_count++)
{
to=tol;
from=froml;
switch(size)
{
case 1:
{
*((char*)to)=*((char*)from);
};
break;
case 2:
{
*((short*)to)=*((short*)from);
};
break;
case 4:
{
*((int*)to)=*((int*)from);










break;
case 8:


*(to)=*(from);
};
break;
case 16:


*(to++)=*(from++);
*(to++)=*(from++);
};
break;
case 32:


*(to++)=
*(to++)=
*(to++)=
*(to++)=


*(from++);
*(from++);
*(from++);
*(from++);


break;
default:
{
memcpy((long int *) to,(long int *) from, size );


I
gettimeofday(&et, NULL);

if(et.tv usec < st.tv usec)


et.tv usec += 1E6;
et.tv sec -= 1;

/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tv usec st.tvusec);

if (elapsed > (TIME PER POINT/10))
return loopcount;


int print latency_summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- \n");
printf("--------- LATENCY SUMMARY --------------\n");
printf("------------------------ ---\n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("------------------------ ---\n");


for (size var = 0; uMsgSize <= MAXSIZE; size var++)











total=0;
max=0;
min=10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(latencies[sizevar][repvar+l] > max)
max=latencies[size_var] [rep_var+l];

if(latencies[size_var] [repvar+l] < min)
min=latencies[size_var] [repvar+1];

total = total + latencies[sizevar] [rep_var+l];

avg=total/NUM REPS;

printf("%d,%10.6f,%10.6f,%10.6f, %10.6f, %10.6f, "uMsgSize, max, min, avg, (max-min), ((max-
min)/min) );
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");

if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;



int print throughput summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- n");
printf("--------- THROUGHPUT SUMMARY -----------\n");
printf("--- ----------------------- n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("--- ----------------------- n");


for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min= 10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(throughputs[sizevar] [repvar+l] > max)
max=throughputs[size var][rep var+1];


if(throughputs[size var] [rep var+l] < min)










min=throughputs[sizevar] [repvar+l];

total = total + throughputs[sizevar][repvar+l];
I
avg=total/NUM REPS;

printf("%d,%10.10.6f0.6f,%10.6f, %10.6f, %10.6f, ", uMsgSize, max, min, avg, (max-min), ((max-
min)/min) );
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");

if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}


int print latency_array (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");
printf("------------ RAW LATENCY DATA ---------------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize);*/
printf("%10.6f, ", latencies[size_var][0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", latencies[sizevar] [repvar+l]);
printf("\n");

if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

return 0;
}

int print throughputarray (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");










printf("------------ RAW THROUGHPUT DATA -------------\n");
printf("---------------------------------- -----\n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- -----\n");

for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize); */
printf("%10.6f, ", throughputs[sizevar] [0]);
for(rep_var = 0; repvar < NUM REPS; rep_var++)
printf("%10.6f, ", throughputs[sizevar][repvar+l]);
printf("\n");

if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

return 0;

















APPENDIX B
MPIBENCH CODE LISTING

/********************************************************
* MPIBENCH *
* Portable Message Passing benchmarks *
* *
* Damian M. Gonzalez gonzalez@hcs.ufl.edu *
* HCS Lab University of Florida, Gainesville *
********************************************************/

/********************************************************
* This is originally written for the ScaMPI *
* implementation. Alter as necessary to compile for *
* other MPI implementations. *
********************************************************/

#include
#include
#include
#include

#include "/opt/scali/include/mpi.h"

/* for timing */
struct timeval st, et;
double elapsed;
timet timevar; /* to print out date of test later */

#define MB 1024*1024
#define KB 1024

#define ACCURACY 0.1 /* A warning is printed if (max-min) is greater than
(min*ACCURACY) */

#define MAXSIZE 256*KB /* LARGE in bytes */
/*#define MAXSIZE 4*KB /* SMALL in bytes */

#define NUMREPS 10 /* number of iterations made to determine max, min, avg */
#define TIMEPERPOINT 2E4 /* self explanatory, in microseconds */
#define RESOLUTION 8192 /* interval between successive points, in bytes */
/* if this is zero, a powers of two analysis is performed */

#define TESTTYPE 1 /* 1= one way test */
/* 2= ping pong test */




/* arrays to hold the raw data, for the max min avg calculations later */










/* one extra to contain the message size at location [0] for validation */
/* Note: need to change this later when I implement the variable # of reps */
double latencies[MAXSIZE/RESOLUTION+1] [NUM REPS+I];
double throughputs[MAXSIZE/RESOLUTION+1] [NUM REPS+1];
double min,max,avg,total;
/* to hold the number of repetitions for each message size that's calculated at the beginning */
/* width is two so that it may contain both message size [0] and the repetitions value [1] */
int repetitions[MAXSIZE/RESOLUTION] [2];
int rank; /* needs to be globally available */

int main (int argc, char **argv)
{
char host[20]; /* to contain the host name for printing out later */

int Stat; /* to contain status after MPIBarrier() */
int comm size;
int send index, recindex;
int j; /* loop variable for synchronization runs */
int sizevar, repvar;
int *buff, uMsgSize, total loops, num datapoints, temp;

MPI Status status;

setvbuf(stdout, NULL, _IONBF, 0); /* simply to facilitate the printing of the dots during the
iterations */
setvbuf(stdin, NULL, _IONBF, 0);

MPI_Init( &argc, &argv );
MPI_Commrank( MPI_COMMWORLD, &rank);
MPI_Commsize( MPI_COMMWORLD, &commsize);

buff = malloc(MAXSIZE);

Stat = MPIBarrier(MPI_COMMWORLD);
gethostname(&host[0], 20);
printf("rank:%/d, host:%s\n",rank,&host[0]);

if (rank == 1)
{
time(&timevar);
printf("%s", ctime(&timevar));

/* Calculating the time for the test */
if(RESOLUTION>0)
{
num datapoints=MAXSIZE/RESOLUTION;
I
else
{
temp=MAXSIZE;
num datapoints=0;
while(temp > 1)
{
temp=temp/2;
num datapoints++;










I
if(TEST_TYPE==1)
printf("---------- ONE WAY TEST -----------------\n");
if(TEST_TYPE==2)
printf("---------- PING PONG TEST ---------------\n");

printf("Maximum message size:0/od (%/d points/iteration) \n", MAXSIZE,
(num datapoints+2));
printf("Number of outer loops:%d \n", NUM_REPS);
printf("Number of seconds per datapoint:%10.6f \n", (TIMEPERPOINT/1E6));

printf("Anticipated duration of main test:%10.6f minutes\n",
((NUM_REPS*(TIMEPERPOINT/1E6)*(num datapoints+1))/(60)));
printf("------------------------------------- \n");


synch(5);
/* fill out repetitions array */
getreps();
synch(100);

for (rep_var = 0; repvar < NUM REPS; rep_var++)
{
uMsgSize=0;

for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
if(rank== 1)
printf(".");
if(uMsgSize==0)
total_loops=repetitions[0] [1] 10; /* couldn't perform getreps for zero size */
else
total_loops = repetitions[size_var-1][1] 10;
/*
printf("rank:%d uMsgSize=%d repetitions[%od][0]=%d repetitions[%d][1l]=%d %d\n",
rank,
uMsgSize,
size var,
repetitions[sizevar] [0],
size var,
repetitions[sizevar] [1],
total_loops);
*/

switch (TEST_TYPE)
{
case 1: /* 1= one way test */
{
if (rank == 0)
{
for (rec_index=0; rec_index < total loops; rec_index++)
{
MPIRecv (buff, uMsgSize/4, MPI_INT, 1, 1, MPI_COMMWORLD, &status);
}
MPISend (buff,1, MPIINT, 1, 1, MPICOMM WORLD);










else
{
/* get starting time */
gettimeofday(&st, NULL);
for (sendindex=0; sendindex < total loops; sendindex++)
{
MPI_Send (buff, uMsgSize/4, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
MPIRecv (buff, 1, MPI_INT, 0, 1, MPICOMM_WORLD, &status);
gettimeofday(&et, NULL);
/* get final time */

if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tvusec st.tv_usec);

/* Double-checking position in array */
if(repvar-==0)
{
latencies[sizevar] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[size_var] [0],
uMsgSize);

if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%d] [0]=%10.6f
uMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);



latencies[sizevar][repvar+l] = elapsed/total loops;
throughputs[sizevar] [repvar+l] =
(uMsgSize*total loops)/(((elapsed)/1E6)*1048576);

/*
printf("%7d,%8.5f,%6d,%10.6f,%9.3f\n",
uMsgSize,
elapsed/lE6,
total_loops,
elapsed/total loops,
(uMsgSize*total_loops)/((elapsed/1E6)*1048576));*/










}
break;
case 2: /* 2= ping pong test */
{
if (rank == 0)
{
for (rec_index=0; rec_index < total loops; rec_index++)
{
MPI_Recv (buff, uMsgSize/4, MPIINT, 1, 1, MPI_COMM_WORLD, &status);
MPI_Send (buff, uMsgSize/4, MPI_INT, 1, 1, MPI_COMM_WORLD);
}
}
else
{
/* get starting time */
gettimeofday(&st, NULL);
for (sendindex=0; sendindex < total loops; sendindex++)
{
MPI_Send (buff, uMsgSize/4, MPIINT, 0, 1, MPI_COMM_WORLD);
MPIRecv (buff, uMsgSize/4, MPI_INT, 0, 1, MPI_COMMWORLD, &status);
}
gettimeofday(&et, NULL);
/* get final time */

if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tvusec st.tv_usec);

if(repvar==0)
{
latencies[sizevar] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[size_var] [0],
uMsgSize);

if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%d] [0]=%10.6f
uMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);

latencies[sizevar][repvar+ eapsed2totaloops;
latencies[sizevar][repvar+1] = (elapsed/2)/total loops;










throughputs[sizevar] [rep_var+l] =
(uMsgSize*total loops)/((((elapsed/2))/1E6)*1048576);
/*
printf("%7d,%8.5f,%6d,%10.6f,%9.3f\n",
uMsgSize,
(elapsed)/1E6,
total_loops,
(elapsed/2)/total loops,
(uMsgSize*total_loops)/(((elapsed/2)/1E6)*1048576));*/


}
break;
default:
{
printf ("Invalid test type.\n");
break;
}
}
fflush (stdout);

if(RESOLUTION > 0)
uMsgSize=uMsgSize+RESOLUTION;

else
{
if(uMsgSize==0)
uMsgSize=1;
else
uMsgSize*=2;


I
if(rank==l)
printf("\n %d / %d complete\n",rep var+1,NUM REPS);


/* Calculate and print Minimum, Maximum and Average Latency and Throughput for each size
*/
if(rank== 1)


print latency_array();
print throughput array();
print latency_summary();
print throughput summary();


MPIFinalize();
return 0;
}

int getreps(void)
{
int loop_count;
int *tempbuff;
int size var;
int message_size=4; /* can't co-ordinate the end of testing if size is zero!! */
intj;











int start size=5;
int increment=200;

MPI Status status;

tempbuff = malloc (MAXSIZE sizeof (char));

for (sizevar = 0; message_size <= MAXSIZE; size_var++)
{
*tempbuff=0;
switch (TESTTYPE)
{
case 1:
{
if (rank ==0)
{
for(loop_count=startsize;loop_count<1E8;loop_count+=increment)
{
for (j = 0; j < loop_count; j++)
{
MPIRecv (tempbuff, message_size/4 MPI_INT, 1, 1,
MPICOMMWORLD, &status);
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
break;
}
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
{
loop_count-=increment;/* rank zero will have counted one more, subtract this */
break;
}
MPI_Send (tempbuff,1, MPI_INT, 1, 1, MPI_COMMWORLD);
}
}
else
{
for(loop_count=startsize;loop_count< 1E8;loop_count+=increment)
{
gettimeofday(&st, NULL);
for (j = 0; j < loop_count; j++)
{
MPI_Send (tempbuff, message_size/4 MPI_INT, 0, 1,
MPICOMMWORLD);
}
MPIRecv (tempbuff, 1, MPIINT, 0, 1, MPI_COMM_WORLD, &status);
gettimeofday(&et, NULL);

if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tv usec st.tv usec);












/* if we've reached 1/10 the time per point, return the value */
if (elapsed > (TIMEPERPOINT/10))
{
*tempbuff=9;
MPI_Send (tempbuff, message_size/4, MPI_INT, 0, 1,
MPI_COMMWORLD);
break;
}
}
}
}
break;
case 2:
{
if (rank ==0)
{
for(loop_count startsize;loop_count< 1E8;loop_count++)
{
MPIRecv (tempbuff, message_size/4, MPI_INT, 1, 1, MPICOMMWORLD,
&status);
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
{
loop_count--;
break;
}
MPI_Send (tempbuff, message_size/4, MPI_INT, 1, 1, MPI_COMMWORLD);
}
}
else
{
gettimeofday(&st, NULL);
for(loop_count startsize;loop_count< 1E8;loop_count++)
{
MPISend (tempbuff, message_size/4, MPI_INT, 0, 1, MPICOMMWORLD);
MPIRecv (temp_buff, message_size/4, MPIINT, 0, 1, MPI_COMM_WORLD,
&status);

gettimeofday(&et, NULL);

if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*1E6 +
(et.tvusec st.tv_usec);

/* if we've reached 1/10 the time per point, return the value */
if (elapsed > (TIMEPERPOINT/10))
{
*tempbuff=9;
MPI_Send (tempbuff, message_size/4, MPI_INT, 0, 1,
MPICOMM WORLD);










break;
}
}
}
}
break;
default:
{
printf ("Invalid test type.\n");
break;
}

/*
if (rank== 1)
printf("rank:%d sizevar-%d message_size=%d elapsed=%10.6f loop_count=%d\n",
rank, sizevar, message_size, elapsed, loopcount);
*/
repetitions[sizevar][0]=message_size;
repetitions[size_var][1]=loop_count;

if(RESOLUTION > 0)
{
message_size=message_size+RESOLUTION;
I
else
message_size*=2;
}



int print latency_summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("-- ---------------------- ---n");
printf("--------- LATENCY SUMMARY --------------\n");
printf("------------------------------n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("--- ----------------------- n");


for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min=10000;
for(rep_var = 0; rep_var < NUMREPS; rep_var++)
{
if(latencies[sizevar] [rep_var+1] > max)
max=latencies[size_var] [rep_var+l];

if(latencies[size_var] [rep_var+1] < min)
min=latencies[size var][repvar+1];










total = total + latencies[sizevar] [rep_var+l];
I
avg=total/NUM REPS;

printf("%d,o%10.6f,%10.6f,%10.6f, %10.6f, %10.6f, ", uMsgSize, max, min, avg, (max-min),
((max-min)/min));
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");

if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}
int print throughput summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- n");
printf("--------- THROUGHPUT SUMMARY -----------\n");
printf("------------------------ ---n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("------------------------ ---n");


for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min= 10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(throughputs[sizevar] [repvar+l] > max)
max=throughputs[sizevar] [rep_var+l];

if(throughputs[sizevar] [repvar+l] < min)
min=throughputs[sizevar] [repvar+l];

total = total + throughputs[sizevar] [repvar+l];

avg=total/NUM REPS;

printf("%d,%10.6f,%10.10.6f.6 0.6f, %10.6f, % %1010f, ", uMsgSize, max, min, avg, (max-min),
((max-min)/min));
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");


if(uMsgSize==l)










uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}
int print latency_array (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");
printf("------------ RAW LATENCY DATA ----------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize);*/
printf("%10.6f, ", latencies[size_var][0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", latencies[sizevar] [rep_var+l]);
printf("\n");

if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;

return 0;
}

int print throughputarray (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf(" ------------------------- -------- n");
printf("------------ RAW THROUGHPUT DATA -------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");

for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize); */
printf("%10.6f, ", throughputs[sizevar] [0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", throughputs[sizevar][repvar+l]);
printf("\n");

if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else










uMsgSize = uMsgSize+RESOLUTION;

return 0;
I



int synch(int reps)
{
intj;
int *buff;
MPI Status status;
buff = malloc (MAXSIZE sizeof (char));


for (j = 0; j < reps; j++)


/* synchronization runs */


if (rank==0)


MPIRecv (buff, 0, MPIINT, 1, 1, MPICOMMWORLD, &status);
MPI Send (buff, 0, MPI INT, 1, 1, MPI COMM WORLD);


MPISend (buff, 0, MPIINT, 0, 1, MPI_COMMWORLD);
MPI Recv (buff, 0, MPI INT, 0, 1, MPI COMM WORLD, &status);















LIST OF REFERENCES


[1] Bennett A., Field A., Harrison P., Modeling and Validation of Shared Memory
Coherency Protocols, Performance Evaluation 28 (1996) 541-562.

[2] Boden N., Cohen D., Felderman R., Kulawik A., Seitz C., Seizovic J., Su W., Myrinet: A
Gigabit-per-Second Local Area Network, IEEE Micro 15 (1) (1995) 26-36.

[3] Brewer T., Astfalk G., The Evolution of the HP/Convex Exemplar, in: Proceedings of
COMPCON '97, San Jose, CA, February 1997, pp. 81-86.

[4] Bugge H., Affordable Scalability using Multicubes, in: H. Hellwagner, A. Reinefeld
(Eds.), SCI: Scalable Coherent Interface, LNCS State-of-the-Art Survey (Springer,
Berlin, 1999) 167-174.

[5] Clark R., SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers
with Pentium II Xeon SHV Nodes, White Paper, Data General Corp., Hopkinton, MA,
1998.

[6] Giganet Inc., Giganet: Building a Scalable Internet Infrastructure with Windows 2000
and Linux, White Paper, Giganet Inc., Concord, MA, 1999.

[7] Horn G., Scalability of SCI Ringlets, in: H. Hellwagner, A. Reinefeld (Eds.), SCI:
Scalable Coherent Interface, LNCS State-of-the-Art Survey (Springer, Berlin, 1999) 151-
165.

[8] Huse L., Omang K., Bugge H., Ry H., Haugsdal A., Rustad E., ScaMPI Design and
Implementation, in: H. Hellwagner, A. Reinefeld (Eds.), SCI: Scalable Coherent
Interface, LNCS State-of-the-Art Survey (Springer, Berlin, 1999) 249-261.

[9] Hellwagner H., Reinefeld A., SCI: Scalable Coherent Interface, LNCS State-of-the-Art
Survey (Springer, Berlin, 1999).

[10] International Business Machines Corp., The IBM NUMA-Q Enterprise Server
Architecture: Solving issues of Latency and Scalability in Multiprocessor Systems, White
Paper, International Business Machines Corp., Armonk, NY, 2000.

[11] IEEE, SCI: Scalable Coherent Interface, IEEE Approved Standard 1596-1992,
Piscataway, NJ, 1992.









[12] Kurmann C., Stricker T., A Comparison of Three Gigabit Technologies: SCI, Myrinet
and SGI/Cray T3D, in: Proceedings of SCI Europe '98, Bordeaux, France, September
1998, pp. 29-40.

[13] Omang K., SCI Clustering through the I/O bus: A Performance and Functionality
Analysis, Ph.D thesis, Department of Informatics, University of Oslo, Norway, 1998.

[14] Sarwar M., George A., Simulative Performance Analysis of Distributed Switching
Fabrics for SCI-based Systems, Microprocessors and Microsystems 24 (1) (2000) 1-11.

[15] Scali Computer AS, Scali System Guide Version 2.0, White Paper, Scali Computer AS,
Oslo, Norway, 2000.

[16] Scott S., Goodman J., Vernon M., Performance of the SCI Ring, in: Proceedings of the
19th Annual International Symposium on Computer Architecture, Gold Coast, Australia,
May 1992, pp. 403-414.

[17] Windham W., Hudgins C., Schroeder J., Vertal M., An Animated Graphical Simulator for
the IEEE 1596 Scalable Coherent Interface with Real-Time Extensions, Computing for
Engineers 12 (1997) 8-13.

[18] Wu B., The Applications of the Scalable Coherent Interface in Large Data Acquisition
Systems in High Energy Physics, Ph.D thesis, Department of Informatics, University of
Oslo, Norway, 1996.















BIOGRAPHICAL SKETCH

Damian Mark Gonzalez was born on March 5th, 1974, in San Fernando, Trinidad

and Tobago. At 19 years of age, after successfully completing the GCE Advanced Level

examinations, he left Trinidad to pursue educational opportunities as an international

student at Florida International University in Miami, Florida. During his five years at

FIU, he obtained a broad liberal arts education through his involvement in the FIU

Honors Program for four consecutive years, and also as an exchange student in England

at the University of Hull in Spring 1996. In Spring 1998, he graduated magna cum laude

from FIU with a Bachelor of Science degree in electrical engineering, and a minor in

computer science.

A desire for advanced study at a well-recognized research university led him to

the University of Florida where he pursued a Master of Science degree in the Department

of Electrical and Computer Engineering. At UF, he took advantage of the opportunity to

gain valuable technical experience and contribute to the research community as part of

the High-performance Computing and Simulation Research Laboratory (HCS).

Consistent academic performance, coupled with his experience at HCS and

elsewhere helped to secure him a position as a software engineer with Motorola's Paging

Products Group in Boynton Beach, Florida. He is very excited about this opportunity,

and looks forward to the many new experiences that lie ahead.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs