PERFORMANCE MODELING AND EVALUATION OF TOPOLOGIES FOR
LOW-LATENCY SCI SYSTEMS
By
DAMIAN MARK GONZALEZ
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2000
Copyright 2000
by
Damian Mark Gonzalez
For my family
ACKNOWLEDGMENTS
I would like to thank Dr. Alan George for giving me the opportunity to gain
valuable technical experience at the HCS Lab, and for insisting on high standards. I also
wish to thank Matthew Chidester, Kyu-Sang Park (both at the University of Florida) and
Hikon Bugge (at Scali Computer AS, Norway) for their timely and valuable assistance
throughout the development of this work. Thanks also go to the members of the HCS
Lab past and present who have each taught me valuable lessons along the way, and
helped make the experience worthwhile.
TABLE OF CONTENTS
page
A C K N O W L E D G M E N T S ......... .. ............. .................................................................. iv
LIST OF TABLES ......... ......................................... .. .. ................. vii
LIST OF FIGU RES ......... ...... ........ ............ ............ ...... ...... ............. viii
A B S T R A C T ............................................................................ ............... ix
CHAPTERS
1 INTRODUCTION..................... ............. 1
2 R E L A T E D R E SE A R C H ..................................................... ...................................... 3
3 O V E R V IE W O F S C I ...................................................................................................... 6
4 ANALYTICAL INVESTIGATION ...................................................... ..... ....... .. 9
4.1 Point-to-Point Latency M odel ................................... .......... ............... 12
4 .2 A average L atency M odel ............................................................. .... .................. 14
5 EXPERIMENTAL INVESTIGATION .................... ................................... 17
5.1 B enchm ark D design ..................................................... .. .......... .. 20
5.2 Ring Experim ents ......................................... ............... ........ .... 23
5.3 Torus Experim ents ................................................ ...... .. .......... .. 26
5.4 V alidation ............................................................... ... ..... ........ 28
6 ANALYTICAL PROJECTIONS........................................................ ............ 31
6.1 C current System .................... ........................................ .. ............ .. ... .......... 32
6.2 Enhanced System .. ................. ........................... .. .......... .. 34
7 C O N C LU SIO N S .............. ........ .......................................................... ..... .... ............ 38
APPENDICES
A SCIB EN CH COD E LISTIN G ............................................................ .............. 40
B MPIBENCH CODE LISTING........................................................ ..............67
LIST O F R EFER EN CE S ..................................................... ................................. 79
BIOGRAPHICAL SKETCH .. ..... ........ ........................ 81
LIST OF TABLES
Table Page
1: Estimates of experimental latency components. ........................................................ 28
2: Sum m ary of crossover points (in nodes)........................................ ................. ...... 36
LIST OF FIGURES
Figure Page
1: SCI subactions ............ ... ..... ............................... ............ 7
2: Topology alternatives............... .. ........ .. ...... ... ................ .............. 9
3: Latency components for a point-to-point transaction on a 3x3 torus............................ 11
4: Architectural components of a Wulfkit SCI NIC .................................................... 17
5: Comparison of MPI and API latency on a two-node ring ................. ....... ........... 19
6: O ne-w ay (OW ) testing schem e. ................................... ........................... ............ 21
7: Ping-pong (PP) testing schem e. ................................... ........................... ............ 21
8: Shared-m em ory testing environm ent. ........................................ ........................ 22
9: Analysis of PP testing. ............. .................. .................. .. ........... .. 23
10 : R ing test configu ration s. ............................................ ............................................... 2 5
11: Torus test configurations ...................................................................... .............. 26
12: Comparison of calculated latency components....................................................... 27
13: Validation of analytical model........................ ................................. 29
14: Inter-topology comparison of current system ............. ............................... ....... ....... 32
15: Inter-topology comparison of enhanced system ................................ .................. .... 35
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
PERFORMANCE MODELING AND EVALUATION OF TOPOLOGIES FOR
LOW-LATENCY SCI SYSTEMS
By
Damian Mark Gonzalez
December 2000
Chairman: Alan D. George
Major Department: Electrical and Computer Engineering
This paper presents an analytical performance characterization and topology
comparison from a latency perspective for the Scalable Coherent Interface (SCI).
Experimental methods are used to determine constituent latency components and verify
the results obtained by these analytical models as close approximations of reality. In
contrast with simulative models, analytical SCI models are faster to solve, yielding
accurate performance estimates very quickly, and thereby broadening the design space
that can be explored. Ultimately, the results obtained here serve to identify optimum
topology types for a range of system sizes based on the latency performance of common
parallel application demands.
CHAPTER 1
INTRODUCTION
Modem supercomputing is increasingly characterized by a shift away from the
traditional monolithic supercomputing systems toward a new generation of systems using
commercial-off-the-shelf (COTS) computers, tightly integrated with high-performance
System Area Networks (SANs). Together, these computers and interconnection networks
form a distributed-memory multicomputer or cluster that offers significantly better
price/performance than traditional supercomputers.
A fundamental challenge faced in designing these parallel processing systems is
that of interconnect performance. The Scalable Coherent Interface (SCI), ANSI/IEEE
Standard 1596-1992 [11] addresses this need by providing a high-performance
interconnect specifically designed to support the unique demands of parallel processing
systems. SCI offers considerable flexibility in topology choices, all based on the
fundamental structure of a ring. However, since a message from one node in a ring must
traverse every other node in that ring, this topology becomes inefficient as the number of
nodes increases. Multi-dimensional topologies and/or switches are used to minimize the
traffic paths and congestion in larger systems.
Before making design decisions between such elaborate topology alternatives, it
is first necessary to evaluate the relative performance of available topology choices
without incurring the expense of constructing a complete system. Toward this end, this
thesis presents analytical models for various SCI topologies from a latency perspective,
using experimentally-derived parameters as inputs, and validating later against
experimental results. These validated models are then used to project tradeoffs between
topology choices and their suitability in handling common application demands. Similar
topology projections are also performed for a conceptual system featuring enhanced
switching performance.
The remainder of the thesis is organized as follows. Chapter 2 provides an
overview of related research in this area. Chapter 3 introduces the fundamentals of SCI
communication. Chapter 4 presents a derivation of analytical models based on these
fundamentals. Chapter 5 provides a description of the experimental testbed, the
calculation of experimentally derived input parameters for the models, and a validation of
the analytical models against equivalent experimental results. In Chapter 6, the models
are used to predict the performance of topology types that exceed current testbed
capabilities. Finally, Chapter 7 presents conclusions and suggests directions for future
research.
CHAPTER 2
RELATED RESEARCH
The SCI standard originated out of efforts to develop a high-performance bus that
would overcome the inherent serial bottlenecks in traditional memory buses. SCI-related
research has since progressed in many different directions, such as the use of SCI in
distributed I/O systems [18] and as a SAN interconnect [9].
Significant progress has been made in the use of SCI as a SAN interconnect
interfacing with the I/O bus. Hellwagner and Reinefeld [9] present a survey of
representative samples of such work, demonstrating results achieved thus far in a variety
of related areas. These samples include contributions in the basic definitions, hardware,
performance comparisons, implementation experiences, low-level software, higher-level
software and management tools.
Other parallel interconnects have since entered the arena, and each finds its own
niche of support. Competing interconnects include the Myrinet network [2] from
Myricom and the cLAN network from Giganet [6]. To some extent, ATM and Gigabit
Ethernet are also used in clustering solutions, but their performance characteristics (e.g.
one-way latencies on the order of 200 [ts for GbE and OC-3 ATM using TCP/IP [13])
make them poorly suited for use as interconnects for latency-sensitive parallel systems.
A comparison of SCI, Myrinet and the Cray T3D interconnect has been performed by
Kurmann and Stricker [12].
Simulative models of SCI have been used to investigate issues such as fault
tolerance [14] and real-time optimizations [17]. However, simulative modeling often
requires several hours to simulate a few seconds of real execution time with any degree
of accuracy. An analytical model is orders of magnitude faster to solve, yielding
performance estimates very quickly, and thereby broadening the design space that can be
explored. Analytical modeling therefore provides a means to project network behavior in
the absence of an expensive hardware testbed and without requiring the use of complex,
computationally-intensive simulative models.
Analytical modeling of SCI has traditionally focused on cache coherency
modeling [1] or queue modeling [16] of SCI components. Relatively little work exists
for analytical modeling of SCI developed from an architectural perspective. Such a
perspective is necessary to identify bottlenecks for various systems and provide insight
into scalability and performance as a function of architectural system elements. Such an
architecturally-motivated analytical model would also offer valuable insight into the
suitability of a given system for handling common types of parallel communication
behavior.
Horn [7] follows such an architecturally-motivated approach, using information
about SCI packet types to develop an analytical representation of the interaction of
packets during an SCI transaction sequence. He develops a throughput model for a single
ring, and presents a single chart of results showing the scalability of the SCI ring for
different PCI bandwidth capabilities. This model demonstrates scalability issues from a
throughput perspective, but does not include a latency study and does not investigate
topology types beyond the basic ring. Moreover, no validation of the model used in this
study was provided.
Bugge [4] uses knowledge about the underlying hardware, coupled with an
understanding of traffic patterns of all-to-all communication to develop an analytical
throughput model for all-to-all communication on SCI. He shows the scalability of
various multicube topologies ranging from rings to four-dimensional tori. This study
makes topology recommendations for varying system sizes, based on a throughput study,
but does not include a similar investigation using a latency approach, and does not
investigate other types of traffic patterns. This throughput study also lacks a validation
exercise.
The simulative study of SCI fault tolerance performed by Sarwar and George [14]
presents analytical derivations for average paths taken by SCI request and response
packets for one- and two-dimensional topologies, paving the way for extrapolation of
topologies to higher degrees. These analytical expressions are used for verification of
simulative results, but no validations are made using experimental data.
This thesis complements and extends previous work by providing meaningful
performance projections of multiple SCI topology types using an architecturally
motivated analytical approach. In so doing, several contributions are achieved. In
contrast with existing throughput studies, latency performance is used as a basis for
comparison, since for many applications latency is a key characteristic with high-speed
networks for scalable parallel systems. Analytical models are derived and validated
against experimental data for traffic patterns that are representative of basic
communication in parallel applications. Finally, performance projections are rendered
for scalable systems with up to one-thousand nodes in terms of current and emerging
component characteristics.
The following chapter provides an overview of SCI communication as
background for subsequent development of analytical representations of latency.
CHAPTER 3
OVERVIEW OF SCI
The SCI standard was developed over the course of approximately four years, and
involved participation from a wide variety of companies and academic institutions. This
standard describes a packet-based protocol using unidirectional links that provides
participating nodes with a shared memory view of the system. It specifies transactions
for reading and writing to a shared address space, and features a detailed specification of
a distributed, directory-based, cache-coherence protocol.
Commercially available SCI-based systems follow two design classifications.
One class consists of parallel computers that employ memory bus interfaces based on
SCI, such as the Data General AViiON [5], the IBM/Sequent NUMA-Q [10], and the
HP/Convex Exemplar [3]. The second class consists of SCI-based network interface
cards (NICs) and switches for the construction of workstation and PC clusters using an
I/O bus interface, such as the Dolphin/Scali Wulfkit [15].
SCI offers many clear advantages for the unique nature of parallel computing
demands. Perhaps the most significant of these advantages is its low-latency
performance. This fundamental characteristic makes SCI well suited to support finer-
grained parallel computations. Typical systems can achieve single-digit microsecond
latency performance. SCI also offers a link data rate of 3.2 Gb/s in current systems. Yet
another advantage in using SCI is that, unlike competing systems, SCI offers support for
both the shared-memory and message-passing paradigms.
The analytical latency models developed in this thesis rely upon an understanding
of the fundamental SCI packet types and the ways in which they interact during a single
transaction. A typical transaction consists of two subactions, a request subaction and a
response subaction, as shown in Figure 1.
if request send
Request J >
subaction e request echo /
ILl
LU response send u).
/ j LResponse n
response echo subaction
Figure 1: SCI subactions.
For the request subaction, a request packet (read or write) is sent by a requesting
node, destined for a recipient node. The recipient or responder node sends an echo
packet back to the requesting node to acknowledge receipt of the request packet. The
recipient simultaneously processes the request and then delivers its own response packet
to the network to begin the response subaction. This packet is received at the original
requesting node, and another echo is sent along the ring to the recipient to acknowledge
receipt of this response packet.
A somewhat more complicated situation arises when the source and destination
nodes do not reside on the same ring. In such a case, there are one or more intermediate
agents that accept the request packet and then act on behalf of the requester, forwarding
the packet along the new ring, and on toward the final destination. In this regard, a node
on an SCI torus topology that enacts a change in dimension acts as an agent for that
transaction.
In SCI, data is represented in terms of symbols with a symbol being a 16-bit word
(2 bytes). All transmissions are conducted based on units of symbols and multiples
thereof. Current implementations support both 16-byte (8-symbol) and 64-byte (32-
symbol) packet payload sizes. The following chapter describes the development of
analytical representations of SCI transactions using knowledge of these basic packet
types and their interaction during an SCI transaction sequence.
CHAPTER 4
ANALYTICAL INVESTIGATION
The topologies considered in this study range from simple rings to multi-
dimensional tori. This framework is shown in Figure 2. Subsequent experimentation
explores topologies having a maximum of nine nodes and two dimensions, but ultimately
analytical models are used to predict the relative performance of systems that exceed
these limits.
4 ~ t~~
Number of dimensions (D)
Figure 2: Topology alternatives.
Rings are useful because of their simplicity. They are straightforward, and no
routing is required. However, they are not scalable since it becomes inefficient for each
node to share the ring bandwidth with traffic generated by every other node in the
network. From a latency perspective also, scalability is inhibited since a round-trip
message from one node in a ring must traverse every other node in the ring. Multi-
dimensional tori address this problem by minimizing the length of traffic paths for point-
to-point communications. Subsequent analysis of multi-dimensional topologies assumes
an equal number of nodes in each dimension. Therefore, for a system with D dimensions
and n nodes in each dimension, the total number of nodes (i.e. system size) is equal to nD
Considering a point-to-point transaction on a one-dimensional topology, it is
assumed that the overhead processing time at the sender is equal to that at the receiver,
and these are each represented using the variable o. The variables lp and If represent the
propagation latency per hop and the forwarding latency through a node, respectively.
The propagation latency is of course dictated by the speed of light through a medium,
whereas the forwarding latency is dependent upon the performance of the SCI adapter
interface in checking the header for routing purposes and directing the packet onto the
output link.
It is important to note that many of these events take place in parallel. For
example, for a relatively large packet, the first symbols of the packet may arrive at the
recipient before the requester has finished transmission of the complete packet onto the
network. This overlap ceases once the time spent by a packet traversing the network is
equal to the time spent putting the packet onto the physical links. Using a 16-bit wide
path, a 5 ns channel cycle time, and assuming a 40-symbol packet, the time to put this
packet onto the link is equal to 200 ns. Using typical values for forwarding and
propagation latencies (60 ns and 7 ns respectively), the time spent putting the packet onto
the link is matched by hop latencies after traversing only 3 hops. Since any overlapping
effect ceases for a relatively small number of hops, the effect of such parallel events does
not play a role in subsequent analytical development.
For multi-dimensional tori, there is also a switching latency (1s) to be considered.
This component represents the time taken to switch dimensions from one ring to another
ring on a torus. The echo latencies are not considered in this model, since they take place
in parallel with the request and response latencies and do not contribute to the critical
path of latency.
SStep 13io Step
1P Step 6
Figure 3 shows how all of these latency components play a role in a complete
request and response transaction sequence on a two-dimensional topology. Latency
components in the figure are numbered 1 through 14 to identify their ordering in time.
Step 1 represents the processing overhead in putting the request packet onto the network.
This request packet then incurs forwarding and propagation latencies (Steps 2, 3 and 4) in
traversing the horizontal ring. The packet must then switch dimensions (Step 5) and
incur forwarding and propagation latencies in traversing the vertical ring (Steps 6, 7 and
8). The request subaction is complete once the recipient incurs the processing overhead
for getting the request packet off the network (Step 9).
The response subaction begins with the processing overhead for putting the
response packet onto the network (Step 10). In traveling back to the original source, this
packet incurs a propagation latency along the vertical ring (Step 11), a switching latency
(Step 12) and then a propagation latency along the horizontal ring (Step 13). The
transaction is complete once the source node incurs the processing overhead for getting
the response packet off the network (Step 14).
At this point, it is assumed that the switching, forwarding and propagation
latencies will be largely independent of message size, since they only represent the
movement of the head of a given message. However, the overhead components rely
upon the processing of the entire message, and are therefore expected to have a
significant dependence upon message size. The validity of these assumptions is
investigated through experimental testing in Chapter 5.
4.1 Point-to-Point Latency Model
Having outlined the latency components that will be considered in this model, it is
now necessary to determine the number of times that each of these components will
appear for a given point-to-point transaction. Subsequent derivations do not incorporate
contention considerations and therefore represent the unloaded point-to-point latency.
Consider a point-to-point transaction between two nodes on an SCI network. The
overall latency of the transaction is given by
Ltransact = Lrequest response (1)
Using hk to represent the number of hops from the source to the destination in the
kth dimension, the transaction latency components for an SCI ring ofn nodes are given by
Lrequest = o+h1 xl +(h, -1)xl/ +o (2)
Lresponse =o+(n -h, )xl +(n-h, -1)xl +o (3)
For a two-dimensional SCI torus with n nodes in each dimension, three cases can
occur depending upon the number of hops required in each of the two dimensions. If hi =
0 or h2 = 0, then the previous equations can be readily applied since the transaction takes
place on a single ring. For the third case, where hi # 0 and h2 # 0, the request and
response latencies are given by
Lreque, = o+h, xl, + (h -1)xlI +I, +h2 x/~ +(h2 )xlf +0
=2xo+[hI +h 2]xp + [(h, -1)+(h 2- )]xlf +s
Lresponse =o+(n-hl)Xl +(n-h, -1)xl +, + +(n-h2 )Xl +(n-h2 -1)x +0 (5)
=2xo + [(n -h, )+(n -h)]lp +[(n-h, -1)+(n2 )]x/l +/
Using a minimum function to eliminate dimensions with no hop traversals, all
three cases are generalized as
Lrees, = 2xo+[h, + h ]xp +[(h, -min(h,,))+(h, -min(h,,l))]xlf + (
[min (h,,)+ min(h2 ,)-]xls
Lespone = 2xo+ [min(h ,1l)x(n- h )+min(h2 ,)x(n-h2 )]xlp +
[min(h ,1)x(n-h, -1)+min(h2 ,1)x(n-h2 -1)]xf/ + (7)
[min(h ,1)+min(h2 ,1)-1]x1
These results are extended for D dimensions as follows:
Lr, =2xo+ h, +x (h, -min(h,,l)) xl + min(h, ,1) -1 x, (8)
[,-I I -I -I
LD =2xo+ --h,)) X1 +[D h,-I) Xxl
Response= 2xo+ f(min(h,,1)x(n -,))1, + f(min(h,,1)x(n-, -1))1x
==1 (9)
+ [ min(h,1) 1 x I
4.2 Average Latency Model
Further analysis is now performed to augment the previous point-to-point analysis
by characterizing the average distances traveled by request and response packets in a
system. The equations below extend the one- and two-dimensional average-distance
derivations of Sarwar and George [14] by developing a general form for D dimensions.
First, consider a single ring, and assume that there is a uniformly random
distribution of destination nodes for all packets. To arrive at the average number of links
traversed in a ring, a scenario having a fixed source and variable destinations is
considered. The total distance traveled for all possible source/destination pairs is
determined, and then divided by the number of destinations to determine the average
distance traveled.
The variable hi is used to represent the number of hops in the single dimension
for a given source/destination pair. For a request that has traveled hi hops, the response
will travel n -hi hops around the remainder of the ring. Therefore, the average number
of hops for request and response packets in a ring is represented as follows:
h, -1 n
Average request distance =h -- (10)
n-1 2
n-1
X(n- h )
Average response distance =h -1 n (11)
n-l 2
Similarly, for a two-dimensional system, using h2 to represent the number of hops
in the second dimension, the derivation for average number of hops is as follows:
CY-(h, +h,) 2
h Oh0 n2 X(n-l) (12)
Average request distance =- 2 n
Average response distance =--i- -- h)+(nhjJ n 2-X=(n 1- (13)
n2 -1 n2 1-
Based on the results of similar derivations for three- and four-dimensional
systems, a general expression is derived for the average number of hops as a function of
D dimensions:
D nD X (n-1) (14)
Average request distance = Average response distance = -x
2 nD -1
As for switching latencies, it can be shown that the average number of dimension
switches for a transaction in a torus of D dimensions is accurately represented as follows:
Average number of dimension switches = (15)
nD -1
For a single ring, the number of forwarding latencies is always one less than the
number of propagation latencies. However, when considering a transaction on a multi-
dimensional topology, the sum of the number of forwarding and switching latencies is
one less than the number of propagation latencies. Preceding analysis determined that, in
the average case, the number of switching latencies is given by Equation 15, and the
number of propagation latencies is given by Equation 14. As such, the number of
forwarding latencies can be determined as follows:
S(i-1)x x(n -1)'
Y- i D n(n-l)
Average number of forwarding delays += -D X D--
n -1 2 n -1
x2 x n ,(_ I l 0 'x _
Average number of forwarding delays = ( -1 (16)
nD -1
Therefore, Equation 17 represents the average latency of a request or response
packet for a D-dimensional topology.
/ D
D lD x(n-1) l+ 1xDx -O -. .- 1)] xl
Lrequest = Lresponse x= 2+ 2 x D P+ 2 1 i x,
n- -1
| /
+ i [-1)- -(n1 x (17)
nD-
In the following chapter, experimental methods are used to determine the values
for overhead, switching, forwarding and propagation latency to be used as inputs for
these analytical models.
CHAPTER 5
EXPERIMENTAL INVESTIGATION
Experimental results described in this thesis were obtained using an SCI testbed
consisting of nine PCs, each having one 400 MHz Intel Pentium-II processor and 128
MB of PC100 SDRAM. These PCs each contained 32-bit PCI bus interfaces operating at
33 MHz, and were connected using Dolphin/Scali Wulfkit [15] adapters having a link
data rate of 3.2 Gb/s. Experimental performance measurements were obtained by
configuring this system in a variety of ring and torus topologies.
Figure 4 shows a block diagram of the main components of the NIC for a single
node in any given topology. A request originating at this node will enter the NIC through
the PCI bus, at which point the PCI to SCI Bridge (PSB in Figure 4) transfers this data
from PCI to the internal B-link bus. The request send packet then traverses the B-link,
and enters the SCI network fabric through one of the Link Controllers (LC2 in Figure 4).
Together with the software overhead for processing the message on the host, these steps
collectively constitute the sender overhead (o).
S32-bit, 33 MHz PCI
Figure 4: Architectural components of a Wulfkit SCI NIC.
Packets entering the NIC from one of the SCI links will first encounter an LC2,
and that controller will check the header for routing purposes. Three cases can occur
based on the information contained in the packet header. In the first case, the header
could correspond with the address of the node, and in this case the packet would traverse
all intermediate components, and enter the PCI bus for further processing by the host.
Together with the associated software overhead, these steps constitute the receiver
overhead component in the analytical model (o).
Another possibility is that the packet is destined for a node that resides on another
SCI ring for which this node can serve as an agent. In such a case, the packet is sent
across the internal B-link bus, and enters the second ring through another LC2. These
components correspond with the switching delay (Is) in the analytical study.
In the third possible scenario, the incoming packet is addressed to a different node
residing on the same ring. In this case, the packet is routed back out the same LC2,
without traversing any other NIC components. These steps correspond with the
forwarding delay (If) in the analytical study.
The Red Hat Linux (kernel 2.2.15) operating system was used on all machines in
the testbed, and each machine contained 100 Mb/s Fast Ethernet adapters for routine
traffic. The system used the Scali Software Platform 2.0.0 to provide the drivers, low-
level API (SCI USRAPI), and MPI implementation (ScaMPI [8]) to support experimental
testing.
Despite the fact that high-level implementations using MPI are highly portable
and widely used, benchmarking at that level imposes significant software overhead (e.g.
often as much as several thousand instructions per message transferred) and obscures the
underlying architectural behavior. The low-level API facilitates a memory-mapped
dataflow that bypasses much of the software overhead. As such, low-level API
benchmarking is more relevant for this work, since it helps to expose underlying
architectural phenomena.
However, although MPI results feature additional software overhead, they are
useful in providing an idea of the performance that is offered at the application level.
Therefore, MPI-level benchmarking demonstrates the performance obtained after the
addition of extra protocol and coordination overhead. Figure 5 clearly demonstrates the
performance penalty paid for the simplicity and portability of MPI, providing a
comparison of the Scali USRAPI and ScaMPI on a two-node ring using a one-way
latency test (see Figure 6). The API results are consistently lower, achieving a minimum
latency of 1.9 ts for a one-byte message, while the minimum latency achieved using
ScaMPI is 6.4 [ts for a similarly sized message.
35.0
30.0
25.0
20.0 t s
15.0-
-JJ~
10.0 -
5.0
10.0 -----------------------
o MPI
-x- API
0.0
0 64 128 192 256 320 384 448 512
Message size (bytes)
Figure 5: Comparison of MPI and API latency on a two-node ring.
The results shown in Figure 5 demonstrate another important point. The shape of
the curves for both API and MPI suggests that the overall trend in behavior is dominated
by the performance of SCI transactions having a 64-byte packet payload size.
Transactions with 16-byte packet payloads are only significant for tests having an overall
message size less than 64 bytes. Therefore, subsequent analysis focuses on the behavior
of SCI transactions having a 64-byte packet payload size.
Since the experimental latency results are on the order of a handful of
microseconds, transient system effects (e.g. context switching, interrupts, cache misses,
UNIX watchdog, etc.) and timer resolution issues can negatively affect the results. The
results shown here were obtained by repeating experiments multiple times, calculating
maximum, minimum and average values, and then insuring that the difference between
maximum and minimum values was less than five percent of the minimum value.
Subsequent analysis uses the minimum value obtained during a series of experiments,
since it is the most reproducible value and therefore serves to efficiently negate
undesirable system effects.
5.1 Benchmark Design
For both API- and MPI-based testing, many design alternatives are available. A
suite of benchmarks was developed as part of this work to support both MPI
(mpibench) and API (scibench) benchmarking. Both one-way (OW) and ping-pong
(PP) latency tests were used. Figures 6 and 7 explain these strategies in detail using
relevant pseudo-code. The OW test in Figure 6 uses one sender and one receiver to
measure the latency of uni-directional streaming message transfers. The PP test in
Figure 7 alternates the roles of sender and receiver for each message transferred, and the
PP latency is computed as one half the round trip time.
Mpibench performs both tests, and the standardization of the interface allows it
to be easily ported for use in other high-performance networking systems. Scibench
also performs both tests, using API calls to establish the sharing and mapping of remote
memory.
SERVER:
start timing
for (index=0; index < reps; index++)
{
send message of predetermined size;
receive acknowledgement of final message;
end timing
'One-way latency'=(end-start)/reps;
CLIENT:
for (index=0; index < reps; index++)
{
receive message of predetermined size;
send acknowledgement of final message;
Figure 6: One-way (OW) testing scheme.
SERVER:
start timing
for (index=0; index < reps; index++)
send message of predetermined size;
receive message of predetermined size;
receive message of predetermined size;
end timing
'Ping-pong latency'=((end-start)/reps)/2;
CLIENT:
for (index=0; index < reps; index++)
receive message of predetermined size;
receive message of predetermined size;
send message of predetermined size;
}
Figure 7: Ping-pong (PP) testing scheme.
Setting up shared-memory communication using the USRAPI involves a few
simple steps. First, a host computer identifies a block of local memory that it wants to
share among other members of the shared-memory group. This block is then offered to
the group using API calls to the local NIC. Remote nodes can now refer to the block, and
map it into their virtual address space. These remote nodes then access the shared block
directly using their virtual address space.
Figure 8 shows how the shared memory environment was configured to support
API-based benchmarking. The pointers pLocal and pRemote are configured to point
to the local and remote memory arrays (each having s array elements), respectively.
Once established, a write directed at pRemote will be an order of magnitude larger in
latency than one directed at pLocal (e.g. 2 is vs. 100 ns, for a 4-byte message).
CLIENT I SERVER I
pLocal ... pRemote
Local .. ....... ..... View of
memory 2 2 remote
array : memory
array
s .....................
s-2 I s-2
arrayarra
s-2 ....... .. s-
S-....... ______________ s
View2of 2 Local
array
s-2 s-2
..................... _
Figure 8: Shared-memory testing environment.
The API-based benchmarks use local reads (for polling a memory location for
changes) and remote writes, since remote polling would incur a significant and
unnecessary performance penalty. Writes of large messages are performed by assigning
a remote target to memcpy () function calls. However, when transferring messages of 8
bytes and smaller, direct assignments of atomic data types (char, short, int,
long long int) are used to avoid the overhead of the memcpy () function call.
To investigate the components of latency, multiple experiments were performed
on different topology types and the results compared to determine the transaction latency
components.
5.2 Ring Experiments
In the first series of experiments, several configurations based on a ring topology
were used to conduct PP latency testing. PP tests were used since the analytical
derivations are based on the behavior of a single message, and the compound effect of
multiple serial messages in an OW test would necessarily feature a potentially misleading
pipeline effect.
W ,.--N' "
E E -
E -0
0 4
0 T4
8-
SPing request Pong request
Q Ping request echo Pong request echo
( Ping response ) Pong response
( Ping response echo Pong response echo
Figure 9: Analysis of PP testing.
Figure 9 analyzes the execution of a PP test on SCI, with constituent steps
numbered 1 through 8 to identify the ordering of ping and pong transaction components.
Steps 1 through 4 describe the four components (see Figure 1) of a single ping
transaction. Steps 5 through 8 represent the subaction components for the corresponding
pong transaction.
The ping-pong latency is calculated to be one half the round trip time, and is
shown to be equivalent to the timing of a single ping request (step 1 in Figure 9).
Therefore, the analytical representation of a ping request (Lrequest) is used in subsequent
analysis to represent experimental PP results.
The propagation latency (lp) was determined theoretically, by considering the fact
that signals propagate through a conductor at approximately half the speed of light.
Using a value of 299,792.5 km/s for the speed of light, and assuming cables one meter in
length, the propagation latency was determined to be 7 ns. Since the propagation latency
represents the latency for the head of a message passing through a conductor, it is
therefore independent of message size.
Client and server nodes were chosen such that there is a symmetrical path
between source and destination. Figure 10 demonstrates three such testing configurations
in which tests are identified based on the number of hops traversed by a ping request
traveling from the source to destination. These path scenarios were selected so that the
ping and pong messages would each traverse the same number of hops. Such
symmetrical paths allow the PP result to be characterized by an integer number of hops,
facilitating a direct association of an experimental PP result with the corresponding
analytical representation of a ping request. Once these experimental results were
obtained, their differences were then used to determine latency components.
source
s e ddt s dest
(a) one hop (b) two hops (c) three hops
Figure 10: Ring test configurations.
The first experiment performed is designed to determine the value of overhead,
and involves the one-hop test shown in Figure 10a. PP latency was measured over a
range of message sizes, and is represented analytically as follows:
PPone hop =2 x o + 1p (18)
Since lp has already been determined (7 ns), the overhead component is the only
unknown in Equation 18. This overhead was computed algebraically for a range of
message sizes, and the results of this computation are discussed further in the next
subsection.
The next series of ring experiments applied the difference between PP latencies
for the one-hop test and similar results obtained from a four-hop test. The difference
between these is derived analytically as follows:
PPfour hops =2 x + 3 X + 4 x lp
PPone hop =2 x o + lp
PPfour hops PPone hop = 3 If + 3 x 1l (19)
Using the value previously obtained for propagation latency, the forwarding
latency is the only unknown in Equation 19. As such, the value for forwarding latency
was computed algebraically for a range of message sizes. The results from this
derivation are also discussed further in the next subsection.
The only remaining unknown is the switching latency component, which occurs
when a message switches dimensions from one ring to another through an agent. This
component is determined using a series of torus experiments.
5.3 Torus Experiments
The switching latency is determined using PP benchmarking for the torus-based
testing scenarios shown in Figure 11.
M. [Mde source
(a) non-switching (b) switching
Figure 11: Torus test configurations.
Figure 1 la illustrates the first test configuration, which involves movement in a
single dimension. The second configuration, shown in Figure 1 Ib, introduces the
switching latency element. Once the latency experiments were performed for a range of
message sizes on each of these two configurations, the difference between the two sets of
results was determined algebraically. Although the topology in Figure 1 la is no longer
perfectly symmetrical for ping and pong paths, the following provides a close
approximation of the algebraic difference between torus experiments:
PPnon-switching = 2 X + 1 X f + 2 X 1,
PPswitching =2xo + 1 xlf+ 3 x ,+ Is
PPswitching PPnon-switching = is + 1p (20)
As before, Equation 20 is used along with the value for propagation latency to
algebraically determine the switching latency for a range of message sizes.
10000
2085ns 7268ns
2085ns
,,Gradient =11.6 nslbyte
1000
---- --t--- ----------- .. ---
Avg=670ns
U 100
A .......A ----- A ---...... A ...... .. .... A ------- A
-- Avg=6Ons
10-
X---- X- -- -- --x-- .---- ----- X-- ---X-- ---X. .
7ns --overhead
- switching
------- forwarding
- propagation
0 64 128 192 256 320 384 448 512
Message size (bytes)
Figure 12: Comparison of calculated latency components.
Using Equations 18, 19 and 20 along with the calculated value for propagation
latency, Figure 12 shows a comparison of all components for a range of message sizes.
This figure demonstrates the clear differences between components in terms of their
relationship with message size. The switching, forwarding and propagation components
are shown to be relatively independent of message size, whereas the overhead component
is significantly dependent upon message size. These experimental results therefore match
the original intuitive expectations.
To use these experimental results as inputs to the analytical model, Table 1
provides a summary of the estimates made for each component, for a message of m bytes.
Propagation, forwarding and switching components are assumed constant, whereas the
overhead component is represented using a linear equation.
Table 1: Estimates of experimental latency components.
Latency component Estimate
Propagation latency (1l) 7 ns
Forwarding latency (If) 60 ns
Switching latency (1s) 670 ns
Overhead (o) 2085 + 11.6 x (m 64) ns
5.4 Validation
Using these estimates as inputs to the analytical models, a validation exercise is
performed to confirm the models as worthy representations of reality. The first validation
exercise investigates the accuracy of the model as a function of message size, and
involves the symmetrical three-hop ring test shown in Figure 10c. This test is chosen
because one- and four-hop tests were used previously to experimentally determine the
inputs. The analytical PP latency for the three-hop test is given by the following
equation:
PPthree hops = 2 X + 2 X lf+ 3 X lp (21)
Figure 13a shows the results of this validation, and demonstrates how closely the
analytical estimates match experimental results.
14
12 435
S10 430
8-
S6- 425-
4-
420-
2 [ ...0... expenmental ---o---experimental
-analytical -analytical
0- 415I
0 64 128 192 256 320 384 448 512 1 2 3 4 5 6 7 8 9
Message size (bytes) Total number of nodes
(a) latency vs. message size (6-node ring) (b) latency vs. ring size (64-byte message)
Figure 13: Validation of analytical model.
The second validation exercise investigates the accuracy of the model as a
function of the number of nodes and uses a 64-byte message size on one-, two-, three-,
and four-hop tests. The analytical PP latencies for these rings are given by the following
equations:
PPone hop = 2 xo 0x + 1 x 1, (22)
PPtwo hops =2xo +1 xl+ 2x (23)
PPthree hops = 2 0 + 2 If+ 3 p1, (24)
PPfourhops = 2 xo +3 x + 4 x, (25)
Figure 13b shows the results of this validation, and again demonstrates the
accuracy of the analytical estimates. Although there is a slight deviation observed
between analytical and experimental results, a linear extrapolation of this deviation for
systems sizes up to one-thousand nodes shows that the error never exceeds five percent
within this range.
30
Having now derived and validated the analytical models, the next chapter uses
these models to project the behavior of larger systems and investigates topology
alternatives that exceed the capabilities of the experimental testbed.
CHAPTER 6
ANALYTICAL PROJECTIONS
To ascertain the relative latency performance of different topology types, the
analytical models were used to investigate topologies that range from one to four
dimensions, with a maximum system size of up to one-thousand nodes. The models were
first given input parameters derived directly from the experimental analysis. Based on
the results of these analytical projections, they were then fed data for a conceptual system
featuring enhanced parameters.
Two types of applications are considered in determining topology tradeoffs. The
first type is average latency, based on the equations derived in Section 4.2. This
application provides a useful performance metric since it represents the performance that
is achieved in a typical point-to-point transaction on a given topology.
The second application type used for comparison is an unoptimized broadcast
operation, carried out using a series ofunicast messages. For a given topology, having a
fixed source, the complete set of destinations is determined, and the point-to-point
latency of each of these transactions is calculated using the latency equations derived in
Section 4.1. As before, each point-to-point transaction is assumed equivalent to the
latency of a ping request, based on the analysis in Figure 9. The sum of these
transactions is determined, and is used as a basis for inter-topology comparison. This
one-to-all multi-unicast operation is also a useful metric for comparison, since such an
approach for collective communication operations is common in parallel applications and
systems.
6.1 Current System
To investigate the relative latency performance of different topology alternatives
using current hardware, the performance of average latency and one-to-all multi-unicast
applications was derived analytically using data obtained directly from the experimental
testing. The results obtained are shown in Figure 14. On each figure, the crossover
points identify the system sizes after which an incremental increase in the topology
dimension offers superior latency performance.
700 6000
650 .. -*.x 5000
600 ..x--- 4000
450 Crossover 18nodes ...C2D r o x..2
---3D 101D -2D -- -31
g r4D Cosv Cross over no2des
500
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Total number of nodes Total number of nodes
(a) average latency (b) one-to-all multi-unicast
Figure 14: Inter-topology comparison of current system.
Since this study was conducted using topologies having equal numbers of nodes
in each dimension, the ring is the only featured topology that can offer every system size
within the range of interest. An extrapolation of the higher-dimensional topologies was
used to fill in the gaps and identify the exact crossover points at which certain topology
types surpass the performance of others. For this reason, crossover points do not
necessarily align with equi-dimensional topology types, but they still provide a useful
basis for inter-topology comparisons.
The average latency application, shown in Figure 14a, demonstrates clear
scalability differences between topologies, with a one-dimensional topology offering the
best latency performance for systems having fewer than 18 nodes. However, for this
basic ring, messages sent by any node on the ring must pass through every other node in
the entire system, thus limiting scalability. Beyond 18 nodes, the two-dimensional
topology is able to limit the traffic paths to a sufficient extent to outweigh the large
dimension-switching penalty paid as the number of dimensions increases.
The two-dimensional topology continues to lead latency performance up to 191
nodes, at which point the additional path savings achieved using a three-dimensional
topology now outweighs the added switching latency for this higher-dimensional
topology. The three-dimensional topology continues to lead latency performance up to
1000 nodes and beyond.
The situation for the one-to-all multi-unicast application, shown in Figure 14b, is
quite different. The savings achieved in going from one to two dimensions is
pronounced, but for higher dimensions, the relative latency performance does not vary
much within the range of interest. For this application, one-dimensional topologies lead
latency performance for system sizes smaller than 20 nodes, at which point the path
savings of the two-dimensional topology enables it to provide the best latency
performance up to 220 nodes. The three-dimensional topology then offers the best
latency performance up to 1000 nodes and beyond.
These results demonstrate that one- and two-dimensional topologies dominate
latency performance for small and medium system sizes. Crossover points depend
primarily upon the relative magnitude of switching and forwarding delays. Although
higher-dimensional topologies offer significant path savings for point-to-point traffic, the
associated switching penalty makes these topologies impractical for medium-sized
systems.
As a means of comparison, these results mirror those achieved by Bugge [4], who
performed similar comparisons of multi-dimensional torus topologies based strictly on a
throughput study. Although the crossover points are different, his conclusions are
equivalent, with higher-dimensional topologies becoming practical only for very large
system sizes.
6.2 Enhanced System
Advances in semiconductor manufacturing techniques have been able to sustain a
breathtaking pace of improvement in related technologies. As such, it is reasonable to
expect that systems will be available in the near future that significantly outperform
current hardware. To investigate the relative latency performance of different topology
alternatives using such enhanced hardware, the analytical models were fed data that
artificially enhanced system performance.
The component calculations in Figure 12 demonstrate the order of magnitude
difference between switching and forwarding latencies (670 ns and 60 ns respectively).
The topology comparisons in Figure 14 demonstrate that this large difference between
switching and forwarding latencies limits the practicality of higher-dimensional
topologies for medium-sized systems. An improvement in the switching latency
parameter should therefore produce promising results. To examine latency performance
of hardware having an enhanced switching latency, the original value (670 ns) is halved
(335 ns) to explore the effect this design improvement has on relative topology
performance. Figure 15 shows the results achieved after making this change.
7 00 6000
6 50 5000 .*
3D 4D x
Crossover 232 nodes
6 00 ..x 4000 .
> I x--X 33D -4D .* I*
5 0.x Crossover 18nodes .
S550 .x*._ --- -- 3000 2D -> 3D
5x0 x- . Crossover 9 nodes
500 2D-3D 2000
SCrossover 45 nodes D
450 1D- -2 -3D 1000 1D-'2D --D2
y-~^! Crossover 9 nodes -A--4D _!=Crossover S nodes
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Total number of nodes Total number of nodes
(a) average latency (b) one-to-all multi-unicast
Figure 15: Inter-topology comparison of enhanced system.
The average latency application, shown in Figure 15a, once again demonstrates
clear differences between topologies. The one-dimensional topology is now
outperformed by the two-dimensional topology for a system size above 9 nodes. The
two-dimensional topology leads latency performance until 45 nodes, at which point the
three-dimensional topology leads latency performance up to 232 nodes. The four-
dimensional topology now becomes a practical consideration within the range of interest
and provides the best latency performance up to 1000 nodes and beyond.
The one-to-all multi-unicast application performance, shown in Figure 15b,
reflects similar trends to those in the previous configuration. The savings achieved in
going from one to two dimensions is again more pronounced than subsequent dimension
increases, but there is a clear downward shift overall as the crossover points all occur for
smaller system sizes. One-dimensional topologies are quickly outperformed by the two-
dimensional topology (5 nodes), which then leads latency performance up to only 9
nodes, at which point the three-dimensional topology leads latency performance up to
only 18 nodes. The four-dimensional topology dominates latency performance for the
remaining range of system sizes.
Although the crossover points achieved for multi-unicast on enhanced hardware
are significantly smaller than those achieved using current hardware, this downward shift
is not as significant as that in the average latency case since the best latency performance
for a given system size does not improve as significantly in the multi-unicast comparison
(Figure 15b) as it does in the average latency comparison (Figure 15a). Table 2
summarizes the crossover points observed for average latency and one-to-all multi-
unicast applications using both the current system and the enhanced system. This table
shows the system sizes at which the path savings of each dimension increase outweighs
the associated switching penalty, facilitating superior latency performance for the higher-
dimensional topology in each case.
Table 2: Summary of crossover points (in nodes).
Crossover Current system Enhanced system
point Average Multi-unicast Average Multi-unicast
latency latency
1D 2D 18 20 9 5
2D 3D 191 220 45 9
3D 4D 1831 2050 232 18
The results indicate that enhancements in the switching latency (achieved perhaps
through the use of a wider or faster internal B-link bus, or by using wormhole routing
instead of store-and-forward at switch points) would enable higher-dimensional
topologies to become practical for smaller system sizes. Such an enhancement would
provide moderately better latency performance, but this may not justify the added
complexity of a higher-dimensional topology.
The average latency data suggests that the enhancement may be warranted, since
the best latency performance for medium-sized systems is seen to improve, although only
by a modest amount (e.g. approx. 5% improvement for a system size of 100 nodes).
However, for the one-to-all multi-unicast application, the enhanced system offers no real
improvement for medium system sizes. The enhanced hardware only offers an
improvement in multi-unicast performance for large system sizes (e.g. approx. 10%
improvement for a system size of 1000 nodes).
CHAPTER 7
CONCLUSIONS
This thesis introduces an analytical characterization of SCI network performance
and topology comparison from a latency perspective, using architectural issues to inspire
the characterization. Analytical models were developed for point-to-point and average
latency of various topology types, and a validation exercise demonstrated that these
models closely match equivalent experimental results. Based on these models, this work
helps determine architectural sources of latency for various systems and provides a
straightforward means to project network behavior in the absence of an expensive
hardware testbed and without requiring the use of computationally-intensive simulative
models.
Using system parameters derived from experimental testing, topology differences
for a range of system sizes are found to be a result of the large difference between
forwarding latencies and switching latencies. Analytical projections demonstrate the
tradeoffs between path savings on higher-dimensional topologies versus the large
switching penalty paid when increasing the number of dimensions.
One-dimensional topologies offer superior latency performance for small numbers
of nodes, but are soon outperformed by two-dimensional topologies due to the inherent
lack of scalability of the basic SCI ring. Using current hardware, the two-dimensional
topology continues to lead latency performance for medium system sizes (ranging
approximately from 20 nodes to 200 nodes). For larger system sizes, the three-
dimensional topology provides the best latency performance for the remainder of the
range of interest. When using an enhanced system with a smaller switching latency,
higher-dimensional topologies become more practical for medium-sized systems, but the
improvement in best latency performance for such system sizes is only moderate.
In terms of future directions for this research, although the current models provide
an accurate approximation of the experimental data, they can be further elaborated to
include finer-grained representations of constituent network events. These improvements
could involve investigating more subtle phenomena (e.g. contention issues) thereby
enhancing the fidelity of the models. In terms of experimental analysis, further testbed
work could involve more elaborate types of topology alternatives. As the available
system resources continue to increase, further work can include studies with larger
numbers of nodes, bi-directional rings, faster network/host interfaces, and switch-
inclusive studies.
In addition, while the average latency and one-to-all multi-unicast applications
provide a practical comparison of topology types, opportunity exists for the study of more
types of traffic patterns than the ones investigated here. Some examples of such
application patterns include all-to-all, nearest-neighbor, unbalanced communication and
tree-based multicasting. Ultimately, such enhancements can be used to predict the
behavior of more complex parallel applications, and map these applications to the
topology types that best serve their needs.
APPENDIX A
SCIBENCH CODE LISTING
/********************************************************
* SCIBENCH *
* Shared memory benchmarking for Scali USRAPI *
* *
* Damian M. Gonzalez gonzalez@hcs.ufl.edu *
* HCS Lab University of Florida, Gainesville *
#include "scasci****************************************
#include "scasci.h"
#include "rtl.h"
#include
#include
#include
#include
#define MB 1024*1024
#define KB 1024
#define MEMALIGN 1 /* 0 No memory alignment, 'gathering' effect is seen in the results */
/* 1 Memory aligned so that the PSB automatically flushes (recommended) */
#define ACCURACY 0.10 /* A warning is printed if (max-min) is greater than (min*ACCURACY)
*#
#define MAXSIZE 64*KB
/*#define MAXSIZE 4*KB */
#define NUMREPS 15 /* number of iterations made to determine max, min, avg */
#define TIMEPERPOINT 2E4 /* self explanatory, in microseconds */
#define RESOLUTION 2048 /* interval between successive points, in bytes */
#define TESTTYPE 1 /* 1= one way WITH an acknowledge at the end */
/* 2= ping pong test */
#define IAMSERVER (uServCliBool==l)
/* for timing */
struct timeval st, et;
double elapsed;
time t timevar;
/* for statistics */
double latencies[(MAXSIZE/RESOLUTION)+1] [NUMREPS+1];
double throughputs[(MAXSIZE/RESOLUTION)+1] [NUMREPS+1];
double min,max,avg,total;
static void _GetNumber (const char *sz,unsigned32 *uValue,BOOL *fOK)
41
{
char *ep;
*uValue = strtoul (sz,&ep,0);
*fOK = (*ep == 0);
}
static void _GetArguments (
int argc,
char **argv,
unsigned32 *uServCliBool,
unsigned32 *uLocalModuleID,
unsigned32 *uLocalChunkID,
unsigned32 *uRemoteModuleID,
unsigned32 *uRemoteChunkID,
unsigned32 *uLocalChunkSize,
unsignedl6 *uRemoteNodeID,
BOOL *fOK
)
{
if (argc != 8)
{
*fOK = FALSE;
I
else
{
GetNumber (argv [1],uServCliBool,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Problem with the Server/Client boolean.\n");
I
else
{
if(*uServCliBool==0 I| *uServCliBool==l)
{
GetNumber (argv [2],uLocalModuleID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid local module ID.\n");
}
else
{
GetNumber (argv [3],uLocalChunkID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid local chunk ID.\n");
}
else
{
GetNumber (argv [4],uRemoteModuleID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid Remote module ID.\n");
42
else
{
GetNumber (argv [5],uRemoteChunkID,fOK);
if (*fOK == FALSE)
{
fprintf (stderr, "Invalid Remote chunk ID.\n");
}
else
{
GetNumber (argv [6],uLocalChunkSize,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid chunk size.\n");
}
else
{
unsigned32 uTemp;
_GetNumber (argv [7],&uTemp,fOK);
if (*fOK == FALSE)
{
fprintf (stderr,"Invalid remote node ID.\n");
I
else
{
*uRemoteNodeID=(unsignedl6)uTemp;
else
fprintf (stderr, "Server/Client boolean should be a 1 (SERVER) or a 0 (CLIENT).\n");
*fOK = FALSE;
static void Usage (void)
printf ("***************************************\"
printf ("* SCIBENCH *\n");
printf ("* Shared memory benclunarking for Scali USRAPI *\n");
printf ("* *\n");
printf ("* Damian M. Gonzalez gonzalezOhcs.ufl.edu *\n"):
printf ("o* HCS Lab University of Florida, Gainesville *\n");
printf (" *\n");
printf ("* This program, when used properly, sets up two *\n"):
printf ("* shared memory segments, on a pair of machines, *\n"):
printf ("* and each machine maps the remote segment to it's *\n");
printf ("* virtual memory. The test type, maximum data size, *\n"):
printf (" number of repetitions, time per datapoint, and *\n"):
printf ("* interval between successive points are all set *\n"):
}
}
}
}
}
}
else
{
fprintf (stderr,"Server/Client boolean should be a 1 (SERVER) or a 0 (CLIENT).\n");
*fOK = FALSE;
}
static void _Usage (void)
{
printf("* SCIBENCH *\n");
printf ("* Shared memory benchmarking for Scali USRAPI *\n");
printf("* *\n");
printf ("* Damian M. Gonzalez gonzalez@hcs.ufl.edu *\n");
printf ("* HCS Lab University of Florida, Gainesville *\n");
printf("* *\n");
printf ("* This program, when used properly, sets up two *\n");
printf ("* shared memory segments, on a pair of machines, *\n");
printf ("* and each machine maps the remote segment to it's *\n");
printf ("* virtual memory. The test type, maximum data size, *\n");
printf ("* number of repetitions, time per datapoint, and *\n");
printf ("* interval between successive points are all set *\n");
printf ("* using the variables near the top of the code. *\n");
printf("* *\n");
printf ("* Measures: Max/Min/Avg/Max-Min Lat/Thrpt *\n");
printf ("* Using : OneWay/PingPong tests *\n");
printf ("* Can vary: TESTTYPE MAXSIZE TIME PERPOINT *\n");
printf ("* NUMREPS ACCURACY RESOLUTION *\n");
printf ("* *\n");
printf ("***************************************************************\n");
printf ("* Instructions: *\n");
printf ("* rsh to the SERVER (nodeid 0x1100) and type: *\n");
printf ("* *\n");
printf ("* scibench 1 *\n");
printf ("* (e.g. scibench 1 1 1 1 1 1048576 0x1200 ) *\n");
printf ("* *\n");
printf ("* rsh to the CLIENT (nodeid 0x1200) and type: *\n");
printf ("* *\n");
printf ("* scibench 0 *\n");
printf ("* (e.g. scibench 0 11 1 1 1048576 0x1100 ) *\n");
printf ("***************************************************************\n");
exit(l);
}
int main (int argc,char **argv)
{
BOOL fOK;
register intj;
int repvar;
volatile int repetitions;
char pause;
unsigned32 uServCliBool;
unsigned32 uLocalModuleID;
unsigned32 uLocalChunkID;
unsigned32 uRemoteModuleID;
unsigned32 uRemoteChunkID;
unsigned32 uVal;
unsigned32 uLocalChunkSize;
unsigned32 uRemoteChunkSize;
volatile unsigned32 uMsgSize;
unsignedl6 uRemoteNodeID;
unsignedl6 uLocalPsbNodeId;
unsigned nAdapters;
void *local virtual addr raw;
void *remotevirtualaddr raw;
PSHARABLE pshm;
PCONNECTOR pcon;
ICM STATUS status;
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stdin, NULL, _IONBF, 0);
_GetArguments (argc,argv,&uServCliBool,&uLocalModuleID,&uLocalChunkID,
&uRemoteModuleID,&uRemoteChunkID, &uLocalChunkSize,&uRemoteNodeID,&fOK);
if (!fOK)
Usage ();
Scilnitialize (SCIUSRVERSION, &fOK);
if (!fOK)
{
fprintf (stderr,"Could not initialize USRAPI V%/u.\n",SCIUSRVERSION);
return 1;
I
nAdapters = SciGetNumberOfAdapters ();
if (nAdapters == 0)
{
fprintf (stderr,"There are no SCI adapters on this machine.\n");
SciClose();
return 1;
I
else
{
printf ("The # of SCI adapters on this machine = %d\n", nAdapters);
I
SciFlush(O);
uLocalPsbNodeId = SciGetNodeId (0);
printf ("Using local SCI device with node ID Oxx.\n",uLocalPsbNodeId);
/* allocate local memory chunk */
SciAllocateLocalChunk (&pshm,uLocalChunkSize,LMAP_CONSISTENT,&status);
if (status != ICMS_OK)
{
fprintf (stderr,"Could not allocate %u bytes sharable memory (%s).\n", uLocalChunkSize,
SciErrorString (status));
I
else
{
/* map local memory chunk */
SciMapLocalChunk(pshm,0,0,MPREADWRITE,&status,&local virtualaddrraw);
if (status != ICMSOK)
{
fprintf (stderr,"Error (%s) creating user level mapping (SciMapLocalChunk()\n",SciErrorString
(status));
I
else
{
/* offer local memory chunk */
SciOffer (pshm, 0, uLocalModuleID, uLocalChunkID, LMAP_CONSISTENT, &status);
if (status != ICMS_OK)
{
fprintf (stderr,"Error (%s) introducing memory on SCI (SciOffer()).\n",SciErrorString (status));
}
else
{
printf ("Sharing %u bytes on node Oxx as module %u chunk %u .\n",
uLocalChunkSize, uLocalPsbNodeId, uLocalModuleID, uLocalChunkID);
SciConnectToRemoteChunk (&pcon,0,uRemoteNodelD,uRemoteModuleID,
uRemoteChunkID,&status);
while (status != ICMSOK)
{
fprintf (stderr,"Could not connect to remote memory at node Oxx module %u chunk %u
(%s).\n Trying again...\n",
uRemoteNodelD,uRemoteModuleID,uRemoteChunkID,SciErrorString (status));
sleep(3);
SciConnectToRemoteChunk(&pcon,0,uRemoteNodeID,uRemoteModulelD,
uRemoteChunkID,&status);
}
printf ("Successfully connected to remote memory at node Oxx.\n", uRemoteNodeID);
printf ("Mapping module %u chunk %u from remote node
Oxox\n",uRemoteModuleID,uRemoteChunkID,uRemoteNodeID);
SciMapRemoteChunk(pcon,0,0,RMAP_GATHERING,LMAP_CONSISTENT,MPREADWRITE,
&status, &remotevirtualaddrraw);
while (status != ICMSOK)
{
fprintf (stderr,"Could not map remote memory (%s).\n Trying again...\n",
SciErrorString (status));
sleep(3);
SciMapRemoteChunk(pcon,0,0,RMAP_GATHERING,LMAP_CONSISTENT,MPREADWRITE,
&status, &remotevirtualaddrraw);
}
printf ("Successfully mapped remote memory from remote node Oxx.\n", uRemoteNodeID);
if(status == ICMS_OK)
{
volatile unsigned32 *nodeidremote;
volatile unsigned32 *nodeid_local;
volatile unsigned32 *pRemote;
volatile unsigned32 *pLocal;
volatile unsigned32 *to;
volatile long long int *to_llint;
volatile int *to int;
volatile short *toshort;
volatile char *tochar;
volatile unsigned32 *from;
volatile long long int *fromllint;
volatile int *from int;
volatile short *from short;
volatile char *from char;
int c, ok=0;
char *rbuffer;
int *pLocalTempDatabuffer;
unsigned i, sizevar;
/* the following code 'fixes' the virtual addresses so that they end with '000000000' */
volatile unsigned32 *remotevirtual_addr_aligned =
(volatile unsigned32 *)(((int)remotevirtualaddrraw + 511) & -511);
volatile unsigned32 *localvirtual_addraligned =
(volatile unsigned32 *)(((int)localvirtual_addr_raw + 511) & -511);
nodeidlocal = localvirtualaddraligned + 0x10;
nodeidremote = remotevirtualaddraligned + 0x10;
pLocal = localvirtual_addraligned + 0x20;
pRemote = remotevirtual_addraligned + 0x20;
*nodeidlocal = uLocalPsbNodeId; /* assign it the value of the nodeid */
uRemoteChunkSize = SciGetSizeOfRemoteChunk (pcon);
uRemoteChunkSize = uRemoteChunkSize ((int)remotevirtual_addraligned -
(int)remotevirtualaddrraw);
sleep(3);
printf ("Remote memory module %u chunk %u with owner Oxo%x and size %d has been mapped
into user space.\n",
uRemoteModuleID,uRemoteChunkID,*nodeidremote,uRemoteChunkSize);
if (!I_AMSERVER) /* client prints out information about the test */
{
time(&timevar);
printf("%s", ctime(&timevar));
printf("Number of outer loops: %d \n", NUMREPS);
printf("Maximum message size: %d bytes\n", MAXSIZE);
printf("Approximate time per datapoint: %10.6f seconds \n",
((double)TIMEPERPOINT)/1E6);
printf("Anticipated duration of test: %10.6f minutes\n",
((NUMREPS*((double)TIMEPERPOINT/1E6)*(MAXSIZE/RESOLUTION))/(60)));
}
switch (TESTTYPE)
{
case 1:
{
printf ("**ONE WAY WRITES WITH AN ACKNOWLEDGE**\n");
if (IAMSERVER) /* code for SERVER */
{
printf ("** I AM THE SERVER **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;
/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10] = 1;
for( i=0; i < 5; i++ )
whilefromchar0]!(char)i);
while(from char[10]!=(char)i);
47
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of second synchronization phase\n"); */
for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("message size = %d\n", uMsgSize);*/
printf(".");
while( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
I
/* Can't have the CLIENT and SERVER both calculate repetions */
/* need to receive the repetitions info from the client */
/* uses three message handshake for repetitions transfer */
to_int = (int*)pRemote;
fromint = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER, we receive it below */
fromint[0]=0;
fromint[1]=0;
repetitions=0;
while(fromint[l]!=1)
repetitions=from_int[0];
repetitions=fromint[0];
fromint[0]=0;
/* SERVER sends 'repetitions' to the CLIENT so CLIENT can confirm that it was
received*/
while(fromint[l]!=2)
{
to_int[0] = repetitions;
to_int[1] = 1;
I
/* third synchronization phase */
fromchar[10] = 1;
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to char[10]=(char)i;
48
switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]='x';
while(fromchar[0] !='o'); /* can't hold an int, use x/o chars */
to_char[0]='o';
fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = to short + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;
while(from short[0]!=(short)(repetitions-1));
to_short[0]=(short)(repetitions-1);
fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
to_int =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]=1;
while(from int[0]!=(int)(repetitions-1));
to_int[0]=(int)(repetitions-1);
fromint[0]=0;
};
break;
case 8:
{
to _Hint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tolint =to_llint + ((64-(uMsgSize%/64))/8);
fromlint = fromlint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;
while(fromllint[0]!=(long long int)(repetitions-1));
to_llint[0]=(long long int)(repetitions-1);
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =toint + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
from_int[((uMsgSize/4)-1)]=1;
while(fromint[((uMsgSize/4)-1)]!=(int)(repetitions-1))
{
/*
printf("uMsgSize=%d, ", uMsgSize);
printf("((uMsgSize/4)-l)=%d, ", ((uMsgSize/4)-l));
printf("fromint[((uMsgSize/4)-l)]=%d, ", fromint[((uMsgSize/4)-l)]);
printf("repetitions=%d, ", repetitions);
printf("repetitions-l=%d\n ", repetitions-1);
*/
};
pLocalTempDatabuffer[((uMsgSize/4)-l)]=(int)(repetitions-1);
memcpy((void*)toint, (void*)pLocalTempDatabuffer, uMsgSize);
fromint[((uMsgSize/4)-1)]=0;
}
}
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
printf("%/d / %d complete\n",repvar+1,NUMREPS);
}
else /* code for CLIENT */
{
printf ("** I AM THE CLIENT **\n");
for (rep var = 0; rep var < NUM REPS; rep var++)
{
uMsgSize= 1;
/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10]=1;
for( i=0; i < 5; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=0; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i);
I
/* printf("end of second synchronization phase\n"); */
for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
printf(".");
/* printf("uMsgSize=%d\n", uMsgSize);*/
if( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
}
/* work out the required number of repetitions */
repetitions = getreps(pRemote, pLocalTempDatabuffer, uMsgSize)*10;
/* Can't have the CLIENT and SERVER both calculate repetitions */
/* need to send the repetitions info to the server */
/* use two phase handshake */
to_int = (int*)pRemote;
from_int = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER to get the info across */
to_int[0] = repetitions;
usleep(20);
to_int[1]=l;
/* SERVER sends 'repetitions to the CLIENT so CLIENT can know that it was
received */
/* keep sending till we get this acknowledgement */
fromint[0]=0;
fromint[1]=0;
while(fromint[1]!=1)
{
to_int[0] = repetitions;
to_int[1] = 1;
}
usleep(50);
if (fromint[0] != repetitions)
printf("ERROR (from int[0]=%d)\n", from int[0]);
51
else
toint[1] = 2;
/* third synchronization phase */
fromchar[10]=1;
for( i=; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
}
switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
to_char =to_char + (64-(uMsgSize%64));
fromchar = fromchar + (64-(uMsgSize%64));
}
fromchar[0]=1;
tochar[0]='x';
gettimeofday(&st, NULL);
for(i=0;i
{
if(i!=repetitions-1)
to_char[0]='x';
else
to_char[0]='o';
}
while(fromchar[0] !='o');
gettimeofday(&et, NULL);
fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
to_short =to_short + ((64-(uMsgSize%64))/2);
fromshort = fromshort + ((64-(uMsgSize%/64))/2);
}
fromshort[0]=1;
gettimeofday(&st, NULL);
for(i=0;i
to_short[0]=(short)i;
while(from short[0] !=(short)(repetitions-1));
52
gettimeofday(&et, NULL);
from short[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_ int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
gettimeofday(&st, NULL);
for(i=0;i
to_int[0]=(int)i;
while(from int[0]!=(int)(repetitions-1));
gettimeofday(&et, NULL);
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
toliint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;
gettimeofday(&st, NULL);
for(i=0;i
to_llint[0]=(long long int)i;
while(fromllint[0]!=(long long int)(repetitions-1));
gettimeofday(&et, NULL);
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
gettimeofday(&st, NULL);
for(i=0;i
{
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
}
while(from int[((uMsgSize/4)-1)]!=(int)(repetitions-1));
gettimeofday(&et, NULL);
from int[uMsgSize-1]=0;
}
}
if(et.tvusec < st.tv usec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tv usec st.tvusec);
/* Double-checking position in array */
if(repvar==0)
{
latencies[size_var] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[sizevar] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6fuMsgSize=%d)\n",
size var,
latencies[sizevar] [0],
uMsgSize);
if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%od] [0]=%10.6fuMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);
}
latencies[size_var] [repvar+l] = elapsed/repetitions;
throughputs[sizevar] [repvar+l]= (uMsgSize*repetitions)/elapsed;
/*
printf("%8.5f,%6d,%7d,%10.6f,%09.3f\n",
elapsed/lE6,
repetitions,
uMsgSize,
(elapsed)/repetitions,
(uMsgSize*repetitions)/(((elapsed)/1E6)*1048576));*/
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
printf("%d / %d complete\n",repvar+1,NUMREPS);
print latency_array();
print throughputarray);
print latency_summary();
print throughputsummary();
}
};
break;
case 2:
{
printf ("**PING PONG TEST**\n");
if (IAMSERVER) /* code for SERVER */
{
printf ("** I AM THE SERVER **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;
/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10] = 1;
for( i=0; i < 5; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
I
/* printf("end of second synchronization phase\n"); */
for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("uMsgSize = %d\n", uMsgSize); */
printf(".");
while( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
/* Can't have the CLIENT and SERVER both calculate repetions */
/* need to receive the repetitions info from the client */
/* uses three message handshake for repetitions transfer */
to_int = (int*)pRemote;
fromint = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER, we receive it below */
fromint[0]=0;
fromint[1]=0;
repetitions=0;
while(fromint[1]!=1)
repetitions=from_int[0];
repetitions=fromint[0];
fromint[0]=0;
/* SERVER sends 'repetitions' to the CLIENT so CLIENT can confirm that it was
received*/
while(fromint[1]!=2)
{
to_int[0] = repetitions;
to_int[1] = 1;
}
/* second synchronization phase */
fromchar[10] = 1;
for( i=0; i < 100; i++ )
{
while(fromchar[10] !=(char)i);
to_char[10]=(char)i;
}
switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]=1;
for(i=0;i
{
while(fromchar[0] !=(char)i);
tochar[0]=(char)i;
}
fromchar[0]=0;
};
break;
case 2:
{
to short = (short*)pRemote;
56
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = toshort + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;
for(i=0;i
{
while(fromshort[0] !=(short)i);
to_short[0]=(short)i;
}
fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
to_int =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
for(i=0;i
{
while(fromint[0] !=(int)i);
toint[0]=(int)i;
}
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tollint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]= 1;
for(i=0;i
{
while(fromllint[0]!=(long long int)i);
to_llint[0]=(long long int)i;
}
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
for(i=0;i
{
while(fromint[((uMsgSize/4)-1)]!=(int)i);
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
}
fromint[((uMsgSize/4)-1)]=0;
}
}
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
printf("%d / %d complete\n",repvar+1,NUMREPS);
}
else /* code for CLIENT */
{
printf ("** I AM THE CLIENT **\n");
for (rep_var = 0; rep_var < NUMREPS; rep_var++)
{
uMsgSize= 1;
/* synchronization phase */
tochar = (char*)pRemote;
fromchar = (char*)pLocal;
fromchar[10]=1;
for( i=0; i < 5; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
I
/* printf("end of first synchronization phase\n"); */
for( i=; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i);
/* printf("end of second synchronization phase\n"); */
for(sizevar-0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("uMsgSize = %d\n", uMsgSize); */
printf(".");
if( (pLocalTempDatabuffer = (int*)RtlMemAlign(32, uMsgSize)) == NULL)
{
printf("\n Couldn't allocate %d kb buffer!\n", uMsgSize);
exit(l);
}
/* work out the required number of repetitions */
repetitions = getreps(pRemote, pLocalTempDatabuffer, uMsgSize)*10;
/* Can't have the CLIENT and SERVER both calculate repetitions */
/* need to send the repetitions info to the server */
/* use two phase handshake */
to_int = (int*)pRemote;
from_int = (int*)pLocal;
/* CLIENT sends 'repetitions' to the SERVER to get the info across */
to_int[0] = repetitions;
usleep(20);
to_int[1]=l;
/* SERVER sends 'repetitions to the CLIENT so CLIENT can know that it was
received */
/* keep sending till we get this acknowledgement */
fromint[0]=0;
fromint[1]=0;
while(fromint[1]!=1)
{
to_int[0] = repetitions;
to_int[1] = 1;
; /* need this to context switch to allow the new write */
}
usleep(20);
if (fromint[0] != repetitions)
printf("ERROR (fromint[0]=%d)\n", fromint[0]);
else
to_int[1] = 2;
/* second synchronization phase */
fromchar[10]=1;
for( i=0; i < 100; i++)
{
to_char[10]=(char)i;
while(fromchar[10] !=(char)i)
to_char[10]=(char)i;
}
switch(uMsgSize)
{
case 1:
{
to_char = (char*)pRemote;
fromchar = (char*)pLocal;
if(MEMALIGN==1)
{
tochar = tochar + (64-(uMsgSize%64));
from_char = from_char + (64-(uMsgSize%64));
}
fromchar[0]=1;
gettimeofday(&st, NULL);
for(i=0;i
59
{
tochar[0]=(char)i;
while(fromchar[0] !=(char)i);
}
gettimeofday(&et, NULL);
fromchar[0]=0;
};
break;
case 2:
{
to_short = (short*)pRemote;
fromshort = (short*)pLocal;
if(MEMALIGN==1)
{
toshort = to short + ((64-(uMsgSize%64))/2);
from_short = from_short + ((64-(uMsgSize%64))/2);
}
fromshort[0]=1;
gettimeofday(&st, NULL);
for(i=0;i
{
toshort[0]=(short)i;
while(fromshort[0] !=(short)i);
}
gettimeofday(&et, NULL);
fromshort[0]=0;
};
break;
case 4:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =to_ int + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[0]= 1;
gettimeofday(&st, NULL);
for(i=0;i
{
to_int[0]=(int)i;
while(fromint[0] !=(int)i);
}
gettimeofday(&et, NULL);
fromint[0]=0;
};
break;
case 8:
{
to_llint = (long long int*)pRemote;
fromllint = (long long int*)pLocal;
if(MEMALIGN==1)
{
tollint =to_llint + ((64-(uMsgSize%64))/8);
fromllint = fromllint + ((64-(uMsgSize%/64))/8);
}
fromllint[0]=1;
gettimeofday(&st, NULL);
for(i=0;i
{
to_llint[0]=(long long int)i;
while(fromllint[0]!=(long long int)i);
}
gettimeofday(&et, NULL);
fromllint[0]=0;
};
break;
default:
{
to_int = (int*)pRemote;
fromint = (int*)pLocal;
if(MEMALIGN==1)
{
toint =toint + ((64-(uMsgSize%64))/4);
fromint = fromint + ((64-(uMsgSize%64))/4);
}
fromint[((uMsgSize/4)-1)]=1;
gettimeofday(&st, NULL);
for(i=0;i
{
pLocalTempDatabuffer[((uMsgSize/4)-1)]=(int)i;
memcpy((void*)to_int, (void*)pLocalTempDatabuffer, uMsgSize);
while(fromint[((uMsgSize/4)-l)]!=(int)i);
}
gettimeofday(&et, NULL);
fromint[((uMsgSize/4)-l)]=0;
}
}
if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tvusec st.tvusec);
/* Double-checking position in array */
if(repvar==0)
{
latencies[size_var] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
/*
printf("latencies[%d] [0]=%10.6f uMsgSize=%d\n", sizevar,
latencies[sizevar][0], uMsgSize);
printf("throughputs[%Od][0]=%10.6f uMsgSize=%d\n", sizevar,
throughputs[sizevar] [0], uMsgSize);
*/
}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[sizevar] [0],
uMsgSize);
if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%od] [0]=%10.6fuMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);
latencies[size_var] [repvar+l] = (elapsed/2)/repetitions;
throughputs[sizevar][repvar+l] =
(uMsgSize*repetitions)/((((elapsed/2))/1E6)*1048576);
/*
printf("%8.5f,o%6d,%07d,%10.6f,%9.3f\n",
elapsed/1E6,
repetitions,
uMsgSize,
(elapsed/2)/repetitions,
(uMsgSize*repetitions)/(((elapsed/2)/1E6)*1048576));*/
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
printf("%/d / %d complete\n",repvar+1,NUM REPS);
print latency_array();
print throughput array();
print latency_summary();
print throughput summary();
}
};
break;
default:
{
printf ("Invalid test type.\n");
break;
}
}
}
}
sleep(l);
62
SciWithdraw (pshm,0,uLocalModuleID,uLocalChunkID,&status);
if (status != ICMSOK)
fprintf (stderr,"Error (%s) withdrawing memory from SCI.\n",SciErrorString (status));
else
printf ("Memory withdrawn from SCI. Clients disconnected.\n");
SciDisconnectFromRemoteChunk (&pcon,&status);
if (status != ICMSOK)
fprintf (stderr,"Error (%s) disconnecting from remote memory.\n",SciErrorString (status));
else
printf ("Disconnected from memory.\n");
SciCloseLocalChunk (&pshm,&status);
if (status != ICMSOK)
fprintf (stderr,"Error (%s) freeing local memory.\n",SciErrorString (status));
else
printf ("Local memory freed.\n");
SciClose ();
return 0;
}
int getreps(long long int *tol, long long int *froml, int size)
{
/* this gives a rough baseline of the number of repetitions necessary for 1/10 the time per datapoint */
/* It is used for both types of tests, and admittedly isn't absolutely the same as each test */
/* It is, however, good enough, for our purposes (govm't work :)) */
int loop_count;
long long int *to, *from;
gettimeofday(&st, NULL);
for(loop_count=10; loop_count < 1E6; loop_count++)
{
to=tol;
from=froml;
switch(size)
{
case 1:
{
*((char*)to)=*((char*)from);
};
break;
case 2:
{
*((short*)to)=*((short*)from);
};
break;
case 4:
{
*((int*)to)=*((int*)from);
break;
case 8:
*(to)=*(from);
};
break;
case 16:
*(to++)=*(from++);
*(to++)=*(from++);
};
break;
case 32:
*(to++)=
*(to++)=
*(to++)=
*(to++)=
*(from++);
*(from++);
*(from++);
*(from++);
break;
default:
{
memcpy((long int *) to,(long int *) from, size );
I
gettimeofday(&et, NULL);
if(et.tv usec < st.tv usec)
et.tv usec += 1E6;
et.tv sec -= 1;
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tv_sec)*lE6 +
(et.tv usec st.tvusec);
if (elapsed > (TIME PER POINT/10))
return loopcount;
int print latency_summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- \n");
printf("--------- LATENCY SUMMARY --------------\n");
printf("------------------------ ---\n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("------------------------ ---\n");
for (size var = 0; uMsgSize <= MAXSIZE; size var++)
total=0;
max=0;
min=10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(latencies[sizevar][repvar+l] > max)
max=latencies[size_var] [rep_var+l];
if(latencies[size_var] [repvar+l] < min)
min=latencies[size_var] [repvar+1];
total = total + latencies[sizevar] [rep_var+l];
avg=total/NUM REPS;
printf("%d,%10.6f,%10.6f,%10.6f, %10.6f, %10.6f, "uMsgSize, max, min, avg, (max-min), ((max-
min)/min) );
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
int print throughput summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- n");
printf("--------- THROUGHPUT SUMMARY -----------\n");
printf("--- ----------------------- n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("--- ----------------------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min= 10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(throughputs[sizevar] [repvar+l] > max)
max=throughputs[size var][rep var+1];
if(throughputs[size var] [rep var+l] < min)
min=throughputs[sizevar] [repvar+l];
total = total + throughputs[sizevar][repvar+l];
I
avg=total/NUM REPS;
printf("%d,%10.10.6f0.6f,%10.6f, %10.6f, %10.6f, ", uMsgSize, max, min, avg, (max-min), ((max-
min)/min) );
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");
if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}
int print latency_array (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");
printf("------------ RAW LATENCY DATA ---------------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize);*/
printf("%10.6f, ", latencies[size_var][0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", latencies[sizevar] [repvar+l]);
printf("\n");
if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
return 0;
}
int print throughputarray (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");
printf("------------ RAW THROUGHPUT DATA -------------\n");
printf("---------------------------------- -----\n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- -----\n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize); */
printf("%10.6f, ", throughputs[sizevar] [0]);
for(rep_var = 0; repvar < NUM REPS; rep_var++)
printf("%10.6f, ", throughputs[sizevar][repvar+l]);
printf("\n");
if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
return 0;
APPENDIX B
MPIBENCH CODE LISTING
/********************************************************
* MPIBENCH *
* Portable Message Passing benchmarks *
* *
* Damian M. Gonzalez gonzalez@hcs.ufl.edu *
* HCS Lab University of Florida, Gainesville *
********************************************************/
/********************************************************
* This is originally written for the ScaMPI *
* implementation. Alter as necessary to compile for *
* other MPI implementations. *
********************************************************/
#include
#include
#include
#include
#include "/opt/scali/include/mpi.h"
/* for timing */
struct timeval st, et;
double elapsed;
timet timevar; /* to print out date of test later */
#define MB 1024*1024
#define KB 1024
#define ACCURACY 0.1 /* A warning is printed if (max-min) is greater than
(min*ACCURACY) */
#define MAXSIZE 256*KB /* LARGE in bytes */
/*#define MAXSIZE 4*KB /* SMALL in bytes */
#define NUMREPS 10 /* number of iterations made to determine max, min, avg */
#define TIMEPERPOINT 2E4 /* self explanatory, in microseconds */
#define RESOLUTION 8192 /* interval between successive points, in bytes */
/* if this is zero, a powers of two analysis is performed */
#define TESTTYPE 1 /* 1= one way test */
/* 2= ping pong test */
/* arrays to hold the raw data, for the max min avg calculations later */
/* one extra to contain the message size at location [0] for validation */
/* Note: need to change this later when I implement the variable # of reps */
double latencies[MAXSIZE/RESOLUTION+1] [NUM REPS+I];
double throughputs[MAXSIZE/RESOLUTION+1] [NUM REPS+1];
double min,max,avg,total;
/* to hold the number of repetitions for each message size that's calculated at the beginning */
/* width is two so that it may contain both message size [0] and the repetitions value [1] */
int repetitions[MAXSIZE/RESOLUTION] [2];
int rank; /* needs to be globally available */
int main (int argc, char **argv)
{
char host[20]; /* to contain the host name for printing out later */
int Stat; /* to contain status after MPIBarrier() */
int comm size;
int send index, recindex;
int j; /* loop variable for synchronization runs */
int sizevar, repvar;
int *buff, uMsgSize, total loops, num datapoints, temp;
MPI Status status;
setvbuf(stdout, NULL, _IONBF, 0); /* simply to facilitate the printing of the dots during the
iterations */
setvbuf(stdin, NULL, _IONBF, 0);
MPI_Init( &argc, &argv );
MPI_Commrank( MPI_COMMWORLD, &rank);
MPI_Commsize( MPI_COMMWORLD, &commsize);
buff = malloc(MAXSIZE);
Stat = MPIBarrier(MPI_COMMWORLD);
gethostname(&host[0], 20);
printf("rank:%/d, host:%s\n",rank,&host[0]);
if (rank == 1)
{
time(&timevar);
printf("%s", ctime(&timevar));
/* Calculating the time for the test */
if(RESOLUTION>0)
{
num datapoints=MAXSIZE/RESOLUTION;
I
else
{
temp=MAXSIZE;
num datapoints=0;
while(temp > 1)
{
temp=temp/2;
num datapoints++;
I
if(TEST_TYPE==1)
printf("---------- ONE WAY TEST -----------------\n");
if(TEST_TYPE==2)
printf("---------- PING PONG TEST ---------------\n");
printf("Maximum message size:0/od (%/d points/iteration) \n", MAXSIZE,
(num datapoints+2));
printf("Number of outer loops:%d \n", NUM_REPS);
printf("Number of seconds per datapoint:%10.6f \n", (TIMEPERPOINT/1E6));
printf("Anticipated duration of main test:%10.6f minutes\n",
((NUM_REPS*(TIMEPERPOINT/1E6)*(num datapoints+1))/(60)));
printf("------------------------------------- \n");
synch(5);
/* fill out repetitions array */
getreps();
synch(100);
for (rep_var = 0; repvar < NUM REPS; rep_var++)
{
uMsgSize=0;
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
if(rank== 1)
printf(".");
if(uMsgSize==0)
total_loops=repetitions[0] [1] 10; /* couldn't perform getreps for zero size */
else
total_loops = repetitions[size_var-1][1] 10;
/*
printf("rank:%d uMsgSize=%d repetitions[%od][0]=%d repetitions[%d][1l]=%d %d\n",
rank,
uMsgSize,
size var,
repetitions[sizevar] [0],
size var,
repetitions[sizevar] [1],
total_loops);
*/
switch (TEST_TYPE)
{
case 1: /* 1= one way test */
{
if (rank == 0)
{
for (rec_index=0; rec_index < total loops; rec_index++)
{
MPIRecv (buff, uMsgSize/4, MPI_INT, 1, 1, MPI_COMMWORLD, &status);
}
MPISend (buff,1, MPIINT, 1, 1, MPICOMM WORLD);
else
{
/* get starting time */
gettimeofday(&st, NULL);
for (sendindex=0; sendindex < total loops; sendindex++)
{
MPI_Send (buff, uMsgSize/4, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
MPIRecv (buff, 1, MPI_INT, 0, 1, MPICOMM_WORLD, &status);
gettimeofday(&et, NULL);
/* get final time */
if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tvusec st.tv_usec);
/* Double-checking position in array */
if(repvar-==0)
{
latencies[sizevar] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[size_var] [0],
uMsgSize);
if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%d] [0]=%10.6f
uMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);
latencies[sizevar][repvar+l] = elapsed/total loops;
throughputs[sizevar] [repvar+l] =
(uMsgSize*total loops)/(((elapsed)/1E6)*1048576);
/*
printf("%7d,%8.5f,%6d,%10.6f,%9.3f\n",
uMsgSize,
elapsed/lE6,
total_loops,
elapsed/total loops,
(uMsgSize*total_loops)/((elapsed/1E6)*1048576));*/
}
break;
case 2: /* 2= ping pong test */
{
if (rank == 0)
{
for (rec_index=0; rec_index < total loops; rec_index++)
{
MPI_Recv (buff, uMsgSize/4, MPIINT, 1, 1, MPI_COMM_WORLD, &status);
MPI_Send (buff, uMsgSize/4, MPI_INT, 1, 1, MPI_COMM_WORLD);
}
}
else
{
/* get starting time */
gettimeofday(&st, NULL);
for (sendindex=0; sendindex < total loops; sendindex++)
{
MPI_Send (buff, uMsgSize/4, MPIINT, 0, 1, MPI_COMM_WORLD);
MPIRecv (buff, uMsgSize/4, MPI_INT, 0, 1, MPI_COMMWORLD, &status);
}
gettimeofday(&et, NULL);
/* get final time */
if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tvusec st.tv_usec);
if(repvar==0)
{
latencies[sizevar] [0] = uMsgSize;
throughputs[sizevar] [0] = uMsgSize;
}
else
{
if(latencies[size_var] [0] != uMsgSize)
printf("Latency Array Error (latencies[%d] [0]=%10.6f uMsgSize=%d)\n",
size var,
latencies[size_var] [0],
uMsgSize);
if(throughputs[sizevar][0] != uMsgSize)
printf("Throughput Array Error(throughputs[%d] [0]=%10.6f
uMsgSize=%d)\n",
size var,
throughputs[sizevar] [0],
uMsgSize);
latencies[sizevar][repvar+ eapsed2totaloops;
latencies[sizevar][repvar+1] = (elapsed/2)/total loops;
throughputs[sizevar] [rep_var+l] =
(uMsgSize*total loops)/((((elapsed/2))/1E6)*1048576);
/*
printf("%7d,%8.5f,%6d,%10.6f,%9.3f\n",
uMsgSize,
(elapsed)/1E6,
total_loops,
(elapsed/2)/total loops,
(uMsgSize*total_loops)/(((elapsed/2)/1E6)*1048576));*/
}
break;
default:
{
printf ("Invalid test type.\n");
break;
}
}
fflush (stdout);
if(RESOLUTION > 0)
uMsgSize=uMsgSize+RESOLUTION;
else
{
if(uMsgSize==0)
uMsgSize=1;
else
uMsgSize*=2;
I
if(rank==l)
printf("\n %d / %d complete\n",rep var+1,NUM REPS);
/* Calculate and print Minimum, Maximum and Average Latency and Throughput for each size
*/
if(rank== 1)
print latency_array();
print throughput array();
print latency_summary();
print throughput summary();
MPIFinalize();
return 0;
}
int getreps(void)
{
int loop_count;
int *tempbuff;
int size var;
int message_size=4; /* can't co-ordinate the end of testing if size is zero!! */
intj;
int start size=5;
int increment=200;
MPI Status status;
tempbuff = malloc (MAXSIZE sizeof (char));
for (sizevar = 0; message_size <= MAXSIZE; size_var++)
{
*tempbuff=0;
switch (TESTTYPE)
{
case 1:
{
if (rank ==0)
{
for(loop_count=startsize;loop_count<1E8;loop_count+=increment)
{
for (j = 0; j < loop_count; j++)
{
MPIRecv (tempbuff, message_size/4 MPI_INT, 1, 1,
MPICOMMWORLD, &status);
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
break;
}
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
{
loop_count-=increment;/* rank zero will have counted one more, subtract this */
break;
}
MPI_Send (tempbuff,1, MPI_INT, 1, 1, MPI_COMMWORLD);
}
}
else
{
for(loop_count=startsize;loop_count< 1E8;loop_count+=increment)
{
gettimeofday(&st, NULL);
for (j = 0; j < loop_count; j++)
{
MPI_Send (tempbuff, message_size/4 MPI_INT, 0, 1,
MPICOMMWORLD);
}
MPIRecv (tempbuff, 1, MPIINT, 0, 1, MPI_COMM_WORLD, &status);
gettimeofday(&et, NULL);
if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*lE6 +
(et.tv usec st.tv usec);
/* if we've reached 1/10 the time per point, return the value */
if (elapsed > (TIMEPERPOINT/10))
{
*tempbuff=9;
MPI_Send (tempbuff, message_size/4, MPI_INT, 0, 1,
MPI_COMMWORLD);
break;
}
}
}
}
break;
case 2:
{
if (rank ==0)
{
for(loop_count startsize;loop_count< 1E8;loop_count++)
{
MPIRecv (tempbuff, message_size/4, MPI_INT, 1, 1, MPICOMMWORLD,
&status);
if(*tempbuff==9) /* this is just a local read, shouldn't take too long */
{
loop_count--;
break;
}
MPI_Send (tempbuff, message_size/4, MPI_INT, 1, 1, MPI_COMMWORLD);
}
}
else
{
gettimeofday(&st, NULL);
for(loop_count startsize;loop_count< 1E8;loop_count++)
{
MPISend (tempbuff, message_size/4, MPI_INT, 0, 1, MPICOMMWORLD);
MPIRecv (temp_buff, message_size/4, MPIINT, 0, 1, MPI_COMM_WORLD,
&status);
gettimeofday(&et, NULL);
if(et.tvusec < st.tvusec)
{
et.tv usec += 1E6;
et.tv sec -= 1;
}
/* calculate elapsed time in us */
elapsed=(et.tv_sec st.tvsec)*1E6 +
(et.tvusec st.tv_usec);
/* if we've reached 1/10 the time per point, return the value */
if (elapsed > (TIMEPERPOINT/10))
{
*tempbuff=9;
MPI_Send (tempbuff, message_size/4, MPI_INT, 0, 1,
MPICOMM WORLD);
break;
}
}
}
}
break;
default:
{
printf ("Invalid test type.\n");
break;
}
/*
if (rank== 1)
printf("rank:%d sizevar-%d message_size=%d elapsed=%10.6f loop_count=%d\n",
rank, sizevar, message_size, elapsed, loopcount);
*/
repetitions[sizevar][0]=message_size;
repetitions[size_var][1]=loop_count;
if(RESOLUTION > 0)
{
message_size=message_size+RESOLUTION;
I
else
message_size*=2;
}
int print latency_summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("-- ---------------------- ---n");
printf("--------- LATENCY SUMMARY --------------\n");
printf("------------------------------n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("--- ----------------------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min=10000;
for(rep_var = 0; rep_var < NUMREPS; rep_var++)
{
if(latencies[sizevar] [rep_var+1] > max)
max=latencies[size_var] [rep_var+l];
if(latencies[size_var] [rep_var+1] < min)
min=latencies[size var][repvar+1];
total = total + latencies[sizevar] [rep_var+l];
I
avg=total/NUM REPS;
printf("%d,o%10.6f,%10.6f,%10.6f, %10.6f, %10.6f, ", uMsgSize, max, min, avg, (max-min),
((max-min)/min));
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");
if(uMsgSize==1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}
int print throughput summary(void)
{
int uMsgSize=1;
int size var;
int repvar;
/* print out summary of results */
printf("--- ----------------------- n");
printf("--------- THROUGHPUT SUMMARY -----------\n");
printf("------------------------ ---n");
printf("Size,Max,Min,Avg,Max-Min,((Max-Min)/Min)\n");
printf("------------------------ ---n");
for (sizevar = 0; uMsgSize <= MAXSIZE; size_var++)
{
total=0;
max=0;
min= 10000;
for(rep_var = 0; repvar < NUM REPS; rep_var++)
{
if(throughputs[sizevar] [repvar+l] > max)
max=throughputs[sizevar] [rep_var+l];
if(throughputs[sizevar] [repvar+l] < min)
min=throughputs[sizevar] [repvar+l];
total = total + throughputs[sizevar] [repvar+l];
avg=total/NUM REPS;
printf("%d,%10.6f,%10.10.6f.6 0.6f, %10.6f, % %1010f, ", uMsgSize, max, min, avg, (max-min),
((max-min)/min));
/* Guage accuracy of results */
if( (max-min) > (min*ACCURACY) )
printf("Insufficient accuracy ");
printf("\n");
if(uMsgSize==l)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
I
return 0;
}
int print latency_array (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf("---------------------------------- ------- n");
printf("------------ RAW LATENCY DATA ----------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize);*/
printf("%10.6f, ", latencies[size_var][0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", latencies[sizevar] [rep_var+l]);
printf("\n");
if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
return 0;
}
int print throughputarray (void)
{
int uMsgSize=1;
int size var;
int repvar;
printf("\n");
printf(" ------------------------- -------- n");
printf("------------ RAW THROUGHPUT DATA -------------\n");
printf("---------------------------------- ------- n");
printf("Size,Repl,Rep2,Rep3,Rep4,Rep5,Rep6,Rep7,Rep8,Rep9,Repl0 \n");
printf("---------------------------------- ------- n");
for (sizevar = 0; uMsgSize <= MAXSIZE; sizevar++)
{
/* printf("%d, ", uMsgSize); */
printf("%10.6f, ", throughputs[sizevar] [0]);
for(repvar = 0; repvar < NUM REPS; repvar++)
printf("%10.6f, ", throughputs[sizevar][repvar+l]);
printf("\n");
if(uMsgSize== 1)
uMsgSize=RESOLUTION;
else
uMsgSize = uMsgSize+RESOLUTION;
return 0;
I
int synch(int reps)
{
intj;
int *buff;
MPI Status status;
buff = malloc (MAXSIZE sizeof (char));
for (j = 0; j < reps; j++)
/* synchronization runs */
if (rank==0)
MPIRecv (buff, 0, MPIINT, 1, 1, MPICOMMWORLD, &status);
MPI Send (buff, 0, MPI INT, 1, 1, MPI COMM WORLD);
MPISend (buff, 0, MPIINT, 0, 1, MPI_COMMWORLD);
MPI Recv (buff, 0, MPI INT, 0, 1, MPI COMM WORLD, &status);
LIST OF REFERENCES
[1] Bennett A., Field A., Harrison P., Modeling and Validation of Shared Memory
Coherency Protocols, Performance Evaluation 28 (1996) 541-562.
[2] Boden N., Cohen D., Felderman R., Kulawik A., Seitz C., Seizovic J., Su W., Myrinet: A
Gigabit-per-Second Local Area Network, IEEE Micro 15 (1) (1995) 26-36.
[3] Brewer T., Astfalk G., The Evolution of the HP/Convex Exemplar, in: Proceedings of
COMPCON '97, San Jose, CA, February 1997, pp. 81-86.
[4] Bugge H., Affordable Scalability using Multicubes, in: H. Hellwagner, A. Reinefeld
(Eds.), SCI: Scalable Coherent Interface, LNCS State-of-the-Art Survey (Springer,
Berlin, 1999) 167-174.
[5] Clark R., SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers
with Pentium II Xeon SHV Nodes, White Paper, Data General Corp., Hopkinton, MA,
1998.
[6] Giganet Inc., Giganet: Building a Scalable Internet Infrastructure with Windows 2000
and Linux, White Paper, Giganet Inc., Concord, MA, 1999.
[7] Horn G., Scalability of SCI Ringlets, in: H. Hellwagner, A. Reinefeld (Eds.), SCI:
Scalable Coherent Interface, LNCS State-of-the-Art Survey (Springer, Berlin, 1999) 151-
165.
[8] Huse L., Omang K., Bugge H., Ry H., Haugsdal A., Rustad E., ScaMPI Design and
Implementation, in: H. Hellwagner, A. Reinefeld (Eds.), SCI: Scalable Coherent
Interface, LNCS State-of-the-Art Survey (Springer, Berlin, 1999) 249-261.
[9] Hellwagner H., Reinefeld A., SCI: Scalable Coherent Interface, LNCS State-of-the-Art
Survey (Springer, Berlin, 1999).
[10] International Business Machines Corp., The IBM NUMA-Q Enterprise Server
Architecture: Solving issues of Latency and Scalability in Multiprocessor Systems, White
Paper, International Business Machines Corp., Armonk, NY, 2000.
[11] IEEE, SCI: Scalable Coherent Interface, IEEE Approved Standard 1596-1992,
Piscataway, NJ, 1992.
[12] Kurmann C., Stricker T., A Comparison of Three Gigabit Technologies: SCI, Myrinet
and SGI/Cray T3D, in: Proceedings of SCI Europe '98, Bordeaux, France, September
1998, pp. 29-40.
[13] Omang K., SCI Clustering through the I/O bus: A Performance and Functionality
Analysis, Ph.D thesis, Department of Informatics, University of Oslo, Norway, 1998.
[14] Sarwar M., George A., Simulative Performance Analysis of Distributed Switching
Fabrics for SCI-based Systems, Microprocessors and Microsystems 24 (1) (2000) 1-11.
[15] Scali Computer AS, Scali System Guide Version 2.0, White Paper, Scali Computer AS,
Oslo, Norway, 2000.
[16] Scott S., Goodman J., Vernon M., Performance of the SCI Ring, in: Proceedings of the
19th Annual International Symposium on Computer Architecture, Gold Coast, Australia,
May 1992, pp. 403-414.
[17] Windham W., Hudgins C., Schroeder J., Vertal M., An Animated Graphical Simulator for
the IEEE 1596 Scalable Coherent Interface with Real-Time Extensions, Computing for
Engineers 12 (1997) 8-13.
[18] Wu B., The Applications of the Scalable Coherent Interface in Large Data Acquisition
Systems in High Energy Physics, Ph.D thesis, Department of Informatics, University of
Oslo, Norway, 1996.
BIOGRAPHICAL SKETCH
Damian Mark Gonzalez was born on March 5th, 1974, in San Fernando, Trinidad
and Tobago. At 19 years of age, after successfully completing the GCE Advanced Level
examinations, he left Trinidad to pursue educational opportunities as an international
student at Florida International University in Miami, Florida. During his five years at
FIU, he obtained a broad liberal arts education through his involvement in the FIU
Honors Program for four consecutive years, and also as an exchange student in England
at the University of Hull in Spring 1996. In Spring 1998, he graduated magna cum laude
from FIU with a Bachelor of Science degree in electrical engineering, and a minor in
computer science.
A desire for advanced study at a well-recognized research university led him to
the University of Florida where he pursued a Master of Science degree in the Department
of Electrical and Computer Engineering. At UF, he took advantage of the opportunity to
gain valuable technical experience and contribute to the research community as part of
the High-performance Computing and Simulation Research Laboratory (HCS).
Consistent academic performance, coupled with his experience at HCS and
elsewhere helped to secure him a position as a software engineer with Motorola's Paging
Products Group in Boynton Beach, Florida. He is very excited about this opportunity,
and looks forward to the many new experiences that lie ahead.
|