Title: Multicast performace modeling and evaluation for high=speed unidirectional Torus networks
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094650/00001
 Material Information
Title: Multicast performace modeling and evaluation for high=speed unidirectional Torus networks
Physical Description: Book
Language: English
Creator: Oral, Sarp
George, Alan D.
Publisher: High-Performance Computing and Simulation Research Lab, Department of Electrical and Computer Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2003
Copyright Date: 2003
 Record Information
Bibliographic ID: UF00094650
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

SCI2003 ( PDF )


Full Text



2003, HCS Research Lab, U of Florida All Rights Reserved


Multicast Performance Modeling and Evaluation
for High-Speed Unidirectional Torus Networks*

Sarp Oral and Alan D. George*

High-Performance C. ,iij-tw,,a and Simulation (HCS) Research Lab, Department of Electrical and Computer Engineering,
University ofFlorida, P.O. Box 116200, Gainesville, Florida 32611, USA


Abstract
This paper evaluates the performance of various unicast-based and path-based multicast protocols for high-speed torus

networks. The results of an experimental case-study on a Scalable Coherent Interface (SCI) torus network are presented.

Small-message latency models of these software-based multicast algorithms as well as analytical projections for larger

unidirectional torus systems are also introduced. The strengths and weaknesses of selected multicast protocols are

experimentally and analytically illustrated in terms of various metrics, such as startup and completion latency, CPU utilization,

and link concentration and concurrency for SCI networks under various networking and multicasting scenarios.

Keywords: Scalable Coherent Interface; Torus networks; Multicast communication; Analytical modeling; Benchmarking


1. Introduction

Collective communication primitives play a major role in parallel computing by making the applications more portable

among different platforms. Utilizing collective communication not only simplifies but also increases the functionality and

efficiency of the parallel tasks. As a result, efficient support of collective communications is important in the design of high-

performance parallel and distributed systems.

An important primitive among collective communication operations is multicast communication. Multicast is defined as

sending a single message from a source node to a set of destination nodes. This primitive can be used as a basis for many

collective operations, such as barrier synchronization and global reduction, as well as cache invalidations in shared-memory

multiprocessor systems [1]. The multicast primitive also functions as a useful tool in parallel numerical procedures such as

matrix multiplication and transposition, eigenvalue computation, and Gauss elimination [2]. Multicast communication

algorithms can be broadly classified as unicast-based or path-based [3]. In a unicast-based algorithm, the source node sends the


*This work was supported in part by the U.S. Department of Defense, by matching funds from the University of Florida for the iVDGL
project supported by the National Science Foundation, and by equipment support of Dolphin Interconnect Solutions Inc. and Scali Computer
AS.
t A preliminary subset of this paper was presented at the High-Speed Local Networks Workshop at the 27th IEEE Local Computer Networks
LCN) Conference, Tampa, Florida, November 2002.
Corresponding author. Tel.: +1-352-392-2552; fax: +1-352-392-8671.
E-mail addresses: oralahcs.ufl.edu (S. Oral), george(T)hcs.ufl.edu (A.D. George)







2003, HCS Research Lab, U of Florida All Rights Reserved


message to the destination node set as unicast-routed messages [1]. Unlike the unicast-based algorithms, path-based ones

require each relaying element to transmit the message to multiple output channels simultaneously, forming a tree-like multicast

structure [4].

Torus networks are widely used in high-performance parallel computing systems and in the context of this study we have

selected five relevant multicast algorithms from the literature to be evaluated. Among these algorithms, the separate addressing

(also known as multi-unicast) and the U-torus [5] protocols are unicast-based. The other three are path-based multicast

communication algorithms, namely S-torus, Md-torus, and M,-torus [6].

The tradeoffs in the performance of the selected algorithms are experimentally evaluated using various metrics, including

multicast completion latency, start-up latency, CPU load, link concentration, and concurrency. Analytical models of the

selected algorithms for short messages are also presented. The experimental results are used to verify and calibrate the

analytical models. Subsequently, analytical projections of the aforementioned algorithms for larger unidirectional torus

networks are produced. The experiments are performed on a Dolphin/Scali Scalable Coherent Interface (SCI) network [7].

This interconnect is based on the ANSI/IEEE standard [8]. The SCI interconnect offers considerable flexibility in topology

choices, such as unidirectional or bi-directional tori, all based on the fundamental structure of a ring.

The next section summarizes related work, followed by a brief overview of the selected multicast algorithms in Section 3.

Section 4 provides a detailed discussion of the setup, results and analysis from the case study experiments. Section 5 presents

the short-message analytical modeling, and Section 6 discusses the projections for larger scale systems. Finally, Section 7

provides conclusions and directions for future work.

2. Related research

Research for multicast communication can be categorized into two groups, unicast-based and multi-destination-based [3].

Among the unicast-based multicasting methods, separate addressing is the simplest one, in which the source node iteratively

unicasts the message to each destination node one after another [5]. Unicast-based multicasting can also be performed in a

multi-phase communication structure, in which the destination nodes are organized in some sort of a binomial tree. The U-torus

multicast algorithm proposed by Robinson et al. [5] is a slightly modified version of the generic binomial-tree approach for

direct torus networks.

Lin and Ni [9] were the first to introduce and investigate the path-based multicasting approach. Subsequently, path-based

multicast communication has received attention and has been studied for direct networks [2, 5, 6]. Regarding path-based







2003, HCS Research Lab, U of Florida All Rights Reserved


studies, this research will concentrate on the S-torus, Md-torus, MA-torus algorithms defined by Robinson et al. [5, 6]. More

details about these and other path-based multicast algorithms can be found in [10].

SCI unicast performance analysis and modeling has been discussed in the literature [11, 12, 13, 14, 15], while collective

communication on SCI has received little attention and its multicast communication characteristics are still unclear. Limited

studies have used collective communication for assessing the scalability of various SCI topologies [16, 17], while no known

study has yet investigated the multicast performance of SCI.

3. Selected multicast algorithms

Bound by the limits of available hardware, two unicast-based and three path-based multicast algorithms are selected in the

context of this study, thereby keeping a degree of variety among different classes of multicast routing algorithms. Throughout

this work, the aggregate collection of all destination nodes and the source node is called the multicast group. Therefore, for a

given group with size d, there are d-1 destination nodes. Figure 1 represents how each algorithm operates for a group size of

10. The root node and the destination nodes are clearly marked and the message transfers are indicated. Alphabetic labels next

to each arrow indicate the individual paths, and the numerical labels represent the logical communication steps on each path.


Separate addressing

Separate addressing is the simplest unicast-based algorithm in terms of algorithmic complexity. For small group sizes and

short messages, separate addressing can be an efficient approach. However, for large messages and large group sizes, the

iterative unicast transmissions may result in large host-processor overhead. Another drawback of this protocol is linearly

increasing multicast completion latencies with increasing group sizes. Figure l(a) illustrates separate addressing for a given

multicast problem.

U-torus

U-torus [5] is another unicast-based multicast algorithm that uses a binomial-tree approach to reduce the total number of

required communication steps. For a given group of size d, the lower bound on the number of steps required to complete the

multicast by U-torus will be Flog2d]. This reduction is achieved by increasing the number of covered destination nodes by a

factor of 2 in each communication step. Figure l(b) illustrates a typical U-torus multicast scenario.







2003, HCS Research Lab, U of Florida All Rights Reserved


1, H 00
A,
\B G, F

C1 F,
D, E,










(a) Separate Addressing
1112 -A 7
n, c



A8 A2







Ag A4




(c) S-torus





B2 1AA A4

B ,
4 EE 1 1,




~m1, )16


Act

A,


nB2
1 AF









(b) U-torus




B2 B1 >



B3 D, C2 A3
IJ F

11 -09 oE


(d) Md-torus


l Multicast Root Node

* Multicast Destination Node

S Idle Node

-- Message Transfer Path

- Network Connection


(e) M,-torus

Figure 1: Separate addressing (a), U-torus (b), S-torus (c), Md-torus (d), and M,-torus (e) multicast algorithms for a
typical multicast scenario with a group size of 10. Individual message paths are marked alphabetically, and the
numerical labels represent the logical communication steps for each message path.







2003, HCS Research Lab, U of Florida All Rights Reserved


S-torus

S-torus is a single-phase, path-based multicast routing algorithm, defined by Robinson et al. [6], in which the destination

nodes are rank ordered to form a Hamiltonian cycle. The ranking of the node is based on their respective physical locations in

the torus network. More detailed information about Hamiltonian node rankings can be found in [6]. The root node issues a

single multicast worm which visits each destination node one after another following the ordered set. At each destination node,

the header is truncated to remove the visited destination address and the worm is re-routed to the next destination. The

algorithm continues until the last destination node receives the message.

M-torus

Although simple, single-phase communication is known for large latency variations for large sets of destination nodes [18].

Robinson et al. proposed the multi-phase multicast routing algorithm, M-torus [6], to further improve the S-torus algorithm.

The idea was to shorten the path lengths of the multicast worms to stabilize the latency variations and to achieve better

performance by partitioning the multicast group. They introduced two variations of the M-torus algorithm, Md-torus and Mu-

torus. Md-torus uses a dimensional partitioning method based on the respective sub-torus dimensions to eliminate the costly

dimension-switching overhead. Mu-torus uses a uniform partitioning mechanism to equalize the partition lengths. In both of

these algorithms, the root node separately transmits the message to each partition and the message is then further relayed inside

the subsets using multicast worms.

For a k-ary N-dimensional torus network, where k" is the total number of nodes, the Md-torus algorithm needs N steps to

complete the multicast operation. Ma-torus is parameterized by the partitioning size, denoted by r. For a group size of d, the

Mu-toms algorithm with a partitioning size of r requires Flog,(d)] steps to complete the multicast operation. Figures 1(d) and (e)

illustrate Md-torus and Mu-torus respectively, where r=4.

4. Case study

To comparatively evaluate the performance of the selected algorithms, an experimental case study is conducted over a

high-performance unidirectional SCI torus network. There are 16 nodes in the case study testbed. Each node is configured with

dual 1GHz Intel Pentium-III processors and 256MB of PC133 SDRAM. Each node also features a Dolphin SCI NIC (PCI-

64/66/D330) with 5.3 Gb/s link speed using Scali's SSP (Scali Software Platform) 3.0.1, Redhat Linux 7.2 with kernel version

2.4.7-10smp, mtrr patched, and write-combining enabled. The nodes are interconnected to form a 4x4 unidirectional torus.

For all of the selected algorithms, the polling notification method is used to lower the latencies. Although this method is

known to be effective for achieving low latencies, oftentimes it results in higher CPU loads, especially if the polling process







2003, HCS Research Lab, U of Florida All Rights Reserved


runs for extended periods. To further decrease the completion latencies, the multicast-tree creation is removed from the critical

path and performed at the beginning of each algorithm in every node.

Throughout the case study, modified versions of the three path-based algorithms, S-torus, Md-torus, and Mu-torus are used.

These algorithms were originally designed to use multi-destination worms. However, as with most high-speed interconnects

available on the market today, our testbed does not support multi-destination worms. Therefore, store-and-forward versions of

these algorithms are developed.

On our 4-ary 2D torus testbed, Md-torus partitions the torus network into simple 4-node rings. For a fair comparison

between the Md-torus and the Mu-torus algorithms, the partition length r of 4 is chosen for Mu-torus. Also, the partition

information for U-torus is embedded in the relayed multicast message at each step. Although separate addressing exhibits no

algorithmic concurrency, it is possible to provide some degree of concurrency by simply allowing multiple message transfers to

occur in a pipelined structure. This method is used for our separate address algorithm.

Case study experiments with the five algorithms are performed for various group sizes and for small and large message

sizes. Each algorithm is evaluated for each message and group size 100 times, where each execution has 50 repetitions. The

variance was found to be very small and the averages of all executions are used in this study. Four different sets of experiments

are performed to analyze the various aspects of each algorithm, which are explained in detailed in the following subsections.


4.1. Multicast completion latency

Two different sets of experiments for multicast completion latency are performed, one for a message size of 2B and the

other for a message size of 512KB. Figures 2(a) and 2(b) illustrate the multicast completion latency versus group size for small

and large messages, respectively.

S-torus has the worst performance for both small and large messages. Moreover, S-torus shows a linear increase in

multicast completion latency with respect to the increasing group size, as it exhibits no parallelism in message transfers. By

contrast, the separate addressing algorithm has a higher level of concurrency due to its design and performs best for small

messages. However, it also presents linearly increasing completion latencies for large messages with increasing group size.

The Md-torus and Mu-torus algorithms exhibit similar levels of performance for both small and large messages. The difference

between these two becomes more distinctive at certain data points, such as 10 and 14 nodes for large messages. For group sizes

of 10 and 14 the partition length for Mu-torus does not provide perfectly balanced partitions, resulting in higher multicast

completion latencies. Finally, U-torus has nearly flat latency for small messages. For large messages, it exhibits similar








2003, HCS Research Lab, U of Florida All Rights Reserved




behavior to Mu-torus. Overall, separate addressing appears to be the best for small messages and groups, while for large


messages and groups Md-torus performs better compared to other algorithms.


1600 70000

S1400 60000

S1200 50000 --
1000 _W_
.- -J 40000
o 800
30000
E 600 A-4
0 ------ ,.A........
0 20000 .. ...- ..- -
S400 ---- -
S 200 i k. .--- - .-. .. - '- 1.. . -- ... ... -.. . . 0000 o'oo. .
S0 ---- --.. ...... .. ..... ...... . . 0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (In nodes) Multicast Group Size (In nodes)
U-torus --*- S-torus ... A Mu-torus --*e- Md-torus --.-- Sep Add -- U-torus --- S-torus -.A-M Mu-torus --*-- Md-torus m-.- Sep Add

(a) 2B messages (b) 512KB messages

Figure 2: Completion latency vs. group size for small (a) and large (b) messages.


4.2. User-level CPU utilization


User-level host processor load is measured using Linux's built-in sar utility. Figures 3(a) and (b) present the maximum


CPU utilization for the root node of each algorithm for small and large messages, respectively.




4 "


23 --- ------ ----- -... -9'"--* -.... .--.I'' 6
I 25 I 5 -,-







U-torus -- S-torus Mu-torus -- Md-torus ---- Sep Add U-torus-- -- S-tors Mu-torus --*-- Md-torus ---- Sep Add
05 -
0 0-
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
-- U-torus --. S-torus A-Mu-torus -. --Md-torus ---- Sep Add -- U-torus -S-torus -A Mu-torus -- --Md-torus ..--- Sep Add

(a) 2B messages (b) 512KB messages

Figure 3: User-level CPU utilization vs. group size for small (a) and large (b) messages.


It is observed that S-torus exhibits constant CPU load for small messages independent of group size. However, for large


messages, as the group size increases the CPU load also linearly increases. The separate addressing algorithm has a nearly


linear increase in CPU load for large messages with increasing group size. By contrast, since the number of message


transmissions for the root node stays constant, Md-torus provides a nearly constant CPU overhead for small messages for every







2003, HCS Research Lab, U of Florida All Rights Reserved


group size, and for large messages and small group sizes. However, for group sizes greater than 10, the CPU utilization tends to

increase due to variations in the path lengths causing extended polling durations. Mu-torus exhibits an identical behavior to Md-

torus for small messages. For large messages, it also provides higher but constant CPU utilization. Finally, as can be seen, U-

torus exhibits a step-like increase due to the increase in the number of communication steps required to cover all destination

nodes at certain group sizes, such as 4, 8, and 16.


4.3. Multicast startup and tree-creation latencies

The user-level multicast startup latency is an important metric since, for small message sizes, this factor might impede the

overall communication performance. In addition, multicast tree creation latencies exhibit a similar effect. Both the startup and

the tree creation latencies are independent of the message size. Figures 4(a) and (b) present these two latencies versus group

size.

The U-torus and separate addressing algorithms have unbounded fan-out numbers and, as clearly illustrated in Figure 4(a),

the startup latencies for these two algorithms are identical and linearly increasing with group size. By contrast, the S-torus and

Md-torus algorithms have constant startup latencies because of their fixed fan-out numbers. Also, it can be seen that Mu-torus

has a step-like increasing latency after each multiple of r, the partition length.


140000 60



S10000040 ._- "- ..... .. --'-
40
* 80000
SSa 30
60000 3

| 40000 20 - . .. 20-- ------- --- -
----r .....A.. --- -... ... --- -.--------------------
20000 ............... -. - ............... .-" " 10
----------------------------------------------
0 0
4 6 8 10 12 14 16 6 8 10 12 14 16
4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
-- U-torus --- S-torus ---A-- Mu-torus ---- Md-torus --m---Sep Add -- U-torus -- S-torus --A-- Mu-torus --*- Md-torus

(a) (b)
Figure 4: Startup latency (a) and tree-creation latency (b) vs. group size.


Figure 4(b) presents the multicast tree creation latencies for all algorithms except separate addressing, which does not use a

tree for multicast delivery. The M,-torus and Md-torus algorithms differ in their partitioning methods and both methods are

quite complex compared to the other algorithms, resulting in the highest multicast tree-creation latencies. U-torus has a

distributed partitioning process. From the root node perspective and compared to the two M-torus algorithms, it has lower tree-







2003, HCS Research Lab, U of Florida All Rights Reserved


creation latency. S-torus does not perform any partitioning and it only rank orders the destination nodes resulting in the lowest

latency due to this simplicity of its tree formation.


4.4. Link concentration and concurrency

Link concentration is defined here as the ratio of two components: number of link visits and number of used links. Link

visits is defined as the cumulative number of links used during the entire communication process, while used links is defined as

the number of individual links used. Link concurrency is the maximum number of messages that are in transit in the network at

any given time. Link concentration and link concurrency are given in Figures 5(a) and 5(b), respectively. Link concentration

combined with the link concurrency illustrates the degree of communication balance. The concentration and concurrency

values presented in Figure 5 are obtained by analyzing the theoretical communication structures and the experimental timings of

the algorithms.


350 7
-A
300 6


''" 4
--S-:-----t-------
250 .5

200 4 4----- ------- ------

S1 50 U"' -. -- 3 1. -- -- ---- .. ........'"'"



050 1 ---------------- ..--------------.-----------.

000 0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
U-torus -. S-torus A--- Mu-torus --a*-Md-torus -U -- Sep Add U-torus --- S-torus .. A- Mu-torus -*.- Md-torus --n--Sep Add

(a) (b)
Figure 5: Link concentration (a) and concurrency (b) vs. group size.


S-torus is a simple chained communication and there is only one active message transfer in the network at any given time.

Therefore, S-torus has the lowest and a constant link concentration and concurrency compared to other algorithms. By contrast,

due to the high parallelism provided by the recursive doubling approach, the U-torus algorithm has the highest concurrency.

Separate addressing exhibits an identical degree of concurrency to the U-torus, because of the multiple message transfers

overlapping at the same time due to the network pipelining. Md-torus has inversely proportional link concentration versus

increasing group size. In Md-torus, the root node first sends the message to the destination header nodes, and they relay it to

their child nodes. As the number of dimensional header nodes is constant (k in a k-ary torus), with the increasing group size

each new child node added to the group will increase the number of available links. Moreover, due to the communication







2003, HCS Research Lab, U of Florida All Rights Reserved


structure of the Md-torus, the number of used links increases much more rapidly compared to the number of link visits with the

increasing group size. This trend asymptotically limits the decreasing link concentration to 1. The concurrency of Md-torus is

upper bounded by k as each dimensional header relays the message over separate ringlets with k nodes in each.

The Mu-torus algorithm has low link concentration for all group sizes, as it multicasts the message to the partitioned

destination nodes over a limited number of individual paths as shown in Figure l(d), where only a single link is used per path at

a time. By contrast, for a given partition length of constant size, an increase in the group size results in an increase in the

number of partitions as well as an increase in the number of individual paths. This trait results in more messages being

transferred concurrently at any given time over the entire network.

5. Multicast Latency Modeling

The experiments throughout this study have investigated the performance of multicast algorithms over a 2D torus network

having a maximum of 16 nodes. However, ultimately modeling will be a key tool in predicting the relative performance of

these algorithms for system sizes that far exceed our current testbed capabilities. Also, by modifying the model, future systems

with improved characteristics could be evaluated quickly and accurately. The model presented in the next subsections assumes

an equal number of nodes in each dimension for a given N-dimensional torus network. Our small-message latency model

follows the LogP model [19].

LogP is a general-purpose model for distributed-memory machines with asynchronous unicast communication. A detailed

description of the LogP model is given in [19]. Under LogP, at most at every g CPU cycles a new communication operation

can be issued, and a one-way small-message delivery to a remote location is formulated as

communicaton = 2L + 2 x (sender + Oreceiver) (1)

where L is the upper bound on the latency for a message delivery of a message from its source processor to its target processor,

and sender and orecelver represent the sender and receiver communication overheads, respectively.

As noted in [20], today's high-performance NICs and interconnection systems are fast enough that any packet can be

injected into the network as soon as the host processor produces it without any further delays. Therefore, g is negligible for

modeling high-performance interconnects. This observation yields the relaxed LogP model for one-way unicast

communication, given as

tcommunicaton = sender network + receiver (2)

where tnetwork is the total time spent between the injection of the message into the network and drainage of the message.







2003, HCS Research Lab, U of Florida All Rights Reserved


Following this approach, a model is proposed to capture the multicast communication performance of high-performance

torus networks on an otherwise unloaded system. The model is based on the concept that each multicast message can be

expressed as sequences of serially forwarded unicast messages from root to destinations. The proposed model is formulated as

t mlticast = Max[osender + network +0 receiver over V (3)
paths

where the Max[ ] operation yields the total multicast latency for the deepest path over all paths and tmu,,cast is the time interval

between the initiation of multicast and the last destination's reception of the message. Figure 6 illustrates this concept on a

sample binomial-tree multicast scenario. For this scenario, tmul,,,cas is determined over the deepest path, which is the route from

node 0 to node 7.








4 5

Deepest
Path




Figure 6: A sample multicast scenario for a given binomial tree


Gonzalez et al. [14] modeled the unicast performance of direct SCI networks and observed that tneork can be divided into

smaller components as given by:


network hp xLp +hf xLf +h xLs +h, xL, (4)

Here hp, hf, hs, and h, represent the total number of hops, forwarding nodes, switching nodes, and intermediate nodes,

respectively. Similarly Lp, Lf, and Ls denote the propagation delay per hop, forwarding delay through a node, and switching

delay through a node, respectively. L, denotes the intermediate delay through a node, which is the sum of the receiving

overhead of the message, processing time, and the sending overhead of the message. Figure 7(a) illustrates an arbitrary

mapping of the problem given above in Figure 6 to a 2D 4x4 torus network, while 7(b) visually breaks down tmul,cas, over the

same arbitrary mapping.

Following the method outlined in [14] and using the obtained experimental results from our case study, the model

parameters are measured and calculated for short message sizes. Assuming that the electrical signals propagate through a

copper conductor at approximately half the speed of light, and observing that our SCI torus testbed is connected with 2m long






2003, HCS Research Lab, U of Florida All Rights Reserved


cables, resulted in 14ns of propagation latency per link. Since Lp represents the latency for the head of a message passing

through a conductor, it is therefore independent of message size [14].



U__mfl


I
0 0 0-E -E


Osender LetPwork e OLceiew r
I I I I
,o I.0, tnwork ,,oc......r

SRoot Node Forwanrding Node Y Switcming Node


SIntennediate Node Destination Node


(a) (b)
Figure 7: Arbitrary mapping of the multicast problem given in Figure 6 to a 2D 4-ary torus (a), and break-down of
tmu,,Icast over the deepest path for the given multicast scenario (b)

Model parameters Osender, OreceWer, Lf and L, were obtained through ring-based and torus-based API unicast experiments.

Ring-based experiments were performed with two-, four-, and six-node ring configurations. For each setup, hp and Lp are

known. Plugging these values into Eqs. 2 and 4 and taking algebraic differences between two-, four-, and six-node experiments

yields se,,der, receiver, and Lf. Switching latency, Ls, is determined through torus-based experiments by comparing latency of

message transfer within a dimension versus between dimensions. Figure 8 plots the L, Lf, L, Osender, and orecever model

parameters for various short message sizes.

Lp, Lf, L, Osender, and receiver are communication-bound parameters and are dominated by NIC and interconnect

performance. L, is dependent upon interconnect performance as well as the node's computational power, and is formulated as:

L = sender + process + orece ver (5)
Calculations based on the experimental data obtained previously show that each multicast algorithm's processing load,

process, is different. This load is composed mainly of the time needed to check the child node positions on the multicast tree and

the calculation time of the child node(s) for the next iteration of the communication. Also, process is observed to be the dominant

component of L, for short message sizes. Moreover, it can be easily seen that compared to the tproces, and L, parameters, the Lp,

Lf, and the Ls values are drastically smaller, and thus they are relatively negligible. Dropping these negligible parameters, Eq. 3

can be simplified and expressed as







2003, HCS Research Lab, U of Florida All Rights Reserved


it,,,cast (total number of destination nodes)x osende, + tocess) + orece, (6)



for the separate addressing model. For the Md-torus, M.-torus, U-torus and S-torus algorithms the simplified model can be

formulated as in Eq. 7. As can be seen, the modeling is straightforward for separate addressing and for the remaining

algorithms the modeling problem is now reduced to identifying the number of intermediate nodes over the longest multicast

message path. Moreover, without loss of generality, sender and receiver values can be treated equal to one another [19] for

simplicity and we represent them by a single parameter, o.


tmulcast Max[osender +h, x L, +o receiver] over V
paths
Max[2xo + h, xL] over V (
paths

Of course, variable L, is not involved in the separate addressing algorithm. The reason is simply that separate addressing

consists of a series of unicast message transmissions from source to destination nodes so there are no intermediate nodes that

are needed to relay the multicast message to other nodes. Table 1 presents the process and L, values for short message sizes.

tproces is independent of the multicast message and group size but strictly dependent on the performance of the host machine.

Therefore, for different computing platforms, different tprocess values will be obtained.


10000
3203ns
1994ns

1000
A ------...........A ----..........--A----- ..........---------.........-- ---.......... .........----
Avg=670ns

> 100
1 ------------------U---- ----------






128 192 256 320 384 448 512
Message Size (bytes)
-4-Propagation --w- Forwarding ---A-- Switching ---Overhead

Figure 8: Measured and calculated model parameters for short message sizes:
Lp = 14ns, Lf= 60ns, L, = 670ns, and o = 1994 + 3.15 x (M- 128) ns

Table 1: Calculated tprcess and Li values
t._____iocess (s) L (s)
Separate Addressing 7 N/A
Md-torus 206 206 + 2xo
M,-torus 201 201 +2xo
U-torus 629 629 +2xoo
S-torus 265 265 + 2xo








2003, HCS Research Lab, U of Florida All Rights Reserved


The following subsections will discuss and provide more detail about the simplified model for each multicast algorithm.

For each algorithm, the modeled values, the actual testbed measurements, and the modeling error will also be presented.


5.1. Separate addressing model

With the separate addressing algorithm, for a group size of G, there are G-l destination nodes that the root node alone must

serve. Therefore, a simplified model for separate addressing can be expressed as:


multicast (G- 1)x (Osender + process +Oreceiver (8)

Figure 9 plots the simplified model and the actual measurements with various multicast group sizes for a 128-byte multicast

message using the o and tproce, values previously defined. The results show that the model is accurate with an average error of

-3%.



2 0 0 n-------------------------------------------------------r 1 0 0
2O090
160 80


0 6O
5


300


A--- - - --._ ... .. ". .. --
4 6 8 10 12 14 16
Multicast Group Size (In nodes)
Model Actual ---Enor

Figure 9: Simplified model vs. actual measurements for separate addressing algorithm

5.2. Md-torus model

For Md-torus, the total number of intermediate nodes for any communication path is observed to be a function of G, the

multicast group size, and k, the number of nodes in a given dimension. The simplified Md-torus model is formulated as:



mult, 2xo+ xL, (9)


The simplified model and actual measurements for various group sizes with 128-byte messages are plotted in Figure 10.

As can be seen the simplified model is accurate with an average error of -2%.








2003, HCS Research Lab, U of Florida All Rights Reserved


E 700 8_________________________________0
ooo
600----












Figure 10: Simplified model vs. actual measurements for Md-torus algorithm

5.3. M,-torus model

The number of partitions for the Ma-torus algorithm, denoted by p, is a byproduct of the multicast group size, G, and the


partition length, r. For systems with r equal to G, there exists only one partition and the multicast message is propagated in a

chain-type communication mechanism among the destination nodes. Under this condition, the number of intermediate nodes is


simply two less than the group size. The subtracted two are the root and the last destination nodes. For systems with partitions

equal to or more than 2, the number of intermediate nodes becomes a function of the group size, partition length and the number

of nodes in a given dimension. The simplified model is given as:
t 2xo+ G -rx k xL,, p>2 (




















2xo+(G-2)xL,, p<2
50
E 40
300-. .

i oI io-- r-------------------------------.


4 6 8 10 12 14 16

























Multicast Group Size (in nodes)
IModel Actual --=-Enor


























Figure 11: Simplified model vs. actual measurements for Ma-torus algorithm
5.3. Mu-torus model

























Figure number11 shows the small-message model versus actual measurement by p, is for 128-byte messages. The results show that the


partition length, r. For systemsmodel is accurate with an average error equal to G, there exists only one partition and the multicast message is propagated in a
chain-type communication mechanism among the destination nodes. Under this condition, the number of intermediate nodes is

simply two less than the group size. The subtracted two are the root and the last destination nodes. For systems with partitions

equal to or more than 2, the number of intermediate nodes becomes a function of the group size, partition length and the number

of nodes in a given dimension. The simplified model is given as:





I multhcast ~ (10)

2xo+(G-2)xL,, p<2




1 4 0 0 ---------------------------------------------r 1 0 0
1 .-,- 90

/ooo 70




400 30
20

0 7-----7---- ---- -------7--A----T---- ------- ,----------A 0
4 6 8 10 12 14 16
Multicast Group Size (in nodes)


Figure 11: Simplified model vs. actual measurements for Mu-torus algorithm


Figure 11 shows the small-message model versus actual measurements for 128-byte messages. The results show that the

model is accurate with an average error of -2%.








2003, HCS Research Lab, U of Florida All Rights Reserved


5.4. U-torus model

For U-torus the minimum number of communication steps required to cover all destination nodes can be expressed as


Flog, G The number of intermediate nodes in the U-torus algorithm is a function of the minimum required communication


steps, the group size and the number of nodes in a given dimension. The simplified U-torus model is given as:


(Gmodk)+k
tmult cast 2 o + log, G x (G ox L, (11)
k

Figure 12 presents the short-message model and actual measurements for various group sizes. The results show the model

is accurate with an average error of -2%.


3500 ------------------------------------o0


2500 70










Multicast Group Size (In nodes)


Figure 12: Simplified model vs. actual measurements for U-torus algorithm
3000














5.5. S-torus model

S-torus is a chain-type communication algorithm and can be modeled identically to the single partition case of the Ma-torus


algorithm. The simplified model for S-torus is formulated as:
50
10








0 2 xo (G 2) xL, (12)-- 7 0
4 6 8 10 12 14 16
















Multicast Group Size (In nodes)
















ModeFigure 13: SimplifiedActua --model vs. actual measurements for S-torus algorithm
Figure 12: Simplified model vs. actual measurements for U-torus algorithm

5.5. S-torus model

S-torus is a chain-type communication algorithm and can be modeled identically to the single partition case of the Mu-torus

algorithm. The simplified model for S-torus is formulated as:


t..lhcast & 2 x o + (G -2)x L, (12)


4500 -----------------------------------100
4 0 0 0 9 0------------------------------^ -- S
3500 80
3000 70
2500 m






0 N---------- ^ -- ^----------------^ -- 0
4 6 8 10 12 14 16
Multicast Group Size (in nodes)


Figure 13: Simplified model vs. actual measurements for S-torus algorithm







2003, HCS Research Lab, U of Florida All Rights Reserved


The results from the short-message model versus actual measurements for 128-byte messages are shown in Figure 13. As

previously stated, S-torus routing is based on Hamiltonian circuit. This type of routing ensures that each destination node will

receive only one copy of the message, but some forwarding nodes (i.e. non-destination nodes that are on the actual message

path) may be visited more than once for routing purposes. Moreover, depending on the group size, single-phase routing

traverses many unnecessary channels, creating more traffic and possible contention. Therefore, S-torus has unavoidably large

latency variations due to the varying and sometimes extremely long message paths [18]. The small-message model presented is

incapable of tracking these large variations in the completion latency that are inherent to the S-torus multicast algorithm. The

modeling error is relatively high, unlike the other algorithms evaluated in this study. The instability of the modeling error is not

expected to lessen with the increasing group size.

6. Analytical Projections


To evaluate and compare the short-message performance of the multicast algorithms for larger systems, the simplified

models are used to investigate 2D torus-based parallel systems with 64 and 256 nodes. The effects of different partition lengths

(i.e. r = 8, r = 16, r = 32) for the Mu-torus algorithm over these system sizes are also investigated analytically with these

projections. The results of the projections are plotted in Figure 14.


1000 ---------------------------------------------------------10000--------------------------


100E 100
__ H--I~~~~~~~~- ---------------------- ^----I-I------- ---------- h------ -t


10- *. t


10
8 4 12 16 20 24 28 32 36 40 44 48 52 56 60 64 2


01- 4 32 60 88 116 144 172 200 228 256
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
-- U-toms -r -S-torus ----- Mu-torus (r=8) --- Mu-torus(oru) -U-torus -Stous ..*A- Mu-torus (r8) --- Mu-torus (r16)
-- Mu-torus(r32) ----Md-torus --Sep Add Mu-torus(r=32) --*--Md-torus -Sep Add

(a) (b)
Figure 14: Small-message latency projections for an 8x8 torus system (a), and a 16x16 torus system

The Md-torus algorithm has step-like completion latency due to the fact that, with every new k destination nodes added to

the group, a new ringlet is introduced to the multicast communication which increases the completion latency. The optimal

performance for the 8x8 torus network is obtained when the Mu-torus has a partition length of 8 and for the 16x16 torus







2003, HCS Research Lab, U of Florida All Rights Reserved


network when the partition length is 16. Therefore, it is surmised that the optimal partition length for Mu-torus is equal to k for

2D direct SCI torus networks.

U-torus, on average, has slowly increasing completion latency with increasing group sizes. U-torus, Mu-torus, and Md-torus

all tend to have similar asymptotic latencies. S-torus and separate addressing, as expected, have linearly increasing latencies

with increasing group sizes. Although separate addressing is the best choice for small-scale systems, it loses its advantage with

increasing group sizes for short messages. S-torus, by contrast, again proves to be a poor choice, as it is simply the worst

performing algorithm for group sizes greater than 8.

7. Conclusions

In this study, five different multicast algorithms for high-performance torus networks are evaluated. These algorithms are

analyzed on direct SCI torus networks. The performance characteristics of these algorithms are experimentally examined under

different system and network configurations using various metrics, such as multicast completion latency, root-node CPU

utilization, multicast startup and tree creation latencies, and link concentration and concurrency, to evaluate their key strengths

and weaknesses. Based on the results obtained, small-message latency models for each algorithm are defined. The models are

observed to be accurate. Projections for larger systems are also presented and evaluated.

Based on the experimental analysis it is observed that the separate addressing algorithm is the best choice for small

messages or small group sizes from the perspective of multicast completion latency and CPU utilization due to its simple and

cost effective structure. The Md-torus algorithm performs best from the perspective of completion latency for large messages or

large group sizes, because of the balance provided by its use of dimensional partitioning. In addition, Md-torus incurs a very

low CPU overhead and achieves high concurrency for all the message and group sizes considered. The U-torus and Mu-torus

algorithms perform better when the individual multicast path depths are approximately equal. Furthermore, the M,-torus

algorithm exhibits its best performance when group size is an exact multiple of the partition length. The U-torus and Mu-torus

algorithms have nearly constant CPU utilizations for small and large messages alike. Moreover, the U-torus algorithm has the

highest concurrency among all algorithms evaluated, due to the high parallelism provided by the recursive-doubling method. S-

torus is always the worst performer from the perspective of completion latency and CPU utilization, because of its lack of

concurrency and its extensive communication overhead. As expected, S-torus exhibits a nearly linear increase in completion

latency and CPU utilization for large messages with increasing group size.

The small-message latency models, using only a few parameters, capture the essential mechanisms of the multicast

communication over the given platforms. The models are accurate for all evaluated algorithms except the S-torus algorithm.







2003, HCS Research Lab, U of Florida All Rights Reserved


Small-message multicast latency projections for larger torus systems are provided using these models. Projection results show

that with increasing group size the U-torus, Mu-torus, and Md-torus algorithms tend to have asymptotically bounded similar

latencies. Therefore, it is possible to choose an optimal multicast algorithm among these three for larger systems, based on the

multicast completion latency and other metrics such as CPU utilization or network link concentration and concurrency.

Projection results also show that S-torus and separate addressing have unbounded and linearly increasing completion latencies

with increasing group sizes, which makes them unsuitable for large-scale systems. It is also possible and straightforward to

project the multicast performance of larger-scale 2D torus networks with our model. Applying the simplified models to other

torus networks and/or multicast communication schemes is possible with a minimal calibration effort.

Overall, the results of this research make it clear that no single multicast algorithm is best in all cases for all metrics. For

example, as the number of dimensions in the network increases, the Md-torus algorithm becomes dominant. By contrast, for

networks with fewer dimensions supporting a large number of nodes, the Mu-torus and the U-torus algorithms are most

effective. Separate addressing is an efficient and cost-effective choice for small-scale systems. Finally, S-torus is determined to

be inefficient as compared to the alternative algorithms in all the cases evaluated. This inefficiency is due to the extensive

length of the paths used to multicast, which in turn leads to long and widely varying completion latencies, as well as a high

degree of root-node CPU utilization.

There are several possibilities for future research, one of which is integrating and evaluating the performance of the selected

algorithms with higher communication layers such as MPI. Another possible direction is to extend our SAN-based research as

a low-level communication service for other high-performance networks including use not only for cluster networks but also in

hierarchical grid networks.

8. References

[1] P.K. McKinley, H.Xu, A.H. Esfahanian, and L.M. Ni, Unicast-Based Multicast Communication in Wormhole-Routed Networks, IEEE
Transactions on Parallel and Distributed Systems 5 (12) (1994) 1252-1265.
[2] P.K. McKinley, Y.Tsai, and D.F. Robinson, Collective Communication in Wormhole-Routed Massively Parallel Computers, IEEE
Computer 28 (2) (1995) 39-50.
[3] Y. Tseng, D.K. Panda, and T Lai, A Trip-Based Multicasting Model in Wormhole-Routed Networks with Virtual Channels, IEEE
Transactions on Parallel and Distributed Systems 7 (2) (1996) 138-150.
[4] R. Kesavan and D.K. Panda, Multicasting on Switch-Based Irregular Networks using Multi-Drop Path-Based Multi-Destination
Worms, Proc. of Parallel Computer Routing and Communication, Second International Workshop (PCRCW'97), Atlanta, Georgia, Jun.
1997, pp. 179-192.
[5] D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, Optimal Multicast Communication in Wormhole-Routed Torus Networks, IEEE
Transactions on Parallel and Distributed Systems 6 (10) (1995) 1029-1042.







2003, HCS Research Lab, U of Florida All Rights Reserved


[6] D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, Path-Based Multicast Communication in Wormhole-Routed Unidirectional Torus
Networks, Journal of Parallel and Distributed Computing 45 (2) (1997) 104-121.
[7] Scali Computer AS, Scali System Guide Version 2.0, White Paper, Scali Computer AS, Oslo, Norway, 2000.
[8] IEEE, SCI: Scalable Coherent Interface, IEEE Approved Standard 1596-1992, 1992.
[9] X. Lin and L.M. Ni, Deadlock-Free Multicast Wormhole Routing in Multicomputer Networks, Proc. of 18th Annual International
Symp. on Computer Architecture, Toronto, Canada, May 1991, pp. 116-124.
[10] L.M. Ni and P.K. McKinley, A Survey of Wormhole Routing Techniques In Direct Networks, IEEE Computer 26 (2) (1993) 62-76.
[11] K. Omang and B. Parady, Performance Of Low-Cost Ultrasparc Multiprocessors Connected By SCI, Technical Report, Department of
Informatics, University of Oslo, Norway, 1996.
[12] M. Ibel, K.E. Schauser, C.J. Scheiman and M. Weis, High-Performance Cluster Computing using SCI, Proc. of Hot Interconnects
Symposium V, Palo Alto, CA, Aug. 1997.
[13] M. Sarwar and A. George, Simulative Performance Analysis of Distributed Switching Fabrics for SCI-Based Systems, Microprocessors
and Microsystems 24 (1) (2000) 1-11.
[14] D. Gonzalez, A. George, and M. Chidester, Performance Modeling and Evaluation of Topologies for Low-Latency SCI Systems,
Microprocessor and Microsystems 25 (7) (2001) 343-356.
[15] R. Todd, M. Chidester, and A. George, Comparative Performance Analysis of Directed Flow Control for Real-Time SCI, Computer
Networks 37 (4) (2001) 391-406.
[16] H. Bugge, Affordable Scalability using Multicubes, in: H. Hellwagner, A. Reinfeld (Eds.), SCI: Scalable Coherent Interface, LNCS
State-of-the-Art Survey, Springer, Berlin, Germany, 1999, pp. 167-174.
[17] L.P. Huse, Collective Communication on Dedicated Clusters Of Workstations, Proc. of 6th PVM/MPI European Users Meeting
(EuroPVM/MPI '99), Sabadell, Barcelona, Spain, Sep. 1999, pp. 469-476.
[18] H. Wang, and D.M. Blough, Tree-Based Multicast in Wormhole-Routed Torus Networks, Proc. of Int. Conf. on Parallel and
Distributed Processing Techniques and Applications (PDPTA'98), Las Vegas, Nevada, Jul. 1998, pp 702-709.
[19] D.E. Culler, R. Karp, D.A. Patterson, A. Sahay, K.E. Shauser, E. Santos, R. Subramonian, and T. von Eicken, LogP: Towards a
Realistic Model of Parallel Computation, Proc. of ACM 4th SIGPLAN Symp. on Principles and Practices of Parallel Programming, San
Diego, California, May 1993, pp. 1-12.
[20] E. Deelman, A. Dube, A. Hoisie, Y. Luo, R. Oliver, D. Sundaram-Stukel, H. Wasserman, V.S. Adve, R. Bagrodia, J.C. Browne, E.
Houstis, O. Lubeck, J. Rice, P. Teller, M.K. Vernon; POEMS: End-to-end Performance Design of Large Parallel Adaptive
Computational Systems, Proc. of First International Workshop on Software and Performance '98, WOSP '98, Santa Fe, New Mexico,
Oct. 1998, pp. 18-30.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs