2003, HCS Research Lab, U of Florida All Rights Reserved
Multicast Performance Modeling and Evaluation
for HighSpeed Unidirectional Torus Networks*
Sarp Oral and Alan D. George*
HighPerformance C. ,iijtw,,a and Simulation (HCS) Research Lab, Department of Electrical and Computer Engineering,
University ofFlorida, P.O. Box 116200, Gainesville, Florida 32611, USA
Abstract
This paper evaluates the performance of various unicastbased and pathbased multicast protocols for highspeed torus
networks. The results of an experimental casestudy on a Scalable Coherent Interface (SCI) torus network are presented.
Smallmessage latency models of these softwarebased multicast algorithms as well as analytical projections for larger
unidirectional torus systems are also introduced. The strengths and weaknesses of selected multicast protocols are
experimentally and analytically illustrated in terms of various metrics, such as startup and completion latency, CPU utilization,
and link concentration and concurrency for SCI networks under various networking and multicasting scenarios.
Keywords: Scalable Coherent Interface; Torus networks; Multicast communication; Analytical modeling; Benchmarking
1. Introduction
Collective communication primitives play a major role in parallel computing by making the applications more portable
among different platforms. Utilizing collective communication not only simplifies but also increases the functionality and
efficiency of the parallel tasks. As a result, efficient support of collective communications is important in the design of high
performance parallel and distributed systems.
An important primitive among collective communication operations is multicast communication. Multicast is defined as
sending a single message from a source node to a set of destination nodes. This primitive can be used as a basis for many
collective operations, such as barrier synchronization and global reduction, as well as cache invalidations in sharedmemory
multiprocessor systems [1]. The multicast primitive also functions as a useful tool in parallel numerical procedures such as
matrix multiplication and transposition, eigenvalue computation, and Gauss elimination [2]. Multicast communication
algorithms can be broadly classified as unicastbased or pathbased [3]. In a unicastbased algorithm, the source node sends the
*This work was supported in part by the U.S. Department of Defense, by matching funds from the University of Florida for the iVDGL
project supported by the National Science Foundation, and by equipment support of Dolphin Interconnect Solutions Inc. and Scali Computer
AS.
t A preliminary subset of this paper was presented at the HighSpeed Local Networks Workshop at the 27th IEEE Local Computer Networks
LCN) Conference, Tampa, Florida, November 2002.
Corresponding author. Tel.: +13523922552; fax: +13523928671.
Email addresses: oralahcs.ufl.edu (S. Oral), george(T)hcs.ufl.edu (A.D. George)
2003, HCS Research Lab, U of Florida All Rights Reserved
message to the destination node set as unicastrouted messages [1]. Unlike the unicastbased algorithms, pathbased ones
require each relaying element to transmit the message to multiple output channels simultaneously, forming a treelike multicast
structure [4].
Torus networks are widely used in highperformance parallel computing systems and in the context of this study we have
selected five relevant multicast algorithms from the literature to be evaluated. Among these algorithms, the separate addressing
(also known as multiunicast) and the Utorus [5] protocols are unicastbased. The other three are pathbased multicast
communication algorithms, namely Storus, Mdtorus, and M,torus [6].
The tradeoffs in the performance of the selected algorithms are experimentally evaluated using various metrics, including
multicast completion latency, startup latency, CPU load, link concentration, and concurrency. Analytical models of the
selected algorithms for short messages are also presented. The experimental results are used to verify and calibrate the
analytical models. Subsequently, analytical projections of the aforementioned algorithms for larger unidirectional torus
networks are produced. The experiments are performed on a Dolphin/Scali Scalable Coherent Interface (SCI) network [7].
This interconnect is based on the ANSI/IEEE standard [8]. The SCI interconnect offers considerable flexibility in topology
choices, such as unidirectional or bidirectional tori, all based on the fundamental structure of a ring.
The next section summarizes related work, followed by a brief overview of the selected multicast algorithms in Section 3.
Section 4 provides a detailed discussion of the setup, results and analysis from the case study experiments. Section 5 presents
the shortmessage analytical modeling, and Section 6 discusses the projections for larger scale systems. Finally, Section 7
provides conclusions and directions for future work.
2. Related research
Research for multicast communication can be categorized into two groups, unicastbased and multidestinationbased [3].
Among the unicastbased multicasting methods, separate addressing is the simplest one, in which the source node iteratively
unicasts the message to each destination node one after another [5]. Unicastbased multicasting can also be performed in a
multiphase communication structure, in which the destination nodes are organized in some sort of a binomial tree. The Utorus
multicast algorithm proposed by Robinson et al. [5] is a slightly modified version of the generic binomialtree approach for
direct torus networks.
Lin and Ni [9] were the first to introduce and investigate the pathbased multicasting approach. Subsequently, pathbased
multicast communication has received attention and has been studied for direct networks [2, 5, 6]. Regarding pathbased
2003, HCS Research Lab, U of Florida All Rights Reserved
studies, this research will concentrate on the Storus, Mdtorus, MAtorus algorithms defined by Robinson et al. [5, 6]. More
details about these and other pathbased multicast algorithms can be found in [10].
SCI unicast performance analysis and modeling has been discussed in the literature [11, 12, 13, 14, 15], while collective
communication on SCI has received little attention and its multicast communication characteristics are still unclear. Limited
studies have used collective communication for assessing the scalability of various SCI topologies [16, 17], while no known
study has yet investigated the multicast performance of SCI.
3. Selected multicast algorithms
Bound by the limits of available hardware, two unicastbased and three pathbased multicast algorithms are selected in the
context of this study, thereby keeping a degree of variety among different classes of multicast routing algorithms. Throughout
this work, the aggregate collection of all destination nodes and the source node is called the multicast group. Therefore, for a
given group with size d, there are d1 destination nodes. Figure 1 represents how each algorithm operates for a group size of
10. The root node and the destination nodes are clearly marked and the message transfers are indicated. Alphabetic labels next
to each arrow indicate the individual paths, and the numerical labels represent the logical communication steps on each path.
Separate addressing
Separate addressing is the simplest unicastbased algorithm in terms of algorithmic complexity. For small group sizes and
short messages, separate addressing can be an efficient approach. However, for large messages and large group sizes, the
iterative unicast transmissions may result in large hostprocessor overhead. Another drawback of this protocol is linearly
increasing multicast completion latencies with increasing group sizes. Figure l(a) illustrates separate addressing for a given
multicast problem.
Utorus
Utorus [5] is another unicastbased multicast algorithm that uses a binomialtree approach to reduce the total number of
required communication steps. For a given group of size d, the lower bound on the number of steps required to complete the
multicast by Utorus will be Flog2d]. This reduction is achieved by increasing the number of covered destination nodes by a
factor of 2 in each communication step. Figure l(b) illustrates a typical Utorus multicast scenario.
2003, HCS Research Lab, U of Florida All Rights Reserved
1, H 00
A,
\B G, F
C1 F,
D, E,
(a) Separate Addressing
1112 A 7
n, c
A8 A2
Ag A4
(c) Storus
B2 1AA A4
B ,
4 EE 1 1,
~m1, )16
Act
A,
nB2
1 AF
(b) Utorus
B2 B1 >
B3 D, C2 A3
IJ F
11 09 oE
(d) Mdtorus
l Multicast Root Node
* Multicast Destination Node
S Idle Node
 Message Transfer Path
 Network Connection
(e) M,torus
Figure 1: Separate addressing (a), Utorus (b), Storus (c), Mdtorus (d), and M,torus (e) multicast algorithms for a
typical multicast scenario with a group size of 10. Individual message paths are marked alphabetically, and the
numerical labels represent the logical communication steps for each message path.
2003, HCS Research Lab, U of Florida All Rights Reserved
Storus
Storus is a singlephase, pathbased multicast routing algorithm, defined by Robinson et al. [6], in which the destination
nodes are rank ordered to form a Hamiltonian cycle. The ranking of the node is based on their respective physical locations in
the torus network. More detailed information about Hamiltonian node rankings can be found in [6]. The root node issues a
single multicast worm which visits each destination node one after another following the ordered set. At each destination node,
the header is truncated to remove the visited destination address and the worm is rerouted to the next destination. The
algorithm continues until the last destination node receives the message.
Mtorus
Although simple, singlephase communication is known for large latency variations for large sets of destination nodes [18].
Robinson et al. proposed the multiphase multicast routing algorithm, Mtorus [6], to further improve the Storus algorithm.
The idea was to shorten the path lengths of the multicast worms to stabilize the latency variations and to achieve better
performance by partitioning the multicast group. They introduced two variations of the Mtorus algorithm, Mdtorus and Mu
torus. Mdtorus uses a dimensional partitioning method based on the respective subtorus dimensions to eliminate the costly
dimensionswitching overhead. Mutorus uses a uniform partitioning mechanism to equalize the partition lengths. In both of
these algorithms, the root node separately transmits the message to each partition and the message is then further relayed inside
the subsets using multicast worms.
For a kary Ndimensional torus network, where k" is the total number of nodes, the Mdtorus algorithm needs N steps to
complete the multicast operation. Matorus is parameterized by the partitioning size, denoted by r. For a group size of d, the
Mutoms algorithm with a partitioning size of r requires Flog,(d)] steps to complete the multicast operation. Figures 1(d) and (e)
illustrate Mdtorus and Mutorus respectively, where r=4.
4. Case study
To comparatively evaluate the performance of the selected algorithms, an experimental case study is conducted over a
highperformance unidirectional SCI torus network. There are 16 nodes in the case study testbed. Each node is configured with
dual 1GHz Intel PentiumIII processors and 256MB of PC133 SDRAM. Each node also features a Dolphin SCI NIC (PCI
64/66/D330) with 5.3 Gb/s link speed using Scali's SSP (Scali Software Platform) 3.0.1, Redhat Linux 7.2 with kernel version
2.4.710smp, mtrr patched, and writecombining enabled. The nodes are interconnected to form a 4x4 unidirectional torus.
For all of the selected algorithms, the polling notification method is used to lower the latencies. Although this method is
known to be effective for achieving low latencies, oftentimes it results in higher CPU loads, especially if the polling process
2003, HCS Research Lab, U of Florida All Rights Reserved
runs for extended periods. To further decrease the completion latencies, the multicasttree creation is removed from the critical
path and performed at the beginning of each algorithm in every node.
Throughout the case study, modified versions of the three pathbased algorithms, Storus, Mdtorus, and Mutorus are used.
These algorithms were originally designed to use multidestination worms. However, as with most highspeed interconnects
available on the market today, our testbed does not support multidestination worms. Therefore, storeandforward versions of
these algorithms are developed.
On our 4ary 2D torus testbed, Mdtorus partitions the torus network into simple 4node rings. For a fair comparison
between the Mdtorus and the Mutorus algorithms, the partition length r of 4 is chosen for Mutorus. Also, the partition
information for Utorus is embedded in the relayed multicast message at each step. Although separate addressing exhibits no
algorithmic concurrency, it is possible to provide some degree of concurrency by simply allowing multiple message transfers to
occur in a pipelined structure. This method is used for our separate address algorithm.
Case study experiments with the five algorithms are performed for various group sizes and for small and large message
sizes. Each algorithm is evaluated for each message and group size 100 times, where each execution has 50 repetitions. The
variance was found to be very small and the averages of all executions are used in this study. Four different sets of experiments
are performed to analyze the various aspects of each algorithm, which are explained in detailed in the following subsections.
4.1. Multicast completion latency
Two different sets of experiments for multicast completion latency are performed, one for a message size of 2B and the
other for a message size of 512KB. Figures 2(a) and 2(b) illustrate the multicast completion latency versus group size for small
and large messages, respectively.
Storus has the worst performance for both small and large messages. Moreover, Storus shows a linear increase in
multicast completion latency with respect to the increasing group size, as it exhibits no parallelism in message transfers. By
contrast, the separate addressing algorithm has a higher level of concurrency due to its design and performs best for small
messages. However, it also presents linearly increasing completion latencies for large messages with increasing group size.
The Mdtorus and Mutorus algorithms exhibit similar levels of performance for both small and large messages. The difference
between these two becomes more distinctive at certain data points, such as 10 and 14 nodes for large messages. For group sizes
of 10 and 14 the partition length for Mutorus does not provide perfectly balanced partitions, resulting in higher multicast
completion latencies. Finally, Utorus has nearly flat latency for small messages. For large messages, it exhibits similar
2003, HCS Research Lab, U of Florida All Rights Reserved
behavior to Mutorus. Overall, separate addressing appears to be the best for small messages and groups, while for large
messages and groups Mdtorus performs better compared to other algorithms.
1600 70000
S1400 60000
S1200 50000 
1000 _W_
. J 40000
o 800
30000
E 600 A4
0  ,.A........
0 20000 .. ... .. 
S400  
S 200 i k. .  .. ..  ' 1.. .  ... ... .. . . 0000 o'oo. .
S0  .. ...... .. ..... ...... . . 0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (In nodes) Multicast Group Size (In nodes)
Utorus * Storus ... A Mutorus *e Mdtorus . Sep Add  Utorus  Storus .AM Mutorus * Mdtorus m. Sep Add
(a) 2B messages (b) 512KB messages
Figure 2: Completion latency vs. group size for small (a) and large (b) messages.
4.2. Userlevel CPU utilization
Userlevel host processor load is measured using Linux's builtin sar utility. Figures 3(a) and (b) present the maximum
CPU utilization for the root node of each algorithm for small and large messages, respectively.
4 "
23    ... 9'"* .... ..I'' 6
I 25 I 5 ,
Utorus  Storus Mutorus  Mdtorus  Sep Add Utorus  Stors Mutorus * Mdtorus  Sep Add
05 
0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
 Utorus . Storus AMutorus . Mdtorus  Sep Add  Utorus Storus A Mutorus  Mdtorus .. Sep Add
(a) 2B messages (b) 512KB messages
Figure 3: Userlevel CPU utilization vs. group size for small (a) and large (b) messages.
It is observed that Storus exhibits constant CPU load for small messages independent of group size. However, for large
messages, as the group size increases the CPU load also linearly increases. The separate addressing algorithm has a nearly
linear increase in CPU load for large messages with increasing group size. By contrast, since the number of message
transmissions for the root node stays constant, Mdtorus provides a nearly constant CPU overhead for small messages for every
2003, HCS Research Lab, U of Florida All Rights Reserved
group size, and for large messages and small group sizes. However, for group sizes greater than 10, the CPU utilization tends to
increase due to variations in the path lengths causing extended polling durations. Mutorus exhibits an identical behavior to Md
torus for small messages. For large messages, it also provides higher but constant CPU utilization. Finally, as can be seen, U
torus exhibits a steplike increase due to the increase in the number of communication steps required to cover all destination
nodes at certain group sizes, such as 4, 8, and 16.
4.3. Multicast startup and treecreation latencies
The userlevel multicast startup latency is an important metric since, for small message sizes, this factor might impede the
overall communication performance. In addition, multicast tree creation latencies exhibit a similar effect. Both the startup and
the tree creation latencies are independent of the message size. Figures 4(a) and (b) present these two latencies versus group
size.
The Utorus and separate addressing algorithms have unbounded fanout numbers and, as clearly illustrated in Figure 4(a),
the startup latencies for these two algorithms are identical and linearly increasing with group size. By contrast, the Storus and
Mdtorus algorithms have constant startup latencies because of their fixed fanout numbers. Also, it can be seen that Mutorus
has a steplike increasing latency after each multiple of r, the partition length.
140000 60
S10000040 ._ " ..... .. '
40
* 80000
SSa 30
60000 3
 40000 20  . .. 20   
r .....A..  ... ...  .
20000 ............... .  ............... ." " 10

0 0
4 6 8 10 12 14 16 6 8 10 12 14 16
4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
 Utorus  Storus A Mutorus  Mdtorus mSep Add  Utorus  Storus A Mutorus * Mdtorus
(a) (b)
Figure 4: Startup latency (a) and treecreation latency (b) vs. group size.
Figure 4(b) presents the multicast tree creation latencies for all algorithms except separate addressing, which does not use a
tree for multicast delivery. The M,torus and Mdtorus algorithms differ in their partitioning methods and both methods are
quite complex compared to the other algorithms, resulting in the highest multicast treecreation latencies. Utorus has a
distributed partitioning process. From the root node perspective and compared to the two Mtorus algorithms, it has lower tree
2003, HCS Research Lab, U of Florida All Rights Reserved
creation latency. Storus does not perform any partitioning and it only rank orders the destination nodes resulting in the lowest
latency due to this simplicity of its tree formation.
4.4. Link concentration and concurrency
Link concentration is defined here as the ratio of two components: number of link visits and number of used links. Link
visits is defined as the cumulative number of links used during the entire communication process, while used links is defined as
the number of individual links used. Link concurrency is the maximum number of messages that are in transit in the network at
any given time. Link concentration and link concurrency are given in Figures 5(a) and 5(b), respectively. Link concentration
combined with the link concurrency illustrates the degree of communication balance. The concentration and concurrency
values presented in Figure 5 are obtained by analyzing the theoretical communication structures and the experimental timings of
the algorithms.
350 7
A
300 6
''" 4
S:t
250 .5
200 4 4  
S1 50 U"' .  3 1.    .. ........'"'"
050 1  ....
000 0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
Utorus . Storus A Mutorus a*Mdtorus U  Sep Add Utorus  Storus .. A Mutorus *. Mdtorus nSep Add
(a) (b)
Figure 5: Link concentration (a) and concurrency (b) vs. group size.
Storus is a simple chained communication and there is only one active message transfer in the network at any given time.
Therefore, Storus has the lowest and a constant link concentration and concurrency compared to other algorithms. By contrast,
due to the high parallelism provided by the recursive doubling approach, the Utorus algorithm has the highest concurrency.
Separate addressing exhibits an identical degree of concurrency to the Utorus, because of the multiple message transfers
overlapping at the same time due to the network pipelining. Mdtorus has inversely proportional link concentration versus
increasing group size. In Mdtorus, the root node first sends the message to the destination header nodes, and they relay it to
their child nodes. As the number of dimensional header nodes is constant (k in a kary torus), with the increasing group size
each new child node added to the group will increase the number of available links. Moreover, due to the communication
2003, HCS Research Lab, U of Florida All Rights Reserved
structure of the Mdtorus, the number of used links increases much more rapidly compared to the number of link visits with the
increasing group size. This trend asymptotically limits the decreasing link concentration to 1. The concurrency of Mdtorus is
upper bounded by k as each dimensional header relays the message over separate ringlets with k nodes in each.
The Mutorus algorithm has low link concentration for all group sizes, as it multicasts the message to the partitioned
destination nodes over a limited number of individual paths as shown in Figure l(d), where only a single link is used per path at
a time. By contrast, for a given partition length of constant size, an increase in the group size results in an increase in the
number of partitions as well as an increase in the number of individual paths. This trait results in more messages being
transferred concurrently at any given time over the entire network.
5. Multicast Latency Modeling
The experiments throughout this study have investigated the performance of multicast algorithms over a 2D torus network
having a maximum of 16 nodes. However, ultimately modeling will be a key tool in predicting the relative performance of
these algorithms for system sizes that far exceed our current testbed capabilities. Also, by modifying the model, future systems
with improved characteristics could be evaluated quickly and accurately. The model presented in the next subsections assumes
an equal number of nodes in each dimension for a given Ndimensional torus network. Our smallmessage latency model
follows the LogP model [19].
LogP is a generalpurpose model for distributedmemory machines with asynchronous unicast communication. A detailed
description of the LogP model is given in [19]. Under LogP, at most at every g CPU cycles a new communication operation
can be issued, and a oneway smallmessage delivery to a remote location is formulated as
communicaton = 2L + 2 x (sender + Oreceiver) (1)
where L is the upper bound on the latency for a message delivery of a message from its source processor to its target processor,
and sender and orecelver represent the sender and receiver communication overheads, respectively.
As noted in [20], today's highperformance NICs and interconnection systems are fast enough that any packet can be
injected into the network as soon as the host processor produces it without any further delays. Therefore, g is negligible for
modeling highperformance interconnects. This observation yields the relaxed LogP model for oneway unicast
communication, given as
tcommunicaton = sender network + receiver (2)
where tnetwork is the total time spent between the injection of the message into the network and drainage of the message.
2003, HCS Research Lab, U of Florida All Rights Reserved
Following this approach, a model is proposed to capture the multicast communication performance of highperformance
torus networks on an otherwise unloaded system. The model is based on the concept that each multicast message can be
expressed as sequences of serially forwarded unicast messages from root to destinations. The proposed model is formulated as
t mlticast = Max[osender + network +0 receiver over V (3)
paths
where the Max[ ] operation yields the total multicast latency for the deepest path over all paths and tmu,,cast is the time interval
between the initiation of multicast and the last destination's reception of the message. Figure 6 illustrates this concept on a
sample binomialtree multicast scenario. For this scenario, tmul,,,cas is determined over the deepest path, which is the route from
node 0 to node 7.
4 5
Deepest
Path
Figure 6: A sample multicast scenario for a given binomial tree
Gonzalez et al. [14] modeled the unicast performance of direct SCI networks and observed that tneork can be divided into
smaller components as given by:
network hp xLp +hf xLf +h xLs +h, xL, (4)
Here hp, hf, hs, and h, represent the total number of hops, forwarding nodes, switching nodes, and intermediate nodes,
respectively. Similarly Lp, Lf, and Ls denote the propagation delay per hop, forwarding delay through a node, and switching
delay through a node, respectively. L, denotes the intermediate delay through a node, which is the sum of the receiving
overhead of the message, processing time, and the sending overhead of the message. Figure 7(a) illustrates an arbitrary
mapping of the problem given above in Figure 6 to a 2D 4x4 torus network, while 7(b) visually breaks down tmul,cas, over the
same arbitrary mapping.
Following the method outlined in [14] and using the obtained experimental results from our case study, the model
parameters are measured and calculated for short message sizes. Assuming that the electrical signals propagate through a
copper conductor at approximately half the speed of light, and observing that our SCI torus testbed is connected with 2m long
2003, HCS Research Lab, U of Florida All Rights Reserved
cables, resulted in 14ns of propagation latency per link. Since Lp represents the latency for the head of a message passing
through a conductor, it is therefore independent of message size [14].
U__mfl
I
0 0 0E E
Osender LetPwork e OLceiew r
I I I I
,o I.0, tnwork ,,oc......r
SRoot Node Forwanrding Node Y Switcming Node
SIntennediate Node Destination Node
(a) (b)
Figure 7: Arbitrary mapping of the multicast problem given in Figure 6 to a 2D 4ary torus (a), and breakdown of
tmu,,Icast over the deepest path for the given multicast scenario (b)
Model parameters Osender, OreceWer, Lf and L, were obtained through ringbased and torusbased API unicast experiments.
Ringbased experiments were performed with two, four, and sixnode ring configurations. For each setup, hp and Lp are
known. Plugging these values into Eqs. 2 and 4 and taking algebraic differences between two, four, and sixnode experiments
yields se,,der, receiver, and Lf. Switching latency, Ls, is determined through torusbased experiments by comparing latency of
message transfer within a dimension versus between dimensions. Figure 8 plots the L, Lf, L, Osender, and orecever model
parameters for various short message sizes.
Lp, Lf, L, Osender, and receiver are communicationbound parameters and are dominated by NIC and interconnect
performance. L, is dependent upon interconnect performance as well as the node's computational power, and is formulated as:
L = sender + process + orece ver (5)
Calculations based on the experimental data obtained previously show that each multicast algorithm's processing load,
process, is different. This load is composed mainly of the time needed to check the child node positions on the multicast tree and
the calculation time of the child node(s) for the next iteration of the communication. Also, process is observed to be the dominant
component of L, for short message sizes. Moreover, it can be easily seen that compared to the tproces, and L, parameters, the Lp,
Lf, and the Ls values are drastically smaller, and thus they are relatively negligible. Dropping these negligible parameters, Eq. 3
can be simplified and expressed as
2003, HCS Research Lab, U of Florida All Rights Reserved
it,,,cast (total number of destination nodes)x osende, + tocess) + orece, (6)
for the separate addressing model. For the Mdtorus, M.torus, Utorus and Storus algorithms the simplified model can be
formulated as in Eq. 7. As can be seen, the modeling is straightforward for separate addressing and for the remaining
algorithms the modeling problem is now reduced to identifying the number of intermediate nodes over the longest multicast
message path. Moreover, without loss of generality, sender and receiver values can be treated equal to one another [19] for
simplicity and we represent them by a single parameter, o.
tmulcast Max[osender +h, x L, +o receiver] over V
paths
Max[2xo + h, xL] over V (
paths
Of course, variable L, is not involved in the separate addressing algorithm. The reason is simply that separate addressing
consists of a series of unicast message transmissions from source to destination nodes so there are no intermediate nodes that
are needed to relay the multicast message to other nodes. Table 1 presents the process and L, values for short message sizes.
tproces is independent of the multicast message and group size but strictly dependent on the performance of the host machine.
Therefore, for different computing platforms, different tprocess values will be obtained.
10000
3203ns
1994ns
1000
A ...........A ..........A ................... .......... .........
Avg=670ns
> 100
1 U 
128 192 256 320 384 448 512
Message Size (bytes)
4Propagation w Forwarding A Switching Overhead
Figure 8: Measured and calculated model parameters for short message sizes:
Lp = 14ns, Lf= 60ns, L, = 670ns, and o = 1994 + 3.15 x (M 128) ns
Table 1: Calculated tprcess and Li values
t._____iocess (s) L (s)
Separate Addressing 7 N/A
Mdtorus 206 206 + 2xo
M,torus 201 201 +2xo
Utorus 629 629 +2xoo
Storus 265 265 + 2xo
2003, HCS Research Lab, U of Florida All Rights Reserved
The following subsections will discuss and provide more detail about the simplified model for each multicast algorithm.
For each algorithm, the modeled values, the actual testbed measurements, and the modeling error will also be presented.
5.1. Separate addressing model
With the separate addressing algorithm, for a group size of G, there are Gl destination nodes that the root node alone must
serve. Therefore, a simplified model for separate addressing can be expressed as:
multicast (G 1)x (Osender + process +Oreceiver (8)
Figure 9 plots the simplified model and the actual measurements with various multicast group sizes for a 128byte multicast
message using the o and tproce, values previously defined. The results show that the model is accurate with an average error of
3%.
2 0 0 nr 1 0 0
2O090
160 80
0 6O
5
300
A   ._ ... .. ". .. 
4 6 8 10 12 14 16
Multicast Group Size (In nodes)
Model Actual Enor
Figure 9: Simplified model vs. actual measurements for separate addressing algorithm
5.2. Mdtorus model
For Mdtorus, the total number of intermediate nodes for any communication path is observed to be a function of G, the
multicast group size, and k, the number of nodes in a given dimension. The simplified Mdtorus model is formulated as:
mult, 2xo+ xL, (9)
The simplified model and actual measurements for various group sizes with 128byte messages are plotted in Figure 10.
As can be seen the simplified model is accurate with an average error of 2%.
2003, HCS Research Lab, U of Florida All Rights Reserved
E 700 8_________________________________0
ooo
600
Figure 10: Simplified model vs. actual measurements for Mdtorus algorithm
5.3. M,torus model
The number of partitions for the Matorus algorithm, denoted by p, is a byproduct of the multicast group size, G, and the
partition length, r. For systems with r equal to G, there exists only one partition and the multicast message is propagated in a
chaintype communication mechanism among the destination nodes. Under this condition, the number of intermediate nodes is
simply two less than the group size. The subtracted two are the root and the last destination nodes. For systems with partitions
equal to or more than 2, the number of intermediate nodes becomes a function of the group size, partition length and the number
of nodes in a given dimension. The simplified model is given as:
t 2xo+ G rx k xL,, p>2 (
2xo+(G2)xL,, p<2
50
E 40
300. .
i oI io r.
4 6 8 10 12 14 16
Multicast Group Size (in nodes)
IModel Actual =Enor
Figure 11: Simplified model vs. actual measurements for Matorus algorithm
5.3. Mutorus model
Figure number11 shows the smallmessage model versus actual measurement by p, is for 128byte messages. The results show that the
partition length, r. For systemsmodel is accurate with an average error equal to G, there exists only one partition and the multicast message is propagated in a
chaintype communication mechanism among the destination nodes. Under this condition, the number of intermediate nodes is
simply two less than the group size. The subtracted two are the root and the last destination nodes. For systems with partitions
equal to or more than 2, the number of intermediate nodes becomes a function of the group size, partition length and the number
of nodes in a given dimension. The simplified model is given as:
I multhcast ~ (10)
2xo+(G2)xL,, p<2
1 4 0 0 r 1 0 0
1 ., 90
/ooo 70
400 30
20
0 77  7AT  ,A 0
4 6 8 10 12 14 16
Multicast Group Size (in nodes)
Figure 11: Simplified model vs. actual measurements for Mutorus algorithm
Figure 11 shows the smallmessage model versus actual measurements for 128byte messages. The results show that the
model is accurate with an average error of 2%.
2003, HCS Research Lab, U of Florida All Rights Reserved
5.4. Utorus model
For Utorus the minimum number of communication steps required to cover all destination nodes can be expressed as
Flog, G The number of intermediate nodes in the Utorus algorithm is a function of the minimum required communication
steps, the group size and the number of nodes in a given dimension. The simplified Utorus model is given as:
(Gmodk)+k
tmult cast 2 o + log, G x (G ox L, (11)
k
Figure 12 presents the shortmessage model and actual measurements for various group sizes. The results show the model
is accurate with an average error of 2%.
3500 o0
2500 70
Multicast Group Size (In nodes)
Figure 12: Simplified model vs. actual measurements for Utorus algorithm
3000
5.5. Storus model
Storus is a chaintype communication algorithm and can be modeled identically to the single partition case of the Matorus
algorithm. The simplified model for Storus is formulated as:
50
10
0 2 xo (G 2) xL, (12) 7 0
4 6 8 10 12 14 16
Multicast Group Size (In nodes)
ModeFigure 13: SimplifiedActua model vs. actual measurements for Storus algorithm
Figure 12: Simplified model vs. actual measurements for Utorus algorithm
5.5. Storus model
Storus is a chaintype communication algorithm and can be modeled identically to the single partition case of the Mutorus
algorithm. The simplified model for Storus is formulated as:
t..lhcast & 2 x o + (G 2)x L, (12)
4500 100
4 0 0 0 9 0^  S
3500 80
3000 70
2500 m
0 N ^  ^^  0
4 6 8 10 12 14 16
Multicast Group Size (in nodes)
Figure 13: Simplified model vs. actual measurements for Storus algorithm
2003, HCS Research Lab, U of Florida All Rights Reserved
The results from the shortmessage model versus actual measurements for 128byte messages are shown in Figure 13. As
previously stated, Storus routing is based on Hamiltonian circuit. This type of routing ensures that each destination node will
receive only one copy of the message, but some forwarding nodes (i.e. nondestination nodes that are on the actual message
path) may be visited more than once for routing purposes. Moreover, depending on the group size, singlephase routing
traverses many unnecessary channels, creating more traffic and possible contention. Therefore, Storus has unavoidably large
latency variations due to the varying and sometimes extremely long message paths [18]. The smallmessage model presented is
incapable of tracking these large variations in the completion latency that are inherent to the Storus multicast algorithm. The
modeling error is relatively high, unlike the other algorithms evaluated in this study. The instability of the modeling error is not
expected to lessen with the increasing group size.
6. Analytical Projections
To evaluate and compare the shortmessage performance of the multicast algorithms for larger systems, the simplified
models are used to investigate 2D torusbased parallel systems with 64 and 256 nodes. The effects of different partition lengths
(i.e. r = 8, r = 16, r = 32) for the Mutorus algorithm over these system sizes are also investigated analytically with these
projections. The results of the projections are plotted in Figure 14.
1000 10000
100E 100
__ HI~~~~~~~~  ^II  h t
10 *. t
10
8 4 12 16 20 24 28 32 36 40 44 48 52 56 60 64 2
01 4 32 60 88 116 144 172 200 228 256
Multicast Group Size (in nodes) Multicast Group Size (in nodes)
 Utoms r Storus  Mutorus (r=8)  Mutorus(oru) Utorus Stous ..*A Mutorus (r8)  Mutorus (r16)
 Mutorus(r32) Mdtorus Sep Add Mutorus(r=32) *Mdtorus Sep Add
(a) (b)
Figure 14: Smallmessage latency projections for an 8x8 torus system (a), and a 16x16 torus system
The Mdtorus algorithm has steplike completion latency due to the fact that, with every new k destination nodes added to
the group, a new ringlet is introduced to the multicast communication which increases the completion latency. The optimal
performance for the 8x8 torus network is obtained when the Mutorus has a partition length of 8 and for the 16x16 torus
2003, HCS Research Lab, U of Florida All Rights Reserved
network when the partition length is 16. Therefore, it is surmised that the optimal partition length for Mutorus is equal to k for
2D direct SCI torus networks.
Utorus, on average, has slowly increasing completion latency with increasing group sizes. Utorus, Mutorus, and Mdtorus
all tend to have similar asymptotic latencies. Storus and separate addressing, as expected, have linearly increasing latencies
with increasing group sizes. Although separate addressing is the best choice for smallscale systems, it loses its advantage with
increasing group sizes for short messages. Storus, by contrast, again proves to be a poor choice, as it is simply the worst
performing algorithm for group sizes greater than 8.
7. Conclusions
In this study, five different multicast algorithms for highperformance torus networks are evaluated. These algorithms are
analyzed on direct SCI torus networks. The performance characteristics of these algorithms are experimentally examined under
different system and network configurations using various metrics, such as multicast completion latency, rootnode CPU
utilization, multicast startup and tree creation latencies, and link concentration and concurrency, to evaluate their key strengths
and weaknesses. Based on the results obtained, smallmessage latency models for each algorithm are defined. The models are
observed to be accurate. Projections for larger systems are also presented and evaluated.
Based on the experimental analysis it is observed that the separate addressing algorithm is the best choice for small
messages or small group sizes from the perspective of multicast completion latency and CPU utilization due to its simple and
cost effective structure. The Mdtorus algorithm performs best from the perspective of completion latency for large messages or
large group sizes, because of the balance provided by its use of dimensional partitioning. In addition, Mdtorus incurs a very
low CPU overhead and achieves high concurrency for all the message and group sizes considered. The Utorus and Mutorus
algorithms perform better when the individual multicast path depths are approximately equal. Furthermore, the M,torus
algorithm exhibits its best performance when group size is an exact multiple of the partition length. The Utorus and Mutorus
algorithms have nearly constant CPU utilizations for small and large messages alike. Moreover, the Utorus algorithm has the
highest concurrency among all algorithms evaluated, due to the high parallelism provided by the recursivedoubling method. S
torus is always the worst performer from the perspective of completion latency and CPU utilization, because of its lack of
concurrency and its extensive communication overhead. As expected, Storus exhibits a nearly linear increase in completion
latency and CPU utilization for large messages with increasing group size.
The smallmessage latency models, using only a few parameters, capture the essential mechanisms of the multicast
communication over the given platforms. The models are accurate for all evaluated algorithms except the Storus algorithm.
2003, HCS Research Lab, U of Florida All Rights Reserved
Smallmessage multicast latency projections for larger torus systems are provided using these models. Projection results show
that with increasing group size the Utorus, Mutorus, and Mdtorus algorithms tend to have asymptotically bounded similar
latencies. Therefore, it is possible to choose an optimal multicast algorithm among these three for larger systems, based on the
multicast completion latency and other metrics such as CPU utilization or network link concentration and concurrency.
Projection results also show that Storus and separate addressing have unbounded and linearly increasing completion latencies
with increasing group sizes, which makes them unsuitable for largescale systems. It is also possible and straightforward to
project the multicast performance of largerscale 2D torus networks with our model. Applying the simplified models to other
torus networks and/or multicast communication schemes is possible with a minimal calibration effort.
Overall, the results of this research make it clear that no single multicast algorithm is best in all cases for all metrics. For
example, as the number of dimensions in the network increases, the Mdtorus algorithm becomes dominant. By contrast, for
networks with fewer dimensions supporting a large number of nodes, the Mutorus and the Utorus algorithms are most
effective. Separate addressing is an efficient and costeffective choice for smallscale systems. Finally, Storus is determined to
be inefficient as compared to the alternative algorithms in all the cases evaluated. This inefficiency is due to the extensive
length of the paths used to multicast, which in turn leads to long and widely varying completion latencies, as well as a high
degree of rootnode CPU utilization.
There are several possibilities for future research, one of which is integrating and evaluating the performance of the selected
algorithms with higher communication layers such as MPI. Another possible direction is to extend our SANbased research as
a lowlevel communication service for other highperformance networks including use not only for cluster networks but also in
hierarchical grid networks.
8. References
[1] P.K. McKinley, H.Xu, A.H. Esfahanian, and L.M. Ni, UnicastBased Multicast Communication in WormholeRouted Networks, IEEE
Transactions on Parallel and Distributed Systems 5 (12) (1994) 12521265.
[2] P.K. McKinley, Y.Tsai, and D.F. Robinson, Collective Communication in WormholeRouted Massively Parallel Computers, IEEE
Computer 28 (2) (1995) 3950.
[3] Y. Tseng, D.K. Panda, and T Lai, A TripBased Multicasting Model in WormholeRouted Networks with Virtual Channels, IEEE
Transactions on Parallel and Distributed Systems 7 (2) (1996) 138150.
[4] R. Kesavan and D.K. Panda, Multicasting on SwitchBased Irregular Networks using MultiDrop PathBased MultiDestination
Worms, Proc. of Parallel Computer Routing and Communication, Second International Workshop (PCRCW'97), Atlanta, Georgia, Jun.
1997, pp. 179192.
[5] D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, Optimal Multicast Communication in WormholeRouted Torus Networks, IEEE
Transactions on Parallel and Distributed Systems 6 (10) (1995) 10291042.
2003, HCS Research Lab, U of Florida All Rights Reserved
[6] D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, PathBased Multicast Communication in WormholeRouted Unidirectional Torus
Networks, Journal of Parallel and Distributed Computing 45 (2) (1997) 104121.
[7] Scali Computer AS, Scali System Guide Version 2.0, White Paper, Scali Computer AS, Oslo, Norway, 2000.
[8] IEEE, SCI: Scalable Coherent Interface, IEEE Approved Standard 15961992, 1992.
[9] X. Lin and L.M. Ni, DeadlockFree Multicast Wormhole Routing in Multicomputer Networks, Proc. of 18th Annual International
Symp. on Computer Architecture, Toronto, Canada, May 1991, pp. 116124.
[10] L.M. Ni and P.K. McKinley, A Survey of Wormhole Routing Techniques In Direct Networks, IEEE Computer 26 (2) (1993) 6276.
[11] K. Omang and B. Parady, Performance Of LowCost Ultrasparc Multiprocessors Connected By SCI, Technical Report, Department of
Informatics, University of Oslo, Norway, 1996.
[12] M. Ibel, K.E. Schauser, C.J. Scheiman and M. Weis, HighPerformance Cluster Computing using SCI, Proc. of Hot Interconnects
Symposium V, Palo Alto, CA, Aug. 1997.
[13] M. Sarwar and A. George, Simulative Performance Analysis of Distributed Switching Fabrics for SCIBased Systems, Microprocessors
and Microsystems 24 (1) (2000) 111.
[14] D. Gonzalez, A. George, and M. Chidester, Performance Modeling and Evaluation of Topologies for LowLatency SCI Systems,
Microprocessor and Microsystems 25 (7) (2001) 343356.
[15] R. Todd, M. Chidester, and A. George, Comparative Performance Analysis of Directed Flow Control for RealTime SCI, Computer
Networks 37 (4) (2001) 391406.
[16] H. Bugge, Affordable Scalability using Multicubes, in: H. Hellwagner, A. Reinfeld (Eds.), SCI: Scalable Coherent Interface, LNCS
StateoftheArt Survey, Springer, Berlin, Germany, 1999, pp. 167174.
[17] L.P. Huse, Collective Communication on Dedicated Clusters Of Workstations, Proc. of 6th PVM/MPI European Users Meeting
(EuroPVM/MPI '99), Sabadell, Barcelona, Spain, Sep. 1999, pp. 469476.
[18] H. Wang, and D.M. Blough, TreeBased Multicast in WormholeRouted Torus Networks, Proc. of Int. Conf. on Parallel and
Distributed Processing Techniques and Applications (PDPTA'98), Las Vegas, Nevada, Jul. 1998, pp 702709.
[19] D.E. Culler, R. Karp, D.A. Patterson, A. Sahay, K.E. Shauser, E. Santos, R. Subramonian, and T. von Eicken, LogP: Towards a
Realistic Model of Parallel Computation, Proc. of ACM 4th SIGPLAN Symp. on Principles and Practices of Parallel Programming, San
Diego, California, May 1993, pp. 112.
[20] E. Deelman, A. Dube, A. Hoisie, Y. Luo, R. Oliver, D. SundaramStukel, H. Wasserman, V.S. Adve, R. Bagrodia, J.C. Browne, E.
Houstis, O. Lubeck, J. Rice, P. Teller, M.K. Vernon; POEMS: Endtoend Performance Design of Large Parallel Adaptive
Computational Systems, Proc. of First International Workshop on Software and Performance '98, WOSP '98, Santa Fe, New Mexico,
Oct. 1998, pp. 1830.
