Title: Reliability modeling of SCI ring-based topologies
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094776/00001
 Material Information
Title: Reliability modeling of SCI ring-based topologies
Physical Description: Book
Language: English
Creator: Sarwar, Mushtaq
George, Alan D.
Collins, D.
Affiliation: FAMU-FSU -- College of Engineering
University of Florida
University of Florida
Publisher: High-performance Computing and Simulation Research Laboratory, Department of Electrical and Computer Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2000
Copyright Date: 2000
 Record Information
Bibliographic ID: UF00094776
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

LCN2000 ( PDF )


Full Text



2000, HCS Research Lab All Rights Reserved


Reliability Modeling of SCI Ring-Based Topologies

M. Sarwar', A. George2, and D. Collins'

High-performance C. i,,ijlt,, and Simulation (HCS) Research Laboratory
'Department of Electrical and Computer Engineering, FAMU-FSU College of Engineering
2Department of Electrical and Computer Engineering, University of Florida

Abstract
Performance evaluation and reliability prediction are two importantfactors in the study of multiprocessor and
cluster interconnects. One such interconnect is the Scalable Coherent Interface (SCI). SCI is a point-to-point, ring-
based interconnect that can be configured in various switched-ring topologies such as c n'",,.... -,.. ,lr, rings and
tori. While performance analyses of SCI-based interconnects have been discussed in the literature, reliability
evaluation has not received much attention. In addition, the reliability of SCI interconnects configured in many of
today's popular topologies cannot be deduced from earlier work on network reliability as link failures within an SCI
interconnect are not independent of one another. A single link failure within the topology results in the failure of the
entire ringlet to which the link belongs. This paper presents the results of a reliability study on ID and 2D k-ary n-
cube switching fabrics for the Scalable Coherent Interface based on ring elimination rather than link elimination.
The study is conducted using reliability models created in UltraSAN, a tool based on Stochastic Activity Networks.
The models are verified using both combinatorial and Markov modeling. The results demonstrate the inherent
reliability characteristics of a single-ring system can be .., ..,/ i enhanced by the addition of a second redundant
ring. By contrast, the results show that the reliability of a torus does not increase ,..o r,,rl ti with the addition of
redundant rings. Hence, the cost of adding redundant rings to certain topologies may or may not be justified,
depending upon the degree of reliability sought.

1. Introduction
The inevitable shift towards parallel computing systems has accentuated the need for more reliable networks.
As the number of components in the network increases, so too does the failure rate of the system. Classic fault-
tolerant systems use redundant interconnects to provide a measure of fault tolerance. However, other systems often
provide fault tolerance by using the inherent redundancy within an interconnect rather than using completely
redundant interconnect components. Hypercubes, meshes, tori and other networks that are topologically isomorphic
to the family of k-ary n-cube networks are some examples of such topologies, where n is the dimension of the cube,
k is the radix (e.g. k = 2 for a hypercube), and the number of nodes in the network is k". These topologies provide
fault tolerance in the form of multiple paths from any source node to any destination node. In this paper we target
the inherent redundancy in such networks in order to determine the reliability of SCI [1] topologies. The k-ary n-
cube family was chosen since SCI is a ring-based network, and hence k-ary n-cubes provide an ideal framework for
scalable SCI systems.
Reliability analysis of redundant fault-tolerant systems is typically performed using one of two methods,
combinatorial modeling and Markov modeling [2]. Combinatorial modeling is applicable to cases where the system
can be broken down into series and parallel combinations of components. For more complex systems that cannot be
represented using distinct series-parallel combinations, Markov modeling is employed. A higher-level modeling
approach based on Petri nets can also be used, providing a graphical view of the operation of a system and the
interaction between failures. Petri nets are used to determine the reliability of systems by generating the Markov
state space for the model and solving the associated Chapman-Kolmogorov equations. Some of the more popular
Petri net packages include DSPNexpress [3], GreatSPN [4], HARP [5], SPNP [6], and SURF-2 [7]. For this paper,
the reliability of SCI-based 1D and 2D k-ary n-cubes is studied using UltraSAN [8], a tool based on Stochastic
Activity Networks developed by Sanders et al. at the University of Illinois at Urbana-Champaign. UltraSAN was
chosen since it permits easy modeling of replicable systems and provides a broad range of analytic solvers.
The remainder of this paper is organized as follows. Section 2 presents related research in the areas of fault-
tolerant network reliability modeling, with an emphasis on ring-based architectures and the use of Petri net models.
Section 3 describes the characteristics of the SCI model implemented in UltraSAN with a description of the
assumptions made in creating the models. Analytical verification of the UltraSAN models of the studied k-ary n-
cube systems is detailed in Section 4, using both combinatorial and Markov models. Section 5 examines the
reliability results obtained from the UltraSAN model. The reliability of the unidirectional and bi-directional tori and










the single and counter-rotating ring configurations are analyzed and compared, allowing informed decisions to be
made regarding network organization and level of redundancy for SCI-based applications with reliability
requirements. Finally, the conclusions and possible directions for future research are presented in Section 6.

2. Related research
Fault tolerance of interconnect topologies can be measured in terms of the terminal reliability or network
reliability. Terminal reliability is the probability that there exists at least one path from a given node to a destination
node. It is most commonly used to assess the reliability of multi-stage interconnection networks (MINs) as is done
by Colbourn et al. [9] and Varma and Raghavendra [10]. This paper concentrates on network reliability, or the
probability that there exists at least one path from every node to all other nodes.
Network reliability can be assessed using either combinatorial or Markov modeling. Combinatorial modeling
of networks requires the decomposition of a network into subnets and determining the reliability of the entire system
as a combination of the subnets. Cheng and Ibe [11] and Menezes and Bakhru [12] use such a method to evaluate
the reliability of shuffle-exchange networks. Markov modeling can also be used as an alternative to combinatorial
modeling or in conjunction with combinatorial modeling. Blake and Trividi [13] use continuous time Markov
chains to determine the reliability of shuffle-exchange networks. In [14], Blake and Trivedi use Markov modeling
in conjunction with combinatorial modeling by dividing the studied MINs into a two-level model, obtaining the
reliability of each subsystem using Markov modeling and the system reliability using a series system comprised of
the Markov components. In [15], Balakrishnan and Reibman use Markov modeling to determine the reliability of
private networks where the minimal operational path is dictated by the application. The Balakrishnan-Reibman
models present an example where combinatorial analysis is no longer feasible since the reliability models are
dependent upon the communication paths. Since the communication paths can take any form, they cannot be
accurately represented as series-parallel combinations.
This paper concentrates on the reliability of SCI networks configured as k-ary n-cubes. The reliability of ring-
based architectures has been studied in Smith and Trivedi [16]. The topology targeted in that paper is the forward
loop backward hop (FLBH) network. The FLBH class of 1D ring topologies, which include daisy chain loop,
forward loop parity hop networks, and chordal rings, has also been studied by Raghavendra and Silvester [17].
Work has also been conducted on the 2D architectures such as the Manhattan street network (MSN) and Torus. In
[18], Chung et al. present the terminal reliability analysis of an MSN and a 2D Torus. Chen and Berger [19] present
a reliability analysis of Manhattan street networks showing the complexity of the Markov model for the MSN.
Complexity arises from the interdependence between link failures as shown in their paper. In the model presented in
this paper, the requirements of the SCI protocol help to simplify the link interdependence as entire rings containing
link failures are eliminated. The ring elimination cuts down dramatically on the state space of the corresponding
Markov model. In addition, by using the ring as the basic building block of the k-ary n-cube models, Petri nets can
be used to determine the reliabilities of the systems.
Network reliability analysis using Petri nets has not been carried out extensively. Most reliability models based
on Petri nets deal with small redundant systems with a fixed number of components. The reason for this limited
usage of Petri nets is that only systems that employ replicable building blocks can be easily modeled. Benitez and
Fortez [20] demonstrate the use of a Petri net model for determining the reliability of fault-tolerant processor arrays.
In their paper, the processor array is considered to be a replicable system with a single row of processors as the
building block of the model. Since SCI uses a fabric of switched rings, and each ring is eliminated if it contains a
single faulty link, the k-ary n-cubes being studied can be easily created using replicable components, making ring-
based systems with distributed switching ideal for modeling with Petri nets. The primary benefit of modeling a
system with Petri nets is the ease with which the model can be created and replicated to create larger models. In
addition it provides an easy to understand, visual representation of the system and the interaction between faults.

3. SCI UltraSAN model
The simplest configuration of an SCI interconnect is a ring traversing all nodes. The ring is based on the
architecture of an SCI interface illustrated in Fig. 1. Incoming packets to the interface pass through an address
decoder. If the packet is destined for the local node, the decoder places it into the request- or response-input queue.
If the packet is destined for another downstream node, it is forwarded to the bypass FIFO. To output a packet, the
SCI node must have sufficient free space in its bypass FIFO to hold all incoming symbols. When there are no
packets waiting in the output queue or there is insufficient free space in the bypass FIFO for the output queue data to
be sent, data from the bypass FIFO is transmitted on the node's output link. If the bypass queue is empty, then idle
symbols are transmitted. Idle symbols also carry flow-control information and at least one must precede any send or











echo packet. This flow-control information is used to inhibit upstream nodes from sending data when the bypass
FIFO must be emptied to allow the output queue to be emptied.
Larger SCI networks are based on multiple ringlets connected together to create more complex topologies
through the use of agents. An agent is essentially a SCI-to-SCI bridge used to interconnect two or more rings. The
topologies studied in this paper include the unidirectional and bi-directional forms of the topologies illustrated in
Fig. 2. In [21] we presented performance analysis of these SCI-based topologies using the node model shown in
Fig. 3. Each switch in a distributed switching fabric provides the ability to interface up to four SCI ringlets and also
acts as an interface to the processing unit at that node. The UltraSAN models presented in this paper are based on
this switch model.

To processing node







SC Encoder By Address SCI
out- Mr Bypass FIFO Decoder in
Save Idle

Fig. 1. The SCI interface


SCI SCI
Interfaces Interfaces






Fig. 2. (a) Dual-ring k-ary -cube, (b) bi-directional k-ary 2-cube




Fig. 2. (a) Dual-ring k-ary 1-cube, (b) bi-directional k-ary 2-cube


Fig. 3. 4-port switch model


In the UltraSAN model, it is assumed that each node consists of the processing node and a crossbar. The SCI
interfaces are used to connect the node into ring-based topologies. The number of SCI interfaces can be increased
from one unit to four units, permitting the node to access up to four rings hence providing it with four input and four
output ports.










Due to the inherent characteristics of a register-insertion ring and their key role in SCI, all links in an SCI ring
must be operational in order for that ring to operate correctly. A single link failure eliminates the entire ring from a
multi-ring system, requiring the system to reconfigure and use the remaining links to continue normal operation.
The UltraSAN model uses this fact as the basis of determining a failed state for the network. If a node is
disconnected from the remainder of the network either due to ring failures or the failure of the crossbar at that node,
then the entire system is considered to be in a failed state. If a set of ring failures cause the network to be disjoint,
then that condition will also result in a failed state.
To create the dual-ring topology, the single-ring model is simply duplicated. The torus model is created using
one or more single-ring models representing the rows in a torus with each row sharing the vertical rings. Each row
monitors ring failures within itself and the vertical rings connected to it. By doing so, any one of the duplicated row
subnets representing the rows can detect and signal a network failure.

4. Model Verification
The models are verified using combinatorial modeling for the single and counter-rotating ring systems, and
Markov modeling for the torus model. The single and counter-rotating ring systems can be represented as series-
parallel combination of components. The tori systems cannot and hence require the use of Markov modeling to
determine their reliability.

4.1 1D topologies (k-ary 1-cubes)
In the single ring model, a link failure results in a system failure. In addition, it is assumed that a system failure
occurs if a single node is unable to communicate with the rest of the network. Such a case can occur due to a
crossbar failure resulting in the isolation of the attached node from the network. The reliability of single ring
systems of n nodes can therefore be verified analytically by assuming a series system of links and crossbars. The
system reliability is expressed as:

Rsytem (t) = R k (t)R rosbar (t) (1)

In Equation 1, based on the Exponential Failure Law (EFL), Rhnk (t) = e_^nk' and Rcrossbar(t) = ec rb"t .'. The failure
rates of the links and the crossbars were estimated using the Handbook for Reliability Prediction of Electronic
Equipment (MIL-HDBK-217F) [23]. These reliabilities are Al,nk = 3.5' i"\ I failures per hour and crossbar = IxlO-6
failures per hour respectively. In the series model expressed by Equation 1, all links and crossbars must operate
correctly for the system to remain in an operational state.
The analytical model for the counter-rotating ring systems is a combined series-parallel system in which at least
one of the two rings must be operational and all crossbars must be operational for the system to remain in an
operational state. The reliability of the counter-rotating ring systems is expressed as:

Rsystem (t) = 1 [1 Rnk (t)]2 roar (t) (2)

The expression 1- [1 -R k (t)]2 represents the reliability of having one of the two rings operational. This value is
then multiplied by the reliabilities of the crossbars Rrosb( (t) to account for the n crossbars that must remain
operational. Fig. 4 shows the UltraSAN and analytical model reliability results for the single ring (SR) and counter-
rotating ring (CRR) systems. In both cases the reliabilities obtained from the UltraSAN models are identical to the
values obtained analytically.

4.2 2D topologies (k-ary 2-cubes)

Due to the inter-ring dependencies, the 2D topologies cannot be represented using a distinct series-parallel
model, hence Markov modeling must be employed. In addition, the state spaces of the Markov models for the 2D
systems increase rapidly with each increment in k. For this reason, the smaller Markov model of a 9-node torus is
used to verify the reliability results obtained from UltraSAN. Fig. 5 depicts the Markov model for a 9-node
unidirectional torus. Fig. 6 shows the determination of the critical rings after a given ring failure. Failure of any
critical ring will then result in a network failure. The larger tori models use a simple extension of the 9-node torus














model, and thus the verification below adds credence to their accuracy as well. Equation 3 gives the reliability of a
9-node unidirectional torus as a function of mission time.


R9nod(t) = e-6ngt 6e sgt + 6e '


09
08
07
06
05
-*9-node SR analytical
0 4 -9-node SR UltraSAN
-5-16-node SR analytical
0 3 -- 16-node SR URraSAN
-25-node SR analytcal
2 --25-node SR URraSAN
0 1 --36-node SR analytcal
--36-node SR UlraSAN
00


Mission Time (hours)


10
09
08
07
r06
05
S --9-node CRR analytical
0 04- ---9-node CRRUltraSAN
-I-16node CRR analytical
03 -*-16-node CRR UltraSAN
S--25-node CRR analytical
--25-node CRR UltraSAN
0 1 --36-node CRR analytical
-36-node CRR UltraSAN
00


Mission Time (hours)


Fig. 4. UltraSAN and analytical reliabilities for single ring and counter-rotating ring systems


In Equation 3, A is the reliability of each ringlet. For the 9-node torus, setting ,,, = 3Ank accounts for the

three links making up each ringlet. To determine the system reliability, Equation 3 is multiplied by R ,,sbar (t) to

account for the 9 crossbars within the system. For a detailed description of the approach used to derive the
reliability expression above, the reader is directed to [2]. Fig. 7 illustrates the accuracy of the UltraSAN model,
where the analytical and UltraSAN results for the 9-node torus are nearly identical.


1-5XAt 1.0



61At 31At
1-6XAt 0 1 1 F


2At
4_At

1-4XAt 2



Fig. 5. Markov model for a 9-node unidirectional torus



Non-Critical Rings
Failed Ring ___ "








Critical Rings


Fig. 6. 9-node torus with critical and non-critical rings after one ring failure












1 00 I
095
090
0 85
| 080
0 75
070
0 65 -- 9-node torus analytical
-- 9-node torus UltraSAN
060


Mission Time (hours)

Fig. 7. UltraSAN and analytical reliability results for the 9-node torus

5. Reliability Results

In this section, the reliability results obtained for the k-ary n-cubes listed in Table 1 are presented. Afterwards,
in the next section, three case studies are presented to demonstrate the application of the reliability results to SCI
systems requiring varying levels of fault tolerance.


Table 1. 1D and 2D model descriptions


1D Systems 2D Systems
Single Ring Counter-Rotating Ring Unidirectional Torus Bi-directional Torus

9-ary 1-cube (9-node) 9-ary 1-cube (9-node) 3-ary 2-cube (9-node) 3-ary 2-cube (9-node)

9-ary 1-cube (16-node) 9-ary 1-cube (16-node) 4-ary 2-cube (16-node) 4-ary 2-cube (16-node)

9-ary 1-cube (25-node) 9-ary 1-cube (25-node) 5-ary 2-cube (25-node) 5-ary 2-cube (25-node)

9-ary 1-cube (36-node) 9-ary 1-cube (36-node) 6-ary 2-cube (36-node) 6-ary 2-cube (36-node)



The reliabilities of the systems presented in this section are based on estimated component reliabilities. Even
though the individual component reliabilities may not be precise for any particular implementation, maintaining the
values constant throughout the evaluation process permits a relatively fair comparison of the systems.

5.1 1D Systems
The 1D SCI systems consist of single and counter-rotating ring topologies where each ring traverses all nodes
within the system. The reliabilities determined from the UltraSAN models for the 1D single ring (SR) and counter-
rotating ring (CRR) systems are shown in Fig. 8. The trend shows a decrease in reliability with each increment in
ring size.
The ratio of the analytical reliabilities, from Equations 1 and 2, for the counter-rotating ring and single ring
systems is given by
R (t)
= 2 Rk (t) (4)
RR (t)

This equation shows that the reliability of each counter-rotating ring system ranges from 1 to 2 times that of the
comparable single ring systems. Rk (t) is a function of both time and the number of nodes in the system. It is
observed that as time tends to infinity, the reliability of each counter-rotating ring system approaches twice the












reliability of the comparable single ring system of equal size. However, at this point, the reliabilities are almost
insignificant. Examining the effect of the number of nodes n at a constant time, it is seen that as the number of
nodes increases, the ratio of the reliabilities of the counter-rotating ring systems to the single ring systems increases.
Hence, it can be concluded that the addition of a second ring to the single ring systems improves the reliability of the
larger systems more significantly than the smaller ones.


1 00 100
090
080
070 0
0 60090
050
9040 node CRR - n o CR_
-0-9-node SR 080 --nodeSR







Mission Time (hours) Mission Time (hours)
0 30 -25-nde CRRn n n o n
020 16-node SR
5.2 20 36-node CRR2D Systems
010 S-25-nodeSR










An increase in the number of dimensions might be expected to provide an increase in reliability. Hence a move
from the ID counter-rotating ring to a 2D unidirectional torus should provide an increase in reliability. In both
topologies, each node shares two ringlets, supplying two input/output ports per node. Since the reliabilities of the
070 .

Mission Time (hours) Mission Time (hours)






Fig.D systems are dependent upon the numberves of the single ringl and cunter-rtating ring systems
5.2 2D Systems
An increase in the number of dimensions might be expected to provide an increase in reliability. Hence a move
from the 1D counter-rotating ring to a 2D unidirectional toms should provide an increase in reliability. In both
topologies, each node shares two ringlets, supplying two input/output ports per node. Since the reliabilities of the
1D systems are dependent upon the number of nodes per ringlet, a reduced number of nodes per ringlet and an
increase in the number of ringlets comprising the system might be expected to provide a higher overall system
reliability. However, this expectation was found not to be the case.
From both the analytical and UltraSAN models, it is found that the reliabilities of the counter-rotating ring and
unidirectional tori systems were identical. A closer examination of the topologies shows that after a single link
failure, an n/(2n-1) probability exists that a second failure will result in a node disconnection. This probability holds
true for both the counter-rotating ring and tori topologies of n nodes. The benefit of using a torus over a counter-
rotating ring is the ability of the torus to degrade, permitting communication between some nodes. The counter-
rotating ring system is incapable of such graceful degradation. For example, in the case of a 9-node counter-rotating
ring system, two ring failures cause the entire system to fail. For a 9-node torus, the failure of a single ring creates
three critical rings. A failure of any one of the three critical links will create a disjoined system in which 8 of the
nodes still have the ability to communicate. By adding a counter-rotating ring alongside each ring in the
unidirectional torus, an added degree of fault tolerance can be achieved. The reliabilities of both the unidirectional
and bi-directional tori are shown in Fig. 9.



090
080 095
070 090
0 60
\050 -^9-nodeUT X X" 1^ ^ ^
0 085 -9-node UT
a 0 -0 9-node UT 08 -.9-node UT
040 6-node UT 16-nodeUT
030 16-nodeBT 00 l16-node BT
-25-node UT - -25-node UT
020 -25-nodeBT 075 -25nodeBT
010 -x-36-nodeUT -x--36-node UT
36-node BT -36-node BT
000 070


Mission Time (hours) Mission Time (hours)


Fig. 9. Reliability curves of the unidirectional and bi-directional tori systems












A comparison of the unidirectional tori (UT) and bi-directional tori (BT) curves shows a different trend from
that seen in the 1D systems. For the 1D systems, the reliability curves showed an expected trend wherein the
counter-rotating rings demonstrated a consistently higher reliability than the single ring systems. For the 2D
systems, the trend for small systems is reversed. For example, in Fig. 9b, the reliability of the 9-node unidirectional
torus is higher than that of the comparable bi-directional torus. As the number of nodes is increased to 16, an
intersection point is seen at a mission time of 3000 hours. For mission times less than 3000 hours, the reliability of
the 16-node unidirectional torus surpasses the reliability of the bi-directional torus. Beyond 3000 hours, the 16-node
bi-directional torus provides a higher reliability than its unidirectional counterpart. This trend also occurs for the 9-
node and 25-node systems at mission times of 13000 and 1500 hours, respectively.


5.3 Topology Comparisons
In order to compare the reliabilities of the k-ary n-cubes, the increase in reliability of one topology can be
calculated relative to another. For example, the reliability increase of the counter-rotating ring relative to the single
ring topology is expressed as

reliability increase = R (t) -R(t) (5)
Rs (t)


where R, (t) is the reliability of the counter-rotating ring system at time t and RSR (t) is the reliability of the

single ring system at the same time t. Fig. 10 shows the increase in reliability of the CRR systems relative to
comparable SR systems for CRR reliabilities ranging from 0.7 to 0.99. As the reliability demands on the CRR
system increase, the improvement in reliability of the CRR over the SR systems decreases. Figure 11 shows the
reliability increase for BT over UT, for BT reliabilities ranging from 0.7 to 0.99, superimposed over the CRR-to-SR
comparison of Fig. 10.


50-
45-











070 075 080 085 090 095 099
CRR Reliability


Fig. 10. Increase in reliabilities of the CRR systems relative to comparable SR systems
90%
80% 35











30%-
J 25--











20%
oi 1 5 --------------------











10%
5-
0o










070 075 080 085 090 095 0 99
BT Reliability


Fig. 11. Increase in reliabilities of the BT systems relative to comparable UT systems
90%
-K-36-node
80% : 25-node
70% 16-node
-*- 9-node
60% -.- -CRR-to-SR
50%.
40% "
30%
I 20% ---


-0 '0 0 75 0 80 0 85 0 90 0 95 0 )9
BT Reliability


Fig. 11. Increase in reliabilities of the BT systems relative to comparable UT systems










Unlike the CRR-to-SR reliability comparison, the reliability increase of the bi-directional tori systems is a function
of the system size. For BT systems with a reliability of 0.95, the reliability increase relative to comparable UT
systems falls within the range -1.86% to 3.50% for the 9, 16, 25, and 36-node systems. This range decreases to -
0.45% to 0.04% for a higher BT reliability of 0.99. In contrast, for the CRR-to-SR comparison, the increases in
reliability were 11.88% and 3.11% respectively at a CRR reliability of 0.95 and 0.99. Hence, the shift from
unidirectional tori to bi-directional tori does not provide a significantly large reliability increase for practical
reliabilities of 0.95 and higher. At these higher reliabilities, the CRR show a more significant increase in reliability
relative to the single ring topologies. Also, for the 9-node and 16-node bi-directional tori with reliabilities greater
than 0.8 and 0.9 respectively, the reliability increase relative to comparable unidirectional tori is negative. These
two examples demonstrate that redundancy could in fact reduce the overall system reliability. This reduction in
reliability is the result of an increase in the number of components that could fail.


6. Conclusions
In this paper, the reliabilities of 1D and 2D k-ary n-cubes were evaluated using UltraSAN, a tool based on
Stochastic Activity Networks. The accuracy of the models was verified using both combinatorial and Markov
techniques. The feasibility of modeling SCI ring-based networks using ring elimination rather than link elimination
is also demonstrated. This contribution shows the feasibility of modeling relatively large networks using this
technique when the basic components are replicable. Furthermore, this research develops a framework regarding the
inherent reliability of SCI-based networks, laying a portion of the groundwork for the use of SCI-based networks for
mission-critical applications.
A comparison of the single and dual ring systems showed that the reliabilities of the 9-node, 16-node, and 25-
node single-ring systems exceed the reliabilities of the larger 16-node, 25-node, and 36-node dual-ring systems for
mission times exceeding 14000, 13000, and 12000 hours respectively. These examples represent several such trends
that can be found for single- and dual-ring systems of various sizes where a tradeoff exists between the number of
nodes required by an application versus the reliability desired for the application. Similarly, it was shown that the 9-
node and 16-node unidirectional tori provided a higher reliability than the 16-node and 25-node bi-directional tori
for mission times less that 13000 hours and 4000 hours respectively. This insight is an important one for those
considering the organization of SCI-based networks for mission-critical applications, for which the penalty of a
network failure may be very expensive.
Comparing the reliability results of the dual-ring systems and unidirectional systems, it was shown that
reliabilities of both topologies were identical for equal-sized networks. For target reliabilities of 0.8, 0.85, 0.9, and
0.95, the percentage improvements in reliability of the dual-ring systems over the single-ring systems were 32.49%,
27.10%, 20.11%, and 11.88%, respectively. The same comparison conducted using 2D tori yielded percentage
improvements in reliability of the bi-directional tori over the unidirectional tori of 36.61%, 21.74%, 10.38%, and
2.82%, for target reliabilities of 0.8, 0.85, 0.9, and 0.95, respectively. These results are particularly significant, as
they demonstrate the improvements in reliability that can be achieved by doubling the number of links in the
network without a radical reorganization of network topology. Of course, it is noteworthy that moving to a dual-
ring or bi-directional implementation of the same topology also generally results in an improvement in the effective
bisection bandwidth and latency of these networks.
The unidirectional tori and dual-ring systems provide the same percentage improvement in reliability over the
single-ring systems. The bi-directional torus provides a higher percentage improvement in reliability over the
single-ring systems for target reliability values less that 0.9. The percentage improvements in reliability of the bi-
directional tori over the single-ring systems were 105.14%, 70%, 24.71%, and 9.98% for target reliabilities of 0.8,
0.85, 0.9, and 0.95, respectively. These results demonstrate that the bi-directional torus is not as effective as the
dual-ring systems and the unidirectional tori in providing high levels of reliability when the target reliabilities
exceed 0.9. The lower improvement in reliability is the result of an increased number of components in the bi-
directional torus topology. For target reliability values 0.9 or less, the bi-directional tori demonstrate a larger
percentage improvement than the dual-ring systems and unidirectional tori. This observation points to the suitability
of such a network organization for applications with very long mission times and no provision for repair.
There exist several possible directions for future research. One such direction is the reliability analysis of larger
networks of the same topology using the technique of ring elimination. Another possible direction is the reliability
analysis of SCI-based network topologies in other popular configurations, such as meshes. Yet another possible
area for future research is the application of the techniques described in this paper to other emerging high-
performance networks.











References
1. IEEE, 1596-1992 IEEE Standard for Scalable Coherent Interface (SCI), Piscataway, NJ: IEEE Service Center, 1993.
2. B. W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley, 1989.
3. C. Lindemann, "DSPNexpress: A Software Package for the Efficient Solution of Deterministic and Stochastic Petri Nets,"
Proceedings of the Sixth International Conference on Modeling Techniques and Tools for Computer Systems Performance
Evaluation, pp. 15-29, Edinburgh, Great Britain, 1992.
4. G. Chiola, "GreatSPN 1.5 Software Architecture," Proc. 5th Int. Conf on Modeling Techniques and Tools for Computer
Performance Evaluation, Torino, Italy, Feb. 1991.
5. S. J. Bravuso, J. B. Dugan, K. S. Trividi, E. M. Rothman, W. E. Smith, "Analysis of Typical Fault-Tolerant Architectures
using HARP," IEEE Transactions on Reliability, vol. R-36, no. 2, pp. 176-185, June 1987.
6. G. Ciardo, J. Muppala, and K. S. Trividi, "SPNP: Stochastic Petri Net Package", Proceedings of the Fourth International
Workshop on Petri Nets and Performance Models, pp. 142-151, Kyoto, Japan, December 1989.
7. C. Beounes et al., "SURF-2: A Program for Dependability Evaluation of Complex Hardware and Software Systems,"
Proceedings 23rd Int. Symp. on Fault-Tolerant Computing (FTCS-23), IEEE, Toulouse, France, June 1993.
8. W. H. Sanders, W. D. Obal, M. A. Qureshi, F. K. Widjanarko, "The UltraSAN modeling Environment," Performance
Evaluation, vol. 24, no 1-2, pp. 89-115, November 1995.
9. C. Colbourn, J. Devitt, S. Harms, D. Daryl, "Assessing Reliability of Multistage Interconnection Networks," IEEE
Transactions on Computers, vol. 42, no. 10, pp. 1207-1221, October 1993.
10. A. Varma, C. Raghavendra, "Reliability Analysis of Redundant-Path Interconnection Networks," IEEE Transactions on
Reliability, vol. R-38, no.l, pp. 130-137, April 1989.
11. X. Cheng, O. Ibe, "Reliability of a Class of Multistage Interconnection Networks," IEEE Transactions on Parallel and
Distributed Systems, vol. 3, pp. 241-246, March 1992.
12. B. Menezes, U. Bakhru, "New Bonds on the Reliability of Augmented Shuffle-Exchange Networks," IEEE Transactions on
Computers, vol. 44, no. 1, pp. 123-129, January 1995.
13. J. Blake, K. Trivedi, "Reliability Analysis of Interconnection Networks Using Hierarchical Composition," IEEE
Transactions on Reliability, vol. 38, no.1, pp. 111-120, April 1989.
14. J. Blake, K. Trivedi, "Multistage Interconnection Network Reliability," IEEE Transactions on Computers, vol. 38, no. 11,
pp. 1600-1604, November 1989.
15. M. Balakrishnan, A. Reibman, "Reliability Models for Fault-Tolerant Private Network Applications," IEEE Transactions on
Computers, vol. 43, no. 9, pp.1039-1053, September 1994.
16. E. Smith, K. Trivedi, "Dependability Evaluation of a Class of Multi-Loop Topologies for Local Area Networks," IBM
Journal ofResearch and Development, vol. 33, no. 5, pp. 511-523, September 1989.
17. C. Raghavendra, J. Silvester, "A Survey of Multi-Connected Loop Topologies for Local Computer Networks," Computer
Networks and ISDN Systems, vol. 11, no. 1, pp. 29-42, January 1986.
18. T. Chung, N. Sharma, D. Agrawal, "Cost-Performance Trade-offs in Manhattan Street Network versus 2-D Torus," IEEE
Transactions on Computers, vol. 43, no. 2, pp. 240-243, February 1994.
19. Z. Chen, T. Berger, "Reliability and Availability Analysis pf Manhattan Street Networks," IEEE Transactions on
Communications, vol. 42, no. 2/3/4, pp. 511-522, February/March/April 1994.
20. N. Lopez-Benitez, J. Fortes, "Detailed Modeling and Reliability Analysis of Fault-Tolerant Processor Arrays," IEEE
Transactions on Computers, vol. 41, no. 9, pp. 1193-1200, September 1992.
21. M. Sarwar and A. D. George, "Simulative Performance Analysis of Switching Fabrics for Scalable SCI Networks,"
Microprocessors and Microsystems, vol. 24, no. 1, pp. 1-11, March 2000.
22. K. Kibria, Interconnect Systems Solution, hii'. *,, i .-us.com/LincCore.htm
23. MIL-HDBK-217F: Handbook for Reliability Prediction of Electronic Equipment, Defense Printing Service, Philadelphia,
PA.
24. W. Yost, "Cost Effective Fault Tolerance for Network Routing," Master of Science Thesis, University of Washington, 1995.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs