2000, HCS Research Lab All Rights Reserved
Reliability Modeling of SCI RingBased Topologies
M. Sarwar', A. George2, and D. Collins'
Highperformance C. i,,ijlt,, and Simulation (HCS) Research Laboratory
'Department of Electrical and Computer Engineering, FAMUFSU College of Engineering
2Department of Electrical and Computer Engineering, University of Florida
Abstract
Performance evaluation and reliability prediction are two importantfactors in the study of multiprocessor and
cluster interconnects. One such interconnect is the Scalable Coherent Interface (SCI). SCI is a pointtopoint, ring
based interconnect that can be configured in various switchedring topologies such as c n'",,.... ,.. ,lr, rings and
tori. While performance analyses of SCIbased interconnects have been discussed in the literature, reliability
evaluation has not received much attention. In addition, the reliability of SCI interconnects configured in many of
today's popular topologies cannot be deduced from earlier work on network reliability as link failures within an SCI
interconnect are not independent of one another. A single link failure within the topology results in the failure of the
entire ringlet to which the link belongs. This paper presents the results of a reliability study on ID and 2D kary n
cube switching fabrics for the Scalable Coherent Interface based on ring elimination rather than link elimination.
The study is conducted using reliability models created in UltraSAN, a tool based on Stochastic Activity Networks.
The models are verified using both combinatorial and Markov modeling. The results demonstrate the inherent
reliability characteristics of a singlering system can be .., ..,/ i enhanced by the addition of a second redundant
ring. By contrast, the results show that the reliability of a torus does not increase ,..o r,,rl ti with the addition of
redundant rings. Hence, the cost of adding redundant rings to certain topologies may or may not be justified,
depending upon the degree of reliability sought.
1. Introduction
The inevitable shift towards parallel computing systems has accentuated the need for more reliable networks.
As the number of components in the network increases, so too does the failure rate of the system. Classic fault
tolerant systems use redundant interconnects to provide a measure of fault tolerance. However, other systems often
provide fault tolerance by using the inherent redundancy within an interconnect rather than using completely
redundant interconnect components. Hypercubes, meshes, tori and other networks that are topologically isomorphic
to the family of kary ncube networks are some examples of such topologies, where n is the dimension of the cube,
k is the radix (e.g. k = 2 for a hypercube), and the number of nodes in the network is k". These topologies provide
fault tolerance in the form of multiple paths from any source node to any destination node. In this paper we target
the inherent redundancy in such networks in order to determine the reliability of SCI [1] topologies. The kary n
cube family was chosen since SCI is a ringbased network, and hence kary ncubes provide an ideal framework for
scalable SCI systems.
Reliability analysis of redundant faulttolerant systems is typically performed using one of two methods,
combinatorial modeling and Markov modeling [2]. Combinatorial modeling is applicable to cases where the system
can be broken down into series and parallel combinations of components. For more complex systems that cannot be
represented using distinct seriesparallel combinations, Markov modeling is employed. A higherlevel modeling
approach based on Petri nets can also be used, providing a graphical view of the operation of a system and the
interaction between failures. Petri nets are used to determine the reliability of systems by generating the Markov
state space for the model and solving the associated ChapmanKolmogorov equations. Some of the more popular
Petri net packages include DSPNexpress [3], GreatSPN [4], HARP [5], SPNP [6], and SURF2 [7]. For this paper,
the reliability of SCIbased 1D and 2D kary ncubes is studied using UltraSAN [8], a tool based on Stochastic
Activity Networks developed by Sanders et al. at the University of Illinois at UrbanaChampaign. UltraSAN was
chosen since it permits easy modeling of replicable systems and provides a broad range of analytic solvers.
The remainder of this paper is organized as follows. Section 2 presents related research in the areas of fault
tolerant network reliability modeling, with an emphasis on ringbased architectures and the use of Petri net models.
Section 3 describes the characteristics of the SCI model implemented in UltraSAN with a description of the
assumptions made in creating the models. Analytical verification of the UltraSAN models of the studied kary n
cube systems is detailed in Section 4, using both combinatorial and Markov models. Section 5 examines the
reliability results obtained from the UltraSAN model. The reliability of the unidirectional and bidirectional tori and
the single and counterrotating ring configurations are analyzed and compared, allowing informed decisions to be
made regarding network organization and level of redundancy for SCIbased applications with reliability
requirements. Finally, the conclusions and possible directions for future research are presented in Section 6.
2. Related research
Fault tolerance of interconnect topologies can be measured in terms of the terminal reliability or network
reliability. Terminal reliability is the probability that there exists at least one path from a given node to a destination
node. It is most commonly used to assess the reliability of multistage interconnection networks (MINs) as is done
by Colbourn et al. [9] and Varma and Raghavendra [10]. This paper concentrates on network reliability, or the
probability that there exists at least one path from every node to all other nodes.
Network reliability can be assessed using either combinatorial or Markov modeling. Combinatorial modeling
of networks requires the decomposition of a network into subnets and determining the reliability of the entire system
as a combination of the subnets. Cheng and Ibe [11] and Menezes and Bakhru [12] use such a method to evaluate
the reliability of shuffleexchange networks. Markov modeling can also be used as an alternative to combinatorial
modeling or in conjunction with combinatorial modeling. Blake and Trividi [13] use continuous time Markov
chains to determine the reliability of shuffleexchange networks. In [14], Blake and Trivedi use Markov modeling
in conjunction with combinatorial modeling by dividing the studied MINs into a twolevel model, obtaining the
reliability of each subsystem using Markov modeling and the system reliability using a series system comprised of
the Markov components. In [15], Balakrishnan and Reibman use Markov modeling to determine the reliability of
private networks where the minimal operational path is dictated by the application. The BalakrishnanReibman
models present an example where combinatorial analysis is no longer feasible since the reliability models are
dependent upon the communication paths. Since the communication paths can take any form, they cannot be
accurately represented as seriesparallel combinations.
This paper concentrates on the reliability of SCI networks configured as kary ncubes. The reliability of ring
based architectures has been studied in Smith and Trivedi [16]. The topology targeted in that paper is the forward
loop backward hop (FLBH) network. The FLBH class of 1D ring topologies, which include daisy chain loop,
forward loop parity hop networks, and chordal rings, has also been studied by Raghavendra and Silvester [17].
Work has also been conducted on the 2D architectures such as the Manhattan street network (MSN) and Torus. In
[18], Chung et al. present the terminal reliability analysis of an MSN and a 2D Torus. Chen and Berger [19] present
a reliability analysis of Manhattan street networks showing the complexity of the Markov model for the MSN.
Complexity arises from the interdependence between link failures as shown in their paper. In the model presented in
this paper, the requirements of the SCI protocol help to simplify the link interdependence as entire rings containing
link failures are eliminated. The ring elimination cuts down dramatically on the state space of the corresponding
Markov model. In addition, by using the ring as the basic building block of the kary ncube models, Petri nets can
be used to determine the reliabilities of the systems.
Network reliability analysis using Petri nets has not been carried out extensively. Most reliability models based
on Petri nets deal with small redundant systems with a fixed number of components. The reason for this limited
usage of Petri nets is that only systems that employ replicable building blocks can be easily modeled. Benitez and
Fortez [20] demonstrate the use of a Petri net model for determining the reliability of faulttolerant processor arrays.
In their paper, the processor array is considered to be a replicable system with a single row of processors as the
building block of the model. Since SCI uses a fabric of switched rings, and each ring is eliminated if it contains a
single faulty link, the kary ncubes being studied can be easily created using replicable components, making ring
based systems with distributed switching ideal for modeling with Petri nets. The primary benefit of modeling a
system with Petri nets is the ease with which the model can be created and replicated to create larger models. In
addition it provides an easy to understand, visual representation of the system and the interaction between faults.
3. SCI UltraSAN model
The simplest configuration of an SCI interconnect is a ring traversing all nodes. The ring is based on the
architecture of an SCI interface illustrated in Fig. 1. Incoming packets to the interface pass through an address
decoder. If the packet is destined for the local node, the decoder places it into the request or responseinput queue.
If the packet is destined for another downstream node, it is forwarded to the bypass FIFO. To output a packet, the
SCI node must have sufficient free space in its bypass FIFO to hold all incoming symbols. When there are no
packets waiting in the output queue or there is insufficient free space in the bypass FIFO for the output queue data to
be sent, data from the bypass FIFO is transmitted on the node's output link. If the bypass queue is empty, then idle
symbols are transmitted. Idle symbols also carry flowcontrol information and at least one must precede any send or
echo packet. This flowcontrol information is used to inhibit upstream nodes from sending data when the bypass
FIFO must be emptied to allow the output queue to be emptied.
Larger SCI networks are based on multiple ringlets connected together to create more complex topologies
through the use of agents. An agent is essentially a SCItoSCI bridge used to interconnect two or more rings. The
topologies studied in this paper include the unidirectional and bidirectional forms of the topologies illustrated in
Fig. 2. In [21] we presented performance analysis of these SCIbased topologies using the node model shown in
Fig. 3. Each switch in a distributed switching fabric provides the ability to interface up to four SCI ringlets and also
acts as an interface to the processing unit at that node. The UltraSAN models presented in this paper are based on
this switch model.
To processing node
SC Encoder By Address SCI
out Mr Bypass FIFO Decoder in
Save Idle
Fig. 1. The SCI interface
SCI SCI
Interfaces Interfaces
Fig. 2. (a) Dualring kary cube, (b) bidirectional kary 2cube
Fig. 2. (a) Dualring kary 1cube, (b) bidirectional kary 2cube
Fig. 3. 4port switch model
In the UltraSAN model, it is assumed that each node consists of the processing node and a crossbar. The SCI
interfaces are used to connect the node into ringbased topologies. The number of SCI interfaces can be increased
from one unit to four units, permitting the node to access up to four rings hence providing it with four input and four
output ports.
Due to the inherent characteristics of a registerinsertion ring and their key role in SCI, all links in an SCI ring
must be operational in order for that ring to operate correctly. A single link failure eliminates the entire ring from a
multiring system, requiring the system to reconfigure and use the remaining links to continue normal operation.
The UltraSAN model uses this fact as the basis of determining a failed state for the network. If a node is
disconnected from the remainder of the network either due to ring failures or the failure of the crossbar at that node,
then the entire system is considered to be in a failed state. If a set of ring failures cause the network to be disjoint,
then that condition will also result in a failed state.
To create the dualring topology, the singlering model is simply duplicated. The torus model is created using
one or more singlering models representing the rows in a torus with each row sharing the vertical rings. Each row
monitors ring failures within itself and the vertical rings connected to it. By doing so, any one of the duplicated row
subnets representing the rows can detect and signal a network failure.
4. Model Verification
The models are verified using combinatorial modeling for the single and counterrotating ring systems, and
Markov modeling for the torus model. The single and counterrotating ring systems can be represented as series
parallel combination of components. The tori systems cannot and hence require the use of Markov modeling to
determine their reliability.
4.1 1D topologies (kary 1cubes)
In the single ring model, a link failure results in a system failure. In addition, it is assumed that a system failure
occurs if a single node is unable to communicate with the rest of the network. Such a case can occur due to a
crossbar failure resulting in the isolation of the attached node from the network. The reliability of single ring
systems of n nodes can therefore be verified analytically by assuming a series system of links and crossbars. The
system reliability is expressed as:
Rsytem (t) = R k (t)R rosbar (t) (1)
In Equation 1, based on the Exponential Failure Law (EFL), Rhnk (t) = e_^nk' and Rcrossbar(t) = ec rb"t .'. The failure
rates of the links and the crossbars were estimated using the Handbook for Reliability Prediction of Electronic
Equipment (MILHDBK217F) [23]. These reliabilities are Al,nk = 3.5' i"\ I failures per hour and crossbar = IxlO6
failures per hour respectively. In the series model expressed by Equation 1, all links and crossbars must operate
correctly for the system to remain in an operational state.
The analytical model for the counterrotating ring systems is a combined seriesparallel system in which at least
one of the two rings must be operational and all crossbars must be operational for the system to remain in an
operational state. The reliability of the counterrotating ring systems is expressed as:
Rsystem (t) = 1 [1 Rnk (t)]2 roar (t) (2)
The expression 1 [1 R k (t)]2 represents the reliability of having one of the two rings operational. This value is
then multiplied by the reliabilities of the crossbars Rrosb( (t) to account for the n crossbars that must remain
operational. Fig. 4 shows the UltraSAN and analytical model reliability results for the single ring (SR) and counter
rotating ring (CRR) systems. In both cases the reliabilities obtained from the UltraSAN models are identical to the
values obtained analytically.
4.2 2D topologies (kary 2cubes)
Due to the interring dependencies, the 2D topologies cannot be represented using a distinct seriesparallel
model, hence Markov modeling must be employed. In addition, the state spaces of the Markov models for the 2D
systems increase rapidly with each increment in k. For this reason, the smaller Markov model of a 9node torus is
used to verify the reliability results obtained from UltraSAN. Fig. 5 depicts the Markov model for a 9node
unidirectional torus. Fig. 6 shows the determination of the critical rings after a given ring failure. Failure of any
critical ring will then result in a network failure. The larger tori models use a simple extension of the 9node torus
model, and thus the verification below adds credence to their accuracy as well. Equation 3 gives the reliability of a
9node unidirectional torus as a function of mission time.
R9nod(t) = e6ngt 6e sgt + 6e '
09
08
07
06
05
*9node SR analytical
0 4 9node SR UltraSAN
516node SR analytical
0 3  16node SR URraSAN
25node SR analytcal
2 25node SR URraSAN
0 1 36node SR analytcal
36node SR UlraSAN
00
Mission Time (hours)
10
09
08
07
r06
05
S 9node CRR analytical
0 04 9node CRRUltraSAN
I16node CRR analytical
03 *16node CRR UltraSAN
S25node CRR analytical
25node CRR UltraSAN
0 1 36node CRR analytical
36node CRR UltraSAN
00
Mission Time (hours)
Fig. 4. UltraSAN and analytical reliabilities for single ring and counterrotating ring systems
In Equation 3, A is the reliability of each ringlet. For the 9node torus, setting ,,, = 3Ank accounts for the
three links making up each ringlet. To determine the system reliability, Equation 3 is multiplied by R ,,sbar (t) to
account for the 9 crossbars within the system. For a detailed description of the approach used to derive the
reliability expression above, the reader is directed to [2]. Fig. 7 illustrates the accuracy of the UltraSAN model,
where the analytical and UltraSAN results for the 9node torus are nearly identical.
15XAt 1.0
61At 31At
16XAt 0 1 1 F
2At
4_At
14XAt 2
Fig. 5. Markov model for a 9node unidirectional torus
NonCritical Rings
Failed Ring ___ "
Critical Rings
Fig. 6. 9node torus with critical and noncritical rings after one ring failure
1 00 I
095
090
0 85
 080
0 75
070
0 65  9node torus analytical
 9node torus UltraSAN
060
Mission Time (hours)
Fig. 7. UltraSAN and analytical reliability results for the 9node torus
5. Reliability Results
In this section, the reliability results obtained for the kary ncubes listed in Table 1 are presented. Afterwards,
in the next section, three case studies are presented to demonstrate the application of the reliability results to SCI
systems requiring varying levels of fault tolerance.
Table 1. 1D and 2D model descriptions
1D Systems 2D Systems
Single Ring CounterRotating Ring Unidirectional Torus Bidirectional Torus
9ary 1cube (9node) 9ary 1cube (9node) 3ary 2cube (9node) 3ary 2cube (9node)
9ary 1cube (16node) 9ary 1cube (16node) 4ary 2cube (16node) 4ary 2cube (16node)
9ary 1cube (25node) 9ary 1cube (25node) 5ary 2cube (25node) 5ary 2cube (25node)
9ary 1cube (36node) 9ary 1cube (36node) 6ary 2cube (36node) 6ary 2cube (36node)
The reliabilities of the systems presented in this section are based on estimated component reliabilities. Even
though the individual component reliabilities may not be precise for any particular implementation, maintaining the
values constant throughout the evaluation process permits a relatively fair comparison of the systems.
5.1 1D Systems
The 1D SCI systems consist of single and counterrotating ring topologies where each ring traverses all nodes
within the system. The reliabilities determined from the UltraSAN models for the 1D single ring (SR) and counter
rotating ring (CRR) systems are shown in Fig. 8. The trend shows a decrease in reliability with each increment in
ring size.
The ratio of the analytical reliabilities, from Equations 1 and 2, for the counterrotating ring and single ring
systems is given by
R (t)
= 2 Rk (t) (4)
RR (t)
This equation shows that the reliability of each counterrotating ring system ranges from 1 to 2 times that of the
comparable single ring systems. Rk (t) is a function of both time and the number of nodes in the system. It is
observed that as time tends to infinity, the reliability of each counterrotating ring system approaches twice the
reliability of the comparable single ring system of equal size. However, at this point, the reliabilities are almost
insignificant. Examining the effect of the number of nodes n at a constant time, it is seen that as the number of
nodes increases, the ratio of the reliabilities of the counterrotating ring systems to the single ring systems increases.
Hence, it can be concluded that the addition of a second ring to the single ring systems improves the reliability of the
larger systems more significantly than the smaller ones.
1 00 100
090
080
070 0
0 60090
050
9040 node CRR  n o CR_
09node SR 080 nodeSR
Mission Time (hours) Mission Time (hours)
0 30 25nde CRRn n n o n
020 16node SR
5.2 20 36node CRR2D Systems
010 S25nodeSR
An increase in the number of dimensions might be expected to provide an increase in reliability. Hence a move
from the ID counterrotating ring to a 2D unidirectional torus should provide an increase in reliability. In both
topologies, each node shares two ringlets, supplying two input/output ports per node. Since the reliabilities of the
070 .
Mission Time (hours) Mission Time (hours)
Fig.D systems are dependent upon the numberves of the single ringl and cunterrtating ring systems
5.2 2D Systems
An increase in the number of dimensions might be expected to provide an increase in reliability. Hence a move
from the 1D counterrotating ring to a 2D unidirectional toms should provide an increase in reliability. In both
topologies, each node shares two ringlets, supplying two input/output ports per node. Since the reliabilities of the
1D systems are dependent upon the number of nodes per ringlet, a reduced number of nodes per ringlet and an
increase in the number of ringlets comprising the system might be expected to provide a higher overall system
reliability. However, this expectation was found not to be the case.
From both the analytical and UltraSAN models, it is found that the reliabilities of the counterrotating ring and
unidirectional tori systems were identical. A closer examination of the topologies shows that after a single link
failure, an n/(2n1) probability exists that a second failure will result in a node disconnection. This probability holds
true for both the counterrotating ring and tori topologies of n nodes. The benefit of using a torus over a counter
rotating ring is the ability of the torus to degrade, permitting communication between some nodes. The counter
rotating ring system is incapable of such graceful degradation. For example, in the case of a 9node counterrotating
ring system, two ring failures cause the entire system to fail. For a 9node torus, the failure of a single ring creates
three critical rings. A failure of any one of the three critical links will create a disjoined system in which 8 of the
nodes still have the ability to communicate. By adding a counterrotating ring alongside each ring in the
unidirectional torus, an added degree of fault tolerance can be achieved. The reliabilities of both the unidirectional
and bidirectional tori are shown in Fig. 9.
090
080 095
070 090
0 60
\050 ^9nodeUT X X" 1^ ^ ^
0 085 9node UT
a 0 0 9node UT 08 .9node UT
040 6node UT 16nodeUT
030 16nodeBT 00 l16node BT
25node UT  25node UT
020 25nodeBT 075 25nodeBT
010 x36nodeUT x36node UT
36node BT 36node BT
000 070
Mission Time (hours) Mission Time (hours)
Fig. 9. Reliability curves of the unidirectional and bidirectional tori systems
A comparison of the unidirectional tori (UT) and bidirectional tori (BT) curves shows a different trend from
that seen in the 1D systems. For the 1D systems, the reliability curves showed an expected trend wherein the
counterrotating rings demonstrated a consistently higher reliability than the single ring systems. For the 2D
systems, the trend for small systems is reversed. For example, in Fig. 9b, the reliability of the 9node unidirectional
torus is higher than that of the comparable bidirectional torus. As the number of nodes is increased to 16, an
intersection point is seen at a mission time of 3000 hours. For mission times less than 3000 hours, the reliability of
the 16node unidirectional torus surpasses the reliability of the bidirectional torus. Beyond 3000 hours, the 16node
bidirectional torus provides a higher reliability than its unidirectional counterpart. This trend also occurs for the 9
node and 25node systems at mission times of 13000 and 1500 hours, respectively.
5.3 Topology Comparisons
In order to compare the reliabilities of the kary ncubes, the increase in reliability of one topology can be
calculated relative to another. For example, the reliability increase of the counterrotating ring relative to the single
ring topology is expressed as
reliability increase = R (t) R(t) (5)
Rs (t)
where R, (t) is the reliability of the counterrotating ring system at time t and RSR (t) is the reliability of the
single ring system at the same time t. Fig. 10 shows the increase in reliability of the CRR systems relative to
comparable SR systems for CRR reliabilities ranging from 0.7 to 0.99. As the reliability demands on the CRR
system increase, the improvement in reliability of the CRR over the SR systems decreases. Figure 11 shows the
reliability increase for BT over UT, for BT reliabilities ranging from 0.7 to 0.99, superimposed over the CRRtoSR
comparison of Fig. 10.
50
45
070 075 080 085 090 095 099
CRR Reliability
Fig. 10. Increase in reliabilities of the CRR systems relative to comparable SR systems
90%
80% 35
30%
J 25
20%
oi 1 5 
10%
5
0o
070 075 080 085 090 095 0 99
BT Reliability
Fig. 11. Increase in reliabilities of the BT systems relative to comparable UT systems
90%
K36node
80% : 25node
70% 16node
* 9node
60% . CRRtoSR
50%.
40% "
30%
I 20% 
0 '0 0 75 0 80 0 85 0 90 0 95 0 )9
BT Reliability
Fig. 11. Increase in reliabilities of the BT systems relative to comparable UT systems
Unlike the CRRtoSR reliability comparison, the reliability increase of the bidirectional tori systems is a function
of the system size. For BT systems with a reliability of 0.95, the reliability increase relative to comparable UT
systems falls within the range 1.86% to 3.50% for the 9, 16, 25, and 36node systems. This range decreases to 
0.45% to 0.04% for a higher BT reliability of 0.99. In contrast, for the CRRtoSR comparison, the increases in
reliability were 11.88% and 3.11% respectively at a CRR reliability of 0.95 and 0.99. Hence, the shift from
unidirectional tori to bidirectional tori does not provide a significantly large reliability increase for practical
reliabilities of 0.95 and higher. At these higher reliabilities, the CRR show a more significant increase in reliability
relative to the single ring topologies. Also, for the 9node and 16node bidirectional tori with reliabilities greater
than 0.8 and 0.9 respectively, the reliability increase relative to comparable unidirectional tori is negative. These
two examples demonstrate that redundancy could in fact reduce the overall system reliability. This reduction in
reliability is the result of an increase in the number of components that could fail.
6. Conclusions
In this paper, the reliabilities of 1D and 2D kary ncubes were evaluated using UltraSAN, a tool based on
Stochastic Activity Networks. The accuracy of the models was verified using both combinatorial and Markov
techniques. The feasibility of modeling SCI ringbased networks using ring elimination rather than link elimination
is also demonstrated. This contribution shows the feasibility of modeling relatively large networks using this
technique when the basic components are replicable. Furthermore, this research develops a framework regarding the
inherent reliability of SCIbased networks, laying a portion of the groundwork for the use of SCIbased networks for
missioncritical applications.
A comparison of the single and dual ring systems showed that the reliabilities of the 9node, 16node, and 25
node singlering systems exceed the reliabilities of the larger 16node, 25node, and 36node dualring systems for
mission times exceeding 14000, 13000, and 12000 hours respectively. These examples represent several such trends
that can be found for single and dualring systems of various sizes where a tradeoff exists between the number of
nodes required by an application versus the reliability desired for the application. Similarly, it was shown that the 9
node and 16node unidirectional tori provided a higher reliability than the 16node and 25node bidirectional tori
for mission times less that 13000 hours and 4000 hours respectively. This insight is an important one for those
considering the organization of SCIbased networks for missioncritical applications, for which the penalty of a
network failure may be very expensive.
Comparing the reliability results of the dualring systems and unidirectional systems, it was shown that
reliabilities of both topologies were identical for equalsized networks. For target reliabilities of 0.8, 0.85, 0.9, and
0.95, the percentage improvements in reliability of the dualring systems over the singlering systems were 32.49%,
27.10%, 20.11%, and 11.88%, respectively. The same comparison conducted using 2D tori yielded percentage
improvements in reliability of the bidirectional tori over the unidirectional tori of 36.61%, 21.74%, 10.38%, and
2.82%, for target reliabilities of 0.8, 0.85, 0.9, and 0.95, respectively. These results are particularly significant, as
they demonstrate the improvements in reliability that can be achieved by doubling the number of links in the
network without a radical reorganization of network topology. Of course, it is noteworthy that moving to a dual
ring or bidirectional implementation of the same topology also generally results in an improvement in the effective
bisection bandwidth and latency of these networks.
The unidirectional tori and dualring systems provide the same percentage improvement in reliability over the
singlering systems. The bidirectional torus provides a higher percentage improvement in reliability over the
singlering systems for target reliability values less that 0.9. The percentage improvements in reliability of the bi
directional tori over the singlering systems were 105.14%, 70%, 24.71%, and 9.98% for target reliabilities of 0.8,
0.85, 0.9, and 0.95, respectively. These results demonstrate that the bidirectional torus is not as effective as the
dualring systems and the unidirectional tori in providing high levels of reliability when the target reliabilities
exceed 0.9. The lower improvement in reliability is the result of an increased number of components in the bi
directional torus topology. For target reliability values 0.9 or less, the bidirectional tori demonstrate a larger
percentage improvement than the dualring systems and unidirectional tori. This observation points to the suitability
of such a network organization for applications with very long mission times and no provision for repair.
There exist several possible directions for future research. One such direction is the reliability analysis of larger
networks of the same topology using the technique of ring elimination. Another possible direction is the reliability
analysis of SCIbased network topologies in other popular configurations, such as meshes. Yet another possible
area for future research is the application of the techniques described in this paper to other emerging high
performance networks.
References
1. IEEE, 15961992 IEEE Standard for Scalable Coherent Interface (SCI), Piscataway, NJ: IEEE Service Center, 1993.
2. B. W. Johnson, Design and Analysis of FaultTolerant Digital Systems, AddisonWesley, 1989.
3. C. Lindemann, "DSPNexpress: A Software Package for the Efficient Solution of Deterministic and Stochastic Petri Nets,"
Proceedings of the Sixth International Conference on Modeling Techniques and Tools for Computer Systems Performance
Evaluation, pp. 1529, Edinburgh, Great Britain, 1992.
4. G. Chiola, "GreatSPN 1.5 Software Architecture," Proc. 5th Int. Conf on Modeling Techniques and Tools for Computer
Performance Evaluation, Torino, Italy, Feb. 1991.
5. S. J. Bravuso, J. B. Dugan, K. S. Trividi, E. M. Rothman, W. E. Smith, "Analysis of Typical FaultTolerant Architectures
using HARP," IEEE Transactions on Reliability, vol. R36, no. 2, pp. 176185, June 1987.
6. G. Ciardo, J. Muppala, and K. S. Trividi, "SPNP: Stochastic Petri Net Package", Proceedings of the Fourth International
Workshop on Petri Nets and Performance Models, pp. 142151, Kyoto, Japan, December 1989.
7. C. Beounes et al., "SURF2: A Program for Dependability Evaluation of Complex Hardware and Software Systems,"
Proceedings 23rd Int. Symp. on FaultTolerant Computing (FTCS23), IEEE, Toulouse, France, June 1993.
8. W. H. Sanders, W. D. Obal, M. A. Qureshi, F. K. Widjanarko, "The UltraSAN modeling Environment," Performance
Evaluation, vol. 24, no 12, pp. 89115, November 1995.
9. C. Colbourn, J. Devitt, S. Harms, D. Daryl, "Assessing Reliability of Multistage Interconnection Networks," IEEE
Transactions on Computers, vol. 42, no. 10, pp. 12071221, October 1993.
10. A. Varma, C. Raghavendra, "Reliability Analysis of RedundantPath Interconnection Networks," IEEE Transactions on
Reliability, vol. R38, no.l, pp. 130137, April 1989.
11. X. Cheng, O. Ibe, "Reliability of a Class of Multistage Interconnection Networks," IEEE Transactions on Parallel and
Distributed Systems, vol. 3, pp. 241246, March 1992.
12. B. Menezes, U. Bakhru, "New Bonds on the Reliability of Augmented ShuffleExchange Networks," IEEE Transactions on
Computers, vol. 44, no. 1, pp. 123129, January 1995.
13. J. Blake, K. Trivedi, "Reliability Analysis of Interconnection Networks Using Hierarchical Composition," IEEE
Transactions on Reliability, vol. 38, no.1, pp. 111120, April 1989.
14. J. Blake, K. Trivedi, "Multistage Interconnection Network Reliability," IEEE Transactions on Computers, vol. 38, no. 11,
pp. 16001604, November 1989.
15. M. Balakrishnan, A. Reibman, "Reliability Models for FaultTolerant Private Network Applications," IEEE Transactions on
Computers, vol. 43, no. 9, pp.10391053, September 1994.
16. E. Smith, K. Trivedi, "Dependability Evaluation of a Class of MultiLoop Topologies for Local Area Networks," IBM
Journal ofResearch and Development, vol. 33, no. 5, pp. 511523, September 1989.
17. C. Raghavendra, J. Silvester, "A Survey of MultiConnected Loop Topologies for Local Computer Networks," Computer
Networks and ISDN Systems, vol. 11, no. 1, pp. 2942, January 1986.
18. T. Chung, N. Sharma, D. Agrawal, "CostPerformance Tradeoffs in Manhattan Street Network versus 2D Torus," IEEE
Transactions on Computers, vol. 43, no. 2, pp. 240243, February 1994.
19. Z. Chen, T. Berger, "Reliability and Availability Analysis pf Manhattan Street Networks," IEEE Transactions on
Communications, vol. 42, no. 2/3/4, pp. 511522, February/March/April 1994.
20. N. LopezBenitez, J. Fortes, "Detailed Modeling and Reliability Analysis of FaultTolerant Processor Arrays," IEEE
Transactions on Computers, vol. 41, no. 9, pp. 11931200, September 1992.
21. M. Sarwar and A. D. George, "Simulative Performance Analysis of Switching Fabrics for Scalable SCI Networks,"
Microprocessors and Microsystems, vol. 24, no. 1, pp. 111, March 2000.
22. K. Kibria, Interconnect Systems Solution, hii'. *,, i .us.com/LincCore.htm
23. MILHDBK217F: Handbook for Reliability Prediction of Electronic Equipment, Defense Printing Service, Philadelphia,
PA.
24. W. Yost, "Cost Effective Fault Tolerance for Network Routing," Master of Science Thesis, University of Washington, 1995.
