Title: Large-scale distributed services
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100757/00001
 Material Information
Title: Large-scale distributed services
Physical Description: Book
Language: English
Creator: Vorapanya, Anek
Publisher: State University System of Florida
Place of Publication: Florida
Florida
Publication Date: 2000
Copyright Date: 2000
 Subjects
Subject: Database management -- Reliability   ( lcsh )
Reliability (Engineering)   ( lcsh )
Heterogeneous computing   ( lcsh )
Computer and Information Science and Engineering thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre: government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )
 Notes
Summary: ABSTRACT: In wide-area networks, the Internet in particular, a message-passing distributed system experiences frequent network failures and fluctuating communication latency. Because of its scale and different administrative domains, large-scale distributed systems usually consist of heterogeneous components. Heterogeneity and failures induce asynchrony, which varies from mild to extreme. Inability to acquire correct and up-to-date knowledge about the other elements of the system makes it difficult for a process to decide correctly how to proceed. Failures, asynchrony, and limited knowledge contribute to the difficulty of building reliable large-scale distributed services. In this research, we present our approach for building reliable large-scale distributed services. Essentially, we combine mutual failure detection with group membership and group multicast, and package them together into a group communication system on top of which reliable and scalable distributed services can be built. Our group communication system uses a hierarchical group structure to achieve scalability. To provide reliable services, the group communication system depends on failure information reported by a local mutual failure detector as well as the defined properties of the group communication system itself. A mutual failure detector is responsible for detecting the responsiveness of remote processes and the ability to communicate with those processes. It uses bounded delay on correct messages or network time-to-live of messages to help it detect process responsiveness and communication failures. The group communication system supports reliable and scalable multicast with single-source ordering semantic.
Summary: ABSTRACT (cont.): With our group communication system, building reliable large-scale distributed services, especially service replication, becomes easier than using standard network programming tools that are based on unicast communications. Currently we're using these tools to build our distributed conferencing system (DCS). There are a number of issues that still need to be addressed to make these tools complete, in particular the data consistency issue through the multicast message delivery ordering and the security issue in group communication services. Since our architecture is a rooted tree, one can build a strong multicast message delivery semantic without much difficulty. The security issue requires further research in secure group communications.
Summary: KEYWORDS: distributed systems, reliability, failure detection, group communication
Thesis: Thesis (Ph. D.)--University of Florida, 2000.
Bibliography: Includes bibliographical references (p. 108-111).
System Details: System requirements: World Wide Web browser and PDF reader.
System Details: Mode of access: World Wide Web.
Statement of Responsibility: by Anek Vorapanya.
General Note: Title from first page of PDF file.
General Note: Document formatted into pages; contains xi, 112 p.; also contains graphics.
General Note: Vita.
 Record Information
Bibliographic ID: UF00100757
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: oclc - 45837273
alephbibnum - 002640024
notis - ANA6855

Downloads

This item has the following downloads:

dissertation ( PDF )


Full Text











LARGE-SCALE DISTRIBUTED SERVICES


By

ANEK VORAPANYA
















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2000


































Copyright 2000

by

Anek Vorap i I

















This work is dedicated to my late father, Kumnuan Vorapanya.















ACKNOWLEDGMENTS

I'm grateful to my advisor, Dr. Richard N. i.--i, in, This work would not have been

possible without him giving me the opportunity to conduct my research and supporting

me with research assistantships throughout my entire study. He is both my mentor and

my good friend. I thank my committee members, Dr. Randy ('!i...-, Dr. Mike Conlon,

Dr. Tim Davis, and Dr. Beverly Sanders for serving on my committee and for all

sir.-.- -Ii. .i that improved my work. I also thank the CISE department for supporting

me with teaching assistantships during my study.

My thanks go to my good friends who have helped me so much during my study,

especially Ratanaporn Awiphan, Paveena C'!i .. .- ,!i iwongse, Pakphong Cli(' i i An-

drew Lomonosov, Karnchana Panichakarn, Sutharin Pathomvanich, Somkiet Songpet-

mongkol, Saowapa Sukuntee and Prin Suvimolthirabutr.

I'm thankful for my mother, Auempon, and the entire Vi 'I' ll:a family who, for

as long as I can remember, have believed in me, never given up on me and supported

me financially and morally. This family has, without doubt, patiently waited to see me

succeed during this seemingly endless journey of self-fulfillment. I alv--,v- feel fortunate

to be surrounded by these people who understand the importance of education. Without

them, it would be unimaginably difficult for me to make this wish come true.

Lastly, I'm thankful for the one special person, Ple, who has been closer to my heart

more than anyone else and who has helped me persevere during the past eight years.















TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ............................... iv

LIST OF FIGURES ..................... ........... vii

A B ST R A CT . . . . . . . . . . . . . . . . . . . x

CHAPTERS

1 INTRODUCTION ............................. 1

1.1 Problems and Motivations ......... ........ .... 1
1.2 Our Approach ............................ 5
1.3 Contributions ... .. .. .. .. ... .. .. .. .. ... .. 5
1.4 Dissertation Outline .................. .... 6

2 BASICS AND PREVIOUS WORK ................ . 7

2.1 Message-Passing Systems ...... ......... . ..... 7
2.2 Reliable Distributed Computing ............. . 9
2.3 Agreement Problem ................ . ... .. 10
2.4 Failure Detectors .... ........... . .... .. 12
2.5 Group Communication ................ . . 13

3 LARGE-SCALE DISTRIBUTED SERVICES . . . . ... 18

3.1 System and Failure Models ............. .. ... ..18
3.2 Requirements . ............... ......... ..22
3.3 Reliable Communications ................ . ..... 23

4 MUTUAL FAILURE DETECTION .............. . ... ..32

4.1 Failures . ............... .......... .. .. 32
4.2 Failure Detection ............... . .... . 33
4.3 Mutual Failure Detection .................. . . 36
4.4 Mutual Heartbeat Failure Detector .... . . . . ..... 40
4.4.1 Definitions .................. .... .. .. 40
4.4.2 Failure Detector Machine and Its Operations ...... .. 42
4.4.3 Analysis of Mutual Failure Detector . . . . ... 48
4.4.4 Failure Detector Simulation .. . . . . . 52
4.4.5 Analytical and Simulation Results .. . . . . 53









5 MEMBERSHIP AND MULTICAST ......... ............ 75

5.1 Group Communication ......... ............. 75
5.2 Asynchronous Hierarchical Service Group . . . . ... 75
5.3 Membership Service .................. ..... 80
5.3.1 Membership Properties .................. .. 81
5.3.2 Membership Protocol ............... .. . 82
5.3.3 Correctness .................. ..... .. 90
5.4 Multicast Service .................. ..... .. 94
5.4.1 Multicast Properties ................. .. ... .. 97
5.4.2 Multicast Protocol ................. . . .. 99
5.4.3 Correctness . . . . . . . . . . . . . 103

6 CONCLUSION AND FUTURE WORK ...... . . . .... 106

REFERENCES ............. .. ................. .. .. 108

BIOGRAPHICAL SKETCH .............. . . . . . 112















LIST OF FIGURES

Figure page

3.1 The l-v-r architecture of our system ............. ..... . 23

4.1 Mutual heartbeat failure detection .................. ..... 37

4.2 Process p sends a message to process q. The message takes dl time delay
before it reaches q. q responds to the message within deadline A (dl +
d2) with some probability r. The response from q takes d2 time delay
before it reaches p. .................. .. ...... 40

4.3 A failure detection machine. The states R and N represent responsiveness
and non-responsiveness state of a monitored process, respectively. . 43

4.4 A continuous series of certain events force failure detector to change the
state of a process. In the figure, a series of c consecutive missing heart-
beats from q make p's failure detector change the state of q. . . 48

4.5 Failure detector initially classifies a remote process to be in some state
(N in this case). Then it changes q's state to R when the condition
to do so has been met. Later on, it may change q's state back to N
again. The cycle repeats indefinitely (depending upon the probability
of successes and failures of q with respect to p). ........... ..52

4.6 Probability of failure detector reporting R state, P(report R), for fixed p
and varied v as obtained by a direct probabilistic analysis. For lower
v, the graph is lower. .................. ...... 54

4.7 Probability of failure detector reporting N state, P(report N), for fixed
v and varied p as obtained by a direct probabilistic analysis. For lower
p, the graph is lower. .................. ..... . 55

4.8 Expected dwell time in R state, EDTR, of failure detector for c = 3.
The top graph uses a maximum try of 30 and the bottom graph uses a
maximum try of 10. .................. ......... .. 56

4.9 Expected dwell time in N state, EDTN, of failure detector for c = 3.
The top graph uses a maximum try of 30 and the bottom graph uses a
maximum try of 10. .................. ......... .. 57

4.10 Probability of failure detector reporting R state, P(report R), for fixed p
and varied v. For lower v, the graph is lower. . . . . 59









4.11 Probability of failure detector reporting R state, P(report R), for fixed p
and varied v. For lower v, the graph is lower. . . . . 60

4.12 3-D surface plot of the probability of failure detector reporting R state,
(P(report R)), for a probability of success of q (P,) with all combina-
tions of p and v. . ................ ......... 61

4.13 3-D surface plot of the probability of failure detector reporting R state,
(P(report R)), for a probability of success of q (P,) with all combina-
tions of p and v. ............... .......... .. . 62

4.14 3-D surface plot of the probability of failure detector reporting R state,
(P(report R)), for a probability of success of q (P,) with all combina-
tions of p and v. ............... .......... .. . 63

4.15 3-D surface plot of the probability of failure detector reporting R state,
(P(report R)), for a probability of success of q (P,) with all combina-
tions of p and v. ............... .......... .. . 64

4.16 Probability of failure detector reporting N state, P(report N), for fixed
v and varied p. For lower p, the graph is lower. ........... ..65

4.17 Probability of failure detector reporting N state, P(report N), for fixed
v and varied p. For lower p, the graph is lower. ........... ..66

4.18 3-D surface plot of the probability of failure detector reporting N state,
(P(report N)), for a probability of success of q (P,) with all combina-
tions of v and p. .................. ........ 67

4.19 3-D surface plot of the probability of failure detector reporting N state,
(P(report N)), for a probability of success of q (P,) with all combina-
tions of v and p. ................. .. . . . . ..... 68

4.20 3-D surface plot of the probability of failure detector reporting N state,
(P(report N)), for a probability of success of q (P,) with all combina-
tions of v and p. .................. ........ 69

4.21 3-D surface plot of the probability of failure detector reporting N state,
(P(report N)), for a probability of success of q (P,) with all combina-
tions of v and p. .................. ........ 70

4.22 Average dwell time in R state, ADTR, of failure detector as a function of
p and v; p is fixed and v is varied. Note that in all ADTR graphs, the
ADTR values are plotted in a logarithmic scale. The upper end point
of each line in ADTR is infinity (not shown). For lower v, the graph is
lower . ............... ............... .. 71

4.23 Average dwell time in R state of failure detector as a function of p and
v; p is fixed and v is varied. For lower v, the graph is lower. . . 72









4.24 Average dwell time in N state, ADTN, of failure detector as a function
of v and p; v is fixed and p is varied. Note that in all ADTN graphs,
the ADTN values are plotted in a logarithmic scale. The upper end
point of each line in ADTN is infinity (not shown). For lower p, the
graph is lower. .................. .. ........ 73

4.25 Average dwell time in N state of failure detector as a function of v and
p; v is fixed and p is varied. For lower p, the graph is lower. . . 74

5.1 Two extremes of tree height. In (a), a tall relay tree has low parallelism
in multicast diffusion. In (b), a fat relay tree has high parallelism
in multicast diffusion. The number next to each arrow indicates the
multicast diffusion step number since the multicast message has entered
the service group boundary at some relay and j i k where i is a
sm all integer. .. .. ... .. .. .. .. ... .. .... .. .. . 79

5.2 When a parent relay in a group is non-responsive to its children, the
children relocate themselves to the upper ancestor. In this scenario, G
actually fails. If that's not the case, C will still be a child of D . . 84

5.3 When a root relay or one or more of its immediate children fails, there
are many possibilities that need to be considered. The goal is to avoid
the unnecessary need to split the group into many smaller groups. . 86

5.4 When a root relay in a group is non-responsive to its children, the left-
most sibling of the root takes over as a new root of the relay group. . 88















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



LARGE-SCALE DISTRIBUTED SERVICES



By

Anek Voralp ini

August 2000

C( imi iiii: Dr. Richard E. N. v"'.in i
Major Department: Computer and Information Science and Engineering

In wide-area networks, the Internet in particular, a message-passing distributed sys-

tem experiences frequent network failures and fluctuating communication latency. Be-

cause of its scale and different administrative domains, large-scale distributed systems

usually consist of heterogeneous components. Heterogeneity and failures induce .I-vn-

chrony, which varies from mild to extreme. Inability to acquire correct and up-to-date

knowledge about the other elements of the system makes it difficult for a process to de-

cide correctly how to proceed. Failures, .,-vnchrony, and limited knowledge contribute

to the difficulty of building reliable large-scale distributed services.

In this research, we present our approach for building reliable large-scale distributed

services. Essentially, we combine mutual failure detection with group membership and

group multicast, and package them together into a group communication system on









top of which reliable and scalable distributed services can be built. Our group commu-

nication system uses a hierarchical group structure to achieve scalability. To provide

reliable services, the group communication system depends on failure information re-

ported by a local mutual failure detector as well as the defined properties of the group

communication system itself. A mutual failure detector is responsible for detecting the

responsiveness of remote processes and the ability to communicate with those processes.

It uses bounded delay on correct messages or network time-to-live of messages to help

it detect process responsiveness and communication failures. The group communication

system supports reliable and scalable multicast with single-source ordering semantic.

With our group communication system, building reliable large-scale distributed ser-

vices, especially service replication, becomes easier than using standard network pro-

gramming tools that are based on unicast communications. Currently we're using these

tools to build our distributed conferencing system (DCS).

There are a number of issues that still need to be addressed to make these tools

complete, in particular the data consistency issue through the multicast message delivery

ordering and the security issue in group communication services. Since our architecture

is a rooted tree, one can build a strong multicast message delivery semantic without much

difficulty. The security issue requires further research in secure group communications.















CHAPTER 1
INTRODUCTION

1.1 Problems and Motivations

With the advent of a global data communication infrastructure, such as the Inter-

net, numerous network-based applications are being created and deploy,' 1 Wide-area

collaborative applications are an important class of network-based applications that will

benefit greatly from this new global communication medium. Collaborative systems use

computing and communication to enhance human interaction and collaboration. Our

distributed conferencing system (DCS) is an example of such a system.

Distributed conferencing system provides role-based conferencing services to users

at many DCS service domains or sites. Distributed conferencing system consists of a

collection of sites, a DCS terminology for DCS service domains. Each site consists of a

small group of highly available, synchronous ,-.-regate servers and a number of clients

attached to one of the ... regate servers which acts as the primary server of that site [ ].

The rest of the servers are backups. Distributed conferencing sites interact to exchange

information crucial for providing conferencing services.

To achieve the highest possible level of conferencing service availability, we will need

to allow each site to be autonomous as much as possible. The entire DCS system shares

some information, such as conference information, access control information, decision or

voting information. This information must be replicated and distributed to all sites in a







2

timely manner [ ]. With this information replicated, DCS clients can access conference

services from any DCS site.

Distributed conferencing users can participate in conferences created and maintained

by the DCS. A DCS daemon at a site maintains conference and user information and

provides conference services to its local users (with in the same DCS service domain), as

well as remote users (from another DCS service domain). Each DCS user runs a DCS

client which contacts any available DCS daemons on behalf of the user to gain access to

the conference services or any other services provided by the DCS.

In role-based distributed conference services, each DCS client has a specific role and

can be changed in each conference. Each client can participate in more than one confer-

ence simultaneously and may do so in many different roles. To support this role-based

conferencing service, the DCS system incorporates many functionalities, including con-

ference management, event notifications, decision supports, directory services, security

services, location transparent access, access control services, role management, among

many other services.

Beyond the current state of DCS, we want to extend its highly available server

approach so that it will work better over wide-area networks, such as the Internet, in

the following v--,v-

Location transparent and increasing service reliability and availability: The DCS

servers and the DCS clients can be anywhere on the network. The current model

of DCS uses a set of synchronous servers within a DCS site to provide resilient

and responsive conference services. A client can choose which DCS site will be

his conference service provider. This provides direct support for highly mobile







3

DCS clients, as well as making the conference services more available regardless

of the location of the DCS clients. This group of synchronous DCS servers uses

the primary and backup approach to achieve high service availability, but within a

local site. We would like to extend this service availability to the global wide-area

networks.

Reliable group communication supports: To support conference services and DCS

database and directory maintenance among many different DCS sites, we need

reliable group communication infrastructures. This reliable group communication

infrastructure will also be used to achieve our goal of location transparency and

service reliability and availability just described.

A number of challenging problems need to be addressed before we can take full advan-

tage of this new global data communication infrastructure in large collaborative systems

such as the DCS. Because of the massive scale of the Internet, which consists of many

wide-area networks, the complex interaction of different types of wide-area networks, the

distributed nature of the control and management of these networks, unexpected com-

munication disruption and fluctuating communication latency are common. And since

the Internet consists of many different administrative domains of many smaller wide-area

networks, .i-vichrony, due to the heterogeneity in hardware and software components,

is unavoidable.

It is also common for collaborative applications to interact frequently. Thus, the

inability to acquire correct or up-to-date knowledge about the other elements of the

system makes it difficult for distributed processes in these collaborative applications to

correctly decide how to proceed. These factors, i.e., process and communication failures,







4

.i-vnchrony, and lack of common knowledge, together contribute to the difficultly of

providing reliable, available and scalable distributed services.

There are some proposed systems that try to address different issues related to

reliable large-scale distributed systems. Most of the recent works focus on the problem

called a primary partition group communication system, i.e., group membership and

group multicast. Some of the primary partition systems focus only on the membership

problem and some of them focus only on the message multicast issue. A few systems

are similar to ours but still some of them are either too theoretical in nature and some

have different requirements than ours.

Essentially, we want to build distributed conferencing services that are reliable, avail-

able and scalable over wide-area networks. A large amount of research has been done in

the area of small-scale distributed systems with various synchrony guarantees. However,

there is not as much research that deals exclusively with the issue of service reliability,

availability and scalability in distributed services at the level of the Internet.

Because of the omni-presence of .,-vnchrony and the frequent failures in this global

network, we need a more suitable architectural model that better represents the Internet,

as well as a sufficiently reliable failure detection mechanism, that is critical to building

reliable distributed services at this scale. The challenge of creating a more suitable model

for large-scale distributed services over wide-area networks, as well as other necessary

tools, must be met if we are to provide services such as our distributed conferencing

services on a global scale.









1.2 Our Approach

We tackle this challenge by using a mutual failure detector and a weak group com-

munication system that are both efficient and scalable. A weak group communication

system provides weak group membership and group multicast services. Instead of using

strong group membership, as in almost every published system, it is much more appro-

priate to use weak group membership in large-scale distributed systems. This is because

it would be cost-prohibitive to agree on a large group of processes on every event that

occurs and causes the change in the group membership. Certainly, this limits the use of

our approach. However, that is the price that we have to p ,li to gain service reliability,

availability and scalability that we need in wide-area networks.

Since our process group structure is hierarchical, our group communication services

scale easily. Also, a hierarchical structure allows for more efficient message diffusion

during reliable multicast transmissions due to the non-existence of cycles in our com-

munication model. This makes multicast in our systems efficient, which is necessary in

large-scale distributed systems. With our mutual failure detector and weak group com-

munication system, we can build efficient, available and scalable large-scale distributed

services over wide-area networks.


1.3 Contributions

The key contributions of this doctoral research are:

Definition of an efficient architecture and an .I-vichronous model for large-scale

collaborative distributed services over a very large network such as the Internet.







6

Specification with a correctness proof and construction of a combined, efficient

process and communication failure detector which is suitable for large-scale dis-

tributed systems experiencing frequent communication failures. This mutual fail-

ure detector is the foundation of our weak group communication system.

Specification with a correctness proof and construction of a weak group commu-

nication system, namely group membership and group multicast services that are

suitable for distributed services with service reliability, availability and scalability

goals.


1.4 Dissertation Outline

We review previous works in fault-tolerant distributed computing that related to our

research with the focus on well-known results. Other projects similar to ours are also

discussed. Then we establish the problem requirements of large-scale distributed services

and discuss the proposed hierarchical group structure model. We present our mutual

failure detector and argue its suitability for failure detection in large-scale distributed

systems. We also supply the arguments about its performance and robustness, and give

a correctness proof. Then we describe our weak group communication services which rely

on our mutual failure detector. We give the protocols along with the correctness proof.

At the end, we summarize our work and discuss possible future research extensions based

on this work.















CHAPTER 2
BASICS AND PREVIOUS WORK

2.1 Message-Passing Systems

A message-passing distributed system consists of a number of processes and commu-

nication channels. Processes in a message-passing system communicate with each other

by explicitly exchanging messages over unreliable communication channels. These pro-

cesses are concurrent, but they are usually cooperative and work collaboratively toward

a common goal.

In real message-passing systems, processes and communication channels are usually

heterogeneous. Thus they are alv--,v- different in a number of characteristics. The

differences of timing characteristics among these processes and communication channels

have the most significant impact on the ability of the distributed system to make progress

or to solve important problems. The timing characteristic of system components are

often referred to as system synchrony, which includes communication d.- 1 -, local clock

drift and relative process speed. Communication delay probably has the most impact

on large-scale distributed services over wide-area networks.

Some system synchrony can easily be bounded under certain environments. When

that happens, we have a synchronous system. This particular type of system is usually a

very small distributed system consisting of mainly homogeneous components connected

together by a very high speed interconnection network, possibly in a cabinet or a very

small room or on a small campus. One of our previous team members used these







8

synchrony assumptions to successfully implement the .I.: -regate servers for the DCS [ ].

t1 i:' fundamental results are known to be solvable with synchronous model [ ,

].

It is difficult to build a message-passing synchronous system that ahv--l- satisfies

these timing properties. This is due to the fact that probabilistically, system components

may fail, or system transient load may vary unpredictably. These failures, transient or

permanent, lead to .-i-v chrony. Thus, for a large-scale, wide-area distributed system,

which is an interconnection of a large number of heterogeneous components across a

large geographic distance, it is unrealistic to assume that the bounds exist for system

synchrony. A weaker model that is suitable for large, heterogeneous distributed systems

is the .,-vnchronous model.

The message-passing .i-v chronous model assumes less about timing properties com-

pared to the synchronous model. Since the .,- nchronous model assumes very little about

timing information of events that are usually provided by real distributed systems, algo-

rithms or protocols, designed for the .i-vnchronous system, are more general and more

portable. They should run correctly in networks with arbitrary timing guarantees. One

obvious consequence is that the protocols do not utilize the system synchrony. This

could severely hurt the performance of the systems.

In practice, even an .,-i-v chronous system may indeed satisfy synchrony properties

temporarily. Also, most distributed protocols do not require that the system be syn-

chronized indefinitely for these protocols to terminate. More often than not, distributed

protocols only require that some synchrony exists for only a short period of time, in







9

particular during a certain phase of the protocols. That is sufficient for the protocols to

terminate and the system can make progress toward the termination of the protocols.

The model that does satisfy certain synchrony is called a partially synchronous

model [ ]. In this model, certain synchrony can be expected. This model is very

practical and useful. This is the model that we think is the most suitable to real system

design, both small and large.


2.2 Reliable Distributed Computing

The difficulty of constructing a reliable large-scale distributed system lies mainly

in the .-i-vtchrony and failures within the system. Also, when more components are

involved, more information needs to be obtained and more coordinations, which become

more difficult and expensive with scaling, may be required if these components are to

proceed in concert. Here we briefly examine the key factors that contribute significantly

to the difficulty of the construction of reliable distributed systems, namely .,-ltchrony,

failures and lack of common knowledge.

Asynchrony usually comes with the heterogeneity of the system components and the

way these components are structured. System transient loads, network congestion, un-

predictable process scheduling strategies, and most important, temporary or permanent

failures are sources of .-i-vchrony. Asynchrony is unavoidable in large-scale systems over

wide-area networks.

Failures can cause .,-vnchrony, as in the case of a communication disruption that

d. 1 ,i- messages arbitrarily. In .-i- tchronous systems, the effect of a process being slow

or the network being slow is indistinguishable. However, failures and the inability to







10

identify them in a timely manner are a main obstacle to the construction of reliable

distributed systems.

A ,ii:, types of failure have been studied and classified. They are fail-stop failures [ ],

crash failures [ ], omission failures [ ], and, the most difficult to deal with, Byzan-

tine failures [ ]. A failure type that is commonly assumed, and the one in which

we are interested, is the crash failure model. In crash failures, the fact that a process

has failed may or may not be detectable by other processes.

Knowledge pliv a significant role in the design of distributed systems. Halpern

and Moses showed that in any kind of distributed system, a common knowledge cannot

be attained if communications are not reliable [ ]. Knowledge that we can deduce

in .I-v ichronous systems is thus very limited. If we cannot agree on some common

knowledge, it is difficult to make various components of the system agree upon something

or take coordinated actions, which is the key to reliable distributed computing [ ].

This inability to acquire correct or current knowledge about the other elements of the

system makes it more difficult for a process to decide correctly what actions it would

have to take to complete the protocol.

Next we discuss some known approaches to reliable distributed computing.


2.3 Agreement Problem

There are two common approaches to building reliable distributed systems: reliable

broadcast and consensus. The consensus approach is based on distributed agreement or

consensus [ ], while the reliable broadcast approach is based on process group

communication [ , , .







11

It is a common practice to use some form of distributed agreement or consensus to

make distributed systems more fault-tolerant [ ]. Consensus is very straightforward if

every process in the distributed system is reliable, i.e., no processes fail during any phase

of the protocol.

Consensus in the presence of failures is very difficult. This is especially true when

very few or no assumptions can be made about the underlying system, or what kinds

of failures that are permitted in the system, or how much we can know about the other

processes in the system. A very important impossibility result states that distributed

consensus cannot be deterministically achieved in completely .,-i- ichronous systems with

failures [ ]. This is true even when communication is completely reliable, and only one

process is allowed to fail, and it can only fail by crashing. This impossibility result has

a very significant implication as protocols in distributed systems usually rely on some

form of consensus.

Since the knowledge about process failures is critical to the possibility of consensus,

it is easy to see why consensus is possible in completely .-i- vchronous systems with

fail-stop failures. Fail-stop processes are required to halt when they fail, and the fact

that they have failed can be reliably detected by other correct processes [ ]. With the

ability to reliably detect failures, consensus can be solved [ ].

It is very important to emphasize that the impossibility result of consensus doesn't

mean that consensus can never be deterministically achieved in real ., i-vichronous sys-

tems with crash failures or any stronger types of failure. What it really means is that

for .,- nchronous systems with crash failures, every deterministic consensus protocol has

a possibility of non-termination which violates one of the properties of consensus. This







12

usually happens when some process has to wait for a very long time without knowing

with absolute certainty that the process that it is waiting for is very slow or has already

failed. However, the chance for this unfortunate incident to happen is indeed quite small

unless failures are so frequent or accurate failure detection is difficult.

Since consensus is very important to building reliable distributed systems, there are

many attempts to circumvent it. A certain amount of synchronism boosts the power of

the system to identify faulty remote processes. This is, in fact, the key idea behind the

partially synchronous model [ ].

An exhaustive list of minimally required synchrony properties for the possibility of

consensus has already been identified [ ]. An attempt to provide a modular consensus

service can be found in [ ]. Another interesting approach to circumvent the impossi-

bility result of consensus is to use unreliable failure detectors [ ]. Failure detection

is one of the main topics of this research.


2.4 Failure Detectors

A very important observation about the impossibility of distributed consensus in

completely .i-vnchronous systems is that it is inherently difficult for .,-vnchronous pro-

cesses in a consensus protocol to determine whether remote processes are faulty or simply

very slow. In other words, consensus is not ahv--, deterministically solvable because

accurate failure detection in completely .,-vnchronous systems is impossible.

C'!I iii and Toueg showed that to solve a difficult problem such as consensus,

one doesn't need a very accurate failure detector [ ]. By augmenting a completely

.,-vnchronous system with unreliable failure detectors, ones that strictly satisfy a very







13

specific set of properties and can make infinitely many false suspicions, consensus is

shown to be solvable in this new model.

Suppose that we have a consensus protocol that relies on failure information from

failure detectors to help a group of processes reach an agreement. If the required prop-

erties of the failure detector are satisfied in some execution of the consensus protocol,

then the protocol guarantees that all correct processes will reach an agreement and the

protocol will terminate in that execution. If however, some or all of the failure detector

properties do not hold during an execution of the consensus protocol, the consensus

protocol may never terminate in that execution. Thus failure detectors help ensure the

liveness requirement of the consensus protocol. Every consensus protocol, including the

ones that rely on unreliable failure detectors, must guarantee that the safety requirement

of the consensus protocol is never violated, i.e., different decisions will never be reached

by any correct processes during this non-termination period.

The work on unreliable failure detectors for solving consensus identifies many types

of failure detectors. However, they are for crash failures only. Omission and Byzantine

failure detectors have also been studied recently [ ].

There are many known different implementations of failure detection mechanism.

Most of them are designed specifically for solving consensus or related problems [ ].

Some of them are designed to be used as a foundation for building other distributed

services [ ]. Sergent et al. gave a list of different implementations of crash failure

detectors [ ].









2.5 Group Communication

Another approach to reliable distributed computing is group communication, a vari-

ation of the atomic broadcast approach. A group communication system provides two

services, group membership and group multicast. Informally, group membership main-

tains memberships of groups within the system by using some kind of distributed agree-

ment protocols to enforce the agreement of group memberships among processes. A

group membership protocol keeps track of the dynamic changes of group membership.

A membership change event occurs when a process leaves or joins a group. A process

may leave a group because it fails, voluntarily leaves, or it is forced to leave. A process

joins a group usually by being selected to replace a process that has just left the group

or to increase the availability of services provided by the process group.

The process agreement required in a group membership protocol is different from

that of consensus. In group membership, a process that is suspected to have crashed

can be removed from the group or killed, even if the suspicion is, in fact, incorrect.

In other words, a group membership allows the exclusion of some correct processes

from participating in the group membership protocol, but all correct processes must

participate in the consensus protocol. Also, consensus requires progress in all executions,

but group membership does not require any changes in the groups if no processes fail,

request to join, or request to leave.

Group membership has a weaker set of requirements than consensus. Even with

weaker requirements, group membership is still not possible with crash failures [ ]. This

impossibility result applies to primary-partition group membership such as ISIS, Totem

and NewTop only [ , ]. This impossibility result still holds even if we allow the







15

erroneously suspected non-faulty processes to be removed or killed as in the ISIS system.

A non-primary partition group membership allows divergences of membership, i.e., not

all membership changes require distributed agreement by all members. Therefore, it

escapes from this impossibility result.

Based on a thesis that a group of processes will make distributed computing more

reliable, Birman created the ISIS tool kit, a virtual synchrony group communication sys-

tem [ ]. Essentially, virtual synchrony guarantees that whenever a membership change

occurs, the membership change event appears to be atomic to every process. In other

words, virtual synchrony guarantees that every multicast is totally ordered with respect

to membership change events. The set of messages delivered before or after each mem-

bership change is consistent at every process. It tries to provide the effect of a perfect

failure detector or a simulation of fail-stop failure.

The ISIS system suffers the problem of false failure suspicions. Specifically, if a

process is suspected to fail, it is forced to be removed from its group. This is to guarantee

that its failure detector is perfect. This could lead to a very unstable system when it

experiences frequent, transient network congestion or unusually long communication

d,1 i, which can cause many processes to be suspected as faulty.

Totem is a fault-tolerant multicast group communication system that provides a

reliable, totally ordered multicasting service over local area networks (LANs) [ ].

It extends the virtual synchrony concept introduced by ISIS. It exploits the available

hardware broadcasts to achieve high performance. Totem is designed for both real-time

and fault-tolerant applications. It is scalable to multiple LANs but limited to the same

geographical area. Totem relies on synchrony heavily.







16

Totem and ISIS are primary group membership systems. A primary partition re-

stricts the progress of the system to only the primary partition. By limiting the progress

to only the primary partition, consistency management becomes easier. But since it is

possible that a primary partition cannot be formed in some situation and this situation

is common, the system may never make any progress. This is a serious shortcoming of

this approach.

Because of the various conditions imposed by large-scale networks, not all member

processes in a group can communicate with each other at all times due to communication

failures. Thus a process cannot be assured that a message it sends will be received by

every process in the destination group. Also, a process cannot assured that a received

message will also be received by other members in the same group. It is very difficult for

a system to make any progress when processes cannot communicate. Thus in large-scale

networks where communication failures are common, a non-primary partition group

membership system is more appropriate. This type of system allows progress to be

made in all partitions, primary or not. This can increase the performance of the system,

as well as service availability. However, these systems lose the ability to maintain strict

consistency at all times.

Transis [ ] is one of the earlier group communication systems that intends to deal

with the network partition problem. Transis is very close to my work except that it tries

to achieve a strong non-primary partition group communication system. The Transis

approach allows operations to happen in many partitions simultaneously and supports

recovery through merging. This allows the system to continue its operations if there are







17

enough functioning processes. Multicast messages can be diffused gradually. Transis

provides transport-level multicast services.

Relacs is another ongoing research project in group communication [ ]. In Relacs,

each member has a notion of the current view, which is an unordered list of the members.

Each member in the current view is guaranteed either to accept that same view or to

be removed from that view. Messages sent in the current view are delivered to the

surviving members of the current view, and messages received in the current view are

received by all surviving members in the current view. This is called view synchrony,

because all members that can communicate appear to see a failure at the same logical

time, significantly reducing the number of failure scenarios.

Recently, Guerraoui and Schiper proposed a generic consensus service to be used to

solve the group membership problem [ ]. That work focuses on the consensus service

as the basis for solving many agreement problems. However, it covers only the primary

partition group membership. Prisco et al. presented a new formal model of .-v ichronous

group membership based on the I/O automata. The specification, however, covers only

the safety properties [ ].

By observing that both consensus and group membership required distributed agree-

ment, Lin and Hadzilacos recently showed that failure detectors called exclude oracles

and include oracles can be used to solve the primary partition group membership prob-

lem in .i-vnchronous systems [ ]. However, they considered only the group membership

problem, not the group multicast problem, and they assumed reliable communication.















CHAPTER 3
LARGE-SCALE DISTRIBUTED SERVICES

3.1 System and Failure Models

We often see claims that by assuming a very weak model, such as a highly .,-vn-

chronous model, it would increase the applicability of the results, such as impossibility

results. The reason that they used to support this claim is that synchrony is proba-

bilistic. What this means is that no one can really guarantee synchrony in real systems,

obviously due to failures which are probabilistic. Certainly, the less one assumes about

synchrony in a model, the better the chance the model will be applicable to real systems

because synchrony is indeed probabilistic. Thus, the real benefit of using a very weak

model, which is easier to define and justify, is that it is more applicable to real systems

since it assumes very little.

However, being overly pessimistic about our operating environment also means that

we do not take advantage of certain characteristics of the environment that may hold,

though not ahlv--,, most of the time. If we do not take advantage of this increasing

power of the system, we may have already made a poor engineering design decision from

the very beginning.

For our purpose of building a large-scale distributed system, we do not see the ben-

efit of assuming a very weak model, such as a highly .i-v ichronous model. If a model

is too weak, though it can represent the system, it may fail to capture all essential

characteristics and properties of the systems, especially dynamic ones. Therefore, from







19

an engineering perspective, using a very weak model is not a good way of solving our

problem. For example, if we were to assume a highly .-i- vchronous model, the infor-

mation about remote process failures obtained through a local failure detector cannot

be considered as reliable, even though it may be accurate most of the time. This gives

us little power to solve our problem as we must rely on knowledge about process and

communication failures. The fact that most of the time failure detectors in large-scale

systems are accurate by correctly reporting failure status of remote processes is not fully

exploited.

By exploiting certain properties that hold most of the time, we can build more ac-

curate failure detectors. Thus, an .-i- vchronous system model with a justifiable amount

of synchrony is the most suitable model for our purpose of system building.

Based on the above observation, we model our system as an .i-vnchronous one. It

consists of a set of communicating processes and an unreliable communication trans-

port network. These processes perform certain tasks defined by distributed protocols.

Message passing is the only form of coordination, synchronization and communication

among processes allowed in our system. Even though the communication transport net-

work is unreliable, we assume that there is a reliable transport network 1livr on top of

it. All interprocess communications rely on this reliable transport communication 1 -i,_T.

We do assume a bounded message 1. l/; for correct messages. If messages were to

arrive at its destination correctly, it must happen within this bound. In real systems,

the network imposes a delay bound for correct messages. Even though we do not have

a delay bound on every message, with this delay bound on correct messages, incorrect

messages will be dropped by the network and automatically become lost messages. If







20

a message takes longer than this bound in the network, it will eventually be dropped.

Thus, this bound helps us determine the chance whether or not a message gets through

or should it wait longer for any positive acknowledgment.

We rely on this d.1 iv bound of correct message assumption in our model for both

failure detection purpose and for reliable communications. However, we do not require

that our model is in anyway synchronous, which will render it unsuitable for large-scale

systems. In the environment that we deploy our system, the Internet and wide-area

networks, this correct message delay bound assumption is certainly justifiable. We can

enforce this correct message delay bound assumption through the time-to-live parameter

of network packets.

We do not explicitly specify the time limit for a process to move from one state to

the next. This is one of the reasons why our system is qualified as an .,-vichronous

system. This does not mean that the system can take as much time as is required to

complete a transition from one configuration to another, as is the case of completely

. -vnchronous systems. On the contrary, this is usually specified through a regular event

such as mutual failure detection. Our mutual failure detection, in a way, required the

two processes to synchronize this mutual failure detection activity.

We assume a deterministic system. Processes are deterministic in the sense that

either they behave according to their specification or completely stop functioning, a

crash failure semantic. A process fails or crashes if it stops executing the protocol

prematurely. In other words, a process fails if it no longer makes any state transitions

on any inputs as specified by the protocol when it is not at the terminal state of the

protocol yet. A process is non-faulty if it never stops executing the protocol. This







21

doesn't mean that non-faulty processes don't stop somewhere. Usually when a protocol

terminates, it requires that all processes move to some terminal states and they do not

change their state again, i.e., they move to the terminal state and never come out of it.

We assume a non-persistent process in the sense that a process is not required to save

its state before it crashes, and if it recovers from a crash, it is not required to restore its

state to the one before the crash even though it may choose to do so at the application

level. We assume that it recovers with a new identity but not necessarily with some

predefined initial state.

We assume an unreliable point-to-point transport network. A process can directly

communicate with any process in the system. Packet routing is transparent in our

model and thus irrelevant to our discussion. A point-to-point direct link in our network

is faulty, with respect to a receiving process, if the process does not regularly receive

heartbeats through the link. Otherwise the link is non-faulty, i.e., the receiver regularly

receives heartbeat messages on the link.

By regularly receiving heartbeat messages, we do not mean that it can never miss

any single heartbeat message. Heartbeat missing could be temporary and it's up to

the failure detectors to decide how many consecutive heartbeat messages must be lost

before the link is considered faulty. Not all links will be monitored through this heartbeat

message mechanism. But those that are being monitored must continuously send and

receive heartbeat messages. The state of a link does not survive crashing. In other

words, messages in transit will simply be discarded and must be retransmitted by the

source.







22

In our model, a transport link that connects a process p to another process q can be

viewed as a part of q in some sense. The reason is that if q fails, this link also fails, at

least from p's perspective. However, the reverse does not hold. This is what makes our

mutual failure detector works.


3.2 Requirements

We want our DCS to operate over a global wide-area network. Service reliability,

availability and scalability are important requirements in this environment.

Reliability is albv-- important in any kind of service. We want our DCS to provide

reliable services even with frequent network and process failures. As we mentioned

earlier, failures are common in our environment. We need to have a way to detect these

failures and adapt to it in a timely manner.

The kind of service availability that we need is the one that allows DCS clients to

access services from any DCS site whether or not it is the client home site, i.e., the

original DCS site in which the user registered. By completely replicating the DCS

services throughout the entire global network, a client is more likely to be able to access

the DCS services from anywhere regardless of where he is located because the same

services are now available at the point where a DCS client can most conveniently access

them. The DCS must make sure that all services are available to a client wherever and

whenever he needs to access them.

It is not uncommon for a DCS to have thousands of sites providing conferencing

services simultaneously in every part of the world. Our distributed conferencing system

must work well from a campus-wide setting with a few departmental DCS sites to a
























Figure 3.1: The 1 vi. architecture of our system


global-scale DCS system with hundreds of thousands DCS sites. Due to an unforeseen

in user demands for conferencing services at the global scale and its effect on the perfor-

mance of the system when the user demands increase, the system must scale well from

a handful of sites to thousands or more sites.


3.3 Reliable Communications

A reliable communication transport network is essential for communicating processes

since it is easier to develop applications on top of a reliable transport network. But if we

were to add reliable transport communication to our group communication subsystem,

it would make group communication unnecessarily complex and more difficult to main-

tain. Therefore, we add a reliable communication 1livr on top of our unreliable UDP

transport network instead. This shields the group communication -,1 i'-i-l. i- and thus

the applications from the unreliability of the lower-level transport network. Figure 3.1

depicts the l1v.-r architecture used by our system.


Applications


Membership & Multicast


Reliable Datagram Failure Detector


Unreliable Datagram (UDP)







24

In Figure 3.1, we assume a reliable transport network li-vr through send() and

receive() primitives. We justify this assumption by providing the protocols necessary to

achieve this.

The following list outlines the steps involved in transforming an unreliable transport

network to a more reliable transport network. Note that our prime goal here is reliability

and because of that, we may need to sacrifice some efficiency here.

When a process p sends a message m to another process q but q does not receive

m within a time period tb (delay bound on correct message), p can use this in-

formation to conclude that m will never arrive at q. This is because m will be

dropped by the underlying unreliable transport network. This is valid according

to our assumption of the upper delay bound of correct messages tb. The question

is what is this appropriate bound tb. In our case, we can set this period to the

time-to-live of unreliable datagrams (UDP) plus some positive margin.

Out-of-order messages are not an issue since we do not concern ourselves with the

semantic of message delivery ordering here. Message ordering can be done at the

group multicast liv-r which is right on top of the reliable datagram lV ,-r. Refer

to Figure 3.1 for our li-,-r architecture.

To deal with lost messages, we can tag every message with a unique sequence

number. If a message is missing from the sequence, it can be detected by the

sender since it will not receive a positive acknowledgment for that message. The

sender will retransmit it again.

To ensure that a process eventually receives messages sent to it, we use a simple

transmit-acknowledge mechanism. If a process p has a message m to send to another







25

process q, it does the following: p periodically sends m to q until it gets a positive

acknowledgment m(ack) from q. We can set this period to be tb plus some positive

margin. But before sending m to q, p queries its local failure detector about the failure

status of q. Ifp and q are mutually responsive, then p proceeds with the transmission of

m and waits for m(ack) back from q. Otherwise, p holds m until its local failure detector

reports that p and q are mutually responsive and p then starts the transmission of m.

It is possible that m or m(ack) or both may be lost during transit. If m is lost, it

is obvious that p will keep sending m (periodically, as long as p and q are mutually

responsive) until p receives m(ack). If m(ack) is lost (after q has received m already), p

can never know that q has already received m and p will retransmit m to q until it gets

M(ack) back from q. q ignores duplicated copies and keeps sending m(ack). When at least

one m(ack) arrives at p, p stops sending m and removes it from the sending buffer.

The two reliable transport communication primitives, send() and receive(), are de-

scribed in Protocols 3.1 and 3.2. These two communication primitives are implemented

on top of unreliable datagram transport protocols, UDP. usedd) and ub-recv() are

unreliable non-blocking send and unreliable blocking receive, respectively. Notationally,

(...) is used to describe the content of a message, i.e., a sequence of message composition,

and receive(-) means receive any message from any process.

The next lemma shows that a process p can reliably send a message m to another

process q using send(m). Here we assume that q's identity, the receiver of the message,

is encoded in the message m itself.
















Protocol 3.1: For any two processes p and q, p reliably non-blockingly sends a message
m to q by calling send(m).


1: Initially assume outgoing message buffers ontlb,. {} at process p


2: // Background thread for sending messages.

3: > outgoingmessagethread():

4: do forever, p. ,.:...V. ll./;

5: for each (m, q) E oilth,,

6: Query the status of q from p's local failure detector

7: if p and q are mutually responsive

8: usend(m); // send m to q.

9: else if q is crashed

10: otlt = oithl, {(m, q)}; // recycle m


11: // Reliably send a message m to q.

12: > send(m):

13: // Simply add m to outbp.

14: // The actual transmission occurs in outgoingmessage_thread() above.

15: oth, = ont, + { (m, q) };














Protocol 3.2: For any two processes p and q, p accepts incoming messages, acknowledges
them, and non-blockingly receives a message m from q by calling receive(-).



1: Initially assume incoming message buffers .:,,,. = {} at process p


// Background thread for accepting messages.

> incomingmessage_thread():

do forever

ubrecv(m); // Wait for incoming message m from process q.

if m = (ck)

oif, = o, i, { (', q)};

else

u_send(m(ack)); // Send an acknowledgment back to q.

if (r ,,j) i .:,,1,.

inbp = inbp + {(m, q)};


// Reliably receive

> receive(q):

if Em, m E inb,

return m;
:, ,t,, ,:,, ,

return m;


a message m from q.


: sender(m)

{(m, q) };


else


return 1; // Empty message







28

Lemma 3.1. Protocols 3.1 and 3.2 together implement reliable non-blocking send, i.e.,

for i,;1; two processes p and q, a message m sent out by p using send(m) ev, l./;,,ll;

arrives at q, if p does not fail and p and q are ev, u/,:,,ll; in,,,l,Jll;I responsive.

PROOF: Since we assume that p, the sender, does not fail after m is submitted

to p's outgoing message queue outbp, Protocol 3.1, line 15, and p and q are mutually

responsive, p must repeatedly send m to q, Protocol 3.1, line 5-9, as long as m is in the

outgoing message buffer. The only way that m can be removed from the queue is when

p receives an acknowledgment mack) from q, Protocol 3.2, line 5-8, which indicates that

m has already been received by q. O

In the above lemma, it is not necessary that p and q be responsive at the time that

p submits m to its outgoing message queue outbp. But it requires that p and q are

eventually mutually responsive. Also, in the lemma we do not consider what would

happen if p fails after m has been submitted to p's outgoing message queue but before

m arrives at q. In that case, the message will be lost and a retransmission may never

occur if p does not recover or p does not retransmit m after it recovers; either it cannot

reconstruct m or it decides not to retransmit m. We do not go as far as the issue of

reliable message transmission across process crashing in this work.

The next lemma shows that if a process p uses send(m) to reliably send a message

m to another process q, then q can use receive(-) to reliably receive m. In fact, q can

use receive(-) to reliably receive any message sent to it.

Lemma 3.2. Protocol 3.1 and 3.2 together implement reliable non-blocking receive, i.e.,

for iw.; two processes p and q, a message m sent out by p using the function send( is







29

i, ,l,/nili received by q if both p and q do not fail, they are ni,,,olili responsive and q

calls the function receive(-) as i,,ri,'i times as necessary.

PROOF: From Lemma 3.1, send() is reliable, i.e., a message m from a non-faulty

process p will eventually arrive at q and m is added to the message queue at q if p and

q are eventually mutually responsive. If q is non-faulty and q tries to receive as many

times as necessary, then eventually m will be chosen from q's incoming message buffer

.:,i. Protocol 3.2, line 16-18. O

Lemmas 3.1 and 3.2 guarantee that a message m eventually arrives at q if and only

if both p and q are eventually mutually responsive. It also guarantees that if a process q

tries hard enough to receive m after m has arrived at q (i.e., sitting in .:,1,'), eventually

q will receive m. However, if q cannot be reached by p and vice versa, the protocols

provide no guarantee about the arrival of m at the receiver q. Also, a message m will

be buffered and re-sent by p to q as many times as possible (possibly infinitely often if

p is non-faulty).

The two lemmas rest on the assumption that our failure detectors are correct. How-

ever, when p's failure detector incorrectly reports that p and q are not mutually respon-

sive, but not permanently disconnected, p stops sending m to q and but never removes

m from its outstanding message sending queue onutlt,. Since m has never been removed

from p's outgoing message queue (at least until it is acknowledged by q) and if both

p and q are indeed responsive, eventually p's failure detector reports that p and q are

mutually responsive and p will start sending m to q again until it receives a positive

acknowledgment mack) from q.







30

Now, what if p and q are actually temporarily non-responsive or temporarily discon-

nected? Can p still use send(m) to reliably send a message m to q? It is clear from

Protocol 3.1 that p stops sending when its local failure detector reports that p and q

are not mutually responsive. However, it doesn't remove m from its outgoing message

queue. That means if q is temporarily disconnected from p (and most of the time this

is the case), then send() should do its job well. However, if p's failure detector reports

that q is non-responsive or disconnected, i.e., q has crashed, send() will simply not send

any messages to q. We can do a periodic scan for m that is safe to be removed from p's

outgoing message buffer outpb.

Clearly the send() function in Protocol 3.1 is quiescent which is very desirable in

large-scale systems as it helps to reduce the number of messages that are very unlikely

to reach the destination.

The two protocols only guarantee that messages ,, ;. ,/nll;l arrive at the destination

process, i.e., putting them into an incoming message buffer of a process. However, they

may or may not be received or consumed by the process, i.e., being taken out of the

incoming message buffer. But if the process tries to receive incoming messages infinitely

often, i.e., it doesn't crash, then all arriving messages will be consumed by the process.

Note that to guarantee that all incoming messages are consumed, by calling receive(),

the process needs to survive only process failure, not network failure, because we need to

consider only messages which are already in the incoming message buffer of the process.

We can increase the reliability of the transport network further by ensuring that

all accepted messages will eventually be received or consumed by a receiving process.

We can do this by writing all arriving messages to some stable storage before we send







31

back positive acknowledgments to the sending process. This will be useful in certain

situations, especially failure recovery. However, this is not the focus of our work.

Note that the above reliable communication protocols only guarantee eventual de-

livery of messages if the two communicating parties are mutually responsive. They do

not, however, guarantee the order of the messages. If a process p sends a message m to

a process q before it sends another message m' to q, it may or may not be the case that

q receives m before it receives m'. The order of messages can be guaranteed, however,

by imposing a message delivery semantic at the level of group multicast in the group

communication l-V.-r. We also assume that if a message gets through to a receiver, its

content will never be corrupted, either intentionally or unintentionally.















CHAPTER 4
MUTUAL FAILURE DETECTION

In the previous chapter, we defined our system model based on certain timing pa-

rameters that exist in large t-i-vchronous systems, especially ones operating on the

Internet. Specifically, we set a delay bound for correct messages. In this chapter, we

use this bound to help us achieve accurate and reliable mutual failure detection between

processes.


4.1 Failures

Message-passing communications among processes over a wide-area network expe-

rience d. 1 .i, 1 and lost messages. When communication d.-1 v- are excessive, d, 1, i\ 1

messages will be dropped according to our correct message delay bound constraint.

Thus, the message delay and lost message problems are really the same problem under

our assumption of a bounded d. 1 iv on correct messages.

Transient communication failures are common in all networks, large or small. Process

crashing is also common. But because in our setting, a process is indeed a highly resilient

one (as in a highly available DCS server), they are less likely to fail. Still they can and

do fail.

We are interested in the ability to tell whether a remote process is responsive or not.

A remote process is responsive if there is evidence that shows that it can respond in a

timely fashion to requests from a local process. Processes can be non-responsive, which







33

happens when the two processes cannot determine the responsiveness of one another for

some extended period of time.

In a system in which communication is not perfect, being responsive doesn't mean

that there are no lost or d. 1 -, 1 messages. However, the lost and d.1 i- .1 messages

are only a very small fraction of all messages. System health and the level of system

responsiveness can ultimately determine how small this fraction of the messages should

be. In a highly responsive and highly synchronous system, missing a continuous series

of messages for a few seconds could mean a disaster. But in very large .i-i- chronous

systems, this is common and expected. Thus, at a very large scale, it doesn't make

much sense to focus on the loss of few messages or randomly lost or d. 1 li, .1 messages.

As stated in the previous chapter, we consider only processes communication at the

transport level where a communication channel is simply an abstraction augmented by a

process. In other words, a channel doesn't really exist by itself but with the process that

uses it. With processes augmented by abstract transport channels, the communication

mode between a pair of processes is not necessarily a symmetric one. In other words,

a failure of a transport link from p to q is independent of a failure of a transport link

from q to p.


4.2 Failure Detection

Essentially, failure detection is a mechanism for continuously monitoring remote

process failure status and/or its ability to communicate with a local process. A failure

detector is an implementation of failure detection mechanism. It provides a list of

processes that it suspects to have failed. The list of suspects is dynamic. By definition,







34

an unreliable failure detector can make mistakes. In other words, it may wrongly suspect

that a process has failed when, in fact, it has not.

A failure detector adds a process to its list of suspects if it suspects that the process

is faulty. Later a failure detector may remove the process from the list if it believes

otherwise. This behavior of adding and then removing suspects happens in the case for

unreliable failure detectors. The more reliable a failure detector is, the fewer mistakes

it makes. Failure detectors in which we are interested are completely distributed, au-

tonomous and not completely reliable. Thus, at any given time, it is not necessary for

any two failure detectors to contain the same list of suspects.

We are only interested in time-varying failures. The failure status of a process,

augmented by transport communication links or not, is time-varying. By the time that

a local failure detector reports that a remote process is not responsive, this information

may no longer be true. Thus, any failure detector that detects time-varying failures

of remote processes is inherently unreliable. Failure detector is unreliable in the sense

that the failure status of a remote process could have already been changed by the time

the failure detector concludes the status of the remote process based on the information

that they have.

Our failure detection scheme is based on the fact that a non-faulty process with a

non-faulty transport link should be responsive to an outside excitation, such as receiving

some command message that needs a response. The responsiveness of the system to

outside excitations is used to determine the failure status of remote processes.

Messages never traverse any physical network indefinitely when they are being routed

to their destination. Depending on the characteristics and other statistical performance







35

and usage parameters of the network, a bound must be set to restrict the life span of

network packets. This bound is usually known as time-to-live (TTL). It is this TTL

parameter, along with an appropriate mechanism, that we use to help us measure the

responsiveness of remote processes.

Specifically, our failure detection scheme uses time-outs, which are, in turn, based on

the network TTL parameter. The use of time-out based failure detection is independent

of the application that relies on the outputs of failure detectors, i.e., the list of remote

process failure status. In our system, timing dependency is limited to failure detection

only, and it does not imply that our upper-level protocols or our applications that rely on

failure status information depend highly on this timing restriction for them to function

properly.

Using not-so-perfect failure detectors does imply something about how the informa-

tion should be used and what needs to be done in situations where failure detectors are

wrong, i.e., make incorrect conclusions about failure status of other processes. What we

need is to find a way to minimize the impact of this unfortunate incident. Upper-level

protocols that rely on information from failure detectors should ahv--, realize that the

information that it has about other process failure status could be wrong. For example,

in the last chapter, we see that the function send() understands that its local failure

could be wrong about a process being mutually responsive. Therefore, it never takes

any unrecoverable action based on that information. Rather, it uses that information

as a general guideline on what it should do and at the same time be ready to take other

appropriate actions just in case that information is incorrect.









4.3 Mutual Failure Detection

Allowing a process to take an arbitrarily amount of time to complete a single step

is far from realistic, even in large .-i- vchronous systems. We may need to set some

limit on how much time a process can use to complete a step. If a process repeatedly

violates this time constraint, we -v that the process is faulty. In real systems, setting

a limit on the time a process takes to complete a step is extremely difficult since we

have no idea what exactly a step would be in general. Also, each step could take a

very different amount of time to complete, which is why most general purpose systems

are .-i- tchronous. Therefore, it makes no sense to impose synchrony between steps by

different processes.

Instead of defining a step and a time limit to complete a step, and then using this

time constraint to help us to detect failures, we seek a simpler and more practical failure

detection approach for large .- i- chronous systems. Specifically, for the purpose of failure

detection, we require that a process regularly exchanges heartbeats with other processes.

This regular exchange of heartbeats is used as an indication to the other processes that

the process is still responsive, even though we don't know whether or not it is really

doing its job correctly. For the purpose of failure detection, if a process is sending out its

heartbeat regularly, then that's enough for the other processes to decide, based on these

heartbeats alone, that the process that is emitting regular heartbeats is functioning and

thus non-faulty. Figure 4.1 depicts our heartbeat mutual failure detection scheme.

Mutual heartbeat failure detection requires that a process must periodically send

out heartbeats to the monitoring processes regardless of what it is doing. If a process












V
A B
* *

v+1


Mutually
Responsive


A B
* v

v+1


A B



v-1 or v+1


Non-Mutually
Responsive


Mutually
Non-Responsive


A B

* 0


Crashed


Figure 4.1: Mutual heartbeat failure detection







38

can maintain regular heartbeats, then we iv it is responsive. Otherwise, it is non-

responsive. It is reasonable to do failure detection this way simply because if we ahl--iv-

have to check at every step of the protocol execution to determine whether or not a

remote process is faulty, then doing failure detection would probably cost more than the

real job of the process.

Since we cannot be guaranteed that all heartbeats will be received precisely at the

time that they are supposed to, and some of them may even get lost, we can only deal

with these possibilities in our failure detection scheme and try to accommodate these

inherent imperfections of the scheme within our system. We can define our failure detec-

tion protocol in such a way that allows some margin of continuously missing heartbeats.

D.1 i -, iI heartbeat messages should never arrive at the monitoring process because we

normally set the period of heartbeats to be higher than the time-to-live parameter of

the transport network.

Mutual failure detection is very useful for large-scale distributed services with reliable

and service availability goals. A process needs to know whether or not it can mutually

communicate with other processes. Mutual failure detectors provide information that

is useful for processes in a large process group over a wide-area network to adapt to

transient failures quickly.

As an example, consider our group membership. A process must find out whether

or not its parent process is still accessible. At the same time, the parent relay needs to

know whether or not it can effectively communicate with all of its children because it is

the responsibility of the parent process to multicast messages that it receives to all of its

children within the same group. For the purpose of maintaining service reliability and







39

availability, as soon as a process concludes that its parent process is no longer accessible,

it must find a new parent process and rejoin the group again at some other point.

Though the parent and its child are non-mutually responsive, all multicast messages

must be buffered until communication recovers or until the child decides to disconnect

itself from its parent.

Since the information about two processes with transport links being responsive

is strictly relative to an observer, this information is thus non-transferable to other

observers. In other words, if two processes are engaging in a mutual heartbeat failure

detection, whatever the result of this monitoring will be, it cannot be directly passed on

to be used as is by a third process.

Because of the non-transferable property of mutual failure status, we will have to

find a way to reduce the amount of failure detection needed in large-scale systems. A

tree configuration of a process group is exactly what we can use here. It reduces the

amount of failure detection that is needed to be done while maintaining the level of

failure detection accuracy that we need.

Reliable and accurate large-scale failure detection over a wide-area network is difficult

and incurs a very high cost because of the number of processes involved in the detection

scheme. An appropriate detection scheme must be invented to minimize the amount of

heartbeat messages. Fortunately, our architecture renders itself to a very efficient failure

detection scheme.

Because we use a hierarchical group structure (see C'!i pter 5), a process needs to

deal with its parent process and its immediate children in its tree group only. By

dealing only with our immediate neighbors, i.e., the parent and all children in a group,











p





A




I-


a

dl






(12
d2:
'b


A -dl -d2


Figure 4.2: Process p sends a message to process q. The message takes dl time delay
before it reaches q. q responds to the message within deadline A (dl + d2) with some
probability r. The response from q takes d2 time delay before it reaches p.

we manage to gain more synchrony (though probabilistically) out of the .i-vnichronous

system. This gaining in synchrony helps in failure detection because we tend to have

more accuracy with failure detection when more synchrony is available. Also, since we

deal only with our immediate neighbors, we don't flood the entire system with a large

amount of heartbeats periodically.


4.4 Mutual Heartbeat Failure Detector

4.4.1 Definitions

Let p and q be two processes, a be a transport link from p to q, b be a transport link

from q to p. In the following, all probabilities are with regard to process p.

We ziv that q is (r, A)-responsive with respect to p if q replies to a request of p

within deadline A with a probability r.







41

Let Pa be the reliability of link a, Pb be the reliability of link b, and Pq be a local

responsiveness measure of process q. We define P, to be the probability that q will

process a message within a period of A (dl + d2) where di is the delivery delay on link

a and d2 is the delivery delay on link b for the message from some process p to q and

the response from q to p, respectively. Therefore, Pfa is the probability of link a fails

(faulty rate), Pfb is the probability of link b fails (faulty rate) and Pfq is the probability

of process q is "not fast enoul! (faulty rate);


Pfb P; P fa = 1- fP; Pfq 1 Pq


The probability that p will not receive a response from q within time A is


Pf(q)(A) P + (1 Pf)Pfq + (1 -Pf)( -Pf)Pfb

= Pfa + Pfq PfPq + (1 Pf Pfq + PfaPfq)Pfb

= Pfa + Pfq PfaPfq + (Pfb PfaPfb PfqPfb + PfaPfqPfb)

= Pfa + Pfq + Pfb PfaPfq PfaPfb PfqPfb + PfaPfqPfb


We define only two states of a process; R for the process being responsive and N for the

process being non-responsive. A process q is in state R if


Ps(q) =1 Pf(q) > r


and q is in state N if


P,(q) < r







42

Let Pq(report R) be the probability that q is "reported" to be in state R by p's

failure detector, and Pq(report N) be the probability that q is "reported" to be in state

N by p's failure detector.


4.4.2 Failure Detector Machine and Its Operations

Figure 4.3 shows a state machine representation of our mutual failure detector. In

the figure, the two states R and N represent responsiveness and non-responsiveness

state of a monitored process, respectively. The states RN and NR are internal to the

mutual failure detector. We require only two state transitions, namely H' > H and

-(H' > H). Table 4.1 explains the meaning of these states and transitions.

The state machine in Figure 4.3 is a variant form of a standard state machine because

in every state, except the R and N states, we assume that there is a counter inside these

states. This counter records the number of consecutive events which is of interest to

that particular state. For example, in the NR state, for as long as an event H' > H

does not occurs consecutively at least p 1 times, the failure detector will stay in this

state. But after the pth H' > H event (consecutive), the failure detector moves to the

R state and this counter is reset to 0.

The parameter p and v in Figure 4.3 are used to modify the behavior of our mutual

failure detector. For example, if we increase the parameter p of the machine, it will

take longer for a process to be classified as non-responsive. This also implies that the

detector becomes less responsive to changing the state of a process that it is monitoring.





















H'>H


-(H' > H)


-(H' > H)


RN


H'>H


-(H' > H)
V 1 times


H'>H
- 1 times


-(H' > H)


H' > H


H' > H


-(H' > H)


Figure 4.3: A failure detection machine. The states R and N represent responsiveness
and non-responsiveness state of a monitored process, respectively.
















Table 4.1: The states and transitions of the mutual failure detector depicted in Fig-
ure 4.3. Assuming that p is the monitoring process and q is the monitored process.
States in parentheses are states that the failure detector reports about q.

MFD State Meaning

R q is suspected to be responsive by p.

Failure detector reports R state (responsive) for q.

N q is suspected to be non-responsive by p.

Failure detector reports N state (non-responsive) for q.

RN q is suspected to be less responsive to p's messages.

Failure detector reports R state for q.

NR q is suspected to be more responsive to p's messages.

Failure detector reports N state for q.

MFD Transition Meaning

H' > H Incoming heartbeat value H' from q is greater than

the current heartbeat value H (modulo M) of q at p.

-(H' > H) Either there are no heartbeats coming from q or

incoming heartbeat value H' from q is less than or equal

to the current heartbeat value H (modulo M) of q at p.







45

In our mutual failure detection scheme, every heartbeat message contains an integer

value called a heartbeat value. A heartbeat value v in a heartbeat message is monotone,

non-decreasing modulo M for some M value, i.e., 0 < v < M.

The rule of engaging in mutual heartbeat failure detection between two processes is

simple: any process that receives a heartbeat message with a heartbeat value v from

some process is required to respond to this heartbeat by returning back to the sender a

heartbeat with a value v + 1 (modulo M).

Based on this rule of mutual failure detection, an incoming heartbeat value from a

remote process is alv--i- either equal to or larger than the last heartbeat value that

a local process sent out. The only exception to this rule is when a process holds a

heartbeat value M 1. In that case, it is expected that the incoming heartbeat value

will be either M 2 or 0 (due to the modulo M arithmetic). Every heartbeat value is

initially set to 0.

Since we do not know whether a transition from the R state to the N state is caused

by a transient or permanent failure. If it is transient and we jump too quickly to the N

state, then we could make a wrong conclusion about the responsiveness of the remote

process. Recall that DCS servers are highly available synchronous servers. It makes

more sense to be optimistic when we do failure detection. In other words, a process is

not likely to be faulty since we have already made them highly resilient.

The increase in the detection phase transition period does not necessarily lead to

more accuracy in mutual failure detection. It may, in fact, hinder the ability to detect

the changes of the failure status of remote processes quickly. We must be careful as we

may end up with the problem of oscillating between states.







46

The behavior of the failure detector can be modified by changing its two parameters,

v and p. For the transition to the R state, v is the number of consecutive, monotonically

non-decreasing (modulo M) heartbeat values exchanged between two processes before

the monitored process is reported as responsive. For the transition to the N state, p is the

number of consecutive, missing heartbeats or non-increasing heartbeat values received

by the monitoring process before it decides that the monitored process is non-responsive.

Based on the changes in the heartbeat values sent and received between processes,

for each monitored process, a failure detector reports one of the following two process

states:

R (responsive): When p and q can regularly exchange heartbeats, we -v- that

the two processes are mutually responsive. For our detector to report a mutually

responsive state, the two processes must have gone through a series of continuous

exchanges of increasing heartbeat values. A local failure detector at p reports that

a remote process q is mutually responsive if the current state of its local detector

for q is either R or RN.

N (non-responsive): Since we assume that every transport communication link is

a part of some process, .,-vmmetrical communication mode must also be assumed.

In other words, communications from p to q are independent of communications

from q to p.

Suppose processes p and q are both correct and they are exchanging heartbeats

when a communication disruption occurs and q loses its ability to communicate

with p but not the other way around. Therefore, while q is regularly receiving

heartbeats from p, no heartbeat from q has ever arrived at p. Since it is possible







47

that communication disruption from q to p is transient and if q is non-faulty, q

must be trying to send back heartbeats to p. q keeps sending the same heartbeat

value H + 1 (modulo M) to p since it alv--i-i receives the same heartbeat value

H from p. Because q receives a constant heartbeat value from p, it knows that

none of the heartbeats that it sent to p has ever reached p. After v consecutive,

constant heartbeats from p, q's detector identifies p as non-responsive. In the

meantime, p has not received heartbeats from q for a while, after v consecutive

missing heartbeats, p's failure detector also reports that q is non-responsive.

A local failure detector at p reports that a remote process q is non-responsive

if the current state of its local detector for q is either N or NR.

Our mutual failure detector does not report intermediate transition states. For ex-

ample, assuming that p has its local detector monitor q, and the detector is reporting

to p that p and q are currently mutually responsive. If after a while, either q does not

increase its heartbeat but sends back many heartbeats with the same value or q does

not send anything back to p at all, then p's detector will move q's state to the state RN,

which means that the detector suspects that q may become non-responsive soon. But

since q may be experiencing a transient network failure, p's detector cannot commit to

that change yet. During the time that q's state at p is RN, p's detector will have to

report to p (if p queries its detector) that p and q are still mutually responsive. Until

p's detector changes q's state from RN to N, then p's detector can report that q is

non-responsive.











The point where p begins
to miss a consecutive series
of heartbeats from q


R N

t--------------
I l.

c


Figure 4.4: A continuous series of certain events force failure detector to change the
state of a process. In the figure, a series of c consecutive missing heartbeats from q
make p's failure detector change the state of q.


4.4.3 Analysis of Mutual Failure Detector

Let p and q be two processes. In the following analysis of our mutual failure detector

is based on Figure 4.4. and all probabilities are with regard to process p.

Let Pch(Pe, c, k) be the probability that the failure detector changes the state of q

on i ii, 11; the kth try when P, is the probability that an expected event occurs and c,

c > 0, is the number of consecutive, expected events required to change the state of q.









We have


Pch(Pe, c, 1)

Pch(Pe, c, 2)




Pch(Pe, c, c)


Pch(P, CC + 1)

PCh(Pe, c + 2)

PCh(Pe, c, c + 3)


(1 P) P

(1 PC, (Pe,c, ))(1

(1 PC. ,(Pe, c, 2))(t


where
k
PC ,(Pe, c, k) = Pch(c, j)
j=1
is the probability that the failure detector changes the state of q no later than the kth

try.

In general, if 0 < c < k then


Ph(Pe, c, k) =(1


PC ,(Pe, c, k


and, for k = c,


Pch(Pe, c, k) = Pc


Otherwise, for 0 < k < c,


Pch(Pe, c, k) 0


c 1))(1 Pe) P







50

Let EDT(Pe, c) be the expected dwell time before the failure detector changes to a

new non-intermediate state for a given c and P,;

OO
EDT(P,,c) = jPch(P,c,j)
j=1

Thus,
OO
EDTN(Pe,p) = JPchR(Pe,P,j)
j=1

and
00
EDTR(Pe, v) -jPChN(Pe, V,j)
j=1

The probability that q is "reported" to be in state R or the probability that q is

"reported" to be in state Q by p's failure detector are


EDTR(P,w,)
Pq(report R) PeEDTRe +EDTNPp)v)
EDTR(PP, v) + EDTN(Pp p)

EDTN(Pe, p)
Pq(report N)(Pe,P,) EDTR(P, +EDTN(P )
EDTR(P,, v) + EDTN(P,, p)

Note that if P, is the probability that q is responsive, then P, for EDTR is 1 P1 and

Pe for EDTN is P,.

Based on the state machine representation of our mutual failure detector, we want

to ain iv. its responsiveness and its accuracy.

From the definition of our failure detector, v is the number of consecutive events

required before the failure detector changes from state R to state N and p is the number

of consecutive events required before the failure detector changes from state N to state

R.

Let P,(q) be the probability that q succeeds in responding to a message from p within

time A and Pf(q) = 1 P(q) be the probability that q fails to respond to a message from









p with in the time A. For convenience,


P FrPqW = 1 o
S= Ps(q) 1 -



If PC v)(4, v, k) is the probability that the failure detector changes the state of q

from R to N by no later than the kth try and Pch(N) (, v, k) is the probability that the

failure detector changes the state of q from R to N on 'r. /;1, the kth try, then

k
PC, ,v)(0,v,k) PCh(N) (0,,j)
j=1

where


Pch(N)(4, v, k) = 0; 0 < k < v

0'; k -v

( (- Pc, v)(4, k v- ))( 0)0; 0< v < k


We define the following properties for our mutual failure detector:

Responsiveness: This property indicates how fast the failure detector is with re-

spect to changing its reported state of q when q has changed its state. This is

described by the EDT parameter. High EDT means the detector resists changing

the state of q compared to lower EDT.

Accuracy: This property indicates how accurate is the failure detector. For each

value of Ps(q), we would like P(report R) to be as close to Ps(q) as possible.













N1 R1 N2 R2 N3 R3


t----------------- --------------- ------
n (R1) n (N2)



Figure 4.5: Failure detector initially classifies a remote process to be in some state (N
in this case). Then it changes q's state to R when the condition to do so has been
met. Later on, it may change q's state back to N again. The cycle repeats indefinitely
(depending upon the probability of successes and failures of q with respect to p).

4.4.4 Failure Detector Simulation

We did a simulation to study the actual behavior of the failure detector. The param-

eter p and v that were used in the simulation are between 1 and 12. The probability of

success Pq of q at responding to p's heartbeat messages increases from 0.0 to 1.0 in a 0.1

increment. For each combination of P,, p and v, a simulation has 10 runs and each run

has 106 tries (steps). Thus the total number of tries E for any particular set of values of

Pq, p and v is 107. An initial state of q at p's failure detector is random with an unbiased

5(' chance of being either R or N state.

The calculations of the probability of reporting a state (P(report)) by failure detector

and the average dwell time (ADT) are as follows.

The probability of a failure reporting a state R is

E(R)
P(report R) =
>2







53

where E(R) is the number of tries that failure detector reports R state. Similarly, the

probability of a failure reporting a state N is


P(report N) (N)


where E(N) is the number of tries that failure detector reports N state.

The average dwell time in the R state, ADTR, is

t X(N)
ADTR = x (R)
i=1

where X(N) is the number of times that failure detector changes the state from R to N

and II(Ri) is the ith period in which the failure detector reports R state. See Figure 4.5.

Similarly, the average dwell time in the N state, ADTN is

t x(R)
ADTN = x RN)


where X(R) is the number of times that failure detector changes the state from N to R

and II(Ni) is the 1th period in which the failure detector reports R state.


4.4.5 Analytical and Simulation Results

The graphs in Figure 4.6 to 4.9 show the plots of a direct probabilistic analysis of

failure detector. Figure 4.6 and 4.7 show the probability of failure detector reporting R

and N states, respectively.

The P(report R) and P(report N) obtained through the direct probabilistic analysis

are somewhat different from those obtained through a simulation (see Figure 4.10 to 4.15

and Figure 4.16 to 4.21) especially for the cases of small values of p and v. With larger

values of p and v, the results from the direct probabilistic analysis and the simulation


































0.2 0.4 0.6 0.8 1


0.8


( 0.6
o

0.4


S0.2


0
0 0.2 0.4 0.6 0.8 1






0I.8


S0.6
o

0.4
0

I 0.2


0
0 0.2 0.4 0.6 0.8 1


Figure 4.6:
and varied
lower.


0 0.2 0.4 0.6 0.8 1


Probability of failure detector reporting R state, P(report R), for fixed p
v as obtained by a direct probabilistic analysis. For lower v, the graph is


T 0.8
IT

(0.6
o

'0.4


~ 0.2


0
0




































0.2 0.4 0.6 0.8 1


0.8
o
6 0.6


Z 0.4
r-
0

'0.2


0
0 0.2 0.4 0.6 0.8 1



1


S0.8

0
6 0.6
II


Z 0.4
r-
0

o 0.2


0
0 0.2 0.4 0.6 0.8 1


Figure 4.7:

and varied

lower.


0 0.2 0.4 0.6 0.8 1


Probability of failure detector reporting N state, P(report N), for fixed v

p as obtained by a direct probabilistic analysis. For lower p, the graph is


T 0.8
o
-E
6 0.6
II


Z 0.4
Q-
OC
'0.2


0-
0
























6-
I'-



4-



2-



0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pe

Figure 4.8: Expected dwell time in R state, EDTR, of failure detector for c = 3. The
top graph uses a maximum try of 30 and the bottom graph uses a maximum try of 10.


are quite similar. We believe that these differences occur because the EDT values that


we used in the calculation of P(report R) and P(report N) are a very rough estimation


of EDT. We will see later that the maximum number of tries in the calculation of EDT


is much lower than what it should be, which is infinity.


Figure 4.8 and 4.9 show the expected dwell time, EDT, of failure detector for a fixed


c value, c = 3 in this case. With a different value of c, the graph will look similarly but


with the point that the graph intersects with the y-axis (EDT) be the value of c.


From


EDT(P,,c) = jPch(Pe,c,i)
j=1









57






















12




10




8-




6-
I-
w


4




2




0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pe


Figure 4.9: Expected dwell time in N state, EDTN, of failure detector for c = 3. The
top graph uses a maximum try of 30 and the bottom graph uses a maximum try of 10.







58

the exact EDT calculation requires a sum to infinity. Thus, we must limit j from 1

to some maximum try value, 30 in our graphs. Because of this, the EDT at P, = 0 is

ah--iv- 0. In the EDTR graph, Figure 4.8, when we increase the maximum try from 10

to 20 and to 30, we see that the EDTR values go up and the peaks move to the right as

we expect. Therefore, if we were to use the sum of P h(Pe, c,j) to a much higher value

of maximum try, the EDTR at P, = 0 will approach infinity as we expect. The case

for EDTN, Figure 4.9, is similar. The EDTN values increase with larger values of the

maximum try but the peaks will now move to the left as they should. We anticipate that

the expected dwell time EDT and the average dwell time ADT, Figure 4.10 to 4.21, to

be similar and that is the case here.

The graphs in Figure 4.10 to 4.25 show the actual behavior of our failure detector

according to the simulation. The graphs are highly symmetric, which is expected as our

failure detector is designed to perform that way.









59








1. 1


T 0.8 T 0.8


0.6 c0.6
o o

0.4 0.4
o o

0.2 0.2


0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


1. 1


T0.8 T 0.8


C 0.6 o1 0.6
II II



0 0
E 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8
0.6 0.6
















o o
I, 0.2 0.2


0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps





S0.8- 0.8


m 0.6- 0.6
o 0








0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


Figure 4.10: Probability of failure detector reporting R state, P(report R), for fixed p
and varied v. For lower v, the graph is lower.









60







1. 1


10.8 T 0.8


N 0.6 0.6
o 0
-E -







0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8






t0.8 0.8



c 0.6 S 0.6


0.4 /0.4
0 0










0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


1 1


10.8 0.8


0.6 0.6
II II
0 .4 0.4
o o










S0.2 0.2
0 0










0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps



Figure 4.11: Probability of failure detector reporting R state, P(report R), for fixed p
and varied v. For lower v, the graph is lower.
and varied v. For lower v, the graph is lower.























O 1
O
II
0-

0.5


0
rC
'S 0
OI 12
10
8
6
4
Rho 2


0 :
II


S0.5



0
'S.
IO 12
10
8
6
4
Rho 2


d 1
gi
O
II


S0.5


0
CL
0
I0 19


0 1
O
0
O
II


S0.5


0
0
IL 19


S12
10


12
10


6 R 4
Rho 2


Figure 4.12: 3-D surface plot of the probability of

(P(report R)), for a probability of success of q (Ps)


failure detector reporting R state,

with all combinations of p and v.


II


S0.5


o..
0
0- 19


- 12
10


6 >
4
Rho


12
8 10
8






















O 1


i0.5

0


a
S 0.5




I 12
10
8
6
4
Rho 2


LO 1
O
II
0)

CL
1 0.5


0

IL 19


8-10 12
8 10


d 1
0
II
D-

S0.5


0

0a 19


6 4
Rho 2
Rho 2


10 12
10


10 12
10


Figure 4.13: 3-D surface plot of the probability of

(P(report R)), for a probability of success of q (Ps)


812


Rho 2 2 Nu
8 10 12
4 6











Rho 2 2 Nu








86 8 10 12











Rho 2 2 Nu








6 8
4 6
Rho 2 2 Nu


failure detector reporting R state,

with all combinations of p and v.






















O LO
II II

0.5 0.5




L 12 12



Rho 2 2 Nu






0d O

II II

0.5 0.5



o- o
120 12 12 1
S88 102 10
6 4 6 8
Rho 2 2 Nu






0

I I II

0.51 0.5




a_ 12 IL 1 12
10- 12 10
8 10
6 8

Rho 2 2 Nu


Figure 4.14: 3-D surface plot of the probability of
(P(report R)), for a probability of success of q (Ps)


12
6 8 1012
4 4 6
Rho 2 Nu

















12
8
6 8 10
4 6
Rho 2 2 Nu

















8
6 80 12
4 6
Rho 2 2 Nu


failure detector reporting R state,
with all combinations of p and v.









64























II II
1 0.. 3 0..

0.5 0.5




a 12 1- a 12
10 12 1 10 12
8 10 8 10
6 8 6 1 6 8
4 46 4 46
Rho 2 2 Nu Rho 2 Nu










0.



a 12
10
6 10
4 46
Rho 2 2 Nu


Figure 4.15: 3-D surface plot of the probability of failure detector reporting R state,

(P(report R)), for a probability of success of q (Ps) with all combinations of p and v.









65








1 1


T0.8 T 0.8

0 0
0.6 0.6
II II

z 0.4 0.4

0 0
!,0.2 !:, 0.2


0 -0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


1 1


T0.8 T 0.8

o o
6 0.6 0.6
II II

z 0.4 z 0.4
o o
0 0
I 0.2 0.2


0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps






T0.8 T 0.8

o 0
L6 0.6 0.6
II II

S0.4 z 0.4
O O




0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


Figure 4.16: Probability of failure detector reporting N state, P(report N), for fixed v

and varied p. For lower p, the graph is lower.






















T0.8 T 0.8

o 0o
-0.6 0 0.6
II II

o 0.4 z 0.4
O o



0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


1 1


0.8 0.8

o 0
0.6 6 0.6


z0.4 0.4


0,! .2 0.2


0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


1 1


10.8 0.8
o o

0.6 ci 0.6
II II

0.4 0`.4
o o

S0.2 0.2


0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ps Ps


Figure 4.17: Probability of failure detector reporting N state, P(report N), for fixed v

and varied p. For lower p, the graph is lower.









67











O O


O O
0.5 : : : : 0.5 : :
z z


So 0. . : : o O
0

IL 12 IL 12
10 12 10 1
80 -10 8 10
6 8 6 410 8
4 46 4 4 8
Nu 2 2 Rho Nu 2 2 Rho















12 10 2 12 u 12 '




8 10 1 8 10
S6 4 4 6
Nu 2 2 Rho Nu 2 2 Rho







z z

0.5 0.5



a 12 "- a 12
10 12 10 12
8 10 8 10
6 8 6 16 8

Nu 2 2 RhoNu Rh
S Rho Rho


Figure 4.18: 3-D surface plot of the probability of failure detector reporting N state,
(P(report N)), for a probability of success of q (Ps) with all combinations of v and p.






















01 1
O O
II II

0.51 0.5
o z
0 0
0 0.5
10 12 10 12
8 10 1 10
6 8 6 8
4 6 4 46
Nu 2 2 Rho Nu Rho











0.5 30.5





10 12 10 12
8 10 8 10
6 6 8









4 66 4 6

Nu 2 2 Rho Nu 2 2 Rho
IL IL



z z



0- 12 IL 12
10
8 10 12 10 8 10
6 8 6 1
4 4 4 4
Nu 2 2 Rho Nu 2 2 Rho









69











O LO
(D (0
0 0.. :
II II

0.5s 0.5,
z z
0 0

1 12 Nu 12
10 10 12 10 12
64 46 4 46
Nu 2 2 R ho Nu 2 2 Rho






S 1 : 1



0.5 0.5
z z


0 0
n 12 IL 12
8 10 2 108 10
6 8 6 8
4 6 4 4 6
Nu 2 2 Rho 2 2 Rho







O O
0 -
II II

0.5 3:0.5
z z
0 0

IL 12 IL 12
10 8 10 12 10 8 10 12
6 8 6 8
4 6 4 6
Nu 2 2 Rho Nu 2 2 Rho


Figure 4.20: 3-D surface plot of the probability of failure detector reporting N state,

(P(report N)), for a probability of success of q (P,) with all combinations of v and p.






































II
0 .



z
0

S12
IL 12 .


UO

0
IL
(0
0a


S12 0 12


Nu 2 2 Rho Nu 2 Rho













0 0
(0 1







IL 12
10 8 10 12

4 4- 6
Nu 2 2 Rho



Figure 4.21: 3-D surface plot of the probability of failure detector reporting N state,

(P(report N)), for a probability of success of q (Ps) with all combinations of v and p.


-CY























































0 0.2 0.4 0.6
Ps


106




104


o
0

S_102
I-




100
0.8 1 0



106



/ 104


/2
II 10
0

-E

y 100



10-2
0.8 1 0


0 0.2 0.4 0.6 0.8
Ps


106



S104
II

102

S10

O 100



10-2
0


0.2 0.4 0.6
Ps


Figure 4.22: Average dwell time in R state, ADTR, of failure detector as a function of
p and v; p is fixed and v is varied. Note that in all ADTR graphs, the ADTR values

are plotted in a logarithmic scale. The upper end point of each line in ADTR is infinity

(not shown). For lower v, the graph is lower.


0.2 0.4 0.6
Ps


0.2 0.4 0.6
Ps


0.8 1


0.8 1


106



C104
II

10 2
0

I-
S100



10-2


0.8 1


1



































0.2 0.4 0.6
Ps


0.8 1


0 0.2 0.4 0.6 0.8
Ps


0 0.2 0.4 0.6
Ps


106



S10
II


102
0

0
010
w


10-2
0.8 1 0


0.2 0.4 0.6
Ps


Figure 4.23: Average dwell time in R state of failure detector as a function of p and v;
p is fixed and v is varied. For lower v, the graph is lower.


10-2
0


106


S104


102

1 10
0


S100



10-2
0 0.2 0.4 0.6
Ps


106



10



W 102
0


010



10-2
0 0.2 0.4 0.6
Ps


0.8 1


0.8 1


0.8 1

































0 0.2 0.4 0.6 0.8
Ps


106




o

&
II

z102
I-
0



10
0 0.2 0.4 0.6
Ps


0 0.2 0.4 0.6
Ps


0 0.2 0.4 0.6
Ps


106



C 104

o

102

Z
100



10-2
0.8 1 0



106


-- 104

II
o

d 102

z


10

S10-2
0.8 1 0


Figure 4.24: Average dwell time in N state, ADTN, of failure detector as a function of
v and p; v is fixed and p is varied. Note that in all ADTN graphs, the ADTN values
are plotted in a logarithmic scale. The upper end point of each line in ADTN is infinity

(not shown). For lower p, the graph is lower.


0.8 1


0.2 0.4 0.6
Ps


0.2 0.4 0.6
Ps


0.8 1


0.8 1


















106


N 4
10



02


z
o10 1



10-2
0 0.2 0.4 0.6 0.8
Ps


0 0.2 0.4 0.6 0.8
Ps


106


N 4
10

-E
06102


z
L 100



10-2
0 0.2 0.4 0.6
Ps


106



104



102


0
o10
LJ
w


10-2
0 0.2 0.4 0.6
Ps


0 0.2 0.4 0.6
Ps


106



S104



102


0
S10
LU


10-2
0.8 1 0


0.2 0.4 0.6
Ps


Figure 4.25: Average dwell time in N state of failure detector as a function of v and p;
v is fixed and p is varied. For lower p, the graph is lower.


0.8 1


0.8 1


0.8 1















CHAPTER 5
MEMBERSHIP AND MULTICAST

5.1 Group Communication

Essentially, we want to disseminate messages to a group of processes over a large

geographical area, also known as multicasting. To support a multicast service to a

group of processes, we must know to which processes to send multicast messages. This

is accomplished through a group membership service. Group membership and group

multicast services together form what is commonly known as group communication

services.

There are two kinds of group communication systems, a primary partition system

and a non-primary partition system. In a primary partition system, progresses are

required to happen only in the primary partition. On the contrary, in the non-primary

partition system, progresses happen concurrently in all partitions. Both types of group

membership require distributed agreement but with a different degree of strictness.


5.2 Asynchronous Hierarchical Service Group

We consider each DCS site as a single entity, represented by its primary server. We

collectively call the primary and all backup servers at a site a ,' I/.;/ because they partic-

ipate in relaying multicast messages. Therefore, in our model, a large-scale distributed

conferencing system is simply a collection of inter-connected relays.







76

To achieve our service reliability, availability and scalability goals, we apply the group

communication concept to a hierarchical group structure. An .,-i-nchronous hierarchical

service group or service group is a tree of rel i -. Most of the time it consists of only a

subset of all rel -.~- in the system. A service group spans across multiple DCS sites, and

it is formed when there is a need for replicating data and services among those sites.

Our -i- tchronous hierarchical service group is different from the highly available

synchronous ..-i--regate servers at a local DCS site. R.li- in our .-i- tchronous hier-

archical service group do not have a primary and backup relationship. Each relay in

a service group 1l li-, a similar role except for the root relay, which is responsible for

imposing multicast delivery ordering. Each relay in a service group provides services to

a DCS client connected to it in exactly the same way. Therefore, in our case, a client

can choose to access the DCS from any relay. But in the case of highly ..-i--regate syn-

chronous servers, a client can access services only through the primary server of a DCS

site.

Also, in highly available .,-- regate servers, a logical ring structure connects servers

to form a closed circle. A ring structure is shown to be very effective for the purpose

of fault-tolerance in a small, highly synchronous environment. However, ring structures

don't scale as well as hierarchical structures.

The root of a service group is the root of the tree that represents the service group.

Unlike other relays in a service group, a root relay has an extra responsibility to order

multicast messages. Multicast messages are delivered according to the message ordering

semantic specified by the group multicast service.







77

A clear advantage of using a tree structure is that failure detection and multicast

message propagation can be done efficiently. The number of failure detections at a relay

is a direct proportion to the number of relays that connect to the relay. The number of

relay connections at a relay in a graph structure is usually much higher comparing to a

tree structure, due to the redundancy in the graph structure. For the same number of

re-vl 7 the maximum number of connections in a graph is 0(N2) when it is only 0(N)

for a tree. Therefore the maximum number of mutual failure detections required at a

relay in a tree-based service group is usually much lower than in a graph-based service

group.

Multicasting can be achieved efficiently with tree structures. The key idea is to bring

the messages to the root by bubbling multicast messages up the tree. Because there are

no cyclic paths in a tree structure, as in a standard graph structure, multicast message

routing is not redundant. This is important when message diffusion is used to propagate

messages to a large number of interconnected rel-7i-. The ordering of multicast messages

can be done easily since we use the root as the point to order all multicast messages.

The use of a hierarchical structure for multicast communications can be found in some

recent systems [ ].

Unlike a ring structure, which is a highly constrained structure, an unconstrained

tree is allowed to grow in all directions, as long as it does not violate the non-cyclic

requirement of tree structures. By allowing a tree to grow without any constraint,

system performance can be seriously degraded. If the tree height increases quickly,

it will take longer for a multicast message to propagate to all relays since message

diffusion parallelism rarely exists in this case. Figure 5.1 (a) depicts this scenario.







78

Thus, an unconstrained growth of a service group tree could hinder the ability to diffuse

multicast messages quickly from a relay to the rest of the service group.

A constrained growth of a service group helps facilitate the appropriate scaling of the

system, as well as the efficiency of multicast diffusion among rel -iv. We want to keep

a relay tree to expand both horizontally (increasing its branching factor) and vertically

(increasing its height). But whenever possible, we prefer the tree to expand horizontally

to increase the parallelism in multicast diffusion. Figure 5.1 (b) shows the scenario that

we would like to occur.

Even with these advantages, however, we do not want to put too much burden on

balancing any relay tree. Whenever possible, a tree with a higher branching factor is

preferred. However, we must realize that by increasing a branching factor of a tree,

we also increase the amount of mutual failure detection at each relay node. Because

increasing mutual failure detection affects the performance of the entire system, we have

to balance the increase of mutual failure detection and the increase of multicast diffusion

parallelism. This trade-off should be decided dynamically at runtime. Sometimes the

network connectivity determines how rel i- should be interconnected for communication

performance and efficiency reason and we cannot do much to control the tree structure

under this circumstance.

It is important to note that our .i-nchronous hierarchical service group requires that

a multicast message that enters a service group at a non-root relay must bubble up the

tree instead of going from the relay directly to the root and then propagating downward.

If we allow a multicast message to be sent directly from a relay in a service group to

the root of the same service group instead of bubbling it up, all relays in the service











0
j+1 t j


j+2 j

0 k+1 / k+1
SO k+1 k k+1
+2 0 .k+ k+2


2j 1 2k
m 1


2j+1 2k+1 ***

2j+1 k+
2j+2 0
2k+2

: 0 2i+3

S2j+3
Figure 5.1: Two extremes of tree height. In (a), a tall relay tree has low parallelism in
multicast diffusion. In (b), a fat relay tree has high parallelism in multicast diffusion.
The number next to each arrow indicates the multicast diffusion step number since the
multicast message has entered the service group boundary at some relay and j i k
where i is a small integer.







80

group must do a mutual failure detection with the root of the service group. This breaks

the hierarchical communication structure of the .i- i chronous hierarchical service group

and requires an enormous increase in the failure detection load at the root.


5.3 Membership Service

Group membership can be divided into strong and weak group memberships. In

a strong group membership, non-faulty members of a group ahv--, agree on the total

order of every membership change event. Membership change events occur when a relay

joins or voluntarily leaves a service group or when a relay fails (involuntarily leave).

If these changes are frequent, it would be cost prohibitive to maintain a strong group

membership. Since a strong group membership requires a distributed agreement by all

correct reli-,v on every membership change, unless a certain amount of synchrony can

be guaranteed, a strong group membership is not possible to accomplish. Also, it would

be difficult to adapt to changes which are more frequent in large-scale systems.

In a weak group membership, we allow the membership changes to be preserved

with respect to only the membership, but not necessarily its orders. It is weak in

the sense that it allows a divergence of group membership, but the group membership

eventually converges. In other words, eventually every group member has the same

group membership. By weakening a group membership, we can manage large groups

more efficiently, especially when group membership changes occur frequently, either

voluntarily or involuntarily.

Our system is a non-primary partition type with a weak group membership. A service

group can be temporarily divided into many sub-groups. Each sub-group maintains a







81

partial membership of the entire group. We allow a divergence of group membership

among these sub-groups, which together form a single group, for an unbounded period.

Two or more sub-groups that belong to the same group will eventually merge as soon as

they learn the existence of one another and they are mutually responsive. The divergence

period cannot be bounded since we have no way of knowing how long it would take for

two or more sub-groups to be mutually responsive to one another again.


5.3.1 Membership Properties

We define our group membership with the following properties:

Safety: Only non-responsive relays can be removed from the group.

Integration: Eventually two mutually responsive sub-groups merge to form a single

group and the resulting group maintains an overall hierarchical structure.

Partial Agreement: Eventually every mutually responsive relay in a group agrees

on the partial membership of the group.

These three properties are essential to large-scale memberships. The agreement

property enforces only a partial agreement of the group membership. It is partial in the

sense that only mutually responsive r 1el i agree upon the partial membership.

The integration property ensures the liveness of the group membership by requiring

that every pair of mutually responsive rel -1i of the same group must merge as soon as

they discover each other. Thus sub-groups exist only when it is absolutely unavoidable,

e.g., due to communication failures. A suspected relay that is non-responsive from a







82

group never receives any messages during the non-responsive period. But if it is non-

faulty, it will try to rejoin the group. The original parent of this relay can safely remove

any non-responsive relay.

The integration property also requires that a newly formed group (resulting from the

merger of the two sub-groups) still maintains a hierarchical structure. This hierarchical

structure of a service group is important as mentioned at the beginning of this chapter.

The safety property ensures that a relay can be forcibly removed from its group

membership only if the relay is non-responsive to the group. Since a relay is connected

with a group only through a single parent due to the hierarchical structure constraint,

a relay is non-responsive to its group only if it is non-responsive to its parent in the

service group.


5.3.2 Membership Protocol

When a new relay joins a group, only the new relay and the relay at the point of the

joint need to agree upon the new membership. When a relay leaves a group, it needs

only to agree with all of its immediate neighbors that it is leaving. Other relays in the

group do not need to be involved.

A relay q joins a service group as follows: First, q finds a relay p which belongs to

the group that q wishes to join. Then q contacts p and announces its intention to join

the group. The relay p grants q's request if one of the following two conditions can be

satisfied: either q has no siblings (i.e., it is not a root of a sub-group) or it is not one of

the ancestors of p in the service group. Ancestors of a relay p are all re1i,-' along the

path starting from the parent of p up to the root of the tree. If p grants q's request to







83

join, then p will be a new parent of q. Both q and p start mutual failure detection. p

also starts flooding its new child q with multicast messages based on the list of missing

multicast messages that q informed p.

A relay q may decide to leave a group because it cannot reach its parent relay p

for an extensive period of time through mutual failure detection. However, our group

membership requires that the relay q must try to rejoin the group at a later time either

through another relay r in the same service group or through the same relay p if p is

later accessible from q. This guarantees that all sub-groups of the same group eventually

merge as soon as they are mutually responsive from each other. Figure 5.2 shows how

a relay relocates itself when its parent relay in a group is non-responsive.

Usually q will try to rejoin the service group through an immediate ancestor of p.

Thus the height of a tree will not be increased unnecessarily. However, from Figure 5.2,

H and I may decide to rejoin the service group at any of the following relay: B, B'

(subtree), F, F', C, C'. By rejoining the service group through one of the above relJ-,

the total height of the tree is not increased. Only under a very special circumstance

(such as when a root fails which will be discussed later) that we will allow a sibling to

rejoin a service group through another sibling of the same failed parent.

When a root or an immediate child of a root fails, the situation becomes complicated.

This is unlike a normal situation in which a relay simply finds a new relay to be its

new parent relay. Two issues exist: who will be a new root and how to avoid the

unnecessary split of the group because of network partitions. Figure 5.3 depicts many

different scenarios that can occur. In Figure 5.3 (a), a relay tree consists of only 5 rel i-,

R, A, B, C, and D. R is the root of this service group. A is assigned to be the left-most












A)


AD~G


AI,D,G,I


BA)C




CC'

A,DD
F^ ------ ---- ---- -

F AA,D,G

IJ K
---------------

Figure 5.2: When a parent relay in a group is non-responsive to its children, the children
relocate themselves to the upper ancestor. In this scenario, G actually fails. If that's
not the case, G will still be a child of D.







85

sibling of the parent relay R. B is the next left-most sibling and so forth. A dark

line with double-headed arrow indicates that the two relays are mutually responsive. A

dotted line with double-headed arrow indicates that the two relays are non-responsive

to each other.

In a benign situation when only the root fails and every other condition is perfect,

all children of the root cannot use the same procedure for moving to the upper level

of the tree. There is no parent for any root. When a root of a service group fails, a

relay from within the service group must be chosen as the new root, and it takes over

the role of the failed root. The first immediate child of the failed root usually will take

over to become a new root. The first immediate child of a relay is the left-most sibling

of the relay. A failed root, if recovered, can join the service group as a new leaf relay,

not immediately as a root relay again. However, it may become a root again if it is

eventually chosen to take that role later through the same procedure that a left-most

sibling relay of a failed root is chosen to become a new root.

There are many possibilities that can occur with the root and their children, as

depicted in Figure 5.3. Because of the special attention needs for the failures of a root

and its children, each one of them is required to do mutual failure detection with the

rest. Now we will consider all cases in Figure 5.3.

In a benign situation, such as in Figure 5.3 (2), relay B is non-responsive to its

parent R, which is also the root of the tree. However, B is still mutually responsive with

all children of R, including A. Therefore, B can learn from A that there is no need for

B to take the role of the root since A is both alive and mutually responsive with R, the

current root.












R is the root of the tree.
It has four children: A,B,C,D.
A is the left-most immediate
child of R. B is the second
left-most child, and so forth.


B


C


7



R-


Figure 5.3: When a root relay or one or more of its immediate children fails, there are
many possibilities that need to be considered. The goal is to avoid the unnecessary need
to split the group into many smaller groups.







87

In Figure 5.3 (3), relay A, which is the left-most sibling in all children of R, is non-

responsive to R, the root of the tree. It is A's responsibility to try to take over the

role of the service group root since it is possible that R is faulty. Since A is mutually

responsive with at least some of the children of R, including B and D, and one of these

two (actually both of them) are also mutually responsive with R, A can learn from either

B or D that R is, in fact, still alive. Therefore, A gives up the attempt to take over the

role of the root and tries to join the service group either by becoming a child of B or D.

In some rare situation, as in Figure 5.3 (4), A is non-responsive to R again. However,

this time the only sibling relay that is mutually responsive with A, which is D, is also

non-responsive to R. The other two siblings, B and C, even though they are mutually

responsive with R, are not mutually responsive with A. Thus, if A asks D about the

status of R, D could -;v that it thinks that R is faulty also.

However, note that D is mutually responsive to both B and C, and they are mutually

responsive with R. If D asks either B or C or both about R, they would have to tell D

that R is still alive. If there is just one relay that z-iv that R is alive, that is sufficient

proof to conclude that R is alive. The reason is because B (or C) must have solid

evidence to back up the claim that R is alive for it to tell D that R is indeed alive. That

evidence is the heartbeat that B and C receive from R regularly. Thus, D can tell A

that R is indeed alive. Again, A gives up the attempt to be the new root of the service

group.

In the case of Figure 5.3 (5), the root fails. A notices that R is non-responsive, it

queries all mutually responsive sibling re~B7i-, B and D, about the status of R. Even

though C is non-responsive to A, A can still query either B or D about C's view on












R A



A D B D






(a) R, the root, is faulty (b) A, the left-most child, takes over



Figure 5.4: When a root relay in a group is non-responsive to its children, the left-most
sibling of the root takes over as a new root of the relay group.


R. Eventually A, B, C and D will reach an agreement that R is non-responsive to all

of its children. Thus, A will become a new root of the service group with no objection.

Figure 5.4 shows the change of the root of the tree.

If C is non-faulty but is non-responsive to A, B and D, as in the case of Figure 5.3,

then C knows that it is isolated from all other siblings. C eventually locates a new relay

to join the service group. C will eventually learn that the new root has been assigned

and R is no longer available. But that information is not important to C anymore since

it is no longer a child of R.

In Figure 5.3 (6), A fails. Since A is the first relay that will have to take the role of

the root if R were to fail, someone must take A's responsibility when A fails. Since A

fails, every relay connected to it, including R, B and all other siblings of A, must see

that A is non-responsive. R immediately informs B to be a new left-most sibling and

to be ready to take over as a new root if the situation warrants. B must accept this







89

responsibility. Then R informs all of its surviving children about A's departure and B's

new role as the left-most child. No children of R will object to this since A is indeed

non-responsive to all other siblings.

In the last case, Figure 5.3 (7), both R and A fail. All non-faulty re l--, of R,

including B, C and D, notice the same condition. Since A, who is supposed to take

the role of the root R when R fails, also fails too, B will consult with the rest of R's

children, C and D, whether or not they see the situation differently. C and D see the

situation in the same way that B sees the situation. Therefore, B can ask C and D to

agree to let B be a new root of the service group. If C or D were to be non-responsive

to B during this transition period, it will simply rejoin the service group at some other

relay and eventually will learn that B is a new root.

There is yet another possibility in which the root R is non-responsive from all of its

children but the root is still alive. In this case, A will agree with whomever it is mutually

responsive and become a new root for that sub-group. Eventually, both R and A will

be mutually responsive again and they will merge by R becoming a new child relay of

some relay in the new sub-group.

During this disconnection period, the root R may decide not to accept new multicast

messages sent to the service group through it. Since it is very unlikely that all of

its children (there are quite a number of them) would fail simultaneously, the root is

convinced that one of its children may have already taken the role of the root and formed

a sub-group with the rest of the service group. Therefore if some relay outside the service

group wants to send multicast messages to the group through R, R should refuse them.

If that happens, the outside relay will simply send the messages to the service group




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs