Hierarchical Matrix Timestamps for Scalable Update
Propagation*
T. Johnson K. Jeong
June 25, 1996
Abstract
Update propagation is a technique for the weakly consistent replication of a database. Op
erations are performed at the local copy of the database, then are asynchronously propagated
to the replicas. In large scale applications where the weak consistency can be tolerated (e.g.,
network routing tables, bulletin boards etc.), update propagation can give much better perfor
mance than conventional database techniques. However, standard algorithms for ensuring that
an update is eventually propagated to all replicas require that each site store and communicate
an O(N2) matrix timestamp, where N is the number of sites that store a replica. If N is very
large (e.g. N = 10, 000 or larger), then the cost of maintaining the matrix timestamp limits
the scalability of the replicated database. Also, standard algorithms require global membership
information, and keeping the membership information uptodate can be difficult.
In this paper, we present an algorithm that can significantly reduce the size of the matrix
timestamp. The sites that store a replica of the database are partitioned among a set of do
mains. We replace matrix timestamps with hierarchical matrix timestamps, which have precise
information about processors in the same domain, and summary information about other do
mains. If there are O( ) domains, the hierarchical matrix timestamp can require only O(N)
space. Furthermore, the hierarchical matrix timestamp allows hierarchical replica management.
The cost of the reduced space of the hierarchical matrix timestamp is a lower quality estimate
of the progress of the other processors. As a consequence, updates must be buffered for a longer
time to ensure safety, increasing the size of the update log. We present detailed simulation
studies that quantify this tradeoff. We explore a set of optimization techniques for reducing the
average log size. When all of the optimizations are applied, there is no increase in the log size
as compared to the regular update propagation algorithm.
1 Introduction
Update propagation is a lowoverhead technique for implementing a weakly consistent replicated
database. Many applications, such as network routing tables [4] and groupware, require only weak
consistency and are sensitive to replication performance. For example, Lotus Notes replication [8]
is based on a variant of update propagation. Other researchers have proposed methods by which
reliable servers [9] or database applications [6, 12] can be built using update propagation techniques.
*Submitted to the 10th International Workshop on Distributed Algorithms
There are several alternative solutions to the problem of ensuring the safety of updates in a large
system. One method is to use an epidemic algorithm, and settle for probabilistic guarantees [4].
Alternatively, one can propagate the values of the updated data objects instead of the operations
on the objects [8, 12]. However, data value propagation is not efficient if the data objects are much
larger than the description of the operations on them (e.g., insert a key into a table).
Reliable update propagation requires that every site is eventually informed of every update.
Every site is required to store each update in a buffer until the update is stable that is, delivered
to every site. The standard method for determining the stability of an update is to use a matrix
timestamp (which we present in detail in the next section). If there are N sites that store a
replica of the database, then the matrix timestamp requires O(N2) storage. Reliable multicast
schemes [3, 14] use matrix timestamps to ensure reliable delivery in spite of failures, and attach
vector timestamps to each multicast message to advance matrix timestamps at other processors.
Reliable update propagation schemes [15, 13, 1, 9] must propagate their matrix timestamps in
order to advance the matrix timestamps at other processors. If the number of replicas is very
large, then the storage and communications overhead of matrix timestamps becomes prohibitive.
For example, if there are 10,000 replicas, then the matrix timestamp requires 100 million entries.
Given the intended applications of update propagation schemes (e.g., groupware, routing tables,
etc.), such large numbers of replicas are quite feasible.
Meldal, Sankar, and Vera [11] show how to reduce the size of timestamps for causal communi
cation when communication occurs only along predefined links. This technique can be extended to
reduce the size of matrix timestamps, but the requirement of predetermined communication links
makes it inflexible. In [7], Heddaya, Hsu, and Weihl reduce the cost of reliable update propagation
by storing and propagating an O(N) size summary of the matrix timestamp instead of the matrix
timestamp itself. However, this algorithm does not make use of hierarchical structure, and each
site must know exactly which other sites also maintain replicas.
The contribution of this work is to provide a simple, efficient, and hierarchical method for
ensuring the reliable propagation of updates to all replicas, without resorting to value propagation.
We partition the replicas into a set of domains. A domain corresponds to a set of replicas that are
in close communication with each other for example at the same location (e.g., replicas at the
University of Florida), or even on the same LAN. Each site stores two matrix timestamps. One
of the matrix timestamps contains precise information about replicas in the same domain, and the
other contains summary information about the other domains. If the N sites are evenly divided
into 0(VN) domains, then the matrix timestamps are of size O(N) apiece.
Maintaining a very large system requires the use of a hierarchical structure. For example,
internet address are hierarchical, and are typically composed of a domain address, and a site
address within the domain. The hierarchical organization gives system administrators flexibility in
assigning addresses to local machines. Similarly, a hierarchical organization of the replicas increases
the administrator's flexibility in managing the local replicas. Because our matrix timestamps
contain only summary information about replicas in other domains, the system administrators can
start new replicas, shut down existing replicas, or move their locations without invoking a global
system reorganization. The algorithm described in [7] requires global membership lists, so that
local changes do require a global change to the membership list.
Storing less precise information in the matrix timestamps means that it takes longer to detect
when an update becomes stable. We perform a simulation study to quantify the increase in the log
size, and techniques for reducing the log size. We find that the most valuable optimization is to
strongly prefer to communicate with other replicas in the same domain. This result is nice because
local communication is cheaper than remote communication, and fast propagation to sites in the
same domain is more important than to remote domains. We present three methods for reducing
the log size. By using all three optimizations, we can reduce the average log size to that of the
regular matrix timestamp algorithm. We also tested the algorithm described in [7], and found that
it increases log sizes by ..11' over the regular algorithm.
2 Reliable Update Propagation
In this section, we present the reliable update propagation problem, and the standard method used
to solve it. Each site that stores a replica of the database is permitted to perform updates on the
database. Each site p keeps a log, L of the updates that it has processed. The log is an (ordered)
listing of event records, where each event corresponds to an update. Each event record, e contains
the fields e.op (the operation and its parameters), e.p (the site that first executed the operation)
and e.VTS (a vector timestamp attached to the event).
Sites distribute information by exchanging their logs. Several authors [7, 15, 1, 9] require
that updates satisfy the consistent log property (similar to requiring that multicasts be causally
delivered). Given a log L, the first i events of L are denoted by L[i]. If event e is in L, then its
position is denoted by index(i). We will use L[e] as a shorthand notation for L[index(e)].
Consistent log property: Let e be an event that is first executed at processor p. Then for
every processor q = 1,..., N and every event f,
fE L,[e] fE L,[e]
Requiring consistent logs is similar to requiring reliable propagation because a site that misses
receive_log(Pp, P,) // receiver: Pp, sender: Pq
1. receive log Lq from q
receive matrix timestamp Mq from q
2. for every e in Lq
if e V Lp then append e to Lp (in causal order)
3. for j = 1 to m // update your timestamp.
Mp[p][j] = max(Mp[p][j], Mq[q][j])
4. for i = 1 to m, i : p // update your timestamp lower bound array.
forj i= 1 to m
Mp[i][j] = max(Mp[i][j], Mq[i][j])
5. for i = 1 to m // compute the stable updates.
stable[i]= min(Mp[1][i],..., Mp[m][i])
6. for every e in Lp
if e.VTS[e.p] < stable[e.p]
remove e from Lp
Program 1 Regular log r,,i,,'.,,.'1 and matrix timestamp update.
an update will not be able to receive any events that causally follow the missing event. The logs
must be garbage collected at some point. If processor p knows that every other processor has heard
of an event e, then p can delete e from Lp. Garbage collection requires a processor p to know, for
every other processor q, a lower bound on the set of events that must be in Lq. This lower bound is
expressed as an array of vector timestamps one for every other processor q. Such an array is called
a matrix timestamp.
The algorithm we present here is based on those presented in [1]. The data structures stored
at a site are:
count The number of events originated at self.
L The log of events that have been received.
M[1 .. m][1 .. m] M[p][q] is a lower bound on the maximum number of events initiated
at q known to exist in Lp. Thus, the vector timestamp M[p][*] is knowledge about Lp.
stable[1 .. m] stable[q] is the highest numbered event originated by q that
every site has received.
When a site executes an update, it increments count, sets M[self][self]=count, attaches the vector
timestamp M[self][*] to the update, and puts the update in the log. When a site p propagates its
log to processor q, it sends the events that q might not have in Lq (which can be determined by
examining M[self][*] and M[q][*]). Finally, the procedure for receiving and integrating a log that is
distributed to you is shown in Program 1.
The matrix M[*][*] is often termed a matrix timestamp, and is used to detect when every
processor has learned of some event. However, the O(N2) space of the matrix timestamp limits its
scalability (note that the matrix timestamp must be propagated as well as stored). Our motivation
in this research is to take advantage of the natural structure of a distributed system to reduce the
overhead caused by the matrix timestamp.
3 Hierarchical Matrix Timestamps
In the following algorithms, the system consists of N replica sites which communicate only via
message passing. We assume that the system is hierarchically organized and managed. The sites
are partitioned into groups which we call domains. However, no particular partitioning strategy is
assumed and a site can send a message to any other site, regardless of partitioning.
The domains and the sites in the domains all have unique identifiers. Domains are referred to
as D1, D2,  , D,. For processes, we use two types of notations. A process is denoted as either Pi
(the ith process in the system) or P(jk) (the kth process in the domain Dj). The latter notation
reflects a hierarchical structure and is used when we need to express domains explicitly.
In this section, we present a new matrix timestamp, the hierarchical matrix timestamp (hereafter
HMT), which is based on the hierarchical organization. A HMT consists of an array of hierarchical
vector timestamps (hereafter, HVT) plus a summary matrix. The HVT is one site's knowledge about
another site's log, similar to the regular vector timestamp in the regular matrix timestamp. The
main idea of the HMT is to maintain precise information about other sites in local domains but only
loose estimates about sites in remote domains. The motivation for the design approach is twofold.
First, we can significantly reduce the O(N2) space requirement for matrix timestamps. Second,
sites store only summary information about remote domains, so each domain can be independently
managed.
Hierarchical Vector Timestamps A HVT consists of two components: a processtoprocess
vector timestamp (a PP vector timestamp) and a processtodomain vector timestamp (a PD vector
timestamp). In a HVT, a process maintains temporal information for each individual process of the
local domain in the PP vector timestamp (as in conventional vector timestamps), and summary
information about each remote domain in the PD vector timestamp. Vector timestamps require
sites to maintain local clocks, which can be operation counts or Lamport clocks [10]. For HVTs,
Lamport clocks are preferable, because they ensure that all sites report similar times, avoiding
performance degradation due to lagging sites. Agrawal and Malpani [1] i,: 1 the use of realtime
clocks for similar reasons. In this work, the processes in each domain share one Lamport clock.
Suppose that site P(,j) maintains its knowledge about some site's log L in a HVT consisting of
the PP and PD timestamps, denoted by pp and pd. If pp[s] = r, then the log L contains the records
of all the events that have occurred at P(i,s) (i.e., the sth process in domain Di) with timestamp
r or less.
Therefore, the semantics of the PP vector timestamp is the same as that of the regular vector
timestamp, except that a PP vector timestamp uses the Lamport clock and only contains informa
tion for those processes in the same domain. If pd[t] = r, then L contains at least the records of
all events originating in Dt (i.e., every site in the domain) that are timestamped r or less.
In Figure 1, we illustrate the idea of the HVT by converting a regular vector timestamp to
the corresponding HVT. Nine processes are grouped into three domains and three processes are
assigned to each domain. We convert the vector timestamp into a HVT that a site in domain Do
would generate. In the example, the PD vector timestamp stores 10 (i.e., the minimum Lamport
time) for domain Do, 14 for domain D1, and 16 for domain D2, instead of the sequences (10,15,13),
(14,16,18) and (16,19,18), respectively. Note that information in the first entry of the PD vector
timestamp is redundant with that in the PP vector timestamp. Note also that the site contains in
its log events from domain 1 numbered larger than 14. The PD timestamp makes a safe estimate
of the events in the remote logs.
The space requirement for an HVT depends on how the system is partitioned, but it is usually
less than the O(N) of a regular vector timestamp. For example, if N processes are partitioned into
/N domains, then the HVT contains 2 x /N entries. However, HVTs do not maintain as up to
date information about processes in remote domains as regular vector timestamps.
Hierarchical Matrix Timestamps A HMT consists of three components: processtoprocess
matrix timestamp (the PP matrix timestamp), processtodomain matrix timestamp (the PD matrix
timestamp), and a domaintodomain matrix timestamp (the DD matrix timestamp). The PP and
PD matrix timestamps are an array of HVTs, one for every site in the local domain (analogous
to the regular matrix timestamp). The DD matrix timestamp is summary information about the
entire system.
Since the PP and PD matrix timestamps are an array of HVTs, we explain only the DD matrix
timestamp. Suppose that dd is site P's DD matrix timestamp. If dd[i][j] = r, then P knows that
the log of each site in domain Di contains at least the records of all the updates in domain Dj
timestamped r or less. As notational conventions, we call the rows of matrix timestamps vector
timestamps. For example, the ith row of a PP matrix timestamp pp is called the ith vector
timestamp of pp and denoted as pp[i].
Figure 2 illustrates how an HMT is generated from an array of HVTs. As in Figure 1, this
example consists of nine processes partitioned into three domains. In the example, the left dotted
box represents an array of HVTs received by a site in domain Do and the right dotted box is the
HMT of the site. Let's consider HVTs for the processes of domain D2 (i.e., P6, P7, P8) in the left
dotted box. The HVTs show that P6's log, P7's log, and Ps's log contain at least the records of
all the updates in domain D1 timestamped 26, 15 and 15 or less, respectively. This is summarized
as dd[2][1] = 15. That is, the site (in domain 0) knows that every site in domain 2 has received
at least the events from domain 1 timestamped 15 or less. This estimation is safe, as we take a
minimum.
The space requirement for a HMT is O(N) when there are N processes evenly partitioned into
/N domains. More precisely, a HMT requires 3N entries.
Notice that the update safety information can be computed from the DD matrix timestamp in
the same way that it can be computed from matrix M. In the example in Figure 2, a site with the
HMT can determine that every site in every domain knows about at least all updates numbered
15 or less made by any site in domain 1. So, any updates from domain 1 that are numbered 15 or
less can be safely discarded from the log.
4 HMT Propagation Algorithm
In this section, we present algorithms for update log propagation based on HMTs. Because the
HMTs maintain asymmetric information about sites in local and remote domains, we need to use
different algorithms to propagate HMTs between sites in the same domain and different domains.
Program 2 and 3 show the algorithms for local and remote propagation, respectively. The propa
gation consists of a pair of operations: send and receive. In the figures, we present only the receive
operation because the send operation is straightforward (see Section 2).
We explain the algorithms in Program 2 and 3 by comparing them with the regular receive
algorithm presented in Section 2 (hereafter, the regular algorithm). Updating a matrix timestamp
or HMT consists of two parts:
A: Update knowledge about the local log to reflect newly received events.
B: Update knowledge about the other site' logs.
In the receive algorithm, the PP, PD, and DD matrix timestamps of process P(d,i) are denoted
as PP;, PD;, and DD;, respectively. We assume that there are m domains and n processes per
domain in the system to simplify the presentation.
Local Algorithm Program 2 shows the receive algorithm for local propagation (hereafter, the
local algorithm) The first two steps are identical to those in the regular algorithm. Step 3 corre
receivelog(P(d,p), P(d,q)) // receiver: P(d,p) sender: P(d,q)
1. receive log Lq from P(d,q)
receive matrix timestamp (PPq, PDq, DD,) from P(d,q)
2. for every e in Lq
if e V Lp then append e to Lp(in causal or der)
3. //update the HVT for the local log
for i = 1 to n
PPp[p][i] = max(PPp[p][i],PP,[q][i])
for i = 1 to m
PDp[p][i] = max(PDp[p][i],PD [q][i])
// compute a summary about the local domain
PDp[p][d] = MIN(PPp[p][i] for i = 1 to n)
// update the Lamport clock
PPp[p][p] = max(PPp[p][p],PPq[q][q]) + 1
4. // update timestamps for the other processes' logs
for i = 1 to n (i : p) // HVTs for the other local processes's logs
for j= 1 to n
PPp[i][] = max(PP,[i]],PP,[i][j])
forj i= 1 to m
PDp[i][j] = max(PDp[i][j],PD,[i][j])
for i = 1 to m, j = 1 to m // DD iimestamp for remote processes' logs
DDp[i][j] = max(DDp[i][j],DD,[i][j])
5. // update the DD vector timestamp for the local domain
forj i= 1 to m
DDp[d][j] = max(DDp[d][j], MIN(PDp[i][j] for i = 1 to n))
6. // compute stable updates
forj i= 1 to m
stable] = MIN(DDp[i][j] for i = 1 to m)
Program 2 Receive Operation for Local Pi,,ii',,,
spends to part A. The step updates the HVT (PP [/,] PD, [r/]) for the receiver's log in the same
way as in the regular algorithm. The only difference from the regular algorithm is that the HVT
consists of two components. Therefore, this step is composed of two for loops and the Lamport
clock is updated.
Step 4 updates the receiver's local knowledge about the other processes' log by comparing matrix
timestamps with those from the sender (similar to the regular algorithm). Step 5 recomputes the
DDp[d] vector timestamp (i.e., the summary about the receiver Pp's own domain) based on new
updates to the PDp matrix timestamp (i.e., Pp's knowledge about the logs of processes in the local
domain).
Step 6 computes stable updates based on the DDp matrix timestamp (i.e., Pp's local knowledge
about the summary about all the processes' logs in each domain).
receive_log(P(d,p), P(d,q)) // receiver: P(d,p), sender: P(d',q)
1. receive log Lq from P(d',q)
receive only (PD,[q] vector,DDq matrix) from P(d',q)
2. for every e in Lq
if e V Lp then append e to Lp(in causal order)
3. // update the PDp[p] vector for the local log
for j = 1 to m
PDp[p][j] = max(PDp[p][j], PD,[q][j])
4. // update DD timestamps
for i = 1 to m,j = 1 to m
DDp[i][j] = max(DDp[i][j],DD,[i][j])
5. // update the DD timestamp for the local domain
forj i= 1 to m
DDp[d][j] = max(DDp[d][j], MIN(PDp[i][j] for i = 1 to n))
6. // compute stable updates
for j = 1 to m // the stable updates
stable[] = MIN(DDp[i][j] for i = 1 to m)
Program 3 Receive Operation for Remote P,,i,,,,l',,
Remote Algorithm In the remote algorithm, the PPq, PDq matrix timestamps of the sender
P(d',q) contains the process's knowledge about the logs of processes in the local domain Dd,; in
particular, PPq[q] and PDq[q] is knowledge about P(d q)'s own log which is being received. However,
the receiver P(d,p) in the different domain DDd does not maintain such detailed information in its
HMT, but only the summary of the information which can be found in P(d', )'s DD,. Therefore, the
sender does not include the entire HMT, but only the PDq[q] vector timestamp and the DDd matrix
in a propagation message. Note that the PDq[q] vector timestamp is included as information about
P(d,,q)'S log and the PPq[q] vector timestamp is already summarized in PDq[q][d'].
Step 3 (which corresponds to part A:) updates the receiver's HVT about the local log, but only
the PDp[p] vector timestamp component. Likewise, Step 4 updates only the DDp matrix timestamp.
Steps 5 and 6 are identical to those in the local algorithm.
Note that unlike the local algorithm, the receiver does not need to know details (e.g., the member
list) about processes in the remote sender's domain (i.e., the PPq and PDq matrix timestamps).
That is, this remote algorithm allows each domain to manage local processes independently
iaii ,nrI rt independence.
5 Performance
The HMTbased propagation scheme trades smaller matrix timestamps with the propagation speed
of information about remote site's logs (not the log records themselves). More specifically, only
loose but safe estimates about remote domains are transferred. Therefore, log record truncation of
the HMTbased scheme is slower than that of the regular matrix timestamp based scheme, resulting
in larger log sizes.
To quantify the tradeoff, we implemented both the regular and the HMT propagation algorithms
in a simulator. Each site generates one update per unit time (the interupdate times are chosen
independently from an exponential distribution). In addition, each site propagates its log to another
site once per unit of time (again chosen from an exponential distribution). We found that the log
size is directly proportional to the ratio of the average interevent time and the average inter
propagation time, so we do not vary these parameters in our experiments. We assumed that
executing the log propagation protocol requires a negligible amount of time. This assumption is
reasonable, because we are concerned with measuring the log size.
While the simulations execute, we measure the average log size and the average time until an
update becomes stable (i.e., propagated to every site). The time until stability should be interpreted
as a multiple of the interpropagation time. We ran the simulation until 800,000 updates had
occurred. The ', '.' confidence intervals are within :;.
Under the regular log propagation algorithm, a site uniformly randomly chooses another site
to be the destination for its log propagation. The HMT algorithms can take advantage of locality.
We evenly partition the sites into a varying number of domains. When a site propagates its log,
it chooses a site in the same domain with probability local and a remote site with probability
1 local. We call local the local preference.
In Figure 3 and 4, we plot the average log size against a varying local preference. For reference,
we also plot the average log size for the regular update propagation algorithm. We use 4, 6, 8,
10, and 12 domains. The charts show that the proper setting of the local preference reduces the
average log size of the HMT algorithm to about 7 ''. larger than that of the regular algorithm.
Note that the log size is not overly sensitive to the precise setting of local.
By examining a large quantity of experimental results, we observed that the optimal setting
of the local preference depends almost entirely on the number of sites per domain. In Figure 5,
we plot our empirically determined optimal value of local against the number of processors per
domain.
6 Reducing the Log Size
The average log size of the HMT algorithm is moderately larger than that of the regular algorithm.
But, one can expect better performance if the log size can be reduced. In this section, we investigate
three orthogonal techniques for reducing the average log size.
6.1 Timestamponly Propagation
In this section, we investigate the heuristic of sending some HMTonly messages to speed up HMT
propagation (this technique was proposed in [1]). These messages are processed using the normal
receive algorithm, but without log updates and without part A of the local HMT update. Because
the HMTs are small, the overhead of this extra communication is low.
When a site propagates a HMTonly message, it propagates the HMT to a site in the same
domain with probability PHMT, and to a site in a remote domain with probability 1 PHMT
In Figure 6, we investigate the optimal setting for PHMT for a system of 60 processors and 8
domains, and sending HMTonly messages at the same rate as log propagation (these results are
representative of all situations we encountered). In this particular experiment, we found that setting
Local = .7 and PHMT = .8 gives the best performance. However, performance is not overly sensitive
to the exact setting of these parameters, and setting PHMT = Plocal gives nearoptimal performance.
Next, we investigate the effect of varying the rate of sending HMTonly messages on the average
log size. In Figure 7, we use the same parameters as those for the experiment in Figure 6, except
that we set PHMT = .8 and vary the rate at which a site sends HMTonly messages. We find that
obtaining a new decrement in the log size requires a doubling of the rate of sending HMTonly
messages. We stopped the experiment with 1 HMTonly message per log propagation because of
the declining performance improvement.
6.2 Ksafe Propagation
The algorithms so far presented for the HMTbased scheme guarantee that only stable updates are
ever removed from any log (i.e., log truncation based on HMTs is safe). However, these algorithms
are conservative when transferring HMTs between sites, especially different domains. In this section,
we present an optimistic propagation method for HMTs which is aimed at improving the speed of
remote HMT propagation.
In Ksafe update propagation, a site can remove an update from its log if the update has been
propagated to at least K sites in every domain. The motivation for this definition is that local sites
work to ensure that all other sites in a domain receive every update. If the update is propagated
to K sites in domain D, sites in other domains can be assured that every site in D will eventually
receive the update. We note that a similar idea was proposed informally in [6]. A caution about
optimistic update propagation techniques is that it is possible that log propagation may violate
causality, due to the eager garbage collection.
In the Ksafe propagation, the sender P(d,p) optimistically advances the DDp[d] vector timestamp
(i.e., the summary about the logs in the local domain) before remote propagation as follows:
for j = 1 to m
DDp[d][j] = KthMAX(PDp[i][j], i=1 to n)
However, this update is only for propagation messages and does not affect the local DDd.
The Ksafe propagation needs to (1) detect if an update has been propagated to at least K
sites in every domain and (2) detect if a log propagation attempt violates causality. Due to the
optimistic advance operation by the sender, the same Step 6 in Programs 2 and 3 computes updates
propagated to at least K sites.
The Ksafe technique assumes that P;'s log Li maintains two vector timestamps Ci and Si. If
Cj[j] = r, then the log contains all the update records at Process Pj timestamped r or less. If Si[j]
= r, then Li has already truncated all the update records at Process Pj timestamped r or less.
The receiver can detect causality violation by checking the following condition:
Causality violation: Creceiver [i] < Ssender [i] for some Process Pi
This Ksafe propagation allows each domain to choose a different constant for K depending on
the reliability of the domain. Due to space limitations, we must delete any further discussion.
Performance We ran an experiment to determine the performance improvement that can be
achieved using the Ksafe technique. In Figure 8, we use 60 processors, 8 domains, vary the K
in the Ksafe algorithm, and plot the average log size against the local preference. Note that kO
represents the normal HMT algorithm, and k 2 is the 6safe algorithm in a domain with 8 sites.
Using Ksafe results in a significant performance improvement when the local preference is tuned
to the proper value. We found that 1safe, 2safe, and 3safe all had about the same performance.
The 1safe and 2safe algorithms experienced a small number of rejected update propagations (less
than .11".'f), but the rejected propagations had no effect on the average time to fully propagate a
message.
6.3 Logbased Compensation
The HMT algorithm presented in Section 4 makes a clear separation between local and remote
information. When a site s in domain D propagates information to a site r in domain D', site
3. for i = 1 to n
PPp[p][i] = Cp[i], for every Pi in Dd)
forj = 1 to m
PDp[p][j] = MIN(Cp[i] for every P in Dj)
Program 4 Logbased HMT Compensation
s gives to r the value to place in r's PD matrix timestamp. Using this mechanism, site r does
not need any information about sites in the remote domain D. However, this clear separation of
domains causes a significant loss of information.
Let us consider the following example. Let si and s2 be the two sites in domain D which
propagate their logs to site r in domain D'. Suppose that when si propagates its log to r, its PP
vector timestamp (i.e., its own entry in the PP matrix timestamp) is (5, 2). Then, sl gives site r
the value 2 to put in r's PD matrix timestamp. Suppose further that when s2 propagates its log to
r, its PP vector timestamp is (2, 5), again giving r the value 2 to put in its PD matrix timestamp.
Now, r's entry in its PD timestamp about what it knows about updates in domain D is 2, even
though r has in its log all updates from D timestamped 5 or less.
We can advance the PD timestamp at site r more aggressively by letting site r update its vector
in its PD matrix by examining its log, L. The technique for doing so is based on the definition of the
HVT, illustrated in Figure 1. The local and remote algorithms in Program 2 and 3 are modified
to use the Step 3 shown in in Program 4. The modified step assumes two vector timestamps
introduced for the Ksafe technique previously.
Note that for an effective application of the logbased compensation optimization, a site must
know about the currently active sites in other domains. Therefore, the effective application of
logbased compensation requires that local changes in replica membership lists be propagated to
all domains. While a discussion of these techniques is beyond the scope of this paper, we note that
domains can still be independently managed and membership changes asynchronously propagated
to remote domains using techniques similar to those proposed for group membership algorithms
[2, 5].
The performance characteristics of update propagation using log compensation are similar to
that of the basic HMT algorithm. We summarize the performance gain possible using log compen
sation in the next chapter.
7 Discussion
We repeated the experiments summarized in the previous sections for 24, 36, and 48 processors as
well as 60 processors. In Figure 9, we plot the best performance (smallest log size as we vary the local
preference) of the basic HMT algorithm, the HMT algorithm with each of the three optimizations
applied separately, and regular update propagation algorithm. We varied the number of domains
to be approximately the square root of the number of processors (4, 6, 6, and 8 respectively). The
HMTonly algorithm propagated as many HMTonly messages as logs.
We find that the average log size increases in direct proportion to the number of processors for
each of the algorithms. Given the stability of the charts, we can extrapolate the results to larger
numbers of processors. The normal HMT algorithm results in a log size about 1.7 times larger than
that of regular update propagation. The 2safe, HMTonly, and logcompensation optimizations
applied individually result in a log 1.25, 1.35, and 1.6 larger than the regular update algorithm,
respectively. Note that each of the optimizations to the regular HMT algorithm has some kind of
cost (additional messages, changed delivery semantics, or a mixing of local and remote knowledge).
Next, we plot the average size of the best regular HMT algorithm, the best HMT algorithm using
2safe, the best HMT algorithm using all three optimizations, and the regular matrix timestamp
algorithm in Figure 10. We also plot the average log size when the algorithm of [7] is used (labeled
simple). This chart shows the performance tradeoff of the algorithm options. The regular HMT
algorithm has a clean separation between local sites and remote domains, but at the cost of the
largest average log sizes. The regular timestamp algorithm has a small average log size, but at
the cost of very large matrix timestamps and global membership lists. The simple algorithm of [7]
propagates O(N)size data structures for log truncation and has a moderate average log size, but
still requires global membership lists. The best single optimization (2safe) gives smaller logs than
the simple algorithm. If all three of the optimizations are applied to the regular HMT algorithm,
the average log size is almost the same as that of the regular matrix timestamp algorithm.
8 Conclusions and Future Work
In this paper, we present an algorithm for storing and propagating hierarchical matrix timestamps.
A matrix timestamp is used in reliable update propagation (multicast) systems to determine when
an update has been propagated to all sites (i.e., stable). Once an update is determined to be stable,
it may be deleted from a site's log.
One problem with matrix timestamps is their O(N2) space use. Another problem is the need
for global membership lists. We take the approach of partitioning replica sites into domains.
The hierarchical matrix timestamp (HMT) stores precise information about other sites in the
same domain, and summary information about other domains. The hierarchical matrix timestamp
(HMT) can reduce the space use to O(N) if the replica sites are evenly partitioned into 0(VN)
domains. Furthermore, requiring only summary information about sites in remote domains greatly
increases the flexibility in large scale system management. We note that the algorithm of [7] cannot
take advantage of the Ksafe optimization.
Using summary information about remote sites reduces the storage cost of the HMT, but results
in reduced knowledge about progress at remote sites. The result is an increase in the update log
size. We run simulation experiments to precisely quantify this tradeoff. We provide techniques
to reduce the log size and precisely specify parameter settings. An application of all optimization
techniques results in the same average log size as with the regular matrix timestamp algorithm.
The hierarchical vector timestamp and the hierarchical matrix timestamp are useful whenever
a large set of sites need to compute global predicates indicating that all sites have learned of a
special event (e.g., garbage collection, causal communication, consistent checkpointing, etc.). The
particular application we chose, garbage collection of logs in reliable update propagation schemes,
has special need of this technique because the matrix timestamp must be propagated. For future
work, we will apply the methods in this paper to other applications such as causal communications
and distributed predicate evaluation.
References
[1] D. Agrawal and A. Malpani. Efficient dissemination of information in computer networks. The
Computer Journal, 34(6):534541, 1991.
[2] Y. Amir, D. Dolev, S. Kramer, and D. Malki. Membership algorithms for multicast commu
nication groups. In 6th International Workshop on Distributed Algorithms, SpringerV, I/l.,i
pages 292312, November 1992.
[3] K. Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group multicast.
AC I1 Trans. on Computer Systems, 9(3):272314, 1991.
[4] A. Demers, D. Greene, C. Hauser, W. I., J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and
D. Terry. Epidemic algorithms for replicated database maintenance. In Proceedings of the 6th
Annual AC II Symposium on Principles of Distributed Computing, pages 112. AC(:., August
1', 7
[5] D. Dolev, D. Malki, and R. Strong. An asynchronous membership protocol that tolerates
partitions. Technical Report TR946, Institute of Computer Science, The Hebrew University
of Jerusalem, 1994. available at http://www.cs.huji.ac.il/papers/transis/transis.html.
[6] A. R. Downing, I. G. Greenberg, and J. M. Peha. OSCAR: A system for weakconsistency
replication. In IEEE Workshop on 11,. i, o ,,I of Replicated Data, Houston, TX, November
1990.
[7] A. Heddaya, M. Hsu, and W. Weihl. Two phase gossip: Managing distributed event histories.
I, ,, '.i, Sciences, 49:3557, 1'i ,
[8] L. Kawell Jr., S. Beckhardt, T. Halvorsen, R. Ozzie, and I. Grief. Replicated document
management in a group communication system. In Proc..' ,1 Cond. on Computersupported
Cooperative Work, 1'"
[9] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing high availability using lazy
replication. AC II Transaction on Computer Systems, 10(4):360, November 1992.
[10] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications
of the AC 11, 21(7):558564, 1978.
[11] S. Meldal, S. Sankar, and J. Vera. Exploiting locality in maintaining potential causality. In
AC II Symp. Principles of Distributed Computing, pages 231239, 1991.
[12] M. Rabinovitch, N. Gehani, and A. Kononov. Scalable update propogation in epidemic repli
cated databases. In Proc. 5th Conf. Extended Database Technologies, pages 207222, 1996.
[13] S.K. Sarin and N. Lynch. Discarding obsolete information in a replicated database. IEEE
Trans. on Software E,.', ,, ,I',,i 13(1 .*;' I, 1'i .
[14] A. Schiper, J. Eggli, and A. Sandoz. A new algorithm to implement causal ordering. In Proc.
,,.1 Intl. Workshop on Distributed Algorithms, pages 219232, 1'1'
[15] G.T.J. Wuu and A.J. Bernstein. Efficient solutions to the replicated log and dictionary prob
lems. In Proc. ',,i Symp. on Principles of Distributed Computing, pages 233242, 1' 1. L
0 1 2
Figure 1: Hierarchical Vector Timestamp in Domain 0
PP PD
0 1 2 0 1 2 0 1 2 0 1 2
0 18 17 13 13 26 20 0 18 17 13 13 26 20
process idl 10 17 13 13 26 20 process idl 10 17 13 13 26 20
2 9 10 13 9 21 18 2 9 10 13 9 21 18
PP PD
3 4 5 0 1 2 PD
3 30 26 29 9 26 20
domain id
processid4 15 26 18 10 15 20 0 1 2
5 30 26 31 10 26 20 9 21 18
9 15 20
6 7 8 0 1 2 domain id 25
6 20 24 19 13, 26 '.19 2 10 .15. 19
process id7 20 26 19 10 15 19 .* DD
8 20 26 27 10U 15 :20
 ... .2 . : . . . . ... : . . .  :  
select the minimum
Figure 2: Hierarchical Matrix Timestamp
processed 0(0,0) 1(0,1) 2(0,2)3(1,0) 4(1,1)5(1,2) 6(2,0) 7(2,1) 8(2,2)
10 15 13 14 16 18 16 19 18
regular vector timestamp 's:.. :
select the minimum
S" : for each remote domain
10 15 13 14 14 14 16 16 16
hierarchical vector timestamp ..
10 15 13 10 14 16
0 1 2 0 1 2
PP: info about local PD: summary info about domains
processes
domain id
Log size vs. local preference, 24 processors
avg. log size
1,600
4 domains
1,400 
1,200 6 domains
1,000 8 domains
800 
10 domains
600 H
12 domains
400
200 regular
20 rreua
0 I I I
0 20 40 60 80
local preference (%)
Figure 3: Average log sizes of the HMT and regular propagation algorithms.
Log size vs. local preference, 60 processors
avg. log size
5,000 4 domains
6 domains
4,000
8 domains
3,000
10 domains
2,000 
12 domains
1,000  regular
0I
0 20 40 60 80
local preference (%)
Figure 4: Average log sizes of the HMT and regular propagation algorithms.
Optimal local communication preference
local communication preference (%)
100 
80
60
40
20
n
0 2 4 6 8 10
processors per domain
Figure 5: Emperically observed optimal settings of the local preference.
Log size vs. local preference, 60 processors
8 domains, send TSonly messages
avg. log size
5,000
4,000 
3,000 .
2,000 .""**... 
1,000
1,000 
0 I I
0 20 40 60 80
local preference (%)
Figure 6: Finding the optimal local preference for HMTonly messages.
12 14
no TS only
1,40% local
1, 60% local
1, 80% local
1, 80% local
EH


U
Log size vs. local preference, 60 processors
8 domains, send TSonly messages
avg. log size
5,000 
no TS only
.125,80% local
.25, 80% local
.5, 80% local
1, 80% local
0 20 40 60
local preference (%)
Figure 7: Varying the rate of sending HMTonly messages.
Log size vs. local preference, 60 processors
8 domains, vary ksafe
avg. log size
5,000 
4,000
3,000
2,000
1,000
0
0 20 40 60 80
local preference (%)
Figure 8: The performance of varying Ksafe.
4,000
3,000
2,000
1,000
0
kO
U.
k2
k3
k3
k2
k1
)Kt
Log size vs. number of processors
number of domains is square root of number of processors
avg. log size
2,000
best normal
1,500 .. best 2safe
best TSonly
1,000
best log comp.
500 regular
0 I
20 30 40 50 60
number of processors
Figure 9: Comparison of the HMT algorithms.
Log size vs. number of processors
number of domains is square root of number of processors
avg. log size
2,000
best normal
. . .
1,500 .. best 2safe
best all opt.
1,000
simple
500 regular
0
30 40 50 60
number of processors
Figure 10: Comparison of all algorithms.
