Designing Distributed Search Structures with Lazy Updates*
Theodore Johnson Padmashree Krishna
Dept. of Computer and Information Science
University of Florida
Gainesville, Fl 32611-2024
Abstract
Very large database systems require distributed storage for expansibility and high throughput, which
means that they need distributed search structures for fast and efficient access to the data. In a highly
parallel distributed search structure, parts of the index must be replicated to avoid serialization bottlenecks.
Designing distributed and replicated search structures is made difficult by the complex interaction of the
search structure concurrency control and the replica coherency algorithms. In this paper, we present an
approach to maintaining distributed data structures that uses lazy updates, which take advantage of the
semantics of the search structure operations to allow for scalable and low-overhead replication. Lazy
updates can be used to design distributed search structures that support very high levels of concurrency.
The alternatives to lazy update algorithms (eager updates) use synchronization to ensure consistency,
while lazy update algorithms avoid blocking. Since lazy updates avoid the use of synchronization, they are
much easier to implement than eager update algorithms. We develop a correctness theory for lazy update
algorithms, then present lazy update algorithms to maintain a dB tree, which is a distributed B+ tree
that replicates its interior nodes for highly parallel access. We show how the algorithms can be applied to
the construction of other distributed search structures.
1 Introduction
A distributed search structure is a search structure that is stored across several different processors that
communicate through message passing. A distributed search structure is used to index and manage dis-
tributed storage, or to provide a location-independent name service. As distributed and parallel database
techniques are used to improve transaction processing performance in a scalable manner, distributed search
structures become a critical issue.
A common problem with distributed search structures is that they are single-rooted. If the root node is
not replicated, it becomes a bottleneck and overwhelms the processor that stores it [2]. A search structure
node can be replicated by one of several well-known algorithms [3, 8]. However, these algorithms synchronize
operations, which reduces concurrency, and create a significant communications overhead.
While some distributed search structure algorithms have been proposed (see section 1.1.1 for references),
most work in concurrent or parallel-access search structures have assumed a shared-memory implementation
(see [34, 18] for a survey). One reason for the limited number of distributed search structures is the difficulty
in designing them. A methodology for designing distributed search structures is needed before they can
become popularly available.
*This work was partially supported by USRA grant number 5555-19. Part of the work was performed while Theodore
Johnson was an ASEE summer faculty fellow at the NASA's National Space Science Data Center
Techniques exist to reduce the cost of maintaining replicated data and for increasing concurrency. Joseph
and Birman [19] propose a method in which updates in a distributed database are piggybacked on synchro-
nization messages. Ladin, Liskov, and Shira propose lazy replication for maintaining replicated servers [22],
making use of the dependencies in the operations to determine if a server's data is sufficiently up-to-date.
Lazy update algorithms are similar to lazy replication algorithms because both use the semantics of an
operation to reduce the cost of maintaining replicated copies. The effects of an operation can be lazily sent
to the other servers, perhaps on piggybacked messages.
Lazy updates have a number of pragmatic advantages over more eager algorithms. They significantly
reduce maintenance overhead. They are highly concurrent, since they permit concurrent reads, reads concur-
rent with updates, and concurrent updates (at different nodes). Finally, they are easy to implement because
they avoid the use of synchronization. We present in this paper general-purpose techniques for designing
distributed search structures using lazy updates.
We first present a framework for showing the correctness of lazy update algorithms. We next discuss an
increasingly general set of lazy update algorithms for implementing a distributed B-tree, the dB-tree [15, 16].
Finally, we show how some additional distributed search structures can be written.
1.1 Distributed B-trees
We initiated our study of distributed search structures by designing a distributed B-tree, the dB-tree. We
use the dB-tree as a running example to illustrate techniques for designing distributed search structures.
1.1.1 Previous Work
A B-tree is a multi-ary tree in which every path from the root to a leaf is the same length. The tree is kept
in balance by adjusting the number of children in each node. In a B+-tree, the keys are stored in the leaves
and the non-leaf nodes serve as the index. A B-link tree is a B+-tree in which every node contains a pointer
to its right sibling [7].
Previous work on parallel-access search structures (see [34] for a survey) has concentrated on concurrent
or shared-memory implementations. Particularly notable are the B-link tree algorithms [24, 32, 23] which
we use as a base for the dB-tree. These algorithms have been found to provide the highest concurrency of
all concurrent B-tree algorithms [17]. In addition, operations on a B-link tree access one node at a time.
A B-link tree's high performance and node independence makes it the most attractive starting point for
constructing a distributed search structure.
The key to the high performance of the B-link tree algorithms is the use of the half-split action, illustrated
Initially parent
A B c
Half-spt p
Complete the split
parent
A B sibling c- I
Figure 1: Half-split action
in Figure 1. If a key is inserted into a full node, the node must be split and a pointer to the new sibling
inserted into the parent (the standard B-tree insertion algorithm). In a B-link tree, this action is broken
into two steps. First, the sibling is created and linked into the node list, and half the keys are moved from
the node to the new sibling (the half-split). Second, the split is completed by inserting a pointer to the new
sibling into the parent. If the parent overflows, the process continues recursively.
During the time between the half-split of the node and the completion of the split at the parent, an
operation that is descending the tree can misnavigate and read the half-split node when it is searching for
a key that moved to the sibling. In this case, it will detect the mistake using range information stored in
the node and use the link to the sibling to recover from the error. As a result, all actions on the B-link tree
index are completely local. A search operation examines one node at a time to find its key, and an insert
operation searches for the node that contains its key, performs the insert, then restructures the tree from
the bottom up.
Some work has been done to develop a distributed B-tree. Colbrook et al. [6] developed a pipelined
algorithm. Wang and Weihl [39, 41] have proposed that parallel B-trees be stored using Multi-version
Memory, a special cache coherence algorithm for linked data structures. Multi-version Memory permits
only a single update to occur on a replicated node at any point in time (analogous to value logging [40, 3] in
transaction systems). Our algorithm permits concurrent updates on replicated nodes (analogous to transition
logging [40, 3]).
Peleg [12, 30] has proposed several structures for implementing a distributed dictionary. The concern of
these papers is the message complexity of access and data balancing. However, the issues of efficiency and
concurrent access are not addressed. Matsliach and Shmueli [27] propose methods for distributing search
structures in a way that has a high space utilization. The authors assume that the index is stored in shared
memory, however, and don't address issues of concurrent restructuring.
Deitzfelbinger and Meyer auf der Hyde [9] give algorithms for implementing a hash table on a syn-
chronous network. Ranade [31] gives algorithms and performance bounds for implementing a search tree in
a synchronous butterfly or mesh network.
Some related work has been done to implement hash tables. Yen and Bastani [43] have developed
algorithms for implementing a hash table on a SIMD parallel computer, such as a CM2. The authors
examine the use of chaining, linear probing, and double hashing to handle bucket overflows.
Ellis [11] has proposed algorithms for a distributed hash table. The directories of the table are replicated
among several sites, and the data buckets are distributed among several sites. Ellis' algorithm for maintaining
the replicated directories is similar in many ways to our lazy update algorithms. Litwin, Neimat, and
Schneider [26] propose a simple yet effective algorithm for a distributed linear hash table that uses a form of
a replicated directory. They extended this work [38] to the order-preserving RP* hash table. The algorithms
in these works are similar to the algorithms discussed here, but the updates are performed reactively instead
of proactively.
Other work on distributed search structures have primarily addressed distributing a search structure over
several disks. These works include [37, 28, 33].
The contribution of this work is to present a method for constructing algorithms for distributed search
structures. Unlike much of the previous work, our algorithms are explicitly designed for asynchronous
distributed systems (or asynchronous parallel processors). While Ellis [11] and Litwin, Neimat, and Schneider
[26, 38] have also proposed distributed search structure algorithms with a similar flavor, we present a structure
for understanding, designing, and proving correct more complex distributed search structure algorithms.
1.1.2 The dB-tree
We use the dB-tree as a running example to demonstrate the application of lazy updates. In this section we
briefly describe the dB-tree [15, 16].
The B-link tree protocol has two features that make it a promising starting point for designing a dis-
tributed search structure. First, all restructuring is performed as a sequence of local actions. Second, the
B-link tree provides allows an operation to recover from a misnavigation. As a result, global co-ordination
is not needed to restructure the index.
The dB-tree [15, 16] implements the B-link tree algorithm as a distributed protocol (as in [2, 11, 26]).
An operation on the index (search, insert, or delete) is performed as a sequence of actions on the nodes in
the search structure, which are distributed among the processors. Each processor that maintains part of
the search structure has two components: a queue manager, which maintains a queue of pending actions
(the message queue); and a node manager, which repeatedly takes an action from the queue manager and
performs the action on a node. The action execution typically generates a subsequent action on another
node (for example, searching one index node leads to searching another node). If the next node to process
is stored locally, then a new entry is put into the message queue. Otherwise, the node manager sends a
message to the appropriate remote queue manager indicating the action to be taken. We assume that the
processing of one action can't be interrupted by the processing of another action, so an action is implicitly
atomic.
All operations start by accessing the root of the search structure. A search operation follows the link
protocol [24, 32] until it reaches the leaf that can contain the key for which it is searching, then reports the
result of the search. An insert or delete operation searches for the key that can contain the key to be inserted
of deleted, then performs the operation. An insert might cause the node to become too full, in which case
the node is half split. A delete might also cause similar restructuring [16].
If there is only one copy of the root, then access to the index is serialized. Therefore, we want to replicate
the root widely in order to improve parallelism. As we increase the degree of replication, however, the cost
of maintaining coherent copies of a node increases. Since the root is rarely updated, maintaining coherence
at the root isn't a problem. A leaf is rarely accessed, but a significant portion of the accesses are updates.
As a result, wide replication of leaf nodes is prohibitively expensive.
In the dB-tree the leaf nodes are stored on a single processor. We apply the rule that if a processor stores
a leaf node, it stores every node on the path from the root to that leaf (a previous work [1] shows that this is
a good strategy). An example of a dB-tree that uses this replication policy is shown in Figure 2. The dB-tree
replication policy stores the root everywhere, the leaves at a single processor, and the intermediate nodes at
a moderate level of replication. As a result, an operation can be initiated at every processor simultaneously,
but the effects of updates are localized. As a side effect, an operation can perform much of its searching
locally, reducing the number of messages passed. This paper focus on replication and design strategies, and
a performance analysis of dB-tree implementation choices is beyond the scope of this work.
The replication strategy for a dB-tree helps to reduce the cost of maintaining a distributed search struc-
ture, but the replication strategy alone is not enough. If every node update required the execution of an
available-copies algorithm [3], the overhead of maintaining replicated copies could be prohibitive. Instead,
we take advantage of the semantics of the actions on the search structure nodes and use lazy updates to
Figure 2: A dB-tree
maintain the replicated copies inexpensively.
We note that many of the actions on a dB-tree node commute. For example, consider the sequence of
actions that occurs in Figure 3. Suppose that nodes A and B split at "about the same time". Pointers to
the new siblings must be inserted into the parent, of which there are two copies. A pointer to A' is inserted
into the first copy of the parent and a pointer to B' is inserted into the second copy of the parent. At this
point, the search structure is inconsistent, since not only does the parent not contain a pointer to one of its
children, but the two copies of the parent don't contain the same value.
The tree in Figure 3 is still usable, since no node has been made unavailable. Further, the copies of
the parents will eventually converge to the same value. Therefore, there is no need for one insert action
to synchronize with other insert actions on the node. The tree is always navigable, so the execution of an
insert doesn't block a search action. We call node actions with such loose synchronization requirements lazy
updates.
Before we terminate this introduction, we should mention some useful characteristics of lazy updates.
First, when a lazy update is performed at one copy of a node, it must also be performed at the other copies.
Since the lazy update commutes with other updates, there is no pressing need to inform the other copies of
the update immediately. Instead, the lazy update can be piggybacked onto messages used for other purposes,
greatly reducing the cost of replication management (as in [19, 22]). Second, index node searches and updates
commute, so that one copy of a node may be read while another copy is being updated. Further, two updates
to the copies of a node may proceed at the same time. As a result, the dB-tree not only supports concurrent
read actions on different copies of its nodes, it supports concurrent reads and updates, and also concurrent
updates.
Nodes A and B half-split 2 copies of parent
1 parent parent
A A B B' C
A' inserted into copy 1, B' inserted into copy 2
2 copies of parent
1 parent 2 parent |
A A B B C
Figure 3: Lazy inserts
2 Correctness
Shasha and Goodman [34] provide a framework for showing that (non-replicated) concurrent search search
structures are correct and serializable. We would like to show that when algorithms that have been vali-
dated by the methods of [34] are translated to replicated search structures, they are still correct. Here, we
concentrate on the replica coherence protocols.
In this section, we present the theory for the case when replicated nodes never increase their key range.
That is, replicated nodes never merge. This assumption makes the presentation of the correctness theory
much simpler, because the set of actions on a node are well-defined. In Section 4, we extend the theory to
handle merged nodes, and present algorithms that correctly handle actions on merged nodes.
Intuitively, we want the replicated nodes of the search structure to contain the same value eventually. We
can ensure the coherence of the copies by serializing the actions on the nodes (perhaps via an available-copies
algorithm [3]). However, we want to be lazy about the maintenance. In addition, the replica update protocol
should support the design of distributed search structure algorithms. In this section, we describe a model of
distributed search structure computation and establish correctness criteria for lazy updates.
Let A" be the set of nodes that exist in the search structure during execution E (we will define C more
precisely below). A node n E AN of the logical search structure might be stored at several different processors.
We say that the physically stored replicas of the logical node are copies of the logical node. We denote by
copiest(n) the set of copies that correspond to node n at time t.1
An operation (i.e., search for a key, insert a key) is performed by executing a sequence of actions (i.e.,
search this copy) on the copies of the nodes of the search structure. The specification of an action on a copy
has two components: a final value c' and a subsequent action set SA. An action that modifies a node is
By "time", we mean an ordering on the events that occur in S that is consistent with causality [5].
performed on one of the copies first, then is relayed to the remaining copies. The action that performs the
first modification is an initial action, and the actions that are relayed to the other copies are the relayed
actions. We distinguish between the initial and the relayed actions. Thus, the specification of an action is:
a (p, c) (c', SA)
When action a with parameter p is performed on copy c, copy c is replaced by c' (the final value of a)
and the subsequent actions in SA are scheduled for execution. Each subsequent action in SA is of the form
(aj ,pj, cj), indicating that action aj with parameter pj should be performed on copy cj (perhaps c and cj
are copies of different nodes). If copy cj is stored locally, the processor puts the action in the message queue.
Otherwise, the action is sent to a processor that stores cj. If the subsequent action can be performed on
any copy of a node (i.e., -- .. i the next node"), we assume that the action chooses the copy. If the action
is a return value action, a message containing the return value is sent to the processor that initiated the
operation. If the final value of a(p, c) is c for every valid p and c, then a is a non-update action; otherwise a
is an update action. The superscript s is either init or relay, indicating an initial or a relayed action. We
also distinguish initial actions by writing them in capitals, and relayed actions by writing them in lowercase
(I and i for an insert, for an example).
We need to make some assumptions about the execution of the actions. First, we assume that the
execution of the actions is deterministic. Second, we assume that update actions on replicated nodes execute
in two phases -the initial action and the relayed actions. The initial action might execute several times
at different nodes because of navigation. When the initial action navigates to the correct node, it will fire,
perform its modifications, and issue the relayed actions. Generally, relayed actions have empty subsequent
action sets.
In order to discuss the commutativity of actions, we will need to specify whether the order of two actions
be exchanged. If action aS with parameter p can be performed on c to produce subsequent action set SA,
then the action is valid, otherwise the action is invalid. We note that the validity of an action does not
depend on the final value. The intuition behind this definition is that we only care about the values in copies
when the system is at rest, but it is hard to revise actions sent to other copies.
An algorithm might require that some actions must be performed on all copies of a node, or on all copies
of several nodes, "simultaneously". Thus, we group some action sequences into atomic action sequences,
or AAS. The execution of an AAS at a copy is initiated by an AAS_slart action and terminated by an
AAS-finish action. A copy may have one or more AAS currently executing. An AAS will commute with
some actions (possibly other AASstart actions), and conflict with others. We assume that the node manager
at each processor is aware of the AAS-action conflict relationships, and will block actions that conflict with
currently executing AAS. The AAS is the distributed analogue of the shared memory lock, and can be used
to implement a similar kind of synchronization (as in [11]). However, lazy updates are preferable.
Example To make the concepts more clear, let us consider a simple example of a distributed B-tree and
write the specifications of the possible actions. The algorithm that we specify will implement a dB-tree along
the lines discussed in Section 1.1. To make the algorithm easier to specify, we assume that
1. The dB-tree has already been created.
2. No new copies of interior nodes are created.
3. Interior nodes are never split.
4. The only operations are to search for a key and to insert a key.
These assumptions are unrealistic, but they let us avoid some difficulties that are discussed in later
sections. The algorithm is presented in Table 1. A search operation is initiated by calling the Search(k,c)
action on a copy c of the root note, and an insert operation is initiated by calling the Insert_d(k,c) action
on a copy of the root node. The execution of the action is specified by the modification made to the copy,
and the subsequent action set. Typically, more than one execution is possible, so we list the conditions for
each execution. In the algorithm, we make use of the function keyrangec(k, c'). This function returns true if
based on the information in c, key k can be found in the subtree rooted at c', and false otherwise.
The search operation returns a value by scheduling the Answer(t/f,O) action, where t/f is true or false,
depending on whether k is in c, and O represents the task that issued the query. The Insert_d performs the
downwards phase of the insert operation. When the proper leaf node is found, the insert is performed. If the
leaf was full, it is half-split, creating the new leaf sibling. The restructuring phase of the insert is initiated
by sending the Insert_u action to the parent of c. The exact parent of c might not be known, hence Insert_u
must be able to navigate sideways. When the initial update Insert_u finally fires on copy c of node n, the
relayed updates insert_u are sent to the other copies of n.
2.1 Histories
In order to capture the conditions under which actions on a copy commute, we model the value of a copy
by its history (as in [3, 13, 42]). We note that our algorithms do not require that nodes be stored as history
lists, we use histories only as a modeling technique. A typical algorithm will make use of a small amount of
stored information to make its decisions.
Action Conditions Final value of c Subsequent Actions
Search(k,c) c is not a leaf, keyrangec(k, c) is true,
c' is a child of c,
keyrangec(k, c') is true c Search(k,c')
keyrangec(k, c) is false,
keyrangec(k, c') is true c Search(k,c')
c is a leaf, keyrangec(k, c) is true c Answer(t/f,O)
Insert_d(k,c) c is not a leaf, keyrangec(k, c) is true,
c' is a child of c,
keyrangec(k, c') is true c Insert_d(k,c')
keyrangec(k, c) is false,
keyrangec(k, c') is true c Insert_d(k,c')
c is a leaf, keyrangec(k, c) is true,
c is not full c U k Answer(true,O)
c is a leaf, keyrangec(k, c) is true,
c is full (sib,sep)=halfsplit(c U k) Insert_u(sib,sep,parent(c))
Insert_u(k,n,c) keyrangec(k, c) is false,
keyrangec(k, c') is true c Insert_u(k,c')
keyrangec(k, c) is true, c U (k, n) insert_u(k,n,c')
keyrangec(k, c') is true for every other copy c' of n.
insert_u(k,n,c) c U (k, n)
Table 1: Actions for a simple dB-tree.
An execution, Ct is a record of all operations that have been initiated and all actions that have been
executed up to time t (where t represents a consistent state). The record of an operation specifies its first
action. The record of an action is discussed below.
The total history of copy c E copiest(n) consists of the pair (Vc, A'), where Vc is the initial value of c
and A' is a totally-ordered set of actions executed on c up to time t. We define correctness in terms of
the update actions, since non-update actions should not be required to execute at every copy. The (update)
history of a copy is a pair (Vc, Ac) where Vc is the same initial value as in the total history, and A, is A' with
the non-update actions deleted (and the order on the update actions preserved). To remove the distinction
between initial and relayed actions, we define the uniform history, U(H) to be the update history H with
each action a'(p, c) replaced by a(p, c).
The actions in Ct aren't necessarily the same as the actions in the copy histories, {Hlc E copies(n), n E
A}. We will permit the algorithms to re-write their histories, to ensure that all copy histories meet some
correctness requirements, discussed below.
Since a history (whether total, update, or uniform) is totally ordered, we can assign index numbers to
the actions in the history. We will need to keep track of the subsequent action sets in the history, so we will
denote an action in as (a, p, c, SA), where a is the action, p is the parameter of the action, c is the copy,
and SA is the subsequent action set that was generated. So, we can write the history of copy c, (V, Ac)
as H, = Vc ~ =l (aj,pj, c, SAj), where A, = ((a, pi, c, SAi), (a2, 2, c, SA2),..., (am,,,, c, SAm)). The
product sign f[ denotes the application of actions to the copy, and lets us specify boundaries, but should not
be understood as implying properties such as associativity and commutativity. Where there is no confusion,
we will abbreviate fnm 1(a, PJ, c, SAj) as f[ml a>.
The final value of a history is the final value of the last action in the history. Suppose that H, =
Vc, Hjm aj, and that Vc is the final value of H' = I' H 1 a,. Then H = (I' H a,) nm 1 aa is the
backwards extension of H, by H'. It is easy to see that H, and H* have the same value, and the last m
actions in H* have the same subsequent action sets as the m actions in H,. When a node is created, it has
an initial value, V,. When a copy of a node is created it is given an initial value, which we call the initial
value of the copy. The initial value should be chosen in some meaningful way, and will typically be equivalent
to the history of the creating copy, or to a synthesis of the histories of the existing copies. In either case, the
new copy will have a backwards extension that corresponds to the the history of update actions performed
on the node. If a copy is deleted, we will retain its history to capture backwards extensions, but no further
actions will be performed on it.
When we compare copy histories, we assume that the execution has reached a quiescent state. That is,
every action that is specified in an operation or a subsequent action set in St has been executed. In the
presentation, we do not explicitly tie our observations to a particular point in time and instead implicitly
assume a quiescent point in time t. We denote set of all initial update actions performed on a copy of node
n by time t as Mn, the update history of node n. Similarly, we denote the set of all actions performed on a
copy of node n as which we call the total history of node n.
We recall that an action on a copy is valid if the action on the current value of the copy has its associated
subsequent action. A history H is valid if update action aN is valid on Vc Hl{= ak for every update action
in H. We need a definition by which we can compare the histories of two different copies and say that they
are the same. We say that history H is compatible with history H if
1. H and H have the same initial value.
2. The actions in H can be rearranged to form a valid history H' such that U(H') = U(H).
3. The final values of H, H', and H are the same.
Let AN be the set of all nodes accessed in C. Our correctness criteria for the replica maintenance algorithms
are the following:
Compatible History Requirement: A node n E AN with initial value V1 and update action set Mn
has compatible histories if, at the end of computation C,
1. Every copy c E copies(n) with history He has a backwards extension H' = B IH, such that the initial
value of H' is V, and the update actions in H' are exactly the actions in Mn.
2. There is a valid history H, (the standard history) with initial value V, and update actions Mn such
that every H' is compatible with H,.
If an algorithm guarantees that every node has a compatible history, then it meets the compatible history
requirement.
Complete History Requirement: We will say that an action a is issued if a is an initial action that
appears in the subsequent action set of a', a' E Ct, such that a and a' have different actions and parameters. If
every action that is issued appears as a fired action in some node's total history, then the computation meets
the complete history requirement. If every computation that an algorithm produces satisfies the complete
history requirement, then the algorithm satisfies the complete history requirement.
Ordered History Requirement: We define an ordered action to be one that belongs to a class r such
that all actions of class r are time-ordered with each other (we assume a causal and total order exists). A
history H is an ordered history if for any ordered actions h, h2 E H of class T, if hi <, h2 then hi < h2
in H. An algorithm meets the ordered history requirement if for every node, its copies meet the compatible
history requirement with a standard ordered history H,.
The compatible history requirement guarantees that every node is single-copy equivalent when the com-
putation terminates. We note that our condition for rearranging histories is a condition of the subsequent
action sets rather than a condition of the intermediate values of the nodes. The copies need only to have
the same value at the end of the computation, but the subsequent actions can't be posthumously issued or
withdrawn without a special protocol. However, we let search actions misnavigate.
The complete history requirement tells us that we must route every issued action to a copy. While the
compatible history requirement applies to update actions only, the complete history requirement applies to
all actions, including searches and relayed actions.
The ordered history requirement lets us remove explicit synchronization constraints on the equivalent
concurrent algorithm by shifting the constraints to the copy coherence algorithm. The power of this constraint
is explored in later sections.
2.1.1 Lazy Updates
An update action must be performed on all copies of a node. With no further information about the action,
it must be performed via an AAS to ensure that the conflicting actions are ordered in the same way at all
copies. However, some actions commute with almost all other actions, removing the need for an AAS. In
Figure 3, the final value of the node is the same at either copy, and the search structure is always in a good
state. Therefore, there is no need to agree on the order of execution. We provide a rough taxonomy of the
degree of synchronization that different updates require.
Lazy Update: We say that a search structure update is a lazy update if it commutes with all other lazy
updates, so synchronization is not required.
Semi synchronous update: Other updates are almost lazy updates, but they conflict with some but not
all other actions. For example, the actions may belong to a class of ordered actions. We call these semi
synchronous updates. A semi synchronous action requires special treatment, but does not require the
activation of an AAS.
Synchronous Update: A synchronous update requires an AAS for correct execution. We note that the
AAS might block only a subclass of other actions, or might extend to the copies of several different nodes.
3 Algorithms
In this section, we describe algorithms for the lazy maintenance of dB-trees, under increasingly relaxed
assumptions about structure. We work from a simple fixed-copies distributed B-tree to a more complex
variable-copies B-tree, and develop the tools and techniques that we need along the way. The algorithms
and proofs developed in this section demonstrate techniques for designing distributed search structures.
We assume that the network is reliable, delivering every message exactly once in order. In addition, we
initially assume that only search and insert operations are performed on the dB-tree. In Section 4, we discuss
how nodes can be merged or deleted.
3.1 Fixed-Position Copies
For this algorithm, we assume that every node has a fixed set of copies. This assumption lets us concentrate
on specifying lazy updates. Every node contains pointers to its children, its parent, and its right sibling.
When a node is created, its set of copies are also created, and copies of the node are neither created nor
destroyed.
We use the dB-tree algorithm described in Section 1.1. The actions in this algorithm are the same as
those in Table 1, with the exception that an Insert_u action will half-split an interior node if it is too full.
Since the half-split is being executed on a replicated node, it becomes its own action (i.e., not combined with
the insert, as in Table 1), and have an initial and a relayed version. The initial half-split creates the copies
of the sibling node, removes the keys of the local copy c, and deletes the keys (pointers) that are no longer
in the key range of c, and adjusts the sibling pointers. The relayed half-split adjusts the key range, removes
keys (pointers), and adjusts sibling pointers.
By design, search actions can always execute, so we do not explicitly discuss them. Since the leaves are
not replicated, we do not discuss actions on leaves either. So, the only actions of interest are the insert and
half-split actions on the interior nodes. We abbreviate the initial and relayed insert_u actions as I and i,
respectively, and the initial and relayed half-split actions as S and s. The first step in designing a replication
algorithm is to specify the commutativity relationships between actions.
1. Any two insert actions on a copy commute. As in Sagiv's algorithm [32], we need to take care to
perform out-of-order inserts properly.
2. Half-split operations do not commute. As an example, since a half-split action modifies the right-sibling
pointer, the final value of a copy depends on the order in which the half-splits are processed.
3. Relayed half-split actions commute with relayed inserts, but not with initial inserts. Suppose that in
history Hp, initial insert action I(A) is performed before a half-split action s that removes A's range
from p. Then, if the order of I and s are switched, I becomes an invalid action. A relayed insert action
has no subsequent actions, and the final value of the node is the same in either ordering. Therefore,
relayed half-splits and relayed inserts commute.
4. Initial half-split actions don't commute with relayed insert actions. One of the subsequent actions of
an initial half-split action is to create the new sibling. The key that is inserted either will or won't
appear in the sibling, depending on whether it occurs before or after the half-split.
By our classification methods, an insert is a lazy update and a half-split is a synchronous update. If the
ordering between half-splits and inserts isn't maintained, the result is lost updates. To see why this occurs,
examine Figure 4. In this scenario, there are three copies of a node n, copy 1, copy 2, and copy 3. Copy
1 performs an initial insert of key k concurrent with copy 2 processing a half-split that removes k from the
node. When copy 1 performs the relayed split, it throws away key k, and copy 2 ignores the relayed insert
because the key is out of range. The end result is that k does not appear in any copy of any node.
We present two algorithms to manage fixed-copy nodes. To order the half-splits, both algorithms use a
primary copy (PC), which executes all initial half-split actions. (non-PC copies never execute initial half-
split actions, only relayed half-splits). The algorithms differ in how the insert and half-split actions are
ordered. The synchronous algorithm uses the order of half-splits and inserts at the primary copy as the
standard to which all other copies must adhere (that is, the PC generates the standard history, H,). The
semisynchronous algorithm requires that the ordering at the primary copy be consistent with the ordering
at all other nodes (see Figure 5).
We do not require that all initial insert actions are performed at the PC, so copies might find that they
exceed their maximum capacity. However, since each copy is maintained serially, it is a simple matter to
add overflow blocks.
copy 1 copy 2 copy 3
1)
low high low high low h
key key key key key key
Insert(k)
Split
insert(k)
2) k k
low high low high low high
key key key key key key
split insert(k) split
3)
low high low high low high
key key key key key key
Figure 4: An example of the lost-insert problem
3.1.1 Synchronous Splits
Algorithm: To specify the algorithm, we only need to specify the execution of the insert and half-split
actions at the interior nodes. A half-split action will invoke an AAS, and so is composed of several further
actions. To simplify the presentation, we describe the action executions by listing the steps taken, instead
of using a table table. The synchronous split algorithm uses ensures that splits and inserts are ordered the
same way at the PC and at the non-PC copies. Figure 5 illustrates an execution.
Half-split: Only the PC executes initial half-split actions. Non-PC copies execute relayed half-split actions.
When the PC detects that it must half-split the node, it does the following:
1. Performs a splitstart AAS locally. This AAS blocks all initial insert actions, but not relayed
insert or search actions.
2. The PC sends a split_start AAS to all of the other copies.
3. The PC waits for acknowledgements from all of the copies of the AAS.
4. When the PC receives all of the acknowledgements, it performs the half-split, creating all copies
of the new sibling and sending them the sibling's initial value.
5. The PC sends a split_end AAS to all copies, and performs a split_end AAS on itself.
When a non-PC copy receives a splitstart AAS, it blocks the execution of initial inserts and sends an
acknowledgement to the PC. The executions of further initial insert actions on the copy are blocked
until the PC sends a split_end AAS. When the copy processes the split_end AAS, it modifies the range
of the copy, modifies the right-sibling pointer, discards pointers that are no longer in the node's range,
and unblocks the initial insert actions. That is, the split takes effect then the split_end is processed.
Insert: When a copy receives an initial insert action it does the following:
1. Checks to see if the insert is in the copy's range. If not, the insert action is sent to the right
sibling.
2. If the insert is in range, and the copy is performing a split AAS, the insert is blocked.
3. Otherwise, the insert is performed and relayed insert actions are sent to all of the other copies.
When a copy receives a relayed insert action, it checks to see if the insert is in the copy's range. If so,
the copy performs the insert. Otherwise, the action is discarded.
We note that since non-PC copies can't initiate a half-split action, they may be required to perform an
insert on a too-full node. Actions on a copy are performed on a single processor, so it is not a problem to
attach a temporary overflow bucket. The PC will soon detect the overflow condition and issue a half-split,
correcting the problem.
Theorem 1 The synchronous split algorithm satisfies the complete, compatible, and ordered history require-
ments.
Proof: We observe that nodes always split to the right, so if an action misnavigates, it can always recover
its path (that is, Shasha and Goodman's fourth link-algorithm guideline is satisfied [34]). Therefore, the
synchronous split algorithm satisfies the complete history requirement.
Since there are no ordered actions, the synchronous split algorithm vacuously satisfies the ordered history
requirement.
We show that the synchronous algorithm produces compatible histories by showing that the histories at
each node are compatible with the uniform history at the primary copy. First, consider the ordering of the
half-split actions (a half-split is performed at a node when the split_end AAS is executed). All initial half-
split actions are performed at the PC, then are relayed to the other copies. Since we assume that messages
are received in the order sent, all half-splits are processed in the same order at all nodes.
Consider an initial insert I and a relayed half-split s performed at non-PC copy c. If I < s in He, then I
must have been performed at c before the AASstart for s arrived at c (because the AASstart blocks initial
inserts). Therefore, I's relayed insert i must have been sent to the PC before the acknowledgement of s was
sent. By message ordering, i is received at the PC before S is performed at the PC, so i < S in Hpc. If
s < I in He, then S < i in Hpc, because S < s and I < i (due to message passing causality). O
We note that this algorithm makes good use of lazy updates. For example, only the PC needs an
acknowledgement of the split AAS. If every communications channel between copies had to be flushed, a
split action would require O(|copies(n)|2) messages instead of the O(|copies(n)|) messages that this algorithm
uses. Furthermore, search actions are never blocked.
pnmary co y pniar3opy
copy copy Inser
insert
Split start
Split
Insert Inser
split start the insert is in range,
S so re-wnte history
initial inserts insert spht insert
are blocked
are blocked acknowledge Spliht
Split end Insert ,
split en the insert is not in range,
so re-wnte history
Insert spli/ and issue a correction
Insert
insert Insert
Semi-synchronous split algorithm
Synchronous split algorithm never blocks inserts, instead
blocks new inserts while rewrites history to ensure
a split executes compatible histories
Figure 5: Synchronous and semi synchronous split ordering.
3.1.2 Semi synchronous Splits
We can greatly improve on the synchronous-split algorithm. For example, the synchronous split algorithm
blocks initial inserts when a split is being performed. Furthermore, 3* (copies(n)| 1) messages are required
to perform the split. By the applying of the "trick" of rewriting history, we can obtain a simpler algorithm
that never blocks insert actions and requires only Icopies(n)| 1 messages per split (and therefore is optimal).
The synchronous-split algorithm ensures that an initial insert I and a relayed split s at a non-PC node
are performed in the same order as the corresponding relayed insert i and initial split s are performed at
the PC, with the ordering in the PC setting the standard. We can turn this requirement around and let the
non-PC copies determine the ordering on initial inserts and relayed splits, and place the burden on the PC
to comply with the ordering.
Suppose that the PC performs initial split S, then receives a relayed insert i, from c, where I, was
performed before s at c (see Figure 5). We can keep Hpc compatible with He by rewriting Hpc, inserting
i, before S in Hpc. If i,'s key is in the PC's range, then Hpc can be rewritten by performing ic on the PC.
Otherwise, i,'s key should have been sent to the sibling that s created. Fortunately, the PC can correct its
mistake by creating a new initial insert with i,'s key, and sending it to the sibling. This is the basis for the
semi synchronous split algorithm.
Algorithm: The semi synchronous split algorithm is the same as the synchronous split algorithm, with
the following exceptions:
1. When the PC detects that a split needs to occur, it performs the initial split (creates the copies of the
new sibling, etc.), then sends relayed split actions to the other copies.
2. When a non-PC copy receives a relayed split action, it performs the relayed split.
3. If the PC receives a relayed insert and the insert is not in the range of the PC, the PC creates an initial
insert action and sends it to the right neighbor
Theorem 2 The semi synchronous split algorithm satisfies the complete, consistent, and ordered history
requirements.
Proof: The semi synchronous algorithm can be shown to produce complete and ordered histories in the same
manner as in the proof of Theorem 1.
We need to show that all copies of a node have compatible histories. Since relayed inserts and relayed
splits commute, we need only consider the cases when at least one of the actions is an initial action. Suppose
that copy c performs initial insert I after relayed split s. Then, by message causality, the PC has already
performed S, so the PC will perform i after S.
Suppose that c performs I before s and PC performs i after S. If i is in the range of PC after S, then
i can be moved before S in Hpc without modifying any other actions. If i is no longer in the range of PC
after S, then moving i before S in Hpc requires that S's subsequent action be modified to include sending
i to the new sibling. This is exactly the action that the algorithm takes. So, we re-write Hpc to make it
compatible with the standard history. o
Discussion: Theorem 2 shows that we can take advantage of the semantics of the insert and split actions
to lazily manage replicated copies of the interior nodes of the B-tree. The key trick here is to examine the
copy histories and issue a 'correction' if an incompatibility is discovered. Note that while the algorithms
motivated by an examination of copy histories, the actual algorithms use only a small amount of saved state
to make decisions.
In the next section, we observe a different type of lazy copy management that also simplifies implemen-
tation and improves performance.
3.2 Single-copy Mobile Nodes
In this section, we briefly examine the problem of lazy node-mobility. The solution requires an algorithm
to enforce ordered actions. We assume that there is only a single copy of each node, but that the nodes
of the B-tree can migrate from processor to processor (typically, to perform load-balancing). When a node
migrates, the host processor can broadcast its new location to every other processor that manages the node.
However, this algorithm requires large amounts of wasted effort, and doesn't solve the garbage collection
problems.
The algorithms that we propose inform the node's immediate neighbors of the new address. In order to
find the neighbors, a node contains links to both its left and right sibling, as well as to its parent and its
children. When a node migrates to a different processor, it leaves behind a forwarding address. If a message
arrives for a node that has migrated, the message is routed by the forwarding address. We are left with
the problem of garbage-collecting the forwarding addresses (when is it safe to reclaim the space used by a
forwarding address). As with the fixed-copies scenario, we propose an eager and a lazy algorithm to satisfy
the protocol. We have implemented the lazy protocol, and found it effectively supports data balancing [20].
The eager algorithm ensures that a forwarding address exists until the processor is guaranteed that no
message will arrive for it. Unfortunately, obtaining such a guarantee is complex and requires much message
passing and synchronization. We omit the details of the eager algorithm to save space.
Suppose that a node migrates and doesn't leave behind a forwarding address. If a message arrives for
the migrated node, then the message clearly has misnavigated. This situation is similar to the misnavigated
operations in the concurrent B-link protocol, which suggests that we can use a similar mechanism to recover
from the error. We need to find a pointer to follow. If the processor stores a tree node, then that node
contains the first link on the path to the correct destination. So the error-recovery mechanism is to find a
node that is 'close' to the destination, and follow that set of links.
The other issue to address is the ordering of the actions on the nodes (since there is only one copy, every
node history is vacuously compatible). The possible actions are the following: insert, split, migrate, and
link-change. The link-change actions are a new development in that they are issued from an external source,
and need to be performed in the order issued. The problem that can arise is shown in figure 6. In this
scenario, nodes A and B are neighbors. Node A half-splits to produce node sib 1. Sib 1 points to A and B,
and a link-change action is sent to node B to inform node B of its new left neighbor. Before node B can
process this link-change operation, sib 1 half-splits to produce sib 2, and another link-change action is sent
to B. If node B processes the link-change for sib 2 before the link-change for sibl, the backwards list will be
disordered for an indefinite period of time.
Sibl
A _. Sibl B
Sibl
Sib2
Figure 6: Possible Inconsistency if back links are modified out-of-order.
Algorithm: Every node has two additional identifiers, a version number and a level. The version number
allows us to lazily produce ordered histories. The level, which indicates the distance to a leaf, aids in recovery
from misnavigation. An operation is executed by executing its B-link tree actions, so we only need to specify
the execution of the actions.
Out-of-range: When a message arrives at a node, the processor first checks if the node is in range. This
check includes testing to see if the node level and the message destination level match. If the message
is out of range or on the wrong level, the node routes it in the appropriate direction.
Migration: When a node migrates,
1. All actions on the node are blocked until the migration terminates (equivalently, migration is
atomic).
2. A copy of the node is made on a remote processor, and the copy is a duplicate (with the exception
that the version number increments).
3. A link-change action is sent to all known neighbors of the node.
4. The original node is deleted.
Insert: Inserts are performed locally.
Half-split: Half-splits are performed locally by placing the sibling on the same processor and assigning the
sibling a version number one greater than the half-split node's. An insert action is sent to the parent,
and a link-change action is sent to the right neighbor.
Link-change: When a node receives a link-change action, it updates the indicated link only if the update's
version number is greater than the link's current version number. If the update is performed, the new
version number is recorded.
Missing Node: If a message arrives for a node at a processor, but the processor doesn't store the node,
the processor performs the out-of-range action at a locally stored node at the same or higher level. If
the processor doesn't store such a node, the action is sent to the root.
Theorem 3 The lazy algorithm satisfies the complete, compatible, and ordered history requirements.
Proof: There is only a single copy of a node, so the histories are vacuously compatible. The only ordered
actions are the link-change actions. The node at the end of a link can only change due to a split or a
migration. In both cases, the node's version number increments. When a link-change action arrives at the
correct destination, it is performed only if the version number of the new node is larger than the version
number of the current node. If the update not performed, the node's history is rewritten to insert the link
change into its proper place. Let 1 be a link-change action that is not performed, and let 1 be an ordered
action of class C. Let aj be the ordered action of class C in He that is ordered immediately after 1 (there is
no ak such that 1
can be rewritten so that it remains valid.
Since the lists are maintained properly, every action is eventually able to find its destination. Thus, the
algorithm produces complete histories.
Each action takes a good state to a good state, so every action eventually finds its destination. Therefore,
the algorithm produces complete histories. o
We note that an implementation of the lazy single-copy algorithm can use forwarding addresses to improve
efficiency and reduce overhead. The forwarding addresses are not required for correctness, so they can be
garbage-collected at convenient intervals.
Discussion This section shows two techniques for implementing distributed search structures. The first
technique takes advantage of the fact that an action does not directly access a node, but must request access
from the node manager. As a result, an action can try to access a non-existent node, as long as a recovery
path exists.
The second technique discussed in this section shows how to make the node management protocol respect
ordered actions. Since we assume that a natural time ordering exists between ordered actions, it is possible to
assign timestamps to the actions. The particular ordered actions in the single-copy algorithm are overwriting
actions (i.e, modifying link pointers). Since the result of executing an overwriting ordered action doesn't
depend on the previous ordered actions, an overwriting action doesn't need to wait for the preceding ordered
actions to execute. Instead, late actions are discarded. If an ordered action is not overwriting, then the later
action must block until the earlier actions are executed. For example, if nodes can be deleted (as in [16]),
then the insertion and the deletion of a pointer to a child are non-overwriting ordered actions. If the delete
action arrives first, it must be held to kill the insert action when it arrives (as discussed in Section 4.
3.3 Variable Copies
In this section, we discuss how to allow processors to join and leave the replication of the index nodes (so
we can use this algorithm to implement a never-merge dB-tree). We assume that the leaf nodes are not
replicated, and that the PC of a node never changes. The lazy algorithm that we propose combines elements
of the lazy fixed-copy and migrating-node algorithms by using lazy splits, version numbers, and message
recovery.
To allow for data-balancing, we let the leaf level nodes migrate. The leaf level nodes aren't replicated,
so we can manage them with the lazy algorithm for migrating nodes (section 3.2). We want to maintain
the dB-tree property that if a processor owns a leaf node, it has a copy of every node on the path from the
root to the leaf. If a node obtains a new leaf node, it must join the set of copies for every node from the
root to the leaf which it does not already help maintain. If the processor sends off the last child of a node,
it unjoins the the set of processors that maintain the parent (applied recursively). When a processor joins
or unjoins a node replication, the neighboring nodes are informed of the new cooperating processor with a
link-change action. To facilitate link-change actions, we require that a node have pointers to both its left
and right sibling. Therefore, a split action generates a link-change subsequent action for the right sibling, as
well as an insert action for the parent.
We assume that every node has a PC that never changes (we remove this assumption later). The primary
copy is responsible for performing all initial split actions for registering all join and unjoin actions. The join
and unjoin actions are analogous to the migrate actions. Hence, every join or unjoin registration increments
the version number of the node. The version number permits the correct execution of ordered actions, and
also helps ensure that copies which join a replication obtain a complete history (see Figure 7). When a
processor unjoins a replication, it will ignore all relayed actions on that node and perform error recovery on
all initial action requests.
Algorithm:
Out-of-range: If a copy receives an initial action that is out-of-range the copy sends the action across the
appropriate link. Relayed actions that are out of range are discarded.
Insert: 1. When a copy receives an initial insert action, it performs the insert and sends relayed-insert
actions to the other node copies that it is aware of. The copy attaches its version number to the
update.
2. When a non-PC copy receives a relayed insert, it performs the insert if it is in range, and discards
it otherwise.
3. When the PC receives a relayed insert action, it tests to see if the relayed insert action is in range.
(a) If the insert is in range, the PC performs the insert. The PC then relays the insert action
to all copies that joined the replication at a later version than the version attached to the
relayed insert.
(b) If the insert is not in range, the PC sends an initial insert action to the appropriate neighbor.
Split: 1. When the PC detects that its copy is too full, it performs a half-split action by creating a new
sibling on several processors, designating one of them to be the PC, and transferring half of its
keys to the copies of the new sibling. The PC sets the starting version number of the new sibling
to be its own version number plus one. Finally, the PC sends an insert action to the parent, a
link-change action to the PC of its old right sibling, and relayed-split actions to the other copies.
2. When a non-PC copy receives a relayed half-split action, it performs the half-split locally.
Join: When a processor joins a replication of a copy, it sends a join action to the PC of the node. The PC
increments the version number of the node and sends a copy to the requester. The PC then informs
every processor in the replication of the new member, and performs a link-change action on all of its
neighbors.
Unjoin: When a processor unjoins a replication of a node, it sends an unjoin action to the PC and deletes
its copy. The processor discards relayed actions on the node and performs error recovery on the initial
actions. When the PC receives the unjoin action, it removes the processor from the list of copies, relays
the unjoin to the other copies, and performs a link-change action on all of its neighbors.
Relayed join/unjoin: When a non-PC copy receives a join or an unjoin action, it updates its list of
participants and its version number.
Link-change: A link-change action is executed using the migrating-node algorithm.
Missing-node: When a processor receives an initial action for a node that it doesn't manage, it submits
the action to a 'close' node, or returns the action to the sender (as in the mobile nodes algorithm).
Theorem 4 The variable-copies algorithm satisfies the complete, compatible, and ordered history require-
ments.
Proof: We can show that the variable-copies algorithm produces complete and ordered histories by using the
proof of theorem 3. If we can show that for every node n E A, the history of every copy c E copies(n) has a
backwards extension H' whose uniform update actions are exactly Mn, then the proof of theorem 2 shows
that the variable copies algorithm produces compatible histories.
For a node n with primary copy PC, let Aj be the set of update actions performed on PC when the
PC has version number j. When copy c is created, the PC updates its version number to k and gives c an
initial value V = VnBk, where Bk is the backwards extension of Vc to V, and contains all uniform update
actions in A, through Ak-1. The PC next informs all other copies of the new replication member. After a
copy c' is informed of c, c' will send all of its updates to c. The copy c' might perform some initial updates
concurrent with c's joining copies(n). These concurrent updates are detected by the PC by the version
number algorithm and are relayed to c. Therefore, at the end of a computation every copy c E copies(n)
has every update in Mn, in its uniform history. Thus, the variable copies algorithm produces compatible
histories.D
new pnmary old
copy copy copy
join
Problem -
Insert Missing Insert An insert that is executed
concurrently with the join
new copy relayedjoin is not sent to the new copy
receives Insert
PC's cop
Figure 7: Incomplete histories due to concurrent joins and inserts.
Discussion This section synthesizes the techniques presented in the previous two sections, contributing a
method for ensuring that every copy has a complete history. If the PC wishes to give up its responsibilities,
it contacts a non-PC copy c' to transfer information and responsibility for being the PC. At this point c'
becomes the PC. The old PC then informs all other non-PC copies of the new PC, and waits for their
acknowledgement. If an unjoin request arrives, it is relayed to the new PC, c'.
4 Handling Deletes and Merged Nodes
The discussion of the previous section assumed that all operations are search or insert operations. In this
section, we discuss how to handle the deletion of nodes, and more generally the shifting of responsibility for
a key range between two nodes.
If only leaf nodes may be deleted, then the insert and the delete of the pointer to the leaf node are ordered
actions -the delete must come after the insert. Unfortunately, the insert and delete are not overwriting
actions (as in Section 3.2). If a node is asked to delete a pointer that does not exist (but is in its key
range), the delete action is delayed until the corresponding insert action arrives. The delayed delete action
is remembered at the copy, and is executed immediately after the corresponding insert action is executed.
When a node is deleted, some care must be taken to preserve the double linked list in which it exists.
That algorithm is described in [16]. One leaf node might transfer some of its key range to an adjacent leaf,
in order to perform rebalancing [7]. The parent must be informed of the shift in the key range. This action
is a link-change action, and is an ordered action.
If the key ranges of the interior nodes can increase (due to the merging or rebalancing of an interior node),
then some new problems involving the synchronization of insert or delete actions with split and merge actions
can occur. In particular, a relayed insert (delete) might arrive at a copy of a node after the node has split
off and then merged back a portion of its key range. If the anti-action occurred while keyspace of the action
existed at a different node, then the action might be performed twice.
For example, consider the execution illustrated in Figure 8. The initial action I(k) is performed at copy
cl and relayed to the other copies. The key range containing k is split off from the node, then later merged
back in. A relayed insert i(k) is performed at the PC before the merge, and at copy c2 after the merge. If
key k is deleted at the sibling before the merge, then i(k) at c2 inserts k after all d(k) actions are quiescent.
4.1 Merged Histories
In Section 2, a replicated node can only lose key space (i.e., split into two nodes). Because the histories of
two nodes are never shared, we avoided the issue of specifying the initial values and backwards extensions of
cl I(k) PC c2
Split -
w1w (k)
I(k)
I(k) performed at the siblmg
D(k) performed at the sibling
Merge ---- ------------------------------------
1(k)
Figure 8: Duplicated operations due to merged histories.
split nodes. When a node gains a new key space (due to a merge or a rebalancing), it also gains the history
of the actions that occurred on that key range. In this section, we discuss how the history of a merged node
can be constructed.
We need to distinguish between two kinds of actions -actions on the node (n-actions), and actions on
entries that a copy of the node contains (r-actions). An n-action specifies a node as its target, whereas a
r-action specifies a key value as its target.
Given a copy of a node, c, let range(c) be the key range of the entries that can exist in c. Given an
action, let range(a) be the key range that is the target of a. Let R be a key range. We define H(R) the be
the set of all r-actions a that were fired on some copy c and such that range(a) is in R. That is,
H(R) = {ala was fired on c and range(a) R}
When a replicated node engages in restructuring, it can split and create a new node, delete itself and
give its key range to a neighboring node, send a portion of its key range to a neighboring node, or accept
a portion of the key range of a neighboring node. Each of these n-actions changes the key range of a node,
either by increasing it or by decreasing it. We call a n-action that decreases the node range a split action,
and one that increases the key range a merge action. Note that a split action might send its key range to
another node that executes a merge action, or it might create a new node. If a node is deleted, it splits off
its entire key range, and then becomes dormant.
When a node n splits off a portion R' of its key range, it sends the entries (i.e., key range and pointer
pairs) in R' to node n'. These entries correspond to a history H(R') of r-actions such that a E H(R') 4
range(a) E R'. This history is received by n' and is incorporated into H, for every c E copies(n') when c
executes the corresponding merge action. We denote the incorporated actions due to a merge as m-action.
An m-action modifies the value of a copy, and does not instigate subsequent actions.
Given a history of a copy, the record of a r-action, a, might appear several times. The first appearance
might be due to an initial or a relayed action (i.e, not an m-action), the subsequent appearances are due to
m-actions. Yet, there might be no guarantee that the effects of the action are reflected in the final value of
the copy. To account for this possibility, we need to make the following definition.
Def: An action a appears in history He of copy c if there is an ai in He such that ai is the same action as
a, and there is no aj in He, j > i that removes range(a) from range(c).
Next we need to replace the complete history requirement with one that better expresses the way histories
are transferred between nodes.
Merge-Complete History Requirement: If every subsequent action that is issued appears in the
history He of a node copy c once, then the computation meets the merge-complete history requirement. If
every computation that an algorithm produces satisfies the complete history requirement, then the algorithm
satisfies the merge-complete history requirement.
The merge-complete history requirement only requires that there is some copy of a node such that an
issued action appears in the history, and appears in the history only once. The compatible history requirement
ensures that all copies of the node are also merge-complete.
4.2 Algorithm
The problem that can occur when nodes can merge as well as split is that actions can be executed twice at
some copies. This problem can occur if an action occurs at the PC before the split, and at a non-PC copy
after the merge.
To detect a potential problem, we divide a node's lifetime into epochs. When a node is created, it has
epoch number 0. During an epoch, the node may execute a number of split actions. A new epoch begins
when the node executes a merge, and the node has executed at least one split during the epoch. Each copy of
a node keeps track of its current epoch, and also the number of split actions it has executed. These numbers
are attached to every relayed action that a copy relays to the other copies of a node.
When a relayed r-action a arrives at a copy c, and has an epoch number smaller than the c's epoch, c
next checks a's split number. If a's split number is smaller than c's split number, then the a might have
been lost or duplicated during the change in epochs. Let e be the node pointer in the parameter p of a. The
copy c performs the following protocol:
1. If c is the PC, then c executes a and transmits its state with respect to e to all copies.
2. If c is not the PC, then c asks the PC for its value of the state with respect to e. Copy c uses the reply
from the PC to makes its state consistent with respect to e.
Theorem 5 The algorithm for handling node merges is correct.
Proof: Suppose a copy of a node executes an initial action A, and all relayed actions arrive at all other copies
during the same epoch. Then, the previous correctness arguments still apply. So, suppose that a relayed
action a arrives at a copy c when c is in an epoch later than the one when A was executed. In this case, the
copy's history is made consistent with the PC's history. By the split protocol, all actions are executed once.
The PC will execute the actions only once, so all copies will execute the action only once. o
5 Applications to Other Structures
The techniques described in this paper are not limited to the implementation of the dB-tree. For example,
there is no requirement that the nodes of the distributed search structure be of a limited size. The distributed
search structure might be limited to having three levels, If an average node stores 10,000 pointers, this is
sufficient to index one trillion data items. In addition, the storage overhead per processor is limited by the
shallow tree depth. We can also apply the techniques to search-structures that support multi-attribute range
queries (such as the hB-tree [25]), provided that they support the link technique.
To illustrate the application of these methods to other distributed search structures, we analyze a dis-
tributed hash table due to Ellis. In [11], Ellis describes a distributed and concurrent extendible hash table.
An extendible hash table is one type of hash table that increases the number of buckets available in the table
in response to increasing storage demand (such hash tables are called dynamic hash tables). The extendible
hash table contains a set of data buckets, and a directory that contains pointers to the data buckets. Ini-
tially, the hash table contains a single bucket, and the hash directory contains a single entry pointing to
that bucket. When this bucket fills, it splits into two buckets. The original bucket is labeled '0' and the new
bucket is labeled '1'. All data items whose hash function's least significant bit is a '' are put in the bucket
labeled '0', and the remaining keys are placed in the bucket marked '1'. The directory doubles in size to
accommodate the new pointer. In general, when a bucket labeled 'tag' becomes too full, it splits into two
buckets, the original marked 'Otag' and the new bucket marked 'Itag'. The keys from the original bucket
are distributed among the original and the new bucket based on which of the labels the least significant bits
of their hash functions match.
The directory size is 21, where 1 is the number of bits in the longest label. The directory is indexed by
the least significant 1 bits of the hash function of the input key. If there is no bucket that is labeled with
the directory index, the directory entry points to the bucket with the longest matching suffix. If a bucket
splits to produce buckets with longer labels than any other in the directory, the directory doubles in length
to accommodate the new labels. If a bucket with label x tag becomes empty, it can merge with the bucket
labeled !x tag, if it exists.
Ellis observed that the extendible hashing algorithm can be made into a highly concurrent hash table by
linking together the buckets [10]. Suppose that a bucket with label tag is ready to split, and tag is not one
of the longest labels in the hash table (that is, |tag| < 1). Then there are directory entries with suffixes Otag
and Itag, both of which point to the bucket labeled tag. When the bucket splits, it is relabeled Otag and
the new bucket is labeled Itag. The bucket labeled Otag stores a pointer to the bucket labeled Itag. If an
operation read the a directory entry labeled with a suffix of Itag and is directed to the bucket labeled Otag,
it can detect the misnavigation and determine the correct bucket to access. As a result, bucket splitting can
be performed concurrently with directory accesses.
Ellis further observed that the concurrent extendible hash table can be made into a distributed extendible
hash table in which multiple copies of the hash directories exist [11]. The key observation is that since
operations can recover from misnavigation, the directory copies can contain outdated information. In this
section we analyze the distributed algorithm using lazy updates.
A picture (taken from [11]) of a distributed extendible hash table is shown in Figure 9. There are two
copies of the directory, three active buckets, and one deleted bucket. The buckets are double linked. The
forwards link is used to recover from outdated directory information, and is generated according to the rules
discussed above. The backwards link is used to maintain the list. There are two incorrect links in the two
directories. Directory 1 has not yet been informed that bucket '0' has split into buckets '00' and '01', so
its entry for '10' points to the wrong bucket. Directory 2 has not yet been informed that bucket '11' was
merged into bucket '01', so its entry for '11' points to a deleted bucket.
The three types of operations on the hash table: search, insert, and delete. A search operation reads
a local copy of the directory, determines a bucket to visit, then sends itself to the processor that manages
the bucket. When the operation reaches the bucket, it traverses the links until it finds the correct bucket,
then performs the search on the bucket. The insert and delete operations similarly navigate to the correct
bucket, then perform their operations. If the bucket becomes too full (or empty) it is split (or merged). The
directories are informed of the restructuring, and they update their directory entries accordingly.
Ellis' algorithm maintains the doubly-linked list of buckets using a locking-style algorithm. We do not
consider the bucket management further. The directories process three types of actions: search, split update,
and merge update. The search structure has been set up so that search actions on the directories are never
Directory 1
Figure 9: A distributed extendible hash table (figure from Ellis).
blocked. In addition, an action on one directory entry does not affect, and therefore does not need to block,
another directory entry. However, actions on the same directory entry must be performed in the order issued
(since they are link-change actions).
In Ellis' algorithm, each bucket and each directory entry contains a version number. Whenever a bucket
splits or merges, all buckets involved increase their version number so that it is one greater than the previous
version numbers. Since directory entries refer to buckets, these version numbers can be used to satisfy the
ordered history requirement.
There is one complication, namely that a split update or a merge update action can affect many directory
entries. If the length of the longest bucket label is 1, and the label of the bucket undergoing restructuring is t
bits, then the update affects 2l"t bits. Ellis blocks an update until it is the succeeding update for all directory
entries involved. We note that at any particular directory entry, the updates are overwriting updates. Thus
the same effect can be had by applying the ordered overwriting update algorithm at each entry individually.
Suppose that the set of directories is fixed. Then since no operation blocks any other and the ordered
history requirement is satisfied, our modification of Ellis' algorithm is a correct lazy-update algorithm on the
directories. Furthermore, no PC is necessary. If directories can join and unjoin the replication, an agreement
protocol is necessary (such as the one discussed in section 3.3).
We note one final detail, which involves garbage-collecting the deleted nodes. As is shown in Figure 9,
a deleted node drains into an (currently) live node. In Ellis' algorithm, deleted nodes are garbage-collected
only after all directories can guarantee that no operation will access that node. Since a lost operation can
recover its path by searching a directory, there is no need to synchronize garbage collection.
Discussion As this section shows, the techniques described in this paper can be used to design a distributed
search structure starting from a concurrent search structure. The implementor can take the following steps:
1. Select a concurrent implementation of the search structure that uses the link technique.
2. Translate the concurrent algorithm into a distributed algorithm.
3. Determine the ordered actions and the commutativity between actions.
4. Apply the algorithms in section 2 to manage the node copies.
In addition to hash tables and B-link trees, additional concurrent search structures that use the link
technique have been proposed. For example, Parker [29] gives an a link-type algorithm for a concurrent trie.
A multiattribute range query search structure, such as the hB-tree [25] can also be modified to serve as a
distributed search structure.
6 Failures and Recovery
The algorithms in this paper have not explicitly accounted for processor failures and recovery. However,
because lazy updates require little synchronization, a message recovery strategy (such as that discussed in
[22, 36, 14, 35, 4] can be applied. When a server recovers, it recovers its message state, and applies the
actions it received.
7 Conclusion
We present algorithms for implementing lazy updates for distributed search structures. Lazy updates avoid
the need for synchronization between copies, and permit concurrent searches and concurrent updates. In
other works [1, 21] we discuss implementation and performance issues, while this work concentrates on
algorithms and correctness proofs. The application of lazy updates is demonstrated by their application
the the dB-tree, a distributed B-tree. After presenting algorithms and proof for lazy updates, we discuss
methods for designing distributed search structures.
References
[1] Krishna P. A. and Johnson T. Index replication in a distributed b-tree. In Conference on Management
of Data, pages 207-224, 1994.
[2] F.B. Bastani, S.S. Iyengar, and I-Ling Yen. Concurrent maintenance of data structures in a distributed
environment. The Computer Journal, 21(2):165-174, 1988.
[3] P.A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database
Systems. Addison-Wesley, 1 1 -.
[4] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. AC if
Transactions on Computer Systems, 7(1):1-24, 1989.
[5] K.M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems.
AC if Transactions on Computer Systems, 3(1):63-75, 1 I"'
[6] A. Colbrook, E.A. Brewer, C.N. Dellarocas, and W.E. Weihl. An algorithm for concurrent search trees.
In Proceedings of the ',lii, International Conference on Parallel Processing, pages I11138-111141, 1991.
[7] D. Comer. The ubiquitous B-tree. AC if Comp. Surveys, 11:121-137, 1979.
[8] S.B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in partitioned networks. Computing
Surveys, 17(3):342-370, 1 -".
[9] M. Dietzfelbinger and F. Meyer auf der Hyde. An optimal parallel dictionary. In Proc. AC if Symp. on
Parallel Algorithms and Architectures, pages 360-368, 1989.
[10] C.S. Ellis. Extendible hashing for concurrent operations and distributed data. In Proceedings of the
*.1 AC if SIGACT-SIC( 1OD Symposium on Principles of Database Systems, pages 106-116, Portland,
OR, 1983.
[11] C.S. Ellis. Distributed data structures: A case study. IEEE Transactions on Computing, C-34(12):1178
1185, 1 "'.
[12] K.Z Gilon and D. Peleg. Compact deterministic distributed dictionaries. In Proceedings of the Tenth
Annual AC if Symposium on Principles of Distributed Computing, pages 81-94. AC'\I 1991.
[13] M. Herlihy and J. Wing. Linearizability: A correctness condition for concurrent objects. AC if Trans-
actions on Programming Languages and Systems, 12(3):463-492, 1990.
[14] D.B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging
and checkpointing. Journal of Algorithms, 11: ,i..'-'491, 1990.
[15] T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the B-link tree. In Proc.
I,,t / Parallel Processing Symp., pages 319-325, 1992.
[16] T. Johnson and A. Colbrook. A distributed, replicated, data-balanced search structure. To appear in the
Int'l Journal of High-Speed Computing. Available at ftp.cis.ufl.edu:cis/tech-reports/tr93/tr93-028.ps.Z,
1992.
[17] T. Johnson and D. Shasha. A framework for the performance analysis of concurrent B-tree algorithms.
In AC if Symp. on Principles of Database Systems, pages 273-287, 1990.
[18] T. Johnson and D. Shasha. The performance of concurrent data structure algorithms. Transactions on
Database Systems, pages 51-101, March 1993.
[19] T.A. Joseph and K.P. Birman. Low cost management of replicated data in fault-tolerant distributed
systems. AC if Trans. on Computer Systems, 4(1):54-70, 1986.
[20] P. Krishna and T. Johnson. Implementing distributed search structures. Technical Report UF CIS
TR92-032, Availiable at anonymous ftp site cis.ufl.edu, University of Florida, Dept. of CIS, 1992.
[21] P.A. Krishna and T. Johnson. Highly scalable data balanced distributed b-trees. Technical Report
95-015, University of Florida, Dept. of CISE, 1995. Available at ftp.cis.ufl.edu:cis/tech-reports.
[22] R. Ladin, B. Liskov, L. Shira, and S. Ghemewat. Providing high reliability using lazy replication. AC if
Trans. Computer Systems, 10(4):360-391, 1992.
[23] V. Lanin and D. Shasha. A symmetric concurrent B-tree algorithm. In 1986 Fall Joint Computer
Conference, pages 380-389, 1986.
[24] P.L. Lehman and S.B. Yao. Efficient locking for concurrent operations on B-trees. AC if Transactions
on Database Systems, 6(4 1.-'11 1.70, 1981.
[25] D.B. Litwin and B. Salzberg. The hB-tree: A multiattribute indexing method with good guaranteed
performance. AC if Trans. Database Systems, 14(4):625 '.' 1990.
[26] W. Litwin, M. Neimat, and D.A Schneider. LH* -linear hashing for distributed files. In Proc. 1993
AC if SIC ifOD, pages 327-336, 1993.
[27] G. Matsliach and 0. Shmueli. An efficient method for distributing search structures. In Symposium on
Parallel and Distributed Information Systems, pages 159-166, 1991.
[28] G. Matsliach and O. Shmueli. An efficient method for distributing search structures. In Proceedings
of the First International Conference on Parallel and Distributed Information Systems, pages 159-166,
1991.
[29] J.D. Parker. A concurrent search structure. Journal of Parallel and Distributed Computing, 7, 1989.
[30] D. Peleg. Distributed data structures: A complexity oriented view. In Fourth I,, / Workshop on
Distributed Algorithms, pages 71-89, Bari, Italy, 1990.
[31] A. Ranade. Maintaining dynamic ordered sets on processor networks. In Proc. AC if Symp. on Parallel
Algorithms and Architectures, pages 127-137, 1992.
[32] Y. Sagiv. Concurrent operations on B*-trees with overtaking. In 41h AC if Symp. Principles of Database
Systems, pages 28-37. AC,\I 1 I
[33] B. Seeger and Larson P. Multi-disk b-trees. In Proceedings of the 1991 AC if SIC rfOD Conference,
pages 436-445, 1991.
[34] D. Shasha and N. Goodman. Concurrent search structure algorithms. AC if Transactions on Database
Systems, 13(1):53-90, 1988.
[35] A.P. Sistla and J.L. Welch. Efficient distributed recovery using message logging. In Proc. Symp. on
Principles of Distributed Computing, pages 223-238, 1989.
[36] R.E. Strom and S. Yemeni. Optimistic recovery in distribtued systems. AC if Transactions on Computer
Systems, 3(3):204-226, 1 -".
[37] R. Vingraek, Y. Breitbart, and G. Weikum. Distributed file organization with scalable cost/performance.
In Proceedings of the 1994 AC if SICY 1OD conference, pages 253-264, 1994.
[38] Litwin W., Neimat M., and Schneider D. A. Rp* a family of order-preserving scalable distributed data
structures. In Proceedings of the ',ili VLDB Conference, pages 342-353, 1994.
[39] P. Wang. An in-depth analysis of concurrent b-tree algorithms. Technical Report MIT/LCS/TR-496,
MIT Laboratory for Computer Science, 1991.
[40] W.E. Weihl. The impact or recovery on concurrency control. Technical Report MIT/LCS/TM-382b,
MIT Laboratory for Computer Science, 1989.
[41] W.E. Weihl and P. Wang. Multi-version memory: Software cache management for concurrent B-trees.
In Proc. .'1 IEEE Symp. Parallel and Distributed Processing, pages i.', 11 i'. 1990.
[42] M.H. Wong and D. Agrawal. Context-based synchroniczation: An approach beyond semantics for
concurrency control. In Proc. 1993 AC if SICi \OD, pages 278-287, 1993.
[43] I.L. Yen and F. Bastani. Hash tables in massively parallel systems. In IJi I Parallel Processing Sympo-
sium, pages 660-664, 1992.
|