Lazy Updates for Distributed Search Structures
Theodore Johnson and Padmashree Krishna
Dept. of CIS
University of Florida
December 1, 1992
Very large database systems require distributed storage, which means that they need distributed
search structures for fast and efficient access to the data. In this paper, we present an approach to
maintaining distributed data structures that uses lazy updates, which take advantage of the semantics
of the search structure operations to allow for scalable and low-overhead replication. Lazy updates
can be used to design distributed search structures that support very high levels of concurrency. The
alternatives to lazy update algorithms (vigorous updates) use synchronization to ensure consistency.
Hence, lazy update algorithms are a distributed analogue of shared-memory lock-free search structure
algorithms. Since lazy updates avoid the use of synchronization, they are much easier to implement than
vigorous update algorithms. We demonstrate the application of lazy updates to the dB-tree, which is a
distributed B+ tree that replicates its interior nodes for highly parallel access. We develop a correctness
theory for lazy updates so that our algorithms can be applied to other distributed search structures.
A common problem with distributed search structures is that they are often single-rooted. If the root node
is not replicated, it becomes a bottleneck and overwhelms the node that stores it (as noted in ). A search
structure node can be replicated by one of several well-known algorithms . However, these algorithms
synchronize operations, which reduces concurrency, and create a significant communications overhead.
Techniques exist to reduce the cost of maintaining replicated data and for increasing concurrency. Ladin,
Liskov, and Shira propose lazy replication for maintaining replicated servers . Lazy replication uses the
dependencies that exist in the operations to determine if a server's data is sufficiently up-to-date to execute
a new request. Several authors have explored the construction of non-blocking and wait-free concurrent data
structures in a shared-memory environment [6, 20]. These algorithms enhance concurrency because a slow
operation never blocks a fast operation.
Lazy update algorithms are similar to lazy replication algorithms because both use the semantics of an
operation to reduce the cost of maintaining replicated copies. The effects of an operation can be lazily sent
to the other servers, perhaps on piggybacked messages. The lazy replication algorithm blocks an operation
until the local data is sufficiently up-to-date. In contrast, a non-blocking wait-free concurrent data structure
never blocks an operation. The lazy update algorithms are similar in that the execution of a remote operation
never blocks a local operation; hence they are a distributed analogue of non-blocking algorithms.
Lazy updates have a number of pragmatic advantages over more vigorous replication algorithms. They
significantly reduce maintenance overhead. They are highly concurrent, since they permit concurrent reads,
reads concurrent with updates, and concurrent updates (at different nodes). Finally, they are easy to
implement because they avoid the use of synchronization.
Despite the benefits of the lazy update approach, implementors might be reluctant to use it without
correctness guarantees. In this paper, we first present a framework for showing the correctness of lazy
update algorithms. We next discuss lazy update algorithms for implementing a distributed B-tree, the dB-
tree . We present three algorithms, the last of which can implement a dB-tree that never merges nodes
and performs data balancing on leaf nodes (we have previously found that never merging nodes results in
little loss in space utilization , and data balancing on the leaf level is low-overhead and effective ).
The methods that we present can be applied to other distributed search structures, such as hash tables .
1.1 The dB-tree
We initiated our study of distributed search structures by examining the design of a highly concurrent
distributed B-tree. A B-tree is a multi-ary tree in which every path from the root to the leaf is the same
length. The tree is kept in balance by adjusting the number of children in each leaf. In a B+-tree, the keys
are stored in the leaves and the non-leaf nodes serve as the index. A B-link tree is a B+-tree in which every
node contains a pointer to its right sibling .
Previous work on parallel-access search structures (see  for a survey), has concentrated on concurrent
or shared-memory implementations. Particularly notable are the B-link tree algorithms [17, 18, 16], which
we use as a base for our work. Concurrent B-link tree algorithms [17, 18] have been found to provide the
highest concurrency of all concurrent B-tree algorithms . In addition, operations on a B-link tree access
one node at a time. A B-link tree's high performance and node independence makes it the most attractive
starting point for constructing a distributed search structure.
The key to the high performance of the B-link tree algorithms is the use of the half-split operation,
illustrated in Figure 1. If a key is inserted into a full node, the node must be split and a pointer to the new
sibling inserted into the parent (the standard B-tree insertion algorithm). In a B-link tree, this action is
broken into two steps. First, the sibling is created and linked into the node list, and half the keys are moved
from the node to the new sibling (the half-split). Second, the split is completed by inserting a pointer to the
new sibling into the parent. If the parent overflows, the process continues recursively.
Figure 1: Half-split operation
During the time between the half-split of the node and the completion of the split at the parent, an
operation that is descending the tree can misnavigate and read the half-split node when it is searching for
a key that moved to the sibling. In this case, it will detect the mistake using range information stored in
the node and use the link to the sibling to recover from the error. As a result, all actions on the B-link tree
index are completely local. A search operation examines one node at a time to find its key, and an insert
operation searches for the node that contains its key, performs the insert, then restructures the tree from
the bottom up.
Some work has been done to develop a distributed B-tree. Colbrook et al.  developed a pipelined
algorithm. Wang and Weihl  use a special form of cache coherence to implement a parallel B-tree, so
that it can be implemented on a shared-nothing architecture with the appropriate underlying software.
The dB-tree [9, 10] implements the B-link tree algorithm as a distributed protocol (as in ). An
operation on the index (search, insert, or delete) is performed as a sequence of actions on the nodes in
the search structure, which are distributed among different processors. Each processor that maintains part
of the search structure has two components: a queue manager and a node manager. The queue manager
maintains the message queue, which stores pending actions to perform on locally stored nodes. The node
manager repeatedly takes an action from the queue manager and performs the action on a node. The action
execution typically generates a subsequent action on another node (for example, searching one index node
leads to searching another node). If the next node to process is stored locally, then a new entry is put into
the message queue. Otherwise, the node manager sends a message to the appropriate remote queue manager
indicating the action to be taken. We assume that the processing of one action can't be interrupted by the
processing of another action, so an action on a node is implicitly atomic.
All operations start by accessing the root of the search structure. If there is only one copy of the root,
then access to the index is serialized. Therefore, we want to replicate the root widely in order to improve
parallelism. As we increase the degree of replication, however, the cost of maintaining coherent copies of a
Figure 2: A dB-tree
node increases. Since the root is rarely updated, maintaining coherence at the root isn't a problem. A leaf
is rarely accessed, but a significant portion of the accesses are updates. As a result, wide replication of leaf
nodes is prohibitively expensive.
In the dB-tree the leaf nodes are stored on a single processor. We apply the rule that if a processor stores
a leaf node, it stores every node on the path from the root to that leaf. An example of a dB-tree that uses
this replication policy is shown in Figure 2. The dB-tree replication policy stores the root everywhere, the
leaves at a single processor, and the intermediate nodes at a moderate level of replication. As a result, an
operation can be initiated at every processor simultaneously, but the effects of updates are localized. As a
side effect, an operation can perform much of its searching locally, reducing the number of messages passed.
The replication strategy for a dB-tree helps to reduce the cost of maintaining a distributed search struc-
ture, but the replication strategy alone is not enough. If every node update required the execution of an
available-copies algorithm , the overhead of maintaining replicated copies would be prohibitive. Instead,
we take advantage of the semantics of the actions on the search structure nodes and use lazy updates to
maintain the replicated copies inexpensively.
We note that many of the actions on a dB-tree node commute. For example, consider the sequence of
actions that occurs in Figure 3. Suppose that nodes A and B split at "about the same time". Pointers to
the new siblings must be inserted into the parent, of which there are two copies. A pointer to A' is inserted
into the first copy of the parent and a pointer to B' is inserted into the second copy of the parent. At this
point, the search structure is inconsistent, since not only does the parent not contain a pointer to one of its
children, but the two copies of the parent don't contain the same value.
The tree in Figure 3 is still usable, since no node has been made unavailable. Further, the copies of the
parents will eventually converge to the same value. Therefore, there is no need for one insert action on a
node to synchronize with another insert action on a node. The tree is always navigable, so the execution of
an insert doesn't block a search action. We call node actions with such loose synchronization requirements
Nodes A and B half-split
S parent parent 2 copies of parent
A A B C
A' inserted into copy 1, B' inserted into copy 2
. . .. .. .. ........ ............. .. .... 2 copies of parent
..*"" .. 2 copies of parent
: parent parent
A A' B B' C
Figure 3: Lazy inserts
Before we terminate this introduction, we should mention some useful characteristics of lazy updates.
First, when a lazy update is performed at one copy of a node, it must also be performed at the other copies.
Since the lazy update commutes with other updates, there is no pressing need to inform the other copies of
the update immediately. Instead, the lazy update can be piggybacked onto messages used for other purposes,
greatly reducing the cost of replication management (this is similar to the lazy replication techniques ).
Second, index node searches and updates commute, so that one copy of a node may be read while another
copy is being updated. Further, two updates to the copies of a node may proceed at the same time. As
a result, the dB-tree not only supports concurrent read actions on different copies of its nodes, it supports
concurrent reads and updates, and also concurrent updates.
2 Correctness of Distributed Search Structures
Shasha and Goodman  provide a framework for proving the correctness of non-replicated concurrent
data structures. We make extensive use of their framework in order to discuss operation correctness. We
delete most details here to save space, but we note that if the distributed analogue of a link-type search
structure algorithm follows the Shasha-Goodman link algorithm guidelines, it will produce strict serializable
(or linearizable) executions. However, we would like the distributed search structure to satisfy additional
correctness constraints. For example, when a distributed computation terminates, every copy of a node
should have the same value. Performing concurrency control on the copies is the subject of this paper.
3 Copy Correctness
We intuitively want the replicated nodes of the distributed search structure to contain the same value
eventually. We can ensure the coherence of the coherence of the copies by serializing the actions on the
nodes (perhaps via an available-copies algorithm ). However, we want to be lazy about the maintenance.
In this section, we describe a model of distributed search structure computation and establish correctness
criteria for lazy updates.
A node of the logical search structure might be stored at several different processors. We say that the
physically stored replicas of the logical node are copies of the logical node. We denote by copiest(n) the set
of copies that correspond to node n at (global snapshot) time t.
An operation is performed by executing a sequence of actions on the copies of the nodes of the search
structure. Thus, the specification of an action on a copy has two components: a final value c' and a
subsequent action set SA. An action that modifies a node (an update action) is performed on one of the
copies first, then is relayed to the remaining copies. We distinguish between the initial action and the relayed
actions. Thus, the specification of an action is:
at(p, c) = (c', SA)
When action a with parameter p is performed on copy c, copy c is replaced by c' and the subsequent actions
in SA are scheduled for execution. Each subsequent action in SA is of the form (ai,pi, ci), indicating that
action ai with parameter pi should be performed on copy ci. If copy ci is stored locally, the processor puts
the action in the set of executable actions. If ci is stored remotely then the action is sent to the processor
that stores ci. If the action is a return value action, a message containing the return value is sent to the
processor that initiated the operation. If the final value of a(p, c) is c for every valid p and c, then a is a
non-update action; otherwise a is an update action. The superscript t is either i or r, indicating an initial
or a relayed action. We also distinguish initial actions by writing them in capitals, and relayed actions by
writing them in lowercase (I and i for an insert, for an example).
In order to discuss the commutativity of actions, we will need to specify whether the order of two actions
be exchanged. If action at with parameter p can be performed on c to produce subsequent action set SA,
then the action is valid, otherwise the action is invalid. We note that the validity of an action does not
depend on the final value.
An algorithm might require that some actions must be performed on all copies of a node, or on all copies
of several nodes, "simultaneously". Thus, we group some action sequences into atomic action sequences,
or AAS. The execution of an AAS at a copy is initiated by an AASstart action and terminated by an
AASfinish action. A copy may have one or more AAS currently executing. An AAS will commute with
some actions (possibly other AASstart actions), and conflict with others. We assume that the node manager
at each processor is aware of the AAS-action conflict relationships, and will block actions that conflict with
currently executing AAS. The AAS is the distributed analogue of the shared memory lock, and can be used
to implement a similar kind of synchronization. However, lazy updates are preferable.
In order to capture the conditions under which actions on a copy commute, we model the value of a copy by
its history (as in ). Formally, the total history of copy c E copiest(n) consists of the pair (I,, A'), where
I, is the initial value of c and A' is a totally-ordered set of actions of c. We define correctness in terms of
the update actions, since non-update actions should not be required to execute at every copy. The (update)
history of a copy is a pair (Ic, A,) where I, is the same initial value as in the total history, and A, is A' with
the non-update actions deleted (and the order on the update actions preserved). To remove the distinction
between initial and related actions, we define the uniform history, U(H) to be the update history H with
each action at replaced by a. Finally, we will write the history of copy c, (Ic, A,) as He = Ic [= a where
A, = (al ,..., a ).
Suppose that He I,c [ = ai, and that I, is the final value of H' =I' f a'. Then H = (I' fj a')
[i=1 ai is the backwards extension of He by H'. It is easy to see that He and H* have the same value,
and the last m actions in H* have the same subsequent action sets as the m actions in He. When a node
is created, it has an initial value, In. When a copy of a node is created it is given an initial value, which
we call the original value of the copy. This initial value should be chosen in some meaningful way, and will
typically be equivalent to the history of the creating copy, or to a synthesis of the histories of the existing
copies. In either case, the new copy will have a backwards extension that corresponds to the the history of
update actions performed on the copy. If a copy of a node is deleted, then we no longer need to worry about
the node contents. We denote set of all initial update actions performed on node n by m,.
We recall that an action on a copy is valid if the action on the current value of the copy has its associated
subsequent action. A history is valid if action ai is valid on Ic r i a for every i = 1, . ., m. The final
value of a history is the final value of the last action in the history. Two histories are compatible if they are
valid, have the same final values, and have the same uniform updates. If H1 and H2 are compatible, then
we write H1 H2.
Our correctness criteria for the replica maintenance algorithms are the following:
Compatible History Requirement: A node n with initial value In and update action set Mn, has
compatible histories if, at the end of the computation C,
1. Every copy c E copies(n) with history He has a backwards extension Be such that the update actions
in H' = BeH, contains exactly the actions in Mn.
2. Every backwards extension H' can be rearranged to form H* such that U(H>) = U(HT,) for every
c, c' E copies(n), and every H* is valid.
If an algorithm guarantees that every node has a compatible history, then it meets the compatible history
Complete History Requirement: If every subsequent action that is issued appears in some node's
update action set, then the computation meets the complete history requirement. If every computation that
an algorithm produces satisfies the complete history requirement, then the algorithm satisfies the complete
Ordered History Requirement: We define an ordered action as be one that belongs to a class r such
that all actions of class r are time-ordered with each other (we assume a total order exists). A history H
is an ordered history if for any ordered actions hi, h2 E H of class r, if h1 <, h2 then h1 < h2 in H. An
algorithm meets the ordered history requirement if every node has a compatible history that is an ordered
The compatible history requirement guarantees that every node is single-copy equivalent when the com-
putation terminates. We note that our condition for rearranging uniform histories is a condition of the
subsequent action sets rather than a condition of the intermediate values of the nodes. The copies need only
to have the same value at the end of the computation, but the subsequent actions can't be posthumously
issued or withdrawn without a special protocol.
The complete history requirement tells us that we must route every issued action to a copy. A deleted
node is conceptually retained in the search structure to satisfy the complete history requirement. The ordered
history requirement lets us remove explicit synchronization constraints on the equivalent parallel algorithm
by shifting the constraints to the copy coherence algorithm.
3.2 Lazy Updates
An update action must be performed on all copies of a node. With no further information about the action,
it must be performed via an AAS to ensure that the conflicting actions are ordered in the same way at all
copies. However, some actions commute with other almost all other actions, removing the need for an AAS.
In Figure 3, the final value of the node is the same at either copy, and the search structure is always in a
good state. Therefore, there is no need to agree on the order of execution. We provide a rough taxonomy of
the degree of synchronization that different updates require.
Lazy Update: We say that a search structure update is a lazy update if it commutes with all other lazy
updates, so synchronization is not required.
Semi synchronous update: Other updates are almost lazy updates, but they conflict with some but not
all other actions. For example, the actions may belong to a class of ordered actions. We call these semi
synchronous updates. A semi synchronous action requires special treatment, but does not require the
activation of an AAS.
Synchronous Update: A synchronous update requires an AAS for correct execution. We note that the
AAS might block only a subclass of other actions, or might extend to the copies of several different
In this section, we describe algorithms for the lazy maintenance of several different dB-tree algorithms. We
work from a simple fixed-copies distributed B-tree to a more complex variable-copies B-tree, and develop the
tools and techniques that we need along the way. For all of the algorithms that we develop, we assume that
only search and insert operations are performed on the dB-tree. In addition, we assume that the network is
reliable, delivering every message exactly once in order.
4.1 Fixed-Position Copies
For this algorithm, we assume that every node has a fixed set of copies. This assumption lets us concentrate
on specifying lazy updates. Every node contains pointers to its children, its parent, and its right sibling.
When a node is created, its set of copies are also created, and copies of the node are neither created nor
A search operation issues a search action for the root. The search action is a straightforward translation
of the action that a shared-memory B-link tree algorithm takes at a node. An insert operation searches for
the correct leaf using search actions, then performs an insert action on the leaf. If the leaf becomes too full,
the operation restructures the dB-tree by issuing half-split and insert actions. The insert action adds a new
key at the leaves and adds a pointer to a child in the non-leaf nodes. The half-split action creates a new
sibling (and the sibling's copies), transfers keys from the half-split node to the sibling, modifies the node to
point to the sibling, and sends an insert action to the parent.
The first step in designing a distributed algorithm is to specify the commutativity relationships between
1. Any two insert actions on a copy commute. As in Sagiv's algorithm , we need to take care to
perform out-of-order inserts properly.
2. Half-split operations do not commute. Since a half-split action modifies the right-sibling pointer, the
final value of a copy depends on the order in which the half-splits are processed.
3. Relayed half-split actions commute with relayed inserts, but not with performed initial inserts. Suppose
that in history Hp, initial insert action I(A) is performed before a half-split action s that removes A's
range from p. Then, if the order of I and s are switched, I becomes an invalid action. A relayed
insert action has no subsequent actions, and the final value of the node is the same in either ordering.
Therefore, relayed half-splits and relayed inserts commute.
4. Initial half-split actions don't commute with relayed insert actions. One of the subsequent actions of
an initial half-split action is to create the new sibling. The key that is inserted either will or won't
appear in the sibling, depending on whether it occurs before or after the half-split.
By our classification methods, an insert is a lazy update and a half-split is a semisynchronous update.
If the ordering between half-splits and inserts isn't maintained, the result is lost updates (see Figure 4).
We next present two algorithms to manage fixed-copy nodes To order the half-splits, both algorithms
use a primary copy (PC), which executes all initial half-split actions. (non-PC copies never execute initial
half-split actions, only relayed half-splits). The algorithms differ in how the insert and half-split actions are
ordered. The synchronous algorithm uses the order of half-splits and inserts at the primary copy as the
standard to which all other copies must adhere. The semisynchronous algorithm requires that the ordering
at the primary copy be consistent with the ordering at all other nodes (see Figure 5).
We do not require that all initial insert actions are performed at the PC, so copies might find that they
exceed their maximum capacity. However, since each copy is maintained serially, it is a simple matter to
add overflow blocks.
4.1.1 Synchronous Splits
Algorithm: An operation is executed by submitting an action, and each action generates subsequent
actions until the operation is completed. An operation is executed by executing its B-link tree actions, as
discussed previously. Thus, all we need to do is specify the execution of an action at a copy. The synchronous
I1 12 13 Problem:
i2 ii i2 If S1 transfers reduces the range of the node to
i3 3 ii exclude I4's key, then I4's key is lost.
14 S1 i4 -The PC ignores an out-of-range relayed insert.
sI i4 sI The copies discard I4's key when they perform
W, PC [2 the relayed split.
Figure 4: An example of the lost-insert problem
split algorithm uses an AAS to ensure that splits and inserts are ordered the same way at the PC and at the
non-PC copies (see Figure 5).
Half-split Only the PC executes initial half-split actions. Non-PC copies execute relayed half-split actions.
When the PC detects that it must half-split the node, it does the following:
1. Performs a splitstart AAS locally. This AAS blocks all initial insert actions, but not relayed
insert or search actions.
2. The PC sends a split_start AAS to all of the other copies.
3. The PC waits for acknowledgements from all of the copies of the AAS.
4. When the PC receives all of the acknowledgements, it performs the half-split, creating all copies
of the new sibling and sending them the sibling's original value.
5. The PC sends a split_end AAS to all copies, and performs a split_end AAS on itself.
When a non-PC copy receives a splitstart AAS, it blocks the execution of initial inserts and sends an
acknowledgement to the PC. The executions of further initial insert actions on the copy are blocked
until the PC sends a split_end AAS. When the copy processes the split_end AAS, it modifies the range
of the copy, modifies the right-sibling pointer, discards pointers that are no longer in the node's range,
and unblocks the initial insert actions.
Insert When a copy receives an initial insert action it does the following:
1. Checks to see if the insert is in the copy's range. If not, the insert action is sent to the right
2. If the insert is in range, and the copy is performing a split AAS, the insert is blocked.
3. Otherwise, the insert is performed and relayed insert actions are sent to all of the other copies.
When a copy receives a relayed insert action, it checks to see if the insert is in the copy's range. If so,
the copy performs the insert. Otherwise, the action is discarded.
Search When a copy receives a search action, it examines the node's current state and issues the appropriate
We note that since non-PC copies can't initiate a half-split action, they may be required to perform an
insert on a too-full node. Actions on a copy are performed on a single processor, so it is not a problem to
attach a temporary overflow bucket. The PC will soon detect the overflow condition and issue a half-split,
correcting the problem.
Theorem 1 The synchronous split algorithm satisfies the complete, compatible, and ordered history require-
Proof: We observe that the fourth link-algorithm guideline is satisfied, so that whenever an action arrives at
a copy, its parameter is within the copy's inreach. Therefore, the synchronous split algorithm satisfies the
complete history requirement.
Since there are no ordered actions, the synchronous split algorithm vacuously satisfies the ordered history
We show that the synchronous algorithm produces compatible histories by showing that the histories at
each node are compatible with the uniform history at the primary copy. First, consider the ordering of the
half-split actions (a half-split is performed at a node when the split_end AAS is executed). All initial half-
split actions are performed at the PC, then are relayed to the other copies. Since we assume that messages
are received in the order sent, all half-splits are processed in the same order at all nodes.
Consider an initial insert I and a relayed half-split s performed at non-PC copy c. If I < s in He, then I
must have been performed at c before the AASstart for s arrived at c (because the AASstart blocks initial
inserts). Therefore, I's relayed insert i must have been sent to the PC before the acknowledgement of s was
sent. By message ordering, i is received at the PC before S is performed at the PC, so i < S in Hpc. If
s < I in He, then S < i in Hpc, because S < s and I < i (due to message passing causality). O
We note that this algorithm makes good use of lazy updates. For example, only the PC needs an
acknowledgement of the split AAS. If every communications channel between copies had to be flushed, a
split action would require O(|copies(n)|2) messages instead of the O(|copies(n)|) messages that this algorithm
uses. Furthermore, search actions are never blocked.
primary copy pnmary copy
copy copy Insert
Split stt the insert is in range,
S so re-write history.
. insert split X
initial inserts insert split insert
are blocked acknowledge Split
/ Split end Insert
split end / the insert is not in range,
so re-write history
Insert split and issue a correction.
Semi-synchronous split algorithm
Synchronous split algorithm never blocks inserts, instead
blocks new inserts while rewrites history to ensure
a split executes. compatible histories.
Figure 5: Synchronous and semi synchronous split ordering.
4.1.2 Semi synchronous Splits
We can greatly improve on the synchronous-split algorithm. For example, the synchronous split algorithm
blocks initial inserts when a split is being performed. Furthermore, 3 |copies(n)| messages are required to
perform the split. By the applying of the "trick" of rewriting history, we can obtain a simpler algorithm
that never blocks insert actions and requires only |copies(n)| messages per split (and therefore is optimal).
The synchronous-split algorithm ensures that an initial insert I and a relayed split s at a non-PC node
are performed in the same order as the corresponding relayed insert i and initial split s are performed at
the PC, with the ordering in the PC setting the standard. We can turn this requirement around and let the
non-PC copies determine the ordering on initial inserts and relayed splits, and place the burden on the PC
to comply with the ordering.
Suppose that the PC performs initial split S, then receives a relayed insert i, from c, where Ic was
performed before s at c (see Figure 5). We can keep Hpc compatible with He by rewriting Hpc, inserting
i, before S in Hpc. If i,'s key is in the PC's range, then Hpc can be rewritten by performing ic on the PC.
Otherwise, i,'s key should have been sent to the sibling that s created. Fortunately, the PC can correct its
mistake by creating a new initial insert with i,'s key, and sending it to the sibling. This is the basis for the
semi synchronous split algorithm.
Algorithm: The semi synchronous split algorithm is the same as the synchronous split algorithm, with
the following exceptions:
1. When the PC detects that a split needs to occur, it performs the initial split (creates the copies of the
new sibling, etc.), then sends relayed split actions to the other copies.
2. When a non-PC copy receives a relayed split action, it performs the relayed split.
3. If the PC receives a relayed insert and the insert is not in the range of the PC, the PC creates an initial
insert action and sends it to the right neighbor
Theorem 2 The semi synchronous split algorithm satisfies the complete, consistent, and ordered history
Proof: The semi synchronous algorithm can be shown to produce complete and ordered histories in the same
manner as in the proof of Theorem 1.
We need to show that all copies of a node have compatible histories. Since relayed inserts and relayed
splits commute, we need only consider the cases when at least one of the actions is an initial action. Suppose
that copy c performs initial insert I after relayed split s. Then, by message causality, the PC has already
performed S, so the PC will perform i after S.
Suppose that c performs I before s and PC performs i after S. If i is in the range of PC after S, then
i can be moved before S in Hpc without modifying any other actions. If i is no longer in the range of PC
after S, then moving i before S in Hpc requires that S's subsequent action be modified to include sending
i to the new sibling. This is exactly the action that the algorithm takes. o
Theorem 2 shows that we can take advantage of the semantics of the insert and split actions to lazily
manage replicated copies of the interior nodes of the B-tree. In the next section, we observe a different type
of lazy copy management that also simplifies implementation and improves performance.
4.2 Single-copy Mobile Nodes
In this section, we briefly examine the problem of lazy node mobility. We assume that there is only a single
copy of each node, but that the nodes of the B-tree can migrate from processor to processor (typically, to
perform load-balancing). When a node migrates, the host processor can broadcast its new location to every
other processor that manages the node (as is done in Emerald ). However, this algorithm requires large
amounts of wasted effort, and doesn't solve the garbage collection problems.
The algorithms that we propose inform the node's immediate neighbors of the new address. In order to
find the neighbors, a node contains links to both its left and right sibling, as well as to its parent and its
children. When a node migrates to a different processor, it leaves behind a forwarding address. If a message
arrives for a node that has migrated, the message is routed by the forwarding address. We are left with
the problem of garbage-collecting the forwarding addresses (when is it safe to reclaim the space used by a
forwarding address). As with the fixed-copies scenario, we propose an eager and a lazy algorithm to satisfy
the protocol. We have implemented the lazy protocol, and found it effectively supports data balancing .
The eager algorithm ensures that a forwarding address exists until the processor is guaranteed that no
message will arrive for it. Unfortunately, obtaining such a guarantee is complex and requires much message
passing and synchronization. We omit the details of the eager algorithm to save space.
Suppose that a node migrates and doesn't leave behind a forwarding address. If a message arrives for
the migrated node, then the message clearly has misnavigated. This situation is similar to the misnavigated
operations in the concurrent B-link protocol, which suggests that we can use a similar mechanism to recover
from the error. We need to find a pointer to follow. If the processor stores a tree node, then that node
contains the first link on the path to the correct destination. So the error-recovery mechanism is to find a
node that is 'close' to the destination, and follow that set of links.
The other issue to address is the ordering of the actions on the nodes (since there is only one copy, every
node history is vacuously compatible). The possible actions are the following: insert, split, migrate, and
link-change. The link-change actions are a new development in that they are issued from an external source,
and need to be performed in the order issued.
Algorithm: Every node has two additional identifiers, a version number and a level. The version number
allows us to lazily produce ordered histories. The level, which indicates the distance to a leaf, aids in recovery
from misnavigation. An operation is executed by executing its B-link tree actions, so we only need to specify
the execution of the actions.
Out-of-range: When a message arrives at a node, the processor first checks if the node is in range. This
check includes testing to see if the node level and the message destination level match. If the message
is out of range or on the wrong level, the node routes it in the appropriate direction.
Migration: When a node migrates,
1. All actions on the node are blocked until the migration terminates.
2. An copy of the node is made on a remote processor, and the copy is a duplicate (with the exception
that the version number increments).
3. A link-change action is sent to all known neighbors of the node.
4. The original node is deleted.
Insert: Inserts are performed locally.
Half-split: Half-splits are performed locally by placing the sibling on the same processor and assigning the
sibling a version number one greater than the half-split node's. An insert action is sent to the parent,
and a link-change action is sent to the right neighbor.
Link-change: When a node receives a link-change action, it updates the indicated link only if the update's
version number is greater than the link's current version number. If the update is performed, the new
version number is recorded.
Missing Node: If a message arrives for a node at a processor, but the processor doesn't store the node,
the processor performs the out-of-range action at a locally stored node. If the processor doesn't store
a search structure node, the action is sent to the root.
Theorem 3 The lazy algorithm satisfies the complete, compatible, and ordered history requirements.
Proof: There is only a single copy of a node, so the histories are vacuously compatible. Each action takes a
good state to a good state, so every action eventually finds its destination. Therefore, the algorithm produces
The only ordered actions are the link-change actions. The node at the end of a link can only change
due to a split or a migration. In both cases, the node's version number increments. When a link-change
action arrives at the correct destination, it is performed only if the version number of the new node is larger
than the version number of the current node. If the update not performed, the node's history is rewritten to
insert the link change into its proper place. Let 1 be a link-change action that is not performed, and let 1 be
an ordered action of class C. Let ai be the ordered action of class C in He that is ordered immediately after
S(there is no aj such that 1
the history can be rewritten so that it remains valid.D
We note that an implementation of the lazy single-copy algorithm can use forwarding addresses to improve
efficiency and reduce overhead. The forwarding addresses are not required for correctness, so they can be
garbage-collected at convenient intervals.
4.3 Variable Copies
In this scenario, we assume that leaf level nodes can migrate, and that processors can join and leave the
replication of the index nodes (so we can use this algorithm to implement a never-merge dB-tree). We
assume that the leaf nodes are not replicated, and that the PC of a node never changes.
The lazy algorithm that we propose combines elements of the lazy fixed-copy and migrating-node algo-
rithms by using lazy splits, version numbers, and message recovery.
To allow for data-balancing, we let the leaf level nodes migrate. The leaf level nodes aren't replicated,
so we can manage them with the lazy algorithm for migrating nodes (section 4.2). We want to maintain
the dB-tree property that if a processor owns a leaf node, it has a copy of every node on the path from the
root to the leaf. If a node obtains a new leaf node, it must join the set of copies for every node from the
root to the leaf which it does not already help maintain. If the processor sends off the last child of a node,
it unjoins the the set of processors that maintain the parent (applied recursively). When a processor joins
or unjoins a node replication, the neighboring nodes are informed of the new cooperating processor with a
link-change action. To facilitate link-change actions, we require that a node have pointers to both its left
and right sibling. Therefore, a split action generates a link-change subsequent action for the right sibling, as
well as an insert action for the parent.
We assume that every node has a PC that never changes (we can relax this assumption). The primary
copy is responsible for performing all initial split actions for registering all join and unjoin actions. The join
and unjoin actions are analogous to the migrate actions. Hence, every join or unjoin registration increments
the version number of the node. The version number permits the correct execution of ordered actions, and
also helps ensure that copies which join a replication obtain a complete history (see Figure 6). When a
processor unjoins a replication, it will ignore all relayed actions on that node and perform error recovery on
all initial action requests.
Out-of-range: If a copy receives an initial action that is out-of-range the copy sends the action across the
appropriate link. Relayed actions that are out of range are discarded.
Insert: 1. When a copy receives an initial insert action, it performs the insert and sends relayed-insert
actions to the other node copies that it is aware of. The copy attaches its version number to the
2. When a non-PC copy receives a relayed insert, it performs the insert if it is in range, and discards
3. When the PC receives a relayed insert action, it tests to see if the relayed insert action is in range.
(a) If the insert is in range, the PC performs the insert. The PC then relays the insert action
to all copies that joined the replication at a later version than the version attached to the
(b) If the insert is not in range, the PC sends an initial insert action to the appropriate neighbor.
Split: 1. When the PC detects that its copy is too full, it performs a half-split action by creating a new
sibling on several processors, designating one of them to be the PC, and transferring half of its
keys to the copies of the new sibling. The PC sets the starting version number of the new sibling
to be its own version number plus one. Finally, the PC sends an insert action to the parent, a
link-change action to the PC of its old right sibling, and relayed-split actions to the other copies.
2. When a non-PC copy receives a relayed half-split action, it performs the half-split locally.
Join: When a processor joins a replication of a copy, it sends a join action to the PC of the node. The PC
increments the version number of the node and sends a copy to the requester. The PC then informs
every processor in the replication of the new member, and performs a link-change action on all of its
Unjoin: When a processor unjoins a replication of a node, it sends an unjoin action to the PC and deletes
its copy. The processor discards relayed actions on the node and performs error recovery on the initial
actions. When the PC receives the unjoin action, it removes the processor from the list of copies, relays
the unjoin to the other copies, and performs a link-change action on all of its neighbors.
Relayed join/unjoin: When a non-PC copy receives a join or an unjoin action, it updates its list of
participants and its version number.
Link-change: A link-change action is executed using the migrating-node algorithm.
Missing-node: When a processor receives an initial action for a node that it doesn't manage, it submits
the action to a 'close' node, or returns the action to the sender.
Theorem 4 The variable-copies algorithm satisfies the complete, compatible, and ordered history require-
Proof: We can show that the variable-copies algorithm produces complete and ordered histories by using
the proof of theorem 3. If we can show that for every node n, the history of every copy c E copies(n) has a
backwards extension H' whose uniform update actions are exactly Mn, then the proof of theorem 2 shows
that the variable copies algorithm produces compatible histories.
For a node n with primary copy PC, let Ai be the set of update actions performed on PC when the
PC has version number i. When copy c is created, the PC updates its version number to j and gives c an
initial value Ic = InBj, where Bj is the backwards extension of Ic to I, and contains all uniform update
actions in A1 through A-_1. The PC next informs all other copies of the new replication member. After a
copy c' is informed of c, c' will send all of its updates to c. The copy c' might perform some initial updates
concurrent with c's joining copies(n). These concurrent updates are detected by the PC by the version
number algorithm and are relayed to c. Therefore, at the end of a computation every copy c E copies(n)
has every update in Mn in its uniform history. Thus, the variable copies algorithm produces compatible
new primary old
copy copy copy
Insert Missing Insert An insert that is executed
concurrently with the join
new copy relayedjoin is not sent to the new copy.
Figure 6: Incomplete histories due to concurrent joins and inserts.
We present algorithms for implementing lazy updates on a dB-tree, a distributed B-tree. The algorithms in
this paper can be used to implement a dB-tree that never merges empty nodes and performs data-balancing
on the leaves (we have previously found that the free-at-empty policy provides good space utilization 
and that leaf-level data balancing is effective and low-overhead ). We provide a correctness theory for
lazy updates, so lazy update techniques can be used to implement lazy updates on other distributed and
replicated search structures . Lazy updates, like lazy replication, permit the efficient maintenance of the
replicated index nodes. Since little synchronization is required, lazy updates permit concurrent search and
modification of a node, and even concurrent modification of a node. Finally, distributed search structures
that use lazy updates are easier to implement than more restrictive algorithms because lazy updates avoid
the use of synchronization.
Our plans for future work include developing lazy updates algorithms that for node merging and node
deletion (for a dE-tree ). We will apply lazy updates to other distributed data structures, such as hash
tables, tries, and parallel file structures. Finally, we will investigate fault-tolerant lazy updates.
 F.B. Bastani, S.S. Iyengar, and I-Ling Yen. Concurrent maintenance of data structures in a distributed
environment. The Computer Journal, 21(2):165-174, 1988.
 P.A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database
Systems. Addison-Wesley, 1 1 .
 A. Colbrook, E.A. Brewer, C.N. Dellarocas, and W.E. Weihl. An algorithm for concurrent search trees.
In Proceedings of the ',ili International Conference on Parallel Processing, pages 111138-111141, 1991.
 D. Comer. The ubiquitous B-tree. AC if Comp. Surveys, 11:121-137, 1979.
 C.S. Ellis. Distributed data structures: A case study. IEEE Transactions on Computing, C-34(12):1178
1185, 1 -'
 M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceeding of the
Second AC if SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 197-206.
 M. Herlihy and J. Wing. Linearizability: A correctness condition for concurrent objects. AC if Trans-
actions on Programming Languages and Systems, 12(3):463-492, 1990.
 T. Johnson. The Performance of Concurrent Data Structure Algorithms. PhD thesis, NYU Dept. of
Computer Science, 1990.
 T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the B-link tree. In Proc.
I,,' / Parallel Processing Symp., pages 319-325, 1992.
 T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the B-link tree. Technical
Report TR-530, MIT LCS, 1992. Also available via anonymous ftp at cis.ufl.edu:cis/tech-reports.
 T. Johnson and D. Shasha. Utilization of B-trees with inserts, deletes and modifies. In AC if
SIGACT/SIC I[OD/SIGART Symposium on Principles of Database Systems, pages 235-246, 1989.
 T. Johnson and D. Shasha. A framework for the performance analysis of concurrent B-tree algorithms.
In AC if Symp. on Principles of Database Systems, pages 273-287, 1990.
 E. Jul, H. Levy, N. Hutcjison, and A. Black. Fine-grained mobility in the Emerald system. AC if
Transactions on Computer Systems, 6(1):109-133, 1988.
 P. Krishna and T. Johnson. Implementing distributed search structures. Technical Report UF CIS
TR92-032, Availiable at anonymous ftp site cis.ufl.edu, University of Florida, Dept. of CIS, 1992.
 R. Ladin, B. Liskov, and L. Shira. Lazy replication: Exploiting the semantics of distributed services. In
AC if Principles of Distributed Computing, pages 43-57, 1990.
 V. Lanin and D. Shasha. A symmetric concurrent B-tree algorithm. In 1986 Fall Joint Computer
Conference, pages 380-389, 1986.
 P.L. Lehman and S.B. Yao. Efficient locking for concurrent operations on B-trees. AC if Transactions
on Database Systems, 6(4 1.'-11 1.70, 1981.
 Y. Sagiv. Concurrent operations on B*-trees with overtaking. In 4th AC if Symp. Principles of Database
Systems, pages 28-37. AC'\I 1 '.
 D. Shasha and N. Goodman. Concurrent search structure algorithms. AC if Transactions on Database
Systems, 13(1):53-90, 1988.
 J. Turek. Resilient Computation in the Presence of Slowdowns. PhD thesis, NYU Dept. of Computer
 W.E. Weihl and P. Wang. Multi-version memory: Software cache management for concurrent B-trees.
In Proc. ',.1 IEEE Symp. Parallel and Distributed Processing, pages .'-II '., 1990.