Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Supporting insertions and deletions in striped parallel filesystems
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095135/00001
 Material Information
Title: Supporting insertions and deletions in striped parallel filesystems
Alternate Title: Department of Computer and Information Science and Engineering Technical Report
Physical Description: Book
Language: English
Creator: Johnson, Theodore
Publisher: Department of Computer and Information Science, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: September 23, 1992
Copyright Date: 1992
 Record Information
Bibliographic ID: UF00095135
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199259 ( PDF )


Full Text













Supporting Insertions and Deletions in Striped Parallel

Filesystems

Theodore Johnson
University of Florida, Dept. of CIS
Gainesville, Fl 32611-2024


September 23, 1992


Abstract
The dramatic improvements in the processing rates of parallel computers are turning many
compute-bound jobs into IO-bound jobs. Parallel file systems have been proposed to better match
IO throughput to processing power. Many parallel file systems stripe files across numerous disks;
each disk has its own controller. A striped file can be appended (or prepended) to and maintain its
structure. However, a block can't be inserted into or deleted from the middle of the file, since doing
so would destroy the regular striping structure of the file. In this paper, we present a distributed file
structure that maintains files in indexed striped extents on a message passing multiprocessor. This
approach allows highly parallel random and sequential reads, and also allows insertion and deletion
into the middle of the file.


1 Introduction

Researchers have observed that the performance of I/O subsystems has not kept pace with the increasing
performance of the processors, especially in parallel systems [1]. A very large I/O bandwidth can be
created by attaching disk drives to each node of a parallel computer, but this bandwidth cannot be used
effectively if files are stored sequentially on the disks [3]. A parallel filesystem is a file system in which
the files are stored on multiple disks and the disk drives are located on different processors.
A common method for implementing a parallel filesystem is to use disk striping [17], in which con-
secutive blocks in a file are stored on different disk drives. Kim [8] has proposed disk synchronization, in
which a file is byte-wise distributed across the disks, and the disks synchronously read a block. Reddy
and Banerjee [13, 14, 15] compare the performance of disk synchronization and disk striping.
The Bridge filesystem [3, 2, 9] is a parallel striped file system implemented on a BBN butterfly. Files
in Bridge are striped across multiple disks in a regular pattern. The Bridge file system can support
highly parallel block file reads and file appends. Based on these operations, the authors can implement
a highly parallel secondary storage merge sort.
Inserting or deleting a block in the middle of a striped parallel file requires that the file be reorganized,
since otherwise the regular pattern of blocks striped across disks is destroyed by an out-of-order block
or a gap. Dibble, Scott, and Ellis concede that Bridge does not support these operations, and suggest a
linked-list implementation of the file that trades off the capability for interior inserts and deletes at the
expense of slow random access [3].
In this paper, we present a file index structure that allows inserts and deletes in the middle of a
parallel striped file, (if the records in the file are ordered) and that permits fast random access and













highly parallel block reads. We base our file structure on the dB-tree [6, 7], a highly parallel distributed
index structure. Thus, the file structures that we propose can be implemented on a message passing
parallel computer.


2 Parallel Striped Files

In this section, we review the structure of a parallel striped file, discuss the operations on the file, and
describe the difficulty of performing insertions or deletions in the middle of the file. We will use the
Bridge filesystem [3, 2, 9] as a concrete example.
The Bridge file system consists of a number of Local File Systems (or LFSs) distributed throughput
the parallel computer. Each LFS is managed by a single processor. The blocks of a file are distributed in
round-robin fashion among L LFSs, do, di, ..., dL-1. If the first block of the file is stored at LFS ds, then
the i-th block of the file is stored in LFS (i + s) mod L. If there are M LFSs on the parallel computer,
then we will assume here that L = M.
A striped file can support the following file operations in parallel:

Sequential Read: The M processors can read the i-th through the i + k 1-th blocks of F in parallel
by having each LFS read its local blocks. The k blocks can be read in O(k/M) disk accesses, with
a IO throughput of min(k, M).

Concurrent Read: The processes in a computation might read the blocks of a given file independently.
We assume that each process issues a random access read request for a single block. A process
directs its read request to the appropriate LFS, which is easily determined from the block number.
Each LFS can service its requests independently. Up to M requests can be served simultaneously.
If there are k simultaneous requests, the throughput will be less than min(k, M) due to collisions.

Block Append: The M processors can append or prepend k blocks to F by determining the processor
that stores the last (first) block of F, determining which processor should store which block of the
appended or prepended data, then sending the blocks to the appropriate processors. Again, the k
blocks can be written in O(k/M) disk accesses, with a IO throughput of min(k, M).

Update: Portions of the interior of a file can be overwritten, since overwriting will not change the
striping pattern. A large block can be updated in parallel and many small blocks can be updated
concurrently, assuming that concurrency control issues have been settled.

For many types of files and application, the only operations on the files are read and append, so that
the mechanism of striping leads to a substantial increase in throughput [1]. In other types of applications,
it would be very useful to be able to insert or delete data blocks from the middle of the file. For example,
the file might contain a quad or oct tree that stores image data in locational code [18] which is being
revised. In a striped file, a block can't be inserted or deleted from the middle of the file without requiring
an expensive restructuring to maintain the regular striped pattern (see figure 1).
The regular pattern of the assignment of disk blocks to LFSs is important for two reasons. First,
it permits parallel read operations without collisions between read requests. Second, it makes indexing
into the file simple. When a processor issues a random access read request, it first determines which LFS
stores the block by a simple modulo calculation, then sends the request to that LFS. The LFS consults
a local index to determine which disk block to return [3, 2].
Allowing irregularities in the striping pattern is not a satisfactory solution, because file reads must
first be filtered to remove the gaps, and indexing the file becomes difficult. Restructuring the the file
can be very expensive if the file is long. Dibble, Scott, and Ellis [3] suggest that the file can be broken
into a series of striped extents, where each extent is linked to its successor (an extent is a contiguous
set of blocks with an uninterrupted striping pattern). A striped extent-based file will suffer moderate














disk 1 2 3 4 1 2 3
data block ] ] ] e f I g


insert x |



disk 1 2 3 4 1 2 3 4
data block ] E ] e f I g


delete d -


disk 1 2 3 4 1 2 3
data block b] e] ff


Figure 1: Insertion and deletion require striped file restructuring


performance degradation when a sequential read operation is performed (due to the misalignment at
extent boundaries). The most serious problem, though, is that random access to the file will be very
slow, since the extents must be traversed to find a block in the interior of the file.
In this paper, we propose an index structure for parallel striped files. The index, which is based on
the dB-tree [7, 6], is a highly parallel distributed data structure, so an indexed striped file system can
be built on a distributed memory parallel computer. Our approach is similar to the one proposed in [3],
in that we maintain the file as a sequence of linked, striped extents. We use the dB-tree to index and to
help maintain the extents. The striped extents provide highly parallel sequential access, and the dB-tree
index provides fast and highly concurrent random access.
We note that in this paper, we only address the issues of indexing, accessing, and modifying an
indexed striped file. We don't address issues of distributing the data once it is retrieved [1, 3], caching
[9, 10, 4], and other file system issues.


3 The dE-tree

Our approach to maintaining a striped file is to break it into extents, then keep an index into the extents.
Therefore, one major problem that we need to solve is that of maintaining a distributed index. In this
section, we discuss our highly parallel distributed index, the dB-tree, which is jointly managed by all
of the LFSs in the file system. The dB-tree allows concurrent searches and updates, and can initiate a
search from any LFS in the file system. Readers who are familiar with B+-trees can skip to section 4, as
the distributed index is based on a B+-tree whose leaves correspond to the striped extents.
The base for the dB-tree is the concurrent (shared memory) B-link tree [12, 16, 11], which is a B+-
tree in which every node contains a pointer to its right neighbor. The B-link tree is easy to distribute
because global restructuring operations are performed one node at a time. In the concurrent B-link tree
algorithm, each node also contains a field, highest, which is the highest valued key that can be stored













in the subtree rooted at that node. The B-link tree algorithms use the additional information stored in
the nodes to let an operation recover if it misnavigates in the tree due to out-of-date information.
In the concurrent B-link tree algorithm described by Sagiv [16], insert operations place no more than
one lock on the data structure at a time. Search operations start by placing an R (read) lock on the root
and then searching the root to determine the next node to access. The search operation then unlocks
the root and places a R lock on the next node. The search operation continues accessing the interior
nodes in this manner until it reaches the leaf that could contain the leaf for which it is searching. If
the operation reads a node and finds that the highest field is lower than the key it is searching for, it
follows the right pointer of the node. When the search operation reaches the leaf that would contain the
key that it is searching for, it searches the leaf for the key and returns success or failure depending on
whether the key is or is not in the leaf.
An insert operation works in two phases, searching and restructuring. The search phase of the insert
operation uses the same algorithm as the search operation, except that it places W (exclusive write) locks
on the leaf nodes. When the insert operation reaches the appropriate leaf, it inserts its key if the key is
not already in the leaf. If the leaf is now too full, the insert operation must split the leaf and restructure
the tree, as in the usual B-tree algorithm. An operation holds at most one lock at a time, so it must
break the restructuring into disjoint pieces. So, when an insert operation finds that it must restructure
the tree it performs a half-split operation (see figure 2). A half split operation consists of creating a new
node (the sibling), transferring half of the node's keys to the sibling, and linking the sibling into the leaf
list. In order to complete the split, the insert operation releases the lock on the node, locks the parent,
and inserts a pointer to the sibling. If the parent becomes too full, the insert operation applies the same
restructuring steps. Notice that for a period of time, there is no pointer in the parent to the sibling.
Operations can still reach the sibling because of the right pointers and highest field.


Initially parent



A -- B C



Half-split parent




A B sibling -- C



Complete the split
parent



A -- B -- sibling --- C -


Figure 2: Half-split operation













The dB-tree algorithm builds on the concurrent B-link algorithms. A sample dB-tree (distributed
B-tree) is shown in figure 3. The leaves of the dB-tree are distributed among four processors. The interior
nodes of the dB-tree are replicated among several processors. The rule for replicating the interior is that
if a processor owns a leaf then it owns a copy of every node from the path from the root to the leaf,
inclusive. Each processor on a level has links to both neighbors, not just the right neighbor. Keeping the
nodes on a level in a doubly-linked list simplifies replication control and the deletion of nodes (as will be
discussed later), and also aids in downwards range queries.


root
S1,2,3,4





r1,3,411 1,2,4 12





S LI 3 L2 *4 L3 "2 L4 L5 __* L6




Figure 3: A dB-tree

The dB-tree has three operations defined on it: search(v), insert(v), and delete(v). These operations
have the usual semantics. An operation on the dB-tree performs suboperations on the tree nodes in
order to perform its computation. The suboperations are performed atomically by a processor on the
processor's local data. Each processor has a queue of pending suboperations that need to be performed
on its local nodes. A processor that helps to maintain the dB-tree accepts suboperations from other
processors and carries out the required actions. Conversely, a processor can send an operation to a
different processor.
A search operation that originates at one of the participating processors starts by performing a node-
search suboperation on the locally stored copy of the root. A search operation that originates at a
processor that does not help maintain the dB-tree must transmit its request to one of the participating
processors. The node-search suboperation is completely local: the node can be read in parallel at every
processor that owns a copy of the node, and a node-search at one processor does not block an update of
the same node at another processor. The node-search suboperation determines the next node that the
operation must access. A copy of the next node on the search path may be stored locally, in which case
the local copy should be used, or all copies may be stored remotely, in which case the processor must
send the operation to a processor that stores a copy of the node.
As an example, if a search operation for a key stored in L3 originates in processor 3, then the operation
reads processor 3's copy of the root and II, then transmits a request to read L3 at processor 4. If a
processor has a choice of sites to send the search operation to, it can make a choice based any reasonable
criteria, such as locality or estimated load. For example, if a search request for a key stored at L5
originates at processor 3, processor 3 chooses between processors 1, 2, and 4 to service the request to
search node 12. When the search operation reaches the correct leaf, it returns a success message to the
originating processor if the key is in the leaf, and a failure message otherwise.
If an insert or delete operation does not cause restructuring, then it is executed in the same way













Nodes A and B half-split


parent parent ) 2 copies of parent
1 2




A- A'- B- B' C



A' inserted into copy 1, B' inserted into copy 2

........................................................................................ 2 c o p ie s o f p a re n t
parent parent




A A'- B B' C


Figure 4: Lazy inserts


as a search operation, except that the action on the leaf is to insert or delete a key. If restructuring
is required, the half split operation (or a corresponding half-merge operation) is used to perform the
restructuring in a series of local actions.
If restructuring a node must be performed atomically, then every copy of a node must be locked, which
is expensive. The structure of the dB-tree allows lazy updates, in which operations may be performed
out of order. In the example shown in figure 4, nodes A and B have split, and pointers to the new nodes
(A' and B') need to be inserted in the pointers to the parents. In the time between the creation of a
sibling and the insertion of a pointer to that sibling into the parent, the sibling can still be reached by the
search, insert, and delete suboperations by the right and left pointers. Thus, if two inserts to the parent
node are pending, it does not matter which is performed first; they can be performed in a different order
in different copies of the parent node.
Not every update can be lazy (consider, for example, half-splits), but many node updates can be
lazy updates. As a result, the dB-tree can be maintained with little synchronization or communication
overhead.
The dE-tree is a practical distributed index constructed from the dB-tree. The key observation is
that, instead of maintaining separate leaves for the consecutive keys stored in a processor, the dE-tree
maintains a single leaf (i.e, an extent) that displays key range information only and stores the keys in
a local data structure. Figure 5 shows a four-processor dB-tree in which processors store extents of
consecutive leaves and the equivalent dE-tree. The extents in the dE-tree are managed as the leaves in a
dB-tree -when a processor decides that it is too heavily loaded, it transfers some of its keys to another
processor and performs what looks like a split or a merge operation.













extent balanced dB-tree


dE-tree
1,2,3,4



S1,2,3 2,3,4


ssor processor processor processor l processor processor
3 1 12 3 14 2 _


Figure 5: The dE-tree


4 Indexed Striped Files

The dB-tree algorithms can provide an index structure that allows us to insert into or delete from a striped
file, and yet still perform parallel reads and fast random access. We assume that the file is composed
of records, each of which can be identified by a key, and the keys can be ordered. This assumption is
reasonable, because the meaning of "insert this data after the 100-th block in the file" loses meaning
when data blocks are being inserted and deleted concurrently. Further, we note that many interesting
file types have this form, such as a files that contains a quad or oct tree [18], or a sorted database.
We assume that the records fit evenly onto the blocks. In addition to read, append, prepend, and
overwrite operations, we support the following:

Insert: Insert k blocks into F after the record with key v.

Delete: Delete k blocks from F starting at the record with key v.

4.1 Implementation
Instead of maintaining a single striped file, we maintain a sequence of independently striped extents.
The idea is that on an insertion or a deletion, we can either reorganize the extent or create a new extent.
The dB-tree index helps to manage the striped extents and provides an index to the extents. We will call
such a file an indexed striped file. This file structure is similar to the method that Dibble, Scott, and Ellis
proposed [3], but the index allows fast random access. In addition, the index aids in the maintenance of
the file.
An example of an indexed striped file is shown in figure 6. The file is broken into a number of extents,
each of which is independently striped across M disks (i.e, a striped extent). The extents are indexed












file i-node


striped extents


Figure 6: An Indexed Striped File


by a dB-tree. The nodes of the dB-tree are stored in the LFSs. The index is used for managing the
extents, as well as for providing an index for random access. The I-node for a file contains a description
of the file, a pointer to the root of the index, and a pointer to the leftmost extent (to facilitate sequential
access). The nodes of the index are maintained in the LFSs. The root of the index is replicated at all of
the LFSs, so that an index search can be initiated at all LFSs simultaneously.
The data structure for a striped extent is shown in figure 7. The dB-tree index points to an extent
stub. This stub contains the information necessary to read the extent: the extent name, the extent
length, and the processor that stores the first block of the extent. The index is used to allow random
accesses into the extent. We do not specify the exact nature of the index, nor is one always necessary.
Since each processor stores many extents, the extent name is used to tie the stored extents to the
extent stub. Finally, we note that the starting processor isn't necessarily the processor that maintains
the extent stub. Each processor that participates in the file system maintains an extent index, identified
with the extent name. The extent index (which can be associated with the file index) points to the
locally stored blocks.
Scheduling a parallel read operation in a striped file is easy, because a request to read the i-th
block is directed to the i modulo M-th LFS. The striped extents can be read in parallel using the same
methods. If the set of blocks read crosses extent boundaries, then some synchronization is required at
the extent boundaries. The new extent can be prepared in advance to begin reading out file blocks, but
the mismatch in the striping will cause a decrease in IO throughput. In the left hand example in figure 8,
reading the last dlst blocks in the first extent can be overlapped with reading the first M offset
blocks of the second extent because doffset > d1ast. In this case, doffset dast 1 LFSs are idled for
one step. In the example on the right hand side, reading the first row of the second extent can't be
overlapped with reading the last row of the first extent because offset < d1ast. In this case, a total of
M firstt offset + 1) LFSs are idled when these rows are read.
Appending (prepending) to a striped extent file is handled in the same manner as appending (prepend-
ing) to a striped file: a simple calculation directs the blocks to processors, which add the blocks to the set
of locally managed blocks. In the case of the striped extent file, it is the last (first) extent that receives
the new blocks. Similarly, blocks can be appended (prepended) to any of the striped extents in the file.
If blocks are to be inserted into or deleted from the middle of an extent, then there are several

















processor
1 -


I I
t processor s processor (s-1) modulo M

extent name extent name
index index to locally index to locally
extent name stored blocks stored blocks
extent length 1/ 2 3 \ 5 e. . s 2 3 4\ 5\
starting processor (s)



Figure 7: Striped Extent Storage


possibilities. If the number of inserted or deleted blocks is a multiple of M, then no global restructuring
is required; the data blocks are directed to the processors, which insert the blocks into the indices of
their locally managed files.
In general, inserting the blocks into the file destroys the striping pattern and requires some type of
restructuring. There are two restructuring possibilities: reorganizing the extent, or creating a new extent.
The option chosen depends of the relative costs of each action. Suppose that k blocks are inserted into an
L-block extent that is striped over M processors, starting at the i-th block of the extent. Reorganizing
the striped extent requires that one of the two parts of the extent be moved, so that min(i, L i) blocks
must be moved. If reorganizing the extent is too expensive, then the extent can be broken at the insertion
(deletion) point. Creating a new extent requires that a processor be picked, and all processors must be
informed of the new extent. When a processor is informed of the new extent, it creates a new extent entry
and transfers the corresponding portion of the old extent's index into the new extent entry. Finally, the
new extent must be registered with the dB-tree index. So, creating a new extent is relatively inexpensive.
On the other hand, the new extent is a disruption in the striping pattern, and so is going to slow down
a parallel read of the entire file. We provide criteria for deciding whether to reorganize an extent or to
create a new extent in the next section.
An extent might receive conflicting insert, delete, and read requests. While concurrency control
protocols exist for the dB-tree index, concurrent access to the file is best controlled by the file itself.
The processor that maintains the extent stub arbitrates among conflicting read and write requests to the
extent, permitting concurrent reads but requiring exclusive writes.
It is also possible, and sometimes desirable, to merge extents. For example, all blocks in the extent
might be deleted, the boundaries of the neighboring extents might match, or an extent might become
smaller than is deemed desirable. In these cases, the extent can be merged with a neighbor, using an
algorithm similar to that for a delete in the dB-tree.


4.2 An Optimization

Suppose that the extents in a file are not required to have the same striping pattern, and instead
are striped according to some arbitrary permutation of the LFSs. We can permit arbitrary striping
permutations by recording the permutation in the extent stub. It is clear that if the length of the extent
is no greater than M, then an arbitrary number of records (up to M) can be deleted by rearranging the












dl d2 d3 d4 d5 d6
000DD0
000DD0
DDDDDD
DDDDDD


-Q-Q-----------
ED
- - - -

D DED ElDED
D- -- -- -- -- -- -- -- -- -- -
ED ED ED ED ED ED
ED ED ED ED ED ED


extent 1


extent 2


Reading the end of extent 1 can
be overlapped with reading the
beginning of extent 2. Only 1
LFS is idled
Figure 8: Parallelism lost due to


dl d2 d3 d4 d5 d6
000000


000000
000000
000000





000000
D DDDDD









DDE









Reading the end of extent 2 can't
be overlapped wit reading the
beginning of extent 1. Five LFSs
are idled.
striping mismatch


permutation. Similarly, if s blocks are inserted into an extent with L blocks, and L + s < M, then only
the s blocks that are inserted need to be transferred, the space for them being created by rearranging
the permutation. The question is whether arbitrary permutations help us insert or delete blocks when
an extent contains more than M blocks.
Suppose S blocks are to be inserted starting at position i of an extent that contains L = M + k
blocks, where k + s < M. If L/2 < i < M k s, then the s blocks can be inserted into the extent using
only k + 2s block transfers (including the block transfers needed to insert the new data). Figure 9 shows
the necessary block transfers and rearrangements. The blocks following the insertion point are kept in
the same LFSs, as are the i k s blocks that precede the insertion point. The first k + s blocks in the
extent are transferred to that their pattern matches that of the last k + s blocks in the extent. The s
blocks that are inserted are placed in the unused LFSs -which are the LFSs that hold the k-th through
the k + s 1-st blocks in the original extent. If the insertion point, i, is within k s of either end, then
it requires fewer transfers to rearrange the extent as is shown in figure 1. If k + s < i < L/2, then a
symmetric algorithm can be used. If k + 2s < min(i, L i), then changing the striping permutation can
substantially reduce the cost of inserting a block.
Figure 9 also shows an algorithm for deleting s blocks using k transfers. This algorithm is similar to
the insertion algorithm.

4.3 Decision Making
When an extent is updated by an insert or a delete, we have the choice of reorganizing the extent,
splitting the extent, or possible merging the extent with a neighboring extent. The best action to take
is the one that requires the least overhead and allows the most parallelism in future operations. We will

















Insert s blocks:
requires k+2s
transfers









Delete s blocks:
requires k transfers


M-s__
I MI
+s li-k- IM-i-s k+S




k+s i-k-s s M-i-s k+s

same pattern



k-s s i-k s M-i k+s
same
pattern

k-s s- i-k M-i I -

\ I" M .


same pattern
Figure 9: Algorithms for fast insertion and deletion in an extent


measure the overhead by the cost of an action, as we discuss later.
The I.. file organization (long extents or short extents) depends on the type of access to the file.
Crockett [1] describes file access methods as being either sequential or direct. Parallel sequential access
can be either partitioned each processor reads a different section of the file) or interleaved (all processors
read the same parts of the file). Direct access means that the file is read in an arbitrary order. Direct
access can be either global (all processors can read all of the file) or partitioned (each processor accesses
its own part of the file exclusively). Kotz and Ellis [9] provide a similar taxonomy of parallel file access.
From the perspective of a striped indexed file, the important characteristics of the file access are
whether the access is highly sequential or highly random, and how many update operations are performed.
If the file access is is highly sequential and updates are rarely performed, then the extents should be
long (if blocks are never inserted into or deleted from the the middle of the file, it can consist of a single
extent). If updates are common, then the extents should be short. If the file access is highly random,
then it is desirable that the extents be of a moderate length in order to hash the random access indexing
responsibilities to the LFSs.
The cost of performing an action (restructure, split, merge) after an update has two parts: the cost
of actually performing the action, and the reduction in future performance. In reorganizing the extent,
the cost is shifting the data blocks on one of the ends. If the extent is L blocks long and the insertion or
deletion point is after the i-th block, then we denote the reorganization cost by R(i, L, M). The function
R is architecture-dependent, but has a lower bound of [min(i, L i)/M] if L > 2M. Dibble, Scott, and
Ellis report that the time it takes to copy an N block file with M LFSs is O(N/M+log(M)). The future
cost of reorganizing the extent is the additional future cost of random searches in the file. The more
extents that a file contains, the greater parallelism is possible in performing random searches or updates.
We denote the future cost of reorganizing the extent by I(i, L, E, P), where E is the number of extents in
the file (which can be estimated while searching the index) and P is the expected number of processors
concurrently performing a random access. An extent stub can perform indexing for only one operation













at a time, and updating operations can block other operations. Thus, the future cost of reorganizing the
extent is the additional blocking of random access operations that will occur if the extent is not split.
The amount of blocking can be estimated via ball-and-urn models [5].
The immediate cost of splitting the extent is the cost of creating a new extent and inserting it into
the dB-tree index. We denote this cost by S(i, L, M). This cost includes the overhead required to find
a processor to host the new extent stub, communicate to the LFSs that store the old extent, and insert
a new pointer into the dB-tree index. The future cost is primarily the performance lost when a parallel
read crosses extent boundaries. We denote this cost by B(i, k, L). As discussed previously, between 1
and M LFSs will be idle during a round of I/O due to the mismatch in the extent boundaries (see figure
8). There are two ways to calculate B(i, k, L): counting the current size of the gap between the extents,
estimating the gap size to be some average value (such as (M 1)/2). The performance lost is relative
to the potential parallelism. If the gap size is estimated to be G, then B(i, k, L) = G/(L + k).
A comparison of the immediate costs is simple, while a comparison of the future costs requires
knowledge of how the file is used. If the file access is mostly sequential, then the future cost of splitting
an extent should be heavily weighted. If the file access is mostly random, then a greater weight should
be placed on the future cost of reorganizing the extent. We denote the weight attached to sequential
accesses by W, and the weight attached to random accesses by W,. The decision to split or restructure
an extent is based on the choice that has the lowest present and future cost:

create a new extent if S(i, L) + WB(i, k, L) < R(i, L, M) + WI(i, L, B)
reorganize the extent otherwise

After an extent is reorganized due to an update, the system can consider whether to merge an extent
with its neighbor -for example, it might do so if the extent is very small, or if the striping patterns
agree. Again, the criterion for merging is whether the future cost of not merging is greater than the
current and future cost of merging. Let E1 and E2 be the extents being considered, and let M(E1, E2)
be the cost to merge the two extents. The cost of merging extents includes the cost of any copying that
needs to be performed, and the cost of deleting an entry in the dB-tree index. Thus, the extents should
be merged if WB(i, k, L) > M(E1, E2) + WI(i, L, B).

4.4 Extensions

We can define additional operations on a indexed striped file. For example, a scan operation consists of
an initial random access followed by a number of get-next-record operations. The initial lookup returns
the position of the desired record in the extent and places a read-lock on the records in the extent.
Subsequent get-next-record operations can be issued directly, without going through the index of the
extent stub. When an extent boundary is crossed, the lock on the previous extent is released and the
new extent is read-locked. Multiple scan operations can be executed simultaneously.
If the parallel file has enough structure, then indexed striped file structure can be specialized to yield
a more efficient file structure [1]. For example, the file might consist of a series of buckets, where each
bucket must occur in sorted order but the data items in the buckets need not be sorted. In this case,
each bucket is stored in a different extent. Since there is no order on the data items in the buckets,
inserted records can be placed at the end of the extent so that no reorganization is required. When data
items are deleted, records are removed from the end of the extent to fill the hole. If a bucket becomes
large, it can be split into multiple extents to facilitate parallel indexing.


5 Conclusions

Highly parallel supercomputers need highly parallel file systems in order to provide support data intensive
computations. A common solution is to stripe files across multiple, physically separated disks. While













striped files allow highly parallel block reads, permitting record insertion, record deletion, together with
random access is difficult.
We present a file structure for indexed striped files that permits highly parallel sequential reads, allows
blocks to be inserted into and deleted from the middle of the file, and also permits fast random access.
The approach is to break files into extents in which the striping pattern is maintained. An insertion or
deletion can lead to either the reorganization of the extent, or the creation of a new extent, depending on
the present and future cost of either option. The distributed index on the extents allows each extent to
be managed independently, allowing highly parallel indexing, random access reads, and random access
updates. We discuss how highly structured files can further benefit from the techniques applied to the
striped indexed file.


References

[1] T.W. Crockett. File concepts for parallel I/O. In Proceedings Supercomputing 1988, pages 574-589,
1988.

[2] P.C. Dibble and M.L. Scott. Beyond striping: The Bridge multiprocessor file system. Computer
Architecture News, 15(5):32-39, 1989.

[3] P.C. Dibble, M.L. Scott, and C.S. Ellis. Bridge: A high-performance file system for parallel proces-
sors. In IEEE I,, / Conf. on Distributed Computing Systems, pages 154-161, 1988.

[4] C.S. Ellis and D. Kotz. Prefetching in file systems for mimd multiprocessors. In Li I Conf. on
Parallel Processing, pages I:306-314, 1989.

[5] W. Feller. An Introduction to Probability Theory and Us Applications, volume 1. John Wiley, 1950.

[6] T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the b-link tree. Tech-
nical Report TR-530, MIT Laboratory for Computer Science, 1992. Also available via anonymous
ftp at cis.ufl.edu:cis/techreports.

[7] T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the b-link tree. In
Proc. I,, / Parallel Processing Symposium, pages 319-325, 1992.

[8] M.Y. Kim. Synchronized disk striping. IEEE Transactions on Computers, C-35(11):336-342, 1986.

[9] D. Kotz and C.S. Ellis. Prefetching in file systems for MIMD multiprocessors. IEEE Trans. on
Parallel and Distributed Systems, 1(2):218-230, 1990.

[10] D. Kotz and C.S. Ellis. Practical prefetching techniques for parallel file systems. In I1, I Conf. on
Parallel and Distributed Information Systems, pages 182-189, 1991.

[11] V. Lanin and D. Shasha. A Symmetric Concurrent B-Tree Algorithm. In 1986 Proceedings Fall
Joint Computer Conference, pages 380-386, November 1986.

[12] P.L. Lehman and S.B. Yao. Efficient locking for concurrent operations on B-trees. AC if Transactions
on Database Systems, 6(4 1.'.11 1.70, 1981.

[13] A.L.N. Reddy and P. Banerjee. An evaluation of multiple-disk i/o systems. IEEE Transactions on
Computers, 38(12):1680-1690, 1989.

[14] A.L.N. Reddy and P. Banerjee. Performance evaluation of multiple-disk i/o systems. In 1, /I
Conference on Parallel Processing, pages 1:315-318, 1989.













[15] A.L.N. Reddy and P. Banerjee. A study of parallel disk organizations. Computer Architecture News,
17(5): I 11 I, 1989.

[16] Y. Sagiv. Concurrent Operations on B-Trees with Overtaking. Journal of Computer and System
Sciences, 33(2 '-' .-296, October 1986.

[17] K. Salem and H. Garcia-Molina. Disk striping. In It I Conf. on Data Engineering, pages 336-342,
1986.

[18] H. Samet. The quadtree and related hierarchical data structures. Computing Surveys, 16(2):187-260,
1984.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs