Maintaining Very Large Samples Using the Geometric File

Material Information

Maintaining Very Large Samples Using the Geometric File
Pol, Abhijit A
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (122 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Jermaine, Christophe
Committee Members:
Kahveci, Tamer
Dobra, Alin
Hammer, Joachim
Ahuja, Ravindra K.
Graduation Date:


Subjects / Keywords:
Buffer storage ( jstor )
Databases ( jstor )
Datasets ( jstor )
Estimate reliability ( jstor )
Index numbers ( jstor )
International conferences ( jstor )
Random sampling ( jstor )
Recordings ( jstor )
Sampling bias ( jstor )
Statistical discrepancies ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
biased, databases, file, indexing, sampling
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.


Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a 'sample' is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for maintaining a very large, on-disk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2007.
Adviser: Jermaine, Christophe.
Statement of Responsibility:
by Abhijit A Pol.

Record Information

Source Institution:
Rights Management:
Copyright Pol, Abhijit A. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
LD1780 2007 ( lcc )


This item has the following downloads:

Full Text

The above two methods implement simple random sampling (SRS) without replacement

with successive draws. An alternative method for SRS

with fixed size is to select units for replacement, and then to reject the sample if there are

duplicates. We discuss one such method here, called Sampford's Method.

Sampford's Method: In this method we will first draw r with probability a, and in the

remaining N- 1 draws, which are carried out with replacement, we use the selection probabilities

3, = Kai/(1 Nao), where K is the normalizing constant. If there are any duplicates in the

sample we start again from the beginning and repeat the procedure until the desired sample with

no duplicates is obtained. The main drawback of this sampling design is that as N becomes large

it becomes likely that duplicates will occur in each sampling round.

While a very large sample can be required to answer a difficult query, a huge sample may

often contain too much information. We therefore develop efficient techniques which allow a

geometric file to itself be sampled in order to produce smaller data objects.

Efficiently searching and discovering information from the geometric file is essential for

query processing. A natural way to support this is to build an index structure. We discuss three

secondary index structures and their maintenance as new records are inserted to a geometric file.

existing subsample with the records from a new buffer flush, a simple, efficient, sequential

overwrite of the existing subsample's largest segment generally suffices.

3.5 Characterizing Subsample Decay

To describe the geometric file in detail, we begin with an analogy between the samples in a

subsample S that are lost over time, and radioactive decay. Imagine that we have 100 grams of

Uranium at an initial point of time (Uo = 100), and a decay rate (1 a) = 0.1 with a retention rate

of a. On day one, the mass of Uranium decays to Uo x a = 90 grams, because the Uranium loses

Uo x (1 a) = 10 grams of its mass. We define n = Uo x (1 a) to be the mass of Uranium lost

on the very first day, giving n = 10 for our example.

On day two, (with U1 = 90) the Uranium further decays to U1 x a = 81 grams, this time

losing U1 x (1 a) = Uo x a x (1 a) n x a = 9 grams of its mass. On day three, it further

decays by n x a2 = 7.2 grams, and so on. The decay process is allowed to continue until we have

less than 3 grams of Uranium remaining.

Continuing with the Uranium analogy, three questions that are relevant to our problem of

maintaining very large samples from a data stream are

What is the amount of Uranium lost on any given ith day?
How can the initial mass of Uranium, 100 grams, can be expressed in terms of n and a?
How many days it will take for us before we are left with 3 grams or less of Uranium?

These questions can be answered using the following three simple observations related to

geometric series:

Observation 1: Given a retention rate a < 1 and n to be the first term of a geometric series, the

ith term is given by n x a-1 for any nE cR.

Observation 2: Given a retention rate a < 1, it holds that C"n x a -1 "r for any n R.

Observation 3: Given a retention rate a < 1, define f(j) as x aj. From Observation 2, it

follows that the largest j such that f(j) > P3 is j log3-loglog(1-a)]. We denote this floor by


of records was selected to be inserted into the reservoir (as many as each of the five options

could handle). The goal was to test how many new records could be added to the reservoir in

20 hours, while at the same time expelling existing records from the reservoir as is required by

the reservoir algorithm. The number of new samples processed by each of the five options (that

is, the number of records added to disk) is plotted as a function of time in Figure 7-1 (a). By

"number of samples processed" we mean the number of records that are actually inserted into the

reservoir, and not the number of records that have passed through the data stream.

Insertion experiment 2: This experiment is identical to Experiment 1, except that the 50GB

sample was composed of 50 million, 1KB records. Results are plotted in Figure 7-1 (b). Thus, we

test the effect of record size on the five options.

Insertion experiment 3: This experiment is identical to Experiment 1, except that the amount of

buffer memory is reduced to 150MB for each of the five options. The virtual memory option used

all 150MB for an LRU buffer, and the four other options allocated 100MB to the LRU buffer and

50MB to the buffer for new samples. Results are plotted in Figure 7-1 (c). This experiment tests

the effect of a constrained amount of main memory.

7.1.2 Discussion of Experimental Results

All three experiments suggest that the multiple geo files option is superior to the other

options. In Experiments 1 and 2, the multiple geofiles option was able to write new samples to

disk almost at the maximum sustained speed of the hard disk, at around 40 MB/sec.

It is worthwhile to point out a few specific findings. Each of the five options writes the first

50GB of data from the stream more or less directly to disk, as the reservoir is large enough to

hold all of the data as long as the total is less than 50GB. However, Figure 7-1 (a) and (b) show

that only the multiple geofiles option does not have much of a decline in performance after

the reservoir fills (at least in Experiments 1 and 2). This is why the scan and virtual memory

options plateau after the amount of data inserted reaches 50GB. There is something of a decline

in performance in all of the methods once the reservoir fills in Experiment 3 (with restricted

buffer memory), but it is far less severe for the multiple geofiles option than for the other options.

Table 7-2. Query timing results for 1k record, R| = 10 million, and IBI = 50k

Scheme Selectivity Index Time File Time Total Time

Point Query 38.2890 0.0226 38.3116
S 10 recs 40.2477 0.1803 40.2480
100 recs 43.2856 0.8766 44.1622
1000 recs 45.6276 6.2571 51.8847
Point Query 0.87551 0.02382 0.89937
10 recs 1.12740 0.15867 1.28607
Subsample-Based 100 recs 1.74911 1.10544 2.85455
1000 recs 2.09980 5.96637 8.06617
Point Query 0.00012 0.01996 0.02008
10 recs 0.00015 0.01263 0.01278
100 recs 0.00019 0.79358 0.79377
1000 recs 0.00056 5.82210 5.82266

Once the reservoir is initialized, both the segment-based and the subsample-based index structure

perform an equal number of disk seeks. Finally, the LSM-tree-based index structure is slowest

amongst the three. The LSM-tree maintains the index by processing insertions and deletions

more aggressively than other two options, demanding more rolling merges and more disk seeks

per buffer flush.

Table 7.4 also shows the insertion figures for the smaller, 200B record size. Not surprisingly,

all three index structures shows similar insertion patterns, but since they have to process a larger

number of records the insertion rates are slower than in the case of the 1KB record size. We also

observed and plotted the disk footprint size for three index structures (Figure 7-7 and Figure

7-8). As expected, all three index structures initially grow fairly quickly. The segment-based

and the subsample-based index structures stabilize soon after the reservoir is filled, whereas the

LSM-Tree-based structure stabilizes a little later when the removal of stale records from the

rolling merges stabilizes.

The subsample-based index structure has the largest footprint (almost 1/5th of the geometric

file size). This is expected as stale index records is removed from the B+-trees only when the

net worth of American households. In the general case, many millions of samples may be

needed to estimate the net worth of the average household accurately (due to a small ratio

between the average household's net worth and the standard deviation of this statistic across all

American households). However, if the same set of records held information about the size of

each household, only a few hundred records would be needed to obtain similar accuracy for an

estimate of the average size of an American household, since the ratio of average household size

to the standard deviation of sample size across households in the United States is greater than 2.

Thus, to estimate the answer to these two queries, vastly different sample sizes are needed.

Since there is no single sample size that is optimal for answering all queries and the required

sample size can vary dramatically from query to query, this part of dissertation considers the

problem of generating a sample of size N from a data stream using an existing geometric file

that contains a large sample of records from the stream, where N < R. We will consider two

specific problems. First, we consider the case where N is known beforehand. We will refer to a

sample retrieved in this manner as a batch sample. We will also consider the case where N is not

known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext

results in an additional sampled record being returned to the caller, and so N consecutive calls

to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an

online or sequential sample.

1.4 Index Structures For The Geometric File

A geometric file could easily contain a sample of size several gigabytes or even terabytes.

A huge sample like this may often contain too much information and it becomes expensive to

scan all the records of a sample to find those (most likely very few) records that match a given

condition. A natural way to speed up the search and discovery of those records from a geometric

file that have a particular value for a particular attribute is to build an index structure. In this part

of the dissertation we discuss and compare three different index structures for the geometric file.

In general an index is a data structure that lets us find a record without having to look at

more than a small fraction of all possible records. An index is referred to as primary index if it

Insertion into the Co component has no I/O cost associated with it. However, its size is limited by

the size of the available memory. Thus, we must efficiently migrate part of the Co component to

the disk-resident C1 component.

Whenever the Co component reaches a threshold size an ongoing rolling merge process

removes some records (a contiguous segment) from the Co component and merges them into

the Ci component on disk. The the rolling merge process is depicted pictorially in Figure 2.2

of the original LSM-Tree paper [44]. The rolling merge is repeated for migration between

higher components of an LSM-Tree in similar manner. Thus, there is a certain amount of delay

before records in the Co component migrate out to the disk-resident Ci and higher components.

Deletions are performed concurrently in batch fashion similar to inserts.

The disk resident components of an LSM-tree are comparable to a B+-tree structure, but are

optimized for sequential disk access, with nodes 100% full. Lower levels of the tree are packed

together in contiguous, multi-page disk blocks for better I/O performance during the rolling


6.5.2 Index Maintenance and Look-Ups

As in case of previously proposed index structures, every time the buffer is fulled and

partitioned into segments, we create an index record for each buffered record and bulk insert

them all into an LSM-tree index. The index record is comprised of five fields: (1) the key value,

(2) the disk page number of the record, (3) an offset within the page, (4) the segment number

to which the record belongs, and (5) the subsample number to which the record belongs. The

segment and subsample number are recorded with each index record to determine its staleness.

Every time a record is migrated from a lower component to a higher disk based component, the

rolling merge additionally identifies stale records and removes them from the tree structure. We

refer to an index record as a stale record if it is indexing a record either from a subsample that is

decayed completely, or a segment of a subsample that is overwritten.

We use the existing LSM-Tree-based point query and range query algorithms to perform

index look-ups. As in case of previously proposed index structures, we sort the valid index

This file organization has several significant benefits for use in maintaining a very large

sample from a data stream:

Performing a buffer flush requires absolutely no reads from disk.

Each buffer flush requires only T random disk head movements; all other disk I/Os are
sequential writes. To add the new samples from the buffer into the geometric file to create
a new subsample S, we need only seek to the position that will be occupied by each of S's
on-disk segments.

Even if segments are not block-aligned, only the first and last block in each over-written
segment must be read and then re-written (to preserve the records from adjacent segments).

Algorithm 3 Reservoir Sampling with a Geometric File
1: Set numSubsamples = 0
2: for int i = 1 to oo do
3: Wait for a new record r to appear in the stream
4: if i < R then
5: Add r to B
6: if Count(B) == IB|InmSubsamples then
7: Randomize the ordering of the records in B
8: Set n Count(B) x (1 a)
9: Partition B into segments of size n, na, na2, and so on
10: Flush the first T segments to the disk
11: Store the group of remaining segments in main memory
12: numSubsamples + +
13: B =
14: else
15: with probability IR|/i do
16: with probability Count(B)/ IR do
17: Replace a random record in B with r
18: else do
19: Add r to B
20: if Count(B) = BI then
21: Partition the buffer into segments of size n, na, na2, and so on (see Section 3.7.1)
22: for each segment sgj from B do
23: Overwrite the largest segment of jth largest subsample of R with sgj
24: B =

3.7.1 Introducing the Required Randomness

One issue that needs to be addressed is the partitioning the buffer into segments in Algo-

rithm 3 Step (21). In order to maintain the algorithm's correctness, when the buffer is flushed

Biased sampling w/o skewed records
1.4e+17 Unbiased reservoir sampling
Biased sampling worst case

l e+17 ------

S 8e+16

7 6e+16


0 'I' I I'
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-5. Sum query estimation accuracy for zipf=O.8.

By testing each of the three different scenarios described in the previous subsection over a

set of data sets created by varying zipf as well as the correlation factor, we can see the effect of

data skew and of bias function quality on the relative quality of the estimator produced by each of

the three scenarios.

For each experiment, we generate a data stream of one million records and obtain a sample

of size 1000. For each of the three scenarios and each of the data sets that we test, we repeat the

sampling process 1000 times over the same data stream in Monte-Carlo fashion. The variance

of the corresponding estimator is reported as the observed variance of the 1000 estimates. The

observed Monte-Carlo variances are depicted in Figures 7-3, 7-4, 7-5, and 7-6.

7.2.2 Discussion

It is possible to draw a couple of conclusions based on the experimental results. Most

significant is that biased sampling under the pathological record ordering shows qualitative

performance similar to the biased sampling without any overweight records. Even though in the

pathological case the sample might not be biased exactly as specified by the user-defined function

f, the number of records not sampled according to f is usually small, and the resulting estimator

typically suffers from an increase in variance of around a factor of ten of less. This demonstrates

Algorithm 10 Construction and Maintenance of a Segment-Based Index Structure
1: Set n |B| x (1 a)
2: Set totSegslnSubsam -,o log n+og(-
3: Set totSubsamlnR 0
4: Set totSegslnR = 0
5: Set numRecs = 0
6: while numRecs < |R1 do
7: numRecs+ = IBI X atotSubamInR
8: totSegslnR+ = totSegslnSubsam totSubsamlnR
9: totSubsamlnR + +
10: Set BTree array of size totSegslnR
11: for inti = 1 to oo do
12: if Buffer B is partitioned then
13: for each segment sgj in B do
14: Build a B+-Tree BTj
15: if i < |R then
16: Flush BTj on the disk at next available spot in the Index File
17: else
18: Overwrite the B+-Tree for the largest segment of jth largest subsample of R with
19: Record BTj root and its disk position in BTree array

6.3.3 Index Look-Up and Search

A segment-based index structure is a collection of B+-Trees, one for each segment of the

geometric file. Any index-based search involves looking up all B+-Tree indexes. We use the

existing B+-Tree-based point query and range query algorithms and re-run them for each entry in

the B+-Tree array. The algorithm returns all index records that satisfy the search criteria. We sort

the valid index records by their page number attribute. We then retrieve the actual records from

the geometric file and return them as a query result.

We expect a segment-based index structure to be a compact structure as there is exactly one

index record present in the index structure for each record in the geometric file, and the index

structure is maintained as new records are deleted from the file.

6.4 A Subsample-Based Index Structure

Although compact, the segment-based index structure has little too many small indexes.

The requirement that we perform a look-up using every single one of a large number of B+-Tree

can easily degrade the performance of index-based search. A geometric file could easily have

multiple thousands of segments in it; even with two disk seeks per B+-Tree to retrieve an index

record, a simple point query may required thousands of disk seeks to return the query results. An

alternative to a segment-based index structure is to build a B+-Tree index for each subsample of

the geometric file. We refer this approach as a subsample-based index structure.

6.4.1 Index Construction and Maintenance

Every time the buffer accumulates the desired number of samples for a new subsample, we

build a single B+-Tree index for all the buffered records. As in the case of a segment-based index

structure, we construct an index record for each buffer record and then bulk insert them all to

create a B+-Tree index. The structure of the index record for a subsample-based index structure

is the same as that of a segment-based index structure, except that we add an attribute recording

the segment number to which the buffered record belongs. As discussed subsequently, we use

the segment number associated with the index record to determine if it is stale. We remember the

B+-Tree added to the structure by keeping track of its root node in an array structure.

As in the case of a segment-based index structure, we arrange the B+-Tree indexes on

disk in a single index file. However, we need a slightly different approach, because during

the start-up subsamples are flushed to the geometric file, until the reservoir is full. Thereafter

subsamples of the same size |B are added to the reservoir. Since each B+-Tree will index no

more than |B records, we can bound the size of a B+-Tree index. We use this bound to pre-

allocate a fix-sized slot on disk for each B+-Tree. Furthermore, for every buffer flush after the

reservoir is full, exactly one subsample is added to the file and the smallest subsample of the file

decays completely, keeping the number of subsamples in a geometric file constant. We use this

information to lay out the subsample-based B+-Trees on disk and maintain them as new records

are sampled from the data stream.

Thus, if totSubsamples is the total number subsamples in R, we first allocate fixed-size

totSubsamples slots in the index file. Initially all the slots are empty. During start-up, as a new

B+-Tree is built, we seek to the next available slot and write out the B+-Tree in a sequential


At the end of my dissertation I would like to thank all those people who made this disserta-

tion possible and an enjoyable experience for me.

First of all I wish to express my sincere gratitude to my adviser Chris Jermaine for his

patient guidance, encouragement, and excellent advice throughout this study. If I would have

access to magic tool create-your-own-adviser, I still would not have ended up with anyone better

than Chris. He always introduces me to interesting research problems. He is around whenever

I have a question, but at the same time encourages me to think on my own and work on any

problems that interest me.

I am also indebted to Alin Dobra for his support and encouragement. Alin is a constant

source of enthusiasm. The only topic I have not discussed with him is strategies of Gator football


I am grateful to my dissertation committee members Tamer Kahveci, Joachim Hammer, and

Ravindra Ahuja for their support and their encouragement.

I acknowledge the Department of Industrial and Systems Engineering, Ravindra Ahuja,

and chair Donald Heam for the financial support and advice I received during initial years of my


Finally, I would like to express my deepest gratitude for the constant support, understanding,

and love that I received from my parents during the past years.

geometric file is that it organizes the records to be overwritten systematically on the disk by

making the observation that each existing subsample loses approximately the same fraction of its

remaining records every time.

3.10 Multiple Geometric Files

The value of a can have a significant effect on geometric file performance. If a = 0.999,

we can expect to spend up to 95' of our time on random disk head movements. However, if

we were instead able to choose a = 0.9, then we reduce the number of disk head movements

by factor of 100, and we would spend only a tiny fraction of the total processing time on seeks.

Unfortunately, as things stand, we are not free to choose a. According to Lemma 1, a is fixed by

the ratio IB|/ R|. That is, for a fixed desired size of reservoir we need a larger buffer to lower the

value of a.

However, there is a way to improve the situation. Given a buffer of fixed capacity IBI

and desired sample size IR|, we choose a smaller value a' < a, and then maintain more than

one geometric file at the same time to achieve a large enough sample. Specifically, we need to

maintain m =(1-,) geometric files at once. These files are identical to what we have described

thus far, except that the parameter a' is used to compute the sizes of a subsample's on-disk

segments and size of each file is -. The remainder of this Section describes the details of how
multiple geometric files are used to achieve greater efficiency.

3.11 Reservoir Sampling with Multiple Geometric Files

The reservoir sampling algorithm with multiple geometric files is similar to the Algorithm 3.

Each of the m geometric files is still treated as a set of decaying subsamples, and each subsample

is partitioned into a set of segments of exponentially decreasing size, just as is done in Algorithm

3, Steps (5)-(13). The only difference is that as each file is created, the parameter a' is used

instead of a in Steps (6), (8)-(9), and each of the m geometric files is filled after one another, in

turn. Thus, each subsample of each geometric file will have segments of size n, na', na'2 and so


Table 1-1. Population: student records

Rec # Name Class Salary ($/month)
1 James Junior 1200
2 Tom Freshman 520
3 Sandra Junior 1250
4 Jim Senior 1500
5 Ashley Sophomore 700
6 Jennifer Freshman 530
7 Robert Sophomore 750
8 Frank Freshman 580
9 Rachel Freshman 605
10 Tim Freshman 550
11 Maria Sophomore 760
12 Monica Freshman 600
Total Salary: 9545.00

Table 1-2. Random sample of the size=4

Rec # Name Class Salary ($/month)
2 Tom Freshman 520
5 Ashley Sophomore 700
8 Frank Freshman 580
12 Monica Freshman 600

Other cases where a biased sample is preferable abound. For example, if the goal is to

monitor the packets flowing through a network, one may choose to weight more recent packets

more heavily, since they would tend to figure more prominently in most query workloads.

We propose a simple modifications to the classic reservoir sampling algorithm [11, 38]

in order to derive a very simple algorithm that permits the sort of fixed-size, biased sampling

given in the example. Our method assumes the existence of an arbitrary, user-defined weighting

function f which takes as an argument a record ri, where f(ri) > 0 describes the record's utility

Table 1-3. Biased sample of the size=4

Rec # Name Class Salary ($/month)
1 James Junior 1200
4 Jim Senior 1500
7 Robert Sophomore 750
11 Maria Sophomore 760

which is the desired probability.

This expression can then be used in conjunction with the next lemma to compute the

variance of the natural estimator for q.

Lemma 7. The variance of a is g'(rj) + r 2Pr[(rj,rk)lg(rJ)g(rk) 2
i, Pr[rjER,] 2 ,rk Pr[rjERi]Pr[rkRi q


Var(q) = E[q2] (E[q])2

Sg(rj) 2- E[ Z Pr]r
rjERi rjERi

E g(rj) q2
SPr [rj Ri]

E [Xj]g2(r) + 2XjXkg(rj)g(r)
SP2[r, E[r R, Pr[rj E R1]Pr[rk G R]

Pr2[rj E R} .' Pr[rj E Ri]Pr[rfC Ri]

rj Pr[r RPr[r
rj rj,rk

This proves the lemma.

By using the result of Lemma 6 to compute Pr [{rj, rk E Ri], the variance of the estimator

is then easily obtained for a specific query. In practice, the variance itself must be estimated by

considering only the sampled records as we typically do not have access to each and every rj

during query processing. The q2 term and the two sums in the expression of variance are thus

computed over each rj in the sample of biased geometric file rather than over the entire reservoir.

There is one additional issue regarding biased sampling that is worth some additional

discussion: how to efficiently compute the value Pr[{rj, rk} E Ri] in order to estimate

3.12 Speed-Up Analysis .. ....................

4 BIASED RESERVOIR SAMPLING .. ........................ 58

4.1 A Single-Pass Biased Sampling Algorithm ...... . . . 59
4.1.1 Biased Reservoir Sampling . . . ....... 59
4.1.2 So, What Can Go Wrong? (And a Simple Solution) . . ... 60
4.1.3 Adjusting Weights of Existing Samples . . . 62
4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm . ... 65
4.2.1 The Proof for the Worst Case . . . ...... ..... 66
4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist . ... 73
4.3 Biased Reservoir Sampling With The Geometric File . . . 75
4.4 Estimation Using a Biased Reservoir ................ . .. 76

5 SAMPLING THE GEOMETRIC FILE .................. ..... 80

5.1 Why Might We Need To Sample From a Geometric File? . . .... 80
5.2 Different Sampling Plans for the Geometric File . . . ..... 80
5.3 Batch Sampling From a Geometric File ................ .. .. 81
5.3.1 A Naive Algorithm .......... . . . .. 81
5.3.2 A Geometric File Structure-Based Algorithm . . . 82
5.3.3 Batch Sampling Multiple Geometric Files . . . 84
5.4 Online Sampling From a Geometric File ............. ... .. 84
5.4.1 A Naive Algorithm .......... . . . .. 84
5.4.2 A Geometric File Structure-Based Algorithm . . . 85
5.5 Sampling A Biased Sample ............. . . ..... 88


6.1 Why Index a Geometric File? ........... . . ...... 89
6.2 Different Index Structures for the Geometric File . . . 90
6.3 A Segment-Based Index Structure .................. ... .. 91
6.3.1 Index Construction During Start-up .... . . ... 91
6.3.2 Maintaining Index During Normal Operation . . . 92
6.3.3 Index Look-Up and Search .................. ...... .. 93
6.4 A Subsample-Based Index Structure ................... .... .. 93
6.4.1 Index Construction and Maintenance ... . . .. 94
6.4.2 Index Look-Up .................. ............ .. 95
6.5 A LSM-Tree-Based Index Structure .................. ..... .. 96
6.5.1 An LSM-Tree Index ............. . . ..... 96
6.5.2 Index Maintenance and Look-Ups .............. ... .. 97

7 BENCHMARKING .................. ................ .. 99

7.1 Processing Insertions .................. ............. .. 99
7.1.1 Experiments Performed .................. ....... 99
7.1.2 Discussion of Experimental Results ... . . ... 100

Definition 1. If Ri is the biased sample of the first i records produced by a data stream, the value

is the true weight of a record rj if and only if Pr[rj c Ri] = f' (r )
k= I f(rk)

What we will be able to guarantee is then twofold:

1. First, we will be able to guarantee that f'(rj) will be exactly f(rj) if (|R| f(ri))/totalWeight <
1 for all k > j.

2. We can also guarantee that we can compute the true weight for a given record to unbiased
any estimate made using our sample (see Section 4.4).

In other words, our biased sample can still be used to produce unbiased estimates that

are correct on expectation [16], but the sample might not be biased exactly as specified by the

user-defined function f, if the value of f(r) tends to fluctuate wildly. While this may seem like a

drawback, the number of records not sampled according to f will usually be small. Furthermore,

since the function used to measure the utility of a sample in biased sampling is usually the result

of an approximate answer to a difficult optimization problem [15] or the application of a heuristic

[52], having a small deviation from that function might not be of much concern.

We present a single-pass biased sampling algorithm that provides both guarantees outlined

above as Algorithm 7, and Lemma 4 proves the correctness of the algorithm.

Lemma 4. Let Ri be a state of the biased sample just after the ith record in the stream has

been processed. Using the biased sampling described in Al 'r I, ii 7, we are guaranteed that

for each Ri and for each record rj produced by the data stream such that j < i, we have,

Pr[rj E Ri]- Lif'(r)
Z=1 f'(rm)"

Proof We know that the probability of selecting ith record in the reservoir is IR f(ri)/totalWeight.

Then, there are two cases to explore. The first, when the reservoir is full and before we encounter

an overweight record rl, and the the second after we encounter such an rl.

Case (i): The proof of this case is very similar to the proof of the Lemma 3. We simply use

f' instead of f to prove the desired result.







flushes assigned to replace them. In order to accomplish this, we note that if we did not perform

consolidation and instead replaced a segment from each subsample with exactly those records

assigned to overwrite records from that subsample, then on expectation a subsample would

lose all of the records in its largest segment after m buffer flushes. Thus, if we somehow delay

overwriting the largest segment in each file for m buffer flushes, we could sidestep the problem

of losing too many records due to consolidation.

The way to accomplish this is to overwrite subsamples in a lazy manner. We merge the

buffer with the (j mod m)th geometric file, but we do not overwrite any of the valid samples

stored in the file until the next time we get to the file. We can achieve this by allocating enough

extra space in each geometric file to hold a complete, empty subsample in each geometric

file. This subsample is referred to as the dummy. The dummy never decays in size, and never

stores its own samples. Rather, it is used as a buffer that allows us to sidestep the problem of

a subsample decaying too quickly. When a new subsample is added to a geometric file, the

new subsample overwrites segments of dummy rather than overwriting largest segment of any

existing subsamples. Thus, we have protected segments of subsamples that contain valid data by

overwriting dummy's records instead.

When records are merged from the buffer into the dummy, the space previously owned by

the dummy is given up to allow storage of the file's newest subsample. After this flush, the largest

segment from each of the subsamples in the file is given up to reconstitute the new dummy.

Because the records in (new) dummy's segments will not be over-written until the next time that

this particular geometric file is written to, all of the data that is contained within it is protected.

Note that with a dummy subsample, we no longer have a problem with a subsample losing

its samples too quickly. Instead, a subsample may have slightly too many samples present on disk

at any given time, buffered by the file's dummy. These extra samples can easily be ignored during

query processing. The only additional cost we incur with dummy is that each of the geometric

files on disk must have IBI additional units of storage allocated. The use of a dummy subsample

is illustrated in Figure 3-5.

the variance during query evaluation. Computing Pr[{rj, rk }E R] requires that we be
able to compute two subexpressions for each sampled record pair: RI- 1)f(rj) f'(rk) and
y -I f'(rI) k Jf(r1I)
H 2Pr[r ERi]
l=k+1 2RP
The first subexpressions can be easily computed with the help of running total totalWeight

along with the weight multipliers associated with each subsample. When sample records are

added to the reservoir, like attribute ri.weight, we store another attribute with each record,

ri.oldTotalWeight and r\.oldM. The first attribute gets its value from current value of totalWeight,

whereas the M(ri) is stored in the second attribute. When a query is evaluated and we need

to compute the first subexpressions for a given record pair rj and rk, we compute terms in its

denominator as follows:

Sf'(rn) Tk.oldTotalWeight x Mr
l 1
k-l k k
Sf'(r) E '() f'(rk) '(r) (rk.weight x M(rk))
1=1 1=1 l=1

The second subexpressions can also be easily computed if we maintain a running total

subexp2Total for the sum log (1 2Pr[rER]) at all times. When a new record is added to the

reservoir, the current values of subexp2Total is stored as another attribute r .\~1\, \p2Val along

with each record. When a query is evaluated, for a given record pair rj and rk we simply evaluate

ni l (1 I2Pr[r ER] (subexp2Total-rk.subexp2Val)
l=k+1 |R|

associated with randomization and sampling from a data management perspective. However,

the assumption underlying the CONTROL project is that all of the data are present and can

be archived by the system; online sampling is not considered. Our work is complementary to

the CONTROL project in that their algorithms could make use of our samples. For example,

a sample maintained as a geometric file could easily be used as input to a ripple join or online


2.2 Biased Sampling Related Work

Our biased sampling algorithm is based on reservoir sampling algorithm which was first

proposed in the 1960s [11, 38]. Recently, Gemulla et al. [29] extended the reservoir sampling

algorithm to handle deletions. In their algorithm called "random pairing" (RP) every deletion

from the dataset is eventually compensated by a subsequent insertion. The RP Algorithm keeps

tracks of uncompensated deletions and uses this information while performing the inserts. The

Algorithm guards the bound on the sample size and at the same time utilizes the sample space

effectively to provides a stable sample. Another extension to the classic reservoir sampling

algorithm has been recently proposed by Brown and Haas for warehousing of sample data [10].

They propose hybrid reservoir sampling for independent and parallel uniform random sampling

of multiple streams. These algorithms can be used to maintain a warehouse of sampled data that

shadows the full-scale data warehouse. They have also provided methods for merging samples

for different streams to create a uniform random sample.

The problem of temporal biased sampling in a stream environment has been considered.

Babcock et al. [7] presented the sliding window approach with restricted horizon of the sample

to biased the sample towards the recent streaming records. However, this solution has a potential

to completely lose the entire history of past stream data that is not a part of sliding window. The

work done by Aggarwal [5] addresses this limitation and presents a biased sampling method so

that we can have temporal bias for recent records as well as we keep representation from stream

history. This work exploits some interesting properties of the class of memory-less bias functions

to present a single-pass biased sampling algorithm for these type of biased functions. However,

3.11.3 Handling the Stacks in Multiple Geometric Files

One final issue that should be considered is maintenance of the stacks associated with each

subsamples of the (j mod m)th geometric file. Just as in the single file case, the purpose of the

stack associated with a subsample is to store samples that are still valid, but whose space must

be given up in order to store new samples from the buffer that have been flushed to disk. With

multiple geometric files, this does not change. It is possible that when the buffer is written to the

dummy subsample in a file, the dummy may still contain valid samples from a subsample in that

file. Specifically, one or more of the dummy's segments may contain valid samples from the last

subsample to own the segment. In that case, the valid samples are saved to that subsample's stack

before the dummy is over-written.

3.12 Speed-Up Analysis

The increase in speed achieved using multiple geometric files can be dramatic. The time

required to flush a set of new samples to disk as a new subsample is dominated by the need to

perform random disk head movements. For each subsample, we need two random movements

to overwrite its largest segment (one to read the location and one to write a new segment) and

then two more seeks for its stack adjustment; a total of around 40 ms/segment. The number of

segments required to write a new subsample to disk in the case of multiple geometric files (and

thus the number of random disk head movements required) is given by Lemma 2.

Lemma 2. Let u = (log(1/ac'))-1. Multiple geometric files can be used to maintain an online

sample of arbitrary size with a cost of O(u x log IB /|B ) random disk head movements for each

newly sampled record.

Proof We know that for every buffer flush, m segments in the buffer are grouped to form a

consolidated segment. All such consolidated segments are then used to overwrite the largest on-

disk segments of the subsamples stored in a single geometric file. From Observation 3, we know

that the number on-disk segments of a subsample (and thus the number of consolidated segments)

is log 3-log+log(l-a) ]. Substituting n = (1- a') x BI and simplifying the expression (as well as
log a'

Biased sampling w/o skewed records
7e+14 Unbiased reservoir sampling
SBiased sampling worst case

6e+14 -..


5 4e+14

7 3e+14


0 I-I I I
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-4. Sum query estimation accuracy for zipf=0.5.

Attribute B is the attribute that is actually aggregated by the SUM query. Each set is generated

so that attributes A and B both have a certain amount of Zipfian skew, specified by the parameter

zipf. In each case, the bias function f is defined so as to minimize the variance for a SUM query

evaluated over attribute A.

In addition to the parameter zipf, each data set also has a second parameter which we term

the correlation factor. This is the probability that attribute A has the same value as attribute B. If

the correlation factor is 1, then A and B are identical, and since the bias function is defined so as

to minimize the variance of a query over A, the bias function also minimizes the variance of an

estimate over the actual query attribute B. Thus, a correlation factor of 1 provides for a perfect

bias function. As the correlation factor decreases, the quality of the bias function for a query over

attribute B declines, because the chance increases that a record deemed important by looking at

attribute A is, in fact, one that should not be included in the sample. This models the case where

one can only guess at the correct bias function beforehand for example, when queries with an

arbitrary relational selection predicate may be issued. A small correlation factor corresponds to

the case when the guessed-at bias function is actually very incorrect.

of the existing subsamples during a buffer flush. Though we may be able to avoid rebuilding the

entire file, the fact that the buffer must over-write a subset of each on-disk subsample presents

a challenge when trying to maintain acceptable performance, because this naturally leads to

fragmentation (see the discussion of the localized overwrite extension in Section 3.3). For

example, if there are 100 on-disk subsamples, the buffer must be split 100 ways in order to write

to a portion of each of the 100 on-disk subsamples. This fragmented buffer then becomes a new

subsample, and subsequent buffer flushes that need to replace a random portion of this subsample

must somehow efficiently overwrite a random subset of the subsample's fragmented data.

The geometric file uses a careful, on-disk data organization in order to avoid such fragmen-

tation. The key observation behind the geometric file is that the number of records of a subsample

that are replaced with records from buffered sample can be characterized with reasonable accu-

racy using a geometric series (hence the name geometricfile). As buffered samples are added to

the reservoir via buffer flushes, we observe that each existing subsample loses approximately the

same fraction of its remaining records every time, where the fraction of records lost is governed

by the ratio of the size of a buffered sample to the overall size of the reservoir. By "loses", we

mean that the subsample has some of its records replaced in the reservoir with records from a

subsequent subsample. Thus, the size of a subsample decays approximately in an exponential

manner as buffered samples are added to the reservoir.

This exponential decay is used to great advantage in the geometric file, because it suggests

a way to organize the data in order to avoid problems with fragmentation. Each subsample is

partitioned into a set of segments of exponentially decreasing size. These segments are sized

so that every time a buffered sample is added to the reservoir, we expect that each existing

subsample loses exactly the set of records contained in its largest remaining segment. As a

result, each subsample loses one segment to the newly-created subsample every time the buffer is

emptied, and a geometric file can be organized into a fixed and unchanging set of segments that

are stored as contiguous runs of blocks on disk. Because the set of segments is fixed beforehand,

fragmentation and update performance are not problematic: in order to replace records in an


A geometric file is a simple random sample (without replacement) from a data stream. In

this chapter we develop techniques which allow a geometric file to itself be sampled in order to

produce smaller sets of data objects that are themselves random samples (without replacement)

from the original data stream. The goal of the algorithms described in this chapter is to efficiently

support further sampling of a geometric file by making use of its own structure.

5.1 Why Might We Need To Sample From a Geometric File?

In Section 3.2, we argued that small samples frequently do not provide enough accuracy,

especially in the case when the resulting statistical estimator has a very high variance. However,

while in the general case a very large sample can be required to answer a difficult query, a

huge sample may often contain too much information. For example, reconsider the problem

of estimating the average net worth of American households as described in Section 3.2. In

the general case, many millions of samples may be needed to estimate the net worth of the

average household accurately (due to a small ratio between the average household's net worth

and the standard deviation of this statistic across all American households). However, if the same

set of records held information about the size of each household, only a few hundred records

would be needed to obtain similar accuracy for an estimate of the average size of an American

household, since the ratio of average household size to the standard deviation of sample size

across households in the United States is greater than 2. Thus, to estimate the answer to these two

queries, vastly different sample sizes are needed.

5.2 Different Sampling Plans for the Geometric File

Since there is no single sample size that is optimal for answering all queries and the required

sample size can vary dramatically from query to query, this chapter considers the problem of

generating a sample of size N from a data stream using an existing geometric file that contains a

large sample of records from the stream, where N < R. We will consider two specific problems.

First, we consider the case where N is known beforehand. We will refer to a sample retrieved

in this manner as a batch sample. Batch samples of fixed size have been suggested for use in

k-I ff(rk)

IRlf(r-ax) Zk i+l f(rA)
IR f(Tr ) Z hi+l f(Trk)

k-N f(rk)
'ik1 f(rk) + Yk -i+l f(Tk)

Since Ck, f(rk) < IRlf(rTma), we have

lRlf(ra.x) + Ek>i+L f(rk)

< f(rj)
1 f(rk)

We can therefore conclude that

f(rj) < f(rj)
:k-1 f(Trk) k-1 f(rk)

k-I ff(rk)

This proves the second part of the lemma.

The proof of the second proposition regarding the effect of the first IR records:
We now turn our attention to the effect of first IR| record of the stream on the worst-case
distance. If rm" appears as the IR1th record in the worst case, then using the result of Lemma 5,
Vj < IR| we know that

k f'(rk)

k1 ff(rk)

(II-1)f (r ma")
|R-1 (f ()+)
IRlf(rmax) + z7 IRI+I f(rk)

(Il|-1R) f (rm)
(IRI- )f (rma-) f(rj)
(IR1 1)f(rT ax) + k HI f(rk)

k-1 ff(rk)

Yk I f(Tk)

k 1if'(rk)

YNk- f'(rk)

track of the B+-Trees for each segment in the geometric file. Each array entry simply stores the

position of a B+-Tree root node.

Rather than maintaining a file for each B+-Tree created, we organize multiple B+-Trees on

a single disk file. We refer this single file as indexfile. The index file, in a sense, is similar to

the log-structure file system proposed by Ousterhout [45]. In log-structured file system, as files

are modified, the contents are written out to the disk as logs in a sequential stream. This allows

writes in full-cylinder units, with only track-to-track seeks. Thus the disk operates at nearly its

full bandwidth. The index file enjoys the similar performance benefits. Every time a B+-Tree

is created for a memory resident segment, it is written to the index file in a sequential stream at

the next available position. The array maintaining all B+-Tree root nodes is augmented with the

starting disk position of the B+-Tree.

Finally, we do not index segments that are never flushed to the disk. These segments are

typically very small (a size of a disk block) and it is efficient to search them using sequential

memory scan when geometric file is queried.

6.3.2 Maintaining Index During Normal Operation

Maintaining a segment-based index structure is exceedingly simple. During normal

operation as a new subsample and its segments are formed, we build a B+-Tree index for each

in-memory segment just like we did during the start-up. The only difference is that the B+-Trees

are written to the disk in slightly different manner. As an in-memory segment overwrites the

on-disk segment, a B+-Tree for an in-memory segment overwrites the B+-Tree for the on-disk

segment. We update the B+-Tree array entry for the root node of the new B+-Tree that is added

to the index structure. Thus, the index maintenance for records newly inserted into the geometric

file and the records that are deleted from the file is handled at the same time.

The algorithm used to construct and maintain a segment-based index structure is given as

Algorithm 10.

(a) Initial configuration. Each of the m geometric
files has an additional dummy segment that
holds no data.

-, -
)m mm-


u***00 ---segments
owned by the dummy

(c) Existing subsamples give their largest
segment to reconstitute the dummy. The
data in these segments are protected until
the next time the dummy is over-written.

newly reconstituted

(b) The jth new subsample is added by
overwriting the dummy in the i = (j mod
m)th geometric file

***- -Onn --
.mmmm -

array of m
geometric files


(d) The next m -1 buffer flushes write new
subsamples to the other m 1 geometric
files, using the same process. The mth
buffer flush again overwrites the dum-
my in the ith geometric file, and the pro-
cess is repeated from step (c).


newly added subsample

Figure 3-5. Speeding up the processing of new samples using multiple geometric files.





Biased sampling w/o skewed records
Unbiased reservoir sampling
1.2e+18 Biased sampling worst case






0 -------------i-'^
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-6. Sum query estimation accuracy for zipf=l.

that even for very skewed data sets, it is difficult for even an adversary to come up with a data

ordering that can significantly alter the quality of the user-defined bias function.

We also observe that for a low zipf parameter and a low correlation factor, unbiased

sampling outperforms biased sampling. In other words, it is actually preferable not to bias in

this case. This is because the low zipf value assigns relatively uniform values to attribute B,

rendering an optimal biased scheme little different from uniform sampling. Furthermore, as the

correlation factor decreases, the weighting scheme used both biased sampling schemes becomes

less accurate, hence the higher variance. As the weighting scheme becomes very inaccurate, it

is better not to bias at all. Not surprisingly, there are more cases where the biased scheme under

the pathological ordering is actually worse than the unbiased scheme. However, as the correlation

factor increases and the bias scheme becomes more accurate, it quickly becomes preferable to


7.3 Sampling From a Geometric File

We have also implemented and benchmarked four sampling techniques to sample geometric

files discussed in the Chapter 5. Specifically, we have compared the naive batch sampling and the

online sampling algorithms against a geometric file structure based batch sampling and online


In this chapter, we detail three sets of benchmarking experiments. In the first set of experi-

ments, we attempt to measure the ability of the geometric file to process a high-speed stream of

data records. In the second set of experiments, we examine the various algorithms for producing

smaller samples from a large, disk-based geometric file. Finally, in the third set of experiments,

we compare the three index structures for the geometric file for build time, disk space, and index

look-up speed.

7.1 Processing Insertions

In order to test the relative ability of the geometric file to process a high-speed stream of

insertions, we have implemented and benchmarked five alternatives for maintaining a large

reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the

framework described in Section 3.10 for using multiple geometric files at once. In the remainder

of this Section, we refer to these alternatives as the virtual memory, scan, local overwrite, geo

file, and multiple geofiles options. An a' value of 0.9 was used for the multiple geo files option.

All implementation was performed in C++. Benchmarking was performed using a set of

Linux workstations, each equipped with 2.4 GHz Intel Xeon Processors. 15,000 RPM, 80GB

Seagate SCSI hard disks were used to store each of the reservoirs. Benchmarking of these disks

showed a sustained read/write rate of 35-50 MB/second, and an "across the disk" random data

access time of around 10ms.

7.1.1 Experiments Performed

The following three experiments were performed:

Insertion experiment 1: The task in this experiment was to maintain a 50GB reservoir holding

a sample of 1 billion, 50B records from a synthetic data stream. Each of the five alternatives was

allowed 600MB of buffer memory to work with when maintaining the reservoir. For the scan,

local overwrite, geo file, and multiple geo files options, 100MB was used as an LRU buffer for

disk reads/writes, and 500MB was used to buffer newly sampled records before processing. The

virtual memory option used all 600MB as an LRU buffer. In the experiment, a continual stream

of the data (or "sample view" [43]) may be desirable. In order to save time and/or computer

resources, queries can then be evaluated over the sample rather than the original data, as long as

the user can tolerate some carefully controlled inaccuracy in the query results.

This particular application has two specific requirements that are addressed by the dis-

sertation. First, it may be necessary to use quite a large sample in order to achieve acceptable

accuracy; perhaps on the order of gigabytes in size. This is especially true if the sample will

be used to answer selective queries or aggregates over attributes with high variance (see Sec-

tion 3.2). Second, whatever the required sample size, it is often independent of the size of the

database, since estimation accuracy depends primarily on sample size 1 In other words, the

required sample size will generally not grow as the database size increases, as long as other

factors such as query selectivity remain relatively constant. Thus, this application requires that

we be able to maintain a large, disk-based, fixed-size random sample of the archived data, even as

new data are added to the warehouse. This is precisely the problem we tackle in the dissertation.

For another example of a case where existing sampling methods can fall short, consider

stream-based data management tasks, such as network monitoring (for an example of such an

application, we point to the Gigascope project from AT&T Laboratories [18-20]). Given the

tremendous amount of data transported over today's computer networks, the only conceivable

way to facilitate ad-hoc, after-the-fact query processing over the set of packets that have passed

through a network router is to build some sort of statistical model for those packets. The most

obvious choice would be to produce a very large, statistically random sample of the packets that

have passed through the router. Again, maintaining such a sample is precisely the problem we

tackle in this dissertation. While other researchers have tackled the problem of maintaining an

1 The unimportance of database size for certain queries is due to the fact that the bias and vari-
ance of many sampling-based estimators are related far more to sample size than to the sampling
fraction (see Cochran [16] for a thorough treatment of finite population random sampling).

Algorithm 2 Reservoir Sampling with a Buffer
1: for int i = 1 to oo do
2: Wait for a new record r to appear in the stream
3: if i < R then
4: Add r directly to R and continue
5: else
6: with probability IR|/i do
7: with probability Count(B)/IRI do
8: //new samples can overwrite buffered samples
9: Replace a random record in B with r
10: else do
11: Add r to B
12: if Count(B) == BI then
13: Scan the reservoir R and empty B in one pass
14: B =

a record that has been randomly selected for replacement by line (9) of Algorithm 2, and so all

of the database blocks must be updated. Thus, it makes sense to rely on fast, sequential I/O to

update the entire file in a single pass. The drawback of this approach is that every time that the

buffer fills, we are effectively rebuilding the entire reservoir to process a set of buffered records

that are a small fraction of the existing reservoir size.

The localized overwrite extension. We will do better if we enforce a requirement that all

samples are stored in a random order on disk. If data are clustered randomly, then we can simply

write the buffer sequentially to disk at any arbitrary position. Because of the random clustering,

we can guarantee that wherever the buffer is written to disk, the new samples will overwrite a

random subset of the records in the reservoir and preserve the correctness of the algorithm. The

problem with this solution is that after the buffered samples are added, the data are no longer

clustered randomly and so a randomized overwrite cannot be used a second time. The data are

now clustered by insertion time, since the buffered samples were the most recently seen in the

data stream, and were written to a single position on disk. Any subsequent buffer flush will need

to overwrite portions of both the new and the old records to preserve the algorithm's correctness,

requiring an additional random disk head movement. With each subsequent flush, maintaining

randomness will become more costly, as data become more and more clustered by insertion time.

[15] Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approx-
imate answering of aggregate queries. In: ACM SIGMOD International Conference on
Management of Data (2001)

[16] Cochran, W.: Sampling Techniques. Wiley and Sons (1977)

[17] Council, T.P.: TPC-H Benchmark. (2004)

[18] Cranor, C., Gao, Y, Johnson, T., Shkapenyuk, V., Spatscheck, O.: Gigascope high per-
formance network monitoring with an sql interface. In: ACM SIGMOD International
Conference on Management of Data (2002)

[19] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for
network applications. In: ACM SIGMOD International Conference on Management of Data

[20] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: The gigascope stream database.
In: IEEE Data Engineering Bulletin, pp. 26(1): 27-32 (2003)

[21] Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries
over data streams. In: ACM SIGMOD International Conference on Management of Data

[22] Duffield, N., Lund, C., Thorup, M.: Charging from sampled network usage. In: IMW '01:
Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, pp. 245-256.
ACM Press, New York, NY, USA (2001)

[23] Estan, C., Naughton, J.E: End-biased samples for join cardinality estimation. In: ICDE '06:
Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 20.
IEEE Computer Society, Washington, DC, USA (2006)

[24] Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on
the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270-313 (2003)

[25] F. Olken, D.R.: Random sampling from b+ trees. In: International Conference on Very
Large Data Bases (1989)

[26] F. Olken, D.R.: Random sampling from database files a survey. In: International Working
Conference on Scientific and Statistical Database Management (1990)

[27] F. Olken D. Rotem, P.X.: Random sampling from hash fies. In: ACM SIGMOD Interna-
tional Conference on Management of Data (1990)

[28] Ganguly, S., Gibbons, P., Matias, Y, Silberschatz, A.: Bifocal sampling for skew-resistant
join size estimation. In: ACM SIGMOD International Conference on Management of Data

-low address on disk

`u -,new

high address on disk


(-low address on disk

C samples

high address on disk


low address on disk

high address on disk

samples n
dd-- ` samf
high address on disk high addss on disk

(d) (e)

ew -
)les sam
high address on disk



Figure 3-3. Building a geometric file.

Several data structures and algorithms have been proposed to speed up index inserts such

as the LSM-Tree [44], Buffer-Tree [6], and Y-Tree [12]. These papers consider problem of

providing I/O efficient indexing for a database experiencing a very high record insertion rate

which is impossible to handle using a traditional B+-Tree indexing structure. In general these

methods buffer a large set of insertions and then scan the entire base relation, which is typically

organized as a B+-Tree, at once adding new data to the structure.

Any of the above methods could trivially be used to maintain a large random sample of a

data stream. Every time a sampling algorithm probabilistically selects a record for insertion, it

must overwrite, at random, an existing record of the reservoir. Once an evictee is determined,

we can attach its location as a position identifier (a number between 1 and R) with a new sample

record. This position field is then used to insert the new record into these index structures. While

performing the efficient batch inserts, if an index structure discovers that a record with the same

position identifier exists, it simply overwrites the old record with the newer one.

However, none of these methods can come close to the raw write speed of the disk, as

the geometric file can [13]. In a sense, the issue is that while the indexing provided by these

structures could be used to implement efficient, disk-based reservoir sampling, it is too heavy-

duty a solution. We would end up paying too much in terms of disk I/O to send a new record to

overwrite a specific, existing record chosen at the time the new record is inserted, when all one

really needs is to have a new record overwrite any random, existing record.

There has been much recent interest in approximate query processing over data streams

(a very small subset of these papers is listed in the References section [1, 21, 34]); even some

work on sampling from a data stream [7]. This work is very different from our own, in that most

existing approximation techniques try to operate in very small space. Instead, our focus is on

making use of today's very large and very inexpensive secondary storage to physically store the

largest snapshot possible of the stream.

Finally, we mention the U.C. Berkeley CONTROL project [37] (which resulted in the

development of online aggregation [33] and ripple joins [32]). This work does address issues

flush it in a single scan of the reservoir and overwrite the records as dictated by the sorted order

of the position array. It is obvious that this process is equivalent to the steps (5-6) of Algorithm 1

as far as correctness is concerned.

Logically, steps (7-14) of Algorithm 2 actually implement exactly this process. The

probability that we will generate a random position between 1 and IRI that is already in the

position array of size BI is IBI/R. Step (7) of Algorithm 2 decides whether to overwrite a

random buffered record with a newly sampled record. Once the buffer is full, step (13) performs

a one pass buffer-reservoir merging by generating sequential random positions in the reservoir on

the fly.

3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File

In Algorithm 2 we store the samples sequentially on the disk and overwrite them in a

random order. Though correct, the algorithm demands almost a complete scan of the reservoir (to

perform all random overwrites) for every buffer flush. We can do better if we instead force the

samples to be stored in a random order on disk so that they can be replaced via an overwrite using

sequential I/Os. The localized overwrite extension discussed before use this idea. Every time a

buffer is flushed to the reservoir it is randomized in main memory and written as a random cluster

on the disk. We maintain the correctness of this technique by splitting the random cluster in

N-ways where N is the number of existing clusters on the disk and by overwriting random subset

of each existing cluster. This avoids the problem of clustering by insertion time. However, the

drawback of this technique is that the solution deteriorates because of fragmentation of clusters.

The geometric file overcomes the drawbacks of these two techniques and can be viewed

as a combination of Algorithm 2 and the idea used in the localized overwrite extension. The

correctness of the Geometric file is results directly from the correctness of these two techniques.

In case of the geometric file the entire sample in the main memory (referred to as a subsample)

is randomized and flushed into the reservoir. Furthermore, each new subsample is split into

exactly those many segments as the number of existing subsamples on the disk. These segments

then overwrite a random portion of each disk-based subsample. The only difference with the

Eventually, this solution will deteriorate, unless we periodically re-randomize the entire reservoir.

Unfortunately, re-randomizing the entire reservoir is as costly as performing an external-memory

sort of the entire file containing samples, and requires taking the sample off-line.

3.4 The Geometric File

The three extensions to Algorithm 1 can be used to maintain a large, on-disk sample, but

all of them have drawbacks. In this section, we discuss a fourth algorithm and an associated

data organization called the geometric file to address these pitfalls. The geometric file is best

seen as an extension of the massive rebuild option given as Algorithm 2. Just like Algorithm

2, the geometric file makes use of a main-memory buffer that allows new samples selected by

the reservoir algorithm to be added to the on-disk reservoir in a lazy fashion. However, the

key difference between Algorithm 2 and the algorithms used by the geometric file is that the

geometric file makes use of a far more efficient algorithm for merging those new samples into the


Intuitive description: Except for Step (13) of Algorithm 2, the basic algorithm employed

by the geometric file is not much different. As far as Step (13) is concerned, the difference

between the geometric file and the massive rebuild extension is that the geometric file empties

the buffer more efficiently, in order to avoid scanning or periodically re-randomizing the entire


To accomplish this, the entire sample in main memory that is flushed into the reservoir is

viewed as a single subsample or a stratum [16], and the reservoir itself is viewed as a collection

of subsamples, each formed via a single buffer flush. Since the records in a subsample are non-

random subset of the records in the reservoir (they are sampled from the stream during a specific

time period), each new subsample needs to overwrite a true, random subset of the records in the

reservoir in order to maintain the correctness of the reservoir sampling algorithm. If this can be

done efficiently, we can avoid rebuilding the entire reservoir in order to process a buffer flush.

At first glance, it may seem difficult to achieve the desired efficiency. The buffered records

that must be added to the reservoir will typically overwrite a subset of the records stored in each

analogous to Olken and Rotem's procedure for choosing the number of records to select from

each hash bucket when performing batched sampling from a hashed file [26]. Once the number

of sampled records from each segment has been determined, sampling those records can be

done with an efficient sequential read since within each on disk segment, all records are store

in a randomized order. The key algorithmic issue is how to calculate the contribution of each

subsample. Since this contribution is a multivariate hypergeometric random variable, we can

use an approach analogous to Algorithm 4, which is used to partition the buffer to form the

segments of a subsample. In other words, we can view retrieving N samples from a geometric

file analogous to choosing N random records to overwrite when new records are added to the file.

The resulting algorithm can be described as follows. To start with, we partition the sample

space of N records into segments of varying size exactly as in Algorithm 4. We refer to these

segments of the sample space as sampling segments. The sampling segments are then filled with

samples from the disk using a series of sequential reads, analogous to the set of writes that are

used to add new samples to the geometric file. The largest sampling segment obtains all of its

records from the largest subsample, the next largest sampling segment obtains all its record from

second largest subsample, and so on.

Algorithm 8 Batch Sampling a Geometric File
1: Set NS = Number of subsamples in a geometric file
2: for i =1 to NS do
3: Set RecsInSubsam[i] = Size of ith subsample
4: Set RecsToRead[i] = 0
5: for i =1 to NS do
6: Choose j such that Pr[choosing j]= RecslnS, l.u, [i] /1RI
7: RecslnS ,F..,[di] -
8: RecsToR ..i[ j] + +
9: for i =1 to NS do
10: Append to batchsample RecsToRead[i] records from the ith subsample

When using this algorithm, some care needs to be taken when N approaches to the size

of a geometric file. Specifically, when all disk segments of a subsample are returned to a

corresponding sampling segment, we must also consider the subsample's in-memory buffered

Subsample-based -
12 LSM-Tree

m 10

o 8 /

co 6
o 4

0 2 4 6 8 10 12
Time elapsed in Hrs

Figure 7-8. Disk footprint for 200B record size

speed are shown in Table 7.4, the disk space used by three index structure is plotted in Figure 7-8,

and the index look-up speed in tabulated in Table 7.4.2. Thus, we test the effect of record size on

the three index structure.

7.4.2 Discussion

It is possible to draw a few conclusions based on the experimental results. The subsample-

based index structure shows the best build time, the segment-based index structure has the most

compact disk footprint, whereas the LSM-tree-based index structure has best response to the

index look-ups.

Table 7.4 shows millions of records inserted into geometric file after ten hours of insertions

and concurrent updates to the index structure. For comparison we present the number of records

inserted into a geometric file when no index structure is maintained (the "no index" column). It is

clear that the subsample-based index structure performs the best on insertions, with performance

comparable to the "no index" option. This difference reflects the cost of concurrently maintaining

the index structure. The segment based index structure does the next best. It is slower than the

subsample-based index structure because of higher number of seeks performed during the start-

up. Recall that during start-up the segment-based index must write a B+-tree for each segment.

size of a random sample obtained with sampling probability of 1/7, where r is the threshold used

by these algorithms. Thus, the threshold r is carefully selected to control the sample size and if

required, it is increased to honor the upper bound of the sample size.

The problem of implementing fixed size sampling design with desired and unequal inclusion

probabilities has been studied in statistics. The monogram Theory of Sample Surveys [50]

discusses several methods for such a sampling technique, which is of some practical importance

in survey sampling. This monogram begins by discussing two designs which mimic simple

random sampling without replacement with selection probabilities for a given draw that are not

the same for all the units. We first summarize these techniques.

Successive Sampling: Let the selection probabilities pi, p2,... PL such that pi > 0 and

i =1 1, and desired sample size be N =2. Then the design suggests that we draw r with
probability pr, and q with probability pq/l( Pr). The inclusion probabilities can be expressed

in terms of the selection probabilities by the fact that r is included if it is drawn on the first draw,

or on the second draw not having been chosen on the first. Thus, the inclusion probability r,

is given by p, (1 + Zqr p,/(1 p,)). Similarly, the value for the joint probability rrq can be

deduced. The monogram suggests that the value of pr be found using an iterative computation


Fellegi's Method: This method is very much like Successive sampling described above

except that the selection probabilities are different for the second draw. The second draw

probabilities are chosen such that the marginal selection probabilities for the both draws are the

same. This feature makes this method suitable for rotating sampling as in labor force sampling

where a fixed proportion of the sample is replaced each month. The procedure is as follows: the

first draw is made with probability pr = a, and then q with probability p,/(l pr), where

pi,... PL is another set of selection probabilities chosen so that

C "rPq/l( Pr) a nq
r sq
Where ci are specified positive numbers such that Eii or.

approach is that the more records we fetch sequentially from the disk during a single call
to GetNext, the longer the response time will be for the particular call to GetNext during
which we fetch those blocks. This is particularly worrisome if we spend a lot of time to
fetch blocks which are never used (which will be the case if the user intends to draw only a
relatively small-sized sample.)

*Fetch few. If we fetch small number of blocks at the time of buffer refills, we reduce the
maximum response time for any given GetNext call. However, we then need more seeks
to sample N records. The approach can be problematic if user intends to draw a relatively
large sample from the file.

In order to discuss such considerations more concretely, we note that the time required to process

GetNext call is proportional to the number of blocks fetched on the call, assuming that the cost

to perform the required in-memory calculations is minimal. If b blocks are fetched during a

particular call, we spend s + br time units on that particular call to GetNext, where s is the seek

time and r is time required to scan a block. Once these b blocks are fetched we incur zero cost for

next bn calls to GetNext, where n is the blocking factor (number of records per block). Thus, in

the case where blocks are fetched at the first call to GetNext, we incur the total cost of s + br to

sample bn records, and have a response time of s + br units at the first call to GetNext, with all

subsequent calls having zero cost.

Now imagine that instead we split b blocks into two chunks of size b/2 each, and read a

chunk-at-a-time. Thus, the first GetNext call will cost us s + br/2 time units. Once these bn/2

records are used up we read next chuck of blocks. The total cost in this scenario is 2s + br with a

response time of s + br/2 time units once at the starting point and other mid-way through. Note

that although the maximum response time on any call to GetNext is reduced by half, we required

more time to sample bn records. The question then becomes, How do we reconcile response time

with overall sampling time to give the user optimal performance?

The systematic approach we take to answering this question is based on minimizing the

average square sum of response time over all GetNext calls. This idea is similar to the widely

utilized sum-square-error or MSE criterion, which tries to keeps the average error or "cost" from

being too high, but also penalizes particularly poor individual errors or costs. However, one

We therefore also need to show that the pairwise values Pr[rk, rl E Ri] has the correct

value. All three-way inclusion probabilities must also be correct, as well as all four-way inclusion

probabilities, and so on. In other words we need to show that for a set S of interest, Pr[S E Ri]

has the correct value, for all S C R.
The proof that reservoir sampling maintains the correct inclusion probability for any set

of interest is actually very similar to the univariate inclusion probability correctness discussed

above. We know that the univariate inclusion probability Pr[rk Ri] = R/i. For any arbitrary

value of IS| < |R|, assume that we have the correct probabilities when we have seen i 1 input
records, i.e. Pr[S E R-] = () / (i). When the ith record is processed (i > |R|), we have

Pr[S Ri] = Pr[S E R i_l]Pr[ri e Ri]Pr[None of S's records are expelled ]+

(Pr[S E Ri1]Pr[r, ( RJ])

Pr[SR G + )]

(I) R s_

S1S) P- \

= P [SE r15] -1 i + -
(S) [(I \S)

which is the desired probability.

3.2 Sampling: Sometimes a Little is not Enough

One advantage of random sampling is that samples usually offer statistical guarantees on the

estimates they are used to produce. Typically, a sample can be used to produce an estimate for

a query result that is guaranteed to have error less E than with a probability 6 (see Cochran for a

nice introduction to sampling [16]). The 6 value is known as the c(nifid ,h c of the estimate.

Very large samples are often required to provide accurate estimates with suitably high

confidence. The need for very large samples can be easily explained in the context of the

Central Limit Theorem (CLT) [27]. The CLT implies that if we use a random sample of size

N to estimate the mean p of a set of numbers, the error of our estimate is usually normally


In this chapter, we first review the literature on reservoir sampling algorithms. We then

present the summary of existing work on biased sampling.

2.1 Related Work on Reservoir Sampling

Sampling has a very long history in the data management literature, and research continues

unabated today [2, 3, 8, 14, 15, 28, 32, 33, 35, 51, 52]. However, the most previous papers

(including the aforementioned references) are concerned with how to use a sample, and not with

how to actually store or maintain one. Most of these algorithms could be viewed as potential

users of a large sample maintained as a geometric file.

As mentioned in the introduction chapter, a series of papers by Olken and Rotem (including

two papers listed in the References section [25, 27]) probably constitute the most well-known

body of research detailing how to actually compute samples in a database environment. Olken

and Rotem give an excellent survey of work in this area [26]. However, most of this work is very

different than ours, in that it is concerned primarily with sampling from an existing database

file, where it is assumed that the data to be sampled from are all present on disk and indexed by

the database. Single pass sampling is generally not the goal, and when it is, management of the

sample itself as a disk-based object is not considered.

The algorithms in this dissertation are based on reservoir sampling, which was first de-

veloped in the 1960s [11, 38]. In his well-known paper [53], Vitter extends this early work by

describing how to decrease the number of random numbers required to perform the sampling.

Vitter's techniques could be used in conjunction with our own, but the focus of existing work

on reservoir sampling is again quite different from ours; management of the sample itself is not

considered, and the sample is implicitly assumed to be small and in-memory. However, if we re-

move the requirement that our sample of size N be maintained on-line so that it is always a valid

snapshot of the stream and must evolve over time, then sequential sampling techniques related

to reservoir sampling that could be used to build (but not maintain) a large, on-disk sample (see

Vitter [54], for example).

online sample targeted towards more recent data [7], no existing methods have considered how to

handle very large samples that exceed the available main memory.

In this dissertation we describe a new data organization called the geometric file and related

online algorithms for maintaining a very large, disk-based sample from a data stream. The

dissertation is divided into four parts. In the first part we describe the geometric file organization

and detail how geometric files can be used to maintain a very large simple random sample. In

the second part we propose a simple modification to the classical reservoir sampling algorithm

to compute a biased sample in a single pass over the data stream and describe how the geometric

file can be used to maintain a very large biased sample. In the third part we develop techniques

which allow a geometric file to itself be sampled in order to produce smaller sets of data objects.

Finally, in the fourth part, we discuss secondary index structures for the geometric file. Index

structures are useful to speed up search and discovery of required information from a huge

sample stored in a geometric file. The index structures must be maintained concurrently with

constant updates to the geometric file and at the same time provide efficient access to its records.

We now give an introduction to these four parts of the dissertation in subsequent sections.

1.1 The Geometric File

If one accepts the notion that being able to maintain a very large (but fixed size) random

sample from a data stream is an important problem, it is reasonable to ask: Is maintaining such

a sample difficult or costly using modern algorithms and hardware? Fortunately, modem storage

hardware gives us the capacity to inexpensively store very large samples that should suffice for

even difficult and emerging applications. A terabyte of commodity hard disk storage now costs

less than $1,000. Given current trends, we should see storage costs of $1,000 per petabyte by

the year 2020. However, even given such large storage capacities, it turns out that maintaining a

large sample is difficult using current technology. The problem is not purchasing the hardware to

store the sample; rather, the problem is actually getting the samples onto disk, so as to guarantee

the statistical randomness of the sample, in the face of data streams that may exceed tens of

gigabytes per minute in the case of a network monitoring application.

3.9 Why Reservoir Sampling with a Geometric File is Correct?

We discuss the correctness of the geometric file by answering the following questions:

1. Why is the classical reservoir sampling algorithm (presented as Algorithm 1) correct? That
is what is the invariant maintained by the Algorithm 1?

2. Why is the obvious disk-based, extension of Algorithm 1 (presented as Algorithm 2)
correct? That is how does Algorithm 2 maintain the invariant of Algorithm 1 via the use of
a main memory buffer?

3. Why is the proposed geometric file based sampling technique in Algorithm 3 correct?

We have answered the first question in Section 3.1. We discuss the second and third

questions here.

3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer

The Algorithm 2 makes use of the main memory buffer of size IBI to buffer new samples.

The buffered samples logically represent a set samples that should have been used to replace

on-disk samples in order to preserve the correctness of the sampling algorithm, but that have not

yet been moved to disk for performance reasons (that is, due to lazy writes).

It is not hard to see that the invariant maintained by Algorithm 1 is also maintained by

Algorithm 2 in step (6). The new records are sampled with the same probability I R/i. The only

difference is that newly sampled records are added to the reservoir using steps (7-14) instead of

simple steps (5-6) of Algorithm 1. We now discuss why these steps are equivalent.

One straightforward way of keeping the sampled records in the buffer and do lazy writes

is as follows. Every time we decide to add a new sample to the buffer (i.e. with probability

I R/i), we also generate a random number between 1 and R to decide its position in the reservoir.

However, we store this position in the position array and thus avoid an immediate disk seek.

If we happen to generate a position that is already in the position array, we overwrite the

corresponding record in the buffer with the newly sampled record. If we would have flushed that

record to disk using the classic algorithm (rather than buffering it), we would have replaced it

with the newly sampled record. Thus we would obtain the same result. Once the buffer is full we

Definition 2. Iff is the user-defined bias function and f' is the actual bias function, then the

distance between ,hl \,, two functions is defined as totalDist(f, f') = EC dist(ri), where

dist(r) f'(ri) f(r)
k-1 fP(rk) k- (O

For a data stream with no overweight records, totalDist(f, f') = 0 (the best case). The

worst case distance is given by the Theorem 1 and is analyzed and proved in the Appendix of this


Theorem 1. Given a set of streaming records rl, r2,... rN and a user-defined weighting function

f, Alg,,ritihm 7 will sample with an actual bias function f' where totalDist(f, f') is upper
bounded by

k N
E ( f() 11() (RI p )f(r' ) + = R f(+)
1 f(r ) R ff((r) k IR1+1 f(r'k)

and r, r, ., r' is the permutation (reordering) of the streaming records such that f(r') <

f(r) < <_ f(rK')

According to this Theorem, the worst case occurs when the reservoir is initially filled (on

startup) with the R records having the smallest possible weights (that is, we have the smallest

totalWeight when the reservoir is filled) and we encounter the record with the largest weight

immediately thereafter. We evaluate the effect of this worst-possible ordering in the experimental

section of the paper.

4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm

Algorithm 7 computes a biased sample according to f', where f' is a "close" function to a

user-defined weighting function f according to the following distance metric:

N f'(ri) f(ri)
totalDist(f, f') = dist(ri), where dist(r)) = rfi
iN N
il 2kl f'(rI ) C-,l f(rk)

each), whereas 10, 344 might mean that 400 seconds of disk time is spent on random disk I/Os.

This is important when one considers that the time required to write 1GB to a disk sequentially is

only around 25 seconds. While minimizing a is vital, it turns out that we do not have the freedom

to choose a. In fact, to guarantee that the sum of all existing subsamples is IR|, the choice of a is

governed by the ratio of IR| to the size of the buffer IB|:

Lemma 1. (The size of a geometric file is IR) <> ((1 a = )

Proof. In the proof, (and consequently the Lemma) we ignore the fact that IB x -'1 may not

be integral, we also ignore the storage associated with auxiliary structures such as the stacks and

the beta segments. In this case, the geometric file is simply collection of subsamples of decaying

size. We know that the largest subsample on disk is created by the most recent buffer flush and

has |B records in it. From Observation 1 the size of the ith subsample of a file is IBI x ai-1
It then follows from Observation 2 that the total size of all subsamples of a geometric file is

1, |BI x a'-1 BI and thus (1 a) = .

We will address this limitation in Section 3.10.

3.8.2 Choosing a Value for Beta

It turns out that the choice of 3 is actually somewhat unimportant, with far less impact

than a. For example, if we allocate 32KB for holding our 3 in-memory samples for each

subsample, and |B|/|RI is 0.01, then as described above, adding a new subsample requires that

1029 segments be written, which will require on the order of 1029 seeks. Redoing this calculation

with 1MB allocated to buffer samples from each on-disk subsample, the number of on-disk
segments is Lo0 og 10+log(1-0.99)] or 687. By increasing the amount of main memory devoted

to holding the smallest segments for each subsample by a factor of 32, we are able to reduce

the number of disk head movements by less than a factor of two. Thus, we will not consider

optimizing 3. Rather, we will fix 3 to hold a set of samples equivalent to the system block size,

and search for a better way to increase performance.

4.3 Biased Reservoir Sampling With The Geometric File

It is easy to use the biased reservoir sampling algorithm with a geometric file. To use the

geometric file for biased sampling, it is vital that we be able to compute the true weight of any

given record. To allow this, we will require that the following auxiliary information be stored:

Each record r will have its effective weight r.weight stored along with it in the geometric
file on disk. Once totalWeight becomes large, we can expect that for each new record r,
r.weight = f(r). However, for the initial records from the data stream, these two values
will not necessarily be the same.

Each subsample Si will have a weight multiplier i3 associated with it. Again, for subsam-
ples containing records produced by the data stream after totalWeight becomes large, .3
will typically be one. For efficiency, 3 [ can be buffered in main memory. Along with the
effective weight, the weight multiplier can give us the true weight for a given record, which
will be M(r) x r.weight.

Algorithmic changes: Given that we need to store this auxiliary information, the algorithms

for sampling from a data stream using the geometric file will require three changes to support

biased sampling. These modifications are described now:

During start-up. To begin with, the reservoir is filled with the first IR| records from the
stream. For each of these initial records, r.weight is set to one. Let "totalWeight" be the
sum of f(r) over the first IRI records. When the reservoir is finished filling, 3 1 is set to
totalWeight/IRI for every one of the initial subsamples. In this way, the true weight of
each of the first IRI records produced by the data stream is set to be the mean value of f(r)
for the first IRI records. Giving the first IRI records a uniform true weight is a necessary
evil, since they will all be overwritten by subsequent buffer flushes with equal probability.

As subsequent records are produced by the data stream. Just as suggested by Algo-
rithm 4, additional records produced by the stream are added to the buffer with probability
(I|R f(ri))/totalWeight, so that at least initially, the true weight of the ith record is exactly
f(ri). The interesting case is when I Rf(r) > 1 when the ith record is produced by the
data stream. In this case, we must scale the true weight of every existing record up so that
tot" .a 1. To accomplish this, we do the following:
1. For each on-disk subsample, Mj is set to be IRIf(r) 1.
2. For each sampled record still in the buffer, rj.weight is set to txr.weight f(r = .
3. Finally, totalWeight is set to IR f(ri).

7.3.2 Discussion of Experimental Results

Not surprisingly, these results suggest that the geometric file structure based sampling

methods are superior over the more obvious naive algorithms, both in the batch and online case.

As expected, the naive batch sampling algorithm took almost constant time to obtain batch

sample of any size as it requires scan of the entire geometric file to retrieve any batch sample.

The geometric file structure based algorithm can produce a small-size batch sample very fast, and

the total sampling time increases linearly with sample size. The time required for the geometric

file structure based algorithm is well below the time required by the naive approach even when

1/10 of file is sampled. In case of online sampling, geometric file structure based algorithm

clearly outperformed naive approach and this was not surprising as it must expend one disk seek

per sample. For both, batch and online sampling multiple geometric files framework showed

results analogues to single geometric file case.

As expected and then demonstrated by variance plots, the variance of online naive approach

is smaller than geometric file structure based algorithm. Although with this little larger variance

(less than 10 times for 100k samples) in the response times, the structure based approach

executed order of magnitude faster (more than 100 times for 100k samples) than the naive

approach for any number of records sampled, justifying our approach of minimizing the average

square sum of the response time. In other words, we got enough added speed for a small enough

added variance in response time to make the trade-off acceptable. As more and more samples are

obtained the variance of structure based algorithm approached variance of the naive algorithm

making the trade-off even more reasonable for large intended sample sizes.

Finally, we point out that both the geometric file structure based algorithms, batch and

online case, were able to read sample records from disk almost at the maximum sustained speed

of the hard disk, at around 45 MB/sec. This is comparable to the rate of a sequential read from

disk, the best we can hope for.

Table 7-3. Query timing results for 200 bytes record, |R|

entire subsample decays. On the other hand, the segment-based index structure has the smallest

footprint as at every buffer flush all stale records are removed from the index structure. This

results in a very compact index structure. The disk space usage of the LSM-Tree-based index

structure lies between these two index structures. Although at every rolling merge, stale records

are removed from the part of index structure that is merging, not all of the stale records in the

structure are removed all at once. As soon as the rate of removal of stale records stabilizes the

disk footprint also becomes stable.

Finally, we compared the index look-up speed of these three index structures. We report

index look-up and geometric file access times for different selectivity queries. As expected,

the geometric file access time remains constant irrespective of the index structure option and

increases linearly as the query produces more output tuples. The index look-up time varied

for the three index structures. The segment-based index structure (the slowest) was an order of

magnitude slower than the LSM-Tree-based index structure (the fastest). This is mainly because

the segment-based index structure requires index lookups in several thousand B+-Trees for

any selectivity query, where the LSM-Tree-based structure uses a singe LSM-Tree, requiring a

small, constant number of seeks. The performance of the subsample-based index structure lies in

Scheme Selectivity Index Time File Time Total Time

Point Query 6.2488 0.0338 6.2826
-10 recs 9.6186 0.1267 9.7453
S 100 recs 12.9885 0.9288 13.9173
1000 recs 17.6891 5.9754 23.6645
Point Query 2.50717 0.0156 2.5227
~ 10 recs 4.92744 0.1763 5.1037
u 100 recs 7.2387 0.8637 8.1024
~ 1000 recs 9.9837 6.1363 16.1200
Point Query 0.00505 0.0174 0.0224
~ 10 recs 0.00967 0.1565 0.1661
100 recs 0.01440 0.8343 0.8487
~ 1000 recs 0.05987 4.9961 5.0559

= 50 million, and IBI = 250k

a catastrophic event, but it increases the disk I/O associated with stack maintenance and leads to

fragmentation, and so it is an event that we would like to render very rare.

To avoid this, we observe that if the stack associated with a sub-sample S contains any

samples at a given moment, then S has had fewer of its own samples removed than expected.

Thus, our problem of bounding the growth of S's stack is equivalent to bounding the difference

between the expected and the observed number of samples that S loses as IBI new samples are

added to the reservoir, over all possible values for IBI.

To bound this difference, we first note that after adding IBI new samples into the reservoir,

the probability that any existing sample in the reservoir has been over-written by a new sample is

1 1 -{ During the addition of new records to the reservoir, we can view a subsample

S of initial size IBI as a set of IBI identical, independent Bernoulli trials (coin flips). The ith

trial determines whether the ith sample was removed from S. Given this model, the number of

samples remaining in S after IBI new samples have been added to the reservoir is binomially

distributed with IBI trials and P = Pr[s E S remains] = 1 1 Since we are interested

in characterizing the variance in the number of samples removed from S primarily when IB|P

is large, the binomial distribution can be approximated with very high accuracy using a normal

distribution with mean = B IP and standard deviation a = IBIP(1 P) [42]. Simple

arithmetic implies that the greatest variance is achieved when a subsample has on expectation

lost 50' of its records to new sample (P = 0.5); at this point the standard deviation a is

0.5 B/B. Since we want to ensure that stack overruns are essentially impossible, we choose

a stack size of 3/B This allows the amount of data remaining in a given subsample to be

up to six standard deviations from the norm without a stack overflow, and is not too costly an

additional overhead. A quick lookup in a standard table of normal probabilities tells us that this

will yield only around a 109- probability that any given subsample overflows its stack. While

achieving such a small probability may seem like overkill, it is important to remember that many

thousands of subsamples may be created in all during the life of the geometric file, and we want

to ensure that very few of them overflow their respective stacks. If 100, 000 on-disk segments

not deleted from the index tree until the subsample completely decays (when the entire tree is

deleted). We refer to a index record as a stale record if it belongs to a segment of a subsample

that is already overwritten (lost).

Recall that we have recorded a segment number in an additional field along with each index

record. For a given subsample, we keep track of which of its segments are decayed so far and use

this information to ignore the index records that are stale. We returns all valid index records that

satisfy the search criteria. We first sort these index records by their page number attribute and

then then retrieve the actual records from the geometric file and return them as a query result.

Although, the subsample-based index structure maintains and must search far fewer B+Trees

compared to the segment-based index structure, we except reasonable search time per B+-Tree

due to the smaller size and lazy deletion policy.

6.5 A LSM-Tree-Based Index Structure

An alternative to the segment-based and subsample-based index structure is to build a single

index structure for the entire geometric file, and maintain it as new records are inserted in the

file. Thus, we design the third index structure that makes use of the LSM-tree index [44]. The

LSM-Tree is a disk-based data structure designed to provide low-cost indexing in an environment

with a high rate of inserts and deletes.

6.5.1 An LSM-Tree Index

An LSM-tree is composed of two or more tree-like component data structures. The smallest

component of the index always resides entirely in main memory (referred as the Co tree), and all

other larger components reside on disk (referred as C1, C2, ..., Cj). The schematic picture of an

LSM-tree of two components is depicted in Figure 2.1 of the original LSM-Tree paper [44].

Although C1 (and higher) components are disk-resident, the most frequently referred nodes

(in general nodes at higher level) of these trees are buffered in main memory for performance


LSM-Tree insertions and deletions: Index records are first inserted into the main-memory-

resident Co component, after which they migrate to the C1 component that is stored on disk.

to disk it must overwrite a truly random subset of the records on disk. Thus, when performing

the flush, we need to randomly choose records from the reservoir to replace. This implies that

the on-disk subsamples (which are expectedly of size 1 7, and so on) will lose around
1- 1-aI 1-a
n, na, na2 records, and so on, respectively. However, while the number of records replaced in a

subsample S will on expectation be proportional to the size of S (and hence equal to the size of

S's largest on-disk segment) this replacement must be performed in a randomized fashion. The

situation can be illustrated as follows. Say we have a set of numbers, divided into three buckets,

as shown in Figure 3-4. Now, we want to add five additional numbers to our set, by randomly

replacing five existing numbers. While we do expect numbers to be replaced in a way that is

proportional to bucket size (Figure 3-4 (b)), this is not always what will happen (Figure 3-4 (c)).

Algorithm 4 Randomized Segmentation of the buffer
1: for each subsample i in the reservoir R do
2: Set Ni= Number of records in Si
3: Set .11 0
4: for each record r in the buffer B do
5: Randomly choose a victim subsample Si such that Pr[choosing 5J] = Nil / Nj
6: N, -; 1 + +

In order to correctly introduce this variance into the geometric file, we need to add a few

additional steps to Algorithm 3. Before we add a new subsample to disk via a buffer flush in Step

(21), we first perform a logical, randomized partitioning of the buffer into segments, described

by Algorithm 4. In Algorithm 4, each newly-sampled record is randomly assigned to replace a

sample from an existing, on-disk subsample so that the probability of each subsample losing a

record is proportional to its size. The result of Algorithm 4 is an array of 11 values, where 3 1

tells Step (21) of Algorithm 3 how many records should be assigned to overwrite the ith on-disk


3.7.2 Handling the Variance

Of course, there is no guarantee that M1 = n, i = na, = na2, and so on, so there

is no guarantee that Algorithm 3 will overwrite exactly the number of records contained in each

The worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the

R records having the smallest possible weights and (2) we encounter the record rmx with the

largest weight immediately thereafter. Theorem 1 presented an upper bound on totalDist(f, f')

in this worst case. In this section, we first provide the proof of this worst case for Algorithm 7

and then prove the upper bound on totalDist(f, f') given by Theorem 1.

4.2.1 The Proof for the Worst Case

To prove the worst case for Algorithm 7, we first prove the following three propositions.

These proofs lead us to the worst-case argument. If we denote the record with the highest weight

in the stream as rmx and use rma to denote the case where r"m is located at position i in the

stream, then for any given random ordering of the streaming records ri,..., ri-1, rmax, TN,

we prove that

1. Moving the record rma" earlier in the range rjI ... rN can not decrease totalDist(f, f').

2. When we are initially filling the reservoir, choosing |RI records with smallest possible
weight maximizes totalDist(f, f').

3. Reordering of any record that appears after rmax in the range r~+l ... rN can not increase
totalDist(f, f').

The proof of the first proposition regarding moving rm~ earlier in the stream:

We prove this proposition by showing that if we move r~m to r-a, totalDist(f, f)

can not decrease. If rmax is not an overweight record, the claim trivially holds as moving non-

overweight record does not change totalDist(f, f'). If r"ma is an overweight record, we prove

that totalDist(f, f') increases because of the move. We first compute totalDisti(f, f') for rm

and then compute totalDist2(f, f') for rma. We prove the claim by showing totalDist2(f, f')

totalDisti(f, f') > 0.

1. An Expression for totalDist (f, f') for rax

We start with the totalDist formula

Reservoir sampling can be very efficient, with time complexity less than linear in the size of

the stream. Variations on the algorithm allow it to "go to sleep" for a period of time during which

it only counts the number of records that have passed by [53]. After a certain number of records

have been seen, the algorithm "wakes up" and capture the next record from the stream.

Correctness of the reservoir sampling algorithm: The reservoir sampling process can

be viewed as two phase process: (1) adding the first R records to the reservoir, and (2) adding

subsequent records until the input is consumed. A reservoir algorithm should maintain following

invariant in the second phase: after each record is processed, a reservoir should be a simple

random sample of size R of the records processed so far Algorithm 1 maintains this invariant in

steps (2-6) as follows [11, 38]. The ith record processed (i > IRI), it is added to the reservoir

with probability IRl/i by step 4. We need to show that for all other records processed thus far,

the inclusion probability is IR /i. Let rk be any record in the reservoir s.t. k / i. Let Ri denote

the state of the reservoir just after addition of the ith record. Thus, we are interested in the

Pr[rk r Ri]

Pr[rk G Ri] = Pr[rk G R-i1]Pr[ri E Ri]Pr[rk i Ri] + (Pr[rk G R,-i]Pr[ri i R])

R[R t R R
iR- 1R i i

The correctness of the inclusion probability alone is not sufficient to prove the required

invariant. Consider the systematic sampling described in the Chapter 8 of Cohran book [16]. To

select a sample of IRI units, systematic sampling takes a unit at random from the first k units and

"every kth" unit thereafter. Although the inclusion probability in systematic sampling is the same

as in simple random sampling, the properties of a sample such as variance can be far different.

It is known that the variance of the systematic sampling can be better or worse compared to a

simple random sampling depending on data heterogeneity and correlation coefficient between

pairs of sampled units.

several approximate query processing applications [1, 21, 30, 34, 39]. In general, the drawback

of making use of a batch sample is that the accuracy of any estimator which makes use of the

sample is fixed at the time that the sample is taken, whereas the benefit of batch sampling is that

the sample can be drawn with very high efficiency.

We will also consider the case where N is not known beforehand, and we want to implement

an iterative function GetNext. Each call to GetNext results in an additional sampled record being

returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We

will refer a sample retrieved in this manner as an online or sequential sample. The drawback of

online sampling compared to batch sampling is that it is generally less efficient to obtain a sample

of size N using online methods. However, since the consumer of the sample can call GetNext

repeatedly until an estimator with enough accuracy is obtained, online sampling is more flexible

than batch sampling. An online sample retrieved from a geometric file can be useful for many

applications, including online aggregation [32, 33]. In online 'riLlion. a database system

tries to quickly gather enough information so as to approximate answer to an aggregate query. As

more and more information is gathered, the approximation quality is improved, and the online

sampling procedure is halted when the user is happy with the approximation accuracy.

5.3 Batch Sampling From a Geometric File

5.3.1 A Naive Algorithm

The most obvious way to implement batch sampling is to make use of the reservoir sampling

algorithm to raw a sample of size N from a geometric file of size IRI in a single pass. As the

following lemma asserts, the resulting sample is also a sample of size N from the original data


Lemma 8. The reservoir sampling algorithm over a geometric file produces a correct random

sample of the stream.

Proof If S is the batch sample of size N retrieved from a geometric file R of size |R| using

the reservoir sampling algorithm, then we know from the correctness of the reservoir sampling

As the buffer fills. When the buffer fills and the jth subsample is to be created and written
to disk, Mj is set to 1.

4.4 Estimation Using a Biased Reservoir

The biased sampling algorithm presented gives a user the opportunity to make use of

different weighting algorithms and estimators, depending upon the particular application domain.

We discuss one such simple estimator, the standard Horvitz-Thompson estimator [50] for a

sample computed using our algorithm. We derive the correlation covariancee) between the

Bernoulli random variables governing the sampling of two records ri and rj using our algorithm

and use this covariance to derive the variance of a Horvitz-Thomson estimator. Combined with

the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's

accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the

AVERAGE and COUNT aggregates) over a single database table for which the reservoir is

maintained. Though handling more complicated queries using the biased sample is beyond the

scope of the paper, it is straightforward to extend the analysis of this Section to more complicated

queries such as joins [32].

Imagine that we have the following single-table query, whose (unknown) answer is q:



WHERE g2(r)

Given such a query, let g(r) = g9(r) if g2(r) evaluates to true 0 otherwise. Let Ri be a state

of the biased sample just after the ith record in the stream has been processed. Then the unbiased

Horvitz-Thompson estimator for the query answer q can be written as q = re pr al]i In

the Horvitz-Thompson estimator, each record is weighted according to the inverse of its sampling


Next, we derive the variance of this estimator. To do this, we need a result similar to Lemma

3 that can be used to compute the probability Pr[{rj, rk} E Ri] under our biased sampling


Making use of the set of stacks is fairly straightforward. Imagine that na(i-1) of a buffer's

records are sent to overwrite a segment from an existing subsample Si, but according to Algo-

rithm 4, 1 [ should have been. Then, there are two possible cases:

Case 1: i is smaller than na(i-1) by some number of records e In this case, E records
are removed from the segment that is about to be over-written and pushed onto Si's stack
in order to buffer them. This is necessary because these records logically should not be
over-written by the records that are going to be added to the disk, but they will be.

Case 2: 1 [ is larger than na(i-1) by some number of records e. In this case, e records are
popped off of Si's stack to reflect the additional records that should have been removed
from Si, but were not.

These stack operations are performed just prior to Step (23) in Algorithm 3. Note that since the

final group of segments from a subsample of total size 3 are buffered in main memory, their

maintenance does not require any stack operations. Once a subsample has lost all of its on-disk

samples, overwrites of records in this set can be handled by simply replacing the records directly.

3.7.3 Bounding the Variance

Because the stacks associated with each subsample will be used with high frequency as

insertions are processed, each stack must be maintained with extreme efficiency. Writes should

be entirely sequential, with no random disk head movements. To assure this efficiency and avoid

any sort of online reorganization, it is desirable to pre-allocate space for each of the stacks on


To pre-allocate space for these stacks, we need to characterize how much overflow we can

expect from a given subsample, which will bound the growth of the subsample's stack. It is

important to have a good characterization of the expected stack growth. If we allocate too much

space for the stacks, then we allocate disk space for storage that is never used. If we allocate

too little space, then the top of one stack may grow up into the base of another. If a stack does

overflow, it can be handled by buffering the additional records temporarily in memory or moving

the stack to a new location on disk until the stack can again fit in its allocated space. This is not

problem we face using this strategy in the context of online sampling is that we do not know

before hand the value of N, the number of records to be sampled.

Algorithm 9 GetNext for Online Sampling
1: Set NS = Number of subsamples in a geometric file
2: for i = 1 to NS do
3: Set RecsInSubsam[i] = Size of ith subsample
4: Set BufferedSubsamSize[i] = 0
5: Randomly choose a subsample Si such that Pr[choosing i] = RecsInSubsam[i] /1RI
6: RecsInSubsam[i]- -
7: if BufferedSubsamSize[i] == 0 then
8: Set numRecs to minimum of sf/r and RecsInSubsam[i]
9: Read and buffer numRecs records of Si
10: BufferedSubsamSize[i] = numRecs
11: Buf ferSubsamSize[i] -
12: Return the next available buffered record of Si

To address this issue, we use a simple heuristic. Every time we refill a buffer, we look at

the number of records already sampled from a subsample and assume that the user will ask for

the same number of samples as the algorithm progresses. This gives us the planning horizon for

which we can determine the number of blocks to be fetched. We also use the obvious constraint

that the total number of samples fetched from the subsample should not exceed the number of

records in a subsample. Given this, an analytic solution to the problem of minimizing the average

squared cost over all calls to GetNext is as follows:

If there are b number of records per blocks then let N/b be the number of blocks in the
planning horizon, and let X be the number of equal size chunks that we read on every
buffer refill. Our goal is to determine the value of X and the number of blocks in each

We know that the time to read a chunk is proportional to s + (N/b x r)/X, and thus the
square sum of response time of all GetNext calls is X(s + (N/b x r)/X)2

In order to derive a formula for the value of X that minimizes this, we simply differentiate
it with respect to X and then solve for the zero.

= Xs+ (:N/b xr)/X)2)

S(Xs2 + 2Nsr + (N/b x r)2/X)
=s2 (N/b x r)2/X2


Random sampling is a ubiquitous data management tool, but relatively little research

from the data management community has been concerned with how to actually compute and

maintain a sample. In this dissertation we have considered the problem of random sampling from

a data stream, where the sample to be maintained is very large and must reside on secondary

storage. We have developed the geometric file organization which can be used to maintain an

online sample of arbitrary size with an amortized cost of O(w x logB IB/ B ) random disk head

movements for each newly sampled record. The multiplier u can be made very small by making

use of a small amount of additional disk space.

We have presented a modified version of the classic reservoir sampling algorithm that

is exceedingly simple, and is applicable for biased sampling using any arbitrary user-defined

weighting function f. Our algorithm computes, in a single pass, a biased sample Ri (without

replacement) of the i records produced by a data stream.

We have also discussed certain pathological cases where our algorithm can provide a

correctly biased sample for a slightly modified bias function f'. We have analytically bound

how far f' can be from f in such a pathological case. We have also experimentally evaluated the

practical significance of this difference.

We have also derived the variance of a Horvitz-Thomson estimator making use of a sample

computed using our algorithm. Combined with the Central Limit Theorem, the variance can

then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the

SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single

database table for which the reservoir is maintained.

We have developed efficient techniques which allow a geometric file to itself be sampled

in order to produce smaller data objects. We considered two sampling techniques (1) a batch

sampling when sample size is known before hand and (2) an online sampling which implements

an iterative function GetNext to retrieve a sample at-a-time. The goal of these algorithms was to

efficiently support further sampling of a geometric file by making use of its own structure.

1/5 of total 1/5 of total 3/5 of total

(a) Five new samples randomly replace existing samples which are grouped into three

(b) Most likely outcome: new samples distributed proportionally

(c) Possible (though unlikely) outcome: new samples all distributed to smallest bucket

Figure 3-4. Distributing new records to existing subsamples.

subsample's largest segment. To handle this problem, we associate a stack (or buffer1 ) with

each of the subsamples. The stack associated with a subsample will buffer any of a subsample's

records that logically should not have been over-written during buffer flush into the subsample

(because 3 [ for some buffer flush for that subsample was smaller than expected), but whose

space had to be claimed by the buffer flush in order to write a new subsample to disk. If the size

of the stack is positive, it means that the corresponding subsample is larger than expected because

it has had fewer of its records over-written than expected. We also allow a negative stack size.

This simply means that some of the subsample's records should have been over-written but were

not, because an 3 [ value for that subsample was larger than expected. A stack size of -k means

that k of the subsample's on-disk records logically are not part of the reservoir (even though they

are physically present on disk), and should be ignored during query processing.

1 We use the term "stack" rather than "buffer" to clearly differentiate the extra storage associ-
ated with each subsample from the buffer B.


Abhijit Pol was born and brought up in state of Maharashtra in India. He received his

Bachelor of Engineering from Government College of Engineering Pune (COEP) University

of Pune, one of the most prestigious and oldest engineering college in India, in 1999. Abhijit

majored in mechanical engineering and obtained a distinguished record. He ranked second in

the university merit ranking. He was employed in the Research and Development department of

Kirloskar Oil Engines Ltd for one year.

Abhijit received his first Master of Science from University of Florida in 2002. He majored

in industrial and systems engineering. Abhijit then worked as a researcher in the Department of

Computer and Information Science and Engineering at the University of Florida. He received his

second Master of Science and Doctor of Philosophy (Ph.D) in computer engineering in 2007.

During his studies at University of Florida, Abhijit coauthored a text book titled "Develop-

ing Web-Enabled Decision Support Systems." He taught the Web-DSS course several times in

the Department of Industrial and Systems Engineering at the University of Florida. He presented

several tutorials at workshops and conferences on the need and importance of teaching DSS

material, and he also taught at two instructor-training workshops on DSS development.

Abhijit's research focus is in the area of databases, with special interests in approximate

query processing, physical database design, and data streams. He has presented research papers

at several prestigious database conferences and performed research at the Microsoft Research

Lab. He is now a Senior Software Engineer in the Strategic Data Solutions group at Yahoo! Inc.

2)f(r-.a) E i+1 f(rk)] and Y
k- IY




2 x f(rswa)
N- f(rk)

2 x f(rswap
Ek- f(rk)

X- f(rswa)
Y + f(rswP)

f(rswa) x [X + Y]
Y x [Y + f(rswa)]

Substituting the values for X and Y, the equation further simplifies to



2 x f(rwap)
Nk-1 f(Rk)
f (rswap) [(|R

2)f(r-na) l+ f(rk) + Rf(r-a) + E i1 (ik)

Y x [Y + f(rPs)]

2 x f(rswp)
2 x ( f(r)
2 x f(rswaP) x (IRI- )f(rra)

Since [IR f(rm ) ki+l /(k)] > (IR

1)f(r'ma), we have


totalDisti(..) >

2 x f(rswp)
N /f(Tk)

2 x f(rswap)
[IRf(ri ) + Z i7 f(rk) + f(rswap)]

Since [| f(r rna) + f(rswa) + + (rk)] > -1 1 f(rk), we have

totalDist2(f, f')

totalDisti(f, f')

2 x f(rswap)
k 1 f(rk)

2 x f(rswap)
Nk1 f(rk)


[IR f(rm,) k-i+f(rk)] x [IRIf(rr )+

i+1 f(rk) + f(rswa)]


If we let X

[Il(r -)+E i f(rk)],we

It turns out that in the worst-case scenario we might have to buffer almost the entire

data stream. We describe the case by construction. For a given arbitrary reservoir size |R|

and the stream size N, we add first |R| records, all with the same weight wtl 1, to the

reservoir. Next, we set f(rRl+l ) = i f (rk)/IR] + 1 1wt + 1 2. The inclusion

probability of rRa+a1 is |R f(rlRnl+)/l z( f(rK) = 21 /(IR + 2) > 1. Since rRl+1

is an overweight record, we buffer it. We construct the remaining records of the stream with

f(fIRI+2) ... f(rN) = (rlR|+1) = 2 so as to have all of them overweight and we must
buffer them all. The priority queue thus contains N IRI records in it. Since f(ri) = 1 Vi < IR,

we have: RIlf (r)/ k L f(rk) IRI/[ IR + 2(N RI)] < 1, and since f(ri) 2 Vi > RI

and N > IR, we have: IRIf(r)/ E i f(rk) 21RI/[ R + 2(N I)] < 1. Thus, for a

well-defined biased function f and the constructed stream the required queue size is N IRI. We

therefore conclude that for N > IRI, the size of the buffer required for delayed insertion of the

overweight records is 0(N).

We stress that though this upper bound is quite poor (requiring that we need to buffer the

entire data stream!) it is in fact a worst-case scenario, and the approach will often be feasible

in practice. This is because weights will often increase monotonically over time (as in the case

where newer records tend to be more relevant for query processing than older ones). Still, given

the poor worst-case upper bound, a more robust solution is required, which we now describe.

4.1.3 Adjusting Weights of Existing Samples

Another, orthogonal method for handling overweight records (that can be applied when

the available buffer memory is exceeded) is to simply adjust the bias function and try to do the

best that we can. Specifically, when we encounter an overweight record, we simply bump up the

weights of all existing samples so as to ensure the inclusion probability of the current record is

exactly one. Of course, as a result of this we will not be able to ensure that the weight of each

record ri is exactly f(ri). We describe what we will be able to guarantee in the context of the true

weight of a record:

sampling algorithms. We have also tested these four techniques with the framework that make

use of multiple geometric files. All of the algorithms were implemented on top of the geometric

file prototype that was benchmarked in the previous sections.

7.3.1 Experiments Performed

To compare the various options, we used the following setup. We first initialize a geometric

file by sampling and adding records from a synthesized data stream to the files for a period of

several hours. This ensures a realistic scenario for testing: the reservoir in the file that is to be

tested has been filled, a reasonable portion of each initial subsample has been over-written, and

some of the smaller initial subsamples have been removed from the file, and a number of new

subsamples have been create. The parameters used in building the geometric file are the same

as those describe in Experiment 2 of the previous section (a 50GB file with 50 million, 1KB

records). Given such a file, the following set of experiments were performed:

Sampling experiment 1: The goal of this experiment was to compare the two options for

obtaining a batch sample from a geometric file: the naive algorithm, and then the geometric file

structure based algorithm. For both algorithms, we plot the time to perform the sampling as a

function of the desired sample size. Figure 7-2 (a) depicts the plot for a single geometric file;

Figure 7-2 (b) shows an analogous plot for the multiple geometric files option.

Sampling experiment 2: This experiment is analogous to Sampling Experiment 1, except that

online sampling is performed via multiple successive calls to GetNext. The number of records

sampled with multiple calls to GetNext versus the elapsed time is plotted in Figure 7-2 (c)

for both the naive algorithm and the more advanced, geometric file structure based algorithm

designed to increase the sampling rate and even out the response times. The analogous plot for

multiple geometric file case is shown in Figure 7-2 (d). We also plot the variance in response

times over all calls to GetNext as a function of the number of calls to GetNext in Figures 7-

2(e) and 7-2(f) (the first is for a single geometric file; the second is with multiple files). Taken

together, these plots show the trade-off between overall processing time and the potential for

waiting for a long time in order to obtain a single sample.

segment 2:
2 segments numbered V
segment 0: segment 1: no samples and after: 3 samples total
n samples na samples

stored in main
n memory
before first buffer flush: samples total


after first buffer flush:

after second buffer flush:

after third buffer flush:


after buffer flush number y:

Figure 3-1. Decay of a subsample after multiple buffer flushes.

/3 that are buffered in main memory (subsequently referred to as the "beta segment"), then the

ith buffer flush into R will on expectation overwrite exactly one on-disk segment from S. S

loses an additional segment with every buffer flush until the subsample has only its beta segment

remaining. At the point that only the subsample's beta segment remains, the samples contained

therein can be replaced directly. The reason that the beta segment is buffered in main memory is

that overwriting a segment requires at least one random disk head movement, which is costly. By

storing the beta segment in main memory, we can reduce the number of disk head movements

with little main-memory storage cost. The process is depicted in Figure 3-1.

are replaced, then using a stack of size 3 IB1 will yield a very reasonable probability that we

experience no overflows of (1 10-9)100 000, or 99.9, I'-.. In practice, the actual probability of

experiencing no overflows will be even greater. This is due to the fact that the standard deviation

in subsample size for most of a subsample's lifespan will be much less than 0.5 BVB due to the

high percentage of its lifespan that it has an associated P of less than 0.5 as it slowly loses all of

its samples.

3.8 Choosing Parameter Values

Given a specified file size and buffer size, two parameters associated with using the

geometric file must be chosen: a, which is the fraction of a subsample's records that remain after

the addition of a new subsample, and 3, which is the total size of a subsample's segments that are

buffered in memory.

3.8.1 Choosing a Value for Alpha

In general, it is desirable to minimize a. Decreasing a decreases the number of segments

used to store each subsample. Fewer segments means fewer random disk head movements

are required to write a new subsample to disk, since each segment requires around four disk

seeks to write (one to read the location and one to write a new segment, and similarly two more

considering the cost of subsequently adjusting the stack of the previous owner).

To illustrate the importance of minimizing a, imagine that we have a 1GB buffer and a

stream producing 100B records, and we want to maintain a 1TB sample. Assume that we use

an a value of 0.99. Thus, each subsample is originally 1GB, and | B = 107. From Observa-

tion 2 we know that -- must be 107, so we must use n = 105. If we choose / = 320 (so

that 3 is around the size of one 32KB disk block), then from Observation 3 we will require
Sog320 log0log(-.99) 1029 segments to store the entire new subsample.

Now, consider the situation if a = 0.999. A similar computation shows that we will now

require 10, 344 segments to store the same 1GB subsample. This is an order-of-magnitude

difference, with significant practical importance. With four disk seeks per segment, 1029

segments might mean that we spend around 40 seconds of disk time in random I/Os (at 10ms

(a) 50 Byte records,
600MB buffer space

hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hr

Time elapsed

geo file


scan & virtual mem

8 hrs 12 hrs 16 hrs 20 hr

Time elapsed

geo files

7 local
geo file overwrte

(c) 50 Byte records,
150MB buffer space

scan & virtual mem-
0 hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hrs
Time elapsed

Figure 7-1. Results of benchmarking experiments (Processing insertions).

geo files

(b) 1KB records,
600MB buffer space


Algorithm 5 Randomized Segmentation of the Buffer for Multiple Geometric Files
1: for each Sij, the ith subsample in jth file do
2: Set Nij= Number of records in Sij
3: Set 1. 0
4: for each record r in the buffer B do
5: Randomly choose a victim subsample Sij such that Pr[choosing ij] = Nij/ C8 Nkl
6: Nj -;V + +

However, processing additional records from the stream is somewhat different. As more and

more records are produced by the stream, new samples are captured and are added to the buffer

exactly as in Algorithm 3 Steps (15)-(20) until buffer is full. Once the buffer is full, its record

order is then randomized, just as is in a single geometric file. Next the buffer is flushed to disk.

This is where the algorithm is modified. Overwriting records on disk with records from the buffer

is somewhat different, in two primary ways, as discussed next.

Partitioning the buffer: In Algorithm 4, the buffer is partitioned so that the size of each

buffer segment is on expectation proportional to the current size of subsamples in a single file.

In case of multiple geometric files, we partition the buffer just like in Algorithm 4; however,

we randomly partition the buffer across all subsamples from all geometric files. The number

of buffer segments after the partitioning is the same as the total number of subsamples in the

entire reservoir, and the size of each buffer segment is on expectation proportional to the current

size of each of the subsamples from one of the geometric files. This allows us to maintain the

correctness of the reservoir sampling algorithm. The buffer partitioning steps in case of multiple

geometric files are given in Algorithm 5.

Merging buffer segments with multiple geometric files: This step requires quite a

different approach compared to Algorithm 3's buffer merge algorithm. We discuss all the

intricacies subsequently, but at high-level, the largest segment of each subsample from only

one geometric file is over-written with samples from the buffer. This allows for considerable

speedup, as we discuss in Section 3.12. At first, this would seem to compromise the correctness

of the algorithm: logically, the buffered samples must over-write samples from every one of the

geometric files (in fact, this is precisely why the buffer is partitioned across all geometric files, as


ACKNOWLEDGMENTS ................ ...................... 4

LIST OF TABLES ...................................... 8

LIST OF FIGURES ................. ............. ....... 9

ABSTRACT ................................... ....... 10


1 INTRODUCTION .................................. 12

1.1 The Geometric File ................... .......... 14
1.2 Biased Reservoir Sampling ................... ........ 16
1.3 Sampling The Sample ................... .......... 18
1.4 Index Structures For The Geometric File ........ ........ .... 19

2 RELATED WORK ..................... .............. 22

2.1 Related Work on Reservoir Sampling .......... ....... ....... 22
2.2 Biased Sampling Related Work ................... ....... 24

3 THE GEOMETRIC FILE .......... ........... ........... 28

3.1 Reservoir Sampling ................... .......... 28
3.2 Sampling: Sometimes a Little is not Enough ........ ........ .. 30
3.3 Reservoir for Very Large Samples ......... ................ 31
3.4 The Geometric File ......... .. .............. 34
3.5 Characterizing Subsample Decay .......... ................ 36
3.6 Geometric File Organization .................. ......... .. 40
3.7 Reservoir Sampling With a Geometric File .... . . .... 40
3.7.1 Introducing the Required Randomness . . . ..... 41
3.7.2 Handling the Variance ............ . . ...... 42
3.7.3 Bounding the Variance .................. ....... 45
3.8 Choosing Parameter Values ............. . . .... 47
3.8.1 Choosing a Value for Alpha .................. ...... .. 47
3.8.2 Choosing a Value for Beta ...... . . ...... 48
3.9 Why Reservoir Sampling with a Geometric File is Correct? . . .... 49
3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer . 49
3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File .50
3.10 Multiple Geometric Files ............ . . . 51
3.11 Reservoir Sampling with Multiple Geometric Files . . . 51
3.11.1 Consolidation And Merging .................. ...... .. 53
3.11.2 How Can Correctness Be Maintained? ..... . . ..... 53
3.11.3 Handling the Stacks in Multiple Geometric Files . . .... 56

7.2 Biased Reservoir Sampling ............ . . .... 103
7.2.1 Experimental Setup ............ . . ... 104
7.2.2 Discussion ................. . . . ... 106
7.3 Sampling From a Geometric File .................. ...... .. 107
7.3.1 Experiments Performed .................. ....... 108
7.3.2 Discussion of Experimental Results .... . . .... 109
7.4 Index Structures For The Geometric File ....... . . ... 110
7.4.1 Experiments Performed .................. ....... 110
7.4.2 Discussion ................. . . . ... 112

8 CONCLUSION ................... ........ ........ 116

REFERENCES ...................... . . . 118

BIOGRAPHICAL SKETCH ................. . . ..... 122

between these two structures. This is expected as the structure maintains many fewer B+-Trees

than the segment-based index but far more than the LSM-Tree-based structure.

In general the subsample-based index structure gives the best build time with reasonable

index look-up speed at the cost of slightly larger disk footprint. The LSM-Tree-based index

structure makes use of reasonable disk space and gives the best query performance at the cost

of slow insertion rate or build time. The segment-based index structure gives comparable build

time and has the most compact disk footprint, but suffers considerably when it comes to index



In this chapter we give an introduction to the basic reservoir sampling algorithm that was

proposed to obtain an online random sample of a data stream. The algorithm assumes that

the sample maintained is small enough to fit in main memory in its entirety. We discuss and

motivate why very large sample sizes can be mandatory in common situations. We describe three

alternatives for maintaining very large, disk-based samples in a streaming environment. We then

introduce the geometric file organization and present algorithms for reservoir sampling with the

geometric file. We also describe how multiple geometric files can be maintained all-at-once to

achieve considerable speed up.

3.1 Reservoir Sampling

The classic algorithm for maintaining an online random sample of a data stream is known

as reservoir sampling [11, 38]. To maintain a reservoir sample R of target size |R|, the following

loop is used:

Algorithm 1 Reservoir Sampling
1: Add first |R| items from the stream directly to R
2: for int i = RI + 1 to oo do
3: Wait for a new record r to appear in the stream
4: with probability IR|/i do
5: Remove a randomly selected record from R
6: Add r to R

A key benefit of the reservoir algorithm is that after each execution of the for loop, it can

be shown that the set R is a true, uniform random sample (without replacement) of the first i

records from the stream. Thus, at all times, the algorithm maintains an unbiased snapshot of all

of the data produced by the stream. The name "reservoir sampling" is an apt one. The sample R

serves as a reservoir that buffers certain records from the data stream. New records appearing in

the stream may be trapped by the reservoir, whose limited capacity then forces an existing record

to exit the reservoir.

extensions of the reservoir algorithm to on-disk samples all have serious drawbacks. We discuss

the obvious extensions now.

The virtual memory extension. The most obvious adaptation for very large sample sizes is

to simply treat the reservoir as if it were stored in virtual memory. The problem with this solution

is that every new sample that is added to the reservoir will overwrite a random, existing record on

disk, and so it will require two random disk I/Os: one to read in the block where the record will

be written, and one to re-write it with the new sample. This means we can sample only on the

order of 50 records per second at 10ms per random I/O per disk. Currently, a terabyte of storage

requires as few as five disks, giving us a sampling rate of only 5 x 50 = 250 records per second.

To put this in perspective, it would take months to sample enough 100 byte records to fill that


The massive rebuild extension. As an alternative, when new samples are selected from the

stream, they are not added to the on-disk reservoir immediately. Rather, we make use of all of

our available main memory to buffer new samples. At all times, the records stored in the buffer

B logically represent a set samples that should have been used to replace on-disk samples in

order to preserve the correctness of the reservoir algorithm, but that have not yet been moved to

disk for performance reasons. When the buffer B fills, we simply scan the entire reservoir R, and

replace a random subset of the existing records with the new, buffered samples. The modified

algorithm is given as Algorithm 2. Count(B) refers to the current number of records in B. Note

that since the records contained in B logically represent records in the reservoir that have not yet

been added to disk, a newly-sampled record can either be assigned to replace an on-disk record,

or it can be assigned to replace a buffered record (this is decided in Step (7) of the algorithm).

In a realistic scenario, the ratio of the number of disk blocks to the number of records

buffered in main memory may approach or even exceed one. For example, a 1 TB database with

128 KB blocks will have 7.8 million blocks; and for such a relatively large database it is realistic

to expect that we have access to enough memory to buffer millions records. As the number of

buffered records per block meets or exceeds one, most or all of the blocks on disk will contain

To relate this back to the task of the reservoir sampling, imagine that our large, disk-based

reservoir sample R is maintained using a reservoir sampling algorithm in conjunction with a

main memory buffer B (as in Algorithm 2). Recall that the way reservoir sampling works is

that new samples from the data stream are chosen to overwrite random samples currently in the

reservoir. The buffer temporarily stores these new samples, delaying the overwrite of a random

set of records that are already stored on disk. Once the buffer is full, all new samples are merged

with the R by overwriting a random subset of the existing samples in R.

Consider some arbitrary subsample S of R (so S C R), with capacity IS|. Since the buffer

B represents the samples that have already over-written the equal number of records of R, a

buffer flush overwrites exactly |B| samples of R. Thus, on expectation the merge will overwrite

SlxB samples of S. If we define 1 a then on expectation, S should lose IS| x (1 a)

of its own records due to the buffer flush1 We refer this loss as subsample decay.

We can roughly describe the expected decay of S after repeated buffer merges using the

three observations stated before. If the subsample retention rate a = 1 then:

From Observation 1, it follows that the ith buffer merge, on expectation, removes n x a-1
samples from what remains of S.

From Observation 2, it follows that the initial size of a subsample ISI = -

From Observation 3, it follows that the expected number of merges required until S has 3
or less samples left is T.

The net result of this is that it is possible to characterize the expected decay of any arbitrary

subset of the records in our disk-based sample as new records are added to the sample through

multiple emptyings of the buffer. If we view S as being composed of T on-disk "segments"

of exponentially decreasing size, plus a special, a single group of final segments of total size

1 Actually, this is only a fairly tight approximation to the expected rate of decay. It is not an
exact characterization because these expressions treat the emptying of the buffer into the reservoir
as a single, atomic event, rather than a set of individual record additions (See Section 3.7).

[45] Ousterhout, J.K., Douglis, F.: Beating the i/o bottleneck: A case for log-structured file
systems. Operating Systems Review 23(1), 11-28 (1989)

[46] P.B. Gibbons Y. Matias, VP.: Fast incremental maintenance of approximate histograms. In:
ACM Transactions on Database Systems, pp. 27(3): 261-298 (2002)

[47] Pol, A., Jermaine, C.: Biased reservoir sampling. IEEE Transactions on Knowledge and
Data Engineering

[48] Pol, A., Jermaine, C., Arumugam, S.: Maintaining very large random samples using the
geometric file. VLDBJ (2007)

[49] Shao, J.: Mathematical Statistics. Springer-Verlag (1999)

[50] Thompson, M.E.: Theory of Sample Surveys. Chapman and Hall (1997)

[51] Toivonen, H.: Sampling large databases for association rules. In: International Conference
on Very Large Data Bases (1996)

[52] V. Ganti M.-L. Lee, R.R.: Icicles self-tuning samples for approximate query answering.
In: International Conference on Very Large Data Bases (2000)

[53] Vitter, J.: Random sampling with a reservoir. In: ACM Transactions on Mathematical
Software (1985)

[54] Vitter, J.: An efficient algorithm for sequential random sampling. In: ACM Transactions on
Mathematical Software, pp. 13(1): 58-67 (1987)

ignoring the floor) we compute the number of segments to write as 1(log 3 log |BI). If we

let u = (log(1/a'))-1 the number of segments can be expressed as w(log IBI log/3). Assuming

a constant number c of random seeks per segment written to the disk, the total random disk head

movements required per record is wc ((log IB log /3)/IB ), which is O(u x log IB I/ IB). D

In case of multiple geometric files we use additional space for m dummy subsamples. Thus,

the total storage required by all geometric files is |R| + (m x IBI). If we wish to maintain a 1TB

reservoir of 100B samples with 1GB of memory, we can achieve a' = 0.9 by using only 1.1TB of

disk storage in total. For a' = 0.9, we need to write less than 100 segments per 1GB buffer flush.

At 40 ms/segment, this is only 4 seconds of random disk head movements to write 1GB of new

samples to disk.

In order to test the relative ability of the geometric file to process a high-speed stream of

insertions, we have implemented and bench-marked five alternatives for maintaining a large

reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the

framework described in Section 3.10 for using multiple geometric files at once. We present these

benchmarking results in Chapter 7.

manner. When the reservoir is full, we have used all of the slots exactly once. During the normal

operation, every time buffer is full, the slot corresponding to the smallest subsample in the

reservoir (which is about to decay completely) is used to write out newly built B+-Tree. Thus,

during normal operations B+-Tree slots are used in round-robin fashion.

The algorithm used to construct and maintain a segment-based index structure is given as

Algorithm 11.

Algorithm 11 Construction and Maintenance of a Subsample-Based Index Structure
1: Set totSubsamnInR = [iog3-og IB+g(i-a)
log a
2: Set BTree allTrees [totSubsamInR]
3: Set btlndex = 0
4: for int i = 1 to oo do
5: if Buffer B is partitioned then
6: for each segment j in B do
7: allTrees[btIndex].BuildBTree(j)
8: btlndex + +
9: ifi > R1 then
10: btlndex = btli,.i ,'. I. I ubsamInR

6.4.2 Index Look-Up

In the subsample-based index structure, after every buffer flush, exactly one B+-Tree is

created and written to the disk, making insertions in the index structure very efficient. However,

most of the deletions are deferred until the subsample decays completely. Thus, although every

subsample losses its records to the new subsample, B+-Tree records are deleted from the index

structure only when the entire B+-Tree is to be deleted. In other words, at any given time all

B+-Trees except the one recently inserted contains stale records that must be ignored during the


A search on subsample-based index structure involves looking up all B+-Tree indexes, one

for each subsample in the geometric file. We modify the existing B+-Tree-based point query and

range query algorithms and run them for each entry in the B+-Tree array of the index structure.

The modification is required to ignore the stale records in the B+Trees. As mentioned before,

the subsample corresponding to a B+-Tree may lose its segments, but the index records are

totalDist(f, f')


Yk-1 f'(ik)

k- If(rk)
N +(
=k-1 f (k)

: N
k I, fl(rk)
c 1*I fl(Tk)


f (rN)
=k-1 f(Tk)

Since r"x' is the ith record of the stream, using the result of Lemma 5 (given below) we
re-write the totalDist formula as

totalDisti(f, f')


1 f f( k)

( k f'(rIk)

f l .

Zk-1 f(Tk)
Y k 1 f'(r~k)

+ + N
>3k 1 f'(rk)
>3k 1 f(rk)

( f(r)
( fk1 f(
f1 if'(
1 Y.N .,P

kZ1 ff(rk)
=k 1 f'(rk)

rk) 1 kf(ik)
f'(r N)
rk) 1I f'(rk)

We know that Vj < i, f'(rj)

(IR-1)f (r' )f(rj) and Vj > i, f'(rj) f- (rj). We also know
k-I1f r

I R f(rM a) k i+1 f(rik). Therefore, the above equation simplifies to

+ N
Y=1 kf (rk)
N t f( k)

that E -I f '(rk)


[1] A. Das J. Gehrke, M.R.: Approximate join processing over data streams. In: ACM
SIGMOD International Conference on Management of Data (2003)

[2] Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of
group-by queries. In: ACM SIGMOD International Conference on Management of Data

[3] Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query
answering. In: ACM SIGMOD International Conference on Management of Data (1999)

[4] Acharya, S., P.B. Gibbons, V.P., Ramaswamy, S.: The aqua approximate query answering
system. In: ACM SIGMOD International Conference on Management of Data (1999)

[5] Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In:
VLDB'2006: Proceedings of the 32nd international conference on Very large data bases, pp.
607-618. VLDB Endowment (2006)

[6] Arge, L.: The buffer tree: A new technique for optimal i/o-algorithms. In: International
Workshop on Algorithms and Data Structures (1995)

[7] Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming
data. In: SODA'02: Proceedings of the thirteenth annual ACM-SIAM symposium on
Discrete algorithms, pp. 633-634. Society for Industrial and Applied Mathematics (2002)

[8] Babcock, B., S. Chaudhuri, G.D.: Dynamic sample selection for approximate query
processing. In: ACM SIGMOD International Conference on Management of Data (2003)

[9] Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indexes. In:
SIGFIDET Workshop, pp. 107-141 (1970)

[10] Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE '06:
Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 6.
IEEE Computer Society, Washington, DC, USA (2006)

[11] C. Fan M. Muller, I.R.: Development of sampling plans by using sequential (item by item)
techniques and digital computers. In: Journal of American Statistical Association, pp. 57:
387-402 (1962)

[12] C. Jermaine A. Datta, E.O.: A novel index supporting high volume data warehouse
insertion. In: International Conference on Very Large Data Bases (1999)

[13] C. Jermaine E. Omiecinski, W.Y: The partitioned exponential file for database storage
management. In: International Conference on Very Large Data Bases (1999)

[14] Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of
sampling for aig gricglion queries. In: ICDE (2001)

with probability If(rj) in Step (8) of the algorithm and the probability requirement trivially
E1=1 f(ri)
holds for the new record. We now must prove this fact for rk, for all k < i. Since R is correct,
we know that for k < i, Pr[rk E R~ -1i (r) i Then there are two cases to consider; either
EI=i f(rl)
the new record ri is chosen for the reservoir, or it is not. If ri is not chosen, then rk remains in
the reservoir for k < i. If ri is chosen, then rk remains in the reservoir if rk is not selected for
expulsion from the reservoir (the chance of this happening if ri is chosen is (|R| 1)/|R|). Thus,
the probability that a record rk is in Ri is

Pr[rk e Ri]

Pr[rk E

Pr[rk G

Pr[rk E Ri-1]

Pr[rk E Ri-1]

Pr[rk E Ri-i]

= Pr[rk E Ri-1]

t R f (rk) '

Y i1 f(r')
This is the desired results and prn

Pr[ri E Ri]Pr[rk not expelled] + (Pr[rk E Ri-1]Pr[r, Ri)

(Pr[ri G R] ( ) + 1 Pr[r G Rj])
(R\Pr[r, G R] Pr[ri c R] \R\ R\Pr[r, e R}\
(|R| Pr[rC e R1])

(f: f(ri)

Ste stt t of t l

oves the statement of the lemma.

4.1.2 So, What Can Go Wrong? (And a Simple Solution)
This simple modification to the reservoir sampling algorithm will give us the desired
biased sample as long as the probability IRI f(ri)/totalWeight never exceeds one. If this
value does exceed one, then the correctness of the algorithm is not preserved. Unfortunately,
we may very well see such meaningless probabilities, especially early on as the reservoir is

records and any records contained in its stack in order to obtain desired size sample. The detailed

algorithm is presented as Algorithm 8.

It is clear that this algorithm obtains the desired batch sample by scanning exactly N records

as against the entire scan of the reservoir sampling at the cost of few random disk seeks. Since

the sampling process is analogous to the process of adding more samples to the file, it is just as

efficient, requiring O(w x log BI /N) random disk head movements for each newly sampled

record, as described in Lemma 2.

5.3.3 Batch Sampling Multiple Geometric Files

A geometric file structure based batch sampling algorithm can be extended to allow efficient

batch sampling from multiple geometric files in the same way that the insertion algorithm for

new samples into the geometric file can be extended to allow insertions into multiple geometric

files. The extension is fairly straightforward with additional first step where we determine the

number of records to be sampled from each geometric file. Once this number is determined, we

execute Algorithm 8 on each file in order to obtain the desired batch sample.

5.4 Online Sampling From a Geometric File

5.4.1 A Naive Algorithm

One straightforward way of supporting online sampling from a geometric file is to imple-

ment the iterative function GetNext as follows. For every call to GetNext, we simply generate a

random number i between 1 and size of the file IRI, and then return a record at the ith position

in the geometric file. Care must be taken to avoid choosing same record of R more than once in

order to obtain a correct sample without replacement. For example, to sample N records from R,

the numbers 0 through N 1 could be hashed or randomized using a bijective pseudo-random

function onto the domain 0 through IRI 1, and the resulting N numbers used to generate the

sample. To pick the next record to sample, we simply hash N.

It is easy to see that a naive algorithm will give us a correct online sample of a geometric

file. However, we will use one disk seek per call to GetNext. Since each random I/O requires

Furthermore, this degeneration in performance could probably be reduced by using a smaller

value for a'.

As expected, the local overwrite option performs very well early on, especially in the

first two experiments (see Section 3.3 for a discussion of why this is expected). Even with

limited buffer memory in Experiment 3, it uniformly outperforms a single geometric file.

Furthermore, with enough buffer memory in Experiments 1 and 2, the local overwrite option

is competitive with the multiple geofiles option early on. However, fragmentation becomes a

problem and performance decreases over time. Unless offline re-randomization of the file is

possible periodically, this degradation probably precludes long-term use of the local overwrite


It is interesting that as demonstrated by Experiment 3 (and explained in Section 3.8) a

single geometric file is very sensitive to the ratio of the size of the reservoir to the amount of

available memory for buffering new records from the stream. The geofile option performs well

in Experiments 1 and 2 when this ratio is 100, but rather poorly in Experiment 3 when the ratio is


Finally, we point out the general unusability of the scan and virtual memory options, scan

generally outperformed virtual memory, but both generally did poorly. Except in experiment

1 with large memory and small record size, with these two options more than 97'. of the

processing of records from the stream occurs in the first half hour as the reservoir fills. In the 19.5

hours or so after the reservoir first fills, only a tiny fraction of additional processing occurs due to

the inefficiency of the two options.

7.2 Biased Reservoir Sampling

In Section 4.1 we gave an upper bound for the distance between the actual bias function f'

computed using our reservoir algorithm, and the desired, user-defined bias function f. While

useful, this bound does not tell the entire story. In the end, what a user of a biased sampling

algorithm is interested in is not how close the bias function that is actually computed is to the

user-specified one, but instead the key question is what sort of effect any deviation has on the

algorithm that:

# of subsets of size N E R
Pr[S C] # of such subsets in a data stream D

(l| R|
Now, imagine that S E R. If we obtain a sample of size N from R using the reservoir algorithm,

the probability that we choose precisely S is:

Pr[S sampled from RS c R] 1/( I

Thus we have:

Pr[S sampled from R] = Pr[S sampled from R S E R]

x Pr[S E R]

1 (|R
(IRI) (l D)

This is precisely the probability we would expect if we sampled directly from the stream without

replacement. O

Unfortunately, though it is very simple, the naive algorithm will be inefficient for drawing

a small sample from a large geometric file since it requires a full scan of the geometric file to

obtain a true random sample for any value of N. Since the geometric file may be gigabytes in

size, this can be problematic.

5.3.2 A Geometric File Structure-Based Algorithm

We can do better if we make use of the structure of a geometric file itself. The intuitive

outline of this approach is as follows. To obtain a batch sample of size N, we pre-calculate

how many records from each on-disk subsample will be included in the batch sample, and then

we read the appropriate number of records sequentially from the various segments of each

subsample. The process of choosing the number of records to select from each subsample is

simple random sample. In Chapter 4 we propose a single pass biased reservoir sampling

algorithm. In Chapter 5 we develop techniques that can be used to sample geometric files to

obtain a small size sample. In Chapter 6 we present secondary index structures for the geometric

file. In Chapter 7 we discuss the benchmarking results. The dissertation is concluded in Chapter


Most of the work in the dissertation is either already published or is under review for

publication. The material from Chapter 3 is from the paper with Christopher Jermaine and

Subramanian Arumugam that was originally published in SIGMOD 2004 [36]. The work

presented in Chapter 4 is submitted to TKDE and is under review [47]. The material in Chapter 5

is the part of journal paper accepted at VLDBJ [48]. The results in the Chapter 7 are taken from

above three papers as well.

For a given set of records, ZE I 1 f(rk) + | kRI f(rk) is a constant. Therefore, dist(rj)
increases as E I1 f(rk) decreases. In other words, the dist(rj) is maximum for the smallest

possible ZElj 1 f(rk). Thus, the totalDist(f, f') is largest when the reservoir is initially filled
with the |R| records having the smallest possible weights. This proves the claim in the second

The proof of the third proposition regarding reordering of records after r"mx:
This is immediate since r" m is the highest weight record of the stream, no record after r"'
can be an overweight record.

From the above three propositions, we can conclude that the worst case for Algorithm 7
occurs when (1) the reservoir is initially filled with the |R| records having the smallest possible
weights and, (2) we encounter the record r"' with the largest weight immediately thereafter.
4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist
To derive the upper bound, we start with the totalDist formula and give its value in the
worst case

totalDst(f, (r) f(rl) f(rlR -1) f(r lR-1) +
SEtotalDist(f(, f') N E+ +
k-1 f'(rk) k, 1f(rk) N I f'(rk) k-NI f(rk)
f (rlRI) f(rRI) f'(rN) f(rN)
ki, fl'(k) k- fE(rk) Y I f'(rk) Eki f(rk)

We know that in the worst case r ma appears as the IR1th record in the stream, using the
result of Lemma 5 we re-write the totalDist formula as

Current techniques suitable for maintaining samples from a data stream are based on

reservoir sampling [11, 38]. Reservoir sampling algorithms can be used to dynamically maintain

a fixed-size sample of N records from a stream, so that at any given instant, the N records in

the sample constitute a true random sample of all of the records that have been produced by the

stream. However, as we will discuss in this dissertation, the problem is that existing reservoir

techniques are suitable only when the sample is small enough to fit into main memory.

Given that there are limited techniques for maintaining very large samples, the problem

addressed in the first part of this dissertation is as follows:

Given a main memory buffer B large enough to hold BI records, can we develop efficient

nlgt,'rihii,\ for dynamically maintaining a massive random sample containing exactly N records

from a data stream, where N > IBI ?

Key design goals for the algorithms we develop are

1. The algorithms must be suitable for streaming data, or any similar environment where a
large sample must be maintained on-line in a single pass through a data set, with the strict
requirement that the sample always be a true, statistically random sample of fixed size N
(without replacement) from all of the data produced by the stream thus far.

2. When maintaining the sample, the fraction of I/O time devoted to reads should be close to
zero. Ideally, there would never be a need to read a block of samples from disk simply to
add one new sample and subsequently write the block out again.

3. The fraction I/O of time spent performing random I/Os should also be close to zero. Costly
random disk seeks should be few and far between. Almost all I/O should be sequential.

4. Finally, the amount of data written to disk should be bounded by the total size of all of the
records that are ever sampled.

The geometric file meets each of the requirements listed above. With memory large enough

to buffer IBI > 1 records, the geometric file can be used to maintain an online sample of arbitrary

size with an amortized cost of O(u x log B / B ) random disk head movements for each newly

sampled record (see Section 3.12). The multiplier u can be made arbitrarily small by making use

of additional disk space. A rigorous benchmark of the geometric file demonstrates its superiority

over the obvious alternatives.

around 10 milliseconds, the naive algorithm can only sample around 6, 000 records from the

geometric file per minute per disk. This performance is unacceptable for most applications.

5.4.2 A Geometric File Structure-Based Algorithm

As in the case of batch sampling algorithm, we can make use of the structure of a geometric

file to efficiently support online sampling.

Instead of selecting a random record of a geometric file, we randomly pick a subsample

and choose its next available record as a return value of GetNext. This is analogous to the classic

online sampling algorithm for sampling from a hashed file [26], where first a hash bucket is

selected and then a record is chosen. Since the selection of a random record within a subsample

is sequential, we may reduce the number of costly disk seeks if we read the subsample in its

entirety, and buffer the subsample's records in memory. Using this basic methodology, we now

describe how a call to the GetNext will be processed:

We first randomly pick a subsample Si, with the probability of selecting i proportional to
the size of ith subsample.

Next, we look for buffered records of Si; if such records exist, we choose and return the
first available record as the return value of GetNext. If no buffered records are found, we
fetch and buffer a number of blocks of records from subsample Si; these records are then
buffered. We return the first buffered record as the return value of GetNext.

Since the records from each subsample are read and buffered in memory sequentially, we

are guaranteed to choose each record of the reservoir at most once, giving us desired random

sample without replacement. A proof of this is simple, and analogous to the proof of Lemma

3. However, thus far we have not considered a very important question: How many blocks of a

subsample Si should we fetch at the time of buffer refill? In general there are two extremes that

we may consider:

Fetch many. If we fetch a large number of blocks at the time of the buffer refill, we reduce
the overall time to sample N records for large N. This is due to the fact that by fetching
many blocks using a sequential read, we amortize the seek time over a large number of
blocks and at the same time we prepare ourselves for future calls to GetNext; once the
records are fetched from disk, the response time for subsequent calls to GetNext is almost
instantaneous (only in-memory computations are required). However, the drawback of this

Setting this to zero, we have X = Nr/bs. Thus, we divide N/b blocks into Nr/bs chunks

and read bs/r number of blocks from a subsample every time we refill the buffer. It turns out

that when this solution is used, the number of blocks read at the time of buffer refill depends

on the ratio of the seek time to the block scan time. Since this solution is independent of the

planning horizon, we always read bs/r blocks irrespective of the number of records sampled so

far. Algorithm 9 gives the detailed online sampling algorithm.

5.5 Sampling A Biased Sample

We end this chapter by noting that if a geometric files sample is correctly biased, then batch

and online sampling algorithms we have given will also produce a correctly biased sample with

no modification, as described by the following lemma.

Lemma 9. A simple, equal-probability random sample from a correctly biased geometric file will

be correctly biased if the sample stored by the geometric file is correctly biased.

Proof In biased sampling, the probability of record being accepted in a geometric file is
I xf (r) where is the weight of the record under consideration and the totalWeight is the sum
of weights of all records from the stream so far.

Let the Sample be the biased sample of the geometric file, then we have

Pr[i E Sample] = Pr[Selecting i from Si] x Pr[Selecting Si] x Pr[i E Si]
1 sl IRf(r)
S, |R totalWeight

We examine the various algorithms for producing smaller samples from a large, disk-based

geometric file in chapter 7 of this dissertation.

in subsequent query processing. We then compute (in a single pass) a biased sample Ri of the i

records produced by a data stream. Ri is fixed-size, and the probability of sampling the jth record

from the stream is proportional to f(rj) for all j < i. This is a fairly simple and yet powerful

definition of biased sampling, and is general enough to support many applications.

The key contributions of this part of dissertation are as follows:

1. We present a modified version of the classic reservoir sampling algorithm that is ex-
ceedingly simple, and is applicable for biased sampling using any arbitrary user-defined
weighting function f.

2. In most cases, our algorithm is able to produce a correctly biased sample. However, given
certain pathological data sets and data orderings, this may not be the case. Our algorithm
adapts in this case and provides a correctly biased sample for a slightly modified bias
function f'. We analytically bound how far f' can be from f in such a pathological case,
and experimentally evaluate the practical significance of this difference.

3. We describe how to perform a biased reservoir sampling and maintain large biased samples
with the geometric file.

4. Finally, we derive the correlation covariancee) between the Bernoulli random variables gov-
erning the sampling of two records ri and rj using our algorithm. We use this covariance
to derive the variance of a Horvitz-Thomson estimator making use of a sample computed
using our algorithm.

1.3 Sampling The Sample

A geometric file is a simple random sample (without replacement) from a data stream.

In this part of the dissertation we develop techniques which allow a geometric file to itself be

sampled in order to produce smaller sets of data objects that are themselves random samples

(without replacement) from the original data stream. The goal of the algorithms described in

this part is to efficiently support further sampling of a geometric file by making use of its own


Small samples frequently do not provide enough accuracy, especially in the case when

the resulting statistical estimator has a very high variance. However, while in the general

case a very large sample can be required to answer a difficult query, a huge sample may often

contain too much information. For example, consider the problem of estimating the average

2007 Abhijit A. Pol


Despite the variety of alternatives for approximate query processing [1, 21, 30, 34, 39],

sampling is still one of the most powerful methods for building a one-pass synopsis of a data set,

especially in a streaming environment where the assumption is that there is too much data to store

all of it permanently. Sampling's many benefits include:

Sampling is the most widely-studied and best understood approximation technique cur-
rently available. Sampling has been studied for hundreds of years, and many fundamental
results describe the utility of random samples (such as the Central Limit Theorem, Cher-
noff, Hoeffding and Chebyshev bounds [16, 49]).

Sampling is the most versatile approximation technique available. Most data processing
algorithms can be used on a random sample of a data set rather than the original data with
little or no modification. For example, almost any data mining algorithm for building a
decision tree classifier can be run directly on a sample.

Sampling is the most widely-used approximation technique. Sampling is common in data
mining, statistics, and machine learning. The sheer number of recent papers from ICDE,
VLDB, and SIGMOD [2, 3, 8, 14, 15, 28, 32, 33, 35, 46, 51, 52] that use samples testify to
sampling's popularity as a data management tool.

Given the obvious importance of random sampling, it is perhaps surprising that there has

been very little work in the data management community on how to actually perform random

sampling. The most well-known papers in this area are due to Olken and Rotem [25, 27], who

also offer the definitive survey of related work through the early 1990s [26]. However, this work

is relevant mostly for sampling from data stored in a database, and implicitly assumes that a

"sample" is a small data structure that is easily stored in main memory.

Such assumptions are sometimes overly restrictive. Consider the problem of approximate

query processing. Recent work has suggested the possibility of maintaining a sample of a large

database and then executing analytic queries over the sample rather than the original data as a

way to speed up processing [4, 31]. Given the most recent TPC-H benchmark results [17], it is

clear that processing standard report-style queries over a large, multi-terabyte data warehouse

may take hours or days. In such a situation, maintaining a fully materialized random sample

Biased sampling w/o skewed records
4.5e+13 -. Unbiased reservoir sampling
Biased sampling worst case
S 3e+13
5 2.5e+13
0 2e+13
---------------------- ---------------" ^---~: ----------^
0 --I I I
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-3. Sum query estimation accuracy for zipf=0.2.

particular estimation task that is to be performed. Perhaps the easiest way to detail the practical

effect of a pathological data ordering is through experimentation.

In this section we present the experimental results evaluating practical significance of a

worst-case data ordering. Specifically, we design a set of experiments to compute the error

(variance) one would expect when sampling for the answer to a SUM query in following there


1. When a biased sample is computed using our reservoir algorithm with the data ordered so
as to produce no overweight records.

2. When an unbiased sample is computed using the classical reservoir sampling algorithm.

3. When a biased sample computed using our reservoir algorithm, with records arranged so
as to produce the bias function furthest from the user-specified one, as described by the
Theorem 1.

By examining the results, it should become clear exactly what sort of practical effect on the

accuracy of an estimator one might expect due to a pathological ordering.

7.2.1 Experimental Setup

In our experiments, we evaluated a SUM query over a set of synthetic data streams having

various statistical properties. In each experiment, every record has two attributes: A and B.

Furthermore, we know that Algorithm 7 accepts the first IR| records of the stream with
probability 1. No weight adjustments are triggered for first IR| records irrespective of their
weights. Therefore, the earliest position rmx can appear in the stream is right after the reservoir
is filled. This proves the proposition. We now turn to proving Lemma 5, which was used in the
previous proof.
Lemma 5. If r' appears as the ith record of the stream, then Vj < i we have: '(r) >
j) f') (rk)f)
-(r-) and Vj > i we have: 77(r) < f(r- )
k l f(rk) k=lif (k) f(k))
Proof When we encounter r"m as the ith record of the stream, we increase the weights of rj
Vj < i by a factor of C = 1)f and adjust E i f'(rk) R (ra) + + (r).
=1 f(r k)
We also know that Vj > i, f'(rj) = f(rj).
Part 1: Vj < i we have

f'(rj) f(rj) C x f(rj) f(rj)
k1 fl(k) f (r) ) f (r)
S () f( Tk) L = fl(Tk) k 1 Tk)
Cx f(rj) f(rj)
i-1 C X f (rk)Z ki f(rk) -I f(rk)

Since C > 1, we have

C x f(rj) f (rj)
-i C x (rk) + Ek f(rk) -' lf(rk)

We can therefore conclude that

f'(rj) f(rj) > f(rj) f(rj)
k I f '(rk) k-i f(rk) fE (rk) f(rk)

This proves the first part of the lemma.

Part 2: Vj > i we have

Algorithm 7 Biased Reservoir Sampling (Adjusting Weights of Existing Samples)
1: Set totalWeight = 0
2: for int i = 1 to oo do
3: Wait for a new record ri to appear in the stream
4: Set r ". ':,l,, = f(ri)
5: totalWeight = totalWeight + f(ri)
6: if i < R then
7: Add ri directly to R
8: else
9: if If(rI) < 1 then
10: with probability lR/(r' do
11: Remove a randomly selected record from R
12: Add ri to R
13: else
14: for each record j in R do
15: rj.weight = (I RI -)f(ri) x rj.weight
totalWeight- f(ri) X igh
16: totalWeight = IRf(ri)
17: Remove a randomly selected record from R
18: Add ri to R

Case (ii): If Iti (t > 1, we scale the true weight of every existing sample so as to have

totalWeight I= f(rf). This is done by first setting C = IR 1)(rs) and then scaling up
totalWeight- f (ri)
f(rk) C x f(rk) Vk < i. As a result of this linear scaling, we have

IRl x C x f(r)
Pr [rj E Ri]
R x C x f (ri)
E c x f (T) + f(ri)
l 1 fl(ni)

An important factor to consider while determining the applicability of Algorithm 7 is the

deviation of f' from f. That is: how far off from the correct weighting can we be, in the worst

case? When stream has no overweight records, we expect f' to be exactly equal to f, but it may

be very far away under certain circumstances. To address this, we define a distance metric in

Definition 2 and evaluate the worse case distance between f' and f.

analytically bound how far f' can be from f in such a pathological case, and experimentally eval-

uate the practical significance of this difference. Finally, we derive the correlation covariancee)

between the Bernoulli random variables governing the sampling of two records ri and rj using

our algorithm. We use this covariance to derive the variance of a Horvitz-Thomson estimator

making use of a sample computed using our algorithm.

The rest of the chapter is organized as follows. We describes a single-pass biased sampling

algorithm. We also define a distance metric to evaluate the worst case deviation from the user-

defined weighting function f. Finally, we derive a simple estimator for a biased reservoir. The

experiments performed to test our algorithms are presented in Chapter 7.

4.1 A Single-Pass Biased Sampling Algorithm

We introduced the classical reservoir sampling algorithm that maintains an unbiased

sample of a data stream in the previous chapter. We will extend this algorithm to give our biased

reservoir sampling algorithm and prove various properties and pathological cases for the same.

4.1.1 Biased Reservoir Sampling

It turns out that in most cases, one may produce a correctly biased sample by simply

modifying the reservoir algorithm to maintain a current sum total Weight over all observed

f(ri). Then, incoming records are added to the reservoir so that the probability of sampling
record rj is f(rj)/totalWeight. This basic version of the algorithm is given as Algorithm 6. It is

possible to prove that this modified algorithm results in a correctly biased sample, provided that

the "probability" from line (8) of Algorithm 6 does not exceed one.

Lemma 3. Let Ri be a state of the biased sample just after the ith record in the stream has

been processed. Using the biased sampling described in Alg, rihil 6, we are guaranteed that

for each Ri and for each record rj produced by the data stream such that j < i, we have

Pr[rj EG i] I Rif(r3)
E 1f/(rk)

Proof We need to prove that when a new record ri appears in the stream, then for each record

rj from the stream, Pr[rj E R1] = f( A new records produced by the stream is sampled
El=I f(rIl)


Figure page

3-1 Decay of a subsample after multiple buffer flushes. ..... . . 38

3-2 Basic structure of the geometric file. .................. ........ 39

3-3 Building a geometric file. .................. ............ 43

3-4 Distributing new records to existing subsamples. .................. ..44

3-5 Speeding up the processing of new samples using multiple geometric files. ...... 54

4-1 Adjustment of r'" to rrn . ........ . ...69

7-1 Results of benchmarking experiments (Processing insertions). . . ... 101

7-2 Results of benchmarking experiments (Sampling from a geometric file). . ... 102

7-3 Sum query estimation accuracy for zipf=0.2. ................... 104

7-4 Sum query estimation accuracy for zipf=0.5. ................... 105

7-5 Sum query estimation accuracy for zipf=0.8. ................... 106

7-6 Sum query estimation accuracy for zipf=l. ................... ....... 107

7-7 Disk footprint for 1KB record size .................. .......... 110

7-8 Disk footprint for 200B record size .................. ........ 112

records by their page number attribute and retrieve the actual records from the geometric file as a

query result.

In Chapter 7, we evaluate and compare the three index structures suggested in this chapter

experimentally by measuring build time and disk footprint as new records are inserted into the

geometric file. We also compare the efficiency of these structures for point and range queries.

Efficiently searching and discovering information from the geometric file is essential for

query processing. A natural way to support this functionality is to build an index structure. We

discussed three secondary index structures and their maintenance as new records are inserted

into a geometric file. The segment-based and the subsample-based index structures are designed

around structure of the geometric file. The third index structure, the LSM-tree-based index

structure makes use of LSM-tree structure, an efficient structure to handle bulk insertion and

deletion. We compared these structures for build time, disk space used, and index look-up time.

Full Text








AttheendofmydissertationIwouldliketothankallthosepeoplewhomadethisdisserta-tionpossibleandanenjoyableexperienceforme.FirstofallIwishtoexpressmysinceregratitudetomyadviserChrisJermaineforhispatientguidance,encouragement,andexcellentadvicethroughoutthisstudy.IfIwouldhaveaccesstomagictoolcreate-your-own-adviser,IstillwouldnothaveendedupwithanyonebetterthanChris.Healwaysintroducesmetointerestingresearchproblems.HeisaroundwheneverIhaveaquestion,butatthesametimeencouragesmetothinkonmyownandworkonanyproblemsthatinterestme.IamalsoindebtedtoAlinDobraforhissupportandencouragement.Alinisaconstantsourceofenthusiasm.TheonlytopicIhavenotdiscussedwithhimisstrategiesofGatorfootballgames.IamgratefultomydissertationcommitteemembersTamerKahveci,JoachimHammer,andRavindraAhujafortheirsupportandtheirencouragement.IacknowledgetheDepartmentofIndustrialandSystemsEngineering,RavindraAhuja,andchairDonaldHearnforthenancialsupportandadviceIreceivedduringinitialyearsofmystudies.Finally,Iwouldliketoexpressmydeepestgratitudefortheconstantsupport,understanding,andlovethatIreceivedfrommyparentsduringthepastyears. 4


page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 10 CHAPTER 1INTRODUCTION .................................... 12 1.1TheGeometricFile ................................. 14 1.2BiasedReservoirSampling ............................. 16 1.3SamplingTheSample ................................ 18 1.4IndexStructuresForTheGeometricFile ...................... 19 2RELATEDWORK .................................... 22 2.1RelatedWorkonReservoirSampling ........................ 22 2.2BiasedSamplingRelatedWork ........................... 24 3THEGEOMETRICFILE ................................. 28 3.1ReservoirSampling ................................. 28 3.2Sampling:SometimesaLittleisnotEnough .................... 30 3.3ReservoirforVeryLargeSamples ......................... 31 3.4TheGeometricFile ................................. 34 3.5CharacterizingSubsampleDecay .......................... 36 3.6GeometricFileOrganization ............................ 40 3.7ReservoirSamplingWithaGeometricFile ..................... 40 3.7.1IntroducingtheRequiredRandomness ................... 41 3.7.2HandlingtheVariance ............................ 42 3.7.3BoundingtheVariance ........................... 45 3.8ChoosingParameterValues ............................. 47 3.8.1ChoosingaValueforAlpha ......................... 47 3.8.2ChoosingaValueforBeta .......................... 48 3.9WhyReservoirSamplingwithaGeometricFileisCorrect? ............ 49 3.9.1CorrectnessoftheReservoirSamplingAlgorithmwithaBuffer ...... 49 3.9.2CorrectnessoftheReservoirSamplingAlgorithmwithaGeometricFile 50 3.10MultipleGeometricFiles .............................. 51 3.11ReservoirSamplingwithMultipleGeometricFiles ................ 51 3.11.1ConsolidationAndMerging ......................... 53 3.11.2HowCanCorrectnessBeMaintained? ................... 53 3.11.3HandlingtheStacksinMultipleGeometricFiles .............. 56 5


................................. 56 4BIASEDRESERVOIRSAMPLING ........................... 58 4.1ASingle-PassBiasedSamplingAlgorithm ..................... 59 4.1.1BiasedReservoirSampling ......................... 59 4.1.2So,WhatCanGoWrong?(AndaSimpleSolution) ............ 60 4.1.3AdjustingWeightsofExistingSamples ................... 62 4.2WorstCaseAnalysisforBiasedReservoirSamplingAlgorithm .......... 65 4.2.1TheProoffortheWorstCase ........................ 66 4.2.2TheProofofTheorem 1 :TheUpperBoundontotalDist 73 4.3BiasedReservoirSamplingWithTheGeometricFile ............... 75 4.4EstimationUsingaBiasedReservoir ........................ 76 5SAMPLINGTHEGEOMETRICFILE ......................... 80 5.1WhyMightWeNeedToSampleFromaGeometricFile? ............. 80 5.2DifferentSamplingPlansfortheGeometricFile .................. 80 5.3BatchSamplingFromaGeometricFile ...................... 81 5.3.1ANaiveAlgorithm ............................. 81 5.3.2AGeometricFileStructure-BasedAlgorithm ............... 82 5.3.3BatchSamplingMultipleGeometricFiles ................. 84 5.4OnlineSamplingFromaGeometricFile ...................... 84 5.4.1ANaiveAlgorithm ............................. 84 5.4.2AGeometricFileStructure-BasedAlgorithm ............... 85 5.5SamplingABiasedSample ............................. 88 6INDEXSTRUCTURESFORTHEGEOMETRICFILE ................ 89 6.1WhyIndexaGeometricFile? ............................ 89 6.2DifferentIndexStructuresfortheGeometricFile ................. 90 6.3ASegment-BasedIndexStructure ......................... 91 6.3.1IndexConstructionDuringStart-up ..................... 91 6.3.2MaintainingIndexDuringNormalOperation ................ 92 6.3.3IndexLook-UpandSearch ......................... 93 6.4ASubsample-BasedIndexStructure ........................ 93 6.4.1IndexConstructionandMaintenance .................... 94 6.4.2IndexLook-Up ............................... 95 6.5ALSM-Tree-BasedIndexStructure ........................ 96 6.5.1AnLSM-TreeIndex ............................. 96 6.5.2IndexMaintenanceandLook-Ups ..................... 97 7BENCHMARKING .................................... 99 7.1ProcessingInsertions ................................ 99 7.1.1ExperimentsPerformed ........................... 99 7.1.2DiscussionofExperimentalResults ..................... 100 6


............................. 103 7.2.1ExperimentalSetup ............................. 104 7.2.2Discussion .................................. 106 7.3SamplingFromaGeometricFile .......................... 107 7.3.1ExperimentsPerformed ........................... 108 7.3.2DiscussionofExperimentalResults ..................... 109 7.4IndexStructuresForTheGeometricFile ...................... 110 7.4.1ExperimentsPerformed ........................... 110 7.4.2Discussion .................................. 112 8CONCLUSION ...................................... 116 REFERENCES ......................................... 118 BIOGRAPHICALSKETCH .................................. 122 7


Table page 1-1Population:studentrecords ................................ 17 1-2Randomsampleofthesize=4 ............................... 17 1-3Biasedsampleofthesize=4 ................................ 17 7-1Millionsofrecordsinsertedin10hrs ........................... 110 7-2Querytimingresultsfor1krecord,jRj=10million,andjBj=50k 113 7-3Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k 114 8


Figure page 3-1Decayofasubsampleaftermultiplebufferushes. ................... 38 3-2Basicstructureofthegeometricle. ........................... 39 3-3Buildingageometricle. ................................. 43 3-4Distributingnewrecordstoexistingsubsamples. ..................... 44 3-5Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. ....... 54 4-1Adjustmentofrmaxitormaxi1 69 7-1Resultsofbenchmarkingexperiments(Processinginsertions). ............. 101 7-2Resultsofbenchmarkingexperiments(Samplingfromageometricle). ........ 102 7-3Sumqueryestimationaccuracyforzipf=0.2. ....................... 104 7-4Sumqueryestimationaccuracyforzipf=0.5. ....................... 105 7-5Sumqueryestimationaccuracyforzipf=0.8. ....................... 106 7-6Sumqueryestimationaccuracyforzipf=1. ........................ 107 7-7Diskfootprintfor1KBrecordsize ............................ 110 7-8Diskfootprintfor200Brecordsize ............................ 112 9


Samplingisoneofthemostfundamentaldatamanagementtoolsavailable.Itisoneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.However,mostcurrentresearchinvolvingsamplingconsiderstheproblemofhowtouseasample,andnothowtocomputeone.Theimplicitassumptionisthatasampleisasmalldatastructurethatiseasilymaintainedasnewdataareencountered,eventhoughsimplestatisticalargumentsdemonstratethatverylargesamplesofgigabytesorterabytesinsizecanbenecessarytoprovidehighaccuracy.Noexistingworktacklestheproblemofmaintainingverylarge,disk-basedsamplesinanonlinemannerfromstreamingdata. Wepresentanewdataorganizationcalledthegeometricleandonlinealgorithmsformain-tainingaverylarge,on-disksamples.Thealgorithmsaredesignedforanyenvironmentwherealargesamplemustbemaintainedonlineinasinglepassthroughadataset.Thegeometricleorganizationmeetsthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsample(withoutreplacement)ofallofthedataprocessedthusfar. Wemodifytheclassicreservoirsamplingalgorithmtocomputeaxed-sizesampleinasinglepassoveradataset,wherethegoalistobiasthesampleusinganarbitrary,user-denedweightingfunction.Wealsodescribehowthegeometriclecanbeusedtoperformabiasedreservoirsampling. 10


Efcientlysearchinganddiscoveringinformationfromthegeometricleisessentialforqueryprocessing.Anaturalwaytosupportthisistobuildanindexstructure.Wediscussthreesecondaryindexstructuresandtheirmaintenanceasnewrecordsareinsertedtoageometricle. 11


Despitethevarietyofalternativesforapproximatequeryprocessing[ 1 21 30 34 39 ],samplingisstilloneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.Sampling'smanybenetsinclude: 16 49 ]). 2 3 8 14 15 28 32 33 35 46 51 52 ]thatusesamplestestifytosampling'spopularityasadatamanagementtool. Giventheobviousimportanceofrandomsampling,itisperhapssurprisingthattherehasbeenverylittleworkinthedatamanagementcommunityonhowtoactuallyperformrandomsampling.Themostwell-knownpapersinthisareaareduetoOlkenandRotem[ 25 27 ],whoalsoofferthedenitivesurveyofrelatedworkthroughtheearly1990s[ 26 ].However,thisworkisrelevantmostlyforsamplingfromdatastoredinadatabase,andimplicitlyassumesthatasampleisasmalldatastructurethatiseasilystoredinmainmemory. Suchassumptionsaresometimesoverlyrestrictive.Considertheproblemofapproximatequeryprocessing.Recentworkhassuggestedthepossibilityofmaintainingasampleofalargedatabaseandthenexecutinganalyticqueriesoverthesampleratherthantheoriginaldataasawaytospeedupprocessing[ 4 31 ].GiventhemostrecentTPC-Hbenchmarkresults[ 17 ],itisclearthatprocessingstandardreport-stylequeriesoveralarge,multi-terabytedatawarehousemaytakehoursordays.Insuchasituation,maintainingafullymaterializedrandomsample 12


43 ])maybedesirable.Inordertosavetimeand/orcomputerresources,queriescanthenbeevaluatedoverthesampleratherthantheoriginaldata,aslongastheusercantoleratesomecarefullycontrolledinaccuracyinthequeryresults. Thisparticularapplicationhastwospecicrequirementsthatareaddressedbythedis-sertation.First,itmaybenecessarytousequitealargesampleinordertoachieveacceptableaccuracy;perhapsontheorderofgigabytesinsize.Thisisespeciallytrueifthesamplewillbeusedtoanswerselectivequeriesoraggregatesoverattributeswithhighvariance(seeSec-tion 3.2 ).Second,whatevertherequiredsamplesize,itisoftenindependentofthesizeofthedatabase,sinceestimationaccuracydependsprimarilyonsamplesize Foranotherexampleofacasewhereexistingsamplingmethodscanfallshort,considerstream-baseddatamanagementtasks,suchasnetworkmonitoring(foranexampleofsuchanapplication,wepointtotheGigascopeprojectfromAT&TLaboratories[ 18 20 ]).Giventhetremendousamountofdatatransportedovertoday'scomputernetworks,theonlyconceivablewaytofacilitatead-hoc,after-the-factqueryprocessingoverthesetofpacketsthathavepassedthroughanetworkrouteristobuildsomesortofstatisticalmodelforthosepackets.Themostobviouschoicewouldbetoproduceaverylarge,statisticallyrandomsampleofthepacketsthathavepassedthroughtherouter.Again,maintainingsuchasampleispreciselytheproblemwetackleinthisdissertation.Whileotherresearchershavetackledtheproblemofmaintainingan 16 ]forathoroughtreatmentofnitepopulationrandomsampling). 13


7 ],noexistingmethodshaveconsideredhowtohandleverylargesamplesthatexceedtheavailablemainmemory. Inthisdissertationwedescribeanewdataorganizationcalledthegeometricleandrelatedonlinealgorithmsformaintainingaverylarge,disk-basedsamplefromadatastream.Thedissertationisdividedintofourparts.Intherstpartwedescribethegeometricleorganizationanddetailhowgeometriclescanbeusedtomaintainaverylargesimplerandomsample.Inthesecondpartweproposeasimplemodicationtotheclassicalreservoirsamplingalgorithmtocomputeabiasedsampleinasinglepassoverthedatastreamanddescribehowthegeometriclecanbeusedtomaintainaverylargebiasedsample.Inthethirdpartwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjects.Finally,inthefourthpart,wediscusssecondaryindexstructuresforthegeometricle.Indexstructuresareusefultospeedupsearchanddiscoveryofrequiredinformationfromahugesamplestoredinageometricle.Theindexstructuresmustbemaintainedconcurrentlywithconstantupdatestothegeometricleandatthesametimeprovideefcientaccesstoitsrecords. Wenowgiveanintroductiontothesefourpartsofthedissertationinsubsequentsections. 14


11 38 ].Reservoirsamplingalgorithmscanbeusedtodynamicallymaintainaxed-sizesampleofNrecordsfromastream,sothatatanygiveninstant,theNrecordsinthesampleconstituteatruerandomsampleofalloftherecordsthathavebeenproducedbythestream.However,aswewilldiscussinthisdissertation,theproblemisthatexistingreservoirtechniquesaresuitableonlywhenthesampleissmallenoughtotintomainmemory. Giventhattherearelimitedtechniquesformaintainingverylargesamples,theproblemaddressedintherstpartofthisdissertationisasfollows: 1. Thealgorithmsmustbesuitableforstreamingdata,oranysimilarenvironmentwherealargesamplemustbemaintainedon-lineinasinglepassthroughadataset,withthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsampleofxedsizeN(withoutreplacement)fromallofthedataproducedbythestreamthusfar. 2. Whenmaintainingthesample,thefractionofI/Otimedevotedtoreadsshouldbeclosetozero.Ideally,therewouldneverbeaneedtoreadablockofsamplesfromdisksimplytoaddonenewsampleandsubsequentlywritetheblockoutagain. 3. ThefractionI/OoftimespentperformingrandomI/Osshouldalsobeclosetozero.Costlyrandomdiskseeksshouldbefewandfarbetween.AlmostallI/Oshouldbesequential. 4. Finally,theamountofdatawrittentodiskshouldbeboundedbythetotalsizeofalloftherecordsthatareeversampled. Thegeometriclemeetseachoftherequirementslistedabove.WithmemorylargeenoughtobufferjBj>1records,thegeometriclecanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord(seeSection 3.12 ).Themultiplier!canbemadearbitrarilysmallbymakinguseofadditionaldiskspace.Arigorousbenchmarkofthegeometricledemonstratesitssuperiorityovertheobviousalternatives. 15


Theneedforbiasedsamplingcaneasilybeillustratedwithanexamplepopulation,giveninTable 1.2 .Thisparticulardatasetcontainsrecordsdescribinggraduatestudentsalariesinauniversityacademicdepartment,andourgoalistoguessthetotalgraduatestudentsalary.Imaginethatasimplerandomsampleofthedatasetisdrawn,asshownintheTable 1-2 .Thefoursampledrecordsarethenusedtoguessthatthetotalstudentsalaryis(520+700+580+600)12=4=$7200,whichisconsiderablylessthanthetruetotalof$9545.Theproblemisthatwehappenedtomissmostofthehigh-salarystudentswhoaregenerallymoreimportantwhencomputingtheoveralltotal. Now,imaginethatweweighteachrecord,sothattheprobabilityofincludinganygivenrecordwithasalary700orgreaterinthesampleis(2)(4=12),andtheprobabilityofincludingagivenrecordwithasalarylessthan700is(1=2)(4=12).Thus,oursamplewilltendtoincludethoserecordswithhighervalues,thataremoreimportanttotheoverallsum.TheresultingbiasedsampleisdepictedinTable 1-3 .ThestandardHorvitz-Thompsonestimator[ 50 ]isthenappliedtothesample(whereeachrecordisweightedaccordingtotheinverseofitssamplingprobability),whichgivesusanestimateof(1200+1500+750)(12=8)+(580)(24=4)=$8655.Thisisobviouslyabetterestimatethan$7200,andthefactthatitisbetterthentheoriginalestimateisnotjustaccidental:ifonechoosestheweightscarefully,itiseasilypossibletoproduceasamplewhoseassociatedestimatorhaslowervariance(andhencehigheraccuracy)thanthesimple,uniform-probabilitysample.Forinstance,thevarianceoftheestimatorinthestudentsalaryexampleis2:533106undertheuniform-probabilitysamplinganditis5:083105underthebiasedsamplingscheme. 16


Population:studentrecords Rec# NameClassSalary($/month) 1 JamesJunior12002 TomFreshman5203 SandraJunior12504 JimSenior15005 AshleySophomore7006 JenniferFreshman5307 RobertSophomore7508 FrankFreshman5809 RachelFreshman60510 TimFreshman55011 MariaSophomore76012 MonicaFreshman600 TotalSalary:9545.00 Table1-2. Randomsampleofthesize=4 Rec# NameClassSalary($/month) 2 TomFreshman5205 AshleySophomore7008 FrankFreshman58012 MonicaFreshman600 Othercaseswhereabiasedsampleispreferableabound.Forexample,ifthegoalistomonitorthepacketsowingthroughanetwork,onemaychoosetoweightmorerecentpacketsmoreheavily,sincetheywouldtendtoguremoreprominentlyinmostqueryworkloads. Weproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutility Table1-3. Biasedsampleofthesize=4 Rec# NameClassSalary($/month) 1 JamesJunior12004 JimSenior15007 RobertSophomore75011 MariaSophomore760 17


Thekeycontributionsofthispartofdissertationareasfollows: 1. Wepresentamodiedversionoftheclassicreservoirsamplingalgorithmthatisex-ceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf. 2. Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Weanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase,andexperimentallyevaluatethepracticalsignicanceofthisdifference. 3. Wedescribehowtoperformabiasedreservoirsamplingandmaintainlargebiasedsampleswiththegeometricle. 4. Finally,wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgov-erningthesamplingoftworecordsriandrjusingouralgorithm.WeusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm. Smallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,considertheproblemofestimatingtheaverage 18


Sincethereisnosinglesamplesizethatisoptimalforansweringallqueriesandtherequiredsamplesizecanvarydramaticallyfromquerytoquery,thispartofdissertationconsiderstheproblemofgeneratingasampleofsizeNfromadatastreamusinganexistinggeometriclethatcontainsalargesampleofrecordsfromthestream,whereNR.Wewillconsidertwospecicproblems.First,weconsiderthecasewhereNisknownbeforehand.Wewillrefertoasampleretrievedinthismannerasabatchsample.WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample. Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Anindexisreferredtoasprimaryindexifit 19


Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexesaremaintainedforeachsegmentorsubsampleinageometricle.Asnewrecordsareaddedtotheleinunitsofasegmentorsubsample,anewB+-treeindexingnewrecordsiscreatedandaddedtotheindexstructure.Also,anexistingB+-treeisdeletedfromthestructurewhenalltherecordsindexedbyitaredeletedfromthele.ThethirdindexstructuremakesuseoftheLSM-treeindex[ 44 ]-adisked-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes.Weevaluateandcomparethesethreeindexstructuresexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries. 2 .InChapter 3 wepresentthegeometricleorganizationandshowhowthisstructurecanbeusedtomaintainaverylarge 20


4 weproposeasinglepassbiasedreservoirsamplingalgorithm.InChapter 5 wedeveloptechniquesthatcanbeusedtosamplegeometriclestoobtainasmallsizesample.InChapter 6 wepresentsecondaryindexstructuresforthegeometricle.InChapter 7 wediscussthebenchmarkingresults.ThedissertationisconcludedinChapter 8 Mostoftheworkinthedissertationiseitheralreadypublishedorisunderreviewforpublication.ThematerialfromChapter 3 isfromthepaperwithChristopherJermaineandSubramanianArumugamthatwasoriginallypublishedinSIGMOD2004[ 36 ].TheworkpresentedinChapter 4 issubmittedtoTKDEandisunderreview[ 47 ].ThematerialinChapter 5 isthepartofjournalpaperacceptedatVLDBJ[ 48 ].TheresultsintheChapter 7 aretakenfromabovethreepapersaswell. 21


Inthischapter,werstreviewtheliteratureonreservoirsamplingalgorithms.Wethenpresentthesummaryofexistingworkonbiasedsampling. 2 3 8 14 15 28 32 33 35 51 52 ].However,themostpreviouspapers(includingtheaforementionedreferences)areconcernedwithhowtouseasample,andnotwithhowtoactuallystoreormaintainone.Mostofthesealgorithmscouldbeviewedaspotentialusersofalargesamplemaintainedasageometricle. Asmentionedintheintroductionchapter,aseriesofpapersbyOlkenandRotem(includingtwopaperslistedintheReferencessection[ 25 27 ])probablyconstitutethemostwell-knownbodyofresearchdetailinghowtoactuallycomputesamplesinadatabaseenvironment.OlkenandRotemgiveanexcellentsurveyofworkinthisarea[ 26 ].However,mostofthisworkisverydifferentthanours,inthatitisconcernedprimarilywithsamplingfromanexistingdatabasele,whereitisassumedthatthedatatobesampledfromareallpresentondiskandindexedbythedatabase.Singlepasssamplingisgenerallynotthegoal,andwhenitis,managementofthesampleitselfasadisk-basedobjectisnotconsidered. Thealgorithmsinthisdissertationarebasedonreservoirsampling,whichwasrstde-velopedinthe1960s[ 11 38 ].Inhiswell-knownpaper[ 53 ],Vitterextendsthisearlyworkbydescribinghowtodecreasethenumberofrandomnumbersrequiredtoperformthesampling.Vitter'stechniquescouldbeusedinconjunctionwithourown,butthefocusofexistingworkonreservoirsamplingisagainquitedifferentfromours;managementofthesampleitselfisnotconsidered,andthesampleisimplicitlyassumedtobesmallandin-memory.However,ifwere-movetherequirementthatoursampleofsizeNbemaintainedon-linesothatitisalwaysavalidsnapshotofthestreamandmustevolveovertime,thensequentialsamplingtechniquesrelatedtoreservoirsamplingthatcouldbeusedtobuild(butnotmaintain)alarge,on-disksample(seeVitter[ 54 ],forexample). 22


44 ],Buffer-Tree[ 6 ],andY-Tree[ 12 ].ThesepapersconsiderproblemofprovidingI/OefcientindexingforadatabaseexperiencingaveryhighrecordinsertionratewhichisimpossibletohandleusingatraditionalB+-Treeindexingstructure.Ingeneralthesemethodsbufferalargesetofinsertionsandthenscantheentirebaserelation,whichistypicallyorganizedasaB+-Tree,atonceaddingnewdatatothestructure. Anyoftheabovemethodscouldtriviallybeusedtomaintainalargerandomsampleofadatastream.Everytimeasamplingalgorithmprobabilisticallyselectsarecordforinsertion,itmustoverwrite,atrandom,anexistingrecordofthereservoir.Onceanevicteeisdetermined,wecanattachitslocationasapositionidentier(anumberbetween1andR)withanewsamplerecord.Thispositioneldisthenusedtoinsertthenewrecordintotheseindexstructures.Whileperformingtheefcientbatchinserts,ifanindexstructurediscoversthatarecordwiththesamepositionidentierexists,itsimplyoverwritestheoldrecordwiththenewerone. However,noneofthesemethodscancomeclosetotherawwritespeedofthedisk,asthegeometriclecan[ 13 ].Inasense,theissueisthatwhiletheindexingprovidedbythesestructurescouldbeusedtoimplementefcient,disk-basedreservoirsampling,itistooheavy-dutyasolution.WewouldenduppayingtoomuchintermsofdiskI/Otosendanewrecordtooverwriteaspecic,existingrecordchosenatthetimethenewrecordisinserted,whenallonereallyneedsistohaveanewrecordoverwriteanyrandom,existingrecord. Therehasbeenmuchrecentinterestinapproximatequeryprocessingoverdatastreams(averysmallsubsetofthesepapersislistedintheReferencessection[ 1 21 34 ]);evensomeworkonsamplingfromadatastream[ 7 ].Thisworkisverydifferentfromourown,inthatmostexistingapproximationtechniquestrytooperateinverysmallspace.Instead,ourfocusisonmakinguseoftoday'sverylargeandveryinexpensivesecondarystoragetophysicallystorethelargestsnapshotpossibleofthestream. Finally,wementiontheU.C.BerkeleyCONTROLproject[ 37 ](whichresultedinthedevelopmentofonlineaggregation[ 33 ]andripplejoins[ 32 ]).Thisworkdoesaddressissues 23


11 38 ].Recently,Gemullaetal.[ 29 ]extendedthereservoirsamplingalgorithmtohandledeletions.Intheiralgorithmcalledrandompairing(RP)everydeletionfromthedatasetiseventuallycompensatedbyasubsequentinsertion.TheRPAlgorithmkeepstracksofuncompensateddeletionsandusesthisinformationwhileperformingtheinserts.TheAlgorithmguardstheboundonthesamplesizeandatthesametimeutilizesthesamplespaceeffectivelytoprovidesastablesample.AnotherextensiontotheclassicreservoirsamplingalgorithmhasbeenrecentlyproposedbyBrownandHaasforwarehousingofsampledata[ 10 ].Theyproposehybridreservoirsamplingforindependentandparalleluniformrandomsamplingofmultiplestreams.Thesealgorithmscanbeusedtomaintainawarehouseofsampleddatathatshadowsthefull-scaledatawarehouse.Theyhavealsoprovidedmethodsformergingsamplesfordifferentstreamstocreateauniformrandomsample. Theproblemoftemporalbiasedsamplinginastreamenvironmenthasbeenconsidered.Babcocketal.[ 7 ]presentedtheslidingwindowapproachwithrestrictedhorizonofthesampletobiasedthesampletowardstherecentstreamingrecords.However,thissolutionhasapotentialtocompletelylosetheentirehistoryofpaststreamdatathatisnotapartofslidingwindow.TheworkdonebyAggarwal[ 5 ]addressesthislimitationandpresentsabiasedsamplingmethodsothatwecanhavetemporalbiasforrecentrecordsaswellaswekeeprepresentationfromstreamhistory.Thisworkexploitssomeinterestingpropertiesoftheclassofmemory-lessbiasfunctionstopresentasingle-passbiasedsamplingalgorithmforthesetypeofbiasedfunctions.However, 24


Anotherpieceofworkonsingle-passsamplingwithanonuniformdistributionisduetoKolonkoandWasch[ 40 ].Theypresentasingle-passalgorithmtosampleadatastreamofunknownsize(thatis,notknowbeforehand)toobtainasampleofarbitrarysizensuchthattheprobabilityofselectingadataitemiisdependontheindividualitem.Theweightortnessoftheitemthatisusedforitsprobabilisticselectionisderivedusingexponentiallydistributedauxiliaryvalueswiththeparameteroftheexponentialdistributionandthelargestauxiliaryvaluedeterminesthesample.Liketemporalbiasedsamplingmethoddiscussedabove,thisalgorithmcannotbedirectlyadaptedforarbitraryuser-denedbiasedfunctions. Surprisingly,abovethreepapersaretheonlypiecesofworkthatareknowtoauthorsonhowtoperformasingle-passbiasedsamplingoverlargedatasetsorstreamingdata. Theanotherbodyofrelatedworkisthepapersfromnetworkusagearea[ 22 24 41 ].Thesepaperspresenttechniquesforestimatingthetotalnetworktrafc(orusage)basedonthesampleofaowrecordsproducedbyrouters.Sincetheseowstypicallyhaveheavy-taileddistributions,thetechniquespresentedinthesepapersmakeuseofsize-dependentsamplingscheme.Ingeneral,suchschemesworkbysamplingalltherecordswhosetrafcisabovecertainthresholdandsamplingtherestwithprobabilityproportionaltotheirtrafc.Although,suchtechniquesintroducesamplingbiaswheresizecanbethoughtastheweightofarecord,therearekeydifferencesbetweensuchtechniquesandthealgorithmpresentedinthisdissertation.Thegoalofouralgorithmistoobtainaxedsizebiasedsamplethatcomplywiththearbitraryuser-denedbiasedfunction.Thegoalofthesize-dependentsamplingschemeistoobtainasamplethatwillprovidethebestaccuracyforestimatingthetotalnetworktrafcthatfollowsaspecicdistribution.Thesamplegatheredbytheseschemesisnotnecessarilyaxedsizebiasedsample.Itonlyguaranteesthattheexpectedsamplesizeisnolargerthantheexpectedsample 25


Theproblemofimplementingxedsizesamplingdesignwithdesiredandunequalinclusionprobabilitieshasbeenstudiedinstatistics.ThemonogramTheoryofSampleSurveys[ 50 ]discussesseveralmethodsforsuchasamplingtechnique,whichisofsomepracticalimportanceinsurveysampling.Thismonogrambeginsbydiscussingtwodesignswhichmimicsimplerandomsamplingwithoutreplacementwithselectionprobabilitiesforagivendrawthatarenotthesameforalltheunits.Werstsummarizethesetechniques. 26


withxedsizeistoselectunitsforreplacement,andthentorejectthesampleifthereareduplicates.Wediscussonesuchmethodhere,calledSampford'sMethod. 27


Inthischapterwegiveanintroductiontothebasicreservoirsamplingalgorithmthatwasproposedtoobtainanonlinerandomsampleofadatastream.Thealgorithmassumesthatthesamplemaintainedissmallenoughtotinmainmemoryinitsentirety.Wediscussandmotivatewhyverylargesamplesizescanbemandatoryincommonsituations.Wedescribethreealternativesformaintainingverylarge,disk-basedsamplesinastreamingenvironment.Wethenintroducethegeometricleorganizationandpresentalgorithmsforreservoirsamplingwiththegeometricle.Wealsodescribehowmultiplegeometriclescanbemaintainedall-at-oncetoachieveconsiderablespeedup. 11 38 ].TomaintainareservoirsampleRoftargetsizejRj,thefollowingloopisused: 28


53 ].Afteracertainnumberofrecordshavebeenseen,thealgorithmwakesupandcapturethenextrecordfromthestream. 1 maintainsthisinvariantinsteps(2-6)asfollows[ 11 38 ].Theithrecordprocessed(i>jRj),itisaddedtothereservoirwithprobabilityjRj=ibystep4.Weneedtoshowthatforallotherrecordsprocessedthusfar,theinclusionprobabilityisjRj=i.Letrkbeanyrecordinthereservoirs.t.k6=i.LetRidenotethestateofthereservoirjustafteradditionoftheithrecord.Thus,weareinterestedinthePr[rk2Ri] i11 i=R i1R i1 i=R i 16 ].ToselectasampleofjRjunits,systematicsamplingtakesaunitatrandomfromtherstkunitsandeverykthunitthereafter.Althoughtheinclusionprobabilityinsystematicsamplingisthesameasinsimplerandomsampling,thepropertiesofasamplesuchasvariancecanbefardifferent.Itisknownthatthevarianceofthesystematicsamplingcanbebetterorworsecomparedtoasimplerandomsamplingdependingondataheterogeneityandcorrelationcoefcientbetweenpairsofsampledunits. 29


Theproofthatreservoirsamplingmaintainsthecorrectinclusionprobabilityforanysetofinterestisactuallyverysimilartotheunivariateinclusionprobabilitycorrectnessdiscussedabove.WeknowthattheunivariateinclusionprobabilityPr[rk2Ri]=R=i.ForanyarbitraryvalueofjSjjRj,assumethatwehavethecorrectprobabilitieswhenwehaveseeni1inputrecords,i.e.Pr[S2Ri1]=jRjjSj=i1jSj.Whentheithrecordisprocessed(i>jRj),wehave i1S R+1R i=jRjjSj i1jSjR iS i+1R i=jRjjSj ijSjiS iiS i=jRjjSj ijSj 16 ]).Thevalueisknownasthecondenceoftheestimate. Verylargesamplesareoftenrequiredtoprovideaccurateestimateswithsuitablyhighcondence.TheneedforverylargesamplescanbeeasilyexplainedinthecontextoftheCentralLimitTheorem(CLT)[ 27 ].TheCLTimpliesthatifweusearandomsampleofsizeNtoestimatethemeanofasetofnumbers,theerrorofourestimateisusuallynormally 30


1. Theerrorisinverselyproportionaltothesquarerootofthesamplesize. 2. Theerrorisdirectlyproportionaltothestandarddeviationofthesetoverwhichweareestimatingthemeanover. Thesignicanceofthisobservationisthatthesamplesizerequiredtoproduceanaccurateestimatecanvarytremendouslyinpractice,andgrowsquadraticallywithincreasingstandarddeviation.Forexample,saythatweusearandomsampleof100studentsatauniversitytoestimatetheaveragestudentsage.Imaginethattheaverageageis20withastandarddeviationof2years.AccordingtotheCLT,oursample-basedestimatewillbeaccuratetowithin2.5%withcondenceofaround98%,givingusanaccurateguessastothecorrectanswerwithonly100sampledstudents. Now,considerasecondscenario.WewanttouseasecondrandomsampletoestimatetheaveragenetworthofhouseholdsintheUnitedStates,whichisaround$140,000,withastandarddeviationofatleast$5,000,000.Becausethestandarddeviationissolarge,aquickcalculationshowswewillneedmorethan12millionsamplestoachievethesamestatisticalguaranteesasintherstcase. Requiredsamplesizescanbefarlargerwhenstandarddatabaseoperationslikerelationalselectionandjoinareconsidered,becausetheseoperationscaneffectivelymagnifythevarianceofourestimate.Forexample,theworkonripplejoins[ 32 ]providesanexcellentexampleofhowvariancecanbemagniedbysamplingovertherelationaljoinoperator. 31


2 .Count(B)referstothecurrentnumberofrecordsinB.NotethatsincetherecordscontainedinBlogicallyrepresentrecordsinthereservoirthathavenotyetbeenaddedtodisk,anewly-sampledrecordcaneitherbeassignedtoreplaceanon-diskrecord,oritcanbeassignedtoreplaceabufferedrecord(thisisdecidedinStep(7)ofthealgorithm). Inarealisticscenario,theratioofthenumberofdiskblockstothenumberofrecordsbufferedinmainmemorymayapproachorevenexceedone.Forexample,a1TBdatabasewith128KBblockswillhave7.8millionblocks;andforsucharelativelylargedatabaseitisrealistictoexpectthatwehaveaccesstoenoughmemorytobuffermillionsrecords.Asthenumberofbufferedrecordsperblockmeetsorexceedsone,mostoralloftheblocksondiskwillcontain 32


2 ,andsoallofthedatabaseblocksmustbeupdated.Thus,itmakessensetorelyonfast,sequentialI/Otoupdatetheentireleinasinglepass.Thedrawbackofthisapproachisthateverytimethatthebufferlls,weareeffectivelyrebuildingtheentirereservoirtoprocessasetofbufferedrecordsthatareasmallfractionoftheexistingreservoirsize. 33


1 canbeusedtomaintainalarge,on-disksample,butallofthemhavedrawbacks.Inthissection,wediscussafourthalgorithmandanassociateddataorganizationcalledthegeometricletoaddressthesepitfalls.ThegeometricleisbestseenasanextensionofthemassiverebuildoptiongivenasAlgorithm 2 .JustlikeAlgorithm 2 ,thegeometriclemakesuseofamain-memorybufferthatallowsnewsamplesselectedbythereservoiralgorithmtobeaddedtotheon-diskreservoirinalazyfashion.However,thekeydifferencebetweenAlgorithm 2 andthealgorithmsusedbythegeometricleisthatthegeometriclemakesuseofafarmoreefcientalgorithmformergingthosenewsamplesintothereservoir. 2 ,thebasicalgorithmemployedbythegeometricleisnotmuchdifferent.AsfarasStep(13)isconcerned,thedifferencebetweenthegeometricleandthemassiverebuildextensionisthatthegeometricleemptiesthebuffermoreefciently,inordertoavoidscanningorperiodicallyre-randomizingtheentirereservoir. Toaccomplishthis,theentiresampleinmainmemorythatisushedintothereservoirisviewedasasinglesubsampleorastratum[ 16 ],andthereservoiritselfisviewedasacollectionofsubsamples,eachformedviaasinglebufferush.Sincetherecordsinasubsamplearenon-randomsubsetoftherecordsinthereservoir(theyaresampledfromthestreamduringaspecictimeperiod),eachnewsubsampleneedstooverwriteatrue,randomsubsetoftherecordsinthereservoirinordertomaintainthecorrectnessofthereservoirsamplingalgorithm.Ifthiscanbedoneefciently,wecanavoidrebuildingtheentirereservoirinordertoprocessabufferush. Atrstglance,itmayseemdifculttoachievethedesiredefciency.Thebufferedrecordsthatmustbeaddedtothereservoirwilltypicallyoverwriteasubsetoftherecordsstoredineach


3.3 ).Forexample,ifthereare100on-disksubsamples,thebuffermustbesplit100waysinordertowritetoaportionofeachofthe100on-disksubsamples.Thisfragmentedbufferthenbecomesanewsubsample,andsubsequentbufferushesthatneedtoreplacearandomportionofthissubsamplemustsomehowefcientlyoverwritearandomsubsetofthesubsample'sfragmenteddata. Thegeometricleusesacareful,on-diskdataorganizationinordertoavoidsuchfragmen-tation.Thekeyobservationbehindthegeometricleisthatthenumberofrecordsofasubsamplethatarereplacedwithrecordsfrombufferedsamplecanbecharacterizedwithreasonableaccu-racyusingageometricseries(hencethenamegeometricle).Asbufferedsamplesareaddedtothereservoirviabufferushes,weobservethateachexistingsubsamplelosesapproximatelythesamefractionofitsremainingrecordseverytime,wherethefractionofrecordslostisgovernedbytheratioofthesizeofabufferedsampletotheoverallsizeofthereservoir.Byloses,wemeanthatthesubsamplehassomeofitsrecordsreplacedinthereservoirwithrecordsfromasubsequentsubsample.Thus,thesizeofasubsampledecaysapproximatelyinanexponentialmannerasbufferedsamplesareaddedtothereservoir. Thisexponentialdecayisusedtogreatadvantageinthegeometricle,becauseitsuggestsawaytoorganizethedatainordertoavoidproblemswithfragmentation.Eachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize.Thesesegmentsaresizedsothateverytimeabufferedsampleisaddedtothereservoir,weexpectthateachexistingsubsamplelosesexactlythesetofrecordscontainedinitslargestremainingsegment.Asaresult,eachsubsamplelosesonesegmenttothenewly-createdsubsampleeverytimethebufferisemptied,andageometriclecanbeorganizedintoaxedandunchangingsetofsegmentsthatarestoredascontiguousrunsofblocksondisk.Becausethesetofsegmentsisxedbeforehand,fragmentationandupdateperformancearenotproblematic:inordertoreplacerecordsinan 35


Ondaytwo,(withU1=90)theUraniumfurtherdecaystoU1=81grams,thistimelosingU1(1)=U0(1)=n=9gramsofitsmass.Ondaythree,itfurtherdecaysbyn2=7:2grams,andsoon.ThedecayprocessisallowedtocontinueuntilwehavelessthangramsofUraniumremaining. ContinuingwiththeUraniumanalogy,threequestionsthatarerelevanttoourproblemofmaintainingverylargesamplesfromadatastreamare Thesequestionscanbeansweredusingthefollowingthreesimpleobservationsrelatedtogeometricseries: logc.Wedenotethisoorby. 36


2 ).Recallthatthewayreservoirsamplingworksisthatnewsamplesfromthedatastreamarechosentooverwriterandomsamplescurrentlyinthereservoir.Thebuffertemporarilystoresthesenewsamples,delayingtheoverwriteofarandomsetofrecordsthatarealreadystoredondisk.Oncethebufferisfull,allnewsamplesaremergedwiththeRbyoverwritingarandomsubsetoftheexistingsamplesinR. ConsidersomearbitrarysubsampleSofR(soSR),withcapacityjSj.SincethebufferBrepresentsthesamplesthathavealreadyover-writtentheequalnumberofrecordsofR,abufferushoverwritesexactlyjBjsamplesofR.Thus,onexpectationthemergewilloverwritejSjjBj jRjsamplesofS.IfwedenejBj jRj=1,thenonexpectation,SshouldlosejSj(1)ofitsownrecordsduetothebufferush WecanroughlydescribetheexpecteddecayofSafterrepeatedbuffermergesusingthethreeobservationsstatedbefore.Ifthesubsampleretentionrate=1jBj jRj,then: Thenetresultofthisisthatitispossibletocharacterizetheexpecteddecayofanyarbitrarysubsetoftherecordsinourdisk-basedsampleasnewrecordsareaddedtothesamplethroughmultipleemptyingsofthebuffer.IfweviewSasbeingcomposedofon-disksegmentsofexponentiallydecreasingsize,plusaspecial,asinglegroupofnalsegmentsoftotalsize 3.7 ). 37


Decayofasubsampleaftermultiplebufferushes. 3-1 38


Basicstructureofthegeometricle. 39


3-1 ,wecanorganizeourlarge,diskbasedsampleasasetofdecayingsubsamples.Atanypointoftime,thelargestsubsamplewascreatedbythemostrecentushingofthebufferintoR,andhasnotyetlostanysegments.Thesecondlargestsubsamplewascreatedbythesecondmostrecentbufferush;itlostitslargestsegmentinthemostrecentbufferush.Ingeneral,theithlargestsubsamplewascreatedbytheithmostrecentbufferush,andithashadi1segmentsremovedbysubsequentbufferushes.TheoverallleorganizationisdepictedinFigure 3-2 3 .Thetermsn,,andcarrythemeaningdiscussedinSection 3.5 .ThisprocessdescribedbyAlgorithm 3 isdepictedgraphicallyinFigure 3-3 .First,theleislledwiththeinitialdataproducedbythestream(athroughc).Toaddtherstrecordstothele,thebufferisallowedtollwithsamples.Thebufferedrecordsarethenrandomlygroupedintosegments,andthesegmentsarewrittentodisktoformthelargestinitialsubsample(a).Forthesecondinitialsubsample,thebufferisonlyallowedtolltojBjofitscapacitybeforebeingwrittenout(b).Forthethirdinitialsubsample,thebufferllstojBj2ofitscapacitybeforeitiswritten(c).Thisisrepeateduntilthereservoirhascompletelylled(aswasshowninFigure 3-2 ).Atthispoint,newsamplesmustoverwriteexistingones.Tofacilitatethis,thebufferisagainallowedtolltocapacity.Recordsarethenrandomlygroupedintosegmentsofappropriatesize,andthosesegmentsoverwritethelargestsegmentofeachexistingsubsample(d).Thisprocessisthenrepeatedindenitely,aslongasthestreamproducesnewrecords(eandf). 40


3.7.1 ) 3 Step(21).Inordertomaintainthealgorithm'scorrectness,whenthebufferisushed 41


3-4 .Now,wewanttoaddveadditionalnumberstoourset,byrandomlyreplacingveexistingnumbers.Whilewedoexpectnumberstobereplacedinawaythatisproportionaltobucketsize(Figure 3-4 (b)),thisisnotalwayswhatwillhappen(Figure 3-4 (c)). 3 .BeforeweaddanewsubsampletodiskviaabufferushinStep(21),werstperformalogical,randomizedpartitioningofthebufferintosegments,describedbyAlgorithm 4 .InAlgorithm 4 ,eachnewly-sampledrecordisrandomlyassignedtoreplaceasamplefromanexisting,on-disksubsamplesothattheprobabilityofeachsubsamplelosingarecordisproportionaltoitssize.TheresultofAlgorithm 4 isanarrayofMivalues,whereMitellsStep(21)ofAlgorithm 3 howmanyrecordsshouldbeassignedtooverwritetheithon-disksubsample. 3 willoverwriteexactlythenumberofrecordscontainedineach 42


Buildingageometricle. 43


Distributingnewrecordstoexistingsubsamples. subsample'slargestsegment.Tohandlethisproblem,weassociateastack(orbuffer 44


4 ,Mishouldhavebeen.Then,therearetwopossiblecases: ThesestackoperationsareperformedjustpriortoStep(23)inAlgorithm 3 .Notethatsincethenalgroupofsegmentsfromasubsampleoftotalsizearebufferedinmainmemory,theirmaintenancedoesnotrequireanystackoperations.Onceasubsamplehaslostallofitson-disksamples,overwritesofrecordsinthissetcanbehandledbysimplyreplacingtherecordsdirectly. Topre-allocatespaceforthesestacks,weneedtocharacterizehowmuchoverowwecanexpectfromagivensubsample,whichwillboundthegrowthofthesubsample'sstack.Itisimportanttohaveagoodcharacterizationoftheexpectedstackgrowth.Ifweallocatetoomuchspaceforthestacks,thenweallocatediskspaceforstoragethatisneverused.Ifweallocatetoolittlespace,thenthetopofonestackmaygrowupintothebaseofanother.Ifastackdoesoverow,itcanbehandledbybufferingtheadditionalrecordstemporarilyinmemoryormovingthestacktoanewlocationondiskuntilthestackcanagaintinitsallocatedspace.Thisisnot 45


Toavoidthis,weobservethatifthestackassociatedwithasub-sampleScontainsanysamplesatagivenmoment,thenShashadfewerofitsownsamplesremovedthanexpected.Thus,ourproblemofboundingthegrowthofS'sstackisequivalenttoboundingthedifferencebetweentheexpectedandtheobservednumberofsamplesthatSlosesasjBjnewsamplesareaddedtothereservoir,overallpossiblevaluesforjBj. Toboundthisdifference,werstnotethatafteraddingjBjnewsamplesintothereservoir,theprobabilitythatanyexistingsampleinthereservoirhasbeenover-writtenbyanewsampleis111 42 ].Simplearithmeticimpliesthatthegreatestvarianceisachievedwhenasubsamplehasonexpectationlost50%ofitsrecordstonewsample(P=0:5);atthispointthestandarddeviationis0:5p 46


Toillustratetheimportanceofminimizing,imaginethatwehavea1GBbufferandastreamproducing100Brecords,andwewanttomaintaina1TBsample.Assumethatweuseanvalueof0.99.Thus,eachsubsampleisoriginally1GB,andjBj=107.FromObserva-tion2weknowthatn log0:99c=1029segmentstostoretheentirenewsubsample. Now,considerthesituationif=0:999.Asimilarcomputationshowsthatwewillnowrequire10;344segmentstostorethesame1GBsubsample.Thisisanorder-of-magnitudedifference,withsignicantpracticalimportance.Withfourdiskseekspersegment,1029segmentsmightmeanthatwespendaround40secondsofdisktimeinrandomI/Os(at10ms 47


jRj jRj. WewilladdressthislimitationinSection 3.10 log0:99cor687.Byincreasingtheamountofmainmemorydevotedtoholdingthesmallestsegmentsforeachsubsamplebyafactorof32,weareabletoreducethenumberofdiskheadmovementsbylessthanafactoroftwo.Thus,wewillnotconsideroptimizing.Rather,wewillxtoholdasetofsamplesequivalenttothesystemblocksize,andsearchforabetterwaytoincreaseperformance. 48


1. Whyistheclassicalreservoirsamplingalgorithm(presentedasAlgorithm 1 )correct?ThatiswhatistheinvariantmaintainedbytheAlgorithm 1 ? 2. Whyistheobviousdisk-based,extensionofAlgorithm 1 (presentedasAlgorithm 2 )correct?ThatishowdoesAlgorithm 2 maintaintheinvariantofAlgorithm 1 viatheuseofamainmemorybuffer? 3. WhyistheproposedgeometriclebasedsamplingtechniqueinAlgorithm 3 correct? WehaveansweredtherstquestioninSection 3.1 .Wediscussthesecondandthirdquestionshere. 2 makesuseofthemainmemorybufferofsizejBjtobuffernewsamples.Thebufferedsampleslogicallyrepresentasetsamplesthatshouldhavebeenusedtoreplaceon-disksamplesinordertopreservethecorrectnessofthesamplingalgorithm,butthathavenotyetbeenmovedtodiskforperformancereasons(thatis,duetolazywrites). ItisnothardtoseethattheinvariantmaintainedbyAlgorithm 1 isalsomaintainedbyAlgorithm 2 instep(6).ThenewrecordsaresampledwiththesameprobabilityjRj=i.Theonlydifferenceisthatnewlysampledrecordsareaddedtothereservoirusingsteps(7-14)insteadofsimplesteps(5-6)ofAlgorithm 1 .Wenowdiscusswhythesestepsareequivalent. Onestraightforwardwayofkeepingthesampledrecordsinthebufferanddolazywritesisasfollows.Everytimewedecidetoaddanewsampletothebuffer(i.e.withprobabilityjRj=i),wealsogeneratearandomnumberbetween1andRtodecideitspositioninthereservoir.However,westorethispositioninthepositionarrayandthusavoidanimmediatediskseek.Ifwehappentogenerateapositionthatisalreadyinthepositionarray,weoverwritethecorrespondingrecordinthebufferwiththenewlysampledrecord.Ifwewouldhaveushedthatrecordtodiskusingtheclassicalgorithm(ratherthanbufferingit),wewouldhavereplaceditwiththenewlysampledrecord.Thuswewouldobtainthesameresult.Oncethebufferisfullwe 49


1 asfarascorrectnessisconcerned. Logically,steps(7-14)ofAlgorithm 2 actuallyimplementexactlythisprocess.Theprobabilitythatwewillgeneratearandompositionbetween1andjRjthatisalreadyinthepositionarrayofsizejBjisjBj=R.Step(7)ofAlgorithm 2 decideswhethertooverwritearandombufferedrecordwithanewlysampledrecord.Oncethebufferisfull,step(13)performsaonepassbuffer-reservoirmergingbygeneratingsequentialrandompositionsinthereservoironthey. 2 westorethesamplessequentiallyonthediskandoverwritetheminarandomorder.Thoughcorrect,thealgorithmdemandsalmostacompletescanofthereservoir(toperformallrandomoverwrites)foreverybufferush.WecandobetterifweinsteadforcethesamplestobestoredinarandomorderondisksothattheycanbereplacedviaanoverwriteusingsequentialI/Os.Thelocalizedoverwriteextensiondiscussedbeforeusethisidea.Everytimeabufferisushedtothereservoiritisrandomizedinmainmemoryandwrittenasarandomclusteronthedisk.WemaintainthecorrectnessofthistechniquebysplittingtherandomclusterinN-wayswhereNisthenumberofexistingclustersonthediskandbyoverwritingrandomsubsetofeachexistingcluster.Thisavoidstheproblemofclusteringbyinsertiontime.However,thedrawbackofthistechniqueisthatthesolutiondeterioratesbecauseoffragmentationofclusters. ThegeometricleovercomesthedrawbacksofthesetwotechniquesandcanbeviewedasacombinationofAlgorithm 2 andtheideausedinthelocalizedoverwriteextension.ThecorrectnessoftheGeometricleisresultsdirectlyfromthecorrectnessofthesetwotechniques.Incaseofthegeometricletheentiresampleinthemainmemory(referredtoasasubsample)israndomizedandushedintothereservoir.Furthermore,eachnewsubsampleissplitintoexactlythosemanysegmentsasthenumberofexistingsubsamplesonthedisk.Thesesegmentsthenoverwritearandomportionofeachdisk-basedsubsample.Theonlydifferencewiththe 50


1 ,isxedbytheratiojBj=jRj.Thatis,foraxeddesiredsizeofreservoirweneedalargerbuffertolowerthevalueof. However,thereisawaytoimprovethesituation.GivenabufferofxedcapacityjBjanddesiredsamplesizejRj,wechooseasmallervalue0<,andthenmaintainmorethanonegeometricleatthesametimetoachievealargeenoughsample.Specically,weneedtomaintainm=(10) (1)geometriclesatonce.Theselesareidenticaltowhatwehavedescribedthusfar,exceptthattheparameter0isusedtocomputethesizesofasubsample'son-disksegmentsandsizeofeachleisjRj 3 .Eachofthemgeometriclesisstilltreatedasasetofdecayingsubsamples,andeachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize,justasisdoneinAlgorithm 3 ,Steps(5)-(13).Theonlydifferenceisthataseachleiscreated,theparameter0isusedinsteadofinSteps(6),(8)-(9),andeachofthemgeometriclesislledafteroneanother,inturn.Thus,eachsubsampleofeachgeometriclewillhavesegmentsofsizen;n0;n02andsoon. 51


3 Steps(15)-(20)untilbufferisfull.Oncethebufferisfull,itsrecordorderisthenrandomized,justasisinasinglegeometricle.Nextthebufferisushedtodisk.Thisiswherethealgorithmismodied.Overwritingrecordsondiskwithrecordsfromthebufferissomewhatdifferent,intwoprimaryways,asdiscussednext. 4 ,thebufferispartitionedsothatthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofsubsamplesinasinglele.Incaseofmultiplegeometricles,wepartitionthebufferjustlikeinAlgorithm 4 ;however,werandomlypartitionthebufferacrossallsubsamplesfromallgeometricles.Thenumberofbuffersegmentsafterthepartitioningisthesameasthetotalnumberofsubsamplesintheentirereservoir,andthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofeachofthesubsamplesfromoneofthegeometricles.Thisallowsustomaintainthecorrectnessofthereservoirsamplingalgorithm.ThebufferpartitioningstepsincaseofmultiplegeometriclesaregiveninAlgorithm 5 3 'sbuffermergealgorithm.Wediscussalltheintricaciessubsequently,butathigh-level,thelargestsegmentofeachsubsamplefromonlyonegeometricleisover-writtenwithsamplesfromthebuffer.Thisallowsforconsiderablespeedup,aswediscussinSection 3.12 .Atrst,thiswouldseemtocompromisethecorrectnessofthealgorithm:logically,thebufferedsamplesmustover-writesamplesfromeveryoneofthegeometricles(infact,thisispreciselywhythebufferispartitionedacrossallgeometricles,as 52


3.11.1 to 3.11.3 ,wedescribeindetailanalgorithmthatisabletomaintainthecorrectnessofthesample. Oncethesegmentsassignedtothevariousleshavebeenconsolidated,theresultingsegmentsareusedtooverwritesubsamplesfromasinglegeometricleusingexactlythealgorithmfromSection 3.4 ,subjecttotheconstraintthatthejthbuffermergeoverwritessubsamplesfromthe(jmodm)thgeometricle. Ourremedytothisproblemistodelayoverwritingasubsample'slargestsegmentuntilthetimethatall(ormost)oftherecordsthatwillbeover-writtenondiskareinvalid,inthesensethattheyhavelogicallybeenover-writtenbyhavingrecordsfromsubsequentbuffer 53


Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. 54


Thewaytoaccomplishthisistooverwritesubsamplesinalazymanner.Wemergethebufferwiththe(jmodm)thgeometricle,butwedonotoverwriteanyofthevalidsamplesstoredintheleuntilthenexttimewegettothele.Wecanachievethisbyallocatingenoughextraspaceineachgeometricletoholdacomplete,emptysubsampleineachgeometricle.Thissubsampleisreferredtoasthedummy.Thedummyneverdecaysinsize,andneverstoresitsownsamples.Rather,itisusedasabufferthatallowsustosidesteptheproblemofasubsampledecayingtooquickly.Whenanewsubsampleisaddedtoageometricle,thenewsubsampleoverwritessegmentsofdummyratherthanoverwritinglargestsegmentofanyexistingsubsamples.Thus,wehaveprotectedsegmentsofsubsamplesthatcontainvaliddatabyoverwritingdummy'srecordsinstead. Whenrecordsaremergedfromthebufferintothedummy,thespacepreviouslyownedbythedummyisgivenuptoallowstorageofthele'snewestsubsample.Afterthisush,thelargestsegmentfromeachofthesubsamplesintheleisgivenuptoreconstitutethenewdummy.Becausetherecordsin(new)dummy'ssegmentswillnotbeover-writtenuntilthenexttimethatthisparticulargeometricleiswrittento,allofthedatathatiscontainedwithinitisprotected. Notethatwithadummysubsample,wenolongerhaveaproblemwithasubsamplelosingitssamplestooquickly.Instead,asubsamplemayhaveslightlytoomanysamplespresentondiskatanygiventime,bufferedbythele'sdummy.Theseextrasamplescaneasilybeignoredduringqueryprocessing.TheonlyadditionalcostweincurwithdummyisthateachofthegeometriclesondiskmusthavejBjadditionalunitsofstorageallocated.TheuseofadummysubsampleisillustratedinFigure 3-5 55


2 Proof. log0c.Substitutingn=(10)jBjandsimplifyingtheexpression(aswellas 56


log0(loglogjBj).Ifwelet!=(log(1=0))1thenumberofsegmentscanbeexpressedas!(logjBjlog).Assumingaconstantnumbercofrandomseekspersegmentwrittentothedisk,thetotalrandomdiskheadmovementsrequiredperrecordis!c((logjBjlog)=jBj),whichisO(!logjBj=jBj). Incaseofmultiplegeometriclesweuseadditionalspaceformdummysubsamples.Thus,thetotalstoragerequiredbyallgeometriclesisjRj+(mjBj).Ifwewishtomaintaina1TBreservoirof100Bsampleswith1GBofmemory,wecanachieve0=0:9byusingonly1.1TBofdiskstorageintotal.For0=0:9,weneedtowritelessthan100segmentsper1GBbufferush.At40ms/segment,thisisonly4secondsofrandomdiskheadmovementstowrite1GBofnewsamplestodisk. Inordertotesttherelativeabilityofthegeometricletoprocessahigh-speedstreamofinsertions,wehaveimplementedandbench-markedvealternativesformaintainingalargereservoirondisk:thethreealternativesdiscussedinSection 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.WepresentthesebenchmarkingresultsinChapter 7 57


Inthischapterweproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutilityinsubsequentqueryprocessing.Wethencompute(inasinglepass)abiasedsampleRioftheirecordsproducedbyadatastream.Riisxed-size,andtheprobabilityofsamplingthejthrecordfromthestreamisproportionaltof(rj)forallji.Thisisafairlysimpleandyetpowerfuldenitionofbiasedsampling,andisgeneralenoughtosupportmanyapplications. Ofcourse,onestraightforwardwaytosampleaccordingtoawell-denedbiasfunctionwouldbetomakeacompletepassoverthedatasettocomputethetotalweightofalltherecords,PNj=1f(rj).Duringasecondpass,wecanthenchoosetheithrecordofthedatasetwithprobabilityjRjf(ri) Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.We 58


Therestofthechapterisorganizedasfollows.Wedescribesasingle-passbiasedsamplingalgorithm.Wealsodeneadistancemetrictoevaluatetheworstcasedeviationfromtheuser-denedweightingfunctionf.Finally,wederiveasimpleestimatorforabiasedreservoir.TheexperimentsperformedtotestouralgorithmsarepresentedinChapter 7 6 .Itispossibletoprovethatthismodiedalgorithmresultsinacorrectlybiasedsample,providedthattheprobabilityfromline(8)ofAlgorithm 6 doesnotexceedone. 6 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehavePr[rj2Ri]=jRjf(rj) Proof. 59




1 ) WedeneanoverweightrecordtobearecordriforwhichjRjf(ri) Animportantfactortoconsiderwhiledeterminingthefeasibilityofmaintainingsuchaqueueinthegeneralcaseisprovidinganupperboundonitssize.Thiscanbedonebyconsider-ingtheworstpossibleorderingoftherecordsinputintothealgorithm,subjecttotheconstraintthatthebiasfunctioniswell-dened.Ingeneral,wedescribetheuser-denedweightingfunctionfasbeingwell-denedifjRjf(ri) 61


Westressthatthoughthisupperboundisquitepoor(requiringthatweneedtobuffertheentiredatastream!)itisinfactaworst-casescenario,andtheapproachwilloftenbefeasibleinpractice.Thisisbecauseweightswilloftenincreasemonotonicallyovertime(asinthecasewherenewerrecordstendtobemorerelevantforqueryprocessingthanolderones).Still,giventhepoorworst-caseupperbound,amorerobustsolutionisrequired,whichwenowdescribe. 62


1. First,wewillbeabletoguaranteethatf0(rj)willbeexactlyf(rj)if(jRjf(ri))=totalWeight1forallk>j. 2. Wecanalsoguaranteethatwecancomputethetrueweightforagivenrecordtounbiasedanyestimatemadeusingoursample(seeSection 4.4 ). Inotherwords,ourbiasedsamplecanstillbeusedtoproduceunbiasedestimatesthatarecorrectonexpectation[ 16 ],butthesamplemightnotbebiasedexactlyasspeciedbytheuser-denedfunctionf,ifthevalueoff(r)tendstouctuatewildly.Whilethismayseemlikeadrawback,thenumberofrecordsnotsampledaccordingtofwillusuallybesmall.Furthermore,sincethefunctionusedtomeasuretheutilityofasampleinbiasedsamplingisusuallytheresultofanapproximateanswertoadifcultoptimizationproblem[ 15 ]ortheapplicationofaheuristic[ 52 ],havingasmalldeviationfromthatfunctionmightnotbeofmuchconcern. Wepresentasingle-passbiasedsamplingalgorithmthatprovidesbothguaranteesoutlinedaboveasAlgorithm 7 ,andLemma 4 provesthecorrectnessofthealgorithm. 7 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehave,Pr[rj2Ri]=jRjf0(rj) Proof. 3 .Wesimplyusef0insteadofftoprovethedesiredresult. 63


7 isthedeviationoff0fromf.Thatis:howfarofffromthecorrectweightingcanwebe,intheworstcase?Whenstreamhasnooverweightrecords,weexpectf0tobeexactlyequaltof,butitmaybeveryfarawayundercertaincircumstances.Toaddressthis,wedeneadistancemetricinDenition 2 andevaluatetheworsecasedistancebetweenf0andf. 64


1 andisanalyzedandprovedintheAppendixofthispaper. 7 willsamplewithanactualbiasfunctionf0wheretotalDist(f;f0)isupperboundedbyPNk=jRjf(r0k)PjRj1k=1f(r0k) 7 computesabiasedsampleaccordingtof0,wheref0isaclosefunctiontoauser-denedweightingfunctionfaccordingtothefollowingdistancemetric:


7 occurswhen(1)thereservoirisinitiallylledwiththeRrecordshavingthesmallestpossibleweightsand(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter.Theorem 1 presentedanupperboundontotalDist(f;f0)inthisworstcase.Inthissection,werstprovidetheproofofthisworstcaseforAlgorithm 7 andthenprovetheupperboundontotalDist(f;f0)givenbyTheorem 1 7 ,werstprovethefollowingthreepropositions.Theseproofsleadustotheworst-caseargument.Ifwedenotetherecordwiththehighestweightinthestreamasrmaxandusermaxitodenotethecasewherermaxislocatedatpositioniinthestream,thenforanygivenrandomorderingofthestreamingrecordsr1;:::;ri1;rmaxi;:::;rN,weprovethat 1. MovingtherecordrmaxiearlierintherangerjRj:::rNcannotdecreasetotalDist(f;f0). 2. Whenweareinitiallyllingthereservoir,choosingjRjrecordswithsmallestpossibleweightmaximizestotalDist(f;f0). 3. Reorderingofanyrecordthatappearsafterrmaxiintherangeri+1:::rNcannotincreasetotalDist(f;f0). 66


5 (givenbelow)were-writethetotalDistformulaas 67


5 were-writethetotalDistformulaas


Adjustmentofrmaxitormaxi1 4 )fromEquation( 4 )asfollows: 4-1 showstheadjustmentofrmaxitormaxi1.Wedenotetherecordthatisswappedwithrmaxasrswap.Theaboveequationfurthersimpliesto hjRjf(rmax)+PNk=i+1f(rk)i




7 acceptstherstjRjrecordsofthestreamwithprobability1.NoweightadjustmentsaretriggeredforrstjRjrecordsirrespectiveoftheirweights.Therefore,theearliestpositionrmaxcanappearinthestreamisrightafterthereservoirislled.Thisprovestheproposition.WenowturntoprovingLemma 5 ,whichwasusedinthepreviousproof. Proof. 71


5 ,8j

Fromtheabovethreepropositions,wecanconcludethattheworstcaseforAlgorithm 7 occurswhen(1)thereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweightsand,(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter. 1 :TheUpperBoundontotalDist 5 were-writethetotalDistformulaas 73


4 ),theaboveequationsimpliesto IntheworstcasethereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweights.Ifr1;r2;:::rNaretherecordsinappearanceorderthenwedener01;r02;:::;r0Nasthepermutation(reordering)oftherecordssuchthatf(r01)f(r02)f(r0N):Theconditionrequiringreservoirlledwiththesmallestpossibleweightscanbethenwrittenas 74


1. Foreachon-disksubsample,MjissettobejRjMjf(ri) 2. Foreachsampledrecordstillinthebuffer,rj:weightissettojRjrj:weightf(ri) 3. Finally,totalWeightissettojRjf(ri). 75


50 ]forasamplecomputedusingouralgorithm.Wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgoverningthesamplingoftworecordsriandrjusingouralgorithmandusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimator.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained.Thoughhandlingmorecomplicatedqueriesusingthebiasedsampleisbeyondthescopeofthepaper,itisstraightforwardtoextendtheanalysisofthisSectiontomorecomplicatedqueriessuchasjoins[ 32 ]. Imaginethatwehavethefollowingsingle-tablequery,whose(unknown)answerisq: TABLEASr Next,wederivethevarianceofthisestimator.Todothis,weneedaresultsimilartoLemma 3 thatcanbeusedtocomputetheprobabilityPr[frj;rkg2Ri]underourbiasedsamplingscheme. 76


7 ,foreachRiandforeachrecordpairfrj;rkgproducedbythedatastreamwherej

Thisexpressioncanthenbeusedinconjunctionwiththenextlemmatocomputethevarianceofthenaturalestimatorforq. ByusingtheresultofLemma 6 tocomputePr[frj;rkg2Ri],thevarianceoftheestimatoristheneasilyobtainedforaspecicquery.Inpractice,thevarianceitselfmustbeestimatedbyconsideringonlythesampledrecordsaswetypicallydonothaveaccesstoeachandeveryrjduringqueryprocessing.Theq2termandthetwosumsintheexpressionofvariancearethuscomputedovereachrjinthesampleofbiasedgeometricleratherthanovertheentirereservoir. Thereisoneadditionalissueregardingbiasedsamplingthatisworthsomeadditionaldiscussion:howtoefcientlycomputethevaluePr[frj;rkg2Ri]inordertoestimate 78




Ageometricleisasimplerandomsample(withoutreplacement)fromadatastream.Inthischapterwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjectsthatarethemselvesrandomsamples(withoutreplacement)fromtheoriginaldatastream.Thegoalofthealgorithmsdescribedinthischapteristoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 3.2 ,wearguedthatsmallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,reconsidertheproblemofestimatingtheaveragenetworthofAmericanhouseholdsasdescribedinSection 3.2 .Inthegeneralcase,manymillionsofsamplesmaybeneededtoestimatethenetworthoftheaveragehouseholdaccurately(duetoasmallratiobetweentheaveragehousehold'snetworthandthestandarddeviationofthisstatisticacrossallAmericanhouseholds).However,ifthesamesetofrecordsheldinformationaboutthesizeofeachhousehold,onlyafewhundredrecordswouldbeneededtoobtainsimilaraccuracyforanestimateoftheaveragesizeofanAmericanhousehold,sincetheratioofaveragehouseholdsizetothestandarddeviationofsamplesizeacrosshouseholdsintheUnitedStatesisgreaterthan2.Thus,toestimatetheanswertothesetwoqueries,vastlydifferentsamplesizesareneeded. 80


1 21 30 34 39 ].Ingeneral,thedrawbackofmakinguseofabatchsampleisthattheaccuracyofanyestimatorwhichmakesuseofthesampleisxedatthetimethatthesampleistaken,whereasthebenetofbatchsamplingisthatthesamplecanbedrawnwithveryhighefciency. WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample.ThedrawbackofonlinesamplingcomparedtobatchsamplingisthatitisgenerallylessefcienttoobtainasampleofsizeNusingonlinemethods.However,sincetheconsumerofthesamplecancallGetNextrepeatedlyuntilanestimatorwithenoughaccuracyisobtained,onlinesamplingismoreexiblethanbatchsampling.Anonlinesampleretrievedfromageometriclecanbeusefulformanyapplications,includingonlineaggregation[ 32 33 ].Inonlineaggregation,adatabasesystemtriestoquicklygatherenoughinformationsoastoapproximateanswertoanaggregatequery.Asmoreandmoreinformationisgathered,theapproximationqualityisimproved,andtheonlinesamplingprocedureishaltedwhentheuserishappywiththeapproximationaccuracy. 5.3.1ANaiveAlgorithm Proof. 81


jDjN jDjN=1 Unfortunately,thoughitisverysimple,thenaivealgorithmwillbeinefcientfordrawingasmallsamplefromalargegeometriclesinceitrequiresafullscanofthegeometricletoobtainatruerandomsampleforanyvalueofN.Sincethegeometriclemaybegigabytesinsize,thiscanbeproblematic. 82


26 ].Oncethenumberofsampledrecordsfromeachsegmenthasbeendetermined,samplingthoserecordscanbedonewithanefcientsequentialreadsincewithineachondisksegment,allrecordsarestoreinarandomizedorder.Thekeyalgorithmicissueishowtocalculatethecontributionofeachsubsample.Sincethiscontributionisamultivariatehypergeometricrandomvariable,wecanuseanapproachanalogoustoAlgorithm 4 ,whichisusedtopartitionthebuffertoformthesegmentsofasubsample.Inotherwords,wecanviewretrievingNsamplesfromageometricleanalogoustochoosingNrandomrecordstooverwritewhennewrecordsareaddedtothele. Theresultingalgorithmcanbedescribedasfollows.Tostartwith,wepartitionthesamplespaceofNrecordsintosegmentsofvaryingsizeexactlyasinAlgorithm 4 .Werefertothesesegmentsofthesamplespaceassamplingsegments.Thesamplingsegmentsarethenlledwithsamplesfromthediskusingaseriesofsequentialreads,analogoustothesetofwritesthatareusedtoaddnewsamplestothegeometricle.Thelargestsamplingsegmentobtainsallofitsrecordsfromthelargestsubsample,thenextlargestsamplingsegmentobtainsallitsrecordfromsecondlargestsubsample,andsoon. Whenusingthisalgorithm,somecareneedstobetakenwhenNapproachestothesizeofageometricle.Specically,whenalldisksegmentsofasubsamplearereturnedtoacorrespondingsamplingsegment,wemustalsoconsiderthesubsample'sin-memorybuffered 83


8 ItisclearthatthisalgorithmobtainsthedesiredbatchsamplebyscanningexactlyNrecordsasagainsttheentirescanofthereservoirsamplingatthecostoffewrandomdiskseeks.Sincethesamplingprocessisanalogoustotheprocessofaddingmoresamplestothele,itisjustasefcient,requiringO(!logjBj=N)randomdiskheadmovementsforeachnewlysampledrecord,asdescribedinLemma2. 8 oneachleinordertoobtainthedesiredbatchsample. 5.4.1ANaiveAlgorithm Itiseasytoseethatanaivealgorithmwillgiveusacorrectonlinesampleofageometricle.However,wewilluseonediskseekpercalltoGetNext.SinceeachrandomI/Orequires 84


Insteadofselectingarandomrecordofageometricle,werandomlypickasubsampleandchooseitsnextavailablerecordasareturnvalueofGetNext.Thisisanalogoustotheclassiconlinesamplingalgorithmforsamplingfromahashedle[ 26 ],whererstahashbucketisselectedandthenarecordischosen.Sincetheselectionofarandomrecordwithinasubsampleissequential,wemayreducethenumberofcostlydiskseeksifwereadthesubsampleinitsentirety,andbufferthesubsample'srecordsinmemory.Usingthisbasicmethodology,wenowdescribehowacalltotheGetNextwillbeprocessed: Sincetherecordsfromeachsubsamplearereadandbufferedinmemorysequentially,weareguaranteedtochooseeachrecordofthereservoiratmostonce,givingusdesiredrandomsamplewithoutreplacement.Aproofofthisissimple,andanalogoustotheproofofLemma3.However,thusfarwehavenotconsideredaveryimportantquestion:HowmanyblocksofasubsampleSishouldwefetchatthetimeofbufferrell?Ingeneraltherearetwoextremesthatwemayconsider: 85


Inordertodiscusssuchconsiderationsmoreconcretely,wenotethatthetimerequiredtoprocessGetNextcallisproportionaltothenumberofblocksfetchedonthecall,assumingthatthecosttoperformtherequiredin-memorycalculationsisminimal.Ifbblocksarefetchedduringaparticularcall,wespends+brtimeunitsonthatparticularcalltoGetNext,wheresistheseektimeandristimerequiredtoscanablock.OncethesebblocksarefetchedweincurzerocostfornextbncallstoGetNext,wherenistheblockingfactor(numberofrecordsperblock).Thus,inthecasewhereblocksarefetchedattherstcalltoGetNext,weincurthetotalcostofs+brtosamplebnrecords,andhavearesponsetimeofs+brunitsattherstcalltoGetNext,withallsubsequentcallshavingzerocost. Nowimaginethatinsteadwesplitbblocksintotwochunksofsizeb=2each,andreadachunk-at-a-time.Thus,therstGetNextcallwillcostuss+br=2timeunits.Oncethesebn=2recordsareusedupwereadnextchuckofblocks.Thetotalcostinthisscenariois2s+brwitharesponsetimeofs+br=2timeunitsonceatthestartingpointandothermid-waythrough.NotethatalthoughthemaximumresponsetimeonanycalltoGetNextisreducedbyhalf,werequiredmoretimetosamplebnrecords.Thequestionthenbecomes,Howdowereconcileresponsetimewithoverallsamplingtimetogivetheuseroptimalperformance? ThesystematicapproachwetaketoansweringthisquestionisbasedonminimizingtheaveragesquaresumofresponsetimeoverallGetNextcalls.Thisideaissimilartothewidelyutilizedsum-square-errororMSEcriterion,whichtriestokeepstheaverageerrororcostfrombeingtoohigh,butalsopenalizesparticularlypoorindividualerrorsorcosts.However,one 86


dX(X(s+(N=br)=X)2)=d dX(Xs2+2Nsr+(N=br)2=X)=s2(N=br)2=X2


9 givesthedetailedonlinesamplingalgorithm. Proof. LettheSamplebethebiasedsampleofthegeometricle,thenwehavePr[i2Sample]=Pr[SelectingifromSi]Pr[SelectingSi]Pr[i2Si]=1 jRjjRjf(r) 7 ofthisdissertation. 88


Efcientlysearchinganddiscoveringrequiredinformationfromasamplestoredinageo-metricleisessentialtospeedupqueryprocessing.Anaturalwaytosupportthisfunctionalityistobuildanindexstructureforthegeometricle.Inthischapterwediscussthreesecondaryindexstructuresforthegeometricle.Thegoalistomaintaintheindexstructuresasnewrecordsareinsertedtothegeometricleandatthesametimeprovideefcientaccesstothedesiredinformationinthele. FROMTransaction WHEREStoreState='FL'ANDTransDate>1/1/2007 89


Anaturalwaytospeedupthesearchanddiscoveryofthoserecordsfromageometriclethathaveaparticularvalueforaparticularattribute(s)istobuildanindexstructure.Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Thus,inourexample,wecouldusetheindexbuiltoneitherStoreStateorTransDate(orboth)toquicklyaccessspecicsetofrecordsandtestthemfortheconditionsintheWHEREclause.Inthischapterwefocusonbuildingsuchanindexstructureforthegeometricle. Apartfromprovidingefcientaccesstothedesiredinformationinthele,akeyconsider-ationisthattheindexforthegeometriclemustbemaintainedasnewrecordsareinserted.Forinstance,wecouldbuildasecondaryindexonanattributewhenthenewrecordsarebulkinsertedintothegeometricle.Wemustthendeterminehowdowemergethenewsecondaryindexwiththeexistingindexesbuiltfortherestofthele.Furthermore,wemustmaintaintheindexasexistingrecordsarebeingoverwrittenwithnewlyinsertedrecordsandhencearedeletedfromthegeometricle. Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexes[ 9 ]aremaintainedforeachsegmentorsubsampleinageometric 90


44 ]-adisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. Inthesubsequentsectionswediscussconstruction,maintenance,andqueryingofthesethreetypesofindexes. Wedetailconstructionandmaintenanceofasegment-basedindexstructureinthissection. 3 fromChapter 3 duringstart-uptollthereservoir.Everytimethebufferaccumulatesthedesirednumberofrecords,itissegmentedandushedtothedisk.WebuildaB+-treeindexforeachsegmentjustbeforetheyarewrittenouttothedisk.Foreachbufferedrecordofasegmentweconstructanindexrecord.Anindexrecordiscomprisedofthevalueoftheattributeonwhichtheindexisgettingbuilt(thekeyvalue)andthepositionofthebufferedrecordonthedisk.Thepositionisstoredasanumberpair:apagenumberandoffsetwithinapage.TheindexrecordsarethenusedtocreateanindexusingthebulkinsertionalgorithmforaB+-Tree.Weuseasimplearray-baseddatastructuretokeep 91


RatherthanmaintainingaleforeachB+-Treecreated,weorganizemultipleB+-Treesonasinglediskle.Wereferthissingleleasindexle.Theindexle,inasense,issimilartothelog-structurelesystemproposedbyOusterhout[ 45 ].Inlog-structuredlesystem,aslesaremodied,thecontentsarewrittenouttothediskaslogsinasequentialstream.Thisallowswritesinfull-cylinderunits,withonlytrack-to-trackseeks.Thusthediskoperatesatnearlyitsfullbandwidth.Theindexleenjoysthesimilarperformancebenets.EverytimeaB+-Treeiscreatedforamemoryresidentsegment,itiswrittentotheindexleinasequentialstreamatthenextavailableposition.ThearraymaintainingallB+-TreerootnodesisaugmentedwiththestartingdiskpositionoftheB+-Tree. Finally,wedonotindexsegmentsthatareneverushedtothedisk.Thesesegmentsaretypicallyverysmall(asizeofadiskblock)anditisefcienttosearchthemusingsequentialmemoryscanwhengeometricleisqueried. Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 10 92


logc Weexpectasegment-basedindexstructuretobeacompactstructureasthereisexactlyoneindexrecordpresentintheindexstructureforeachrecordinthegeometricle,andtheindexstructureismaintainedasnewrecordsaredeletedfromthele. 93


Asinthecaseofasegment-basedindexstructure,wearrangetheB+-Treeindexesondiskinasingleindexle.However,weneedaslightlydifferentapproach,becauseduringthestart-upsubsamplesareushedtothegeometricle,untilthereservoirisfull.ThereaftersubsamplesofthesamesizejBjareaddedtothereservoir.SinceeachB+-TreewillindexnomorethanjBjrecords,wecanboundthesizeofaB+-Treeindex.Weusethisboundtopre-allocateax-sizedslotondiskforeachB+-Tree.Furthermore,foreverybufferushafterthereservoirisfull,exactlyonesubsampleisaddedtotheleandthesmallestsubsampleoftheledecayscompletely,keepingthenumberofsubsamplesinageometricleconstant.Weusethisinformationtolayoutthesubsample-basedB+-Treesondiskandmaintainthemasnewrecordsaresampledfromthedatastream. Thus,iftotSubsamplesisthetotalnumbersubsamplesinR,werstallocatexed-sizetotSubsamplesslotsintheindexle.Initiallyalltheslotsareempty.Duringstart-up,asanewB+-Treeisbuilt,weseektothenextavailableslotandwriteouttheB+-Treeinasequential 94


Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 11 logc Asearchonsubsample-basedindexstructureinvolveslookingupallB+-Treeindexes,oneforeachsubsampleinthegeometricle.WemodifytheexistingB+-Tree-basedpointqueryandrangequeryalgorithmsandrunthemforeachentryintheB+-Treearrayoftheindexstructure.ThemodicationisrequiredtoignorethestalerecordsintheB+Trees.Asmentionedbefore,thesubsamplecorrespondingtoaB+-Treemayloseitssegments,buttheindexrecordsare 95


Recallthatwehaverecordedasegmentnumberinanadditionaleldalongwitheachindexrecord.Foragivensubsample,wekeeptrackofwhichofitssegmentsaredecayedsofarandusethisinformationtoignoretheindexrecordsthatarestale.Wereturnsallvalidindexrecordsthatsatisfythesearchcriteria.Werstsorttheseindexrecordsbytheirpagenumberattributeandthenthenretrievetheactualrecordsfromthegeometricleandreturnthemasaqueryresult. Although,thesubsample-basedindexstructuremaintainsandmustsearchfarfewerB+Treescomparedtothesegment-basedindexstructure,weexceptreasonablesearchtimeperB+-Treeduetothesmallersizeandlazydeletionpolicy. 44 ].TheLSM-Treeisadisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. 44 ]. AlthoughC1(andhigher)componentsaredisk-resident,themostfrequentlyreferrednodes(ingeneralnodesathigherlevel)ofthesetreesarebufferedinmainmemoryforperformancereasons. 96


WhenevertheC0componentreachesathresholdsizeanongoingrollingmergeprocessremovessomerecords(acontiguoussegment)fromtheC0componentandmergesthemintotheC1componentondisk.ThetherollingmergeprocessisdepictedpictoriallyinFigure2.2oftheoriginalLSM-Treepaper[ 44 ].TherollingmergeisrepeatedformigrationbetweenhighercomponentsofanLSM-Treeinsimilarmanner.Thus,thereisacertainamountofdelaybeforerecordsintheC0componentmigrateouttothedisk-residentC1andhighercomponents.Deletionsareperformedconcurrentlyinbatchfashionsimilartoinserts. ThediskresidentcomponentsofanLSM-treearecomparabletoaB+-treestructure,butareoptimizedforsequentialdiskaccess,withnodes100%full.Lowerlevelsofthetreearepackedtogetherincontiguous,multi-pagediskblocksforbetterI/Operformanceduringtherollingmerge. WeusetheexistingLSM-Tree-basedpointqueryandrangequeryalgorithmstoperformindexlook-ups.Asincaseofpreviouslyproposedindexstructures,wesortthevalidindex 97


InChapter 7 ,weevaluateandcomparethethreeindexstructuressuggestedinthischapterexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedintothegeometricle.Wealsocomparetheefciencyofthesestructuresforpointandrangequeries. 98


Inthischapter,wedetailthreesetsofbenchmarkingexperiments.Intherstsetofexperi-ments,weattempttomeasuretheabilityofthegeometricletoprocessahigh-speedstreamofdatarecords.Inthesecondsetofexperiments,weexaminethevariousalgorithmsforproducingsmallersamplesfromalarge,disk-basedgeometricle.Finally,inthethirdsetofexperiments,wecomparethethreeindexstructuresforthegeometricleforbuildtime,diskspace,andindexlook-upspeed. 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.IntheremainderofthisSection,werefertothesealternativesasthevirtualmemory,scan,localoverwrite,geole,andmultiplegeolesoptions.An0valueof0:9wasusedforthemultiplegeolesoption. AllimplementationwasperformedinC++.BenchmarkingwasperformedusingasetofLinuxworkstations,eachequippedwith2.4GHzIntelXeonProcessors.15,000RPM,80GBSeagateSCSIharddiskswereusedtostoreeachofthereservoirs.Benchmarkingofthesedisksshowedasustainedread/writerateof35-50MB/second,andanacrossthediskrandomdataaccesstimeofaround10ms. 99

PAGE 100

7-1 (a).Bynumberofsamplesprocessedwemeanthenumberofrecordsthatareactuallyinsertedintothereservoir,andnotthenumberofrecordsthathavepassedthroughthedatastream. 7-1 (b).Thus,wetesttheeffectofrecordsizeontheveoptions. 7-1 (c).Thisexperimentteststheeffectofaconstrainedamountofmainmemory. Itisworthwhiletopointoutafewspecicndings.Eachoftheveoptionswritestherst50GBofdatafromthestreammoreorlessdirectlytodisk,asthereservoirislargeenoughtoholdallofthedataaslongasthetotalislessthan50GB.However,Figure 7-1 (a)and(b)showthatonlythemultiplegeolesoptiondoesnothavemuchofadeclineinperformanceafterthereservoirlls(atleastinExperiments1and2).Thisiswhythescanandvirtualmemoryoptionsplateauaftertheamountofdatainsertedreaches50GB.ThereissomethingofadeclineinperformanceinallofthemethodsoncethereservoirllsinExperiment3(withrestrictedbuffermemory),butitisfarlesssevereforthemultiplegeolesoptionthanfortheotheroptions. 100

PAGE 101

Resultsofbenchmarkingexperiments(Processinginsertions). 101

PAGE 102

Resultsofbenchmarkingexperiments(Samplingfromageometricle). 102

PAGE 103

Asexpected,thelocaloverwriteoptionperformsverywellearlyon,especiallyinthersttwoexperiments(seeSection 3.3 foradiscussionofwhythisisexpected).EvenwithlimitedbuffermemoryinExperiment3,ituniformlyoutperformsasinglegeometricle.Furthermore,withenoughbuffermemoryinExperiments1and2,thelocaloverwriteoptioniscompetitivewiththemultiplegeolesoptionearlyon.However,fragmentationbecomesaproblemandperformancedecreasesovertime.Unlessofinere-randomizationoftheleispossibleperiodically,thisdegradationprobablyprecludeslong-termuseofthelocaloverwriteoption. ItisinterestingthatasdemonstratedbyExperiment3(andexplainedinSection 3.8 )asinglegeometricleisverysensitivetotheratioofthesizeofthereservoirtotheamountofavailablememoryforbufferingnewrecordsfromthestream.ThegeoleoptionperformswellinExperiments1and2whenthisratiois100,butratherpoorlyinExperiment3whentheratiois1000. Finally,wepointoutthegeneralunusabilityofthescanandvirtiualmemoryoptions.scangenerallyoutperformedvirtualmemory,butbothgenerallydidpoorly.Exceptinexperiment1withlargememoryandsmallrecordsize,withthesetwooptionsmorethan97%oftheprocessingofrecordsfromthestreamoccursinthersthalfhourasthereservoirlls.Inthe19:5hoursorsoafterthereservoirrstlls,onlyatinyfractionofadditionalprocessingoccursduetotheinefciencyofthetwooptions. 4.1 wegaveanupperboundforthedistancebetweentheactualbiasfunctionf0computedusingourreservoiralgorithm,andthedesired,user-denedbiasfunctionf.Whileuseful,thisbounddoesnottelltheentirestory.Intheend,whatauserofabiasedsamplingalgorithmisinterestedinisnothowclosethebiasfunctionthatisactuallycomputedistotheuser-speciedone,butinsteadthekeyquestioniswhatsortofeffectanydeviationhasonthe 103

PAGE 104

Sumqueryestimationaccuracyforzipf=0.2. particularestimationtaskthatistobeperformed.Perhapstheeasiestwaytodetailthepracticaleffectofapathologicaldataorderingisthroughexperimentation. Inthissectionwepresenttheexperimentalresultsevaluatingpracticalsignicanceofaworst-casedataordering.Specically,wedesignasetofexperimentstocomputetheerror(variance)onewouldexpectwhensamplingfortheanswertoaSUMqueryinfollowingtherescenarios: 1. Whenabiasedsampleiscomputedusingourreservoiralgorithmwiththedataorderedsoastoproducenooverweightrecords. 2. Whenanunbiasedsampleiscomputedusingtheclassicalreservoirsamplingalgorithm. 3. Whenabiasedsamplecomputedusingourreservoiralgorithm,withrecordsarrangedsoastoproducethebiasfunctionfurthestfromtheuser-speciedone,asdescribedbytheTheorem 1 Byexaminingtheresults,itshouldbecomeclearexactlywhatsortofpracticaleffectontheaccuracyofanestimatoronemightexpectduetoapathologicalordering. 104

PAGE 105

Sumqueryestimationaccuracyforzipf=0.5. AttributeBistheattributethatisactuallyaggregatedbytheSUMquery.EachsetisgeneratedsothatattributesAandBbothhaveacertainamountofZipanskew,speciedbytheparameterzipf.Ineachcase,thebiasfunctionfisdenedsoastominimizethevarianceforaSUMqueryevaluatedoverattributeA. Inadditiontotheparameterzipf,eachdatasetalsohasasecondparameterwhichwetermthecorrelationfactor.ThisistheprobabilitythatattributeAhasthesamevalueasattributeB.Ifthecorrelationfactoris1,thenAandBareidentical,andsincethebiasfunctionisdenedsoastominimizethevarianceofaqueryoverA,thebiasfunctionalsominimizesthevarianceofanestimateovertheactualqueryattributeB.Thus,acorrelationfactorof1providesforaperfectbiasfunction.Asthecorrelationfactordecreases,thequalityofthebiasfunctionforaqueryoverattributeBdeclines,becausethechanceincreasesthatarecorddeemedimportantbylookingatattributeAis,infact,onethatshouldnotbeincludedinthesample.Thismodelsthecasewhereonecanonlyguessatthecorrectbiasfunctionbeforehandforexample,whenquerieswithanarbitraryrelationalselectionpredicatemaybeissued.Asmallcorrelationfactorcorrespondstothecasewhentheguessed-atbiasfunctionisactuallyveryincorrect. 105

PAGE 106

Sumqueryestimationaccuracyforzipf=0.8. Bytestingeachofthethreedifferentscenariosdescribedintheprevioussubsectionoverasetofdatasetscreatedbyvaryingzipfaswellasthecorrelationfactor,wecanseetheeffectofdataskewandofbiasfunctionqualityontherelativequalityoftheestimatorproducedbyeachofthethreescenarios. Foreachexperiment,wegenerateadatastreamofonemillionrecordsandobtainasampleofsize1000.Foreachofthethreescenariosandeachofthedatasetsthatwetest,werepeatthesamplingprocess1000timesoverthesamedatastreaminMonte-Carlofashion.Thevarianceofthecorrespondingestimatorisreportedastheobservedvarianceofthe1000estimates.TheobservedMonte-CarlovariancesaredepictedinFigures 7-3 7-4 7-5 ,and 7-6 106

PAGE 107

Sumqueryestimationaccuracyforzipf=1. thatevenforveryskeweddatasets,itisdifcultforevenanadversarytocomeupwithadataorderingthatcansignicantlyalterthequalityoftheuser-denedbiasfunction. Wealsoobservethatforalowzipfparameterandalowcorrelationfactor,unbiasedsamplingoutperformsbiasedsampling.Inotherwords,itisactuallypreferablenottobiasinthiscase.ThisisbecausethelowzipfvalueassignsrelativelyuniformvaluestoattributeB,renderinganoptimalbiasedschemelittledifferentfromuniformsampling.Furthermore,asthecorrelationfactordecreases,theweightingschemeusedbothbiasedsamplingschemesbecomeslessaccurate,hencethehighervariance.Astheweightingschemebecomesveryinaccurate,itisbetternottobiasatall.Notsurprisingly,therearemorecaseswherethebiasedschemeunderthepathologicalorderingisactuallyworsethantheunbiasedscheme.However,asthecorrelationfactorincreasesandthebiasschemebecomesmoreaccurate,itquicklybecomespreferabletobias. 5 .Specically,wehavecomparedthenaivebatchsamplingandtheonlinesamplingalgorithmsagainstageometriclestructurebasedbatchsamplingandonline 107

PAGE 108

7-2 (a)depictstheplotforasinglegeometricle;Figure 7-2 (b)showsananalogousplotforthemultiplegeometriclesoption. 7-2 (c)forboththenaivealgorithmandthemoreadvanced,geometriclestructurebasedalgorithmdesignedtoincreasethesamplingrateandevenouttheresponsetimes.TheanalogousplotformultiplegeometriclecaseisshowninFigure 7-2 (d).WealsoplotthevarianceinresponsetimesoverallcallstoGetNextasafunctionofthenumberofcallstoGetNextinFigures 7-2 (e)and 7-2 (f)(therstisforasinglegeometricle;thesecondiswithmultipleles).Takentogether,theseplotsshowthetrade-offbetweenoverallprocessingtimeandthepotentialforwaitingforalongtimeinordertoobtainasinglesample. 108

PAGE 109

Asexpectedandthendemonstratedbyvarianceplots,thevarianceofonlinenaiveapproachissmallerthangeometriclestructurebasedalgorithm.Althoughwiththislittlelargervariance(lessthan10timesfor100ksamples)intheresponsetimes,thestructurebasedapproachexecutedorderofmagnitudefaster(morethan100timesfor100ksamples)thanthenaiveapproachforanynumberofrecordssampled,justifyingourapproachofminimizingtheaveragesquaresumoftheresponsetime.Inotherwords,wegotenoughaddedspeedforasmallenoughaddedvarianceinresponsetimetomakethetrade-offacceptable.Asmoreandmoresamplesareobtainedthevarianceofstructurebasedalgorithmapproachedvarianceofthenaivealgorithmmakingthetrade-offevenmorereasonableforlargeintendedsamplesizes. Finally,wepointoutthatboththegeometriclestructurebasedalgorithms,batchandonlinecase,wereabletoreadsamplerecordsfromdiskalmostatthemaximumsustainedspeedoftheharddisk,ataround45MB/sec.Thisiscomparabletotherateofasequentialreadfromdisk,thebestwecanhopefor. 109

PAGE 110

Diskfootprintfor1KBrecordsize Table7-1. Millionsofrecordsinsertedin10hrs Subsample-Based Segment-Based LSM-Tree 13700 12550 10960 9680 12810 7230 8030 2930 6 weintroducedthreeindexstructuresforthegeometricle:thesegment-based,thesubsample-based,andtheLSM-tree-basedindexstructure.Inthissection,weexperimentallyevaluateandcomparethesethreeindexstructuresbymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries.Alloftheindexstructureswereimplementedontopofthegeometricleprototypethatwasbenchmarkedintheprevioussections. 110

PAGE 111

6 .Thetenhoursofinsertioninthegeometricleensuresthatareasonablenumberofinsertionsanddeletionsareperformedonanindexstructure.Givensuchale,wecollectedfollowingthreepiecesofinformationforeachofthethreeindexstructuresunderconsideration. Withthesemetricsinmindweperformedfollowingtwosetsofexperiments: 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-7 ,andtheindexlook-upspeedintabulatedinTable 7.4.1 111

PAGE 112

Diskfootprintfor200Brecordsize speedareshowninTable 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-8 ,andtheindexlook-upspeedintabulatedinTable 7.4.2 .Thus,wetesttheeffectofrecordsizeonthethreeindexstructure. Table 7.4 showsmillionsofrecordsinsertedintogeometricleaftertenhoursofinsertionsandconcurrentupdatestotheindexstructure.Forcomparisonwepresentthenumberofrecordsinsertedintoageometriclewhennoindexstructureismaintained(thenoindexcolumn).Itisclearthatthesubsample-basedindexstructureperformsthebestoninsertions,withperformancecomparabletothenoindexoption.Thisdifferencereectsthecostofconcurrentlymaintainingtheindexstructure.Thesegmentbasedindexstructuredoesthenextbest.Itisslowerthanthesubsample-basedindexstructurebecauseofhighernumberofseeksperformedduringthestart-up.Recallthatduringstart-upthesegment-basedindexmustwriteaB+-treeforeachsegment. 112

PAGE 113

Querytimingresultsfor1krecord,jRj=10million,andjBj=50k Selectivity IndexTime FileTime TotalTime PointQuery 38.2890 0.0226 38.3116 40.2477 0.1803 40.2480 43.2856 0.8766 44.1622 45.6276 6.2571 51.8847 Subsample-Based PointQuery 0.87551 0.02382 0.89937 1.12740 0.15867 1.28607 1.74911 1.10544 2.85455 2.09980 5.96637 8.06617 LSM-Tree PointQuery 0.00012 0.01996 0.02008 0.00015 0.01263 0.01278 0.00019 0.79358 0.79377 0.00056 5.82210 5.82266 Oncethereservoirisinitialized,boththesegment-basedandthesubsample-basedindexstructureperformanequalnumberofdiskseeks.Finally,theLSM-tree-basedindexstructureisslowestamongstthethree.TheLSM-treemaintainstheindexbyprocessinginsertionsanddeletionsmoreaggressivelythanothertwooptions,demandingmorerollingmergesandmorediskseeksperbufferush. Table 7.4 alsoshowstheinsertionguresforthesmaller,200Brecordsize.Notsurprisingly,allthreeindexstructuresshowssimilarinsertionpatterns,butsincetheyhavetoprocessalargernumberofrecordstheinsertionratesareslowerthaninthecaseofthe1KBrecordsize.Wealsoobservedandplottedthediskfootprintsizeforthreeindexstructures(Figure 7-7 andFigure 7-8 ).Asexpected,allthreeindexstructuresinitiallygrowfairlyquickly.Thesegment-basedandthesubsample-basedindexstructuresstabilizesoonafterthereservoirislled,whereastheLSM-Tree-basedstructurestabilizesalittlelaterwhentheremovalofstalerecordsfromtherollingmergesstabilizes. Thesubsample-basedindexstructurehasthelargestfootprint(almost1=5thofthegeometriclesize).ThisisexpectedasstaleindexrecordsisremovedfromtheB+-treesonlywhenthe 113

PAGE 114

Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k Selectivity IndexTime FileTime TotalTime PointQuery 6.2488 0.0338 6.2826 9.6186 0.1267 9.7453 12.9885 0.9288 13.9173 17.6891 5.9754 23.6645 Subsample-Based PointQuery 2.50717 0.0156 2.5227 4.92744 0.1763 5.1037 7.2387 0.8637 8.1024 9.9837 6.1363 16.1200 LSM-Tree PointQuery 0.00505 0.0174 0.0224 0.00967 0.1565 0.1661 0.01440 0.8343 0.8487 0.05987 4.9961 5.0559 entiresubsampledecays.Ontheotherhand,thesegment-basedindexstructurehasthesmallestfootprintasateverybufferushallstalerecordsareremovedfromtheindexstructure.Thisresultsinaverycompactindexstructure.ThediskspaceusageoftheLSM-Tree-basedindexstructureliesbetweenthesetwoindexstructures.Althoughateveryrollingmerge,stalerecordsareremovedfromthepartofindexstructurethatismerging,notallofthestalerecordsinthestructureareremovedallatonce.Assoonastherateofremovalofstalerecordsstabilizesthediskfootprintalsobecomesstable. Finally,wecomparedtheindexlook-upspeedofthesethreeindexstructures.Wereportindexlook-upandgeometricleaccesstimesfordifferentselectivityqueries.Asexpected,thegeometricleaccesstimeremainsconstantirrespectiveoftheindexstructureoptionandincreaseslinearlyasthequeryproducesmoreoutputtuples.Theindexlook-uptimevariedforthethreeindexstructures.Thesegment-basedindexstructure(theslowest)wasanorderofmagnitudeslowerthantheLSM-Tree-basedindexstructure(thefastest).Thisismainlybecausethesegment-basedindexstructurerequiresindexlookupsinseveralthousandB+-Treesforanyselectivityquery,wheretheLSM-Tree-basedstructureusesasingeLSM-Tree,requiringasmall,constantnumberofseeks.Theperformanceofthesubsample-basedindexstructureliesin 114

PAGE 115

Ingeneralthesubsample-basedindexstructuregivesthebestbuildtimewithreasonableindexlook-upspeedatthecostofslightlylargerdiskfootprint.TheLSM-Tree-basedindexstructuremakesuseofreasonablediskspaceandgivesthebestqueryperformanceatthecostofslowinsertionrateorbuildtime.Thesegment-basedindexstructuregivescomparablebuildtimeandhasthemostcompactdiskfootprint,butsuffersconsiderablywhenitcomestoindexlook-ups. 115

PAGE 116

Randomsamplingisaubiquitousdatamanagementtool,butrelativelylittleresearchfromthedatamanagementcommunityhasbeenconcernedwithhowtoactuallycomputeandmaintainasample.Inthisdissertationwehaveconsideredtheproblemofrandomsamplingfromadatastream,wherethesampletobemaintainedisverylargeandmustresideonsecondarystorage.WehavedevelopedthegeometricleorganizationwhichcanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord.Themultiplier!canbemadeverysmallbymakinguseofasmallamountofadditionaldiskspace. Wehavepresentedamodiedversionoftheclassicreservoirsamplingalgorithmthatisexceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf.Ouralgorithmcomputes,inasinglepass,abiasedsampleRi(withoutreplacement)oftheirecordsproducedbyadatastream. Wehavealsodiscussedcertainpathologicalcaseswhereouralgorithmcanprovideacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Wehaveanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase.Wehavealsoexperimentallyevaluatedthepracticalsignicanceofthisdifference. WehavealsoderivedthevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained. Wehavedevelopedefcienttechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallerdataobjects.Weconsideredtwosamplingtechniques(1)abatchsamplingwhensamplesizeisknownbeforehandand(2)anonlinesamplingwhichimplementsaniterativefunctionGetNexttoretrieveasampleat-a-time.Thegoalofthesealgorithmswastoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 116

PAGE 117


PAGE 118

[1] A.DasJ.Gehrke,M.R.:Approximatejoinprocessingoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [2] Acharya,S.,Gibbons,P.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroup-byqueries.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [3] Acharya,S.,Gibbons,P.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [4] Acharya,S.,P.B.Gibbons,V.P.,Ramaswamy,S.:Theaquaapproximatequeryansweringsystem.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [5] Aggarwal,C.C.:Onbiasedreservoirsamplinginthepresenceofstreamevolution.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.607.VLDBEndowment(2006) [6] Arge,L.:Thebuffertree:Anewtechniqueforoptimali/o-algorithms.In:InternationalWorkshoponAlgorithmsandDataStructures(1995) [7] Babcock,B.,Datar,M.,Motwani,R.:Samplingfromamovingwindowoverstreamingdata.In:SODA'02:ProceedingsofthethirteenthannualACM-SIAMsymposiumonDiscretealgorithms,pp.633.SocietyforIndustrialandAppliedMathematics(2002) [8] Babcock,B.,S.Chaudhuri,G.D.:Dynamicsampleselectionforapproximatequeryprocessing.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [9] Bayer,R.,McCreight,E.M.:Organizationandmaintenanceoflargeorderedindexes.In:SIGFIDETWorkshop,pp.107(1970) [10] Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.6.IEEEComputerSociety,Washington,DC,USA(2006) [11] C.FanM.Muller,I.R.:Developmentofsamplingplansbyusingsequential(itembyitem)techniquesanddigitalcomputersi.In:JournalofAmericanStatisticalAssociation,pp.57:387(1962) [12] C.JermaineA.Datta,E.O.:Anovelindexsupportinghighvolumedatawarehouseinsertion.In:InternationalConferenceonVeryLargeDataBases(1999) [13] C.JermaineE.Omiecinski,W.Y.:Thepartitionedexponentiallefordatabasestoragemanagement.In:InternationalConferenceonVeryLargeDataBases(1999) [14] Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE(2001) 118

PAGE 119

[15] Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimization-basedapproachforapprox-imateansweringofaggregatequeries.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [16] Cochran,W.:SamplingTechniques.WileyandSons(1977) [17] Council,T.P.:TPC-HBenchmark. [18] Cranor,C.,Gao,Y.,Johnson,T.,Shkapenyuk,V.,Spatscheck,O.:Gigascopehighper-formancenetworkmonitoringwithansqlinterface.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [19] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Gigascope:Astreamdatabasefornetworkapplications.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [20] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Thegigascopestreamdatabase.In:IEEEDataEngineeringBulletin,pp.26(1):27(2003) [21] Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [22] Dufeld,N.,Lund,C.,Thorup,M.:Chargingfromsamplednetworkusage.In:IMW'01:Proceedingsofthe1stACMSIGCOMMWorkshoponInternetMeasurement,pp.245.ACMPress,NewYork,NY,USA(2001) [23] Estan,C.,Naughton,J.F.:End-biasedsamplesforjoincardinalityestimation.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.20.IEEEComputerSociety,Washington,DC,USA(2006) [24] Estan,C.,Varghese,G.:Newdirectionsintrafcmeasurementandaccounting:Focusingontheelephants,ignoringthemice.ACMTrans.Comput.Syst.21(3),270(2003) [25] F.Olken,D.R.:Randomsamplingfromb+trees.In:InternationalConferenceonVeryLargeDataBases(1989) [26] F.Olken,D.R.:Randomsamplingfromdatabaseles-asurvey.In:InternationalWorkingConferenceonScienticandStatisticalDatabaseManagement(1990) [27] F.OlkenD.Rotem,P.X.:Randomsamplingfromhashes.In:ACMSIGMODInterna-tionalConferenceonManagementofData(1990) [28] Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskew-resistantjoinsizeestimation.In:ACMSIGMODInternationalConferenceonManagementofData(1996)

PAGE 120

[29] Gemulla,R.,Lehner,W.,Haas,P.J.:Adipinthereservoir:maintainingsamplesynopsesofevolvingdatasets.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.595.VLDBEndowment(2006) [30] Gunopulos,D.,Kollios,G.,Tsotras,V.,Domeniconi,C.:Approximatingmulti-dimensionalaggregaterangequeriesoverrealattributes.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [31] Haas,P.:Theneedforspeed:Speedingupdb2usingsampling.In:IDUGSolutionsJournal(2003) [32] Haas,P.J.,Hellerstein,J.M.:RipplejoinsforOnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.287298(1999) [33] Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:OnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.171(1997) [34] J.GehrkeF.Korn,D.S.:Oncomputingcorrelatedaggregatesovercontinualdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [35] Jermaine,C.:Robustestimationwithsamplingandapproximatepre-aggregation.In:InternationalConferenceonVeryLargeDataBases(2003) [36] Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:ACMSIGMODInternationalConferenceonManagementofData,pp.299(2004) [37] J.M.HellersteinR.Avnur,V.R.:Informixundercontrolonlinequeryprocessing.In:DataMiningandKnowledgeDiscovery,pp.4(4):281(2000) [38] Joens,T.:Anoteonsamplingfromatapele.In:CommunicationsoftheACM,p.5:343(1964) [39] J.S.Vitter,M.W.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [40] Kolonko,M.,Wasch,D.:Sequentialreservoirsamplingwithanonuniformdistribution.ACMTrans.Math.Softw.32(2),257(2006).DOI [41] Manku,G.S.,Motwani,R.:Approximatefrequencycountsoverdatastreams.In:VLDBConference(2002) [42] N.L.Johnson,S.K.:DiscreteDistributions.HoughtonMifin(1969) [43] Olken,F.:RandomSamplingfromDatabases.In:Ph.D.Dissertation(1993) [44] O'Neil,P.,Cheng,E.,Gawlick,D.,O'NeilJ,E.:Thelog-structuredmerge-tree.In:ActaInformatica,pp.33:351(1996)

PAGE 121

[45] Ousterhout,J.K.,Douglis,F.:Beatingthei/obottleneck:Acaseforlog-structuredlesystems.OperatingSystemsReview23(1),11(1989) [46] P.B.GibbonsY.Matias,V.P.:Fastincrementalmaintenanceofapproximatehistograms.In:ACMTransactionsonDatabaseSystems,pp.27(3):261(2002) [47] Pol,A.,Jermaine,C.:Biasedreservoirsampling.IEEETransactionsonKnowledgeandDataEngineering [48] Pol,A.,Jermaine,C.,Arumugam,S.:Maintainingverylargerandomsamplesusingthegeometricle.VLDBJ(2007) [49] Shao,J.:MathematicalStatistics.Springer-Verlag(1999) [50] Thompson,M.E.:TheoryofSampleSurveys.ChapmanandHall(1997) [51] Toivonen,H.:Samplinglargedatabasesforassociationrules.In:InternationalConferenceonVeryLargeDataBases(1996) [52] V.GantiM.-L.Lee,R.R.:Icicles-self-tuningsamplesforapproximatequeryanswering.In:InternationalConferenceonVeryLargeDataBases(2000) [53] Vitter,J.:Randomsamplingwithareservoir.In:ACMTransactionsonMathematicalSoftware(1985) [54] Vitter,J.:Anefcientalgorithmforsequentialrandomsampling.In:ACMTransactionsonMathematicalSoftware,pp.13(1):58(1987)

PAGE 122

AbhijitPolwasbornandbroughtupinstateofMaharashtrainIndia.HereceivedhisBachelorofEngineeringfromGovernmentCollegeofEngineeringPune(COEP)UniversityofPune,oneofthemostprestigiousandoldestengineeringcollegeinIndia,in1999.Abhijitmajoredinmechanicalengineeringandobtainedadistinguishedrecord.Herankedsecondintheuniversitymeritranking.HewasemployedintheResearchandDevelopmentdepartmentofKirloskarOilEnginesLtdforoneyear.AbhijitreceivedhisrstMasterofSciencefromUniversityofFloridain2002.Hemajoredinindustrialandsystemsengineering.AbhijitthenworkedasaresearcherintheDepartmentofComputerandInformationScienceandEngineeringattheUniversityofFlorida.HereceivedhissecondMasterofScienceandDoctorofPhilosophy(Ph.D)incomputerengineeringin2007.DuringhisstudiesatUniversityofFlorida,AbhijitcoauthoredatextbooktitledDevelop-ingWeb-EnabledDecisionSupportSystems.HetaughttheWeb-DSScourseseveraltimesintheDepartmentofIndustrialandSystemsEngineeringattheUniversityofFlorida.HepresentedseveraltutorialsatworkshopsandconferencesontheneedandimportanceofteachingDSSmaterial,andhealsotaughtattwoinstructor-trainingworkshopsonDSSdevelopment.Abhijit'sresearchfocusisintheareaofdatabases,withspecialinterestsinapproximatequeryprocessing,physicaldatabasedesign,anddatastreams.HehaspresentedresearchpapersatseveralprestigiousdatabaseconferencesandperformedresearchattheMicrosoftResearchLab.HeisnowaSeniorSoftwareEngineerintheStrategicDataSolutionsgroupatYahoo!Inc. 122