Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0021132/00001
## Material Information- Title:
- Maintaining Very Large Samples Using the Geometric File
- Creator:
- Pol, Abhijit A
- Place of Publication:
- [Gainesville, Fla.]
Florida - Publisher:
- University of Florida
- Publication Date:
- 2007
- Language:
- english
- Physical Description:
- 1 online resource (122 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Computer Engineering
Computer and Information Science and Engineering - Committee Chair:
- Jermaine, Christophe
- Committee Members:
- Kahveci, Tamer
Dobra, Alin Hammer, Joachim Ahuja, Ravindra K. - Graduation Date:
- 8/11/2007
## Subjects- Subjects / Keywords:
- Buffer storage ( jstor )
Databases ( jstor ) Datasets ( jstor ) Estimate reliability ( jstor ) Index numbers ( jstor ) International conferences ( jstor ) Random sampling ( jstor ) Recordings ( jstor ) Sampling bias ( jstor ) Statistical discrepancies ( jstor ) Computer and Information Science and Engineering -- Dissertations, Academic -- UF biased, databases, file, indexing, sampling - Genre:
- bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) born-digital ( sobekcm ) Electronic Thesis or Dissertation Computer Engineering thesis, Ph.D.
## Notes- Abstract:
- Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a 'sample' is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for maintaining a very large, on-disk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2007.
- Local:
- Adviser: Jermaine, Christophe.
- Statement of Responsibility:
- by Abhijit A Pol.
## Record Information- Source Institution:
- UFRGP
- Rights Management:
- Copyright Pol, Abhijit A. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Classification:
- LD1780 2007 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

The above two methods implement simple random sampling (SRS) without replacement with successive draws. An alternative method for SRS with fixed size is to select units for replacement, and then to reject the sample if there are duplicates. We discuss one such method here, called Sampford's Method. Sampford's Method: In this method we will first draw r with probability a, and in the remaining N- 1 draws, which are carried out with replacement, we use the selection probabilities 3, = Kai/(1 Nao), where K is the normalizing constant. If there are any duplicates in the sample we start again from the beginning and repeat the procedure until the desired sample with no duplicates is obtained. The main drawback of this sampling design is that as N becomes large it becomes likely that duplicates will occur in each sampling round. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file. existing subsample with the records from a new buffer flush, a simple, efficient, sequential overwrite of the existing subsample's largest segment generally suffices. 3.5 Characterizing Subsample Decay To describe the geometric file in detail, we begin with an analogy between the samples in a subsample S that are lost over time, and radioactive decay. Imagine that we have 100 grams of Uranium at an initial point of time (Uo = 100), and a decay rate (1 a) = 0.1 with a retention rate of a. On day one, the mass of Uranium decays to Uo x a = 90 grams, because the Uranium loses Uo x (1 a) = 10 grams of its mass. We define n = Uo x (1 a) to be the mass of Uranium lost on the very first day, giving n = 10 for our example. On day two, (with U1 = 90) the Uranium further decays to U1 x a = 81 grams, this time losing U1 x (1 a) = Uo x a x (1 a) n x a = 9 grams of its mass. On day three, it further decays by n x a2 = 7.2 grams, and so on. The decay process is allowed to continue until we have less than 3 grams of Uranium remaining. Continuing with the Uranium analogy, three questions that are relevant to our problem of maintaining very large samples from a data stream are What is the amount of Uranium lost on any given ith day? How can the initial mass of Uranium, 100 grams, can be expressed in terms of n and a? How many days it will take for us before we are left with 3 grams or less of Uranium? These questions can be answered using the following three simple observations related to geometric series: Observation 1: Given a retention rate a < 1 and n to be the first term of a geometric series, the ith term is given by n x a-1 for any nE cR. Observation 2: Given a retention rate a < 1, it holds that C"n x a -1 "r for any n R. Observation 3: Given a retention rate a < 1, define f(j) as x aj. From Observation 2, it follows that the largest j such that f(j) > P3 is j log3-loglog(1-a)]. We denote this floor by T. of records was selected to be inserted into the reservoir (as many as each of the five options could handle). The goal was to test how many new records could be added to the reservoir in 20 hours, while at the same time expelling existing records from the reservoir as is required by the reservoir algorithm. The number of new samples processed by each of the five options (that is, the number of records added to disk) is plotted as a function of time in Figure 7-1 (a). By "number of samples processed" we mean the number of records that are actually inserted into the reservoir, and not the number of records that have passed through the data stream. Insertion experiment 2: This experiment is identical to Experiment 1, except that the 50GB sample was composed of 50 million, 1KB records. Results are plotted in Figure 7-1 (b). Thus, we test the effect of record size on the five options. Insertion experiment 3: This experiment is identical to Experiment 1, except that the amount of buffer memory is reduced to 150MB for each of the five options. The virtual memory option used all 150MB for an LRU buffer, and the four other options allocated 100MB to the LRU buffer and 50MB to the buffer for new samples. Results are plotted in Figure 7-1 (c). This experiment tests the effect of a constrained amount of main memory. 7.1.2 Discussion of Experimental Results All three experiments suggest that the multiple geo files option is superior to the other options. In Experiments 1 and 2, the multiple geofiles option was able to write new samples to disk almost at the maximum sustained speed of the hard disk, at around 40 MB/sec. It is worthwhile to point out a few specific findings. Each of the five options writes the first 50GB of data from the stream more or less directly to disk, as the reservoir is large enough to hold all of the data as long as the total is less than 50GB. However, Figure 7-1 (a) and (b) show that only the multiple geofiles option does not have much of a decline in performance after the reservoir fills (at least in Experiments 1 and 2). This is why the scan and virtual memory options plateau after the amount of data inserted reaches 50GB. There is something of a decline in performance in all of the methods once the reservoir fills in Experiment 3 (with restricted buffer memory), but it is far less severe for the multiple geofiles option than for the other options. Table 7-2. Query timing results for 1k record, R| = 10 million, and IBI = 50k Scheme Selectivity Index Time File Time Total Time Point Query 38.2890 0.0226 38.3116 S 10 recs 40.2477 0.1803 40.2480 Segment-Based 100 recs 43.2856 0.8766 44.1622 1000 recs 45.6276 6.2571 51.8847 Point Query 0.87551 0.02382 0.89937 10 recs 1.12740 0.15867 1.28607 Subsample-Based Subsample-Based 100 recs 1.74911 1.10544 2.85455 1000 recs 2.09980 5.96637 8.06617 Point Query 0.00012 0.01996 0.02008 10 recs 0.00015 0.01263 0.01278 LSM-Tree 100 recs 0.00019 0.79358 0.79377 1000 recs 0.00056 5.82210 5.82266 Once the reservoir is initialized, both the segment-based and the subsample-based index structure perform an equal number of disk seeks. Finally, the LSM-tree-based index structure is slowest amongst the three. The LSM-tree maintains the index by processing insertions and deletions more aggressively than other two options, demanding more rolling merges and more disk seeks per buffer flush. Table 7.4 also shows the insertion figures for the smaller, 200B record size. Not surprisingly, all three index structures shows similar insertion patterns, but since they have to process a larger number of records the insertion rates are slower than in the case of the 1KB record size. We also observed and plotted the disk footprint size for three index structures (Figure 7-7 and Figure 7-8). As expected, all three index structures initially grow fairly quickly. The segment-based and the subsample-based index structures stabilize soon after the reservoir is filled, whereas the LSM-Tree-based structure stabilizes a little later when the removal of stale records from the rolling merges stabilizes. The subsample-based index structure has the largest footprint (almost 1/5th of the geometric file size). This is expected as stale index records is removed from the B+-trees only when the net worth of American households. In the general case, many millions of samples may be needed to estimate the net worth of the average household accurately (due to a small ratio between the average household's net worth and the standard deviation of this statistic across all American households). However, if the same set of records held information about the size of each household, only a few hundred records would be needed to obtain similar accuracy for an estimate of the average size of an American household, since the ratio of average household size to the standard deviation of sample size across households in the United States is greater than 2. Thus, to estimate the answer to these two queries, vastly different sample sizes are needed. Since there is no single sample size that is optimal for answering all queries and the required sample size can vary dramatically from query to query, this part of dissertation considers the problem of generating a sample of size N from a data stream using an existing geometric file that contains a large sample of records from the stream, where N < R. We will consider two specific problems. First, we consider the case where N is known beforehand. We will refer to a sample retrieved in this manner as a batch sample. We will also consider the case where N is not known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext results in an additional sampled record being returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an online or sequential sample. 1.4 Index Structures For The Geometric File A geometric file could easily contain a sample of size several gigabytes or even terabytes. A huge sample like this may often contain too much information and it becomes expensive to scan all the records of a sample to find those (most likely very few) records that match a given condition. A natural way to speed up the search and discovery of those records from a geometric file that have a particular value for a particular attribute is to build an index structure. In this part of the dissertation we discuss and compare three different index structures for the geometric file. In general an index is a data structure that lets us find a record without having to look at more than a small fraction of all possible records. An index is referred to as primary index if it Insertion into the Co component has no I/O cost associated with it. However, its size is limited by the size of the available memory. Thus, we must efficiently migrate part of the Co component to the disk-resident C1 component. Whenever the Co component reaches a threshold size an ongoing rolling merge process removes some records (a contiguous segment) from the Co component and merges them into the Ci component on disk. The the rolling merge process is depicted pictorially in Figure 2.2 of the original LSM-Tree paper [44]. The rolling merge is repeated for migration between higher components of an LSM-Tree in similar manner. Thus, there is a certain amount of delay before records in the Co component migrate out to the disk-resident Ci and higher components. Deletions are performed concurrently in batch fashion similar to inserts. The disk resident components of an LSM-tree are comparable to a B+-tree structure, but are optimized for sequential disk access, with nodes 100% full. Lower levels of the tree are packed together in contiguous, multi-page disk blocks for better I/O performance during the rolling merge. 6.5.2 Index Maintenance and Look-Ups As in case of previously proposed index structures, every time the buffer is fulled and partitioned into segments, we create an index record for each buffered record and bulk insert them all into an LSM-tree index. The index record is comprised of five fields: (1) the key value, (2) the disk page number of the record, (3) an offset within the page, (4) the segment number to which the record belongs, and (5) the subsample number to which the record belongs. The segment and subsample number are recorded with each index record to determine its staleness. Every time a record is migrated from a lower component to a higher disk based component, the rolling merge additionally identifies stale records and removes them from the tree structure. We refer to an index record as a stale record if it is indexing a record either from a subsample that is decayed completely, or a segment of a subsample that is overwritten. We use the existing LSM-Tree-based point query and range query algorithms to perform index look-ups. As in case of previously proposed index structures, we sort the valid index This file organization has several significant benefits for use in maintaining a very large sample from a data stream: Performing a buffer flush requires absolutely no reads from disk. Each buffer flush requires only T random disk head movements; all other disk I/Os are sequential writes. To add the new samples from the buffer into the geometric file to create a new subsample S, we need only seek to the position that will be occupied by each of S's on-disk segments. Even if segments are not block-aligned, only the first and last block in each over-written segment must be read and then re-written (to preserve the records from adjacent segments). Algorithm 3 Reservoir Sampling with a Geometric File 1: Set numSubsamples = 0 2: for int i = 1 to oo do 3: Wait for a new record r to appear in the stream 4: if i < R then 5: Add r to B 6: if Count(B) == IB|InmSubsamples then 7: Randomize the ordering of the records in B 8: Set n Count(B) x (1 a) 9: Partition B into segments of size n, na, na2, and so on 10: Flush the first T segments to the disk 11: Store the group of remaining segments in main memory 12: numSubsamples + + 13: B = 14: else 15: with probability IR|/i do 16: with probability Count(B)/ IR do 17: Replace a random record in B with r 18: else do 19: Add r to B 20: if Count(B) = BI then 21: Partition the buffer into segments of size n, na, na2, and so on (see Section 3.7.1) 22: for each segment sgj from B do 23: Overwrite the largest segment of jth largest subsample of R with sgj 24: B = 3.7.1 Introducing the Required Randomness One issue that needs to be addressed is the partitioning the buffer into segments in Algo- rithm 3 Step (21). In order to maintain the algorithm's correctness, when the buffer is flushed 1.6e+17 Biased sampling w/o skewed records 1.4e+17 Unbiased reservoir sampling Biased sampling worst case 1.2e+17 l e+17 ------ S 8e+16 7 6e+16 4e+16 2e+16 0 'I' I I' 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 7-5. Sum query estimation accuracy for zipf=O.8. By testing each of the three different scenarios described in the previous subsection over a set of data sets created by varying zipf as well as the correlation factor, we can see the effect of data skew and of bias function quality on the relative quality of the estimator produced by each of the three scenarios. For each experiment, we generate a data stream of one million records and obtain a sample of size 1000. For each of the three scenarios and each of the data sets that we test, we repeat the sampling process 1000 times over the same data stream in Monte-Carlo fashion. The variance of the corresponding estimator is reported as the observed variance of the 1000 estimates. The observed Monte-Carlo variances are depicted in Figures 7-3, 7-4, 7-5, and 7-6. 7.2.2 Discussion It is possible to draw a couple of conclusions based on the experimental results. Most significant is that biased sampling under the pathological record ordering shows qualitative performance similar to the biased sampling without any overweight records. Even though in the pathological case the sample might not be biased exactly as specified by the user-defined function f, the number of records not sampled according to f is usually small, and the resulting estimator typically suffers from an increase in variance of around a factor of ten of less. This demonstrates Algorithm 10 Construction and Maintenance of a Segment-Based Index Structure 1: Set n |B| x (1 a) 2: Set totSegslnSubsam -,o log n+og(- 3: Set totSubsamlnR 0 4: Set totSegslnR = 0 5: Set numRecs = 0 6: while numRecs < |R1 do 7: numRecs+ = IBI X atotSubamInR 8: totSegslnR+ = totSegslnSubsam totSubsamlnR 9: totSubsamlnR + + 10: Set BTree array of size totSegslnR 11: for inti = 1 to oo do 12: if Buffer B is partitioned then 13: for each segment sgj in B do 14: Build a B+-Tree BTj 15: if i < |R then 16: Flush BTj on the disk at next available spot in the Index File 17: else 18: Overwrite the B+-Tree for the largest segment of jth largest subsample of R with BTj 19: Record BTj root and its disk position in BTree array 6.3.3 Index Look-Up and Search A segment-based index structure is a collection of B+-Trees, one for each segment of the geometric file. Any index-based search involves looking up all B+-Tree indexes. We use the existing B+-Tree-based point query and range query algorithms and re-run them for each entry in the B+-Tree array. The algorithm returns all index records that satisfy the search criteria. We sort the valid index records by their page number attribute. We then retrieve the actual records from the geometric file and return them as a query result. We expect a segment-based index structure to be a compact structure as there is exactly one index record present in the index structure for each record in the geometric file, and the index structure is maintained as new records are deleted from the file. 6.4 A Subsample-Based Index Structure Although compact, the segment-based index structure has little too many small indexes. The requirement that we perform a look-up using every single one of a large number of B+-Tree can easily degrade the performance of index-based search. A geometric file could easily have multiple thousands of segments in it; even with two disk seeks per B+-Tree to retrieve an index record, a simple point query may required thousands of disk seeks to return the query results. An alternative to a segment-based index structure is to build a B+-Tree index for each subsample of the geometric file. We refer this approach as a subsample-based index structure. 6.4.1 Index Construction and Maintenance Every time the buffer accumulates the desired number of samples for a new subsample, we build a single B+-Tree index for all the buffered records. As in the case of a segment-based index structure, we construct an index record for each buffer record and then bulk insert them all to create a B+-Tree index. The structure of the index record for a subsample-based index structure is the same as that of a segment-based index structure, except that we add an attribute recording the segment number to which the buffered record belongs. As discussed subsequently, we use the segment number associated with the index record to determine if it is stale. We remember the B+-Tree added to the structure by keeping track of its root node in an array structure. As in the case of a segment-based index structure, we arrange the B+-Tree indexes on disk in a single index file. However, we need a slightly different approach, because during the start-up subsamples are flushed to the geometric file, until the reservoir is full. Thereafter subsamples of the same size |B are added to the reservoir. Since each B+-Tree will index no more than |B records, we can bound the size of a B+-Tree index. We use this bound to pre- allocate a fix-sized slot on disk for each B+-Tree. Furthermore, for every buffer flush after the reservoir is full, exactly one subsample is added to the file and the smallest subsample of the file decays completely, keeping the number of subsamples in a geometric file constant. We use this information to lay out the subsample-based B+-Trees on disk and maintain them as new records are sampled from the data stream. Thus, if totSubsamples is the total number subsamples in R, we first allocate fixed-size totSubsamples slots in the index file. Initially all the slots are empty. During start-up, as a new B+-Tree is built, we seek to the next available slot and write out the B+-Tree in a sequential ACKNOWLEDGMENTS At the end of my dissertation I would like to thank all those people who made this disserta- tion possible and an enjoyable experience for me. First of all I wish to express my sincere gratitude to my adviser Chris Jermaine for his patient guidance, encouragement, and excellent advice throughout this study. If I would have access to magic tool create-your-own-adviser, I still would not have ended up with anyone better than Chris. He always introduces me to interesting research problems. He is around whenever I have a question, but at the same time encourages me to think on my own and work on any problems that interest me. I am also indebted to Alin Dobra for his support and encouragement. Alin is a constant source of enthusiasm. The only topic I have not discussed with him is strategies of Gator football games. I am grateful to my dissertation committee members Tamer Kahveci, Joachim Hammer, and Ravindra Ahuja for their support and their encouragement. I acknowledge the Department of Industrial and Systems Engineering, Ravindra Ahuja, and chair Donald Heam for the financial support and advice I received during initial years of my studies. Finally, I would like to express my deepest gratitude for the constant support, understanding, and love that I received from my parents during the past years. geometric file is that it organizes the records to be overwritten systematically on the disk by making the observation that each existing subsample loses approximately the same fraction of its remaining records every time. 3.10 Multiple Geometric Files The value of a can have a significant effect on geometric file performance. If a = 0.999, we can expect to spend up to 95' of our time on random disk head movements. However, if we were instead able to choose a = 0.9, then we reduce the number of disk head movements by factor of 100, and we would spend only a tiny fraction of the total processing time on seeks. Unfortunately, as things stand, we are not free to choose a. According to Lemma 1, a is fixed by the ratio IB|/ R|. That is, for a fixed desired size of reservoir we need a larger buffer to lower the value of a. However, there is a way to improve the situation. Given a buffer of fixed capacity IBI and desired sample size IR|, we choose a smaller value a' < a, and then maintain more than one geometric file at the same time to achieve a large enough sample. Specifically, we need to maintain m =(1-,) geometric files at once. These files are identical to what we have described thus far, except that the parameter a' is used to compute the sizes of a subsample's on-disk segments and size of each file is -. The remainder of this Section describes the details of how in multiple geometric files are used to achieve greater efficiency. 3.11 Reservoir Sampling with Multiple Geometric Files The reservoir sampling algorithm with multiple geometric files is similar to the Algorithm 3. Each of the m geometric files is still treated as a set of decaying subsamples, and each subsample is partitioned into a set of segments of exponentially decreasing size, just as is done in Algorithm 3, Steps (5)-(13). The only difference is that as each file is created, the parameter a' is used instead of a in Steps (6), (8)-(9), and each of the m geometric files is filled after one another, in turn. Thus, each subsample of each geometric file will have segments of size n, na', na'2 and so on. Table 1-1. Population: student records Rec # Name Class Salary ($/month) 1 James Junior 1200 2 Tom Freshman 520 3 Sandra Junior 1250 4 Jim Senior 1500 5 Ashley Sophomore 700 6 Jennifer Freshman 530 7 Robert Sophomore 750 8 Frank Freshman 580 9 Rachel Freshman 605 10 Tim Freshman 550 11 Maria Sophomore 760 12 Monica Freshman 600 Total Salary: 9545.00 Table 1-2. Random sample of the size=4 Rec # Name Class Salary ($/month) 2 Tom Freshman 520 5 Ashley Sophomore 700 8 Frank Freshman 580 12 Monica Freshman 600 Other cases where a biased sample is preferable abound. For example, if the goal is to monitor the packets flowing through a network, one may choose to weight more recent packets more heavily, since they would tend to figure more prominently in most query workloads. We propose a simple modifications to the classic reservoir sampling algorithm [11, 38] in order to derive a very simple algorithm that permits the sort of fixed-size, biased sampling given in the example. Our method assumes the existence of an arbitrary, user-defined weighting function f which takes as an argument a record ri, where f(ri) > 0 describes the record's utility Table 1-3. Biased sample of the size=4 Rec # Name Class Salary ($/month) 1 James Junior 1200 4 Jim Senior 1500 7 Robert Sophomore 750 11 Maria Sophomore 760 which is the desired probability. This expression can then be used in conjunction with the next lemma to compute the variance of the natural estimator for q. Lemma 7. The variance of a is g'(rj) + r 2Pr[(rj,rk)lg(rJ)g(rk) 2 i, Pr[rjER,] 2 ,rk Pr[rjERi]Pr[rkRi q Proof Var(q) = E[q2] (E[q])2 Sg(rj) 2- E[ Z Pr]r rjERi rjERi [1 E g(rj) q2 SPr [rj Ri] r3Rrj,rlc E [Xj]g2(r) + 2XjXkg(rj)g(r) SP2[r, E[r R, Pr[rj E R1]Pr[rk G R] Pr2[rj E R} .' Pr[rj E Ri]Pr[rfC Ri] rj Pr[r RPr[r rj rj,rk This proves the lemma. By using the result of Lemma 6 to compute Pr [{rj, rk E Ri], the variance of the estimator is then easily obtained for a specific query. In practice, the variance itself must be estimated by considering only the sampled records as we typically do not have access to each and every rj during query processing. The q2 term and the two sums in the expression of variance are thus computed over each rj in the sample of biased geometric file rather than over the entire reservoir. There is one additional issue regarding biased sampling that is worth some additional discussion: how to efficiently compute the value Pr[{rj, rk} E Ri] in order to estimate 3.12 Speed-Up Analysis .. .................... 4 BIASED RESERVOIR SAMPLING .. ........................ 58 4.1 A Single-Pass Biased Sampling Algorithm ...... . . . 59 4.1.1 Biased Reservoir Sampling . . . ....... 59 4.1.2 So, What Can Go Wrong? (And a Simple Solution) . . ... 60 4.1.3 Adjusting Weights of Existing Samples . . . 62 4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm . ... 65 4.2.1 The Proof for the Worst Case . . . ...... ..... 66 4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist . ... 73 4.3 Biased Reservoir Sampling With The Geometric File . . . 75 4.4 Estimation Using a Biased Reservoir ................ . .. 76 5 SAMPLING THE GEOMETRIC FILE .................. ..... 80 5.1 Why Might We Need To Sample From a Geometric File? . . .... 80 5.2 Different Sampling Plans for the Geometric File . . . ..... 80 5.3 Batch Sampling From a Geometric File ................ .. .. 81 5.3.1 A Naive Algorithm .......... . . . .. 81 5.3.2 A Geometric File Structure-Based Algorithm . . . 82 5.3.3 Batch Sampling Multiple Geometric Files . . . 84 5.4 Online Sampling From a Geometric File ............. ... .. 84 5.4.1 A Naive Algorithm .......... . . . .. 84 5.4.2 A Geometric File Structure-Based Algorithm . . . 85 5.5 Sampling A Biased Sample ............. . . ..... 88 6 INDEX STRUCTURES FOR THE GEOMETRIC FILE . . . 89 6.1 Why Index a Geometric File? ........... . . ...... 89 6.2 Different Index Structures for the Geometric File . . . 90 6.3 A Segment-Based Index Structure .................. ... .. 91 6.3.1 Index Construction During Start-up .... . . ... 91 6.3.2 Maintaining Index During Normal Operation . . . 92 6.3.3 Index Look-Up and Search .................. ...... .. 93 6.4 A Subsample-Based Index Structure ................... .... .. 93 6.4.1 Index Construction and Maintenance ... . . .. 94 6.4.2 Index Look-Up .................. ............ .. 95 6.5 A LSM-Tree-Based Index Structure .................. ..... .. 96 6.5.1 An LSM-Tree Index ............. . . ..... 96 6.5.2 Index Maintenance and Look-Ups .............. ... .. 97 7 BENCHMARKING .................. ................ .. 99 7.1 Processing Insertions .................. ............. .. 99 7.1.1 Experiments Performed .................. ....... 99 7.1.2 Discussion of Experimental Results ... . . ... 100 Definition 1. If Ri is the biased sample of the first i records produced by a data stream, the value is the true weight of a record rj if and only if Pr[rj c Ri] = f' (r ) k= I f(rk) What we will be able to guarantee is then twofold: 1. First, we will be able to guarantee that f'(rj) will be exactly f(rj) if (|R| f(ri))/totalWeight < 1 for all k > j. 2. We can also guarantee that we can compute the true weight for a given record to unbiased any estimate made using our sample (see Section 4.4). In other words, our biased sample can still be used to produce unbiased estimates that are correct on expectation [16], but the sample might not be biased exactly as specified by the user-defined function f, if the value of f(r) tends to fluctuate wildly. While this may seem like a drawback, the number of records not sampled according to f will usually be small. Furthermore, since the function used to measure the utility of a sample in biased sampling is usually the result of an approximate answer to a difficult optimization problem [15] or the application of a heuristic [52], having a small deviation from that function might not be of much concern. We present a single-pass biased sampling algorithm that provides both guarantees outlined above as Algorithm 7, and Lemma 4 proves the correctness of the algorithm. Lemma 4. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Using the biased sampling described in Al 'r I, ii 7, we are guaranteed that for each Ri and for each record rj produced by the data stream such that j < i, we have, Pr[rj E Ri]- Lif'(r) Z=1 f'(rm)" Proof We know that the probability of selecting ith record in the reservoir is IR f(ri)/totalWeight. Then, there are two cases to explore. The first, when the reservoir is full and before we encounter an overweight record rl, and the the second after we encounter such an rl. Case (i): The proof of this case is very similar to the proof of the Lemma 3. We simply use f' instead of f to prove the desired result. MAINTAINING VERY LARGE SAMPLES USING THE GEOMETRIC FILE By ABHIJIT A. POL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 flushes assigned to replace them. In order to accomplish this, we note that if we did not perform consolidation and instead replaced a segment from each subsample with exactly those records assigned to overwrite records from that subsample, then on expectation a subsample would lose all of the records in its largest segment after m buffer flushes. Thus, if we somehow delay overwriting the largest segment in each file for m buffer flushes, we could sidestep the problem of losing too many records due to consolidation. The way to accomplish this is to overwrite subsamples in a lazy manner. We merge the buffer with the (j mod m)th geometric file, but we do not overwrite any of the valid samples stored in the file until the next time we get to the file. We can achieve this by allocating enough extra space in each geometric file to hold a complete, empty subsample in each geometric file. This subsample is referred to as the dummy. The dummy never decays in size, and never stores its own samples. Rather, it is used as a buffer that allows us to sidestep the problem of a subsample decaying too quickly. When a new subsample is added to a geometric file, the new subsample overwrites segments of dummy rather than overwriting largest segment of any existing subsamples. Thus, we have protected segments of subsamples that contain valid data by overwriting dummy's records instead. When records are merged from the buffer into the dummy, the space previously owned by the dummy is given up to allow storage of the file's newest subsample. After this flush, the largest segment from each of the subsamples in the file is given up to reconstitute the new dummy. Because the records in (new) dummy's segments will not be over-written until the next time that this particular geometric file is written to, all of the data that is contained within it is protected. Note that with a dummy subsample, we no longer have a problem with a subsample losing its samples too quickly. Instead, a subsample may have slightly too many samples present on disk at any given time, buffered by the file's dummy. These extra samples can easily be ignored during query processing. The only additional cost we incur with dummy is that each of the geometric files on disk must have IBI additional units of storage allocated. The use of a dummy subsample is illustrated in Figure 3-5. the variance during query evaluation. Computing Pr[{rj, rk }E R] requires that we be able to compute two subexpressions for each sampled record pair: RI- 1)f(rj) f'(rk) and y -I f'(rI) k Jf(r1I) H 2Pr[r ERi] l=k+1 2RP The first subexpressions can be easily computed with the help of running total totalWeight along with the weight multipliers associated with each subsample. When sample records are added to the reservoir, like attribute ri.weight, we store another attribute with each record, ri.oldTotalWeight and r\.oldM. The first attribute gets its value from current value of totalWeight, whereas the M(ri) is stored in the second attribute. When a query is evaluated and we need to compute the first subexpressions for a given record pair rj and rk, we compute terms in its denominator as follows: Sf'(rn) Tk.oldTotalWeight x Mr l 1 k-l k k Sf'(r) E '() f'(rk) '(r) (rk.weight x M(rk)) 1=1 1=1 l=1 The second subexpressions can also be easily computed if we maintain a running total subexp2Total for the sum log (1 2Pr[rER]) at all times. When a new record is added to the reservoir, the current values of subexp2Total is stored as another attribute r .\~1\, \p2Val along with each record. When a query is evaluated, for a given record pair rj and rk we simply evaluate ni l (1 I2Pr[r ER] (subexp2Total-rk.subexp2Val) l=k+1 |R| associated with randomization and sampling from a data management perspective. However, the assumption underlying the CONTROL project is that all of the data are present and can be archived by the system; online sampling is not considered. Our work is complementary to the CONTROL project in that their algorithms could make use of our samples. For example, a sample maintained as a geometric file could easily be used as input to a ripple join or online aggregation. 2.2 Biased Sampling Related Work Our biased sampling algorithm is based on reservoir sampling algorithm which was first proposed in the 1960s [11, 38]. Recently, Gemulla et al. [29] extended the reservoir sampling algorithm to handle deletions. In their algorithm called "random pairing" (RP) every deletion from the dataset is eventually compensated by a subsequent insertion. The RP Algorithm keeps tracks of uncompensated deletions and uses this information while performing the inserts. The Algorithm guards the bound on the sample size and at the same time utilizes the sample space effectively to provides a stable sample. Another extension to the classic reservoir sampling algorithm has been recently proposed by Brown and Haas for warehousing of sample data [10]. They propose hybrid reservoir sampling for independent and parallel uniform random sampling of multiple streams. These algorithms can be used to maintain a warehouse of sampled data that shadows the full-scale data warehouse. They have also provided methods for merging samples for different streams to create a uniform random sample. The problem of temporal biased sampling in a stream environment has been considered. Babcock et al. [7] presented the sliding window approach with restricted horizon of the sample to biased the sample towards the recent streaming records. However, this solution has a potential to completely lose the entire history of past stream data that is not a part of sliding window. The work done by Aggarwal [5] addresses this limitation and presents a biased sampling method so that we can have temporal bias for recent records as well as we keep representation from stream history. This work exploits some interesting properties of the class of memory-less bias functions to present a single-pass biased sampling algorithm for these type of biased functions. However, 3.11.3 Handling the Stacks in Multiple Geometric Files One final issue that should be considered is maintenance of the stacks associated with each subsamples of the (j mod m)th geometric file. Just as in the single file case, the purpose of the stack associated with a subsample is to store samples that are still valid, but whose space must be given up in order to store new samples from the buffer that have been flushed to disk. With multiple geometric files, this does not change. It is possible that when the buffer is written to the dummy subsample in a file, the dummy may still contain valid samples from a subsample in that file. Specifically, one or more of the dummy's segments may contain valid samples from the last subsample to own the segment. In that case, the valid samples are saved to that subsample's stack before the dummy is over-written. 3.12 Speed-Up Analysis The increase in speed achieved using multiple geometric files can be dramatic. The time required to flush a set of new samples to disk as a new subsample is dominated by the need to perform random disk head movements. For each subsample, we need two random movements to overwrite its largest segment (one to read the location and one to write a new segment) and then two more seeks for its stack adjustment; a total of around 40 ms/segment. The number of segments required to write a new subsample to disk in the case of multiple geometric files (and thus the number of random disk head movements required) is given by Lemma 2. Lemma 2. Let u = (log(1/ac'))-1. Multiple geometric files can be used to maintain an online sample of arbitrary size with a cost of O(u x log IB /|B ) random disk head movements for each newly sampled record. Proof We know that for every buffer flush, m segments in the buffer are grouped to form a consolidated segment. All such consolidated segments are then used to overwrite the largest on- disk segments of the subsamples stored in a single geometric file. From Observation 3, we know that the number on-disk segments of a subsample (and thus the number of consolidated segments) is log 3-log+log(l-a) ]. Substituting n = (1- a') x BI and simplifying the expression (as well as log a' 8e+14 Biased sampling w/o skewed records 7e+14 Unbiased reservoir sampling SBiased sampling worst case 6e+14 -.. 5e+14 5 4e+14 7 3e+14 2e+14 le+14 0 I-I I I 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 7-4. Sum query estimation accuracy for zipf=0.5. Attribute B is the attribute that is actually aggregated by the SUM query. Each set is generated so that attributes A and B both have a certain amount of Zipfian skew, specified by the parameter zipf. In each case, the bias function f is defined so as to minimize the variance for a SUM query evaluated over attribute A. In addition to the parameter zipf, each data set also has a second parameter which we term the correlation factor. This is the probability that attribute A has the same value as attribute B. If the correlation factor is 1, then A and B are identical, and since the bias function is defined so as to minimize the variance of a query over A, the bias function also minimizes the variance of an estimate over the actual query attribute B. Thus, a correlation factor of 1 provides for a perfect bias function. As the correlation factor decreases, the quality of the bias function for a query over attribute B declines, because the chance increases that a record deemed important by looking at attribute A is, in fact, one that should not be included in the sample. This models the case where one can only guess at the correct bias function beforehand for example, when queries with an arbitrary relational selection predicate may be issued. A small correlation factor corresponds to the case when the guessed-at bias function is actually very incorrect. of the existing subsamples during a buffer flush. Though we may be able to avoid rebuilding the entire file, the fact that the buffer must over-write a subset of each on-disk subsample presents a challenge when trying to maintain acceptable performance, because this naturally leads to fragmentation (see the discussion of the localized overwrite extension in Section 3.3). For example, if there are 100 on-disk subsamples, the buffer must be split 100 ways in order to write to a portion of each of the 100 on-disk subsamples. This fragmented buffer then becomes a new subsample, and subsequent buffer flushes that need to replace a random portion of this subsample must somehow efficiently overwrite a random subset of the subsample's fragmented data. The geometric file uses a careful, on-disk data organization in order to avoid such fragmen- tation. The key observation behind the geometric file is that the number of records of a subsample that are replaced with records from buffered sample can be characterized with reasonable accu- racy using a geometric series (hence the name geometricfile). As buffered samples are added to the reservoir via buffer flushes, we observe that each existing subsample loses approximately the same fraction of its remaining records every time, where the fraction of records lost is governed by the ratio of the size of a buffered sample to the overall size of the reservoir. By "loses", we mean that the subsample has some of its records replaced in the reservoir with records from a subsequent subsample. Thus, the size of a subsample decays approximately in an exponential manner as buffered samples are added to the reservoir. This exponential decay is used to great advantage in the geometric file, because it suggests a way to organize the data in order to avoid problems with fragmentation. Each subsample is partitioned into a set of segments of exponentially decreasing size. These segments are sized so that every time a buffered sample is added to the reservoir, we expect that each existing subsample loses exactly the set of records contained in its largest remaining segment. As a result, each subsample loses one segment to the newly-created subsample every time the buffer is emptied, and a geometric file can be organized into a fixed and unchanging set of segments that are stored as contiguous runs of blocks on disk. Because the set of segments is fixed beforehand, fragmentation and update performance are not problematic: in order to replace records in an CHAPTER 5 SAMPLING THE GEOMETRIC FILE A geometric file is a simple random sample (without replacement) from a data stream. In this chapter we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects that are themselves random samples (without replacement) from the original data stream. The goal of the algorithms described in this chapter is to efficiently support further sampling of a geometric file by making use of its own structure. 5.1 Why Might We Need To Sample From a Geometric File? In Section 3.2, we argued that small samples frequently do not provide enough accuracy, especially in the case when the resulting statistical estimator has a very high variance. However, while in the general case a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. For example, reconsider the problem of estimating the average net worth of American households as described in Section 3.2. In the general case, many millions of samples may be needed to estimate the net worth of the average household accurately (due to a small ratio between the average household's net worth and the standard deviation of this statistic across all American households). However, if the same set of records held information about the size of each household, only a few hundred records would be needed to obtain similar accuracy for an estimate of the average size of an American household, since the ratio of average household size to the standard deviation of sample size across households in the United States is greater than 2. Thus, to estimate the answer to these two queries, vastly different sample sizes are needed. 5.2 Different Sampling Plans for the Geometric File Since there is no single sample size that is optimal for answering all queries and the required sample size can vary dramatically from query to query, this chapter considers the problem of generating a sample of size N from a data stream using an existing geometric file that contains a large sample of records from the stream, where N < R. We will consider two specific problems. First, we consider the case where N is known beforehand. We will refer to a sample retrieved in this manner as a batch sample. Batch samples of fixed size have been suggested for use in f(rj) k-I ff(rk) f(rj) IRlf(r-ax) Zk i+l f(rA) f(rj) IR f(Tr ) Z hi+l f(Trk) f(rj) k-N f(rk) f(rj) 'ik1 f(rk) + Yk -i+l f(Tk) Since Ck, f(rk) < IRlf(rTma), we have f(rj) lRlf(ra.x) + Ek>i+L f(rk) < f(rj) 1 f(rk) We can therefore conclude that f(rj) < f(rj) :k-1 f(Trk) k-1 f(rk) f(rj) k-I ff(rk) This proves the second part of the lemma. The proof of the second proposition regarding the effect of the first IR records: We now turn our attention to the effect of first IR| record of the stream on the worst-case distance. If rm" appears as the IR1th record in the worst case, then using the result of Lemma 5, Vj < IR| we know that k f'(rk) f(rj) k1 ff(rk) (II-1)f (r ma") |R-1 (f ()+) IRlf(rmax) + z7 IRI+I f(rk) (Il|-1R) f (rm) (IRI- )f (rma-) f(rj) (IR1 1)f(rT ax) + k HI f(rk) f(rj) k-1 ff(rk) f(rk) Yk I f(Tk) f'(rj) k 1if'(rk) f(rj) YNk- f'(rk) track of the B+-Trees for each segment in the geometric file. Each array entry simply stores the position of a B+-Tree root node. Rather than maintaining a file for each B+-Tree created, we organize multiple B+-Trees on a single disk file. We refer this single file as indexfile. The index file, in a sense, is similar to the log-structure file system proposed by Ousterhout [45]. In log-structured file system, as files are modified, the contents are written out to the disk as logs in a sequential stream. This allows writes in full-cylinder units, with only track-to-track seeks. Thus the disk operates at nearly its full bandwidth. The index file enjoys the similar performance benefits. Every time a B+-Tree is created for a memory resident segment, it is written to the index file in a sequential stream at the next available position. The array maintaining all B+-Tree root nodes is augmented with the starting disk position of the B+-Tree. Finally, we do not index segments that are never flushed to the disk. These segments are typically very small (a size of a disk block) and it is efficient to search them using sequential memory scan when geometric file is queried. 6.3.2 Maintaining Index During Normal Operation Maintaining a segment-based index structure is exceedingly simple. During normal operation as a new subsample and its segments are formed, we build a B+-Tree index for each in-memory segment just like we did during the start-up. The only difference is that the B+-Trees are written to the disk in slightly different manner. As an in-memory segment overwrites the on-disk segment, a B+-Tree for an in-memory segment overwrites the B+-Tree for the on-disk segment. We update the B+-Tree array entry for the root node of the new B+-Tree that is added to the index structure. Thus, the index maintenance for records newly inserted into the geometric file and the records that are deleted from the file is handled at the same time. The algorithm used to construct and maintain a segment-based index structure is given as Algorithm 10. (a) Initial configuration. Each of the m geometric files has an additional dummy segment that holds no data. -, - mmmm( )m mm- mmmml[][] m- mVy u***00 ---segments initially owned by the dummy (c) Existing subsamples give their largest segment to reconstitute the dummy. The data in these segments are protected until the next time the dummy is over-written. newly reconstituted dummy (b) The jth new subsample is added by overwriting the dummy in the i = (j mod m)th geometric file ***- -Onn -- .mmmm - mNNOD array of m geometric files newly added subsample (d) The next m -1 buffer flushes write new subsamples to the other m 1 geometric files, using the same process. The mth buffer flush again overwrites the dum- my in the ith geometric file, and the pro- cess is repeated from step (c). mEmmm. newly added subsample Figure 3-5. Speeding up the processing of new samples using multiple geometric files. ~~~ "~~ ~~~~ ~ ~~~ "~" ann~n 1.4e+18 Biased sampling w/o skewed records Unbiased reservoir sampling 1.2e+18 Biased sampling worst case le+18 8e+17 6e+17 4e+17 2e+17 0 0 -------------i-'^ 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 7-6. Sum query estimation accuracy for zipf=l. that even for very skewed data sets, it is difficult for even an adversary to come up with a data ordering that can significantly alter the quality of the user-defined bias function. We also observe that for a low zipf parameter and a low correlation factor, unbiased sampling outperforms biased sampling. In other words, it is actually preferable not to bias in this case. This is because the low zipf value assigns relatively uniform values to attribute B, rendering an optimal biased scheme little different from uniform sampling. Furthermore, as the correlation factor decreases, the weighting scheme used both biased sampling schemes becomes less accurate, hence the higher variance. As the weighting scheme becomes very inaccurate, it is better not to bias at all. Not surprisingly, there are more cases where the biased scheme under the pathological ordering is actually worse than the unbiased scheme. However, as the correlation factor increases and the bias scheme becomes more accurate, it quickly becomes preferable to bias. 7.3 Sampling From a Geometric File We have also implemented and benchmarked four sampling techniques to sample geometric files discussed in the Chapter 5. Specifically, we have compared the naive batch sampling and the online sampling algorithms against a geometric file structure based batch sampling and online CHAPTER 7 BENCHMARKING In this chapter, we detail three sets of benchmarking experiments. In the first set of experi- ments, we attempt to measure the ability of the geometric file to process a high-speed stream of data records. In the second set of experiments, we examine the various algorithms for producing smaller samples from a large, disk-based geometric file. Finally, in the third set of experiments, we compare the three index structures for the geometric file for build time, disk space, and index look-up speed. 7.1 Processing Insertions In order to test the relative ability of the geometric file to process a high-speed stream of insertions, we have implemented and benchmarked five alternatives for maintaining a large reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the framework described in Section 3.10 for using multiple geometric files at once. In the remainder of this Section, we refer to these alternatives as the virtual memory, scan, local overwrite, geo file, and multiple geofiles options. An a' value of 0.9 was used for the multiple geo files option. All implementation was performed in C++. Benchmarking was performed using a set of Linux workstations, each equipped with 2.4 GHz Intel Xeon Processors. 15,000 RPM, 80GB Seagate SCSI hard disks were used to store each of the reservoirs. Benchmarking of these disks showed a sustained read/write rate of 35-50 MB/second, and an "across the disk" random data access time of around 10ms. 7.1.1 Experiments Performed The following three experiments were performed: Insertion experiment 1: The task in this experiment was to maintain a 50GB reservoir holding a sample of 1 billion, 50B records from a synthetic data stream. Each of the five alternatives was allowed 600MB of buffer memory to work with when maintaining the reservoir. For the scan, local overwrite, geo file, and multiple geo files options, 100MB was used as an LRU buffer for disk reads/writes, and 500MB was used to buffer newly sampled records before processing. The virtual memory option used all 600MB as an LRU buffer. In the experiment, a continual stream of the data (or "sample view" [43]) may be desirable. In order to save time and/or computer resources, queries can then be evaluated over the sample rather than the original data, as long as the user can tolerate some carefully controlled inaccuracy in the query results. This particular application has two specific requirements that are addressed by the dis- sertation. First, it may be necessary to use quite a large sample in order to achieve acceptable accuracy; perhaps on the order of gigabytes in size. This is especially true if the sample will be used to answer selective queries or aggregates over attributes with high variance (see Sec- tion 3.2). Second, whatever the required sample size, it is often independent of the size of the database, since estimation accuracy depends primarily on sample size 1 In other words, the required sample size will generally not grow as the database size increases, as long as other factors such as query selectivity remain relatively constant. Thus, this application requires that we be able to maintain a large, disk-based, fixed-size random sample of the archived data, even as new data are added to the warehouse. This is precisely the problem we tackle in the dissertation. For another example of a case where existing sampling methods can fall short, consider stream-based data management tasks, such as network monitoring (for an example of such an application, we point to the Gigascope project from AT&T Laboratories [18-20]). Given the tremendous amount of data transported over today's computer networks, the only conceivable way to facilitate ad-hoc, after-the-fact query processing over the set of packets that have passed through a network router is to build some sort of statistical model for those packets. The most obvious choice would be to produce a very large, statistically random sample of the packets that have passed through the router. Again, maintaining such a sample is precisely the problem we tackle in this dissertation. While other researchers have tackled the problem of maintaining an 1 The unimportance of database size for certain queries is due to the fact that the bias and vari- ance of many sampling-based estimators are related far more to sample size than to the sampling fraction (see Cochran [16] for a thorough treatment of finite population random sampling). Algorithm 2 Reservoir Sampling with a Buffer 1: for int i = 1 to oo do 2: Wait for a new record r to appear in the stream 3: if i < R then 4: Add r directly to R and continue 5: else 6: with probability IR|/i do 7: with probability Count(B)/IRI do 8: //new samples can overwrite buffered samples 9: Replace a random record in B with r 10: else do 11: Add r to B 12: if Count(B) == BI then 13: Scan the reservoir R and empty B in one pass 14: B = a record that has been randomly selected for replacement by line (9) of Algorithm 2, and so all of the database blocks must be updated. Thus, it makes sense to rely on fast, sequential I/O to update the entire file in a single pass. The drawback of this approach is that every time that the buffer fills, we are effectively rebuilding the entire reservoir to process a set of buffered records that are a small fraction of the existing reservoir size. The localized overwrite extension. We will do better if we enforce a requirement that all samples are stored in a random order on disk. If data are clustered randomly, then we can simply write the buffer sequentially to disk at any arbitrary position. Because of the random clustering, we can guarantee that wherever the buffer is written to disk, the new samples will overwrite a random subset of the records in the reservoir and preserve the correctness of the algorithm. The problem with this solution is that after the buffered samples are added, the data are no longer clustered randomly and so a randomized overwrite cannot be used a second time. The data are now clustered by insertion time, since the buffered samples were the most recently seen in the data stream, and were written to a single position on disk. Any subsequent buffer flush will need to overwrite portions of both the new and the old records to preserve the algorithm's correctness, requiring an additional random disk head movement. With each subsequent flush, maintaining randomness will become more costly, as data become more and more clustered by insertion time. [15] Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approx- imate answering of aggregate queries. In: ACM SIGMOD International Conference on Management of Data (2001) [16] Cochran, W.: Sampling Techniques. Wiley and Sons (1977) [17] Council, T.P.: TPC-H Benchmark. http://www.tpc.org (2004) [18] Cranor, C., Gao, Y, Johnson, T., Shkapenyuk, V., Spatscheck, O.: Gigascope high per- formance network monitoring with an sql interface. In: ACM SIGMOD International Conference on Management of Data (2002) [19] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: ACM SIGMOD International Conference on Management of Data (2003) [20] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: The gigascope stream database. In: IEEE Data Engineering Bulletin, pp. 26(1): 27-32 (2003) [21] Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: ACM SIGMOD International Conference on Management of Data (2002) [22] Duffield, N., Lund, C., Thorup, M.: Charging from sampled network usage. In: IMW '01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, pp. 245-256. ACM Press, New York, NY, USA (2001) [23] Estan, C., Naughton, J.E: End-biased samples for join cardinality estimation. In: ICDE '06: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 20. IEEE Computer Society, Washington, DC, USA (2006) [24] Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270-313 (2003) [25] F. Olken, D.R.: Random sampling from b+ trees. In: International Conference on Very Large Data Bases (1989) [26] F. Olken, D.R.: Random sampling from database files a survey. In: International Working Conference on Scientific and Statistical Database Management (1990) [27] F. Olken D. Rotem, P.X.: Random sampling from hash fies. In: ACM SIGMOD Interna- tional Conference on Management of Data (1990) [28] Ganguly, S., Gibbons, P., Matias, Y, Silberschatz, A.: Bifocal sampling for skew-resistant join size estimation. In: ACM SIGMOD International Conference on Management of Data (1996) -low address on disk `u -,new samples high address on disk (a) (-low address on disk new C samples high address on disk (b) low address on disk uI samples high address on disk samples n dd-- ` samf high address on disk high addss on disk (d) (e) ew - )les sam high address on disk (f) new iples Figure 3-3. Building a geometric file. Several data structures and algorithms have been proposed to speed up index inserts such as the LSM-Tree [44], Buffer-Tree [6], and Y-Tree [12]. These papers consider problem of providing I/O efficient indexing for a database experiencing a very high record insertion rate which is impossible to handle using a traditional B+-Tree indexing structure. In general these methods buffer a large set of insertions and then scan the entire base relation, which is typically organized as a B+-Tree, at once adding new data to the structure. Any of the above methods could trivially be used to maintain a large random sample of a data stream. Every time a sampling algorithm probabilistically selects a record for insertion, it must overwrite, at random, an existing record of the reservoir. Once an evictee is determined, we can attach its location as a position identifier (a number between 1 and R) with a new sample record. This position field is then used to insert the new record into these index structures. While performing the efficient batch inserts, if an index structure discovers that a record with the same position identifier exists, it simply overwrites the old record with the newer one. However, none of these methods can come close to the raw write speed of the disk, as the geometric file can [13]. In a sense, the issue is that while the indexing provided by these structures could be used to implement efficient, disk-based reservoir sampling, it is too heavy- duty a solution. We would end up paying too much in terms of disk I/O to send a new record to overwrite a specific, existing record chosen at the time the new record is inserted, when all one really needs is to have a new record overwrite any random, existing record. There has been much recent interest in approximate query processing over data streams (a very small subset of these papers is listed in the References section [1, 21, 34]); even some work on sampling from a data stream [7]. This work is very different from our own, in that most existing approximation techniques try to operate in very small space. Instead, our focus is on making use of today's very large and very inexpensive secondary storage to physically store the largest snapshot possible of the stream. Finally, we mention the U.C. Berkeley CONTROL project [37] (which resulted in the development of online aggregation [33] and ripple joins [32]). This work does address issues flush it in a single scan of the reservoir and overwrite the records as dictated by the sorted order of the position array. It is obvious that this process is equivalent to the steps (5-6) of Algorithm 1 as far as correctness is concerned. Logically, steps (7-14) of Algorithm 2 actually implement exactly this process. The probability that we will generate a random position between 1 and IRI that is already in the position array of size BI is IBI/R. Step (7) of Algorithm 2 decides whether to overwrite a random buffered record with a newly sampled record. Once the buffer is full, step (13) performs a one pass buffer-reservoir merging by generating sequential random positions in the reservoir on the fly. 3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File In Algorithm 2 we store the samples sequentially on the disk and overwrite them in a random order. Though correct, the algorithm demands almost a complete scan of the reservoir (to perform all random overwrites) for every buffer flush. We can do better if we instead force the samples to be stored in a random order on disk so that they can be replaced via an overwrite using sequential I/Os. The localized overwrite extension discussed before use this idea. Every time a buffer is flushed to the reservoir it is randomized in main memory and written as a random cluster on the disk. We maintain the correctness of this technique by splitting the random cluster in N-ways where N is the number of existing clusters on the disk and by overwriting random subset of each existing cluster. This avoids the problem of clustering by insertion time. However, the drawback of this technique is that the solution deteriorates because of fragmentation of clusters. The geometric file overcomes the drawbacks of these two techniques and can be viewed as a combination of Algorithm 2 and the idea used in the localized overwrite extension. The correctness of the Geometric file is results directly from the correctness of these two techniques. In case of the geometric file the entire sample in the main memory (referred to as a subsample) is randomized and flushed into the reservoir. Furthermore, each new subsample is split into exactly those many segments as the number of existing subsamples on the disk. These segments then overwrite a random portion of each disk-based subsample. The only difference with the Eventually, this solution will deteriorate, unless we periodically re-randomize the entire reservoir. Unfortunately, re-randomizing the entire reservoir is as costly as performing an external-memory sort of the entire file containing samples, and requires taking the sample off-line. 3.4 The Geometric File The three extensions to Algorithm 1 can be used to maintain a large, on-disk sample, but all of them have drawbacks. In this section, we discuss a fourth algorithm and an associated data organization called the geometric file to address these pitfalls. The geometric file is best seen as an extension of the massive rebuild option given as Algorithm 2. Just like Algorithm 2, the geometric file makes use of a main-memory buffer that allows new samples selected by the reservoir algorithm to be added to the on-disk reservoir in a lazy fashion. However, the key difference between Algorithm 2 and the algorithms used by the geometric file is that the geometric file makes use of a far more efficient algorithm for merging those new samples into the reservoir. Intuitive description: Except for Step (13) of Algorithm 2, the basic algorithm employed by the geometric file is not much different. As far as Step (13) is concerned, the difference between the geometric file and the massive rebuild extension is that the geometric file empties the buffer more efficiently, in order to avoid scanning or periodically re-randomizing the entire reservoir. To accomplish this, the entire sample in main memory that is flushed into the reservoir is viewed as a single subsample or a stratum [16], and the reservoir itself is viewed as a collection of subsamples, each formed via a single buffer flush. Since the records in a subsample are non- random subset of the records in the reservoir (they are sampled from the stream during a specific time period), each new subsample needs to overwrite a true, random subset of the records in the reservoir in order to maintain the correctness of the reservoir sampling algorithm. If this can be done efficiently, we can avoid rebuilding the entire reservoir in order to process a buffer flush. At first glance, it may seem difficult to achieve the desired efficiency. The buffered records that must be added to the reservoir will typically overwrite a subset of the records stored in each analogous to Olken and Rotem's procedure for choosing the number of records to select from each hash bucket when performing batched sampling from a hashed file [26]. Once the number of sampled records from each segment has been determined, sampling those records can be done with an efficient sequential read since within each on disk segment, all records are store in a randomized order. The key algorithmic issue is how to calculate the contribution of each subsample. Since this contribution is a multivariate hypergeometric random variable, we can use an approach analogous to Algorithm 4, which is used to partition the buffer to form the segments of a subsample. In other words, we can view retrieving N samples from a geometric file analogous to choosing N random records to overwrite when new records are added to the file. The resulting algorithm can be described as follows. To start with, we partition the sample space of N records into segments of varying size exactly as in Algorithm 4. We refer to these segments of the sample space as sampling segments. The sampling segments are then filled with samples from the disk using a series of sequential reads, analogous to the set of writes that are used to add new samples to the geometric file. The largest sampling segment obtains all of its records from the largest subsample, the next largest sampling segment obtains all its record from second largest subsample, and so on. Algorithm 8 Batch Sampling a Geometric File 1: Set NS = Number of subsamples in a geometric file 2: for i =1 to NS do 3: Set RecsInSubsam[i] = Size of ith subsample 4: Set RecsToRead[i] = 0 5: for i =1 to NS do 6: Choose j such that Pr[choosing j]= RecslnS, l.u, [i] /1RI 7: RecslnS ,F..,[di] - 8: RecsToR ..i[ j] + + 9: for i =1 to NS do 10: Append to batchsample RecsToRead[i] records from the ith subsample When using this algorithm, some care needs to be taken when N approaches to the size of a geometric file. Specifically, when all disk segments of a subsample are returned to a corresponding sampling segment, we must also consider the subsample's in-memory buffered 14 Subsample-based - Segment-based 12 LSM-Tree m 10 o 8 / co 6 c0 o 4 0 2 4 6 8 10 12 Time elapsed in Hrs Figure 7-8. Disk footprint for 200B record size speed are shown in Table 7.4, the disk space used by three index structure is plotted in Figure 7-8, and the index look-up speed in tabulated in Table 7.4.2. Thus, we test the effect of record size on the three index structure. 7.4.2 Discussion It is possible to draw a few conclusions based on the experimental results. The subsample- based index structure shows the best build time, the segment-based index structure has the most compact disk footprint, whereas the LSM-tree-based index structure has best response to the index look-ups. Table 7.4 shows millions of records inserted into geometric file after ten hours of insertions and concurrent updates to the index structure. For comparison we present the number of records inserted into a geometric file when no index structure is maintained (the "no index" column). It is clear that the subsample-based index structure performs the best on insertions, with performance comparable to the "no index" option. This difference reflects the cost of concurrently maintaining the index structure. The segment based index structure does the next best. It is slower than the subsample-based index structure because of higher number of seeks performed during the start- up. Recall that during start-up the segment-based index must write a B+-tree for each segment. size of a random sample obtained with sampling probability of 1/7, where r is the threshold used by these algorithms. Thus, the threshold r is carefully selected to control the sample size and if required, it is increased to honor the upper bound of the sample size. The problem of implementing fixed size sampling design with desired and unequal inclusion probabilities has been studied in statistics. The monogram Theory of Sample Surveys [50] discusses several methods for such a sampling technique, which is of some practical importance in survey sampling. This monogram begins by discussing two designs which mimic simple random sampling without replacement with selection probabilities for a given draw that are not the same for all the units. We first summarize these techniques. Successive Sampling: Let the selection probabilities pi, p2,... PL such that pi > 0 and i =1 1, and desired sample size be N =2. Then the design suggests that we draw r with probability pr, and q with probability pq/l( Pr). The inclusion probabilities can be expressed in terms of the selection probabilities by the fact that r is included if it is drawn on the first draw, or on the second draw not having been chosen on the first. Thus, the inclusion probability r, is given by p, (1 + Zqr p,/(1 p,)). Similarly, the value for the joint probability rrq can be deduced. The monogram suggests that the value of pr be found using an iterative computation method. Fellegi's Method: This method is very much like Successive sampling described above except that the selection probabilities are different for the second draw. The second draw probabilities are chosen such that the marginal selection probabilities for the both draws are the same. This feature makes this method suitable for rotating sampling as in labor force sampling where a fixed proportion of the sample is replaced each month. The procedure is as follows: the first draw is made with probability pr = a, and then q with probability p,/(l pr), where pi,... PL is another set of selection probabilities chosen so that C "rPq/l( Pr) a nq r sq Where ci are specified positive numbers such that Eii or. approach is that the more records we fetch sequentially from the disk during a single call to GetNext, the longer the response time will be for the particular call to GetNext during which we fetch those blocks. This is particularly worrisome if we spend a lot of time to fetch blocks which are never used (which will be the case if the user intends to draw only a relatively small-sized sample.) *Fetch few. If we fetch small number of blocks at the time of buffer refills, we reduce the maximum response time for any given GetNext call. However, we then need more seeks to sample N records. The approach can be problematic if user intends to draw a relatively large sample from the file. In order to discuss such considerations more concretely, we note that the time required to process GetNext call is proportional to the number of blocks fetched on the call, assuming that the cost to perform the required in-memory calculations is minimal. If b blocks are fetched during a particular call, we spend s + br time units on that particular call to GetNext, where s is the seek time and r is time required to scan a block. Once these b blocks are fetched we incur zero cost for next bn calls to GetNext, where n is the blocking factor (number of records per block). Thus, in the case where blocks are fetched at the first call to GetNext, we incur the total cost of s + br to sample bn records, and have a response time of s + br units at the first call to GetNext, with all subsequent calls having zero cost. Now imagine that instead we split b blocks into two chunks of size b/2 each, and read a chunk-at-a-time. Thus, the first GetNext call will cost us s + br/2 time units. Once these bn/2 records are used up we read next chuck of blocks. The total cost in this scenario is 2s + br with a response time of s + br/2 time units once at the starting point and other mid-way through. Note that although the maximum response time on any call to GetNext is reduced by half, we required more time to sample bn records. The question then becomes, How do we reconcile response time with overall sampling time to give the user optimal performance? The systematic approach we take to answering this question is based on minimizing the average square sum of response time over all GetNext calls. This idea is similar to the widely utilized sum-square-error or MSE criterion, which tries to keeps the average error or "cost" from being too high, but also penalizes particularly poor individual errors or costs. However, one We therefore also need to show that the pairwise values Pr[rk, rl E Ri] has the correct value. All three-way inclusion probabilities must also be correct, as well as all four-way inclusion probabilities, and so on. In other words we need to show that for a set S of interest, Pr[S E Ri] has the correct value, for all S C R. The proof that reservoir sampling maintains the correct inclusion probability for any set of interest is actually very similar to the univariate inclusion probability correctness discussed above. We know that the univariate inclusion probability Pr[rk Ri] = R/i. For any arbitrary value of IS| < |R|, assume that we have the correct probabilities when we have seen i 1 input records, i.e. Pr[S E R-] = () / (i). When the ith record is processed (i > |R|), we have Pr[S Ri] = Pr[S E R i_l]Pr[ri e Ri]Pr[None of S's records are expelled ]+ (Pr[S E Ri1]Pr[r, ( RJ]) Pr[SR G + )] (I) R s_ S1S) P- \ = P [SE r15] -1 i + - (S) [(I \S) which is the desired probability. 3.2 Sampling: Sometimes a Little is not Enough One advantage of random sampling is that samples usually offer statistical guarantees on the estimates they are used to produce. Typically, a sample can be used to produce an estimate for a query result that is guaranteed to have error less E than with a probability 6 (see Cochran for a nice introduction to sampling [16]). The 6 value is known as the c(nifid ,h c of the estimate. Very large samples are often required to provide accurate estimates with suitably high confidence. The need for very large samples can be easily explained in the context of the Central Limit Theorem (CLT) [27]. The CLT implies that if we use a random sample of size N to estimate the mean p of a set of numbers, the error of our estimate is usually normally CHAPTER 2 RELATED WORK In this chapter, we first review the literature on reservoir sampling algorithms. We then present the summary of existing work on biased sampling. 2.1 Related Work on Reservoir Sampling Sampling has a very long history in the data management literature, and research continues unabated today [2, 3, 8, 14, 15, 28, 32, 33, 35, 51, 52]. However, the most previous papers (including the aforementioned references) are concerned with how to use a sample, and not with how to actually store or maintain one. Most of these algorithms could be viewed as potential users of a large sample maintained as a geometric file. As mentioned in the introduction chapter, a series of papers by Olken and Rotem (including two papers listed in the References section [25, 27]) probably constitute the most well-known body of research detailing how to actually compute samples in a database environment. Olken and Rotem give an excellent survey of work in this area [26]. However, most of this work is very different than ours, in that it is concerned primarily with sampling from an existing database file, where it is assumed that the data to be sampled from are all present on disk and indexed by the database. Single pass sampling is generally not the goal, and when it is, management of the sample itself as a disk-based object is not considered. The algorithms in this dissertation are based on reservoir sampling, which was first de- veloped in the 1960s [11, 38]. In his well-known paper [53], Vitter extends this early work by describing how to decrease the number of random numbers required to perform the sampling. Vitter's techniques could be used in conjunction with our own, but the focus of existing work on reservoir sampling is again quite different from ours; management of the sample itself is not considered, and the sample is implicitly assumed to be small and in-memory. However, if we re- move the requirement that our sample of size N be maintained on-line so that it is always a valid snapshot of the stream and must evolve over time, then sequential sampling techniques related to reservoir sampling that could be used to build (but not maintain) a large, on-disk sample (see Vitter [54], for example). online sample targeted towards more recent data [7], no existing methods have considered how to handle very large samples that exceed the available main memory. In this dissertation we describe a new data organization called the geometric file and related online algorithms for maintaining a very large, disk-based sample from a data stream. The dissertation is divided into four parts. In the first part we describe the geometric file organization and detail how geometric files can be used to maintain a very large simple random sample. In the second part we propose a simple modification to the classical reservoir sampling algorithm to compute a biased sample in a single pass over the data stream and describe how the geometric file can be used to maintain a very large biased sample. In the third part we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects. Finally, in the fourth part, we discuss secondary index structures for the geometric file. Index structures are useful to speed up search and discovery of required information from a huge sample stored in a geometric file. The index structures must be maintained concurrently with constant updates to the geometric file and at the same time provide efficient access to its records. We now give an introduction to these four parts of the dissertation in subsequent sections. 1.1 The Geometric File If one accepts the notion that being able to maintain a very large (but fixed size) random sample from a data stream is an important problem, it is reasonable to ask: Is maintaining such a sample difficult or costly using modern algorithms and hardware? Fortunately, modem storage hardware gives us the capacity to inexpensively store very large samples that should suffice for even difficult and emerging applications. A terabyte of commodity hard disk storage now costs less than $1,000. Given current trends, we should see storage costs of $1,000 per petabyte by the year 2020. However, even given such large storage capacities, it turns out that maintaining a large sample is difficult using current technology. The problem is not purchasing the hardware to store the sample; rather, the problem is actually getting the samples onto disk, so as to guarantee the statistical randomness of the sample, in the face of data streams that may exceed tens of gigabytes per minute in the case of a network monitoring application. 3.9 Why Reservoir Sampling with a Geometric File is Correct? We discuss the correctness of the geometric file by answering the following questions: 1. Why is the classical reservoir sampling algorithm (presented as Algorithm 1) correct? That is what is the invariant maintained by the Algorithm 1? 2. Why is the obvious disk-based, extension of Algorithm 1 (presented as Algorithm 2) correct? That is how does Algorithm 2 maintain the invariant of Algorithm 1 via the use of a main memory buffer? 3. Why is the proposed geometric file based sampling technique in Algorithm 3 correct? We have answered the first question in Section 3.1. We discuss the second and third questions here. 3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer The Algorithm 2 makes use of the main memory buffer of size IBI to buffer new samples. The buffered samples logically represent a set samples that should have been used to replace on-disk samples in order to preserve the correctness of the sampling algorithm, but that have not yet been moved to disk for performance reasons (that is, due to lazy writes). It is not hard to see that the invariant maintained by Algorithm 1 is also maintained by Algorithm 2 in step (6). The new records are sampled with the same probability I R/i. The only difference is that newly sampled records are added to the reservoir using steps (7-14) instead of simple steps (5-6) of Algorithm 1. We now discuss why these steps are equivalent. One straightforward way of keeping the sampled records in the buffer and do lazy writes is as follows. Every time we decide to add a new sample to the buffer (i.e. with probability I R/i), we also generate a random number between 1 and R to decide its position in the reservoir. However, we store this position in the position array and thus avoid an immediate disk seek. If we happen to generate a position that is already in the position array, we overwrite the corresponding record in the buffer with the newly sampled record. If we would have flushed that record to disk using the classic algorithm (rather than buffering it), we would have replaced it with the newly sampled record. Thus we would obtain the same result. Once the buffer is full we Definition 2. Iff is the user-defined bias function and f' is the actual bias function, then the distance between ,hl \,, two functions is defined as totalDist(f, f') = EC dist(ri), where dist(r) f'(ri) f(r) k-1 fP(rk) k- (O For a data stream with no overweight records, totalDist(f, f') = 0 (the best case). The worst case distance is given by the Theorem 1 and is analyzed and proved in the Appendix of this paper. Theorem 1. Given a set of streaming records rl, r2,... rN and a user-defined weighting function f, Alg,,ritihm 7 will sample with an actual bias function f' where totalDist(f, f') is upper bounded by k N E ( f() 11() (RI p )f(r' ) + = R f(+) 1 f(r ) R ff((r) k IR1+1 f(r'k) and r, r, ., r' is the permutation (reordering) of the streaming records such that f(r') < f(r) < <_ f(rK') According to this Theorem, the worst case occurs when the reservoir is initially filled (on startup) with the R records having the smallest possible weights (that is, we have the smallest totalWeight when the reservoir is filled) and we encounter the record with the largest weight immediately thereafter. We evaluate the effect of this worst-possible ordering in the experimental section of the paper. 4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm Algorithm 7 computes a biased sample according to f', where f' is a "close" function to a user-defined weighting function f according to the following distance metric: N f'(ri) f(ri) totalDist(f, f') = dist(ri), where dist(r)) = rfi iN N il 2kl f'(rI ) C-,l f(rk) each), whereas 10, 344 might mean that 400 seconds of disk time is spent on random disk I/Os. This is important when one considers that the time required to write 1GB to a disk sequentially is only around 25 seconds. While minimizing a is vital, it turns out that we do not have the freedom to choose a. In fact, to guarantee that the sum of all existing subsamples is IR|, the choice of a is governed by the ratio of IR| to the size of the buffer IB|: Lemma 1. (The size of a geometric file is IR) <> ((1 a = ) Proof. In the proof, (and consequently the Lemma) we ignore the fact that IB x -'1 may not be integral, we also ignore the storage associated with auxiliary structures such as the stacks and the beta segments. In this case, the geometric file is simply collection of subsamples of decaying size. We know that the largest subsample on disk is created by the most recent buffer flush and has |B records in it. From Observation 1 the size of the ith subsample of a file is IBI x ai-1 It then follows from Observation 2 that the total size of all subsamples of a geometric file is 1, |BI x a'-1 BI and thus (1 a) = . We will address this limitation in Section 3.10. 3.8.2 Choosing a Value for Beta It turns out that the choice of 3 is actually somewhat unimportant, with far less impact than a. For example, if we allocate 32KB for holding our 3 in-memory samples for each subsample, and |B|/|RI is 0.01, then as described above, adding a new subsample requires that 1029 segments be written, which will require on the order of 1029 seeks. Redoing this calculation with 1MB allocated to buffer samples from each on-disk subsample, the number of on-disk segments is Lo0 og 10+log(1-0.99)] or 687. By increasing the amount of main memory devoted to holding the smallest segments for each subsample by a factor of 32, we are able to reduce the number of disk head movements by less than a factor of two. Thus, we will not consider optimizing 3. Rather, we will fix 3 to hold a set of samples equivalent to the system block size, and search for a better way to increase performance. 4.3 Biased Reservoir Sampling With The Geometric File It is easy to use the biased reservoir sampling algorithm with a geometric file. To use the geometric file for biased sampling, it is vital that we be able to compute the true weight of any given record. To allow this, we will require that the following auxiliary information be stored: Each record r will have its effective weight r.weight stored along with it in the geometric file on disk. Once totalWeight becomes large, we can expect that for each new record r, r.weight = f(r). However, for the initial records from the data stream, these two values will not necessarily be the same. Each subsample Si will have a weight multiplier i3 associated with it. Again, for subsam- ples containing records produced by the data stream after totalWeight becomes large, .3 will typically be one. For efficiency, 3 [ can be buffered in main memory. Along with the effective weight, the weight multiplier can give us the true weight for a given record, which will be M(r) x r.weight. Algorithmic changes: Given that we need to store this auxiliary information, the algorithms for sampling from a data stream using the geometric file will require three changes to support biased sampling. These modifications are described now: During start-up. To begin with, the reservoir is filled with the first IR| records from the stream. For each of these initial records, r.weight is set to one. Let "totalWeight" be the sum of f(r) over the first IRI records. When the reservoir is finished filling, 3 1 is set to totalWeight/IRI for every one of the initial subsamples. In this way, the true weight of each of the first IRI records produced by the data stream is set to be the mean value of f(r) for the first IRI records. Giving the first IRI records a uniform true weight is a necessary evil, since they will all be overwritten by subsequent buffer flushes with equal probability. As subsequent records are produced by the data stream. Just as suggested by Algo- rithm 4, additional records produced by the stream are added to the buffer with probability (I|R f(ri))/totalWeight, so that at least initially, the true weight of the ith record is exactly f(ri). The interesting case is when I Rf(r) > 1 when the ith record is produced by the data stream. In this case, we must scale the true weight of every existing record up so that tot" .a 1. To accomplish this, we do the following: 1. For each on-disk subsample, Mj is set to be IRIf(r) 1. totalWeight 2. For each sampled record still in the buffer, rj.weight is set to txr.weight f(r = . 3. Finally, totalWeight is set to IR f(ri). 7.3.2 Discussion of Experimental Results Not surprisingly, these results suggest that the geometric file structure based sampling methods are superior over the more obvious naive algorithms, both in the batch and online case. As expected, the naive batch sampling algorithm took almost constant time to obtain batch sample of any size as it requires scan of the entire geometric file to retrieve any batch sample. The geometric file structure based algorithm can produce a small-size batch sample very fast, and the total sampling time increases linearly with sample size. The time required for the geometric file structure based algorithm is well below the time required by the naive approach even when 1/10 of file is sampled. In case of online sampling, geometric file structure based algorithm clearly outperformed naive approach and this was not surprising as it must expend one disk seek per sample. For both, batch and online sampling multiple geometric files framework showed results analogues to single geometric file case. As expected and then demonstrated by variance plots, the variance of online naive approach is smaller than geometric file structure based algorithm. Although with this little larger variance (less than 10 times for 100k samples) in the response times, the structure based approach executed order of magnitude faster (more than 100 times for 100k samples) than the naive approach for any number of records sampled, justifying our approach of minimizing the average square sum of the response time. In other words, we got enough added speed for a small enough added variance in response time to make the trade-off acceptable. As more and more samples are obtained the variance of structure based algorithm approached variance of the naive algorithm making the trade-off even more reasonable for large intended sample sizes. Finally, we point out that both the geometric file structure based algorithms, batch and online case, were able to read sample records from disk almost at the maximum sustained speed of the hard disk, at around 45 MB/sec. This is comparable to the rate of a sequential read from disk, the best we can hope for. Table 7-3. Query timing results for 200 bytes record, |R| entire subsample decays. On the other hand, the segment-based index structure has the smallest footprint as at every buffer flush all stale records are removed from the index structure. This results in a very compact index structure. The disk space usage of the LSM-Tree-based index structure lies between these two index structures. Although at every rolling merge, stale records are removed from the part of index structure that is merging, not all of the stale records in the structure are removed all at once. As soon as the rate of removal of stale records stabilizes the disk footprint also becomes stable. Finally, we compared the index look-up speed of these three index structures. We report index look-up and geometric file access times for different selectivity queries. As expected, the geometric file access time remains constant irrespective of the index structure option and increases linearly as the query produces more output tuples. The index look-up time varied for the three index structures. The segment-based index structure (the slowest) was an order of magnitude slower than the LSM-Tree-based index structure (the fastest). This is mainly because the segment-based index structure requires index lookups in several thousand B+-Trees for any selectivity query, where the LSM-Tree-based structure uses a singe LSM-Tree, requiring a small, constant number of seeks. The performance of the subsample-based index structure lies in Scheme Selectivity Index Time File Time Total Time Point Query 6.2488 0.0338 6.2826 -10 recs 9.6186 0.1267 9.7453 Segment-Based S 100 recs 12.9885 0.9288 13.9173 1000 recs 17.6891 5.9754 23.6645 Point Query 2.50717 0.0156 2.5227 ~ 10 recs 4.92744 0.1763 5.1037 Subsample-Based u 100 recs 7.2387 0.8637 8.1024 ~ 1000 recs 9.9837 6.1363 16.1200 Point Query 0.00505 0.0174 0.0224 ~ 10 recs 0.00967 0.1565 0.1661 LSM-Tree 100 recs 0.01440 0.8343 0.8487 ~ 1000 recs 0.05987 4.9961 5.0559 = 50 million, and IBI = 250k a catastrophic event, but it increases the disk I/O associated with stack maintenance and leads to fragmentation, and so it is an event that we would like to render very rare. To avoid this, we observe that if the stack associated with a sub-sample S contains any samples at a given moment, then S has had fewer of its own samples removed than expected. Thus, our problem of bounding the growth of S's stack is equivalent to bounding the difference between the expected and the observed number of samples that S loses as IBI new samples are added to the reservoir, over all possible values for IBI. To bound this difference, we first note that after adding IBI new samples into the reservoir, the probability that any existing sample in the reservoir has been over-written by a new sample is 1 1 -{ During the addition of new records to the reservoir, we can view a subsample S of initial size IBI as a set of IBI identical, independent Bernoulli trials (coin flips). The ith trial determines whether the ith sample was removed from S. Given this model, the number of samples remaining in S after IBI new samples have been added to the reservoir is binomially distributed with IBI trials and P = Pr[s E S remains] = 1 1 Since we are interested in characterizing the variance in the number of samples removed from S primarily when IB|P is large, the binomial distribution can be approximated with very high accuracy using a normal distribution with mean = B IP and standard deviation a = IBIP(1 P) [42]. Simple arithmetic implies that the greatest variance is achieved when a subsample has on expectation lost 50' of its records to new sample (P = 0.5); at this point the standard deviation a is 0.5 B/B. Since we want to ensure that stack overruns are essentially impossible, we choose a stack size of 3/B This allows the amount of data remaining in a given subsample to be up to six standard deviations from the norm without a stack overflow, and is not too costly an additional overhead. A quick lookup in a standard table of normal probabilities tells us that this will yield only around a 109- probability that any given subsample overflows its stack. While achieving such a small probability may seem like overkill, it is important to remember that many thousands of subsamples may be created in all during the life of the geometric file, and we want to ensure that very few of them overflow their respective stacks. If 100, 000 on-disk segments not deleted from the index tree until the subsample completely decays (when the entire tree is deleted). We refer to a index record as a stale record if it belongs to a segment of a subsample that is already overwritten (lost). Recall that we have recorded a segment number in an additional field along with each index record. For a given subsample, we keep track of which of its segments are decayed so far and use this information to ignore the index records that are stale. We returns all valid index records that satisfy the search criteria. We first sort these index records by their page number attribute and then then retrieve the actual records from the geometric file and return them as a query result. Although, the subsample-based index structure maintains and must search far fewer B+Trees compared to the segment-based index structure, we except reasonable search time per B+-Tree due to the smaller size and lazy deletion policy. 6.5 A LSM-Tree-Based Index Structure An alternative to the segment-based and subsample-based index structure is to build a single index structure for the entire geometric file, and maintain it as new records are inserted in the file. Thus, we design the third index structure that makes use of the LSM-tree index [44]. The LSM-Tree is a disk-based data structure designed to provide low-cost indexing in an environment with a high rate of inserts and deletes. 6.5.1 An LSM-Tree Index An LSM-tree is composed of two or more tree-like component data structures. The smallest component of the index always resides entirely in main memory (referred as the Co tree), and all other larger components reside on disk (referred as C1, C2, ..., Cj). The schematic picture of an LSM-tree of two components is depicted in Figure 2.1 of the original LSM-Tree paper [44]. Although C1 (and higher) components are disk-resident, the most frequently referred nodes (in general nodes at higher level) of these trees are buffered in main memory for performance reasons. LSM-Tree insertions and deletions: Index records are first inserted into the main-memory- resident Co component, after which they migrate to the C1 component that is stored on disk. to disk it must overwrite a truly random subset of the records on disk. Thus, when performing the flush, we need to randomly choose records from the reservoir to replace. This implies that the on-disk subsamples (which are expectedly of size 1 7, and so on) will lose around 1- 1-aI 1-a n, na, na2 records, and so on, respectively. However, while the number of records replaced in a subsample S will on expectation be proportional to the size of S (and hence equal to the size of S's largest on-disk segment) this replacement must be performed in a randomized fashion. The situation can be illustrated as follows. Say we have a set of numbers, divided into three buckets, as shown in Figure 3-4. Now, we want to add five additional numbers to our set, by randomly replacing five existing numbers. While we do expect numbers to be replaced in a way that is proportional to bucket size (Figure 3-4 (b)), this is not always what will happen (Figure 3-4 (c)). Algorithm 4 Randomized Segmentation of the buffer 1: for each subsample i in the reservoir R do 2: Set Ni= Number of records in Si 3: Set .11 0 4: for each record r in the buffer B do 5: Randomly choose a victim subsample Si such that Pr[choosing 5J] = Nil / Nj 6: N, -; 1 + + In order to correctly introduce this variance into the geometric file, we need to add a few additional steps to Algorithm 3. Before we add a new subsample to disk via a buffer flush in Step (21), we first perform a logical, randomized partitioning of the buffer into segments, described by Algorithm 4. In Algorithm 4, each newly-sampled record is randomly assigned to replace a sample from an existing, on-disk subsample so that the probability of each subsample losing a record is proportional to its size. The result of Algorithm 4 is an array of 11 values, where 3 1 tells Step (21) of Algorithm 3 how many records should be assigned to overwrite the ith on-disk subsample. 3.7.2 Handling the Variance Of course, there is no guarantee that M1 = n, i = na, = na2, and so on, so there is no guarantee that Algorithm 3 will overwrite exactly the number of records contained in each The worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the R records having the smallest possible weights and (2) we encounter the record rmx with the largest weight immediately thereafter. Theorem 1 presented an upper bound on totalDist(f, f') in this worst case. In this section, we first provide the proof of this worst case for Algorithm 7 and then prove the upper bound on totalDist(f, f') given by Theorem 1. 4.2.1 The Proof for the Worst Case To prove the worst case for Algorithm 7, we first prove the following three propositions. These proofs lead us to the worst-case argument. If we denote the record with the highest weight in the stream as rmx and use rma to denote the case where r"m is located at position i in the stream, then for any given random ordering of the streaming records ri,..., ri-1, rmax, TN, we prove that 1. Moving the record rma" earlier in the range rjI ... rN can not decrease totalDist(f, f'). 2. When we are initially filling the reservoir, choosing |RI records with smallest possible weight maximizes totalDist(f, f'). 3. Reordering of any record that appears after rmax in the range r~+l ... rN can not increase totalDist(f, f'). The proof of the first proposition regarding moving rm~ earlier in the stream: We prove this proposition by showing that if we move r~m to r-a, totalDist(f, f) can not decrease. If rmax is not an overweight record, the claim trivially holds as moving non- overweight record does not change totalDist(f, f'). If r"ma is an overweight record, we prove that totalDist(f, f') increases because of the move. We first compute totalDisti(f, f') for rm and then compute totalDist2(f, f') for rma. We prove the claim by showing totalDist2(f, f') totalDisti(f, f') > 0. 1. An Expression for totalDist (f, f') for rax We start with the totalDist formula Reservoir sampling can be very efficient, with time complexity less than linear in the size of the stream. Variations on the algorithm allow it to "go to sleep" for a period of time during which it only counts the number of records that have passed by [53]. After a certain number of records have been seen, the algorithm "wakes up" and capture the next record from the stream. Correctness of the reservoir sampling algorithm: The reservoir sampling process can be viewed as two phase process: (1) adding the first R records to the reservoir, and (2) adding subsequent records until the input is consumed. A reservoir algorithm should maintain following invariant in the second phase: after each record is processed, a reservoir should be a simple random sample of size R of the records processed so far Algorithm 1 maintains this invariant in steps (2-6) as follows [11, 38]. The ith record processed (i > IRI), it is added to the reservoir with probability IRl/i by step 4. We need to show that for all other records processed thus far, the inclusion probability is IR /i. Let rk be any record in the reservoir s.t. k / i. Let Ri denote the state of the reservoir just after addition of the ith record. Thus, we are interested in the Pr[rk r Ri] Pr[rk G Ri] = Pr[rk G R-i1]Pr[ri E Ri]Pr[rk i Ri] + (Pr[rk G R,-i]Pr[ri i R]) R[R t R R iR- 1R i i The correctness of the inclusion probability alone is not sufficient to prove the required invariant. Consider the systematic sampling described in the Chapter 8 of Cohran book [16]. To select a sample of IRI units, systematic sampling takes a unit at random from the first k units and "every kth" unit thereafter. Although the inclusion probability in systematic sampling is the same as in simple random sampling, the properties of a sample such as variance can be far different. It is known that the variance of the systematic sampling can be better or worse compared to a simple random sampling depending on data heterogeneity and correlation coefficient between pairs of sampled units. several approximate query processing applications [1, 21, 30, 34, 39]. In general, the drawback of making use of a batch sample is that the accuracy of any estimator which makes use of the sample is fixed at the time that the sample is taken, whereas the benefit of batch sampling is that the sample can be drawn with very high efficiency. We will also consider the case where N is not known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext results in an additional sampled record being returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an online or sequential sample. The drawback of online sampling compared to batch sampling is that it is generally less efficient to obtain a sample of size N using online methods. However, since the consumer of the sample can call GetNext repeatedly until an estimator with enough accuracy is obtained, online sampling is more flexible than batch sampling. An online sample retrieved from a geometric file can be useful for many applications, including online aggregation [32, 33]. In online 'riLlion. a database system tries to quickly gather enough information so as to approximate answer to an aggregate query. As more and more information is gathered, the approximation quality is improved, and the online sampling procedure is halted when the user is happy with the approximation accuracy. 5.3 Batch Sampling From a Geometric File 5.3.1 A Naive Algorithm The most obvious way to implement batch sampling is to make use of the reservoir sampling algorithm to raw a sample of size N from a geometric file of size IRI in a single pass. As the following lemma asserts, the resulting sample is also a sample of size N from the original data stream. Lemma 8. The reservoir sampling algorithm over a geometric file produces a correct random sample of the stream. Proof If S is the batch sample of size N retrieved from a geometric file R of size |R| using the reservoir sampling algorithm, then we know from the correctness of the reservoir sampling As the buffer fills. When the buffer fills and the jth subsample is to be created and written to disk, Mj is set to 1. 4.4 Estimation Using a Biased Reservoir The biased sampling algorithm presented gives a user the opportunity to make use of different weighting algorithms and estimators, depending upon the particular application domain. We discuss one such simple estimator, the standard Horvitz-Thompson estimator [50] for a sample computed using our algorithm. We derive the correlation covariancee) between the Bernoulli random variables governing the sampling of two records ri and rj using our algorithm and use this covariance to derive the variance of a Horvitz-Thomson estimator. Combined with the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single database table for which the reservoir is maintained. Though handling more complicated queries using the biased sample is beyond the scope of the paper, it is straightforward to extend the analysis of this Section to more complicated queries such as joins [32]. Imagine that we have the following single-table query, whose (unknown) answer is q: SELECT SUM g(r) FROM THE_TABLE AS r WHERE g2(r) Given such a query, let g(r) = g9(r) if g2(r) evaluates to true 0 otherwise. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Then the unbiased Horvitz-Thompson estimator for the query answer q can be written as q = re pr al]i In the Horvitz-Thompson estimator, each record is weighted according to the inverse of its sampling probability. Next, we derive the variance of this estimator. To do this, we need a result similar to Lemma 3 that can be used to compute the probability Pr[{rj, rk} E Ri] under our biased sampling scheme. Making use of the set of stacks is fairly straightforward. Imagine that na(i-1) of a buffer's records are sent to overwrite a segment from an existing subsample Si, but according to Algo- rithm 4, 1 [ should have been. Then, there are two possible cases: Case 1: i is smaller than na(i-1) by some number of records e In this case, E records are removed from the segment that is about to be over-written and pushed onto Si's stack in order to buffer them. This is necessary because these records logically should not be over-written by the records that are going to be added to the disk, but they will be. Case 2: 1 [ is larger than na(i-1) by some number of records e. In this case, e records are popped off of Si's stack to reflect the additional records that should have been removed from Si, but were not. These stack operations are performed just prior to Step (23) in Algorithm 3. Note that since the final group of segments from a subsample of total size 3 are buffered in main memory, their maintenance does not require any stack operations. Once a subsample has lost all of its on-disk samples, overwrites of records in this set can be handled by simply replacing the records directly. 3.7.3 Bounding the Variance Because the stacks associated with each subsample will be used with high frequency as insertions are processed, each stack must be maintained with extreme efficiency. Writes should be entirely sequential, with no random disk head movements. To assure this efficiency and avoid any sort of online reorganization, it is desirable to pre-allocate space for each of the stacks on disk. To pre-allocate space for these stacks, we need to characterize how much overflow we can expect from a given subsample, which will bound the growth of the subsample's stack. It is important to have a good characterization of the expected stack growth. If we allocate too much space for the stacks, then we allocate disk space for storage that is never used. If we allocate too little space, then the top of one stack may grow up into the base of another. If a stack does overflow, it can be handled by buffering the additional records temporarily in memory or moving the stack to a new location on disk until the stack can again fit in its allocated space. This is not problem we face using this strategy in the context of online sampling is that we do not know before hand the value of N, the number of records to be sampled. Algorithm 9 GetNext for Online Sampling 1: Set NS = Number of subsamples in a geometric file 2: for i = 1 to NS do 3: Set RecsInSubsam[i] = Size of ith subsample 4: Set BufferedSubsamSize[i] = 0 5: Randomly choose a subsample Si such that Pr[choosing i] = RecsInSubsam[i] /1RI 6: RecsInSubsam[i]- - 7: if BufferedSubsamSize[i] == 0 then 8: Set numRecs to minimum of sf/r and RecsInSubsam[i] 9: Read and buffer numRecs records of Si 10: BufferedSubsamSize[i] = numRecs 11: Buf ferSubsamSize[i] - 12: Return the next available buffered record of Si To address this issue, we use a simple heuristic. Every time we refill a buffer, we look at the number of records already sampled from a subsample and assume that the user will ask for the same number of samples as the algorithm progresses. This gives us the planning horizon for which we can determine the number of blocks to be fetched. We also use the obvious constraint that the total number of samples fetched from the subsample should not exceed the number of records in a subsample. Given this, an analytic solution to the problem of minimizing the average squared cost over all calls to GetNext is as follows: If there are b number of records per blocks then let N/b be the number of blocks in the planning horizon, and let X be the number of equal size chunks that we read on every buffer refill. Our goal is to determine the value of X and the number of blocks in each chunk. We know that the time to read a chunk is proportional to s + (N/b x r)/X, and thus the square sum of response time of all GetNext calls is X(s + (N/b x r)/X)2 In order to derive a formula for the value of X that minimizes this, we simply differentiate it with respect to X and then solve for the zero. dX = Xs+ (:N/b xr)/X)2) S(Xs2 + 2Nsr + (N/b x r)2/X) dX =s2 (N/b x r)2/X2 CHAPTER 8 CONCLUSION Random sampling is a ubiquitous data management tool, but relatively little research from the data management community has been concerned with how to actually compute and maintain a sample. In this dissertation we have considered the problem of random sampling from a data stream, where the sample to be maintained is very large and must reside on secondary storage. We have developed the geometric file organization which can be used to maintain an online sample of arbitrary size with an amortized cost of O(w x logB IB/ B ) random disk head movements for each newly sampled record. The multiplier u can be made very small by making use of a small amount of additional disk space. We have presented a modified version of the classic reservoir sampling algorithm that is exceedingly simple, and is applicable for biased sampling using any arbitrary user-defined weighting function f. Our algorithm computes, in a single pass, a biased sample Ri (without replacement) of the i records produced by a data stream. We have also discussed certain pathological cases where our algorithm can provide a correctly biased sample for a slightly modified bias function f'. We have analytically bound how far f' can be from f in such a pathological case. We have also experimentally evaluated the practical significance of this difference. We have also derived the variance of a Horvitz-Thomson estimator making use of a sample computed using our algorithm. Combined with the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single database table for which the reservoir is maintained. We have developed efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. We considered two sampling techniques (1) a batch sampling when sample size is known before hand and (2) an online sampling which implements an iterative function GetNext to retrieve a sample at-a-time. The goal of these algorithms was to efficiently support further sampling of a geometric file by making use of its own structure. 1/5 of total 1/5 of total 3/5 of total (a) Five new samples randomly replace existing samples which are grouped into three (b) Most likely outcome: new samples distributed proportionally (c) Possible (though unlikely) outcome: new samples all distributed to smallest bucket Figure 3-4. Distributing new records to existing subsamples. subsample's largest segment. To handle this problem, we associate a stack (or buffer1 ) with each of the subsamples. The stack associated with a subsample will buffer any of a subsample's records that logically should not have been over-written during buffer flush into the subsample (because 3 [ for some buffer flush for that subsample was smaller than expected), but whose space had to be claimed by the buffer flush in order to write a new subsample to disk. If the size of the stack is positive, it means that the corresponding subsample is larger than expected because it has had fewer of its records over-written than expected. We also allow a negative stack size. This simply means that some of the subsample's records should have been over-written but were not, because an 3 [ value for that subsample was larger than expected. A stack size of -k means that k of the subsample's on-disk records logically are not part of the reservoir (even though they are physically present on disk), and should be ignored during query processing. 1 We use the term "stack" rather than "buffer" to clearly differentiate the extra storage associ- ated with each subsample from the buffer B. BIOGRAPHICAL SKETCH Abhijit Pol was born and brought up in state of Maharashtra in India. He received his Bachelor of Engineering from Government College of Engineering Pune (COEP) University of Pune, one of the most prestigious and oldest engineering college in India, in 1999. Abhijit majored in mechanical engineering and obtained a distinguished record. He ranked second in the university merit ranking. He was employed in the Research and Development department of Kirloskar Oil Engines Ltd for one year. Abhijit received his first Master of Science from University of Florida in 2002. He majored in industrial and systems engineering. Abhijit then worked as a researcher in the Department of Computer and Information Science and Engineering at the University of Florida. He received his second Master of Science and Doctor of Philosophy (Ph.D) in computer engineering in 2007. During his studies at University of Florida, Abhijit coauthored a text book titled "Develop- ing Web-Enabled Decision Support Systems." He taught the Web-DSS course several times in the Department of Industrial and Systems Engineering at the University of Florida. He presented several tutorials at workshops and conferences on the need and importance of teaching DSS material, and he also taught at two instructor-training workshops on DSS development. Abhijit's research focus is in the area of databases, with special interests in approximate query processing, physical database design, and data streams. He has presented research papers at several prestigious database conferences and performed research at the Microsoft Research Lab. He is now a Senior Software Engineer in the Strategic Data Solutions group at Yahoo! Inc. 2)f(r-.a) E i+1 f(rk)] and Y k- IY have totalDist2(..) totalDisti(..) 2 x f(rswa) N- f(rk) 2 x f(rswap Ek- f(rk) X- f(rswa) Y + f(rswP) f(rswa) x [X + Y] Y x [Y + f(rswa)] Substituting the values for X and Y, the equation further simplifies to totalDist2(..) totalDisti(..) 2 x f(rwap) Nk-1 f(Rk) f (rswap) [(|R 2)f(r-na) l+ f(rk) + Rf(r-a) + E i1 (ik) Y x [Y + f(rPs)] 2 x f(rswp) 2 x ( f(r) 2 x f(rswaP) x (IRI- )f(rra) Since [IR f(rm ) ki+l /(k)] > (IR 1)f(r'ma), we have totalDist2(..) totalDisti(..) > 2 x f(rswp) N /f(Tk) 2 x f(rswap) [IRf(ri ) + Z i7 f(rk) + f(rswap)] Since [| f(r rna) + f(rswa) + + (rk)] > -1 1 f(rk), we have totalDist2(f, f') totalDisti(f, f') 2 x f(rswap) k 1 f(rk) 2 x f(rswap) Nk1 f(rk) >0 [IR f(rm,) k-i+f(rk)] x [IRIf(rr )+ i+1 f(rk) + f(rswa)] [(1R If we let X [Il(r -)+E i f(rk)],we It turns out that in the worst-case scenario we might have to buffer almost the entire data stream. We describe the case by construction. For a given arbitrary reservoir size |R| and the stream size N, we add first |R| records, all with the same weight wtl 1, to the reservoir. Next, we set f(rRl+l ) = i f (rk)/IR] + 1 1wt + 1 2. The inclusion probability of rRa+a1 is |R f(rlRnl+)/l z( f(rK) = 21 /(IR + 2) > 1. Since rRl+1 is an overweight record, we buffer it. We construct the remaining records of the stream with f(fIRI+2) ... f(rN) = (rlR|+1) = 2 so as to have all of them overweight and we must buffer them all. The priority queue thus contains N IRI records in it. Since f(ri) = 1 Vi < IR, we have: RIlf (r)/ k L f(rk) IRI/[ IR + 2(N RI)] < 1, and since f(ri) 2 Vi > RI and N > IR, we have: IRIf(r)/ E i f(rk) 21RI/[ R + 2(N I)] < 1. Thus, for a well-defined biased function f and the constructed stream the required queue size is N IRI. We therefore conclude that for N > IRI, the size of the buffer required for delayed insertion of the overweight records is 0(N). We stress that though this upper bound is quite poor (requiring that we need to buffer the entire data stream!) it is in fact a worst-case scenario, and the approach will often be feasible in practice. This is because weights will often increase monotonically over time (as in the case where newer records tend to be more relevant for query processing than older ones). Still, given the poor worst-case upper bound, a more robust solution is required, which we now describe. 4.1.3 Adjusting Weights of Existing Samples Another, orthogonal method for handling overweight records (that can be applied when the available buffer memory is exceeded) is to simply adjust the bias function and try to do the best that we can. Specifically, when we encounter an overweight record, we simply bump up the weights of all existing samples so as to ensure the inclusion probability of the current record is exactly one. Of course, as a result of this we will not be able to ensure that the weight of each record ri is exactly f(ri). We describe what we will be able to guarantee in the context of the true weight of a record: sampling algorithms. We have also tested these four techniques with the framework that make use of multiple geometric files. All of the algorithms were implemented on top of the geometric file prototype that was benchmarked in the previous sections. 7.3.1 Experiments Performed To compare the various options, we used the following setup. We first initialize a geometric file by sampling and adding records from a synthesized data stream to the files for a period of several hours. This ensures a realistic scenario for testing: the reservoir in the file that is to be tested has been filled, a reasonable portion of each initial subsample has been over-written, and some of the smaller initial subsamples have been removed from the file, and a number of new subsamples have been create. The parameters used in building the geometric file are the same as those describe in Experiment 2 of the previous section (a 50GB file with 50 million, 1KB records). Given such a file, the following set of experiments were performed: Sampling experiment 1: The goal of this experiment was to compare the two options for obtaining a batch sample from a geometric file: the naive algorithm, and then the geometric file structure based algorithm. For both algorithms, we plot the time to perform the sampling as a function of the desired sample size. Figure 7-2 (a) depicts the plot for a single geometric file; Figure 7-2 (b) shows an analogous plot for the multiple geometric files option. Sampling experiment 2: This experiment is analogous to Sampling Experiment 1, except that online sampling is performed via multiple successive calls to GetNext. The number of records sampled with multiple calls to GetNext versus the elapsed time is plotted in Figure 7-2 (c) for both the naive algorithm and the more advanced, geometric file structure based algorithm designed to increase the sampling rate and even out the response times. The analogous plot for multiple geometric file case is shown in Figure 7-2 (d). We also plot the variance in response times over all calls to GetNext as a function of the number of calls to GetNext in Figures 7- 2(e) and 7-2(f) (the first is for a single geometric file; the second is with multiple files). Taken together, these plots show the trade-off between overall processing time and the potential for waiting for a long time in order to obtain a single sample. segment 2: 2 segments numbered V segment 0: segment 1: no samples and after: 3 samples total n samples na samples stored in main n memory before first buffer flush: samples total 1-a disk after first buffer flush: disk after second buffer flush: disk after third buffer flush: disk O0O after buffer flush number y: disk Figure 3-1. Decay of a subsample after multiple buffer flushes. /3 that are buffered in main memory (subsequently referred to as the "beta segment"), then the ith buffer flush into R will on expectation overwrite exactly one on-disk segment from S. S loses an additional segment with every buffer flush until the subsample has only its beta segment remaining. At the point that only the subsample's beta segment remains, the samples contained therein can be replaced directly. The reason that the beta segment is buffered in main memory is that overwriting a segment requires at least one random disk head movement, which is costly. By storing the beta segment in main memory, we can reduce the number of disk head movements with little main-memory storage cost. The process is depicted in Figure 3-1. are replaced, then using a stack of size 3 IB1 will yield a very reasonable probability that we experience no overflows of (1 10-9)100 000, or 99.9, I'-.. In practice, the actual probability of experiencing no overflows will be even greater. This is due to the fact that the standard deviation in subsample size for most of a subsample's lifespan will be much less than 0.5 BVB due to the high percentage of its lifespan that it has an associated P of less than 0.5 as it slowly loses all of its samples. 3.8 Choosing Parameter Values Given a specified file size and buffer size, two parameters associated with using the geometric file must be chosen: a, which is the fraction of a subsample's records that remain after the addition of a new subsample, and 3, which is the total size of a subsample's segments that are buffered in memory. 3.8.1 Choosing a Value for Alpha In general, it is desirable to minimize a. Decreasing a decreases the number of segments used to store each subsample. Fewer segments means fewer random disk head movements are required to write a new subsample to disk, since each segment requires around four disk seeks to write (one to read the location and one to write a new segment, and similarly two more considering the cost of subsequently adjusting the stack of the previous owner). To illustrate the importance of minimizing a, imagine that we have a 1GB buffer and a stream producing 100B records, and we want to maintain a 1TB sample. Assume that we use an a value of 0.99. Thus, each subsample is originally 1GB, and | B = 107. From Observa- tion 2 we know that -- must be 107, so we must use n = 105. If we choose / = 320 (so that 3 is around the size of one 32KB disk block), then from Observation 3 we will require Sog320 log0log(-.99) 1029 segments to store the entire new subsample. Now, consider the situation if a = 0.999. A similar computation shows that we will now require 10, 344 segments to store the same 1GB subsample. This is an order-of-magnitude difference, with significant practical importance. With four disk seeks per segment, 1029 segments might mean that we spend around 40 seconds of disk time in random I/Os (at 10ms (a) 50 Byte records, 600MB buffer space hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hr Time elapsed geo file local overwrite scan & virtual mem 8 hrs 12 hrs 16 hrs 20 hr Time elapsed multiple geo files 7 local geo file overwrte (c) 50 Byte records, 150MB buffer space scan & virtual mem- 0 hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hrs Time elapsed Figure 7-1. Results of benchmarking experiments (Processing insertions). multiple geo files \V, (b) 1KB records, 600MB buffer space 7 Algorithm 5 Randomized Segmentation of the Buffer for Multiple Geometric Files 1: for each Sij, the ith subsample in jth file do 2: Set Nij= Number of records in Sij 3: Set 1. 0 4: for each record r in the buffer B do 5: Randomly choose a victim subsample Sij such that Pr[choosing ij] = Nij/ C8 Nkl 6: Nj -;V + + However, processing additional records from the stream is somewhat different. As more and more records are produced by the stream, new samples are captured and are added to the buffer exactly as in Algorithm 3 Steps (15)-(20) until buffer is full. Once the buffer is full, its record order is then randomized, just as is in a single geometric file. Next the buffer is flushed to disk. This is where the algorithm is modified. Overwriting records on disk with records from the buffer is somewhat different, in two primary ways, as discussed next. Partitioning the buffer: In Algorithm 4, the buffer is partitioned so that the size of each buffer segment is on expectation proportional to the current size of subsamples in a single file. In case of multiple geometric files, we partition the buffer just like in Algorithm 4; however, we randomly partition the buffer across all subsamples from all geometric files. The number of buffer segments after the partitioning is the same as the total number of subsamples in the entire reservoir, and the size of each buffer segment is on expectation proportional to the current size of each of the subsamples from one of the geometric files. This allows us to maintain the correctness of the reservoir sampling algorithm. The buffer partitioning steps in case of multiple geometric files are given in Algorithm 5. Merging buffer segments with multiple geometric files: This step requires quite a different approach compared to Algorithm 3's buffer merge algorithm. We discuss all the intricacies subsequently, but at high-level, the largest segment of each subsample from only one geometric file is over-written with samples from the buffer. This allows for considerable speedup, as we discuss in Section 3.12. At first, this would seem to compromise the correctness of the algorithm: logically, the buffered samples must over-write samples from every one of the geometric files (in fact, this is precisely why the buffer is partitioned across all geometric files, as TABLE OF CONTENTS page ACKNOWLEDGMENTS ................ ...................... 4 LIST OF TABLES ...................................... 8 LIST OF FIGURES ................. ............. ....... 9 ABSTRACT ................................... ....... 10 CHAPTER 1 INTRODUCTION .................................. 12 1.1 The Geometric File ................... .......... 14 1.2 Biased Reservoir Sampling ................... ........ 16 1.3 Sampling The Sample ................... .......... 18 1.4 Index Structures For The Geometric File ........ ........ .... 19 2 RELATED WORK ..................... .............. 22 2.1 Related Work on Reservoir Sampling .......... ....... ....... 22 2.2 Biased Sampling Related Work ................... ....... 24 3 THE GEOMETRIC FILE .......... ........... ........... 28 3.1 Reservoir Sampling ................... .......... 28 3.2 Sampling: Sometimes a Little is not Enough ........ ........ .. 30 3.3 Reservoir for Very Large Samples ......... ................ 31 3.4 The Geometric File ......... .. .............. 34 3.5 Characterizing Subsample Decay .......... ................ 36 3.6 Geometric File Organization .................. ......... .. 40 3.7 Reservoir Sampling With a Geometric File .... . . .... 40 3.7.1 Introducing the Required Randomness . . . ..... 41 3.7.2 Handling the Variance ............ . . ...... 42 3.7.3 Bounding the Variance .................. ....... 45 3.8 Choosing Parameter Values ............. . . .... 47 3.8.1 Choosing a Value for Alpha .................. ...... .. 47 3.8.2 Choosing a Value for Beta ...... . . ...... 48 3.9 Why Reservoir Sampling with a Geometric File is Correct? . . .... 49 3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer . 49 3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File .50 3.10 Multiple Geometric Files ............ . . . 51 3.11 Reservoir Sampling with Multiple Geometric Files . . . 51 3.11.1 Consolidation And Merging .................. ...... .. 53 3.11.2 How Can Correctness Be Maintained? ..... . . ..... 53 3.11.3 Handling the Stacks in Multiple Geometric Files . . .... 56 7.2 Biased Reservoir Sampling ............ . . .... 103 7.2.1 Experimental Setup ............ . . ... 104 7.2.2 Discussion ................. . . . ... 106 7.3 Sampling From a Geometric File .................. ...... .. 107 7.3.1 Experiments Performed .................. ....... 108 7.3.2 Discussion of Experimental Results .... . . .... 109 7.4 Index Structures For The Geometric File ....... . . ... 110 7.4.1 Experiments Performed .................. ....... 110 7.4.2 Discussion ................. . . . ... 112 8 CONCLUSION ................... ........ ........ 116 REFERENCES ...................... . . . 118 BIOGRAPHICAL SKETCH ................. . . ..... 122 between these two structures. This is expected as the structure maintains many fewer B+-Trees than the segment-based index but far more than the LSM-Tree-based structure. In general the subsample-based index structure gives the best build time with reasonable index look-up speed at the cost of slightly larger disk footprint. The LSM-Tree-based index structure makes use of reasonable disk space and gives the best query performance at the cost of slow insertion rate or build time. The segment-based index structure gives comparable build time and has the most compact disk footprint, but suffers considerably when it comes to index look-ups. CHAPTER 3 THE GEOMETRIC FILE In this chapter we give an introduction to the basic reservoir sampling algorithm that was proposed to obtain an online random sample of a data stream. The algorithm assumes that the sample maintained is small enough to fit in main memory in its entirety. We discuss and motivate why very large sample sizes can be mandatory in common situations. We describe three alternatives for maintaining very large, disk-based samples in a streaming environment. We then introduce the geometric file organization and present algorithms for reservoir sampling with the geometric file. We also describe how multiple geometric files can be maintained all-at-once to achieve considerable speed up. 3.1 Reservoir Sampling The classic algorithm for maintaining an online random sample of a data stream is known as reservoir sampling [11, 38]. To maintain a reservoir sample R of target size |R|, the following loop is used: Algorithm 1 Reservoir Sampling 1: Add first |R| items from the stream directly to R 2: for int i = RI + 1 to oo do 3: Wait for a new record r to appear in the stream 4: with probability IR|/i do 5: Remove a randomly selected record from R 6: Add r to R A key benefit of the reservoir algorithm is that after each execution of the for loop, it can be shown that the set R is a true, uniform random sample (without replacement) of the first i records from the stream. Thus, at all times, the algorithm maintains an unbiased snapshot of all of the data produced by the stream. The name "reservoir sampling" is an apt one. The sample R serves as a reservoir that buffers certain records from the data stream. New records appearing in the stream may be trapped by the reservoir, whose limited capacity then forces an existing record to exit the reservoir. extensions of the reservoir algorithm to on-disk samples all have serious drawbacks. We discuss the obvious extensions now. The virtual memory extension. The most obvious adaptation for very large sample sizes is to simply treat the reservoir as if it were stored in virtual memory. The problem with this solution is that every new sample that is added to the reservoir will overwrite a random, existing record on disk, and so it will require two random disk I/Os: one to read in the block where the record will be written, and one to re-write it with the new sample. This means we can sample only on the order of 50 records per second at 10ms per random I/O per disk. Currently, a terabyte of storage requires as few as five disks, giving us a sampling rate of only 5 x 50 = 250 records per second. To put this in perspective, it would take months to sample enough 100 byte records to fill that terabyte. The massive rebuild extension. As an alternative, when new samples are selected from the stream, they are not added to the on-disk reservoir immediately. Rather, we make use of all of our available main memory to buffer new samples. At all times, the records stored in the buffer B logically represent a set samples that should have been used to replace on-disk samples in order to preserve the correctness of the reservoir algorithm, but that have not yet been moved to disk for performance reasons. When the buffer B fills, we simply scan the entire reservoir R, and replace a random subset of the existing records with the new, buffered samples. The modified algorithm is given as Algorithm 2. Count(B) refers to the current number of records in B. Note that since the records contained in B logically represent records in the reservoir that have not yet been added to disk, a newly-sampled record can either be assigned to replace an on-disk record, or it can be assigned to replace a buffered record (this is decided in Step (7) of the algorithm). In a realistic scenario, the ratio of the number of disk blocks to the number of records buffered in main memory may approach or even exceed one. For example, a 1 TB database with 128 KB blocks will have 7.8 million blocks; and for such a relatively large database it is realistic to expect that we have access to enough memory to buffer millions records. As the number of buffered records per block meets or exceeds one, most or all of the blocks on disk will contain To relate this back to the task of the reservoir sampling, imagine that our large, disk-based reservoir sample R is maintained using a reservoir sampling algorithm in conjunction with a main memory buffer B (as in Algorithm 2). Recall that the way reservoir sampling works is that new samples from the data stream are chosen to overwrite random samples currently in the reservoir. The buffer temporarily stores these new samples, delaying the overwrite of a random set of records that are already stored on disk. Once the buffer is full, all new samples are merged with the R by overwriting a random subset of the existing samples in R. Consider some arbitrary subsample S of R (so S C R), with capacity IS|. Since the buffer B represents the samples that have already over-written the equal number of records of R, a buffer flush overwrites exactly |B| samples of R. Thus, on expectation the merge will overwrite SlxB samples of S. If we define 1 a then on expectation, S should lose IS| x (1 a) of its own records due to the buffer flush1 We refer this loss as subsample decay. We can roughly describe the expected decay of S after repeated buffer merges using the three observations stated before. If the subsample retention rate a = 1 then: From Observation 1, it follows that the ith buffer merge, on expectation, removes n x a-1 samples from what remains of S. From Observation 2, it follows that the initial size of a subsample ISI = - From Observation 3, it follows that the expected number of merges required until S has 3 or less samples left is T. The net result of this is that it is possible to characterize the expected decay of any arbitrary subset of the records in our disk-based sample as new records are added to the sample through multiple emptyings of the buffer. If we view S as being composed of T on-disk "segments" of exponentially decreasing size, plus a special, a single group of final segments of total size 1 Actually, this is only a fairly tight approximation to the expected rate of decay. It is not an exact characterization because these expressions treat the emptying of the buffer into the reservoir as a single, atomic event, rather than a set of individual record additions (See Section 3.7). [45] Ousterhout, J.K., Douglis, F.: Beating the i/o bottleneck: A case for log-structured file systems. Operating Systems Review 23(1), 11-28 (1989) [46] P.B. Gibbons Y. Matias, VP.: Fast incremental maintenance of approximate histograms. In: ACM Transactions on Database Systems, pp. 27(3): 261-298 (2002) [47] Pol, A., Jermaine, C.: Biased reservoir sampling. IEEE Transactions on Knowledge and Data Engineering [48] Pol, A., Jermaine, C., Arumugam, S.: Maintaining very large random samples using the geometric file. VLDBJ (2007) [49] Shao, J.: Mathematical Statistics. Springer-Verlag (1999) [50] Thompson, M.E.: Theory of Sample Surveys. Chapman and Hall (1997) [51] Toivonen, H.: Sampling large databases for association rules. In: International Conference on Very Large Data Bases (1996) [52] V. Ganti M.-L. Lee, R.R.: Icicles self-tuning samples for approximate query answering. In: International Conference on Very Large Data Bases (2000) [53] Vitter, J.: Random sampling with a reservoir. In: ACM Transactions on Mathematical Software (1985) [54] Vitter, J.: An efficient algorithm for sequential random sampling. In: ACM Transactions on Mathematical Software, pp. 13(1): 58-67 (1987) ignoring the floor) we compute the number of segments to write as 1(log 3 log |BI). If we let u = (log(1/a'))-1 the number of segments can be expressed as w(log IBI log/3). Assuming a constant number c of random seeks per segment written to the disk, the total random disk head movements required per record is wc ((log IB log /3)/IB ), which is O(u x log IB I/ IB). D In case of multiple geometric files we use additional space for m dummy subsamples. Thus, the total storage required by all geometric files is |R| + (m x IBI). If we wish to maintain a 1TB reservoir of 100B samples with 1GB of memory, we can achieve a' = 0.9 by using only 1.1TB of disk storage in total. For a' = 0.9, we need to write less than 100 segments per 1GB buffer flush. At 40 ms/segment, this is only 4 seconds of random disk head movements to write 1GB of new samples to disk. In order to test the relative ability of the geometric file to process a high-speed stream of insertions, we have implemented and bench-marked five alternatives for maintaining a large reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the framework described in Section 3.10 for using multiple geometric files at once. We present these benchmarking results in Chapter 7. manner. When the reservoir is full, we have used all of the slots exactly once. During the normal operation, every time buffer is full, the slot corresponding to the smallest subsample in the reservoir (which is about to decay completely) is used to write out newly built B+-Tree. Thus, during normal operations B+-Tree slots are used in round-robin fashion. The algorithm used to construct and maintain a segment-based index structure is given as Algorithm 11. Algorithm 11 Construction and Maintenance of a Subsample-Based Index Structure 1: Set totSubsamnInR = [iog3-og IB+g(i-a) log a 2: Set BTree allTrees [totSubsamInR] 3: Set btlndex = 0 4: for int i = 1 to oo do 5: if Buffer B is partitioned then 6: for each segment j in B do 7: allTrees[btIndex].BuildBTree(j) 8: btlndex + + 9: ifi > R1 then 10: btlndex = btli,.i ,'. I. I ubsamInR 6.4.2 Index Look-Up In the subsample-based index structure, after every buffer flush, exactly one B+-Tree is created and written to the disk, making insertions in the index structure very efficient. However, most of the deletions are deferred until the subsample decays completely. Thus, although every subsample losses its records to the new subsample, B+-Tree records are deleted from the index structure only when the entire B+-Tree is to be deleted. In other words, at any given time all B+-Trees except the one recently inserted contains stale records that must be ignored during the search. A search on subsample-based index structure involves looking up all B+-Tree indexes, one for each subsample in the geometric file. We modify the existing B+-Tree-based point query and range query algorithms and run them for each entry in the B+-Tree array of the index structure. The modification is required to ignore the stale records in the B+Trees. As mentioned before, the subsample corresponding to a B+-Tree may lose its segments, but the index records are totalDist(f, f') f'(rl) Yk-1 f'(ik) f(ri) k- If(rk) f(ri) N +( =k-1 f (k) f'(ri-i) : N k I, fl(rk) Sf'(rN) c 1*I fl(Tk) f(ri-1) + f (rN) =k-1 f(Tk) Since r"x' is the ith record of the stream, using the result of Lemma 5 (given below) we re-write the totalDist formula as totalDisti(f, f') f'(ri) 1 f f( k) f'(rl) ( k f'(rIk) f(ri) k-1f(rk) VE~nr.) f l . f(rl) :N Zk-1 f(Tk) f'(ri) Y k 1 f'(r~k) f'(ri-1) + + N >3k 1 f'(rk) f(rN) N >3k 1 f(rk) ( f(r) N ( fk1 f( f1 if'( 1 Y.N .,P f(ri-1) kZ1 ff(rk) f'(rN) =k 1 f'(rk) f(ri-) rk) 1 kf(ik) f'(r N) rk) 1I f'(rk) We know that Vj < i, f'(rj) (IR-1)f (r' )f(rj) and Vj > i, f'(rj) f- (rj). We also know k-I1f r I R f(rM a) k i+1 f(rik). Therefore, the above equation simplifies to f'(ri-1) + N Y=1 kf (rk) f(rN) N t f( k) that E -I f '(rk) REFERENCES [1] A. Das J. Gehrke, M.R.: Approximate join processing over data streams. In: ACM SIGMOD International Conference on Management of Data (2003) [2] Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD International Conference on Management of Data (2000) [3] Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: ACM SIGMOD International Conference on Management of Data (1999) [4] Acharya, S., P.B. Gibbons, V.P., Ramaswamy, S.: The aqua approximate query answering system. In: ACM SIGMOD International Conference on Management of Data (1999) [5] Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB'2006: Proceedings of the 32nd international conference on Very large data bases, pp. 607-618. VLDB Endowment (2006) [6] Arge, L.: The buffer tree: A new technique for optimal i/o-algorithms. In: International Workshop on Algorithms and Data Structures (1995) [7] Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA'02: Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 633-634. Society for Industrial and Applied Mathematics (2002) [8] Babcock, B., S. Chaudhuri, G.D.: Dynamic sample selection for approximate query processing. In: ACM SIGMOD International Conference on Management of Data (2003) [9] Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indexes. In: SIGFIDET Workshop, pp. 107-141 (1970) [10] Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE '06: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 6. IEEE Computer Society, Washington, DC, USA (2006) [11] C. Fan M. Muller, I.R.: Development of sampling plans by using sequential (item by item) techniques and digital computers. In: Journal of American Statistical Association, pp. 57: 387-402 (1962) [12] C. Jermaine A. Datta, E.O.: A novel index supporting high volume data warehouse insertion. In: International Conference on Very Large Data Bases (1999) [13] C. Jermaine E. Omiecinski, W.Y: The partitioned exponential file for database storage management. In: International Conference on Very Large Data Bases (1999) [14] Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aig gricglion queries. In: ICDE (2001) with probability If(rj) in Step (8) of the algorithm and the probability requirement trivially E1=1 f(ri) holds for the new record. We now must prove this fact for rk, for all k < i. Since R is correct, we know that for k < i, Pr[rk E R~ -1i (r) i Then there are two cases to consider; either EI=i f(rl) the new record ri is chosen for the reservoir, or it is not. If ri is not chosen, then rk remains in the reservoir for k < i. If ri is chosen, then rk remains in the reservoir if rk is not selected for expulsion from the reservoir (the chance of this happening if ri is chosen is (|R| 1)/|R|). Thus, the probability that a record rk is in Ri is Pr[rk e Ri] Pr[rk E Pr[rk G Pr[rk E Ri-1] Pr[rk E Ri-1] Pr[rk E Ri-i] = Pr[rk E Ri-1] t R f (rk) ' IRIf(rk) Y i1 f(r') This is the desired results and prn Pr[ri E Ri]Pr[rk not expelled] + (Pr[rk E Ri-1]Pr[r, Ri) (Pr[ri G R] ( ) + 1 Pr[r G Rj]) (R\Pr[r, G R] Pr[ri c R] \R\ R\Pr[r, e R}\ IRI + IRI (|R| Pr[rC e R1]) AR (f: f(ri) Ste stt t of t l oves the statement of the lemma. 4.1.2 So, What Can Go Wrong? (And a Simple Solution) This simple modification to the reservoir sampling algorithm will give us the desired biased sample as long as the probability IRI f(ri)/totalWeight never exceeds one. If this value does exceed one, then the correctness of the algorithm is not preserved. Unfortunately, we may very well see such meaningless probabilities, especially early on as the reservoir is records and any records contained in its stack in order to obtain desired size sample. The detailed algorithm is presented as Algorithm 8. It is clear that this algorithm obtains the desired batch sample by scanning exactly N records as against the entire scan of the reservoir sampling at the cost of few random disk seeks. Since the sampling process is analogous to the process of adding more samples to the file, it is just as efficient, requiring O(w x log BI /N) random disk head movements for each newly sampled record, as described in Lemma 2. 5.3.3 Batch Sampling Multiple Geometric Files A geometric file structure based batch sampling algorithm can be extended to allow efficient batch sampling from multiple geometric files in the same way that the insertion algorithm for new samples into the geometric file can be extended to allow insertions into multiple geometric files. The extension is fairly straightforward with additional first step where we determine the number of records to be sampled from each geometric file. Once this number is determined, we execute Algorithm 8 on each file in order to obtain the desired batch sample. 5.4 Online Sampling From a Geometric File 5.4.1 A Naive Algorithm One straightforward way of supporting online sampling from a geometric file is to imple- ment the iterative function GetNext as follows. For every call to GetNext, we simply generate a random number i between 1 and size of the file IRI, and then return a record at the ith position in the geometric file. Care must be taken to avoid choosing same record of R more than once in order to obtain a correct sample without replacement. For example, to sample N records from R, the numbers 0 through N 1 could be hashed or randomized using a bijective pseudo-random function onto the domain 0 through IRI 1, and the resulting N numbers used to generate the sample. To pick the next record to sample, we simply hash N. It is easy to see that a naive algorithm will give us a correct online sample of a geometric file. However, we will use one disk seek per call to GetNext. Since each random I/O requires Furthermore, this degeneration in performance could probably be reduced by using a smaller value for a'. As expected, the local overwrite option performs very well early on, especially in the first two experiments (see Section 3.3 for a discussion of why this is expected). Even with limited buffer memory in Experiment 3, it uniformly outperforms a single geometric file. Furthermore, with enough buffer memory in Experiments 1 and 2, the local overwrite option is competitive with the multiple geofiles option early on. However, fragmentation becomes a problem and performance decreases over time. Unless offline re-randomization of the file is possible periodically, this degradation probably precludes long-term use of the local overwrite option. It is interesting that as demonstrated by Experiment 3 (and explained in Section 3.8) a single geometric file is very sensitive to the ratio of the size of the reservoir to the amount of available memory for buffering new records from the stream. The geofile option performs well in Experiments 1 and 2 when this ratio is 100, but rather poorly in Experiment 3 when the ratio is 1000. Finally, we point out the general unusability of the scan and virtual memory options, scan generally outperformed virtual memory, but both generally did poorly. Except in experiment 1 with large memory and small record size, with these two options more than 97'. of the processing of records from the stream occurs in the first half hour as the reservoir fills. In the 19.5 hours or so after the reservoir first fills, only a tiny fraction of additional processing occurs due to the inefficiency of the two options. 7.2 Biased Reservoir Sampling In Section 4.1 we gave an upper bound for the distance between the actual bias function f' computed using our reservoir algorithm, and the desired, user-defined bias function f. While useful, this bound does not tell the entire story. In the end, what a user of a biased sampling algorithm is interested in is not how close the bias function that is actually computed is to the user-specified one, but instead the key question is what sort of effect any deviation has on the algorithm that: # of subsets of size N E R Pr[S C] # of such subsets in a data stream D (l| R| NJ N) Now, imagine that S E R. If we obtain a sample of size N from R using the reservoir algorithm, the probability that we choose precisely S is: Pr[S sampled from RS c R] 1/( I Thus we have: Pr[S sampled from R] = Pr[S sampled from R S E R] x Pr[S E R] 1 (|R (IRI) (l D) 1 This is precisely the probability we would expect if we sampled directly from the stream without replacement. O Unfortunately, though it is very simple, the naive algorithm will be inefficient for drawing a small sample from a large geometric file since it requires a full scan of the geometric file to obtain a true random sample for any value of N. Since the geometric file may be gigabytes in size, this can be problematic. 5.3.2 A Geometric File Structure-Based Algorithm We can do better if we make use of the structure of a geometric file itself. The intuitive outline of this approach is as follows. To obtain a batch sample of size N, we pre-calculate how many records from each on-disk subsample will be included in the batch sample, and then we read the appropriate number of records sequentially from the various segments of each subsample. The process of choosing the number of records to select from each subsample is simple random sample. In Chapter 4 we propose a single pass biased reservoir sampling algorithm. In Chapter 5 we develop techniques that can be used to sample geometric files to obtain a small size sample. In Chapter 6 we present secondary index structures for the geometric file. In Chapter 7 we discuss the benchmarking results. The dissertation is concluded in Chapter 8. Most of the work in the dissertation is either already published or is under review for publication. The material from Chapter 3 is from the paper with Christopher Jermaine and Subramanian Arumugam that was originally published in SIGMOD 2004 [36]. The work presented in Chapter 4 is submitted to TKDE and is under review [47]. The material in Chapter 5 is the part of journal paper accepted at VLDBJ [48]. The results in the Chapter 7 are taken from above three papers as well. For a given set of records, ZE I 1 f(rk) + | kRI f(rk) is a constant. Therefore, dist(rj) increases as E I1 f(rk) decreases. In other words, the dist(rj) is maximum for the smallest possible ZElj 1 f(rk). Thus, the totalDist(f, f') is largest when the reservoir is initially filled with the |R| records having the smallest possible weights. This proves the claim in the second proposition. The proof of the third proposition regarding reordering of records after r"mx: This is immediate since r" m is the highest weight record of the stream, no record after r"' can be an overweight record. From the above three propositions, we can conclude that the worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the |R| records having the smallest possible weights and, (2) we encounter the record r"' with the largest weight immediately thereafter. 4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist To derive the upper bound, we start with the totalDist formula and give its value in the worst case totalDst(f, (r) f(rl) f(rlR -1) f(r lR-1) + SEtotalDist(f(, f') N E+ + k-1 f'(rk) k, 1f(rk) N I f'(rk) k-NI f(rk) f (rlRI) f(rRI) f'(rN) f(rN) ki, fl'(k) k- fE(rk) Y I f'(rk) Eki f(rk) We know that in the worst case r ma appears as the IR1th record in the stream, using the result of Lemma 5 we re-write the totalDist formula as Current techniques suitable for maintaining samples from a data stream are based on reservoir sampling [11, 38]. Reservoir sampling algorithms can be used to dynamically maintain a fixed-size sample of N records from a stream, so that at any given instant, the N records in the sample constitute a true random sample of all of the records that have been produced by the stream. However, as we will discuss in this dissertation, the problem is that existing reservoir techniques are suitable only when the sample is small enough to fit into main memory. Given that there are limited techniques for maintaining very large samples, the problem addressed in the first part of this dissertation is as follows: Given a main memory buffer B large enough to hold BI records, can we develop efficient nlgt,'rihii,\ for dynamically maintaining a massive random sample containing exactly N records from a data stream, where N > IBI ? Key design goals for the algorithms we develop are 1. The algorithms must be suitable for streaming data, or any similar environment where a large sample must be maintained on-line in a single pass through a data set, with the strict requirement that the sample always be a true, statistically random sample of fixed size N (without replacement) from all of the data produced by the stream thus far. 2. When maintaining the sample, the fraction of I/O time devoted to reads should be close to zero. Ideally, there would never be a need to read a block of samples from disk simply to add one new sample and subsequently write the block out again. 3. The fraction I/O of time spent performing random I/Os should also be close to zero. Costly random disk seeks should be few and far between. Almost all I/O should be sequential. 4. Finally, the amount of data written to disk should be bounded by the total size of all of the records that are ever sampled. The geometric file meets each of the requirements listed above. With memory large enough to buffer IBI > 1 records, the geometric file can be used to maintain an online sample of arbitrary size with an amortized cost of O(u x log B / B ) random disk head movements for each newly sampled record (see Section 3.12). The multiplier u can be made arbitrarily small by making use of additional disk space. A rigorous benchmark of the geometric file demonstrates its superiority over the obvious alternatives. around 10 milliseconds, the naive algorithm can only sample around 6, 000 records from the geometric file per minute per disk. This performance is unacceptable for most applications. 5.4.2 A Geometric File Structure-Based Algorithm As in the case of batch sampling algorithm, we can make use of the structure of a geometric file to efficiently support online sampling. Instead of selecting a random record of a geometric file, we randomly pick a subsample and choose its next available record as a return value of GetNext. This is analogous to the classic online sampling algorithm for sampling from a hashed file [26], where first a hash bucket is selected and then a record is chosen. Since the selection of a random record within a subsample is sequential, we may reduce the number of costly disk seeks if we read the subsample in its entirety, and buffer the subsample's records in memory. Using this basic methodology, we now describe how a call to the GetNext will be processed: We first randomly pick a subsample Si, with the probability of selecting i proportional to the size of ith subsample. Next, we look for buffered records of Si; if such records exist, we choose and return the first available record as the return value of GetNext. If no buffered records are found, we fetch and buffer a number of blocks of records from subsample Si; these records are then buffered. We return the first buffered record as the return value of GetNext. Since the records from each subsample are read and buffered in memory sequentially, we are guaranteed to choose each record of the reservoir at most once, giving us desired random sample without replacement. A proof of this is simple, and analogous to the proof of Lemma 3. However, thus far we have not considered a very important question: How many blocks of a subsample Si should we fetch at the time of buffer refill? In general there are two extremes that we may consider: Fetch many. If we fetch a large number of blocks at the time of the buffer refill, we reduce the overall time to sample N records for large N. This is due to the fact that by fetching many blocks using a sequential read, we amortize the seek time over a large number of blocks and at the same time we prepare ourselves for future calls to GetNext; once the records are fetched from disk, the response time for subsequent calls to GetNext is almost instantaneous (only in-memory computations are required). However, the drawback of this Setting this to zero, we have X = Nr/bs. Thus, we divide N/b blocks into Nr/bs chunks and read bs/r number of blocks from a subsample every time we refill the buffer. It turns out that when this solution is used, the number of blocks read at the time of buffer refill depends on the ratio of the seek time to the block scan time. Since this solution is independent of the planning horizon, we always read bs/r blocks irrespective of the number of records sampled so far. Algorithm 9 gives the detailed online sampling algorithm. 5.5 Sampling A Biased Sample We end this chapter by noting that if a geometric files sample is correctly biased, then batch and online sampling algorithms we have given will also produce a correctly biased sample with no modification, as described by the following lemma. Lemma 9. A simple, equal-probability random sample from a correctly biased geometric file will be correctly biased if the sample stored by the geometric file is correctly biased. Proof In biased sampling, the probability of record being accepted in a geometric file is I xf (r) where is the weight of the record under consideration and the totalWeight is the sum totalWeight' of weights of all records from the stream so far. Let the Sample be the biased sample of the geometric file, then we have Pr[i E Sample] = Pr[Selecting i from Si] x Pr[Selecting Si] x Pr[i E Si] 1 sl IRf(r) S, |R totalWeight f(r) totalWeight We examine the various algorithms for producing smaller samples from a large, disk-based geometric file in chapter 7 of this dissertation. in subsequent query processing. We then compute (in a single pass) a biased sample Ri of the i records produced by a data stream. Ri is fixed-size, and the probability of sampling the jth record from the stream is proportional to f(rj) for all j < i. This is a fairly simple and yet powerful definition of biased sampling, and is general enough to support many applications. The key contributions of this part of dissertation are as follows: 1. We present a modified version of the classic reservoir sampling algorithm that is ex- ceedingly simple, and is applicable for biased sampling using any arbitrary user-defined weighting function f. 2. In most cases, our algorithm is able to produce a correctly biased sample. However, given certain pathological data sets and data orderings, this may not be the case. Our algorithm adapts in this case and provides a correctly biased sample for a slightly modified bias function f'. We analytically bound how far f' can be from f in such a pathological case, and experimentally evaluate the practical significance of this difference. 3. We describe how to perform a biased reservoir sampling and maintain large biased samples with the geometric file. 4. Finally, we derive the correlation covariancee) between the Bernoulli random variables gov- erning the sampling of two records ri and rj using our algorithm. We use this covariance to derive the variance of a Horvitz-Thomson estimator making use of a sample computed using our algorithm. 1.3 Sampling The Sample A geometric file is a simple random sample (without replacement) from a data stream. In this part of the dissertation we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects that are themselves random samples (without replacement) from the original data stream. The goal of the algorithms described in this part is to efficiently support further sampling of a geometric file by making use of its own structure. Small samples frequently do not provide enough accuracy, especially in the case when the resulting statistical estimator has a very high variance. However, while in the general case a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. For example, consider the problem of estimating the average 2007 Abhijit A. Pol CHAPTER 1 INTRODUCTION Despite the variety of alternatives for approximate query processing [1, 21, 30, 34, 39], sampling is still one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. Sampling's many benefits include: Sampling is the most widely-studied and best understood approximation technique cur- rently available. Sampling has been studied for hundreds of years, and many fundamental results describe the utility of random samples (such as the Central Limit Theorem, Cher- noff, Hoeffding and Chebyshev bounds [16, 49]). Sampling is the most versatile approximation technique available. Most data processing algorithms can be used on a random sample of a data set rather than the original data with little or no modification. For example, almost any data mining algorithm for building a decision tree classifier can be run directly on a sample. Sampling is the most widely-used approximation technique. Sampling is common in data mining, statistics, and machine learning. The sheer number of recent papers from ICDE, VLDB, and SIGMOD [2, 3, 8, 14, 15, 28, 32, 33, 35, 46, 51, 52] that use samples testify to sampling's popularity as a data management tool. Given the obvious importance of random sampling, it is perhaps surprising that there has been very little work in the data management community on how to actually perform random sampling. The most well-known papers in this area are due to Olken and Rotem [25, 27], who also offer the definitive survey of related work through the early 1990s [26]. However, this work is relevant mostly for sampling from data stored in a database, and implicitly assumes that a "sample" is a small data structure that is easily stored in main memory. Such assumptions are sometimes overly restrictive. Consider the problem of approximate query processing. Recent work has suggested the possibility of maintaining a sample of a large database and then executing analytic queries over the sample rather than the original data as a way to speed up processing [4, 31]. Given the most recent TPC-H benchmark results [17], it is clear that processing standard report-style queries over a large, multi-terabyte data warehouse may take hours or days. In such a situation, maintaining a fully materialized random sample 5e+13 Biased sampling w/o skewed records 4.5e+13 -. Unbiased reservoir sampling Biased sampling worst case 4e+13 3.5e+13 (D S 3e+13 5 2.5e+13 0 2e+13 ---------------------- ---------------" ^---~: ----------^ 1.5e+13 1e+13 5e+12 0 --I I I 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 7-3. Sum query estimation accuracy for zipf=0.2. particular estimation task that is to be performed. Perhaps the easiest way to detail the practical effect of a pathological data ordering is through experimentation. In this section we present the experimental results evaluating practical significance of a worst-case data ordering. Specifically, we design a set of experiments to compute the error (variance) one would expect when sampling for the answer to a SUM query in following there scenarios: 1. When a biased sample is computed using our reservoir algorithm with the data ordered so as to produce no overweight records. 2. When an unbiased sample is computed using the classical reservoir sampling algorithm. 3. When a biased sample computed using our reservoir algorithm, with records arranged so as to produce the bias function furthest from the user-specified one, as described by the Theorem 1. By examining the results, it should become clear exactly what sort of practical effect on the accuracy of an estimator one might expect due to a pathological ordering. 7.2.1 Experimental Setup In our experiments, we evaluated a SUM query over a set of synthetic data streams having various statistical properties. In each experiment, every record has two attributes: A and B. Furthermore, we know that Algorithm 7 accepts the first IR| records of the stream with probability 1. No weight adjustments are triggered for first IR| records irrespective of their weights. Therefore, the earliest position rmx can appear in the stream is right after the reservoir is filled. This proves the proposition. We now turn to proving Lemma 5, which was used in the previous proof. Lemma 5. If r' appears as the ith record of the stream, then Vj < i we have: '(r) > j) f') (rk)f) -(r-) and Vj > i we have: 77(r) < f(r- ) k l f(rk) k=lif (k) f(k)) Proof When we encounter r"m as the ith record of the stream, we increase the weights of rj Vj < i by a factor of C = 1)f and adjust E i f'(rk) R (ra) + + (r). =1 f(r k) We also know that Vj > i, f'(rj) = f(rj). Part 1: Vj < i we have f'(rj) f(rj) C x f(rj) f(rj) k1 fl(k) f (r) ) f (r) S () f( Tk) L = fl(Tk) k 1 Tk) Cx f(rj) f(rj) i-1 C X f (rk)Z ki f(rk) -I f(rk) Since C > 1, we have C x f(rj) f (rj) -i C x (rk) + Ek f(rk) -' lf(rk) We can therefore conclude that f'(rj) f(rj) > f(rj) f(rj) k I f '(rk) k-i f(rk) fE (rk) f(rk) >0 This proves the first part of the lemma. Part 2: Vj > i we have Algorithm 7 Biased Reservoir Sampling (Adjusting Weights of Existing Samples) 1: Set totalWeight = 0 2: for int i = 1 to oo do 3: Wait for a new record ri to appear in the stream 4: Set r ". ':,l,, = f(ri) 5: totalWeight = totalWeight + f(ri) 6: if i < R then 7: Add ri directly to R 8: else 9: if If(rI) < 1 then totalWeight 10: with probability lR/(r' do totalWeight 11: Remove a randomly selected record from R 12: Add ri to R 13: else 14: for each record j in R do 15: rj.weight = (I RI -)f(ri) x rj.weight totalWeight- f(ri) X igh 16: totalWeight = IRf(ri) 17: Remove a randomly selected record from R 18: Add ri to R Case (ii): If Iti (t > 1, we scale the true weight of every existing sample so as to have totalWeight I= f(rf). This is done by first setting C = IR 1)(rs) and then scaling up totalWeight- f (ri) f(rk) C x f(rk) Vk < i. As a result of this linear scaling, we have IRl x C x f(r) Pr [rj E Ri] totalWeight R x C x f (ri) E c x f (T) + f(ri) IRIf'(rk) l 1 fl(ni) An important factor to consider while determining the applicability of Algorithm 7 is the deviation of f' from f. That is: how far off from the correct weighting can we be, in the worst case? When stream has no overweight records, we expect f' to be exactly equal to f, but it may be very far away under certain circumstances. To address this, we define a distance metric in Definition 2 and evaluate the worse case distance between f' and f. analytically bound how far f' can be from f in such a pathological case, and experimentally eval- uate the practical significance of this difference. Finally, we derive the correlation covariancee) between the Bernoulli random variables governing the sampling of two records ri and rj using our algorithm. We use this covariance to derive the variance of a Horvitz-Thomson estimator making use of a sample computed using our algorithm. The rest of the chapter is organized as follows. We describes a single-pass biased sampling algorithm. We also define a distance metric to evaluate the worst case deviation from the user- defined weighting function f. Finally, we derive a simple estimator for a biased reservoir. The experiments performed to test our algorithms are presented in Chapter 7. 4.1 A Single-Pass Biased Sampling Algorithm We introduced the classical reservoir sampling algorithm that maintains an unbiased sample of a data stream in the previous chapter. We will extend this algorithm to give our biased reservoir sampling algorithm and prove various properties and pathological cases for the same. 4.1.1 Biased Reservoir Sampling It turns out that in most cases, one may produce a correctly biased sample by simply modifying the reservoir algorithm to maintain a current sum total Weight over all observed f(ri). Then, incoming records are added to the reservoir so that the probability of sampling record rj is f(rj)/totalWeight. This basic version of the algorithm is given as Algorithm 6. It is possible to prove that this modified algorithm results in a correctly biased sample, provided that the "probability" from line (8) of Algorithm 6 does not exceed one. Lemma 3. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Using the biased sampling described in Alg, rihil 6, we are guaranteed that for each Ri and for each record rj produced by the data stream such that j < i, we have Pr[rj EG i] I Rif(r3) E 1f/(rk) Proof We need to prove that when a new record ri appears in the stream, then for each record rj from the stream, Pr[rj E R1] = f( A new records produced by the stream is sampled El=I f(rIl) LIST OF FIGURES Figure page 3-1 Decay of a subsample after multiple buffer flushes. ..... . . 38 3-2 Basic structure of the geometric file. .................. ........ 39 3-3 Building a geometric file. .................. ............ 43 3-4 Distributing new records to existing subsamples. .................. ..44 3-5 Speeding up the processing of new samples using multiple geometric files. ...... 54 4-1 Adjustment of r'" to rrn . ........ . ...69 7-1 Results of benchmarking experiments (Processing insertions). . . ... 101 7-2 Results of benchmarking experiments (Sampling from a geometric file). . ... 102 7-3 Sum query estimation accuracy for zipf=0.2. ................... 104 7-4 Sum query estimation accuracy for zipf=0.5. ................... 105 7-5 Sum query estimation accuracy for zipf=0.8. ................... 106 7-6 Sum query estimation accuracy for zipf=l. ................... ....... 107 7-7 Disk footprint for 1KB record size .................. .......... 110 7-8 Disk footprint for 200B record size .................. ........ 112 records by their page number attribute and retrieve the actual records from the geometric file as a query result. In Chapter 7, we evaluate and compare the three index structures suggested in this chapter experimentally by measuring build time and disk footprint as new records are inserted into the geometric file. We also compare the efficiency of these structures for point and range queries. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this functionality is to build an index structure. We discussed three secondary index structures and their maintenance as new records are inserted into a geometric file. The segment-based and the subsample-based index structures are designed around structure of the geometric file. The third index structure, the LSM-tree-based index structure makes use of LSM-tree structure, an efficient structure to handle bulk insertion and deletion. We compared these structures for build time, disk space used, and index look-up time. |

Full Text |

PAGE 1 1 PAGE 2 2 PAGE 3 3 PAGE 4 AttheendofmydissertationIwouldliketothankallthosepeoplewhomadethisdisserta-tionpossibleandanenjoyableexperienceforme.FirstofallIwishtoexpressmysinceregratitudetomyadviserChrisJermaineforhispatientguidance,encouragement,andexcellentadvicethroughoutthisstudy.IfIwouldhaveaccesstomagictoolcreate-your-own-adviser,IstillwouldnothaveendedupwithanyonebetterthanChris.Healwaysintroducesmetointerestingresearchproblems.HeisaroundwheneverIhaveaquestion,butatthesametimeencouragesmetothinkonmyownandworkonanyproblemsthatinterestme.IamalsoindebtedtoAlinDobraforhissupportandencouragement.Alinisaconstantsourceofenthusiasm.TheonlytopicIhavenotdiscussedwithhimisstrategiesofGatorfootballgames.IamgratefultomydissertationcommitteemembersTamerKahveci,JoachimHammer,andRavindraAhujafortheirsupportandtheirencouragement.IacknowledgetheDepartmentofIndustrialandSystemsEngineering,RavindraAhuja,andchairDonaldHearnforthenancialsupportandadviceIreceivedduringinitialyearsofmystudies.Finally,Iwouldliketoexpressmydeepestgratitudefortheconstantsupport,understanding,andlovethatIreceivedfrommyparentsduringthepastyears. 4 PAGE 5 page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 10 CHAPTER 1INTRODUCTION .................................... 12 1.1TheGeometricFile ................................. 14 1.2BiasedReservoirSampling ............................. 16 1.3SamplingTheSample ................................ 18 1.4IndexStructuresForTheGeometricFile ...................... 19 2RELATEDWORK .................................... 22 2.1RelatedWorkonReservoirSampling ........................ 22 2.2BiasedSamplingRelatedWork ........................... 24 3THEGEOMETRICFILE ................................. 28 3.1ReservoirSampling ................................. 28 3.2Sampling:SometimesaLittleisnotEnough .................... 30 3.3ReservoirforVeryLargeSamples ......................... 31 3.4TheGeometricFile ................................. 34 3.5CharacterizingSubsampleDecay .......................... 36 3.6GeometricFileOrganization ............................ 40 3.7ReservoirSamplingWithaGeometricFile ..................... 40 3.7.1IntroducingtheRequiredRandomness ................... 41 3.7.2HandlingtheVariance ............................ 42 3.7.3BoundingtheVariance ........................... 45 3.8ChoosingParameterValues ............................. 47 3.8.1ChoosingaValueforAlpha ......................... 47 3.8.2ChoosingaValueforBeta .......................... 48 3.9WhyReservoirSamplingwithaGeometricFileisCorrect? ............ 49 3.9.1CorrectnessoftheReservoirSamplingAlgorithmwithaBuffer ...... 49 3.9.2CorrectnessoftheReservoirSamplingAlgorithmwithaGeometricFile 50 3.10MultipleGeometricFiles .............................. 51 3.11ReservoirSamplingwithMultipleGeometricFiles ................ 51 3.11.1ConsolidationAndMerging ......................... 53 3.11.2HowCanCorrectnessBeMaintained? ................... 53 3.11.3HandlingtheStacksinMultipleGeometricFiles .............. 56 5 PAGE 6 ................................. 56 4BIASEDRESERVOIRSAMPLING ........................... 58 4.1ASingle-PassBiasedSamplingAlgorithm ..................... 59 4.1.1BiasedReservoirSampling ......................... 59 4.1.2So,WhatCanGoWrong?(AndaSimpleSolution) ............ 60 4.1.3AdjustingWeightsofExistingSamples ................... 62 4.2WorstCaseAnalysisforBiasedReservoirSamplingAlgorithm .......... 65 4.2.1TheProoffortheWorstCase ........................ 66 4.2.2TheProofofTheorem 1 :TheUpperBoundontotalDist 73 4.3BiasedReservoirSamplingWithTheGeometricFile ............... 75 4.4EstimationUsingaBiasedReservoir ........................ 76 5SAMPLINGTHEGEOMETRICFILE ......................... 80 5.1WhyMightWeNeedToSampleFromaGeometricFile? ............. 80 5.2DifferentSamplingPlansfortheGeometricFile .................. 80 5.3BatchSamplingFromaGeometricFile ...................... 81 5.3.1ANaiveAlgorithm ............................. 81 5.3.2AGeometricFileStructure-BasedAlgorithm ............... 82 5.3.3BatchSamplingMultipleGeometricFiles ................. 84 5.4OnlineSamplingFromaGeometricFile ...................... 84 5.4.1ANaiveAlgorithm ............................. 84 5.4.2AGeometricFileStructure-BasedAlgorithm ............... 85 5.5SamplingABiasedSample ............................. 88 6INDEXSTRUCTURESFORTHEGEOMETRICFILE ................ 89 6.1WhyIndexaGeometricFile? ............................ 89 6.2DifferentIndexStructuresfortheGeometricFile ................. 90 6.3ASegment-BasedIndexStructure ......................... 91 6.3.1IndexConstructionDuringStart-up ..................... 91 6.3.2MaintainingIndexDuringNormalOperation ................ 92 6.3.3IndexLook-UpandSearch ......................... 93 6.4ASubsample-BasedIndexStructure ........................ 93 6.4.1IndexConstructionandMaintenance .................... 94 6.4.2IndexLook-Up ............................... 95 6.5ALSM-Tree-BasedIndexStructure ........................ 96 6.5.1AnLSM-TreeIndex ............................. 96 6.5.2IndexMaintenanceandLook-Ups ..................... 97 7BENCHMARKING .................................... 99 7.1ProcessingInsertions ................................ 99 7.1.1ExperimentsPerformed ........................... 99 7.1.2DiscussionofExperimentalResults ..................... 100 6 PAGE 7 ............................. 103 7.2.1ExperimentalSetup ............................. 104 7.2.2Discussion .................................. 106 7.3SamplingFromaGeometricFile .......................... 107 7.3.1ExperimentsPerformed ........................... 108 7.3.2DiscussionofExperimentalResults ..................... 109 7.4IndexStructuresForTheGeometricFile ...................... 110 7.4.1ExperimentsPerformed ........................... 110 7.4.2Discussion .................................. 112 8CONCLUSION ...................................... 116 REFERENCES ......................................... 118 BIOGRAPHICALSKETCH .................................. 122 7 PAGE 8 Table page 1-1Population:studentrecords ................................ 17 1-2Randomsampleofthesize=4 ............................... 17 1-3Biasedsampleofthesize=4 ................................ 17 7-1Millionsofrecordsinsertedin10hrs ........................... 110 7-2Querytimingresultsfor1krecord,jRj=10million,andjBj=50k 113 7-3Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k 114 8 PAGE 9 Figure page 3-1Decayofasubsampleaftermultiplebufferushes. ................... 38 3-2Basicstructureofthegeometricle. ........................... 39 3-3Buildingageometricle. ................................. 43 3-4Distributingnewrecordstoexistingsubsamples. ..................... 44 3-5Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. ....... 54 4-1Adjustmentofrmaxitormaxi1 69 7-1Resultsofbenchmarkingexperiments(Processinginsertions). ............. 101 7-2Resultsofbenchmarkingexperiments(Samplingfromageometricle). ........ 102 7-3Sumqueryestimationaccuracyforzipf=0.2. ....................... 104 7-4Sumqueryestimationaccuracyforzipf=0.5. ....................... 105 7-5Sumqueryestimationaccuracyforzipf=0.8. ....................... 106 7-6Sumqueryestimationaccuracyforzipf=1. ........................ 107 7-7Diskfootprintfor1KBrecordsize ............................ 110 7-8Diskfootprintfor200Brecordsize ............................ 112 9 PAGE 10 Samplingisoneofthemostfundamentaldatamanagementtoolsavailable.Itisoneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.However,mostcurrentresearchinvolvingsamplingconsiderstheproblemofhowtouseasample,andnothowtocomputeone.Theimplicitassumptionisthatasampleisasmalldatastructurethatiseasilymaintainedasnewdataareencountered,eventhoughsimplestatisticalargumentsdemonstratethatverylargesamplesofgigabytesorterabytesinsizecanbenecessarytoprovidehighaccuracy.Noexistingworktacklestheproblemofmaintainingverylarge,disk-basedsamplesinanonlinemannerfromstreamingdata. Wepresentanewdataorganizationcalledthegeometricleandonlinealgorithmsformain-tainingaverylarge,on-disksamples.Thealgorithmsaredesignedforanyenvironmentwherealargesamplemustbemaintainedonlineinasinglepassthroughadataset.Thegeometricleorganizationmeetsthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsample(withoutreplacement)ofallofthedataprocessedthusfar. Wemodifytheclassicreservoirsamplingalgorithmtocomputeaxed-sizesampleinasinglepassoveradataset,wherethegoalistobiasthesampleusinganarbitrary,user-denedweightingfunction.Wealsodescribehowthegeometriclecanbeusedtoperformabiasedreservoirsampling. 10 PAGE 11 Efcientlysearchinganddiscoveringinformationfromthegeometricleisessentialforqueryprocessing.Anaturalwaytosupportthisistobuildanindexstructure.Wediscussthreesecondaryindexstructuresandtheirmaintenanceasnewrecordsareinsertedtoageometricle. 11 PAGE 12 Despitethevarietyofalternativesforapproximatequeryprocessing[ 1 21 30 34 39 ],samplingisstilloneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.Sampling'smanybenetsinclude: 16 49 ]). 2 3 8 14 15 28 32 33 35 46 51 52 ]thatusesamplestestifytosampling'spopularityasadatamanagementtool. Giventheobviousimportanceofrandomsampling,itisperhapssurprisingthattherehasbeenverylittleworkinthedatamanagementcommunityonhowtoactuallyperformrandomsampling.Themostwell-knownpapersinthisareaareduetoOlkenandRotem[ 25 27 ],whoalsoofferthedenitivesurveyofrelatedworkthroughtheearly1990s[ 26 ].However,thisworkisrelevantmostlyforsamplingfromdatastoredinadatabase,andimplicitlyassumesthatasampleisasmalldatastructurethatiseasilystoredinmainmemory. Suchassumptionsaresometimesoverlyrestrictive.Considertheproblemofapproximatequeryprocessing.Recentworkhassuggestedthepossibilityofmaintainingasampleofalargedatabaseandthenexecutinganalyticqueriesoverthesampleratherthantheoriginaldataasawaytospeedupprocessing[ 4 31 ].GiventhemostrecentTPC-Hbenchmarkresults[ 17 ],itisclearthatprocessingstandardreport-stylequeriesoveralarge,multi-terabytedatawarehousemaytakehoursordays.Insuchasituation,maintainingafullymaterializedrandomsample 12 PAGE 13 43 ])maybedesirable.Inordertosavetimeand/orcomputerresources,queriescanthenbeevaluatedoverthesampleratherthantheoriginaldata,aslongastheusercantoleratesomecarefullycontrolledinaccuracyinthequeryresults. Thisparticularapplicationhastwospecicrequirementsthatareaddressedbythedis-sertation.First,itmaybenecessarytousequitealargesampleinordertoachieveacceptableaccuracy;perhapsontheorderofgigabytesinsize.Thisisespeciallytrueifthesamplewillbeusedtoanswerselectivequeriesoraggregatesoverattributeswithhighvariance(seeSec-tion 3.2 ).Second,whatevertherequiredsamplesize,itisoftenindependentofthesizeofthedatabase,sinceestimationaccuracydependsprimarilyonsamplesize Foranotherexampleofacasewhereexistingsamplingmethodscanfallshort,considerstream-baseddatamanagementtasks,suchasnetworkmonitoring(foranexampleofsuchanapplication,wepointtotheGigascopeprojectfromAT&TLaboratories[ 18 20 ]).Giventhetremendousamountofdatatransportedovertoday'scomputernetworks,theonlyconceivablewaytofacilitatead-hoc,after-the-factqueryprocessingoverthesetofpacketsthathavepassedthroughanetworkrouteristobuildsomesortofstatisticalmodelforthosepackets.Themostobviouschoicewouldbetoproduceaverylarge,statisticallyrandomsampleofthepacketsthathavepassedthroughtherouter.Again,maintainingsuchasampleispreciselytheproblemwetackleinthisdissertation.Whileotherresearchershavetackledtheproblemofmaintainingan 16 ]forathoroughtreatmentofnitepopulationrandomsampling). 13 PAGE 14 7 ],noexistingmethodshaveconsideredhowtohandleverylargesamplesthatexceedtheavailablemainmemory. Inthisdissertationwedescribeanewdataorganizationcalledthegeometricleandrelatedonlinealgorithmsformaintainingaverylarge,disk-basedsamplefromadatastream.Thedissertationisdividedintofourparts.Intherstpartwedescribethegeometricleorganizationanddetailhowgeometriclescanbeusedtomaintainaverylargesimplerandomsample.Inthesecondpartweproposeasimplemodicationtotheclassicalreservoirsamplingalgorithmtocomputeabiasedsampleinasinglepassoverthedatastreamanddescribehowthegeometriclecanbeusedtomaintainaverylargebiasedsample.Inthethirdpartwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjects.Finally,inthefourthpart,wediscusssecondaryindexstructuresforthegeometricle.Indexstructuresareusefultospeedupsearchanddiscoveryofrequiredinformationfromahugesamplestoredinageometricle.Theindexstructuresmustbemaintainedconcurrentlywithconstantupdatestothegeometricleandatthesametimeprovideefcientaccesstoitsrecords. Wenowgiveanintroductiontothesefourpartsofthedissertationinsubsequentsections. 14 PAGE 15 11 38 ].Reservoirsamplingalgorithmscanbeusedtodynamicallymaintainaxed-sizesampleofNrecordsfromastream,sothatatanygiveninstant,theNrecordsinthesampleconstituteatruerandomsampleofalloftherecordsthathavebeenproducedbythestream.However,aswewilldiscussinthisdissertation,theproblemisthatexistingreservoirtechniquesaresuitableonlywhenthesampleissmallenoughtotintomainmemory. Giventhattherearelimitedtechniquesformaintainingverylargesamples,theproblemaddressedintherstpartofthisdissertationisasfollows: 1. Thealgorithmsmustbesuitableforstreamingdata,oranysimilarenvironmentwherealargesamplemustbemaintainedon-lineinasinglepassthroughadataset,withthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsampleofxedsizeN(withoutreplacement)fromallofthedataproducedbythestreamthusfar. 2. Whenmaintainingthesample,thefractionofI/Otimedevotedtoreadsshouldbeclosetozero.Ideally,therewouldneverbeaneedtoreadablockofsamplesfromdisksimplytoaddonenewsampleandsubsequentlywritetheblockoutagain. 3. ThefractionI/OoftimespentperformingrandomI/Osshouldalsobeclosetozero.Costlyrandomdiskseeksshouldbefewandfarbetween.AlmostallI/Oshouldbesequential. 4. Finally,theamountofdatawrittentodiskshouldbeboundedbythetotalsizeofalloftherecordsthatareeversampled. Thegeometriclemeetseachoftherequirementslistedabove.WithmemorylargeenoughtobufferjBj>1records,thegeometriclecanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord(seeSection 3.12 ).Themultiplier!canbemadearbitrarilysmallbymakinguseofadditionaldiskspace.Arigorousbenchmarkofthegeometricledemonstratesitssuperiorityovertheobviousalternatives. 15 PAGE 16 Theneedforbiasedsamplingcaneasilybeillustratedwithanexamplepopulation,giveninTable 1.2 .Thisparticulardatasetcontainsrecordsdescribinggraduatestudentsalariesinauniversityacademicdepartment,andourgoalistoguessthetotalgraduatestudentsalary.Imaginethatasimplerandomsampleofthedatasetisdrawn,asshownintheTable 1-2 .Thefoursampledrecordsarethenusedtoguessthatthetotalstudentsalaryis(520+700+580+600)12=4=$7200,whichisconsiderablylessthanthetruetotalof$9545.Theproblemisthatwehappenedtomissmostofthehigh-salarystudentswhoaregenerallymoreimportantwhencomputingtheoveralltotal. Now,imaginethatweweighteachrecord,sothattheprobabilityofincludinganygivenrecordwithasalary700orgreaterinthesampleis(2)(4=12),andtheprobabilityofincludingagivenrecordwithasalarylessthan700is(1=2)(4=12).Thus,oursamplewilltendtoincludethoserecordswithhighervalues,thataremoreimportanttotheoverallsum.TheresultingbiasedsampleisdepictedinTable 1-3 .ThestandardHorvitz-Thompsonestimator[ 50 ]isthenappliedtothesample(whereeachrecordisweightedaccordingtotheinverseofitssamplingprobability),whichgivesusanestimateof(1200+1500+750)(12=8)+(580)(24=4)=$8655.Thisisobviouslyabetterestimatethan$7200,andthefactthatitisbetterthentheoriginalestimateisnotjustaccidental:ifonechoosestheweightscarefully,itiseasilypossibletoproduceasamplewhoseassociatedestimatorhaslowervariance(andhencehigheraccuracy)thanthesimple,uniform-probabilitysample.Forinstance,thevarianceoftheestimatorinthestudentsalaryexampleis2:533106undertheuniform-probabilitysamplinganditis5:083105underthebiasedsamplingscheme. 16 PAGE 17 Population:studentrecords Rec# NameClassSalary($/month) 1 JamesJunior12002 TomFreshman5203 SandraJunior12504 JimSenior15005 AshleySophomore7006 JenniferFreshman5307 RobertSophomore7508 FrankFreshman5809 RachelFreshman60510 TimFreshman55011 MariaSophomore76012 MonicaFreshman600 TotalSalary:9545.00 Table1-2. Randomsampleofthesize=4 Rec# NameClassSalary($/month) 2 TomFreshman5205 AshleySophomore7008 FrankFreshman58012 MonicaFreshman600 Othercaseswhereabiasedsampleispreferableabound.Forexample,ifthegoalistomonitorthepacketsowingthroughanetwork,onemaychoosetoweightmorerecentpacketsmoreheavily,sincetheywouldtendtoguremoreprominentlyinmostqueryworkloads. Weproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutility Table1-3. Biasedsampleofthesize=4 Rec# NameClassSalary($/month) 1 JamesJunior12004 JimSenior15007 RobertSophomore75011 MariaSophomore760 17 PAGE 18 Thekeycontributionsofthispartofdissertationareasfollows: 1. Wepresentamodiedversionoftheclassicreservoirsamplingalgorithmthatisex-ceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf. 2. Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Weanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase,andexperimentallyevaluatethepracticalsignicanceofthisdifference. 3. Wedescribehowtoperformabiasedreservoirsamplingandmaintainlargebiasedsampleswiththegeometricle. 4. Finally,wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgov-erningthesamplingoftworecordsriandrjusingouralgorithm.WeusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm. Smallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,considertheproblemofestimatingtheaverage 18 PAGE 19 Sincethereisnosinglesamplesizethatisoptimalforansweringallqueriesandtherequiredsamplesizecanvarydramaticallyfromquerytoquery,thispartofdissertationconsiderstheproblemofgeneratingasampleofsizeNfromadatastreamusinganexistinggeometriclethatcontainsalargesampleofrecordsfromthestream,whereNR.Wewillconsidertwospecicproblems.First,weconsiderthecasewhereNisknownbeforehand.Wewillrefertoasampleretrievedinthismannerasabatchsample.WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample. Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Anindexisreferredtoasprimaryindexifit 19 PAGE 20 Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexesaremaintainedforeachsegmentorsubsampleinageometricle.Asnewrecordsareaddedtotheleinunitsofasegmentorsubsample,anewB+-treeindexingnewrecordsiscreatedandaddedtotheindexstructure.Also,anexistingB+-treeisdeletedfromthestructurewhenalltherecordsindexedbyitaredeletedfromthele.ThethirdindexstructuremakesuseoftheLSM-treeindex[ 44 ]-adisked-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes.Weevaluateandcomparethesethreeindexstructuresexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries. 2 .InChapter 3 wepresentthegeometricleorganizationandshowhowthisstructurecanbeusedtomaintainaverylarge 20 PAGE 21 4 weproposeasinglepassbiasedreservoirsamplingalgorithm.InChapter 5 wedeveloptechniquesthatcanbeusedtosamplegeometriclestoobtainasmallsizesample.InChapter 6 wepresentsecondaryindexstructuresforthegeometricle.InChapter 7 wediscussthebenchmarkingresults.ThedissertationisconcludedinChapter 8 Mostoftheworkinthedissertationiseitheralreadypublishedorisunderreviewforpublication.ThematerialfromChapter 3 isfromthepaperwithChristopherJermaineandSubramanianArumugamthatwasoriginallypublishedinSIGMOD2004[ 36 ].TheworkpresentedinChapter 4 issubmittedtoTKDEandisunderreview[ 47 ].ThematerialinChapter 5 isthepartofjournalpaperacceptedatVLDBJ[ 48 ].TheresultsintheChapter 7 aretakenfromabovethreepapersaswell. 21 PAGE 22 Inthischapter,werstreviewtheliteratureonreservoirsamplingalgorithms.Wethenpresentthesummaryofexistingworkonbiasedsampling. 2 3 8 14 15 28 32 33 35 51 52 ].However,themostpreviouspapers(includingtheaforementionedreferences)areconcernedwithhowtouseasample,andnotwithhowtoactuallystoreormaintainone.Mostofthesealgorithmscouldbeviewedaspotentialusersofalargesamplemaintainedasageometricle. Asmentionedintheintroductionchapter,aseriesofpapersbyOlkenandRotem(includingtwopaperslistedintheReferencessection[ 25 27 ])probablyconstitutethemostwell-knownbodyofresearchdetailinghowtoactuallycomputesamplesinadatabaseenvironment.OlkenandRotemgiveanexcellentsurveyofworkinthisarea[ 26 ].However,mostofthisworkisverydifferentthanours,inthatitisconcernedprimarilywithsamplingfromanexistingdatabasele,whereitisassumedthatthedatatobesampledfromareallpresentondiskandindexedbythedatabase.Singlepasssamplingisgenerallynotthegoal,andwhenitis,managementofthesampleitselfasadisk-basedobjectisnotconsidered. Thealgorithmsinthisdissertationarebasedonreservoirsampling,whichwasrstde-velopedinthe1960s[ 11 38 ].Inhiswell-knownpaper[ 53 ],Vitterextendsthisearlyworkbydescribinghowtodecreasethenumberofrandomnumbersrequiredtoperformthesampling.Vitter'stechniquescouldbeusedinconjunctionwithourown,butthefocusofexistingworkonreservoirsamplingisagainquitedifferentfromours;managementofthesampleitselfisnotconsidered,andthesampleisimplicitlyassumedtobesmallandin-memory.However,ifwere-movetherequirementthatoursampleofsizeNbemaintainedon-linesothatitisalwaysavalidsnapshotofthestreamandmustevolveovertime,thensequentialsamplingtechniquesrelatedtoreservoirsamplingthatcouldbeusedtobuild(butnotmaintain)alarge,on-disksample(seeVitter[ 54 ],forexample). 22 PAGE 23 44 ],Buffer-Tree[ 6 ],andY-Tree[ 12 ].ThesepapersconsiderproblemofprovidingI/OefcientindexingforadatabaseexperiencingaveryhighrecordinsertionratewhichisimpossibletohandleusingatraditionalB+-Treeindexingstructure.Ingeneralthesemethodsbufferalargesetofinsertionsandthenscantheentirebaserelation,whichistypicallyorganizedasaB+-Tree,atonceaddingnewdatatothestructure. Anyoftheabovemethodscouldtriviallybeusedtomaintainalargerandomsampleofadatastream.Everytimeasamplingalgorithmprobabilisticallyselectsarecordforinsertion,itmustoverwrite,atrandom,anexistingrecordofthereservoir.Onceanevicteeisdetermined,wecanattachitslocationasapositionidentier(anumberbetween1andR)withanewsamplerecord.Thispositioneldisthenusedtoinsertthenewrecordintotheseindexstructures.Whileperformingtheefcientbatchinserts,ifanindexstructurediscoversthatarecordwiththesamepositionidentierexists,itsimplyoverwritestheoldrecordwiththenewerone. However,noneofthesemethodscancomeclosetotherawwritespeedofthedisk,asthegeometriclecan[ 13 ].Inasense,theissueisthatwhiletheindexingprovidedbythesestructurescouldbeusedtoimplementefcient,disk-basedreservoirsampling,itistooheavy-dutyasolution.WewouldenduppayingtoomuchintermsofdiskI/Otosendanewrecordtooverwriteaspecic,existingrecordchosenatthetimethenewrecordisinserted,whenallonereallyneedsistohaveanewrecordoverwriteanyrandom,existingrecord. Therehasbeenmuchrecentinterestinapproximatequeryprocessingoverdatastreams(averysmallsubsetofthesepapersislistedintheReferencessection[ 1 21 34 ]);evensomeworkonsamplingfromadatastream[ 7 ].Thisworkisverydifferentfromourown,inthatmostexistingapproximationtechniquestrytooperateinverysmallspace.Instead,ourfocusisonmakinguseoftoday'sverylargeandveryinexpensivesecondarystoragetophysicallystorethelargestsnapshotpossibleofthestream. Finally,wementiontheU.C.BerkeleyCONTROLproject[ 37 ](whichresultedinthedevelopmentofonlineaggregation[ 33 ]andripplejoins[ 32 ]).Thisworkdoesaddressissues 23 PAGE 24 11 38 ].Recently,Gemullaetal.[ 29 ]extendedthereservoirsamplingalgorithmtohandledeletions.Intheiralgorithmcalledrandompairing(RP)everydeletionfromthedatasetiseventuallycompensatedbyasubsequentinsertion.TheRPAlgorithmkeepstracksofuncompensateddeletionsandusesthisinformationwhileperformingtheinserts.TheAlgorithmguardstheboundonthesamplesizeandatthesametimeutilizesthesamplespaceeffectivelytoprovidesastablesample.AnotherextensiontotheclassicreservoirsamplingalgorithmhasbeenrecentlyproposedbyBrownandHaasforwarehousingofsampledata[ 10 ].Theyproposehybridreservoirsamplingforindependentandparalleluniformrandomsamplingofmultiplestreams.Thesealgorithmscanbeusedtomaintainawarehouseofsampleddatathatshadowsthefull-scaledatawarehouse.Theyhavealsoprovidedmethodsformergingsamplesfordifferentstreamstocreateauniformrandomsample. Theproblemoftemporalbiasedsamplinginastreamenvironmenthasbeenconsidered.Babcocketal.[ 7 ]presentedtheslidingwindowapproachwithrestrictedhorizonofthesampletobiasedthesampletowardstherecentstreamingrecords.However,thissolutionhasapotentialtocompletelylosetheentirehistoryofpaststreamdatathatisnotapartofslidingwindow.TheworkdonebyAggarwal[ 5 ]addressesthislimitationandpresentsabiasedsamplingmethodsothatwecanhavetemporalbiasforrecentrecordsaswellaswekeeprepresentationfromstreamhistory.Thisworkexploitssomeinterestingpropertiesoftheclassofmemory-lessbiasfunctionstopresentasingle-passbiasedsamplingalgorithmforthesetypeofbiasedfunctions.However, 24 PAGE 25 Anotherpieceofworkonsingle-passsamplingwithanonuniformdistributionisduetoKolonkoandWasch[ 40 ].Theypresentasingle-passalgorithmtosampleadatastreamofunknownsize(thatis,notknowbeforehand)toobtainasampleofarbitrarysizensuchthattheprobabilityofselectingadataitemiisdependontheindividualitem.Theweightortnessoftheitemthatisusedforitsprobabilisticselectionisderivedusingexponentiallydistributedauxiliaryvalueswiththeparameteroftheexponentialdistributionandthelargestauxiliaryvaluedeterminesthesample.Liketemporalbiasedsamplingmethoddiscussedabove,thisalgorithmcannotbedirectlyadaptedforarbitraryuser-denedbiasedfunctions. Surprisingly,abovethreepapersaretheonlypiecesofworkthatareknowtoauthorsonhowtoperformasingle-passbiasedsamplingoverlargedatasetsorstreamingdata. Theanotherbodyofrelatedworkisthepapersfromnetworkusagearea[ 22 24 41 ].Thesepaperspresenttechniquesforestimatingthetotalnetworktrafc(orusage)basedonthesampleofaowrecordsproducedbyrouters.Sincetheseowstypicallyhaveheavy-taileddistributions,thetechniquespresentedinthesepapersmakeuseofsize-dependentsamplingscheme.Ingeneral,suchschemesworkbysamplingalltherecordswhosetrafcisabovecertainthresholdandsamplingtherestwithprobabilityproportionaltotheirtrafc.Although,suchtechniquesintroducesamplingbiaswheresizecanbethoughtastheweightofarecord,therearekeydifferencesbetweensuchtechniquesandthealgorithmpresentedinthisdissertation.Thegoalofouralgorithmistoobtainaxedsizebiasedsamplethatcomplywiththearbitraryuser-denedbiasedfunction.Thegoalofthesize-dependentsamplingschemeistoobtainasamplethatwillprovidethebestaccuracyforestimatingthetotalnetworktrafcthatfollowsaspecicdistribution.Thesamplegatheredbytheseschemesisnotnecessarilyaxedsizebiasedsample.Itonlyguaranteesthattheexpectedsamplesizeisnolargerthantheexpectedsample 25 PAGE 26 Theproblemofimplementingxedsizesamplingdesignwithdesiredandunequalinclusionprobabilitieshasbeenstudiedinstatistics.ThemonogramTheoryofSampleSurveys[ 50 ]discussesseveralmethodsforsuchasamplingtechnique,whichisofsomepracticalimportanceinsurveysampling.Thismonogrambeginsbydiscussingtwodesignswhichmimicsimplerandomsamplingwithoutreplacementwithselectionprobabilitiesforagivendrawthatarenotthesameforalltheunits.Werstsummarizethesetechniques. 26 PAGE 27 withxedsizeistoselectunitsforreplacement,andthentorejectthesampleifthereareduplicates.Wediscussonesuchmethodhere,calledSampford'sMethod. 27 PAGE 28 Inthischapterwegiveanintroductiontothebasicreservoirsamplingalgorithmthatwasproposedtoobtainanonlinerandomsampleofadatastream.Thealgorithmassumesthatthesamplemaintainedissmallenoughtotinmainmemoryinitsentirety.Wediscussandmotivatewhyverylargesamplesizescanbemandatoryincommonsituations.Wedescribethreealternativesformaintainingverylarge,disk-basedsamplesinastreamingenvironment.Wethenintroducethegeometricleorganizationandpresentalgorithmsforreservoirsamplingwiththegeometricle.Wealsodescribehowmultiplegeometriclescanbemaintainedall-at-oncetoachieveconsiderablespeedup. 11 38 ].TomaintainareservoirsampleRoftargetsizejRj,thefollowingloopisused: 28 PAGE 29 53 ].Afteracertainnumberofrecordshavebeenseen,thealgorithmwakesupandcapturethenextrecordfromthestream. 1 maintainsthisinvariantinsteps(2-6)asfollows[ 11 38 ].Theithrecordprocessed(i>jRj),itisaddedtothereservoirwithprobabilityjRj=ibystep4.Weneedtoshowthatforallotherrecordsprocessedthusfar,theinclusionprobabilityisjRj=i.Letrkbeanyrecordinthereservoirs.t.k6=i.LetRidenotethestateofthereservoirjustafteradditionoftheithrecord.Thus,weareinterestedinthePr[rk2Ri] i11 i=R i1R i1 i=R i 16 ].ToselectasampleofjRjunits,systematicsamplingtakesaunitatrandomfromtherstkunitsandeverykthunitthereafter.Althoughtheinclusionprobabilityinsystematicsamplingisthesameasinsimplerandomsampling,thepropertiesofasamplesuchasvariancecanbefardifferent.Itisknownthatthevarianceofthesystematicsamplingcanbebetterorworsecomparedtoasimplerandomsamplingdependingondataheterogeneityandcorrelationcoefcientbetweenpairsofsampledunits. 29 PAGE 30 Theproofthatreservoirsamplingmaintainsthecorrectinclusionprobabilityforanysetofinterestisactuallyverysimilartotheunivariateinclusionprobabilitycorrectnessdiscussedabove.WeknowthattheunivariateinclusionprobabilityPr[rk2Ri]=R=i.ForanyarbitraryvalueofjSjjRj,assumethatwehavethecorrectprobabilitieswhenwehaveseeni1inputrecords,i.e.Pr[S2Ri1]=jRjjSj=i1jSj.Whentheithrecordisprocessed(i>jRj),wehave i1S R+1R i=jRjjSj i1jSjR iS i+1R i=jRjjSj ijSjiS iiS i=jRjjSj ijSj 16 ]).Thevalueisknownasthecondenceoftheestimate. Verylargesamplesareoftenrequiredtoprovideaccurateestimateswithsuitablyhighcondence.TheneedforverylargesamplescanbeeasilyexplainedinthecontextoftheCentralLimitTheorem(CLT)[ 27 ].TheCLTimpliesthatifweusearandomsampleofsizeNtoestimatethemeanofasetofnumbers,theerrorofourestimateisusuallynormally 30 PAGE 31 1. Theerrorisinverselyproportionaltothesquarerootofthesamplesize. 2. Theerrorisdirectlyproportionaltothestandarddeviationofthesetoverwhichweareestimatingthemeanover. Thesignicanceofthisobservationisthatthesamplesizerequiredtoproduceanaccurateestimatecanvarytremendouslyinpractice,andgrowsquadraticallywithincreasingstandarddeviation.Forexample,saythatweusearandomsampleof100studentsatauniversitytoestimatetheaveragestudentsage.Imaginethattheaverageageis20withastandarddeviationof2years.AccordingtotheCLT,oursample-basedestimatewillbeaccuratetowithin2.5%withcondenceofaround98%,givingusanaccurateguessastothecorrectanswerwithonly100sampledstudents. Now,considerasecondscenario.WewanttouseasecondrandomsampletoestimatetheaveragenetworthofhouseholdsintheUnitedStates,whichisaround$140,000,withastandarddeviationofatleast$5,000,000.Becausethestandarddeviationissolarge,aquickcalculationshowswewillneedmorethan12millionsamplestoachievethesamestatisticalguaranteesasintherstcase. Requiredsamplesizescanbefarlargerwhenstandarddatabaseoperationslikerelationalselectionandjoinareconsidered,becausetheseoperationscaneffectivelymagnifythevarianceofourestimate.Forexample,theworkonripplejoins[ 32 ]providesanexcellentexampleofhowvariancecanbemagniedbysamplingovertherelationaljoinoperator. 31 PAGE 32 2 .Count(B)referstothecurrentnumberofrecordsinB.NotethatsincetherecordscontainedinBlogicallyrepresentrecordsinthereservoirthathavenotyetbeenaddedtodisk,anewly-sampledrecordcaneitherbeassignedtoreplaceanon-diskrecord,oritcanbeassignedtoreplaceabufferedrecord(thisisdecidedinStep(7)ofthealgorithm). Inarealisticscenario,theratioofthenumberofdiskblockstothenumberofrecordsbufferedinmainmemorymayapproachorevenexceedone.Forexample,a1TBdatabasewith128KBblockswillhave7.8millionblocks;andforsucharelativelylargedatabaseitisrealistictoexpectthatwehaveaccesstoenoughmemorytobuffermillionsrecords.Asthenumberofbufferedrecordsperblockmeetsorexceedsone,mostoralloftheblocksondiskwillcontain 32 PAGE 33 2 ,andsoallofthedatabaseblocksmustbeupdated.Thus,itmakessensetorelyonfast,sequentialI/Otoupdatetheentireleinasinglepass.Thedrawbackofthisapproachisthateverytimethatthebufferlls,weareeffectivelyrebuildingtheentirereservoirtoprocessasetofbufferedrecordsthatareasmallfractionoftheexistingreservoirsize. 33 PAGE 34 1 canbeusedtomaintainalarge,on-disksample,butallofthemhavedrawbacks.Inthissection,wediscussafourthalgorithmandanassociateddataorganizationcalledthegeometricletoaddressthesepitfalls.ThegeometricleisbestseenasanextensionofthemassiverebuildoptiongivenasAlgorithm 2 .JustlikeAlgorithm 2 ,thegeometriclemakesuseofamain-memorybufferthatallowsnewsamplesselectedbythereservoiralgorithmtobeaddedtotheon-diskreservoirinalazyfashion.However,thekeydifferencebetweenAlgorithm 2 andthealgorithmsusedbythegeometricleisthatthegeometriclemakesuseofafarmoreefcientalgorithmformergingthosenewsamplesintothereservoir. 2 ,thebasicalgorithmemployedbythegeometricleisnotmuchdifferent.AsfarasStep(13)isconcerned,thedifferencebetweenthegeometricleandthemassiverebuildextensionisthatthegeometricleemptiesthebuffermoreefciently,inordertoavoidscanningorperiodicallyre-randomizingtheentirereservoir. Toaccomplishthis,theentiresampleinmainmemorythatisushedintothereservoirisviewedasasinglesubsampleorastratum[ 16 ],andthereservoiritselfisviewedasacollectionofsubsamples,eachformedviaasinglebufferush.Sincetherecordsinasubsamplearenon-randomsubsetoftherecordsinthereservoir(theyaresampledfromthestreamduringaspecictimeperiod),eachnewsubsampleneedstooverwriteatrue,randomsubsetoftherecordsinthereservoirinordertomaintainthecorrectnessofthereservoirsamplingalgorithm.Ifthiscanbedoneefciently,wecanavoidrebuildingtheentirereservoirinordertoprocessabufferush. Atrstglance,itmayseemdifculttoachievethedesiredefciency.Thebufferedrecordsthatmustbeaddedtothereservoirwilltypicallyoverwriteasubsetoftherecordsstoredineach PAGE 35 3.3 ).Forexample,ifthereare100on-disksubsamples,thebuffermustbesplit100waysinordertowritetoaportionofeachofthe100on-disksubsamples.Thisfragmentedbufferthenbecomesanewsubsample,andsubsequentbufferushesthatneedtoreplacearandomportionofthissubsamplemustsomehowefcientlyoverwritearandomsubsetofthesubsample'sfragmenteddata. Thegeometricleusesacareful,on-diskdataorganizationinordertoavoidsuchfragmen-tation.Thekeyobservationbehindthegeometricleisthatthenumberofrecordsofasubsamplethatarereplacedwithrecordsfrombufferedsamplecanbecharacterizedwithreasonableaccu-racyusingageometricseries(hencethenamegeometricle).Asbufferedsamplesareaddedtothereservoirviabufferushes,weobservethateachexistingsubsamplelosesapproximatelythesamefractionofitsremainingrecordseverytime,wherethefractionofrecordslostisgovernedbytheratioofthesizeofabufferedsampletotheoverallsizeofthereservoir.Byloses,wemeanthatthesubsamplehassomeofitsrecordsreplacedinthereservoirwithrecordsfromasubsequentsubsample.Thus,thesizeofasubsampledecaysapproximatelyinanexponentialmannerasbufferedsamplesareaddedtothereservoir. Thisexponentialdecayisusedtogreatadvantageinthegeometricle,becauseitsuggestsawaytoorganizethedatainordertoavoidproblemswithfragmentation.Eachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize.Thesesegmentsaresizedsothateverytimeabufferedsampleisaddedtothereservoir,weexpectthateachexistingsubsamplelosesexactlythesetofrecordscontainedinitslargestremainingsegment.Asaresult,eachsubsamplelosesonesegmenttothenewly-createdsubsampleeverytimethebufferisemptied,andageometriclecanbeorganizedintoaxedandunchangingsetofsegmentsthatarestoredascontiguousrunsofblocksondisk.Becausethesetofsegmentsisxedbeforehand,fragmentationandupdateperformancearenotproblematic:inordertoreplacerecordsinan 35 PAGE 36 Ondaytwo,(withU1=90)theUraniumfurtherdecaystoU1=81grams,thistimelosingU1(1)=U0(1)=n=9gramsofitsmass.Ondaythree,itfurtherdecaysbyn2=7:2grams,andsoon.ThedecayprocessisallowedtocontinueuntilwehavelessthangramsofUraniumremaining. ContinuingwiththeUraniumanalogy,threequestionsthatarerelevanttoourproblemofmaintainingverylargesamplesfromadatastreamare Thesequestionscanbeansweredusingthefollowingthreesimpleobservationsrelatedtogeometricseries: logc.Wedenotethisoorby. 36 PAGE 37 2 ).Recallthatthewayreservoirsamplingworksisthatnewsamplesfromthedatastreamarechosentooverwriterandomsamplescurrentlyinthereservoir.Thebuffertemporarilystoresthesenewsamples,delayingtheoverwriteofarandomsetofrecordsthatarealreadystoredondisk.Oncethebufferisfull,allnewsamplesaremergedwiththeRbyoverwritingarandomsubsetoftheexistingsamplesinR. ConsidersomearbitrarysubsampleSofR(soSR),withcapacityjSj.SincethebufferBrepresentsthesamplesthathavealreadyover-writtentheequalnumberofrecordsofR,abufferushoverwritesexactlyjBjsamplesofR.Thus,onexpectationthemergewilloverwritejSjjBj jRjsamplesofS.IfwedenejBj jRj=1,thenonexpectation,SshouldlosejSj(1)ofitsownrecordsduetothebufferush WecanroughlydescribetheexpecteddecayofSafterrepeatedbuffermergesusingthethreeobservationsstatedbefore.Ifthesubsampleretentionrate=1jBj jRj,then: Thenetresultofthisisthatitispossibletocharacterizetheexpecteddecayofanyarbitrarysubsetoftherecordsinourdisk-basedsampleasnewrecordsareaddedtothesamplethroughmultipleemptyingsofthebuffer.IfweviewSasbeingcomposedofon-disksegmentsofexponentiallydecreasingsize,plusaspecial,asinglegroupofnalsegmentsoftotalsize 3.7 ). 37 PAGE 38 Decayofasubsampleaftermultiplebufferushes. 3-1 38 PAGE 39 Basicstructureofthegeometricle. 39 PAGE 40 3-1 ,wecanorganizeourlarge,diskbasedsampleasasetofdecayingsubsamples.Atanypointoftime,thelargestsubsamplewascreatedbythemostrecentushingofthebufferintoR,andhasnotyetlostanysegments.Thesecondlargestsubsamplewascreatedbythesecondmostrecentbufferush;itlostitslargestsegmentinthemostrecentbufferush.Ingeneral,theithlargestsubsamplewascreatedbytheithmostrecentbufferush,andithashadi1segmentsremovedbysubsequentbufferushes.TheoverallleorganizationisdepictedinFigure 3-2 3 .Thetermsn,,andcarrythemeaningdiscussedinSection 3.5 .ThisprocessdescribedbyAlgorithm 3 isdepictedgraphicallyinFigure 3-3 .First,theleislledwiththeinitialdataproducedbythestream(athroughc).Toaddtherstrecordstothele,thebufferisallowedtollwithsamples.Thebufferedrecordsarethenrandomlygroupedintosegments,andthesegmentsarewrittentodisktoformthelargestinitialsubsample(a).Forthesecondinitialsubsample,thebufferisonlyallowedtolltojBjofitscapacitybeforebeingwrittenout(b).Forthethirdinitialsubsample,thebufferllstojBj2ofitscapacitybeforeitiswritten(c).Thisisrepeateduntilthereservoirhascompletelylled(aswasshowninFigure 3-2 ).Atthispoint,newsamplesmustoverwriteexistingones.Tofacilitatethis,thebufferisagainallowedtolltocapacity.Recordsarethenrandomlygroupedintosegmentsofappropriatesize,andthosesegmentsoverwritethelargestsegmentofeachexistingsubsample(d).Thisprocessisthenrepeatedindenitely,aslongasthestreamproducesnewrecords(eandf). 40 PAGE 41 3.7.1 ) 3 Step(21).Inordertomaintainthealgorithm'scorrectness,whenthebufferisushed 41 PAGE 42 3-4 .Now,wewanttoaddveadditionalnumberstoourset,byrandomlyreplacingveexistingnumbers.Whilewedoexpectnumberstobereplacedinawaythatisproportionaltobucketsize(Figure 3-4 (b)),thisisnotalwayswhatwillhappen(Figure 3-4 (c)). 3 .BeforeweaddanewsubsampletodiskviaabufferushinStep(21),werstperformalogical,randomizedpartitioningofthebufferintosegments,describedbyAlgorithm 4 .InAlgorithm 4 ,eachnewly-sampledrecordisrandomlyassignedtoreplaceasamplefromanexisting,on-disksubsamplesothattheprobabilityofeachsubsamplelosingarecordisproportionaltoitssize.TheresultofAlgorithm 4 isanarrayofMivalues,whereMitellsStep(21)ofAlgorithm 3 howmanyrecordsshouldbeassignedtooverwritetheithon-disksubsample. 3 willoverwriteexactlythenumberofrecordscontainedineach 42 PAGE 43 Buildingageometricle. 43 PAGE 44 Distributingnewrecordstoexistingsubsamples. subsample'slargestsegment.Tohandlethisproblem,weassociateastack(orbuffer 44 PAGE 45 4 ,Mishouldhavebeen.Then,therearetwopossiblecases: ThesestackoperationsareperformedjustpriortoStep(23)inAlgorithm 3 .Notethatsincethenalgroupofsegmentsfromasubsampleoftotalsizearebufferedinmainmemory,theirmaintenancedoesnotrequireanystackoperations.Onceasubsamplehaslostallofitson-disksamples,overwritesofrecordsinthissetcanbehandledbysimplyreplacingtherecordsdirectly. Topre-allocatespaceforthesestacks,weneedtocharacterizehowmuchoverowwecanexpectfromagivensubsample,whichwillboundthegrowthofthesubsample'sstack.Itisimportanttohaveagoodcharacterizationoftheexpectedstackgrowth.Ifweallocatetoomuchspaceforthestacks,thenweallocatediskspaceforstoragethatisneverused.Ifweallocatetoolittlespace,thenthetopofonestackmaygrowupintothebaseofanother.Ifastackdoesoverow,itcanbehandledbybufferingtheadditionalrecordstemporarilyinmemoryormovingthestacktoanewlocationondiskuntilthestackcanagaintinitsallocatedspace.Thisisnot 45 PAGE 46 Toavoidthis,weobservethatifthestackassociatedwithasub-sampleScontainsanysamplesatagivenmoment,thenShashadfewerofitsownsamplesremovedthanexpected.Thus,ourproblemofboundingthegrowthofS'sstackisequivalenttoboundingthedifferencebetweentheexpectedandtheobservednumberofsamplesthatSlosesasjBjnewsamplesareaddedtothereservoir,overallpossiblevaluesforjBj. Toboundthisdifference,werstnotethatafteraddingjBjnewsamplesintothereservoir,theprobabilitythatanyexistingsampleinthereservoirhasbeenover-writtenbyanewsampleis111 42 ].Simplearithmeticimpliesthatthegreatestvarianceisachievedwhenasubsamplehasonexpectationlost50%ofitsrecordstonewsample(P=0:5);atthispointthestandarddeviationis0:5p 46 PAGE 47 Toillustratetheimportanceofminimizing,imaginethatwehavea1GBbufferandastreamproducing100Brecords,andwewanttomaintaina1TBsample.Assumethatweuseanvalueof0.99.Thus,eachsubsampleisoriginally1GB,andjBj=107.FromObserva-tion2weknowthatn log0:99c=1029segmentstostoretheentirenewsubsample. Now,considerthesituationif=0:999.Asimilarcomputationshowsthatwewillnowrequire10;344segmentstostorethesame1GBsubsample.Thisisanorder-of-magnitudedifference,withsignicantpracticalimportance.Withfourdiskseekspersegment,1029segmentsmightmeanthatwespendaround40secondsofdisktimeinrandomI/Os(at10ms 47 PAGE 48 jRj jRj. WewilladdressthislimitationinSection 3.10 log0:99cor687.Byincreasingtheamountofmainmemorydevotedtoholdingthesmallestsegmentsforeachsubsamplebyafactorof32,weareabletoreducethenumberofdiskheadmovementsbylessthanafactoroftwo.Thus,wewillnotconsideroptimizing.Rather,wewillxtoholdasetofsamplesequivalenttothesystemblocksize,andsearchforabetterwaytoincreaseperformance. 48 PAGE 49 1. Whyistheclassicalreservoirsamplingalgorithm(presentedasAlgorithm 1 )correct?ThatiswhatistheinvariantmaintainedbytheAlgorithm 1 ? 2. Whyistheobviousdisk-based,extensionofAlgorithm 1 (presentedasAlgorithm 2 )correct?ThatishowdoesAlgorithm 2 maintaintheinvariantofAlgorithm 1 viatheuseofamainmemorybuffer? 3. WhyistheproposedgeometriclebasedsamplingtechniqueinAlgorithm 3 correct? WehaveansweredtherstquestioninSection 3.1 .Wediscussthesecondandthirdquestionshere. 2 makesuseofthemainmemorybufferofsizejBjtobuffernewsamples.Thebufferedsampleslogicallyrepresentasetsamplesthatshouldhavebeenusedtoreplaceon-disksamplesinordertopreservethecorrectnessofthesamplingalgorithm,butthathavenotyetbeenmovedtodiskforperformancereasons(thatis,duetolazywrites). ItisnothardtoseethattheinvariantmaintainedbyAlgorithm 1 isalsomaintainedbyAlgorithm 2 instep(6).ThenewrecordsaresampledwiththesameprobabilityjRj=i.Theonlydifferenceisthatnewlysampledrecordsareaddedtothereservoirusingsteps(7-14)insteadofsimplesteps(5-6)ofAlgorithm 1 .Wenowdiscusswhythesestepsareequivalent. Onestraightforwardwayofkeepingthesampledrecordsinthebufferanddolazywritesisasfollows.Everytimewedecidetoaddanewsampletothebuffer(i.e.withprobabilityjRj=i),wealsogeneratearandomnumberbetween1andRtodecideitspositioninthereservoir.However,westorethispositioninthepositionarrayandthusavoidanimmediatediskseek.Ifwehappentogenerateapositionthatisalreadyinthepositionarray,weoverwritethecorrespondingrecordinthebufferwiththenewlysampledrecord.Ifwewouldhaveushedthatrecordtodiskusingtheclassicalgorithm(ratherthanbufferingit),wewouldhavereplaceditwiththenewlysampledrecord.Thuswewouldobtainthesameresult.Oncethebufferisfullwe 49 PAGE 50 1 asfarascorrectnessisconcerned. Logically,steps(7-14)ofAlgorithm 2 actuallyimplementexactlythisprocess.Theprobabilitythatwewillgeneratearandompositionbetween1andjRjthatisalreadyinthepositionarrayofsizejBjisjBj=R.Step(7)ofAlgorithm 2 decideswhethertooverwritearandombufferedrecordwithanewlysampledrecord.Oncethebufferisfull,step(13)performsaonepassbuffer-reservoirmergingbygeneratingsequentialrandompositionsinthereservoironthey. 2 westorethesamplessequentiallyonthediskandoverwritetheminarandomorder.Thoughcorrect,thealgorithmdemandsalmostacompletescanofthereservoir(toperformallrandomoverwrites)foreverybufferush.WecandobetterifweinsteadforcethesamplestobestoredinarandomorderondisksothattheycanbereplacedviaanoverwriteusingsequentialI/Os.Thelocalizedoverwriteextensiondiscussedbeforeusethisidea.Everytimeabufferisushedtothereservoiritisrandomizedinmainmemoryandwrittenasarandomclusteronthedisk.WemaintainthecorrectnessofthistechniquebysplittingtherandomclusterinN-wayswhereNisthenumberofexistingclustersonthediskandbyoverwritingrandomsubsetofeachexistingcluster.Thisavoidstheproblemofclusteringbyinsertiontime.However,thedrawbackofthistechniqueisthatthesolutiondeterioratesbecauseoffragmentationofclusters. ThegeometricleovercomesthedrawbacksofthesetwotechniquesandcanbeviewedasacombinationofAlgorithm 2 andtheideausedinthelocalizedoverwriteextension.ThecorrectnessoftheGeometricleisresultsdirectlyfromthecorrectnessofthesetwotechniques.Incaseofthegeometricletheentiresampleinthemainmemory(referredtoasasubsample)israndomizedandushedintothereservoir.Furthermore,eachnewsubsampleissplitintoexactlythosemanysegmentsasthenumberofexistingsubsamplesonthedisk.Thesesegmentsthenoverwritearandomportionofeachdisk-basedsubsample.Theonlydifferencewiththe 50 PAGE 51 1 ,isxedbytheratiojBj=jRj.Thatis,foraxeddesiredsizeofreservoirweneedalargerbuffertolowerthevalueof. However,thereisawaytoimprovethesituation.GivenabufferofxedcapacityjBjanddesiredsamplesizejRj,wechooseasmallervalue0<,andthenmaintainmorethanonegeometricleatthesametimetoachievealargeenoughsample.Specically,weneedtomaintainm=(10) (1)geometriclesatonce.Theselesareidenticaltowhatwehavedescribedthusfar,exceptthattheparameter0isusedtocomputethesizesofasubsample'son-disksegmentsandsizeofeachleisjRj 3 .Eachofthemgeometriclesisstilltreatedasasetofdecayingsubsamples,andeachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize,justasisdoneinAlgorithm 3 ,Steps(5)-(13).Theonlydifferenceisthataseachleiscreated,theparameter0isusedinsteadofinSteps(6),(8)-(9),andeachofthemgeometriclesislledafteroneanother,inturn.Thus,eachsubsampleofeachgeometriclewillhavesegmentsofsizen;n0;n02andsoon. 51 PAGE 52 3 Steps(15)-(20)untilbufferisfull.Oncethebufferisfull,itsrecordorderisthenrandomized,justasisinasinglegeometricle.Nextthebufferisushedtodisk.Thisiswherethealgorithmismodied.Overwritingrecordsondiskwithrecordsfromthebufferissomewhatdifferent,intwoprimaryways,asdiscussednext. 4 ,thebufferispartitionedsothatthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofsubsamplesinasinglele.Incaseofmultiplegeometricles,wepartitionthebufferjustlikeinAlgorithm 4 ;however,werandomlypartitionthebufferacrossallsubsamplesfromallgeometricles.Thenumberofbuffersegmentsafterthepartitioningisthesameasthetotalnumberofsubsamplesintheentirereservoir,andthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofeachofthesubsamplesfromoneofthegeometricles.Thisallowsustomaintainthecorrectnessofthereservoirsamplingalgorithm.ThebufferpartitioningstepsincaseofmultiplegeometriclesaregiveninAlgorithm 5 3 'sbuffermergealgorithm.Wediscussalltheintricaciessubsequently,butathigh-level,thelargestsegmentofeachsubsamplefromonlyonegeometricleisover-writtenwithsamplesfromthebuffer.Thisallowsforconsiderablespeedup,aswediscussinSection 3.12 .Atrst,thiswouldseemtocompromisethecorrectnessofthealgorithm:logically,thebufferedsamplesmustover-writesamplesfromeveryoneofthegeometricles(infact,thisispreciselywhythebufferispartitionedacrossallgeometricles,as 52 PAGE 53 3.11.1 to 3.11.3 ,wedescribeindetailanalgorithmthatisabletomaintainthecorrectnessofthesample. Oncethesegmentsassignedtothevariousleshavebeenconsolidated,theresultingsegmentsareusedtooverwritesubsamplesfromasinglegeometricleusingexactlythealgorithmfromSection 3.4 ,subjecttotheconstraintthatthejthbuffermergeoverwritessubsamplesfromthe(jmodm)thgeometricle. Ourremedytothisproblemistodelayoverwritingasubsample'slargestsegmentuntilthetimethatall(ormost)oftherecordsthatwillbeover-writtenondiskareinvalid,inthesensethattheyhavelogicallybeenover-writtenbyhavingrecordsfromsubsequentbuffer 53 PAGE 54 Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. 54 PAGE 55 Thewaytoaccomplishthisistooverwritesubsamplesinalazymanner.Wemergethebufferwiththe(jmodm)thgeometricle,butwedonotoverwriteanyofthevalidsamplesstoredintheleuntilthenexttimewegettothele.Wecanachievethisbyallocatingenoughextraspaceineachgeometricletoholdacomplete,emptysubsampleineachgeometricle.Thissubsampleisreferredtoasthedummy.Thedummyneverdecaysinsize,andneverstoresitsownsamples.Rather,itisusedasabufferthatallowsustosidesteptheproblemofasubsampledecayingtooquickly.Whenanewsubsampleisaddedtoageometricle,thenewsubsampleoverwritessegmentsofdummyratherthanoverwritinglargestsegmentofanyexistingsubsamples.Thus,wehaveprotectedsegmentsofsubsamplesthatcontainvaliddatabyoverwritingdummy'srecordsinstead. Whenrecordsaremergedfromthebufferintothedummy,thespacepreviouslyownedbythedummyisgivenuptoallowstorageofthele'snewestsubsample.Afterthisush,thelargestsegmentfromeachofthesubsamplesintheleisgivenuptoreconstitutethenewdummy.Becausetherecordsin(new)dummy'ssegmentswillnotbeover-writtenuntilthenexttimethatthisparticulargeometricleiswrittento,allofthedatathatiscontainedwithinitisprotected. Notethatwithadummysubsample,wenolongerhaveaproblemwithasubsamplelosingitssamplestooquickly.Instead,asubsamplemayhaveslightlytoomanysamplespresentondiskatanygiventime,bufferedbythele'sdummy.Theseextrasamplescaneasilybeignoredduringqueryprocessing.TheonlyadditionalcostweincurwithdummyisthateachofthegeometriclesondiskmusthavejBjadditionalunitsofstorageallocated.TheuseofadummysubsampleisillustratedinFigure 3-5 55 PAGE 56 2 Proof. log0c.Substitutingn=(10)jBjandsimplifyingtheexpression(aswellas 56 PAGE 57 log0(loglogjBj).Ifwelet!=(log(1=0))1thenumberofsegmentscanbeexpressedas!(logjBjlog).Assumingaconstantnumbercofrandomseekspersegmentwrittentothedisk,thetotalrandomdiskheadmovementsrequiredperrecordis!c((logjBjlog)=jBj),whichisO(!logjBj=jBj). Incaseofmultiplegeometriclesweuseadditionalspaceformdummysubsamples.Thus,thetotalstoragerequiredbyallgeometriclesisjRj+(mjBj).Ifwewishtomaintaina1TBreservoirof100Bsampleswith1GBofmemory,wecanachieve0=0:9byusingonly1.1TBofdiskstorageintotal.For0=0:9,weneedtowritelessthan100segmentsper1GBbufferush.At40ms/segment,thisisonly4secondsofrandomdiskheadmovementstowrite1GBofnewsamplestodisk. Inordertotesttherelativeabilityofthegeometricletoprocessahigh-speedstreamofinsertions,wehaveimplementedandbench-markedvealternativesformaintainingalargereservoirondisk:thethreealternativesdiscussedinSection 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.WepresentthesebenchmarkingresultsinChapter 7 57 PAGE 58 Inthischapterweproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutilityinsubsequentqueryprocessing.Wethencompute(inasinglepass)abiasedsampleRioftheirecordsproducedbyadatastream.Riisxed-size,andtheprobabilityofsamplingthejthrecordfromthestreamisproportionaltof(rj)forallji.Thisisafairlysimpleandyetpowerfuldenitionofbiasedsampling,andisgeneralenoughtosupportmanyapplications. Ofcourse,onestraightforwardwaytosampleaccordingtoawell-denedbiasfunctionwouldbetomakeacompletepassoverthedatasettocomputethetotalweightofalltherecords,PNj=1f(rj).Duringasecondpass,wecanthenchoosetheithrecordofthedatasetwithprobabilityjRjf(ri) Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.We 58 PAGE 59 Therestofthechapterisorganizedasfollows.Wedescribesasingle-passbiasedsamplingalgorithm.Wealsodeneadistancemetrictoevaluatetheworstcasedeviationfromtheuser-denedweightingfunctionf.Finally,wederiveasimpleestimatorforabiasedreservoir.TheexperimentsperformedtotestouralgorithmsarepresentedinChapter 7 6 .Itispossibletoprovethatthismodiedalgorithmresultsinacorrectlybiasedsample,providedthattheprobabilityfromline(8)ofAlgorithm 6 doesnotexceedone. 6 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehavePr[rj2Ri]=jRjf(rj) Proof. 59 PAGE 60 60 PAGE 61 1 ) WedeneanoverweightrecordtobearecordriforwhichjRjf(ri) Animportantfactortoconsiderwhiledeterminingthefeasibilityofmaintainingsuchaqueueinthegeneralcaseisprovidinganupperboundonitssize.Thiscanbedonebyconsider-ingtheworstpossibleorderingoftherecordsinputintothealgorithm,subjecttotheconstraintthatthebiasfunctioniswell-dened.Ingeneral,wedescribetheuser-denedweightingfunctionfasbeingwell-denedifjRjf(ri) 61 PAGE 62 Westressthatthoughthisupperboundisquitepoor(requiringthatweneedtobuffertheentiredatastream!)itisinfactaworst-casescenario,andtheapproachwilloftenbefeasibleinpractice.Thisisbecauseweightswilloftenincreasemonotonicallyovertime(asinthecasewherenewerrecordstendtobemorerelevantforqueryprocessingthanolderones).Still,giventhepoorworst-caseupperbound,amorerobustsolutionisrequired,whichwenowdescribe. 62 PAGE 63 1. First,wewillbeabletoguaranteethatf0(rj)willbeexactlyf(rj)if(jRjf(ri))=totalWeight1forallk>j. 2. Wecanalsoguaranteethatwecancomputethetrueweightforagivenrecordtounbiasedanyestimatemadeusingoursample(seeSection 4.4 ). Inotherwords,ourbiasedsamplecanstillbeusedtoproduceunbiasedestimatesthatarecorrectonexpectation[ 16 ],butthesamplemightnotbebiasedexactlyasspeciedbytheuser-denedfunctionf,ifthevalueoff(r)tendstouctuatewildly.Whilethismayseemlikeadrawback,thenumberofrecordsnotsampledaccordingtofwillusuallybesmall.Furthermore,sincethefunctionusedtomeasuretheutilityofasampleinbiasedsamplingisusuallytheresultofanapproximateanswertoadifcultoptimizationproblem[ 15 ]ortheapplicationofaheuristic[ 52 ],havingasmalldeviationfromthatfunctionmightnotbeofmuchconcern. Wepresentasingle-passbiasedsamplingalgorithmthatprovidesbothguaranteesoutlinedaboveasAlgorithm 7 ,andLemma 4 provesthecorrectnessofthealgorithm. 7 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehave,Pr[rj2Ri]=jRjf0(rj) Proof. 3 .Wesimplyusef0insteadofftoprovethedesiredresult. 63 PAGE 64 7 isthedeviationoff0fromf.Thatis:howfarofffromthecorrectweightingcanwebe,intheworstcase?Whenstreamhasnooverweightrecords,weexpectf0tobeexactlyequaltof,butitmaybeveryfarawayundercertaincircumstances.Toaddressthis,wedeneadistancemetricinDenition 2 andevaluatetheworsecasedistancebetweenf0andf. 64 PAGE 65 1 andisanalyzedandprovedintheAppendixofthispaper. 7 willsamplewithanactualbiasfunctionf0wheretotalDist(f;f0)isupperboundedbyPNk=jRjf(r0k)PjRj1k=1f(r0k) 7 computesabiasedsampleaccordingtof0,wheref0isaclosefunctiontoauser-denedweightingfunctionfaccordingtothefollowingdistancemetric: PAGE 66 7 occurswhen(1)thereservoirisinitiallylledwiththeRrecordshavingthesmallestpossibleweightsand(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter.Theorem 1 presentedanupperboundontotalDist(f;f0)inthisworstcase.Inthissection,werstprovidetheproofofthisworstcaseforAlgorithm 7 andthenprovetheupperboundontotalDist(f;f0)givenbyTheorem 1 7 ,werstprovethefollowingthreepropositions.Theseproofsleadustotheworst-caseargument.Ifwedenotetherecordwiththehighestweightinthestreamasrmaxandusermaxitodenotethecasewherermaxislocatedatpositioniinthestream,thenforanygivenrandomorderingofthestreamingrecordsr1;:::;ri1;rmaxi;:::;rN,weprovethat 1. MovingtherecordrmaxiearlierintherangerjRj:::rNcannotdecreasetotalDist(f;f0). 2. Whenweareinitiallyllingthereservoir,choosingjRjrecordswithsmallestpossibleweightmaximizestotalDist(f;f0). 3. Reorderingofanyrecordthatappearsafterrmaxiintherangeri+1:::rNcannotincreasetotalDist(f;f0). 66 PAGE 67 5 (givenbelow)were-writethetotalDistformulaas 67 PAGE 68 5 were-writethetotalDistformulaas PAGE 69 Adjustmentofrmaxitormaxi1 4 )fromEquation( 4 )asfollows: 4-1 showstheadjustmentofrmaxitormaxi1.Wedenotetherecordthatisswappedwithrmaxasrswap.Theaboveequationfurthersimpliesto hjRjf(rmax)+PNk=i+1f(rk)i PAGE 70 Y=2f(rswap) PAGE 71 7 acceptstherstjRjrecordsofthestreamwithprobability1.NoweightadjustmentsaretriggeredforrstjRjrecordsirrespectiveoftheirweights.Therefore,theearliestpositionrmaxcanappearinthestreamisrightafterthereservoirislled.Thisprovestheproposition.WenowturntoprovingLemma 5 ,whichwasusedinthepreviousproof. Proof. 71 PAGE 72 5 ,8j PAGE 73 Fromtheabovethreepropositions,wecanconcludethattheworstcaseforAlgorithm 7 occurswhen(1)thereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweightsand,(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter. 1 :TheUpperBoundontotalDist 5 were-writethetotalDistformulaas 73 PAGE 74 4 ),theaboveequationsimpliesto IntheworstcasethereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweights.Ifr1;r2;:::rNaretherecordsinappearanceorderthenwedener01;r02;:::;r0Nasthepermutation(reordering)oftherecordssuchthatf(r01)f(r02)f(r0N):Theconditionrequiringreservoirlledwiththesmallestpossibleweightscanbethenwrittenas 74 PAGE 75 1. Foreachon-disksubsample,MjissettobejRjMjf(ri) 2. Foreachsampledrecordstillinthebuffer,rj:weightissettojRjrj:weightf(ri) 3. Finally,totalWeightissettojRjf(ri). 75 PAGE 76 50 ]forasamplecomputedusingouralgorithm.Wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgoverningthesamplingoftworecordsriandrjusingouralgorithmandusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimator.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained.Thoughhandlingmorecomplicatedqueriesusingthebiasedsampleisbeyondthescopeofthepaper,itisstraightforwardtoextendtheanalysisofthisSectiontomorecomplicatedqueriessuchasjoins[ 32 ]. Imaginethatwehavethefollowingsingle-tablequery,whose(unknown)answerisq: TABLEASr Next,wederivethevarianceofthisestimator.Todothis,weneedaresultsimilartoLemma 3 thatcanbeusedtocomputetheprobabilityPr[frj;rkg2Ri]underourbiasedsamplingscheme. 76 PAGE 77 7 ,foreachRiandforeachrecordpairfrj;rkgproducedbythedatastreamwherej PAGE 78 Thisexpressioncanthenbeusedinconjunctionwiththenextlemmatocomputethevarianceofthenaturalestimatorforq. ByusingtheresultofLemma 6 tocomputePr[frj;rkg2Ri],thevarianceoftheestimatoristheneasilyobtainedforaspecicquery.Inpractice,thevarianceitselfmustbeestimatedbyconsideringonlythesampledrecordsaswetypicallydonothaveaccesstoeachandeveryrjduringqueryprocessing.Theq2termandthetwosumsintheexpressionofvariancearethuscomputedovereachrjinthesampleofbiasedgeometricleratherthanovertheentirereservoir. Thereisoneadditionalissueregardingbiasedsamplingthatisworthsomeadditionaldiscussion:howtoefcientlycomputethevaluePr[frj;rkg2Ri]inordertoestimate 78 PAGE 79 TherstsubexpressionscanbeeasilycomputedwiththehelpofrunningtotaltotalWeightalongwiththeweightmultipliersassociatedwitheachsubsample.Whensamplerecordsareaddedtothereservoir,likeattributeri.weight,westoreanotherattributewitheachrecord,ri.oldTotalWeightandri.oldM.TherstattributegetsitsvaluefromcurrentvalueoftotalWeight,whereastheM(ri)isstoredinthesecondattribute.Whenaqueryisevaluatedandweneedtocomputetherstsubexpressionsforagivenrecordpairrjandrk,wecomputetermsinitsdenominatorasfollows:kXl=1f0(rl)=rk:oldTotalWeightM(rk) PAGE 80 Ageometricleisasimplerandomsample(withoutreplacement)fromadatastream.Inthischapterwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjectsthatarethemselvesrandomsamples(withoutreplacement)fromtheoriginaldatastream.Thegoalofthealgorithmsdescribedinthischapteristoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 3.2 ,wearguedthatsmallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,reconsidertheproblemofestimatingtheaveragenetworthofAmericanhouseholdsasdescribedinSection 3.2 .Inthegeneralcase,manymillionsofsamplesmaybeneededtoestimatethenetworthoftheaveragehouseholdaccurately(duetoasmallratiobetweentheaveragehousehold'snetworthandthestandarddeviationofthisstatisticacrossallAmericanhouseholds).However,ifthesamesetofrecordsheldinformationaboutthesizeofeachhousehold,onlyafewhundredrecordswouldbeneededtoobtainsimilaraccuracyforanestimateoftheaveragesizeofanAmericanhousehold,sincetheratioofaveragehouseholdsizetothestandarddeviationofsamplesizeacrosshouseholdsintheUnitedStatesisgreaterthan2.Thus,toestimatetheanswertothesetwoqueries,vastlydifferentsamplesizesareneeded. 80 PAGE 81 1 21 30 34 39 ].Ingeneral,thedrawbackofmakinguseofabatchsampleisthattheaccuracyofanyestimatorwhichmakesuseofthesampleisxedatthetimethatthesampleistaken,whereasthebenetofbatchsamplingisthatthesamplecanbedrawnwithveryhighefciency. WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample.ThedrawbackofonlinesamplingcomparedtobatchsamplingisthatitisgenerallylessefcienttoobtainasampleofsizeNusingonlinemethods.However,sincetheconsumerofthesamplecancallGetNextrepeatedlyuntilanestimatorwithenoughaccuracyisobtained,onlinesamplingismoreexiblethanbatchsampling.Anonlinesampleretrievedfromageometriclecanbeusefulformanyapplications,includingonlineaggregation[ 32 33 ].Inonlineaggregation,adatabasesystemtriestoquicklygatherenoughinformationsoastoapproximateanswertoanaggregatequery.Asmoreandmoreinformationisgathered,theapproximationqualityisimproved,andtheonlinesamplingprocedureishaltedwhentheuserishappywiththeapproximationaccuracy. 5.3.1ANaiveAlgorithm Proof. 81 PAGE 82 jDjN jDjN=1 Unfortunately,thoughitisverysimple,thenaivealgorithmwillbeinefcientfordrawingasmallsamplefromalargegeometriclesinceitrequiresafullscanofthegeometricletoobtainatruerandomsampleforanyvalueofN.Sincethegeometriclemaybegigabytesinsize,thiscanbeproblematic. 82 PAGE 83 26 ].Oncethenumberofsampledrecordsfromeachsegmenthasbeendetermined,samplingthoserecordscanbedonewithanefcientsequentialreadsincewithineachondisksegment,allrecordsarestoreinarandomizedorder.Thekeyalgorithmicissueishowtocalculatethecontributionofeachsubsample.Sincethiscontributionisamultivariatehypergeometricrandomvariable,wecanuseanapproachanalogoustoAlgorithm 4 ,whichisusedtopartitionthebuffertoformthesegmentsofasubsample.Inotherwords,wecanviewretrievingNsamplesfromageometricleanalogoustochoosingNrandomrecordstooverwritewhennewrecordsareaddedtothele. Theresultingalgorithmcanbedescribedasfollows.Tostartwith,wepartitionthesamplespaceofNrecordsintosegmentsofvaryingsizeexactlyasinAlgorithm 4 .Werefertothesesegmentsofthesamplespaceassamplingsegments.Thesamplingsegmentsarethenlledwithsamplesfromthediskusingaseriesofsequentialreads,analogoustothesetofwritesthatareusedtoaddnewsamplestothegeometricle.Thelargestsamplingsegmentobtainsallofitsrecordsfromthelargestsubsample,thenextlargestsamplingsegmentobtainsallitsrecordfromsecondlargestsubsample,andsoon. Whenusingthisalgorithm,somecareneedstobetakenwhenNapproachestothesizeofageometricle.Specically,whenalldisksegmentsofasubsamplearereturnedtoacorrespondingsamplingsegment,wemustalsoconsiderthesubsample'sin-memorybuffered 83 PAGE 84 8 ItisclearthatthisalgorithmobtainsthedesiredbatchsamplebyscanningexactlyNrecordsasagainsttheentirescanofthereservoirsamplingatthecostoffewrandomdiskseeks.Sincethesamplingprocessisanalogoustotheprocessofaddingmoresamplestothele,itisjustasefcient,requiringO(!logjBj=N)randomdiskheadmovementsforeachnewlysampledrecord,asdescribedinLemma2. 8 oneachleinordertoobtainthedesiredbatchsample. 5.4.1ANaiveAlgorithm Itiseasytoseethatanaivealgorithmwillgiveusacorrectonlinesampleofageometricle.However,wewilluseonediskseekpercalltoGetNext.SinceeachrandomI/Orequires 84 PAGE 85 Insteadofselectingarandomrecordofageometricle,werandomlypickasubsampleandchooseitsnextavailablerecordasareturnvalueofGetNext.Thisisanalogoustotheclassiconlinesamplingalgorithmforsamplingfromahashedle[ 26 ],whererstahashbucketisselectedandthenarecordischosen.Sincetheselectionofarandomrecordwithinasubsampleissequential,wemayreducethenumberofcostlydiskseeksifwereadthesubsampleinitsentirety,andbufferthesubsample'srecordsinmemory.Usingthisbasicmethodology,wenowdescribehowacalltotheGetNextwillbeprocessed: Sincetherecordsfromeachsubsamplearereadandbufferedinmemorysequentially,weareguaranteedtochooseeachrecordofthereservoiratmostonce,givingusdesiredrandomsamplewithoutreplacement.Aproofofthisissimple,andanalogoustotheproofofLemma3.However,thusfarwehavenotconsideredaveryimportantquestion:HowmanyblocksofasubsampleSishouldwefetchatthetimeofbufferrell?Ingeneraltherearetwoextremesthatwemayconsider: 85 PAGE 86 Inordertodiscusssuchconsiderationsmoreconcretely,wenotethatthetimerequiredtoprocessGetNextcallisproportionaltothenumberofblocksfetchedonthecall,assumingthatthecosttoperformtherequiredin-memorycalculationsisminimal.Ifbblocksarefetchedduringaparticularcall,wespends+brtimeunitsonthatparticularcalltoGetNext,wheresistheseektimeandristimerequiredtoscanablock.OncethesebblocksarefetchedweincurzerocostfornextbncallstoGetNext,wherenistheblockingfactor(numberofrecordsperblock).Thus,inthecasewhereblocksarefetchedattherstcalltoGetNext,weincurthetotalcostofs+brtosamplebnrecords,andhavearesponsetimeofs+brunitsattherstcalltoGetNext,withallsubsequentcallshavingzerocost. Nowimaginethatinsteadwesplitbblocksintotwochunksofsizeb=2each,andreadachunk-at-a-time.Thus,therstGetNextcallwillcostuss+br=2timeunits.Oncethesebn=2recordsareusedupwereadnextchuckofblocks.Thetotalcostinthisscenariois2s+brwitharesponsetimeofs+br=2timeunitsonceatthestartingpointandothermid-waythrough.NotethatalthoughthemaximumresponsetimeonanycalltoGetNextisreducedbyhalf,werequiredmoretimetosamplebnrecords.Thequestionthenbecomes,Howdowereconcileresponsetimewithoverallsamplingtimetogivetheuseroptimalperformance? ThesystematicapproachwetaketoansweringthisquestionisbasedonminimizingtheaveragesquaresumofresponsetimeoverallGetNextcalls.Thisideaissimilartothewidelyutilizedsum-square-errororMSEcriterion,whichtriestokeepstheaverageerrororcostfrombeingtoohigh,butalsopenalizesparticularlypoorindividualerrorsorcosts.However,one 86 PAGE 87 dX(X(s+(N=br)=X)2)=d dX(Xs2+2Nsr+(N=br)2=X)=s2(N=br)2=X2 PAGE 88 9 givesthedetailedonlinesamplingalgorithm. Proof. LettheSamplebethebiasedsampleofthegeometricle,thenwehavePr[i2Sample]=Pr[SelectingifromSi]Pr[SelectingSi]Pr[i2Si]=1 jRjjRjf(r) 7 ofthisdissertation. 88 PAGE 89 Efcientlysearchinganddiscoveringrequiredinformationfromasamplestoredinageo-metricleisessentialtospeedupqueryprocessing.Anaturalwaytosupportthisfunctionalityistobuildanindexstructureforthegeometricle.Inthischapterwediscussthreesecondaryindexstructuresforthegeometricle.Thegoalistomaintaintheindexstructuresasnewrecordsareinsertedtothegeometricleandatthesametimeprovideefcientaccesstothedesiredinformationinthele. FROMTransaction WHEREStoreState='FL'ANDTransDate>1/1/2007 89 PAGE 90 Anaturalwaytospeedupthesearchanddiscoveryofthoserecordsfromageometriclethathaveaparticularvalueforaparticularattribute(s)istobuildanindexstructure.Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Thus,inourexample,wecouldusetheindexbuiltoneitherStoreStateorTransDate(orboth)toquicklyaccessspecicsetofrecordsandtestthemfortheconditionsintheWHEREclause.Inthischapterwefocusonbuildingsuchanindexstructureforthegeometricle. Apartfromprovidingefcientaccesstothedesiredinformationinthele,akeyconsider-ationisthattheindexforthegeometriclemustbemaintainedasnewrecordsareinserted.Forinstance,wecouldbuildasecondaryindexonanattributewhenthenewrecordsarebulkinsertedintothegeometricle.Wemustthendeterminehowdowemergethenewsecondaryindexwiththeexistingindexesbuiltfortherestofthele.Furthermore,wemustmaintaintheindexasexistingrecordsarebeingoverwrittenwithnewlyinsertedrecordsandhencearedeletedfromthegeometricle. Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexes[ 9 ]aremaintainedforeachsegmentorsubsampleinageometric 90 PAGE 91 44 ]-adisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. Inthesubsequentsectionswediscussconstruction,maintenance,andqueryingofthesethreetypesofindexes. Wedetailconstructionandmaintenanceofasegment-basedindexstructureinthissection. 3 fromChapter 3 duringstart-uptollthereservoir.Everytimethebufferaccumulatesthedesirednumberofrecords,itissegmentedandushedtothedisk.WebuildaB+-treeindexforeachsegmentjustbeforetheyarewrittenouttothedisk.Foreachbufferedrecordofasegmentweconstructanindexrecord.Anindexrecordiscomprisedofthevalueoftheattributeonwhichtheindexisgettingbuilt(thekeyvalue)andthepositionofthebufferedrecordonthedisk.Thepositionisstoredasanumberpair:apagenumberandoffsetwithinapage.TheindexrecordsarethenusedtocreateanindexusingthebulkinsertionalgorithmforaB+-Tree.Weuseasimplearray-baseddatastructuretokeep 91 PAGE 92 RatherthanmaintainingaleforeachB+-Treecreated,weorganizemultipleB+-Treesonasinglediskle.Wereferthissingleleasindexle.Theindexle,inasense,issimilartothelog-structurelesystemproposedbyOusterhout[ 45 ].Inlog-structuredlesystem,aslesaremodied,thecontentsarewrittenouttothediskaslogsinasequentialstream.Thisallowswritesinfull-cylinderunits,withonlytrack-to-trackseeks.Thusthediskoperatesatnearlyitsfullbandwidth.Theindexleenjoysthesimilarperformancebenets.EverytimeaB+-Treeiscreatedforamemoryresidentsegment,itiswrittentotheindexleinasequentialstreamatthenextavailableposition.ThearraymaintainingallB+-TreerootnodesisaugmentedwiththestartingdiskpositionoftheB+-Tree. Finally,wedonotindexsegmentsthatareneverushedtothedisk.Thesesegmentsaretypicallyverysmall(asizeofadiskblock)anditisefcienttosearchthemusingsequentialmemoryscanwhengeometricleisqueried. Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 10 92 PAGE 93 logc Weexpectasegment-basedindexstructuretobeacompactstructureasthereisexactlyoneindexrecordpresentintheindexstructureforeachrecordinthegeometricle,andtheindexstructureismaintainedasnewrecordsaredeletedfromthele. 93 PAGE 94 Asinthecaseofasegment-basedindexstructure,wearrangetheB+-Treeindexesondiskinasingleindexle.However,weneedaslightlydifferentapproach,becauseduringthestart-upsubsamplesareushedtothegeometricle,untilthereservoirisfull.ThereaftersubsamplesofthesamesizejBjareaddedtothereservoir.SinceeachB+-TreewillindexnomorethanjBjrecords,wecanboundthesizeofaB+-Treeindex.Weusethisboundtopre-allocateax-sizedslotondiskforeachB+-Tree.Furthermore,foreverybufferushafterthereservoirisfull,exactlyonesubsampleisaddedtotheleandthesmallestsubsampleoftheledecayscompletely,keepingthenumberofsubsamplesinageometricleconstant.Weusethisinformationtolayoutthesubsample-basedB+-Treesondiskandmaintainthemasnewrecordsaresampledfromthedatastream. Thus,iftotSubsamplesisthetotalnumbersubsamplesinR,werstallocatexed-sizetotSubsamplesslotsintheindexle.Initiallyalltheslotsareempty.Duringstart-up,asanewB+-Treeisbuilt,weseektothenextavailableslotandwriteouttheB+-Treeinasequential 94 PAGE 95 Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 11 logc Asearchonsubsample-basedindexstructureinvolveslookingupallB+-Treeindexes,oneforeachsubsampleinthegeometricle.WemodifytheexistingB+-Tree-basedpointqueryandrangequeryalgorithmsandrunthemforeachentryintheB+-Treearrayoftheindexstructure.ThemodicationisrequiredtoignorethestalerecordsintheB+Trees.Asmentionedbefore,thesubsamplecorrespondingtoaB+-Treemayloseitssegments,buttheindexrecordsare 95 PAGE 96 Recallthatwehaverecordedasegmentnumberinanadditionaleldalongwitheachindexrecord.Foragivensubsample,wekeeptrackofwhichofitssegmentsaredecayedsofarandusethisinformationtoignoretheindexrecordsthatarestale.Wereturnsallvalidindexrecordsthatsatisfythesearchcriteria.Werstsorttheseindexrecordsbytheirpagenumberattributeandthenthenretrievetheactualrecordsfromthegeometricleandreturnthemasaqueryresult. Although,thesubsample-basedindexstructuremaintainsandmustsearchfarfewerB+Treescomparedtothesegment-basedindexstructure,weexceptreasonablesearchtimeperB+-Treeduetothesmallersizeandlazydeletionpolicy. 44 ].TheLSM-Treeisadisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. 44 ]. AlthoughC1(andhigher)componentsaredisk-resident,themostfrequentlyreferrednodes(ingeneralnodesathigherlevel)ofthesetreesarebufferedinmainmemoryforperformancereasons. 96 PAGE 97 WhenevertheC0componentreachesathresholdsizeanongoingrollingmergeprocessremovessomerecords(acontiguoussegment)fromtheC0componentandmergesthemintotheC1componentondisk.ThetherollingmergeprocessisdepictedpictoriallyinFigure2.2oftheoriginalLSM-Treepaper[ 44 ].TherollingmergeisrepeatedformigrationbetweenhighercomponentsofanLSM-Treeinsimilarmanner.Thus,thereisacertainamountofdelaybeforerecordsintheC0componentmigrateouttothedisk-residentC1andhighercomponents.Deletionsareperformedconcurrentlyinbatchfashionsimilartoinserts. ThediskresidentcomponentsofanLSM-treearecomparabletoaB+-treestructure,butareoptimizedforsequentialdiskaccess,withnodes100%full.Lowerlevelsofthetreearepackedtogetherincontiguous,multi-pagediskblocksforbetterI/Operformanceduringtherollingmerge. WeusetheexistingLSM-Tree-basedpointqueryandrangequeryalgorithmstoperformindexlook-ups.Asincaseofpreviouslyproposedindexstructures,wesortthevalidindex 97 PAGE 98 InChapter 7 ,weevaluateandcomparethethreeindexstructuressuggestedinthischapterexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedintothegeometricle.Wealsocomparetheefciencyofthesestructuresforpointandrangequeries. 98 PAGE 99 Inthischapter,wedetailthreesetsofbenchmarkingexperiments.Intherstsetofexperi-ments,weattempttomeasuretheabilityofthegeometricletoprocessahigh-speedstreamofdatarecords.Inthesecondsetofexperiments,weexaminethevariousalgorithmsforproducingsmallersamplesfromalarge,disk-basedgeometricle.Finally,inthethirdsetofexperiments,wecomparethethreeindexstructuresforthegeometricleforbuildtime,diskspace,andindexlook-upspeed. 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.IntheremainderofthisSection,werefertothesealternativesasthevirtualmemory,scan,localoverwrite,geole,andmultiplegeolesoptions.An0valueof0:9wasusedforthemultiplegeolesoption. AllimplementationwasperformedinC++.BenchmarkingwasperformedusingasetofLinuxworkstations,eachequippedwith2.4GHzIntelXeonProcessors.15,000RPM,80GBSeagateSCSIharddiskswereusedtostoreeachofthereservoirs.Benchmarkingofthesedisksshowedasustainedread/writerateof35-50MB/second,andanacrossthediskrandomdataaccesstimeofaround10ms. 99 PAGE 100 7-1 (a).Bynumberofsamplesprocessedwemeanthenumberofrecordsthatareactuallyinsertedintothereservoir,andnotthenumberofrecordsthathavepassedthroughthedatastream. 7-1 (b).Thus,wetesttheeffectofrecordsizeontheveoptions. 7-1 (c).Thisexperimentteststheeffectofaconstrainedamountofmainmemory. Itisworthwhiletopointoutafewspecicndings.Eachoftheveoptionswritestherst50GBofdatafromthestreammoreorlessdirectlytodisk,asthereservoirislargeenoughtoholdallofthedataaslongasthetotalislessthan50GB.However,Figure 7-1 (a)and(b)showthatonlythemultiplegeolesoptiondoesnothavemuchofadeclineinperformanceafterthereservoirlls(atleastinExperiments1and2).Thisiswhythescanandvirtualmemoryoptionsplateauaftertheamountofdatainsertedreaches50GB.ThereissomethingofadeclineinperformanceinallofthemethodsoncethereservoirllsinExperiment3(withrestrictedbuffermemory),butitisfarlesssevereforthemultiplegeolesoptionthanfortheotheroptions. 100 PAGE 101 Resultsofbenchmarkingexperiments(Processinginsertions). 101 PAGE 102 Resultsofbenchmarkingexperiments(Samplingfromageometricle). 102 PAGE 103 Asexpected,thelocaloverwriteoptionperformsverywellearlyon,especiallyinthersttwoexperiments(seeSection 3.3 foradiscussionofwhythisisexpected).EvenwithlimitedbuffermemoryinExperiment3,ituniformlyoutperformsasinglegeometricle.Furthermore,withenoughbuffermemoryinExperiments1and2,thelocaloverwriteoptioniscompetitivewiththemultiplegeolesoptionearlyon.However,fragmentationbecomesaproblemandperformancedecreasesovertime.Unlessofinere-randomizationoftheleispossibleperiodically,thisdegradationprobablyprecludeslong-termuseofthelocaloverwriteoption. ItisinterestingthatasdemonstratedbyExperiment3(andexplainedinSection 3.8 )asinglegeometricleisverysensitivetotheratioofthesizeofthereservoirtotheamountofavailablememoryforbufferingnewrecordsfromthestream.ThegeoleoptionperformswellinExperiments1and2whenthisratiois100,butratherpoorlyinExperiment3whentheratiois1000. Finally,wepointoutthegeneralunusabilityofthescanandvirtiualmemoryoptions.scangenerallyoutperformedvirtualmemory,butbothgenerallydidpoorly.Exceptinexperiment1withlargememoryandsmallrecordsize,withthesetwooptionsmorethan97%oftheprocessingofrecordsfromthestreamoccursinthersthalfhourasthereservoirlls.Inthe19:5hoursorsoafterthereservoirrstlls,onlyatinyfractionofadditionalprocessingoccursduetotheinefciencyofthetwooptions. 4.1 wegaveanupperboundforthedistancebetweentheactualbiasfunctionf0computedusingourreservoiralgorithm,andthedesired,user-denedbiasfunctionf.Whileuseful,thisbounddoesnottelltheentirestory.Intheend,whatauserofabiasedsamplingalgorithmisinterestedinisnothowclosethebiasfunctionthatisactuallycomputedistotheuser-speciedone,butinsteadthekeyquestioniswhatsortofeffectanydeviationhasonthe 103 PAGE 104 Sumqueryestimationaccuracyforzipf=0.2. particularestimationtaskthatistobeperformed.Perhapstheeasiestwaytodetailthepracticaleffectofapathologicaldataorderingisthroughexperimentation. Inthissectionwepresenttheexperimentalresultsevaluatingpracticalsignicanceofaworst-casedataordering.Specically,wedesignasetofexperimentstocomputetheerror(variance)onewouldexpectwhensamplingfortheanswertoaSUMqueryinfollowingtherescenarios: 1. Whenabiasedsampleiscomputedusingourreservoiralgorithmwiththedataorderedsoastoproducenooverweightrecords. 2. Whenanunbiasedsampleiscomputedusingtheclassicalreservoirsamplingalgorithm. 3. Whenabiasedsamplecomputedusingourreservoiralgorithm,withrecordsarrangedsoastoproducethebiasfunctionfurthestfromtheuser-speciedone,asdescribedbytheTheorem 1 Byexaminingtheresults,itshouldbecomeclearexactlywhatsortofpracticaleffectontheaccuracyofanestimatoronemightexpectduetoapathologicalordering. 104 PAGE 105 Sumqueryestimationaccuracyforzipf=0.5. AttributeBistheattributethatisactuallyaggregatedbytheSUMquery.EachsetisgeneratedsothatattributesAandBbothhaveacertainamountofZipanskew,speciedbytheparameterzipf.Ineachcase,thebiasfunctionfisdenedsoastominimizethevarianceforaSUMqueryevaluatedoverattributeA. Inadditiontotheparameterzipf,eachdatasetalsohasasecondparameterwhichwetermthecorrelationfactor.ThisistheprobabilitythatattributeAhasthesamevalueasattributeB.Ifthecorrelationfactoris1,thenAandBareidentical,andsincethebiasfunctionisdenedsoastominimizethevarianceofaqueryoverA,thebiasfunctionalsominimizesthevarianceofanestimateovertheactualqueryattributeB.Thus,acorrelationfactorof1providesforaperfectbiasfunction.Asthecorrelationfactordecreases,thequalityofthebiasfunctionforaqueryoverattributeBdeclines,becausethechanceincreasesthatarecorddeemedimportantbylookingatattributeAis,infact,onethatshouldnotbeincludedinthesample.Thismodelsthecasewhereonecanonlyguessatthecorrectbiasfunctionbeforehandforexample,whenquerieswithanarbitraryrelationalselectionpredicatemaybeissued.Asmallcorrelationfactorcorrespondstothecasewhentheguessed-atbiasfunctionisactuallyveryincorrect. 105 PAGE 106 Sumqueryestimationaccuracyforzipf=0.8. Bytestingeachofthethreedifferentscenariosdescribedintheprevioussubsectionoverasetofdatasetscreatedbyvaryingzipfaswellasthecorrelationfactor,wecanseetheeffectofdataskewandofbiasfunctionqualityontherelativequalityoftheestimatorproducedbyeachofthethreescenarios. Foreachexperiment,wegenerateadatastreamofonemillionrecordsandobtainasampleofsize1000.Foreachofthethreescenariosandeachofthedatasetsthatwetest,werepeatthesamplingprocess1000timesoverthesamedatastreaminMonte-Carlofashion.Thevarianceofthecorrespondingestimatorisreportedastheobservedvarianceofthe1000estimates.TheobservedMonte-CarlovariancesaredepictedinFigures 7-3 7-4 7-5 ,and 7-6 106 PAGE 107 Sumqueryestimationaccuracyforzipf=1. thatevenforveryskeweddatasets,itisdifcultforevenanadversarytocomeupwithadataorderingthatcansignicantlyalterthequalityoftheuser-denedbiasfunction. Wealsoobservethatforalowzipfparameterandalowcorrelationfactor,unbiasedsamplingoutperformsbiasedsampling.Inotherwords,itisactuallypreferablenottobiasinthiscase.ThisisbecausethelowzipfvalueassignsrelativelyuniformvaluestoattributeB,renderinganoptimalbiasedschemelittledifferentfromuniformsampling.Furthermore,asthecorrelationfactordecreases,theweightingschemeusedbothbiasedsamplingschemesbecomeslessaccurate,hencethehighervariance.Astheweightingschemebecomesveryinaccurate,itisbetternottobiasatall.Notsurprisingly,therearemorecaseswherethebiasedschemeunderthepathologicalorderingisactuallyworsethantheunbiasedscheme.However,asthecorrelationfactorincreasesandthebiasschemebecomesmoreaccurate,itquicklybecomespreferabletobias. 5 .Specically,wehavecomparedthenaivebatchsamplingandtheonlinesamplingalgorithmsagainstageometriclestructurebasedbatchsamplingandonline 107 PAGE 108 7-2 (a)depictstheplotforasinglegeometricle;Figure 7-2 (b)showsananalogousplotforthemultiplegeometriclesoption. 7-2 (c)forboththenaivealgorithmandthemoreadvanced,geometriclestructurebasedalgorithmdesignedtoincreasethesamplingrateandevenouttheresponsetimes.TheanalogousplotformultiplegeometriclecaseisshowninFigure 7-2 (d).WealsoplotthevarianceinresponsetimesoverallcallstoGetNextasafunctionofthenumberofcallstoGetNextinFigures 7-2 (e)and 7-2 (f)(therstisforasinglegeometricle;thesecondiswithmultipleles).Takentogether,theseplotsshowthetrade-offbetweenoverallprocessingtimeandthepotentialforwaitingforalongtimeinordertoobtainasinglesample. 108 PAGE 109 Asexpectedandthendemonstratedbyvarianceplots,thevarianceofonlinenaiveapproachissmallerthangeometriclestructurebasedalgorithm.Althoughwiththislittlelargervariance(lessthan10timesfor100ksamples)intheresponsetimes,thestructurebasedapproachexecutedorderofmagnitudefaster(morethan100timesfor100ksamples)thanthenaiveapproachforanynumberofrecordssampled,justifyingourapproachofminimizingtheaveragesquaresumoftheresponsetime.Inotherwords,wegotenoughaddedspeedforasmallenoughaddedvarianceinresponsetimetomakethetrade-offacceptable.Asmoreandmoresamplesareobtainedthevarianceofstructurebasedalgorithmapproachedvarianceofthenaivealgorithmmakingthetrade-offevenmorereasonableforlargeintendedsamplesizes. Finally,wepointoutthatboththegeometriclestructurebasedalgorithms,batchandonlinecase,wereabletoreadsamplerecordsfromdiskalmostatthemaximumsustainedspeedoftheharddisk,ataround45MB/sec.Thisiscomparabletotherateofasequentialreadfromdisk,thebestwecanhopefor. 109 PAGE 110 Diskfootprintfor1KBrecordsize Table7-1. Millionsofrecordsinsertedin10hrs Subsample-Based Segment-Based LSM-Tree 13700 12550 10960 9680 12810 7230 8030 2930 6 weintroducedthreeindexstructuresforthegeometricle:thesegment-based,thesubsample-based,andtheLSM-tree-basedindexstructure.Inthissection,weexperimentallyevaluateandcomparethesethreeindexstructuresbymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries.Alloftheindexstructureswereimplementedontopofthegeometricleprototypethatwasbenchmarkedintheprevioussections. 110 PAGE 111 6 .Thetenhoursofinsertioninthegeometricleensuresthatareasonablenumberofinsertionsanddeletionsareperformedonanindexstructure.Givensuchale,wecollectedfollowingthreepiecesofinformationforeachofthethreeindexstructuresunderconsideration. Withthesemetricsinmindweperformedfollowingtwosetsofexperiments: 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-7 ,andtheindexlook-upspeedintabulatedinTable 7.4.1 111 PAGE 112 Diskfootprintfor200Brecordsize speedareshowninTable 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-8 ,andtheindexlook-upspeedintabulatedinTable 7.4.2 .Thus,wetesttheeffectofrecordsizeonthethreeindexstructure. Table 7.4 showsmillionsofrecordsinsertedintogeometricleaftertenhoursofinsertionsandconcurrentupdatestotheindexstructure.Forcomparisonwepresentthenumberofrecordsinsertedintoageometriclewhennoindexstructureismaintained(thenoindexcolumn).Itisclearthatthesubsample-basedindexstructureperformsthebestoninsertions,withperformancecomparabletothenoindexoption.Thisdifferencereectsthecostofconcurrentlymaintainingtheindexstructure.Thesegmentbasedindexstructuredoesthenextbest.Itisslowerthanthesubsample-basedindexstructurebecauseofhighernumberofseeksperformedduringthestart-up.Recallthatduringstart-upthesegment-basedindexmustwriteaB+-treeforeachsegment. 112 PAGE 113 Querytimingresultsfor1krecord,jRj=10million,andjBj=50k Selectivity IndexTime FileTime TotalTime PointQuery 38.2890 0.0226 38.3116 40.2477 0.1803 40.2480 43.2856 0.8766 44.1622 45.6276 6.2571 51.8847 Subsample-Based PointQuery 0.87551 0.02382 0.89937 1.12740 0.15867 1.28607 1.74911 1.10544 2.85455 2.09980 5.96637 8.06617 LSM-Tree PointQuery 0.00012 0.01996 0.02008 0.00015 0.01263 0.01278 0.00019 0.79358 0.79377 0.00056 5.82210 5.82266 Oncethereservoirisinitialized,boththesegment-basedandthesubsample-basedindexstructureperformanequalnumberofdiskseeks.Finally,theLSM-tree-basedindexstructureisslowestamongstthethree.TheLSM-treemaintainstheindexbyprocessinginsertionsanddeletionsmoreaggressivelythanothertwooptions,demandingmorerollingmergesandmorediskseeksperbufferush. Table 7.4 alsoshowstheinsertionguresforthesmaller,200Brecordsize.Notsurprisingly,allthreeindexstructuresshowssimilarinsertionpatterns,butsincetheyhavetoprocessalargernumberofrecordstheinsertionratesareslowerthaninthecaseofthe1KBrecordsize.Wealsoobservedandplottedthediskfootprintsizeforthreeindexstructures(Figure 7-7 andFigure 7-8 ).Asexpected,allthreeindexstructuresinitiallygrowfairlyquickly.Thesegment-basedandthesubsample-basedindexstructuresstabilizesoonafterthereservoirislled,whereastheLSM-Tree-basedstructurestabilizesalittlelaterwhentheremovalofstalerecordsfromtherollingmergesstabilizes. Thesubsample-basedindexstructurehasthelargestfootprint(almost1=5thofthegeometriclesize).ThisisexpectedasstaleindexrecordsisremovedfromtheB+-treesonlywhenthe 113 PAGE 114 Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k Selectivity IndexTime FileTime TotalTime PointQuery 6.2488 0.0338 6.2826 9.6186 0.1267 9.7453 12.9885 0.9288 13.9173 17.6891 5.9754 23.6645 Subsample-Based PointQuery 2.50717 0.0156 2.5227 4.92744 0.1763 5.1037 7.2387 0.8637 8.1024 9.9837 6.1363 16.1200 LSM-Tree PointQuery 0.00505 0.0174 0.0224 0.00967 0.1565 0.1661 0.01440 0.8343 0.8487 0.05987 4.9961 5.0559 entiresubsampledecays.Ontheotherhand,thesegment-basedindexstructurehasthesmallestfootprintasateverybufferushallstalerecordsareremovedfromtheindexstructure.Thisresultsinaverycompactindexstructure.ThediskspaceusageoftheLSM-Tree-basedindexstructureliesbetweenthesetwoindexstructures.Althoughateveryrollingmerge,stalerecordsareremovedfromthepartofindexstructurethatismerging,notallofthestalerecordsinthestructureareremovedallatonce.Assoonastherateofremovalofstalerecordsstabilizesthediskfootprintalsobecomesstable. Finally,wecomparedtheindexlook-upspeedofthesethreeindexstructures.Wereportindexlook-upandgeometricleaccesstimesfordifferentselectivityqueries.Asexpected,thegeometricleaccesstimeremainsconstantirrespectiveoftheindexstructureoptionandincreaseslinearlyasthequeryproducesmoreoutputtuples.Theindexlook-uptimevariedforthethreeindexstructures.Thesegment-basedindexstructure(theslowest)wasanorderofmagnitudeslowerthantheLSM-Tree-basedindexstructure(thefastest).Thisismainlybecausethesegment-basedindexstructurerequiresindexlookupsinseveralthousandB+-Treesforanyselectivityquery,wheretheLSM-Tree-basedstructureusesasingeLSM-Tree,requiringasmall,constantnumberofseeks.Theperformanceofthesubsample-basedindexstructureliesin 114 PAGE 115 Ingeneralthesubsample-basedindexstructuregivesthebestbuildtimewithreasonableindexlook-upspeedatthecostofslightlylargerdiskfootprint.TheLSM-Tree-basedindexstructuremakesuseofreasonablediskspaceandgivesthebestqueryperformanceatthecostofslowinsertionrateorbuildtime.Thesegment-basedindexstructuregivescomparablebuildtimeandhasthemostcompactdiskfootprint,butsuffersconsiderablywhenitcomestoindexlook-ups. 115 PAGE 116 Randomsamplingisaubiquitousdatamanagementtool,butrelativelylittleresearchfromthedatamanagementcommunityhasbeenconcernedwithhowtoactuallycomputeandmaintainasample.Inthisdissertationwehaveconsideredtheproblemofrandomsamplingfromadatastream,wherethesampletobemaintainedisverylargeandmustresideonsecondarystorage.WehavedevelopedthegeometricleorganizationwhichcanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord.Themultiplier!canbemadeverysmallbymakinguseofasmallamountofadditionaldiskspace. Wehavepresentedamodiedversionoftheclassicreservoirsamplingalgorithmthatisexceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf.Ouralgorithmcomputes,inasinglepass,abiasedsampleRi(withoutreplacement)oftheirecordsproducedbyadatastream. Wehavealsodiscussedcertainpathologicalcaseswhereouralgorithmcanprovideacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Wehaveanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase.Wehavealsoexperimentallyevaluatedthepracticalsignicanceofthisdifference. WehavealsoderivedthevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained. Wehavedevelopedefcienttechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallerdataobjects.Weconsideredtwosamplingtechniques(1)abatchsamplingwhensamplesizeisknownbeforehandand(2)anonlinesamplingwhichimplementsaniterativefunctionGetNexttoretrieveasampleat-a-time.Thegoalofthesealgorithmswastoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 116 PAGE 117 117 PAGE 118 [1] A.DasJ.Gehrke,M.R.:Approximatejoinprocessingoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [2] Acharya,S.,Gibbons,P.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroup-byqueries.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [3] Acharya,S.,Gibbons,P.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [4] Acharya,S.,P.B.Gibbons,V.P.,Ramaswamy,S.:Theaquaapproximatequeryansweringsystem.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [5] Aggarwal,C.C.:Onbiasedreservoirsamplinginthepresenceofstreamevolution.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.607.VLDBEndowment(2006) [6] Arge,L.:Thebuffertree:Anewtechniqueforoptimali/o-algorithms.In:InternationalWorkshoponAlgorithmsandDataStructures(1995) [7] Babcock,B.,Datar,M.,Motwani,R.:Samplingfromamovingwindowoverstreamingdata.In:SODA'02:ProceedingsofthethirteenthannualACM-SIAMsymposiumonDiscretealgorithms,pp.633.SocietyforIndustrialandAppliedMathematics(2002) [8] Babcock,B.,S.Chaudhuri,G.D.:Dynamicsampleselectionforapproximatequeryprocessing.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [9] Bayer,R.,McCreight,E.M.:Organizationandmaintenanceoflargeorderedindexes.In:SIGFIDETWorkshop,pp.107(1970) [10] Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.6.IEEEComputerSociety,Washington,DC,USA(2006) [11] C.FanM.Muller,I.R.:Developmentofsamplingplansbyusingsequential(itembyitem)techniquesanddigitalcomputersi.In:JournalofAmericanStatisticalAssociation,pp.57:387(1962) [12] C.JermaineA.Datta,E.O.:Anovelindexsupportinghighvolumedatawarehouseinsertion.In:InternationalConferenceonVeryLargeDataBases(1999) [13] C.JermaineE.Omiecinski,W.Y.:Thepartitionedexponentiallefordatabasestoragemanagement.In:InternationalConferenceonVeryLargeDataBases(1999) [14] Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE(2001) 118 PAGE 119 [15] Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimization-basedapproachforapprox-imateansweringofaggregatequeries.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [16] Cochran,W.:SamplingTechniques.WileyandSons(1977) [17] Council,T.P.:TPC-HBenchmark.http://www.tpc.org(2004) [18] Cranor,C.,Gao,Y.,Johnson,T.,Shkapenyuk,V.,Spatscheck,O.:Gigascopehighper-formancenetworkmonitoringwithansqlinterface.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [19] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Gigascope:Astreamdatabasefornetworkapplications.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [20] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Thegigascopestreamdatabase.In:IEEEDataEngineeringBulletin,pp.26(1):27(2003) [21] Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [22] Dufeld,N.,Lund,C.,Thorup,M.:Chargingfromsamplednetworkusage.In:IMW'01:Proceedingsofthe1stACMSIGCOMMWorkshoponInternetMeasurement,pp.245.ACMPress,NewYork,NY,USA(2001) [23] Estan,C.,Naughton,J.F.:End-biasedsamplesforjoincardinalityestimation.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.20.IEEEComputerSociety,Washington,DC,USA(2006) [24] Estan,C.,Varghese,G.:Newdirectionsintrafcmeasurementandaccounting:Focusingontheelephants,ignoringthemice.ACMTrans.Comput.Syst.21(3),270(2003) [25] F.Olken,D.R.:Randomsamplingfromb+trees.In:InternationalConferenceonVeryLargeDataBases(1989) [26] F.Olken,D.R.:Randomsamplingfromdatabaseles-asurvey.In:InternationalWorkingConferenceonScienticandStatisticalDatabaseManagement(1990) [27] F.OlkenD.Rotem,P.X.:Randomsamplingfromhashes.In:ACMSIGMODInterna-tionalConferenceonManagementofData(1990) [28] Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskew-resistantjoinsizeestimation.In:ACMSIGMODInternationalConferenceonManagementofData(1996) PAGE 120 [29] Gemulla,R.,Lehner,W.,Haas,P.J.:Adipinthereservoir:maintainingsamplesynopsesofevolvingdatasets.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.595.VLDBEndowment(2006) [30] Gunopulos,D.,Kollios,G.,Tsotras,V.,Domeniconi,C.:Approximatingmulti-dimensionalaggregaterangequeriesoverrealattributes.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [31] Haas,P.:Theneedforspeed:Speedingupdb2usingsampling.In:IDUGSolutionsJournal(2003) [32] Haas,P.J.,Hellerstein,J.M.:RipplejoinsforOnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.287298(1999) [33] Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:OnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.171(1997) [34] J.GehrkeF.Korn,D.S.:Oncomputingcorrelatedaggregatesovercontinualdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [35] Jermaine,C.:Robustestimationwithsamplingandapproximatepre-aggregation.In:InternationalConferenceonVeryLargeDataBases(2003) [36] Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:ACMSIGMODInternationalConferenceonManagementofData,pp.299(2004) [37] J.M.HellersteinR.Avnur,V.R.:Informixundercontrolonlinequeryprocessing.In:DataMiningandKnowledgeDiscovery,pp.4(4):281(2000) [38] Joens,T.:Anoteonsamplingfromatapele.In:CommunicationsoftheACM,p.5:343(1964) [39] J.S.Vitter,M.W.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [40] Kolonko,M.,Wasch,D.:Sequentialreservoirsamplingwithanonuniformdistribution.ACMTrans.Math.Softw.32(2),257(2006).DOIhttp://doi.acm.org/10.1145/1141885.1141891 [41] Manku,G.S.,Motwani,R.:Approximatefrequencycountsoverdatastreams.In:VLDBConference(2002) [42] N.L.Johnson,S.K.:DiscreteDistributions.HoughtonMifin(1969) [43] Olken,F.:RandomSamplingfromDatabases.In:Ph.D.Dissertation(1993) [44] O'Neil,P.,Cheng,E.,Gawlick,D.,O'NeilJ,E.:Thelog-structuredmerge-tree.In:ActaInformatica,pp.33:351(1996) PAGE 121 [45] Ousterhout,J.K.,Douglis,F.:Beatingthei/obottleneck:Acaseforlog-structuredlesystems.OperatingSystemsReview23(1),11(1989) [46] P.B.GibbonsY.Matias,V.P.:Fastincrementalmaintenanceofapproximatehistograms.In:ACMTransactionsonDatabaseSystems,pp.27(3):261(2002) [47] Pol,A.,Jermaine,C.:Biasedreservoirsampling.IEEETransactionsonKnowledgeandDataEngineering [48] Pol,A.,Jermaine,C.,Arumugam,S.:Maintainingverylargerandomsamplesusingthegeometricle.VLDBJ(2007) [49] Shao,J.:MathematicalStatistics.Springer-Verlag(1999) [50] Thompson,M.E.:TheoryofSampleSurveys.ChapmanandHall(1997) [51] Toivonen,H.:Samplinglargedatabasesforassociationrules.In:InternationalConferenceonVeryLargeDataBases(1996) [52] V.GantiM.-L.Lee,R.R.:Icicles-self-tuningsamplesforapproximatequeryanswering.In:InternationalConferenceonVeryLargeDataBases(2000) [53] Vitter,J.:Randomsamplingwithareservoir.In:ACMTransactionsonMathematicalSoftware(1985) [54] Vitter,J.:Anefcientalgorithmforsequentialrandomsampling.In:ACMTransactionsonMathematicalSoftware,pp.13(1):58(1987) PAGE 122 AbhijitPolwasbornandbroughtupinstateofMaharashtrainIndia.HereceivedhisBachelorofEngineeringfromGovernmentCollegeofEngineeringPune(COEP)UniversityofPune,oneofthemostprestigiousandoldestengineeringcollegeinIndia,in1999.Abhijitmajoredinmechanicalengineeringandobtainedadistinguishedrecord.Herankedsecondintheuniversitymeritranking.HewasemployedintheResearchandDevelopmentdepartmentofKirloskarOilEnginesLtdforoneyear.AbhijitreceivedhisrstMasterofSciencefromUniversityofFloridain2002.Hemajoredinindustrialandsystemsengineering.AbhijitthenworkedasaresearcherintheDepartmentofComputerandInformationScienceandEngineeringattheUniversityofFlorida.HereceivedhissecondMasterofScienceandDoctorofPhilosophy(Ph.D)incomputerengineeringin2007.DuringhisstudiesatUniversityofFlorida,AbhijitcoauthoredatextbooktitledDevelop-ingWeb-EnabledDecisionSupportSystems.HetaughttheWeb-DSScourseseveraltimesintheDepartmentofIndustrialandSystemsEngineeringattheUniversityofFlorida.HepresentedseveraltutorialsatworkshopsandconferencesontheneedandimportanceofteachingDSSmaterial,andhealsotaughtattwoinstructor-trainingworkshopsonDSSdevelopment.Abhijit'sresearchfocusisintheareaofdatabases,withspecialinterestsinapproximatequeryprocessing,physicaldatabasedesign,anddatastreams.HehaspresentedresearchpapersatseveralprestigiousdatabaseconferencesandperformedresearchattheMicrosoftResearchLab.HeisnowaSeniorSoftwareEngineerintheStrategicDataSolutionsgroupatYahoo!Inc. 122 |