<%BANNER%>

Maintaining Very Large Samples Using the Geometric File

Permanent Link: http://ufdc.ufl.edu/UFE0021132/00001

Material Information

Title: Maintaining Very Large Samples Using the Geometric File
Physical Description: 1 online resource (122 p.)
Language: english
Creator: Pol, Abhijit A
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: biased, databases, file, indexing, sampling
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a 'sample' is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for maintaining a very large, on-disk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Abhijit A Pol.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021132:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021132/00001

Material Information

Title: Maintaining Very Large Samples Using the Geometric File
Physical Description: 1 online resource (122 p.)
Language: english
Creator: Pol, Abhijit A
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: biased, databases, file, indexing, sampling
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a 'sample' is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for maintaining a very large, on-disk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Abhijit A Pol.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021132:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101129_AAAAAO INGEST_TIME 2010-11-29T22:36:51Z PACKAGE UFE0021132_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 87648 DFID F20101129_AAAIRC ORIGIN DEPOSITOR PATH pol_a_Page_083.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
3bf06a597e2fb1e32dc294bd1f875bdc
SHA-1
d1aa7a2ed96b19ad6d1d11135efe1fa82c1a765a
68797 F20101129_AAAIQO pol_a_Page_065.jpg
69a362d9524537c84dbec01b8436d382
0398ed496efb1d0fca2eb1b19e2f137c1b83122a
89143 F20101129_AAAIPZ pol_a_Page_045.jpg
c06cf6185da012caf51b3fa1a3510a50
8b9b621dc7df41055e151f4b5004a59a848ff78d
87172 F20101129_AAAIRD pol_a_Page_084.jpg
17d49526f9197c35485a8cad55a7e516
fdd980c0807d5ea455040495c0091991b717581b
78086 F20101129_AAAIQP pol_a_Page_066.jpg
74539082d561b285bff83a937b491fd8
7c6d0bcb84432e97db28feb8debb4e79a461260f
97536 F20101129_AAAIRE pol_a_Page_085.jpg
3a73882144070c5cc93886be9f77c892
36f7543c4157b3c8292945c2f3fa8538199abb19
46409 F20101129_AAAIQQ pol_a_Page_067.jpg
63c3311a1fc0fd90044576708d81b18c
3190df5bec3aa1687f0fe501158545e62bdc0c58
86990 F20101129_AAAIRF pol_a_Page_086.jpg
a49f22841da6dac6d33fda2c331c09d4
4a79b3594fdc2800da90ad7b2aa5e9adef6aa3b7
57244 F20101129_AAAIQR pol_a_Page_068.jpg
15ae3a84cb27c6ef44ede0a9f9e8acd3
20a580d2299379127bb35866dae33779f1500b3c
74466 F20101129_AAAIRG pol_a_Page_087.jpg
22c3318007630cdd9ed206a7858bf82d
083abc0f3ab09d3b20429fd2a611c85f8df4b5c3
54450 F20101129_AAAIQS pol_a_Page_069.jpg
7f9e18d9774196841a29b6f9d797e403
44e4ffd3dc022a801b07115a02e6c60b54b771c1
68050 F20101129_AAAIRH pol_a_Page_088.jpg
3961357ca3e599bc7c62f5b90c709b1a
396ad099696a9d8363936d00e073a2881a281e05
74455 F20101129_AAAIRI pol_a_Page_089.jpg
2fb8f6abc38e7a265ab56257bd733258
9a4ee368c1bfbe9cede276060d969e89e68cdd76
40952 F20101129_AAAIQT pol_a_Page_070.jpg
657ea2d85a6ecbd5de7789046abbaaa8
fb6414ecb06501f0c86ce4a933f8eee01261eb75
92305 F20101129_AAAIRJ pol_a_Page_090.jpg
bc76bfed15ce5ca8e996384547e6c22e
32be331f4e4b684d07051c45ca92a5a720c5c3c7
55684 F20101129_AAAIQU pol_a_Page_071.jpg
36a002932e80872c0e97a8ad4f75e0ee
b75716a04011e289867d9abb36979595e849802f
89896 F20101129_AAAIRK pol_a_Page_091.jpg
3ca646e820bb48dff846ae4c74e00758
894ab9a7237848be32485b1533689c29989b9acd
47379 F20101129_AAAIQV pol_a_Page_072.jpg
eac8a893a20bfcea0268e4a07032b0ed
3029912b928ea8b5634d8d75695bdd637b7b5f46
86185 F20101129_AAAIRL pol_a_Page_092.jpg
a164e121db4c443abb971a193233627c
9f368d5c425f1a281d6f5716d1cf055fa598e8cd
60336 F20101129_AAAIQW pol_a_Page_073.jpg
855f4c54732caa43b483513346183b36
9add60264cbcd52111d030fe86b54e93ad98b603
85123 F20101129_AAAIRM pol_a_Page_094.jpg
8628d6fe4c411f619422727a3c4d5238
a797c4d9bff288b978a91345ba626cf63d95ec5c
63646 F20101129_AAAIQX pol_a_Page_074.jpg
9675df5e18712d318cccf52176a21052
77ef44ed968b45ffec885e06a3a8874b69c9b2d8
78661 F20101129_AAAISA pol_a_Page_114.jpg
851dc246daca41e4081b2fc6462b1523
0968c1659128bcd54f567bb12cc96edaa550cd46
89053 F20101129_AAAIRN pol_a_Page_096.jpg
6b04bdc210aeed486abe8a60c42d0810
927898fb6d038ab6d8c6a3dbcb2a6832d3e18916
60381 F20101129_AAAIQY pol_a_Page_077.jpg
4933145136ca05fc7d0e0463d54cd9a6
d4519d6fd931e4c8daf22e3d33812409d44a45c8
82217 F20101129_AAAISB pol_a_Page_116.jpg
957bea2f860b98de8e1e464f1de13df6
04bdd69bc90a6e61ab2a630eb111f6211d88417a
89350 F20101129_AAAIRO pol_a_Page_099.jpg
e1d3d1bf776dedd42f71e2f745ce2510
776ef459c2f998203b0e86a7ce58a7120c2fcd01
52460 F20101129_AAAIQZ pol_a_Page_079.jpg
8a7fce22c674784b7590e1d328bfa2d0
4d4e9eeb6ddfcee14ae1e36b345bbb4f6cf541e7
29233 F20101129_AAAISC pol_a_Page_117.jpg
34367dc26551bcc01b084a1153b1ce87
898133b7af503c74b34b7926fd8b95ecad623883
97969 F20101129_AAAIRP pol_a_Page_100.jpg
d2b546a15513d72c9c98b62dbb7b1f26
7689dbe65288e44590b37c565cbab30e2dad908e
91639 F20101129_AAAISD pol_a_Page_118.jpg
0e863875a9d39dddad6723795b78b4ba
837f976cf89136720f098726ec752f3853d55674
43220 F20101129_AAAIRQ pol_a_Page_101.jpg
cefccf7720072cd34c0a2924b6d5407e
44fd443950e391b2434dfd5556b6ea84ced0baf7
89408 F20101129_AAAISE pol_a_Page_119.jpg
b5ff09fc4a65f2d3beca739c210783c5
acf7658ce5bc990003f8a0693ee9f9ba1febe76b
55553 F20101129_AAAIRR pol_a_Page_102.jpg
96c7ecf181efad641461612e87907d1b
8d3cac246c793a4e663b4e82c461ce58dfa53bdb
22332 F20101129_AAAISF pol_a_Page_001.jp2
5683983bda70b943be10c84dc85d944d
2905c79ca1a767ab92e077610311acc288e2a701
86477 F20101129_AAAIRS pol_a_Page_103.jpg
cc73155684948ae98a27df5a94aa19a8
28cc03ce37d4a550f708c2b117ba1e83c9c08244
1051960 F20101129_AAAISG pol_a_Page_005.jp2
42a141b90956bc449fff596be7bbeb88
b7a3a6da4da050b671c5c5c31ea5c4e79e6bd71b
66904 F20101129_AAAIRT pol_a_Page_104.jpg
566d3e67a96ff73b3e75350084fca091
86e7e438a65f0dd0ae875fe94cd444d021ee4573
1051974 F20101129_AAAISH pol_a_Page_006.jp2
a0a210d596d0373c64d9a8b1f96e6614
031bcd91cb303d46f34164026309356f8f807414
570568 F20101129_AAAISI pol_a_Page_007.jp2
96d443c0d954e81ffb28fb1769472d1e
54914f28dd04b42f2b217b43a35192196d94a3ec
61670 F20101129_AAAIRU pol_a_Page_105.jpg
9848c2f53b8ac26df814568d991129c4
47f144c453ad56e3c8c43300bd16c33a8acbf8e8
441831 F20101129_AAAISJ pol_a_Page_008.jp2
e1c0f065e8614ba7125ea706f45ad7db
af90955fa24629f19570c8e2de0e1af0621ca3dc
68510 F20101129_AAAIRV pol_a_Page_107.jpg
3dc52df7db514b2d22635554c10c6309
53fe3fb2e8bc94aefd961a2a296918627e544bcb
1051958 F20101129_AAAISK pol_a_Page_009.jp2
fadacae4495b3c3b54bf7c19f66eae2c
c1da74715ca5c1c250f397354069e3ba70d045cf
96508 F20101129_AAAIRW pol_a_Page_108.jpg
715015f362b519baee3e5b0be0926014
feb70e1cff8cea8b96a0b83d3e4156159d49a319
32877 F20101129_AAAISL pol_a_Page_011.jp2
8fc679a18762c10efd08589c7f03b5c6
37464c0202d8a22e0421ac25a88e27aa5fa1f604
59333 F20101129_AAAIRX pol_a_Page_110.jpg
b8462692af8f624c867b69a7f0745f6c
2e6e0e9e303b29d47213ff0bd991d00ddd521b03
1051978 F20101129_AAAITA pol_a_Page_031.jp2
c4daaf28820bea99532624762af83ddf
34400c3e3056b590af35d9d50197e4b54253920c
1051959 F20101129_AAAISM pol_a_Page_012.jp2
523ea70ce91b27974aa81be43cc4c42d
d7ec573aa06afdaae5da9263976995ca9540cca7
93141 F20101129_AAAIRY pol_a_Page_111.jpg
51c68f1dc3cb1312332110b20fe4b008
51bd1a41dc93e6ef802d15c2b5959a524d9d9e7a
1051982 F20101129_AAAITB pol_a_Page_034.jp2
4a14c397d16a0d20f547f81bdbf3fb04
63d07f9e656593fa652844a52bb4321802886480
1051962 F20101129_AAAISN pol_a_Page_013.jp2
2b300f1db1841c403fc6b5664c7434ed
e49c79aaed6ced6e0c45daaed27e72ff05fbfd57
82577 F20101129_AAAIRZ pol_a_Page_113.jpg
cb8144249a0c7ab7f0b9815788cd096d
58d7b463f8bd068107b7e2f400c1216a14767ec5
1051985 F20101129_AAAITC pol_a_Page_035.jp2
1c5315e149acd841baa66e7ef5765bdc
2a8267f9b8e812acd1674e2c2afce19cb7643537
1051684 F20101129_AAAISO pol_a_Page_014.jp2
198a184f300f324bcb35d5375972f0f8
d2ad40be8eaee17511efded3fa358aa33b6be190
106184 F20101129_AAAITD pol_a_Page_036.jp2
81c0b5d6670f4f0b8297e49cf212de5c
acde94e838175266d8c0c9fefac4a2155d222b7f
1051932 F20101129_AAAISP pol_a_Page_016.jp2
f8e0e91652cc0010c9cf599e194a65e8
a687eb4752b11b88c1afdb80173f67089e62b22b
1051909 F20101129_AAAITE pol_a_Page_037.jp2
a72c05cb4a6e64071ffe0736f3cd5029
e9e99af829c7bd1767c378fa3f58f32abb79c6e1
126460 F20101129_AAAISQ pol_a_Page_018.jp2
6082689f277de47b5c3d6b18d3b75e62
10f8792bba51a0d94849babb09a7c6534636305d
752327 F20101129_AAAITF pol_a_Page_038.jp2
95a638a62cc809b12e4badfce3b20de1
f05036f05f5fb2df09d1d0d4b472af3ab79db16e
131827 F20101129_AAAISR pol_a_Page_019.jp2
24788b3eb30fc3686cd71fab8b36487b
e7887c2909463791d708c1c69eab4e5525fdb2cb
481319 F20101129_AAAITG pol_a_Page_039.jp2
496bbb023be9da2251d9cf91170bc3d6
8b0a801b69890af20c8485a5397a6aa9697031fc
526518 F20101129_AAAISS pol_a_Page_021.jp2
19a0417d50a593ea45723dd33f8f05d0
b57036d949ba7022d98ad56347c77650cf40667c
1051977 F20101129_AAAITH pol_a_Page_040.jp2
a0da4ffdac70345cbeb8bda9b73d7635
20227d299da264ad1ea96b27d95b564bacbdb03a
1051964 F20101129_AAAIST pol_a_Page_022.jp2
dc8f0d9853e0eb5a0a6574ffbc0fdc56
7bf78ddb4035d7fdf5eccd0fbe0d5643e867477b
1017825 F20101129_AAAITI pol_a_Page_041.jp2
0da7f36a13347bec20ff06f5f837ed49
a437cb04bc3d481f74765ab0e9538f94e1c8ec69
1051965 F20101129_AAAISU pol_a_Page_023.jp2
fa68ed0975b3948b5550f2c113d0f505
1a1892e2975749240cceceada189435cd49eac7d
650015 F20101129_AAAITJ pol_a_Page_043.jp2
731a0fd7de8204307d5db84db686316c
f12befe83398ed39848fab0a0b93a8b7f39cd984
1021397 F20101129_AAAITK pol_a_Page_044.jp2
27707fead68af9e6c8ccf4c76369d034
98b35a789eb7a4e6358fd2495a863faad0abdcc2
F20101129_AAAISV pol_a_Page_024.jp2
571f9ea783c789ea43c405e0fc06f056
773fe412f79dbc25dba86ea0589d167bee15c09b
1051968 F20101129_AAAITL pol_a_Page_045.jp2
8db2f8d92dc02901529da95adca79321
535e824aba942e3969d0b1c9abf6e5a57a3cf4cf
1051980 F20101129_AAAISW pol_a_Page_025.jp2
2777722d2009cf1a19407715de59d92e
853d470d8173c32462884b6d7ce2f224594fdbf9
736415 F20101129_AAAIUA pol_a_Page_071.jp2
dad74dd9af149008b670f37c8f7d7653
2716d827973a66559a3081161d8c2943f9579a8c
1051946 F20101129_AAAITM pol_a_Page_046.jp2
8be0206344a13a16733ccbd8c5ec07a7
8afe9312b96e20d1e49ec1afdae119ac3858d322
1051939 F20101129_AAAISX pol_a_Page_026.jp2
99f09074349106c4e2b129e0888fe749
15d511d0342258421ddbc492e6730801055a0f30
574058 F20101129_AAAIUB pol_a_Page_072.jp2
607e930112e3b1445a247a34b470c233
c1513b90cb7b2f9f06ab482f7742042d84e25685
1051963 F20101129_AAAITN pol_a_Page_048.jp2
3a630c834a2892390e4486ffbef8ed10
7569879dce32640847c9949af69e2ffc38e3bced
48527 F20101129_AAAISY pol_a_Page_027.jp2
dbd733fc72eafc863ff158bef6d211dd
418239eb9e0a6c13bfe23a4571c7afe89ca9a044
136728 F20101129_AAAIUC pol_a_Page_075.jp2
2cb33c7e13a0bfb23491d4ab35571965
28bccb40b6167860f5ca92d09f3802872c7d7d16
F20101129_AAAITO pol_a_Page_049.jp2
91e77bc873ad900b8a16d85368a5aec7
b38c5aa0822c47a48bde49c60a2289cbaa46cfc2
F20101129_AAAISZ pol_a_Page_029.jp2
cf3311ad2c1c51ebb727e8d812aa8af1
2deb04ba9f0ee9755c12f920d1c07cd9291446c3
F20101129_AAAITP pol_a_Page_050.jp2
55625b1ffd80a06aa0ded2e8dc595115
8bc969480cb8e797e8e506449db8f7721ee56208
1051971 F20101129_AAAIUD pol_a_Page_076.jp2
625cbf58da3041e80a5caa03492b62d9
cb971038f4af62d6da8523b797df6b4291434214
F20101129_AAAITQ pol_a_Page_052.jp2
83f62b1593e974aed8308a90a4637f4a
5c6b69003c39666b2adf1afd90fa458e4a510d51
819621 F20101129_AAAIUE pol_a_Page_077.jp2
720b380a3981251c69f30d831de6e211
92c263250c0ca2c9b33bc0e4e863705eecf04ff5
1051973 F20101129_AAAITR pol_a_Page_053.jp2
dae2efc591c3ce1ef16d70f67880746b
6ac32f118e5e357bf337d2312b8df7e01570c771
26706 F20101129_AAAJAA pol_a_Page_054.pro
95d47e98e305b589e8fc941a7850f5b0
afb7bad41e1666ffe4b3256d6855ca6008fa332b
803661 F20101129_AAAIUF pol_a_Page_078.jp2
1fa0ab90eba17a4142f07ef363de0ed7
731cf9035e62ddcb5bf56347345a9381969cd9ab
920415 F20101129_AAAITS pol_a_Page_054.jp2
400a4800a85e11c2d3c10f8707fca8b7
9560ef3a2d027d51d6fbb5c14a26a413fe68b15d
63019 F20101129_AAAJAB pol_a_Page_055.pro
891b0085f68595ed5b145335f9c60d26
f50bfe37f7609d2660348273ceabd832b9c7ce5e
78359 F20101129_AAAIUG pol_a_Page_079.jp2
6a94c098e3f891b7f19b8bedbecfe75f
0a6e9f25086d7006d6708514054380c330651c41
745893 F20101129_AAAITT pol_a_Page_057.jp2
b198c5887eb88562f49e268da39ea12a
de2657d03bbb866f96013eef73a983f60466cbcf
1051931 F20101129_AAAIUH pol_a_Page_081.jp2
a2aef046c57517e416624c8942e24b9a
0d776f0a05bb713aa833f019966222d3c735afa4
F20101129_AAAITU pol_a_Page_058.jp2
bd526b16361b045cc80e050b23583414
aa6e20cb1fef2f312bb94c24f0090ef734e46a01
58854 F20101129_AAAJAC pol_a_Page_056.pro
092b387e8821358a627a7373a4bfdf71
9a1b710f668acd4c722ed739121e16fc53912b66
77169 F20101129_AAAIUI pol_a_Page_082.jp2
bd15081a2637334fd719e30944e9e52a
02c05b6a5ea1dcbfb5dfb8dcf622f911e074217b
1051900 F20101129_AAAITV pol_a_Page_059.jp2
500dbafc5c041a18bcccaeafb77f6990
b73ede9c73287e40c615b2a3288a1d74450164cf
33700 F20101129_AAAJAD pol_a_Page_057.pro
e3b8ffb77c190553aecb4542ebf5e27a
777d3c5a9367a8c9e33b89d15c6d50c1090afc80
1051981 F20101129_AAAIUJ pol_a_Page_084.jp2
fe344cb92604995cc44398bc41c17a0e
e1db8d1420aa58c8638377cfdfe12e62f3845681
61177 F20101129_AAAJAE pol_a_Page_058.pro
676de51461ee1afe60f93a6eb1480ffa
1b3db443417b03487266225fd165d2c332b26912
F20101129_AAAIUK pol_a_Page_085.jp2
cf05d5a995c343b8bb41673c48b3ee2d
31611aaa3643df447c252f43828c9b8802a0523f
F20101129_AAAITW pol_a_Page_061.jp2
d28b143ac15994f6bb6c45b71f68a42f
5ae1565bf26fc3463081319d2440ac98e5302fc2
40520 F20101129_AAAJAF pol_a_Page_060.pro
4337bbe74c1a58dbc1b26666be782866
ec20c5885e288465129a0be3a176bd490928dc15
110135 F20101129_AAAIUL pol_a_Page_087.jp2
7cb5901217b41ddb09f4a1ff775b2d3a
fd095fa08480f9385a50dc5edd0ddfa55ea7917d
1051969 F20101129_AAAITX pol_a_Page_063.jp2
18acd6a2e32dc96c808f0a86e68d031d
cd8da34d9464017e012361d02931515e5a0b03bd
54631 F20101129_AAAJAG pol_a_Page_061.pro
c3b604aae8962ce19af01adcd1bd6f1e
3b25da755e640f812b5df06f1b6e5df1aaf46565
118553 F20101129_AAAIVA pol_a_Page_109.jp2
0bed8fa4799c7fdc893770e3a50f9bfa
460bd2e4fe836e9d49768cd3f099dad55499ddf7
931008 F20101129_AAAIUM pol_a_Page_088.jp2
b508fc0d3dc01558cea78088b2cb3446
1dd6762260376624641ea8597177c7e7cb70b5e1
746486 F20101129_AAAITY pol_a_Page_068.jp2
6bf471f5ae1bd65e1cc4c8c402496c61
597a7cf4ae285df149024a49abe094ad0afb8753
58690 F20101129_AAAJAH pol_a_Page_062.pro
b51706d3a37b0affd9e980eb74cd8b4b
41bd8d18816b895ec6b204a7376892d7bff6621e
1051975 F20101129_AAAIVB pol_a_Page_111.jp2
111924980962125fa1d15c91aecabc4c
d95bf0c59cae00627854d69d143703b54820b5eb
110445 F20101129_AAAIUN pol_a_Page_089.jp2
ffe46ad268e3a59a6a1b4c9bc3660ed9
585ee5de038456088e5d994faf908784eae3d4fb
674808 F20101129_AAAITZ pol_a_Page_069.jp2
edebb08e75263094244009cb7a4576f2
153472ba8ba28fb02a8a1cd916685e02c6dd335a
54539 F20101129_AAAJAI pol_a_Page_063.pro
f6ae6a16ddc752cb7358b280a4068ec2
d45322155c6867f7e0b5414739c1545213e3c57a
887018 F20101129_AAAIVC pol_a_Page_112.jp2
79f7bc17de5ee325c4755ca6eaaa864c
2606d973db0e346a303b106b7e3852345a321d46
1051972 F20101129_AAAIUO pol_a_Page_091.jp2
d5ad360786f6c3495d5149c28aeb396b
c5139919adf17b5385c8380f6b2568ac84ed8bc8
39874 F20101129_AAAJAJ pol_a_Page_064.pro
3e0c0a19784d454a8c60d73a85153c06
7066760c4d19f6a34094afc6248bbd5d282a1328
1051952 F20101129_AAAIVD pol_a_Page_113.jp2
aab2c4576c9b81060262a6290f954764
bb7cc97675d5a6373cb8a67f8516e9945e4b823a
128877 F20101129_AAAIUP pol_a_Page_094.jp2
94f937fc2b89105f32e39ce208a92f18
f81e7c4b650d43c32968863e22df38701d76c14e
43699 F20101129_AAAJAK pol_a_Page_065.pro
5223a4068cabf632f09872eefe3158c3
2f47161864733104c6298ccbe5134baefb9ccf9c
115853 F20101129_AAAIVE pol_a_Page_114.jp2
c56ad236a6c5682b8420513857549491
206f580954cb22993701ded90bbb191f4ecb32fd
F20101129_AAAIUQ pol_a_Page_095.jp2
f976cf80b2de36a7e8fdb93f8601e733
80c303331de6ca149fb89dba2bf8b2cde2aa299b
54279 F20101129_AAAJBA pol_a_Page_089.pro
407bc3e900fd372b0b50414130f28eb6
269052575dc82c1ca51326f3361e54155497bafe
48795 F20101129_AAAJAL pol_a_Page_066.pro
0b1ca4c51673a5c333776480232a8914
f7e8908be649c0e9c51b1e5b58f884b28f9312c7
136722 F20101129_AAAIVF pol_a_Page_118.jp2
42ac0489f89fd3dc9c4814ffa493c13e
190dabb6f718b4304c8d4af023d1a835295957b2
1051943 F20101129_AAAIUR pol_a_Page_096.jp2
a2f489090a600f510e8c89aaa935fe00
6eab658a56f0af22399976f75d89fc35142c0044
23748 F20101129_AAAJAM pol_a_Page_067.pro
16e3dac90f948486db6954193096561b
dc2bfeb57afc3cdff79424fdef07c492804826b6
126814 F20101129_AAAIVG pol_a_Page_119.jp2
68117524036ff99c7ff294e46e9d1049
acdc7d4f132116ce0ed3911656d3ea37eaaedfa3
1051970 F20101129_AAAIUS pol_a_Page_097.jp2
7183b6fcefdda4174ae82fd52e552ea1
e3f0c25bde2ef42435d0dd32af492acbaf43e5dc
62033 F20101129_AAAJBB pol_a_Page_090.pro
fc129af50c426ca54a7daaa5ba2d3ed9
791791d8f978fe3727f625f661f858b79df56324
32169 F20101129_AAAJAN pol_a_Page_068.pro
df78f664daaf38e7c8d348532062c047
f08a6483fc51e879d3564a01bd914b91f4f855e8
131122 F20101129_AAAIVH pol_a_Page_120.jp2
e06f988884728fe84a5664ac7bf094e6
9ba645914cf172f6ec5773021b1cb98666104e87
237197 F20101129_AAAIUT pol_a_Page_098.jp2
e7421a2cfb93849a5f246452dbd52bf0
6411d42a9e8acd0c0825df7bf6ce935a05e052f8
61507 F20101129_AAAJBC pol_a_Page_091.pro
68f912596d91d3e44ab39b1ae1851ba4
0d983f7fd0a59dfeeaf650f6d199e49845cf20af
34094 F20101129_AAAJAO pol_a_Page_071.pro
ffa4cb699098944cb5130c48040570b6
d31ba3102fbe142ce9b7c9d3486d1c691edb38a5
69339 F20101129_AAAIVI pol_a_Page_121.jp2
c6dec1ab46e6bfa98e320a863b777c0e
3842808d1f1ebae884206f457f467ca46f9a7e86
1051986 F20101129_AAAIUU pol_a_Page_100.jp2
12cf1da9873948b1fafbed5986346e2c
290dc065dbf93651a7243b2249efb606128c70d3
24768 F20101129_AAAJAP pol_a_Page_072.pro
924853f81f81cefe006dfba9f85fae21
666228e66d3585e8af6477dbc9f3002fe70eb96d
96529 F20101129_AAAIVJ pol_a_Page_122.jp2
5ec35b38e215c439d3dced31db67d9e3
9239680b546a28f698c38ffdd62e6fa4b5a9f272
76126 F20101129_AAAIUV pol_a_Page_102.jp2
9129f8a3d4e7c4ec2a11f0686c6befa6
421dbd32207a1986250e747d7ab677d522759a37
54469 F20101129_AAAJBD pol_a_Page_092.pro
5e7b8d5b7ee153782c5c74017344ab98
7d631c605620cc2c67b6a9658195b347de222300
37838 F20101129_AAAJAQ pol_a_Page_073.pro
19b13a9755178dfcf94e99b1c1e5082e
78ba7aa5a5bd5cbfad121e195534fa95deee8d97
1053954 F20101129_AAAIVK pol_a_Page_002.tif
6820b075f3a53953bfb4d31b6a466697
52b91877121618cec71a9ce053c6cceefdfb692d
1051944 F20101129_AAAIUW pol_a_Page_103.jp2
be545cc26dc1c5c147a59a2aad1f7db3
e26e2f36564dabb8a353416a88c9e1176bce7713
47909 F20101129_AAAJBE pol_a_Page_093.pro
8c56adad6f9faf7484601c7372d5826e
220a6c3811c9e069a874e6aa6556afe0a9d144bd
39240 F20101129_AAAJAR pol_a_Page_074.pro
601b6c2f5923a5f6cd48e08a8715a680
a40add84fe70dea421ff59c118bb2bde0baba69f
F20101129_AAAIVL pol_a_Page_004.tif
ce0b0f8f06d5eedcfd680f817ed0bb67
f7b191c98cb8f8bcaecb61595f0492a530ce213d
64417 F20101129_AAAJBF pol_a_Page_094.pro
1026f335e117cd63d995c7737d5d0914
75cdce20e88f18948d788885a273e0ede64a1414
25271604 F20101129_AAAIWA pol_a_Page_028.tif
633ffe3eda52da609b5ee3e5d940b670
40123f967797ef918a71704ad424e897d8408aca
50981 F20101129_AAAJAS pol_a_Page_076.pro
a4b873c43556c7a9c84a002e3997292c
4bb589744b59692df3e58d60be9f6c3bf8199a96
F20101129_AAAIVM pol_a_Page_005.tif
df3dffc97d44d092b42ce08031a0cd1f
5d677f6fe4ec060de8a683a38faf1d8a765f0765
91842 F20101129_AAAIUX pol_a_Page_105.jp2
243cfcaeaedd8ea7fdf97ef12b440c34
fadf67f05f841583c6fe11a419751b6802733ff6
10613 F20101129_AAAJBG pol_a_Page_098.pro
d2c4d07d8de995d85fdd9670556f35dc
8f7339070ad0cc7c023739480551d7d0b2b7d3e1
F20101129_AAAIWB pol_a_Page_029.tif
2b1c3be5bf19bbff4191db5056250886
fa7121c29f5cf9f3ba0045c37739ab0cfb425ef6
36792 F20101129_AAAJAT pol_a_Page_079.pro
171428c138ebce9b2724f6f4320b2a62
4cccd8236f3f18645cc512df5639267c81f5eaf2
F20101129_AAAIVN pol_a_Page_008.tif
3998f9cf9b9c1f475b1c97cdcb216436
e93751b2756b937cd6acab47eaab1361717baab9
974113 F20101129_AAAIUY pol_a_Page_106.jp2
3ff20cfe2e993057c32fded8c97c3336
70e6b41fdeaa7bc84b15c404362d019a3fa65f24
57481 F20101129_AAAJBH pol_a_Page_099.pro
d084202dd95acfad46f30a09b1e4d3e5
d7cacf63a0520bedc06ae2e05733f4839af412fa
F20101129_AAAIWC pol_a_Page_032.tif
1419de4075b5f25c3f7393d91200070b
3114e7c9b43334d92966adeb8cf5274bd40407e7
62635 F20101129_AAAJAU pol_a_Page_080.pro
abf52b5853ced4397adb2c1a634f00cd
b551d657228d72fcab2d2b261ef4efd1b031ff3b
F20101129_AAAIVO pol_a_Page_009.tif
552cc4a7daaa34d490cf648e7e01115d
94a08f73de8284d624245a759a8458da960eff71
961887 F20101129_AAAIUZ pol_a_Page_107.jp2
85abd85063fed45b8d8469c70e16f473
cd60c9056bedd21fe7d443521f869e1b825e5f37
64167 F20101129_AAAJBI pol_a_Page_100.pro
bded9ec893ce1ba92be9ef766c434e2f
a3135910e04fb9658f1af70b5757e5f0dd776b48
F20101129_AAAIWD pol_a_Page_035.tif
9d53e68b72ce9fa147033605a5a35842
48d49f99fd54f4b87cd37c33c0ec7f0114939494
56378 F20101129_AAAJAV pol_a_Page_081.pro
5017d77c43b31a4bb549748121dd4266
5c3b08302d509fe494546dc5d61cd38e9a36f8ed
F20101129_AAAIVP pol_a_Page_010.tif
9d8bcb948b26232f81a03ee74c971707
210954d290c6c0e1336c9846525c08fef0881b98
14050 F20101129_AAAJBJ pol_a_Page_101.pro
607310119a2cbf5f2fb896d8c1256c1b
57eaae09751b97cca391e1ce7a0a0e20cbaa091c
F20101129_AAAIWE pol_a_Page_036.tif
bdc3fb38a8f1d1c4a0d1225d76d8dd16
29661b4f4d6503d283152aa7d2f2bef3ea1b13de
36521 F20101129_AAAJAW pol_a_Page_082.pro
f1961cde2e4857e16e2b81b06964190b
49261a9c907c631449c0e1c4d6f327c8b71d8600
F20101129_AAAIVQ pol_a_Page_011.tif
f7b806ffb193b761746a854dceee502b
dfdd8d8eec2dfe6178f3d5a5fcc8eff0712ad14f
36591 F20101129_AAAJBK pol_a_Page_102.pro
23133f7a584d2d9325b92fe5a0b6c8e3
09ea4f76fa6fbe40e0a2106379356049011928a5
F20101129_AAAIWF pol_a_Page_038.tif
235d60d2f22cefbf6ecf95af3722d115
cd3c24942fbecd3d57dabacf3f3f19c8b304a9f7
56555 F20101129_AAAJAX pol_a_Page_083.pro
60abc6ef2b69f27d9ae34dec906997a1
2080f502d81815415c474dd1504c6a2c658a9407
F20101129_AAAIVR pol_a_Page_013.tif
f2d00a001772fd5a905a2368d800944b
854c3a23a2f7a6025ceff429dcd5d45ad2e30dee
669 F20101129_AAAJCA pol_a_Page_008.txt
2e31fd90c31eef8827cfde46ab76a6a8
7c5e3b9fcc63cc1834dd7756e7f7ec1940f7f807
57681 F20101129_AAAJBL pol_a_Page_103.pro
6acf0e215248b7eedcbdfcbc702dd8e3
26cf1f288330df9565a528c8a914bbfee7fa103c
F20101129_AAAIWG pol_a_Page_040.tif
54d19aacf35f77912a1abeadf8945415
21dbfb9eb210f5fc6d5de87f7a923c311a52c6c9
66924 F20101129_AAAJAY pol_a_Page_086.pro
6ec6105ecae20f189a45df2e874ae66a
54ff5f00e75fa1b5b4d2075f6ef5f7df4772ffea
F20101129_AAAIVS pol_a_Page_014.tif
6e34e2396acf4cc269b36e3a735e3b60
c21748e84269ab371ec644c8426763c54507d9aa
1410 F20101129_AAAJCB pol_a_Page_009.txt
cb2da4e428283205adb1413f0491eb1e
8cef15c30f67d30c57ac2941a53dd3747e133702
43858 F20101129_AAAJBM pol_a_Page_107.pro
fbf4e06e72dbb21cc07e566662425c92
b86e82b9eb63cfc00a5ebbf869c609e3c6b6b66f
F20101129_AAAIWH pol_a_Page_042.tif
f2e51d30c5e3044cb1ebd9eba2b933a2
f82818cff18632b746308ef0ed1935660ef570dc
41608 F20101129_AAAJAZ pol_a_Page_088.pro
f8f70cbecc69e8756881a09b096ed5e0
3ecb7104b69fd9617284a814627f41e0d5910389
F20101129_AAAIVT pol_a_Page_015.tif
c58cdb9e783d9002b7af490726bd8fc7
dd8cd990cb1f3e5f22548ea29d4b1925e240e034
2158 F20101129_AAAJCC pol_a_Page_010.txt
e53f84e3790d74d4ac1033fc4aefc80f
e401ec62ca9a4f7b47eb6abf5de6b5aa40e3de29
63549 F20101129_AAAJBN pol_a_Page_108.pro
0bc87907694ce8a1251f1e37c3e12439
4e1d7b4332fe187d491e33b3b716fa7ec3897b45
F20101129_AAAIWI pol_a_Page_044.tif
9b6f3dd076f4cb7cd69e4e9304ddf4a3
89e0620517ce0ca76de68e60b26b83d594ea2760
F20101129_AAAIVU pol_a_Page_016.tif
11fe6c9162a41453ad649ac5fac208d1
2bbd74f11053c9816e3b40e6d482bae786b245f1
2648 F20101129_AAAJCD pol_a_Page_012.txt
842258010a9830eea0e2c263f9730e5c
8c2d81a146ab221e406374e12d96035509d0b344
57379 F20101129_AAAJBO pol_a_Page_109.pro
265f7ba48a2fce37c03477d641ea98e7
1003241a630378b5a2ab8b56cbfccf88523c30e7
F20101129_AAAIWJ pol_a_Page_045.tif
9492c64cd73059569f00c4e67680e142
ee7f5b11a30d6a780f6ec1f8556417e425bf2d87
F20101129_AAAIVV pol_a_Page_019.tif
55b62dcda7ffa6aa7fdb144c8ba1edc5
2e678bfce583ef7f4684c8c8bf94e4e05154d015
31994 F20101129_AAAJBP pol_a_Page_110.pro
01769cad62f083559f24c44bd2f969e8
e5eeeae15c2afede30cf9abd77b28a21ac3d2beb
F20101129_AAAIWK pol_a_Page_046.tif
685b9bfe9a5b69851414159b3b984ab1
8ae6b7de93a4763ac523ed2bb5552c55241abd4f
F20101129_AAAIVW pol_a_Page_020.tif
7069120fdc47271c41f55fa288d7406e
6ffd1b0ca7f3bb8001dfba91735c7fbf5a8cfded
2596 F20101129_AAAJCE pol_a_Page_014.txt
fd319b2c168534a30ea9f4e46ac5cd1e
463d7bcbe62c59ee34f8f3af3aabd1e1c6103ac4
62651 F20101129_AAAJBQ pol_a_Page_111.pro
48093613a35194bac4c3b22ed7607cbf
bb9fb10b94b60fdac524cd7c47b2bcc9e9dbc584
F20101129_AAAIWL pol_a_Page_048.tif
1d7086eeed50db3dba687c7b8b106540
864abc70afd0fbbc5378914c23992516b7191252
F20101129_AAAIVX pol_a_Page_021.tif
f298bb195b5a752f57cae54723683cdf
c6b569ec6b6d0641408749a95524dec5a78bbeeb
2276 F20101129_AAAJCF pol_a_Page_016.txt
f86fe6fa8f0a2a4bb0c1eb79719151b7
6c68676f2324d7b76ba607650e0d7c392a97062a
52256 F20101129_AAAJBR pol_a_Page_113.pro
f87555a9e8917528630483a03221982d
2562f2ed02be5cd1e0ce545bba7078d478869e2b
F20101129_AAAIWM pol_a_Page_049.tif
adaf2c48588667eeef440677423219dd
e48a42b49b2a9bf7e17bed18fb8408928bf59d12
1825 F20101129_AAAJCG pol_a_Page_017.txt
551271d28eca49609bc6de037a3e4e56
7923ec2c5450dfc066cd95d65bc4a5df2e3c308c
F20101129_AAAIXA pol_a_Page_069.tif
67fabdf45950dc68878ead1d5fefc088
264bd2629bf4b9cd643871c18bd1b5529c9826d0
17132 F20101129_AAAJBS pol_a_Page_115.pro
29ffecaae7f1c75da88c9ed35dfb41d9
25b8b0ba1601c219c9a1a159966c5022f11b9edb
F20101129_AAAIWN pol_a_Page_051.tif
e4480c97b5dc5b1a17c6b8d141d04229
54b489c0ee3343a031f7de34e83faf1d82abf74a
F20101129_AAAIVY pol_a_Page_023.tif
647706fea3f4ee3b0363710a76f19403
c31a2d52b49b9f2a05b7ca29285ff1077a4aaed1
2585 F20101129_AAAJCH pol_a_Page_019.txt
6392c3fb456054c426e42ba3a37a4e9f
2a6c257f889bd716d8e7079451faff1b00ee2ebc
F20101129_AAAIXB pol_a_Page_070.tif
04a55f90242e013d6fc829de97139cdd
b2ee9b49bd05b27d994b216f7595ba9cd7031ac5
31635 F20101129_AAAJBT pol_a_Page_121.pro
8fe58abbc31c5cda259ef8e6f2117a3e
c41125d0a53046b6fce2f000f1a3d44b2bf20540
F20101129_AAAIWO pol_a_Page_055.tif
6e6e07051346f04d8fcc452abae68890
5db285de71c4d51fdb299f0d22c5e55e1b4b3905
F20101129_AAAIVZ pol_a_Page_024.tif
beefae6eaf06a3053e4b820ac05b9dc5
d92d03fde0d03be4c1c134bfbe892f7a8c93973f
2602 F20101129_AAAJCI pol_a_Page_020.txt
20e46f7ab4d227f4de6bc6761fc79692
2246f9b2a9cec200edc59173e0327195e3784178
F20101129_AAAIXC pol_a_Page_071.tif
1363c1e641bed94c1eba1bff4ed57fe9
84a507fef867b50241b90610b7eb9a3e45541549
427 F20101129_AAAJBU pol_a_Page_001.txt
fce937ca3cfe15f20f6e0d25698bec12
8674c46839ef263f7a90397414bdf64ce6766340
F20101129_AAAIWP pol_a_Page_056.tif
510b0c55b7c57e96ccf8e6fc39573df3
a9acf50a96feb008ad7608f0b11806e95b6eaeba
2548 F20101129_AAAJCJ pol_a_Page_023.txt
aaa8043f27e37984ab8d21fc1772aa88
119ed0dcf03499703aa7cde1591623201183f3c3
F20101129_AAAIXD pol_a_Page_072.tif
1dcf6fffd6e892bb3f4d620a673d355b
434b35e732e1d1080ad197da74851c6fdcaf7756
91 F20101129_AAAJBV pol_a_Page_002.txt
ea4241e534758d910918e15886966e74
71d6d5b7fa7b1d64ad34bfa2040f577967d71c27
F20101129_AAAIWQ pol_a_Page_057.tif
d26eba2a8792f1761d0acc18a448e18c
e9801163e4bba270e286641b7fcfa170caff414b
2469 F20101129_AAAJCK pol_a_Page_024.txt
672dedb281fe7804cb361139621e644a
ec804c2e8ee102d54418ceae6418814df29f7791
F20101129_AAAIXE pol_a_Page_073.tif
c7d303c6ad20e90f5bdd9e636956e5d3
4d288df5ae13a8f5b767de58ad410d74937f9752
92 F20101129_AAAJBW pol_a_Page_003.txt
32d88347d96d66d7ee30e268eb8a0b00
26d0d7669edea095070fa25ab761ca497dd51efe
F20101129_AAAIWR pol_a_Page_058.tif
aad75d664cdf118ee0d8b5a5ed2a80c4
00234104c41e45aeed56b7cad5213c2df19c43a1
2418 F20101129_AAAJDA pol_a_Page_045.txt
c92f8aed8e96c6c48aba319f5df35aca
2d507ec2c0c23f771412a5c4658b19e3946b6758
2538 F20101129_AAAJCL pol_a_Page_025.txt
0067ab939a93deb5b168f184b95b1416
86915364a6371f2218572d315dab816c4a3b7dc8
F20101129_AAAIXF pol_a_Page_074.tif
2a632ed61ca1e899b83e9076c59f1dac
1ddf92d4acda2b9d3f1ac2bfbb13e35e873c1105
1460 F20101129_AAAJBX pol_a_Page_004.txt
3a913ab7277058bbaf715cef7051bb3f
4d91ca19f952ecfd66bc41f56775bc1605bd952a
F20101129_AAAIWS pol_a_Page_059.tif
050632cdf6329831e2ffa3b31c89ce59
73f1b0556945a7a9fa491054daff03e58fad7670
2584 F20101129_AAAJDB pol_a_Page_046.txt
63134c324f9fa416a49b71e016996949
2cd12f5cc297303055ed9a7efd4dfa10dd51771d
2356 F20101129_AAAJCM pol_a_Page_026.txt
7357ff8ada6982afe4d1b790366bc69b
9ec2d68b89db7be408a3b247df6804f795f30a77
F20101129_AAAIXG pol_a_Page_076.tif
1d2488e5bcd14e50bc930f54f040bbfa
357cc360152adfbfe84c08b85843b9f33607c352
2957 F20101129_AAAJBY pol_a_Page_005.txt
3c7c2d8b88f2e08d4916adc8cf4c9443
25cfe96858a1c496d2af7a0d500b809ead958885
F20101129_AAAIWT pol_a_Page_061.tif
4b4ec12f548ba7de94fa1dfa189d0fbb
b174536298ab6b81ec2c0241a209c830d72c237f
2328 F20101129_AAAJDC pol_a_Page_047.txt
7a5087072aa0aed1535caa888cdd95ef
4340ff717c864b142991b1417b467c60bb1c0385
894 F20101129_AAAJCN pol_a_Page_027.txt
bb518d007e0e0cd031b8e8a740322854
ef10593d4aa839f0b760ab8824c863f16bf5c002
F20101129_AAAIXH pol_a_Page_077.tif
252e894e345ff7f115631befe028e132
e59ca769239700c6dd739619f57ab30e9a4fe751
3343 F20101129_AAAJBZ pol_a_Page_006.txt
db9183a438c3619413a3bf2d43083a58
b4d71435be07efdf45b3a2f8a9d499cd7fab96c6
F20101129_AAAIWU pol_a_Page_062.tif
9c5ac6ea37166a3f38510d6f0b5104ea
6220688ce5be3bde46695d12defdacfce46e9458
2200 F20101129_AAAJDD pol_a_Page_048.txt
0d667227c7b08eacab3b2cfe110c3356
d4589cfcdfbe98c94b439875af994734132130f2
2215 F20101129_AAAJCO pol_a_Page_029.txt
1f743adcaf9a093e1214fa42e768839f
727b39fe20bea6c0ba1affbc9bbe2f0a204d17ec
F20101129_AAAIXI pol_a_Page_078.tif
ad5a892ad9668cc3e952ec210ddaefd6
7367111b91485164555eac62b6b4b46ea3fb9773
F20101129_AAAIWV pol_a_Page_063.tif
3da2e72fed7fdaeee66b693955f11523
9128201ecbf198fdb35dec756a3174a6d54b125d
2500 F20101129_AAAJDE pol_a_Page_050.txt
1e0a6c43e904a84c42129587c81e4e8c
291248d3aed9ec85c36948a046712f0b7ca71986
2244 F20101129_AAAJCP pol_a_Page_031.txt
b6b63799d58a2eb72777bc90140764a4
bb73a82bfaaf825c6df3ad22894e2e7fbe5ccdac
F20101129_AAAIXJ pol_a_Page_083.tif
ba6a727d98f81b564a2055d412c5dd78
ddff8332fcdb86d2a2bdda6c6e0f759f888991ec
F20101129_AAAIWW pol_a_Page_064.tif
ed17254e531ad0ff85b4af76d5780a06
f5ab6f2c33efc7a441ddd60bfb72b9368e83b866
2537 F20101129_AAAJCQ pol_a_Page_032.txt
cf019623cb64b56b6f9c40ebe32b8d31
cc3106fd2e7accf2972c9aea6e6006d87301e714
F20101129_AAAIXK pol_a_Page_084.tif
ddfcdebc0e5132b1479290c413f44cc0
b1e45f858f202183913f79f2a26cae6c50d70f16
F20101129_AAAIWX pol_a_Page_065.tif
6ef0c8bcec5f724f0eaf8c018f3e068b
cb815578733a6e176e6c6aa7d1ee0ee0e19548ec
2457 F20101129_AAAJDF pol_a_Page_052.txt
dd6c9131360f58c3bb74abb59b5c49c4
c9ca952b1f2970fe6a9201cd5cb1f65360157953
2220 F20101129_AAAJCR pol_a_Page_033.txt
28fa8f2bde91df4201c3554342339e72
37ece63ba84e497010c0db751bc8fda7d5e2b981
F20101129_AAAIXL pol_a_Page_085.tif
2674297eb6dab65c193ddddd728b488f
984c81ac953e7be6adb88feb3a2caee8a4ec586b
F20101129_AAAIWY pol_a_Page_066.tif
79fac5666ff031775eb2c43c08764331
d19a08ffdc4d9ea33de264dd2fc0f1712e42305f
2476 F20101129_AAAJDG pol_a_Page_055.txt
5ceaef86ece65020740157729cd245bc
79a2f150306874b3244e8297e8eb4f99a2580e6d
F20101129_AAAIYA pol_a_Page_112.tif
0eaf71b8912fca75717ee22c31799827
3054a140fcb0a6a29fb3a47972a1dd60bdc26ffa
2622 F20101129_AAAJCS pol_a_Page_035.txt
7afea95ec70ad5b4e99cbe3e89b655de
62593eb7999ac3c52568b5de928ce4e6c3600a89
F20101129_AAAIXM pol_a_Page_086.tif
4a7d9022a6dff3090868c63bfe544b30
34884359e81337a64884aa2bbc23cc4131f057c1
2349 F20101129_AAAJDH pol_a_Page_056.txt
5e6d80c029fde2c057bce45accf30967
964bd39752070968eaf9ab73f665fd4de11bb572
F20101129_AAAIYB pol_a_Page_113.tif
473cc0c7e6ce4149f0239b92776be687
07b2842d4b5a0ff81fd908ea6ca650ee0f7be11c
1668 F20101129_AAAJCT pol_a_Page_038.txt
fd6861f9b87ba8f6c165ea327b3157be
773b34c6d94f7e741e526c7e4ad2a6e29bb5cc90
F20101129_AAAIXN pol_a_Page_088.tif
5b7c1fc8b1120b907d3c6720cebfd780
808aaf73d1b6a6e61c81393501bae75b7e5519f8
F20101129_AAAIWZ pol_a_Page_068.tif
704903772484857bdd2fbe8e6e8fc4e4
355171a609b93578ef1f00215e126f210ae8854c
2495 F20101129_AAAJDI pol_a_Page_058.txt
48b868c3d7aca672203b3d74dd7e8612
0364661f0447857f4561cb8cd21eeca2988e8091
F20101129_AAAIYC pol_a_Page_114.tif
34c2de998a254768a5724802bebe8586
d39c826702c5d914bf1e7ae20b0c98a19e55a402
643 F20101129_AAAJCU pol_a_Page_039.txt
5c284cfcacdeebebb7d2af5c9ca23300
8f5e64fb8a9edba6b4ce3c5600a034a4c5701bf1
F20101129_AAAIXO pol_a_Page_089.tif
8d31dd98f34dbe9d8fe0c2da466ff89e
907e35cd3736b0260db400824cac80ba9bc7fbef
2363 F20101129_AAAJDJ pol_a_Page_059.txt
e79a2c13cb6c2b30db22bd5a3148e6f5
b0fda9ee8a5cd0fd55cea3efe137d91ae134b065
F20101129_AAAIYD pol_a_Page_116.tif
22582d8cca875aa65b97a3ded9f49a80
3018eba7a2d4ed0b96e537e0015daee6e64bcd12
2492 F20101129_AAAJCV pol_a_Page_040.txt
0c2e68e3dd3fa5f742c14e137d42be84
f47a37c4731b351c0bf56073e1e8273b09859e04
F20101129_AAAIXP pol_a_Page_093.tif
e26dc84dabf533b181c435ebff922292
af5bb8801178c05827a4fea12e42730454b0e013
1754 F20101129_AAAJDK pol_a_Page_060.txt
097e9900ab29e8f3f4e4a695d21b3460
601dd3c90333e91c001e43beeacbef9a1338d055
F20101129_AAAIYE pol_a_Page_117.tif
185a1e39332f007b15e4f26af89f573b
5f0103f2690c42ed3af2706f4f2d1fc3b787b4aa
2024 F20101129_AAAJCW pol_a_Page_041.txt
8a38b644e446dd9afff5c9b2e9e2b2f4
2fa9f8d1a7244908715b5355f39621d5db3f455a
F20101129_AAAIXQ pol_a_Page_094.tif
ed204c147a3abd4e4ed1f5a979aeda9f
77f7320169bb60a93f3dd94ae5640f91a02a5ce5
2245 F20101129_AAAJDL pol_a_Page_061.txt
e75c74dc0111efc061c83ce31bf25ff8
aeb763dc4fb6cffba723b41914f5c29ace5f0e1f
F20101129_AAAIYF pol_a_Page_119.tif
ed1eb1bfaffb41254a242fda4621ed75
1d7aec4eb36e731b65b0e9cb4ee3cb473f2685e8
2294 F20101129_AAAJCX pol_a_Page_042.txt
07f7c9682c9b0aa12eaebfa0034e7f8e
522756c3cb45641d0f6dd0da1f2f25551ffd2ce5
F20101129_AAAIXR pol_a_Page_096.tif
4cda3e51b70edb27a4fe338ddc8397c1
b30f844c3879849d94809a82c767764343a11987
1692 F20101129_AAAJEA pol_a_Page_078.txt
cecb7ef623ca9d1887faafaee9c1e127
683ab79ed85cba758a6f11357e694b1a8942c97f
F20101129_AAAJDM pol_a_Page_062.txt
fbe385016221b69e9a3aab876eb86a89
d4454f3952645d0e0c9ca9311ddf3a203eebe32c
F20101129_AAAIYG pol_a_Page_120.tif
585e3ec0383f059cbc577c083239f57f
5a9eac2c1aae9766ae98a0411347483bfd38e3ae
764 F20101129_AAAJCY pol_a_Page_043.txt
05146ff872ae14b8f7c13e4de1552721
5af132746db1337d96bda21720caed4844b5ce1e
F20101129_AAAIXS pol_a_Page_100.tif
a92b86f00b528161a84368c93757a91c
037af97a87413fd6cbc2ef3d04b6bd70894d0183
1659 F20101129_AAAJEB pol_a_Page_079.txt
0e6d21f7ae9894e27610e897cb1746a0
8bb67d2068db2682093b539b3d08a274f710391e
2248 F20101129_AAAJDN pol_a_Page_063.txt
517350e41be947348f2ab7b24181b09a
16464ba9bec6cde6f27c13d0c370d4fabfb0a988
F20101129_AAAIYH pol_a_Page_121.tif
77796b8b8eb126c1609855ac01b1b60d
ed1b9eaf72f673d5644ff6968da33037d5deffb5
1702 F20101129_AAAJCZ pol_a_Page_044.txt
e9b6762d5a43080b0b22a4deec232ca3
0d60ddfee051fe3bb4eec4e8ba3cf937d99e4e11
F20101129_AAAIXT pol_a_Page_102.tif
b28f7d91f15b55dd611ed7c224b476c9
1a2ec779e795b154c92bbebdb084e0a213512a0f
2557 F20101129_AAAJEC pol_a_Page_080.txt
0a1236445a83373e83a847f5b385ddf2
94d1bc810d726890c2e1b577ae981fbe0079264a
2018 F20101129_AAAJDO pol_a_Page_064.txt
433ecc806ce5062b343a6cf887b6613a
a37aa6a553294b5bf1baa4ceb89012329e586997
F20101129_AAAIYI pol_a_Page_122.tif
ca2dd323aed115a444996bb65757e9da
c0803399f884a888bf2b2e6b5b10da4adf2afc4e
F20101129_AAAIXU pol_a_Page_103.tif
1a81d09ea319c19eab43afa9141ef9cf
1d1542d5e6c48afae166506e533f62da632674bb
2249 F20101129_AAAJED pol_a_Page_081.txt
20a4878d319ac00e80835d40744de6f0
6ff5b01873e674d9bf242be7681efd326ab2d4b7
1996 F20101129_AAAJDP pol_a_Page_065.txt
bb138a3ff881a6bce29a217d15d4163f
7f0ca07c4d7ed76846ccce4e9103e6d9c111dad9
7581 F20101129_AAAIYJ pol_a_Page_001.pro
4475b591e0dd8089f84501da8ef16ba0
c3ae5a4fa230ec3cc1c5c0ea45bb5b32250f8155
F20101129_AAAIXV pol_a_Page_105.tif
2ea9ef75d26515e558b1ec2355ba97f9
05ef1e136cf0373feabacab95b13ab4e5a840d7e
1874 F20101129_AAAJEE pol_a_Page_082.txt
3649ee2c05eefac1f7a9ae5467a1f3b4
2fb8ed00725d054b4557fb7dc94ea64b57729832
1968 F20101129_AAAJDQ pol_a_Page_066.txt
5e2b55af162b38a71c9dad72a4456b46
d1f11bab850f715032a3bedba767cee12b6301b3
805 F20101129_AAAIYK pol_a_Page_002.pro
34358baa8191c7ae011245ccd5dfb7a3
cf0cbf4537d126314a161e231fc9fdd4be9efa9d
F20101129_AAAIXW pol_a_Page_106.tif
79b171e11e3bfd0084efd1771d325246
62f6ce4e7297e73a100c5478ed2dd53d2e1db740
2225 F20101129_AAAJEF pol_a_Page_083.txt
57b2320dad3ebbbbe2006adc5b40b01e
2bae9cf1c4b29998826e09fb2087825b805f265d
1468 F20101129_AAAJDR pol_a_Page_068.txt
6e792abd0b598e0c6303d46e688dba8f
4dbeba83ca45c7db4413eac8c73c4db86875a027
35212 F20101129_AAAIYL pol_a_Page_004.pro
206838ce741767cb07db69110084dfd3
7162381a9c8080e0ab8301d200cffb12f85a29b4
F20101129_AAAIXX pol_a_Page_108.tif
f65cb125567979407be91b7d86761c28
436235b98910b1cc1a9a69302bd86caccbd01c7c
1474 F20101129_AAAJDS pol_a_Page_069.txt
45aa2b13d5f67d495991172f2012f6d7
853f9d2bb0ab54c6d5d1ed5697afb1c56cca716b
67993 F20101129_AAAIYM pol_a_Page_005.pro
db69a797ee027a7d40fcf72c8f7c9d8c
1798701eaccbefa582fb148afd9cc64b4fbccada
F20101129_AAAIXY pol_a_Page_109.tif
a307c626ab288a3f85e061c43ba95f79
4426b987ba64c8a842d8308f63e8698a886489de
2250 F20101129_AAAJEG pol_a_Page_084.txt
b7dada69d7f889b15cdfd18c5d70b563
ad6c9de31db2143c9775f4013614daeba1c0a9a3
22612 F20101129_AAAIZA pol_a_Page_021.pro
a28b0d7b4846a8dd5ff8dbcef96fa951
53b801c828d38cd25c5772e6aea96293a4ca1ff2
1119 F20101129_AAAJDT pol_a_Page_070.txt
fc5728e7ab78362123d2028e0f0e4938
1f14721ea13a16ecb091b4965c76578e162cea59
80816 F20101129_AAAIYN pol_a_Page_006.pro
e75006feb7d0c8a78754cd279e32fb05
c224b727a5ac680b4a6a8472e16219734acbe148
F20101129_AAAIXZ pol_a_Page_111.tif
edcc9316dcf3365cfb8e9705dc7880e1
3d6c59959101dfc7e56fa2cd2d7b581a24f1bcc4
2633 F20101129_AAAJEH pol_a_Page_085.txt
56f29d3fdf7a74ecd94da2ead9137d02
145711697affe4c743b50cdd68620240bc126cca
59520 F20101129_AAAIZB pol_a_Page_022.pro
11abb0dab5f46a84822d0a993febecd5
5afce84c807e07d818567dcae856cf91b738ac20
1871 F20101129_AAAJDU pol_a_Page_071.txt
c9c226c38a6633dd7d5020fc32e04575
4caa5c09f4a25c023d649f43337d51cdb8371fef
24411 F20101129_AAAIYO pol_a_Page_007.pro
54b05e3b99aa2bfe39c8c018e5184f81
b1b09b7025704cc18d82bda19a7b6a19d2a7a2c0
2271 F20101129_AAAJEI pol_a_Page_087.txt
01f8a64aeff8bb8f8dbf8ac159bb0ddd
b9f9ae3427b920f64a166e18d983bc0d4c29c895
64617 F20101129_AAAIZC pol_a_Page_023.pro
64c591732e8d51a011eb5d4e0ece1776
b5a453adb86e73529dc3fedc3c5b30b60901ce54
1099 F20101129_AAAJDV pol_a_Page_072.txt
21e0be0de45f290784a9c64d144d75f1
f7be07b687b792ce93b73cb0088e8843a4ac0925
15615 F20101129_AAAIYP pol_a_Page_008.pro
6e2eaead9af27d6109a99fce599675b4
1507b2aca39f673f54c4d502650ff49b112793ff
2238 F20101129_AAAJEJ pol_a_Page_089.txt
06b4443fc63899dbbacdfe7f1cb09392
69496e8798348d3eaccea8b08759b6e3cc92d53c
62405 F20101129_AAAIZD pol_a_Page_024.pro
8f8d25446dfefbcc19450736b2c56f00
661412aa3a4ca7029a8f52192b0e43c9a6c95114
1647 F20101129_AAAJDW pol_a_Page_073.txt
2ca582ebfb98da5a97d94b39f8a70e42
23d4b216ef6a2921747f4930db8f9312507e3d0c
33604 F20101129_AAAIYQ pol_a_Page_009.pro
e99c4c0c8baf8af3506e73bf5ba42244
ee387e1f8c38c08a1ee9a8bca88dbc7ad0fbe892
2464 F20101129_AAAJEK pol_a_Page_090.txt
472dc9b98191ca3fe123fe24efaaf645
33e470fdd20e15c205314d777356d6749cf7da39
57542 F20101129_AAAIZE pol_a_Page_026.pro
7c2f3204a8158b50b210e339ff20dc80
efb1734cb1e5942a64dfbf6d37cb00a27c8da96e
1854 F20101129_AAAJDX pol_a_Page_074.txt
4e5413a8fd288c02d05a057aa1e08ab0
1236e5b87d64cf0ab1944f630b38d721eb538e9e
14502 F20101129_AAAIYR pol_a_Page_011.pro
c2067ca9eaea3d8a5bdeda3865b8f641
7ae9653eeda57868c5778b66d5c9f0166060f4b0
687 F20101129_AAAJFA pol_a_Page_115.txt
403b584e3a6140c81858c9fb81a5aff2
708d5349e9c6d2ef169eced61dc8c1f8eeb164df
2165 F20101129_AAAJEL pol_a_Page_092.txt
1d8bfb18d0279555d96edfaf23413e13
0d4026b955b3f855d396964cd0849299157d9c64
22129 F20101129_AAAIZF pol_a_Page_027.pro
5d2b62d67284a26616520284de6460b8
9a62856eccedffa892f19036cff2e9f456f1b990
2075 F20101129_AAAJDY pol_a_Page_076.txt
74ecbc92a5e79a59db6a169a4156b754
0737098361c62a4caac394575d09a3c411bcb926
64312 F20101129_AAAIYS pol_a_Page_012.pro
fdfbd1774c8d6143c2fca57566c290b7
a6799274aa18b421e18dcdf7edff4ea94edf3818
2426 F20101129_AAAJFB pol_a_Page_116.txt
d6f8c8d6f2d73b2ccafbc9016f1f2159
1ac25b005949a6adc4d3e37e2f84f238f6cb41db
1953 F20101129_AAAJEM pol_a_Page_093.txt
ae43c78bc63869088a97f07470e3e77a
dee9966519b04d51070f926a9e3612da373d1792
47932 F20101129_AAAIZG pol_a_Page_028.pro
bb56c00c93c8ce7013a43e725a6ef768
ce8b3292a32fd6c6c51165fc6548414c57f7fb56
62395 F20101129_AAAIYT pol_a_Page_013.pro
f379e6a8f46878133f92e436d01e053e
8a3a2b2e089bfe3f460d89bd2d966e7d5ff9d393
2511 F20101129_AAAJFC pol_a_Page_120.txt
eb75a61797a2c97e39c0e5364622a40c
07125fb0d09260cf91bb5dd0492128806f83a6ee
2530 F20101129_AAAJEN pol_a_Page_094.txt
b3580c1dce76659e8496809a4593aa7f
c7ac010b2a1a66131aae04089a8587ee70a15333
54791 F20101129_AAAIZH pol_a_Page_029.pro
9a2a8d28f814b730ad5aec5a1da9d084
4be41ac626c7a7b57caf238e2e6fe32225673717
1710 F20101129_AAAJDZ pol_a_Page_077.txt
b37743cd0740e3d07dd127fa030a96f6
c8db92bd0af4fbd8564de50da4620f3ace6060ef
65347 F20101129_AAAIYU pol_a_Page_014.pro
d6c0605d135fc84219653a7a51e874a5
749659cc2d360a9aabcbeb6efc468f7629753ceb
1834 F20101129_AAAJFD pol_a_Page_122.txt
b12468731020dcba567befdb9f53f4a9
a9b897e50a62ae0d0edf2c05bfcca22423faec99
2375 F20101129_AAAJEO pol_a_Page_097.txt
b51b907615fe644a016babcdd3d1a47f
7bce775ba59eb0ea991d35d3f238badd93785ea5
48899 F20101129_AAAIZI pol_a_Page_030.pro
d37ae1358927e2e5aefb7e4cd135d0ed
dcfa2e047254d8e563729bbe4eeae1237cc690a7
63447 F20101129_AAAIYV pol_a_Page_015.pro
e4e926024d602204ed34da162a28c208
513dade5c2be38e0e810cd93a75db47572faaec8
729421 F20101129_AAAJFE pol_a.pdf
e7427db401ee3dd485d39182e790a835
bad3bd08a01b4e204532c1e1b898614c9acfce8a
429 F20101129_AAAJEP pol_a_Page_098.txt
e07068399cb164b1422b3dbbf4e9741d
a25041a0880f8ffe060b06a2387ea9f81dffa3ee
55812 F20101129_AAAIZJ pol_a_Page_031.pro
a10bd491b8b0dc0bb8a55372a26bdc27
2cb9c76c409394337531b6b8657e8d5286857be3
39481 F20101129_AAAIYW pol_a_Page_017.pro
60a2a19023ddbd13a94461a18ecd8775
dddce70f68f28c7ad8941b15c1b3ee152cee793d
6276 F20101129_AAAJFF pol_a_Page_065thm.jpg
ae241ef1306459d46316824f2eaa76ee
589d8b3ebda84eb201791323fede5d7e9477e089
2371 F20101129_AAAJEQ pol_a_Page_099.txt
eda22423ff9d8c067cb81ccf1ea6f117
d25c58d3b6b6e10f9a600f3bd06b650934515ec8
64715 F20101129_AAAIZK pol_a_Page_032.pro
99729cb43532d23bbf96c01f86acded3
1aae1af30da2e9efc87667127307e2530c9c55cb
62460 F20101129_AAAIYX pol_a_Page_018.pro
466ffe502475b669bf1b666df7af0bfb
cb977b4ce2f52043b67339e0abd3e068fff91746
6773 F20101129_AAAJFG pol_a_Page_033thm.jpg
8bce621e0d7cb6144baaf25acf5804cb
1ea7cb19a901642ef0335e128eced6dbd6ac167c
2510 F20101129_AAAJER pol_a_Page_100.txt
f222c667bf2814523ed442fea03dbe3e
09ae15f2d8bdbf9749157c1738e46cdc39b7c6ac
62155 F20101129_AAAIZL pol_a_Page_034.pro
99c63b07a537d1b076193642044bd1bb
4c69653c94603abb35b6875566dee2f172d984d2
65415 F20101129_AAAIYY pol_a_Page_019.pro
02952a6da62768144551d66d8299c058
e8a85f25b106130730b99e38a72362adbeb9da52
623 F20101129_AAAJES pol_a_Page_101.txt
ece61a758de4b610ddadcc1ea1622a0e
527f8ec9573aff572f07c664a665afcc1b4f5690
67067 F20101129_AAAIZM pol_a_Page_035.pro
06f39a81f595fa25d866cbf6bbd432e7
76136f48a317b937c913f99dc801360b793a3340
66377 F20101129_AAAIYZ pol_a_Page_020.pro
430aec24b39370e6278549c3937ec2d2
e00e67a3573749afab1b6839c25f7626241aa62f
6763 F20101129_AAAJFH pol_a_Page_113thm.jpg
024dcb8cd5bb4c2c1f8329b0a6cdd66c
07f3f72737b1b44960794fe5b202ba9b34e3cb8e
1907 F20101129_AAAJET pol_a_Page_102.txt
b4e212600971fdf015652335d4af2d46
562427332f24106b937f53e7e9d1f2e66891a992
59972 F20101129_AAAIZN pol_a_Page_037.pro
7c17ffc85ab8c2574ca4a7076aadcec9
a0520f74059b5fff5427802170297e52e91fcdee
28993 F20101129_AAAJFI pol_a_Page_058.QC.jpg
16ccb801ec5943a90cc4583fc7fff4e2
9c2568d90b1ad167415ca88a30058159ac9274ab
2144 F20101129_AAAJEU pol_a_Page_105.txt
19fc6c64d17c06eea329cb5c21d66c14
3e9cb73876d064323562b342d053243187eaaae5
31497 F20101129_AAAIZO pol_a_Page_038.pro
8aa05dbf904061911764681a08c65821
1fab6a1518406676ee6c6622ecb0a1b29c1babf1
23568 F20101129_AAAJFJ pol_a_Page_044.QC.jpg
f02b90bef89ad048684afc8d6399fa63
6b30261c2f5bd9ceeb4a953274c99f2dd9f6b47a
2167 F20101129_AAAJEV pol_a_Page_106.txt
30f1d4c4368113ca12adc646a5a70f66
d6450d474a2da1cd93cc056fe580cad47df4a58b
10115 F20101129_AAAIZP pol_a_Page_039.pro
4744ac2e12617040732250b5cc9534cb
36b60b3775f687b445a30a4c376494f23127dc67
21717 F20101129_AAAJFK pol_a_Page_041.QC.jpg
0862959ab0c1807c2d809d7ba65ab6c2
956fda07bd3046c931982c4908c4cca57b4fea28
2269 F20101129_AAAJEW pol_a_Page_109.txt
adeefae8af2ca5de3ffa387e46a9940b
f435aff77825d3cbaa22c4d82e8db6f3260a2751
62206 F20101129_AAAIZQ pol_a_Page_040.pro
4987fa67f404978dd2fe24eeaee73d21
c52ee37cd7f9eaff29c5e15c2dd840bbb94e5995
7001 F20101129_AAAJFL pol_a_Page_051thm.jpg
3bbe2be3134206d9e838f99094c864c3
b48731d9647b6aad59cad360c9880b9363940c9d
2473 F20101129_AAAJEX pol_a_Page_111.txt
aa401f87a67d8deeaa56da8b4bdb21dc
f857e0f35dc333758eda6a1bcf7a921bab75051e
48568 F20101129_AAAIZR pol_a_Page_041.pro
66f3e47863a8e7221678eac3046cfdc4
c872ea0f68f9f29b7b1ef930a81827c0910a2f85
6485 F20101129_AAAJGA pol_a_Page_062thm.jpg
c04c05d7a151154762c4e92675e268ae
604794643512da203f0d59a0da77438def37ffd2
6915 F20101129_AAAJFM pol_a_Page_118thm.jpg
dd048b57aa216e5f826f2220d2f34ddf
f37db309a6b2315e425b0d6431dfa8cdb61f1a37
2422 F20101129_AAAJEY pol_a_Page_113.txt
6c55154e71f797a5c3e41b0944044e4a
f3fcad0eeb178f303aa654bf6dc932f65f18297c
11988 F20101129_AAAIZS pol_a_Page_043.pro
760e90ae001616d259d03a37306322f5
76a6a9755911af3a9f62e5a55c20b9feee589eed
7648 F20101129_AAAJGB pol_a_Page_032thm.jpg
0283f6dd66d2f042904a8e261445a67b
96e8017d0058ff9a0666bc33d233e546eb71eacd
29055 F20101129_AAAJFN pol_a_Page_012.QC.jpg
fe731eb33d5400add87a84754a836ecc
fd42f9ce15cdadd2d6023e74f3a0e11106b0f97f
2547 F20101129_AAAJEZ pol_a_Page_114.txt
a29afde3dc71dfa46f28074b78154209
8e49b1be529f6d35378c1385405224a8b2d1a527
60593 F20101129_AAAIZT pol_a_Page_045.pro
2e941641d9b15f78ed4bc25830a9b504
6a7fdf0779a2c6710e853f00b510c57ed8df332f
7669 F20101129_AAAJGC pol_a_Page_035thm.jpg
c9c659a2eecc5be33d140ec12fcc6f08
c2f23f9151851e10cf07edb9d575546781c3fb84
6660 F20101129_AAAJFO pol_a_Page_116thm.jpg
f40d8650bf2d973c18f36c2cf0d3a6b0
9fc9dd227330678d209d2c0b9259b3086ea452d1
65898 F20101129_AAAIZU pol_a_Page_046.pro
a829e78647b8db575d45090b63639b78
f8951d2d58416d2b7d4ac76e66f37784a741c2ad
26746 F20101129_AAAJGD pol_a_Page_061.QC.jpg
e4ae68faee5aadf6f899bde46331c1b6
8762ce711a6b74c0970f032e1156dc778a171f23
5511 F20101129_AAAJFP pol_a_Page_112thm.jpg
71ddefc2c9df4f2946fb1ddcf86f5bfa
e5de8161cf162a0996f23de6fa1b54fc534d8de8
58271 F20101129_AAAIZV pol_a_Page_047.pro
b036e4161c9bf95a20d684e6849e4af5
f32f0e9220f64f950812db662dfab4cc0bce2939
26613 F20101129_AAAJGE pol_a_Page_075.QC.jpg
18bec1fe1883eef29713334c1b51db33
72567c870ca9173e21e275772d5fdb53a5e5d6b5
5901 F20101129_AAAJFQ pol_a_Page_107thm.jpg
517bee17719a42d38f9db8e57ce8c89a
eac40680ffc82ca505c97d25910b013f8eb58d8d
59358 F20101129_AAAIZW pol_a_Page_049.pro
9dfa60abede754d1daa98630ae3ba9ce
3ba63a10ca6bbc62877f2b6bedbb65e2af9a207a
4188 F20101129_AAAJGF pol_a_Page_070thm.jpg
940bd426f7d5fed11719a48e84b6d546
ca26de83bf9ea1c8789a9d2fdd35bd015e980769
7130 F20101129_AAAJFR pol_a_Page_059thm.jpg
09fa012676eea7a8cb8b18d09e15d220
bc9e41f62329b9a8674b622f2d5cca28016f0cc6
55901 F20101129_AAAIZX pol_a_Page_051.pro
1347c3203fa201d553c0847615ba18ef
63338c1b89c48af4ebd3496e7335d3fc8c2abd4c
1333 F20101129_AAAJGG pol_a_Page_002thm.jpg
5836e4dba3af7d19754b1d387517bf17
034ac08e47aa63facad66c673a89eb4782881895
5218 F20101129_AAAJFS pol_a_Page_078thm.jpg
0c62a7e6d4b139f30ce578d5f44f70d0
6eb24f56185c12e3add39d7fc3903bd2aa043dfa
62313 F20101129_AAAIZY pol_a_Page_052.pro
87be44081b0adc4bc8f94aa76919539d
bcebc51f55e9c8c851104629beec0b7aefebcead
6359 F20101129_AAAJGH pol_a_Page_044thm.jpg
303417f81cfb6b49f090ae693bc04612
4dbbf5478a310a0d8d154fa9e9eed00f7c8d0595
7037 F20101129_AAAJFT pol_a_Page_037thm.jpg
c385638af7efbbcdd5097c6af7bf70f8
86422db7ab65d84d7295a86a20f2371f9bee26d6
59931 F20101129_AAAIZZ pol_a_Page_053.pro
e77a6a762632242b96247f7b27d549f0
86e3a23f73068e197407a52eef7bf1333ccc2eff
2666 F20101129_AAAJFU pol_a_Page_011thm.jpg
589f6c7a9dbb5891d082c967b6a080a4
71baa64bc7e1bf455e72711a3bdfd79ec3693e0a
9649 F20101129_AAAJGI pol_a_Page_007.QC.jpg
22817d9b06cff29e0888f48a8cb4af6e
4a6e92ff86c804f524d78b10477a0bd31c0f2e30
3892 F20101129_AAAJFV pol_a_Page_039thm.jpg
ba486ddb703e271301268da91bb8f93e
eb664dd9c047b702614c7d9928428cdd21f2b660
2715 F20101129_AAAJGJ pol_a_Page_007thm.jpg
4c8952662f903e6264a4949f83afd2b4
83c676e8c3d5d84fb2971f989428b0d774163c43
24033 F20101129_AAAJFW pol_a_Page_005.QC.jpg
7a935e9ebd4ccaeb7dc174aee86d7d7a
35681c7cbceb75a735a96de7ed579336d1e0213a
6862 F20101129_AAAJGK pol_a_Page_061thm.jpg
e2633862ce8fc0595d0b663cde7dd9d2
a4f54e1a5bbf14bc6ebcf6facd70b143b2f4ab13
22101 F20101129_AAAJFX pol_a_Page_107.QC.jpg
d5337790bb61de7d5ef73e10d265a044
b86b8a9df2ea751b48a3309ef50f58ab1c9fe1bc
23924 F20101129_AAAJHA pol_a_Page_066.QC.jpg
2b55d257130fc9deb80d4e1528d7f8be
63074c00b590f6fabc002c3b48646afd0a862621
19421 F20101129_AAAJGL pol_a_Page_060.QC.jpg
b4322f2f5dcb95a3f0b44645b2c4aee7
c1193b14eca0a8faafb38211cf338669cb774b24
26930 F20101129_AAAJFY pol_a_Page_120.QC.jpg
e897a9ce8ce9ce0daec98e2b10f9f4a4
b0e69a2390d680cd0bb8d61c2dbe34bde601064f
6134 F20101129_AAAJHB pol_a_Page_066thm.jpg
494df1dd9ee6e4fdff99ca47eb144753
31d1f1c76f0614d289f53bb55218488df19efaac
16774 F20101129_AAAJGM pol_a_Page_068.QC.jpg
7b9bf160406a98c819d1cfe5e5f25a84
43d874b0a015bc9651a09f36f1aebd6925f2c584
7464 F20101129_AAAJFZ pol_a_Page_014thm.jpg
7eedae8c839ad67d0a415e6844bc573f
fb57f7c8d2f150231ca34226ed8edebf520d2cf5
21527 F20101129_AAAJHC pol_a_Page_088.QC.jpg
d11d6ce16bff31446fb312329d6a8a6c
707edacc508c9bda4804b503e009bf40ef9c329b
24387 F20101129_AAAJGN pol_a_Page_095.QC.jpg
42f64cd1d520f6a1ceb777da2b1abf72
e3919a92efde91d6dcadb8746b443d1c10ff5a2d
7996 F20101129_AAAJHD pol_a_Page_008.QC.jpg
5bca18d55b59361fe98c7b9843f4d933
0732ecc8779bb2e0f92289ab71148ebf1520dbfd
28401 F20101129_AAAJGO pol_a_Page_096.QC.jpg
394da0fa684ca06ec463822b4ec2ed53
b413317f3b2cc0cc2673f3bca301a99873c0d473
7603 F20101129_AAAJHE pol_a_Page_025thm.jpg
c8cc48572fb7f2b12b0d5f4a28b9c4f0
10e6dc66833f3e33e4548f556b6ca1c1f26d8e5d
24341 F20101129_AAAJGP pol_a_Page_030.QC.jpg
2174fb35ad7e2ca732050bb4d379b90d
b278fd81a18e48c4e692abe685783960d0af2925
20160 F20101129_AAAJHF pol_a_Page_038.QC.jpg
a29e7a537990c6a9edc8a33e743dfb9d
d5df272e2bad3c9df94202d92d731b34d8f7e301
5524 F20101129_AAAJGQ pol_a_Page_077thm.jpg
0add00e20116d64558e05425f405ab99
547b8de562218c6e6723b9a38f55d595a66e7476
14514 F20101129_AAAJHG pol_a_Page_072.QC.jpg
311e249e462cf3a4e3ae037dd3576898
795b83839fc80aee641e7d7a9d696d4871dd1bfc
6905 F20101129_AAAJGR pol_a_Page_048thm.jpg
b6a302be026612df2597d3c2904b6f37
0e1708fec3e0978fe2d0430adbd24d0cd65066a3
7252 F20101129_AAAJHH pol_a_Page_094thm.jpg
f65c9144de6aa4b0a72bf7f028574b5c
d2390e016e3cd1d442bd0f02e3a68c2366aadf33
29450 F20101129_AAAJGS pol_a_Page_091.QC.jpg
c5a6b462131aa3edeecfaa0e29f1f46b
3198a3ef145d54761469f9f5666d607161044bea
19022 F20101129_AAAJHI pol_a_Page_064.QC.jpg
e4bf38bd5df0ce69403551066bbff22d
6c4fbfae306f67ac59ba034ce0e3d631b1c3f3b7
7392 F20101129_AAAJGT pol_a_Page_090thm.jpg
04714e9523e37e45ab91626aa85f450c
895e3d96fd51cfb7a10c6713d22b06691e2a3420
23424 F20101129_AAAJGU pol_a_Page_028.QC.jpg
5d843b3d3b118e71868c290897554bb2
6e5f45cae48719a8981fb8d8b78510bb2c872323
7102 F20101129_AAAJHJ pol_a_Page_099thm.jpg
9feb0987d6f77600137a2a50758cba36
5547a88db7dff81742bae21259e12f45284198e0
1351 F20101129_AAAJGV pol_a_Page_003thm.jpg
8f6e54e3c3b29eaeb8add38c54a8f585
eaecf05ae2cb9fa41d8f7c7b292cddff8ee85f26
4060 F20101129_AAAJHK pol_a_Page_009thm.jpg
424f653800589622017a7b47333d4f53
81c8b0820d879f293e7a8f735b122ede276e17f0
29982 F20101129_AAAJGW pol_a_Page_052.QC.jpg
a0dc0074c8e4554cf064b851366333ec
f38ffd7755ada4218114e6ba5323ab3e6debfa01
5885 F20101129_AAAJIA pol_a_Page_010thm.jpg
59ec44f8efcd01aeeac87b296e928999
69bdf7054c72a5c3d77ac3d4d8cd11c3c64eaa68
7017 F20101129_AAAJHL pol_a_Page_019thm.jpg
0cb4d7e9cc70c9aed66fe7393ddd2ba5
6514eaad941a96edf9fc423932022e66625792ec
25946 F20101129_AAAJGX pol_a_Page_018.QC.jpg
94552078a294edc806a6f7d15d1e14eb
a82cd62ce0ce3f02559bca58b6422138b59f04f8
5818 F20101129_AAAJIB pol_a_Page_028thm.jpg
a26efff575003fcaed1b4cd8ef71ecc7
331a1f12bebf5379cf53e96995965d969103b735
4931 F20101129_AAAJHM pol_a_Page_082thm.jpg
6e263b2fc9a31243277181c1d75c05d3
be7bf596d363346a50c37a741e34e1318ad62232
30227 F20101129_AAAJGY pol_a_Page_023.QC.jpg
38ae783b77ce85ea0ca08d3c033744e9
2aebe72069dc14b69b47b9293273181f3d151472
F20101129_AAAIFA pol_a_Page_033.tif
8ad280ddb0bec755586071d1562af426
1fda9e26998440b933f3022c508fabb2e7447c4a
27745 F20101129_AAAJIC pol_a_Page_099.QC.jpg
9a7f33f89110f22c2c90f46e39143339
18a609f3c481c19b4648e87f0bceac3abb11bc99
F20101129_AAAIEL pol_a_Page_034.tif
096058d9d029350b918c0242dcfad1a1
74567feab927ea0a0bbc5e0eca2cfa77c7ff836a
16561 F20101129_AAAJHN pol_a_Page_069.QC.jpg
6045c7154b931be74fca7f5ea2fd69a8
fcd13ccce1fa99b284ba9bf636bf70cdba61f43c
29789 F20101129_AAAJGZ pol_a_Page_055.QC.jpg
fb8e6fcf5c6332b50ba629c30e8abf3d
d330616ddeaac65d248b3acb330aef2ddb91ec38
30405 F20101129_AAAIFB pol_a_Page_108.QC.jpg
d1a0ab0f29eae20e229509ad987a1827
4d3c2e72259e4b93dce7c84245758446efd24238
27861 F20101129_AAAJID pol_a_Page_026.QC.jpg
3f18a0055214514a1a8a1c013427cef8
b16a1e8d4e2c59f50baa8374b914431758483159
52873 F20101129_AAAIEM pol_a_Page_087.pro
9b36e0a8e7dd2559d3fe7c967510e01e
8253c0dc78a16a823992662f300adaa784a3675e
26249 F20101129_AAAJHO pol_a_Page_062.QC.jpg
cc8ace4508e7cba4e757716d98bb49e8
74fae0757871b984175a9651b4d6b9d13979b5b6
F20101129_AAAIFC pol_a_Page_087.tif
ccc2a290d5fc99b4737a98b06c0515a5
1fe4d7e60dcf8e4ae7028197bf2a2e90d1436049
180505 F20101129_AAAJIE UFE0021132_00001.xml FULL
c006607d8e805030a27eb223b48f7b1e
9bee4b7b5c63b14d1b704de11709dde758167c51
55688 F20101129_AAAIEN pol_a_Page_033.pro
f960ef923bfa7dc82594a4a51ce630de
7ec304966aebb9c2706120c47134db86a05ecf2c
7453 F20101129_AAAJHP pol_a_Page_052thm.jpg
0c2802ddd9aa9883fa46ed04bb68e058
f26328bbf3bb982e825b10bf48a687c50af881eb
24452 F20101129_AAAIFD pol_a_Page_076.QC.jpg
e1b66e0973bf67c1b32111dced638b77
efd348fb693ace43c3d9d929baa7a6dad89bcd49
5800 F20101129_AAAJIF pol_a_Page_005thm.jpg
bc004184ff9e5f737ca668f03c4dc8a9
83587b7b77f82483a5d7ec9ac176e0db5e9a49ad
64767 F20101129_AAAIEO pol_a_Page_025.pro
d209a6cac32f216303f55d0b9759da8f
0e3137b595bdc8249d0a5a893f5aa79217bebb6e
5722 F20101129_AAAJHQ pol_a_Page_038thm.jpg
4d0e36aa15909d7a97965531128cc2b4
61591c11ef49d74abaef9eae2c694e691339dd50
1051911 F20101129_AAAIFE pol_a_Page_042.jp2
e0aa9657ea92080dff72329271dbd14e
3143840c2b311724fdaddff8432a8b0fe2366b4f
27346 F20101129_AAAJIG pol_a_Page_006.QC.jpg
a9bc2af1eb8436daecbc3865e1894b31
55bbbecc75320889164423c9d263d358feb2d120
133269 F20101129_AAAIEP pol_a_Page_086.jp2
da5548defbb67edc2df6cffdc6d111ba
e15ba1a70ef3e1258a14b0387a4beee71e992d9c
27214 F20101129_AAAJHR pol_a_Page_042.QC.jpg
1b2866b402d170c2e703a802eee3cb00
fe61331415814753dd849efb1eefe2764ae1668f
F20101129_AAAIFF pol_a_Page_092.jp2
ce720c9fe99a44a4cab9e9640a46dfe7
1ba55d56f76d56307da3764e129072aa82dc2a51
6396 F20101129_AAAJIH pol_a_Page_006thm.jpg
aae429cdd898e7e968f96e296319b1dc
4fcea5ec518e487b1a5aaf37de0269d0837de073
F20101129_AAAIEQ pol_a_Page_095.tif
cf46f4adfd9380ee1ef29a166752578d
f2bdad96ef543a69326244f2c5043d070899f4ce
4749 F20101129_AAAJHS pol_a_Page_071thm.jpg
996ce955beb0b909385b0e31088e2524
e7420d20928b4e941e031d7ecc2e101b0591c3de
21578 F20101129_AAAIFG pol_a_Page_054.QC.jpg
43fa79ae9a60a979c581fb356b36f17e
27f181ce015ffc2d85ea7c274671ecba9fc026d2
2540 F20101129_AAAJII pol_a_Page_008thm.jpg
a0cd3221f94bfab85f91def7956516fc
47181a55d0753e8030967975f699cf0df452d5b0
2093 F20101129_AAAIER pol_a_Page_036.txt
f62b06e074641912744bd460d2374d24
58f68e22b0b9450a6ae8f8d10aadd78671b76a8c
6981 F20101129_AAAJHT pol_a_Page_001.QC.jpg
ad6ce50a6d1071fb930ff3615d2aa235
124d424f6325c075252a78a27f273190f5727dd5
27295 F20101129_AAAIFH pol_a_Page_081.QC.jpg
7c5eebcff1814c99751d24970a45ceb4
43e6aa0f3e22182bf6ed36a34be1eeb703bb92d1
15172 F20101129_AAAJIJ pol_a_Page_009.QC.jpg
724052da585e957ed0e54eb647d3034d
ca99ee58b44b01b4e74bdfbab2205f25f87aa913
F20101129_AAAIES pol_a_Page_034.txt
6f7a5448a3903aff4d02ea54a7a13ae4
54b60bdd49293edd375031c9a62e026e9f5682f3
7416 F20101129_AAAJHU pol_a_Page_080thm.jpg
c21279a05bfe2eb1be98b86b185a92ed
13805d76684a0f60b50f79724f3a08484f7066ef
22803 F20101129_AAAIET pol_a_Page_106.QC.jpg
bc50f572e3a1fb0bb38f9c5d316c7d92
70d082fea533ad6d046260cabb835cbef458f53c
19105 F20101129_AAAJHV pol_a_Page_073.QC.jpg
aef3ba7f5ed359e1882e1a34f7b1db80
4bb4ba642790500f8dd1a856611e64107f9ad095
44846 F20101129_AAAIFI pol_a_Page_106.pro
4ca6dc5f9858b62c93a79624c3f77926
58d96b3dbb0dae2f244613b1022115803c94bac5
22309 F20101129_AAAJIK pol_a_Page_010.QC.jpg
61cefdd0d4ae536f24f0dd90da4a69c3
5bdf3a0536e9aa834f85c351bc07d42bce367101
7552 F20101129_AAAIEU pol_a_Page_023thm.jpg
9d28160019cdc72050a7cb5b45d67965
b745b26f09dcfcfd9ca759dd5fef6ffe91618ec4
15356 F20101129_AAAJHW pol_a_Page_121.QC.jpg
05b6ee0accd2a2ad1ef28f5be7611846
dc9d70d89509378f9cf45e81d2f912b9c5c6a219
31802 F20101129_AAAJJA pol_a_Page_035.QC.jpg
95f8760664194863c4855b5cdf8d8ad8
26c0969c82121a28d0c6f006cdb339a5a76a052f
7198 F20101129_AAAIFJ pol_a_Page_042thm.jpg
df7fc4579deb49263a730b6ff1194600
56a56fa7e12c0bdc8d5475de3c57aff4f4d1772f
7127 F20101129_AAAJIL pol_a_Page_013thm.jpg
60c04f780407132be7ef569df7d4cfee
88bf250ec10ebc3a7863f2a4f7abf91dd1084f0e
2289 F20101129_AAAIEV pol_a_Page_001thm.jpg
c6ec37bb532f4cbf1dd509414ad6133e
1f9b4c18003a6af7986cc42c646632d779d702f6
5231 F20101129_AAAJHX pol_a_Page_102thm.jpg
31cbacea019f5ba6b34442d1441fe3a5
ee4ab074b59ebc88d1c4cbcb36c53e0db606f554
22783 F20101129_AAAJJB pol_a_Page_036.QC.jpg
0d2dd1a5a0a84c533fb1018b4bc5be21
e409ccf6caa6aa1a4b01d8296204cd388fb66194
66619 F20101129_AAAIFK pol_a_Page_122.jpg
1eb67b1185c14c6a197fb6189ec573a3
e0ef44ad99e84c5289d0d861f2da732ecda11167
28230 F20101129_AAAJIM pol_a_Page_015.QC.jpg
9b4912a6eae68aa19eb36c39ba293366
07e4dfdf572e9422bed8e5309a2461cc902edf95
56749 F20101129_AAAIEW pol_a_Page_016.pro
b735045d76ed1b94e002ac2b77f639c4
8764aed0a1f8f407aa4100bb882f6fbf65a221c6
4987 F20101129_AAAJHY pol_a_Page_004thm.jpg
39270af85393d44a8c7cc8312fe9bb01
3cdb5cfbb74685fc4087f904a23289c03cfc0fee
6277 F20101129_AAAJJC pol_a_Page_036thm.jpg
ac5885ece83267671c5b59eee8c096cc
9160ca6892e1f7ee8eb95ffd92fade7bd8496c1b
27183 F20101129_AAAJIN pol_a_Page_016.QC.jpg
e519ef14d275cee2fa718c6dcaaed5a1
696c8dab0273d16f05aeebfa5fd98c33448fcbed
9540 F20101129_AAAIEX pol_a_Page_115.QC.jpg
b3a12406fb65770b05fa443732ec80b6
ca60acd12b3802e427fa8e4a9c9635ed386cf4b0
8500 F20101129_AAAJHZ pol_a_Page_011.QC.jpg
1d7f26114d9218dadacb7c2b7604ce2b
e2c55ef005c4dbd0c1082a3f144db32ecfa08617
F20101129_AAAIGA pol_a_Page_104.tif
f27518c56987295228614e99336842ae
9886c58cae0d8fd8eead84b19535779b44f2ee23
1976 F20101129_AAAIFL pol_a_Page_112.txt
b5bfa58642bb42dc2b71483334cc95a1
639f84e8455c0247618dfd7625be858d2b23252a
27879 F20101129_AAAJJD pol_a_Page_037.QC.jpg
88d9772ffdd3653acf7dafcec0367055
d96bb68ff971d9848491efad12da9792989a2f0b
6872 F20101129_AAAJIO pol_a_Page_016thm.jpg
5d9e54d7dc3dc5eaf8036fe264326457
0874d7d6c96d07d7fe06fe803a53a07e3d293ccc
1051926 F20101129_AAAIEY pol_a_Page_033.jp2
a5730f52b46bdd39c9455c7a017d84f4
5a96d9e594717c170450401ad187fda99815cd37
17110 F20101129_AAAIGB pol_a_Page_004.QC.jpg
99473c24069afc3a675559bd354f17b4
50cd196e8517c9b43b2d2e8ba820fb2530845ce5
8423998 F20101129_AAAIFM pol_a_Page_039.tif
5edc4bddd232ffe25bde21145af68be9
8d539ec480892cf72ca82d4046a0320e4a1afa2d
29543 F20101129_AAAJJE pol_a_Page_040.QC.jpg
ebc6c5ba301fdea34c2d31227fa8bc20
7f8edfc0ae00c95240d7f6cd4e9a9f5ef802a755
28667 F20101129_AAAJIP pol_a_Page_019.QC.jpg
6475ba88d04807cf40cdf41ef7436076
6932588428e70e70c8aee35248d82102ed5d3676
83489 F20101129_AAAIEZ pol_a_Page_031.jpg
20ba0ce66f8b683dc4248dda59b041d0
cd8e77e0671df838368f53e6dc467415b2c309f0
F20101129_AAAIGC pol_a_Page_082.tif
ade562da24df51b0c6bc93756dc29967
0ec7550d1bba7f6de7b121d36993b2df72ad1512
698 F20101129_AAAIFN pol_a_Page_117.txt
ea2b24c7c2f61864cd5a1bf3a41a8ab1
6fb602dbaffaf62009dafaadcd249dabbedee327
7165 F20101129_AAAJJF pol_a_Page_040thm.jpg
ba290b740a1768c46d67e6f6c1605edb
a2014b5b795d4765dc01ce6658bf81b577127b85
3554 F20101129_AAAJIQ pol_a_Page_021thm.jpg
2104cc326344125b2b7fbc2342f6ae28
e954d508731b02d902872561cb9cbe0a4e2d5fb0
75785 F20101129_AAAIGD pol_a_Page_004.jp2
051d140dfbf57e50fa62f85fd0173346
79fb7a3fe79c133059ccb35cb722f234a7c4243b
F20101129_AAAIFO pol_a_Page_022.tif
6a3580a317b02405805678502c2d7fd1
31f6117e7826dbf3c696d8faaa74709cb94b99a6
5921 F20101129_AAAJJG pol_a_Page_041thm.jpg
8c9de56b11c05895d82ce7fa461476b2
de4ede57a45e293ef7c1759eee7d387a645298e4
28073 F20101129_AAAJIR pol_a_Page_022.QC.jpg
1145d4fbceab8660757923bf5b6d3e1c
a8de5707f6a691f3450237a359f21027e0ad572c
30658 F20101129_AAAIGE pol_a_Page_100.QC.jpg
d7699458c099760ca7204e956fdabd7e
9d668f7bbe1d264515bff753680434333477ebac
832746 F20101129_AAAIFP pol_a_Page_017.jp2
3d508f44cde42c05f482d3cf3f59b595
631ae8c4709926afb962031c5aa2fe0f554caa5e
31691 F20101129_AAAJJH pol_a_Page_046.QC.jpg
333ea2eb83f49831c9c834d3c5462660
3089c48f9b4a3d32458c9d7a644b8a6121966499
29859 F20101129_AAAJIS pol_a_Page_024.QC.jpg
ef6ec37ce84d9a1cd71d83b5d5be24ce
604c141ab76ae87ff6aedfdc14a107f5e4c4515d
F20101129_AAAIGF pol_a_Page_052.tif
55fffda7ce875807691cf9b7d52dd344
d3cbbdc208c5031df2069e8464e8b046a9288654
1047601 F20101129_AAAIFQ pol_a_Page_028.jp2
1a3e09a570de47c7f3dec5a59596d1d7
5816e08ad5e7f3f1710aca0677726aa19d5e88d1
25831 F20101129_AAAJJI pol_a_Page_047.QC.jpg
c991ae36444985294e7b0145dd321e6b
39eceed247db72ffa9b68b70d11f2905f0a2ee76
30412 F20101129_AAAJIT pol_a_Page_025.QC.jpg
bdd77a6efb6210fae50fa8b4024ab856
94ea892924c4c11ccf51d300c9a5ce33116db343
7747 F20101129_AAAIGG pol_a_Page_046thm.jpg
04655a3ee70da1cadda73cbbab3327b1
c54039e0c3e6dfa8d0dccf307adcadb680ba104a
51563 F20101129_AAAIFR pol_a_Page_095.pro
092d94d3236a51356d2644df7c5b1dbb
2ff6921f0f39997385da4574f374c200689fe7b2
6813 F20101129_AAAJJJ pol_a_Page_047thm.jpg
f7f9d32678515d303e710a165f5f6c8a
793845989733260164388c0f03c71fe061f54839
11993 F20101129_AAAJIU pol_a_Page_027.QC.jpg
3eae443fbc196c01bbfaae9d640e438a
9fdef625f8f7cf5806156b493a1f658dd827d0a7
33566 F20101129_AAAIGH pol_a_Page_078.pro
9bafaa27e265dc27abe4445b8ecd516f
25ee1767ff1472310804526b08d271568c2e9e34
49673 F20101129_AAAIFS pol_a_Page_121.jpg
e45395482c81cf73ec1786d0248d3f5d
40c84337e4dfa4ed70ce5961d15fb34f63d98baa
27588 F20101129_AAAJJK pol_a_Page_048.QC.jpg
eed6c2aaa6f445843102eacf7c6d52c3
db9c81a800cd4192b1a72f768ac3bc097925c2bc
26484 F20101129_AAAJIV pol_a_Page_029.QC.jpg
0e841b2602247d771d06e247292eaa86
3929f6d57fad93517a7060cc7419464557f8feb4
F20101129_AAAIGI pol_a_Page_055.jp2
d880036fe48bf7a0aa0cff27e29c62eb
34bb9dfdb5cee5cca206f644c0fbc5c030abe402
F20101129_AAAIFT pol_a_Page_050.tif
7c06099db75d02a6430544999fe78a87
8a95f07806f1b348466214acbf000b55e030c4c3
6759 F20101129_AAAJIW pol_a_Page_029thm.jpg
506f0f17165257770aa9778287ff5c1e
49cc9dd3fa6ecbf123bf6bb3dd6a0496f584fa63
40806 F20101129_AAAIFU pol_a_Page_044.pro
216e151a117c22a03eb3cc8f55ccca49
9acefa5bb9bf4755dd00e363262996becc912ce7
12831 F20101129_AAAJKA pol_a_Page_070.QC.jpg
bd3585298af9a60f4bff992720bca00b
13f0b0049dd650cb7c1c6ce4a507fff9a98df600
7474 F20101129_AAAJJL pol_a_Page_049thm.jpg
e57b6939b7e29b0ed571f20cd602af8d
a7993f885d831ed5c3d7f524a5875cc1427b179c
6429 F20101129_AAAJIX pol_a_Page_030thm.jpg
7ae7c4b9e03c87502d1b1f002b3c8d70
45d47a4cb3a8783ed6e389d9a9ef3ec4673a8925
F20101129_AAAIGJ pol_a_Page_118.tif
3366c3c2fb9a332e34496c94debe4691
757a17374ea3b88a5edaf0894b33d9cb43791114
59330 F20101129_AAAIFV pol_a_Page_078.jpg
2075dd8c062e55e02747e6936d07b9d9
ebfc6e0bf53739e5acbe3b1f3a7fe2e6deb2e4f7
17717 F20101129_AAAJKB pol_a_Page_071.QC.jpg
91d8f25b9288b5f703c363c667cd4152
aae3f20a0526fa1824164502003e24a3e3007b48
27063 F20101129_AAAJJM pol_a_Page_051.QC.jpg
c702dc24d39366507777386af3c29e01
151cf9e2a01cfcdc329f94fe1b015f86c7a8a4fb
25996 F20101129_AAAJIY pol_a_Page_033.QC.jpg
acb16a6f3780cb9a194f97cf5c30aa3b
1c441ef411d654d4a7405ed742fb6fd2bc9279ce
74481 F20101129_AAAIGK pol_a_Page_054.jpg
d3a0695512bc9a5dc45fa2f199617f8b
c66bc3645eaf5cb21d81b113f86958770c0c96d8
2166 F20101129_AAAIFW pol_a_Page_107.txt
1d4ca6b4c8ff9853bdb2beb1f6d09af9
13fa1200ffa48b9368976a2e971543b7ce46b2d6
4310 F20101129_AAAJKC pol_a_Page_072thm.jpg
eb80ecee85ac3263751ccbcef12e3586
abc86829fc24e6ffe47aee1384453458a3a9217f
5961 F20101129_AAAJJN pol_a_Page_054thm.jpg
ee0402a12f9076e39e1b52ada87892fe
b847f35b96f598cf1a9271b9b4191e770e3850f4
7454 F20101129_AAAJIZ pol_a_Page_034thm.jpg
58e8aab58395eceb8ce1e5dd936c1224
5b5fa3b74f458620307cb2be6f9dd74f801ac128
5158 F20101129_AAAIHA pol_a_Page_068thm.jpg
215da062e152def99ed81a7b825f8543
de07da461624cf66fdc3e58c636992a5170c6161
54740 F20101129_AAAIGL pol_a_Page_057.jpg
2e8bbf9786ce6d09570b0fed661169c9
a0e9160287fb7fe9fbb1e4d728b4be142157fca0
76064 F20101129_AAAIFX pol_a_Page_076.jpg
a94ab7badaeca607d5a40afb20960e3d
fc8aaf6c815bc8b053f841bb0dfe089c768a1246
5254 F20101129_AAAJKD pol_a_Page_073thm.jpg
b5859ba33b9dfee58f2b35dc24d8a664
b908eb2a386e88b631573bac55f00b298d309d0e
27636 F20101129_AAAJJO pol_a_Page_056.QC.jpg
b9d36d2ce47acb95b516a58346662589
1a3ffcb0bdbefae3e4b34dea47ed5135a5b3d4fd
2301 F20101129_AAAIHB pol_a_Page_103.txt
e0c32c5c3288b3a2549d79a34504ede4
6e83fb24b685e0a5991061ba22f680a1cb79c7c6
19936 F20101129_AAAIGM pol_a_Page_105.QC.jpg
eabe6f192979968bc0f8091c159653e1
d3cf5f93d50f75afe35847f9135a3ca258a5116a
43382 F20101129_AAAIFY pol_a_Page_104.pro
8384e473b3b1c82ff742da4d496a36ca
af32cb15cdb8e1a216c069539712e3286bae4ed0
5644 F20101129_AAAJKE pol_a_Page_074thm.jpg
b7fafdcc14aa3d7a461abdb9ed13cddf
d0ff1cfe2e77712cb76ffc275fac8af94d8ab56f
7277 F20101129_AAAJJP pol_a_Page_056thm.jpg
7cda1fa50a406a4be80bf6b976e047b2
80eae477686d44c6d613969e185d62e0a0d78509
F20101129_AAAIHC pol_a_Page_054.tif
1af30ed2f7636e7dee8d1b355cf82af9
b1463f5bf6760b07672ffcf7d0abde7f5d2a398f
59522 F20101129_AAAIGN pol_a_Page_116.pro
11730ee56b966da7a812401ad0ffdf5b
b2a5d66d3c036252d628b57aa9777bcd93d05994
F20101129_AAAIFZ pol_a_Page_027.tif
1fda355ce39d62ae4a87d8c4912556ce
41da466cbc33a9881d8e01ec808ac9cc0c84601c
6695 F20101129_AAAJKF pol_a_Page_075thm.jpg
f6d4fa1e47e5a95bbfd3cf51a18d5ec8
7c45dd413a4954cda391be63e9966cb72d619083
17850 F20101129_AAAJJQ pol_a_Page_057.QC.jpg
7f9d6dc33177d616d2963f2c6b8b03f6
7b7eaada47b57d3eca30678793561501feaaf036
1288 F20101129_AAAIHD pol_a_Page_121.txt
9794db84e142c4eeec0f6ddf7f54c821
3af819464370ed2663cf214208f2029c2316b898
65705 F20101129_AAAIGO pol_a_Page_118.pro
4c96b85d88365076c0a7cd705821b956
ee7f7eb775035b38962b20338d52790191c6eda4
19054 F20101129_AAAJKG pol_a_Page_077.QC.jpg
8b3fab3d7114623e5a8fc7c94fd7d7a1
27cf526799bc203e648a776714b2abf3280d596e
4817 F20101129_AAAJJR pol_a_Page_057thm.jpg
65c0ea8275d05bb263ef35c0dfb400c9
22d039f9b1bec610586284297b6c83054215bd67
7710 F20101129_AAAIHE pol_a_Page_020thm.jpg
f0846a5a33fb2dd47bab08a0e40216d1
0afe25c798e0cc1c19b169be7bc839c9cfc2ea77
25237 F20101129_AAAIGP pol_a_Page_114.QC.jpg
f435ba44dc158455245646121ba3233a
7cb95a32622558f38e102a22f630cad12a5d54ef
18132 F20101129_AAAJKH pol_a_Page_078.QC.jpg
3ddaf3fce91fc12fcac580e8972cac8f
432201fa89594d0a0901967ee2ee1cfeba19040f
7142 F20101129_AAAJJS pol_a_Page_058thm.jpg
10614a0e879886bcebfc979fd74a5794
0644b87e818827f4bb444246c30f25f1f1124b16
29211 F20101129_AAAIHF pol_a_Page_097.QC.jpg
171a8eabacaa8e5b9e54162b36f824e8
461541ed28b4576809d2c48ce84060b7a5338792
7166 F20101129_AAAIGQ pol_a_Page_026thm.jpg
3eea39b56a2fa782f2e52f68588c4574
a779a1b59ec091bf88ea8c1129cf98b1d8f747e6
17046 F20101129_AAAJKI pol_a_Page_079.QC.jpg
d24d12ba9e0757873dc4b838d8812c43
890b4a554b9c850afe9f09f0fdcb8796ce809c73
27122 F20101129_AAAJJT pol_a_Page_059.QC.jpg
0e97bcbc2b410bac4b9152fc9c1bad26
8150cf18a5922ee6c7efd13e60a82b16d149c532
94566 F20101129_AAAIHG pol_a_Page_024.jpg
8b1ac466aaae74ceab570b8763be799e
48af2c31db3e83e645cce28fdd515c8e2741c7ea
13052 F20101129_AAAIGR pol_a_Page_039.QC.jpg
093550c676ff679a1f5c738b8b8665eb
069f4448c467353b0e2b2a4149daf826aada547a
4800 F20101129_AAAJKJ pol_a_Page_079thm.jpg
f305391bdad1b78060e1a8a627d5cd5b
e8e954dee0d8848fb8e97aae7792cf0a9e8847d0
25252 F20101129_AAAJJU pol_a_Page_063.QC.jpg
af3324bb533dcfacf3a380800d8442b3
944905550584405d5cea608975ed34450f8d0730
30269 F20101129_AAAIHH pol_a_Page_069.pro
6a008ca0df3f5d74adf709ba510d18de
52dbef04f975fd6f4bcc51294184b9952f9c379c
7220 F20101129_AAAIGS pol_a_Page_081thm.jpg
5567faa7ed79282e1b043cb975c1cfa6
9f2d0b5575b57375962bb9876938ebd7bb5fb7ac
29971 F20101129_AAAJKK pol_a_Page_080.QC.jpg
737055f26dbc3b33d6e204570567ea74
467f9f3be8cf6aedca894efbdf5161db8cf559c4
5229 F20101129_AAAJJV pol_a_Page_064thm.jpg
8d1a9cbf4c71f130699b95c91c5cab7f
10c1d65e20f7d17a6b599ea366562bc5c0698204
950190 F20101129_AAAIHI pol_a_Page_065.jp2
78adc330026fd9166e43a3d0bcd660b6
916db68a6eb87476925021bf8b2a78c06d43a884
F20101129_AAAIGT pol_a_Page_090.jp2
ab0cca9ac9b3c20ab852593ef91b3775
23d18ed8f68527a9458859d3f298b161afb172d1
7051 F20101129_AAAJKL pol_a_Page_083thm.jpg
e2280b0ab6fb9079554e020901fbfaf2
3af7aadc7749ec0374c60a47442697d73b5805c8
21378 F20101129_AAAJJW pol_a_Page_065.QC.jpg
88dc38e3f8b95d1dc899ef96de2f17be
98fe8646b8c447476c85ba6a5022f74f96593c4c
586 F20101129_AAAIHJ pol_a_Page_011.txt
4a47e070e22c73378279a63f360c6b49
4ef186b35665848c7654d532f22f4bf0147be9aa
31099 F20101129_AAAIGU pol_a_Page_032.QC.jpg
532b7de18fcee2d808626b9b94d8e757
a72414a77dbf39f0366eb68b752a7a2a2cba6fc8
7558 F20101129_AAAJLA pol_a_Page_100thm.jpg
82f3292039474087af2f87f71fc5bba1
aadead07642109452b03e00d83c1230c1e8b4fc6
14079 F20101129_AAAJJX pol_a_Page_067.QC.jpg
b34ab58e37ec4b20cdecfc1783ac0c2d
45c1c7b9a5249c15babbe4cc1702fd1ce771ba94
F20101129_AAAIGV pol_a_Page_090.tif
a0234896fdbf20b13e8eec0f0c0af777
1b3c8ae81725ed5767c7e8c4f505ca653f6362d8
4258 F20101129_AAAJLB pol_a_Page_101thm.jpg
b1509c150c01530901ee01da2d837411
e0ade059ecdf4f3fd3759fefd54a21dba6dd1cbf
7004 F20101129_AAAJKM pol_a_Page_084thm.jpg
c9a68b31a09c50bf7c68b7889bf00b17
ff0c86f495c814468fe706955dcf07dfd69102a2
4110 F20101129_AAAJJY pol_a_Page_067thm.jpg
e2b30b55be4848b070f7a10053f5cb44
3e8b884a93bf2bea57e49aa11ed68ca736a02462
2730 F20101129_AAAIHK pol_a_Page_117thm.jpg
096cc4db2b431e1011619774f9415870
0c1a267fae4e90b3699bce29e926c117aa7c38e5
F20101129_AAAIGW pol_a_Page_080.jp2
d6643e02106152a337630e992928aed6
bf516c5b6ad44cadc109646cb1361772a64c4503
27241 F20101129_AAAJLC pol_a_Page_103.QC.jpg
65bbc6bf5a516507c9f43c18b3d0ca96
90f69fda298d3257a464e742cacb54ca5a746d8d
7025 F20101129_AAAJKN pol_a_Page_086thm.jpg
0ea38e369eb13bdec28bc5e2855b6e26
b813759e06be803626eb11ab330e13b9944bc730
4699 F20101129_AAAJJZ pol_a_Page_069thm.jpg
1802e99c92795ee6d7c0e11b14de714d
a1bea28ddd7791b959f478ee0f0b41bbac53d694
F20101129_AAAIHL pol_a_Page_079.tif
5b3d063927d3272d30d4c15f1e59d1cf
862991e0d243a90d3ff77106bc6df9a4929bd5cb
87978 F20101129_AAAIGX pol_a_Page_059.jpg
6d5a65b62a5136eaa5e5dd23a7ce3675
dfcb800a77d297e4b623e5037f8b2d60ece03f13
5300 F20101129_AAAIIA pol_a_Page_017thm.jpg
78296873f713e259f6bba4d6ef0805aa
fa524c7dc1a7ecb93e22b9ffd9637975d4570861
5724 F20101129_AAAJLD pol_a_Page_104thm.jpg
83cddafbecca07dada8c93fc57abc234
78a5884f4f8aaacbaf2a51aae7c6b8bc9cbfc0ff
23839 F20101129_AAAJKO pol_a_Page_087.QC.jpg
dfd4cf744b1511a76b2c1adfe6d86494
9e00eaf3f1d0601d7da36484fe79add423c3b439
F20101129_AAAIHM pol_a_Page_099.jp2
b500b630afd625b03e9e1f34e162c417
f1619442748e60a85de35207f3573135ca78042d
2667 F20101129_AAAIGY pol_a_Page_086.txt
402fc8f4b01aec568af6ab951ae51829
4bede79fd346030327db010f38468c315fff9b0a
F20101129_AAAIIB pol_a_Page_060.tif
fe31814aad00fd4caf0a5c9148b2b393
a80fb9eb59f796aa8471bc6dc2241bad36e326f0
5902 F20101129_AAAJLE pol_a_Page_106thm.jpg
5d37c3f34e21220ac5dfa87bbed2b6d9
859c88ff88e25f34c15d9673cf1ae540f5eba4de
6091 F20101129_AAAJKP pol_a_Page_087thm.jpg
3be25d85fd42617eb08d86f7342a5051
e2c93c84c6c322f83510f3e38f836d282ec6fc6d
16960 F20101129_AAAIHN pol_a_Page_102.QC.jpg
92de5f2e349a95eb38218bca043a3d93
eb420c6f5858ff221a43f3b7f1857b8cf978ab04
6568 F20101129_AAAIGZ pol_a_Page_119thm.jpg
e1bc8bb559120dd7ba988fbb5014cea0
15e74f8f779a1563ef0201133f79b89da8f60c5c
1827 F20101129_AAAIIC pol_a_Page_088.txt
1f04833e22a33576d775d945acdff840
0d7182e50ea39fe597a4b58f264226a4b8f40453
25503 F20101129_AAAJLF pol_a_Page_109.QC.jpg
6cb6617ec8afad3ac8f53d959c58ea59
9120fe15a66f13ad675ab0f03ec1ae1e99c54a54
24218 F20101129_AAAJKQ pol_a_Page_089.QC.jpg
e73b39141d6382cd61182e5fff9de6e3
733f2788c53adc7b62c97428a90f7822ffacf894
125012 F20101129_AAAIHO pol_a_Page_116.jp2
d8ae4d101afd4ddc1807342e18a4dae6
f5b2cfca8e0c6a87173d8b6e1a5a13a5c2691e0d
64083 F20101129_AAAIID pol_a_Page_038.jpg
e39f9d9f1be2a6a7d1cf979817c30358
f8af5ca8a5be706c52747ddb36ff57499e2bd2e7
19354 F20101129_AAAJLG pol_a_Page_110.QC.jpg
46f13286736d099b9f843b5a67e7c8b0
525cc8e8d25171f9b1cda82496e2e538d1ac0d9f
6492 F20101129_AAAJKR pol_a_Page_089thm.jpg
2765f7e38a632cff1cd781c2d1781106
13721682484110b02b744a640f0e9b7052f788d5
34213 F20101129_AAAIHP pol_a_Page_007.jpg
a4c74440848e18d8d9d8e619fee73d9d
3795fae524a8ecc79a3ec030543097067eaefed3
F20101129_AAAIIE pol_a_Page_012.tif
c9df8a04301017d2436556ca56a4d272
ef3863e8c50d04f10be0513308480bc36c081ef9
29741 F20101129_AAAJLH pol_a_Page_111.QC.jpg
9c82c09df31acaeb22c51739837929cf
e80728d129b081368fb3fd73efeb7e2d4ae644e8
29207 F20101129_AAAJKS pol_a_Page_090.QC.jpg
c20e0702fb5d8652bb0357ccf8f9036b
e2366d32178ad7197f2c22ce4a0f3ad66fed0f1c
5326 F20101129_AAAIHQ pol_a_Page_060thm.jpg
a097725c0b75d10278de79e9cc391526
edcd770616535eaeab0bb888b69a9e1a3a3c2a78
28877 F20101129_AAAIIF pol_a_Page_049.QC.jpg
c0a5533318565e214b7bc468b44689c3
46e8c3cd8966eea09f1eb192723ba1d1d0540d2e
6611 F20101129_AAAJLI pol_a_Page_114thm.jpg
ca6c4c23f40f808db0294f82c59b496d
33a623306c8201a26946c05513bf6ca192f45714
7498 F20101129_AAAJKT pol_a_Page_091thm.jpg
3f6b390717c4aa071ec80da453eb2814
9e7959b01ef51734cf5d92ed2a495e17fa8a7364
F20101129_AAAIHR pol_a_Page_026.tif
ec9eb7070902f67a563814c66136fbb4
6de218b287ac7a3ada6bb2058dd679a08c69dfa5
822536 F20101129_AAAIIG pol_a_Page_073.jp2
7b0ef7d98b1464f9b78118932521c31d
aec08d82725182a0fbd15f495101c23e0f03a339
26733 F20101129_AAAJLJ pol_a_Page_116.QC.jpg
c562d42d6b7c141f2dabebf0ae6d0723
645d5dcf424c82663cf87bd51619be5d3f8bc38d
22498 F20101129_AAAJKU pol_a_Page_093.QC.jpg
f62901c935cb4667a1ec769ec6f80a88
c4f73503cfca62b976d555e2996ec20b2e30e17e
59235 F20101129_AAAIHS pol_a_Page_070.jp2
26a6e4a0c1972988b96b5bd7aa29077f
653ddc3ac652047e553d7cc6c98ccfe7cdb380e5
2455 F20101129_AAAIIH pol_a_Page_013.txt
e5515f59137e7ea57b55cc771e5649de
4d2174f7b071460138deeb14c6524ec127f628d5
7047 F20101129_AAAJLK pol_a_Page_120thm.jpg
d53ace764d6c1bc20877cd6c3b513b7e
581e98a9a9515c26f01e779028b25bc78ccb3534
5933 F20101129_AAAJKV pol_a_Page_093thm.jpg
a3b11438128af9065aa3749dde899517
a055c0ed2cab88ebe6c72e226f7fc5966d124e01
1051967 F20101129_AAAIHT pol_a_Page_020.jp2
17377863978c23cf0a7c136837e5dc05
e015f7c24d02c4f173a13d23d7aacf665ef3a0ed
55747 F20101129_AAAIII pol_a_Page_048.pro
39d27e69dfad35d27cd878a1b1816738
01e3ce5b84888d80e431f21b84dce2730a22ed1b
4402 F20101129_AAAJLL pol_a_Page_121thm.jpg
6a7441841ebf51d47ce3783e564244ec
646205b8167860e0e69d9c02df85a9b218768f9b
28126 F20101129_AAAJKW pol_a_Page_094.QC.jpg
dc8dfdd5024921bd27ad5a675eacac98
9a4654e05dd9e31a66494fe99915e9ba1b6b74b6
7156 F20101129_AAAIHU pol_a_Page_085thm.jpg
75a40d492ce4b8cae38b6b5fe9997476
038bb931cee99426714457678ff2a2444bff72be
1051912 F20101129_AAAIIJ pol_a_Page_051.jp2
85aadae4b3d89c664652e82d774f3f01
1c5be51feb7ba2fc2a3ecbc1aaa8647b56583ae6
6561 F20101129_AAAJKX pol_a_Page_095thm.jpg
b962446e400d5616187fc525b1bdf807
5a32912771a2e95c7375c06626356e044285c68d
F20101129_AAAIHV pol_a_Page_097.tif
f8614410bdde8ea9569d4547f30054d2
8f4040892435302c3a4e3bfe32b81b6abb5e4ef0
63759 F20101129_AAAIIK pol_a_Page_050.pro
5a13cd6e8ee7dd1b3c2a6dee33796ba7
3353259b541a2c9e9a090094ba40601b4f5505fd
7475 F20101129_AAAJKY pol_a_Page_097thm.jpg
10b3c59a794c2496c5511195d63ec66b
f451c5bd16d5d4b5b0431824c8f7df1a4b782eb9
852627 F20101129_AAAIHW pol_a_Page_074.jp2
e04955eff0aef51ad4c7027c69d61d62
b1f9fcce5c8d0c81e06a330d53198a6ac69492cf
F20101129_AAAJKZ pol_a_Page_098thm.jpg
48ea12ccb281a1fe9a7db5586a79d83b
fdcb1832b756828825e6e834cfa6a822e6e46582
84926 F20101129_AAAIHX pol_a_Page_060.jp2
5ecb81acc9a0bca978a41a37f9c32ebf
8176e333ba607403275120db1132965c52a265ac
99638 F20101129_AAAIJA pol_a_Page_093.jp2
5bba58e6b12eb7fc3193c665a37fc6f8
40e65ca1aaaab9096471a6bef964e7c231fa1ba6
3451 F20101129_AAAIIL pol_a_Page_027thm.jpg
3a71c0233b00be29a9816eef5e1a905a
9e4f40622db3c1abadc4a979c649512b5bbf1cb5
62529 F20101129_AAAIHY pol_a_Page_112.jpg
82feba72cb763b7ccea66d0313f4a7fa
9bdb2f463a8c36958761a1e53b17259ab6c12ce1
F20101129_AAAIJB pol_a_Page_001.tif
300624bc88df09ab2ab449160a8e24f8
5ebb9bbed4cbb3cbfad0917f12986f8ff511f4f0
96908 F20101129_AAAIIM pol_a_Page_032.jpg
d8c6cf768a37b8d6e25a675fe84e3b36
9a773d2cb1be366776ca236dd4e330b30206c076
F20101129_AAAIHZ pol_a_Page_043.tif
9be3a8b8f031301bfe92dcd80ea68b46
b231e0aa141ca5acff8d1fda7e7e22c7476431aa
29001 F20101129_AAAIJC pol_a_Page_053.QC.jpg
372006aad71d689059d7e5c06224a9aa
55d7b0ae959d4c8172a7622f70d0077d8e769446
F20101129_AAAIIN pol_a_Page_053.tif
b05d0b352ce496b3fbd6b26e0433de8c
7ca2a1905acc5e9c6ecc5d0e66e692ea53e9d43e
F20101129_AAAIJD pol_a_Page_017.tif
45be9bc3f0027fe306646fd9f1601c5e
2d3f16d98eeeae1bcd57e492aedcd5eff1d4270d
F20101129_AAAIIO pol_a_Page_099.tif
c08758d0eb6ec288147962ee7e55c813
58a0449d3aed2fe723ee297eb1ab599c2ee94997
28597 F20101129_AAAIJE pol_a_Page_013.QC.jpg
91edd9608cb61f92d4998cb02aaab02c
17b471cc439fc5181254c9cad867e9c2a2a02d68
7207 F20101129_AAAIIP pol_a_Page_053thm.jpg
6af9121658258e591a26bd3ac079a98f
e29ae0cb0390dc97d3e2ad3f6be62f179d66e820
9641 F20101129_AAAIJF pol_a_Page_117.QC.jpg
39d8f3b2dc6eb7399d759c2666b0c247
006fc8b047a28fb3013f2a43763b44d3977190a8
26548 F20101129_AAAIIQ pol_a_Page_092.QC.jpg
a3b8737c36e93f53ded02980f3640673
514b5e7044bd37162a5cbc389cdd96e9b7a1f96d
45107 F20101129_AAAIJG pol_a_Page_122.pro
d6f2f983dd009371a54ebdf1b8163a73
8c04a3ea9cf3962b97ccced56d2d8a9ebe9c02a2
70241 F20101129_AAAIIR pol_a_Page_075.pro
0fc78ea3e4a95a6d8114e42ce70b3bcd
22f88f076e9564aa42685121be8de687629b4f56
77868 F20101129_AAAIJH pol_a_Page_063.jpg
1406d8e353cc06146e24df7e371852d1
ed40d167a81ab7605173b6e960db39a7d79f008c
836367 F20101129_AAAIIS pol_a_Page_064.jp2
e6ee15b07790127e0a47eee1b2a06b0e
3bdb76842946a970911558d052505e85b530eb7a
2484 F20101129_AAAIJI pol_a_Page_108.txt
0a6790882e2fdb83acb69fe64b5b95af
0b397f4fa1688f1ad45fd5fb0b65fca96e186e5d
F20101129_AAAIIT pol_a_Page_030.tif
e86d2f9b510d68457d71da360195ca7c
786b65989b86d3eed94d6973483117787b472c87
16810 F20101129_AAAIJJ pol_a_Page_043.QC.jpg
85bb528b7830dca7e9f15ababaa2c4c9
6e217cfb6a8e1a7db31bb11f6e4a6cc33fc72d0e
6718 F20101129_AAAIIU pol_a_Page_018thm.jpg
59c0cc6756bfe104f9e524ef40658f59
cfd1b8823694893082550101a0554b87aaad3752
17131 F20101129_AAAIJK pol_a_Page_082.QC.jpg
567f15b2a31e8111e4493ba66b584774
529d5996943d86eb97597fc8cf8a50868b1960d3
21958 F20101129_AAAIIV pol_a_Page_098.jpg
838898734bf7955327646cb9a2e66352
5ee0ed69d710074d9f991d59021a9283d807dbe4
9747 F20101129_AAAIJL pol_a_Page_002.jpg
401154cf440dd4be0499d8aaaf37ae3e
213e25f5c03d8f8677b8d38ebfa518c3f66ef94a
49727 F20101129_AAAIIW pol_a_Page_010.pro
39bd50e4ce0613a719d551bf1944b702
e4c65a5008b61b4f72830d2560a0c43e3bec9445
44820 F20101129_AAAIKA pol_a_Page_105.pro
ea8fcb75b96f73de1301e3cc1d1b81f6
60d2c5946ddf82d7cbf2c5c2f32bc0ad17c94aae
7318 F20101129_AAAIIX pol_a_Page_096thm.jpg
10194312bf636e4e60b6fc8be64642e3
e6b3293753e75a2a073cb4151d46d9b8b4625bc9
6398 F20101129_AAAIKB pol_a_Page_109thm.jpg
87cf0bd6e3b92737aeaad640e126ac46
b04db62ad93bea8716be6ded8691e6bb1d210ba5
7178 F20101129_AAAIJM pol_a_Page_015thm.jpg
1a62dec764df89e1e40556af869a57c2
d1e44eae463ec73baac648a74a90308ae9b707c3
7122 F20101129_AAAIIY pol_a_Page_098.QC.jpg
89c9d20f0e88e2b20f2937f7a584d913
1aa7c822590b5480d2c1ba28c3c118d86d128737
51455 F20101129_AAAIKC pol_a_Page_082.jpg
70f33af02f414c5d8919b611ea71e3c8
b59f1ebada2ad973cbf7ebe3c4b2247bc407e825
F20101129_AAAIJN pol_a_Page_025.tif
3e7a6d76382ebd0f474254193f90aed2
ce1fa9d214a69f3230f8239578b31d9a2c490c19
7462 F20101129_AAAIIZ pol_a_Page_012thm.jpg
54a891eab3b483637c1c834aa1ec378b
23e4c494de2e8547eb09a8c555059c50beb20a11
28163 F20101129_AAAIKD pol_a_Page_084.QC.jpg
8220500194f13ec774f1217c7365630a
419432df40e6974a2504dfdb0478ff1e3ef7afca
19875 F20101129_AAAIJO pol_a_Page_017.QC.jpg
9d525873fcf1996f5ac82a14ba8d93fa
6e7640473d15fbbc6ea3d4cd42b39f57d1e3b480
7469 F20101129_AAAIKE pol_a_Page_055thm.jpg
4b0e355791012f3b92234c6ec2d256bb
c51d9faac4224bfb405cb1539d9092801b1707f1
13378 F20101129_AAAIJP pol_a_Page_021.QC.jpg
e3c35715da2c67345304fb9cd05c6452
ca89130f1a7dbbb39af34e3127083114bc5a3034
1051983 F20101129_AAAIKF pol_a_Page_015.jp2
70c72f39d9c717308fbd3871e92e977f
38f29749c0f8319a6e24a267d373e071b5780d8b
F20101129_AAAIJQ pol_a_Page_003.tif
09307364b6cc75ccb9d56f3d8b2f62f0
776de0c000b259ed309327a5b29159aa6daf24c6
5868 F20101129_AAAIKG pol_a_Page_088thm.jpg
e5dcc3d4db27bf4feac563609f787d1f
6ad9bc9d9163b4a34928b87dbf17c56e7a23aec7
35621 F20101129_AAAIJR pol_a_Page_077.pro
f950e35d6ff15d278a4f64712fc89999
2a39b4973368160fcc5b553c9860c2f6fbb03c29
F20101129_AAAIKH pol_a_Page_037.tif
7e20ee95fa0dedfe6ebcdd89090b2ff8
83a6d822e518c4f64e477e072f41f317f32921c4
1051898 F20101129_AAAIJS pol_a_Page_056.jp2
561139ad8958e7fa5acb966f8d0e73fd
6240ccf3f20f71297414f95ae2cae13d0d6fe275
F20101129_AAAIKI pol_a_Page_098.tif
13d53361e873ffdfb85db7392c21408a
736d50c7e921c621f64a228e67373e013aa5a7ee
51847 F20101129_AAAIJT pol_a_Page_036.pro
3492cf16917afb61ebaf14b8fc7f2ee6
f98069d524e4f690a255c04cf371da177ece96c3
13681 F20101129_AAAIJU pol_a_Page_101.QC.jpg
90f0868718a981104ba5354e065abb27
75d4a1d4b61993044bfa0e10545612a065c1e522
F20101129_AAAIKJ pol_a_Page_006.tif
8716b236b7719ab39a6c1b009dd520b2
dbb6c063b2cb0c2c5d9c7de9bf7e50d313419146
2443 F20101129_AAAIJV pol_a_Page_091.txt
ae4de29cd2bc7d43722e71b1d5985c34
9d81dd4b7c101e5dd6088043940f09771bc1911b
98851 F20101129_AAAIKK pol_a_Page_012.jpg
7a2cfa9654122d841c2acc35ae7acedb
49de0bba66f61d27994fbf533bd092a608e23843
38433 F20101129_AAAIJW pol_a_Page_115.jp2
a9d5c62adcbdf709f290d862b7130bac
e6247e8c7b33f2c3ec3d2be6acb7a422ad2a0ef8
F20101129_AAAIKL pol_a_Page_110.tif
64ebd31cbe54a56785700a1ffb5d4914
504ee23328ef8c2d50993824a89c24ac6e76d694
2552 F20101129_AAAIJX pol_a_Page_018.txt
028d5ecc9494f85ab987b37de313a488
b04d5085a6b8d42a718d4d48e54b92d48cb77075
66076 F20101129_AAAILA pol_a_Page_085.pro
f5bf47bdaaaf60ee113551ed3f126b97
bdadce40e49cd42e1c862b699c1da29daca20654
26842 F20101129_AAAIKM pol_a_Page_031.QC.jpg
1fc192fd3e8ded9cf4523e1766e97d09
e2261fabde57f84cf69daa8e4877d60e0cb8102a
2017 F20101129_AAAIJY pol_a_Page_028.txt
8c3ee6dbeec7cc7f375bb7b59b43a192
757f8f7b94dde8a25b13050c024d12c2252c54cf
60377 F20101129_AAAILB pol_a_Page_097.pro
b36e5271cfcb85048d2e89ee3932f2ea
019de87cdeb716ee5a1d9ce43395060a990cbe0c
7090 F20101129_AAAIJZ pol_a_Page_103thm.jpg
56c53fbaf2c7fef950f818cd6c7ec92e
06cf2c0916b3a3cfb7bfa5a9ef5f788f163cfc02
71618 F20101129_AAAILC pol_a_Page_010.jpg
0d008f6ad0ca985aa712dddf71e6ed97
84e57fe2d9c99d78e0f9c2c0b839745dd727d659
57948 F20101129_AAAIKN pol_a_Page_059.pro
2a45e389aeaff9a53cb1f2acee259170
5a352121a9e0f53bb4113f6e329f582f46a8b750
F20101129_AAAILD pol_a_Page_101.tif
8fc33fb7bd097cc8141b6daa15146557
97e49d473462b9bd2148fa5846a2e05a966130af
3093 F20101129_AAAIKO pol_a_Page_003.QC.jpg
bdf50c3ef209bb1e547041773cccc4a7
a8ad408933149c0f6c59c09ace5ac9e17a695616
25228 F20101129_AAAILE pol_a_Page_113.QC.jpg
8117ad2b08fffb11a887e904fd1e1f4e
156ad31838c482f24b4863d67f38cec54f01d8f8
F20101129_AAAIKP pol_a_Page_067.tif
803a4ab62d252699b8594052e2370409
6c947ee32e1067f8df788857d4363cf547340416
881 F20101129_AAAILF pol_a_Page_003.pro
2e2b697a4ab0071b4d6670262f7d9bfe
27a4ad40341beffe5399ba709b6139369475f146
68707 F20101129_AAAIKQ pol_a_Page_036.jpg
c720fa82054dc28071b920f06ab75817
a15e825b5d5d2b84063f8466c88ef0320763bf5d
89894 F20101129_AAAILG pol_a_Page_075.jpg
ebfe10c07a65c21c04bd62a912d2dcc1
08daec67c6bde6ed3c093543675454be81b852d5
F20101129_AAAIKR pol_a_Page_092.tif
e0e86b7188d42029fcec2157123b0720
6733d669c75f4cdf11469be404541eca92029e3d
117723 F20101129_AAAILH pol_a_Page_062.jp2
b7fd3cdfc997ad427797d219b5a1b07e
e95c8189ef3c7a7261a3e462ad7a5d9e8a2586cf
90664 F20101129_AAAIKS pol_a_Page_015.jpg
9b1782dd2d8b33d1d672c9a5b11a8c72
abbac47d5d27119ef9044573094be66d1bd1fc78
1051966 F20101129_AAAILI pol_a_Page_032.jp2
cd5d38b0fc3d374b581229aa597e7b19
bd73649094f262ef1399464cf6adc96424ec7fb5
2326 F20101129_AAAIKT pol_a_Page_096.txt
fd8253668d25345597325870d18202ca
827d64d7cf32aa63fac6ca31a58979ce79a59fea
20872 F20101129_AAAILJ pol_a_Page_104.QC.jpg
c6a27576789671056b31a498d232c0fc
e0741098eb77879511a284d86c18b2e0c771c4d4
99116 F20101129_AAAIKU pol_a_Page_020.jpg
4d4d3030e3380bc3e18aa50b3663f096
31b1f7a3aa3b64fc0c7a61d3b0c7c3827a1d2134
556891 F20101129_AAAILK pol_a_Page_067.jp2
dd61b7e42cb870a74fd733e91217bbe7
1419af803a8d590dcd3fb9b245bd8c28136f6d86
25280 F20101129_AAAIKV pol_a_Page_070.pro
afff44297ff05730ec843a5762efce15
8224aab134859d04dcd98fb9c8f057199fcfe000
56158 F20101129_AAAILL pol_a_Page_114.pro
2842b59919ae760b5ac5cc8a0a2eb1e5
2d80cfdf29cd9145057a6db6d18a91eefde4130f
1160 F20101129_AAAIKW pol_a_Page_054.txt
5274c499f883d9f5caf3c7863aaf3f24
6cb62b052da042b7d0a7964b9b6a104e0d5ab068
58000 F20101129_AAAIMA pol_a_Page_096.pro
f488295850c5fa778361a91a864d2785
91d0c7d1f58ae012e198c54688b3290295ccb2a1
119033 F20101129_AAAILM pol_a_Page_047.jp2
5fdd7b0630422b3d5a6d1cb2be325409
9b0171e1523e188b7c8ad26184a0d8bade493ade
1034 F20101129_AAAIKX pol_a_Page_007.txt
052c0cc2b99c11f2260051b9c24027cb
9b47ad074c0481d186c547454ed8703f158a80cb
905 F20101129_AAAIMB pol_a_Page_021.txt
1b008b4f65eb65c237516b31037f1b35
49e29ce9ac14932601e952946e7e04e79bea8990
56330 F20101129_AAAILN pol_a_Page_084.pro
61dbdbecce6c83f84c93f13e06d268f6
07c0fc9ee0873940ed61cf82d844d806b3d9c238
2087 F20101129_AAAIKY pol_a_Page_095.txt
7fddc6e68f8f820862eb07d0aff3dd47
5807bb66650a605c72eea7b8f2e14958faea43e9
F20101129_AAAIMC pol_a_Page_047.tif
acad8519141c0eab420cf089e5c7fa11
edca548ac86a538656a9571b1513a36dbf7729a9
F20101129_AAAIKZ pol_a_Page_080.tif
f558f2de99476edea5917e02f21f4327
f4ab4a1045dc4ab00135c4c55e3490b970a92db8
27334 F20101129_AAAIMD pol_a_Page_045.QC.jpg
a74d7e78eddbcd2d988103e5b3390ab2
94023be644c4202508f8fb95ae8bac6a3788fab3
52515 F20101129_AAAILO pol_a_Page_004.jpg
a6efceb7b0eb7401a7ca0ef021a7208a
36bfda2db09fe83f3063d2de9b7b91df1c5201f1
31080 F20101129_AAAIME pol_a_Page_014.QC.jpg
4fdeee3fba22b4f62d8f767c66bcaaae
2ec1e12d9e164386ab063630f6eb6aa579803a04
80509 F20101129_AAAILP pol_a_Page_109.jpg
158c75c7ebb641ca704259b5d2584908
9a57b1f3f03a51bf9db16c67c9b3582be0fd7ef2
F20101129_AAAIMF pol_a_Page_107.tif
ee64b4312763d11e91d3d8ab707d9681
62ca4bc03872c5dd17d976f7d2f82f0a6a44e21b
F20101129_AAAILQ pol_a_Page_053.txt
125217984eda5037ba3cac851e8f14ee
f3b01f6ac09bbd29172ad1dfe43fafbde3503154
27512 F20101129_AAAIMG pol_a_Page_083.QC.jpg
82c93438340612c9fbd756296e8ce4a0
101cd4751c58b3752a0f4ad8422509495e5e8abb
F20101129_AAAILR pol_a_Page_075.tif
97db8c8b0cc600c3b6d5d26dd982421b
246c971a43cbd4bfb708d921a414c33b81a137fb
76859 F20101129_AAAIMH pol_a_Page_095.jpg
5720d7c1bc547f54ab12d61b135fbd6a
73735b126f6134f0e03df4b7fbe40e5ba1671f35
5046 F20101129_AAAILS pol_a_Page_002.jp2
f8979d3c7cc03f949105e1b9b99e05da
131cfc23f04863df61b7a9e63150f022134b1f34
105166 F20101129_AAAIMI pol_a_Page_010.jp2
38531a0f15a0da38a8e8a0bf9f4dd0f9
2521bb10b58376cdb1af67c27dbb36684128fcf1
61635 F20101129_AAAILT pol_a_Page_101.jp2
2341be08aff1f7afd54d5d703f22da15
b3830eefd5ceb700637766ee046d3c9d5d2e5bd7
2379 F20101129_AAAIMJ pol_a_Page_037.txt
f4b66c4591323b4a541f27fb75032937
c29e65e80b8d07bbb5ec64ff1a2aa9a6184c48e9
78853 F20101129_AAAILU pol_a_Page_062.jpg
c28ffafa8177803f939d7e687f40cf70
db6fb381060462645925b5d3f655ccabf63154fd
2440 F20101129_AAAIMK pol_a_Page_119.txt
4665a96203a2503adda4ee08877b85c8
250cbe8f5c8d18d16f9dd730a2119f011b41225d
60401 F20101129_AAAILV pol_a_Page_119.pro
759960a9a00916d02d3c5614846707b1
0645d9513bd0e8c125b86a5d456cfa8be4f3f58f
56544 F20101129_AAAIML pol_a_Page_042.pro
836a5655fbeac2b69161667ae53ead6b
65b55d213b8a659592517c47da53806e685bdb3d
7050 F20101129_AAAILW pol_a_Page_031thm.jpg
d381701309563f68c8a27ead8e4e7ab2
807615ad5ec03fee1efb460123d5c8a3288b02c1
89018 F20101129_AAAIMM pol_a_Page_120.jpg
492e62f778d116d54c814be8c642690a
58d5634ce58698578930fa0eca0c99031b87a1c0
4950 F20101129_AAAILX pol_a_Page_003.jp2
4858c179029380562d9e307ce67f0aa4
d5bb480f680033c8876b145abc44590d9cee29a4
20419 F20101129_AAAINA pol_a_Page_112.QC.jpg
b18f251c8d63e46cbd8756b9faab2b7f
1aaa9e1e97361965112d24ae7f2437c6ea52d087
6914 F20101129_AAAIMN pol_a_Page_063thm.jpg
7c4e89e641bab68bfb024e4795adab0b
a16f76186ac130404b3a1557ece0bd3185aed530
F20101129_AAAILY pol_a_Page_092thm.jpg
2ab1e2d1276887dfc1fcc948f41e9d1f
c70534fa0126549f8af8825a14169010c0b81abc
2933 F20101129_AAAINB pol_a_Page_075.txt
e2236f66ef56fa33e8179d056f2b18e4
9e6da24827b2b63d57a02e482edc2255cb4fa678
2381 F20101129_AAAIMO pol_a_Page_049.txt
a4386dc4b7867c913a217cb9c31935c1
921fb5f1c485c7c6df606a046e0f1e505af5f795
62368 F20101129_AAAILZ pol_a_Page_120.pro
c59f4dcfc1ee9c19d3645407d0461972
e06800a3410d4be7cd1be13d1ad58bfb5a23fb07
5261 F20101129_AAAINC pol_a_Page_110thm.jpg
887fbc7524c18292fe3323d4203e0099
7162c231a668c97f72d6a7bc90ed85aeece921f2
5642 F20101129_AAAIND pol_a_Page_122thm.jpg
9f712a098e753ae96444e39e33406261
77d699e673561db5da402cd1caa763a6cb00cc73
F20101129_AAAIMP pol_a_Page_108.jp2
da87ba5da24c48b51bf3981993520e48
60a687f73cb22ed2fa432f156198247833e9ac01
26878 F20101129_AAAINE pol_a_Page_086.QC.jpg
9773875687b94bb92ba33947b4e76c89
a11fb485957079ffdd1dd16a046028bf2e141e23
F20101129_AAAIMQ pol_a_Page_018.tif
23999145ef75a50a6440c93e8de2e1ff
95b43be63a3bd3fef909667e2b0f6a4607f267df
40955 F20101129_AAAINF pol_a_Page_112.pro
1a31fa53718d4d24971c3ee7335c2d22
317225cbca34005b73a8f80be146fe51f58e4d51
76565 F20101129_AAAIMR pol_a_Page_041.jpg
0fc5e7ad45f5ac86b6293ab4570ad2a6
49f61a36a1d1531c236046c4a511c95d10bdd55c
2118 F20101129_AAAING pol_a_Page_030.txt
f0f0ac251383239500e9f3a4f644c02b
3fd6a61f7fbdf8038cf29c877839c01b49040275
1347 F20101129_AAAIMS pol_a_Page_057.txt
82e6ed7adcc06cb86f5812f3eeb141f4
b3a5af916b1052a31e87bf58c76a49c4e0840885
1170 F20101129_AAAINH pol_a_Page_067.txt
04e7693c240d16aa49af80ac6db3824d
b340effadc4858c55c194e4f99037244236b33e8
2647 F20101129_AAAIMT pol_a_Page_118.txt
4c53cf4abf2e50bd4a9eba5809353b5f
7290ee3a654762ae7470cf435801db46b437b205
36457 F20101129_AAAINI pol_a_Page_027.jpg
127c4f69837d00fb77aba2f79c7ef771
b11f305c1cf48ddcfcc9207bcdc1d6f5ffe3f4c2
5036 F20101129_AAAIMU pol_a_Page_043thm.jpg
26db769c3d626e8b01bca96810edf091
018a4cc9b5e30a852c36a454e5026082857afdf0
3056 F20101129_AAAINJ pol_a_Page_002.QC.jpg
5765323a8224aec7cb5deb808366bc43
fa380cea0196316216c6e0fa5b67dbfdd16f1c01
5297 F20101129_AAAIMV pol_a_Page_105thm.jpg
2f26eeead46b6757aa2a601e2cad053f
9b970a545b3286cd02e4b9869a4f52ffbaa9aa04
1637 F20101129_AAAINK pol_a_Page_110.txt
83d4f98b06c79e59443836208bc3985f
60897622f219753e309a624b08c79893535466b8
27223 F20101129_AAAIMW pol_a_Page_118.QC.jpg
b0cdf6c04cc1e8aceda64a90dea74128
b1f89ee30e202ea9c436c038c919a8847c5445ad
2786 F20101129_AAAINL pol_a_Page_115thm.jpg
fd7d674c8d539a263e17cbe2baa44e9c
fdb5561fb1cc3ab460346864e232f059b5fd9876
20049 F20101129_AAAIMX pol_a_Page_074.QC.jpg
8f821edf7ec5338df089eccfc0e71359
cfafb3729b2f2baa7eaf5b54dae48d12662d0936
F20101129_AAAIOA pol_a_Page_083.jp2
07e6a46bbeb23c22ae7a8c7463593272
d00f28f0d6260cd591f70f03234854d865e492b6
68317 F20101129_AAAINM pol_a_Page_093.jpg
76dd4f3b151ce0be24dd9b7cfd3409b0
b971adb43b0cdd2993442450ea773005699ba2c7
94306 F20101129_AAAIMY pol_a_Page_097.jpg
6e6f55027e62fe984a443ef893ac9e23
5cc73a2adbbd1c7855ae5011da51bdaa939432ef
21838 F20101129_AAAIOB pol_a_Page_122.QC.jpg
bcfbbaf9e674b99343c7a83ddeed6cdf
9b800249790ac3bc4e465c7b858df66af8b8f974
F20101129_AAAINN pol_a_Page_081.tif
5ffd1eebc2bfe2ea9e58d5d431595861
9ce79fe05ca8d0eec70057d6c47d5d611ae5ca32
F20101129_AAAIMZ pol_a_Page_031.tif
f3018f770dd7e8ff4d4b9ce5095ccded
4312f8810bc6e1713aeaa8daaeb6c44f944093e5
98970 F20101129_AAAIOC pol_a_Page_014.jpg
6dc4f8840654b01b7f00dd95d34edbe6
2b2d1799e711d10b47ee2f54727c9f88785c5678
31775 F20101129_AAAINO pol_a_Page_020.QC.jpg
3d1a3488a6fc8c7978d4ca2054a8ff10
cd25efbcb27c28f4a1af9a88e7a0cfb9d0c27b28
2545 F20101129_AAAIOD pol_a_Page_015.txt
7c08a3cf98e0efe9e6dd1f27e59c51ee
01ac91d1eeed6c7d120cd401b6bb4670684e8979
7135 F20101129_AAAINP pol_a_Page_022thm.jpg
adb7a0ac2d18532e48fb5f0e8aac7173
180626385f57f2aba84081c0682c3b1c8a6becf9
F20101129_AAAIOE pol_a_Page_041.tif
3c8a43f8a6df5873bb7eab3287545b32
ab556961154979aa399d2fe5839c2fee094b768c
1050715 F20101129_AAAIOF pol_a_Page_030.jp2
69a69c73ab9ba8121f22d3571c054f69
720a151e98fd9e2a5efed7e6dcdb6013496c7f16
29367 F20101129_AAAINQ pol_a_Page_034.QC.jpg
4f59159088aa0e84d7488e3d84034b1b
b6ddb38d6466a5c5edae5a55790dfe2257802bca
86081 F20101129_AAAIOG pol_a_Page_018.jpg
d9ff521cf8963fcb6fb813201973acaf
bafe8b56fcf9e753aa76625199d122f8088a1b8b
25786 F20101129_AAAINR pol_a_Page_119.QC.jpg
165732c9b0e420b4fee334672f8c9636
7d4fcb7cdd8f6dcac5fadc0289d4674c0448220c
7513 F20101129_AAAIOH pol_a_Page_024thm.jpg
ef3b66abac8fe89d7503024dcb73439f
9da9c8d26e0ef39625b601bbc1880226e4722228
76970 F20101129_AAAINS pol_a_Page_044.jpg
d98dc25144a98f2ab47e90afe3fa8f7c
b97eda012b644eb6d6ac71c12fe516b61efa4fc8
7660 F20101129_AAAIOI pol_a_Page_050thm.jpg
f9f211bed5b233a6c9eea0400432cfbd
ee35e3fd8d7a5a0fdd32e1c96346904330c1f9e8
2446 F20101129_AAAINT pol_a_Page_022.txt
f1fddd164250a8ee3d3ad74330e13c96
2992efad53f7478ea2daf9715addb15a3ad7602a
2287 F20101129_AAAIOJ pol_a_Page_051.txt
65a7fca403d15415dfc709dcfadbbe9f
fbdc5704cc4737c535094fab9986beaa5f45c632
7125 F20101129_AAAINU pol_a_Page_045thm.jpg
8b605f53608cb3d2676df063018705a9
53664932544323b42dbb4a9905f49b2b81619f73
29127 F20101129_AAAIOK pol_a_Page_115.jpg
77182de24b0ac69aac91363a81732eb6
d48f846fc49e3731c3edae5499b28e1bedb1cf95
F20101129_AAAINV pol_a_Page_091.tif
ac4751c2759178c22663d6ab262f87c4
d9d52f64f7b71bc5b9d56d64e80e59985e4d8cf9
F20101129_AAAIOL pol_a_Page_007.tif
da0452138b3a4b078f2f0a31cfe40639
43022843b3cf4d340b3b64ca9f1030e49882bc38
30737 F20101129_AAAINW pol_a_Page_050.QC.jpg
f6578e05fb9132883ed73f2e9372bda4
9fe0c17f0915b94f587ab59eaaea76f924ce44d0
9621 F20101129_AAAIPA pol_a_Page_003.jpg
f620bc21e285a9002039657b36b36815
6d88bd9fac4b631ec471bc165a0fe80a9bcd53e5
7408 F20101129_AAAIOM pol_a_Page_108thm.jpg
474b2786a18b8ee292ae2a41dd68a4b7
eba8ac50be7df231218fca73a29089d363c00247
38363 F20101129_AAAINX pol_a_Page_117.jp2
19022cb29b8096058ddd87567a543607
9f2f63569ea1f314719d894b08971c2f3ca551e5
88789 F20101129_AAAIPB pol_a_Page_005.jpg
758d7db505fde949fc8314eb9314cd57
dc0d4c21afc22785697db85ea6a58031015d6835
29569 F20101129_AAAION pol_a_Page_085.QC.jpg
cf6cd48cc55a131c2d2bc659d85da5e5
daf496262df9d8fe0d71a849b6e7de1429845e87
70103 F20101129_AAAINY pol_a_Page_106.jpg
d929d9f8adbede46d331101ab9def7b1
fb703b6586dcf818b6c490361aa77b6fd6f12902
102018 F20101129_AAAIPC pol_a_Page_006.jpg
199b9196dbf6c6f5962bdf247c9f4505
bbec0dc48811416c6e29d32fe828ff7a70fc7d60
7339 F20101129_AAAIOO pol_a_Page_111thm.jpg
0eb9c7f4a0809c03dbb5933d2105c5f2
6300d6ea662770ea3733049c172c13a55db4acaf
2222 F20101129_AAAINZ pol_a_Page_104.txt
78adbe4b9634256c169c4fd725632115
35d69384451162dfdd51abb19d373df473f72875
26028 F20101129_AAAIPD pol_a_Page_008.jpg
c3a0b8d83a1d3f5677bf848e3d22b870
589e23921b4fa6202a7008efbd56162b0d9eae28
17481 F20101129_AAAIOP pol_a_Page_117.pro
44c0bcb52610831bb5f75de1a1d91035
6fb2f79fcc03127e82c25310485ebf9728582d83
50209 F20101129_AAAIPE pol_a_Page_009.jpg
127b2b6d8f66c35c73ea748d0b0c99de
1237a278d8d90e215b671e12ed12ba9e66cbc2ef
771750 F20101129_AAAIOQ pol_a_Page_110.jp2
08649f4d6b3088e5c2e98c790df99c09
b67c8f676760f36a5296ce32e8b4dd4983f7833d
25369 F20101129_AAAIPF pol_a_Page_011.jpg
ed149776f1b96a88ae3823a71c4b8a94
e195a5040325bab35010d5921610a716ccf56598
F20101129_AAAIOR pol_a_Page_115.tif
6e4d9b4d8c932c7695d5016d80e1ea82
76d9b9a90b74e1122a03d345e85f3b7d1f7774c9
92443 F20101129_AAAIPG pol_a_Page_013.jpg
5c19b05361e611790ce0fa3c51a0509d
16b35e427bcca1200fdba6de5c79f629ec0bd719
F20101129_AAAIOS pol_a_Page_066.jp2
ac6d06fa8dde7a7eb4477ecb57cb746d
9d0469985497274a672f5622d304c6533533987d
88856 F20101129_AAAIPH pol_a_Page_016.jpg
8dabea652d6e41477f4485937e9e62ce
b3b8d0be2c3503e3fadaedcd9e328e57972f5d2f
903275 F20101129_AAAIOT pol_a_Page_104.jp2
83599ebe5adfd7d7830fab2798d5d45c
b1330a5a8d72bad9673a3712391cb4362bac80a2
66024 F20101129_AAAIPI pol_a_Page_017.jpg
2b0989cd6ea34419a74d4b752155659e
01d4dc007756c6b4e9b0043d767775f79c782e79
92459 F20101129_AAAIOU pol_a_Page_034.jpg
4b5ae45b01144cc2b6aea9c7ca5e950e
aa91e89983fddb6f998e5c97fb16249fc94742aa
85292 F20101129_AAAIPJ pol_a_Page_019.jpg
8dbbf544d3ce0f0385d81c73f7a07ead
e4b53ea3976fa6464e508683344246022335b7bc
6720 F20101129_AAAIOV pol_a_Page_076thm.jpg
42fa041db3c58c56754b1ccf78baf5f8
fabce01deb6d79d6f9779ef937512164f7bdb565
41458 F20101129_AAAIPK pol_a_Page_021.jpg
a24f1ea92cc872469a5b29a7171a1fdb
8ae7705ce2d5abc3c2e3c52390499bfd077f63cb
139431 F20101129_AAAIOW UFE0021132_00001.mets
a54d7fa1ac200c3cccac16e57f6ca364
ffdb08d7ddb251e7c5bd18556ef1d6a4c4da8c77
88960 F20101129_AAAIPL pol_a_Page_022.jpg
ac51662eddde75601cf57d29c87599f5
018df3c6786f69776b40365547811c28b3d223f8
99362 F20101129_AAAIQA pol_a_Page_046.jpg
58da4491f3533411af7a77b56f7c346b
ca4153eae9ace4908e16d0798ec1b8bbc1c9ba0b
96901 F20101129_AAAIPM pol_a_Page_023.jpg
7834a57f46df144d18dcd9c1a27d07dd
670745c06ee52c1ff4e856b8b58b2c3df10fe476
78170 F20101129_AAAIQB pol_a_Page_047.jpg
5757647abead80471184b192070b467f
c6e5aa87e4abc2546697de4fb4617b861b582d25
97434 F20101129_AAAIPN pol_a_Page_025.jpg
542f1f17bc759ada68b34864ba92133f
2c2553e9846c311fc8245f683e7c30a1468941ec
22276 F20101129_AAAIOZ pol_a_Page_001.jpg
4dda041bfb90f17a7d2ecf0da71a6c1c
9e5586133b896ecc6bec161d31368ed28fb40b48
87757 F20101129_AAAIQC pol_a_Page_048.jpg
0cc1f6d45252e590b9cb10785176fa27
d19931e4dd4526e2c419a3d495779dce057e7e7c
89337 F20101129_AAAIPO pol_a_Page_026.jpg
5dcf8a61ccd44902723e96949661ba8a
98f28a37dfea8dc007c8306228be7e6f290e58ba
91823 F20101129_AAAIQD pol_a_Page_049.jpg
171ed1cd7d4bb5c2403613899bf0422d
a5dc46b4a52c6ec51ca60dc7d0abdcb708ced87d
79329 F20101129_AAAIPP pol_a_Page_028.jpg
ff95cdc7e121e3ce42e0d088d8ca66ce
f5e061621aeedd5eafdf8a54a2d02e491e6a2177
95381 F20101129_AAAIQE pol_a_Page_050.jpg
51accbe14800b94afafc958bf3823ec8
b61939cfed02c7241c27cf7020fb5893f3f798df
83240 F20101129_AAAIPQ pol_a_Page_029.jpg
d33e213f09a5876423004e6b1549812a
2d251991fc73c20158e82344d011b2bbcfa9c244
85040 F20101129_AAAIQF pol_a_Page_051.jpg
4f58b5c6859401874536ff274be18f65
7c9c623fc0ee9561f7edb558df9fb1bea7f52d10
75873 F20101129_AAAIPR pol_a_Page_030.jpg
b5d162a436ce7bda405a64595a1e7653
ed7beb91031b921c7771de4b8df5d1ec4860ea3e
97968 F20101129_AAAIQG pol_a_Page_052.jpg
bc3b92a5f9f9c83ef5110a758a43d6ba
52450940b25d84eb1587de5689780eb1ef94b9f9
91289 F20101129_AAAIQH pol_a_Page_053.jpg
246d1b78c3513a3338fe07e2a52ac83e
951b9346e9439e759e7300f73b078101eda7c4d3
82424 F20101129_AAAIPS pol_a_Page_033.jpg
fb6d67fbb4db0258f41e1cef3df051c0
03365a101ed35b0de9f7dc8db08d26fd8e325243
95935 F20101129_AAAIQI pol_a_Page_055.jpg
31fc9680a0adc7ad632c0e3f221ad288
08d0362730051a459144685fa86a5cb5e399cd53
100708 F20101129_AAAIPT pol_a_Page_035.jpg
ac0d9618b946194085885cffb4a8e7f6
136fa2d153d96ac9918d52a351e243fe122e0be5
88173 F20101129_AAAIQJ pol_a_Page_056.jpg
a2afabb8c88a10f7eac56891f8ae527b
aed2b27d77d3026840def0fa27d5493d86aec699
90324 F20101129_AAAIPU pol_a_Page_037.jpg
c481b7aec118a8cd6a38cd94a69e2b4c
2979d6fcd664256cfba49ce03ff88e67cbb0bf48
91983 F20101129_AAAIQK pol_a_Page_058.jpg
4b912838abcf77d4c03a71356056a811
3346b227ba07acf2907782b089685c8b7e8c313b
40820 F20101129_AAAIPV pol_a_Page_039.jpg
addcbf8a99dfd8da1b60acd9f0d6210a
2a5605ba9c12c79e8ae96d37dcb4f1cdf3dee906
58044 F20101129_AAAIQL pol_a_Page_060.jpg
a82f18614bf0e397701deb3f62a16076
a45fadd9e8d2754b84b2107d70f15e36615a724d
95587 F20101129_AAAIPW pol_a_Page_040.jpg
94bf4360dd6c5aa7cc9d3fef734dc1a3
cf6c4594866eb377f8c2e79a9fcc404f3ef0db01
97135 F20101129_AAAIRA pol_a_Page_080.jpg
679ff7fdfa7c187736a1d835ec1688e5
fb06058aff67d9e6284a31556c4d6939c6fbf796
86149 F20101129_AAAIQM pol_a_Page_061.jpg
6ebcd8baee5b82baba3b68eff601faf8
203a3a50a1ea282399077880cde2107e56ca70f3
86679 F20101129_AAAIPX pol_a_Page_042.jpg
a21ea75635073ad59a037e560421ed17
0e26ea31423c0754c4b4374178bfdcd8097c9311
86429 F20101129_AAAIRB pol_a_Page_081.jpg
9bf9b8025fb408e882cf896be4d7f77e
97cbfbfb21e88d983f4eef1d76a92c0a4ed74b86
62714 F20101129_AAAIQN pol_a_Page_064.jpg
0394ca3e2517fd0ac6239c7b9f6a807b
0d0f2765bdc7e8fc242673334147c7f4f3f28b65
55410 F20101129_AAAIPY pol_a_Page_043.jpg
84136ccadd526731e11767318ae11928
2f748fa80118a89f79e6f42bcd90f19b2d76ace3







MAINTAINING VERY LARGE SAMPLES USING THE GEOMETRIC FILE


By

ABHIJIT A. POL




















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007



































2007 Abhijit A. Pol



































To my wonderful parents









ACKNOWLEDGMENTS

At the end of my dissertation I would like to thank all those people who made this disserta-

tion possible and an enjoyable experience for me.

First of all I wish to express my sincere gratitude to my adviser Chris Jermaine for his

patient guidance, encouragement, and excellent advice throughout this study. If I would have

access to magic tool create-your-own-adviser, I still would not have ended up with anyone better

than Chris. He always introduces me to interesting research problems. He is around whenever

I have a question, but at the same time encourages me to think on my own and work on any

problems that interest me.

I am also indebted to Alin Dobra for his support and encouragement. Alin is a constant

source of enthusiasm. The only topic I have not discussed with him is strategies of Gator football

games.

I am grateful to my dissertation committee members Tamer Kahveci, Joachim Hammer, and

Ravindra Ahuja for their support and their encouragement.

I acknowledge the Department of Industrial and Systems Engineering, Ravindra Ahuja,

and chair Donald Heam for the financial support and advice I received during initial years of my

studies.

Finally, I would like to express my deepest gratitude for the constant support, understanding,

and love that I received from my parents during the past years.









TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ................ ...................... 4

LIST OF TABLES ...................................... 8

LIST OF FIGURES ................. ............. ....... 9

ABSTRACT ................................... ....... 10

CHAPTER

1 INTRODUCTION .................................. 12

1.1 The Geometric File ................... .......... 14
1.2 Biased Reservoir Sampling ................... ........ 16
1.3 Sampling The Sample ................... .......... 18
1.4 Index Structures For The Geometric File ........ ........ .... 19

2 RELATED WORK ..................... .............. 22

2.1 Related Work on Reservoir Sampling .......... ....... ....... 22
2.2 Biased Sampling Related Work ................... ....... 24

3 THE GEOMETRIC FILE .......... ........... ........... 28

3.1 Reservoir Sampling ................... .......... 28
3.2 Sampling: Sometimes a Little is not Enough ........ ........ .. 30
3.3 Reservoir for Very Large Samples ......... ................ 31
3.4 The Geometric File ......... .. .............. 34
3.5 Characterizing Subsample Decay .......... ................ 36
3.6 Geometric File Organization .................. ......... .. 40
3.7 Reservoir Sampling With a Geometric File .... . . .... 40
3.7.1 Introducing the Required Randomness . . . ..... 41
3.7.2 Handling the Variance ............ . . ...... 42
3.7.3 Bounding the Variance .................. ....... 45
3.8 Choosing Parameter Values ............. . . .... 47
3.8.1 Choosing a Value for Alpha .................. ...... .. 47
3.8.2 Choosing a Value for Beta ...... . . ...... 48
3.9 Why Reservoir Sampling with a Geometric File is Correct? . . .... 49
3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer . 49
3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File .50
3.10 Multiple Geometric Files ............ . . . 51
3.11 Reservoir Sampling with Multiple Geometric Files . . . 51
3.11.1 Consolidation And Merging .................. ...... .. 53
3.11.2 How Can Correctness Be Maintained? ..... . . ..... 53
3.11.3 Handling the Stacks in Multiple Geometric Files . . .... 56









3.12 Speed-Up Analysis .. ....................


4 BIASED RESERVOIR SAMPLING .. ........................ 58

4.1 A Single-Pass Biased Sampling Algorithm ...... . . . 59
4.1.1 Biased Reservoir Sampling . . . ....... 59
4.1.2 So, What Can Go Wrong? (And a Simple Solution) . . ... 60
4.1.3 Adjusting Weights of Existing Samples . . . 62
4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm . ... 65
4.2.1 The Proof for the Worst Case . . . ...... ..... 66
4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist . ... 73
4.3 Biased Reservoir Sampling With The Geometric File . . . 75
4.4 Estimation Using a Biased Reservoir ................ . .. 76

5 SAMPLING THE GEOMETRIC FILE .................. ..... 80

5.1 Why Might We Need To Sample From a Geometric File? . . .... 80
5.2 Different Sampling Plans for the Geometric File . . . ..... 80
5.3 Batch Sampling From a Geometric File ................ .. .. 81
5.3.1 A Naive Algorithm .......... . . . .. 81
5.3.2 A Geometric File Structure-Based Algorithm . . . 82
5.3.3 Batch Sampling Multiple Geometric Files . . . 84
5.4 Online Sampling From a Geometric File ............. ... .. 84
5.4.1 A Naive Algorithm .......... . . . .. 84
5.4.2 A Geometric File Structure-Based Algorithm . . . 85
5.5 Sampling A Biased Sample ............. . . ..... 88

6 INDEX STRUCTURES FOR THE GEOMETRIC FILE . . . 89

6.1 Why Index a Geometric File? ........... . . ...... 89
6.2 Different Index Structures for the Geometric File . . . 90
6.3 A Segment-Based Index Structure .................. ... .. 91
6.3.1 Index Construction During Start-up .... . . ... 91
6.3.2 Maintaining Index During Normal Operation . . . 92
6.3.3 Index Look-Up and Search .................. ...... .. 93
6.4 A Subsample-Based Index Structure ................... .... .. 93
6.4.1 Index Construction and Maintenance ... . . .. 94
6.4.2 Index Look-Up .................. ............ .. 95
6.5 A LSM-Tree-Based Index Structure .................. ..... .. 96
6.5.1 An LSM-Tree Index ............. . . ..... 96
6.5.2 Index Maintenance and Look-Ups .............. ... .. 97

7 BENCHMARKING .................. ................ .. 99

7.1 Processing Insertions .................. ............. .. 99
7.1.1 Experiments Performed .................. ....... 99
7.1.2 Discussion of Experimental Results ... . . ... 100









7.2 Biased Reservoir Sampling ............ . . .... 103
7.2.1 Experimental Setup ............ . . ... 104
7.2.2 Discussion ................. . . . ... 106
7.3 Sampling From a Geometric File .................. ...... .. 107
7.3.1 Experiments Performed .................. ....... 108
7.3.2 Discussion of Experimental Results .... . . .... 109
7.4 Index Structures For The Geometric File ....... . . ... 110
7.4.1 Experiments Performed .................. ....... 110
7.4.2 Discussion ................. . . . ... 112

8 CONCLUSION ................... ........ ........ 116

REFERENCES ...................... . . . 118

BIOGRAPHICAL SKETCH ................. . . ..... 122









LIST OF TABLES


Table page

1-1 Population: student records ................ . . .... 17

1-2 Random sample of the size=4 .................. ............. 17

1-3 Biased sample of the size=4 ................ . . .... 17

7-1 Millions of records inserted in 10 hrs .................. ........ .. 110

7-2 Query timing results for 1k record, | R = 10 million, and | B = 50k . ... 113

7-3 Query timing results for 200 bytes record, R1 = 50 million, and |BI = 250k . 114









LIST OF FIGURES


Figure page

3-1 Decay of a subsample after multiple buffer flushes. ..... . . 38

3-2 Basic structure of the geometric file. .................. ........ 39

3-3 Building a geometric file. .................. ............ 43

3-4 Distributing new records to existing subsamples. .................. ..44

3-5 Speeding up the processing of new samples using multiple geometric files. ...... 54

4-1 Adjustment of r'" to rrn . ........ . ...69

7-1 Results of benchmarking experiments (Processing insertions). . . ... 101

7-2 Results of benchmarking experiments (Sampling from a geometric file). . ... 102

7-3 Sum query estimation accuracy for zipf=0.2. ................... 104

7-4 Sum query estimation accuracy for zipf=0.5. ................... 105

7-5 Sum query estimation accuracy for zipf=0.8. ................... 106

7-6 Sum query estimation accuracy for zipf=l. ................... ....... 107

7-7 Disk footprint for 1KB record size .................. .......... 110

7-8 Disk footprint for 200B record size .................. ........ 112









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MAINTAINING VERY LARGE SAMPLES USING THE GEOMETRIC FILE

By

Abhijit A. Pol

August 2007

Chair: Christopher M. Jermaine
Major: Computer Engineering

Sampling is one of the most fundamental data management tools available. It is one of the

most powerful methods for building a one-pass synopsis of a data set, especially in a streaming

environment where the assumption is that there is too much data to store all of it permanently.

However, most current research involving sampling considers the problem of how to use a

sample, and not how to compute one. The implicit assumption is that a "sample" is a small data

structure that is easily maintained as new data are encountered, even though simple statistical

arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary

to provide high accuracy. No existing work tackles the problem of maintaining very large,

disk-based samples in an online manner from streaming data.

We present a new data organization called the geometric file and online algorithms for main-

taining a very large, on-disk samples. The algorithms are designed for any environment where

a large sample must be maintained online in a single pass through a data set. The geometric file

organization meets the strict requirement that the sample always be a true, statistically random

sample (without replacement) of all of the data processed thus far.

We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a

single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined

weighting function. We also describe how the geometric file can be used to perform a biased

reservoir sampling.









While a very large sample can be required to answer a difficult query, a huge sample may

often contain too much information. We therefore develop efficient techniques which allow a

geometric file to itself be sampled in order to produce smaller data objects.

Efficiently searching and discovering information from the geometric file is essential for

query processing. A natural way to support this is to build an index structure. We discuss three

secondary index structures and their maintenance as new records are inserted to a geometric file.









CHAPTER 1
INTRODUCTION

Despite the variety of alternatives for approximate query processing [1, 21, 30, 34, 39],

sampling is still one of the most powerful methods for building a one-pass synopsis of a data set,

especially in a streaming environment where the assumption is that there is too much data to store

all of it permanently. Sampling's many benefits include:

Sampling is the most widely-studied and best understood approximation technique cur-
rently available. Sampling has been studied for hundreds of years, and many fundamental
results describe the utility of random samples (such as the Central Limit Theorem, Cher-
noff, Hoeffding and Chebyshev bounds [16, 49]).

Sampling is the most versatile approximation technique available. Most data processing
algorithms can be used on a random sample of a data set rather than the original data with
little or no modification. For example, almost any data mining algorithm for building a
decision tree classifier can be run directly on a sample.

Sampling is the most widely-used approximation technique. Sampling is common in data
mining, statistics, and machine learning. The sheer number of recent papers from ICDE,
VLDB, and SIGMOD [2, 3, 8, 14, 15, 28, 32, 33, 35, 46, 51, 52] that use samples testify to
sampling's popularity as a data management tool.


Given the obvious importance of random sampling, it is perhaps surprising that there has

been very little work in the data management community on how to actually perform random

sampling. The most well-known papers in this area are due to Olken and Rotem [25, 27], who

also offer the definitive survey of related work through the early 1990s [26]. However, this work

is relevant mostly for sampling from data stored in a database, and implicitly assumes that a

"sample" is a small data structure that is easily stored in main memory.

Such assumptions are sometimes overly restrictive. Consider the problem of approximate

query processing. Recent work has suggested the possibility of maintaining a sample of a large

database and then executing analytic queries over the sample rather than the original data as a

way to speed up processing [4, 31]. Given the most recent TPC-H benchmark results [17], it is

clear that processing standard report-style queries over a large, multi-terabyte data warehouse

may take hours or days. In such a situation, maintaining a fully materialized random sample









of the data (or "sample view" [43]) may be desirable. In order to save time and/or computer

resources, queries can then be evaluated over the sample rather than the original data, as long as

the user can tolerate some carefully controlled inaccuracy in the query results.

This particular application has two specific requirements that are addressed by the dis-

sertation. First, it may be necessary to use quite a large sample in order to achieve acceptable

accuracy; perhaps on the order of gigabytes in size. This is especially true if the sample will

be used to answer selective queries or aggregates over attributes with high variance (see Sec-

tion 3.2). Second, whatever the required sample size, it is often independent of the size of the

database, since estimation accuracy depends primarily on sample size 1 In other words, the

required sample size will generally not grow as the database size increases, as long as other

factors such as query selectivity remain relatively constant. Thus, this application requires that

we be able to maintain a large, disk-based, fixed-size random sample of the archived data, even as

new data are added to the warehouse. This is precisely the problem we tackle in the dissertation.

For another example of a case where existing sampling methods can fall short, consider

stream-based data management tasks, such as network monitoring (for an example of such an

application, we point to the Gigascope project from AT&T Laboratories [18-20]). Given the

tremendous amount of data transported over today's computer networks, the only conceivable

way to facilitate ad-hoc, after-the-fact query processing over the set of packets that have passed

through a network router is to build some sort of statistical model for those packets. The most

obvious choice would be to produce a very large, statistically random sample of the packets that

have passed through the router. Again, maintaining such a sample is precisely the problem we

tackle in this dissertation. While other researchers have tackled the problem of maintaining an



1 The unimportance of database size for certain queries is due to the fact that the bias and vari-
ance of many sampling-based estimators are related far more to sample size than to the sampling
fraction (see Cochran [16] for a thorough treatment of finite population random sampling).









online sample targeted towards more recent data [7], no existing methods have considered how to

handle very large samples that exceed the available main memory.

In this dissertation we describe a new data organization called the geometric file and related

online algorithms for maintaining a very large, disk-based sample from a data stream. The

dissertation is divided into four parts. In the first part we describe the geometric file organization

and detail how geometric files can be used to maintain a very large simple random sample. In

the second part we propose a simple modification to the classical reservoir sampling algorithm

to compute a biased sample in a single pass over the data stream and describe how the geometric

file can be used to maintain a very large biased sample. In the third part we develop techniques

which allow a geometric file to itself be sampled in order to produce smaller sets of data objects.

Finally, in the fourth part, we discuss secondary index structures for the geometric file. Index

structures are useful to speed up search and discovery of required information from a huge

sample stored in a geometric file. The index structures must be maintained concurrently with

constant updates to the geometric file and at the same time provide efficient access to its records.

We now give an introduction to these four parts of the dissertation in subsequent sections.

1.1 The Geometric File

If one accepts the notion that being able to maintain a very large (but fixed size) random

sample from a data stream is an important problem, it is reasonable to ask: Is maintaining such

a sample difficult or costly using modern algorithms and hardware? Fortunately, modem storage

hardware gives us the capacity to inexpensively store very large samples that should suffice for

even difficult and emerging applications. A terabyte of commodity hard disk storage now costs

less than $1,000. Given current trends, we should see storage costs of $1,000 per petabyte by

the year 2020. However, even given such large storage capacities, it turns out that maintaining a

large sample is difficult using current technology. The problem is not purchasing the hardware to

store the sample; rather, the problem is actually getting the samples onto disk, so as to guarantee

the statistical randomness of the sample, in the face of data streams that may exceed tens of

gigabytes per minute in the case of a network monitoring application.









Current techniques suitable for maintaining samples from a data stream are based on

reservoir sampling [11, 38]. Reservoir sampling algorithms can be used to dynamically maintain

a fixed-size sample of N records from a stream, so that at any given instant, the N records in

the sample constitute a true random sample of all of the records that have been produced by the

stream. However, as we will discuss in this dissertation, the problem is that existing reservoir

techniques are suitable only when the sample is small enough to fit into main memory.

Given that there are limited techniques for maintaining very large samples, the problem

addressed in the first part of this dissertation is as follows:

Given a main memory buffer B large enough to hold BI records, can we develop efficient

nlgt,'rihii,\ for dynamically maintaining a massive random sample containing exactly N records

from a data stream, where N > IBI ?

Key design goals for the algorithms we develop are

1. The algorithms must be suitable for streaming data, or any similar environment where a
large sample must be maintained on-line in a single pass through a data set, with the strict
requirement that the sample always be a true, statistically random sample of fixed size N
(without replacement) from all of the data produced by the stream thus far.

2. When maintaining the sample, the fraction of I/O time devoted to reads should be close to
zero. Ideally, there would never be a need to read a block of samples from disk simply to
add one new sample and subsequently write the block out again.

3. The fraction I/O of time spent performing random I/Os should also be close to zero. Costly
random disk seeks should be few and far between. Almost all I/O should be sequential.

4. Finally, the amount of data written to disk should be bounded by the total size of all of the
records that are ever sampled.


The geometric file meets each of the requirements listed above. With memory large enough

to buffer IBI > 1 records, the geometric file can be used to maintain an online sample of arbitrary

size with an amortized cost of O(u x log B / B ) random disk head movements for each newly

sampled record (see Section 3.12). The multiplier u can be made arbitrarily small by making use

of additional disk space. A rigorous benchmark of the geometric file demonstrates its superiority

over the obvious alternatives.









1.2 Biased Reservoir Sampling

In this part of the dissertation, we study the problem of how to compute a simple, fixed-size

random sample (without replacement) in a single pass over a data stream, where the goal is to

bias the sample using some arbitrary weighting function.

The need for biased sampling can easily be illustrated with an example population, given

in Table 1.2. This particular data set contains records describing graduate student salaries in

a university academic department, and our goal is to guess the total graduate student salary.

Imagine that a simple random sample of the data set is drawn, as shown in the Table 1-2. The

four sampled records are then used to guess that the total student salary is (520 + 700 + 580 +

600) x 12/4 = $7200, which is considerably less than the true total of "' %,45. The problem is that

we happened to miss most of the high-salary students who are generally more important when

computing the overall total.

Now, imagine that we weight each record, so that the probability of including any given

record with a salary 700 or greater in the sample is (2) x (4/12), and the probability of including

a given record with a salary less than 700 is (1/2) x (4/12). Thus, our sample will tend to include

those records with higher values, that are more important to the overall sum. The resulting biased

sample is depicted in Table 1-3. The standard Horvitz-Thompson estimator [50] is then applied to

the sample (where each record is weighted according to the inverse of its sampling probability),

which gives us an estimate of (1200 + 1500 + 750) x (12/8) + (580) x (24/4) -= ,-,-. This

is obviously a better estimate than $7200, and the fact that it is better then the original estimate is

not just accidental: if one chooses the weights carefully, it is easily possible to produce a sample

whose associated estimator has lower variance (and hence higher accuracy) than the simple,

uniform-probability sample. For instance, the variance of the estimator in the student salary

example is 2.533 x 106 under the uniform-probability sampling and it is 5.083 x 105 under the

biased sampling scheme.









Table 1-1. Population: student records

Rec # Name Class Salary ($/month)
1 James Junior 1200
2 Tom Freshman 520
3 Sandra Junior 1250
4 Jim Senior 1500
5 Ashley Sophomore 700
6 Jennifer Freshman 530
7 Robert Sophomore 750
8 Frank Freshman 580
9 Rachel Freshman 605
10 Tim Freshman 550
11 Maria Sophomore 760
12 Monica Freshman 600
Total Salary: 9545.00


Table 1-2. Random sample of the size=4

Rec # Name Class Salary ($/month)
2 Tom Freshman 520
5 Ashley Sophomore 700
8 Frank Freshman 580
12 Monica Freshman 600


Other cases where a biased sample is preferable abound. For example, if the goal is to

monitor the packets flowing through a network, one may choose to weight more recent packets

more heavily, since they would tend to figure more prominently in most query workloads.

We propose a simple modifications to the classic reservoir sampling algorithm [11, 38]

in order to derive a very simple algorithm that permits the sort of fixed-size, biased sampling

given in the example. Our method assumes the existence of an arbitrary, user-defined weighting

function f which takes as an argument a record ri, where f(ri) > 0 describes the record's utility


Table 1-3. Biased sample of the size=4

Rec # Name Class Salary ($/month)
1 James Junior 1200
4 Jim Senior 1500
7 Robert Sophomore 750
11 Maria Sophomore 760









in subsequent query processing. We then compute (in a single pass) a biased sample Ri of the i

records produced by a data stream. Ri is fixed-size, and the probability of sampling the jth record

from the stream is proportional to f(rj) for all j < i. This is a fairly simple and yet powerful

definition of biased sampling, and is general enough to support many applications.

The key contributions of this part of dissertation are as follows:

1. We present a modified version of the classic reservoir sampling algorithm that is ex-
ceedingly simple, and is applicable for biased sampling using any arbitrary user-defined
weighting function f.

2. In most cases, our algorithm is able to produce a correctly biased sample. However, given
certain pathological data sets and data orderings, this may not be the case. Our algorithm
adapts in this case and provides a correctly biased sample for a slightly modified bias
function f'. We analytically bound how far f' can be from f in such a pathological case,
and experimentally evaluate the practical significance of this difference.

3. We describe how to perform a biased reservoir sampling and maintain large biased samples
with the geometric file.

4. Finally, we derive the correlation covariancee) between the Bernoulli random variables gov-
erning the sampling of two records ri and rj using our algorithm. We use this covariance
to derive the variance of a Horvitz-Thomson estimator making use of a sample computed
using our algorithm.


1.3 Sampling The Sample

A geometric file is a simple random sample (without replacement) from a data stream.

In this part of the dissertation we develop techniques which allow a geometric file to itself be

sampled in order to produce smaller sets of data objects that are themselves random samples

(without replacement) from the original data stream. The goal of the algorithms described in

this part is to efficiently support further sampling of a geometric file by making use of its own

structure.

Small samples frequently do not provide enough accuracy, especially in the case when

the resulting statistical estimator has a very high variance. However, while in the general

case a very large sample can be required to answer a difficult query, a huge sample may often

contain too much information. For example, consider the problem of estimating the average









net worth of American households. In the general case, many millions of samples may be

needed to estimate the net worth of the average household accurately (due to a small ratio

between the average household's net worth and the standard deviation of this statistic across all

American households). However, if the same set of records held information about the size of

each household, only a few hundred records would be needed to obtain similar accuracy for an

estimate of the average size of an American household, since the ratio of average household size

to the standard deviation of sample size across households in the United States is greater than 2.

Thus, to estimate the answer to these two queries, vastly different sample sizes are needed.

Since there is no single sample size that is optimal for answering all queries and the required

sample size can vary dramatically from query to query, this part of dissertation considers the

problem of generating a sample of size N from a data stream using an existing geometric file

that contains a large sample of records from the stream, where N < R. We will consider two

specific problems. First, we consider the case where N is known beforehand. We will refer to a

sample retrieved in this manner as a batch sample. We will also consider the case where N is not

known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext

results in an additional sampled record being returned to the caller, and so N consecutive calls

to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an

online or sequential sample.

1.4 Index Structures For The Geometric File

A geometric file could easily contain a sample of size several gigabytes or even terabytes.

A huge sample like this may often contain too much information and it becomes expensive to

scan all the records of a sample to find those (most likely very few) records that match a given

condition. A natural way to speed up the search and discovery of those records from a geometric

file that have a particular value for a particular attribute is to build an index structure. In this part

of the dissertation we discuss and compare three different index structures for the geometric file.

In general an index is a data structure that lets us find a record without having to look at

more than a small fraction of all possible records. An index is referred to as primary index if it









determines the location of the index record in the file. An index is referred as secondary index

if it tells us the current location of records that may have been decided by a primary index.

Thus, a secondary index is an index that is maintained for a data file, but not used to control

the current processing order of the file. In the case of a geometric file the physical location of a

sampled record is determined (randomly) by the insertion algorithms. We could therefore build

a secondary index structure on one or more attributes including the key attribute of the record.

Apart from providing an efficient access to the desired information in the file, the index must

be maintained as new records are inserted to the geometric file. For instance, we could build a

secondary index on an attribute when the new records are bulk inserted in the geometric file. At

this time we must determine how we merge the new secondary index with the existing indexes

built for the rest of the file. Furthermore, we must maintain the index as existing records are

being overwritten with newly inserted records and hence are deleted from the geometric file.

With these goals in mind, we discuss three secondary index structures for the geometric file:

(1) a segment-based index, (2) a subsample-based index, and (3) a Log-Structured Merge-Tree-

(LSM-) based index. The first two indexes are developed around the structure of the geometric

file. Multiple B+-tree indexes are maintained for each segment or subsample in a geometric file.

As new records are added to the file in units of a segment or subsample, a new B+-tree indexing

new records is created and added to the index structure. Also, an existing B+-tree is deleted from

the structure when all the records indexed by it are deleted from the file. The third index structure

makes use of the LSM-tree index [44] a disked-based data structure designed to provide low-

cost indexing in an environment with a high rate of inserts and deletes. We evaluate and compare

these three index structures experimentally by measuring build time and disk footprint as new

records are inserted in the geometric file. We also compare efficiency of these structures for point

and range queries.

Dissertation organization and original publications: The rest of the dissertation is

organized as follows. We present the related work in Chapter 2. In Chapter 3 we present the

geometric file organization and show how this structure can be used to maintain a very large









simple random sample. In Chapter 4 we propose a single pass biased reservoir sampling

algorithm. In Chapter 5 we develop techniques that can be used to sample geometric files to

obtain a small size sample. In Chapter 6 we present secondary index structures for the geometric

file. In Chapter 7 we discuss the benchmarking results. The dissertation is concluded in Chapter

8.

Most of the work in the dissertation is either already published or is under review for

publication. The material from Chapter 3 is from the paper with Christopher Jermaine and

Subramanian Arumugam that was originally published in SIGMOD 2004 [36]. The work

presented in Chapter 4 is submitted to TKDE and is under review [47]. The material in Chapter 5

is the part of journal paper accepted at VLDBJ [48]. The results in the Chapter 7 are taken from

above three papers as well.









CHAPTER 2
RELATED WORK

In this chapter, we first review the literature on reservoir sampling algorithms. We then

present the summary of existing work on biased sampling.

2.1 Related Work on Reservoir Sampling

Sampling has a very long history in the data management literature, and research continues

unabated today [2, 3, 8, 14, 15, 28, 32, 33, 35, 51, 52]. However, the most previous papers

(including the aforementioned references) are concerned with how to use a sample, and not with

how to actually store or maintain one. Most of these algorithms could be viewed as potential

users of a large sample maintained as a geometric file.

As mentioned in the introduction chapter, a series of papers by Olken and Rotem (including

two papers listed in the References section [25, 27]) probably constitute the most well-known

body of research detailing how to actually compute samples in a database environment. Olken

and Rotem give an excellent survey of work in this area [26]. However, most of this work is very

different than ours, in that it is concerned primarily with sampling from an existing database

file, where it is assumed that the data to be sampled from are all present on disk and indexed by

the database. Single pass sampling is generally not the goal, and when it is, management of the

sample itself as a disk-based object is not considered.

The algorithms in this dissertation are based on reservoir sampling, which was first de-

veloped in the 1960s [11, 38]. In his well-known paper [53], Vitter extends this early work by

describing how to decrease the number of random numbers required to perform the sampling.

Vitter's techniques could be used in conjunction with our own, but the focus of existing work

on reservoir sampling is again quite different from ours; management of the sample itself is not

considered, and the sample is implicitly assumed to be small and in-memory. However, if we re-

move the requirement that our sample of size N be maintained on-line so that it is always a valid

snapshot of the stream and must evolve over time, then sequential sampling techniques related

to reservoir sampling that could be used to build (but not maintain) a large, on-disk sample (see

Vitter [54], for example).









Several data structures and algorithms have been proposed to speed up index inserts such

as the LSM-Tree [44], Buffer-Tree [6], and Y-Tree [12]. These papers consider problem of

providing I/O efficient indexing for a database experiencing a very high record insertion rate

which is impossible to handle using a traditional B+-Tree indexing structure. In general these

methods buffer a large set of insertions and then scan the entire base relation, which is typically

organized as a B+-Tree, at once adding new data to the structure.

Any of the above methods could trivially be used to maintain a large random sample of a

data stream. Every time a sampling algorithm probabilistically selects a record for insertion, it

must overwrite, at random, an existing record of the reservoir. Once an evictee is determined,

we can attach its location as a position identifier (a number between 1 and R) with a new sample

record. This position field is then used to insert the new record into these index structures. While

performing the efficient batch inserts, if an index structure discovers that a record with the same

position identifier exists, it simply overwrites the old record with the newer one.

However, none of these methods can come close to the raw write speed of the disk, as

the geometric file can [13]. In a sense, the issue is that while the indexing provided by these

structures could be used to implement efficient, disk-based reservoir sampling, it is too heavy-

duty a solution. We would end up paying too much in terms of disk I/O to send a new record to

overwrite a specific, existing record chosen at the time the new record is inserted, when all one

really needs is to have a new record overwrite any random, existing record.

There has been much recent interest in approximate query processing over data streams

(a very small subset of these papers is listed in the References section [1, 21, 34]); even some

work on sampling from a data stream [7]. This work is very different from our own, in that most

existing approximation techniques try to operate in very small space. Instead, our focus is on

making use of today's very large and very inexpensive secondary storage to physically store the

largest snapshot possible of the stream.

Finally, we mention the U.C. Berkeley CONTROL project [37] (which resulted in the

development of online aggregation [33] and ripple joins [32]). This work does address issues









associated with randomization and sampling from a data management perspective. However,

the assumption underlying the CONTROL project is that all of the data are present and can

be archived by the system; online sampling is not considered. Our work is complementary to

the CONTROL project in that their algorithms could make use of our samples. For example,

a sample maintained as a geometric file could easily be used as input to a ripple join or online

aggregation.

2.2 Biased Sampling Related Work

Our biased sampling algorithm is based on reservoir sampling algorithm which was first

proposed in the 1960s [11, 38]. Recently, Gemulla et al. [29] extended the reservoir sampling

algorithm to handle deletions. In their algorithm called "random pairing" (RP) every deletion

from the dataset is eventually compensated by a subsequent insertion. The RP Algorithm keeps

tracks of uncompensated deletions and uses this information while performing the inserts. The

Algorithm guards the bound on the sample size and at the same time utilizes the sample space

effectively to provides a stable sample. Another extension to the classic reservoir sampling

algorithm has been recently proposed by Brown and Haas for warehousing of sample data [10].

They propose hybrid reservoir sampling for independent and parallel uniform random sampling

of multiple streams. These algorithms can be used to maintain a warehouse of sampled data that

shadows the full-scale data warehouse. They have also provided methods for merging samples

for different streams to create a uniform random sample.

The problem of temporal biased sampling in a stream environment has been considered.

Babcock et al. [7] presented the sliding window approach with restricted horizon of the sample

to biased the sample towards the recent streaming records. However, this solution has a potential

to completely lose the entire history of past stream data that is not a part of sliding window. The

work done by Aggarwal [5] addresses this limitation and presents a biased sampling method so

that we can have temporal bias for recent records as well as we keep representation from stream

history. This work exploits some interesting properties of the class of memory-less bias functions

to present a single-pass biased sampling algorithm for these type of biased functions. However,









since these techniques are tailored for a specific class of bias functions, one can not adapt them

directly for arbitrary user-defined biased functions. On the other hand, one can perform temporal

biased sampling using our algorithm by simply attaching a temporal weight to each streaming

record.

Another piece of work on single-pass sampling with a nonuniform distribution is due

to Kolonko and Wasch [40]. They present a single-pass algorithm to sample a data stream of

unknown size (that is, not know beforehand) to obtain a sample of arbitrary size n such that

the probability of selecting a data item i is depend on the individual item. The weight or fitness

of the item that is used for its probabilistic selection is derived using exponentially distributed

auxiliary values with the parameter of the exponential distribution and the largest auxiliary value

determines the sample. Like temporal biased sampling method discussed above, this algorithm

can not be directly adapted for arbitrary user-defined biased functions.

Surprisingly, above three papers are the only pieces of work that are know to authors on how

to perform a single-pass biased sampling over large datasets or streaming data.

The another body of related work is the papers from network usage area [22-24, 41].

These papers present techniques for estimating the total network traffic (or usage) based on the

sample of a flow records produced by routers. Since these flows typically have heavy-tailed

distributions, the techniques presented in these papers make use of size-dependent sampling

scheme. In general, such schemes work by sampling all the records whose traffic is above certain

threshold and sampling the rest with probability proportional to their traffic. Although, such

techniques introduce sampling bias where size can be thought as the weight of a record, there

are key differences between such techniques and the algorithm presented in this dissertation.

The goal of our algorithm is to obtain a fixed size biased sample that comply with the arbitrary

user-defined biased function. The goal of the size-dependent sampling scheme is to obtain a

sample that will provide the best accuracy for estimating the total network traffic that follows a

specific distribution. The sample gathered by these schemes is not necessarily a fixed size biased

sample. It only guarantees that the expected sample size is no larger than the expected sample









size of a random sample obtained with sampling probability of 1/7, where r is the threshold used

by these algorithms. Thus, the threshold r is carefully selected to control the sample size and if

required, it is increased to honor the upper bound of the sample size.

The problem of implementing fixed size sampling design with desired and unequal inclusion

probabilities has been studied in statistics. The monogram Theory of Sample Surveys [50]

discusses several methods for such a sampling technique, which is of some practical importance

in survey sampling. This monogram begins by discussing two designs which mimic simple

random sampling without replacement with selection probabilities for a given draw that are not

the same for all the units. We first summarize these techniques.

Successive Sampling: Let the selection probabilities pi, p2,... PL such that pi > 0 and

i =1 1, and desired sample size be N =2. Then the design suggests that we draw r with
probability pr, and q with probability pq/l( Pr). The inclusion probabilities can be expressed

in terms of the selection probabilities by the fact that r is included if it is drawn on the first draw,

or on the second draw not having been chosen on the first. Thus, the inclusion probability r,

is given by p, (1 + Zqr p,/(1 p,)). Similarly, the value for the joint probability rrq can be

deduced. The monogram suggests that the value of pr be found using an iterative computation

method.

Fellegi's Method: This method is very much like Successive sampling described above

except that the selection probabilities are different for the second draw. The second draw

probabilities are chosen such that the marginal selection probabilities for the both draws are the

same. This feature makes this method suitable for rotating sampling as in labor force sampling

where a fixed proportion of the sample is replaced each month. The procedure is as follows: the

first draw is made with probability pr = a, and then q with probability p,/(l pr), where

pi,... PL is another set of selection probabilities chosen so that



C "rPq/l( Pr) a nq
r sq
Where ci are specified positive numbers such that Eii or.









The above two methods implement simple random sampling (SRS) without replacement

with successive draws. An alternative method for SRS

with fixed size is to select units for replacement, and then to reject the sample if there are

duplicates. We discuss one such method here, called Sampford's Method.

Sampford's Method: In this method we will first draw r with probability a, and in the

remaining N- 1 draws, which are carried out with replacement, we use the selection probabilities

3, = Kai/(1 Nao), where K is the normalizing constant. If there are any duplicates in the

sample we start again from the beginning and repeat the procedure until the desired sample with

no duplicates is obtained. The main drawback of this sampling design is that as N becomes large

it becomes likely that duplicates will occur in each sampling round.









CHAPTER 3
THE GEOMETRIC FILE

In this chapter we give an introduction to the basic reservoir sampling algorithm that was

proposed to obtain an online random sample of a data stream. The algorithm assumes that

the sample maintained is small enough to fit in main memory in its entirety. We discuss and

motivate why very large sample sizes can be mandatory in common situations. We describe three

alternatives for maintaining very large, disk-based samples in a streaming environment. We then

introduce the geometric file organization and present algorithms for reservoir sampling with the

geometric file. We also describe how multiple geometric files can be maintained all-at-once to

achieve considerable speed up.

3.1 Reservoir Sampling

The classic algorithm for maintaining an online random sample of a data stream is known

as reservoir sampling [11, 38]. To maintain a reservoir sample R of target size |R|, the following

loop is used:

Algorithm 1 Reservoir Sampling
1: Add first |R| items from the stream directly to R
2: for int i = RI + 1 to oo do
3: Wait for a new record r to appear in the stream
4: with probability IR|/i do
5: Remove a randomly selected record from R
6: Add r to R


A key benefit of the reservoir algorithm is that after each execution of the for loop, it can

be shown that the set R is a true, uniform random sample (without replacement) of the first i

records from the stream. Thus, at all times, the algorithm maintains an unbiased snapshot of all

of the data produced by the stream. The name "reservoir sampling" is an apt one. The sample R

serves as a reservoir that buffers certain records from the data stream. New records appearing in

the stream may be trapped by the reservoir, whose limited capacity then forces an existing record

to exit the reservoir.









Reservoir sampling can be very efficient, with time complexity less than linear in the size of

the stream. Variations on the algorithm allow it to "go to sleep" for a period of time during which

it only counts the number of records that have passed by [53]. After a certain number of records

have been seen, the algorithm "wakes up" and capture the next record from the stream.

Correctness of the reservoir sampling algorithm: The reservoir sampling process can

be viewed as two phase process: (1) adding the first R records to the reservoir, and (2) adding

subsequent records until the input is consumed. A reservoir algorithm should maintain following

invariant in the second phase: after each record is processed, a reservoir should be a simple

random sample of size R of the records processed so far Algorithm 1 maintains this invariant in

steps (2-6) as follows [11, 38]. The ith record processed (i > IRI), it is added to the reservoir

with probability IRl/i by step 4. We need to show that for all other records processed thus far,

the inclusion probability is IR /i. Let rk be any record in the reservoir s.t. k / i. Let Ri denote

the state of the reservoir just after addition of the ith record. Thus, we are interested in the

Pr[rk r Ri]



Pr[rk G Ri] = Pr[rk G R-i1]Pr[ri E Ri]Pr[rk i Ri] + (Pr[rk G R,-i]Pr[ri i R])



R[R t R R
iR- 1R i i

The correctness of the inclusion probability alone is not sufficient to prove the required

invariant. Consider the systematic sampling described in the Chapter 8 of Cohran book [16]. To

select a sample of IRI units, systematic sampling takes a unit at random from the first k units and

"every kth" unit thereafter. Although the inclusion probability in systematic sampling is the same

as in simple random sampling, the properties of a sample such as variance can be far different.

It is known that the variance of the systematic sampling can be better or worse compared to a

simple random sampling depending on data heterogeneity and correlation coefficient between

pairs of sampled units.









We therefore also need to show that the pairwise values Pr[rk, rl E Ri] has the correct

value. All three-way inclusion probabilities must also be correct, as well as all four-way inclusion

probabilities, and so on. In other words we need to show that for a set S of interest, Pr[S E Ri]

has the correct value, for all S C R.
The proof that reservoir sampling maintains the correct inclusion probability for any set

of interest is actually very similar to the univariate inclusion probability correctness discussed

above. We know that the univariate inclusion probability Pr[rk Ri] = R/i. For any arbitrary

value of IS| < |R|, assume that we have the correct probabilities when we have seen i 1 input
records, i.e. Pr[S E R-] = () / (i). When the ith record is processed (i > |R|), we have



Pr[S Ri] = Pr[S E R i_l]Pr[ri e Ri]Pr[None of S's records are expelled ]+

(Pr[S E Ri1]Pr[r, ( RJ])

Pr[SR G + )]

(I) R s_

S1S) P- \

= P [SE r15] -1 i + -
(S) [(I \S)

which is the desired probability.

3.2 Sampling: Sometimes a Little is not Enough

One advantage of random sampling is that samples usually offer statistical guarantees on the

estimates they are used to produce. Typically, a sample can be used to produce an estimate for

a query result that is guaranteed to have error less E than with a probability 6 (see Cochran for a

nice introduction to sampling [16]). The 6 value is known as the c(nifid ,h c of the estimate.

Very large samples are often required to provide accurate estimates with suitably high

confidence. The need for very large samples can be easily explained in the context of the

Central Limit Theorem (CLT) [27]. The CLT implies that if we use a random sample of size

N to estimate the mean p of a set of numbers, the error of our estimate is usually normally









distributed with mean zero and variance O.2/N, where .2 is the variance of the set over which we

are performing our estimation. Since the "spread" of a normally distributed random variable is

proportional to the square root of the variance (also known as the standard deviation), the error

observed when using a random sample is governed by two factors:

1. The error is inversely proportional to the square root of the sample size.

2. The error is directly proportional to the standard deviation of the set over which we are
estimating the mean over.


The significance of this observation is that the sample size required to produce an accurate

estimate can vary tremendously in practice, and grows quadratically with increasing standard

deviation. For example, say that we use a random sample of 100 students at a university to

estimate the average students age. Imagine that the average age is 20 with a standard deviation of

2 years. According to the CLT, our sample-based estimate will be accurate to within 2.5% with

confidence of around 98%, giving us an accurate guess as to the correct answer with only 100

sampled students.

Now, consider a second scenario. We want to use a second random sample to estimate the

average net worth of households in the United States, which is around $140, 000, with a standard

deviation of at least $5,000,000. Because the standard deviation is so large, a quick calculation

shows we will need more than 12 million samples to achieve the same statistical guarantees as in

the first case.

Required sample sizes can be far larger when standard database operations like relational

selection and join are considered, because these operations can effectively magnify the variance

of our estimate. For example, the work on ripple joins [32] provides an excellent example of how

variance can be magnified by sampling over the relational join operator.

3.3 Reservoir for Very Large Samples

Reservoir sampling is very efficient if the sample is small enough to be stored in main

memory. However, efficiency is difficult if a large sample must be stored on disk. Obvious









extensions of the reservoir algorithm to on-disk samples all have serious drawbacks. We discuss

the obvious extensions now.

The virtual memory extension. The most obvious adaptation for very large sample sizes is

to simply treat the reservoir as if it were stored in virtual memory. The problem with this solution

is that every new sample that is added to the reservoir will overwrite a random, existing record on

disk, and so it will require two random disk I/Os: one to read in the block where the record will

be written, and one to re-write it with the new sample. This means we can sample only on the

order of 50 records per second at 10ms per random I/O per disk. Currently, a terabyte of storage

requires as few as five disks, giving us a sampling rate of only 5 x 50 = 250 records per second.

To put this in perspective, it would take months to sample enough 100 byte records to fill that

terabyte.

The massive rebuild extension. As an alternative, when new samples are selected from the

stream, they are not added to the on-disk reservoir immediately. Rather, we make use of all of

our available main memory to buffer new samples. At all times, the records stored in the buffer

B logically represent a set samples that should have been used to replace on-disk samples in

order to preserve the correctness of the reservoir algorithm, but that have not yet been moved to

disk for performance reasons. When the buffer B fills, we simply scan the entire reservoir R, and

replace a random subset of the existing records with the new, buffered samples. The modified

algorithm is given as Algorithm 2. Count(B) refers to the current number of records in B. Note

that since the records contained in B logically represent records in the reservoir that have not yet

been added to disk, a newly-sampled record can either be assigned to replace an on-disk record,

or it can be assigned to replace a buffered record (this is decided in Step (7) of the algorithm).

In a realistic scenario, the ratio of the number of disk blocks to the number of records

buffered in main memory may approach or even exceed one. For example, a 1 TB database with

128 KB blocks will have 7.8 million blocks; and for such a relatively large database it is realistic

to expect that we have access to enough memory to buffer millions records. As the number of

buffered records per block meets or exceeds one, most or all of the blocks on disk will contain









Algorithm 2 Reservoir Sampling with a Buffer
1: for int i = 1 to oo do
2: Wait for a new record r to appear in the stream
3: if i < R then
4: Add r directly to R and continue
5: else
6: with probability IR|/i do
7: with probability Count(B)/IRI do
8: //new samples can overwrite buffered samples
9: Replace a random record in B with r
10: else do
11: Add r to B
12: if Count(B) == BI then
13: Scan the reservoir R and empty B in one pass
14: B =


a record that has been randomly selected for replacement by line (9) of Algorithm 2, and so all

of the database blocks must be updated. Thus, it makes sense to rely on fast, sequential I/O to

update the entire file in a single pass. The drawback of this approach is that every time that the

buffer fills, we are effectively rebuilding the entire reservoir to process a set of buffered records

that are a small fraction of the existing reservoir size.

The localized overwrite extension. We will do better if we enforce a requirement that all

samples are stored in a random order on disk. If data are clustered randomly, then we can simply

write the buffer sequentially to disk at any arbitrary position. Because of the random clustering,

we can guarantee that wherever the buffer is written to disk, the new samples will overwrite a

random subset of the records in the reservoir and preserve the correctness of the algorithm. The

problem with this solution is that after the buffered samples are added, the data are no longer

clustered randomly and so a randomized overwrite cannot be used a second time. The data are

now clustered by insertion time, since the buffered samples were the most recently seen in the

data stream, and were written to a single position on disk. Any subsequent buffer flush will need

to overwrite portions of both the new and the old records to preserve the algorithm's correctness,

requiring an additional random disk head movement. With each subsequent flush, maintaining

randomness will become more costly, as data become more and more clustered by insertion time.









Eventually, this solution will deteriorate, unless we periodically re-randomize the entire reservoir.

Unfortunately, re-randomizing the entire reservoir is as costly as performing an external-memory

sort of the entire file containing samples, and requires taking the sample off-line.

3.4 The Geometric File

The three extensions to Algorithm 1 can be used to maintain a large, on-disk sample, but

all of them have drawbacks. In this section, we discuss a fourth algorithm and an associated

data organization called the geometric file to address these pitfalls. The geometric file is best

seen as an extension of the massive rebuild option given as Algorithm 2. Just like Algorithm

2, the geometric file makes use of a main-memory buffer that allows new samples selected by

the reservoir algorithm to be added to the on-disk reservoir in a lazy fashion. However, the

key difference between Algorithm 2 and the algorithms used by the geometric file is that the

geometric file makes use of a far more efficient algorithm for merging those new samples into the

reservoir.

Intuitive description: Except for Step (13) of Algorithm 2, the basic algorithm employed

by the geometric file is not much different. As far as Step (13) is concerned, the difference

between the geometric file and the massive rebuild extension is that the geometric file empties

the buffer more efficiently, in order to avoid scanning or periodically re-randomizing the entire

reservoir.

To accomplish this, the entire sample in main memory that is flushed into the reservoir is

viewed as a single subsample or a stratum [16], and the reservoir itself is viewed as a collection

of subsamples, each formed via a single buffer flush. Since the records in a subsample are non-

random subset of the records in the reservoir (they are sampled from the stream during a specific

time period), each new subsample needs to overwrite a true, random subset of the records in the

reservoir in order to maintain the correctness of the reservoir sampling algorithm. If this can be

done efficiently, we can avoid rebuilding the entire reservoir in order to process a buffer flush.

At first glance, it may seem difficult to achieve the desired efficiency. The buffered records

that must be added to the reservoir will typically overwrite a subset of the records stored in each









of the existing subsamples during a buffer flush. Though we may be able to avoid rebuilding the

entire file, the fact that the buffer must over-write a subset of each on-disk subsample presents

a challenge when trying to maintain acceptable performance, because this naturally leads to

fragmentation (see the discussion of the localized overwrite extension in Section 3.3). For

example, if there are 100 on-disk subsamples, the buffer must be split 100 ways in order to write

to a portion of each of the 100 on-disk subsamples. This fragmented buffer then becomes a new

subsample, and subsequent buffer flushes that need to replace a random portion of this subsample

must somehow efficiently overwrite a random subset of the subsample's fragmented data.

The geometric file uses a careful, on-disk data organization in order to avoid such fragmen-

tation. The key observation behind the geometric file is that the number of records of a subsample

that are replaced with records from buffered sample can be characterized with reasonable accu-

racy using a geometric series (hence the name geometricfile). As buffered samples are added to

the reservoir via buffer flushes, we observe that each existing subsample loses approximately the

same fraction of its remaining records every time, where the fraction of records lost is governed

by the ratio of the size of a buffered sample to the overall size of the reservoir. By "loses", we

mean that the subsample has some of its records replaced in the reservoir with records from a

subsequent subsample. Thus, the size of a subsample decays approximately in an exponential

manner as buffered samples are added to the reservoir.

This exponential decay is used to great advantage in the geometric file, because it suggests

a way to organize the data in order to avoid problems with fragmentation. Each subsample is

partitioned into a set of segments of exponentially decreasing size. These segments are sized

so that every time a buffered sample is added to the reservoir, we expect that each existing

subsample loses exactly the set of records contained in its largest remaining segment. As a

result, each subsample loses one segment to the newly-created subsample every time the buffer is

emptied, and a geometric file can be organized into a fixed and unchanging set of segments that

are stored as contiguous runs of blocks on disk. Because the set of segments is fixed beforehand,

fragmentation and update performance are not problematic: in order to replace records in an









existing subsample with the records from a new buffer flush, a simple, efficient, sequential

overwrite of the existing subsample's largest segment generally suffices.

3.5 Characterizing Subsample Decay

To describe the geometric file in detail, we begin with an analogy between the samples in a

subsample S that are lost over time, and radioactive decay. Imagine that we have 100 grams of

Uranium at an initial point of time (Uo = 100), and a decay rate (1 a) = 0.1 with a retention rate

of a. On day one, the mass of Uranium decays to Uo x a = 90 grams, because the Uranium loses

Uo x (1 a) = 10 grams of its mass. We define n = Uo x (1 a) to be the mass of Uranium lost

on the very first day, giving n = 10 for our example.

On day two, (with U1 = 90) the Uranium further decays to U1 x a = 81 grams, this time

losing U1 x (1 a) = Uo x a x (1 a) n x a = 9 grams of its mass. On day three, it further

decays by n x a2 = 7.2 grams, and so on. The decay process is allowed to continue until we have

less than 3 grams of Uranium remaining.

Continuing with the Uranium analogy, three questions that are relevant to our problem of

maintaining very large samples from a data stream are

What is the amount of Uranium lost on any given ith day?
How can the initial mass of Uranium, 100 grams, can be expressed in terms of n and a?
How many days it will take for us before we are left with 3 grams or less of Uranium?


These questions can be answered using the following three simple observations related to

geometric series:

Observation 1: Given a retention rate a < 1 and n to be the first term of a geometric series, the

ith term is given by n x a-1 for any nE cR.

Observation 2: Given a retention rate a < 1, it holds that C"n x a -1 "r for any n R.

Observation 3: Given a retention rate a < 1, define f(j) as x aj. From Observation 2, it

follows that the largest j such that f(j) > P3 is j log3-loglog(1-a)]. We denote this floor by

T.









To relate this back to the task of the reservoir sampling, imagine that our large, disk-based

reservoir sample R is maintained using a reservoir sampling algorithm in conjunction with a

main memory buffer B (as in Algorithm 2). Recall that the way reservoir sampling works is

that new samples from the data stream are chosen to overwrite random samples currently in the

reservoir. The buffer temporarily stores these new samples, delaying the overwrite of a random

set of records that are already stored on disk. Once the buffer is full, all new samples are merged

with the R by overwriting a random subset of the existing samples in R.

Consider some arbitrary subsample S of R (so S C R), with capacity IS|. Since the buffer

B represents the samples that have already over-written the equal number of records of R, a

buffer flush overwrites exactly |B| samples of R. Thus, on expectation the merge will overwrite

SlxB samples of S. If we define 1 a then on expectation, S should lose IS| x (1 a)

of its own records due to the buffer flush1 We refer this loss as subsample decay.

We can roughly describe the expected decay of S after repeated buffer merges using the

three observations stated before. If the subsample retention rate a = 1 then:

From Observation 1, it follows that the ith buffer merge, on expectation, removes n x a-1
samples from what remains of S.

From Observation 2, it follows that the initial size of a subsample ISI = -

From Observation 3, it follows that the expected number of merges required until S has 3
or less samples left is T.


The net result of this is that it is possible to characterize the expected decay of any arbitrary

subset of the records in our disk-based sample as new records are added to the sample through

multiple emptyings of the buffer. If we view S as being composed of T on-disk "segments"

of exponentially decreasing size, plus a special, a single group of final segments of total size



1 Actually, this is only a fairly tight approximation to the expected rate of decay. It is not an
exact characterization because these expressions treat the emptying of the buffer into the reservoir
as a single, atomic event, rather than a set of individual record additions (See Section 3.7).









segment 2:
2 segments numbered V
segment 0: segment 1: no samples and after: 3 samples total
n samples na samples




stored in main
n memory
before first buffer flush: samples total
1-a

disk

after first buffer flush:
disk


after second buffer flush:
disk

after third buffer flush:
disk

O0O

after buffer flush number y:
disk

Figure 3-1. Decay of a subsample after multiple buffer flushes.


/3 that are buffered in main memory (subsequently referred to as the "beta segment"), then the

ith buffer flush into R will on expectation overwrite exactly one on-disk segment from S. S

loses an additional segment with every buffer flush until the subsample has only its beta segment

remaining. At the point that only the subsample's beta segment remains, the samples contained

therein can be replaced directly. The reason that the beta segment is buffered in main memory is

that overwriting a segment requires at least one random disk head movement, which is costly. By

storing the beta segment in main memory, we can reduce the number of disk head movements

with little main-memory storage cost. The process is depicted in Figure 3-1.














low address on disk

cN--I


all segment O's



all segment 1 's


all segment 2's



all segment 3's



all segment 4's



" all segment 5's



all segment 6's




all segment y 's


high address on disk


SIIIII I I III 1
111111H 111111m
1 11111 1111111


III


all smaller segments, buffered
in main memory


Figure 3-2. Basic structure of the geometric file.









3.6 Geometric File Organization

This decay process suggests a file organization for efficiently maintaining very large random

samples from a data stream. Let a subsample S be the set of records that are loaded into our

disk-based reservoir sample R in a single emptying of the buffer. Since we know that the number

of records that remain in S will on expectation decay over time as depicted in Figure 3-1, we

can organize our large, disk based sample as a set of decaying subsamples. At any point of time,

the largest subsample was created by the most recent flushing of the buffer into R, and has not

yet lost any segments. The second largest subsample was created by the second most recent

buffer flush; it lost its largest segment in the most recent buffer flush. In general, the ith largest

subsample was created by the ith most recent buffer flush, and it has had i 1 segments removed

by subsequent buffer flushes. The overall file organization is depicted in Figure 3-2.

3.7 Reservoir Sampling With a Geometric File

Given this organization, processing a buffer flush becomes an easy task. The overall

reservoir sampling algorithm for the geometric file organization is given as Algorithm 3. The

terms n, a, and T carry the meaning discussed in Section 3.5. This process described by

Algorithm 3 is depicted graphically in Figure 3-3. First, the file is filled with the initial data

produced by the stream (a through c). To add the first records to the file, the buffer is allowed

to fill with samples. The buffered records are then randomly grouped into segments, and the

segments are written to disk to form the largest initial subsample (a). For the second initial

subsample, the buffer is only allowed to fill to IB| a of its capacity before being written out (b).

For the third initial subsample, the buffer fills to I Ba2 of its capacity before it is written (c).

This is repeated until the reservoir has completely filled (as was shown in Figure 3-2). At this

point, new samples must overwrite existing ones. To facilitate this, the buffer is again allowed

to fill to capacity. Records are then randomly grouped into segments of appropriate size, and

those segments overwrite the largest segment of each existing subsample (d). This process is then

repeated indefinitely, as long as the stream produces new records (e and f).









This file organization has several significant benefits for use in maintaining a very large

sample from a data stream:

Performing a buffer flush requires absolutely no reads from disk.

Each buffer flush requires only T random disk head movements; all other disk I/Os are
sequential writes. To add the new samples from the buffer into the geometric file to create
a new subsample S, we need only seek to the position that will be occupied by each of S's
on-disk segments.

Even if segments are not block-aligned, only the first and last block in each over-written
segment must be read and then re-written (to preserve the records from adjacent segments).


Algorithm 3 Reservoir Sampling with a Geometric File
1: Set numSubsamples = 0
2: for int i = 1 to oo do
3: Wait for a new record r to appear in the stream
4: if i < R then
5: Add r to B
6: if Count(B) == IB|InmSubsamples then
7: Randomize the ordering of the records in B
8: Set n Count(B) x (1 a)
9: Partition B into segments of size n, na, na2, and so on
10: Flush the first T segments to the disk
11: Store the group of remaining segments in main memory
12: numSubsamples + +
13: B =
14: else
15: with probability IR|/i do
16: with probability Count(B)/ IR do
17: Replace a random record in B with r
18: else do
19: Add r to B
20: if Count(B) = BI then
21: Partition the buffer into segments of size n, na, na2, and so on (see Section 3.7.1)
22: for each segment sgj from B do
23: Overwrite the largest segment of jth largest subsample of R with sgj
24: B =


3.7.1 Introducing the Required Randomness

One issue that needs to be addressed is the partitioning the buffer into segments in Algo-

rithm 3 Step (21). In order to maintain the algorithm's correctness, when the buffer is flushed









to disk it must overwrite a truly random subset of the records on disk. Thus, when performing

the flush, we need to randomly choose records from the reservoir to replace. This implies that

the on-disk subsamples (which are expectedly of size 1 7, and so on) will lose around
1- 1-aI 1-a
n, na, na2 records, and so on, respectively. However, while the number of records replaced in a

subsample S will on expectation be proportional to the size of S (and hence equal to the size of

S's largest on-disk segment) this replacement must be performed in a randomized fashion. The

situation can be illustrated as follows. Say we have a set of numbers, divided into three buckets,

as shown in Figure 3-4. Now, we want to add five additional numbers to our set, by randomly

replacing five existing numbers. While we do expect numbers to be replaced in a way that is

proportional to bucket size (Figure 3-4 (b)), this is not always what will happen (Figure 3-4 (c)).

Algorithm 4 Randomized Segmentation of the buffer
1: for each subsample i in the reservoir R do
2: Set Ni= Number of records in Si
3: Set .11 0
4: for each record r in the buffer B do
5: Randomly choose a victim subsample Si such that Pr[choosing 5J] = Nil / Nj
6: N, -; 1 + +


In order to correctly introduce this variance into the geometric file, we need to add a few

additional steps to Algorithm 3. Before we add a new subsample to disk via a buffer flush in Step

(21), we first perform a logical, randomized partitioning of the buffer into segments, described

by Algorithm 4. In Algorithm 4, each newly-sampled record is randomly assigned to replace a

sample from an existing, on-disk subsample so that the probability of each subsample losing a

record is proportional to its size. The result of Algorithm 4 is an array of 11 values, where 3 1

tells Step (21) of Algorithm 3 how many records should be assigned to overwrite the ith on-disk

subsample.

3.7.2 Handling the Variance

Of course, there is no guarantee that M1 = n, i = na, = na2, and so on, so there

is no guarantee that Algorithm 3 will overwrite exactly the number of records contained in each



















-low address on disk







`u -,new
samples

high address on disk

(a)


(-low address on disk







new
C samples

high address on disk

(b)


low address on disk
uI


samples
high address on disk


samples n
dd-- ` samf
high address on disk high addss on disk

(d) (e)


ew -
)les sam
high address on disk

(f)


new
iples


Figure 3-3. Building a geometric file.












1/5 of total 1/5 of total 3/5 of total

(a) Five new samples randomly replace existing samples which are grouped into three






(b) Most likely outcome: new samples distributed proportionally






(c) Possible (though unlikely) outcome: new samples all distributed to smallest bucket

Figure 3-4. Distributing new records to existing subsamples.


subsample's largest segment. To handle this problem, we associate a stack (or buffer1 ) with

each of the subsamples. The stack associated with a subsample will buffer any of a subsample's

records that logically should not have been over-written during buffer flush into the subsample

(because 3 [ for some buffer flush for that subsample was smaller than expected), but whose

space had to be claimed by the buffer flush in order to write a new subsample to disk. If the size

of the stack is positive, it means that the corresponding subsample is larger than expected because

it has had fewer of its records over-written than expected. We also allow a negative stack size.

This simply means that some of the subsample's records should have been over-written but were

not, because an 3 [ value for that subsample was larger than expected. A stack size of -k means

that k of the subsample's on-disk records logically are not part of the reservoir (even though they

are physically present on disk), and should be ignored during query processing.



1 We use the term "stack" rather than "buffer" to clearly differentiate the extra storage associ-
ated with each subsample from the buffer B.









Making use of the set of stacks is fairly straightforward. Imagine that na(i-1) of a buffer's

records are sent to overwrite a segment from an existing subsample Si, but according to Algo-

rithm 4, 1 [ should have been. Then, there are two possible cases:

Case 1: i is smaller than na(i-1) by some number of records e In this case, E records
are removed from the segment that is about to be over-written and pushed onto Si's stack
in order to buffer them. This is necessary because these records logically should not be
over-written by the records that are going to be added to the disk, but they will be.

Case 2: 1 [ is larger than na(i-1) by some number of records e. In this case, e records are
popped off of Si's stack to reflect the additional records that should have been removed
from Si, but were not.


These stack operations are performed just prior to Step (23) in Algorithm 3. Note that since the

final group of segments from a subsample of total size 3 are buffered in main memory, their

maintenance does not require any stack operations. Once a subsample has lost all of its on-disk

samples, overwrites of records in this set can be handled by simply replacing the records directly.

3.7.3 Bounding the Variance

Because the stacks associated with each subsample will be used with high frequency as

insertions are processed, each stack must be maintained with extreme efficiency. Writes should

be entirely sequential, with no random disk head movements. To assure this efficiency and avoid

any sort of online reorganization, it is desirable to pre-allocate space for each of the stacks on

disk.

To pre-allocate space for these stacks, we need to characterize how much overflow we can

expect from a given subsample, which will bound the growth of the subsample's stack. It is

important to have a good characterization of the expected stack growth. If we allocate too much

space for the stacks, then we allocate disk space for storage that is never used. If we allocate

too little space, then the top of one stack may grow up into the base of another. If a stack does

overflow, it can be handled by buffering the additional records temporarily in memory or moving

the stack to a new location on disk until the stack can again fit in its allocated space. This is not









a catastrophic event, but it increases the disk I/O associated with stack maintenance and leads to

fragmentation, and so it is an event that we would like to render very rare.

To avoid this, we observe that if the stack associated with a sub-sample S contains any

samples at a given moment, then S has had fewer of its own samples removed than expected.

Thus, our problem of bounding the growth of S's stack is equivalent to bounding the difference

between the expected and the observed number of samples that S loses as IBI new samples are

added to the reservoir, over all possible values for IBI.

To bound this difference, we first note that after adding IBI new samples into the reservoir,

the probability that any existing sample in the reservoir has been over-written by a new sample is

1 1 -{ During the addition of new records to the reservoir, we can view a subsample

S of initial size IBI as a set of IBI identical, independent Bernoulli trials (coin flips). The ith

trial determines whether the ith sample was removed from S. Given this model, the number of

samples remaining in S after IBI new samples have been added to the reservoir is binomially

distributed with IBI trials and P = Pr[s E S remains] = 1 1 Since we are interested

in characterizing the variance in the number of samples removed from S primarily when IB|P

is large, the binomial distribution can be approximated with very high accuracy using a normal

distribution with mean = B IP and standard deviation a = IBIP(1 P) [42]. Simple

arithmetic implies that the greatest variance is achieved when a subsample has on expectation

lost 50' of its records to new sample (P = 0.5); at this point the standard deviation a is

0.5 B/B. Since we want to ensure that stack overruns are essentially impossible, we choose

a stack size of 3/B This allows the amount of data remaining in a given subsample to be

up to six standard deviations from the norm without a stack overflow, and is not too costly an

additional overhead. A quick lookup in a standard table of normal probabilities tells us that this

will yield only around a 109- probability that any given subsample overflows its stack. While

achieving such a small probability may seem like overkill, it is important to remember that many

thousands of subsamples may be created in all during the life of the geometric file, and we want

to ensure that very few of them overflow their respective stacks. If 100, 000 on-disk segments









are replaced, then using a stack of size 3 IB1 will yield a very reasonable probability that we

experience no overflows of (1 10-9)100 000, or 99.9, I'-.. In practice, the actual probability of

experiencing no overflows will be even greater. This is due to the fact that the standard deviation

in subsample size for most of a subsample's lifespan will be much less than 0.5 BVB due to the

high percentage of its lifespan that it has an associated P of less than 0.5 as it slowly loses all of

its samples.

3.8 Choosing Parameter Values

Given a specified file size and buffer size, two parameters associated with using the

geometric file must be chosen: a, which is the fraction of a subsample's records that remain after

the addition of a new subsample, and 3, which is the total size of a subsample's segments that are

buffered in memory.

3.8.1 Choosing a Value for Alpha

In general, it is desirable to minimize a. Decreasing a decreases the number of segments

used to store each subsample. Fewer segments means fewer random disk head movements

are required to write a new subsample to disk, since each segment requires around four disk

seeks to write (one to read the location and one to write a new segment, and similarly two more

considering the cost of subsequently adjusting the stack of the previous owner).

To illustrate the importance of minimizing a, imagine that we have a 1GB buffer and a

stream producing 100B records, and we want to maintain a 1TB sample. Assume that we use

an a value of 0.99. Thus, each subsample is originally 1GB, and | B = 107. From Observa-

tion 2 we know that -- must be 107, so we must use n = 105. If we choose / = 320 (so

that 3 is around the size of one 32KB disk block), then from Observation 3 we will require
Sog320 log0log(-.99) 1029 segments to store the entire new subsample.

Now, consider the situation if a = 0.999. A similar computation shows that we will now

require 10, 344 segments to store the same 1GB subsample. This is an order-of-magnitude

difference, with significant practical importance. With four disk seeks per segment, 1029

segments might mean that we spend around 40 seconds of disk time in random I/Os (at 10ms









each), whereas 10, 344 might mean that 400 seconds of disk time is spent on random disk I/Os.

This is important when one considers that the time required to write 1GB to a disk sequentially is

only around 25 seconds. While minimizing a is vital, it turns out that we do not have the freedom

to choose a. In fact, to guarantee that the sum of all existing subsamples is IR|, the choice of a is

governed by the ratio of IR| to the size of the buffer IB|:

Lemma 1. (The size of a geometric file is IR) <> ((1 a = )


Proof. In the proof, (and consequently the Lemma) we ignore the fact that IB x -'1 may not

be integral, we also ignore the storage associated with auxiliary structures such as the stacks and

the beta segments. In this case, the geometric file is simply collection of subsamples of decaying

size. We know that the largest subsample on disk is created by the most recent buffer flush and

has |B records in it. From Observation 1 the size of the ith subsample of a file is IBI x ai-1
It then follows from Observation 2 that the total size of all subsamples of a geometric file is

1, |BI x a'-1 BI and thus (1 a) = .


We will address this limitation in Section 3.10.

3.8.2 Choosing a Value for Beta

It turns out that the choice of 3 is actually somewhat unimportant, with far less impact

than a. For example, if we allocate 32KB for holding our 3 in-memory samples for each

subsample, and |B|/|RI is 0.01, then as described above, adding a new subsample requires that

1029 segments be written, which will require on the order of 1029 seeks. Redoing this calculation

with 1MB allocated to buffer samples from each on-disk subsample, the number of on-disk
segments is Lo0 og 10+log(1-0.99)] or 687. By increasing the amount of main memory devoted

to holding the smallest segments for each subsample by a factor of 32, we are able to reduce

the number of disk head movements by less than a factor of two. Thus, we will not consider

optimizing 3. Rather, we will fix 3 to hold a set of samples equivalent to the system block size,

and search for a better way to increase performance.









3.9 Why Reservoir Sampling with a Geometric File is Correct?

We discuss the correctness of the geometric file by answering the following questions:

1. Why is the classical reservoir sampling algorithm (presented as Algorithm 1) correct? That
is what is the invariant maintained by the Algorithm 1?

2. Why is the obvious disk-based, extension of Algorithm 1 (presented as Algorithm 2)
correct? That is how does Algorithm 2 maintain the invariant of Algorithm 1 via the use of
a main memory buffer?

3. Why is the proposed geometric file based sampling technique in Algorithm 3 correct?


We have answered the first question in Section 3.1. We discuss the second and third

questions here.

3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer

The Algorithm 2 makes use of the main memory buffer of size IBI to buffer new samples.

The buffered samples logically represent a set samples that should have been used to replace

on-disk samples in order to preserve the correctness of the sampling algorithm, but that have not

yet been moved to disk for performance reasons (that is, due to lazy writes).

It is not hard to see that the invariant maintained by Algorithm 1 is also maintained by

Algorithm 2 in step (6). The new records are sampled with the same probability I R/i. The only

difference is that newly sampled records are added to the reservoir using steps (7-14) instead of

simple steps (5-6) of Algorithm 1. We now discuss why these steps are equivalent.

One straightforward way of keeping the sampled records in the buffer and do lazy writes

is as follows. Every time we decide to add a new sample to the buffer (i.e. with probability

I R/i), we also generate a random number between 1 and R to decide its position in the reservoir.

However, we store this position in the position array and thus avoid an immediate disk seek.

If we happen to generate a position that is already in the position array, we overwrite the

corresponding record in the buffer with the newly sampled record. If we would have flushed that

record to disk using the classic algorithm (rather than buffering it), we would have replaced it

with the newly sampled record. Thus we would obtain the same result. Once the buffer is full we









flush it in a single scan of the reservoir and overwrite the records as dictated by the sorted order

of the position array. It is obvious that this process is equivalent to the steps (5-6) of Algorithm 1

as far as correctness is concerned.

Logically, steps (7-14) of Algorithm 2 actually implement exactly this process. The

probability that we will generate a random position between 1 and IRI that is already in the

position array of size BI is IBI/R. Step (7) of Algorithm 2 decides whether to overwrite a

random buffered record with a newly sampled record. Once the buffer is full, step (13) performs

a one pass buffer-reservoir merging by generating sequential random positions in the reservoir on

the fly.

3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File

In Algorithm 2 we store the samples sequentially on the disk and overwrite them in a

random order. Though correct, the algorithm demands almost a complete scan of the reservoir (to

perform all random overwrites) for every buffer flush. We can do better if we instead force the

samples to be stored in a random order on disk so that they can be replaced via an overwrite using

sequential I/Os. The localized overwrite extension discussed before use this idea. Every time a

buffer is flushed to the reservoir it is randomized in main memory and written as a random cluster

on the disk. We maintain the correctness of this technique by splitting the random cluster in

N-ways where N is the number of existing clusters on the disk and by overwriting random subset

of each existing cluster. This avoids the problem of clustering by insertion time. However, the

drawback of this technique is that the solution deteriorates because of fragmentation of clusters.

The geometric file overcomes the drawbacks of these two techniques and can be viewed

as a combination of Algorithm 2 and the idea used in the localized overwrite extension. The

correctness of the Geometric file is results directly from the correctness of these two techniques.

In case of the geometric file the entire sample in the main memory (referred to as a subsample)

is randomized and flushed into the reservoir. Furthermore, each new subsample is split into

exactly those many segments as the number of existing subsamples on the disk. These segments

then overwrite a random portion of each disk-based subsample. The only difference with the










geometric file is that it organizes the records to be overwritten systematically on the disk by

making the observation that each existing subsample loses approximately the same fraction of its

remaining records every time.

3.10 Multiple Geometric Files

The value of a can have a significant effect on geometric file performance. If a = 0.999,

we can expect to spend up to 95' of our time on random disk head movements. However, if

we were instead able to choose a = 0.9, then we reduce the number of disk head movements

by factor of 100, and we would spend only a tiny fraction of the total processing time on seeks.

Unfortunately, as things stand, we are not free to choose a. According to Lemma 1, a is fixed by

the ratio IB|/ R|. That is, for a fixed desired size of reservoir we need a larger buffer to lower the

value of a.

However, there is a way to improve the situation. Given a buffer of fixed capacity IBI

and desired sample size IR|, we choose a smaller value a' < a, and then maintain more than

one geometric file at the same time to achieve a large enough sample. Specifically, we need to

maintain m =(1-,) geometric files at once. These files are identical to what we have described

thus far, except that the parameter a' is used to compute the sizes of a subsample's on-disk

segments and size of each file is -. The remainder of this Section describes the details of how
in
multiple geometric files are used to achieve greater efficiency.

3.11 Reservoir Sampling with Multiple Geometric Files

The reservoir sampling algorithm with multiple geometric files is similar to the Algorithm 3.

Each of the m geometric files is still treated as a set of decaying subsamples, and each subsample

is partitioned into a set of segments of exponentially decreasing size, just as is done in Algorithm

3, Steps (5)-(13). The only difference is that as each file is created, the parameter a' is used

instead of a in Steps (6), (8)-(9), and each of the m geometric files is filled after one another, in

turn. Thus, each subsample of each geometric file will have segments of size n, na', na'2 and so

on.









Algorithm 5 Randomized Segmentation of the Buffer for Multiple Geometric Files
1: for each Sij, the ith subsample in jth file do
2: Set Nij= Number of records in Sij
3: Set 1. 0
4: for each record r in the buffer B do
5: Randomly choose a victim subsample Sij such that Pr[choosing ij] = Nij/ C8 Nkl
6: Nj -;V + +


However, processing additional records from the stream is somewhat different. As more and

more records are produced by the stream, new samples are captured and are added to the buffer

exactly as in Algorithm 3 Steps (15)-(20) until buffer is full. Once the buffer is full, its record

order is then randomized, just as is in a single geometric file. Next the buffer is flushed to disk.

This is where the algorithm is modified. Overwriting records on disk with records from the buffer

is somewhat different, in two primary ways, as discussed next.

Partitioning the buffer: In Algorithm 4, the buffer is partitioned so that the size of each

buffer segment is on expectation proportional to the current size of subsamples in a single file.

In case of multiple geometric files, we partition the buffer just like in Algorithm 4; however,

we randomly partition the buffer across all subsamples from all geometric files. The number

of buffer segments after the partitioning is the same as the total number of subsamples in the

entire reservoir, and the size of each buffer segment is on expectation proportional to the current

size of each of the subsamples from one of the geometric files. This allows us to maintain the

correctness of the reservoir sampling algorithm. The buffer partitioning steps in case of multiple

geometric files are given in Algorithm 5.

Merging buffer segments with multiple geometric files: This step requires quite a

different approach compared to Algorithm 3's buffer merge algorithm. We discuss all the

intricacies subsequently, but at high-level, the largest segment of each subsample from only

one geometric file is over-written with samples from the buffer. This allows for considerable

speedup, as we discuss in Section 3.12. At first, this would seem to compromise the correctness

of the algorithm: logically, the buffered samples must over-write samples from every one of the

geometric files (in fact, this is precisely why the buffer is partitioned across all geometric files, as









described in the previous bullet). However, the correctness can be maintained by making use of

some additional buffer space. In Sections 3.11.1 to 3.11.3, we describe in detail an algorithm that

is able to maintain the correctness of the sample.

3.11.1 Consolidation And Merging

As stated previously, the process of flushing a buffer to disk once it has been partitioned

must be altered. The first step in flushing the buffer to disk is the consolidation of the many small

buffer segments that result from partitioning the buffer across all files to form larger segments

that are then used to over-write segments in only a single geometric file. To form the largest

consolidated segment, we group the m buffer segments assigned to the largest subsample from

every file. The next largest consolidated segment is formed by grouping the m buffer segments

corresponding to the next largest subsample across every files, and so on.

Once the segments assigned to the various files have been consolidated, the resulting

segments are used to overwrite subsamples from a single geometric file using exactly the

algorithm from Section 3.4, subject to the constraint that the jth buffer merge overwrites

subsamples from the (j mod m)th geometric file.

3.11.2 How Can Correctness Be Maintained?

Logically, samples from the buffer have been partitioned so as to preserve the correctness

of the reservoir algorithm: each record has been assigned to a subsample with probability

proportional to the subsample's size. However, the fact that these partitions are then consolidated

and merged into a single subsample would seem to compromise algorithm correctness, since

the subsamples in the (j mod m)th geometric file are over-written with too many new samples.

Thus, this file physically loses many of its samples before it should. This results in a subsample

with fewer samples stored on disk than it should have in order to preserve the correctness of the

reservoir sampling algorithm.

Our remedy to this problem is to delay overwriting a subsample's largest segment until

the time that all (or most) of the records that will be over-written on disk are invalid, in the

sense that they have logically been "over-written" by having records from subsequent buffer
















(a) Initial configuration. Each of the m geometric
files has an additional dummy segment that
holds no data.


-, -
mmmm(
)m mm-
mmmml[][]


m-
mVy


u***00 ---segments
initially
owned by the dummy


(c) Existing subsamples give their largest
segment to reconstitute the dummy. The
data in these segments are protected until
the next time the dummy is over-written.


newly reconstituted
dummy


(b) The jth new subsample is added by
overwriting the dummy in the i = (j mod
m)th geometric file


***- -Onn --
.mmmm -
mNNOD


array of m
geometric files


newly
added
subsample


(d) The next m -1 buffer flushes write new
subsamples to the other m 1 geometric
files, using the same process. The mth
buffer flush again overwrites the dum-
my in the ith geometric file, and the pro-
cess is repeated from step (c).


mEmmm.


newly added subsample


Figure 3-5. Speeding up the processing of new samples using multiple geometric files.


~~~
"~~

~~~~
~

~~~
"~"


ann~n









flushes assigned to replace them. In order to accomplish this, we note that if we did not perform

consolidation and instead replaced a segment from each subsample with exactly those records

assigned to overwrite records from that subsample, then on expectation a subsample would

lose all of the records in its largest segment after m buffer flushes. Thus, if we somehow delay

overwriting the largest segment in each file for m buffer flushes, we could sidestep the problem

of losing too many records due to consolidation.

The way to accomplish this is to overwrite subsamples in a lazy manner. We merge the

buffer with the (j mod m)th geometric file, but we do not overwrite any of the valid samples

stored in the file until the next time we get to the file. We can achieve this by allocating enough

extra space in each geometric file to hold a complete, empty subsample in each geometric

file. This subsample is referred to as the dummy. The dummy never decays in size, and never

stores its own samples. Rather, it is used as a buffer that allows us to sidestep the problem of

a subsample decaying too quickly. When a new subsample is added to a geometric file, the

new subsample overwrites segments of dummy rather than overwriting largest segment of any

existing subsamples. Thus, we have protected segments of subsamples that contain valid data by

overwriting dummy's records instead.

When records are merged from the buffer into the dummy, the space previously owned by

the dummy is given up to allow storage of the file's newest subsample. After this flush, the largest

segment from each of the subsamples in the file is given up to reconstitute the new dummy.

Because the records in (new) dummy's segments will not be over-written until the next time that

this particular geometric file is written to, all of the data that is contained within it is protected.

Note that with a dummy subsample, we no longer have a problem with a subsample losing

its samples too quickly. Instead, a subsample may have slightly too many samples present on disk

at any given time, buffered by the file's dummy. These extra samples can easily be ignored during

query processing. The only additional cost we incur with dummy is that each of the geometric

files on disk must have IBI additional units of storage allocated. The use of a dummy subsample

is illustrated in Figure 3-5.









3.11.3 Handling the Stacks in Multiple Geometric Files

One final issue that should be considered is maintenance of the stacks associated with each

subsamples of the (j mod m)th geometric file. Just as in the single file case, the purpose of the

stack associated with a subsample is to store samples that are still valid, but whose space must

be given up in order to store new samples from the buffer that have been flushed to disk. With

multiple geometric files, this does not change. It is possible that when the buffer is written to the

dummy subsample in a file, the dummy may still contain valid samples from a subsample in that

file. Specifically, one or more of the dummy's segments may contain valid samples from the last

subsample to own the segment. In that case, the valid samples are saved to that subsample's stack

before the dummy is over-written.

3.12 Speed-Up Analysis

The increase in speed achieved using multiple geometric files can be dramatic. The time

required to flush a set of new samples to disk as a new subsample is dominated by the need to

perform random disk head movements. For each subsample, we need two random movements

to overwrite its largest segment (one to read the location and one to write a new segment) and

then two more seeks for its stack adjustment; a total of around 40 ms/segment. The number of

segments required to write a new subsample to disk in the case of multiple geometric files (and

thus the number of random disk head movements required) is given by Lemma 2.

Lemma 2. Let u = (log(1/ac'))-1. Multiple geometric files can be used to maintain an online

sample of arbitrary size with a cost of O(u x log IB /|B ) random disk head movements for each

newly sampled record.


Proof We know that for every buffer flush, m segments in the buffer are grouped to form a

consolidated segment. All such consolidated segments are then used to overwrite the largest on-

disk segments of the subsamples stored in a single geometric file. From Observation 3, we know

that the number on-disk segments of a subsample (and thus the number of consolidated segments)

is log 3-log+log(l-a) ]. Substituting n = (1- a') x BI and simplifying the expression (as well as
log a'









ignoring the floor) we compute the number of segments to write as 1(log 3 log |BI). If we

let u = (log(1/a'))-1 the number of segments can be expressed as w(log IBI log/3). Assuming

a constant number c of random seeks per segment written to the disk, the total random disk head

movements required per record is wc ((log IB log /3)/IB ), which is O(u x log IB I/ IB). D

In case of multiple geometric files we use additional space for m dummy subsamples. Thus,

the total storage required by all geometric files is |R| + (m x IBI). If we wish to maintain a 1TB

reservoir of 100B samples with 1GB of memory, we can achieve a' = 0.9 by using only 1.1TB of

disk storage in total. For a' = 0.9, we need to write less than 100 segments per 1GB buffer flush.

At 40 ms/segment, this is only 4 seconds of random disk head movements to write 1GB of new

samples to disk.

In order to test the relative ability of the geometric file to process a high-speed stream of

insertions, we have implemented and bench-marked five alternatives for maintaining a large

reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the

framework described in Section 3.10 for using multiple geometric files at once. We present these

benchmarking results in Chapter 7.









CHAPTER 4
BIASED RESERVOIR SAMPLING

Random sampling selects a subset of the items in a population so that statistical properties

of the population can be inferred by studying the sample rather than the entire population. In this

chapter we study the problem of how to compute a simple, fixed-size random sample (without

replacement) in a single pass over a data stream, where the goal is to bias the sample using some

arbitrary weighting function.

In this chapter we propose a simple modifications to the classic reservoir sampling algorithm

[11, 38] in order to derive a very simple algorithm that permits the sort of fixed-size, biased

sampling given in the example. Our method assumes the existence of an arbitrary, user-defined

weighting function f which takes as an argument a record ri, where f(ri) > 0 describes the

record's utility in subsequent query processing. We then compute (in a single pass) a biased

sample Ri of the i records produced by a data stream. Ri is fixed-size, and the probability of

sampling the jth record from the stream is proportional to f(rj) for all j < i. This is a fairly

simple and yet powerful definition of biased sampling, and is general enough to support many

applications.

Of course, one straightforward way to sample according to a well-defined bias function

would be to make a complete pass over the data set to compute the total weight of all the

records, E I f(rj). During a second pass, we can then choose the th record of the data set

with probability Rf(i) by flipping a biased coin in a Bernoulli fashion. However, there are
2yji f(rj)
two problems with this method. First, this algorithm requires two passes over the data set. This

may not be practical for very large data sets and it may be infeasible in a streaming environment.

Second, the resulting sample is not fixed-size, which may be undesirable for several reasons.

The resources required to store the sample are not fixed, and most estimators over the resulting

sample will have higher variance.

In most cases, our algorithm is able to produce a correctly biased sample. However, given

certain pathological data sets and data orderings, this may not be the case. Our algorithm adapts

in this case and provides a correctly biased sample for a slightly modified bias function f'. We









analytically bound how far f' can be from f in such a pathological case, and experimentally eval-

uate the practical significance of this difference. Finally, we derive the correlation covariancee)

between the Bernoulli random variables governing the sampling of two records ri and rj using

our algorithm. We use this covariance to derive the variance of a Horvitz-Thomson estimator

making use of a sample computed using our algorithm.

The rest of the chapter is organized as follows. We describes a single-pass biased sampling

algorithm. We also define a distance metric to evaluate the worst case deviation from the user-

defined weighting function f. Finally, we derive a simple estimator for a biased reservoir. The

experiments performed to test our algorithms are presented in Chapter 7.

4.1 A Single-Pass Biased Sampling Algorithm

We introduced the classical reservoir sampling algorithm that maintains an unbiased

sample of a data stream in the previous chapter. We will extend this algorithm to give our biased

reservoir sampling algorithm and prove various properties and pathological cases for the same.

4.1.1 Biased Reservoir Sampling

It turns out that in most cases, one may produce a correctly biased sample by simply

modifying the reservoir algorithm to maintain a current sum total Weight over all observed

f(ri). Then, incoming records are added to the reservoir so that the probability of sampling
record rj is f(rj)/totalWeight. This basic version of the algorithm is given as Algorithm 6. It is

possible to prove that this modified algorithm results in a correctly biased sample, provided that

the "probability" from line (8) of Algorithm 6 does not exceed one.

Lemma 3. Let Ri be a state of the biased sample just after the ith record in the stream has

been processed. Using the biased sampling described in Alg, rihil 6, we are guaranteed that

for each Ri and for each record rj produced by the data stream such that j < i, we have

Pr[rj EG i] I Rif(r3)
E 1f/(rk)

Proof We need to prove that when a new record ri appears in the stream, then for each record

rj from the stream, Pr[rj E R1] = f( A new records produced by the stream is sampled
El=I f(rIl)








with probability If(rj) in Step (8) of the algorithm and the probability requirement trivially
E1=1 f(ri)
holds for the new record. We now must prove this fact for rk, for all k < i. Since R is correct,
we know that for k < i, Pr[rk E R~ -1i (r) i Then there are two cases to consider; either
EI=i f(rl)
the new record ri is chosen for the reservoir, or it is not. If ri is not chosen, then rk remains in
the reservoir for k < i. If ri is chosen, then rk remains in the reservoir if rk is not selected for
expulsion from the reservoir (the chance of this happening if ri is chosen is (|R| 1)/|R|). Thus,
the probability that a record rk is in Ri is


Pr[rk e Ri]


Pr[rk E

Pr[rk G


Pr[rk E Ri-1]

Pr[rk E Ri-1]

Pr[rk E Ri-i]


= Pr[rk E Ri-1]

t R f (rk) '

IRIf(rk)
Y i1 f(r')
This is the desired results and prn


Pr[ri E Ri]Pr[rk not expelled] + (Pr[rk E Ri-1]Pr[r, Ri)

(Pr[ri G R] ( ) + 1 Pr[r G Rj])
(R\Pr[r, G R] Pr[ri c R] \R\ R\Pr[r, e R}\
IRI + IRI
(|R| Pr[rC e R1])
AR



(f: f(ri)



Ste stt t of t l


oves the statement of the lemma.


4.1.2 So, What Can Go Wrong? (And a Simple Solution)
This simple modification to the reservoir sampling algorithm will give us the desired
biased sample as long as the probability IRI f(ri)/totalWeight never exceeds one. If this
value does exceed one, then the correctness of the algorithm is not preserved. Unfortunately,
we may very well see such meaningless probabilities, especially early on as the reservoir is









Algorithm 6 Biased Reservoir Sampling (A Simple Modification to the Algorithm 1)
1: Set totalWeight = 0
2: for int i = 1 to oo do
3: Wait for a new record ri to appear in the stream
4: totalWeight = totalWeight + f(ri)
5: if i < R then
6: Add ri directly to R
7: else
8: with probability toi t do
9: Remove a randomly selected record from R
10: Add ri to R


initially filled with samples and the value of totalWeight is relatively small. Fortunately, after

some time, the situation will improve as the number of records produced by the stream is very

large and totalWeight grows accordingly, making it unlikely that any single record will have

IRlf(ri)/totalWeight > 1.

We define an overweight record to be a record ri for which RS(i) > 1. This is simply
Ek=1 J' )
a record for which the selection "probability" exceeds one. There are two methods for handling

such overweight records. The first solution which we describe presently is to use some additional

buffer memory. Every time we encounter an overweight record, we do not process the record

immediately and instead buffer the record in a priority queue. The queue is arranged to that at

all times, it gives us access to the minimum-weight, buffered record, which we term r"i". Every

time that totalWeight is incremented, we check to see if IRI f(rm')/ totalWeight < 1. If it is,

we then remove the record from the queue and re-consider it for selection. The process is then

repeated until the record at the head of the queue is found to be overweight, at which point the

modified reservoir algorithm again proceeds normally.

An important factor to consider while determining the feasibility of maintaining such a

queue in the general case is providing an upper bound on its size. This can be done by consider-

ing the worst possible ordering of the records input into the algorithm, subject to the constraint

that the bias function is well-defined. In general, we describe the user-defined weighting function

f as being well-defined if I f(r <1 Vi 1,2,... N.
kl 1fkc) -









It turns out that in the worst-case scenario we might have to buffer almost the entire

data stream. We describe the case by construction. For a given arbitrary reservoir size |R|

and the stream size N, we add first |R| records, all with the same weight wtl 1, to the

reservoir. Next, we set f(rRl+l ) = i f (rk)/IR] + 1 1wt + 1 2. The inclusion

probability of rRa+a1 is |R f(rlRnl+)/l z( f(rK) = 21 /(IR + 2) > 1. Since rRl+1

is an overweight record, we buffer it. We construct the remaining records of the stream with

f(fIRI+2) ... f(rN) = (rlR|+1) = 2 so as to have all of them overweight and we must
buffer them all. The priority queue thus contains N IRI records in it. Since f(ri) = 1 Vi < IR,

we have: RIlf (r)/ k L f(rk) IRI/[ IR + 2(N RI)] < 1, and since f(ri) 2 Vi > RI

and N > IR, we have: IRIf(r)/ E i f(rk) 21RI/[ R + 2(N I)] < 1. Thus, for a

well-defined biased function f and the constructed stream the required queue size is N IRI. We

therefore conclude that for N > IRI, the size of the buffer required for delayed insertion of the

overweight records is 0(N).

We stress that though this upper bound is quite poor (requiring that we need to buffer the

entire data stream!) it is in fact a worst-case scenario, and the approach will often be feasible

in practice. This is because weights will often increase monotonically over time (as in the case

where newer records tend to be more relevant for query processing than older ones). Still, given

the poor worst-case upper bound, a more robust solution is required, which we now describe.

4.1.3 Adjusting Weights of Existing Samples

Another, orthogonal method for handling overweight records (that can be applied when

the available buffer memory is exceeded) is to simply adjust the bias function and try to do the

best that we can. Specifically, when we encounter an overweight record, we simply bump up the

weights of all existing samples so as to ensure the inclusion probability of the current record is

exactly one. Of course, as a result of this we will not be able to ensure that the weight of each

record ri is exactly f(ri). We describe what we will be able to guarantee in the context of the true

weight of a record:









Definition 1. If Ri is the biased sample of the first i records produced by a data stream, the value

is the true weight of a record rj if and only if Pr[rj c Ri] = f' (r )
k= I f(rk)

What we will be able to guarantee is then twofold:


1. First, we will be able to guarantee that f'(rj) will be exactly f(rj) if (|R| f(ri))/totalWeight <
1 for all k > j.

2. We can also guarantee that we can compute the true weight for a given record to unbiased
any estimate made using our sample (see Section 4.4).


In other words, our biased sample can still be used to produce unbiased estimates that

are correct on expectation [16], but the sample might not be biased exactly as specified by the

user-defined function f, if the value of f(r) tends to fluctuate wildly. While this may seem like a

drawback, the number of records not sampled according to f will usually be small. Furthermore,

since the function used to measure the utility of a sample in biased sampling is usually the result

of an approximate answer to a difficult optimization problem [15] or the application of a heuristic

[52], having a small deviation from that function might not be of much concern.

We present a single-pass biased sampling algorithm that provides both guarantees outlined

above as Algorithm 7, and Lemma 4 proves the correctness of the algorithm.

Lemma 4. Let Ri be a state of the biased sample just after the ith record in the stream has

been processed. Using the biased sampling described in Al 'r I, ii 7, we are guaranteed that

for each Ri and for each record rj produced by the data stream such that j < i, we have,

Pr[rj E Ri]- Lif'(r)
Z=1 f'(rm)"

Proof We know that the probability of selecting ith record in the reservoir is IR f(ri)/totalWeight.

Then, there are two cases to explore. The first, when the reservoir is full and before we encounter

an overweight record rl, and the the second after we encounter such an rl.

Case (i): The proof of this case is very similar to the proof of the Lemma 3. We simply use

f' instead of f to prove the desired result.









Algorithm 7 Biased Reservoir Sampling (Adjusting Weights of Existing Samples)
1: Set totalWeight = 0
2: for int i = 1 to oo do
3: Wait for a new record ri to appear in the stream
4: Set r ". ':,l,, = f(ri)
5: totalWeight = totalWeight + f(ri)
6: if i < R then
7: Add ri directly to R
8: else
9: if If(rI) < 1 then
totalWeight
10: with probability lR/(r' do
totalWeight
11: Remove a randomly selected record from R
12: Add ri to R
13: else
14: for each record j in R do
15: rj.weight = (I RI -)f(ri) x rj.weight
totalWeight- f(ri) X igh
16: totalWeight = IRf(ri)
17: Remove a randomly selected record from R
18: Add ri to R


Case (ii): If Iti (t > 1, we scale the true weight of every existing sample so as to have

totalWeight I= f(rf). This is done by first setting C = IR 1)(rs) and then scaling up
totalWeight- f (ri)
f(rk) C x f(rk) Vk < i. As a result of this linear scaling, we have

IRl x C x f(r)
Pr [rj E Ri]
totalWeight
R x C x f (ri)
E c x f (T) + f(ri)
IRIf'(rk)
l 1 fl(ni)



An important factor to consider while determining the applicability of Algorithm 7 is the

deviation of f' from f. That is: how far off from the correct weighting can we be, in the worst

case? When stream has no overweight records, we expect f' to be exactly equal to f, but it may

be very far away under certain circumstances. To address this, we define a distance metric in

Definition 2 and evaluate the worse case distance between f' and f.









Definition 2. Iff is the user-defined bias function and f' is the actual bias function, then the

distance between ,hl \,, two functions is defined as totalDist(f, f') = EC dist(ri), where

dist(r) f'(ri) f(r)
k-1 fP(rk) k- (O

For a data stream with no overweight records, totalDist(f, f') = 0 (the best case). The

worst case distance is given by the Theorem 1 and is analyzed and proved in the Appendix of this

paper.

Theorem 1. Given a set of streaming records rl, r2,... rN and a user-defined weighting function

f, Alg,,ritihm 7 will sample with an actual bias function f' where totalDist(f, f') is upper
bounded by

k N
E ( f() 11() (RI p )f(r' ) + = R f(+)
1 f(r ) R ff((r) k IR1+1 f(r'k)

and r, r, ., r' is the permutation (reordering) of the streaming records such that f(r') <

f(r) < <_ f(rK')

According to this Theorem, the worst case occurs when the reservoir is initially filled (on

startup) with the R records having the smallest possible weights (that is, we have the smallest

totalWeight when the reservoir is filled) and we encounter the record with the largest weight

immediately thereafter. We evaluate the effect of this worst-possible ordering in the experimental

section of the paper.

4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm

Algorithm 7 computes a biased sample according to f', where f' is a "close" function to a

user-defined weighting function f according to the following distance metric:


N f'(ri) f(ri)
totalDist(f, f') = dist(ri), where dist(r)) = rfi
iN N
il 2kl f'(rI ) C-,l f(rk)









The worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the

R records having the smallest possible weights and (2) we encounter the record rmx with the

largest weight immediately thereafter. Theorem 1 presented an upper bound on totalDist(f, f')

in this worst case. In this section, we first provide the proof of this worst case for Algorithm 7

and then prove the upper bound on totalDist(f, f') given by Theorem 1.

4.2.1 The Proof for the Worst Case

To prove the worst case for Algorithm 7, we first prove the following three propositions.

These proofs lead us to the worst-case argument. If we denote the record with the highest weight

in the stream as rmx and use rma to denote the case where r"m is located at position i in the

stream, then for any given random ordering of the streaming records ri,..., ri-1, rmax, TN,

we prove that

1. Moving the record rma" earlier in the range rjI ... rN can not decrease totalDist(f, f').

2. When we are initially filling the reservoir, choosing |RI records with smallest possible
weight maximizes totalDist(f, f').

3. Reordering of any record that appears after rmax in the range r~+l ... rN can not increase
totalDist(f, f').

The proof of the first proposition regarding moving rm~ earlier in the stream:

We prove this proposition by showing that if we move r~m to r-a, totalDist(f, f)

can not decrease. If rmax is not an overweight record, the claim trivially holds as moving non-

overweight record does not change totalDist(f, f'). If r"ma is an overweight record, we prove

that totalDist(f, f') increases because of the move. We first compute totalDisti(f, f') for rm

and then compute totalDist2(f, f') for rma. We prove the claim by showing totalDist2(f, f')

totalDisti(f, f') > 0.

1. An Expression for totalDist (f, f') for rax

We start with the totalDist formula










totalDist(f, f')


f'(rl)

Yk-1 f'(ik)


f(ri)
k- If(rk)
f(ri)
N +(
=k-1 f (k)


f'(ri-i)
: N
k I, fl(rk)
Sf'(rN)
c 1*I fl(Tk)


f(ri-1)
+

f (rN)
=k-1 f(Tk)


Since r"x' is the ith record of the stream, using the result of Lemma 5 (given below) we
re-write the totalDist formula as


totalDisti(f, f')


f'(ri)

1 f f( k)


f'(rl)
( k f'(rIk)
f(ri)

k-1f(rk)
VE~nr.)
f l .


f(rl)
:N
Zk-1 f(Tk)
f'(ri)
Y k 1 f'(r~k)


f'(ri-1)
+ + N
>3k 1 f'(rk)
f(rN)
N
>3k 1 f(rk)


( f(r)
N
( fk1 f(
f1 if'(
1 Y.N .,P


f(ri-1)
kZ1 ff(rk)
f'(rN)
=k 1 f'(rk)


f(ri-)
rk) 1 kf(ik)
f'(r N)
rk) 1I f'(rk)


We know that Vj < i, f'(rj)


(IR-1)f (r' )f(rj) and Vj > i, f'(rj) f- (rj). We also know
k-I1f r


I R f(rM a) k i+1 f(rik). Therefore, the above equation simplifies to


f'(ri-1)
+ N
Y=1 kf (rk)
f(rN)
N t f( k)


that E -I f '(rk)










totalDisti(f, f')


(i-1 l)f(rax)(N l)
:i-1 N
R- f(rk1) X = f'(rfk)

>cIl f(R) E+L /f(k)


(|R|-)f(r7ax)E- 11f(rk)
k i f(rk) kf=(rk)
N+ -



k 1f(rk) k- 1f(rk)
k 1 f(rk) 1f(rk)


Y=i f 1r k
k1 f(rk) f(rk)



Y k-1 f (rk)


f(rk) x 1 f'(Tk)
xEk =1 k


(IRI -
k 1

Cki f(rk)
N
>k=1 f'(rk)



Zk-1 f(Tk)


(IRl 1_)f(r70
k- 1 fl(rk)


ki f(rk)
Yc =1 f'(rk)


Cki f(rk)
k1 f'(rk)


(IR 1)f(r7) f ) af(rk)
IR If(r7ax) i+l f(rk)


(4-1)


2. An Expression for totalDist2(f, f') for r i


In this case, since r"m is the i
re-write the totalDist formula as


1th record of the stream, using the result of Lemma 5 we


totalDist2(f, f')


f'(ri)
NI f,(
k1 f'(rk)
f(ri-1)
ck 1 f(Tk)


f(rl)
EN
kC 1 f(rk)
f'(r-1)
N+---
k-1 fl(Tk)


f'(r-2)
N
k 1 f'(rk)
f(rN)
+ f(
=k-1 f(Tk)


f(r-2)
N
kC=1 f(rk)
f'(rN)
=k-1 fl(Tk)


Using an argument similar to the case of totalDisti(f, f'), the formula simplifies to


totalDist2(f, f')


E-i-1 (rk) L21 fk)
YNk f(k)


(IIR 1) f (rin ix) 1f( rk)
I lf(rr-ix) + yN f(rk)
i( -1)+ r( )


3. An Expression for totalDist (f, f')


i N f(rk)
+ (r)
1 f( )


(4-2)


totalDist (f f ')








Region 2 and 3
Region 1 / /
1 ... r i-2 i l r i i+l


Region 4

rNI


rmax .-)max
^-1'i
Figure 4-1. Adjustment of rma to rii_

We obtain this expression by subtracting Equation (4-1) from Equation (4-2) as follows:


E -l 1f(rk) -E lf(rn) + (I l- )f(r7(-) i i- -l f(rk)1
ki- f(r) R 1 f(r|(r( ) + E i-f(rk)
k t~ 1 (I \I 1 ) i_ f+i-1N
S) I f (r) If (r)(r ) (rk)
i f(rk ) R f( ) ()+ E +1-i f(rk)


Y N 2 ki-2
k-i-1 f(rk) -E If(rk) Z
k-i 1f(rk)
(IR 1)f (r"x) i-1 f(r)
IRIf(rma) + y k f()


ZN -1f r r)
Y f(N )
(IR 1 )f m (rax) f(Tk)
R- f(rma) = i f(rk)
IRIf(r- ) + Ey +1()


Figure 4-1 shows the adjustment of rmax to r x. We denote the record that is swapped with
r"a as rswap. The above equation further simplifies to


totalDist2(..) totalDisti(..)


/rnax) + -swap) i Fk) 2Ei )
f(rrx) f(rw"ap) Ci+1 ff(rk) f(rk)
YN f(rk)
f(rTna) i f(rk) (swap) 1 f(i k)
Z--- f(rk)
[( 2)f(rnax) 1 (k)] f(r swap)

[RI|f(r'ax) + E 7f(ri )] + f(rswap)
[(Jl- 2)f(rna) _- i f(Tk)]

[If(rIf x) + ~ i+1 f(rk)]


totalDist2(..) totalDist1(..)








2)f(r-.a) E i+1 f(rk)] and Y
k- IY


have


totalDist2(..)


totalDisti(..)


2 x f(rswa)
N- f(rk)

2 x f(rswap
Ek- f(rk)


X- f(rswa)
Y + f(rswP)


f(rswa) x [X + Y]
Y x [Y + f(rswa)]


Substituting the values for X and Y, the equation further simplifies to


totalDist2(..)


totalDisti(..)


2 x f(rwap)
Nk-1 f(Rk)
f (rswap) [(|R


2)f(r-na) l+ f(rk) + Rf(r-a) + E i1 (ik)


Y x [Y + f(rPs)]


2 x f(rswp)
2 x ( f(r)
2 x f(rswaP) x (IRI- )f(rra)


Since [IR f(rm ) ki+l /(k)] > (IR


1)f(r'ma), we have


totalDist2(..)


totalDisti(..) >


2 x f(rswp)
N /f(Tk)


2 x f(rswap)
[IRf(ri ) + Z i7 f(rk) + f(rswap)]


Since [| f(r rna) + f(rswa) + + (rk)] > -1 1 f(rk), we have


totalDist2(f, f')


totalDisti(f, f')


2 x f(rswap)
k 1 f(rk)


2 x f(rswap)
Nk1 f(rk)


>0


[IR f(rm,) k-i+f(rk)] x [IRIf(rr )+


i+1 f(rk) + f(rswa)]


[(1R


If we let X


[Il(r -)+E i f(rk)],we








Furthermore, we know that Algorithm 7 accepts the first IR| records of the stream with
probability 1. No weight adjustments are triggered for first IR| records irrespective of their
weights. Therefore, the earliest position rmx can appear in the stream is right after the reservoir
is filled. This proves the proposition. We now turn to proving Lemma 5, which was used in the
previous proof.
Lemma 5. If r' appears as the ith record of the stream, then Vj < i we have: '(r) >
j) f') (rk)f)
-(r-) and Vj > i we have: 77(r) < f(r- )
k l f(rk) k=lif (k) f(k))
Proof When we encounter r"m as the ith record of the stream, we increase the weights of rj
Vj < i by a factor of C = 1)f and adjust E i f'(rk) R (ra) + + (r).
=1 f(r k)
We also know that Vj > i, f'(rj) = f(rj).
Part 1: Vj < i we have

f'(rj) f(rj) C x f(rj) f(rj)
k1 fl(k) f (r) ) f (r)
S () f( Tk) L = fl(Tk) k 1 Tk)
Cx f(rj) f(rj)
i-1 C X f (rk)Z ki f(rk) -I f(rk)

Since C > 1, we have

C x f(rj) f (rj)
-i C x (rk) + Ek f(rk) -' lf(rk)

We can therefore conclude that

f'(rj) f(rj) > f(rj) f(rj)
k I f '(rk) k-i f(rk) fE (rk) f(rk)
>0


This proves the first part of the lemma.

Part 2: Vj > i we have










f(rj)
k-I ff(rk)


f(rj)
IRlf(r-ax) Zk i+l f(rA)
f(rj)
IR f(Tr ) Z hi+l f(Trk)


f(rj)
k-N f(rk)
f(rj)
'ik1 f(rk) + Yk -i+l f(Tk)


Since Ck, f(rk) < IRlf(rTma), we have


f(rj)
lRlf(ra.x) + Ek>i+L f(rk)


< f(rj)
1 f(rk)


We can therefore conclude that


f(rj) < f(rj)
:k-1 f(Trk) k-1 f(rk)


f(rj)
k-I ff(rk)


This proves the second part of the lemma.

The proof of the second proposition regarding the effect of the first IR records:
We now turn our attention to the effect of first IR| record of the stream on the worst-case
distance. If rm" appears as the IR1th record in the worst case, then using the result of Lemma 5,
Vj < IR| we know that


k f'(rk)


f(rj)
k1 ff(rk)


(II-1)f (r ma")
|R-1 (f ()+)
IRlf(rmax) + z7 IRI+I f(rk)


(Il|-1R) f (rm)
(IRI- )f (rma-) f(rj)
(IR1 1)f(rT ax) + k HI f(rk)


f(rj)
k-1 ff(rk)


f(rk)
Yk I f(Tk)


f'(rj)
k 1if'(rk)


f(rj)
YNk- f'(rk)








For a given set of records, ZE I 1 f(rk) + | kRI f(rk) is a constant. Therefore, dist(rj)
increases as E I1 f(rk) decreases. In other words, the dist(rj) is maximum for the smallest

possible ZElj 1 f(rk). Thus, the totalDist(f, f') is largest when the reservoir is initially filled
with the |R| records having the smallest possible weights. This proves the claim in the second
proposition.

The proof of the third proposition regarding reordering of records after r"mx:
This is immediate since r" m is the highest weight record of the stream, no record after r"'
can be an overweight record.

From the above three propositions, we can conclude that the worst case for Algorithm 7
occurs when (1) the reservoir is initially filled with the |R| records having the smallest possible
weights and, (2) we encounter the record r"' with the largest weight immediately thereafter.
4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist
To derive the upper bound, we start with the totalDist formula and give its value in the
worst case


totalDst(f, (r) f(rl) f(rlR -1) f(r lR-1) +
SEtotalDist(f(, f') N E+ +
k-1 f'(rk) k, 1f(rk) N I f'(rk) k-NI f(rk)
f (rlRI) f(rRI) f'(rN) f(rN)
ki, fl'(k) k- fE(rk) Y I f'(rk) Eki f(rk)


We know that in the worst case r ma appears as the IR1th record in the stream, using the
result of Lemma 5 we re-write the totalDist formula as









f'(l)
YNk-1 f'(rk)
f (r R)
YNk1 f(rk)


f(rl) f'(rfIR-1)
N + N
-1 f(Trk) k1 fl(Trk)
f'(rlsI) f (rN)
N, + N + ,
=k 1 f'(rk) =k-1 f(rk)


( f'(r) f'(rIR-1)
Sf(rl) E1 f (rk)
if (rII ) f + (N)
k-, f-(r) k -1 f(rk)


f(rlR -1)
k 1 f(rk)
f' (rN)
Yk I fl(rk)


( f(ri)

f'(rlR)


f(rlRi-1)
" -l f(rk) J
f'(rN) N
YEk-I f'(k


(I R|- 1) f (r-")ax\ f(rj)
We know that Vj < IR1, f'(r,) i and Vj > |R f'(rj) f(rj). We also
k=1 Jv)
N IZI1rak+
know that Ek1 f'(rk) = If(ri) + J| 1R f(rk). Therefore, similar to the Equation (4-1),
the above equation simplifies to


(||R-1) k (-I f (rk)
totalDist,(f f') =-- (r- )
Ek-1 f(Tk) X Yi/k1 fl(rk)


i1 f(rk) kIR f(rk)
k-1 f(rk) -1l f(rk)


|N
Ek IRI f(rk)
k-1 f'(rk)


EkJf(r1)- E 1f(r() (iRi- 1)f(r )- k RI f(rk)
:k 1 f(rk) R 1f( ) >ki f(rk)

(4-3)

In the worst case the reservoir is initially filled with the IR| records having the smallest
possible weights. If rl, r2,... rN are the records in appearance order then we define r', r ,..., r'
as the permutation (reordering) of the records such that f(r') < f(r') < ... < f(r/). The
condition requiring reservoir filled with the smallest possible weights can be then written as


(IR|I 1)f (r') f(r)
S1 f) 4-4)
I |If (r' ) + y I (r )


>37 RI f(r') yl 1 f(r')
>k 1 f(rk)


totalDist, (f, f') -


totalDist, (f, f')












4.3 Biased Reservoir Sampling With The Geometric File


It is easy to use the biased reservoir sampling algorithm with a geometric file. To use the

geometric file for biased sampling, it is vital that we be able to compute the true weight of any

given record. To allow this, we will require that the following auxiliary information be stored:

Each record r will have its effective weight r.weight stored along with it in the geometric
file on disk. Once totalWeight becomes large, we can expect that for each new record r,
r.weight = f(r). However, for the initial records from the data stream, these two values
will not necessarily be the same.

Each subsample Si will have a weight multiplier i3 associated with it. Again, for subsam-
ples containing records produced by the data stream after totalWeight becomes large, .3
will typically be one. For efficiency, 3 [ can be buffered in main memory. Along with the
effective weight, the weight multiplier can give us the true weight for a given record, which
will be M(r) x r.weight.


Algorithmic changes: Given that we need to store this auxiliary information, the algorithms

for sampling from a data stream using the geometric file will require three changes to support

biased sampling. These modifications are described now:

During start-up. To begin with, the reservoir is filled with the first IR| records from the
stream. For each of these initial records, r.weight is set to one. Let "totalWeight" be the
sum of f(r) over the first IRI records. When the reservoir is finished filling, 3 1 is set to
totalWeight/IRI for every one of the initial subsamples. In this way, the true weight of
each of the first IRI records produced by the data stream is set to be the mean value of f(r)
for the first IRI records. Giving the first IRI records a uniform true weight is a necessary
evil, since they will all be overwritten by subsequent buffer flushes with equal probability.

As subsequent records are produced by the data stream. Just as suggested by Algo-
rithm 4, additional records produced by the stream are added to the buffer with probability
(I|R f(ri))/totalWeight, so that at least initially, the true weight of the ith record is exactly
f(ri). The interesting case is when I Rf(r) > 1 when the ith record is produced by the
data stream. In this case, we must scale the true weight of every existing record up so that
tot" .a 1. To accomplish this, we do the following:
1. For each on-disk subsample, Mj is set to be IRIf(r) 1.
totalWeight
2. For each sampled record still in the buffer, rj.weight is set to txr.weight f(r = .
3. Finally, totalWeight is set to IR f(ri).









As the buffer fills. When the buffer fills and the jth subsample is to be created and written
to disk, Mj is set to 1.


4.4 Estimation Using a Biased Reservoir

The biased sampling algorithm presented gives a user the opportunity to make use of

different weighting algorithms and estimators, depending upon the particular application domain.

We discuss one such simple estimator, the standard Horvitz-Thompson estimator [50] for a

sample computed using our algorithm. We derive the correlation covariancee) between the

Bernoulli random variables governing the sampling of two records ri and rj using our algorithm

and use this covariance to derive the variance of a Horvitz-Thomson estimator. Combined with

the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's

accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the

AVERAGE and COUNT aggregates) over a single database table for which the reservoir is

maintained. Though handling more complicated queries using the biased sample is beyond the

scope of the paper, it is straightforward to extend the analysis of this Section to more complicated

queries such as joins [32].

Imagine that we have the following single-table query, whose (unknown) answer is q:

SELECT SUM g(r)

FROM THE_TABLE AS r

WHERE g2(r)

Given such a query, let g(r) = g9(r) if g2(r) evaluates to true 0 otherwise. Let Ri be a state

of the biased sample just after the ith record in the stream has been processed. Then the unbiased

Horvitz-Thompson estimator for the query answer q can be written as q = re pr al]i In

the Horvitz-Thompson estimator, each record is weighted according to the inverse of its sampling

probability.

Next, we derive the variance of this estimator. To do this, we need a result similar to Lemma

3 that can be used to compute the probability Pr[{rj, rk} E Ri] under our biased sampling

scheme.








Lemma 6. Let Ri be a state of the biased sample just after the ith record in the stream has been
processed. Using the biased sampling algorithm described in Alg Iir1i1n 7, for each Ri and for
each record pair {rj, rk} produced by the data stream where j < k < i we have


1Rl(| l 1) f'(rj) f'(rk)
Pr[{rr-1 f k X H i
1=1 f-(l) Ell= f/( 1) l=k+1


2Pr[rl Ri]
A )


Proof The proof is analogous to the proof of Lemma 3. There are two sub-cases to consider. If
k = i, then the proof is relatively easy. In this case

Pr[{rj, rk} c R] = Pr[rj Ri-]Pr[ri c Ri]Pr[rj not expelled]


(|IR|f'(rj) IR|f'(ri) R| 1
i-lf'(l)) E f'l(r)) ( R
|R|(|RI 1)f'(rj)f'(r )

If both k and j are less than i, then the proof becomes a bit more involved and the probabil-
ity must be computed recursively. In this case, we have


Pr[{rj, rk} G Ri] Pr[{rj, rk} E R i_1]Pr[ri E Ri]Pr[{rj, rk} not expelled]+
(Pr[{rj,rk} C Rii]Pr[ri R ]i)
Pr[{r, rk R i Pr[ri R] (I 2) + 1 Pr[ri
|R|


C R1)


Pr[{r ,r k Ci-1]

Pr[{fr,rk} RJ i-1

Pr[{r ,r k} R i-21


Pr[{rT, rk}


SR] x


(Prir, C R 2Pr[ri C Ri] +
( 2Pr[ri E Ri]


(1 2Pr[ri_1 E Ri-1) 2

i 2Pr[rll Ri]
1= k+1


SR1(1 R 1) f'(rj)f'(rk) i R
Pr[{r,}R1, kk X
Yll /'1 (r)E If/( l k+1\


- Pr [r, R)


Pr[ri C Ri]
|R|


2Pr[rl E Ri]
A )









which is the desired probability.


This expression can then be used in conjunction with the next lemma to compute the

variance of the natural estimator for q.

Lemma 7. The variance of a is g'(rj) + r 2Pr[(rj,rk)lg(rJ)g(rk) 2
i, Pr[rjER,] 2 ,rk Pr[rjERi]Pr[rkRi q

Proof

Var(q) = E[q2] (E[q])2

Sg(rj) 2- E[ Z Pr]r
rjERi rjERi
[1

E g(rj) q2
SPr [rj Ri]


r3Rrj,rlc
E [Xj]g2(r) + 2XjXkg(rj)g(r)
SP2[r, E[r R, Pr[rj E R1]Pr[rk G R]


Pr2[rj E R} .' Pr[rj E Ri]Pr[rfC Ri]


rj Pr[r RPr[r
rj rj,rk


This proves the lemma.


By using the result of Lemma 6 to compute Pr [{rj, rk E Ri], the variance of the estimator

is then easily obtained for a specific query. In practice, the variance itself must be estimated by

considering only the sampled records as we typically do not have access to each and every rj

during query processing. The q2 term and the two sums in the expression of variance are thus

computed over each rj in the sample of biased geometric file rather than over the entire reservoir.

There is one additional issue regarding biased sampling that is worth some additional

discussion: how to efficiently compute the value Pr[{rj, rk} E Ri] in order to estimate









the variance during query evaluation. Computing Pr[{rj, rk }E R] requires that we be
able to compute two subexpressions for each sampled record pair: RI- 1)f(rj) f'(rk) and
y -I f'(rI) k Jf(r1I)
H 2Pr[r ERi]
l=k+1 2RP
The first subexpressions can be easily computed with the help of running total totalWeight

along with the weight multipliers associated with each subsample. When sample records are

added to the reservoir, like attribute ri.weight, we store another attribute with each record,

ri.oldTotalWeight and r\.oldM. The first attribute gets its value from current value of totalWeight,

whereas the M(ri) is stored in the second attribute. When a query is evaluated and we need

to compute the first subexpressions for a given record pair rj and rk, we compute terms in its

denominator as follows:


Sf'(rn) Tk.oldTotalWeight x Mr
l 1
k-l k k
Sf'(r) E '() f'(rk) '(r) (rk.weight x M(rk))
1=1 1=1 l=1

The second subexpressions can also be easily computed if we maintain a running total

subexp2Total for the sum log (1 2Pr[rER]) at all times. When a new record is added to the

reservoir, the current values of subexp2Total is stored as another attribute r .\~1\, \p2Val along

with each record. When a query is evaluated, for a given record pair rj and rk we simply evaluate

ni l (1 I2Pr[r ER] (subexp2Total-rk.subexp2Val)
l=k+1 |R|









CHAPTER 5
SAMPLING THE GEOMETRIC FILE

A geometric file is a simple random sample (without replacement) from a data stream. In

this chapter we develop techniques which allow a geometric file to itself be sampled in order to

produce smaller sets of data objects that are themselves random samples (without replacement)

from the original data stream. The goal of the algorithms described in this chapter is to efficiently

support further sampling of a geometric file by making use of its own structure.

5.1 Why Might We Need To Sample From a Geometric File?

In Section 3.2, we argued that small samples frequently do not provide enough accuracy,

especially in the case when the resulting statistical estimator has a very high variance. However,

while in the general case a very large sample can be required to answer a difficult query, a

huge sample may often contain too much information. For example, reconsider the problem

of estimating the average net worth of American households as described in Section 3.2. In

the general case, many millions of samples may be needed to estimate the net worth of the

average household accurately (due to a small ratio between the average household's net worth

and the standard deviation of this statistic across all American households). However, if the same

set of records held information about the size of each household, only a few hundred records

would be needed to obtain similar accuracy for an estimate of the average size of an American

household, since the ratio of average household size to the standard deviation of sample size

across households in the United States is greater than 2. Thus, to estimate the answer to these two

queries, vastly different sample sizes are needed.

5.2 Different Sampling Plans for the Geometric File

Since there is no single sample size that is optimal for answering all queries and the required

sample size can vary dramatically from query to query, this chapter considers the problem of

generating a sample of size N from a data stream using an existing geometric file that contains a

large sample of records from the stream, where N < R. We will consider two specific problems.

First, we consider the case where N is known beforehand. We will refer to a sample retrieved

in this manner as a batch sample. Batch samples of fixed size have been suggested for use in









several approximate query processing applications [1, 21, 30, 34, 39]. In general, the drawback

of making use of a batch sample is that the accuracy of any estimator which makes use of the

sample is fixed at the time that the sample is taken, whereas the benefit of batch sampling is that

the sample can be drawn with very high efficiency.

We will also consider the case where N is not known beforehand, and we want to implement

an iterative function GetNext. Each call to GetNext results in an additional sampled record being

returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We

will refer a sample retrieved in this manner as an online or sequential sample. The drawback of

online sampling compared to batch sampling is that it is generally less efficient to obtain a sample

of size N using online methods. However, since the consumer of the sample can call GetNext

repeatedly until an estimator with enough accuracy is obtained, online sampling is more flexible

than batch sampling. An online sample retrieved from a geometric file can be useful for many

applications, including online aggregation [32, 33]. In online 'riLlion. a database system

tries to quickly gather enough information so as to approximate answer to an aggregate query. As

more and more information is gathered, the approximation quality is improved, and the online

sampling procedure is halted when the user is happy with the approximation accuracy.

5.3 Batch Sampling From a Geometric File

5.3.1 A Naive Algorithm

The most obvious way to implement batch sampling is to make use of the reservoir sampling

algorithm to raw a sample of size N from a geometric file of size IRI in a single pass. As the

following lemma asserts, the resulting sample is also a sample of size N from the original data

stream.


Lemma 8. The reservoir sampling algorithm over a geometric file produces a correct random

sample of the stream.


Proof If S is the batch sample of size N retrieved from a geometric file R of size |R| using

the reservoir sampling algorithm, then we know from the correctness of the reservoir sampling









algorithm that:


# of subsets of size N E R
Pr[S C] # of such subsets in a data stream D

(l| R|
NJ
N)
Now, imagine that S E R. If we obtain a sample of size N from R using the reservoir algorithm,

the probability that we choose precisely S is:


Pr[S sampled from RS c R] 1/( I

Thus we have:

Pr[S sampled from R] = Pr[S sampled from R S E R]

x Pr[S E R]

1 (|R
(IRI) (l D)
1


This is precisely the probability we would expect if we sampled directly from the stream without

replacement. O

Unfortunately, though it is very simple, the naive algorithm will be inefficient for drawing

a small sample from a large geometric file since it requires a full scan of the geometric file to

obtain a true random sample for any value of N. Since the geometric file may be gigabytes in

size, this can be problematic.

5.3.2 A Geometric File Structure-Based Algorithm

We can do better if we make use of the structure of a geometric file itself. The intuitive

outline of this approach is as follows. To obtain a batch sample of size N, we pre-calculate

how many records from each on-disk subsample will be included in the batch sample, and then

we read the appropriate number of records sequentially from the various segments of each

subsample. The process of choosing the number of records to select from each subsample is









analogous to Olken and Rotem's procedure for choosing the number of records to select from

each hash bucket when performing batched sampling from a hashed file [26]. Once the number

of sampled records from each segment has been determined, sampling those records can be

done with an efficient sequential read since within each on disk segment, all records are store

in a randomized order. The key algorithmic issue is how to calculate the contribution of each

subsample. Since this contribution is a multivariate hypergeometric random variable, we can

use an approach analogous to Algorithm 4, which is used to partition the buffer to form the

segments of a subsample. In other words, we can view retrieving N samples from a geometric

file analogous to choosing N random records to overwrite when new records are added to the file.

The resulting algorithm can be described as follows. To start with, we partition the sample

space of N records into segments of varying size exactly as in Algorithm 4. We refer to these

segments of the sample space as sampling segments. The sampling segments are then filled with

samples from the disk using a series of sequential reads, analogous to the set of writes that are

used to add new samples to the geometric file. The largest sampling segment obtains all of its

records from the largest subsample, the next largest sampling segment obtains all its record from

second largest subsample, and so on.

Algorithm 8 Batch Sampling a Geometric File
1: Set NS = Number of subsamples in a geometric file
2: for i =1 to NS do
3: Set RecsInSubsam[i] = Size of ith subsample
4: Set RecsToRead[i] = 0
5: for i =1 to NS do
6: Choose j such that Pr[choosing j]= RecslnS, l.u, [i] /1RI
7: RecslnS ,F..,[di] -
8: RecsToR ..i[ j] + +
9: for i =1 to NS do
10: Append to batchsample RecsToRead[i] records from the ith subsample


When using this algorithm, some care needs to be taken when N approaches to the size

of a geometric file. Specifically, when all disk segments of a subsample are returned to a

corresponding sampling segment, we must also consider the subsample's in-memory buffered









records and any records contained in its stack in order to obtain desired size sample. The detailed

algorithm is presented as Algorithm 8.

It is clear that this algorithm obtains the desired batch sample by scanning exactly N records

as against the entire scan of the reservoir sampling at the cost of few random disk seeks. Since

the sampling process is analogous to the process of adding more samples to the file, it is just as

efficient, requiring O(w x log BI /N) random disk head movements for each newly sampled

record, as described in Lemma 2.

5.3.3 Batch Sampling Multiple Geometric Files

A geometric file structure based batch sampling algorithm can be extended to allow efficient

batch sampling from multiple geometric files in the same way that the insertion algorithm for

new samples into the geometric file can be extended to allow insertions into multiple geometric

files. The extension is fairly straightforward with additional first step where we determine the

number of records to be sampled from each geometric file. Once this number is determined, we

execute Algorithm 8 on each file in order to obtain the desired batch sample.

5.4 Online Sampling From a Geometric File

5.4.1 A Naive Algorithm

One straightforward way of supporting online sampling from a geometric file is to imple-

ment the iterative function GetNext as follows. For every call to GetNext, we simply generate a

random number i between 1 and size of the file IRI, and then return a record at the ith position

in the geometric file. Care must be taken to avoid choosing same record of R more than once in

order to obtain a correct sample without replacement. For example, to sample N records from R,

the numbers 0 through N 1 could be hashed or randomized using a bijective pseudo-random

function onto the domain 0 through IRI 1, and the resulting N numbers used to generate the

sample. To pick the next record to sample, we simply hash N.

It is easy to see that a naive algorithm will give us a correct online sample of a geometric

file. However, we will use one disk seek per call to GetNext. Since each random I/O requires









around 10 milliseconds, the naive algorithm can only sample around 6, 000 records from the

geometric file per minute per disk. This performance is unacceptable for most applications.

5.4.2 A Geometric File Structure-Based Algorithm

As in the case of batch sampling algorithm, we can make use of the structure of a geometric

file to efficiently support online sampling.

Instead of selecting a random record of a geometric file, we randomly pick a subsample

and choose its next available record as a return value of GetNext. This is analogous to the classic

online sampling algorithm for sampling from a hashed file [26], where first a hash bucket is

selected and then a record is chosen. Since the selection of a random record within a subsample

is sequential, we may reduce the number of costly disk seeks if we read the subsample in its

entirety, and buffer the subsample's records in memory. Using this basic methodology, we now

describe how a call to the GetNext will be processed:

We first randomly pick a subsample Si, with the probability of selecting i proportional to
the size of ith subsample.

Next, we look for buffered records of Si; if such records exist, we choose and return the
first available record as the return value of GetNext. If no buffered records are found, we
fetch and buffer a number of blocks of records from subsample Si; these records are then
buffered. We return the first buffered record as the return value of GetNext.


Since the records from each subsample are read and buffered in memory sequentially, we

are guaranteed to choose each record of the reservoir at most once, giving us desired random

sample without replacement. A proof of this is simple, and analogous to the proof of Lemma

3. However, thus far we have not considered a very important question: How many blocks of a

subsample Si should we fetch at the time of buffer refill? In general there are two extremes that

we may consider:

Fetch many. If we fetch a large number of blocks at the time of the buffer refill, we reduce
the overall time to sample N records for large N. This is due to the fact that by fetching
many blocks using a sequential read, we amortize the seek time over a large number of
blocks and at the same time we prepare ourselves for future calls to GetNext; once the
records are fetched from disk, the response time for subsequent calls to GetNext is almost
instantaneous (only in-memory computations are required). However, the drawback of this









approach is that the more records we fetch sequentially from the disk during a single call
to GetNext, the longer the response time will be for the particular call to GetNext during
which we fetch those blocks. This is particularly worrisome if we spend a lot of time to
fetch blocks which are never used (which will be the case if the user intends to draw only a
relatively small-sized sample.)

*Fetch few. If we fetch small number of blocks at the time of buffer refills, we reduce the
maximum response time for any given GetNext call. However, we then need more seeks
to sample N records. The approach can be problematic if user intends to draw a relatively
large sample from the file.


In order to discuss such considerations more concretely, we note that the time required to process

GetNext call is proportional to the number of blocks fetched on the call, assuming that the cost

to perform the required in-memory calculations is minimal. If b blocks are fetched during a

particular call, we spend s + br time units on that particular call to GetNext, where s is the seek

time and r is time required to scan a block. Once these b blocks are fetched we incur zero cost for

next bn calls to GetNext, where n is the blocking factor (number of records per block). Thus, in

the case where blocks are fetched at the first call to GetNext, we incur the total cost of s + br to

sample bn records, and have a response time of s + br units at the first call to GetNext, with all

subsequent calls having zero cost.

Now imagine that instead we split b blocks into two chunks of size b/2 each, and read a

chunk-at-a-time. Thus, the first GetNext call will cost us s + br/2 time units. Once these bn/2

records are used up we read next chuck of blocks. The total cost in this scenario is 2s + br with a

response time of s + br/2 time units once at the starting point and other mid-way through. Note

that although the maximum response time on any call to GetNext is reduced by half, we required

more time to sample bn records. The question then becomes, How do we reconcile response time

with overall sampling time to give the user optimal performance?

The systematic approach we take to answering this question is based on minimizing the

average square sum of response time over all GetNext calls. This idea is similar to the widely

utilized sum-square-error or MSE criterion, which tries to keeps the average error or "cost" from

being too high, but also penalizes particularly poor individual errors or costs. However, one









problem we face using this strategy in the context of online sampling is that we do not know

before hand the value of N, the number of records to be sampled.

Algorithm 9 GetNext for Online Sampling
1: Set NS = Number of subsamples in a geometric file
2: for i = 1 to NS do
3: Set RecsInSubsam[i] = Size of ith subsample
4: Set BufferedSubsamSize[i] = 0
5: Randomly choose a subsample Si such that Pr[choosing i] = RecsInSubsam[i] /1RI
6: RecsInSubsam[i]- -
7: if BufferedSubsamSize[i] == 0 then
8: Set numRecs to minimum of sf/r and RecsInSubsam[i]
9: Read and buffer numRecs records of Si
10: BufferedSubsamSize[i] = numRecs
11: Buf ferSubsamSize[i] -
12: Return the next available buffered record of Si


To address this issue, we use a simple heuristic. Every time we refill a buffer, we look at

the number of records already sampled from a subsample and assume that the user will ask for

the same number of samples as the algorithm progresses. This gives us the planning horizon for

which we can determine the number of blocks to be fetched. We also use the obvious constraint

that the total number of samples fetched from the subsample should not exceed the number of

records in a subsample. Given this, an analytic solution to the problem of minimizing the average

squared cost over all calls to GetNext is as follows:

If there are b number of records per blocks then let N/b be the number of blocks in the
planning horizon, and let X be the number of equal size chunks that we read on every
buffer refill. Our goal is to determine the value of X and the number of blocks in each
chunk.

We know that the time to read a chunk is proportional to s + (N/b x r)/X, and thus the
square sum of response time of all GetNext calls is X(s + (N/b x r)/X)2

In order to derive a formula for the value of X that minimizes this, we simply differentiate
it with respect to X and then solve for the zero.


dX
= Xs+ (:N/b xr)/X)2)

S(Xs2 + 2Nsr + (N/b x r)2/X)
dX
=s2 (N/b x r)2/X2









Setting this to zero, we have X = Nr/bs. Thus, we divide N/b blocks into Nr/bs chunks

and read bs/r number of blocks from a subsample every time we refill the buffer. It turns out

that when this solution is used, the number of blocks read at the time of buffer refill depends

on the ratio of the seek time to the block scan time. Since this solution is independent of the

planning horizon, we always read bs/r blocks irrespective of the number of records sampled so

far. Algorithm 9 gives the detailed online sampling algorithm.

5.5 Sampling A Biased Sample

We end this chapter by noting that if a geometric files sample is correctly biased, then batch

and online sampling algorithms we have given will also produce a correctly biased sample with

no modification, as described by the following lemma.

Lemma 9. A simple, equal-probability random sample from a correctly biased geometric file will

be correctly biased if the sample stored by the geometric file is correctly biased.


Proof In biased sampling, the probability of record being accepted in a geometric file is
I xf (r) where is the weight of the record under consideration and the totalWeight is the sum
totalWeight'
of weights of all records from the stream so far.

Let the Sample be the biased sample of the geometric file, then we have

Pr[i E Sample] = Pr[Selecting i from Si] x Pr[Selecting Si] x Pr[i E Si]
1 sl IRf(r)
S, |R totalWeight
f(r)
totalWeight



We examine the various algorithms for producing smaller samples from a large, disk-based

geometric file in chapter 7 of this dissertation.









CHAPTER 6
INDEX STRUCTURES FOR THE GEOMETRIC FILE

Efficiently searching and discovering required information from a sample stored in a geo-

metric file is essential to speed up query processing. A natural way to support this functionality

is to build an index structure for the geometric file. In this chapter we discuss three secondary

index structures for the geometric file. The goal is to maintain the index structures as new records

are inserted to the geometric file and at the same time provide efficient access to the desired

information in the file.

6.1 Why Index a Geometric File?

A geometric file may contain a sample of size several gigabytes or even terabytes. For

certain queries, a huge sample like this may contain too much information and it becomes

expensive to scan all the records of a sample to find those (most likely very few) records that

match a given condition. For example, consider a geometric file that maintains a temporally-

biased sample of the daily transactions at a large retail store like Wal-Mart. The records feature

all of the attributes that are necessary to capture the details for a transaction like, such as:

StoreID, Location, TransTotal, CustomerID, PaymentMethod and so on. Consider

answering following SQL query, that returns all Florida customers who caused transactions

during this calender year:

SELECT CustomerName, StoreName, TranTotal

FROM Transaction

WHERE StoreState = 'FL' AND TransDate > 1/1/2007

If a sample of the database were stored in a geometric file, a naive way to guess the answer

to this query would be to scan the entire geometric file, examining all its records and testing the

condition in the WHERE clause on each. The file might have only few thousands of records that

satisfy above criteria, while we will have to scan several billions of samples to answer the query.

It would be more efficient if we had some way of obtaining only the records from the current

year, and then testing each of them to see if they are from the state of Florida. It would be even









more efficient if we could obtain directly few thousand tuples that satisfy both the conditions of

the WHERE clause.

A natural way to speed up the search and discovery of those records from a geometric file

that have a particular value for a particular attributes) is to build an index structure. In general

an index is a data structure that lets us find a record without having to look at more than a small

fraction of all possible records. Thus, in our example, we could use the index built on either

StoreState or TransDate (or both) to quickly access specific set of records and test them

for the conditions in the WHERE clause. In this chapter we focus on building such an index

structure for the geometric file.

6.2 Different Index Structures for the Geometric File

An index is referred to as a primary index if it actually forces a specific location for each

record in the file, whereas an index is referred as secondary index if it merely tells us the current

location of a record. Thus, a secondary index is an index that is maintained for a data file, but not

used to control the current processing order of the file. In case of a geometric file the physical

location of a sampled record is determined (randomly) by the insertion algorithm. We therefore

consider how to build a secondary index structure on one or more attributes in a geometric file.

Apart from providing efficient access to the desired information in the file, a key consider-

ation is that the index for the geometric file must be maintained as new records are inserted. For

instance, we could build a secondary index on an attribute when the new records are bulk inserted

into the geometric file. We must then determine how do we merge the new secondary index with

the existing indexes built for the rest of the file. Furthermore, we must maintain the index as

existing records are being overwritten with newly inserted records and hence are deleted from the

geometric file.

With these goals in mind, we discuss three secondary index structures for the geometric file:

(1) a segment-based index, (2) a subsample-based index, and (3) a Log-Structured Merge-Tree-

(LSM-) based index. The first two indexes are developed around the structure of the geometric

file. Multiple B+-tree indexes [9] are maintained for each segment or subsample in a geometric









file. As new records are added to the file in units of a segment or subsample, a new B+-tree that

indexes the new records is created and added to the index structure. Also, an existing B+-tree

is deleted from the structure when all of the records indexed by it are deleted from the file. The

third index structure makes use of the LSM-tree index [44] a disk-based data structure designed

to provide low-cost indexing in an environment with a high rate of inserts and deletes.

In the subsequent sections we discuss construction, maintenance, and querying of these three

types of indexes.

6.3 A Segment-Based Index Structure

The geometric file is a collection of subsamples of exponentially decreasing size. Each

subsample is further divided into a number of segments of exponentially decreasing size. At

every buffer flush the buffered records are divided into different segments which are then used

to overwrite largest segment of each on-disk subsample. This structure of the geometric file

suggests a simple way to construct and maintain an index structure for the file. We could create a

B+-Tree index for each segment of each subsample of a geometric file and maintain them as new

segments overwrite existing segments. We construct the index structure during start-up as the

reservoir is filled with first IRI records and maintain it as subsequent records are produced by the

data stream.

We detail construction and maintenance of a segment-based index structure in this section.

6.3.1 Index Construction During Start-up

The geometric file makes use of steps (4)-(13) of Algorithm 3 from Chapter 3 during start-

up to fill the reservoir. Every time the buffer accumulates the desired number of records, it is

segmented and flushed to the disk. We build a B+-tree index for each segment just before they are

written out to the disk. For each buffered record of a segment we construct an index record. An

index record is comprised of the value of the attribute on which the index is getting built (the key

value) and the position of the buffered record on the disk. The position is stored as a number pair:

a page number and offset within a page. The index records are then used to create an index using

the bulk insertion algorithm for a B+-Tree. We use a simple array-based data structure to keep









track of the B+-Trees for each segment in the geometric file. Each array entry simply stores the

position of a B+-Tree root node.

Rather than maintaining a file for each B+-Tree created, we organize multiple B+-Trees on

a single disk file. We refer this single file as indexfile. The index file, in a sense, is similar to

the log-structure file system proposed by Ousterhout [45]. In log-structured file system, as files

are modified, the contents are written out to the disk as logs in a sequential stream. This allows

writes in full-cylinder units, with only track-to-track seeks. Thus the disk operates at nearly its

full bandwidth. The index file enjoys the similar performance benefits. Every time a B+-Tree

is created for a memory resident segment, it is written to the index file in a sequential stream at

the next available position. The array maintaining all B+-Tree root nodes is augmented with the

starting disk position of the B+-Tree.

Finally, we do not index segments that are never flushed to the disk. These segments are

typically very small (a size of a disk block) and it is efficient to search them using sequential

memory scan when geometric file is queried.

6.3.2 Maintaining Index During Normal Operation

Maintaining a segment-based index structure is exceedingly simple. During normal

operation as a new subsample and its segments are formed, we build a B+-Tree index for each

in-memory segment just like we did during the start-up. The only difference is that the B+-Trees

are written to the disk in slightly different manner. As an in-memory segment overwrites the

on-disk segment, a B+-Tree for an in-memory segment overwrites the B+-Tree for the on-disk

segment. We update the B+-Tree array entry for the root node of the new B+-Tree that is added

to the index structure. Thus, the index maintenance for records newly inserted into the geometric

file and the records that are deleted from the file is handled at the same time.

The algorithm used to construct and maintain a segment-based index structure is given as

Algorithm 10.









Algorithm 10 Construction and Maintenance of a Segment-Based Index Structure
1: Set n |B| x (1 a)
2: Set totSegslnSubsam -,o log n+og(-
3: Set totSubsamlnR 0
4: Set totSegslnR = 0
5: Set numRecs = 0
6: while numRecs < |R1 do
7: numRecs+ = IBI X atotSubamInR
8: totSegslnR+ = totSegslnSubsam totSubsamlnR
9: totSubsamlnR + +
10: Set BTree array of size totSegslnR
11: for inti = 1 to oo do
12: if Buffer B is partitioned then
13: for each segment sgj in B do
14: Build a B+-Tree BTj
15: if i < |R then
16: Flush BTj on the disk at next available spot in the Index File
17: else
18: Overwrite the B+-Tree for the largest segment of jth largest subsample of R with
BTj
19: Record BTj root and its disk position in BTree array


6.3.3 Index Look-Up and Search

A segment-based index structure is a collection of B+-Trees, one for each segment of the

geometric file. Any index-based search involves looking up all B+-Tree indexes. We use the

existing B+-Tree-based point query and range query algorithms and re-run them for each entry in

the B+-Tree array. The algorithm returns all index records that satisfy the search criteria. We sort

the valid index records by their page number attribute. We then retrieve the actual records from

the geometric file and return them as a query result.

We expect a segment-based index structure to be a compact structure as there is exactly one

index record present in the index structure for each record in the geometric file, and the index

structure is maintained as new records are deleted from the file.

6.4 A Subsample-Based Index Structure

Although compact, the segment-based index structure has little too many small indexes.

The requirement that we perform a look-up using every single one of a large number of B+-Tree









can easily degrade the performance of index-based search. A geometric file could easily have

multiple thousands of segments in it; even with two disk seeks per B+-Tree to retrieve an index

record, a simple point query may required thousands of disk seeks to return the query results. An

alternative to a segment-based index structure is to build a B+-Tree index for each subsample of

the geometric file. We refer this approach as a subsample-based index structure.

6.4.1 Index Construction and Maintenance

Every time the buffer accumulates the desired number of samples for a new subsample, we

build a single B+-Tree index for all the buffered records. As in the case of a segment-based index

structure, we construct an index record for each buffer record and then bulk insert them all to

create a B+-Tree index. The structure of the index record for a subsample-based index structure

is the same as that of a segment-based index structure, except that we add an attribute recording

the segment number to which the buffered record belongs. As discussed subsequently, we use

the segment number associated with the index record to determine if it is stale. We remember the

B+-Tree added to the structure by keeping track of its root node in an array structure.

As in the case of a segment-based index structure, we arrange the B+-Tree indexes on

disk in a single index file. However, we need a slightly different approach, because during

the start-up subsamples are flushed to the geometric file, until the reservoir is full. Thereafter

subsamples of the same size |B are added to the reservoir. Since each B+-Tree will index no

more than |B records, we can bound the size of a B+-Tree index. We use this bound to pre-

allocate a fix-sized slot on disk for each B+-Tree. Furthermore, for every buffer flush after the

reservoir is full, exactly one subsample is added to the file and the smallest subsample of the file

decays completely, keeping the number of subsamples in a geometric file constant. We use this

information to lay out the subsample-based B+-Trees on disk and maintain them as new records

are sampled from the data stream.

Thus, if totSubsamples is the total number subsamples in R, we first allocate fixed-size

totSubsamples slots in the index file. Initially all the slots are empty. During start-up, as a new

B+-Tree is built, we seek to the next available slot and write out the B+-Tree in a sequential









manner. When the reservoir is full, we have used all of the slots exactly once. During the normal

operation, every time buffer is full, the slot corresponding to the smallest subsample in the

reservoir (which is about to decay completely) is used to write out newly built B+-Tree. Thus,

during normal operations B+-Tree slots are used in round-robin fashion.

The algorithm used to construct and maintain a segment-based index structure is given as

Algorithm 11.

Algorithm 11 Construction and Maintenance of a Subsample-Based Index Structure
1: Set totSubsamnInR = [iog3-og IB+g(i-a)
log a
2: Set BTree allTrees [totSubsamInR]
3: Set btlndex = 0
4: for int i = 1 to oo do
5: if Buffer B is partitioned then
6: for each segment j in B do
7: allTrees[btIndex].BuildBTree(j)
8: btlndex + +
9: ifi > R1 then
10: btlndex = btli,.i ,'. I. I ubsamInR


6.4.2 Index Look-Up

In the subsample-based index structure, after every buffer flush, exactly one B+-Tree is

created and written to the disk, making insertions in the index structure very efficient. However,

most of the deletions are deferred until the subsample decays completely. Thus, although every

subsample losses its records to the new subsample, B+-Tree records are deleted from the index

structure only when the entire B+-Tree is to be deleted. In other words, at any given time all

B+-Trees except the one recently inserted contains stale records that must be ignored during the

search.

A search on subsample-based index structure involves looking up all B+-Tree indexes, one

for each subsample in the geometric file. We modify the existing B+-Tree-based point query and

range query algorithms and run them for each entry in the B+-Tree array of the index structure.

The modification is required to ignore the stale records in the B+Trees. As mentioned before,

the subsample corresponding to a B+-Tree may lose its segments, but the index records are









not deleted from the index tree until the subsample completely decays (when the entire tree is

deleted). We refer to a index record as a stale record if it belongs to a segment of a subsample

that is already overwritten (lost).

Recall that we have recorded a segment number in an additional field along with each index

record. For a given subsample, we keep track of which of its segments are decayed so far and use

this information to ignore the index records that are stale. We returns all valid index records that

satisfy the search criteria. We first sort these index records by their page number attribute and

then then retrieve the actual records from the geometric file and return them as a query result.

Although, the subsample-based index structure maintains and must search far fewer B+Trees

compared to the segment-based index structure, we except reasonable search time per B+-Tree

due to the smaller size and lazy deletion policy.

6.5 A LSM-Tree-Based Index Structure

An alternative to the segment-based and subsample-based index structure is to build a single

index structure for the entire geometric file, and maintain it as new records are inserted in the

file. Thus, we design the third index structure that makes use of the LSM-tree index [44]. The

LSM-Tree is a disk-based data structure designed to provide low-cost indexing in an environment

with a high rate of inserts and deletes.

6.5.1 An LSM-Tree Index

An LSM-tree is composed of two or more tree-like component data structures. The smallest

component of the index always resides entirely in main memory (referred as the Co tree), and all

other larger components reside on disk (referred as C1, C2, ..., Cj). The schematic picture of an

LSM-tree of two components is depicted in Figure 2.1 of the original LSM-Tree paper [44].

Although C1 (and higher) components are disk-resident, the most frequently referred nodes

(in general nodes at higher level) of these trees are buffered in main memory for performance

reasons.

LSM-Tree insertions and deletions: Index records are first inserted into the main-memory-

resident Co component, after which they migrate to the C1 component that is stored on disk.









Insertion into the Co component has no I/O cost associated with it. However, its size is limited by

the size of the available memory. Thus, we must efficiently migrate part of the Co component to

the disk-resident C1 component.

Whenever the Co component reaches a threshold size an ongoing rolling merge process

removes some records (a contiguous segment) from the Co component and merges them into

the Ci component on disk. The the rolling merge process is depicted pictorially in Figure 2.2

of the original LSM-Tree paper [44]. The rolling merge is repeated for migration between

higher components of an LSM-Tree in similar manner. Thus, there is a certain amount of delay

before records in the Co component migrate out to the disk-resident Ci and higher components.

Deletions are performed concurrently in batch fashion similar to inserts.

The disk resident components of an LSM-tree are comparable to a B+-tree structure, but are

optimized for sequential disk access, with nodes 100% full. Lower levels of the tree are packed

together in contiguous, multi-page disk blocks for better I/O performance during the rolling

merge.

6.5.2 Index Maintenance and Look-Ups

As in case of previously proposed index structures, every time the buffer is fulled and

partitioned into segments, we create an index record for each buffered record and bulk insert

them all into an LSM-tree index. The index record is comprised of five fields: (1) the key value,

(2) the disk page number of the record, (3) an offset within the page, (4) the segment number

to which the record belongs, and (5) the subsample number to which the record belongs. The

segment and subsample number are recorded with each index record to determine its staleness.

Every time a record is migrated from a lower component to a higher disk based component, the

rolling merge additionally identifies stale records and removes them from the tree structure. We

refer to an index record as a stale record if it is indexing a record either from a subsample that is

decayed completely, or a segment of a subsample that is overwritten.

We use the existing LSM-Tree-based point query and range query algorithms to perform

index look-ups. As in case of previously proposed index structures, we sort the valid index









records by their page number attribute and retrieve the actual records from the geometric file as a

query result.

In Chapter 7, we evaluate and compare the three index structures suggested in this chapter

experimentally by measuring build time and disk footprint as new records are inserted into the

geometric file. We also compare the efficiency of these structures for point and range queries.









CHAPTER 7
BENCHMARKING

In this chapter, we detail three sets of benchmarking experiments. In the first set of experi-

ments, we attempt to measure the ability of the geometric file to process a high-speed stream of

data records. In the second set of experiments, we examine the various algorithms for producing

smaller samples from a large, disk-based geometric file. Finally, in the third set of experiments,

we compare the three index structures for the geometric file for build time, disk space, and index

look-up speed.

7.1 Processing Insertions

In order to test the relative ability of the geometric file to process a high-speed stream of

insertions, we have implemented and benchmarked five alternatives for maintaining a large

reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the

framework described in Section 3.10 for using multiple geometric files at once. In the remainder

of this Section, we refer to these alternatives as the virtual memory, scan, local overwrite, geo

file, and multiple geofiles options. An a' value of 0.9 was used for the multiple geo files option.

All implementation was performed in C++. Benchmarking was performed using a set of

Linux workstations, each equipped with 2.4 GHz Intel Xeon Processors. 15,000 RPM, 80GB

Seagate SCSI hard disks were used to store each of the reservoirs. Benchmarking of these disks

showed a sustained read/write rate of 35-50 MB/second, and an "across the disk" random data

access time of around 10ms.

7.1.1 Experiments Performed

The following three experiments were performed:

Insertion experiment 1: The task in this experiment was to maintain a 50GB reservoir holding

a sample of 1 billion, 50B records from a synthetic data stream. Each of the five alternatives was

allowed 600MB of buffer memory to work with when maintaining the reservoir. For the scan,

local overwrite, geo file, and multiple geo files options, 100MB was used as an LRU buffer for

disk reads/writes, and 500MB was used to buffer newly sampled records before processing. The

virtual memory option used all 600MB as an LRU buffer. In the experiment, a continual stream









of records was selected to be inserted into the reservoir (as many as each of the five options

could handle). The goal was to test how many new records could be added to the reservoir in

20 hours, while at the same time expelling existing records from the reservoir as is required by

the reservoir algorithm. The number of new samples processed by each of the five options (that

is, the number of records added to disk) is plotted as a function of time in Figure 7-1 (a). By

"number of samples processed" we mean the number of records that are actually inserted into the

reservoir, and not the number of records that have passed through the data stream.

Insertion experiment 2: This experiment is identical to Experiment 1, except that the 50GB

sample was composed of 50 million, 1KB records. Results are plotted in Figure 7-1 (b). Thus, we

test the effect of record size on the five options.

Insertion experiment 3: This experiment is identical to Experiment 1, except that the amount of

buffer memory is reduced to 150MB for each of the five options. The virtual memory option used

all 150MB for an LRU buffer, and the four other options allocated 100MB to the LRU buffer and

50MB to the buffer for new samples. Results are plotted in Figure 7-1 (c). This experiment tests

the effect of a constrained amount of main memory.

7.1.2 Discussion of Experimental Results

All three experiments suggest that the multiple geo files option is superior to the other

options. In Experiments 1 and 2, the multiple geofiles option was able to write new samples to

disk almost at the maximum sustained speed of the hard disk, at around 40 MB/sec.

It is worthwhile to point out a few specific findings. Each of the five options writes the first

50GB of data from the stream more or less directly to disk, as the reservoir is large enough to

hold all of the data as long as the total is less than 50GB. However, Figure 7-1 (a) and (b) show

that only the multiple geofiles option does not have much of a decline in performance after

the reservoir fills (at least in Experiments 1 and 2). This is why the scan and virtual memory

options plateau after the amount of data inserted reaches 50GB. There is something of a decline

in performance in all of the methods once the reservoir fills in Experiment 3 (with restricted

buffer memory), but it is far less severe for the multiple geofiles option than for the other options.














(a) 50 Byte records,
600MB buffer space


hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hr


Time elapsed


geo file


local
overwrite


scan & virtual mem


8 hrs 12 hrs 16 hrs 20 hr

Time elapsed


multiple
geo files


7 local
geo file overwrte


(c) 50 Byte records,
150MB buffer space


scan & virtual mem-
0 hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hrs
Time elapsed

Figure 7-1. Results of benchmarking experiments (Processing insertions).


multiple
geo files
\V,


(b) 1KB records,
600MB buffer space


7















10000


1000


-100


S10


1


10000


1000


100


10


1


001


0 001


00001


le-05


le-06


naive algorithm





geo file structure
based algorithm
0 10 20 30 40 50 60 70 80 i0010


0 100 200 300 400 500 600 700 800 900 1000
Thousands of Records sampled

(a) Batch Sampling


0 100 200 300 400 500 600 700 800 900 1000
Thousands of Records sampled
(c) Online Sampling


0 100 200 300 400 500 600 700 800 900 1000
Thousands of Calls to GetNext
(e) Variance Plots


10000


1000


100


10


0 100 200 300 400 500 600 700 800 900 1000
Thousands of Records sampled
(b) Batch Sampling (Multiple Geo Files)

10000
1 0 0 0 ------- --------- ---------- ----- -

1000 ..---.

naive algorithm
100


10 geo file structure
based algorithm

0 100 200 300 400 500 600 700 800 900 1000
Thousands of Records sampled

(d) Online Sampling (Multiple Geo Files)


001


0001


00001


0 100 200 300 400 500 600 700 800 900 1000
Thousands of Calls to GetNext

(f) Variance Plots (Multiple Geo Files)


Figure 7-2. Results of benchmarking experiments (Sampling from a geometric file).








102


naive algorithm

-



geo file structure
based algorithm N

-(


j j f ---- -------


naive algorithm




geo file structure
based algorithm


geo file structure
based algorithm





naive algorithm


geo file structure
based algorithm





naive algorithm
naive algorithm









Furthermore, this degeneration in performance could probably be reduced by using a smaller

value for a'.

As expected, the local overwrite option performs very well early on, especially in the

first two experiments (see Section 3.3 for a discussion of why this is expected). Even with

limited buffer memory in Experiment 3, it uniformly outperforms a single geometric file.

Furthermore, with enough buffer memory in Experiments 1 and 2, the local overwrite option

is competitive with the multiple geofiles option early on. However, fragmentation becomes a

problem and performance decreases over time. Unless offline re-randomization of the file is

possible periodically, this degradation probably precludes long-term use of the local overwrite

option.

It is interesting that as demonstrated by Experiment 3 (and explained in Section 3.8) a

single geometric file is very sensitive to the ratio of the size of the reservoir to the amount of

available memory for buffering new records from the stream. The geofile option performs well

in Experiments 1 and 2 when this ratio is 100, but rather poorly in Experiment 3 when the ratio is

1000.

Finally, we point out the general unusability of the scan and virtual memory options, scan

generally outperformed virtual memory, but both generally did poorly. Except in experiment

1 with large memory and small record size, with these two options more than 97'. of the

processing of records from the stream occurs in the first half hour as the reservoir fills. In the 19.5

hours or so after the reservoir first fills, only a tiny fraction of additional processing occurs due to

the inefficiency of the two options.

7.2 Biased Reservoir Sampling

In Section 4.1 we gave an upper bound for the distance between the actual bias function f'

computed using our reservoir algorithm, and the desired, user-defined bias function f. While

useful, this bound does not tell the entire story. In the end, what a user of a biased sampling

algorithm is interested in is not how close the bias function that is actually computed is to the

user-specified one, but instead the key question is what sort of effect any deviation has on the










5e+13
Biased sampling w/o skewed records
4.5e+13 -. Unbiased reservoir sampling
Biased sampling worst case
4e+13
3.5e+13
(D
S 3e+13
5 2.5e+13
0 2e+13
---------------------- ---------------" ^---~: ----------^
1.5e+13
1e+13
5e+12
0 --I I I
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-3. Sum query estimation accuracy for zipf=0.2.


particular estimation task that is to be performed. Perhaps the easiest way to detail the practical

effect of a pathological data ordering is through experimentation.

In this section we present the experimental results evaluating practical significance of a

worst-case data ordering. Specifically, we design a set of experiments to compute the error

(variance) one would expect when sampling for the answer to a SUM query in following there

scenarios:

1. When a biased sample is computed using our reservoir algorithm with the data ordered so
as to produce no overweight records.

2. When an unbiased sample is computed using the classical reservoir sampling algorithm.

3. When a biased sample computed using our reservoir algorithm, with records arranged so
as to produce the bias function furthest from the user-specified one, as described by the
Theorem 1.


By examining the results, it should become clear exactly what sort of practical effect on the

accuracy of an estimator one might expect due to a pathological ordering.

7.2.1 Experimental Setup

In our experiments, we evaluated a SUM query over a set of synthetic data streams having

various statistical properties. In each experiment, every record has two attributes: A and B.










8e+14
Biased sampling w/o skewed records
7e+14 Unbiased reservoir sampling
SBiased sampling worst case

6e+14 -..

5e+14

5 4e+14

7 3e+14

2e+14

le+14
0 I-I I I
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-4. Sum query estimation accuracy for zipf=0.5.


Attribute B is the attribute that is actually aggregated by the SUM query. Each set is generated

so that attributes A and B both have a certain amount of Zipfian skew, specified by the parameter

zipf. In each case, the bias function f is defined so as to minimize the variance for a SUM query

evaluated over attribute A.

In addition to the parameter zipf, each data set also has a second parameter which we term

the correlation factor. This is the probability that attribute A has the same value as attribute B. If

the correlation factor is 1, then A and B are identical, and since the bias function is defined so as

to minimize the variance of a query over A, the bias function also minimizes the variance of an

estimate over the actual query attribute B. Thus, a correlation factor of 1 provides for a perfect

bias function. As the correlation factor decreases, the quality of the bias function for a query over

attribute B declines, because the chance increases that a record deemed important by looking at

attribute A is, in fact, one that should not be included in the sample. This models the case where

one can only guess at the correct bias function beforehand for example, when queries with an

arbitrary relational selection predicate may be issued. A small correlation factor corresponds to

the case when the guessed-at bias function is actually very incorrect.










1.6e+17
Biased sampling w/o skewed records
1.4e+17 Unbiased reservoir sampling
Biased sampling worst case
1.2e+17

l e+17 ------

S 8e+16

7 6e+16

4e+16

2e+16
0 'I' I I'
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-5. Sum query estimation accuracy for zipf=O.8.


By testing each of the three different scenarios described in the previous subsection over a

set of data sets created by varying zipf as well as the correlation factor, we can see the effect of

data skew and of bias function quality on the relative quality of the estimator produced by each of

the three scenarios.

For each experiment, we generate a data stream of one million records and obtain a sample

of size 1000. For each of the three scenarios and each of the data sets that we test, we repeat the

sampling process 1000 times over the same data stream in Monte-Carlo fashion. The variance

of the corresponding estimator is reported as the observed variance of the 1000 estimates. The

observed Monte-Carlo variances are depicted in Figures 7-3, 7-4, 7-5, and 7-6.

7.2.2 Discussion

It is possible to draw a couple of conclusions based on the experimental results. Most

significant is that biased sampling under the pathological record ordering shows qualitative

performance similar to the biased sampling without any overweight records. Even though in the

pathological case the sample might not be biased exactly as specified by the user-defined function

f, the number of records not sampled according to f is usually small, and the resulting estimator

typically suffers from an increase in variance of around a factor of ten of less. This demonstrates










1.4e+18
Biased sampling w/o skewed records
Unbiased reservoir sampling
1.2e+18 Biased sampling worst case

le+18

8e+17

6e+17

4e+17

2e+17

0
0 -------------i-'^
0 0.2 0.4 0.6 0.8 1
Correlation Factor

Figure 7-6. Sum query estimation accuracy for zipf=l.


that even for very skewed data sets, it is difficult for even an adversary to come up with a data

ordering that can significantly alter the quality of the user-defined bias function.

We also observe that for a low zipf parameter and a low correlation factor, unbiased

sampling outperforms biased sampling. In other words, it is actually preferable not to bias in

this case. This is because the low zipf value assigns relatively uniform values to attribute B,

rendering an optimal biased scheme little different from uniform sampling. Furthermore, as the

correlation factor decreases, the weighting scheme used both biased sampling schemes becomes

less accurate, hence the higher variance. As the weighting scheme becomes very inaccurate, it

is better not to bias at all. Not surprisingly, there are more cases where the biased scheme under

the pathological ordering is actually worse than the unbiased scheme. However, as the correlation

factor increases and the bias scheme becomes more accurate, it quickly becomes preferable to

bias.

7.3 Sampling From a Geometric File

We have also implemented and benchmarked four sampling techniques to sample geometric

files discussed in the Chapter 5. Specifically, we have compared the naive batch sampling and the

online sampling algorithms against a geometric file structure based batch sampling and online









sampling algorithms. We have also tested these four techniques with the framework that make

use of multiple geometric files. All of the algorithms were implemented on top of the geometric

file prototype that was benchmarked in the previous sections.

7.3.1 Experiments Performed

To compare the various options, we used the following setup. We first initialize a geometric

file by sampling and adding records from a synthesized data stream to the files for a period of

several hours. This ensures a realistic scenario for testing: the reservoir in the file that is to be

tested has been filled, a reasonable portion of each initial subsample has been over-written, and

some of the smaller initial subsamples have been removed from the file, and a number of new

subsamples have been create. The parameters used in building the geometric file are the same

as those describe in Experiment 2 of the previous section (a 50GB file with 50 million, 1KB

records). Given such a file, the following set of experiments were performed:

Sampling experiment 1: The goal of this experiment was to compare the two options for

obtaining a batch sample from a geometric file: the naive algorithm, and then the geometric file

structure based algorithm. For both algorithms, we plot the time to perform the sampling as a

function of the desired sample size. Figure 7-2 (a) depicts the plot for a single geometric file;

Figure 7-2 (b) shows an analogous plot for the multiple geometric files option.

Sampling experiment 2: This experiment is analogous to Sampling Experiment 1, except that

online sampling is performed via multiple successive calls to GetNext. The number of records

sampled with multiple calls to GetNext versus the elapsed time is plotted in Figure 7-2 (c)

for both the naive algorithm and the more advanced, geometric file structure based algorithm

designed to increase the sampling rate and even out the response times. The analogous plot for

multiple geometric file case is shown in Figure 7-2 (d). We also plot the variance in response

times over all calls to GetNext as a function of the number of calls to GetNext in Figures 7-

2(e) and 7-2(f) (the first is for a single geometric file; the second is with multiple files). Taken

together, these plots show the trade-off between overall processing time and the potential for

waiting for a long time in order to obtain a single sample.









7.3.2 Discussion of Experimental Results

Not surprisingly, these results suggest that the geometric file structure based sampling

methods are superior over the more obvious naive algorithms, both in the batch and online case.

As expected, the naive batch sampling algorithm took almost constant time to obtain batch

sample of any size as it requires scan of the entire geometric file to retrieve any batch sample.

The geometric file structure based algorithm can produce a small-size batch sample very fast, and

the total sampling time increases linearly with sample size. The time required for the geometric

file structure based algorithm is well below the time required by the naive approach even when

1/10 of file is sampled. In case of online sampling, geometric file structure based algorithm

clearly outperformed naive approach and this was not surprising as it must expend one disk seek

per sample. For both, batch and online sampling multiple geometric files framework showed

results analogues to single geometric file case.

As expected and then demonstrated by variance plots, the variance of online naive approach

is smaller than geometric file structure based algorithm. Although with this little larger variance

(less than 10 times for 100k samples) in the response times, the structure based approach

executed order of magnitude faster (more than 100 times for 100k samples) than the naive

approach for any number of records sampled, justifying our approach of minimizing the average

square sum of the response time. In other words, we got enough added speed for a small enough

added variance in response time to make the trade-off acceptable. As more and more samples are

obtained the variance of structure based algorithm approached variance of the naive algorithm

making the trade-off even more reasonable for large intended sample sizes.

Finally, we point out that both the geometric file structure based algorithms, batch and

online case, were able to read sample records from disk almost at the maximum sustained speed

of the hard disk, at around 45 MB/sec. This is comparable to the rate of a sequential read from

disk, the best we can hope for.

















C /



0.5
0
0. 5




0 2 4 6 8 10 12
Time elapsed in Hrs

Figure 7-7. Disk footprint for 1KB record size

Table 7-1. Millions of records inserted in 10 hrs

No Index Subsample-Based Segment-Based LSM-Tree

1KB Record; 13700 12550 10960 9680
IRI 10 million; BI = 50k
200 Bytes Record; 12810 7230 8030 2930
R1 50 million; BI = 250k



7.4 Index Structures For The Geometric File

In chapter 6 we introduced three index structures for the geometric file: the segment-based,

the subsample-based, and the LSM-tree-based index structure. In this section, we experimentally

evaluate and compare these three index structures by measuring build time and disk footprint as

new records are inserted in the geometric file. We also compare efficiency of these structures for

point and range queries. All of the index structures were implemented on top of the geometric file

prototype that was benchmarked in the previous sections.



7.4.1 Experiments Performed

In order to compare the three index structures, we used the following setup. We initialize

a geometric file by sampling and adding records from a synthesized data stream to the files for









a period of ten hours. As the geometric file is initialized, we built index structures as discussed

in chapter 6. The ten hours of insertion in the geometric file ensures that a reasonable number

of insertions and deletions are performed on an index structure. Given such a file, we collected

following three pieces of information for each of the three index structures under consideration.

Build time: With concurrent updates to the index structure we wish to record how many

records can be inserted into a geometric file at the end of insertion window (10 hrs). This should

help compare build time for three index structures.

Disk footprint: We observe the disk footprint of three proposed index structure by record-

ing the total disk spaced used by an index structure every time after a buffer full index records are

bulk inserted into an index file.

Index look-up time: After ten hours of insertion, once index structures are built for the

geometric file, we query the index structures to look-up records with specific key or range of key

values. For each index structure a point query and range queries with different selectivity are

executed. The point query returns exactly one record as an output record. The range queries are

designed to return approximately 10, 100, or 1000 records as an output set. For each selectivity

we execute the query 100 times and reported the average look-up time. Further, processing a

query involves index look-up followed by one more seeks in the geometric file to access output

records. We therefore report index look-up time, geometric file access time, and the total query

processing time.

With these metrics in mind we performed following two sets of experiments:

Indexing experiment 1: The task in this experiment was to maintain a 10GB reservoir

holding a sample of 10 million, 1KB records from a synthetic data stream. Each of the three

alternatives was allowed 500MB of buffer memory to work with when maintaining the reservoir.

The results for index build time is shown in Table 7.4, the disk space used by three index

structure is plotted in Figure 7-7, and the index look-up speed in tabulated in Table 7.4.1.

Indexing experiment 2: This experiment is identical to Experiment 1, except that the 10GB

sample was composed of 50 million, 200B records. For this experiment the results for index build










14
Subsample-based -
Segment-based
12 LSM-Tree

m 10

o 8 /

co 6
c0
o 4




0 2 4 6 8 10 12
Time elapsed in Hrs

Figure 7-8. Disk footprint for 200B record size


speed are shown in Table 7.4, the disk space used by three index structure is plotted in Figure 7-8,

and the index look-up speed in tabulated in Table 7.4.2. Thus, we test the effect of record size on

the three index structure.



7.4.2 Discussion

It is possible to draw a few conclusions based on the experimental results. The subsample-

based index structure shows the best build time, the segment-based index structure has the most

compact disk footprint, whereas the LSM-tree-based index structure has best response to the

index look-ups.

Table 7.4 shows millions of records inserted into geometric file after ten hours of insertions

and concurrent updates to the index structure. For comparison we present the number of records

inserted into a geometric file when no index structure is maintained (the "no index" column). It is

clear that the subsample-based index structure performs the best on insertions, with performance

comparable to the "no index" option. This difference reflects the cost of concurrently maintaining

the index structure. The segment based index structure does the next best. It is slower than the

subsample-based index structure because of higher number of seeks performed during the start-

up. Recall that during start-up the segment-based index must write a B+-tree for each segment.









Table 7-2. Query timing results for 1k record, R| = 10 million, and IBI = 50k


Scheme Selectivity Index Time File Time Total Time

Point Query 38.2890 0.0226 38.3116
S 10 recs 40.2477 0.1803 40.2480
Segment-Based
100 recs 43.2856 0.8766 44.1622
1000 recs 45.6276 6.2571 51.8847
Point Query 0.87551 0.02382 0.89937
10 recs 1.12740 0.15867 1.28607
Subsample-Based
Subsample-Based 100 recs 1.74911 1.10544 2.85455
1000 recs 2.09980 5.96637 8.06617
Point Query 0.00012 0.01996 0.02008
10 recs 0.00015 0.01263 0.01278
LSM-Tree
100 recs 0.00019 0.79358 0.79377
1000 recs 0.00056 5.82210 5.82266


Once the reservoir is initialized, both the segment-based and the subsample-based index structure

perform an equal number of disk seeks. Finally, the LSM-tree-based index structure is slowest

amongst the three. The LSM-tree maintains the index by processing insertions and deletions

more aggressively than other two options, demanding more rolling merges and more disk seeks

per buffer flush.

Table 7.4 also shows the insertion figures for the smaller, 200B record size. Not surprisingly,

all three index structures shows similar insertion patterns, but since they have to process a larger

number of records the insertion rates are slower than in the case of the 1KB record size. We also

observed and plotted the disk footprint size for three index structures (Figure 7-7 and Figure

7-8). As expected, all three index structures initially grow fairly quickly. The segment-based

and the subsample-based index structures stabilize soon after the reservoir is filled, whereas the

LSM-Tree-based structure stabilizes a little later when the removal of stale records from the

rolling merges stabilizes.



The subsample-based index structure has the largest footprint (almost 1/5th of the geometric

file size). This is expected as stale index records is removed from the B+-trees only when the









Table 7-3. Query timing results for 200 bytes record, |R|


entire subsample decays. On the other hand, the segment-based index structure has the smallest

footprint as at every buffer flush all stale records are removed from the index structure. This

results in a very compact index structure. The disk space usage of the LSM-Tree-based index

structure lies between these two index structures. Although at every rolling merge, stale records

are removed from the part of index structure that is merging, not all of the stale records in the

structure are removed all at once. As soon as the rate of removal of stale records stabilizes the

disk footprint also becomes stable.

Finally, we compared the index look-up speed of these three index structures. We report

index look-up and geometric file access times for different selectivity queries. As expected,

the geometric file access time remains constant irrespective of the index structure option and

increases linearly as the query produces more output tuples. The index look-up time varied

for the three index structures. The segment-based index structure (the slowest) was an order of

magnitude slower than the LSM-Tree-based index structure (the fastest). This is mainly because

the segment-based index structure requires index lookups in several thousand B+-Trees for

any selectivity query, where the LSM-Tree-based structure uses a singe LSM-Tree, requiring a

small, constant number of seeks. The performance of the subsample-based index structure lies in


Scheme Selectivity Index Time File Time Total Time

Point Query 6.2488 0.0338 6.2826
-10 recs 9.6186 0.1267 9.7453
Segment-Based
S 100 recs 12.9885 0.9288 13.9173
1000 recs 17.6891 5.9754 23.6645
Point Query 2.50717 0.0156 2.5227
~ 10 recs 4.92744 0.1763 5.1037
Subsample-Based
u 100 recs 7.2387 0.8637 8.1024
~ 1000 recs 9.9837 6.1363 16.1200
Point Query 0.00505 0.0174 0.0224
~ 10 recs 0.00967 0.1565 0.1661
LSM-Tree
100 recs 0.01440 0.8343 0.8487
~ 1000 recs 0.05987 4.9961 5.0559


= 50 million, and IBI = 250k









between these two structures. This is expected as the structure maintains many fewer B+-Trees

than the segment-based index but far more than the LSM-Tree-based structure.

In general the subsample-based index structure gives the best build time with reasonable

index look-up speed at the cost of slightly larger disk footprint. The LSM-Tree-based index

structure makes use of reasonable disk space and gives the best query performance at the cost

of slow insertion rate or build time. The segment-based index structure gives comparable build

time and has the most compact disk footprint, but suffers considerably when it comes to index

look-ups.









CHAPTER 8
CONCLUSION

Random sampling is a ubiquitous data management tool, but relatively little research

from the data management community has been concerned with how to actually compute and

maintain a sample. In this dissertation we have considered the problem of random sampling from

a data stream, where the sample to be maintained is very large and must reside on secondary

storage. We have developed the geometric file organization which can be used to maintain an

online sample of arbitrary size with an amortized cost of O(w x logB IB/ B ) random disk head

movements for each newly sampled record. The multiplier u can be made very small by making

use of a small amount of additional disk space.

We have presented a modified version of the classic reservoir sampling algorithm that

is exceedingly simple, and is applicable for biased sampling using any arbitrary user-defined

weighting function f. Our algorithm computes, in a single pass, a biased sample Ri (without

replacement) of the i records produced by a data stream.

We have also discussed certain pathological cases where our algorithm can provide a

correctly biased sample for a slightly modified bias function f'. We have analytically bound

how far f' can be from f in such a pathological case. We have also experimentally evaluated the

practical significance of this difference.

We have also derived the variance of a Horvitz-Thomson estimator making use of a sample

computed using our algorithm. Combined with the Central Limit Theorem, the variance can

then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the

SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single

database table for which the reservoir is maintained.

We have developed efficient techniques which allow a geometric file to itself be sampled

in order to produce smaller data objects. We considered two sampling techniques (1) a batch

sampling when sample size is known before hand and (2) an online sampling which implements

an iterative function GetNext to retrieve a sample at-a-time. The goal of these algorithms was to

efficiently support further sampling of a geometric file by making use of its own structure.









Efficiently searching and discovering information from the geometric file is essential for

query processing. A natural way to support this functionality is to build an index structure. We

discussed three secondary index structures and their maintenance as new records are inserted

into a geometric file. The segment-based and the subsample-based index structures are designed

around structure of the geometric file. The third index structure, the LSM-tree-based index

structure makes use of LSM-tree structure, an efficient structure to handle bulk insertion and

deletion. We compared these structures for build time, disk space used, and index look-up time.









REFERENCES


[1] A. Das J. Gehrke, M.R.: Approximate join processing over data streams. In: ACM
SIGMOD International Conference on Management of Data (2003)

[2] Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of
group-by queries. In: ACM SIGMOD International Conference on Management of Data
(2000)

[3] Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query
answering. In: ACM SIGMOD International Conference on Management of Data (1999)

[4] Acharya, S., P.B. Gibbons, V.P., Ramaswamy, S.: The aqua approximate query answering
system. In: ACM SIGMOD International Conference on Management of Data (1999)

[5] Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In:
VLDB'2006: Proceedings of the 32nd international conference on Very large data bases, pp.
607-618. VLDB Endowment (2006)

[6] Arge, L.: The buffer tree: A new technique for optimal i/o-algorithms. In: International
Workshop on Algorithms and Data Structures (1995)

[7] Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming
data. In: SODA'02: Proceedings of the thirteenth annual ACM-SIAM symposium on
Discrete algorithms, pp. 633-634. Society for Industrial and Applied Mathematics (2002)

[8] Babcock, B., S. Chaudhuri, G.D.: Dynamic sample selection for approximate query
processing. In: ACM SIGMOD International Conference on Management of Data (2003)

[9] Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indexes. In:
SIGFIDET Workshop, pp. 107-141 (1970)

[10] Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE '06:
Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 6.
IEEE Computer Society, Washington, DC, USA (2006)

[11] C. Fan M. Muller, I.R.: Development of sampling plans by using sequential (item by item)
techniques and digital computers. In: Journal of American Statistical Association, pp. 57:
387-402 (1962)

[12] C. Jermaine A. Datta, E.O.: A novel index supporting high volume data warehouse
insertion. In: International Conference on Very Large Data Bases (1999)

[13] C. Jermaine E. Omiecinski, W.Y: The partitioned exponential file for database storage
management. In: International Conference on Very Large Data Bases (1999)

[14] Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of
sampling for aig gricglion queries. In: ICDE (2001)









[15] Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approx-
imate answering of aggregate queries. In: ACM SIGMOD International Conference on
Management of Data (2001)

[16] Cochran, W.: Sampling Techniques. Wiley and Sons (1977)

[17] Council, T.P.: TPC-H Benchmark. http://www.tpc.org (2004)

[18] Cranor, C., Gao, Y, Johnson, T., Shkapenyuk, V., Spatscheck, O.: Gigascope high per-
formance network monitoring with an sql interface. In: ACM SIGMOD International
Conference on Management of Data (2002)

[19] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for
network applications. In: ACM SIGMOD International Conference on Management of Data
(2003)

[20] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: The gigascope stream database.
In: IEEE Data Engineering Bulletin, pp. 26(1): 27-32 (2003)

[21] Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries
over data streams. In: ACM SIGMOD International Conference on Management of Data
(2002)

[22] Duffield, N., Lund, C., Thorup, M.: Charging from sampled network usage. In: IMW '01:
Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, pp. 245-256.
ACM Press, New York, NY, USA (2001)

[23] Estan, C., Naughton, J.E: End-biased samples for join cardinality estimation. In: ICDE '06:
Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 20.
IEEE Computer Society, Washington, DC, USA (2006)

[24] Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on
the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270-313 (2003)

[25] F. Olken, D.R.: Random sampling from b+ trees. In: International Conference on Very
Large Data Bases (1989)

[26] F. Olken, D.R.: Random sampling from database files a survey. In: International Working
Conference on Scientific and Statistical Database Management (1990)

[27] F. Olken D. Rotem, P.X.: Random sampling from hash fies. In: ACM SIGMOD Interna-
tional Conference on Management of Data (1990)

[28] Ganguly, S., Gibbons, P., Matias, Y, Silberschatz, A.: Bifocal sampling for skew-resistant
join size estimation. In: ACM SIGMOD International Conference on Management of Data
(1996)









[29] Gemulla, R., Lehner, W., Haas, P.J.: A dip in the reservoir: maintaining sample synopses of
evolving datasets. In: VLDB'2006: Proceedings of the 32nd international conference on
Very large data bases, pp. 595-606. VLDB Endowment (2006)

[30] Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional
aggregate range queries over real attributes. In: ACM SIGMOD International Conference
on Management of Data (2000)

[31] Haas, P.: The need for speed: Speeding up db2 using sampling. In: IDUG Solutions Journal
(2003)

[32] Haas, P.J., Hellerstein, J.M.: Ripple joins for Online Aggregation. In: ACM SIGMOD
International Conference on Management of Data, pp. 287 298 (1999)

[33] Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation. In: ACM SIGMOD
International Conference on Management of Data, pp. 171-182 (1997)

[34] J. Gehrke F. Korn, D.S.: On computing correlated aggregates over continual data streams.
In: ACM SIGMOD International Conference on Management of Data (2001)

[35] Jermaine, C.: Robust estimation with sampling and approximate pre-aggregation. In:
International Conference on Very Large Data Bases (2003)

[36] Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples.
In: ACM SIGMOD International Conference on Management of Data, pp. 299-310 (2004)

[37] J.M. Hellerstein R. Avnur, V.R.: Informix under control online query processing. In: Data
Mining and Knowledge Discovery, pp. 4(4): 281-314 (2000)

[38] Joens, T.: A note on sampling from a tape file. In: Communications of the ACM, p. 5: 343
(1964)

[39] J.S. Vitter, M.W.: Approximate computation of multidimensional aggregates of sparse data
using wavelets. In: ACM SIGMOD International Conference on Management of Data
(1999)

[40] Kolonko, M., Wasch, D.: Sequential reservoir sampling with a nonuniform distribution.
ACM Trans. Math. Softw. 32(2), 257-273 (2006). DOI http://doi.acm.org/10.1145/
1141885.1141891

[41] Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB
Conference (2002)

[42] N.L. Johnson, S.K.: Discrete Distributions. Houghton Mifflin (1969)

[43] Olken, F.: Random Sampling from Databases. In: Ph.D. Dissertation (1993)

[44] O'Neil, P., Cheng, E., Gawlick, D., O'NeilJ, E.: The log-structured merge-tree. In: Acta
Informatica, pp. 33:351-385 (1996)









[45] Ousterhout, J.K., Douglis, F.: Beating the i/o bottleneck: A case for log-structured file
systems. Operating Systems Review 23(1), 11-28 (1989)

[46] P.B. Gibbons Y. Matias, VP.: Fast incremental maintenance of approximate histograms. In:
ACM Transactions on Database Systems, pp. 27(3): 261-298 (2002)

[47] Pol, A., Jermaine, C.: Biased reservoir sampling. IEEE Transactions on Knowledge and
Data Engineering

[48] Pol, A., Jermaine, C., Arumugam, S.: Maintaining very large random samples using the
geometric file. VLDBJ (2007)

[49] Shao, J.: Mathematical Statistics. Springer-Verlag (1999)

[50] Thompson, M.E.: Theory of Sample Surveys. Chapman and Hall (1997)

[51] Toivonen, H.: Sampling large databases for association rules. In: International Conference
on Very Large Data Bases (1996)

[52] V. Ganti M.-L. Lee, R.R.: Icicles self-tuning samples for approximate query answering.
In: International Conference on Very Large Data Bases (2000)

[53] Vitter, J.: Random sampling with a reservoir. In: ACM Transactions on Mathematical
Software (1985)

[54] Vitter, J.: An efficient algorithm for sequential random sampling. In: ACM Transactions on
Mathematical Software, pp. 13(1): 58-67 (1987)









BIOGRAPHICAL SKETCH

Abhijit Pol was born and brought up in state of Maharashtra in India. He received his

Bachelor of Engineering from Government College of Engineering Pune (COEP) University

of Pune, one of the most prestigious and oldest engineering college in India, in 1999. Abhijit

majored in mechanical engineering and obtained a distinguished record. He ranked second in

the university merit ranking. He was employed in the Research and Development department of

Kirloskar Oil Engines Ltd for one year.

Abhijit received his first Master of Science from University of Florida in 2002. He majored

in industrial and systems engineering. Abhijit then worked as a researcher in the Department of

Computer and Information Science and Engineering at the University of Florida. He received his

second Master of Science and Doctor of Philosophy (Ph.D) in computer engineering in 2007.

During his studies at University of Florida, Abhijit coauthored a text book titled "Develop-

ing Web-Enabled Decision Support Systems." He taught the Web-DSS course several times in

the Department of Industrial and Systems Engineering at the University of Florida. He presented

several tutorials at workshops and conferences on the need and importance of teaching DSS

material, and he also taught at two instructor-training workshops on DSS development.

Abhijit's research focus is in the area of databases, with special interests in approximate

query processing, physical database design, and data streams. He has presented research papers

at several prestigious database conferences and performed research at the Microsoft Research

Lab. He is now a Senior Software Engineer in the Strategic Data Solutions group at Yahoo! Inc.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

AttheendofmydissertationIwouldliketothankallthosepeoplewhomadethisdisserta-tionpossibleandanenjoyableexperienceforme.FirstofallIwishtoexpressmysinceregratitudetomyadviserChrisJermaineforhispatientguidance,encouragement,andexcellentadvicethroughoutthisstudy.IfIwouldhaveaccesstomagictoolcreate-your-own-adviser,IstillwouldnothaveendedupwithanyonebetterthanChris.Healwaysintroducesmetointerestingresearchproblems.HeisaroundwheneverIhaveaquestion,butatthesametimeencouragesmetothinkonmyownandworkonanyproblemsthatinterestme.IamalsoindebtedtoAlinDobraforhissupportandencouragement.Alinisaconstantsourceofenthusiasm.TheonlytopicIhavenotdiscussedwithhimisstrategiesofGatorfootballgames.IamgratefultomydissertationcommitteemembersTamerKahveci,JoachimHammer,andRavindraAhujafortheirsupportandtheirencouragement.IacknowledgetheDepartmentofIndustrialandSystemsEngineering,RavindraAhuja,andchairDonaldHearnforthenancialsupportandadviceIreceivedduringinitialyearsofmystudies.Finally,Iwouldliketoexpressmydeepestgratitudefortheconstantsupport,understanding,andlovethatIreceivedfrommyparentsduringthepastyears. 4

PAGE 5

page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 10 CHAPTER 1INTRODUCTION .................................... 12 1.1TheGeometricFile ................................. 14 1.2BiasedReservoirSampling ............................. 16 1.3SamplingTheSample ................................ 18 1.4IndexStructuresForTheGeometricFile ...................... 19 2RELATEDWORK .................................... 22 2.1RelatedWorkonReservoirSampling ........................ 22 2.2BiasedSamplingRelatedWork ........................... 24 3THEGEOMETRICFILE ................................. 28 3.1ReservoirSampling ................................. 28 3.2Sampling:SometimesaLittleisnotEnough .................... 30 3.3ReservoirforVeryLargeSamples ......................... 31 3.4TheGeometricFile ................................. 34 3.5CharacterizingSubsampleDecay .......................... 36 3.6GeometricFileOrganization ............................ 40 3.7ReservoirSamplingWithaGeometricFile ..................... 40 3.7.1IntroducingtheRequiredRandomness ................... 41 3.7.2HandlingtheVariance ............................ 42 3.7.3BoundingtheVariance ........................... 45 3.8ChoosingParameterValues ............................. 47 3.8.1ChoosingaValueforAlpha ......................... 47 3.8.2ChoosingaValueforBeta .......................... 48 3.9WhyReservoirSamplingwithaGeometricFileisCorrect? ............ 49 3.9.1CorrectnessoftheReservoirSamplingAlgorithmwithaBuffer ...... 49 3.9.2CorrectnessoftheReservoirSamplingAlgorithmwithaGeometricFile 50 3.10MultipleGeometricFiles .............................. 51 3.11ReservoirSamplingwithMultipleGeometricFiles ................ 51 3.11.1ConsolidationAndMerging ......................... 53 3.11.2HowCanCorrectnessBeMaintained? ................... 53 3.11.3HandlingtheStacksinMultipleGeometricFiles .............. 56 5

PAGE 6

................................. 56 4BIASEDRESERVOIRSAMPLING ........................... 58 4.1ASingle-PassBiasedSamplingAlgorithm ..................... 59 4.1.1BiasedReservoirSampling ......................... 59 4.1.2So,WhatCanGoWrong?(AndaSimpleSolution) ............ 60 4.1.3AdjustingWeightsofExistingSamples ................... 62 4.2WorstCaseAnalysisforBiasedReservoirSamplingAlgorithm .......... 65 4.2.1TheProoffortheWorstCase ........................ 66 4.2.2TheProofofTheorem 1 :TheUpperBoundontotalDist 73 4.3BiasedReservoirSamplingWithTheGeometricFile ............... 75 4.4EstimationUsingaBiasedReservoir ........................ 76 5SAMPLINGTHEGEOMETRICFILE ......................... 80 5.1WhyMightWeNeedToSampleFromaGeometricFile? ............. 80 5.2DifferentSamplingPlansfortheGeometricFile .................. 80 5.3BatchSamplingFromaGeometricFile ...................... 81 5.3.1ANaiveAlgorithm ............................. 81 5.3.2AGeometricFileStructure-BasedAlgorithm ............... 82 5.3.3BatchSamplingMultipleGeometricFiles ................. 84 5.4OnlineSamplingFromaGeometricFile ...................... 84 5.4.1ANaiveAlgorithm ............................. 84 5.4.2AGeometricFileStructure-BasedAlgorithm ............... 85 5.5SamplingABiasedSample ............................. 88 6INDEXSTRUCTURESFORTHEGEOMETRICFILE ................ 89 6.1WhyIndexaGeometricFile? ............................ 89 6.2DifferentIndexStructuresfortheGeometricFile ................. 90 6.3ASegment-BasedIndexStructure ......................... 91 6.3.1IndexConstructionDuringStart-up ..................... 91 6.3.2MaintainingIndexDuringNormalOperation ................ 92 6.3.3IndexLook-UpandSearch ......................... 93 6.4ASubsample-BasedIndexStructure ........................ 93 6.4.1IndexConstructionandMaintenance .................... 94 6.4.2IndexLook-Up ............................... 95 6.5ALSM-Tree-BasedIndexStructure ........................ 96 6.5.1AnLSM-TreeIndex ............................. 96 6.5.2IndexMaintenanceandLook-Ups ..................... 97 7BENCHMARKING .................................... 99 7.1ProcessingInsertions ................................ 99 7.1.1ExperimentsPerformed ........................... 99 7.1.2DiscussionofExperimentalResults ..................... 100 6

PAGE 7

............................. 103 7.2.1ExperimentalSetup ............................. 104 7.2.2Discussion .................................. 106 7.3SamplingFromaGeometricFile .......................... 107 7.3.1ExperimentsPerformed ........................... 108 7.3.2DiscussionofExperimentalResults ..................... 109 7.4IndexStructuresForTheGeometricFile ...................... 110 7.4.1ExperimentsPerformed ........................... 110 7.4.2Discussion .................................. 112 8CONCLUSION ...................................... 116 REFERENCES ......................................... 118 BIOGRAPHICALSKETCH .................................. 122 7

PAGE 8

Table page 1-1Population:studentrecords ................................ 17 1-2Randomsampleofthesize=4 ............................... 17 1-3Biasedsampleofthesize=4 ................................ 17 7-1Millionsofrecordsinsertedin10hrs ........................... 110 7-2Querytimingresultsfor1krecord,jRj=10million,andjBj=50k 113 7-3Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k 114 8

PAGE 9

Figure page 3-1Decayofasubsampleaftermultiplebufferushes. ................... 38 3-2Basicstructureofthegeometricle. ........................... 39 3-3Buildingageometricle. ................................. 43 3-4Distributingnewrecordstoexistingsubsamples. ..................... 44 3-5Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. ....... 54 4-1Adjustmentofrmaxitormaxi1 69 7-1Resultsofbenchmarkingexperiments(Processinginsertions). ............. 101 7-2Resultsofbenchmarkingexperiments(Samplingfromageometricle). ........ 102 7-3Sumqueryestimationaccuracyforzipf=0.2. ....................... 104 7-4Sumqueryestimationaccuracyforzipf=0.5. ....................... 105 7-5Sumqueryestimationaccuracyforzipf=0.8. ....................... 106 7-6Sumqueryestimationaccuracyforzipf=1. ........................ 107 7-7Diskfootprintfor1KBrecordsize ............................ 110 7-8Diskfootprintfor200Brecordsize ............................ 112 9

PAGE 10

Samplingisoneofthemostfundamentaldatamanagementtoolsavailable.Itisoneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.However,mostcurrentresearchinvolvingsamplingconsiderstheproblemofhowtouseasample,andnothowtocomputeone.Theimplicitassumptionisthatasampleisasmalldatastructurethatiseasilymaintainedasnewdataareencountered,eventhoughsimplestatisticalargumentsdemonstratethatverylargesamplesofgigabytesorterabytesinsizecanbenecessarytoprovidehighaccuracy.Noexistingworktacklestheproblemofmaintainingverylarge,disk-basedsamplesinanonlinemannerfromstreamingdata. Wepresentanewdataorganizationcalledthegeometricleandonlinealgorithmsformain-tainingaverylarge,on-disksamples.Thealgorithmsaredesignedforanyenvironmentwherealargesamplemustbemaintainedonlineinasinglepassthroughadataset.Thegeometricleorganizationmeetsthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsample(withoutreplacement)ofallofthedataprocessedthusfar. Wemodifytheclassicreservoirsamplingalgorithmtocomputeaxed-sizesampleinasinglepassoveradataset,wherethegoalistobiasthesampleusinganarbitrary,user-denedweightingfunction.Wealsodescribehowthegeometriclecanbeusedtoperformabiasedreservoirsampling. 10

PAGE 11

Efcientlysearchinganddiscoveringinformationfromthegeometricleisessentialforqueryprocessing.Anaturalwaytosupportthisistobuildanindexstructure.Wediscussthreesecondaryindexstructuresandtheirmaintenanceasnewrecordsareinsertedtoageometricle. 11

PAGE 12

Despitethevarietyofalternativesforapproximatequeryprocessing[ 1 21 30 34 39 ],samplingisstilloneofthemostpowerfulmethodsforbuildingaone-passsynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.Sampling'smanybenetsinclude: 16 49 ]). 2 3 8 14 15 28 32 33 35 46 51 52 ]thatusesamplestestifytosampling'spopularityasadatamanagementtool. Giventheobviousimportanceofrandomsampling,itisperhapssurprisingthattherehasbeenverylittleworkinthedatamanagementcommunityonhowtoactuallyperformrandomsampling.Themostwell-knownpapersinthisareaareduetoOlkenandRotem[ 25 27 ],whoalsoofferthedenitivesurveyofrelatedworkthroughtheearly1990s[ 26 ].However,thisworkisrelevantmostlyforsamplingfromdatastoredinadatabase,andimplicitlyassumesthatasampleisasmalldatastructurethatiseasilystoredinmainmemory. Suchassumptionsaresometimesoverlyrestrictive.Considertheproblemofapproximatequeryprocessing.Recentworkhassuggestedthepossibilityofmaintainingasampleofalargedatabaseandthenexecutinganalyticqueriesoverthesampleratherthantheoriginaldataasawaytospeedupprocessing[ 4 31 ].GiventhemostrecentTPC-Hbenchmarkresults[ 17 ],itisclearthatprocessingstandardreport-stylequeriesoveralarge,multi-terabytedatawarehousemaytakehoursordays.Insuchasituation,maintainingafullymaterializedrandomsample 12

PAGE 13

43 ])maybedesirable.Inordertosavetimeand/orcomputerresources,queriescanthenbeevaluatedoverthesampleratherthantheoriginaldata,aslongastheusercantoleratesomecarefullycontrolledinaccuracyinthequeryresults. Thisparticularapplicationhastwospecicrequirementsthatareaddressedbythedis-sertation.First,itmaybenecessarytousequitealargesampleinordertoachieveacceptableaccuracy;perhapsontheorderofgigabytesinsize.Thisisespeciallytrueifthesamplewillbeusedtoanswerselectivequeriesoraggregatesoverattributeswithhighvariance(seeSec-tion 3.2 ).Second,whatevertherequiredsamplesize,itisoftenindependentofthesizeofthedatabase,sinceestimationaccuracydependsprimarilyonsamplesize Foranotherexampleofacasewhereexistingsamplingmethodscanfallshort,considerstream-baseddatamanagementtasks,suchasnetworkmonitoring(foranexampleofsuchanapplication,wepointtotheGigascopeprojectfromAT&TLaboratories[ 18 20 ]).Giventhetremendousamountofdatatransportedovertoday'scomputernetworks,theonlyconceivablewaytofacilitatead-hoc,after-the-factqueryprocessingoverthesetofpacketsthathavepassedthroughanetworkrouteristobuildsomesortofstatisticalmodelforthosepackets.Themostobviouschoicewouldbetoproduceaverylarge,statisticallyrandomsampleofthepacketsthathavepassedthroughtherouter.Again,maintainingsuchasampleispreciselytheproblemwetackleinthisdissertation.Whileotherresearchershavetackledtheproblemofmaintainingan 16 ]forathoroughtreatmentofnitepopulationrandomsampling). 13

PAGE 14

7 ],noexistingmethodshaveconsideredhowtohandleverylargesamplesthatexceedtheavailablemainmemory. Inthisdissertationwedescribeanewdataorganizationcalledthegeometricleandrelatedonlinealgorithmsformaintainingaverylarge,disk-basedsamplefromadatastream.Thedissertationisdividedintofourparts.Intherstpartwedescribethegeometricleorganizationanddetailhowgeometriclescanbeusedtomaintainaverylargesimplerandomsample.Inthesecondpartweproposeasimplemodicationtotheclassicalreservoirsamplingalgorithmtocomputeabiasedsampleinasinglepassoverthedatastreamanddescribehowthegeometriclecanbeusedtomaintainaverylargebiasedsample.Inthethirdpartwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjects.Finally,inthefourthpart,wediscusssecondaryindexstructuresforthegeometricle.Indexstructuresareusefultospeedupsearchanddiscoveryofrequiredinformationfromahugesamplestoredinageometricle.Theindexstructuresmustbemaintainedconcurrentlywithconstantupdatestothegeometricleandatthesametimeprovideefcientaccesstoitsrecords. Wenowgiveanintroductiontothesefourpartsofthedissertationinsubsequentsections. 14

PAGE 15

11 38 ].Reservoirsamplingalgorithmscanbeusedtodynamicallymaintainaxed-sizesampleofNrecordsfromastream,sothatatanygiveninstant,theNrecordsinthesampleconstituteatruerandomsampleofalloftherecordsthathavebeenproducedbythestream.However,aswewilldiscussinthisdissertation,theproblemisthatexistingreservoirtechniquesaresuitableonlywhenthesampleissmallenoughtotintomainmemory. Giventhattherearelimitedtechniquesformaintainingverylargesamples,theproblemaddressedintherstpartofthisdissertationisasfollows: 1. Thealgorithmsmustbesuitableforstreamingdata,oranysimilarenvironmentwherealargesamplemustbemaintainedon-lineinasinglepassthroughadataset,withthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsampleofxedsizeN(withoutreplacement)fromallofthedataproducedbythestreamthusfar. 2. Whenmaintainingthesample,thefractionofI/Otimedevotedtoreadsshouldbeclosetozero.Ideally,therewouldneverbeaneedtoreadablockofsamplesfromdisksimplytoaddonenewsampleandsubsequentlywritetheblockoutagain. 3. ThefractionI/OoftimespentperformingrandomI/Osshouldalsobeclosetozero.Costlyrandomdiskseeksshouldbefewandfarbetween.AlmostallI/Oshouldbesequential. 4. Finally,theamountofdatawrittentodiskshouldbeboundedbythetotalsizeofalloftherecordsthatareeversampled. Thegeometriclemeetseachoftherequirementslistedabove.WithmemorylargeenoughtobufferjBj>1records,thegeometriclecanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord(seeSection 3.12 ).Themultiplier!canbemadearbitrarilysmallbymakinguseofadditionaldiskspace.Arigorousbenchmarkofthegeometricledemonstratesitssuperiorityovertheobviousalternatives. 15

PAGE 16

Theneedforbiasedsamplingcaneasilybeillustratedwithanexamplepopulation,giveninTable 1.2 .Thisparticulardatasetcontainsrecordsdescribinggraduatestudentsalariesinauniversityacademicdepartment,andourgoalistoguessthetotalgraduatestudentsalary.Imaginethatasimplerandomsampleofthedatasetisdrawn,asshownintheTable 1-2 .Thefoursampledrecordsarethenusedtoguessthatthetotalstudentsalaryis(520+700+580+600)12=4=$7200,whichisconsiderablylessthanthetruetotalof$9545.Theproblemisthatwehappenedtomissmostofthehigh-salarystudentswhoaregenerallymoreimportantwhencomputingtheoveralltotal. Now,imaginethatweweighteachrecord,sothattheprobabilityofincludinganygivenrecordwithasalary700orgreaterinthesampleis(2)(4=12),andtheprobabilityofincludingagivenrecordwithasalarylessthan700is(1=2)(4=12).Thus,oursamplewilltendtoincludethoserecordswithhighervalues,thataremoreimportanttotheoverallsum.TheresultingbiasedsampleisdepictedinTable 1-3 .ThestandardHorvitz-Thompsonestimator[ 50 ]isthenappliedtothesample(whereeachrecordisweightedaccordingtotheinverseofitssamplingprobability),whichgivesusanestimateof(1200+1500+750)(12=8)+(580)(24=4)=$8655.Thisisobviouslyabetterestimatethan$7200,andthefactthatitisbetterthentheoriginalestimateisnotjustaccidental:ifonechoosestheweightscarefully,itiseasilypossibletoproduceasamplewhoseassociatedestimatorhaslowervariance(andhencehigheraccuracy)thanthesimple,uniform-probabilitysample.Forinstance,thevarianceoftheestimatorinthestudentsalaryexampleis2:533106undertheuniform-probabilitysamplinganditis5:083105underthebiasedsamplingscheme. 16

PAGE 17

Population:studentrecords Rec# NameClassSalary($/month) 1 JamesJunior12002 TomFreshman5203 SandraJunior12504 JimSenior15005 AshleySophomore7006 JenniferFreshman5307 RobertSophomore7508 FrankFreshman5809 RachelFreshman60510 TimFreshman55011 MariaSophomore76012 MonicaFreshman600 TotalSalary:9545.00 Table1-2. Randomsampleofthesize=4 Rec# NameClassSalary($/month) 2 TomFreshman5205 AshleySophomore7008 FrankFreshman58012 MonicaFreshman600 Othercaseswhereabiasedsampleispreferableabound.Forexample,ifthegoalistomonitorthepacketsowingthroughanetwork,onemaychoosetoweightmorerecentpacketsmoreheavily,sincetheywouldtendtoguremoreprominentlyinmostqueryworkloads. Weproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutility Table1-3. Biasedsampleofthesize=4 Rec# NameClassSalary($/month) 1 JamesJunior12004 JimSenior15007 RobertSophomore75011 MariaSophomore760 17

PAGE 18

Thekeycontributionsofthispartofdissertationareasfollows: 1. Wepresentamodiedversionoftheclassicreservoirsamplingalgorithmthatisex-ceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf. 2. Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Weanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase,andexperimentallyevaluatethepracticalsignicanceofthisdifference. 3. Wedescribehowtoperformabiasedreservoirsamplingandmaintainlargebiasedsampleswiththegeometricle. 4. Finally,wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgov-erningthesamplingoftworecordsriandrjusingouralgorithm.WeusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm. Smallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,considertheproblemofestimatingtheaverage 18

PAGE 19

Sincethereisnosinglesamplesizethatisoptimalforansweringallqueriesandtherequiredsamplesizecanvarydramaticallyfromquerytoquery,thispartofdissertationconsiderstheproblemofgeneratingasampleofsizeNfromadatastreamusinganexistinggeometriclethatcontainsalargesampleofrecordsfromthestream,whereNR.Wewillconsidertwospecicproblems.First,weconsiderthecasewhereNisknownbeforehand.Wewillrefertoasampleretrievedinthismannerasabatchsample.WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample. Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Anindexisreferredtoasprimaryindexifit 19

PAGE 20

Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexesaremaintainedforeachsegmentorsubsampleinageometricle.Asnewrecordsareaddedtotheleinunitsofasegmentorsubsample,anewB+-treeindexingnewrecordsiscreatedandaddedtotheindexstructure.Also,anexistingB+-treeisdeletedfromthestructurewhenalltherecordsindexedbyitaredeletedfromthele.ThethirdindexstructuremakesuseoftheLSM-treeindex[ 44 ]-adisked-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes.Weevaluateandcomparethesethreeindexstructuresexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries. 2 .InChapter 3 wepresentthegeometricleorganizationandshowhowthisstructurecanbeusedtomaintainaverylarge 20

PAGE 21

4 weproposeasinglepassbiasedreservoirsamplingalgorithm.InChapter 5 wedeveloptechniquesthatcanbeusedtosamplegeometriclestoobtainasmallsizesample.InChapter 6 wepresentsecondaryindexstructuresforthegeometricle.InChapter 7 wediscussthebenchmarkingresults.ThedissertationisconcludedinChapter 8 Mostoftheworkinthedissertationiseitheralreadypublishedorisunderreviewforpublication.ThematerialfromChapter 3 isfromthepaperwithChristopherJermaineandSubramanianArumugamthatwasoriginallypublishedinSIGMOD2004[ 36 ].TheworkpresentedinChapter 4 issubmittedtoTKDEandisunderreview[ 47 ].ThematerialinChapter 5 isthepartofjournalpaperacceptedatVLDBJ[ 48 ].TheresultsintheChapter 7 aretakenfromabovethreepapersaswell. 21

PAGE 22

Inthischapter,werstreviewtheliteratureonreservoirsamplingalgorithms.Wethenpresentthesummaryofexistingworkonbiasedsampling. 2 3 8 14 15 28 32 33 35 51 52 ].However,themostpreviouspapers(includingtheaforementionedreferences)areconcernedwithhowtouseasample,andnotwithhowtoactuallystoreormaintainone.Mostofthesealgorithmscouldbeviewedaspotentialusersofalargesamplemaintainedasageometricle. Asmentionedintheintroductionchapter,aseriesofpapersbyOlkenandRotem(includingtwopaperslistedintheReferencessection[ 25 27 ])probablyconstitutethemostwell-knownbodyofresearchdetailinghowtoactuallycomputesamplesinadatabaseenvironment.OlkenandRotemgiveanexcellentsurveyofworkinthisarea[ 26 ].However,mostofthisworkisverydifferentthanours,inthatitisconcernedprimarilywithsamplingfromanexistingdatabasele,whereitisassumedthatthedatatobesampledfromareallpresentondiskandindexedbythedatabase.Singlepasssamplingisgenerallynotthegoal,andwhenitis,managementofthesampleitselfasadisk-basedobjectisnotconsidered. Thealgorithmsinthisdissertationarebasedonreservoirsampling,whichwasrstde-velopedinthe1960s[ 11 38 ].Inhiswell-knownpaper[ 53 ],Vitterextendsthisearlyworkbydescribinghowtodecreasethenumberofrandomnumbersrequiredtoperformthesampling.Vitter'stechniquescouldbeusedinconjunctionwithourown,butthefocusofexistingworkonreservoirsamplingisagainquitedifferentfromours;managementofthesampleitselfisnotconsidered,andthesampleisimplicitlyassumedtobesmallandin-memory.However,ifwere-movetherequirementthatoursampleofsizeNbemaintainedon-linesothatitisalwaysavalidsnapshotofthestreamandmustevolveovertime,thensequentialsamplingtechniquesrelatedtoreservoirsamplingthatcouldbeusedtobuild(butnotmaintain)alarge,on-disksample(seeVitter[ 54 ],forexample). 22

PAGE 23

44 ],Buffer-Tree[ 6 ],andY-Tree[ 12 ].ThesepapersconsiderproblemofprovidingI/OefcientindexingforadatabaseexperiencingaveryhighrecordinsertionratewhichisimpossibletohandleusingatraditionalB+-Treeindexingstructure.Ingeneralthesemethodsbufferalargesetofinsertionsandthenscantheentirebaserelation,whichistypicallyorganizedasaB+-Tree,atonceaddingnewdatatothestructure. Anyoftheabovemethodscouldtriviallybeusedtomaintainalargerandomsampleofadatastream.Everytimeasamplingalgorithmprobabilisticallyselectsarecordforinsertion,itmustoverwrite,atrandom,anexistingrecordofthereservoir.Onceanevicteeisdetermined,wecanattachitslocationasapositionidentier(anumberbetween1andR)withanewsamplerecord.Thispositioneldisthenusedtoinsertthenewrecordintotheseindexstructures.Whileperformingtheefcientbatchinserts,ifanindexstructurediscoversthatarecordwiththesamepositionidentierexists,itsimplyoverwritestheoldrecordwiththenewerone. However,noneofthesemethodscancomeclosetotherawwritespeedofthedisk,asthegeometriclecan[ 13 ].Inasense,theissueisthatwhiletheindexingprovidedbythesestructurescouldbeusedtoimplementefcient,disk-basedreservoirsampling,itistooheavy-dutyasolution.WewouldenduppayingtoomuchintermsofdiskI/Otosendanewrecordtooverwriteaspecic,existingrecordchosenatthetimethenewrecordisinserted,whenallonereallyneedsistohaveanewrecordoverwriteanyrandom,existingrecord. Therehasbeenmuchrecentinterestinapproximatequeryprocessingoverdatastreams(averysmallsubsetofthesepapersislistedintheReferencessection[ 1 21 34 ]);evensomeworkonsamplingfromadatastream[ 7 ].Thisworkisverydifferentfromourown,inthatmostexistingapproximationtechniquestrytooperateinverysmallspace.Instead,ourfocusisonmakinguseoftoday'sverylargeandveryinexpensivesecondarystoragetophysicallystorethelargestsnapshotpossibleofthestream. Finally,wementiontheU.C.BerkeleyCONTROLproject[ 37 ](whichresultedinthedevelopmentofonlineaggregation[ 33 ]andripplejoins[ 32 ]).Thisworkdoesaddressissues 23

PAGE 24

11 38 ].Recently,Gemullaetal.[ 29 ]extendedthereservoirsamplingalgorithmtohandledeletions.Intheiralgorithmcalledrandompairing(RP)everydeletionfromthedatasetiseventuallycompensatedbyasubsequentinsertion.TheRPAlgorithmkeepstracksofuncompensateddeletionsandusesthisinformationwhileperformingtheinserts.TheAlgorithmguardstheboundonthesamplesizeandatthesametimeutilizesthesamplespaceeffectivelytoprovidesastablesample.AnotherextensiontotheclassicreservoirsamplingalgorithmhasbeenrecentlyproposedbyBrownandHaasforwarehousingofsampledata[ 10 ].Theyproposehybridreservoirsamplingforindependentandparalleluniformrandomsamplingofmultiplestreams.Thesealgorithmscanbeusedtomaintainawarehouseofsampleddatathatshadowsthefull-scaledatawarehouse.Theyhavealsoprovidedmethodsformergingsamplesfordifferentstreamstocreateauniformrandomsample. Theproblemoftemporalbiasedsamplinginastreamenvironmenthasbeenconsidered.Babcocketal.[ 7 ]presentedtheslidingwindowapproachwithrestrictedhorizonofthesampletobiasedthesampletowardstherecentstreamingrecords.However,thissolutionhasapotentialtocompletelylosetheentirehistoryofpaststreamdatathatisnotapartofslidingwindow.TheworkdonebyAggarwal[ 5 ]addressesthislimitationandpresentsabiasedsamplingmethodsothatwecanhavetemporalbiasforrecentrecordsaswellaswekeeprepresentationfromstreamhistory.Thisworkexploitssomeinterestingpropertiesoftheclassofmemory-lessbiasfunctionstopresentasingle-passbiasedsamplingalgorithmforthesetypeofbiasedfunctions.However, 24

PAGE 25

Anotherpieceofworkonsingle-passsamplingwithanonuniformdistributionisduetoKolonkoandWasch[ 40 ].Theypresentasingle-passalgorithmtosampleadatastreamofunknownsize(thatis,notknowbeforehand)toobtainasampleofarbitrarysizensuchthattheprobabilityofselectingadataitemiisdependontheindividualitem.Theweightortnessoftheitemthatisusedforitsprobabilisticselectionisderivedusingexponentiallydistributedauxiliaryvalueswiththeparameteroftheexponentialdistributionandthelargestauxiliaryvaluedeterminesthesample.Liketemporalbiasedsamplingmethoddiscussedabove,thisalgorithmcannotbedirectlyadaptedforarbitraryuser-denedbiasedfunctions. Surprisingly,abovethreepapersaretheonlypiecesofworkthatareknowtoauthorsonhowtoperformasingle-passbiasedsamplingoverlargedatasetsorstreamingdata. Theanotherbodyofrelatedworkisthepapersfromnetworkusagearea[ 22 24 41 ].Thesepaperspresenttechniquesforestimatingthetotalnetworktrafc(orusage)basedonthesampleofaowrecordsproducedbyrouters.Sincetheseowstypicallyhaveheavy-taileddistributions,thetechniquespresentedinthesepapersmakeuseofsize-dependentsamplingscheme.Ingeneral,suchschemesworkbysamplingalltherecordswhosetrafcisabovecertainthresholdandsamplingtherestwithprobabilityproportionaltotheirtrafc.Although,suchtechniquesintroducesamplingbiaswheresizecanbethoughtastheweightofarecord,therearekeydifferencesbetweensuchtechniquesandthealgorithmpresentedinthisdissertation.Thegoalofouralgorithmistoobtainaxedsizebiasedsamplethatcomplywiththearbitraryuser-denedbiasedfunction.Thegoalofthesize-dependentsamplingschemeistoobtainasamplethatwillprovidethebestaccuracyforestimatingthetotalnetworktrafcthatfollowsaspecicdistribution.Thesamplegatheredbytheseschemesisnotnecessarilyaxedsizebiasedsample.Itonlyguaranteesthattheexpectedsamplesizeisnolargerthantheexpectedsample 25

PAGE 26

Theproblemofimplementingxedsizesamplingdesignwithdesiredandunequalinclusionprobabilitieshasbeenstudiedinstatistics.ThemonogramTheoryofSampleSurveys[ 50 ]discussesseveralmethodsforsuchasamplingtechnique,whichisofsomepracticalimportanceinsurveysampling.Thismonogrambeginsbydiscussingtwodesignswhichmimicsimplerandomsamplingwithoutreplacementwithselectionprobabilitiesforagivendrawthatarenotthesameforalltheunits.Werstsummarizethesetechniques. 26

PAGE 27

withxedsizeistoselectunitsforreplacement,andthentorejectthesampleifthereareduplicates.Wediscussonesuchmethodhere,calledSampford'sMethod. 27

PAGE 28

Inthischapterwegiveanintroductiontothebasicreservoirsamplingalgorithmthatwasproposedtoobtainanonlinerandomsampleofadatastream.Thealgorithmassumesthatthesamplemaintainedissmallenoughtotinmainmemoryinitsentirety.Wediscussandmotivatewhyverylargesamplesizescanbemandatoryincommonsituations.Wedescribethreealternativesformaintainingverylarge,disk-basedsamplesinastreamingenvironment.Wethenintroducethegeometricleorganizationandpresentalgorithmsforreservoirsamplingwiththegeometricle.Wealsodescribehowmultiplegeometriclescanbemaintainedall-at-oncetoachieveconsiderablespeedup. 11 38 ].TomaintainareservoirsampleRoftargetsizejRj,thefollowingloopisused: 28

PAGE 29

53 ].Afteracertainnumberofrecordshavebeenseen,thealgorithmwakesupandcapturethenextrecordfromthestream. 1 maintainsthisinvariantinsteps(2-6)asfollows[ 11 38 ].Theithrecordprocessed(i>jRj),itisaddedtothereservoirwithprobabilityjRj=ibystep4.Weneedtoshowthatforallotherrecordsprocessedthusfar,theinclusionprobabilityisjRj=i.Letrkbeanyrecordinthereservoirs.t.k6=i.LetRidenotethestateofthereservoirjustafteradditionoftheithrecord.Thus,weareinterestedinthePr[rk2Ri] i11 i=R i1R i1 i=R i 16 ].ToselectasampleofjRjunits,systematicsamplingtakesaunitatrandomfromtherstkunitsandeverykthunitthereafter.Althoughtheinclusionprobabilityinsystematicsamplingisthesameasinsimplerandomsampling,thepropertiesofasamplesuchasvariancecanbefardifferent.Itisknownthatthevarianceofthesystematicsamplingcanbebetterorworsecomparedtoasimplerandomsamplingdependingondataheterogeneityandcorrelationcoefcientbetweenpairsofsampledunits. 29

PAGE 30

Theproofthatreservoirsamplingmaintainsthecorrectinclusionprobabilityforanysetofinterestisactuallyverysimilartotheunivariateinclusionprobabilitycorrectnessdiscussedabove.WeknowthattheunivariateinclusionprobabilityPr[rk2Ri]=R=i.ForanyarbitraryvalueofjSjjRj,assumethatwehavethecorrectprobabilitieswhenwehaveseeni1inputrecords,i.e.Pr[S2Ri1]=jRjjSj=i1jSj.Whentheithrecordisprocessed(i>jRj),wehave i1S R+1R i=jRjjSj i1jSjR iS i+1R i=jRjjSj ijSjiS iiS i=jRjjSj ijSj 16 ]).Thevalueisknownasthecondenceoftheestimate. Verylargesamplesareoftenrequiredtoprovideaccurateestimateswithsuitablyhighcondence.TheneedforverylargesamplescanbeeasilyexplainedinthecontextoftheCentralLimitTheorem(CLT)[ 27 ].TheCLTimpliesthatifweusearandomsampleofsizeNtoestimatethemeanofasetofnumbers,theerrorofourestimateisusuallynormally 30

PAGE 31

1. Theerrorisinverselyproportionaltothesquarerootofthesamplesize. 2. Theerrorisdirectlyproportionaltothestandarddeviationofthesetoverwhichweareestimatingthemeanover. Thesignicanceofthisobservationisthatthesamplesizerequiredtoproduceanaccurateestimatecanvarytremendouslyinpractice,andgrowsquadraticallywithincreasingstandarddeviation.Forexample,saythatweusearandomsampleof100studentsatauniversitytoestimatetheaveragestudentsage.Imaginethattheaverageageis20withastandarddeviationof2years.AccordingtotheCLT,oursample-basedestimatewillbeaccuratetowithin2.5%withcondenceofaround98%,givingusanaccurateguessastothecorrectanswerwithonly100sampledstudents. Now,considerasecondscenario.WewanttouseasecondrandomsampletoestimatetheaveragenetworthofhouseholdsintheUnitedStates,whichisaround$140,000,withastandarddeviationofatleast$5,000,000.Becausethestandarddeviationissolarge,aquickcalculationshowswewillneedmorethan12millionsamplestoachievethesamestatisticalguaranteesasintherstcase. Requiredsamplesizescanbefarlargerwhenstandarddatabaseoperationslikerelationalselectionandjoinareconsidered,becausetheseoperationscaneffectivelymagnifythevarianceofourestimate.Forexample,theworkonripplejoins[ 32 ]providesanexcellentexampleofhowvariancecanbemagniedbysamplingovertherelationaljoinoperator. 31

PAGE 32

2 .Count(B)referstothecurrentnumberofrecordsinB.NotethatsincetherecordscontainedinBlogicallyrepresentrecordsinthereservoirthathavenotyetbeenaddedtodisk,anewly-sampledrecordcaneitherbeassignedtoreplaceanon-diskrecord,oritcanbeassignedtoreplaceabufferedrecord(thisisdecidedinStep(7)ofthealgorithm). Inarealisticscenario,theratioofthenumberofdiskblockstothenumberofrecordsbufferedinmainmemorymayapproachorevenexceedone.Forexample,a1TBdatabasewith128KBblockswillhave7.8millionblocks;andforsucharelativelylargedatabaseitisrealistictoexpectthatwehaveaccesstoenoughmemorytobuffermillionsrecords.Asthenumberofbufferedrecordsperblockmeetsorexceedsone,mostoralloftheblocksondiskwillcontain 32

PAGE 33

2 ,andsoallofthedatabaseblocksmustbeupdated.Thus,itmakessensetorelyonfast,sequentialI/Otoupdatetheentireleinasinglepass.Thedrawbackofthisapproachisthateverytimethatthebufferlls,weareeffectivelyrebuildingtheentirereservoirtoprocessasetofbufferedrecordsthatareasmallfractionoftheexistingreservoirsize. 33

PAGE 34

1 canbeusedtomaintainalarge,on-disksample,butallofthemhavedrawbacks.Inthissection,wediscussafourthalgorithmandanassociateddataorganizationcalledthegeometricletoaddressthesepitfalls.ThegeometricleisbestseenasanextensionofthemassiverebuildoptiongivenasAlgorithm 2 .JustlikeAlgorithm 2 ,thegeometriclemakesuseofamain-memorybufferthatallowsnewsamplesselectedbythereservoiralgorithmtobeaddedtotheon-diskreservoirinalazyfashion.However,thekeydifferencebetweenAlgorithm 2 andthealgorithmsusedbythegeometricleisthatthegeometriclemakesuseofafarmoreefcientalgorithmformergingthosenewsamplesintothereservoir. 2 ,thebasicalgorithmemployedbythegeometricleisnotmuchdifferent.AsfarasStep(13)isconcerned,thedifferencebetweenthegeometricleandthemassiverebuildextensionisthatthegeometricleemptiesthebuffermoreefciently,inordertoavoidscanningorperiodicallyre-randomizingtheentirereservoir. Toaccomplishthis,theentiresampleinmainmemorythatisushedintothereservoirisviewedasasinglesubsampleorastratum[ 16 ],andthereservoiritselfisviewedasacollectionofsubsamples,eachformedviaasinglebufferush.Sincetherecordsinasubsamplearenon-randomsubsetoftherecordsinthereservoir(theyaresampledfromthestreamduringaspecictimeperiod),eachnewsubsampleneedstooverwriteatrue,randomsubsetoftherecordsinthereservoirinordertomaintainthecorrectnessofthereservoirsamplingalgorithm.Ifthiscanbedoneefciently,wecanavoidrebuildingtheentirereservoirinordertoprocessabufferush. Atrstglance,itmayseemdifculttoachievethedesiredefciency.Thebufferedrecordsthatmustbeaddedtothereservoirwilltypicallyoverwriteasubsetoftherecordsstoredineach

PAGE 35

3.3 ).Forexample,ifthereare100on-disksubsamples,thebuffermustbesplit100waysinordertowritetoaportionofeachofthe100on-disksubsamples.Thisfragmentedbufferthenbecomesanewsubsample,andsubsequentbufferushesthatneedtoreplacearandomportionofthissubsamplemustsomehowefcientlyoverwritearandomsubsetofthesubsample'sfragmenteddata. Thegeometricleusesacareful,on-diskdataorganizationinordertoavoidsuchfragmen-tation.Thekeyobservationbehindthegeometricleisthatthenumberofrecordsofasubsamplethatarereplacedwithrecordsfrombufferedsamplecanbecharacterizedwithreasonableaccu-racyusingageometricseries(hencethenamegeometricle).Asbufferedsamplesareaddedtothereservoirviabufferushes,weobservethateachexistingsubsamplelosesapproximatelythesamefractionofitsremainingrecordseverytime,wherethefractionofrecordslostisgovernedbytheratioofthesizeofabufferedsampletotheoverallsizeofthereservoir.Byloses,wemeanthatthesubsamplehassomeofitsrecordsreplacedinthereservoirwithrecordsfromasubsequentsubsample.Thus,thesizeofasubsampledecaysapproximatelyinanexponentialmannerasbufferedsamplesareaddedtothereservoir. Thisexponentialdecayisusedtogreatadvantageinthegeometricle,becauseitsuggestsawaytoorganizethedatainordertoavoidproblemswithfragmentation.Eachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize.Thesesegmentsaresizedsothateverytimeabufferedsampleisaddedtothereservoir,weexpectthateachexistingsubsamplelosesexactlythesetofrecordscontainedinitslargestremainingsegment.Asaresult,eachsubsamplelosesonesegmenttothenewly-createdsubsampleeverytimethebufferisemptied,andageometriclecanbeorganizedintoaxedandunchangingsetofsegmentsthatarestoredascontiguousrunsofblocksondisk.Becausethesetofsegmentsisxedbeforehand,fragmentationandupdateperformancearenotproblematic:inordertoreplacerecordsinan 35

PAGE 36

Ondaytwo,(withU1=90)theUraniumfurtherdecaystoU1=81grams,thistimelosingU1(1)=U0(1)=n=9gramsofitsmass.Ondaythree,itfurtherdecaysbyn2=7:2grams,andsoon.ThedecayprocessisallowedtocontinueuntilwehavelessthangramsofUraniumremaining. ContinuingwiththeUraniumanalogy,threequestionsthatarerelevanttoourproblemofmaintainingverylargesamplesfromadatastreamare Thesequestionscanbeansweredusingthefollowingthreesimpleobservationsrelatedtogeometricseries: logc.Wedenotethisoorby. 36

PAGE 37

2 ).Recallthatthewayreservoirsamplingworksisthatnewsamplesfromthedatastreamarechosentooverwriterandomsamplescurrentlyinthereservoir.Thebuffertemporarilystoresthesenewsamples,delayingtheoverwriteofarandomsetofrecordsthatarealreadystoredondisk.Oncethebufferisfull,allnewsamplesaremergedwiththeRbyoverwritingarandomsubsetoftheexistingsamplesinR. ConsidersomearbitrarysubsampleSofR(soSR),withcapacityjSj.SincethebufferBrepresentsthesamplesthathavealreadyover-writtentheequalnumberofrecordsofR,abufferushoverwritesexactlyjBjsamplesofR.Thus,onexpectationthemergewilloverwritejSjjBj jRjsamplesofS.IfwedenejBj jRj=1,thenonexpectation,SshouldlosejSj(1)ofitsownrecordsduetothebufferush WecanroughlydescribetheexpecteddecayofSafterrepeatedbuffermergesusingthethreeobservationsstatedbefore.Ifthesubsampleretentionrate=1jBj jRj,then: Thenetresultofthisisthatitispossibletocharacterizetheexpecteddecayofanyarbitrarysubsetoftherecordsinourdisk-basedsampleasnewrecordsareaddedtothesamplethroughmultipleemptyingsofthebuffer.IfweviewSasbeingcomposedofon-disksegmentsofexponentiallydecreasingsize,plusaspecial,asinglegroupofnalsegmentsoftotalsize 3.7 ). 37

PAGE 38

Decayofasubsampleaftermultiplebufferushes. 3-1 38

PAGE 39

Basicstructureofthegeometricle. 39

PAGE 40

3-1 ,wecanorganizeourlarge,diskbasedsampleasasetofdecayingsubsamples.Atanypointoftime,thelargestsubsamplewascreatedbythemostrecentushingofthebufferintoR,andhasnotyetlostanysegments.Thesecondlargestsubsamplewascreatedbythesecondmostrecentbufferush;itlostitslargestsegmentinthemostrecentbufferush.Ingeneral,theithlargestsubsamplewascreatedbytheithmostrecentbufferush,andithashadi1segmentsremovedbysubsequentbufferushes.TheoverallleorganizationisdepictedinFigure 3-2 3 .Thetermsn,,andcarrythemeaningdiscussedinSection 3.5 .ThisprocessdescribedbyAlgorithm 3 isdepictedgraphicallyinFigure 3-3 .First,theleislledwiththeinitialdataproducedbythestream(athroughc).Toaddtherstrecordstothele,thebufferisallowedtollwithsamples.Thebufferedrecordsarethenrandomlygroupedintosegments,andthesegmentsarewrittentodisktoformthelargestinitialsubsample(a).Forthesecondinitialsubsample,thebufferisonlyallowedtolltojBjofitscapacitybeforebeingwrittenout(b).Forthethirdinitialsubsample,thebufferllstojBj2ofitscapacitybeforeitiswritten(c).Thisisrepeateduntilthereservoirhascompletelylled(aswasshowninFigure 3-2 ).Atthispoint,newsamplesmustoverwriteexistingones.Tofacilitatethis,thebufferisagainallowedtolltocapacity.Recordsarethenrandomlygroupedintosegmentsofappropriatesize,andthosesegmentsoverwritethelargestsegmentofeachexistingsubsample(d).Thisprocessisthenrepeatedindenitely,aslongasthestreamproducesnewrecords(eandf). 40

PAGE 41

3.7.1 ) 3 Step(21).Inordertomaintainthealgorithm'scorrectness,whenthebufferisushed 41

PAGE 42

3-4 .Now,wewanttoaddveadditionalnumberstoourset,byrandomlyreplacingveexistingnumbers.Whilewedoexpectnumberstobereplacedinawaythatisproportionaltobucketsize(Figure 3-4 (b)),thisisnotalwayswhatwillhappen(Figure 3-4 (c)). 3 .BeforeweaddanewsubsampletodiskviaabufferushinStep(21),werstperformalogical,randomizedpartitioningofthebufferintosegments,describedbyAlgorithm 4 .InAlgorithm 4 ,eachnewly-sampledrecordisrandomlyassignedtoreplaceasamplefromanexisting,on-disksubsamplesothattheprobabilityofeachsubsamplelosingarecordisproportionaltoitssize.TheresultofAlgorithm 4 isanarrayofMivalues,whereMitellsStep(21)ofAlgorithm 3 howmanyrecordsshouldbeassignedtooverwritetheithon-disksubsample. 3 willoverwriteexactlythenumberofrecordscontainedineach 42

PAGE 43

Buildingageometricle. 43

PAGE 44

Distributingnewrecordstoexistingsubsamples. subsample'slargestsegment.Tohandlethisproblem,weassociateastack(orbuffer 44

PAGE 45

4 ,Mishouldhavebeen.Then,therearetwopossiblecases: ThesestackoperationsareperformedjustpriortoStep(23)inAlgorithm 3 .Notethatsincethenalgroupofsegmentsfromasubsampleoftotalsizearebufferedinmainmemory,theirmaintenancedoesnotrequireanystackoperations.Onceasubsamplehaslostallofitson-disksamples,overwritesofrecordsinthissetcanbehandledbysimplyreplacingtherecordsdirectly. Topre-allocatespaceforthesestacks,weneedtocharacterizehowmuchoverowwecanexpectfromagivensubsample,whichwillboundthegrowthofthesubsample'sstack.Itisimportanttohaveagoodcharacterizationoftheexpectedstackgrowth.Ifweallocatetoomuchspaceforthestacks,thenweallocatediskspaceforstoragethatisneverused.Ifweallocatetoolittlespace,thenthetopofonestackmaygrowupintothebaseofanother.Ifastackdoesoverow,itcanbehandledbybufferingtheadditionalrecordstemporarilyinmemoryormovingthestacktoanewlocationondiskuntilthestackcanagaintinitsallocatedspace.Thisisnot 45

PAGE 46

Toavoidthis,weobservethatifthestackassociatedwithasub-sampleScontainsanysamplesatagivenmoment,thenShashadfewerofitsownsamplesremovedthanexpected.Thus,ourproblemofboundingthegrowthofS'sstackisequivalenttoboundingthedifferencebetweentheexpectedandtheobservednumberofsamplesthatSlosesasjBjnewsamplesareaddedtothereservoir,overallpossiblevaluesforjBj. Toboundthisdifference,werstnotethatafteraddingjBjnewsamplesintothereservoir,theprobabilitythatanyexistingsampleinthereservoirhasbeenover-writtenbyanewsampleis111 42 ].Simplearithmeticimpliesthatthegreatestvarianceisachievedwhenasubsamplehasonexpectationlost50%ofitsrecordstonewsample(P=0:5);atthispointthestandarddeviationis0:5p 46

PAGE 47

Toillustratetheimportanceofminimizing,imaginethatwehavea1GBbufferandastreamproducing100Brecords,andwewanttomaintaina1TBsample.Assumethatweuseanvalueof0.99.Thus,eachsubsampleisoriginally1GB,andjBj=107.FromObserva-tion2weknowthatn log0:99c=1029segmentstostoretheentirenewsubsample. Now,considerthesituationif=0:999.Asimilarcomputationshowsthatwewillnowrequire10;344segmentstostorethesame1GBsubsample.Thisisanorder-of-magnitudedifference,withsignicantpracticalimportance.Withfourdiskseekspersegment,1029segmentsmightmeanthatwespendaround40secondsofdisktimeinrandomI/Os(at10ms 47

PAGE 48

jRj jRj. WewilladdressthislimitationinSection 3.10 log0:99cor687.Byincreasingtheamountofmainmemorydevotedtoholdingthesmallestsegmentsforeachsubsamplebyafactorof32,weareabletoreducethenumberofdiskheadmovementsbylessthanafactoroftwo.Thus,wewillnotconsideroptimizing.Rather,wewillxtoholdasetofsamplesequivalenttothesystemblocksize,andsearchforabetterwaytoincreaseperformance. 48

PAGE 49

1. Whyistheclassicalreservoirsamplingalgorithm(presentedasAlgorithm 1 )correct?ThatiswhatistheinvariantmaintainedbytheAlgorithm 1 ? 2. Whyistheobviousdisk-based,extensionofAlgorithm 1 (presentedasAlgorithm 2 )correct?ThatishowdoesAlgorithm 2 maintaintheinvariantofAlgorithm 1 viatheuseofamainmemorybuffer? 3. WhyistheproposedgeometriclebasedsamplingtechniqueinAlgorithm 3 correct? WehaveansweredtherstquestioninSection 3.1 .Wediscussthesecondandthirdquestionshere. 2 makesuseofthemainmemorybufferofsizejBjtobuffernewsamples.Thebufferedsampleslogicallyrepresentasetsamplesthatshouldhavebeenusedtoreplaceon-disksamplesinordertopreservethecorrectnessofthesamplingalgorithm,butthathavenotyetbeenmovedtodiskforperformancereasons(thatis,duetolazywrites). ItisnothardtoseethattheinvariantmaintainedbyAlgorithm 1 isalsomaintainedbyAlgorithm 2 instep(6).ThenewrecordsaresampledwiththesameprobabilityjRj=i.Theonlydifferenceisthatnewlysampledrecordsareaddedtothereservoirusingsteps(7-14)insteadofsimplesteps(5-6)ofAlgorithm 1 .Wenowdiscusswhythesestepsareequivalent. Onestraightforwardwayofkeepingthesampledrecordsinthebufferanddolazywritesisasfollows.Everytimewedecidetoaddanewsampletothebuffer(i.e.withprobabilityjRj=i),wealsogeneratearandomnumberbetween1andRtodecideitspositioninthereservoir.However,westorethispositioninthepositionarrayandthusavoidanimmediatediskseek.Ifwehappentogenerateapositionthatisalreadyinthepositionarray,weoverwritethecorrespondingrecordinthebufferwiththenewlysampledrecord.Ifwewouldhaveushedthatrecordtodiskusingtheclassicalgorithm(ratherthanbufferingit),wewouldhavereplaceditwiththenewlysampledrecord.Thuswewouldobtainthesameresult.Oncethebufferisfullwe 49

PAGE 50

1 asfarascorrectnessisconcerned. Logically,steps(7-14)ofAlgorithm 2 actuallyimplementexactlythisprocess.Theprobabilitythatwewillgeneratearandompositionbetween1andjRjthatisalreadyinthepositionarrayofsizejBjisjBj=R.Step(7)ofAlgorithm 2 decideswhethertooverwritearandombufferedrecordwithanewlysampledrecord.Oncethebufferisfull,step(13)performsaonepassbuffer-reservoirmergingbygeneratingsequentialrandompositionsinthereservoironthey. 2 westorethesamplessequentiallyonthediskandoverwritetheminarandomorder.Thoughcorrect,thealgorithmdemandsalmostacompletescanofthereservoir(toperformallrandomoverwrites)foreverybufferush.WecandobetterifweinsteadforcethesamplestobestoredinarandomorderondisksothattheycanbereplacedviaanoverwriteusingsequentialI/Os.Thelocalizedoverwriteextensiondiscussedbeforeusethisidea.Everytimeabufferisushedtothereservoiritisrandomizedinmainmemoryandwrittenasarandomclusteronthedisk.WemaintainthecorrectnessofthistechniquebysplittingtherandomclusterinN-wayswhereNisthenumberofexistingclustersonthediskandbyoverwritingrandomsubsetofeachexistingcluster.Thisavoidstheproblemofclusteringbyinsertiontime.However,thedrawbackofthistechniqueisthatthesolutiondeterioratesbecauseoffragmentationofclusters. ThegeometricleovercomesthedrawbacksofthesetwotechniquesandcanbeviewedasacombinationofAlgorithm 2 andtheideausedinthelocalizedoverwriteextension.ThecorrectnessoftheGeometricleisresultsdirectlyfromthecorrectnessofthesetwotechniques.Incaseofthegeometricletheentiresampleinthemainmemory(referredtoasasubsample)israndomizedandushedintothereservoir.Furthermore,eachnewsubsampleissplitintoexactlythosemanysegmentsasthenumberofexistingsubsamplesonthedisk.Thesesegmentsthenoverwritearandomportionofeachdisk-basedsubsample.Theonlydifferencewiththe 50

PAGE 51

1 ,isxedbytheratiojBj=jRj.Thatis,foraxeddesiredsizeofreservoirweneedalargerbuffertolowerthevalueof. However,thereisawaytoimprovethesituation.GivenabufferofxedcapacityjBjanddesiredsamplesizejRj,wechooseasmallervalue0<,andthenmaintainmorethanonegeometricleatthesametimetoachievealargeenoughsample.Specically,weneedtomaintainm=(10) (1)geometriclesatonce.Theselesareidenticaltowhatwehavedescribedthusfar,exceptthattheparameter0isusedtocomputethesizesofasubsample'son-disksegmentsandsizeofeachleisjRj 3 .Eachofthemgeometriclesisstilltreatedasasetofdecayingsubsamples,andeachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize,justasisdoneinAlgorithm 3 ,Steps(5)-(13).Theonlydifferenceisthataseachleiscreated,theparameter0isusedinsteadofinSteps(6),(8)-(9),andeachofthemgeometriclesislledafteroneanother,inturn.Thus,eachsubsampleofeachgeometriclewillhavesegmentsofsizen;n0;n02andsoon. 51

PAGE 52

3 Steps(15)-(20)untilbufferisfull.Oncethebufferisfull,itsrecordorderisthenrandomized,justasisinasinglegeometricle.Nextthebufferisushedtodisk.Thisiswherethealgorithmismodied.Overwritingrecordsondiskwithrecordsfromthebufferissomewhatdifferent,intwoprimaryways,asdiscussednext. 4 ,thebufferispartitionedsothatthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofsubsamplesinasinglele.Incaseofmultiplegeometricles,wepartitionthebufferjustlikeinAlgorithm 4 ;however,werandomlypartitionthebufferacrossallsubsamplesfromallgeometricles.Thenumberofbuffersegmentsafterthepartitioningisthesameasthetotalnumberofsubsamplesintheentirereservoir,andthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofeachofthesubsamplesfromoneofthegeometricles.Thisallowsustomaintainthecorrectnessofthereservoirsamplingalgorithm.ThebufferpartitioningstepsincaseofmultiplegeometriclesaregiveninAlgorithm 5 3 'sbuffermergealgorithm.Wediscussalltheintricaciessubsequently,butathigh-level,thelargestsegmentofeachsubsamplefromonlyonegeometricleisover-writtenwithsamplesfromthebuffer.Thisallowsforconsiderablespeedup,aswediscussinSection 3.12 .Atrst,thiswouldseemtocompromisethecorrectnessofthealgorithm:logically,thebufferedsamplesmustover-writesamplesfromeveryoneofthegeometricles(infact,thisispreciselywhythebufferispartitionedacrossallgeometricles,as 52

PAGE 53

3.11.1 to 3.11.3 ,wedescribeindetailanalgorithmthatisabletomaintainthecorrectnessofthesample. Oncethesegmentsassignedtothevariousleshavebeenconsolidated,theresultingsegmentsareusedtooverwritesubsamplesfromasinglegeometricleusingexactlythealgorithmfromSection 3.4 ,subjecttotheconstraintthatthejthbuffermergeoverwritessubsamplesfromthe(jmodm)thgeometricle. Ourremedytothisproblemistodelayoverwritingasubsample'slargestsegmentuntilthetimethatall(ormost)oftherecordsthatwillbeover-writtenondiskareinvalid,inthesensethattheyhavelogicallybeenover-writtenbyhavingrecordsfromsubsequentbuffer 53

PAGE 54

Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. 54

PAGE 55

Thewaytoaccomplishthisistooverwritesubsamplesinalazymanner.Wemergethebufferwiththe(jmodm)thgeometricle,butwedonotoverwriteanyofthevalidsamplesstoredintheleuntilthenexttimewegettothele.Wecanachievethisbyallocatingenoughextraspaceineachgeometricletoholdacomplete,emptysubsampleineachgeometricle.Thissubsampleisreferredtoasthedummy.Thedummyneverdecaysinsize,andneverstoresitsownsamples.Rather,itisusedasabufferthatallowsustosidesteptheproblemofasubsampledecayingtooquickly.Whenanewsubsampleisaddedtoageometricle,thenewsubsampleoverwritessegmentsofdummyratherthanoverwritinglargestsegmentofanyexistingsubsamples.Thus,wehaveprotectedsegmentsofsubsamplesthatcontainvaliddatabyoverwritingdummy'srecordsinstead. Whenrecordsaremergedfromthebufferintothedummy,thespacepreviouslyownedbythedummyisgivenuptoallowstorageofthele'snewestsubsample.Afterthisush,thelargestsegmentfromeachofthesubsamplesintheleisgivenuptoreconstitutethenewdummy.Becausetherecordsin(new)dummy'ssegmentswillnotbeover-writtenuntilthenexttimethatthisparticulargeometricleiswrittento,allofthedatathatiscontainedwithinitisprotected. Notethatwithadummysubsample,wenolongerhaveaproblemwithasubsamplelosingitssamplestooquickly.Instead,asubsamplemayhaveslightlytoomanysamplespresentondiskatanygiventime,bufferedbythele'sdummy.Theseextrasamplescaneasilybeignoredduringqueryprocessing.TheonlyadditionalcostweincurwithdummyisthateachofthegeometriclesondiskmusthavejBjadditionalunitsofstorageallocated.TheuseofadummysubsampleisillustratedinFigure 3-5 55

PAGE 56

2 Proof. log0c.Substitutingn=(10)jBjandsimplifyingtheexpression(aswellas 56

PAGE 57

log0(loglogjBj).Ifwelet!=(log(1=0))1thenumberofsegmentscanbeexpressedas!(logjBjlog).Assumingaconstantnumbercofrandomseekspersegmentwrittentothedisk,thetotalrandomdiskheadmovementsrequiredperrecordis!c((logjBjlog)=jBj),whichisO(!logjBj=jBj). Incaseofmultiplegeometriclesweuseadditionalspaceformdummysubsamples.Thus,thetotalstoragerequiredbyallgeometriclesisjRj+(mjBj).Ifwewishtomaintaina1TBreservoirof100Bsampleswith1GBofmemory,wecanachieve0=0:9byusingonly1.1TBofdiskstorageintotal.For0=0:9,weneedtowritelessthan100segmentsper1GBbufferush.At40ms/segment,thisisonly4secondsofrandomdiskheadmovementstowrite1GBofnewsamplestodisk. Inordertotesttherelativeabilityofthegeometricletoprocessahigh-speedstreamofinsertions,wehaveimplementedandbench-markedvealternativesformaintainingalargereservoirondisk:thethreealternativesdiscussedinSection 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.WepresentthesebenchmarkingresultsinChapter 7 57

PAGE 58

Inthischapterweproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxed-size,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,user-denedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutilityinsubsequentqueryprocessing.Wethencompute(inasinglepass)abiasedsampleRioftheirecordsproducedbyadatastream.Riisxed-size,andtheprobabilityofsamplingthejthrecordfromthestreamisproportionaltof(rj)forallji.Thisisafairlysimpleandyetpowerfuldenitionofbiasedsampling,andisgeneralenoughtosupportmanyapplications. Ofcourse,onestraightforwardwaytosampleaccordingtoawell-denedbiasfunctionwouldbetomakeacompletepassoverthedatasettocomputethetotalweightofalltherecords,PNj=1f(rj).Duringasecondpass,wecanthenchoosetheithrecordofthedatasetwithprobabilityjRjf(ri) Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.We 58

PAGE 59

Therestofthechapterisorganizedasfollows.Wedescribesasingle-passbiasedsamplingalgorithm.Wealsodeneadistancemetrictoevaluatetheworstcasedeviationfromtheuser-denedweightingfunctionf.Finally,wederiveasimpleestimatorforabiasedreservoir.TheexperimentsperformedtotestouralgorithmsarepresentedinChapter 7 6 .Itispossibletoprovethatthismodiedalgorithmresultsinacorrectlybiasedsample,providedthattheprobabilityfromline(8)ofAlgorithm 6 doesnotexceedone. 6 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehavePr[rj2Ri]=jRjf(rj) Proof. 59

PAGE 60

60

PAGE 61

1 ) WedeneanoverweightrecordtobearecordriforwhichjRjf(ri) Animportantfactortoconsiderwhiledeterminingthefeasibilityofmaintainingsuchaqueueinthegeneralcaseisprovidinganupperboundonitssize.Thiscanbedonebyconsider-ingtheworstpossibleorderingoftherecordsinputintothealgorithm,subjecttotheconstraintthatthebiasfunctioniswell-dened.Ingeneral,wedescribetheuser-denedweightingfunctionfasbeingwell-denedifjRjf(ri) 61

PAGE 62

Westressthatthoughthisupperboundisquitepoor(requiringthatweneedtobuffertheentiredatastream!)itisinfactaworst-casescenario,andtheapproachwilloftenbefeasibleinpractice.Thisisbecauseweightswilloftenincreasemonotonicallyovertime(asinthecasewherenewerrecordstendtobemorerelevantforqueryprocessingthanolderones).Still,giventhepoorworst-caseupperbound,amorerobustsolutionisrequired,whichwenowdescribe. 62

PAGE 63

1. First,wewillbeabletoguaranteethatf0(rj)willbeexactlyf(rj)if(jRjf(ri))=totalWeight1forallk>j. 2. Wecanalsoguaranteethatwecancomputethetrueweightforagivenrecordtounbiasedanyestimatemadeusingoursample(seeSection 4.4 ). Inotherwords,ourbiasedsamplecanstillbeusedtoproduceunbiasedestimatesthatarecorrectonexpectation[ 16 ],butthesamplemightnotbebiasedexactlyasspeciedbytheuser-denedfunctionf,ifthevalueoff(r)tendstouctuatewildly.Whilethismayseemlikeadrawback,thenumberofrecordsnotsampledaccordingtofwillusuallybesmall.Furthermore,sincethefunctionusedtomeasuretheutilityofasampleinbiasedsamplingisusuallytheresultofanapproximateanswertoadifcultoptimizationproblem[ 15 ]ortheapplicationofaheuristic[ 52 ],havingasmalldeviationfromthatfunctionmightnotbeofmuchconcern. Wepresentasingle-passbiasedsamplingalgorithmthatprovidesbothguaranteesoutlinedaboveasAlgorithm 7 ,andLemma 4 provesthecorrectnessofthealgorithm. 7 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehave,Pr[rj2Ri]=jRjf0(rj) Proof. 3 .Wesimplyusef0insteadofftoprovethedesiredresult. 63

PAGE 64

7 isthedeviationoff0fromf.Thatis:howfarofffromthecorrectweightingcanwebe,intheworstcase?Whenstreamhasnooverweightrecords,weexpectf0tobeexactlyequaltof,butitmaybeveryfarawayundercertaincircumstances.Toaddressthis,wedeneadistancemetricinDenition 2 andevaluatetheworsecasedistancebetweenf0andf. 64

PAGE 65

1 andisanalyzedandprovedintheAppendixofthispaper. 7 willsamplewithanactualbiasfunctionf0wheretotalDist(f;f0)isupperboundedbyPNk=jRjf(r0k)PjRj1k=1f(r0k) 7 computesabiasedsampleaccordingtof0,wheref0isaclosefunctiontoauser-denedweightingfunctionfaccordingtothefollowingdistancemetric:

PAGE 66

7 occurswhen(1)thereservoirisinitiallylledwiththeRrecordshavingthesmallestpossibleweightsand(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter.Theorem 1 presentedanupperboundontotalDist(f;f0)inthisworstcase.Inthissection,werstprovidetheproofofthisworstcaseforAlgorithm 7 andthenprovetheupperboundontotalDist(f;f0)givenbyTheorem 1 7 ,werstprovethefollowingthreepropositions.Theseproofsleadustotheworst-caseargument.Ifwedenotetherecordwiththehighestweightinthestreamasrmaxandusermaxitodenotethecasewherermaxislocatedatpositioniinthestream,thenforanygivenrandomorderingofthestreamingrecordsr1;:::;ri1;rmaxi;:::;rN,weprovethat 1. MovingtherecordrmaxiearlierintherangerjRj:::rNcannotdecreasetotalDist(f;f0). 2. Whenweareinitiallyllingthereservoir,choosingjRjrecordswithsmallestpossibleweightmaximizestotalDist(f;f0). 3. Reorderingofanyrecordthatappearsafterrmaxiintherangeri+1:::rNcannotincreasetotalDist(f;f0). 66

PAGE 67

5 (givenbelow)were-writethetotalDistformulaas 67

PAGE 68

5 were-writethetotalDistformulaas

PAGE 69

Adjustmentofrmaxitormaxi1 4 )fromEquation( 4 )asfollows: 4-1 showstheadjustmentofrmaxitormaxi1.Wedenotetherecordthatisswappedwithrmaxasrswap.Theaboveequationfurthersimpliesto hjRjf(rmax)+PNk=i+1f(rk)i

PAGE 70

Y=2f(rswap)

PAGE 71

7 acceptstherstjRjrecordsofthestreamwithprobability1.NoweightadjustmentsaretriggeredforrstjRjrecordsirrespectiveoftheirweights.Therefore,theearliestpositionrmaxcanappearinthestreamisrightafterthereservoirislled.Thisprovestheproposition.WenowturntoprovingLemma 5 ,whichwasusedinthepreviousproof. Proof. 71

PAGE 72

5 ,8j
PAGE 73

Fromtheabovethreepropositions,wecanconcludethattheworstcaseforAlgorithm 7 occurswhen(1)thereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweightsand,(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter. 1 :TheUpperBoundontotalDist 5 were-writethetotalDistformulaas 73

PAGE 74

4 ),theaboveequationsimpliesto IntheworstcasethereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweights.Ifr1;r2;:::rNaretherecordsinappearanceorderthenwedener01;r02;:::;r0Nasthepermutation(reordering)oftherecordssuchthatf(r01)f(r02)f(r0N):Theconditionrequiringreservoirlledwiththesmallestpossibleweightscanbethenwrittenas 74

PAGE 75

1. Foreachon-disksubsample,MjissettobejRjMjf(ri) 2. Foreachsampledrecordstillinthebuffer,rj:weightissettojRjrj:weightf(ri) 3. Finally,totalWeightissettojRjf(ri). 75

PAGE 76

50 ]forasamplecomputedusingouralgorithm.Wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgoverningthesamplingoftworecordsriandrjusingouralgorithmandusethiscovariancetoderivethevarianceofaHorvitz-Thomsonestimator.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained.Thoughhandlingmorecomplicatedqueriesusingthebiasedsampleisbeyondthescopeofthepaper,itisstraightforwardtoextendtheanalysisofthisSectiontomorecomplicatedqueriessuchasjoins[ 32 ]. Imaginethatwehavethefollowingsingle-tablequery,whose(unknown)answerisq: TABLEASr Next,wederivethevarianceofthisestimator.Todothis,weneedaresultsimilartoLemma 3 thatcanbeusedtocomputetheprobabilityPr[frj;rkg2Ri]underourbiasedsamplingscheme. 76

PAGE 77

7 ,foreachRiandforeachrecordpairfrj;rkgproducedbythedatastreamwherej
PAGE 78

Thisexpressioncanthenbeusedinconjunctionwiththenextlemmatocomputethevarianceofthenaturalestimatorforq. ByusingtheresultofLemma 6 tocomputePr[frj;rkg2Ri],thevarianceoftheestimatoristheneasilyobtainedforaspecicquery.Inpractice,thevarianceitselfmustbeestimatedbyconsideringonlythesampledrecordsaswetypicallydonothaveaccesstoeachandeveryrjduringqueryprocessing.Theq2termandthetwosumsintheexpressionofvariancearethuscomputedovereachrjinthesampleofbiasedgeometricleratherthanovertheentirereservoir. Thereisoneadditionalissueregardingbiasedsamplingthatisworthsomeadditionaldiscussion:howtoefcientlycomputethevaluePr[frj;rkg2Ri]inordertoestimate 78

PAGE 79

TherstsubexpressionscanbeeasilycomputedwiththehelpofrunningtotaltotalWeightalongwiththeweightmultipliersassociatedwitheachsubsample.Whensamplerecordsareaddedtothereservoir,likeattributeri.weight,westoreanotherattributewitheachrecord,ri.oldTotalWeightandri.oldM.TherstattributegetsitsvaluefromcurrentvalueoftotalWeight,whereastheM(ri)isstoredinthesecondattribute.Whenaqueryisevaluatedandweneedtocomputetherstsubexpressionsforagivenrecordpairrjandrk,wecomputetermsinitsdenominatorasfollows:kXl=1f0(rl)=rk:oldTotalWeightM(rk)

PAGE 80

Ageometricleisasimplerandomsample(withoutreplacement)fromadatastream.Inthischapterwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjectsthatarethemselvesrandomsamples(withoutreplacement)fromtheoriginaldatastream.Thegoalofthealgorithmsdescribedinthischapteristoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 3.2 ,wearguedthatsmallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,reconsidertheproblemofestimatingtheaveragenetworthofAmericanhouseholdsasdescribedinSection 3.2 .Inthegeneralcase,manymillionsofsamplesmaybeneededtoestimatethenetworthoftheaveragehouseholdaccurately(duetoasmallratiobetweentheaveragehousehold'snetworthandthestandarddeviationofthisstatisticacrossallAmericanhouseholds).However,ifthesamesetofrecordsheldinformationaboutthesizeofeachhousehold,onlyafewhundredrecordswouldbeneededtoobtainsimilaraccuracyforanestimateoftheaveragesizeofanAmericanhousehold,sincetheratioofaveragehouseholdsizetothestandarddeviationofsamplesizeacrosshouseholdsintheUnitedStatesisgreaterthan2.Thus,toestimatetheanswertothesetwoqueries,vastlydifferentsamplesizesareneeded. 80

PAGE 81

1 21 30 34 39 ].Ingeneral,thedrawbackofmakinguseofabatchsampleisthattheaccuracyofanyestimatorwhichmakesuseofthesampleisxedatthetimethatthesampleistaken,whereasthebenetofbatchsamplingisthatthesamplecanbedrawnwithveryhighefciency. WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample.ThedrawbackofonlinesamplingcomparedtobatchsamplingisthatitisgenerallylessefcienttoobtainasampleofsizeNusingonlinemethods.However,sincetheconsumerofthesamplecancallGetNextrepeatedlyuntilanestimatorwithenoughaccuracyisobtained,onlinesamplingismoreexiblethanbatchsampling.Anonlinesampleretrievedfromageometriclecanbeusefulformanyapplications,includingonlineaggregation[ 32 33 ].Inonlineaggregation,adatabasesystemtriestoquicklygatherenoughinformationsoastoapproximateanswertoanaggregatequery.Asmoreandmoreinformationisgathered,theapproximationqualityisimproved,andtheonlinesamplingprocedureishaltedwhentheuserishappywiththeapproximationaccuracy. 5.3.1ANaiveAlgorithm Proof. 81

PAGE 82

jDjN jDjN=1 Unfortunately,thoughitisverysimple,thenaivealgorithmwillbeinefcientfordrawingasmallsamplefromalargegeometriclesinceitrequiresafullscanofthegeometricletoobtainatruerandomsampleforanyvalueofN.Sincethegeometriclemaybegigabytesinsize,thiscanbeproblematic. 82

PAGE 83

26 ].Oncethenumberofsampledrecordsfromeachsegmenthasbeendetermined,samplingthoserecordscanbedonewithanefcientsequentialreadsincewithineachondisksegment,allrecordsarestoreinarandomizedorder.Thekeyalgorithmicissueishowtocalculatethecontributionofeachsubsample.Sincethiscontributionisamultivariatehypergeometricrandomvariable,wecanuseanapproachanalogoustoAlgorithm 4 ,whichisusedtopartitionthebuffertoformthesegmentsofasubsample.Inotherwords,wecanviewretrievingNsamplesfromageometricleanalogoustochoosingNrandomrecordstooverwritewhennewrecordsareaddedtothele. Theresultingalgorithmcanbedescribedasfollows.Tostartwith,wepartitionthesamplespaceofNrecordsintosegmentsofvaryingsizeexactlyasinAlgorithm 4 .Werefertothesesegmentsofthesamplespaceassamplingsegments.Thesamplingsegmentsarethenlledwithsamplesfromthediskusingaseriesofsequentialreads,analogoustothesetofwritesthatareusedtoaddnewsamplestothegeometricle.Thelargestsamplingsegmentobtainsallofitsrecordsfromthelargestsubsample,thenextlargestsamplingsegmentobtainsallitsrecordfromsecondlargestsubsample,andsoon. Whenusingthisalgorithm,somecareneedstobetakenwhenNapproachestothesizeofageometricle.Specically,whenalldisksegmentsofasubsamplearereturnedtoacorrespondingsamplingsegment,wemustalsoconsiderthesubsample'sin-memorybuffered 83

PAGE 84

8 ItisclearthatthisalgorithmobtainsthedesiredbatchsamplebyscanningexactlyNrecordsasagainsttheentirescanofthereservoirsamplingatthecostoffewrandomdiskseeks.Sincethesamplingprocessisanalogoustotheprocessofaddingmoresamplestothele,itisjustasefcient,requiringO(!logjBj=N)randomdiskheadmovementsforeachnewlysampledrecord,asdescribedinLemma2. 8 oneachleinordertoobtainthedesiredbatchsample. 5.4.1ANaiveAlgorithm Itiseasytoseethatanaivealgorithmwillgiveusacorrectonlinesampleofageometricle.However,wewilluseonediskseekpercalltoGetNext.SinceeachrandomI/Orequires 84

PAGE 85

Insteadofselectingarandomrecordofageometricle,werandomlypickasubsampleandchooseitsnextavailablerecordasareturnvalueofGetNext.Thisisanalogoustotheclassiconlinesamplingalgorithmforsamplingfromahashedle[ 26 ],whererstahashbucketisselectedandthenarecordischosen.Sincetheselectionofarandomrecordwithinasubsampleissequential,wemayreducethenumberofcostlydiskseeksifwereadthesubsampleinitsentirety,andbufferthesubsample'srecordsinmemory.Usingthisbasicmethodology,wenowdescribehowacalltotheGetNextwillbeprocessed: Sincetherecordsfromeachsubsamplearereadandbufferedinmemorysequentially,weareguaranteedtochooseeachrecordofthereservoiratmostonce,givingusdesiredrandomsamplewithoutreplacement.Aproofofthisissimple,andanalogoustotheproofofLemma3.However,thusfarwehavenotconsideredaveryimportantquestion:HowmanyblocksofasubsampleSishouldwefetchatthetimeofbufferrell?Ingeneraltherearetwoextremesthatwemayconsider: 85

PAGE 86

Inordertodiscusssuchconsiderationsmoreconcretely,wenotethatthetimerequiredtoprocessGetNextcallisproportionaltothenumberofblocksfetchedonthecall,assumingthatthecosttoperformtherequiredin-memorycalculationsisminimal.Ifbblocksarefetchedduringaparticularcall,wespends+brtimeunitsonthatparticularcalltoGetNext,wheresistheseektimeandristimerequiredtoscanablock.OncethesebblocksarefetchedweincurzerocostfornextbncallstoGetNext,wherenistheblockingfactor(numberofrecordsperblock).Thus,inthecasewhereblocksarefetchedattherstcalltoGetNext,weincurthetotalcostofs+brtosamplebnrecords,andhavearesponsetimeofs+brunitsattherstcalltoGetNext,withallsubsequentcallshavingzerocost. Nowimaginethatinsteadwesplitbblocksintotwochunksofsizeb=2each,andreadachunk-at-a-time.Thus,therstGetNextcallwillcostuss+br=2timeunits.Oncethesebn=2recordsareusedupwereadnextchuckofblocks.Thetotalcostinthisscenariois2s+brwitharesponsetimeofs+br=2timeunitsonceatthestartingpointandothermid-waythrough.NotethatalthoughthemaximumresponsetimeonanycalltoGetNextisreducedbyhalf,werequiredmoretimetosamplebnrecords.Thequestionthenbecomes,Howdowereconcileresponsetimewithoverallsamplingtimetogivetheuseroptimalperformance? ThesystematicapproachwetaketoansweringthisquestionisbasedonminimizingtheaveragesquaresumofresponsetimeoverallGetNextcalls.Thisideaissimilartothewidelyutilizedsum-square-errororMSEcriterion,whichtriestokeepstheaverageerrororcostfrombeingtoohigh,butalsopenalizesparticularlypoorindividualerrorsorcosts.However,one 86

PAGE 87

dX(X(s+(N=br)=X)2)=d dX(Xs2+2Nsr+(N=br)2=X)=s2(N=br)2=X2

PAGE 88

9 givesthedetailedonlinesamplingalgorithm. Proof. LettheSamplebethebiasedsampleofthegeometricle,thenwehavePr[i2Sample]=Pr[SelectingifromSi]Pr[SelectingSi]Pr[i2Si]=1 jRjjRjf(r) 7 ofthisdissertation. 88

PAGE 89

Efcientlysearchinganddiscoveringrequiredinformationfromasamplestoredinageo-metricleisessentialtospeedupqueryprocessing.Anaturalwaytosupportthisfunctionalityistobuildanindexstructureforthegeometricle.Inthischapterwediscussthreesecondaryindexstructuresforthegeometricle.Thegoalistomaintaintheindexstructuresasnewrecordsareinsertedtothegeometricleandatthesametimeprovideefcientaccesstothedesiredinformationinthele. FROMTransaction WHEREStoreState='FL'ANDTransDate>1/1/2007 89

PAGE 90

Anaturalwaytospeedupthesearchanddiscoveryofthoserecordsfromageometriclethathaveaparticularvalueforaparticularattribute(s)istobuildanindexstructure.Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Thus,inourexample,wecouldusetheindexbuiltoneitherStoreStateorTransDate(orboth)toquicklyaccessspecicsetofrecordsandtestthemfortheconditionsintheWHEREclause.Inthischapterwefocusonbuildingsuchanindexstructureforthegeometricle. Apartfromprovidingefcientaccesstothedesiredinformationinthele,akeyconsider-ationisthattheindexforthegeometriclemustbemaintainedasnewrecordsareinserted.Forinstance,wecouldbuildasecondaryindexonanattributewhenthenewrecordsarebulkinsertedintothegeometricle.Wemustthendeterminehowdowemergethenewsecondaryindexwiththeexistingindexesbuiltfortherestofthele.Furthermore,wemustmaintaintheindexasexistingrecordsarebeingoverwrittenwithnewlyinsertedrecordsandhencearedeletedfromthegeometricle. Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegment-basedindex,(2)asubsample-basedindex,and(3)aLog-StructuredMerge-Tree-(LSM-)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+-treeindexes[ 9 ]aremaintainedforeachsegmentorsubsampleinageometric 90

PAGE 91

44 ]-adisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. Inthesubsequentsectionswediscussconstruction,maintenance,andqueryingofthesethreetypesofindexes. Wedetailconstructionandmaintenanceofasegment-basedindexstructureinthissection. 3 fromChapter 3 duringstart-uptollthereservoir.Everytimethebufferaccumulatesthedesirednumberofrecords,itissegmentedandushedtothedisk.WebuildaB+-treeindexforeachsegmentjustbeforetheyarewrittenouttothedisk.Foreachbufferedrecordofasegmentweconstructanindexrecord.Anindexrecordiscomprisedofthevalueoftheattributeonwhichtheindexisgettingbuilt(thekeyvalue)andthepositionofthebufferedrecordonthedisk.Thepositionisstoredasanumberpair:apagenumberandoffsetwithinapage.TheindexrecordsarethenusedtocreateanindexusingthebulkinsertionalgorithmforaB+-Tree.Weuseasimplearray-baseddatastructuretokeep 91

PAGE 92

RatherthanmaintainingaleforeachB+-Treecreated,weorganizemultipleB+-Treesonasinglediskle.Wereferthissingleleasindexle.Theindexle,inasense,issimilartothelog-structurelesystemproposedbyOusterhout[ 45 ].Inlog-structuredlesystem,aslesaremodied,thecontentsarewrittenouttothediskaslogsinasequentialstream.Thisallowswritesinfull-cylinderunits,withonlytrack-to-trackseeks.Thusthediskoperatesatnearlyitsfullbandwidth.Theindexleenjoysthesimilarperformancebenets.EverytimeaB+-Treeiscreatedforamemoryresidentsegment,itiswrittentotheindexleinasequentialstreamatthenextavailableposition.ThearraymaintainingallB+-TreerootnodesisaugmentedwiththestartingdiskpositionoftheB+-Tree. Finally,wedonotindexsegmentsthatareneverushedtothedisk.Thesesegmentsaretypicallyverysmall(asizeofadiskblock)anditisefcienttosearchthemusingsequentialmemoryscanwhengeometricleisqueried. Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 10 92

PAGE 93

logc Weexpectasegment-basedindexstructuretobeacompactstructureasthereisexactlyoneindexrecordpresentintheindexstructureforeachrecordinthegeometricle,andtheindexstructureismaintainedasnewrecordsaredeletedfromthele. 93

PAGE 94

Asinthecaseofasegment-basedindexstructure,wearrangetheB+-Treeindexesondiskinasingleindexle.However,weneedaslightlydifferentapproach,becauseduringthestart-upsubsamplesareushedtothegeometricle,untilthereservoirisfull.ThereaftersubsamplesofthesamesizejBjareaddedtothereservoir.SinceeachB+-TreewillindexnomorethanjBjrecords,wecanboundthesizeofaB+-Treeindex.Weusethisboundtopre-allocateax-sizedslotondiskforeachB+-Tree.Furthermore,foreverybufferushafterthereservoirisfull,exactlyonesubsampleisaddedtotheleandthesmallestsubsampleoftheledecayscompletely,keepingthenumberofsubsamplesinageometricleconstant.Weusethisinformationtolayoutthesubsample-basedB+-Treesondiskandmaintainthemasnewrecordsaresampledfromthedatastream. Thus,iftotSubsamplesisthetotalnumbersubsamplesinR,werstallocatexed-sizetotSubsamplesslotsintheindexle.Initiallyalltheslotsareempty.Duringstart-up,asanewB+-Treeisbuilt,weseektothenextavailableslotandwriteouttheB+-Treeinasequential 94

PAGE 95

Thealgorithmusedtoconstructandmaintainasegment-basedindexstructureisgivenasAlgorithm 11 logc Asearchonsubsample-basedindexstructureinvolveslookingupallB+-Treeindexes,oneforeachsubsampleinthegeometricle.WemodifytheexistingB+-Tree-basedpointqueryandrangequeryalgorithmsandrunthemforeachentryintheB+-Treearrayoftheindexstructure.ThemodicationisrequiredtoignorethestalerecordsintheB+Trees.Asmentionedbefore,thesubsamplecorrespondingtoaB+-Treemayloseitssegments,buttheindexrecordsare 95

PAGE 96

Recallthatwehaverecordedasegmentnumberinanadditionaleldalongwitheachindexrecord.Foragivensubsample,wekeeptrackofwhichofitssegmentsaredecayedsofarandusethisinformationtoignoretheindexrecordsthatarestale.Wereturnsallvalidindexrecordsthatsatisfythesearchcriteria.Werstsorttheseindexrecordsbytheirpagenumberattributeandthenthenretrievetheactualrecordsfromthegeometricleandreturnthemasaqueryresult. Although,thesubsample-basedindexstructuremaintainsandmustsearchfarfewerB+Treescomparedtothesegment-basedindexstructure,weexceptreasonablesearchtimeperB+-Treeduetothesmallersizeandlazydeletionpolicy. 44 ].TheLSM-Treeisadisk-baseddatastructuredesignedtoprovidelow-costindexinginanenvironmentwithahighrateofinsertsanddeletes. 44 ]. AlthoughC1(andhigher)componentsaredisk-resident,themostfrequentlyreferrednodes(ingeneralnodesathigherlevel)ofthesetreesarebufferedinmainmemoryforperformancereasons. 96

PAGE 97

WhenevertheC0componentreachesathresholdsizeanongoingrollingmergeprocessremovessomerecords(acontiguoussegment)fromtheC0componentandmergesthemintotheC1componentondisk.ThetherollingmergeprocessisdepictedpictoriallyinFigure2.2oftheoriginalLSM-Treepaper[ 44 ].TherollingmergeisrepeatedformigrationbetweenhighercomponentsofanLSM-Treeinsimilarmanner.Thus,thereisacertainamountofdelaybeforerecordsintheC0componentmigrateouttothedisk-residentC1andhighercomponents.Deletionsareperformedconcurrentlyinbatchfashionsimilartoinserts. ThediskresidentcomponentsofanLSM-treearecomparabletoaB+-treestructure,butareoptimizedforsequentialdiskaccess,withnodes100%full.Lowerlevelsofthetreearepackedtogetherincontiguous,multi-pagediskblocksforbetterI/Operformanceduringtherollingmerge. WeusetheexistingLSM-Tree-basedpointqueryandrangequeryalgorithmstoperformindexlook-ups.Asincaseofpreviouslyproposedindexstructures,wesortthevalidindex 97

PAGE 98

InChapter 7 ,weevaluateandcomparethethreeindexstructuressuggestedinthischapterexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedintothegeometricle.Wealsocomparetheefciencyofthesestructuresforpointandrangequeries. 98

PAGE 99

Inthischapter,wedetailthreesetsofbenchmarkingexperiments.Intherstsetofexperi-ments,weattempttomeasuretheabilityofthegeometricletoprocessahigh-speedstreamofdatarecords.Inthesecondsetofexperiments,weexaminethevariousalgorithmsforproducingsmallersamplesfromalarge,disk-basedgeometricle.Finally,inthethirdsetofexperiments,wecomparethethreeindexstructuresforthegeometricleforbuildtime,diskspace,andindexlook-upspeed. 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.IntheremainderofthisSection,werefertothesealternativesasthevirtualmemory,scan,localoverwrite,geole,andmultiplegeolesoptions.An0valueof0:9wasusedforthemultiplegeolesoption. AllimplementationwasperformedinC++.BenchmarkingwasperformedusingasetofLinuxworkstations,eachequippedwith2.4GHzIntelXeonProcessors.15,000RPM,80GBSeagateSCSIharddiskswereusedtostoreeachofthereservoirs.Benchmarkingofthesedisksshowedasustainedread/writerateof35-50MB/second,andanacrossthediskrandomdataaccesstimeofaround10ms. 99

PAGE 100

7-1 (a).Bynumberofsamplesprocessedwemeanthenumberofrecordsthatareactuallyinsertedintothereservoir,andnotthenumberofrecordsthathavepassedthroughthedatastream. 7-1 (b).Thus,wetesttheeffectofrecordsizeontheveoptions. 7-1 (c).Thisexperimentteststheeffectofaconstrainedamountofmainmemory. Itisworthwhiletopointoutafewspecicndings.Eachoftheveoptionswritestherst50GBofdatafromthestreammoreorlessdirectlytodisk,asthereservoirislargeenoughtoholdallofthedataaslongasthetotalislessthan50GB.However,Figure 7-1 (a)and(b)showthatonlythemultiplegeolesoptiondoesnothavemuchofadeclineinperformanceafterthereservoirlls(atleastinExperiments1and2).Thisiswhythescanandvirtualmemoryoptionsplateauaftertheamountofdatainsertedreaches50GB.ThereissomethingofadeclineinperformanceinallofthemethodsoncethereservoirllsinExperiment3(withrestrictedbuffermemory),butitisfarlesssevereforthemultiplegeolesoptionthanfortheotheroptions. 100

PAGE 101

Resultsofbenchmarkingexperiments(Processinginsertions). 101

PAGE 102

Resultsofbenchmarkingexperiments(Samplingfromageometricle). 102

PAGE 103

Asexpected,thelocaloverwriteoptionperformsverywellearlyon,especiallyinthersttwoexperiments(seeSection 3.3 foradiscussionofwhythisisexpected).EvenwithlimitedbuffermemoryinExperiment3,ituniformlyoutperformsasinglegeometricle.Furthermore,withenoughbuffermemoryinExperiments1and2,thelocaloverwriteoptioniscompetitivewiththemultiplegeolesoptionearlyon.However,fragmentationbecomesaproblemandperformancedecreasesovertime.Unlessofinere-randomizationoftheleispossibleperiodically,thisdegradationprobablyprecludeslong-termuseofthelocaloverwriteoption. ItisinterestingthatasdemonstratedbyExperiment3(andexplainedinSection 3.8 )asinglegeometricleisverysensitivetotheratioofthesizeofthereservoirtotheamountofavailablememoryforbufferingnewrecordsfromthestream.ThegeoleoptionperformswellinExperiments1and2whenthisratiois100,butratherpoorlyinExperiment3whentheratiois1000. Finally,wepointoutthegeneralunusabilityofthescanandvirtiualmemoryoptions.scangenerallyoutperformedvirtualmemory,butbothgenerallydidpoorly.Exceptinexperiment1withlargememoryandsmallrecordsize,withthesetwooptionsmorethan97%oftheprocessingofrecordsfromthestreamoccursinthersthalfhourasthereservoirlls.Inthe19:5hoursorsoafterthereservoirrstlls,onlyatinyfractionofadditionalprocessingoccursduetotheinefciencyofthetwooptions. 4.1 wegaveanupperboundforthedistancebetweentheactualbiasfunctionf0computedusingourreservoiralgorithm,andthedesired,user-denedbiasfunctionf.Whileuseful,thisbounddoesnottelltheentirestory.Intheend,whatauserofabiasedsamplingalgorithmisinterestedinisnothowclosethebiasfunctionthatisactuallycomputedistotheuser-speciedone,butinsteadthekeyquestioniswhatsortofeffectanydeviationhasonthe 103

PAGE 104

Sumqueryestimationaccuracyforzipf=0.2. particularestimationtaskthatistobeperformed.Perhapstheeasiestwaytodetailthepracticaleffectofapathologicaldataorderingisthroughexperimentation. Inthissectionwepresenttheexperimentalresultsevaluatingpracticalsignicanceofaworst-casedataordering.Specically,wedesignasetofexperimentstocomputetheerror(variance)onewouldexpectwhensamplingfortheanswertoaSUMqueryinfollowingtherescenarios: 1. Whenabiasedsampleiscomputedusingourreservoiralgorithmwiththedataorderedsoastoproducenooverweightrecords. 2. Whenanunbiasedsampleiscomputedusingtheclassicalreservoirsamplingalgorithm. 3. Whenabiasedsamplecomputedusingourreservoiralgorithm,withrecordsarrangedsoastoproducethebiasfunctionfurthestfromtheuser-speciedone,asdescribedbytheTheorem 1 Byexaminingtheresults,itshouldbecomeclearexactlywhatsortofpracticaleffectontheaccuracyofanestimatoronemightexpectduetoapathologicalordering. 104

PAGE 105

Sumqueryestimationaccuracyforzipf=0.5. AttributeBistheattributethatisactuallyaggregatedbytheSUMquery.EachsetisgeneratedsothatattributesAandBbothhaveacertainamountofZipanskew,speciedbytheparameterzipf.Ineachcase,thebiasfunctionfisdenedsoastominimizethevarianceforaSUMqueryevaluatedoverattributeA. Inadditiontotheparameterzipf,eachdatasetalsohasasecondparameterwhichwetermthecorrelationfactor.ThisistheprobabilitythatattributeAhasthesamevalueasattributeB.Ifthecorrelationfactoris1,thenAandBareidentical,andsincethebiasfunctionisdenedsoastominimizethevarianceofaqueryoverA,thebiasfunctionalsominimizesthevarianceofanestimateovertheactualqueryattributeB.Thus,acorrelationfactorof1providesforaperfectbiasfunction.Asthecorrelationfactordecreases,thequalityofthebiasfunctionforaqueryoverattributeBdeclines,becausethechanceincreasesthatarecorddeemedimportantbylookingatattributeAis,infact,onethatshouldnotbeincludedinthesample.Thismodelsthecasewhereonecanonlyguessatthecorrectbiasfunctionbeforehandforexample,whenquerieswithanarbitraryrelationalselectionpredicatemaybeissued.Asmallcorrelationfactorcorrespondstothecasewhentheguessed-atbiasfunctionisactuallyveryincorrect. 105

PAGE 106

Sumqueryestimationaccuracyforzipf=0.8. Bytestingeachofthethreedifferentscenariosdescribedintheprevioussubsectionoverasetofdatasetscreatedbyvaryingzipfaswellasthecorrelationfactor,wecanseetheeffectofdataskewandofbiasfunctionqualityontherelativequalityoftheestimatorproducedbyeachofthethreescenarios. Foreachexperiment,wegenerateadatastreamofonemillionrecordsandobtainasampleofsize1000.Foreachofthethreescenariosandeachofthedatasetsthatwetest,werepeatthesamplingprocess1000timesoverthesamedatastreaminMonte-Carlofashion.Thevarianceofthecorrespondingestimatorisreportedastheobservedvarianceofthe1000estimates.TheobservedMonte-CarlovariancesaredepictedinFigures 7-3 7-4 7-5 ,and 7-6 106

PAGE 107

Sumqueryestimationaccuracyforzipf=1. thatevenforveryskeweddatasets,itisdifcultforevenanadversarytocomeupwithadataorderingthatcansignicantlyalterthequalityoftheuser-denedbiasfunction. Wealsoobservethatforalowzipfparameterandalowcorrelationfactor,unbiasedsamplingoutperformsbiasedsampling.Inotherwords,itisactuallypreferablenottobiasinthiscase.ThisisbecausethelowzipfvalueassignsrelativelyuniformvaluestoattributeB,renderinganoptimalbiasedschemelittledifferentfromuniformsampling.Furthermore,asthecorrelationfactordecreases,theweightingschemeusedbothbiasedsamplingschemesbecomeslessaccurate,hencethehighervariance.Astheweightingschemebecomesveryinaccurate,itisbetternottobiasatall.Notsurprisingly,therearemorecaseswherethebiasedschemeunderthepathologicalorderingisactuallyworsethantheunbiasedscheme.However,asthecorrelationfactorincreasesandthebiasschemebecomesmoreaccurate,itquicklybecomespreferabletobias. 5 .Specically,wehavecomparedthenaivebatchsamplingandtheonlinesamplingalgorithmsagainstageometriclestructurebasedbatchsamplingandonline 107

PAGE 108

7-2 (a)depictstheplotforasinglegeometricle;Figure 7-2 (b)showsananalogousplotforthemultiplegeometriclesoption. 7-2 (c)forboththenaivealgorithmandthemoreadvanced,geometriclestructurebasedalgorithmdesignedtoincreasethesamplingrateandevenouttheresponsetimes.TheanalogousplotformultiplegeometriclecaseisshowninFigure 7-2 (d).WealsoplotthevarianceinresponsetimesoverallcallstoGetNextasafunctionofthenumberofcallstoGetNextinFigures 7-2 (e)and 7-2 (f)(therstisforasinglegeometricle;thesecondiswithmultipleles).Takentogether,theseplotsshowthetrade-offbetweenoverallprocessingtimeandthepotentialforwaitingforalongtimeinordertoobtainasinglesample. 108

PAGE 109

Asexpectedandthendemonstratedbyvarianceplots,thevarianceofonlinenaiveapproachissmallerthangeometriclestructurebasedalgorithm.Althoughwiththislittlelargervariance(lessthan10timesfor100ksamples)intheresponsetimes,thestructurebasedapproachexecutedorderofmagnitudefaster(morethan100timesfor100ksamples)thanthenaiveapproachforanynumberofrecordssampled,justifyingourapproachofminimizingtheaveragesquaresumoftheresponsetime.Inotherwords,wegotenoughaddedspeedforasmallenoughaddedvarianceinresponsetimetomakethetrade-offacceptable.Asmoreandmoresamplesareobtainedthevarianceofstructurebasedalgorithmapproachedvarianceofthenaivealgorithmmakingthetrade-offevenmorereasonableforlargeintendedsamplesizes. Finally,wepointoutthatboththegeometriclestructurebasedalgorithms,batchandonlinecase,wereabletoreadsamplerecordsfromdiskalmostatthemaximumsustainedspeedoftheharddisk,ataround45MB/sec.Thisiscomparabletotherateofasequentialreadfromdisk,thebestwecanhopefor. 109

PAGE 110

Diskfootprintfor1KBrecordsize Table7-1. Millionsofrecordsinsertedin10hrs Subsample-Based Segment-Based LSM-Tree 13700 12550 10960 9680 12810 7230 8030 2930 6 weintroducedthreeindexstructuresforthegeometricle:thesegment-based,thesubsample-based,andtheLSM-tree-basedindexstructure.Inthissection,weexperimentallyevaluateandcomparethesethreeindexstructuresbymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries.Alloftheindexstructureswereimplementedontopofthegeometricleprototypethatwasbenchmarkedintheprevioussections. 110

PAGE 111

6 .Thetenhoursofinsertioninthegeometricleensuresthatareasonablenumberofinsertionsanddeletionsareperformedonanindexstructure.Givensuchale,wecollectedfollowingthreepiecesofinformationforeachofthethreeindexstructuresunderconsideration. Withthesemetricsinmindweperformedfollowingtwosetsofexperiments: 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-7 ,andtheindexlook-upspeedintabulatedinTable 7.4.1 111

PAGE 112

Diskfootprintfor200Brecordsize speedareshowninTable 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 7-8 ,andtheindexlook-upspeedintabulatedinTable 7.4.2 .Thus,wetesttheeffectofrecordsizeonthethreeindexstructure. Table 7.4 showsmillionsofrecordsinsertedintogeometricleaftertenhoursofinsertionsandconcurrentupdatestotheindexstructure.Forcomparisonwepresentthenumberofrecordsinsertedintoageometriclewhennoindexstructureismaintained(thenoindexcolumn).Itisclearthatthesubsample-basedindexstructureperformsthebestoninsertions,withperformancecomparabletothenoindexoption.Thisdifferencereectsthecostofconcurrentlymaintainingtheindexstructure.Thesegmentbasedindexstructuredoesthenextbest.Itisslowerthanthesubsample-basedindexstructurebecauseofhighernumberofseeksperformedduringthestart-up.Recallthatduringstart-upthesegment-basedindexmustwriteaB+-treeforeachsegment. 112

PAGE 113

Querytimingresultsfor1krecord,jRj=10million,andjBj=50k Selectivity IndexTime FileTime TotalTime PointQuery 38.2890 0.0226 38.3116 40.2477 0.1803 40.2480 43.2856 0.8766 44.1622 45.6276 6.2571 51.8847 Subsample-Based PointQuery 0.87551 0.02382 0.89937 1.12740 0.15867 1.28607 1.74911 1.10544 2.85455 2.09980 5.96637 8.06617 LSM-Tree PointQuery 0.00012 0.01996 0.02008 0.00015 0.01263 0.01278 0.00019 0.79358 0.79377 0.00056 5.82210 5.82266 Oncethereservoirisinitialized,boththesegment-basedandthesubsample-basedindexstructureperformanequalnumberofdiskseeks.Finally,theLSM-tree-basedindexstructureisslowestamongstthethree.TheLSM-treemaintainstheindexbyprocessinginsertionsanddeletionsmoreaggressivelythanothertwooptions,demandingmorerollingmergesandmorediskseeksperbufferush. Table 7.4 alsoshowstheinsertionguresforthesmaller,200Brecordsize.Notsurprisingly,allthreeindexstructuresshowssimilarinsertionpatterns,butsincetheyhavetoprocessalargernumberofrecordstheinsertionratesareslowerthaninthecaseofthe1KBrecordsize.Wealsoobservedandplottedthediskfootprintsizeforthreeindexstructures(Figure 7-7 andFigure 7-8 ).Asexpected,allthreeindexstructuresinitiallygrowfairlyquickly.Thesegment-basedandthesubsample-basedindexstructuresstabilizesoonafterthereservoirislled,whereastheLSM-Tree-basedstructurestabilizesalittlelaterwhentheremovalofstalerecordsfromtherollingmergesstabilizes. Thesubsample-basedindexstructurehasthelargestfootprint(almost1=5thofthegeometriclesize).ThisisexpectedasstaleindexrecordsisremovedfromtheB+-treesonlywhenthe 113

PAGE 114

Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k Selectivity IndexTime FileTime TotalTime PointQuery 6.2488 0.0338 6.2826 9.6186 0.1267 9.7453 12.9885 0.9288 13.9173 17.6891 5.9754 23.6645 Subsample-Based PointQuery 2.50717 0.0156 2.5227 4.92744 0.1763 5.1037 7.2387 0.8637 8.1024 9.9837 6.1363 16.1200 LSM-Tree PointQuery 0.00505 0.0174 0.0224 0.00967 0.1565 0.1661 0.01440 0.8343 0.8487 0.05987 4.9961 5.0559 entiresubsampledecays.Ontheotherhand,thesegment-basedindexstructurehasthesmallestfootprintasateverybufferushallstalerecordsareremovedfromtheindexstructure.Thisresultsinaverycompactindexstructure.ThediskspaceusageoftheLSM-Tree-basedindexstructureliesbetweenthesetwoindexstructures.Althoughateveryrollingmerge,stalerecordsareremovedfromthepartofindexstructurethatismerging,notallofthestalerecordsinthestructureareremovedallatonce.Assoonastherateofremovalofstalerecordsstabilizesthediskfootprintalsobecomesstable. Finally,wecomparedtheindexlook-upspeedofthesethreeindexstructures.Wereportindexlook-upandgeometricleaccesstimesfordifferentselectivityqueries.Asexpected,thegeometricleaccesstimeremainsconstantirrespectiveoftheindexstructureoptionandincreaseslinearlyasthequeryproducesmoreoutputtuples.Theindexlook-uptimevariedforthethreeindexstructures.Thesegment-basedindexstructure(theslowest)wasanorderofmagnitudeslowerthantheLSM-Tree-basedindexstructure(thefastest).Thisismainlybecausethesegment-basedindexstructurerequiresindexlookupsinseveralthousandB+-Treesforanyselectivityquery,wheretheLSM-Tree-basedstructureusesasingeLSM-Tree,requiringasmall,constantnumberofseeks.Theperformanceofthesubsample-basedindexstructureliesin 114

PAGE 115

Ingeneralthesubsample-basedindexstructuregivesthebestbuildtimewithreasonableindexlook-upspeedatthecostofslightlylargerdiskfootprint.TheLSM-Tree-basedindexstructuremakesuseofreasonablediskspaceandgivesthebestqueryperformanceatthecostofslowinsertionrateorbuildtime.Thesegment-basedindexstructuregivescomparablebuildtimeandhasthemostcompactdiskfootprint,butsuffersconsiderablywhenitcomestoindexlook-ups. 115

PAGE 116

Randomsamplingisaubiquitousdatamanagementtool,butrelativelylittleresearchfromthedatamanagementcommunityhasbeenconcernedwithhowtoactuallycomputeandmaintainasample.Inthisdissertationwehaveconsideredtheproblemofrandomsamplingfromadatastream,wherethesampletobemaintainedisverylargeandmustresideonsecondarystorage.WehavedevelopedthegeometricleorganizationwhichcanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord.Themultiplier!canbemadeverysmallbymakinguseofasmallamountofadditionaldiskspace. Wehavepresentedamodiedversionoftheclassicreservoirsamplingalgorithmthatisexceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuser-denedweightingfunctionf.Ouralgorithmcomputes,inasinglepass,abiasedsampleRi(withoutreplacement)oftheirecordsproducedbyadatastream. Wehavealsodiscussedcertainpathologicalcaseswhereouralgorithmcanprovideacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Wehaveanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase.Wehavealsoexperimentallyevaluatedthepracticalsignicanceofthisdifference. WehavealsoderivedthevarianceofaHorvitz-Thomsonestimatormakinguseofasamplecomputedusingouralgorithm.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained. Wehavedevelopedefcienttechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallerdataobjects.Weconsideredtwosamplingtechniques(1)abatchsamplingwhensamplesizeisknownbeforehandand(2)anonlinesamplingwhichimplementsaniterativefunctionGetNexttoretrieveasampleat-a-time.Thegoalofthesealgorithmswastoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 116

PAGE 117

117

PAGE 118

[1] A.DasJ.Gehrke,M.R.:Approximatejoinprocessingoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [2] Acharya,S.,Gibbons,P.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroup-byqueries.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [3] Acharya,S.,Gibbons,P.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [4] Acharya,S.,P.B.Gibbons,V.P.,Ramaswamy,S.:Theaquaapproximatequeryansweringsystem.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [5] Aggarwal,C.C.:Onbiasedreservoirsamplinginthepresenceofstreamevolution.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.607.VLDBEndowment(2006) [6] Arge,L.:Thebuffertree:Anewtechniqueforoptimali/o-algorithms.In:InternationalWorkshoponAlgorithmsandDataStructures(1995) [7] Babcock,B.,Datar,M.,Motwani,R.:Samplingfromamovingwindowoverstreamingdata.In:SODA'02:ProceedingsofthethirteenthannualACM-SIAMsymposiumonDiscretealgorithms,pp.633.SocietyforIndustrialandAppliedMathematics(2002) [8] Babcock,B.,S.Chaudhuri,G.D.:Dynamicsampleselectionforapproximatequeryprocessing.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [9] Bayer,R.,McCreight,E.M.:Organizationandmaintenanceoflargeorderedindexes.In:SIGFIDETWorkshop,pp.107(1970) [10] Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.6.IEEEComputerSociety,Washington,DC,USA(2006) [11] C.FanM.Muller,I.R.:Developmentofsamplingplansbyusingsequential(itembyitem)techniquesanddigitalcomputersi.In:JournalofAmericanStatisticalAssociation,pp.57:387(1962) [12] C.JermaineA.Datta,E.O.:Anovelindexsupportinghighvolumedatawarehouseinsertion.In:InternationalConferenceonVeryLargeDataBases(1999) [13] C.JermaineE.Omiecinski,W.Y.:Thepartitionedexponentiallefordatabasestoragemanagement.In:InternationalConferenceonVeryLargeDataBases(1999) [14] Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE(2001) 118

PAGE 119

[15] Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimization-basedapproachforapprox-imateansweringofaggregatequeries.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [16] Cochran,W.:SamplingTechniques.WileyandSons(1977) [17] Council,T.P.:TPC-HBenchmark.http://www.tpc.org(2004) [18] Cranor,C.,Gao,Y.,Johnson,T.,Shkapenyuk,V.,Spatscheck,O.:Gigascopehighper-formancenetworkmonitoringwithansqlinterface.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [19] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Gigascope:Astreamdatabasefornetworkapplications.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [20] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Thegigascopestreamdatabase.In:IEEEDataEngineeringBulletin,pp.26(1):27(2003) [21] Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [22] Dufeld,N.,Lund,C.,Thorup,M.:Chargingfromsamplednetworkusage.In:IMW'01:Proceedingsofthe1stACMSIGCOMMWorkshoponInternetMeasurement,pp.245.ACMPress,NewYork,NY,USA(2001) [23] Estan,C.,Naughton,J.F.:End-biasedsamplesforjoincardinalityestimation.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.20.IEEEComputerSociety,Washington,DC,USA(2006) [24] Estan,C.,Varghese,G.:Newdirectionsintrafcmeasurementandaccounting:Focusingontheelephants,ignoringthemice.ACMTrans.Comput.Syst.21(3),270(2003) [25] F.Olken,D.R.:Randomsamplingfromb+trees.In:InternationalConferenceonVeryLargeDataBases(1989) [26] F.Olken,D.R.:Randomsamplingfromdatabaseles-asurvey.In:InternationalWorkingConferenceonScienticandStatisticalDatabaseManagement(1990) [27] F.OlkenD.Rotem,P.X.:Randomsamplingfromhashes.In:ACMSIGMODInterna-tionalConferenceonManagementofData(1990) [28] Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskew-resistantjoinsizeestimation.In:ACMSIGMODInternationalConferenceonManagementofData(1996)

PAGE 120

[29] Gemulla,R.,Lehner,W.,Haas,P.J.:Adipinthereservoir:maintainingsamplesynopsesofevolvingdatasets.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.595.VLDBEndowment(2006) [30] Gunopulos,D.,Kollios,G.,Tsotras,V.,Domeniconi,C.:Approximatingmulti-dimensionalaggregaterangequeriesoverrealattributes.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [31] Haas,P.:Theneedforspeed:Speedingupdb2usingsampling.In:IDUGSolutionsJournal(2003) [32] Haas,P.J.,Hellerstein,J.M.:RipplejoinsforOnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.287298(1999) [33] Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:OnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.171(1997) [34] J.GehrkeF.Korn,D.S.:Oncomputingcorrelatedaggregatesovercontinualdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [35] Jermaine,C.:Robustestimationwithsamplingandapproximatepre-aggregation.In:InternationalConferenceonVeryLargeDataBases(2003) [36] Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:ACMSIGMODInternationalConferenceonManagementofData,pp.299(2004) [37] J.M.HellersteinR.Avnur,V.R.:Informixundercontrolonlinequeryprocessing.In:DataMiningandKnowledgeDiscovery,pp.4(4):281(2000) [38] Joens,T.:Anoteonsamplingfromatapele.In:CommunicationsoftheACM,p.5:343(1964) [39] J.S.Vitter,M.W.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [40] Kolonko,M.,Wasch,D.:Sequentialreservoirsamplingwithanonuniformdistribution.ACMTrans.Math.Softw.32(2),257(2006).DOIhttp://doi.acm.org/10.1145/1141885.1141891 [41] Manku,G.S.,Motwani,R.:Approximatefrequencycountsoverdatastreams.In:VLDBConference(2002) [42] N.L.Johnson,S.K.:DiscreteDistributions.HoughtonMifin(1969) [43] Olken,F.:RandomSamplingfromDatabases.In:Ph.D.Dissertation(1993) [44] O'Neil,P.,Cheng,E.,Gawlick,D.,O'NeilJ,E.:Thelog-structuredmerge-tree.In:ActaInformatica,pp.33:351(1996)

PAGE 121

[45] Ousterhout,J.K.,Douglis,F.:Beatingthei/obottleneck:Acaseforlog-structuredlesystems.OperatingSystemsReview23(1),11(1989) [46] P.B.GibbonsY.Matias,V.P.:Fastincrementalmaintenanceofapproximatehistograms.In:ACMTransactionsonDatabaseSystems,pp.27(3):261(2002) [47] Pol,A.,Jermaine,C.:Biasedreservoirsampling.IEEETransactionsonKnowledgeandDataEngineering [48] Pol,A.,Jermaine,C.,Arumugam,S.:Maintainingverylargerandomsamplesusingthegeometricle.VLDBJ(2007) [49] Shao,J.:MathematicalStatistics.Springer-Verlag(1999) [50] Thompson,M.E.:TheoryofSampleSurveys.ChapmanandHall(1997) [51] Toivonen,H.:Samplinglargedatabasesforassociationrules.In:InternationalConferenceonVeryLargeDataBases(1996) [52] V.GantiM.-L.Lee,R.R.:Icicles-self-tuningsamplesforapproximatequeryanswering.In:InternationalConferenceonVeryLargeDataBases(2000) [53] Vitter,J.:Randomsamplingwithareservoir.In:ACMTransactionsonMathematicalSoftware(1985) [54] Vitter,J.:Anefcientalgorithmforsequentialrandomsampling.In:ACMTransactionsonMathematicalSoftware,pp.13(1):58(1987)

PAGE 122

AbhijitPolwasbornandbroughtupinstateofMaharashtrainIndia.HereceivedhisBachelorofEngineeringfromGovernmentCollegeofEngineeringPune(COEP)UniversityofPune,oneofthemostprestigiousandoldestengineeringcollegeinIndia,in1999.Abhijitmajoredinmechanicalengineeringandobtainedadistinguishedrecord.Herankedsecondintheuniversitymeritranking.HewasemployedintheResearchandDevelopmentdepartmentofKirloskarOilEnginesLtdforoneyear.AbhijitreceivedhisrstMasterofSciencefromUniversityofFloridain2002.Hemajoredinindustrialandsystemsengineering.AbhijitthenworkedasaresearcherintheDepartmentofComputerandInformationScienceandEngineeringattheUniversityofFlorida.HereceivedhissecondMasterofScienceandDoctorofPhilosophy(Ph.D)incomputerengineeringin2007.DuringhisstudiesatUniversityofFlorida,AbhijitcoauthoredatextbooktitledDevelop-ingWeb-EnabledDecisionSupportSystems.HetaughttheWeb-DSScourseseveraltimesintheDepartmentofIndustrialandSystemsEngineeringattheUniversityofFlorida.HepresentedseveraltutorialsatworkshopsandconferencesontheneedandimportanceofteachingDSSmaterial,andhealsotaughtattwoinstructor-trainingworkshopsonDSSdevelopment.Abhijit'sresearchfocusisintheareaofdatabases,withspecialinterestsinapproximatequeryprocessing,physicaldatabasedesign,anddatastreams.HehaspresentedresearchpapersatseveralprestigiousdatabaseconferencesandperformedresearchattheMicrosoftResearchLab.HeisnowaSeniorSoftwareEngineerintheStrategicDataSolutionsgroupatYahoo!Inc. 122