<%BANNER%>

Sampling-Based Randomization Techniques for Approximate Query Processing

Permanent Link: http://ufdc.ufl.edu/UFE0021217/00001

Material Information

Title: Sampling-Based Randomization Techniques for Approximate Query Processing
Physical Description: 1 online resource (165 p.)
Language: english
Creator: Joshi, Shantanu Sharad
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: approximation, databases, indexing, querying, sampling
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPC-H, which show that many typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data exploratory analysis. One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. However, two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable sampling-based estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary range query. (b) Sampling-based estimators for aggregate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of aggregate queries having highly selective predicates.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Shantanu Sharad Joshi.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021217:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021217/00001

Material Information

Title: Sampling-Based Randomization Techniques for Approximate Query Processing
Physical Description: 1 online resource (165 p.)
Language: english
Creator: Joshi, Shantanu Sharad
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: approximation, databases, indexing, querying, sampling
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPC-H, which show that many typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data exploratory analysis. One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. However, two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable sampling-based estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary range query. (b) Sampling-based estimators for aggregate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of aggregate queries having highly selective predicates.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Shantanu Sharad Joshi.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021217:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101210_AAAAQZ INGEST_TIME 2010-12-11T03:07:06Z PACKAGE UFE0021217_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 7002 DFID F20101210_AACBVE ORIGIN DEPOSITOR PATH joshi_s_Page_117thm.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
28f0458f73ba0793188dc842baf4d52b
SHA-1
a4e5553b7423157073c36c319c16fabdc7a6fd41
22285 F20101210_AACBUQ joshi_s_Page_073.QC.jpg
d9da71cef089a87d5b595ef259a6a200
ce8f27dd28a170d0478b899e4038168bad797569
87214 F20101210_AACAQZ joshi_s_Page_046.jpg
6c3d186c47175652ba610dc46763870d
4cd361986e82892842ca5a410b543f7d2fb5d869
43960 F20101210_AACASC joshi_s_Page_083.jpg
a32b7a9222d7fe867dc1743e871d9f50
d70914671bbaab8195fb1ab9cafe304c9080c82f
39429 F20101210_AACARO joshi_s_Page_064.jpg
79fc925128347fef5208ce196c09e2bb
991b8b281915ea22ee5489f1ba16e323c53da73e
21484 F20101210_AACBVF joshi_s_Page_101.QC.jpg
be0287d09370344d3b673ffb0e656fc3
26bfbbe0ec5bb95ce6df42b254f6b7f8a38ee712
6912 F20101210_AACBUR joshi_s_Page_026thm.jpg
a0e3b4973fefed0b1af3a0af6492dfb8
4cf19e9780a21de38c570fdfdc571abfbbd4fa82
42915 F20101210_AACASD joshi_s_Page_086.jpg
1aae75f325d294fad159de1bf87d3bf3
30b59943d706efa405224fb482e184a586dc8543
89731 F20101210_AACARP joshi_s_Page_065.jpg
7a5fac4d13eca14f62733b4b56c2d132
4dbb031ffaf12423a2857e1bc636b55efbf577b6
4448 F20101210_AACBVG joshi_s_Page_058thm.jpg
cd7eb3be425021aaa812e962c581a865
f995bc0518fea2e14cab9ca7caf26412d77a912e
6112 F20101210_AACBUS joshi_s_Page_127thm.jpg
9ef914b3739bcdffd08cf0db078546f5
a46dc6c6c97b980f4cbd3beb4bcec9d81522e87f
45736 F20101210_AACASE joshi_s_Page_087.jpg
65d8bf90d8812381bd17dcecba8e5f01
e7198e46fa100aa2449ca60b6b035e1e949b1679
82692 F20101210_AACARQ joshi_s_Page_066.jpg
33230a6f5a5d05405cbfb4bdc838f3a0
583738395a22a201d712190120286be7427c58d8
13760 F20101210_AACBVH joshi_s_Page_086.QC.jpg
496d0a2c81dc3a3b490855af3b096fb4
bc838976c3bf2cfb385db61d99ac14cca366b706
23117 F20101210_AACBUT joshi_s_Page_131.QC.jpg
90d7f55e37152f5e7411c34e9647ac5d
e55d271fad1533fa0f3a6ccbd75d300d27244d27
36384 F20101210_AACASF joshi_s_Page_088.jpg
1a3da88c8f72add860dfc9228d616975
63fbba2934f8b5fd41252112088986fb05a4d58b
58704 F20101210_AACARR joshi_s_Page_068.jpg
f876decbed2f37b576bd968107723994
88a80034e675e9d1f6462dc660b5ceeeb19cf9e3
21794 F20101210_AACBVI joshi_s_Page_013.QC.jpg
32bfd4383282b3678ae555408810963f
e7b83f61d97a3d6e43a5746d536048908aff095c
24290 F20101210_AACBUU joshi_s_Page_133.QC.jpg
f8d5d21348dfe7ba30e1ef333194e8fc
64092dd8a823a5b4cdf3f6664a4e359254c1bbcc
52337 F20101210_AACASG joshi_s_Page_089.jpg
0df30e2d0d024eb9810e60b0e0073a0f
2461dba4f888d6545e00438fb920757ac4e82ab4
63672 F20101210_AACARS joshi_s_Page_069.jpg
59be3afc70d6c8c9adc70341aa92e675
a8e9d9714c4875cf90753cbca3f2eb8364e55ec5
23924 F20101210_AACBVJ joshi_s_Page_102.QC.jpg
ecdb99365a0588f1ffc279d557ae44df
b456ac9e3d529f1044c7fb2f704cc07a3e914a7f
6354 F20101210_AACBUV joshi_s_Page_079thm.jpg
9182d0374aa582aaba4e3ad4c93c1e34
cdcb357bd8256c263a5e22bb46bb79bbde4eeb20
59419 F20101210_AACASH joshi_s_Page_090.jpg
d82b4a96e62ff1b02f609111f439fdc5
2361fe1e92c911866e42fee55a85452d660abfd3
90234 F20101210_AACART joshi_s_Page_070.jpg
85a11aef2e32c236673324b57272ee16
8869711fac46f8e7c45a121638debeef1a361d30
21136 F20101210_AACBVK joshi_s_Page_049.QC.jpg
aaf1767a64313c86da3254752026713a
e17f2d63470d3033dfc21e1840c04b3bb75c8bec
28519 F20101210_AACBUW joshi_s_Page_026.QC.jpg
7f2cc38d4692fa42fea1d35bf2639c05
03d83bfd78e3ec67ddacbc06565b020061133b89
85915 F20101210_AACASI joshi_s_Page_091.jpg
e2c1432618ae840ea6cdba71477d1314
7deef10438cbc70d5b23faadb0f1854eceb5a766
89392 F20101210_AACARU joshi_s_Page_071.jpg
01a982bc9b25b087cdb98d508586be9c
633ac27a5ca1f343924ad2ecc4aa8978076f2cdf
6941 F20101210_AACBVL joshi_s_Page_027thm.jpg
43efd1dd9d1de89928be6c462142885a
5d7364bdd6ba2da66a13e7c0dd10add3d2a9d076
11781 F20101210_AACBUX joshi_s_Page_064.QC.jpg
febd03f06f27e3a0404b707a5aaef090
63777167ce6acf8471be9822bb9d676359290515
82951 F20101210_AACASJ joshi_s_Page_092.jpg
b9fd1539d121da4abb862fd7888537bf
c13d79d5737c23c52e457ea6c3f2e7bec4647999
70709 F20101210_AACARV joshi_s_Page_073.jpg
43b1851e971d5b63aac51f78d8337e32
2966d015d07c95a33d751a637198f8893227d8e1
5852 F20101210_AACBWA joshi_s_Page_131thm.jpg
f0d79f91418deb00e14b4d3e805aa6f8
6724fccfadb8554cd046d12ffdef2cd18774c1d5
14656 F20101210_AACBVM joshi_s_Page_010.QC.jpg
a800aab468982ffc691a17ae2ba63df3
04ebe5ff572ca08e7986de8373d18a337d2ca582
21809 F20101210_AACBUY joshi_s_Page_097.QC.jpg
20df8c4dd87fb28ec89c5c9fd5811a88
211482bfc1cb758dd61b745b9e4a95d20eb2f642
52931 F20101210_AACASK joshi_s_Page_093.jpg
7c35edecfae208a8e43c69c7a5cd2762
54db51eefd38443b19479bd803fcf5d91c48b2c2
58649 F20101210_AACARW joshi_s_Page_074.jpg
46699b9682344032e2a9d4c7e12eb5c6
7ac11565db9b41fa0f2a8ef8a185a22b7aa337be
2127 F20101210_AACBWB joshi_s_Page_032thm.jpg
0125b8ebd4ce6823b9825f5cb2cf91d0
8a1d47382f921be131d1f05688a7a4d5d2688c6f
15500 F20101210_AACBVN joshi_s_Page_056.QC.jpg
3096b1a4a7cc644ce98ae0e67d674b72
7990f03321dc723d8c14c7534e973b8ec8e80287
72477 F20101210_AACASL joshi_s_Page_094.jpg
6f3e83192d012d993c0f1a4961a9f783
a4ea3571b04cf6ce4a2a932267d144abcc91db20
28913 F20101210_AACBWC joshi_s_Page_020.QC.jpg
b758bba6dbb348b2ad5770092cef2214
ec0d91e8353f653ed3c3f914f728980d427109ed
6133 F20101210_AACBVO joshi_s_Page_158thm.jpg
27cf4440ddb3cac03f5fdc630a38ce1c
61e76e90b4d332611923391e026ede61e43ff181
6094 F20101210_AACBUZ joshi_s_Page_012.QC.jpg
78804c1648e1085f0f17352bb6bf3e16
3a63bf3b7a474ad4d95018a6992689f9ae53162c
83066 F20101210_AACATA joshi_s_Page_112.jpg
5dca76a09bec70251041cd11a64fafbf
539f9b1b1d19e929406386a89523d7ec93e5f425
79213 F20101210_AACASM joshi_s_Page_095.jpg
da71f20354f9ff86cc676c060b34c80b
2ab483640d54d0728761d8a25eeaf34bf672a6a4
60158 F20101210_AACARX joshi_s_Page_075.jpg
6cfa313d40f1e3991edf1efe9ba0f052
0d9b4e68978555a138ad2d41364de8f2cd17f9ed
19432 F20101210_AACBWD joshi_s_Page_063.QC.jpg
5018b78ae36a698e8f2a7479e6c796e6
2297077d5ad3fd3509f5eb8a4f5f83c3684e3f21
27753 F20101210_AACBVP joshi_s_Page_071.QC.jpg
f95198cbe3c1629847f3cc9fa12b2fa2
8fcd275c55e78f7644a37acbae1c50c4d67bb0a2
85071 F20101210_AACATB joshi_s_Page_113.jpg
175d5c535c4d634efe82a216fcf2fb40
20a119be31247d121c90010c91f641287dd42297
67263 F20101210_AACASN joshi_s_Page_097.jpg
0fa0d299bd862da6ae51bc9f03d88958
a56e5649fb54b25dc909649aed21a41e61d159f3
89182 F20101210_AACARY joshi_s_Page_076.jpg
3a6478a4bc71eabcdc365e2e9a509f80
1c1cf92fa41f3c8ecc23866a4d3c101bc210d673
6484 F20101210_AACBWE joshi_s_Page_114thm.jpg
e26f643a20ad0fc8ecc29037d5825ceb
0365632ee5f237adccabccc4376f5d18456d09af
6603 F20101210_AACBVQ joshi_s_Page_038thm.jpg
4bf26b526087adf948b7caf6337857b4
a440946657192c2a271045410b37c6356b57efa5
93760 F20101210_AACATC joshi_s_Page_114.jpg
9c2519544a352f488c93be84b616fec1
ce16f1c49731904dd97b7f8b6a999cbe5d233a7b
85903 F20101210_AACASO joshi_s_Page_099.jpg
253703b12a3d1e0bce10b305a9908bb8
c91b6e2d95894998de2e43a017094185f72ef3be
77971 F20101210_AACARZ joshi_s_Page_079.jpg
a14d85cf9d52f6923cef9c2e8b2aa342
8d15ea29569a4831d76279a16db11e27d50b0904
24761 F20101210_AACBWF joshi_s_Page_109.QC.jpg
860f05f82569105b47c6c31bdbc893b8
1cf4e9c43610e5c5f3ae551d457e1f9d46b4092a
4972 F20101210_AACBVR joshi_s_Page_090thm.jpg
91ed9ac8edde30f8a967983f88e93283
7eeadc5bc57f04745d4111994e2977e3f718caae
89729 F20101210_AACATD joshi_s_Page_115.jpg
4adf6b15933824281e883fda9848350a
1cdbb968e110ad611ecacdbbc6de9bbbdd1816a2
76709 F20101210_AACASP joshi_s_Page_100.jpg
8fb45c9a2237505e1a589ab405dec00a
47fe5d9f9bdae1eb2c714104335619746a47314d
6740 F20101210_AACBWG joshi_s_Page_021thm.jpg
776161b187f8911167456545a51a4fc1
a9c9f2450d168c505a2767ecd6c032ca0da89508
5984 F20101210_AACBVS joshi_s_Page_050thm.jpg
d42f20d428af1aa50e9c53b8d7a98cba
65f6b7ee13aa17c4dc1a58bf44f42045a3db4736
88644 F20101210_AACATE joshi_s_Page_116.jpg
7f03aee7d0c43ca1305e901f537292b4
49099864a67094e25fbfc79f64b30c133ab8e6ac
77892 F20101210_AACASQ joshi_s_Page_102.jpg
f6b2bf75003f4ac271a226ba71761488
fbe0f1d1efb339cadf4ae4c67c4c28542aaefc11
3679 F20101210_AACBWH joshi_s_Page_017thm.jpg
c00212d76299263e9848542ef5ba52d0
299a614089a2534b8f530bcc99bb687f65fe926d
22882 F20101210_AACBVT joshi_s_Page_127.QC.jpg
0f040726f01936d0a75c29066aa3c99f
690a4593fadea5c7a71bbc31f97bd9fc845c4559
91490 F20101210_AACATF joshi_s_Page_117.jpg
f922d6fffef0dcabc4e70d19a47cff40
b04d1641f9f2fc379f14dacfd1e9b2e23ec12b54
59949 F20101210_AACASR joshi_s_Page_103.jpg
d808dafba63fa219b81187e93f70e712
56d0fbea72e48d4f11c14102e4ee07342eb8d658
6426 F20101210_AACBWI joshi_s_Page_039thm.jpg
7a1a94cef96a8b5ebd33491ed2dd56bd
6ec51389b380a15c8985c25bc17e49fa2a32b232
6569 F20101210_AACBVU joshi_s_Page_130thm.jpg
9af2952603385a34c26dd5f89a031dbf
87f010a59095945dbe8158b0344cae28cd1856ab
87092 F20101210_AACATG joshi_s_Page_118.jpg
4655aedeac0f7188ea4364165e1964f8
74b171ec505693ddfde27332bbaccbae2909c999
72497 F20101210_AACASS joshi_s_Page_104.jpg
b53c18bbee3cd6ac2cabbd43fb0e2ea2
61d6d0a3a71ba021ee4c2522799de47351cd6cf6
5797 F20101210_AACBWJ joshi_s_Page_006thm.jpg
506a4d4028e34ef546d5ac428e0d36f0
deb0e09a2376ccbd36a9493858758babf28be96e
13065 F20101210_AACBVV joshi_s_Page_164.QC.jpg
3297d42ebc0522e78c19a17b99187822
87a1219ab78dcead03a4d847e8a3ee152b9a13c7
84292 F20101210_AACATH joshi_s_Page_119.jpg
e71e4fd2e9e5ea88d48b852d3faacf05
7c2840efc314fb76f3464ab6bd7a78da9163dae3
62286 F20101210_AACAST joshi_s_Page_105.jpg
ced4266ff42d773a87b09fc430967d26
87e324307192b19a903e613a700be438e8b15f5d
6915 F20101210_AACBWK joshi_s_Page_028thm.jpg
ee213fcc80d677a2d94a204b3ad00e9e
3d240bf6b5926b76d290ada502e9aaab1cfb037b
25473 F20101210_AACBVW joshi_s_Page_150.QC.jpg
4aee8c625fc0f420a1a6a5ea85fcb3f3
e1d88c2b994a2bf7bdc1b310a016e295731713be
60143 F20101210_AACATI joshi_s_Page_120.jpg
3ee334bedf86285bc4ff8f7a7e7241fc
3e7b374b5873211f47952406bd3c821e8c71a7e2
80337 F20101210_AACASU joshi_s_Page_106.jpg
27a52fabbe3d89761488142210fe8851
2d71d255bb0893850f530f27e29fbe05bc5173bc
4621 F20101210_AACBWL joshi_s_Page_141thm.jpg
ea2267693ec572e3b23f17de6fbb0998
051e0b6dd748fc9e13fd768781b5c87f52fc0669
13655 F20101210_AACBVX joshi_s_Page_058.QC.jpg
1f9b33e749d8ba220501958e88c9bcf1
693614a15b6cb7aeb069c8df6f120fb2424aa402
77791 F20101210_AACATJ joshi_s_Page_121.jpg
d0d99cc4ebcbf5aa29f810ef41c16984
971f7ba0326801e7a97af0a4ec0655aabeac29d3
51307 F20101210_AACASV joshi_s_Page_107.jpg
b3a3b849578d3489f6037726dde49974
a5672ac9bcce88462213afc632138c6f5ad803cb
4206 F20101210_AACBXA joshi_s_Page_018thm.jpg
5c00f0d6393d71e1d98d341ce4184b57
51d51f700470fb13b9fd19a07179c587575da375
5849 F20101210_AACBWM joshi_s_Page_149thm.jpg
7ab7cf28a2ba233f07da2aa12b10d277
01980c499aff51f641ac689ec201dca4cf3bc09b
25266 F20101210_AACBVY joshi_s_Page_046.QC.jpg
f19bcd9eecc2bd0300c594e0f0029345
f58db88df3b049f9a582c7a39338e3408b39f1ae
76926 F20101210_AACATK joshi_s_Page_122.jpg
6271bf391db700454aba1577b83c9afe
7b6bfe818fed7dee4279f43b528050be0db3ee58
58027 F20101210_AACASW joshi_s_Page_108.jpg
e0a8a0a0c6448282a240f4b356c0bc9c
9cbcac3076b81c53a3d08f9df3370db8d7c66467
6587 F20101210_AACBXB joshi_s_Page_019thm.jpg
08ffb03871f511ffce28c1f29fba42eb
15b834823add58f796d15810d0eaf287aa84982d
25001 F20101210_AACBWN joshi_s_Page_014.QC.jpg
0a7625c07b0ef26be2192ea6f8f69693
4edcc80cd1681ee59e526f80ed1d760badd6c9c9
6312 F20101210_AACBVZ joshi_s_Page_034thm.jpg
811998b917c761cc57a8716317014c30
d6231c19f7dbb6066c6c02756075de2f6aaf5733
88800 F20101210_AACATL joshi_s_Page_123.jpg
9453c6d4987f204f8c87c24fdc4ed811
a4d5e1b87607630bd6511ccf6cbe4a520788d040
77932 F20101210_AACASX joshi_s_Page_109.jpg
d473c11bdf258f158092537b60dee84d
65625dd670f5d2b8fee5518d325a5509107cc48e
27053 F20101210_AACBXC joshi_s_Page_021.QC.jpg
bb5c151727e289325c91c6be4816d608
f5ffc802e95dc342807b253e4d7e19e336abb3e4
16183 F20101210_AACBWO joshi_s_Page_057.QC.jpg
35faf51a56ec9e7a66667fe4981881d0
2e01cd9ad3a7e37c1f1f0238c7d05205b0b7f5e9
54039 F20101210_AACATM joshi_s_Page_124.jpg
4b69bc22a2ba2cac7b5e694704530315
557490b215b422e4bc155a7f719f1e3474cf0234
33263 F20101210_AACAUA joshi_s_Page_143.jpg
a36858dfaf2381cd224bdb59e75e111e
3bcb96a8a12aa1ec6bae8e80ac7e8d77853f8efc
27250 F20101210_AACBXD joshi_s_Page_022.QC.jpg
6c06c88828f59641c0499765d1a74cac
4580069c88120f5aac5ae2600e1ee2b873978a00
3294 F20101210_AACBWP joshi_s_Page_002.QC.jpg
87e40dc56f013e1c96603e48b6a2e754
61e58830b3806a261ef03c766da2fd2146245622
98170 F20101210_AACATN joshi_s_Page_128.jpg
5e2772d8329916ec97c513d63971db08
10ec1ba5315c797198ae8f91a0961b304b76ca41
90568 F20101210_AACASY joshi_s_Page_110.jpg
db5de14a30223344c981f27d5bee6301
ba59c08316e141abc275f6983dd3c56d5f5a16bc
89441 F20101210_AACAUB joshi_s_Page_144.jpg
133ce3cc7d52cc88083b31a5f9b05eea
ed0ca8b2491c7bb93ca10587a2e7115f14db3924
6930 F20101210_AACBXE joshi_s_Page_024thm.jpg
9511e55a6b9c20f4c01b7cd8fb94e7a9
d961264c9270c7bf92a5a6369657a0aa0cd84202
6520 F20101210_AACBWQ joshi_s_Page_106thm.jpg
bc1d27578fed60739d1f2b146b940cad
933ca8bf647dab50df541878de9a9fc0e44ec4b3
70758 F20101210_AACATO joshi_s_Page_129.jpg
b166a565440a9946894150a85c3e9cf3
6ef8c3127812c5dba254b7e11efe835298c4f544
88066 F20101210_AACASZ joshi_s_Page_111.jpg
5fe3f8b446d36390b80b6d1784ecd593
6ef538a453d47ee759b47631c9fa0aaf70cfa96f
88207 F20101210_AACAUC joshi_s_Page_145.jpg
e5cb3a80a2012ddfe9cc3b8951826293
f1f81437a797f3c2ec940d988cf9563449451d81
25402 F20101210_AACBXF joshi_s_Page_036.QC.jpg
02cc3cbd5bb25a502a88022ae8876c81
081a944c7110a0d016c2c12cf0a6567dbe6ff2c1
18527 F20101210_AACBWR joshi_s_Page_108.QC.jpg
5177afeafd8005a5f2cb4885db8f5819
ba0a3d7e1fc00c950083e2aef337f9142504e4b1
81624 F20101210_AACATP joshi_s_Page_130.jpg
c15b5f972b487756ea9f6fbe17b3b26f
333b0fee5ec4677e28df2495decf855296554632
78616 F20101210_AACAUD joshi_s_Page_147.jpg
60a15af45b3ac584e95995d9dd6c4730
6ca0169ddee5d57ea28d624bce15fa4336264eab
6409 F20101210_AACBXG joshi_s_Page_036thm.jpg
4a9dd399794cab5e49e451d2c30cec79
59e64907153298622ead396155d6faed0c3adbc2
6845 F20101210_AACBWS joshi_s_Page_071thm.jpg
cd970fd3ec17d84e85be3359470ed2d5
773f2ecd1fec4193d45494954e01c3d5649da9e5
74699 F20101210_AACATQ joshi_s_Page_131.jpg
d379f669bbe4c9fd8b1bf533cc2ec22f
2ef3acc47fc9aa74b2022e623ca3e06c9310179a
92627 F20101210_AACAUE joshi_s_Page_148.jpg
824c950190a0c1592ca0dcb6072116d4
90d8a8b048fa8895b2708b0da9f52fbfa290b099
25480 F20101210_AACBXH joshi_s_Page_038.QC.jpg
516ff1769a4594a7b4fd9082626b2f35
10535684189ee1421e9c0044631efbe3b70fe1e6
245285 F20101210_AACBWT UFE0021217_00001.xml FULL
fa04cb472a83f791ff7e33599324a7d5
27b9841eae2256882f4ea013d860a3aee1f440aa
57316 F20101210_AACATR joshi_s_Page_132.jpg
b70037c5baadeed8c716d9b7141697ed
7d7689bba5e110158979b49fd2afbc33df149e68
1053954 F20101210_AACBAA joshi_s_Page_004.tif
7c962f63970f50ead7bfaf5d55a9173c
6be770db4a5a70f785d882df94dc4513b304cbb1
95065 F20101210_AACAUF joshi_s_Page_149.jpg
b38396065fe26fc6ad849b1b012d66af
b3b1d77604b5ec1f697fed1e5df8a4b42272cd5b
24567 F20101210_AACBXI joshi_s_Page_039.QC.jpg
3633e9436d98e50c5f62fae01ca2eebc
f2f98d77964fa38c690a2a7e6d6018a5b9a9b580
5308 F20101210_AACBWU joshi_s_Page_005thm.jpg
760553e2e97f9ccf46a3e01c4ef5dae4
669086c7f18b76813780e9bd15d4521373eaf139
81053 F20101210_AACATS joshi_s_Page_134.jpg
9e1b252aba95bf3dfeabf9b6e3a64a6e
c8b988940f5e7f1ac2ab1deb2b271a6d56ca8ca9
25271604 F20101210_AACBAB joshi_s_Page_005.tif
488e20cd4792962869dede84900942a0
3ea88508edfd7d1df09e63a1d1fe0bb068103cf0
77169 F20101210_AACAUG joshi_s_Page_150.jpg
44917fb931226e6bbba78e5ed04a3ed6
a7178c265eb801636d5844765558b6e63840791a
19439 F20101210_AACBXJ joshi_s_Page_040.QC.jpg
c780b1d66fd7616f3f7ae2a77a903f3d
a8ae4a5aad496e0b1b8c4d2a8fcb8cb56f25920a
22757 F20101210_AACBWV joshi_s_Page_006.QC.jpg
a976530cba4bda01512f7110ce0f240a
7a5e4d0aea4ffdae8590538e1a514309bf9664ea
39671 F20101210_AACATT joshi_s_Page_135.jpg
fa61a567b3afd7b31298929a47fec9cc
f57263b84a0df2088d6c7b381a45fb2b8d99504b
F20101210_AACBAC joshi_s_Page_006.tif
7e93b5b89a156697d6e636576cea27ef
4df7b73a1a35c61f2be7a6247b9fc78f8eedc411
69885 F20101210_AACAUH joshi_s_Page_151.jpg
205f1816081c703a11dc3f0cdf886f38
d90c1564b7050986c4043034a543bb2210d2f3d9
6524 F20101210_AACBXK joshi_s_Page_044thm.jpg
9b8a1627b5e1ad6402a87363850c011c
7519707cbd03b7f8369181dc1702657e7b0a6701
30018 F20101210_AACBWW joshi_s_Page_008.QC.jpg
ca3c66128a24519ee3c0d374cfdf13be
b93cf7b05c7ec351f3489a7eb75dd8fff91b09ce
82987 F20101210_AACATU joshi_s_Page_137.jpg
95143e54a0f671ee21e8a695cf225cef
0efd890788e1fb0aa081bbc4e2eecf9e97de8928
F20101210_AACBAD joshi_s_Page_007.tif
a71d305550f74035c3b7f794b801ba61
2acb111aa228998cac082ecc3c2740ec0c50e910
89044 F20101210_AACAUI joshi_s_Page_152.jpg
05d055ca7dfc64f665003862d0422f04
8f850d24842ae19f879adcc7faf382edea61a89b
20249 F20101210_AACBXL joshi_s_Page_045.QC.jpg
098ca4425d40664e6c4a5c00fb652a6f
cef06165d798f5f3b5f9289b9f555633b636b6ae
1999 F20101210_AACBWX joshi_s_Page_012thm.jpg
24da4ea8677fc5e399af97118b9b5e91
67a1ba86118cfa0925bb7a53074eba0a109e70a8
39772 F20101210_AACATV joshi_s_Page_138.jpg
989ae36e4584d7b26d41520ea1e65d82
7d5213b909dd597ae6717d8bb77e3e339b04a156
F20101210_AACBAE joshi_s_Page_008.tif
f266a28d980e27b2662073ccc734e016
42705a01f8aa60c148576b31c7a3c145e6405d95
44517 F20101210_AACAUJ joshi_s_Page_153.jpg
0658e6467f9ebf7cccd69ed0a9fe2dfb
bf365b27d05295b9abd27b5c904afb3cc4250d31
5806 F20101210_AACBYA joshi_s_Page_073thm.jpg
2cc10f164ceaf3484ee72fecf8a06f36
0b932b39eda9bf6f69b058df6fc6621ff339d2d7
6447 F20101210_AACBXM joshi_s_Page_046thm.jpg
c9b3b3ef4e8187678930419a46e33c0f
6488f93ec6fc24f95e05001a191ab54202e76da7
6345 F20101210_AACBWY joshi_s_Page_014thm.jpg
d0270e0d10c228313b554a10ef217bc4
d8006e4c3e37a1d1c6f6353c88fda7c0265d1991
73592 F20101210_AACATW joshi_s_Page_139.jpg
53943bebf802c5284c14e1d9a801cdc7
7143a63d879d981ad7b474886e0c0a59ef8b87b8
42779 F20101210_AACAUK joshi_s_Page_155.jpg
0455939a8f97a26e45bdb03fe2864e34
15b2c008a3f76b4241f27a31e576eba1b193bac4
19554 F20101210_AACBYB joshi_s_Page_074.QC.jpg
fe68efb025e726cc77685c4e8c6c59a0
7c42193c620184b10a01cad91bf21b02a23e89c5
5782 F20101210_AACBXN joshi_s_Page_047thm.jpg
c9c5d28cf9f7b26cb2e09d61f68675cd
88dda928c8b9e7b61fa367c97b13850cf7d6956b
29035 F20101210_AACBWZ joshi_s_Page_015.QC.jpg
994e6bfc4ad00fed87ee1a3806aa5816
62c3a52efec49a8481361ad77f60a821dcf00918
32128 F20101210_AACATX joshi_s_Page_140.jpg
5a0a340c66bd0633a61a1563ce054011
1ee9360ca52f92272d753a691d40ab09222f1154
F20101210_AACBAF joshi_s_Page_009.tif
e52e0919f47e4a53d22aadccca1d2557
8d4fda42f1ae12c3a34dc8214448d3ade7035406
33954 F20101210_AACAUL joshi_s_Page_156.jpg
ff405eb0e5beddbfa395c9bfab65855d
c6f4c502b62e922608b3694270693db6ec10d08d
5025 F20101210_AACBYC joshi_s_Page_075thm.jpg
db828acef199e926e9592611f959c29f
d46eace4e34669cfa880c2c105311532a57fee0b
22512 F20101210_AACBXO joshi_s_Page_048.QC.jpg
5f50fcecd98fd810c6531277dd00787d
e5f0fccfe5f895b12627e450cd3f36373057e050
53272 F20101210_AACATY joshi_s_Page_141.jpg
2426081e1e26c936043738a41c4e7008
65dc29aa5c4e0e9169f1fbb257b969290e14f898
F20101210_AACBAG joshi_s_Page_010.tif
a852aee4c5f6bdd2a5529b876c7e7f1d
310c45b0e247c8bf70227975cea9495aef1737ac
924735 F20101210_AACAVA joshi_s_Page_007.jp2
e0301e7bce39283372fb1b53c4d86f2f
2986dca9dc703a0016304b3d0fc87162ea5229fd
68699 F20101210_AACAUM joshi_s_Page_157.jpg
53688c178bcedebb7d283b8a0d230f4f
072f18525211e799f58589c8bd3c17782facd789
24944 F20101210_AACBYD joshi_s_Page_078.QC.jpg
1bb312bfe0d011f46e6d0acbeb7c3420
d0ef104b7bdafa2630ab8e393497ae8e29ca9e07
5863 F20101210_AACBXP joshi_s_Page_048thm.jpg
b235155adaee6c0e3ca6fa074edcf30e
b060317660f125a0298798d9d08b989951d9d34b
F20101210_AACBAH joshi_s_Page_012.tif
48d2d833ff27348cbb5241001839dbdd
d900408338c0a36207ac7eb50a3a2ce16224d015
1051978 F20101210_AACAVB joshi_s_Page_008.jp2
e5bd87dfbe84d7535a34fc00f6b78eee
33bba94604151d5206cc327ef840fc348a547dc5
79760 F20101210_AACAUN joshi_s_Page_158.jpg
ee7bf5a24c53d99ca15dfbd4d26d833c
1e9b4ecffcfab6f509a341eef0489a1f6dd3a7df
23069 F20101210_AACBYE joshi_s_Page_080.QC.jpg
06f8c415c557dd54093300e8f0f0a8d7
92991a168f6f5fd35b36228d049600e85dce3347
23818 F20101210_AACBXQ joshi_s_Page_051.QC.jpg
794884a7b8646a1923fff24d2af42f63
f03be10d90e3e20b8ceaa16d991e090abdc770f6
52920 F20101210_AACATZ joshi_s_Page_142.jpg
c1108f567f7a9c25f69ad14ed195957b
bb6902ec9acb0f22b79d17be59ac73eba3c46943
F20101210_AACBAI joshi_s_Page_014.tif
4b4cbcde1147cbec6ccccdabf7a39ccb
c8e030a4ccd8c29a5a5436b9a1232b00380713c7
1051963 F20101210_AACAVC joshi_s_Page_009.jp2
6928ad603d8aa003aad5c6332f6f0495
6c012c64cd3065820d14781471f65131a72654b6
84279 F20101210_AACAUO joshi_s_Page_159.jpg
44a4cd605012278e6e34a9da62aa36fd
e17ebe01d90cec1fd21d3d5008d9867d4e188e37
6501 F20101210_AACBYF joshi_s_Page_081thm.jpg
2002b3e706624323ab47a1c9a8e2e79e
f29e9198eafc39fd5a6b0f73f564d5d7ecaf66e7
6036 F20101210_AACBXR joshi_s_Page_051thm.jpg
edfb87901a631039e93c4db90195ff27
3774badc0b5456abb121f38c89b6d02f9ca18744
F20101210_AACBAJ joshi_s_Page_015.tif
3eda01e756ef52261cdf3d25a4e2e3dd
6cf092cf24c309a9923007e3a022db80194a1bfe
1051980 F20101210_AACAVD joshi_s_Page_010.jp2
5bd5d6dd83ea4d17c1023a3739a1c88a
c5c5c9cd8c832eaef72158714c3ce99e1fd2f416
78880 F20101210_AACAUP joshi_s_Page_161.jpg
03456fa1c8f192b95e6de7f4f724806d
ceb23cd78b79dbe0dc2ec3f7c1081181dde266a4
16319 F20101210_AACBYG joshi_s_Page_084.QC.jpg
c110bcd76a24192b5a99d037aef6aa7f
995d2cdf1ab8f8f70986a376b589de64f2e2e598
24972 F20101210_AACBXS joshi_s_Page_053.QC.jpg
3d463d0328224fa6819c1d2741d6a8a0
30e9129d916df893574cabeaf917ba844cbcdf1b
F20101210_AACBAK joshi_s_Page_016.tif
f5ba0f1b9e570cfe12a80b459d174166
db5ce34cd2a4a3b759c82f859015263fefffdee0
112690 F20101210_AACAVE joshi_s_Page_011.jp2
5980e09702bdf922cba6eee8a9743b1f
3e4e1be9ad923e036766ae1f65ca0925e2c45eea
76928 F20101210_AACAUQ joshi_s_Page_162.jpg
d682a6d0608240febda81f74049d8ba5
96b99a7609e6dd75ed9763a307bcc70fb61c0425
4673 F20101210_AACBYH joshi_s_Page_084thm.jpg
838359062af1b8f6c45374ca7ab679c5
ce6b9ee8e0e13ea4db475e16776ed5817ee63448
27610 F20101210_AACBXT joshi_s_Page_055.QC.jpg
b896eda41b579342f8c2869729575caf
84f61088b2b3af9fcdcb7ed942d8fbfe2a1046ab
F20101210_AACBAL joshi_s_Page_017.tif
af97874afccc83ffb34be35ea9d11584
4ef0d86c521c47c4f9fe915003f3330b081b03aa
22669 F20101210_AACAVF joshi_s_Page_012.jp2
b00c9099dda5a5affa2cf7b5e560acf2
5d7810296ea84bd0560e2ebfa34d7864e7749ba5
75099 F20101210_AACAUR joshi_s_Page_163.jpg
e53eece4edaf12db0ed196002866f557
cc0983fc98b841a2a80ea685aa132050f5bf318f
F20101210_AACBBA joshi_s_Page_037.tif
98b0c953e89767fd1cd2b75e95fc50cd
ec12742429db8ed7e2323a91924f3b60c417697d
19126 F20101210_AACBYI joshi_s_Page_085.QC.jpg
7c56d41830a1e50c7822045aebcccc86
7dd8d261f5e178fc050da08362127dca2095bdac
6479 F20101210_AACBXU joshi_s_Page_059thm.jpg
f763b8c43ce47392972d88eb25bd3e70
6d71a7e3581f46c9389c6f8e5f2b88ea7bc1c4c0
F20101210_AACBAM joshi_s_Page_018.tif
517805cc5fa3fa2d18ec032ffabfe25b
e4abe3bea582fd39e6efd4baa629db8cb8e8a895
108083 F20101210_AACAVG joshi_s_Page_013.jp2
a6b4e94fa41f37de26e72d8e74d2a207
f9c7acdc3a64d2d11f52b5584711af11d2f18288
45798 F20101210_AACAUS joshi_s_Page_164.jpg
f620b7db81ff4b3d41aa2149aece3233
e701d7380b7ede6cdda7645c5a49defccdd152ba
F20101210_AACBBB joshi_s_Page_039.tif
fb5c751dae392b14faacac3b671d0df2
b3f00d09be8327d1b6021de5c45fa8054a7a78fd
4333 F20101210_AACBYJ joshi_s_Page_086thm.jpg
f580f3cd8616377428a9a163289db863
33541132e82a7360d39d636771b5b35a54490fac
4683 F20101210_AACBXV joshi_s_Page_060thm.jpg
104e99562bfe7fb103e60e9718cca1ff
873be469d8766508ff45b16fe6d5b594456981a1
F20101210_AACBAN joshi_s_Page_019.tif
712351066846cd18c18464f9acf14216
8ec838c3310c93a6279a50d4f575da9a8d6378f6
1051939 F20101210_AACAVH joshi_s_Page_014.jp2
afc7d297bd3355e1ff90075b937a2b97
60fd4e45a845fec46a96e0829e047a0629d76519
35818 F20101210_AACAUT joshi_s_Page_165.jpg
5b2ea725402b096e4987a68e4a5e8134
68076c65ccb9dd3aa3c4a797d99e260eeb138058
F20101210_AACBBC joshi_s_Page_041.tif
6c97a52a0c6519ca2aecd044fb5e6d4e
83fb27d827230d6db10806d2815b55d0fbca562c
14847 F20101210_AACBYK joshi_s_Page_087.QC.jpg
4714589c6505d8ed7ad3529a3363da2b
398b02da760314515e679ecebddee199e201095a
4551 F20101210_AACBXW joshi_s_Page_061thm.jpg
df9231d69fbbce93e6138e4fc1db31af
8a20eafe3368824f1834be7fb3629a7530610c8f
F20101210_AACBAO joshi_s_Page_020.tif
35ef46934967ec332e6532110ff29e6f
48b9f98eea245a083de401a66e80f4022087bd20
1051973 F20101210_AACAVI joshi_s_Page_015.jp2
d723ed227a7ac69d2640183a62e5ba15
4d88afa0864da45cc1290e38ef7fb8100d669e17
24468 F20101210_AACAUU joshi_s_Page_001.jp2
b2ef95ef7018d31291edfa3b4e210923
d6697bd5119c9b66b6abcce80316b063be7af3c4
F20101210_AACBBD joshi_s_Page_042.tif
e8286375025c4b64579dfb85250da7c4
b70d7de3edef84bfee0e9de751ea45158e80e4c0
23133 F20101210_AACBYL joshi_s_Page_094.QC.jpg
186f60cedde12f229ce1aa219e96a491
832f76fe38e6c7a33323c114a5fe8ca80135b758
26237 F20101210_AACBXX joshi_s_Page_066.QC.jpg
681b015e43c393bf27c6a7688b349a0f
14811acf249eeb37173e6c52049ec9813ff3aa95
F20101210_AACBAP joshi_s_Page_021.tif
6e2e3a0fb2f5f450ee1e4b5901523236
b69cde103f6d28cddc6e26224e985e92d70ddd23
1051950 F20101210_AACAVJ joshi_s_Page_016.jp2
b13d5f7e35b88bc56f716fb9c01a6ab9
46161634cdc5c9fe72f51d544333d6883c928850
5523 F20101210_AACAUV joshi_s_Page_002.jp2
5d1cf299dd68cdf31d19a51f779ed4a9
1c0127ee287e7f477a6e6aafdfc20b4ce0adafb9
F20101210_AACBBE joshi_s_Page_043.tif
dea195ace090cb49765c79a2f9c7b0ef
e01ea414058d252855dd0daaff4f96b45045a8ad
26525 F20101210_AACBZA joshi_s_Page_126.QC.jpg
771f71cebc4c110319d7710168f00591
50559b686cecbaf33a5708e00f5ed422bdc2df05
24251 F20101210_AACBYM joshi_s_Page_095.QC.jpg
d4d744bdfd1306f32c0a91a4930b6cc3
bc2bc7e576a2ff9e04c257127beb4b45557df90d
18310 F20101210_AACBXY joshi_s_Page_067.QC.jpg
91ca275fdd4f1472b50db346f82dde1b
4d7a061b641d8bc2f507e1cc1947aa0ae6919f9b
F20101210_AACBAQ joshi_s_Page_023.tif
3966c3d1d2e3e745e02c8a066ab02a7d
4b805c4de0d90d72cd00e6aa66e49585ab25964d
68075 F20101210_AACAVK joshi_s_Page_017.jp2
574bbf194d17b1f5318651c7c56e1dfe
8d249830f40c7fa3b2516de90a62257218c4ae4c
6940 F20101210_AACAUW joshi_s_Page_003.jp2
3dc21af10e9af5175244b2cfe05ae42d
579085ebea263cb065de1c8ba90c7c6962f3dc36
F20101210_AACBBF joshi_s_Page_044.tif
cf9d5d7fd43c1d0a678adb20275ccd58
fee1efe89f64d5d71d3d6cf3c8c0d6d04baa1f60
19468 F20101210_AACBZB joshi_s_Page_129.QC.jpg
8fd95bf9d54c3136828a2c44a08089d5
f3174c8a601122b67e01f0887b11e0de338291f0
24035 F20101210_AACBYN joshi_s_Page_100.QC.jpg
20dfeccf3d6e59408108e72f7c585a12
cf7ae843328152aeb5fa05e29ebf0c6ec47ba266
5432 F20101210_AACBXZ joshi_s_Page_069thm.jpg
48ac508b98017594903bcebde94d0f67
8622957725ba540dea292fe459ff2c1b5b09f7ea
F20101210_AACBAR joshi_s_Page_024.tif
9d164ed1873e377b8190313541c010da
4383e60853e207d691f71cbff70180b314f3b684
72603 F20101210_AACAVL joshi_s_Page_018.jp2
e9e8ad1f9ea4a06dc56a26742e277d24
33d6a1b35d6219894c046b6fe3a03ad895bf25b5
111118 F20101210_AACAUX joshi_s_Page_004.jp2
edf06b65fa5d00a41947d9d2e5687fd1
f3485e39bfc372302b23f9f0a76e90a3cd7a4ecf
3847 F20101210_AACBZC joshi_s_Page_135thm.jpg
935cc8442e9d17f78fc3c8e65dbe04f3
9fbed57c6e6ed082c88a91313f3f894144fef908
5014 F20101210_AACBYO joshi_s_Page_108thm.jpg
a9b5ab855c1befe4d520137c7f99ff99
ad7f00481baa0cfc4feaba2264385ddc92207de2
833581 F20101210_AACAWA joshi_s_Page_043.jp2
bc66fb9cd75e131b6efb97a07f7ee2d6
6fa4e3b091b9ae62581fa481e5d30cc9ac8841ee
F20101210_AACBAS joshi_s_Page_025.tif
5f6c66422ead96dc0072ed6b9b7eb24c
c5828f956dbf389e2df5706d6405f5a9f4473064
1051920 F20101210_AACAVM joshi_s_Page_019.jp2
b9f9741dbcd825c962aeb799e96206b6
b9c6347c084883a0e83a952acf55cf27d5625fc1
1051965 F20101210_AACAUY joshi_s_Page_005.jp2
ec2ec93c15fe1d83aa2510858cf5b2a9
dc00e15be51ec51a84207ecb92025a2bae4be218
F20101210_AACBBG joshi_s_Page_045.tif
ac49065d2425bac4da2378939c9fad58
ab9650f6494368d571ccb8942bfbe4d36e022fa5
6605 F20101210_AACBZD joshi_s_Page_137thm.jpg
35f72dd868e557fcb78ec32b47b80cb7
8e2ca14e6ec302e04d68090570086e0be071b7c7
27042 F20101210_AACBYP joshi_s_Page_114.QC.jpg
1b4650413417ae5880571cf9d867b91c
aa7330582ef6cf8a1aa060cadf83fd7eea4354d0
1051986 F20101210_AACAWB joshi_s_Page_044.jp2
84fa5365e18d23c142d616ad7f9b3141
dd1a4a52f8f6b3ab985e7656127fe274472c6fc7
F20101210_AACBAT joshi_s_Page_026.tif
346533711a016060578727ea18fc5a8f
ad819b0017b3100b2a1d96e97c81713a7a75ab29
1051966 F20101210_AACAVN joshi_s_Page_020.jp2
4d4d1b10b888f13cd5ae288bce40cf78
3ab17e86b2c1607a9877e16c877f568e8691e537
1051981 F20101210_AACAUZ joshi_s_Page_006.jp2
025497623d3544c0bc6c87c260827643
47b51774ef947bf976f1810021d7b9906882ce9e
F20101210_AACBBH joshi_s_Page_046.tif
0ce5ce724569f37a2cd823e6461c9254
77e4d5cf67c13bdb0ab53ce3f0ca66fc63fbd499
3862 F20101210_AACBZE joshi_s_Page_138thm.jpg
a22097a30b995af80d60b81680b73c24
3b4394e0549e83180b4d7e8ac2c2a978fca97066
7090 F20101210_AACBYQ joshi_s_Page_116thm.jpg
bcd1660ae3dd543fbbd423ae593a59df
0c94f725f67f7da1c2ac1d786e220f44a0bfb761
853601 F20101210_AACAWC joshi_s_Page_045.jp2
0bd016990eec50b3d439e0d59026a807
d945eda2bb2a75989cbe8894452ea5ea070728b7
F20101210_AACBAU joshi_s_Page_027.tif
60233b625711b6863138378dd53782b0
c001e2a85812c7367504fdb45f77ba292bfb8eac
1051956 F20101210_AACAVO joshi_s_Page_024.jp2
cf1ad3835dbff4508f74b777597bd53e
459f55b135d1315ac6c2545f982c35abcc140192
F20101210_AACBBI joshi_s_Page_047.tif
197cadb8a1ea91755afbffe287cd12b3
686401a9e6aa0dd6cbf5699f065d62650e1699b3
23701 F20101210_AACBZF joshi_s_Page_139.QC.jpg
ea616c0b4825e565db0a08b446b8f34f
5695465f2a8e25f717d85c3e776455bde3744441
28378 F20101210_AACBYR joshi_s_Page_117.QC.jpg
b49d226c3fe82618e363a85596a49fa1
ab2ce3c03a270460660a6e6cbf04fb4d78119a82
1051984 F20101210_AACAWD joshi_s_Page_046.jp2
2a46dbf32472000493d2d8329f815ab4
cbb1ce74d2403d2e407dd4bc0cee4a755d015061
F20101210_AACBAV joshi_s_Page_028.tif
ef1b05e100b92beb2fe70d951b7cf9b5
9e4da4d4c8b89be258bfe0a3a5e38bca599bf78e
1051967 F20101210_AACAVP joshi_s_Page_030.jp2
3841cb121b76643d15d988e856bb20ac
468a572c4adb781fb99b4ec4ae316e3e6f6e59d6
F20101210_AACBBJ joshi_s_Page_048.tif
bc0550ac35887d16af8a35a6e1a2f7d0
e82d61b8811c78b00518622bce6161d216db19a3
6116 F20101210_AACBZG joshi_s_Page_139thm.jpg
8c4793ef1fbc7366a47214bb6524de6f
63476f511306ed14c687c66b1d015e4a75b8908c
27459 F20101210_AACBYS joshi_s_Page_118.QC.jpg
411bffa2bebe530643f78ee6aea785fa
3a545ec7d54c1b486816ac580c9c2291b0fb21e6
986161 F20101210_AACAWE joshi_s_Page_047.jp2
7b0a69902ba1b7644bccab773e7742ea
16bd3dc39450a9ec88f36d7ea781c0e944dcc617
F20101210_AACBAW joshi_s_Page_030.tif
004029bfa015f263d16c41e9dba97681
277739fc8685e21afaf9443894520fd49b182283
1051979 F20101210_AACAVQ joshi_s_Page_031.jp2
1b0f75ea2c64c18df7773cef0cc58194
1e489971802ffa33d75a85de25a2fb0398e8ca9e
F20101210_AACBBK joshi_s_Page_050.tif
eb8b0e91d902df765463874f7658a1ca
c01bf1305c81443682d4add7edd0623ed324a68b
5090 F20101210_AACBZH joshi_s_Page_142thm.jpg
00e107036ddecaa85687f2a035a439c7
440759bc9a91fb89ddd42dca4081f6c45e3bdd5c
6883 F20101210_AACBYT joshi_s_Page_118thm.jpg
8aab24e2a1edaa0c588e2705aaf8289f
a4faaaf6c24a29744f4e6e71cf9bb9abbcb85b0d
945027 F20101210_AACAWF joshi_s_Page_048.jp2
81d8daeeed496f09c61d9dba975eb28a
14dd729af790d485403679ea14c2e8b7dd2c3a3c
F20101210_AACBAX joshi_s_Page_033.tif
8f821d808b5020e5da3b61ad65064075
ae63a539ecfc8fae9349addef278a702cd83672c
F20101210_AACAVR joshi_s_Page_034.jp2
d75e8d41e86259b92097af2c996dc331
fc9b36bc873d86d8755f0b8d349227112aed5220
F20101210_AACBCA joshi_s_Page_066.tif
527dd8eb75f296ea6efb3f908204d420
f0c00468196d926c55b072dd9dc3259e2cf93f21
F20101210_AACBBL joshi_s_Page_051.tif
6043e1e479ee55202d11ed04f8a28500
dfa9a602a78073bf400c0661554837e3588fc9f4
6320 F20101210_AACBZI joshi_s_Page_150thm.jpg
42cc16016532a1ac382691802e9e91e7
a57ed3e02cd248610c7ce96c0d249e4806ec286d
26186 F20101210_AACBYU joshi_s_Page_119.QC.jpg
8623b20a37a8d22af58889947608382d
13363ae4e8bd13f1488388c96d48cf4ef66a364e
F20101210_AACBAY joshi_s_Page_034.tif
d00eb99cc4fa1bcbbaf58b76ab807db4
4486403cddab072fc4110722ea7975a76b0e3388
115661 F20101210_AACAVS joshi_s_Page_035.jp2
2760162de454bdf3335f5511624bfa35
2842dc7f266d2868367a6a48a5e352b90e0fbf1d
F20101210_AACBCB joshi_s_Page_067.tif
6db0521463f047ca0e93eaa358bd3300
212590b857627bf85961eddc9fae4b755fa7648b
F20101210_AACBBM joshi_s_Page_052.tif
c366d0a68df31c24b7a1333bdcc95060
4e236cd219375334712f95d30ad20f3eb537cabc
94686 F20101210_AACAWG joshi_s_Page_049.jp2
747c329e757401f3b41d1ff18446db6d
e53d1138e337bfcc89ecfb2534d4c772d1a6c258
3974 F20101210_AACBZJ joshi_s_Page_154thm.jpg
eebbaff8998ac2478e019be9b5209a60
44c57d2d6f8f23ac0c97fb2b37f735912480f9a9
6723 F20101210_AACBYV joshi_s_Page_119thm.jpg
042d0e94299c71ae20745f2e8a082530
6b4f5cfe465f9c046e0d5243bee243d9df841318
F20101210_AACBAZ joshi_s_Page_035.tif
fe549457082a34d43eee199639cdd7d7
8df5c0cfb0eab0b317d19748954afcbb4b8a61da
1051904 F20101210_AACAVT joshi_s_Page_036.jp2
144f716a60045f8c28d65a77baaf1a46
2bf4c34f7594a2410738b856893f27421a25ba1e
F20101210_AACBCC joshi_s_Page_068.tif
b0dc4cf856fc68cbfe24eb33cd7d1704
9d64a7db23ac4ac73caa71b847e9201f935d7b58
F20101210_AACBBN joshi_s_Page_053.tif
9e6e6dba34f15f0ada4c47dcc4777045
3cc6b40a940fa19734e8b596cbdac68c14ce46e5
1051968 F20101210_AACAWH joshi_s_Page_050.jp2
c204a0ebafb62272cf30ddac099426a3
a809882164ddadfb1ced556960f2efe5373a9011
4145 F20101210_AACBZK joshi_s_Page_155thm.jpg
8f6f485517111e58aaf9279336f26da8
b3fdb5f5207b389dd10a69ec112b7d1d667ea9b4
24342 F20101210_AACBYW joshi_s_Page_121.QC.jpg
b54a1c35d41b012ea391bc55029c7b05
bee2974ff5eedae269325e4813c4778be6a9b43a
1051962 F20101210_AACAVU joshi_s_Page_037.jp2
fe8c5ac78d03513c47aabbaa007b9fa0
2cd7ba87ae653a1d0092d3f6870630576119d2c5
F20101210_AACBCD joshi_s_Page_071.tif
29e6f9e5b851e279a8c75ac08b901add
2d8e92f5809feb9432ee6645a34f6aef82c4c8f4
F20101210_AACBBO joshi_s_Page_054.tif
7f1ec7beb0ce0eb096004085f178c561
bfe503ec49a9097c8c313a9374424e192b4e0e43
115504 F20101210_AACAWI joshi_s_Page_051.jp2
6980f46097858574eb9f2058a76fc450
cdd89862f6106c2172147b8dd660f3117e4a2950
10297 F20101210_AACBZL joshi_s_Page_156.QC.jpg
301de4ab6d30c30ffbf450a3c7e157ad
6e178169edfc8fd270204a286f020a50f20b7af3
6145 F20101210_AACBYX joshi_s_Page_121thm.jpg
f4252e0699427714657500ce65403589
1b3ca913ce98b1c77cc8763ddc768d83c8b3c3c7
1051915 F20101210_AACAVV joshi_s_Page_038.jp2
4a27163c738b935d04b0e497a6bf7abe
e1fc2401e69acce8969850664b6b3220c07b01f4
F20101210_AACBCE joshi_s_Page_072.tif
e2379ebf640f16845c65efcb90f78102
032f82e868c937d9fbc35fa3b36aabc6ef51500f
F20101210_AACBBP joshi_s_Page_055.tif
e98f85bc0662c59cd18dd0f05ce561a7
6252a5f79357467ae5e8741e73f90f3b5db92499
112534 F20101210_AACAWJ joshi_s_Page_052.jp2
47d06014ba4aad357bcd26929039dd72
068a1abcd82b93e1c9be969015ea66d46d8b002c
20706 F20101210_AACBZM joshi_s_Page_157.QC.jpg
dd4ab9cb18fedfacd668b02903bfd88d
d42aa294234b25de3cd922ffa9e878984f9458f8
6698 F20101210_AACBYY joshi_s_Page_123thm.jpg
0077aa2bca1f71b5c664905fa419d445
0a22c238f02ff433dba5164cce7ba97901625035
F20101210_AACAVW joshi_s_Page_039.jp2
d4bb925280a9fa708e184285bf54bcd1
5d24e9d54c94ecf879de168c46cb67dd93350b7e
F20101210_AACBCF joshi_s_Page_073.tif
cc22da1e159ae7bad9f0acde322480a9
0a3b09abf05589b534c9961a40da6106a7bc8a53
F20101210_AACBBQ joshi_s_Page_056.tif
274fbea927ad73983bbb5b5c24033a24
3cdeaec860a17955d1b249d539b8ad2499b2ebcb
F20101210_AACAWK joshi_s_Page_053.jp2
529115a9e7638bc9f6d8216374878895
d22bb82eea510ef7f1c0b60eda1aacf883a10195
5618 F20101210_AACBZN joshi_s_Page_157thm.jpg
35634f531954f477d9053af73e17b012
ffbb05816832b3fcd198836c0e88b61e8876c217
17423 F20101210_AACBYZ joshi_s_Page_124.QC.jpg
7e576d9bc6613718a9da962b90a39ff3
88db270e911e09bcd6d16975fa335ca78f352a69
849041 F20101210_AACAVX joshi_s_Page_040.jp2
25a7f21e28ef2f5313d76ea72f023369
da5f90a20dacc31efde2f61537c961cfda009561
F20101210_AACBCG joshi_s_Page_074.tif
0ae6b590abf34e39d397879bd7728293
362b7d7f5f7c31a8926c3cc0b0b6f0ec399698dd
F20101210_AACBBR joshi_s_Page_057.tif
1fbf6cb536fd7d6ae40fb7be9ebf3156
71984baac475da9e32a4230c5a05fbbeaa3e74f7
96517 F20101210_AACAWL joshi_s_Page_054.jp2
3a13f0db69d2ff3864ddb764a63835e9
30430ee31ad44540201294a2d51e90ba9404299e
23192 F20101210_AACBZO joshi_s_Page_158.QC.jpg
583e1c474b100e44886a0220c1702908
c4b648a1cb28d3d722b67652543a3760fd72565d
992705 F20101210_AACAVY joshi_s_Page_041.jp2
ff6991f4d6e739f1885b9573333e624e
dffdf9bffb31f2eefdb820c53b43e8663cd79335
989465 F20101210_AACAXA joshi_s_Page_073.jp2
7df23632dddf1884d94e8aa1e8cfb03b
9196017664e2ed1bacf20d9c8ac94a9ac7871715
F20101210_AACBBS joshi_s_Page_058.tif
1ad23b4c871b85e5b652713a88a7984e
8a7bab2f319eee1219db7b6197092531e253163a
1051938 F20101210_AACAWM joshi_s_Page_055.jp2
492fd390e9dceb1d50576b4d54cb8331
51681dc30f89088e423e7d4c6cff5947d6497b26
24497 F20101210_AACBZP joshi_s_Page_159.QC.jpg
667ed04061ecb7975e5851b16137a2ca
e1f268c7a49d1bc5f05364afd6ca9751f1c3e31c
983710 F20101210_AACAVZ joshi_s_Page_042.jp2
9ecccfe1e2f42b05e87a4842a581f6fa
ed4c8d929e3ff02bc978ceb9099700a4048191eb
F20101210_AACBCH joshi_s_Page_077.tif
046fe9a0ace8cc419f6df9b147a8c6db
79f82ede322c55818c5f050f2f70876e6f24d05f
92210 F20101210_AACAXB joshi_s_Page_074.jp2
0ea744ae92aecca044af209438d5dc4a
391bf87210fe474e9cf03902e86ae128427fbee7
F20101210_AACBBT joshi_s_Page_059.tif
72131613dcdd8b01a31ceb72fb0b4fd0
fae85a030d41d922aaffa6e556cc1a7e615f8ddc
79584 F20101210_AACAWN joshi_s_Page_056.jp2
97128142a664d447f877423af54c245e
be42cc4227e11639dff80d0380c6ce442330b987
6425 F20101210_AACBZQ joshi_s_Page_159thm.jpg
64daa394a2fd01d9b701ad263f3e8785
f2b2a134e0b7aa171c96a8e9b7b70d49c9485298
F20101210_AACBCI joshi_s_Page_078.tif
114a5ed7f39c78a2b0b8e4390c188f95
2c355f5f8f6e727d6db15adfecf1c2ef2dd8766f
825938 F20101210_AACAXC joshi_s_Page_075.jp2
436222df60e2368fabcf18453a2adf9f
2752c6c2dc498107f93de7d277857cbd28e034f0
F20101210_AACBBU joshi_s_Page_060.tif
09b716047aba773758258737bc359db1
1d3480b36fafd1d6528826b980b1a0e9c6da41aa
68591 F20101210_AACAWO joshi_s_Page_058.jp2
b2b0715bb05d2ff55859e6553deec5e5
c29d73b2d2b37b6d82769bf09a776c8f9b277b0f
23380 F20101210_AACBZR joshi_s_Page_160.QC.jpg
bb291363b93c7717e24ed29d847d228e
74b1c1a5ba904664b58fe7a231b0ec5106f30223
F20101210_AACBCJ joshi_s_Page_079.tif
67159701ea8a461d92f5ac24c31e31d9
68b954df1cb01ce72dfd50981e03154f1420d7cf
F20101210_AACAXD joshi_s_Page_076.jp2
0a9b05fc6dc1536d31069631556feed9
43c4f236cc7ca09b1700e08bf1d105068af1c2e7
F20101210_AACBBV joshi_s_Page_061.tif
b3c5e0ea5ba064b13fd18ec30f560874
bd7f02e993eabf4f7306468676a14801b89d864f
1051977 F20101210_AACAWP joshi_s_Page_059.jp2
3727fee1719f43b9fd7f3154bf2954eb
efa42ed6fe6d60c02bc9e4cd66d2904b4de8d50f
6274 F20101210_AACBZS joshi_s_Page_160thm.jpg
b246efd6ebe76dda1f1a3d2378d4edef
b95a3c9c37e3d5e06d8986525fb0a7ed949680aa
F20101210_AACBCK joshi_s_Page_080.tif
f70d8dfcf4c2ef9e5443aa8975d2da62
1147b24b83fb694237c1a4edb230a198c3370ee7
F20101210_AACAXE joshi_s_Page_077.jp2
9d17e7a1816b668ecd8123e36229de06
488f6b2f9b51cb3258cfedcf7468843f72c730d5
F20101210_AACBBW joshi_s_Page_062.tif
5c76d721963a61e3f9c8aadf69727fbf
a407661e81eeeb681ecc2db1120454affbd30f97
69058 F20101210_AACAWQ joshi_s_Page_061.jp2
f38384f35710c9dff5b4140bf98a4127
4b81247bec7f6b18c64b79aecfe4cc3440144a4d
6127 F20101210_AACBZT joshi_s_Page_162thm.jpg
09f3a86b0335bdff2518b15ffd9d9430
659b9bad334621eb3fd76fde7b7b5d071727cde5
F20101210_AACBDA joshi_s_Page_096.tif
fdb74b1ecdbe78bb4b3801115cf150a2
f3c41ed3b2221cc4f4b939685043bed14f5205dc
F20101210_AACBCL joshi_s_Page_081.tif
5213ae8fd9be19908c33728e96c5e241
e172ede543891ee30d6303d6a0681fef6b1d9625
F20101210_AACAXF joshi_s_Page_078.jp2
fd5488ae067ef3911853acfab4032f14
9577277c075626f654a64e472ca22049f835d558
F20101210_AACBBX joshi_s_Page_063.tif
379aeabec4684bbf6e0cabd26334a276
c7625127eef3ff0e7c625739951ed17e68766013
1051861 F20101210_AACAWR joshi_s_Page_062.jp2
119986e31abcc9bafd19f909800eb5c0
d72aa26bb349fe9139e1d5d24aae0319d8c032df
6329 F20101210_AACBZU joshi_s_Page_163thm.jpg
dd051b31b83252a4349106898562e2d6
8922e47775109a25917796191ddfd0c0b72bb7be
F20101210_AACBDB joshi_s_Page_097.tif
f488c7a669c88d6a5ead5cf452402a42
848a88c18cf81ce94bbe2027e2a22a2902680b20
F20101210_AACBCM joshi_s_Page_082.tif
e2e63f40c1cfa7686d6207a068e281ba
e612525e1830eeec7714bc2139834575f9db5d8a
1051940 F20101210_AACAXG joshi_s_Page_079.jp2
95f28ebd1570af6f7997f88e4b478792
38cc48ede2f3b9914b2d1696e8ffc4581d17775e
F20101210_AACBBY joshi_s_Page_064.tif
9e0f870de1d466b43f5d9012b95e0c60
eea5e69bac347570b78ff8f13b40895034f89fba
55035 F20101210_AACAWS joshi_s_Page_064.jp2
9e7252f75775eec990380ed204c5d944
73f84d7b019b1c8621b9b0bbc12c315287908426
11470 F20101210_AACBZV joshi_s_Page_165.QC.jpg
b4020b0dd941b5f636266c8b6be280ec
c07a3499fe6bf6fcd7480e7ae0da9190e40ba424
F20101210_AACBDC joshi_s_Page_098.tif
938c476ea2c4b3ea6ccb7ee80a18de82
1325e66a38ea098ac40eb178f09fbcb5e1596825
F20101210_AACBCN joshi_s_Page_083.tif
2687f1425a1431386e9837f7a2556cca
c4fc2d1bf2c23c35039d2716bd702b39a2c6fa70
1049949 F20101210_AACAXH joshi_s_Page_080.jp2
7bdaee6c9fee19c8cbfa3422823d4087
a5da887ce552d6df3c5a44055005231297ac774c
F20101210_AACBBZ joshi_s_Page_065.tif
03bc3a81df798559541178fee2f606ac
ce92ee654a78de107abf7eca73b085afb0a6d781
1051969 F20101210_AACAWT joshi_s_Page_065.jp2
bc91c1004f30d3ae43dfb8972f95ea49
f94a3abff4c74928fdaa7f2bbb760beb290d97a6
3221 F20101210_AACBZW joshi_s_Page_165thm.jpg
cbceb848568d1f016b4a9e73d0081a4a
1cc37574fc2ccae0a536b2fe35b7a40e143c9ce6
F20101210_AACBDD joshi_s_Page_099.tif
d30f003f69aebe135390ac79b469035f
7d63d4d19736853d27287931df8995079749635e
F20101210_AACBCO joshi_s_Page_084.tif
6bd380203e867cc1a517dadc2ce90471
a7a1f6b5dbf74c5a50c32060b1bbf6764c848940
F20101210_AACAXI joshi_s_Page_081.jp2
0e225fd6e61bf82d298d236ce12a74bc
025768e0411794f1185a8feef33c8004a12befcc
1051898 F20101210_AACAWU joshi_s_Page_066.jp2
e48cc9753bb91aaab8fa959b7a079df7
558274ab8e235764c296f378a14b8d99c2d3e44c
F20101210_AACBCP joshi_s_Page_085.tif
853c923a634fbe20d77012609e93a2c4
7c3435dc6b6b1bd23363eb658a9df254c43ddbb6
962143 F20101210_AACAXJ joshi_s_Page_082.jp2
fd2d4f518905a37b1a8fc3795d1762a4
9a68800dd3b17516942959d063840737f31fc22d
817824 F20101210_AACAWV joshi_s_Page_068.jp2
76f5325dcf175bce4971db8ade0bcb40
b0f549fa3711cf9a7fafd877d229ed15e69946f0
F20101210_AACBDE joshi_s_Page_102.tif
035b5c63efadd29afffde332a4ba6bad
bab1d1d7ddbb1635be3626067923f5f9e98d40b1
F20101210_AACBCQ joshi_s_Page_086.tif
8f4dde0c9b6ca17b4343056e491429db
ac0cfd7dc45cf25c514d762139b16d5945eb63a7
601714 F20101210_AACAXK joshi_s_Page_083.jp2
3268fb5a126edb6558c34e428746dd01
c28be1c33dea470819dd266f7fb39c2d2fbe5e04
881390 F20101210_AACAWW joshi_s_Page_069.jp2
9a900fae598ce7c36b581171a1d779cf
e0e95bd1e5398e5b8e7076115628e4902eaae591
F20101210_AACBDF joshi_s_Page_103.tif
0eae575184149d05d684163af5b09d5b
25807d291d2b186226f0a80afddda6c2904ce445
F20101210_AACBCR joshi_s_Page_087.tif
e6dc7900d333067d284f62210a00367a
ff59f15890ce129d9f5e7417edd5792dc4deec49
741612 F20101210_AACAXL joshi_s_Page_084.jp2
53f44c7bc71bf8581dfeffee4074d293
a3f3bd458afc7fd63f88eb89aeb87d51c236f916
F20101210_AACAWX joshi_s_Page_070.jp2
24979cd906c84b5c795ed4e2517c3e6c
886d6588be338d62f0ff62c1c01dce929b9cdbc6
F20101210_AACBDG joshi_s_Page_104.tif
3f2ffeb97d45b2c0b958fdfdbf1f68b2
661cde2691085995c7ec38eb4f2d0739c4e3d5bb
1051974 F20101210_AACAYA joshi_s_Page_102.jp2
237dce7b1685588150c9356464a6e976
ed907ba695104c3a7a99c050cf5bc3e8fc949572
F20101210_AACBCS joshi_s_Page_088.tif
4fe9880724fa3d0ca8ae358ab74b6ddc
84fccc57fbb4b732ffd247d152af32e698d6cd13
860992 F20101210_AACAXM joshi_s_Page_085.jp2
ba679d748ac6247e5af8faf8a4536907
e6dc01349dde671435a8d009b6a36b507aee28c1
F20101210_AACAWY joshi_s_Page_071.jp2
2446b4ee4fbcc3a9aa9931e703e8941f
7aa82da1a44a8a4d19e2c778b7b5116d5bc21641
F20101210_AACBDH joshi_s_Page_107.tif
304677ef448cbbc52c1ec0d037560d9b
140298f3cb79bf9e5a065320c802e5d84a535bba
1020905 F20101210_AACAYB joshi_s_Page_104.jp2
fc7fb3ea013dad2c255cccb02a10393b
7d2242c2abe76fb855f30c64c4c692336453e3fa
F20101210_AACBCT joshi_s_Page_089.tif
144a80341d42503c2b6df1ece289d129
421851cd1296c2a325d6019537e020fd8b24e194
571038 F20101210_AACAXN joshi_s_Page_086.jp2
8d5887cac9a42b9240a68b6a532f625d
be58257f28f87cca279328265259826b1edb8a38
324407 F20101210_AACAWZ joshi_s_Page_072.jp2
7b61628ad9d26ecd6e994832014db637
5609b8a146547ba9675f2f3283fc9674f353bf8d
881241 F20101210_AACAYC joshi_s_Page_105.jp2
e93d56840cf05a4a7fb43fb1a1bb0e9f
880d9b47426e9541bd8605649bdf5c87657d667c
F20101210_AACBCU joshi_s_Page_090.tif
1fe7d151afbdb5ce7c99f5b4dc9c49d9
bef662e6f35700262ca7bfc89275dfe56db58b4d
55121 F20101210_AACAXO joshi_s_Page_088.jp2
277e45fc57e7b5daa4c925b4ca7f355f
c8e739ba6fea1e379f4954dea0f6781e7a0fbe2e
F20101210_AACBDI joshi_s_Page_111.tif
a54e442567c9ebf91f341a7361475e2b
f282f12a5e16ece6011ebb760e4e4d90af96cafd
1051921 F20101210_AACAYD joshi_s_Page_106.jp2
a6a5b520b72e269a3732d30797a4b0a4
ce47e376b33c529e9ed29b333b8fa9392481721e
F20101210_AACBCV joshi_s_Page_091.tif
84db2561c8066e3f1369958f64ce5f5d
d2ba44ab64801347a3bbf0af7d5e91e072ebfc2b
80706 F20101210_AACAXP joshi_s_Page_089.jp2
627568efd42b942deeb2f9ea9ac96e99
e8f04534a09851372a1ac82d02a880e5265aee97
F20101210_AACBDJ joshi_s_Page_112.tif
3dc4088d737543876faf3e4856c4a5bb
5a14ef483a370f4b7b5d4329da4cacb203ca125c
79669 F20101210_AACAYE joshi_s_Page_107.jp2
61b7aacb95127f6366a2e8b0100655a9
42f156f931b3d58325cbad1790d7280652ebb19b
F20101210_AACBCW joshi_s_Page_092.tif
6260554d36c83f03b6e2e82124c8ae06
d6b0db13497a2a3b98d18e96d0cea63111e119ff
1051958 F20101210_AACAXQ joshi_s_Page_091.jp2
0af44132d6be3f9cbd6d69388d89c9c4
43eaa1015b1330950de03ef79c5e550e36933b93
F20101210_AACBDK joshi_s_Page_113.tif
e21b4ecf7203c1910dc0a278e945b260
21efed36b1e4827fc9bb6ab23a36aac1ddb28205
88373 F20101210_AACAYF joshi_s_Page_108.jp2
ef7acbfbd11a76eaf3f1803e0f81cb58
aeb16c95fab41bef23d529a238735bc1476bdafc
F20101210_AACBCX joshi_s_Page_093.tif
f363224e745ab0e4c86cd0d02a9c4d1d
214329c5e18122fd313582aa5f5b2c868012dc26
F20101210_AACAXR joshi_s_Page_092.jp2
4da4538cc4aef32ea466180158d86c9c
daf89767c7e598df0aa9cfc2c986de28d3e2ee80
F20101210_AACBEA joshi_s_Page_135.tif
ca8bf91ef4245719bd945503a5abfcf4
06d16af5b8480e41acf20a0eb835c87aed801950
F20101210_AACBDL joshi_s_Page_114.tif
69343ba2dcc02e3a9d23df71db467ed4
a66cf95f4419a487b26dcfb74fbff6c0f611bbaf
F20101210_AACAYG joshi_s_Page_109.jp2
46120db4af081fbdf643d4de6f67f583
bf7535a948018ad7e38f7ed319ef98b559a4108d
F20101210_AACBCY joshi_s_Page_094.tif
84f04c7a5be1a2c716e072a53743a0ea
7254354603a9fbe02e6f21530f48208d2f161b72
83694 F20101210_AACAXS joshi_s_Page_093.jp2
9a21e06537a6014dd822b22374a3095f
d231d6311d1298e80f8f9d35cc24a7f8d9ed8c63
F20101210_AACBEB joshi_s_Page_136.tif
80521fc41c68a7efb8d8df33f4c0576c
221dc4bff37eacca516f1ed49d7db3dc087963ee
F20101210_AACBDM joshi_s_Page_115.tif
b9ae4e6587de46a169ed69a2ecbced6e
9ad382d9dc70485c34706bce237bce813f4d8eff
118366 F20101210_AACAYH joshi_s_Page_112.jp2
54943879d782c1cae11a18d6fe6fa66e
bd811412215337fe584a4dcb286aa4a16d77927f
F20101210_AACBCZ joshi_s_Page_095.tif
130650fc6ec356105bbd50c8b641c9f7
1dba34519bd3077cbc277404f3ae8f834174b923
1024716 F20101210_AACAXT joshi_s_Page_094.jp2
296adbd12bc98f7dbd218b773fedc8bf
f44f3f6a6f8757566049e31d9f071c3093825165
F20101210_AACBEC joshi_s_Page_137.tif
81316f7418d6023ffe3f8d12dcfadc9e
578b250a515d658ca429ced3cb499d9ad62e5b89
F20101210_AACBDN joshi_s_Page_117.tif
3f408371963add0963656e8b6f15595a
82d95fc096670975cf35a8e7a558aa7d56615b4f
136877 F20101210_AACAYI joshi_s_Page_114.jp2
86d77fff4817acc9ec25b0f011594465
d171dcc04aa2a5614d827edfe00d7d7137bd4352
1051971 F20101210_AACAXU joshi_s_Page_095.jp2
b02d8eaadfe6ce2c6f35ac59044203f3
5241f145e4eac17cfb2ec040810a11cda0613200
F20101210_AACBED joshi_s_Page_138.tif
7168533be27900c79420bae12dbb2b30
7fdd5f2cec4b843a5c40c4c3a80cca6a6c502f8f
F20101210_AACBDO joshi_s_Page_119.tif
f7453e6260192350c03853fdb2ab3473
caa926d64f5b572179cfbd7d3d07942a309f112e
134754 F20101210_AACAYJ joshi_s_Page_115.jp2
b1e3c0bfbf1e975a0d8962509641d861
23a624e90aa6e9bef277b6ea7b2d528775b5b3cc
F20101210_AACAXV joshi_s_Page_096.jp2
c25551408774eca924da7d7bfea17a81
64bd542fe2b14ea4db7abe977b3de091694c5486
F20101210_AACBEE joshi_s_Page_139.tif
c329ed6f1a7edfaa715f8a7255c00d83
929850d5c8ba0a555040c7b6d497b43a3a46374d
F20101210_AACBDP joshi_s_Page_120.tif
1113266d9186142bc03f342d83888462
2c5906c2281fb0b47b088c9ac4598f4fa596ea2c
F20101210_AACAYK joshi_s_Page_116.jp2
13c0e6809af8dcc253a0731e7e8eaec7
9d2ead9d34273e58496f522335a42feb9dab1712
918089 F20101210_AACAXW joshi_s_Page_097.jp2
f5aa6525ebb3d57fe5daed8fab324511
003c2e0c3eb8674c8ff40ac2d3dc45241dbb5653
F20101210_AACBEF joshi_s_Page_140.tif
eb59b445f3b05866ad5d3d571e489b3c
5be401f609d297efddbbbdc7919e6f53d7af23b5
F20101210_AACBDQ joshi_s_Page_121.tif
1362e5b328efbc9772c63efd1b3c9539
8ad40a229ee8b53d08d8eae97e9c0b3b7464f4ca
F20101210_AACAYL joshi_s_Page_121.jp2
3f0217a05b2dcb73c40e32f8109c98df
5c86f0af38501006486761088843eea4d0bf3ce4
102260 F20101210_AACAXX joshi_s_Page_098.jp2
609a1f1ad208ed8752410c2557ee1987
b8d09d920041af26c7d1a235ecb48a57a7f8ce4d
F20101210_AACBEG joshi_s_Page_141.tif
273a1ff11d33af9565908b8ac0995e7e
8de680d589c500208984929a9a7f6df72a9d0bf1
F20101210_AACBDR joshi_s_Page_122.tif
e66926b6096ad4e72f7c422c898b0f32
6f3bcc58bda864f320686d9857559c3119a452df
F20101210_AACAYM joshi_s_Page_122.jp2
85aad96b5c8f8ccd73fc391d2d71ada4
c5d10d1f7651b0537f90d5c4f7068c671a931cce
118556 F20101210_AACAXY joshi_s_Page_100.jp2
03aac02b1203ecfb4c2fe6f7661bf992
778d7268937c24ab0085b313de243a6207dbdf0c
F20101210_AACBEH joshi_s_Page_142.tif
169354116aba8c4a70622f24d3ffca86
8a9aea4164de01d95eb9e6b9c839373a8fcd756c
712294 F20101210_AACAZA joshi_s_Page_141.jp2
e9bc55f91472382677e082c34b125374
d8ae8e93d683f88415aae731cb83f7e1110a750b
F20101210_AACBDS joshi_s_Page_125.tif
b1f7786a98e5425778ce553df5ce0716
660b6a220e113ad828da1a1505f9a5febb64a86b
1051949 F20101210_AACAYN joshi_s_Page_123.jp2
059116139417bc819316fe52a1f6419c
22d8f944157264ab67f68eab3df86b6fa5abdbcb
106635 F20101210_AACAXZ joshi_s_Page_101.jp2
c9776be01a8ffd31b09657878420aede
c5ebed831f6c4fb277289632e8f17bd072867e79
F20101210_AACBEI joshi_s_Page_143.tif
d6a54ec447932183b65661ff8ca3b224
9b56b82ebde91ad002ffa45f97f9da1671a8b017
84212 F20101210_AACAZB joshi_s_Page_142.jp2
aeb3ff4f57489fb0ed9c71998d593758
d09cffece1ad9c71e93062826c5378fb12298dd7
F20101210_AACBDT joshi_s_Page_126.tif
0f19ec6b3b2ff716ef49bcf43b614bd9
dadc8f873dd7a2d086b3c60f90f0c123a47e7aaa
81249 F20101210_AACAYO joshi_s_Page_124.jp2
ad92608526ccb5eb34905e4f77765647
6d8b48a80cccca709bc325c7977cb4415e3792e6
44980 F20101210_AACAZC joshi_s_Page_143.jp2
1eabd646d6216ea49763511e240ccf74
35ee09ea561722dab73f6ae732a59d73466f209d
F20101210_AACBDU joshi_s_Page_127.tif
753bde9bf322e1f2243c4a9660150366
01e55880e6e53bd87bd204314710e9c12bb14e86
833273 F20101210_AACAYP joshi_s_Page_125.jp2
8a0a0e4f9301e4c670258ecea2064dcc
69950c4b794f88995a9c64cae38dc18e593a53a6
F20101210_AACBEJ joshi_s_Page_144.tif
deff14bb3eb5e3181e5d6ea68a1c72eb
04a9eb072cd5863779931f5e928ec4ea33fc179e
1051972 F20101210_AACAZD joshi_s_Page_144.jp2
9e79fc7ec969dce082395054455a6433
506fdaec64de8b54ac2747f8c4fcfc9182e5a953
F20101210_AACBDV joshi_s_Page_128.tif
342b6d2637af3b1ca7eb317ec6a03135
2b7ed6bb20d8517b94f725fffba0da668c3a0925
F20101210_AACAYQ joshi_s_Page_128.jp2
890a9cd3091258712caec283401809b8
ed4bd192f27aced6641142308196baa0ab872b65
F20101210_AACBEK joshi_s_Page_145.tif
149f34b4389d03d978e95c075b1cecc8
f2f09f4dc57a99a2411051cb4e67b8dd3eb03ad1
F20101210_AACAZE joshi_s_Page_145.jp2
038861d49797c0b59c89f84045f7a6e1
169c95649185a54e15bd1fcb77082c4492a25f09
F20101210_AACBDW joshi_s_Page_129.tif
95636e7b20e293dc054a529c69856215
9a7b42d7e1eb3c2fa1104d7d84e72ea3b5b5eb17
107439 F20101210_AACAYR joshi_s_Page_129.jp2
0eba431e973f5ad90fe507835153fb76
ba848d73ddda4f176bb349688584fcdb345a2bd9
F20101210_AACBFA joshi_s_Page_165.tif
553616772c04b5b77c785482638cc8d1
df04aef07938a7cff135c24a1ec3644594e192b4
F20101210_AACBEL joshi_s_Page_146.tif
2c2777d92d13074abe87c4366859b1b5
2075b03aa3df84206f9c400afb5c64cb87d48d8b
1051914 F20101210_AACAZF joshi_s_Page_147.jp2
2b221ae52cb6cabdbb3e6dbf5e94dabb
b36d9e8e701981a0472a6c9daaebd8aa1a894559
F20101210_AACBDX joshi_s_Page_130.tif
ae0213ff1b8f2095f5f917c48c5ab3df
cd315193992d6575310314f7da640631692eb0e6
1051976 F20101210_AACAYS joshi_s_Page_130.jp2
e5282197a262e1b12062344bc06b8036
d0b5f768356e555ab34d1b1601dc85ed9d7f06d9
8029 F20101210_AACBFB joshi_s_Page_001.pro
77f6a65495d41e7aed4f012794e1a657
f40385cc0e19a5103c3a1c7e55bc72de933e65b8
F20101210_AACBEM joshi_s_Page_147.tif
7433d0414c30cd02d7695c71bf8845a3
71b9634d096b4aa537774f349dad5fb8eb2db67e
133818 F20101210_AACAZG joshi_s_Page_148.jp2
910e18b90fb4faae70847a2967c8e2de
8e8d5520d03c6097e08ab53eacd0caabf28d81a0
F20101210_AACBDY joshi_s_Page_131.tif
404a4fdbbbef87c28501cdcc8bafc65b
bceb6f8a466afb2c075fdd1b3043fb03ccf95a20
745029 F20101210_AACAYT joshi_s_Page_132.jp2
cc82a098df87bb1b303d2d9bd0057855
4da7dbade2daf671ad37e2c299da36fe8e92307f
803 F20101210_AACBFC joshi_s_Page_002.pro
7a607ed3f6e464713fca8759584bd55d
cd08fe43d207d03378d4c724c06cfb9ab865dcf8
F20101210_AACBEN joshi_s_Page_148.tif
40d6a5a6d35fb152565aca1898412c38
998b1beea52899f2747d717f10d782bc03896604
135404 F20101210_AACAZH joshi_s_Page_149.jp2
0d8bbca58ee951973fad725ed8331978
d6fead3ab9e491778eded5ebf80e4195ee80c345
F20101210_AACBDZ joshi_s_Page_134.tif
8271b57165f6341984d7aa97782d4d12
bcaff9e1162a2c73941a49e44aa8127b2ab5dcda
1051873 F20101210_AACAYU joshi_s_Page_133.jp2
5373738da3fc012425b6a29dfbbd493f
9cd51a43f61defe748eeb9f0d2b4f03dd56bc541
1609 F20101210_AACBFD joshi_s_Page_003.pro
bfa4765d09ea2549c13d7039c26c22fc
4df93d0b7c3ae1e150d105a89e8939118c6714e5
F20101210_AACBEO joshi_s_Page_149.tif
9c8b0ad575e38e2b41b33a0c5c0a390f
d67c58cce94d746ba7c1a9a02ad84e7293636ef1
125054 F20101210_AACAZI joshi_s_Page_150.jp2
e6a251ba40a671c56ab0ceaf048a8d19
34130b0a1ea254e257f201d1c84a958448f9ec3e
1051924 F20101210_AACAYV joshi_s_Page_134.jp2
e9953d78da6004a0785eb53e2327c7d3
06649d27c94150537e0bbbbc48202d5d63a646d9
52580 F20101210_AACBFE joshi_s_Page_004.pro
32184f61419b526efbaf599fa34b9f66
c15f3d5ade1b9dd18005a63b0e83b52117290cfa
F20101210_AACBEP joshi_s_Page_150.tif
abfcbf6e752b92414ad6357c06ef5e08
283850f441fd33cb032672931a31dfea6d55f1a7
110676 F20101210_AACAZJ joshi_s_Page_151.jp2
8f8af4c2abb0720d2aa52c1cf57bf192
a3339388b85f8d8cdb863e4e3735ac62b7204ade
55664 F20101210_AACAYW joshi_s_Page_135.jp2
6c37427e8ac1d7dad6996fcc74d0aad4
2d8ba2ca4189b64a3b9506543e55ff43d8a3c241
54553 F20101210_AACBFF joshi_s_Page_005.pro
089e979caf0366a0a0b1cf1a46d34936
36509db96e0867189f5911be4c1d1c79c6da133b
F20101210_AACBEQ joshi_s_Page_152.tif
c7fdd5ac869555ac6a4bc845cc4b223d
1b06e8131cc0999435afda1dad67088a8855db8b
F20101210_AACAZK joshi_s_Page_152.jp2
47256e417a7e0f2167832d79366ace8e
4882a5e6f526b070e3c41f2cfe22ad20662a099c
509374 F20101210_AACAYX joshi_s_Page_138.jp2
14bf51d6ca9be295eed918ad47be074e
6cdc26cd24b2941d200b2cdfcee84bc7ccdb2e34
32286 F20101210_AACBFG joshi_s_Page_007.pro
6ef0563463044e9dea8d7478071bdd93
df93acf33c265a49002ee93becfeb8d8aa38001a
F20101210_AACBER joshi_s_Page_154.tif
b5276eef30ca2276ceb2203235a94e7d
12eba49b2a38f640b8c960d46280d920875d32ec
65514 F20101210_AACAZL joshi_s_Page_153.jp2
786da6de90f3a2674fcdab350991a058
914b5453ba93dc6d20dd031d20d1fe45e9d98462
1051936 F20101210_AACAYY joshi_s_Page_139.jp2
b5cf242b75a7c571dea2ec8fa27b5144
b1351b10560c38cedbf422b28a983953cbbdf729
77864 F20101210_AACBFH joshi_s_Page_008.pro
6af7f794abfa3151880fc1a6b90932a2
153692dafbba501bb36160c06358008c6b8f3461
F20101210_AACBES joshi_s_Page_155.tif
2dbee7ba999da9f2db8d01bd6da9460f
667fb49d55adff12adb5e2cd537eea452433c940
70130 F20101210_AACAZM joshi_s_Page_154.jp2
b80b2fdda33c511972415c0b500b8ac9
c28d5af8f28175f54ece8bbcf69cbca4efc28da0
41682 F20101210_AACAYZ joshi_s_Page_140.jp2
1918dac6d00f991557b4ff78c9c18db7
2bbc5b905e69820c582393f81ba86e7efd443c0b
66944 F20101210_AACBFI joshi_s_Page_009.pro
c6872078dda0af351808a9242c7ce073
d0864ba1eee99db751a18637d4c4e174f9374db8
F20101210_AACBET joshi_s_Page_156.tif
fbf6b270851d1f9e0d4dbf9c89654195
ec9eac85eca093b19c7e8ea6194e8035e1024065
64384 F20101210_AACAZN joshi_s_Page_155.jp2
8d063d95a4ceed388eb3f4823e2ab6a6
1d27844d585978854bf92c062320184142b9688f
34033 F20101210_AACBFJ joshi_s_Page_010.pro
4bec23ce3085ffc1e6aca3c4342f442b
e5e7e2317828ab80cb2421a59b21801fb90856f7
F20101210_AACBEU joshi_s_Page_158.tif
8fb9a1c605a12cbbe834d187d64abb86
d889ba83e7c0879bfc6c09cf4df9efbf68de3cbf
401864 F20101210_AACAZO joshi_s_Page_156.jp2
67b8d3cb7c84579b76490209f3b19d28
b19b079d98f86f888af4c934a1c9698d2dc27a45
F20101210_AACBEV joshi_s_Page_159.tif
118512037c1f99370bd7e4d08c5894a0
f081cd0d6932eb987a7f9f36755b66239eefc641
109565 F20101210_AACAZP joshi_s_Page_157.jp2
623fef760ef17719b7a870bd43ff8430
1c90d6ebc00a16378a18b0375bc5a66022b05534
51839 F20101210_AACBFK joshi_s_Page_011.pro
32dc9d21d87c8c6173aee0ca2ed139f8
3463f85bc569ec5941230cb739cf19e474fc57ff
F20101210_AACBEW joshi_s_Page_160.tif
55abc9c337852c9367f616088bf0bbff
621ee7e02ee731dd10e78f252069726357acb6b6
124262 F20101210_AACAZQ joshi_s_Page_158.jp2
973aae5f49e2ab3e4ee83b06bf587e31
46bbe8bd5d8697316f37aee4ab6343f7717cca92
8624 F20101210_AACBFL joshi_s_Page_012.pro
9ec62af291b542046b6bd63b3e937fd6
0c41fb25a5e664beaa024f6eb725faba236eb279
F20101210_AACBEX joshi_s_Page_161.tif
c07d6f70fe8dd89e08cb79c4facb6d26
c4272dee2c18fb8643953495713a7998541993d7
129337 F20101210_AACAZR joshi_s_Page_159.jp2
45dc74cc59f92f393e6ef41b86c45504
816b2a94c23f3968b407ed28656219db249d08fe
9296 F20101210_AACBGA joshi_s_Page_032.pro
c8f650ba007aed34cefd71b91b08ab5a
656fb15cf8af8a5dbde9c8e6070c91ae1f51add9
52690 F20101210_AACBFM joshi_s_Page_014.pro
59e6224e5169e5eefd572b6911acca80
e33a7fd6c7b54ea43c7fe9807dd9e093e7ba1c38
F20101210_AACBEY joshi_s_Page_162.tif
f12ab25c3d50ba86b5ae68d9b5498b64
40f24af3e359e494250ddd892c2393718de67c57
119610 F20101210_AACAZS joshi_s_Page_160.jp2
7f7f190370095a34eaa495f2a601cc2e
a9eb4150d52cc1964f8fea7492366d5f6b9e3dbf
59205 F20101210_AACBGB joshi_s_Page_034.pro
c17312a93e389d6fea955137c54668d5
14bce796dfd746d2efac75a927d9644bdd9d5dd3
61265 F20101210_AACBFN joshi_s_Page_015.pro
9c1dd4dfbee9a74c98f1b089277daf08
a1f784a32839f0d9b1d57500802e7fcbd05cea3e
F20101210_AACBEZ joshi_s_Page_164.tif
d1c6d382964316b1344bb1da103cda0b
2a115cc78e592578b4c33108896afce33c90eb24
121531 F20101210_AACAZT joshi_s_Page_161.jp2
a60adff3eb51269a564c16f4f4a10b1e
c2029f763dc38bbc8212c682f01d0f19f5255f5e
51074 F20101210_AACBGC joshi_s_Page_036.pro
25a263c738f174704eef4929e136f999
fc683875aae0015defb514ac7d19e46e2681eb12
55712 F20101210_AACBFO joshi_s_Page_016.pro
ac18f72037a17b9b1bdfddb5b53646d8
da7e640ef91808f7354f901a36e40de9e9df4eee
119607 F20101210_AACAZU joshi_s_Page_162.jp2
23a53f8b2231c84969837d5fb29bf4b1
523f829ca2aa3eaceae9e182919a0424cabc0ba5
53097 F20101210_AACBGD joshi_s_Page_037.pro
05a1d5455714d6a227ebb6f3a63ca941
b26909053bac1ddb2282c96087eafef4b131ee98
33505 F20101210_AACBFP joshi_s_Page_017.pro
467067dce7756d91f30fa95ee632546b
519d72d3fd38d0cd553ccc7323291a41e42877fb
115224 F20101210_AACAZV joshi_s_Page_163.jp2
ff8c67897b2a8f4943786612ea09d43e
8502321cfd85f2d2a6024d0a226930737d8cca5f
55840 F20101210_AACBGE joshi_s_Page_038.pro
b7f5b09c15acc7cd511a74270f3e0729
6814722199e3e6936c338a1beb62ce06f50a777b
54916 F20101210_AACBFQ joshi_s_Page_019.pro
cd30bdba127b683d41e3cf1ab9d87c9b
f861b044657a5bb500d26b295b7a8c2fe888eb96
61527 F20101210_AACAZW joshi_s_Page_164.jp2
460723b04f1257c7a7ef5e19d8a8bea0
1f6046e9b6863ab62dc7ea7b1c5da32f4a75f992
49413 F20101210_AACBGF joshi_s_Page_039.pro
55d89ffae520ad8a75359acb33b34f85
846aecaaeddff3defcc18a506ed0d146cda3d1f0
59068 F20101210_AACBFR joshi_s_Page_021.pro
d09d14eb2c2baf532ce2a097caaf532d
e45c2ea17f125dce35a275ba48000474e3bad88f
49285 F20101210_AACAZX joshi_s_Page_165.jp2
6e23eaab2144141a3b44e19115fe5fe8
2aa52d11508db37621c1fa149cae0240e145cf45
32370 F20101210_AACBGG joshi_s_Page_040.pro
da805ea8f3e65243f82d60b19c986953
aab95fe7f714df6f035d80c1fc4a7b96faaf9376
58778 F20101210_AACBFS joshi_s_Page_022.pro
a6f948483495a96e4f8a10dda20dd8bc
9e673c8a007819886ccfe256039c91900903f933
F20101210_AACAZY joshi_s_Page_002.tif
2197ffbce6b68069eb539141ec7f9ce3
52e7eae64446bab8a547fd63da0376a5a127b2f7
42429 F20101210_AACBGH joshi_s_Page_041.pro
be4007071a5ee356a4fb08b27e8d530f
0990b2b38582e84d4cbb1cfe16dadfdb34909347
60280 F20101210_AACBFT joshi_s_Page_024.pro
930e926a7fbdb75c858e2de3a6512feb
01b1c6b1b2fb418609bff2fe41bed3f2da193524
F20101210_AACAZZ joshi_s_Page_003.tif
0430459a9a1a77f5aa48053369641bfe
1579181ea5e3e1272b52ab16c5dd5c8e29e586c4
30157 F20101210_AACBGI joshi_s_Page_043.pro
d778e39879e4c31837993ec309d18511
0acf826fb0d9b1a71f59e749f7bc4ce551c2d061
56704 F20101210_AACBFU joshi_s_Page_025.pro
1f5af74ae55bb128593b004b6ed37ef0
8cd241f4b2c21e1a1d220e526314a0612307cfba
54237 F20101210_AACBGJ joshi_s_Page_044.pro
a34eeb1ea6642eb92cba7ce22e48993d
0656644a2445026fa61dbc5c17a9f3b0a9887d04
59742 F20101210_AACBFV joshi_s_Page_026.pro
2149b44292b921561b104a6b0b8e93c6
af3e6e06c5331d1963fb54e23919e3bdd779a742
30068 F20101210_AACBGK joshi_s_Page_045.pro
62832f5a9c5faa13c08acd9b26a7a1f7
01a270aeec10d4d512371fe983ec633ddba85152
57485 F20101210_AACBFW joshi_s_Page_028.pro
0effe03f6e3bb4c4414d2c1f79fe873c
7e9ab225a38006682925f5dfbd072cfacaba15df
58467 F20101210_AACBFX joshi_s_Page_029.pro
d238ceb5734172d74761fe6662ca57dc
bc171f280aa7e50fc63d8967b06a6f70a24ea7fb
58468 F20101210_AACBHA joshi_s_Page_065.pro
2ad59ce9bf739506286d97d508229e05
a564739ea57a6277478dad1783f36d3451e9e3e8
59890 F20101210_AACBGL joshi_s_Page_046.pro
fa2e4a86b9e969bf918895a96dd43c8b
849c77f6533cb815761b965dde64d6065e8f82e4
58916 F20101210_AACBFY joshi_s_Page_030.pro
34d3b48be1be2cace893cee43db5be4f
5960e24f807753e9312cf63a2f1c59b177b44a8b
55973 F20101210_AACBHB joshi_s_Page_066.pro
5d92173231649ca38165bc45f4329056
7f42a3d2ff088ddc4a47cad66fb1d9d2be4f1d49
42407 F20101210_AACBGM joshi_s_Page_047.pro
464b926dd9ad8814bb459581ff4808a1
79221eab10f054a1035e7fa542702936f86cfa87
60751 F20101210_AACBFZ joshi_s_Page_031.pro
080246d27ab1999fe1498cc665d3a349
f6dc16f2835d180fa91439bf07d8fb9b89f67549
40666 F20101210_AACBHC joshi_s_Page_067.pro
793a422de4ae05d6b767b633ca6c68f7
c92289db1a75fedddca445ff291dea4ddfcfcca0
39402 F20101210_AACBGN joshi_s_Page_048.pro
fae06c2e2b19fdde744eb6720fc5385d
2f1657fc4a35574e523c2041bb308f4ea1613163
35597 F20101210_AACBHD joshi_s_Page_068.pro
9b115aa7c3b3870599f3bea215dc321d
3920d194bde9d7a981fe4d97739a6a4d92e827b2
45181 F20101210_AACBGO joshi_s_Page_049.pro
5a89f356a1d1344b91d4b905a9c25b41
055bc5f6bf99586752158572447b96e0859ec8fc
38967 F20101210_AACBHE joshi_s_Page_069.pro
a9a851ec5de5c2a4d4d426fb6bc5f1a2
d5b39a5b31e1a15704642fc5304d3f5919dbd5d7
49934 F20101210_AACBGP joshi_s_Page_050.pro
5853c7b674d59ad6166a845cb5807228
96fc54c10370439615fa1ec29d669d3cd20a3e33
60331 F20101210_AACBHF joshi_s_Page_070.pro
cc891251d04f1de3189470049574415b
d76ff08a7fff32abdc330485505342147e071174
54668 F20101210_AACBGQ joshi_s_Page_053.pro
691d3bd804ffd7dabaa02bb18e107937
d428a3129ba72a24891ab64ae5db6f1d845fff9d
60394 F20101210_AACBHG joshi_s_Page_071.pro
0ee2f73218353068267a201a437e38ba
1cc3e78972cbacd61e4ba80dd830a0ca506c1853
60338 F20101210_AACBGR joshi_s_Page_055.pro
4bcc39f889d75dee315b92c67fa03cdf
7e5b11ade946704d84ea9ea7156c72c5752b09f7
14335 F20101210_AACBHH joshi_s_Page_072.pro
f0318a74a781af43a16a3f2c426b1c3f
8f0ff33598724a03ade40874b991a109a2b17d09
35756 F20101210_AACBGS joshi_s_Page_056.pro
3f66582cc2d6b0605f1919b1bdc0faf7
9fa58b0a8b6f26447a7f608efbe3060a907d02a6
43558 F20101210_AACBHI joshi_s_Page_073.pro
4bf53b8fbdf2028397441080b2680cf0
5aa5036b3ca0a589953e96144e9e34d6b9503db3
34936 F20101210_AACBGT joshi_s_Page_057.pro
97cf5dff50faa54bd73143820a9aa24a
e42f115c73bcb8cfb0522049b4d199d6b6e43579
41083 F20101210_AACBHJ joshi_s_Page_074.pro
fe1f2c106d8a5cf59f073fe9b921fa41
2cced72d4954b6c5c7edb92a7406a080e57696b4
28752 F20101210_AACBGU joshi_s_Page_058.pro
7f3c345a1e2dc21f13ad13b47b031946
795df2d7af6fdcf3aca769d9bef96852bba55a81
37482 F20101210_AACBHK joshi_s_Page_075.pro
a1696a1aae15d7cbe349d3e7780f7f57
57a496031981686b5c5b83675d95c6b1037aeef8
52340 F20101210_AACBGV joshi_s_Page_059.pro
54ef49a3d1017be1b03d752b54e81556
615ea07144a0e43a279b0d8cf4d223b8deb3d65b
58738 F20101210_AACBHL joshi_s_Page_076.pro
578247d26b4c438d8cd11bee854d2214
e9761e064577429eb01fd8295b113c8bbeb4648e
36885 F20101210_AACBGW joshi_s_Page_060.pro
4c058c69ce43587a0ef5465f4073e21f
39be7943311b006c35bfdd016f1482d7c1c27155
48338 F20101210_AACBIA joshi_s_Page_098.pro
397247540755b23c3074b6c8dd0481ef
a383295af96b91a6802d91677f9be11ec0c28bd4
29715 F20101210_AACBGX joshi_s_Page_061.pro
7831494e58c5e2f3215918b180be3386
68cc87d41e4c48d21fbf10c62a25b51ea032dcd3
52867 F20101210_AACBHM joshi_s_Page_077.pro
a8e66f510d70392636a26013461e35ef
5518ba2123f53a654832fca47b01f02d1097138b
39207 F20101210_AACBGY joshi_s_Page_063.pro
b94f2c50758d430ffd9c54bdc6dff5b7
d09c8c50464c45d6dc05044db23e7d1568bc960a
59155 F20101210_AACBIB joshi_s_Page_099.pro
11367f6d4ca0f00a68d67106ba3bd5fb
a4b857c37dcdc658f0d6bfb5b4d9902781a38bf3
50984 F20101210_AACBHN joshi_s_Page_079.pro
3d64f8c7b30d94e1800d5cfa2c6c9edf
000eabf60cb463dbe1e20ccfaa62b19b6a7d70c8
18352 F20101210_AACBGZ joshi_s_Page_064.pro
42002565bb13dea4ee2308bb0a592af5
34cbd8fba3206ad877a1b8dd10aa1c8c42ce0756
49378 F20101210_AACBIC joshi_s_Page_102.pro
5d93397d7512fc1f126a8a3ea4e2ee5f
1e9632a5faa02e8f4ac5f0b18b12c9156878011e
56354 F20101210_AACBHO joshi_s_Page_081.pro
37a551917c1da524f7ee25ee274b60bc
faa760be590e166debec4024354d676e12d3a361
36613 F20101210_AACBID joshi_s_Page_103.pro
de185d5ea7dd4dded0714a807fd1a23d
cb9039ed5a2b1f12c19a79d827d532625f6935b0
44993 F20101210_AACBHP joshi_s_Page_082.pro
e6c53f2730b990044628a6fa716072b9
586f396a468cea4108c1133ea78c4341a89e3893
47748 F20101210_AACBIE joshi_s_Page_104.pro
c1d0e966f5f4a53a0ea2a732c05acbe1
7b6bff1681e0974b18e1f8db25c0119c05a55fd5
23615 F20101210_AACBHQ joshi_s_Page_083.pro
c8fc8bae12d695506784aa6260ae3614
40fa740d8d9373454ea79a007f8acdc34d4b076f
37806 F20101210_AACBIF joshi_s_Page_105.pro
e30e986ba099b0bf5422d69f4a229095
d899e354ea8957d20c0a3889ca65e643473f30bd
21810 F20101210_AACBHR joshi_s_Page_086.pro
9f992cff2f10b2ec526b5f4946ce1713
725b0158ba500e31358933f3c3366427b3997a30
31928 F20101210_AACBIG joshi_s_Page_107.pro
368b961b7070511fad5acd92d64474e1
8a4c2cca6006ab6bc36db25d480d1ca9b45921c7
27833 F20101210_AACBHS joshi_s_Page_087.pro
79d54062801a90fdf1f8a7280580aa25
b4ccf0e3097bf3333b18611b2a3d9f8d94114192
39756 F20101210_AACBIH joshi_s_Page_108.pro
8481e8536741ed19fa3bce3498fab159
751282ef901a71bbcd9b0aa28922d42a7d6b8084
24112 F20101210_AACBHT joshi_s_Page_088.pro
0057313ba9bc307fe74c33bd005f43a1
39c98b0850a26f41fe40f46fc1c372b6caac4a06
51099 F20101210_AACBII joshi_s_Page_109.pro
63995d8b0bcc25cb9bed0e91c433af6c
92e934b998fa7e04b658f16d16c2b6a8e1585a81
36803 F20101210_AACBHU joshi_s_Page_089.pro
b179ad0f6f90c95c60d92f2c22e63b0b
1ec849db4402a9f83392babfccefe63644b1295a
60418 F20101210_AACBIJ joshi_s_Page_110.pro
935f060eaec367daadf9e00e1868c168
0d7b46cf3296039c3c6bd335f2992520565989c1
57439 F20101210_AACBHV joshi_s_Page_091.pro
69857a60e3cbb62fe71290ff1721929a
062510853a77157ee73a829b09cc1af5e90fe125
57397 F20101210_AACBIK joshi_s_Page_111.pro
cdf28f658dfaa01494132e7a77840324
581b8e471352d780c9836f90c5473888e1e80d27
39003 F20101210_AACBHW joshi_s_Page_093.pro
11b612f1475d91186907a58b1e24dc4d
f79a17f75f8ea3d6d3acb5e3814dbf35a96b3784
55786 F20101210_AACBIL joshi_s_Page_112.pro
2f20c622839324c8c9bffbdafd083fa0
42a1411187f902a284bfb572c0751a80cc31faae
46101 F20101210_AACBHX joshi_s_Page_094.pro
cac10854433156de19ec17b8ab6f1932
b587c9392cdab1f60a491a52910984b82ecdcfd1
52703 F20101210_AACBJA joshi_s_Page_133.pro
ce09df546a66c55c9d55544b9fced7b2
a91c9d368c39a8671e10b05d410792fd6b5f8ef2
64217 F20101210_AACBIM joshi_s_Page_114.pro
5c8199991a9a09196d0176f37be0bde1
c5d9e9d11db47e67b133b7b619583f27333ec609
50246 F20101210_AACBHY joshi_s_Page_095.pro
f5639fe65c59c02bc1bac2efc5edfdbe
a0c2a2abcaa3e1865475d0364d149ecc354bd816
53341 F20101210_AACBJB joshi_s_Page_134.pro
5fe236da6cfaaf4fe3eee1824d239deb
135c742b11d12c772f1926deb2725e3b64e1dc28
41986 F20101210_AACBHZ joshi_s_Page_097.pro
c0ca22cbfb0d4136a9e05f69da06e142
99ddccd49fed8b443503947125fd9e918d946d1a
26137 F20101210_AACBJC joshi_s_Page_135.pro
bb32a067ee79183eb9eea35b0c9df1b3
68da7599a05cafa8fb8a1f31b0a43de1fb4b363a
63985 F20101210_AACBIN joshi_s_Page_115.pro
ccc012d63741128f2c402a77272357c9
486e880356094c81a23f12350f3ed0842ececbff
24889 F20101210_AACBJD joshi_s_Page_136.pro
6cf70c4dd1f9812debb010e42774a2ac
972427ef47f2d944952dd0e4499bb2c84e9b1cd2
61945 F20101210_AACBIO joshi_s_Page_117.pro
c3a02e091e51ebdbb07eac97cd0d92ed
16fb40167aa67062f30e3f2c4f011ce592faa76e
54998 F20101210_AACBJE joshi_s_Page_137.pro
0cf79fd4ee94808131183fedf711b638
0e039e086e788105c5ac7da230d5a893992c2f62
58358 F20101210_AACBIP joshi_s_Page_118.pro
0d73591b16aca27633535b8168dba3c6
6bd56cc36d38ff123795c8cbfa5b01e54946f445
F20101210_AACBJF joshi_s_Page_138.pro
ccf23a88d0890faa82be443a6792b0bf
c36a3595f96ed8961d0e9706792042046f8bf99d
54747 F20101210_AACBIQ joshi_s_Page_119.pro
d26a2ff92cd68ca7f7bd8dbd23d62b2e
864cbbf2f0fcbfc43399a4caa722f8ae98a34918
49224 F20101210_AACBJG joshi_s_Page_139.pro
0d57b1dafd6349833d9256091889196b
f72c5352f69688e3dba75e8af6586031e46df555
39329 F20101210_AACBIR joshi_s_Page_120.pro
c7d548becdc6e1005184d2ecd7f70ea0
264e551a4f93261e31e2ed3665c157e9804c09fc
19656 F20101210_AACBJH joshi_s_Page_140.pro
6d71bbb6afa8b39d3be1aff2d04d2c96
08dbfd7d249d0060542c137d064de74edc59ad34
49797 F20101210_AACBIS joshi_s_Page_121.pro
c3baf0edcb9d6ff0462d02e4e4fe4d1b
115589b0962411e159a4fdf81bfd7a4b9645abc9
31502 F20101210_AACBJI joshi_s_Page_141.pro
0cf552dbe38fa071f5250076a7ee3dac
f6e50f22f52c569608f022c9c5ba0c5fb1de660c
38473 F20101210_AACBIT joshi_s_Page_124.pro
ab9787c1ae888158c9a6220c163f65ab
0a0007c8610e80486c85ff25e10dbcd3683ce4e4
38678 F20101210_AACBJJ joshi_s_Page_142.pro
7f43652b28caeda2ac486b534388e548
0ce301d5bb6c9f88d2dd9a3e8158e301eec60876
34631 F20101210_AACBIU joshi_s_Page_125.pro
2f6f01ab8017eda4c882e64ea8df306c
3f5951b17ee4268a77acc2bf373ed4ee550b4bcc
59769 F20101210_AACBJK joshi_s_Page_144.pro
77c10b3e7b0a25511b28002ccb94e2e9
43dfaa8d2b7df05552ad3679349e8e0089bdbbf6
56948 F20101210_AACBIV joshi_s_Page_126.pro
c757418bfb1f742a575c079f8ada23fb
5cd8edfb397e2f1c258b4a33733ca40525894023
60343 F20101210_AACBJL joshi_s_Page_145.pro
5479f21aa10ef82f45ceaa064d771756
cc557df972c4e7fa778f3b306cfd82cff42eb826
47107 F20101210_AACBIW joshi_s_Page_127.pro
e1ab75bac1c429a8bcd0d3752467f378
88e546689852134826212a841115c0a9c47fc737
29251 F20101210_AACBKA joshi_s_Page_164.pro
f7f2844cb23abc4097d65483f9256ebe
4ed9ed8cc9d68dbcb995ec2c949467b6f66a31a0
50499 F20101210_AACBJM joshi_s_Page_146.pro
2cd90e313377ad745a32798b99ef427a
b02d83f326a559d307e7a6f9d00c3a33aa33ce70
67821 F20101210_AACBIX joshi_s_Page_128.pro
a03d69a76efcebea2ce61c33722f33f6
bfd9be7924e6c0b474d3f61fbdc947befeccb03c
21456 F20101210_AACBKB joshi_s_Page_165.pro
b006cd2a387768f6fff578afba39487b
1f0a2a002a6a6e3765b0f377d4c98f60276f52bc
62363 F20101210_AACBJN joshi_s_Page_148.pro
ca79827ab40390dcc6f02d3f23e8a924
9b47ab1fce5c349cc0547a1706b2fb494b0ee530
55531 F20101210_AACBIY joshi_s_Page_130.pro
13142cd0bbced1f32e4d8d0981b7a5f8
fe693fa324691ace2ddae1ea44537163d69b70bf
464 F20101210_AACBKC joshi_s_Page_001.txt
2b194c05dcadcb81290a054a1cafa171
b2b1714195c7ed04ec634b90b0c773affb2f83ad
47042 F20101210_AACBIZ joshi_s_Page_131.pro
01dd742f0c085e93bb07cd957be5c12c
56430383a0d8a1bb4dab89ed2d6c48201fedf5bc
81 F20101210_AACBKD joshi_s_Page_002.txt
708d5cb11bd2f0a374b97fa88a164972
d4125793789fc39d45bae3966741e70deba66962
61532 F20101210_AACBJO joshi_s_Page_149.pro
8509750dd623385e07ad41c8de15fd12
99e9c5fad410f4e488e76dfd24f9ebe00ceb8487
120 F20101210_AACBKE joshi_s_Page_003.txt
143d91a19aec4e1168bec09233da4f63
ed7ee3fbf32f08b5e765c9d3a0612491c527c8ea
59913 F20101210_AACBJP joshi_s_Page_150.pro
54f420f1533c9ada4c38c7313a964645
410ceccb0cc577ffcb061d60f81898ea2de5d569
2119 F20101210_AACBKF joshi_s_Page_004.txt
e7bb4760d969a95b6517b89152be5708
48e287332de768e0fe20c9138be2543eeabe8a93
53635 F20101210_AACBJQ joshi_s_Page_151.pro
d2e4f03d61cf1870a0f6400c0c031a98
f0f0eecc9d47bfcadb609b1402ea93059ac5c637
2621 F20101210_AACBKG joshi_s_Page_005.txt
c88ab11d4c04cb3238e3391d463c6910
9336e3d24b72da147bb12200ae8b82ce4735f00a
30246 F20101210_AACBJR joshi_s_Page_153.pro
0122f4e258cddc5f4a4b59413c594284
a7e6cbdfce0faacd620d4267596497c10eb98aa0
3240 F20101210_AACBKH joshi_s_Page_006.txt
8d96aa6299a62729cbbd1ee35ce96ec7
5667330cbfc8975c85eab2f32189b6189a7458ee
31810 F20101210_AACBJS joshi_s_Page_154.pro
a211d51d5019a5ce5e8f59daf2458dd1
b72a65c5a502d224748091b8a708cab6d5609ad6
1617 F20101210_AACBKI joshi_s_Page_007.txt
a69ecdf6543b836a7a4e679b24b391f5
fd098f11f64c0e9e7aa3fcee717d46ce8db376e3
25873 F20101210_AACBJT joshi_s_Page_155.pro
68349ec3e1992ee3b2514344fc573738
1edbff7d3c436063ba3a2f921274c1462bead759
F20101210_AACAHH joshi_s_Page_026.jp2
d9aa5b43f671b6a4d9f92025fdb6e3bc
fd5c84dbccab6d15616165833277c00752154f37
2860 F20101210_AACBKJ joshi_s_Page_009.txt
4cc3dd403b8d64682c1ae911971e243a
0def2d2871936bfc0658a9dfc696e3d932ac6922
15266 F20101210_AACBJU joshi_s_Page_156.pro
e56980ba1ef45ce3931591a57cf522d9
9d52d3d40c4aa6bfc3cfdb4362c7e3a29f24e9ac
57458 F20101210_AACAHI joshi_s_Page_113.pro
f7f3571575b16729873c6bc22206bad2
01bee6f1b5738ff4765d797e609fd1bef749f658
1460 F20101210_AACBKK joshi_s_Page_010.txt
6622b9b4b9e91f74314210ecab85d0be
e0651408fb43fa83f2348a4a47da05fe8a9ba18e
60387 F20101210_AACBJV joshi_s_Page_158.pro
7a57e58feaafbf46cc65c6b1b784d881
19321cf673529dfa39ee01693485ca95cd050c1f
F20101210_AACAHJ joshi_s_Page_069.tif
477198a95effdbba532e4defb319e4e6
2339b76d8f8979d4bd71713edb810acb9a12bd82
2226 F20101210_AACBKL joshi_s_Page_011.txt
43ea2c08aea0d120b1ba31be4c70e5b1
0d50047cdc8844f83cc6a231effc0ed5b1d9a547
63701 F20101210_AACBJW joshi_s_Page_159.pro
483d957b602f01935fb2d18cb6c3ac94
c347a08cd65baee144231c45c754b3f2002f3de9
F20101210_AACAHK joshi_s_Page_132.tif
b5c3358ea11f47e672197f45b71e27fa
0c0d7dc5850f78213d32503694f7eb3585e9e1a3
349 F20101210_AACBKM joshi_s_Page_012.txt
5e1e103bb8370097d9ab2a865aaadcfe
33abf91b3886391cd584f16e398fe5b6ec41690d
58651 F20101210_AACBJX joshi_s_Page_161.pro
87c7dd39ebf22f8175f1ee41f75cd92f
e8ce5f80efddca36d957b9a232194073b351cfaf
2070 F20101210_AACBLA joshi_s_Page_033.txt
40e05645330a75abb2ba2ac342777b0e
fcc204c8d0daf9c11d8c007f1ef42f23991996c9
14159 F20101210_AACAHL joshi_s_Page_018.QC.jpg
626ef2bd9ce843c6cbb9a5a876d976fc
1f0048cfbc155877b7e39a20aa74fb5d4cf3e188
2076 F20101210_AACBKN joshi_s_Page_013.txt
0265c91b9cbefc0ae45db86d569236eb
8c9fece3fccc1999fe253b4de9509e1dd558651d
57248 F20101210_AACBJY joshi_s_Page_162.pro
83392871242fb06b9ede4c88cc3d21c7
f1516ed454065ee5c7bf9cb275b1566bf43de86a
2361 F20101210_AACBLB joshi_s_Page_034.txt
67adbb64eefced5fe04588c1c2eb0a25
564f298018efe885aceeaca3d30cc33a45396ed8
2031 F20101210_AACAHM joshi_s_Page_036.txt
f323938867886b90d5c7250af3e44d44
fe7c60495874bb5f29580d1976b55597419cb370
2111 F20101210_AACBKO joshi_s_Page_014.txt
ac2760ccacfaa46514e3f533eac03348
6ed7051d35bf19efb2042199a832d87dd4653121
55317 F20101210_AACBJZ joshi_s_Page_163.pro
963fc3e47f128033e0d19c5dd0f3a740
65a25e9da077b84a7d790ecfe9670d2b2099d1a5
F20101210_AACAIA joshi_s_Page_108.tif
93a63ebf2d858ce6924f6f646c75dd7d
379af36f4d0f7ea34552c7f335b907fa76617d98
2246 F20101210_AACBLC joshi_s_Page_035.txt
49adfd1412f8bacc4d5755a2422090f7
ac0fbf15d95c0a922447fd1aedd8a856fb4f32c5
58765 F20101210_AACAIB joshi_s_Page_160.pro
9c051421b0e3384ebbcb89d16e7fb6c0
68eb4fdf03f12597c90593171dc7b5717a3f727c
2117 F20101210_AACBLD joshi_s_Page_037.txt
da15dc6d8ab6fa8aea7e472dfe337102
b74c094c1c004543abba6eaa22d05cfdabd8f773
1567 F20101210_AACAHN joshi_s_Page_057.txt
b59f92e7004619289e25aa52656f232d
9917fc5413c0a0063d3444b2a9843529e6dbe268
2407 F20101210_AACBKP joshi_s_Page_015.txt
8292cd40137ad4bd622849c4658562d5
c161e2151794b4b060972b8ab57328cd34f2931c
2115 F20101210_AACAIC joshi_s_Page_147.txt
f27fc88d124ea4beb23f62294218f51d
6abf7f7ca25a28c338462fb2a3066826dcfb90a1
2277 F20101210_AACBLE joshi_s_Page_038.txt
4ec2ed62cf99138a50aca97bab18d7ca
62a348b3bd11d8ce9d9ccfa3b760bdc254f552b0
1424 F20101210_AACBKQ joshi_s_Page_017.txt
9bd7397f66d35307d2ccca767e8fbb25
4f77d110ad8278bd157b0c73a62524e7086b468f
F20101210_AACAID joshi_s_Page_152.QC.jpg
04eeeb75f484244ca410c88cfe7240b0
b0631560821f08adaa22b2a6aa0e8dd06fbb41ff
1364 F20101210_AACBLF joshi_s_Page_040.txt
da483136b4691fe7dc569fff65c02e46
e46ef5bbdea696cfc334a2c7999e192ab28db952
28259 F20101210_AACAHO joshi_s_Page_054.pro
71dd94c19f9164d7fbf9e290c9450ac8
fe1012761130a9d1f797ea785382812ba38aac4f
1409 F20101210_AACBKR joshi_s_Page_018.txt
eefc0a6bb1006113e67b779acde9b8b6
7f8f477efb28f51339ac6606fc8e5ee0a5edbfc1
1665 F20101210_AACAIE joshi_s_Page_089.txt
701f01c741f0df71d6063029ad423ea1
9ce69ee7fccbfb2761e6445785af0f3e670a3fe2
1748 F20101210_AACBLG joshi_s_Page_041.txt
7b12c4577d114f6a017fc6c60c2c2fe7
1c59948a3a6dc09f5d95e291df2d6e853bfd2a3f
F20101210_AACAHP joshi_s_Page_070.tif
ab83fbe591063536ea403888811f278d
45e602790ce9abf3701cb7857de633c06ed47739
2331 F20101210_AACBKS joshi_s_Page_021.txt
1778d99e7014e94c7faf4791261a8959
ac3fa76481425162cc1a44428fe6e597984c8b5c
87034 F20101210_AACAIF joshi_s_Page_067.jp2
f1e72ad14f337a2379561c541ca1fefa
44337296f6662e01df1d49fd77117e7263510c5f
1659 F20101210_AACBLH joshi_s_Page_042.txt
38c5bd60b2ff88985e4c9559b44684e6
567b59b54669b2e4f61f5aa15669ad82a5d37fc5
1887 F20101210_AACAHQ joshi_s_Page_067.txt
5b8c343185a3a1f3ed73d218a96fbbd0
1247b9823b4fab0fe7012d763a371a73495e893c
2314 F20101210_AACBKT joshi_s_Page_022.txt
10d0e7a148f3ad3b9afed3c09bb05670
a1825af323c6dbf10f5fea52812dcbfc6a1345fe
F20101210_AACAIG joshi_s_Page_099.jp2
3125f9fd51ca08121e46a6af34531841
b017e4111bd129b3023468c14a7a91c0aec8ac3b
1343 F20101210_AACBLI joshi_s_Page_043.txt
49a54ac4d995ee7e075d1e3fb9c5079d
5a90b259c73c130cfae1fd773ed1bc31c03f6c64
37544 F20101210_AACAHR joshi_s_Page_090.pro
b86a8f34c25894838e5c42bc1e5a24ca
23fce16bc7309d771dc6efa019f46075c8bc0e77
2467 F20101210_AACBKU joshi_s_Page_023.txt
ccf133073f5531b9849e2ff8c2b25e28
a44015dfbc363229052c7d897d49886f00d00ab9
35418 F20101210_AACAIH joshi_s_Page_084.pro
2180290316a82f87e61f4d2749032153
f037884ca12c922d146dbd31c3d1a30120790056
2141 F20101210_AACBLJ joshi_s_Page_044.txt
2a090ff9a093f77387875f8f8c111bee
4c4a14ab3759d7459342e9daa953091e7714b549
1592 F20101210_AACAHS joshi_s_Page_048.txt
2b7b4178cea9e8f6429ee9d07f041f58
aa678fb2e0b7f8fddd3c8e6d1bf3bdc3310fa0da
2366 F20101210_AACBKV joshi_s_Page_024.txt
7a2b6274384e7dc232a37ac73a2e36b1
6879e3f450f70b1b8bda7fccebadad489d11a059
49753 F20101210_AACAII joshi_s_Page_122.pro
71cf11f8dbe42a537e9c263c2a9d5519
a166e4a8d8f642226b6008234e015d50d30f5e2d
1231 F20101210_AACBLK joshi_s_Page_045.txt
7f89857659f1c7101f44b6d4afaa14ba
01959ca8cad367991f85c1c98393141afa4b70d4
F20101210_AACAHT joshi_s_Page_133.tif
4d55d87cab363ce4d74452657ea2a98b
0c1063f5cb06e4416812f84fda807c4ced45f103
2237 F20101210_AACBKW joshi_s_Page_025.txt
6710a58e3f606f3913cbd6290d55bae4
2f3fd901c5fe625c57851f479351977a0ceab264
F20101210_AACAIJ joshi_s_Page_027.jp2
d0431eff5a77b6d848d1a8fb929eba5d
164346e8f1761dfdda1cdf7aac1a7637920bfc45
2395 F20101210_AACBLL joshi_s_Page_046.txt
6c9b777d0d44575795a2d246449d6086
fabee71df7cb98571d0261c4e01017f35287bdfd
F20101210_AACAHU joshi_s_Page_022.tif
def16a3d7c4ea29e1269a31537fc96cd
4b69f7a74f458badd7857a300e2bfd97a88f75cb
901 F20101210_AACBMA joshi_s_Page_064.txt
8b25850ab0a6d28e49dbdc939f104123
edfdcc33ddd8464ff6814f3de6c413eb8e9cbb09
2347 F20101210_AACBKX joshi_s_Page_026.txt
7645ec32bba3a6a1864f1baeafc279a8
d4df90d5e3ca42b5b9df30b1c55e3d3fa3cf8ee4
24022 F20101210_AACAIK joshi_s_Page_079.QC.jpg
aac889179421d97574380392268ad855
8d8311cb45379399f1d63f0dd0d04a45271f8c98
1785 F20101210_AACBLM joshi_s_Page_047.txt
49579c622226bf8060698a5663fa376b
892915fa8e29b7ff5072680116618881148b4f12
25712 F20101210_AACAHV joshi_s_Page_134.QC.jpg
e657aa95b8bf01c342972138659a29f7
056bc2035caa6ded98a924be8041cd4e2c46aa24
2338 F20101210_AACBMB joshi_s_Page_065.txt
e9d3c75552d4bdcdcf9d5b6eda226a48
d98d11314d8db138e820c9d935db3b7d08117736
2091 F20101210_AACBLN joshi_s_Page_049.txt
3287c06284aa3774ee70813c8a36c58d
136a499a6538c7289c3d91d386e2643fecdc41e9
2292 F20101210_AACBKY joshi_s_Page_027.txt
959c2aeff75fb9e48467d1ec0bcd9e4f
9706f952d70a0ff76a013b112dbe399fed22daa8
F20101210_AACAIL joshi_s_Page_117.jp2
e6834c5021d0056f9ba8112626f61805
96e0da491388f111b9a393bf0697ab0c4414764d
72372 F20101210_AACAHW joshi_s_Page_087.jp2
692c201e01a21e2f8fb27ddc0c8a5c25
596434f11019f74cf583f305733909f02f7ae886
1685 F20101210_AACBMC joshi_s_Page_068.txt
7150f16580c7c4d09c47a6f718400f97
c52d0c12c858d7b78844cfe7f0717e8288b11031
2006 F20101210_AACBLO joshi_s_Page_050.txt
360ea1c1e4cf007ff9a1b46d9310fddd
35a803871134743ed6c87be47bc4593faf09d288
2176 F20101210_AACAJA joshi_s_Page_106.txt
3d6b0b8c312f9f962593380c702822cc
75a449cec2cce68debbce82bb6ce355a197473ca
25769 F20101210_AACAIM joshi_s_Page_137.QC.jpg
5a633cbaf224321c92340d1a0380e855
5695a5c2c8beb5424734c78332bf18c3cda43e4d
60170 F20101210_AACAHX joshi_s_Page_123.pro
5b097356d4924b98c8abb11545291d6d
2bb36500c76c517965789327d7cb5638201d1ee4
373 F20101210_AACBKZ joshi_s_Page_032.txt
a2ee841c797ff04b58c0d62579b5f266
3244cf2ef36ae45cb0fe3ad2fe0c796fe708703a
1755 F20101210_AACBMD joshi_s_Page_069.txt
f7b3495ee9dd15bc36fa09b4eea00c16
54a5bfef4f0478d477dad2b2126cb249aae03df8
2224 F20101210_AACBLP joshi_s_Page_051.txt
5faee5e8dfd633dacdab23c93f49c0f2
3d721dbc3cbcc165beca56baa2f9133210e9cd6d
6336 F20101210_AACAJB joshi_s_Page_033thm.jpg
7e17ed6f04980181ec3ee331b1deb3f4
733e82c16fe08e362935949033a3f46e1ba7ac29
38400 F20101210_AACAIN joshi_s_Page_136.jpg
95b04d8505a84bb2157e147e208e16d6
36b4f6ba8e33c5a25338b12ee81c167fca80802a
21887 F20101210_AACAHY joshi_s_Page_082.QC.jpg
6e7f88b877978a8351683eb705241969
501807abf527e5c561b27198c3ccd6abffad6746
2391 F20101210_AACBME joshi_s_Page_070.txt
faea5eb6f38c4993e67395b1ccef7db9
aca3433cacc873c4b6e3dda695b43ee11d063d22
34381 F20101210_AACAJC joshi_s_Page_018.pro
7b8ac751eb49d1e0c15b03784d65f806
74e5c4fa937aa13b6704e0aaa1fbb1d824f0f85e
66980 F20101210_AACAHZ joshi_s_Page_101.jpg
68f4b8b51dfd170c383136f053007dd2
02486d6f67b83a6a58cda2a96eba53056902db55
2369 F20101210_AACBMF joshi_s_Page_071.txt
6908e12225fb7860641aedde892deebc
ca0eb0f238cb821a9b50d9a09cdb99fcedc5919c
2160 F20101210_AACBLQ joshi_s_Page_052.txt
f8753dd3ca08b74905b58f0851f10e6f
1b30284a7eae9561d9acce18997503a5b8e62675
52848 F20101210_AACAJD joshi_s_Page_129.pro
548e56deef1cc73c4c7d40f5a2e6a2ef
3368c30c8db76d6521b9ba1db55642d38c70c274
65496 F20101210_AACAIO joshi_s_Page_098.jpg
49c563dfc4ecee8e77cd73cbafc9ecc4
8b06d044d6b9aba6a6a2d1a4522d566f9ca3fabf
574 F20101210_AACBMG joshi_s_Page_072.txt
6a264fd8988ca660bf8292f74714ccd3
9ad8356fa6c49673961fbd87b96e757872e6687f
2169 F20101210_AACBLR joshi_s_Page_053.txt
3a63236cdfc4cb0db772924355738c97
57893be9a9666180639ed67ed3fc04ad2a279dec
2651 F20101210_AACAJE joshi_s_Page_115.txt
d0cda916de900c3a3539e1314976fe65
812f956bc63597fe1ede23964a52a86217783c48
55689 F20101210_AACAIP joshi_s_Page_035.pro
fb57f8a9a0c27a79d1c6fb82bf55dd9b
b7acab1a69372b8c524330781a136aa05319c203
1854 F20101210_AACBMH joshi_s_Page_073.txt
5e99153c8041eb8d48bda7734fd0cc0d
f17e349896b340733cf54bccb19a29c3f08aa2c2
1361 F20101210_AACBLS joshi_s_Page_054.txt
58e964c253c897b2ab097ac9be1ebe1e
556b6ac26cba41e6ae77ad418b03f48bdada4293
1016392 F20101210_AACAJF joshi_s_Page_131.jp2
1c690a6f6b06ad773a3fd88c9715b0a8
d13832ad398aa189816875a6e800369e764a7e2a
6436 F20101210_AACAIQ joshi_s_Page_078thm.jpg
1e373d971ff4d79bb113bfcc719955b7
661ea9083350db5f9a63a362c35483d0bf1b73c7
1701 F20101210_AACBMI joshi_s_Page_074.txt
12fe6e90bb6a58ae0548039e6153076a
d85a605de9f6389e2bd9f13b93ef68df66bc4b97
1619 F20101210_AACBLT joshi_s_Page_056.txt
c830e01d1284e96984c71473aae9879c
8a066f4d80a629430c2b5f552705dd7e0dd5e878
20942 F20101210_AACAJG joshi_s_Page_098.QC.jpg
bf36ffcbe24a03ffa0e64e9eb07871f9
3595fec1d86547443f88469dcd37f5a82e6bd90f
F20101210_AACAIR joshi_s_Page_013.tif
639d179144f44b1b3661c1524442f57c
6c34bd0356c120bc69da0d1fc7783ed0fd294e54
1588 F20101210_AACBMJ joshi_s_Page_075.txt
5d54c40c179216d3ddc39dbf46c5acc1
445497510daea22718ea0c8c745b2733afa574e6
1492 F20101210_AACBLU joshi_s_Page_058.txt
024a7900e369312c3340ab56a6a638a7
343733a3b1a9f819dca107390d6d49274d10f9b0
2427 F20101210_AACAJH joshi_s_Page_020.txt
9f4b16ffaf5130cc385fe3546c254731
d3f4c0138f36640669dd024d6cfe3f6f29c8bc05
58110 F20101210_AACAIS joshi_s_Page_027.pro
4c6006b41109a81531dd9f3624b9ce2a
74d9eecf7216343dda49053092b5c36f2298d8a2
2316 F20101210_AACBMK joshi_s_Page_076.txt
af02dd5b29f25e6a88fd1b36de2c91be
119e9769c83e6594c12c2f86f47e040a40884f18
2156 F20101210_AACBLV joshi_s_Page_059.txt
461aba9295d069795a655ddd3037e7d2
e1c867a467c4d3882cb5cb3f314cbdf5ed8f3f27
F20101210_AACAJI joshi_s_Page_118.tif
4476a94976829c30cc2b094e1cd5d2ad
31022324949dc3db41152299b306323d49dd827a
1051932 F20101210_AACAIT joshi_s_Page_025.jp2
bec786b827b33d661b436ab911ace8f8
df66b2a2408847c52a6e872392fe8201878dd3f7
2120 F20101210_AACBML joshi_s_Page_077.txt
aef51d43ba8cb78247be12e80260187b
3c143a0c0b9a334edd6bcc8acaaeddd3000053a7
1806 F20101210_AACBLW joshi_s_Page_060.txt
4519e71550f6571af18d3323763dedbf
06802ea2fc6ecb4d0351ef2e93d63190b1af2e9b
5124 F20101210_AACAJJ joshi_s_Page_093thm.jpg
b13f1bcc1cc34be30aeda880db6ff4f0
23800cd9bab4878e043e98c1a3c432971fff9a67
F20101210_AACAIU joshi_s_Page_022.jp2
2ac5b547d0639c4c147af5f1601227c8
e14be2ffe635bee15f3de9b3721644b4a2556e69
2002 F20101210_AACBNA joshi_s_Page_096.txt
291f1bf70378c8ec3af50dcd005e3492
a4e473e98852b95a79104a91405167bf62b86982
F20101210_AACBMM joshi_s_Page_078.txt
48fe8b2e499699c2bd6d62d74d7d6de9
e1244d49d04d3a90046de931595b91f5169b553b
1325 F20101210_AACBLX joshi_s_Page_061.txt
7a12fc31e5ba2031137a38ff1b0a061c
3b649cff577126b258185b64714b657bbcf2855b
26884 F20101210_AACAJK joshi_s_Page_065.QC.jpg
741a4915d096ddc1ba05cd5bdb35e175
50462472021324761899fa5472491b56cf077929
2116 F20101210_AACAIV joshi_s_Page_079.txt
73502e4a5ab0779a2843392843b29237
91d10aee8bfb4613809b1e64d280585643f307d9
2329 F20101210_AACBNB joshi_s_Page_099.txt
48755892c9cac98613e12deb856bec5f
47bdec316fe2b39f260304887b4c18619d7e1369
1993 F20101210_AACBMN joshi_s_Page_080.txt
81cc32be314c2354e78afcef429cb392
ab76b3bacaabbe6ff0662c1f075f550634436433
2195 F20101210_AACBLY joshi_s_Page_062.txt
8b043bdf4a7a7021bc72d352476a4809
dab5756db083d2ed4fbabe622a98bf1fa818b864
14597 F20101210_AACAJL joshi_s_Page_061.QC.jpg
ab64349a219b876c4c41200c6a4dbe0c
6ded5e5940b0e04a848ce7bb444966edc6b6214e
6753 F20101210_AACAIW joshi_s_Page_144thm.jpg
b976921fd0905ba3d34f9d1f2c220aa7
fea55212473f1d49701fea5ec9af3a29aac11340
2243 F20101210_AACBNC joshi_s_Page_100.txt
3edc8814bfcc9613782a5416743aa1c9
9ad7cda2999c472502b6a3bb69a119c19e6e2de6
2248 F20101210_AACBMO joshi_s_Page_081.txt
e4221f8c265f1837a909a2400e45ac15
2bee86fe9ee3d75ef6ce76ae155133c71115ba0e
1723 F20101210_AACBLZ joshi_s_Page_063.txt
ca42578a17502e02be652de3c0727161
f62298382758e114ead6ec218fc1e88e88be2ada
96068 F20101210_AACAJM joshi_s_Page_055.jpg
4a85cbbd82e35f8a2125d90aaad2417d
1ffa0238871d5496c94b33e0da1af176fdcd0e91
F20101210_AACAIX joshi_s_Page_101.tif
9a4c6580ae222104f1ff6be6b7c7c119
59cd774b09e9385b3423662b6b29a2bfb9b2b105
6775 F20101210_AACAKA joshi_s_Page_032.QC.jpg
8a953a028d31b44a0850225826724699
f0a46017dc3359806f969a828a6a187368a2d97b
2059 F20101210_AACBND joshi_s_Page_101.txt
0484932d2519abebc469da8bffe17a1a
5a2e24aeffd0844cb4c88e710c00cf85d409e5e6
1983 F20101210_AACBMP joshi_s_Page_082.txt
38691aee61fa9c07fca54c7c262babad
33b129c07253971fbf1e43b0901436975c8ed0a9
22179 F20101210_AACAJN joshi_s_Page_042.QC.jpg
7d14a45e206f102e50e3618d34354031
01963eedefe83bde0a98193ad756b5c72743926b
75219 F20101210_AACAIY joshi_s_Page_127.jpg
2d3898f3629c41d05da7034fbc880aff
4d9d48d00399f9752c8b415d6fd8c16ae89b48c2
F20101210_AACAKB joshi_s_Page_036.tif
20f62e8ce5283dffe98be3b7ecd5b570
1577665f0f9a32a824ce4d9fe2d5bdde974c62ee
2100 F20101210_AACBNE joshi_s_Page_102.txt
c0e6e04c8217d15329c96cdeefc9815b
9777cb59a731343323079ef9b5e28d6442c0e646
1318 F20101210_AACBMQ joshi_s_Page_083.txt
3c9b666c61beeb91b5b0221c75eeae7a
b5b36f55895b7c667cd07e60002e980beead2bd6
11206 F20101210_AACAJO joshi_s_Page_143.QC.jpg
3c8134a234bc61527c144ad416b23a3f
c1dbcfee0effac9e05183d50fbf084091a49b8cd
3022 F20101210_AACAIZ joshi_s_Page_140thm.jpg
9c5241467c4575a9f343c07b8f2904fc
9ffd051680b889fad48a64f9ea0ca52fd1817068
56600 F20101210_AACAKC joshi_s_Page_084.jpg
54bfc341a9b1e7184aaecf86eea83a34
afe881df46e210bb3834007cd281edec9682e6bc
1547 F20101210_AACBNF joshi_s_Page_103.txt
ad291ecf82ef1d21ab3a2de4ce018f02
6ada2cac1899840c349bb0484ec142b2c7629554
F20101210_AACAKD joshi_s_Page_163.tif
dcc06d3062944e2d1bb81f0d82ade307
a20160a6ef99e402efb8792f1ce7419d8c505228
1933 F20101210_AACBNG joshi_s_Page_104.txt
25f93dcda1dc598388710e5c4514383c
8b9f466eb22c6ab5b86d5facf988a05e394fff33
1452 F20101210_AACBMR joshi_s_Page_084.txt
f838aba7fa9401623d3c231184649ef6
eec3a0b21f7b662e2d03e7e9e52dc7d14ae7a5e7
89090 F20101210_AACAJP joshi_s_Page_006.jpg
dc24e78e0a2209274e5dc8d4622c9b3d
fe274233a99b9335fd85e66f7e60a2c044830e1b
51429 F20101210_AACAKE joshi_s_Page_157.pro
dc574223a8c5b9dac98878c4adeabff6
e93e2cbb6bed31ecbd8fa5fb544e02438ecdb8c4
1702 F20101210_AACBNH joshi_s_Page_108.txt
2892a8aa343ca0ee379cfd3b14d541db
bf6868491d560f8f68fb0c8a34a857a5ba4f8172
1750 F20101210_AACBMS joshi_s_Page_085.txt
b4776f65f9b750e8d3aee1aa02a5a418
db1f26f1e704366ea025a4172d332a6f9d286d28
2344 F20101210_AACAJQ joshi_s_Page_152.txt
7c8b30d88d1026b7823cc8e76c833ece
9aa34de8910f00a66000f1c1af52238c5536c6f8
47479 F20101210_AACAKF joshi_s_Page_080.pro
02a11eae1eb90114e3b44525ca3c5e68
f19a2f0ac9af2b99d8d5679ecc0fe13434e4aea7
2082 F20101210_AACBNI joshi_s_Page_109.txt
45c2fb9721842e7d7670674d98035b5e
c3c60b353a6d8e62429d8d3cac7fd4cad78a56d1
1334 F20101210_AACBMT joshi_s_Page_086.txt
e663edae5ccf15c3603596ad4b4f4432
cca12f43e9171867a91265c2b9cc856da3fcc70b
905854 F20101210_AACAJR joshi_s_Page_063.jp2
104718353eefe683119e360b83812347
648346a3623e37fe9173606310b34ac82b8572a0
89309 F20101210_AACAKG joshi_s_Page_103.jp2
e4058872fd98331ffd33eb92c8d9f434
0b254613446b95e2518c85bee881fa67bc1f4537
2388 F20101210_AACBNJ joshi_s_Page_110.txt
ef2de7e5f82c2e733db622dbe69b1010
d2f2e8635e134fbf45dbeb5bfe4f6948c8349b04
1300 F20101210_AACBMU joshi_s_Page_087.txt
682fbe485fcb78b647212296f7288ce6
b5b83c76807abe91bbf0718e676227470647a628
6477 F20101210_AACAJS joshi_s_Page_115thm.jpg
98a09f06c8684fce92611dbe5a3ce4fa
56ffef85a02b60f6d14e546bdf440eff7a76b2d0
1500 F20101210_AACAKH joshi_s_Page_003thm.jpg
2530136bead771e5497961f429fdc472
9669f52ec8bd531cdf8f60fba33cc5a15a05fd03
2262 F20101210_AACBNK joshi_s_Page_111.txt
367cc8455538d617da0d49d7ca8f0b2c
83fc2a44d76b2829a56c32bc51b373c32d86d10f
1094 F20101210_AACBMV joshi_s_Page_088.txt
77000d52471f86199ababe48773bdfe0
2c56f57fd72a0b2625cc6172f637f149d9765fa2
7105 F20101210_AACAJT joshi_s_Page_070thm.jpg
52d1b29dbe9fbfab9e21ccb245e24d9f
76d1249766439007e0119f7830c5cdb9cf722fcf
5706 F20101210_AACAKI joshi_s_Page_097thm.jpg
1c192f30f544748ac536b0c5392791a3
6a23b8032036ac6d64b7f82bfcc151f92088056d
2266 F20101210_AACBNL joshi_s_Page_112.txt
fc5e446c664f400b76aa648c1a7f4365
41f184c87ce09fb67f4117a96a5412ea31b13fca
1976 F20101210_AACBMW joshi_s_Page_090.txt
346b33599768546e9ba194b7d1545d30
e81f78311a1eb4d568edb98db7abaac337b98d72
36982 F20101210_AACAJU joshi_s_Page_085.pro
2743c58402821b8485ab8314ce309e96
7dc06ac98e113cbc9a171997ade8228928f3648f
828051 F20101210_AACAKJ joshi_s_Page_090.jp2
0cc37d095588662079e4ffd587aa09ae
cdf0ff0e2d9f65b3d1d7d30ad5f6d56efe31f04b
1530 F20101210_AACBOA joshi_s_Page_132.txt
a5062a5ef62ae6d76671f72eab65053f
3d2252bcdf0c7991bee3d63b5f0cbb4a74abd95c
2396 F20101210_AACBNM joshi_s_Page_116.txt
41aacc7e5618bae2ddfaa3a7e74db9a2
5df1ea7ba9da01218ae2053570b60e1c8549a6de
F20101210_AACBMX joshi_s_Page_091.txt
21c1dc51eaa9dabae909b277f6af0322
d07a365cda88e64dfec2bd97f23f18d97cf6548b
6258 F20101210_AACAJV joshi_s_Page_080thm.jpg
036ad7ae9046df93b737a1c40d8ffc04
db26738ec9470f65f06387b3c0b89a889e2d4cd8
1051985 F20101210_AACAKK joshi_s_Page_028.jp2
5cf32354efbb6aa3c058a44963c36680
98d611bf7194a736d18361581dfdbd32b2c34f4f
2096 F20101210_AACBOB joshi_s_Page_133.txt
bb72b86d68c28b5da42e085e3f8ac65a
be488387f8c0ab9654031d1b93337da6d7f34628
2431 F20101210_AACBNN joshi_s_Page_117.txt
bd485ce737d7cdf8d3be55a0fc93c33d
352c31416628d52326ae0f20f3916d34c6bcd260
1943 F20101210_AACBMY joshi_s_Page_094.txt
29a3504e8a3fa6e058bca78bf9d8f78f
64bf5931ab322b99229cf8c99c8a6b95ee522416
20196 F20101210_AACAJW joshi_s_Page_043.QC.jpg
aa3378e010728b34326306260e22f64c
1935d98c6bdc6eb13fb0bade3f0e10b044e11493
F20101210_AACAKL joshi_s_Page_075.tif
3ddd75aa094c187880909d06ffb48028
c4c8e99242c342ae73b4486ea02f20054500f40d
2142 F20101210_AACBOC joshi_s_Page_134.txt
7055acbda8af84f8c1a4d137de5b17fc
46a00777eb4a610a82322712d2f115f5e4907008
2336 F20101210_AACBNO joshi_s_Page_118.txt
774b9fdf1b7122e3293efd900f9dc9be
00a64bfd1932fd0b6766758581fa4c05ad73a6b0
1995 F20101210_AACBMZ joshi_s_Page_095.txt
e6910cfa95efc2991f9a253e200823c6
1bd91fcd2190a1308f94b4ae8cf93f362d32ccf8
96519 F20101210_AACAJX joshi_s_Page_009.jpg
90aa52473993698f5e4e5b3309acf576
c9aad9fc0b885fa0bf8c1ebcfbe13cddcdab9cf4
27571 F20101210_AACALA joshi_s_Page_072.jpg
3aee1b3a4438d6a8bba750739ef21cff
1aa86f7bd821e73276a7b750b31a1e2071130572
72187 F20101210_AACAKM joshi_s_Page_082.jpg
f11ab03e86bb439c014119f1456bfbe3
8e251e8ec89f6edeaf5f5edaf8ebae3e3d69a68c
1191 F20101210_AACBOD joshi_s_Page_135.txt
6689c3af8f091c5ddca1f76c14e8af46
8164d162c8cedd00e570a16cce9fd57a88fac1b1
2204 F20101210_AACBNP joshi_s_Page_119.txt
c35d1fa82dffca4769cf007b4bc658a3
4536cb2d0acafd8ace730407b85bcd98814a04ec
78647 F20101210_AACAJY joshi_s_Page_133.jpg
1201e49cee261b9e43bac62c35d1887d
4b7344def321c22e56e43ddd5009118c38e1423c
56891 F20101210_AACALB joshi_s_Page_125.jpg
ea04c260852effd72d9532e6fec74fce
4ca0b86c5e79053736a0d0810c88da85e4d70e3f
847185 F20101210_AACAKN joshi_s_Page_120.jp2
088d3c63d9ac6927e9e2744d2ce36075
aec5bcf2d07f6724858387bc6fb11a97252bca51
1228 F20101210_AACBOE joshi_s_Page_136.txt
13674aa4fa7d8fe30a8d8280f25e4d38
68faa4762c72837794f164252b3b2465249ab9d4
1566 F20101210_AACBNQ joshi_s_Page_120.txt
54bde9684812413310bbcbb7245959f9
d289ee56edd8f8f9dddf5f7ddb9a576f96706c50
F20101210_AACAJZ joshi_s_Page_113.txt
081ca82dd69dd5cc34a118758458b1d2
dc04cb6a5af87db01eb662941e74062c1ca1fe08
F20101210_AACALC joshi_s_Page_100.tif
4af615af2807accfa177d2f18ef2ca23
e36dd8f8ffbe02e99966520194d37edd29b5b0c1
1051935 F20101210_AACAKO joshi_s_Page_127.jp2
aebfc1a25da38ccc789c147b8f5f52d4
3b8d5a60570893e3721b9158e6b0f4dbc1c77503
2103 F20101210_AACBOF joshi_s_Page_139.txt
689b30d3a28c72ae83a72a8071467079
53042ad72ba18eb7baa10d001f023466124d14b3
2078 F20101210_AACBNR joshi_s_Page_121.txt
063c099e0e9a0785a919f2f45995eaf7
44c7d67c28cb7a55b78e12d88213cd3c0ebe5f52
46570 F20101210_AACALD joshi_s_Page_154.jpg
9fcd8332e7278902d548d9ddba058417
e20524b6d5db265d4e499e55def0f7e002260a23
28021 F20101210_AACAKP joshi_s_Page_116.QC.jpg
0359732786e0734aa325d4cde27b433e
431b1684eed5ad71a163dcbb4dce426c13cafab8
828 F20101210_AACBOG joshi_s_Page_140.txt
051d115dcdc4ca6a984c5b6788258271
2a7ea42700495f356704da78e1a9a64a9339ad5e
78617 F20101210_AACALE joshi_s_Page_077.jpg
574641326119580885515c8a2c322d7c
8c791c0051fd84315b36b849986c514a2948c6d7
2428 F20101210_AACBOH joshi_s_Page_144.txt
94d558f3c9d39d97fd821ae36626288e
ba3c3f770b901a40a1c6b8fd2bed08fbb7a0ab86
2069 F20101210_AACBNS joshi_s_Page_122.txt
7ee2b9b3e4d1fadc3225fdec046c4ea9
31a85a6bde455f704d79c85555bb1c5b87ee8b20
2394 F20101210_AACALF joshi_s_Page_031.txt
41e4376cbeb8785e57063c1f74f32bd1
8b70b60dd171c3112e325e2ba2d35b761c3c3a6a
F20101210_AACAKQ joshi_s_Page_032.tif
07bbd487ce94f3c2e612d9e862589a8d
6c45f594fa79a5dbcbbc9f7caa28d7b0993659f9
2374 F20101210_AACBOI joshi_s_Page_145.txt
b479965f992ccc57753300274828b907
6c31342dd88e51600334c9d7a6eda40df2922627
2399 F20101210_AACBNT joshi_s_Page_123.txt
d6761f9e00c6030489a05d28d287076c
7795e861a34f83e86f82afd3706cec21025765c6
122202 F20101210_AACALG joshi_s_Page_113.jp2
59dab1efe06145e5ebedae811554757d
37bf3ce8b5546256c7a3fd0ba719452bc7ce2730
36230 F20101210_AACAKR joshi_s_Page_132.pro
5864162ee072158abbbfd0583e479745
606b83f037f7587a86b5b7b4f5a3fab94b0f24d7
2081 F20101210_AACBOJ joshi_s_Page_146.txt
d9ff06716d14b81a81e11c181d2a03b1
6911b717ba085274f6e66fbcb5fe84dc4501ed03
1683 F20101210_AACBNU joshi_s_Page_124.txt
7d9f49a22e1b763d8652490894745f0b
8fa0daa1a959e08964463dc931795c5f09fe398d
70485 F20101210_AACALH joshi_s_Page_146.jpg
3758ad28f63a2a759042726f8c148d5d
ea9dae8678594e88ab8029fad270c9bb169daf25
1051959 F20101210_AACAKS joshi_s_Page_119.jp2
e7702f4bb1e35a1fe39d200f0c12ab96
16aded971deb96eb1593a09a8d4bd50fbec304d5
2921 F20101210_AACBOK joshi_s_Page_149.txt
7c8b8de01d70bd1807fcfb0c3abac3b5
a40d256f6c030d12ea49b2865ca19f8da627c08d
1623 F20101210_AACBNV joshi_s_Page_125.txt
696177968c9e1bca7457c0a62708963c
d81170d25e12f8fce6f549ecc125e6725194030d
7076 F20101210_AACALI joshi_s_Page_145thm.jpg
f75d0cc983f85dfca407775e145933f0
aab0ac43de8fd3da4eb1791332d362ef436e5718
2216 F20101210_AACAKT joshi_s_Page_066.txt
4ab54658023c11d892821beca8fdc0e8
9ccd1ccc7f39c54f46fe250df80e5ec0c6a9c878
2171 F20101210_AACBOL joshi_s_Page_151.txt
ab2e264f7f4855298c70b264ee65eace
0a0e36551b42b237c1e09f2f0a9f38a1d615fb73
2240 F20101210_AACBNW joshi_s_Page_126.txt
6d651b88e7948008312687e29864e92d
cc6dddc7c59334aa9ccb6f1a4ab30318fefb8a2f
86983 F20101210_AACALJ joshi_s_Page_126.jpg
395cd27957afc9c1e56cfd4d6e0c78ec
14e04e9b0e41d2e8ab88aa7eef8fd6ed3ff34806
3172 F20101210_AACAKU joshi_s_Page_008.txt
a95b0de0de82fb6d9d6dffe49c9f0c1f
d1ecbd823781fa2fcce8545b88123ea2d172be95
10613 F20101210_AACBPA joshi_s_Page_140.QC.jpg
5fce7e83d446bb7c2d3071c1939a02e7
c52a91a0855428bc5051723b2496b819e2155248
1347 F20101210_AACBOM joshi_s_Page_154.txt
de2d17b09b58f449e4353d588fabe9f9
62eaa8fc74eecb1ef02b98b65a03b4d35b1b7d70
2032 F20101210_AACBNX joshi_s_Page_127.txt
45c24e142613f3d8edfbddc7390e3be8
0f80d6594095c02b6560ea44ff8c54671e0f2941
28079 F20101210_AACALK joshi_s_Page_110.QC.jpg
b2a9a6e2123af68555000429c8f3c9e9
b2869a0e8fd0688c5b9590f228b8718bab17a350
2293 F20101210_AACAKV joshi_s_Page_162.txt
df8e2e1352abad6639b63fe0fbc840e5
1f64472370263e63ca0a37313581ee1dfa17843f
26122 F20101210_AACBPB joshi_s_Page_009.QC.jpg
4a97d2213c65e7bb18e192e3a281d8dd
4eba43cc3179e9cfe345091672091b013873e5e2
F20101210_AACBON joshi_s_Page_155.txt
65c77c582e3a33d59f35c2042fa3cbae
899d1ec0274317257b09d061cae0ba0c8cee1319
2674 F20101210_AACBNY joshi_s_Page_128.txt
8ddb28644d72f873397a7d37d0112dcd
8442d0dadf82230c89445bbbc7a48d4fd485da3e
1513 F20101210_AACALL joshi_s_Page_141.txt
dc5983a9d2aaca2cc30676be9bc60b8b
262c1000a5434110a6a52c1a4e085182bf11b2ee
60808 F20101210_AACAKW joshi_s_Page_116.pro
df9daeb19676e1ea3b3fe1b8b6f0a3b8
1c37ee92c77fc70da27c5bbd36e8d2f1edc1dba0
6724 F20101210_AACBPC joshi_s_Page_126thm.jpg
88aa174462eafe6e87354a6730a919c0
2ab8548b7da82e6790d752fac7a95f4f339f6455
929 F20101210_AACBOO joshi_s_Page_156.txt
c3cde823b5aa991e4902bd0ebe27d34c
34bf0310d608d193bce92f77260b97d487c71ad6
1963 F20101210_AACBNZ joshi_s_Page_131.txt
37ee79b32cd3527ec7b8618459a219be
c6ba07fe52fcad14ea0cab355b9e21aedee1ed72
1093 F20101210_AACAMA joshi_s_Page_143.txt
125ad21afd738047875a88607d9f241f
6f7047f2efc3cc57db4f4ce09896eb42886ab21d
56247 F20101210_AACALM joshi_s_Page_051.pro
84b0d3b7f78f3c96435a0e168425b652
34ff4824b818dff4309a1a4fc816fa0a37ace9a2
76400 F20101210_AACAKX joshi_s_Page_005.jpg
8848cf9f9938fb3d045abbdb5388223e
f5691243b4d485341d6d0db96bf55658efb1eb10
18270 F20101210_AACBPD joshi_s_Page_090.QC.jpg
337d26227dfe40e74db9bf3fb5f87f77
3363493085ce67909bed888a6ad32e1b54af3617
F20101210_AACBOP joshi_s_Page_157.txt
2deb492b3100cf14a6b352988eb7bc10
5b8e6fa293b0b4b4422321e5c0b0fe37f378d790
7221 F20101210_AACAMB joshi_s_Page_001.QC.jpg
5e9d72726ece7eb3992297638ed0f8ac
1cdca08501ff54a8e36628f6e451d54fbd2f838a
6589 F20101210_AACALN joshi_s_Page_053thm.jpg
70dcceca39da1e96b1aed134929bf013
ccc90eafc6f457cfd52bfdc95ba2b247a7b23e43
F20101210_AACAKY joshi_s_Page_124.tif
5164fb2b96e140772575a881b864c5ff
a881bfad88112c29df3fba2c5364df7f3adf90aa
28369 F20101210_AACBPE joshi_s_Page_128.QC.jpg
01075857fe29a6add1c4781e069f376e
30dfe84d48f98412c411037e1c1774416ed87a3a
2425 F20101210_AACBOQ joshi_s_Page_158.txt
25b924b5f0b1e135a2ca62140bfa713b
fae98a72a60df14d121691125ea9b024c2ee11d8
2271 F20101210_AACAMC joshi_s_Page_028.txt
7dc811595fec755b8afd3dcb6e55bc32
715a83e8ca91a3ba545ebdb219301ac208213af9
60717 F20101210_AACALO joshi_s_Page_085.jpg
a09aeb7584120f8a93c978d24aaed2af
cab5c592d1f68046d333162e8fb9e0b745021477
3136 F20101210_AACAKZ joshi_s_Page_156thm.jpg
ca3c1222ce79dfb22c90773affcf4ccd
3e77483a2291fe6b7c1da6b848c6e591116cff11
9023 F20101210_AACBPF joshi_s_Page_072.QC.jpg
c2b27e50cae69637b862849aaa7810bc
55b9e1f2e13f97c9775a7038155e75b674c83e28
2549 F20101210_AACBOR joshi_s_Page_159.txt
409738a1e7bcc67ee5b9707706227ef0
a14e14878ac63bad8394ebac55cada00f3ff93a5
1051923 F20101210_AACAMD joshi_s_Page_110.jp2
2a6230e97b1a2063511e78004d94c92d
15edd92008ce4c2ab8cd380dbc7f3194bb58d603
25333 F20101210_AACALP joshi_s_Page_091.QC.jpg
7e569e700e3f77475e7e526534d45319
94643e3d92104ab2a9f36db8d8ebec886f9d1574
18832 F20101210_AACBPG joshi_s_Page_103.QC.jpg
00ef73436e2ded978fd799e5b5a6733d
c248c2a3c065dc8c48467287a040db2ecec1541f
2352 F20101210_AACBOS joshi_s_Page_160.txt
0f02134a4209022f7e447c1ec6beb1dc
8248bda66af8230776f85fdf58f255c6f38a4d16
6047 F20101210_AACAME joshi_s_Page_147thm.jpg
77d90edf6fe1c1f60bae3aa62edebaa1
c26c07489ae6c45f78c5071b5ba4b4e4b25e080a
76257 F20101210_AACALQ joshi_s_Page_096.jpg
b8994b693c2d723c4e21a5d35e0e5209
8964561a9522eadc7ee2e05bd3c5da78d4263368
6193 F20101210_AACBPH joshi_s_Page_100thm.jpg
d2d9cd1207c4f39ad9c1ecb79b7be496
4ce60418aecd28dc6ef274fda19e7976d82c4466
2233 F20101210_AACAMF joshi_s_Page_016.txt
6a769dcfd097e580b9b2ed5192864ea3
af63625f7b06ebae7cba65f422e8d21023d30ade
5300 F20101210_AACBPI joshi_s_Page_068thm.jpg
8ba9ce07648670147e926a72fc4baf8b
763907e6242f45c6711c3ed74c535aab3e3d37a7
2353 F20101210_AACBOT joshi_s_Page_161.txt
1799dee9b259dcc3fbd83090433be97d
c542da52a1714edaac0300f47be9e2003e80e181
F20101210_AACAMG joshi_s_Page_109.tif
d2295353367711e701084985edbed810
885201f8c53a6f9a9973330698f936c6d561626d
6165 F20101210_AACALR joshi_s_Page_094thm.jpg
1ed25be0b25cfcd222ca78fad8fe7387
ec4c1b01bebf036650373f27724017f9d103e666
6773 F20101210_AACBPJ joshi_s_Page_076thm.jpg
ce49e140329c1292030b35b2c3d268d9
a2e54556cfa1629143395b2939157c3e28598f1a
2228 F20101210_AACBOU joshi_s_Page_163.txt
abb5e6cb5ea06f491d5da95aaa4b60ff
94dc2e105a9b955cbef08c26aea920224e32acb7
F20101210_AACAMH joshi_s_Page_106.tif
16d9dc24fbebdf9c42f6609f1e4a1f23
9248b179d544f82746f842de00c01af248c01df5
F20101210_AACALS joshi_s_Page_031.tif
faddbac5a7be2b2cff18b060277517ce
05fd34bf1c6bab6fc178288a18a389af800c3012
20675 F20101210_AACBPK joshi_s_Page_062.QC.jpg
97bf70a7b4c1afc492801704a9cfed0e
653b59220e39e79981d610d05e9f0e4ddbf9c859
1195 F20101210_AACBOV joshi_s_Page_164.txt
c125a3186d5916bdda157963afd3bc52
a8c4de1773e2315f165e7cd1c3c301d7ab85b8bf
2927 F20101210_AACAMI joshi_s_Page_148.txt
35b2856b649e9402a2a0498105f741c0
cf5f49e25666221124d41ab62806bb92f07f88cc
56793 F20101210_AACALT joshi_s_Page_100.pro
b6fa1e8c334a6e51d2f523301a88babc
bc57e897847b8ed02b33f5a40deeb4aa24ef848e
5694 F20101210_AACBPL joshi_s_Page_062thm.jpg
50d3cfe068b70918b67291726c093e1d
18e18965e0a649368c43c53949241315eb0aec6a
898 F20101210_AACBOW joshi_s_Page_165.txt
b9c0db6574009d96ea759b79e610c14e
e18561ed33b93461b20d9647a6432b031414166c
26288 F20101210_AACAMJ joshi_s_Page_019.QC.jpg
daac0287117283832a27c7f2c27b6e2f
94a3d188dd9cf7910ff1af4697c77d7a8e5966a3
F20101210_AACALU joshi_s_Page_029.jp2
ca2d09ec84ec354d780b082579f01acc
804ba284746ba35205036b19a8b08753763fa20e
5578 F20101210_AACBQA joshi_s_Page_054thm.jpg
53d8baafcbdf3f23166199404f931401
db94f55987330a5e41920769d739657461d19a0b
24715 F20101210_AACBPM joshi_s_Page_130.QC.jpg
5950ee47bf7c498b283fca860e20c0e9
400b944b4e159f280b9625bf3f4431e6f9f0c381
2215 F20101210_AACBOX joshi_s_Page_001thm.jpg
0b2e9e76307ccdcf98aefa11e9baf60a
d1408f33a63558aa000ed08c3529d1000b98b68a
6363 F20101210_AACAMK joshi_s_Page_092thm.jpg
fdafb9122ad5ece68853ffcebdd305a3
443b29cced24fae1d1aa9a0ffbb354b2f0576bc7
F20101210_AACALV joshi_s_Page_105.tif
f67717ec67eef64ab8136d35d47ff9a5
205a24dfa78e925183f0d7685ebc7e7bac3486fb
5990 F20101210_AACBQB joshi_s_Page_104thm.jpg
3d5de69fbb4ca90c2eebb30fe5b87cb9
b93ad5aaeb5f662bd65b597648a7ce9f37b2b852
17709 F20101210_AACBPN joshi_s_Page_068.QC.jpg
5425874947938442e9ff68b61d628a41
01adc625c8f3acb584a676c9baf7b267142cca58
1012946 F20101210_AACBOY joshi_s.pdf
18e11188f64f0bd7929032bff07b802f
ea2d1a91d77f0db61c6df77dd7c50647496997ff
49646 F20101210_AACAML joshi_s_Page_096.pro
50632e8bf94006a69c641ec5c43ffc95
a7b271b7ac98ccbc25b78e809a92575782dc41b6
1051970 F20101210_AACALW joshi_s_Page_023.jp2
ed6e646de8a8ba54473afa3106b9c582
3bb5da2db4463b5da01f8622a0505461f7403ee1
20912 F20101210_AACBQC joshi_s_Page_005.QC.jpg
57b621c961cf5bc6475f8e4d6e13a764
a4ba135d4659489c8fd1284ab8c1941e082043c3
6757 F20101210_AACBPO joshi_s_Page_055thm.jpg
6498b45f93e63e1058c29d37b0ba3ec6
21bf5bc0aa3bac97fc5ce663e1a03939573b7539
6085 F20101210_AACBOZ joshi_s_Page_122thm.jpg
19cc4c1316412830dd7b76bd226a72ad
0024ecf66bea35873d8818fdfe26628f1784bc60
80493 F20101210_AACALX joshi_s_Page_060.jp2
1af3ab19b3e021e0cdf6ed59d7419882
05ad8ad34b0c984366ade9d52e0abf1c1aa31f60
1051954 F20101210_AACANA joshi_s_Page_021.jp2
2c3029c8e8ff950912df8bd71ab62ac2
e3bc557e37be2a0d625f6b8e9c8659eaf463a484
1938 F20101210_AACAMM joshi_s_Page_098.txt
0cf5b54823a5f9581e0214ef1472c6bb
4d729d7e2d9c917bb89531d61058e8ebcb125432
6095 F20101210_AACBQD joshi_s_Page_035thm.jpg
da754200aa43180590e6e6a2cc97066c
24f1db65f8fe58c2d857cfe7b6c3b3fd01e58083
7045 F20101210_AACBPP joshi_s_Page_008thm.jpg
d7dbcae98e5d2a02797d927b185b91d9
ca6ddf0ccbe09063c2efcff0455778aad74fcbf4
F20101210_AACALY joshi_s_Page_093.txt
62fa569466cab28809a8f6b0ba5f39af
d3dc96abb38a46634307750ccb0bfa6c77060f26
F20101210_AACANB joshi_s_Page_110.tif
0752f6b4fdea79036e9ea9af6e50dbaa
fd2bc43522b5b877d5c4df5b1f37c9fd39cb8464
F20101210_AACAMN joshi_s_Page_040.tif
12c45c3744fd2a9566bf715a11865ee3
65f95e325b233c7e301faea7318963a1f6ee1f80
4569 F20101210_AACBQE joshi_s_Page_107thm.jpg
a45a413de30f98db837b8115ec7e8f8d
7d2301e0eb89ccd2dc61e72249da4e7842b15eb3
5698 F20101210_AACBPQ joshi_s_Page_013thm.jpg
42f56d5c62cfe66fcdc12cc5d36cf7fb
93051d2487f8f976c034cf74be2723df204835df
55202 F20101210_AACALZ joshi_s_Page_092.pro
c92ad63806aafb3177c5ae0d01aa8b2e
e2fd5738c330bc2e3bd71917248a9bec9fab6998
1242 F20101210_AACANC joshi_s_Page_153.txt
548e28ac3ad8b1cb2c8b04985394eea1
0c682c470c93d38f6de807c50594bb8266c1ef16
F20101210_AACAMO joshi_s_Page_049.tif
4503942172854d44bb899a1bfad9592d
aeb49c5684b918f4c3389f3394a2440c245c5894
24966 F20101210_AACBQF joshi_s_Page_106.QC.jpg
4e658f591d1e562f7415277569334f3e
a95abf7db3d55364d54eb87b67529322d56084f3
5936 F20101210_AACBPR joshi_s_Page_042thm.jpg
d706307340ac669000fca02135bec32c
183e8efb399d5758fbcdcfd11000d3c02ea332eb
F20101210_AACAND joshi_s_Page_157.tif
ceb7c890297a85f46d3c0de3c02066d1
e8de245fc5e484a3a2d7b6398f2775cdef5929c4
F20101210_AACAMP joshi_s_Page_137.jp2
1a86aa5a3c39c14dad218d4ac23d200e
d9efad298b2d2ee5004cebd617c60dabf0a5119e
19562 F20101210_AACBQG joshi_s_Page_069.QC.jpg
9bc98748816283e0702b586a06653283
a5c07cd1fe9176acf8f844770ae6fa5b0b902199
6548 F20101210_AACBPS joshi_s_Page_016thm.jpg
e4972ee5f1d1705876947360b46afe2a
334ddc4cfd002025ed246f87442cd8ab4673bad7
22319 F20101210_AACANE joshi_s_Page_011.QC.jpg
5f78eabdd5e9f150d26cc9910f860c17
460a7e1d9a65b622db551378c23590a7c68e059c
2201 F20101210_AACAMQ joshi_s_Page_129.txt
26a24245a4794ad2cd2d3c160050285e
c9984bccb4a055269d5cf38e18cc9214ad3861c2
5880 F20101210_AACBQH joshi_s_Page_112thm.jpg
f0306c091c82919a92a5b27c1a352ef3
9be912efe211f1893ce844a50e5e4470913389d4
17855 F20101210_AACBPT joshi_s_Page_093.QC.jpg
2ae6a5db7b08aa6c2ff222a43b795f61
7ad9b0218ab0c32fb70660f24c26701a3f00c9a7
65695 F20101210_AACANF joshi_s_Page_006.pro
ee496914b21ec1c565725491597f1b76
0ac3d40ec69b3ff43b08823a538dc8322ab16bee
106067 F20101210_AACAMR joshi_s_Page_146.jp2
de5c2879461d751e4039c0d273d19aef
40f9a080f91c2fd796ed80c915e8cb6ccacd732a
22888 F20101210_AACBQI joshi_s_Page_004.QC.jpg
f335c238145429147d105f54f3560edd
68acac169ca2f4f62d0d9a431eadccbaca542d09
77972 F20101210_AACANG joshi_s_Page_160.jpg
2a70b8c3c15624835ec0ff6825a0f0a0
40b224079c33b428e6520720e43cb4f736e87c7b
17921 F20101210_AACBQJ joshi_s_Page_142.QC.jpg
e4159ca5e86817bd027ed398f79c7aa1
c0b253b96256348f9e5b8dd21cc237f59d6f2347
26894 F20101210_AACBPU joshi_s_Page_025.QC.jpg
c996313ee9cfb1c2cf7364696300cd89
6e89d1d19fecd5d92ba9e35f86d2594d7353e590
2340 F20101210_AACANH joshi_s_Page_030.txt
a7719d303e1136aad1d69f4ecd92040d
cf5215550381de59c14f3cad0c7f1d4a8eb35784
F20101210_AACAMS joshi_s_Page_151.tif
07ba1deceef7ab787e82acba734f5d55
61c026d331e2b28aaab18562f3336c6e5429ca43
5007 F20101210_AACBQK joshi_s_Page_132thm.jpg
16934cba0a68786c59014713f862c9c0
397daf245527812d789876fc5fc1ade55ac8f72d
6964 F20101210_AACBPV joshi_s_Page_022thm.jpg
5d32de98f10d11967874222c13058d7c
9015ce585fd498383ccfa430b0b0fae68d015c97
14075 F20101210_AACANI joshi_s_Page_155.QC.jpg
3c6b842d371956d6e011ac3d07150293
de3a08e3ea46085ce87930ee5980517adbfa4330
51013 F20101210_AACAMT joshi_s_Page_101.pro
a1e2d57ecdc5f1630e4fbaf86f3c97f3
74228036aad082472149db32239f4f37e4155e7b
3927 F20101210_AACBQL joshi_s_Page_088thm.jpg
384e4bcf7bdc122ab72946eb51daa61e
c9ad87113ba4d7d02bf4f6b6906239541e2bfcf5
23233 F20101210_AACBPW joshi_s_Page_112.QC.jpg
cb8d4fc689a31d7c3a17e859bad154b6
9adbcd52ddcddce48cce257edcf7b925bacc71ed
2517 F20101210_AACANJ joshi_s_Page_114.txt
5d29a79541fbb02fcd7f55860af29fd7
755edd47cd9d7c0d2cf1fbe769175704b21d273d
2220 F20101210_AACAMU joshi_s_Page_130.txt
30c68a0e82015ffc1ff2ebeed0ede42d
42f77dccc44f21018d451f385a1c28b40fa94a9e
4553 F20101210_AACBRA joshi_s_Page_087thm.jpg
61078674dad01c21c2601e7a933d3c5f
42f1c152f9bba83dc06ef4398bed37906d531e9a
3537 F20101210_AACBQM joshi_s_Page_164thm.jpg
e79c7e48e23d14072805fd62f4a6d946
71cb181d0276ec99a3f1a60506cbe822091a5f7a
5068 F20101210_AACBPX joshi_s_Page_067thm.jpg
2095b0315e1fd0eddd513b6c3f7a3b37
65864c2dbe785b7513dba2f11408648a0feff43d
1816 F20101210_AACANK joshi_s_Page_097.txt
0606f046e24a60f18b47dd94315bc892
04987b5a8f9d8f471fb92199f02f7684bc83faeb
57804 F20101210_AACAMV joshi_s_Page_136.jp2
a0e7cc62d3b89be3ba859abfaecdb4d5
c9cb86ad2cacf6849ba392105379097afaa86354
24981 F20101210_AACBRB joshi_s_Page_016.QC.jpg
23d551a68f471b7fbd7fce88fb1a1059
68f76a8e9593433e079fc3cd9419cd948b384466
6140 F20101210_AACBQN joshi_s_Page_096thm.jpg
cc77cb1d4a42e98368d01004066dbea0
c8e4af3fcc10baa992a46fb0c5cbf9fbc0935dc1
4628 F20101210_AACBPY joshi_s_Page_057thm.jpg
e6766115720e6f026a54a9b2f4fadf27
9c941484b8b7fa0f10fb803342f964f711f4e952
6229 F20101210_AACANL joshi_s_Page_095thm.jpg
f1429fa732b9716fc83d2a77fdd71f2c
162456f6ebf0a89cf60d89f4b291e7f1f4472118
5750 F20101210_AACAMW joshi_s_Page_011thm.jpg
f9c31f4f7bf66f8159a13c83e83ed016
0d68e51bf38bf9a1aef638d2498bdf289b1b3dc8
2655 F20101210_AACBRC joshi_s_Page_072thm.jpg
e7a660e932ce6fc779f9ad445372afd3
2dc7503d2f87b022bb58d537b72122fbf23dd1dc
23901 F20101210_AACBQO joshi_s_Page_113.QC.jpg
ff2e98580de4ac488963171a78304540
29598293f48491de5f87eb30407c3d7b369ea285
20025 F20101210_AACBPZ joshi_s_Page_054.QC.jpg
4d6a347311035f1360628b296cea9042
062bc00f39aaff436a0e9669fb2531bca8ea5f6c
53080 F20101210_AACAOA joshi_s_Page_010.jpg
78c1d84e3bce486b63c3ed1262a008fd
a6a1d055e8b3603712e66c3610a103254b1af72c
6261 F20101210_AACANM joshi_s_Page_102thm.jpg
d702508c8978a6c41e4418b870ff28bb
8175774be115efc6250b2303d6993c328aa1b431
50260 F20101210_AACAMX joshi_s_Page_033.pro
22aba5160b288e1cc4ae7b0c172320c0
90f424cbb415cffdbef0e619fc1edc0511fa81c6
5644 F20101210_AACBRD joshi_s_Page_082thm.jpg
575f3bacbea7a8a75f538facbe1aa204
6499c3952dafbf789b4207a36fe57706f9a2c464
5513 F20101210_AACBQP joshi_s_Page_098thm.jpg
6869d86ffa1ef66d8d1cc6a71bdaaaca
f6b89e81bfc75a3e356f4a050a7b71aca27def72
F20101210_AACAOB joshi_s_Page_029.tif
2de26e47a5ccde2b6ebdfdb2080b3712
bc7070b804a8fcde74d41fe207c60a4a4a233342
F20101210_AACANN joshi_s_Page_111.jp2
3bb6364cdce45370b53935fe2b37016b
a5690e474f574246186d3ebd250f1c98d858c17b
1712 F20101210_AACAMY joshi_s_Page_142.txt
2fefd1d2a291e91e67387b35e0ca8814
59319f4197cc596e329f1c9bb4d5cc6ac1227422
27740 F20101210_AACBRE joshi_s_Page_111.QC.jpg
55cefd4c59bd243040086e6629995b56
f810b1a4d2ddb04bbea3051284e483c5b6f8fd02
6151 F20101210_AACBQQ joshi_s_Page_161thm.jpg
a9a896289def5c187648e0d871864372
50c074a0ff83b2a99259bd4b1081ccc6f3aa4c99
F20101210_AACAOC joshi_s_Page_011.tif
87eb9b97f33b5641e4e5691b28942851
a621a50b0f265e6bf7c9ebeb5f135d71e813334b
2311 F20101210_AACANO joshi_s_Page_029.txt
2bd1908d2f407cb46ee319aab1bfdfe8
ecbf89ce4c892b8fddd8ee0a29518813ab07d1c4
80222 F20101210_AACAMZ joshi_s_Page_057.jp2
541ede0035fc756bdcd6c3d07b7d677a
792afb3f5d7a468733924103b6fdce80497670d8
12982 F20101210_AACBRF joshi_s_Page_136.QC.jpg
e719569d50fcbe27c09959eff831318a
ed44238a3329a21fd40732b845eb5747058abeed
1371 F20101210_AACBQR joshi_s_Page_002thm.jpg
2717f210b74df17af9672163119915fd
139849a48f6a3b16ed19d6001a9baed4ddd4b3e1
2350 F20101210_AACAOD joshi_s_Page_150.txt
feee04a4c10617c73791a061784a477c
69da1686b08758483b5acb8a2a461ead9d287661
50190 F20101210_AACANP joshi_s_Page_013.pro
dfd9ff35b75c44add09dbd03083a5bc8
c2ee2090f4fc3f9d944d4d1b576ab07c509baac5
5408 F20101210_AACBRG joshi_s_Page_125thm.jpg
b0662f4dec23e723788c232c291696f6
904c2f1d796dd10db01489f2a2a8e21a903afde3
7066 F20101210_AACBQS joshi_s_Page_128thm.jpg
f12ead7e4ebb906c8fb9ffef8fea9430
923ca648da2f0bdd6d31e515122dcce3162bc2fc
5088 F20101210_AACAOE joshi_s_Page_074thm.jpg
a9b1bd7cbcbda69b52066fef6c7afe69
a49d97e2eb2a7ab176eec082626ca83ac6c8e765
1051945 F20101210_AACANQ joshi_s_Page_033.jp2
82aa82029509dc5a2f74a9457a3becd4
ad1f3b24c8e47d3f303e8994ed9f8a52e25613d5
27818 F20101210_AACBRH joshi_s_Page_024.QC.jpg
e23d0cfc94fb2bf918ee10914589a391
8f823793957197e7e0e92573a0dbc57d5eec2c22
F20101210_AACBQT joshi_s_Page_031.QC.jpg
6877d173f7b01788d5fcba689c46a89c
f810989ed7342b5dd3c9a68590a4261a224303ce
27482 F20101210_AACAOF joshi_s_Page_029.QC.jpg
c83344f08216164fbc47d163f558f645
d3de6ed4d4c1091e8a265beb66063bca7188b24b
20727 F20101210_AACANR joshi_s_Page_146.QC.jpg
f5afd1d3b7d1345e35f3bcdade80e3cd
e16adcdaa4701cfad6dd6d56667ab545cfd8942f
5157 F20101210_AACBRI joshi_s_Page_103thm.jpg
f7b15d1b25ddcc8b4629bc2fe914dc8d
d4a559b0a54d34e11c86cd05975033c86718b14f
4959 F20101210_AACBQU joshi_s_Page_089thm.jpg
c45bfdf34ef49a1c02bc7dd8515003ea
a393e244c35f0b4c9462d82470c63944a5cab557
2238 F20101210_AACAOG joshi_s_Page_092.txt
32cfac6e68f2794b36a71172682e7374
d79de0ebe98e45b77b847f444ea5d815a79d8093
49584 F20101210_AACANS joshi_s_Page_147.pro
b945904e1667cc04ada497779b1456c9
dad6ca0da9817be7c205c969fb0335f83784f00b
27485 F20101210_AACBRJ joshi_s_Page_027.QC.jpg
f89ce8498e5fbffc39ec76128c934053
cb91954f641ff7865b8d8df3f0e0cb6e168c3fc3
F20101210_AACAOH joshi_s_Page_153.tif
a181c8e94185ca3abb8dc26498f1815b
213b5887d3f32c7293a4305649c30f8d410113ee
13500 F20101210_AACBRK joshi_s_Page_135.QC.jpg
283b293b6048f2ddbab1911fc298e72d
20e72487ada1c56e4e1f717ed2392ebe2815de2c
7068 F20101210_AACBQV joshi_s_Page_031thm.jpg
a854b3494aaadf4fdc1991330797c4df
af87725f31c0bbcc7fe3d09e73f97c7526b042fc
F20101210_AACAOI joshi_s_Page_123.tif
f3bf68a5920f009fe39a56bd050a0773
f54fb50503fe341831539808c6564bf8d995d5ec
1836 F20101210_AACANT joshi_s_Page_105.txt
d3403437a8f9b3506f7fc078222c61d3
713d627fbcc8557904f0e90d029f58e8ff1fef5c
13692 F20101210_AACBRL joshi_s_Page_007.QC.jpg
0470ef870c89ffa4ab624a0a5fa3082c
9f38e45ead96aaa8034e79438e1df53c7915a69d
3928 F20101210_AACBQW joshi_s_Page_136thm.jpg
743daeb2647056a9dec5e937e852c4c1
77961b383070f3c94ff5837734652e04a9e805b6
62749 F20101210_AACAOJ joshi_s_Page_023.pro
68cc98565fa6072968df579cc816e720
c163ce151b5c76d8be4b18eb32563733e8866c6a
58389 F20101210_AACANU joshi_s_Page_067.jpg
d7bd8d4478ebb9642a23e0dd02072211
56eac78de7d361d11e6b5ba2cc0535e05cdcb5d6
16759 F20101210_AACBSA joshi_s_Page_141.QC.jpg
5de46f6fe5cbe1b7660c72e1e535b65a
e860a3ba6146139f3fc749fd308d71ce4dbb02da
16854 F20101210_AACBRM joshi_s_Page_107.QC.jpg
2fe623b7f541a7a98bca03c538837208
b1377f2ebe53af167e824a7d5ba7ee1711095868
18556 F20101210_AACBQX joshi_s_Page_125.QC.jpg
e2818feeae9af3dc43590e72a59214c0
b6acc764e002669ff638fdfcec6e00b2b524ab60
53681 F20101210_AACAOK joshi_s_Page_106.pro
cd68f5362746ccebbef8556afbcac045
403498ff787f984bcf398b5d8dc92b2d3e256560
1463 F20101210_AACANV joshi_s_Page_107.txt
34aad97913520be51802850672bcabf4
d5eb27c0506041920dd8eee75038eb81172d603a
6947 F20101210_AACBSB joshi_s_Page_099thm.jpg
c57e1abb1464929db1a3fb954965c2ec
8c2d577ac463c05bf06e99b67013b240675d9447
6595 F20101210_AACBRN joshi_s_Page_065thm.jpg
7071d2a484b0350dd7c4387643b0690a
a2b5de33442311984c6502bd88ed49c54475dae7
5506 F20101210_AACBQY joshi_s_Page_146thm.jpg
3fcd51e4a038e594b50ab590bbc508fa
9f3a52a38ea6f506e8b80936edf4fb88a5d40bc8
F20101210_AACAOL joshi_s_Page_116.tif
f8cf217e9c13fce45cd94460510a2abd
6b3ee29b0b5dd787675dfd167d49f6460bc58e6b
51362 F20101210_AACANW joshi_s_Page_078.pro
d062d9a61e5863fb411616fce28d49cb
cc4c614beb77bdb57581234aeeb89116411112e2
4899 F20101210_AACBSC joshi_s_Page_120thm.jpg
76a0b4180192be34689bb17d6cab4ccd
003fe182d8b143a7dcd94b9468f9b7445dd9674f
5932 F20101210_AACBRO joshi_s_Page_052thm.jpg
cec35fb9812eed483a3cad0cc81b2934
7f2283e8f206aba187d75787fd0473863ea702bd
24855 F20101210_AACBQZ joshi_s_Page_144.QC.jpg
64f698cb9036ec0f4589021aeb3e3cc7
73f79ec3248f0cdc1ece6215e2d299790921079b
1106 F20101210_AACAOM joshi_s_Page_138.txt
a4be08e5d6b3f3e1fcfb4026abcad957
22c52d3ab2a969c7a72d872d614fb58de1d43c99
86722 F20101210_AACANX joshi_s_Page_021.jpg
65e56f9cb5f47d7225b6e346ad8e5cbb
cd32c77e665edc5feae7a4a9e4c262c65241f146
F20101210_AACAPA joshi_s_Page_038.tif
b22ef01208f50fb430c961b170728ba3
efc5d788217a6c4a4a1773c525b9907d921f4884
22624 F20101210_AACBSD joshi_s_Page_151.QC.jpg
e32c5b24f3dd3db26257e50f27ea1b84
30a03dfaad058310270d49c9064ab704744ef747
4827 F20101210_AACBRP joshi_s_Page_124thm.jpg
04fe5f0b2f65218d2a50e1504910c7b5
4ac37fa8715ddc1e25c78cb5a984c92ed2b7e77d
87293 F20101210_AACAON joshi_s_Page_028.jpg
39ef44745b461620e19568ad60dee072
3bd25c7b53790ad744ec690feed00e60ab13b774
2392 F20101210_AACANY joshi_s_Page_055.txt
e8e6441107f7759aa567e082506553f9
70e305c3424cefb172520dbfd5ac1fa3ca5351f1
21751 F20101210_AACAPB joshi_s_Page_143.pro
6db193f9bf5c785d2cd9888ebf3b1951
614e72945b682105ad7c3ab90c5a46cf14a79489
6616 F20101210_AACBSE joshi_s_Page_152thm.jpg
edc9a1dc108c3aec06acbb1a3526a8ed
5debc3418ae4e7aced4efec46722c10f180e690c
5447 F20101210_AACBRQ joshi_s_Page_085thm.jpg
7794c1e875833f6ce9a4f83ad9552a81
c2f98334ac8b23388f6b8a12a9c84215e44c163c
78466 F20101210_AACAOO joshi_s_Page_078.jpg
0205495669a038a926f04fd831971348
f3cb5a4c5fdc8a1e8e5f06d19ceb32c15adfa5bc
F20101210_AACANZ joshi_s_Page_001.tif
7ec3a467f5edade950d68ede597a369d
b53b36d5a3b4f16d9062586eaf0f2d3157847350
48129 F20101210_AACAPC joshi_s_Page_062.pro
6ad0cd4820805f91d2957ce7527216e3
20ce592be41b68f2159615356a1cff15abf31050
25426 F20101210_AACBSF joshi_s_Page_081.QC.jpg
26f13f9b6c428906e4286a6b9026fe60
64889bd292c887ae24f4bbf9014a87ae78843a16
14219 F20101210_AACBRR joshi_s_Page_083.QC.jpg
d3fdcc86206836e24299a05f22c98aec
83fe73badc92675f8715d347e36681ed25adaf4a
1051944 F20101210_AACAOP joshi_s_Page_126.jp2
9eb6c322e458e89bbf275a87b8555f3e
351cc5f502c5e4eda7855e1eefde404efabaec67
68368 F20101210_AACAPD joshi_s_Page_013.jpg
4e2c85a1663df4477a13f02aa98e5dd2
a244659bf7ccccab2f4b56e626a2da51af720493
6804 F20101210_AACBSG joshi_s_Page_111thm.jpg
4df417fd6314a5d0e05101c7a3121e86
3eefc402884f1eafedf380e44a0a5249f29ad449
27534 F20101210_AACBRS joshi_s_Page_028.QC.jpg
7da996da003ccab842943ce3ed6f3c81
573e280d82005ded61a9dffb6efbdcddca51d69e
53545 F20101210_AACAOQ joshi_s_Page_052.pro
703871ccc9a0b8fd50ebf76686f64057
2d293e80de3b442ff5a419f4d05078a7fcadef8d
F20101210_AACAPE joshi_s_Page_076.tif
8eb3cb586df7c9aa7e9d65a5a24654a4
7bb25c19c4970802f9ec3ed66aec8099072d025d
19237 F20101210_AACBSH joshi_s_Page_075.QC.jpg
f1e73eba854e8f6c8c9fa90c2a18ac2a
e9e6eb581d8434185f58a86feba1b1ed912cfb7f
3324 F20101210_AACBRT joshi_s_Page_143thm.jpg
bf564b0acd5fa3c09caf5176732219ee
1dda603dc2a6975790b74d48167d6a956731d268
59556 F20101210_AACAOR joshi_s_Page_152.pro
a9f809c4897a4e75f52e67a9672bd1ef
91865fbf550a5cede02b5f67522b948540467585
214675 F20101210_AACAPF joshi_s_Page_032.jp2
b3aa3c3d673724e956f82434dbae8a13
285e43833d053d563d184cecc73d3468dd1ae178
6404 F20101210_AACBSI joshi_s_Page_133thm.jpg
1a2b453e2d77c8e699eee4a260b220b0
8f648596282ae901cc7b489441dbf16fcab502ee
5721 F20101210_AACBRU joshi_s_Page_041thm.jpg
6e65942b3526734b664f59e4bb5b05f7
6091938327f5c9b2942527900c052055d2aa22f9
6925 F20101210_AACAOS joshi_s_Page_066thm.jpg
0c944399c12d1f3cc97b49c1b3bb9e39
0e00c2c3ded5167829f6e35c51039345dc208e19
70215 F20101210_AACAPG joshi_s_Page_011.jpg
aa3fa8aa13a52a4fcc8ced4c2ea6ee51
0b5f9f5ab76a42d7884e8d8af06a764d2d3a27a3
F20101210_AACBSJ joshi_s_Page_109thm.jpg
a67af9486e2b5b8929857bdf547712a4
16df86862741dcd68e839b483126787a3d6510ff
5664 F20101210_AACBRV joshi_s_Page_101thm.jpg
5a55570c07134ed2b9da8bd4b7829bf3
457da96535632b9a670fde34ed8107c0e9eabcb5
2265 F20101210_AACAOT joshi_s_Page_019.txt
eca6100444554100039bf3090b125c52
f58d0859458898de9d67d2e5aad1cb790d058475
25185 F20101210_AACAPH joshi_s_Page_059.QC.jpg
2d06bf30d7f5ff14f5881cf6950efa06
02ccf79a009fcebc6179bf1c3c04ab07a211a909
7053 F20101210_AACBSK joshi_s_Page_015thm.jpg
829dd4eb079a186497f31263d07ffc7e
0942fbd793a6e0ea337931e4178d2fdf8d2742f7
37124 F20101210_AACAPI joshi_s_Page_042.pro
5a62a015b8b0c2c637719ec59fe1adbc
8bd374fa2f1744eace9d01bccde7b6f95a2d7395
28741 F20101210_AACBSL joshi_s_Page_023.QC.jpg
ff807b394d1f3c240ca9b839c3c97c92
db6f0e3bfbd81da598a41e9fb92f744ddafd1932
14036 F20101210_AACBRW joshi_s_Page_153.QC.jpg
f14a05807cd8e525be5b9a1c7c020626
9108cda6ae80925479749a28b5e88051227b828f
2018 F20101210_AACAOU joshi_s_Page_039.txt
789465851389807aa9fa8043b61637a6
119e54de1d924f1fe5dcef4733e5e3927cbbc7dd
26091 F20101210_AACAPJ joshi_s_Page_123.QC.jpg
4136260b6f97faf158e1cb0d06167134
8f491c6538917a478b9363cc026b2d9a9c526275
F20101210_AACBTA joshi_s_Page_049thm.jpg
95ba54ba26f909bf2c03d07b4484a760
13b207ff01ca48b60abb0a2793bcf13a3de0e581
23850 F20101210_AACBSM joshi_s_Page_050.QC.jpg
13fd169730bdefd045565579b954d89a
3686566825057d1246e694b94ca34e9160b71034
19187 F20101210_AACBRX joshi_s_Page_120.QC.jpg
787b73bd5aa67c5aa46dc18c214bdf0e
5589f2fce0ceaa5448dd0dd70f1cf54c767ce3d0
F20101210_AACAOV joshi_s_Page_118.jp2
dffcc39b19be55e2c3fa5ef00061891f
3fd2a8d879a4d0fd55a2fcf7581196b874630ba1
71442 F20101210_AACAPK joshi_s_Page_048.jpg
cd10cd3684e07d71697c74ebfa8aa499
05a9537aa48d92a46b620cf7de84d4e0ea81b962
23030 F20101210_AACBTB joshi_s_Page_162.QC.jpg
bdf5cfee113d852fadac6f23f2935454
c25fc88cc103c6320103df5134d147e7d6bdaa9b
14929 F20101210_AACBSN joshi_s_Page_154.QC.jpg
2a24e9a359f9db40adebffb59b59c2a7
28735dbd675910f78469733b80636760dc50df94
27559 F20101210_AACBRY joshi_s_Page_145.QC.jpg
c76b78fa3c45ca91ac77664b51539f04
a732386c59b853162453cc74d5f9a94834ab9d17
3740 F20101210_AACAOW joshi_s_Page_153thm.jpg
af2dbacb6e27b6ab724ed3cd8c41a898
f9268d5313f53465a7c625655cb10b33ceb2a39e
189492 F20101210_AACAPL UFE0021217_00001.mets
1102ce4d6da7687481a3fdd7f4f5bc38
516818336c639777ff8605bf52fbb40c52e694e7
5722 F20101210_AACBTC joshi_s_Page_151thm.jpg
575a8fcbbfcc30388f778664ce1afb4e
6ca32bc95f65b2d89b082a102fa1d9d20bd60901
6996 F20101210_AACBSO joshi_s_Page_023thm.jpg
644dde18b06d916517d05165cb30ef6a
ab948758d80ca0119113bf9294ff9dd8249d330b
28898 F20101210_AACBRZ joshi_s_Page_070.QC.jpg
ebf03953d1fd2e0ef8bb6c543941a3bf
83ed27d3b7f353e636560a38568f9e826aad1cd5
2183 F20101210_AACAOX joshi_s_Page_137.txt
d2a1587bafb9dd13d41c8954e0356371
5d5bc23b76c7ccd556cd569a59acd009c99dd284
84797 F20101210_AACAQA joshi_s_Page_019.jpg
dd46515a9cb7f3bdb29337635e752333
491828bd7d41d4fe40f823f927eb3090c4e0d171
21707 F20101210_AACBTD joshi_s_Page_041.QC.jpg
2d741a838a7c29b5d65b7aefefa8f888
de48ca199771fc8240960a7fcc3a1f3b74b79b0d
27795 F20101210_AACBSP joshi_s_Page_030.QC.jpg
bad3548230230f9b183a3258d6fe5fb6
3decdfbc84c8e07a0f642de743c609422b1c46ab
54032 F20101210_AACAOY joshi_s_Page_060.jpg
c11c59e3e0999f074820b4a66a1d8fea
8d4586346e763444a0b11ccb020142c34fde1003
92330 F20101210_AACAQB joshi_s_Page_020.jpg
1aab519938d2c59fffbcc41ab0ba5b40
330c5951d1e6fafe6e9eab4ffc620ccb63506e5f
6375 F20101210_AACBTE joshi_s_Page_037thm.jpg
fece6cc5a18a16859f399accd3e8fe94
4e95f70ab6a51c2fa984e26644845da3d7ab7758
25210 F20101210_AACBSQ joshi_s_Page_034.QC.jpg
32a9ad159b239d95f8f896d072cd0507
93a8486262a7dead88fbf17afbf451027804ada4
61933 F20101210_AACAOZ joshi_s_Page_020.pro
4fd9c0c730962b013d3bba972fe9d166
516047a9152beace064391f58a06a1f5fee511bb
87831 F20101210_AACAQC joshi_s_Page_022.jpg
0403a2450b65ee60a5347a7f502d175a
b611d3bc6591017f1442184998db2d9f5bfe5d65
23478 F20101210_AACAPO joshi_s_Page_001.jpg
24fe9e89813a71f263de541e5d438596
b6d40ba71b4b56d3960b8ac213c161673b54a9e1
4249 F20101210_AACBTF joshi_s_Page_056thm.jpg
f128fd3a1b167611021968b4d3d057d8
00f3b0868e84f93a96f14612313303b93ea13b9e
6412 F20101210_AACBSR joshi_s_Page_077thm.jpg
e3a500849b786e33f3484a3aeec753b9
4697d0b7481200ab8e1c7ebff5786b251d385fa7
92343 F20101210_AACAQD joshi_s_Page_023.jpg
c9645dfd6f2ed27ffbf56fc82b009719
7fd20c372e688fc8f8b08739079eb6270461dafc
10033 F20101210_AACAPP joshi_s_Page_002.jpg
a618ebadb5b98f4566b5c21ba6d2487c
94bca1b4436c553799cda6f6d489962a2ec01633
25421 F20101210_AACBTG joshi_s_Page_149.QC.jpg
a0114a0463af82d635e1e0737fbf3a5d
f68ceb14158373ecf1daf5343762c0a2b0235b9b
23265 F20101210_AACBSS joshi_s_Page_104.QC.jpg
b00813712e4e48b4da3526d9815c091e
c1f3f85cdcf8a1df7e8c589ced81e568500113c8
88692 F20101210_AACAQE joshi_s_Page_024.jpg
409a883091e10495e703eb4a340380ec
8f7475c1eaf4bba87d679cd7a8a92344d8ffa599
10474 F20101210_AACAPQ joshi_s_Page_003.jpg
8fcafb91038a5c92449d62721427750c
d492e802d922946b69121d5b4d0837269747f857
16403 F20101210_AACBTH joshi_s_Page_060.QC.jpg
0441eef651770d39bed39da0f89619cd
2e006d2197f626f7a68d2c94c4a2a10e400cbe87
22582 F20101210_AACBST joshi_s_Page_163.QC.jpg
6674879085fdb1e409c87eadcdd43e3e
91892ccdca9f90ae456e3ad81b42968eee9dcf06
85635 F20101210_AACAQF joshi_s_Page_025.jpg
fde8a8c749e6ea407742bdbea350e18b
620a8cd8c7d671e0cad3f4ec979a2f8c7c66b715
72216 F20101210_AACAPR joshi_s_Page_004.jpg
8a980dc16e64346eafcb4b0dd21eef3b
d4d6ea64c86a26db1be47fbb45d5ef2e5376cbbe
6420 F20101210_AACBTI joshi_s_Page_091thm.jpg
ee7a9217c6714cd956eea9cc6d7bdf12
cc9da86482cc7391eba7db59b89b70554fdc20eb
7179 F20101210_AACBSU joshi_s_Page_020thm.jpg
bcb28cf3f29c4c1f17a9144ba50c742f
1333cae74ab9219faf82bdf319fa1b9fca1044c5
88717 F20101210_AACAQG joshi_s_Page_026.jpg
a03b8e238581a7d486ec932b223dfde1
17d456d0b057761a8a5a4c1406fd41e3050ae1f5
49909 F20101210_AACAPS joshi_s_Page_007.jpg
499b93577551dc562ec3caf3d08f4238
96079b19ac30431da141a77b8e791cc934861d6d
17659 F20101210_AACBTJ joshi_s_Page_132.QC.jpg
0675690db6e8f3b7830afe9149b43f48
68bb7b9b55988bd1977c6a19bdb426c0f2f72f18
5363 F20101210_AACBSV joshi_s_Page_043thm.jpg
6c18cbec0b27bb669a1812bbd0c8a5ac
a303f21d82d91f9c516dda43c189f980731fb8e1
87684 F20101210_AACAQH joshi_s_Page_027.jpg
d4af41f4159cc946b6d7a907473c570b
4e80b0cf326aea67f30829e9e14a35dd25045509
116117 F20101210_AACAPT joshi_s_Page_008.jpg
532317cd130cd22557dd67f19fc8335e
e790152b3bd510773a07371b29096a2ae2a1bc32
23280 F20101210_AACBTK joshi_s_Page_122.QC.jpg
93e68d4c9bb68e4a936588fbc71fc72a
7a33092b58811a675f797df99cbe1ec31d3a6d87
4284 F20101210_AACBSW joshi_s_Page_083thm.jpg
9c92957199c0fba01cb97c022d859545
e5067c01f32f2f72c9a97d1fab45e033721c99e8
87975 F20101210_AACAQI joshi_s_Page_029.jpg
80c2a609365b50b665c03b466d6a1c36
d7d118f2f6b8ff14c91ccac547dc11ab69196093
18255 F20101210_AACAPU joshi_s_Page_012.jpg
99d5b7f997493df64f555b0a064c544c
f4d7aeb62f053b0ac712682ea585258b823d4ab5
13057 F20101210_AACBTL joshi_s_Page_138.QC.jpg
d6fdcc75c94662b1b73716e53edb2e50
fd0702487817a6e93c04a024eab81156e64246fc
86891 F20101210_AACAQJ joshi_s_Page_030.jpg
3c63d1181ade34a373ec6e439e3a7119
8299cb2e63998304fe0c54369b4db0ddbc9a084d
6806 F20101210_AACBUA joshi_s_Page_110thm.jpg
c804d1e6f1a3ce470d3059fb7afece83
ca9086d0844b9536f82955fc19d29bcb51bab7c9
24940 F20101210_AACBTM joshi_s_Page_077.QC.jpg
1a07d8e0b93f3c46c01103c88695155c
5892074ce468308cbbc57a22dca73986a45e9c7d
23653 F20101210_AACBSX joshi_s_Page_147.QC.jpg
bfab9de9e06dc2ecc8244c68e69bb0c9
4fd1e07f43d2ada4dcc19c08bfef64af5b28601c
89740 F20101210_AACAQK joshi_s_Page_031.jpg
b52524c671cb4faf2bf6407178800e44
371143cc869d945ad9ab9e3d8aab9e38dd2e2e5f
79386 F20101210_AACAPV joshi_s_Page_014.jpg
c1fa2a6879cb206b649424dc967ef5b9
21415b0afc5e1edb3e0a35a2aeb4f2fb1c49c620
6618 F20101210_AACBUB joshi_s_Page_025thm.jpg
943c50aba7262d2c63b47aa59ce82e40
931a34ce66d615b55a4cb236142a080dcc5381ff
22309 F20101210_AACBTN joshi_s_Page_052.QC.jpg
b845697c4fc8ef4780335431749f1060
d1bfd9fdfd19ee9de071a6c063aa8f51f5d2a49b
26466 F20101210_AACBSY joshi_s_Page_115.QC.jpg
7574adf6944335e48ff6b90bc4eaa8b2
35dd010f46f435d1ead8b354e67839fa3ff695fb
19836 F20101210_AACAQL joshi_s_Page_032.jpg
4b3ef547a5d356f55bc7888c0effb70d
05d53fd044d8d6398f92000d12e505c8c2d5b756
92822 F20101210_AACAPW joshi_s_Page_015.jpg
a1cf114af99f7de04dd30c508d212964
9539282ba31d992e958b912fc5e4b8d5f37d7ea3
6904 F20101210_AACBUC joshi_s_Page_030thm.jpg
02f8ea263df515d4f4836fbdf5e822c6
2c5df6f94a003be94932620b315b6c04df48ca5b
5965 F20101210_AACBTO joshi_s_Page_113thm.jpg
6961083b4773f252226876f4bd6b5d23
f415582cd28385d223467eb3f23e9e836189fbb5
6469 F20101210_AACBSZ joshi_s_Page_134thm.jpg
b816d188acde7ef4c71057cd9c88e56e
b07c68440fd8c9712420ae402d8719e3e29d85cc
76913 F20101210_AACARA joshi_s_Page_047.jpg
fce6da3a6e71541599c4e4f1cfaef712
38a85af72d027f070eb68e1876b406e083e83544
80017 F20101210_AACAQM joshi_s_Page_033.jpg
a569e2e5648b2ed1fae888fe93d5125a
ece9b31f6bdecb7a415772992a7748b811ae528c
83704 F20101210_AACAPX joshi_s_Page_016.jpg
9d34dd009ce58826b65b270ae9766e96
d4a1054dfe1d41d1bfcf6389577c31c70ddf230f
23773 F20101210_AACBUD joshi_s_Page_096.QC.jpg
f80c23f649c8b5600efe0a453ec2c940
ca5c2ba053c6be40ec03e38ec587562907cfb704
5638 F20101210_AACBTP joshi_s_Page_129thm.jpg
ecab245cc2e6b5ef7cbf6fd66cf9d0d9
9eab198ebd31dcd43eb0f0478436cd03910387c2
70886 F20101210_AACARB joshi_s_Page_049.jpg
799f7bf23d78d5d9070302a2d2b9802f
7e6de4afb316cca7749070cec9deedb12b6115d5
91360 F20101210_AACAQN joshi_s_Page_034.jpg
686b8e34fc510e9c50bf4a84e61973ed
c96a42b4e93df1f55c6da72cf1cc90d84f31f327
48790 F20101210_AACAPY joshi_s_Page_017.jpg
260c740474d00e3828e2889f7a8d6540
d8cb6b271eb542eb3171d7b3e44ff373ac15ed92
26124 F20101210_AACBUE joshi_s_Page_044.QC.jpg
aa7db9e1c95132fdcf707f674bb59f1d
68f6b4df1e6f5d0278eaf1272b9b0def9ce13c50
5089 F20101210_AACBTQ joshi_s_Page_105thm.jpg
5daecd30a2d9a15d63b68ca4bad8931e
14225f89a103778d2249bd4f9feeb1a9473036a5
76457 F20101210_AACARC joshi_s_Page_050.jpg
dd90ed2fe98089d83fc8fbaa1fb3d20e
5383853f09f1b48f44fbff02caa7a61874576267
73084 F20101210_AACAQO joshi_s_Page_035.jpg
445607da8a487408aee212a7686e7ce8
0c81c4cf81ac7cc16f7e54272d64a58525565134
48721 F20101210_AACAPZ joshi_s_Page_018.jpg
439ca683e600f7c6a4313169194f81d0
72e6cebf4490125b3798ee876e5de32cf8c9aaef
5491 F20101210_AACBUF joshi_s_Page_063thm.jpg
517446c104dc4f74c9e976e1586aac6d
26665e05d25dcb1c62b83ed20eefe68d6932f18c
3573 F20101210_AACBTR joshi_s_Page_007thm.jpg
1efb7e39468532312c36b23398538ce9
8cfb842a58405b0df6093727528205f5aa49583a
75398 F20101210_AACARD joshi_s_Page_051.jpg
bd01dcf7b6d41f67072533fa1a26fc08
14214f958b393fdf44fb30d7a8c0f13366abe8a8
79374 F20101210_AACAQP joshi_s_Page_036.jpg
b5eff0fdfd513fc74922e3ea30f88fb6
dbff5cd900b37413903b280d12c71e2e063a2e6d
21361 F20101210_AACBUG joshi_s_Page_047.QC.jpg
3f6a2cfadb3b0b6f107a18fbeccd0ade
9333a54500585b3666ce299d3dab2af2e212e4a0
27830 F20101210_AACBTS joshi_s_Page_076.QC.jpg
06be516ed99e7b824302057d34bb39d0
43770023e5de15838a0b2cbd48027c0f5932139b
70924 F20101210_AACARE joshi_s_Page_052.jpg
1dbba33798cdfb1c8c88dfb1a93ef8f2
808d716c336d5e2c3af46c9639fb57a81534eae7
83467 F20101210_AACAQQ joshi_s_Page_037.jpg
b12c0a6d0a4cb30f3a1e8583a41ae486
fd54f23f92fa75bff22bbfb7230fc18b43f32374
5469 F20101210_AACBUH joshi_s_Page_045thm.jpg
2d6216ccf3ad8f72881a290b07e4aa5f
8c37ab855e5b13d6e25a6fd928ae73e7469b6dab
24868 F20101210_AACBTT joshi_s_Page_037.QC.jpg
2dc0f0f607996a8bdbccdf12204aad38
bec1559589f1ef2cf5ac5d76295b352190326dfa
81260 F20101210_AACARF joshi_s_Page_053.jpg
d0e287148bbb8f699172b7ed9473201d
a03be1ec2e0ec9201986f7b40006d9feda065474
86620 F20101210_AACAQR joshi_s_Page_038.jpg
bdb4d07f8d483e8ff7d3f56fe71ce561
8d91093454a4673c5c81706f1b2949e973bbe7d8
7025 F20101210_AACBUI joshi_s_Page_029thm.jpg
e77eb7feb97a98ccd2ca8f160f9903d6
bccb0f57ad26de91a5259a2ad8597e8dab1639be
23514 F20101210_AACBTU joshi_s_Page_161.QC.jpg
8760f561e488a1b53d8fd9bed1f97941
4e9c979315e217b91a071b96593885fd55b490a8
61990 F20101210_AACARG joshi_s_Page_054.jpg
b2fb4ddb59044630f8e80f29f0b559a8
cda2d29f9a233b80829f4d06b6bd781d831dfab9
78760 F20101210_AACAQS joshi_s_Page_039.jpg
4b2b9a0484a423c85bdeff8e75e65a2b
977638ad0f694b50012fd8bcb7b2027d9060dafe
25854 F20101210_AACBUJ joshi_s_Page_148.QC.jpg
620d1d888518d78cf3aa65a4dd2b5dda
7da80fbf330b248e69151a29ff753436a3184486
5808 F20101210_AACBTV joshi_s_Page_004thm.jpg
4538f53e5b1469d6842697e8d2d9e094
d6919d652e2aee536e9d8bc3e2f9c62077877bd3
54192 F20101210_AACARH joshi_s_Page_056.jpg
4ff5be5c6067b2ec176941cd1260327c
a260c5687d0b2877d8d0cbb6e04100e51bb1d3a5
62244 F20101210_AACAQT joshi_s_Page_040.jpg
f7c2d3d343c48ebe25d0d4c093576c50
cf27a604ee03a20b98133d8d5b6d740d5a655c45
5198 F20101210_AACBUK joshi_s_Page_040thm.jpg
d7bbf86e97254a13fe2e2d71fc653d6e
27d6e8f35d8974eb0ea128341e2e8bb3274c7166
6141 F20101210_AACBTW joshi_s_Page_148thm.jpg
f574e2590ba843e1b6d681bc514d7846
ecf6b9f985c8fc3f0a9050169b98263247aab9c4
51780 F20101210_AACARI joshi_s_Page_057.jpg
66417a5746103817939ddd04318858a7
9d3c3f5536f713d489fc868815fa5e3a8bac863c
69642 F20101210_AACAQU joshi_s_Page_041.jpg
fbeb80ced852bd59709154c1df5a829c
150a6c0cdb08becee7effd065e20adfaa5a6a884
6280 F20101210_AACBUL joshi_s_Page_009thm.jpg
fd302e4db0844121ee8c68399336fc53
be77b802f0cc0a3ab43635a9303559fb537b93c8
16861 F20101210_AACBTX joshi_s_Page_089.QC.jpg
85a229f3479d8d200bde0a22daed68fe
9d93b85619c56b48b09ec7e8426579175d112d00
70402 F20101210_AACAQV joshi_s_Page_042.jpg
a97b1483fa51ca7806a71c202e1ab2c1
645862655c2bde9cb0e716ad777d351f47b0dd6f
42029 F20101210_AACARJ joshi_s_Page_058.jpg
e2204aed9cf3606b9c4c9493b7866737
e647dbf052a4003fb005ca2f08289b234fbec839
4081 F20101210_AACBVA joshi_s_Page_010thm.jpg
4f5dfb434e71210912102ebc90ddebfa
e267efb5e6bd18d99024fea62552eb3b1549b125
23748 F20101210_AACBUM joshi_s_Page_035.QC.jpg
4aeaee40c6a0a5c78bc7b2576e25d808
0a8a3c3b7c6dc07ba8bcdde478463be842b489ce
79135 F20101210_AACARK joshi_s_Page_059.jpg
55b45a1003375dec2e92861c0029c87a
66b81421f8d053f01cb634d0115d2eb3bcd0a5f6
19700 F20101210_AACBVB joshi_s_Page_105.QC.jpg
166dc4d2019d06aaca87da99cf68634f
5c1b4dbcd2a0970fec1227e23c8cbc7d5bb2c0b0
3673 F20101210_AACBUN joshi_s_Page_064thm.jpg
ef36f15915c47b31b561e5e8150f793b
f1cab36be6c1c2477e77468ac9a269600339cdee
24653 F20101210_AACBTY joshi_s_Page_092.QC.jpg
2e749b10e405e32d46f06aa055fe8108
8ab1595de9745ed5e081e9efb50e14cf1bc1d386
62869 F20101210_AACAQW joshi_s_Page_043.jpg
1328712bffdc916b25ec3e0fe4f1c37c
2522eb10147ce60b3aa814267e9c23d4fb4ce838
47609 F20101210_AACARL joshi_s_Page_061.jpg
2b327d0852d59b53b74c288edcf91e76
60329c17599a6fb957eb9df1e997574f16464ad8
27797 F20101210_AACBVC joshi_s_Page_099.QC.jpg
ffc7ab9ba7b4f30c3d3ddff9b3b087d4
7336ec408fb4a27c592eb0023fb9ef214ee41a94
24578 F20101210_AACBUO joshi_s_Page_033.QC.jpg
021d3328a7e683b2b584cfcdf4624ebe
116aa3812c0137f53a22dc2e36c73a59e0d87dad
3459 F20101210_AACBTZ joshi_s_Page_003.QC.jpg
12f802b3bafd88076a3a55feb12c0c27
aaab54883579ddc6fe9a382f30df3cc61d15c690
81871 F20101210_AACAQX joshi_s_Page_044.jpg
2ce8997573085f32fcc5584c6cde98cc
1782c64eedbb20354dafd430413e50181f17b375
71364 F20101210_AACASA joshi_s_Page_080.jpg
f32a2faf90e55e6d1202e063069035dc
c1db967105abf4a97ca20e19b12c64438e4a56b4
78670 F20101210_AACARM joshi_s_Page_062.jpg
c1965a1e72a9d20dc9770da03457aab4
d083361386d6cb3a7159d37dd4ccfffdeacc0ebf
12669 F20101210_AACBVD joshi_s_Page_088.QC.jpg
b364c9a095069baf41667d2a0a7a8eeb
d846e5b46e340ed9990c942b49c66bd35a57aa5b
13364 F20101210_AACBUP joshi_s_Page_017.QC.jpg
6b74a5d07903273e2d06d6d272bc2a32
7ad0cf91d14bd12fad8d275d7099ff317f75099c
65231 F20101210_AACAQY joshi_s_Page_045.jpg
43069a99586beec18e5b8856b121049a
96cb56cbb8c906050d3feaa73c6dc7e9fc50f528
85367 F20101210_AACASB joshi_s_Page_081.jpg
67568c00989d376d0d8657d69f8209d1
40797fe052879810d8b49a7faaa82ce3bae723af
68536 F20101210_AACARN joshi_s_Page_063.jpg
cc0141c584bfb4b0b2b62fc7297d09fd
15818d3635a0e860f81c93ca300d7ee751b0c558







SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE
QUERY PROCESSING


















By
SHANTANU JOSHI


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007































S2007 SHANTANU JOSHI



































To my parents, Dr Sharad Joshi and Dr Hemangi Joshi










ACKENOWLED GMENTS

Firstly, my sincerest gratitude goes to my advisor, Professor C!~! i Jermaine for his

invaluable guidance and support throughout my PhD research work. During the initial

several months of my graduate work, Clau s- was extremely patient and ak-ns-s~ led me

towards the right direction whenever I would waver. His acute insight into the research

problems we worked on set an excellent example and provided me immense motivation to

work on them. He has akr- 1-< emphasized the importance of high-quality technical writing

and has spent several painstaking hours reading and correcting my technical manuscripts.

He has been the best mentor I could have hoped for and I shall ak- .1-< remain indebted to

him for shaping my career and more importantly, my thinking.

I am also very thankful to Professor Alin Dobra for his guidance during my graduate

study. His enthusiasm and constant willingness to help has .I.k- ii.-; amazed me.

Thanks are also due to Professor Joachim Hammer for his support during the very

early d ex-< Of my graduate study. I take this opportunity to thank Professors Tamer

K~ahveci and Gary K~oehler for taking the time to serve on my committee and for their

helpful ---- I _-r;ull-

It was a pleasure working with Subi Arumugam and Abhijit Pol on various collaborative

research projects. Several interesting technical discussions with Mingxi Wu, Fei Xu, Florin

Rusu, Laukik Chitnis and Seema Degwekar provided a stimulating work environment in

the Database Center.

This work would not have been possible without the constant encouragement and

support of my family. My parents, Dr Sharad Joshi and Dr Hemangi Joshi ak- .1-<

encouraged me to focus on my goals and pursue them against all odds. My brother,

Dr Abhijit Joshi has akr- 1-< placed trust in my abilities and has been an ideal example to

follow since my childhood. My loving sister-in-law, Dr Hetal Joshi has been supportive

since the time I decided to pursue computer science.











TABLE OF CONTENTS


page

ACK(NOWLEDGMENTS ......... . .. .. 4

LIST OF TABLES ......... .... .. 8

LIST OF FIGURES ......... .... .. 9

ABSTRACT ......... ..... . 11

CHAPTER



1 INTRODUCTION ......... ... .. 1:3

1.1 Approximate Query Processing (AQP) A Different Paradigm .. .. .. 1:3
1.2 Building an AQP System Afresh . ..... .. 14
1.2.1 Sampling Vs Preconmputed Synopses .... .. .. .. 15
1.2.2 Architectural(l Ch!_. ,- ........ .. .. 16
1.3 Contributions in This Thesis ........ .. 18

2 RELATED WORK(........... ..... .... 19

2.1 Sanipling-hased Estimation ....... .. 19
2.2 Estimation Using Non-sanipling Preconmputed Synopses .. .. .. .. 28
2.3 Analytic Query Processing Using Non-standard Data Models .. .. .. :30

:3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION 3:3

:3.1 Introduction ......... . .. .. 3:3
:3.2 Existing Sampling Techniques ....... ... .. :35
:3.2.1 Randomly Pernmuted Files . ..... .. .. :35
:3.2.2 Sampling from Indices ....... ... .. :36
:3.2.3 Block-based Random Sampling .... ... :37
:3.3 Overview of Our Approach ........ ... :38
:3.3.1 ACE Tree Leaf Nodes ....... .. :38
:3.3.2 ACE Tree Structure . ..... ... :39
:3.3.3 Example Query Execution in ACE Tree .. .. .. .. 40
:3.3.4 Cl o n. .. of Binary Versus k-Ary Tree .... .. .. 42
:3.4 Properties of the ACE Tree ........ ... .. 4:3
:3.4.1 Combinability ......... .. .. 4:3
:3.4.2 Appendability ......... .. .. 44
:3.4.3 Exponentiality ......... .. .. 44
:3.5 Construction of the ACE Tree . ..... .. 45
:3.5.1 Design Goals ......... .. .. 45
:3.5.2 Construction ......... .. .. 46
:3.5.3 Construction Phase 1 ........ ... .. 46











3.5.4 Construction Phase 2 . .... ... .. 48
3.5.5 Combinability/Appendability Revisited ... .. .. .. 51
3.5.6 Page Alignment ......... .. .. 51
3.6 Query Algorithm ......... ... .. 52
3.6.1 Goals ......... . .. .. 53
3.6.2 Algorithm Overview ........ ... .. 53
3.6.3 Data Structures ......... .. .. 55
3.6.4 Actual Algorithm ......... ... .. 55
3.6.5 Algorithm Analysis ......... ... .. 57
3.7 Multi-Dimensional ACE Trees . ..... .. 59
3.8 Benchmarking ........ . .. 60
3.8.1 Overview ........... ...... ...... 61
3.8.2 Discussion of Experimental Results ... ... .. 66
3.9 Conclusion and Discussion ....... .. 70

4 SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES .. 73

4.1 Introduction ........ ... .. 73
4.2 The Concurrent Estimator ....... .. 78
4.3 Unbiased Estimator ........ ... .. 80
4.3.1 Higfh-Level Description . ..... .. 80
4.3.2 The Unbiased Estimator In Depth .... .... .. 81
4.3.3 Why Is the Estimator Unbiased? ..... ... .. 85
4.3.4 Computing the Variance of the Estimator .. .. .. .. 87
4.3.5 Is This Good? ......... .. .. 89
4.4 Developing a Biased Estimator . ...... .. .. 91
4.5 Details of Our Approach ......... .. .. 92
4.5.1 Cle..- ~of Model and Model Parameters ... ... .. .. 92
4.5.2 Estimation of Model Parameters .... ... .. 95
4.5.3 Generating Populations From the Model ... .. . .. 100
4.5.4 Constructing the Estimator . ... .. 102
4.6 Experiments ........ . .. 103
4.6.1 Experimental Setup ....... ... .. 103
4.6.1.1 Synthetic data sets .... .. .. 104
4.6.1.2 Real-life data sets . ... .. 106
4.6.2 Results ........ . .. 109
4.6.3 Discussion ........ . .. 111
4.7 Related Work ........ ... .. 118
4.8 Conclusion ........ . .. 119

5 SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES .. 121

5.1 Introduction ........ . .. 121
5.2 Background ........ . .. 124
5.2.1 Stratification . .... .. ... .. 124
5.2.2 "Optimal" Allocation and Why It's Not .. .. .. .. 126











5.3 Overview of Our Solution ......... ... .. 128
5.4 Definingf Xe ......... . .. .. 129
5.4.1 Overview ......... . .. 129
5.4.2 Definingf X,.,z ......... . .. 1:30
5.4.3 Defining Xc/ ......... ... .. 1:32
5.4.4 Combining The Two . ...... ... .. 1:35
5.4.5 Limiting the Number of Domain Values ... ... . .. 1:37
5.5 Updating Priors Using The Pilot . ..... .. 1:39
5.6 Putting It All Together ......... .. .. 141
5.6.1 Minimizing the Variance . ..... .. .. 141
5.6.2 Computing the Final Sampling Allocation .. . .. 142
5.7 Experiments ......... . .. .. 14:3
5.7.1 Goals ......... . .. .. 14:3
5.7.2 Experimental Setup ........ ... .. 144
5.7.3 Results ......... ... .. 147
5.7.4 Discussion ......... ... .. 147
5.8 Related Work ......... .. .. 151
5.9 Conclusion ......... ... .. 15:3

6 CONCLUSION ......... . .. .. 154

APPENDIX

EM ALGORITHM DERIVATION . ...... .. 155

REFERENCES ............ ........... 157

BIOGRAPHICAL SK(ETCH ......... . .. 165










LIST OF TABLES


Table page

4-1 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for
24 synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 101' and for each of these fractions, it shows
the error for the three estimators: U Unbiased estimator, C Concurrent sampling
estimator and B Model-based biased estimator. .. .. .. 112

4-2 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for
24 synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 101' and for each of these fractions, it shows
the error for the three estimators: U Unbiased estimator, C Concurrent sampling
estimator and B Model-based biased estimator. .. .. .. 113

4-3 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for
18 synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 101' and for each of these fractions, it shows
the error for the three estimators: U Unbiased estimator, C Concurrent sampling
estimator and B Model-based biased estimator. .. .. .. 114

4-4 Observed standard error as a percentage of the total .I_- r_~egate value of all records
in the database for 8 queries over 3 real-life data sets. The table shows errors
for three different sampling fractions: 1 5' and 101' and for each of these
fractions, it shows the error for the three estimators: U Unbiased estimator,
C Concurrent sampling estimator and B Model-based biased estimator. .. 115

5-1 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage
(for 1000 query runs) for a Simple Random Sampling estimator for the K(DD
Cup data set. Results are shown for varying sample sizes and for three different
query selectivities 0.01 0.1 and 1 . ... .. .. 146

5-2 Average running time of Neyman and Bai-; E-Neyman estimators over three real-world
datasets. ......... .... . 147

5-3 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage
(for 1000 query runs) for the Neyman estimator and the Bai-; E-Neyman estimator
for the three data sets. Results are shown for 20 strata and for varying number
of records in pilot sample per stratum (PS), and sample sizes(SS) for three different
query selectivities 0.01 0.1 and 1~ ...... .. . 148

5-4 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage
(for 1000 query runs) for the Neyman estimator and the Bai-; E-Neyman estimator
for the three data sets. Results are shown for 200 strata with varying number
of records in pilot sample per stratum (PS), and sample sizes(SS) for three different
query selectivities 0.01 0.1 and 1~ ..... .. .. 149










LIST OF FIGURES


Figure page

1-1 Simplified architecture of a DBMS ........ .. .. 17

3-1 Structure of a leaf node of the ACE tree. ...... .. 39

3-2 Structure of the ACE tree. ......... .. .. 40

3-3 Random samples from section 1 of L3- * * 4

3-4 Combining samples from L3 and Ls. . . 42

3-5 Combining two sections of leaf nodes of the ACE tree. ... .. .. 43

3-6 Appending two sections of leaf nodes of the ACE tree. ... .. .. 45

3-7 Cls...~-! nig keys for internal nodes. ........ ... .. 47

3-8 Exponentiality property of ACE tree. . ..... .. 48

3-9 Phase 2 of tree construction. ......... .. .. 49

3-10 Execution runs of query answering algorithm with (a) 1 contributing section,
(b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributing
sections. ......... .... . 54

3-11 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 0.25' of
the database records. The graph shows the percentage of database records retrieved
by all three sampling techniques versus time plotted as a percentage of the time
required to scan the relation ......... .. .. 60

3-12 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 2.5' of the
database records. The graph shows the percentage of database records retrieved
by all three sampling techniques versus time plotted as a percentage of the time
required to scan the relation ......... .. .. 61

3-13 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting "'.' of the
database records. The graph shows the percentage of database records retrieved
by all three sampling techniques versus time plotted as a percentage of the time
required to scan the relation ......... .. .. 62

3-14 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 2.5' of the
database records. The graph is an extension of Figure 3-12 and shows results
till all three sampling techniques return all the records matching the query predicate. 63










3-15 Number of records needed to be buffered by the ACE Tree for queries with (a)
0.25' and (b) 2.5' selectivity. The graphs show the number of records buffered
as a fraction of the total database records versus time plotted as a percentage
of the time required to scan the relation. ..... .. .. 64

3-16 Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomly
permuted file, with a spatial selection predicate accepting 0.25' of the database
tuples. ... .... .. 67

3-17 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly
permuted file with a spatial selection predicate accepting 2.5' of the database
tuples ... ........ ............. 68

3-18 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly
permuted file with a spatial selection predicate accepting 25' of the database
tuples. ........ ... .. 69

4-1 Sampling from a superpopulation ........ ... .. 90

4-2 Six distributions used to generate for each e in EMP the number of records s in
SALE for which f3 6, 8) eValuateS to true. ...... .. . 105

5-1 Beta distribution with parameters a~ = P = 0.5. ... .. .. 131









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE
QUERY PROCESSING

By
SHANTANU JOSHI

August 2007


Major: Computer Engineering

The past couple of decades have seen a significant amount of research directed

towards data warehousing and efficient processing of analytic queries. This is a daunting

task due to massive sizes of data warehouses and the nature of complex, analytical queries.

This is evident from standard, published benchmarking results such as TPC-H, which

show that ]rn Ilry typical queries can require several minutes to execute despite using

sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data

exploratory as~ lli--- One direction to speed up execution of such exploratory queries is

to rely on approximate results. This approach can be especially promising if approximate

answers and their error bounds are computed in a small fraction of the time required to

execute the query to completion. Random samples can be used effectively to perform

such an estimation. Two important problems have to be addressed before using random

samples for estimation. The first problem is that retrieval of random samples from

a database is generally very expensive and hence index structures are required to be

designed which can permit efficient random sampling from arbitrary selection predicates.

Secondly, approximate computation of arbitrary queries generally requires complex

statistical machinery and reliable sampling-based estimators have to be developed for

different types of analytic queries. My research addresses the two problems described

above by making the following contributions: (a) A novel file organization and index

structure called the ACE Tree which permits efficient random sampling from an arbitrary










range query. (b) Sampling-based estimators for .I__-oegate queries which have a correlated

subquery where the inner and outer queries are related by the SQL EXISTS, NOT

EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the

result of ..- oregate queries having highly selective predicates.










CHAPTER 1
INTRODUCTION

The last couple of decades have seen an explosive growth of electronic data. It is not

unusual for data management systems to support several terabytes or even petabytes of

data. Such massive volumes of data have led to the evolution of "data warehouses", which

are systems capable of supporting storage and efficient retrieval of large amounts of data.

Data Warehouses are typically used for applications such as online analytical processing

among others. Such applications process queries and expect results in a manner that is

different from traditional transaction processing. For example, a typical query by a sales

manager on a sales data warehouse might be:

"Return average salary of all employees at locations whose sales have increased by

atleast 101' over the past 3 y.~ l -

The result of such a query could be used to make high-level decisions such as whether

or not to hire more employees at the locations of interest. Such queries are typical in

a data warehousing environment in that their evaluation requires complex analytical

processing over huge amounts of data. Traditional transactional processing methods may

be unacceptably slow to answer such complex queries.

1.1 Approximate Query Processing (AQP) A Different Paradigm

The nature of analytical queries and their associated applications provides an

opportunity to provide results which may not be exact. Since computation of exact

results may require an unreasonable amount of time due to massive volumes of data,

approximation may be attractive if the approximate results can be computed in a fraction

of the time it would take to compute the exact results. Moreover, providing approximate

results can be useful to quickly explore the whole data at a high level. This technique of

providing fast but approximate results has been termed "Approximate Query Prol --!ag

in the literature.










In addition to computing an approximate answer, it is also important to provide

metrics about the accuracy of the answer. One way to express the accuracy is in terms

of error bounds with certain probabilistic guarantees of the form, "The estimated answer

is 2.45 x 105, and with 95' confidence the true answer lies within +1.18 x 103 of the

01... !.! I Here, the error bounds are expressed as an interval and the accuracy guarantee

is provided at 95' confidence.

A promising approach for .I_ egation queries in Approximate Query Processing

(AQP) has been proposed by Haas and Hellerstein [63] called Online .I_ egation (OLA).

They propose an interactive interface for data exploration and analysis where records are

retrieved in a random order. Using these random samples, running estimates and error

bounds are computed and immediately di;11 li-o I to the user. As time progresses, the

size of the random sample keeps growing and so the estimate is continuously refined. At

a predetermined time interval, the refined estimate along with its improved accuracy is

di11lai- II to the user. If at any point of time during the execution the user is satisfied with

the accuracy of the answer, she can terminate further execution. The system also gives an

overall progress indicator based on the fraction of records that have been sampled thus

far. Thus, OLA provides an interface where the user is given a rough estimate of the result

very quickly.

1.2 Building an AQP System Afresh

The OLA system described above presents an intuitive interface for approximate

answering of .l__-oegate queries. However, to support the functionality proposed by

the system, fundamental changes need to be incorporated in several components of a

traditional database management system. In this section, we first examine why sampling

is a good approach for AQP, and then present an overview of the changes needed in the

architecture of a database management system to support sampling-hased AQP.










1.2.1 Sampling Vs Precomputed Synopses

We now discuss two techniques that can be used to support fast but approximate

answering of queries. One intuitive technique is using some compact information about

records for answering queries. Such information is typically called a database statistic and

it is actually summary information about the actual records of the database. Commonly

used database statistics are wavelets, histogframs and sketches. Such statistics also known

as synopses, are orders of magnitude smaller in size than the actual data. Hence it is much

faster and efficient to access synopses as compared to reading the entire data. However,

such synopses are precomputed and static. If a query is issued which requires some

synopses that are not already available, then they would have to be computed by scanning

the dataset, possibly multiple times before answering the query.

Second approach to AQP is using samples of database records to answer queries.

Query execution is extremely fast since the number of records in the sample is a small

fraction of the total number of records in the database. The answer is then extrapolated

or -I I!. I1-up" to the size of the entire database. Since the answer is computed by

processing very few records of the database, it is an approximation of the true answer.

For the work in this thesis, we propose to use sampling [25, 109] in order to support

AQP. We make this choice due to the following important advantages of sampling over

precomputed synopses. The accuracy of an estimate computed by using samples can be

easily improved by obtaining more samples to answer the query. On the other hand, if the

estimate computed by using synopses is not sufficiently accurate, a new synopsis providing

greater accuracy would have to be built. Since this would require scanning the dataset it is

impractical. Secondly, sampling is very amenable to scalability. Even for extremely large

datasets of the order of hundreds of gigabytes, it is generally possible to accommodate a

small sample in main memory and use efficient in-memory algorithms to process it. If this

is not possible, disk-based samples and algorithms have also been proposed [76] and are

equally effective as their in-memory counterparts. This is an important benefit of sampling










as compared to histograms, which become unwieldy as the number of attributes of the

records in the dataset increases.

Thirdly, since real records (although very few) are used in a sample, it is possible

to answer any statistical query including arbitrarily complex functions in relational

selection and join predicates. This is a very important advantage of sampling as opposed

to synopses such as sketches, which are not suitable for answering arbitrary queries.

Finally, unlike preconmputed synopses there is no requirement of maintenance and

updates for on-the-fly sampling as data are updated.

1.2.2 Architectural Changes

In order to support sanipling-hased AQP in a database nianagenient system, 1!! lin i-

changes need to be incorporated in the architecture of the system. The reason for this is

that traditional database nianagenient systems were not designed to work with random

samples or to support computation of approximate results. In this section, we briefly

describe some of the most critical changes that are required in the architecture of a DBMS

to support sanipling-hased AQP.

Figure 1-1 depicts the various components front a simplified architecture of a DBMS.

The four components that require us! .1.i- changes in order to support sanipling-hased AQP

are as follows:

* Index/file/record manager The use of traditional index structures like B+-Trees
is not appropriate to obtain random samples. This is because such index structures
order records based on record search key values which is actually the opposite of
obtaining records in a random order. Hence, for AQP it is important to provide
physical structures or file organizations which support efficient retrieval of random
samples.

* Execution engine The execution engine needs to be revamped completely so that
it can use the random samples returned by the lower level to execute the query on
them. Further, the result of the query needs to be scaled up appropriately for the
size of the entire database. This component would also need to be able to compute
accuracy guarantees for the approximate answer.











User Interface

Queries/Updates

Query Compiler

Qery plan


Execution Engine


Index, file and record requests


Index/File/Record Manager



Buffer Manager


Page commands



Read/write pages


Storage Manager


Figure 1-1. Simplified architecture of a DBMS


* Query compiler The query compiler has to be modified so that it can chalk out
a different strategy of execution for various types of queries like relational joins,
subset-based queries or queries with a GROUP-BY clause. Moreover, optimization
of queries needs to be done very differently from traditional query optimizers which
create the most efficient query plan to run a query to completion. For AQP, queries
should be optimized so that the first few result tuples are output as quickly as
possible.

* User interface There is tremendous scope of providing an intuitive user interface
for an online AQP system. In addition to the UI being able to provide accuracy
guarantees to the estimate, it would be very intuitive to provide a visualization of
the intermediate results as and when they become available so that the user can
continue to explore the query or decide to modify or terminate it. Current database
management systems provide user interfaces with very limited functionality.










1.3 Contributions in This Thesis


These tasks involve significant research and implementation issues. Since ]rn Ilw of the

problems have never been tackled in the literature, there are several challenging tasks to

be addressed.

For the scope of my research, I choose to address the following three problems. The

motivation and our solutions to each of these research problems is described separately in

the following chapters of this thesis.

* We present a primary index structure which can support efficient retrieval of random
samples from an arbitrary range query. This requires a specialized file organization
and an efficient algorithm to actually retrieve the desired random samples from the
index. This work falls in the scope of the Index/file/record manager component
described earlier.

* We present our solution to support execution of queries which have a nested
sub-query where the inner query is correlated to the outer query, in an approximate
query processing framework. This work falls in the purview of the execution engine of
the system.

* Finally, we also present a technique to support efficient execution of queries which
have predicates with low selectivities, such as GROUP BY queries with many
different groups. This work also falls in the scope of the query execution engine.









CHAPTER 2
RELATED WORK(

This chapter presents previous work in the data management and statistics literature

related to estimation using sampling as well as non-sampling hased precomputed

synopses structures. Finally, it describes work related to OLAP query processing using

non-relational data models like data cubes.

2.1 Sampling-based Estimation

Sampling has a long history in the data management literature. Some of the

pioneering work in this field has been done hv Olken and Rotem [96, 98-101] and

Antoshenkov [9], though the idea of using a survey sample for estimation in statistics

literature goes back much earlier than these works. 1\ost of the work by Olken and Rotem

describes how to perform simple random sampling from databases. Estimation for several

types of database tasks has been attempted with random samples. The rest of this section

presents important works on sampling-hased estimation of inl li1-~ database tasks.

Some of the initial work on estimating selectivity of join queries is due to Hou et al.

[67, 68]. They present unbiased and consistent estimators for estimating the join size and

also provide an algorithm for cluster sampling. In [64] they propose unbiased estimators

for COUNT .I__regate queries over arbitrary relational algebra expressions. However,

computation of variance of their estimators is very complex [67]. They also do not provide

any bounds on the number of random samples required for estimation.

Adaptive sampling has been used for estimation of selectivity of predicates in

relational selection and join operations [83, 84, 86] and for approximating the size of a

relational projection operation [94]. Adaptive sampling has also been used in [85], to

estimate transitive closures of database relations. The authors point out the benefits and

generality of using sampling for selectivity estimation over parametric methods which

make assumptions about an underlying probability distribution for the data as well as

over non-parametric methods which require storing and maintaining synopses about the










underlying data. The algorithms consider the query result as a collection of results from

several disjoint subqueries. Subqueries are sampled randomly and their result sizes are

computed. The estimate of the actual query result size is then obtained from the results

of the various subqueries. The sampling of subqueries is continued until either the sum

of the subquery sizes is sufficiently large or the number of samples taken is sufficiently

large. The method requires that the maximum size of a subquery be known. Since this

is generally not available, the authors use an upper bound for the maximum subquery

size in their method. Haas and Swami [59] observe that using a loose upper bound for

the maximum subquery size can lead to sampling more subqueries than necessary, and

potentially increasing the cost of sampling significantly.

Double sampling or two-phase sampling has been used in [66] for estimating the

result of a COUNT query with a guaranteed error bound at a certain confidence level.

The error bound is guaranteed by performing sampling in two steps. In the first step a

small pilot sample is used to obtain preliminary information about the input relation. This

information is then used to compute the size of the sample for the second step such that

the estimator is guaranteed to produce an estimate with the desired error bound.

As Haas and Swami [59] point out, the drawback of using double sampling is that

there is no theoretical guidance for choosing the size of the pilot sample. This could

lead to an unpredictably imprecise estimate if the pilot sample size is too small or an

unnecessarily high sampling cost if the pilot sample size is too large. In their work [59],

Haas and Swami present sequential sampling techniques which provide an estimate of the

result size and also bounds the error in estimation with a prespecified probability. They

present two algorithms in the paper to estimate the size of a query result. Although both

algorithms have been proven to be .-i-mptotically correct and efficient, the first algorithm

suffers from the problem of undercoverage. This means that in practice the probability

with which it estimates the query result within the computed error bound is less than

the specified confidence level of the algorithm. This problem is addressed by the second










algorithm which organizes groups of equal-sized results sets into a single stratum and then

performs stratified sampling over the different strata. However, their algorithms do not

perform very well when estimating the size of joins between a skewed and a non-skewed

relation.

Ling and Sun [82] point out that general sampling-based estimation methods have a

high cost of execution since they make an overly restrictive assumption of no knowledge

about the overall characteristics of the data. In particular, they note that estimation of

the overall mean and variance of the data not only incurs cost but also introduces error in

estimation. The authors rather -II__- -r an alternative approach of actually keeping track

of these characteristics in the database at a minimal overhead.

A detailed study about the cost of sampling-based methods to estimate join query

sizes appears in [58]. The paper systematically analyses the factors which influence the

cost of a sampling-based method to estimate join selectivities. Based on their analysis,

their findings can be summarized as follows: (a) When the measure of precision of the

estimate is absolute, the cost of sampling increases with the number of relations involved

in the join as well as the sizes of the relations themselves. (b) When the measure of

precision of the estimate is relative, the cost of using sampling increases with the sizes

of the relations, but decreases as the number of input relations increase. (c) When the

distribution of the join attribute values is uniform or highly skewed for all input relations,

the cost of sampling tends to be low, while it is high when only some of the input relations

have a skewed join attribute value distribution. (d) The presence of tuples in a relation

which do not join with any other tuples from other relations ahr-l-w increases the cost of

sampling.

Haas et al. [56, 57] study and compare the performance of new as well as previous

sampling-based procedures for estimating the selectivity of queries with joins. In particular

they identify estimators which have a minimum variance after a fixed number of sampling

steps have been performed. They note that use of indexes on input relations can further










reduce variance of the selectivity estimate. The authors also show how their estimation

methods can he used to estimate the cost of implementing a given join query plan without

making any assumptions about the underlying data or requiring storage and maintenance

of summary statistics about the data.

Ganguly et al. [35] describe how to estimate the size of a join in the presence of skew

in the data by using a technique called L~.:I. .. l '-"rl'l.:..y This technique classifies tuples

of each input relation into two groups, sparse and dense, based on the number of tuples

with the same value for the join attribute. Every combination of these groups is then

subject to different estimation procedures. Each of these estimation procedures require a

sample size larger than a certain value (in terms of the total number of tuples in the input

relation) to provide an estimate within a small constant factor of the true join size. In

order to guarantee estimates with the specified accuracy, hifocal sampling also requires the

total join size and the join sizes from sparse-sparse subjoins to be greater than a certain

threshold.

Gibbons and Matias [40] introduce two sampling-hased summary statistics called con-

ci~se samaples and counting samaples and present techniques for their fast and incremental

maintenance. Although the paper describes summary statistics rather than on-the-fly

sampling techniques, the summary statistics are created from random samples of the

underlying data and are actually defined to describe characteristics of a random sample

of the data. Since summary statistics of a random sample require much lesser amount of

memory than the sample itself, the paper describes how information from a much larger

sample can he stored in a given amount of memory by storing sample statistics instead

of using the memory to store actual random samples. Thus, the authors claim that since

information from a larger sample can he stored by their summary statistics the accuracy of

approximate answers can he boosted.

C'!s 1! 1isti Motwani and No I:- lyya [22, 24] present a detailed study of the problem

of efficiently sampling the output of a join operation without actually computing the










entire join. They prove a negative result that it is not possible to generate a sample of

the join result of two relations by merely joining samples of the relations involved in

the join. Based on this result, they propose a biased sampling strategy which samples

tuples from one relation in the proportion with which their matching tuples appear in

the other relation. The intuition behind this approach is that the resulting biased sample

is more likely to reflect the structure of the actual join result between the two relations.

Information about the frequency of the various join attribute values is assumed to be

available in the form of some synopsis structures like histograms.

There has also been work to estimate the actual result of an .I_ _-egate query which

involves a relational join operation on its input relations. In fact, Haas, Hellerstein and

Wang [63] propose a system called Online Aggregation (OLA) that can support online

execution of analytic-- VI- ..;:_ regation queries. They propose the system to have a visual

interface which di pt17in the current estimate of the ..;:_ regate query along with error

bounds at a certain confidence level. Then, as time progresses, the system continually

refines the estimate and at the same time shrinks the width of the error bounds. The user

who is presented with such a visual interface, has at all times, an option to terminate

further execution of the query in case the error bound width is satisfactory for the given

confidence level. The authors propose the use random sampling from input relations to

provide estimates in OLA. Further, they describe some of the key changes that would

be required in a DBMS to support OLA. In [51], Haas describes statistical techniques

for computing error bounds in OLA. The work on OLA eventually grew into the UC

Berkeley CONTROL project. In their article [62], Hellerstein et al. describe various issues

in providing interactive data analysis and possible approaches to address those issues.

Haas and Hellerstein [53, 54] propose a family of join algorithms called ripple joins

to perform relational joins in an OLA framework. Ripple joins were designed to minimize

the time until an acceptably precise estimate of the query result is made available, as

opposed to minimizing the time to completion of the query as in a traditional DBMS. For










a two-table join, the algorithm retrieves a certain number of random tuples from both

relations at each sampling step; these new tuples are joined with previously seen tuples

and with each other. The running result of the .I__-oegate query is updated with these

newly retrieved tuples. The paper also describes how a statistically meaningful confidence

interval of the estimated result can he computed based on the Central Limit Theorem

(CLT) .

Luo et al. [87] present an online parallel hash ripple join algorithm to speed up

the execution of the ripple join especially when the join selectivity is low and also when

the user wishes to continue execution till completion. The algorithm is assumed to be

executed at a fixed set of processor nodes. At each node, a hash table is maintained for

every relation. 1\oreover every bucket in each hash table could have some tuples stored

in memory and some others stored on disk. The join algorithm proceeds in two phases; in

the first phase tuples from both relations are retrieved in a random order and distributed

to the processor nodes so that each node would perform roughly the same amount of work

for executing the join. By using multiple threads at each node, production of join tuples

from the in-memory hash table buckets begins even as tuples are being distributed to

the various processors. The second phase begins after redistribution from the first phase

is complete. In this phase, a new in-memory hash table is created which uses a hashing

function different from the function used in phase 1. The tuples in the disk-resident

buckets of the hash table of phase 1 are then hashed according to the hashing function

of phase 2 and joined. The algorithm provides a considerable speed-up factor over the

one-node ripple join, provided its memory requirements are met.

Jermaine et al. [73, 74] point out that the drawback of both the ripple join algorithms

described above is that the statistical guarantees provided by the estimator are valid

only as long as the output of the join can he accomodated in main memory. In order

to counteract this problem, they propose the Sort-M~erge-b'in,.::.. join algorithm as a

generalization of the ripple join which can provide error guarantees throughout execution,










even if it operates from disk. The algorithm proceeds in three phases. In the sort phase,

the two input relations are read in parallel and sorted into runs. Each pair of runs is

subject to an in-memory hash ripple join and provides a corresponding estimate of the

join result. The merge and shrink phases execute concurrently where in the merge phase,

tuples are retrieved from the various sorted runs of both relations and joined with each

other. Since the sorted runs "1 -~ tuples which are pulled by the merge phase, the

shrinking phase takes these tuples into account and updates the estimator accordingly.

The authors provide a detailed statistical analysis of the estimator as well computation of

error bounds.

Estimation using sampling of the number of distinct values in a column has been

studied hv Haas et al. [48]. They provide an overview of the estimators used in the

database and statistics literature and also develop several new sampling-hased estimators

for the distinct value estimation problem. They propose a new hybrid sampling estimator

which explicitly adapts to different levels of data skew. Their hybrid estimator performs

a Chi-square test to detect skew in the distribution of the attribute value. If the data

appears to be skewed, then Shlosser's estimator is used while if the test does not detect

skew, a smoothed-jackknife estimator (which is a modification of the conventional

jackknife estimator) is used. The authors attribute a dearth of work for sampling-hased

estimation of the number of distinct values to the inherent difficulty of the problem while

noting that it is a much harder problem than estimating the selectivity of a join.

Haas and Stokes [50] present a detailed study of the problem of estimating the

number of classes in a finite population. This is equivalent to the database problem of

estimating the number of distinct values in a relation. The authors make recommendations

about which statistical estimator is appropriate subject to constraints and finally claim

from empirical results that a hybrid estimator which adapts according to data skew is the

most superior estimator.










There has also been work by C'! ~ l: .I:. et al. [16] which establishes a negative result

stating that no sampling-hased estimator for estimating the number of distinct values

can guarantee small error across all input distributions unless it examines a large fraction

of the input data. They also present a Guaranteed Error Estimator (GEE) whose error

is provably no worse than their negative result. Since the GEE is a general estimator

providing optimal error over all distributions, the authors note that its accuracy may be

lower than some previous estimators on specific distributions. Hence, they propose an

estimator called the Adaptive Estimator (AE) which is similar in spirit to Haas et al.'s

hybrid estimator [50], but unlike the latter, is not composed of two distinct estimators.

Rather the AE considers the contribution of data items having high and low frequencies in

a single unified estimator.

In the AQUA system [41] for approximate answering of queries, Acharya et al.

[6] propose using synopses for estimating the result of relational join queries involving

foreign-key joins rather than using random samples from the base relations. These

synopses are actually precomputed samples from a small set of distinguished joins and

are called join .synop~se~s in the paper. The idea of join synopses is that by precomputing

samples from a small set of distinguished joins, these samples can he used for estimating

the result of many other joins. The concept is applicable in a k-way join where each join

involves a primary and foreign key of the participating relations. The paper describes

that if workload information is available, it can he used to design an optimal allocation

for the join synopses that minimizes the overall error in the approximate answers over the

workload.

Acharya et al. [5] propose using a mix of uniform and biased samples for approximately

answering queries with a GROUP-BY clause. Their sampling technique called congre~s-

.sional ****rl,,~.:,:l relies on using precomputed samples which are a hybrid union of uniform

and biased samples. They assume that the selectivity of the query predicate is not so low

that their precomputed sample completely misses one or more groups from the result of










the GR OUP-BY query. Based on this assumption, they devise a sampling plan for the

different groups such that the expected minimum number of tuples satisfying the query

predicate in any group, is maximized. The authors also present one-pass algorithms [4] for

constructing the congressional samples.

Ganti et al. [37] describe a biased sampling approach which they call ICICLES to

obtain random samples which are tuned to a particular workload. Thus, if a tuple is

chosen by many queries in a workload, it has a higher probability of being selected in the

self-tuning sample as compared to tuples which are chosen by fewer queries. Since this is

a non-uniform sample, traditional sampling-hased estimators must he adapted for these

samples. The paper describes modified estimators for the common .l__-oegation operations.

It also describes how the self-tuning samples are tuned in the presence of a dynamically

changing workload.

OsI IIII.l1lltis et al. [18] note that uniform random sampling to estimate .I__regate

queries is ineffective when the distribution of the .I_ negate attribute is skewed or when

the query predicate has a low selectivity. They propose using a combination of two

methods to address this problem. Their first approach is to index separately those

attribute values which contribute significantly to the query result. This method is called

Outlier Indexring in the paper. The second approach proposed in the paper is to exploit

workload information to perform weighted sampling. According to this technique, records

which satisfied many queries in the workload are sampled more than records than satisfied

fewer queries.

C'!s u te llis ti Das and No I:- lyya [19, 20] describe how workload information can

he used to precompute a sample that minimizes the error for the given workload. The

problem of selection of the sample is framed as an optimization problem so that the error

in estimation of the workload queries using the resulting sample is minimized. When the

actual incoming queries are identical to queries in the workload, this approach gives a

solution with minimal error across all queries. The paper also describes how the choice of










the sample can be tuned to achieve effective estimates when the actual queries are similar

but not identical to the workload.

Babcock, Chaudhuri and Das [10] note that a uniformly random sample can lead

to inaccurate answers for many queries. They observe that for such queries, estimation

using an appropriately biased sample can lead to more accurate answers as compared

to estimation using uniformly random samples. Based on this idea, the paper describes

a technique called small II,. ;,1....rt;l,,ll y:l which is designed to approximately answer

..- egation queries having a GROUP-BY clause. The distinctive feature of this technique

as compared to previous biased sampling techniques like congressional sampling is that

a new biased sample is chosen for every GROUP-BY query, such that it maximizes

the accuracy of estimating the query rather than trying to devise a biased sample that

maximizes the accuracy over an entire workload of queries. According to this technique,

larger groups from the output of the GROUP-BY queries are sampled uniformly while the

small groups are sampled at a higher rate to ensure that they are adequately represented.

The group samples are obtained on a per-query basis from an overall sample which is

computed in a pre-processing phase.

In fact, database sampling has been recognized as an important enough problem

that ISO has been working to develop a standard interface for sampling from relational

database systems [55], and significant research efforts are directed at providing sampling

from database systems by vendors such as IBM [52].

2.2 Estimation Using Non-sampling Precomputed Synopses

Estimation in databases using a non-sampling technique was first proposed by Rowe

[106, 107]. The technique proposed is called antisamp~ling and involves creation of a special

auxiliary structure called database abstract. The abstract considers the distribution of

several attributes and groups of attributes. Correlations between different attributes can

also be characterized as statistics. This technique was found to be faster than random

sampling, but required domain knowledge about the various attributes.










Classic work on histogram hased estimation of predicate selectivity is by Selinger et

al. [110] and Piatetsky-Shapiro and Connell [102]. Selectivity estimation of queries with

multidimensional predicates using histograms was presented by 1\uralikrishna and DeWitt

[92]. They show that the maximum error in estimation can he controlled more effectively

by choosing equi-depth histograms as opposed to equi-width histograms.

Ioannidis [70] describes how serial histograms are optimal for ..-:-o negate queries

involving arbitrary join trees with equality predicates. Ioannidis and Poosala [71] have also

studied how histograms can he used to approximately answer non- I__-oegate queries which

have a set based result.

Several histogram construction schemes [42, 45, 72] have been proposed in the

literature. Jagadish et al. [72] describe techniques for constructing histograms which can

minimize a given error metric where the error is introduced because of approximation

of values in a bucket by a single value associated with the bucket. They also describe

techniques for augmenting histograms with additional information so that they can he

used to provide accuracy guarantees of the estimated results.

Construction of approximate histograms by considering only a random sample of

the data set was investigated by C'!s II11!sIll et al. [23]. Their technique uses an adaptive

sampling approach to determine the sample size that would be sufficient to generate

approximate histograms which can guarantee pre-specified error bounds in estimation.

They also extend their work to consider duplicate values in the domain of the attribute for

which a histogfram is to be constructed.

The problem of estimation of the number of distinct value combinations of a set of

attributes has been studied by Yu et al. [121]. Due to the inherent difficulty of developing

a good, sampling-hased estimation solution to the problem, they propose using additional

information about the data in the form of histograms, indexes or data cubes.

In a recent paper [28], Dobra presents a study of when histograms are best suited for

approximation. The paper considers the long-standing assumption that histogframs are










most effective only when all elements in a bucket have the same frequency and actually

extends it to a less restrictive assumption that histograms are well-suited when elements

within a bucket are randomly arranged even though they might have different frequencies.

Wavelets have a long history as mathematical tools for hierarchical decomposition

of functions in signal and image processing. Vitter and his collaborators have also

studied how wavelets can he applied to selectivity estimation of queries [89] and also

for computing .I__-oegates over data cubes [118, 119]. C' I..:1 .Ilarti et al. [15] present

techniques for approximate computation of results for .I_- r_ egate as well as non- I__ negate

queries using Haar wavelets.

One more summary structure that has been proposed for approximating the size of

joins is .sketches. Sketches are small-space summaries of data suited for data streams. A

sketch generally consists of multiple counters corresponding to random variables which

enable them to provide approximate answers with error guarantees for a priori decided

queries. Some of the earliest work on sketches was presented by Alon, Gibbons, 10atias

and Szegedy [7, 8]. Sketching techniques with improved error guarantees and faster update

times have been proposed as Fast-Count sketches [117]. A statistical analysis of various

sketching techniques along with recommendations on their use for estimating join sizes

appears in [108].

2.3 Analytic Query Processing Using Non-standard Data Models

A data model for OLAP applications called dthea cube was proposed by Gray et al.

[44] for processing of analytic style .I__ negation queries over data warehouses. The paper

describes a generalization of the SQL GROUP BY operator to multiple dimensions by

introducing the data cube operator. This operator treats each of the possible .I__ regfation

attributes as a dimension of a high dimensional space. The .I__-oegate of a particular

set of attribute values is considered as a point in this space. Since the cube holds

precomputed .I negate values over all dimensions, it can he used to quickly compute

results to GROUP-BY queries over multiple dimensions. The data cube is precomputed










and can require significant amount of space for storage of the precomputed .I_ regfates

along the different dimensions. A more serious drawback of the data cube approach is

that it can he used to efficiently answer only such queries which have a grouping hierarchy

that conforms to the hierarchy on which the data cube is built. 1\oreover, complex queries

which have been addressed in this thesis such as queries having correlated suhqueries are

not amenable to efficient processing with the data cube model.

Due to potentially large sizes of data cubes for high dimensions, researchers have

studied techniques to discover semantic relationships in a data cube. This approach

reduces the number of precomputed .I_ _-egates grouped by different attributes if

their .l regate values are identical. The quotient cube [79] and quotient cube tree [80]

structures are such compressed representations of the data cube which preserve semantic

relationships while also allowing processing of point and range queries.

Another approach that has been emploi- 0 in shrinking the data cube while at the

same time preserving all the information in it is the Dwarf [113, 114] structure. Dwarf

identifies and eliminates redundancies in prefixes and suffixes of the values along different

dimensions of a data cube. The paper shows that by eliminating prefix as well as suffix

redundancies, both dense as well as sparse data cubes can he compressed effectively.

The paper also shows improved cube construction time, query response time as well as

update time as compared to cube trees [105]. Although, the dwarf structure improves the

performance of the data cube model, it still suffers from the inherent drawback of the data

cube model -it is not suitable to efficiently answer arbitrarily complex queries such as

queries with correlated suhqueries.

Recently, a new column-oriented architecture for database systems called C'-store was

proposed by Stonebraker et al [115]. The system has been designed for an environment

that has much higher number of database reads as opposed to writes, such as a data

warehousing environment. C-store logically splits attributes of a relational table into

projections which are collections of attributes, and stores them on disk such that all values










of any attribute are stored .Il11 Il-ent to each other. The paper presents experimental results

which show that C-store executes several select-project-join and group-by queries over the

TPC-H benchmark much faster than coninercial row-oriented or colunin-oriented systems.

At the time of the paper [115], the system was still under development.










CHAPTER 3
MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION

3.1 Introduction

With ever-increasing database sizes, randomization and randomized algorithms [91]

have become vital data management tools. In particular, random sampling is one of the

most important sources of randomness for such algorithms. Scores of algorithms that are

useful over large data repositories either require a randomized input ordering for data (i.e.,

an online random sample), or else they operate over samples of the data to increase the

speed of the algorithm.

Although applications requiring randomization abound in the data management

literature, we specifically consider online ..-:-o negation [54, 62, 63] in this thesis. In online

.I_ _oegfation, database records are processed one-at-a-time, and used to keep the user

informed of the current "hest guess" as to the eventual answer to the query. If the records

are input into the online .I_ egfation algorithm in a randomized order, then it becomes

possible to give probabilistic guarantees on the relationship of the current guess to the

eventual answer to the query.

Despite the obvious importance of random sampling in a database environment and

dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD

and VLDB conferences are concerned with database sampling), there has been relatively

little work towards actually supporting random sampling with physical database file

organizations. The classic work in this area (by Olken and his co-authors [98, 99, 101])

suffers from a key drawback: each record sampled from a database file requires a random

disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this

means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast

approximate query processing or speeding up a data mining algorithm, this is clearly

unacceptable.










The Materialized Sample view


In this chapter, we propose to use the materialized smltale view 1 as a convenient

abstraction for allowing efficient random sampling from a database. For example, consider

the following database schema:

SALE (DAY, CUST, PART, SUPP)

Imagine that we want to support fast, random sampling from this table, and most of

our queries include a temporal range predicate on the DAY attribute. This is exactly the

interface provided by a materialized sample view. A materialized sample view can he

specified with the following SQL-like query:

CREATE MATERIALIZED SAMPLE VIEW MySam

AS SELECT FROM SALE

INDEX ON DAY

In general, the range attribute or attributes referenced in the INDEX ON clause can he

spatial, temporal, or otherwise, depending on the requirements of the application.
While the materialized sample view is a straightforward concept, efficient implementation
is difficult. The primary technical contribution of this thesis is a novel index structure
called the AG'E Tree (Al'i.. ,:ltl.Jl..7.;, C~I,.:,rl.;nt..0.l..; i Eu'***''~ '-'.:~ld////; see Section 3.4) which
can he used to efficiently implement a materialized sample view. Such a view, stored as an
ACE-Tree, has the following characteristics:


* It is possible to efficiently sample (without replacement) from any arbitrary range
query over the indexed attribute, at a rate that is far faster than is possible using
techniques proposed by Olken [96] or by scanning a randomly permuted file. In
general, the view can produce samples from a predicate involving any attribute
having a natural ordering, and a straightforward extension of the ACE Tree can he
used for sampling from multi-dimensional predicates.

* The resulting sample is online, which means that new samples are returned
continuously as time progresses, and in a manner such that at all times, the set
of samples returned is a true random sample of all of the records in the view that




1 This term was originally used in Olken's PhD thesis [96] in a slightly different context,
where the goal was to maintain a fixed-size sample of database; in contrast, as we describe
subsequently our materialized sample view is a structure allowing online sampling










match the range query. This is vital for important applications like online ..-:-o negation
and data mining.

*Finally, the sample view is created efficiently, requiring only two external sorts of the
records in the view, and with only a very small space overhead beyond the storage
required for the data records.

We note that while the materialized sample view is a logical concept, the actual file

organization used to implement such a view can be referred to as a sample index: since it is

a primary index structure to efficiently retrieve random samples.

3.2 Existing Sampling Techniques

In this section, we discuss three simple techniques that can be used to create

materialized sample views to support random sampling from a relational selection

predicate.

3.2.1 Randomly Permuted Files

One option for creating a materialized sample view is to randomly shuffle or permute

the records in the view. To sample from a relational selection predicate over the view,

we scan it sequentially from beginning to end and accept those records that satisfy the

predicate while rejecting the rest. This method has the advantage that it is very simple,

and using a fast external sorting algorithm, permuting the records can be very efficient.

Furthermore, since the process of scanning the file can make use of the fast, sequential I/O

provided by modern hard disks, a materialized view organized as a randomly permuted file

can be very useful for answering queries that are not very selective.

However, the 1!! i 1- problem with such a materialized view is that the fraction of

useful samples retrieved by it is directly proportional to the selectivity of the selection

predicate. For example, if the selectivity of the query is 101' then on average only 1CI' of

the random samples obtained by such a view can be used to answer the query. Hence for

moderate to low selectivity queries, most of the random samples retrieved by such a view

will not be useful for answering queries. Thus, the performance of such a view quickly

degrades as selectivity of the selection predicates decreases.










3.2.2 Sampling from Indices

The second approach to creating a materialized sample view is to use one of the

standard indexing structures like a hashing scheme or a tree-based index structure

to organize the records in the view. In order to produce random samples from such

a materialized view, we can employ iterative or batch sampling techniques [9, 96,

99-101] that sample directly front a relational selection predicate, thus avoiding the

aforementioned problem of obtaining too few relevant records in the sample. Olken [96]

presents a comprehensive analysis and comparison of many such techniques. In this

Section we discuss the technique of sampling front a materialized view organized as

a ranked B+-Tree, since it has been proven to be the most efficient existing iterative

sampling technique in terms of number of disk accesses. A ranked B+-Tree is a regular

B+-Tree whose internal nodes have been augmented with information which permits one

to find the ith record in the file.

Let us assume that the relation SALE presented in the Introduction is stored as a

ranked B+-Tree file indexed on the attribute DAY and we want to retrieve a random

sample of records whose DAY attribute value falls between 11-28-2004 and 03-02-2005.

This translates to the following SQL query:

SELECT FROM SALE

WHERE SALE.DAY BETWEEN '11-28-2004' AND '03-02-2005'

Algorithm 1 above can then he used to obtain a random sample of relevant records

fron the ranked B+-Tree file.


The drawback of the above algorithm is that whenever a leaf page is accessed, the

algorithm retrieves only that record whose rank matches with the rank being searched for.

Hence for every record which resides on a page that is not currently suffered, the retrieval

time is the same as the time required for a random disk I/O. Thus, as long as there are

unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow.









Algorithm 1: Sampling from a Ranked B+-Tree

Algorithm SampleRankedB+Tree (Value vl, Value v2)
1. Find the rank rl of the record which has the smallest
DAY value greater than vl.
2. Find the rank r2 of the record which has the largest
DAY value smaller than v2*
3. While sample size < desired sample size
3.a Generate a uniformly distributed random number
i, between rl and T2-
3.b If i has been generated previously, discard it and
generate the next random number.
3.c Using the rank information in the internal nodes,
retrieve the record whose rank is i.


3.2.3 Block-based Random Sampling

While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time,

it is possible to sample from an indexing structure such as a B+-Tree, and make use of

entire blocks of records [21, 55]. The number of records per block is typically on the order

of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of

records retrieved over time if all of the records in each block are consumed, rather than a

single record.

However, there are two problems with this approach. First, if the structure is used to

estimate the answer to some ..-:-o negate query, then the confidence bounds associated with

any estimate provided after NV samples have been retrieved from a range predicate using a

B+-Tree (or some other index structure) may be much wider than the confidence bounds

that would have been obtained had all NV samples been independent. In the extreme case

where the values on each block of records are closely correlated with one another, all of

the NV samples may be no better than a single sample. Second, any algorithm which makes

use of such a sample must be aware of the block-based method used to sample the index,

and adjust its estimates accordingly, thus adding complexity to the query result estimating

process. For algorithms such as Bradley's K(-means algorithm [11], it is not clear whether

or not such samples are even appropriate.










3.3 Overview of Our Approach

We propose an entirely different strategy for implementing a materialized sample

view. Our strategy uses a new data structure called the ACE Tree to index the records

in the sample view. At the highest level, the ACE Tree partitions a data set into a

large number of different random samples such that each is a random sample without

replacement from one particular range query. When an application asks to sample from

some arbitrary range query, the ACE Tree and its associated algorithms filter and combine

these samples so that very quickly, a large and random subset of the records satisfying the

range query is returned. The sampling algorithm of the ACE Tree is an online l.hm

which means that as time progresses, a larger and larger sample is produced by the

structure. At all times, the set of records retrieved is a true random sample of all the

database records matching the range selection predicate.

3.3.1 ACE Tree Leaf Nodes

The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has

two components:


1. A set of h r ing~. where a range is a pair of key values in the domain of the key
attribute and & is the height of the ACE Tree. Unlike a B+-Tree, each leaf node
in the ACE Tree stores records falling in several different ranges. The ith range
associated with leaf node L is denoted by L.R,. The h different ranges associated
with a leaf node are textithierarchical, that is L.R1 > L.R2 > > L.Rh. The first
range in any leaf node, L.R1, ah-li-w contains a uniform random sample of all records
of the database thus corresponding to the range (-oo, 00). The Ath range in any leaf
node is the smallest among all other ranges in that leaf node.

2. A set of h associated sections. The ith section of leaf node L is denoted by L.Si. The
section L.Si contains a random subset of all the database records with key values in
the range L.Ri.

Figure 3-1 depicts an example leaf node in the ACE Tree with attribute range values

written above each section and section numbers marked below. Records within each

section are shown as circles.










RI :0-too R2 :0-50 R3 :0-25 R4 :0-12



St S2 S3 S4

Figure 3-1. Structure of a leaf node of the ACE tree.


3.3.2 ACE Tree Structure

Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes

used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal

nodes in a binary tree are much smaller than disk pages, they are packed and stored

together in disk-page-sized units [27]. Each internal node has the following components:

1. A range R of key values associated with the node.

2. A key value k that splits R and partitions the data on the left and right of the node.

3. Pointers ptrl and ptrr, that point to the left and right children of the node.

4. Counts cutl and cut,, that give the number of database records falling in the
ranges associated with the left and right child nodes. These values can be used,
for example, during evaluation of online .I_ _regation queries which require the size of
the population from which we are sampling [54].

Figure 3-2 shows the logical structure of the ACE Tree. lIs~ refers to the jth internal

node at level i. The root node is labeled with a range I1,1.R = [0-100], signifying that

all records in the data set have key values within this range. The key of the root node

partitions I1,1.R into I2,1.R = [0-50] and I2,2.R = [51-100]. Similarly each internal node

divides the range of its descendents with its own key.

The ranges associated with each section of a leaf node are determined by the ranges

associated with each internal node on the path from the root node to the leaf. For

example, if we consider the path from the root node down to leaf node L4, the ranges that

we encounter along the path are 0-100, 0-50, 26-50 and 38-50. Thus for L4, L4.S1 has a

random sample of records in the range 0-100, L4.S2 has a random sample in the range









0-100


;-100


L1 L2 L 3^ L 4 5 L6 L7 L8


6-100 0-50 26-50 38-50,



L4.S1 L4.S2 L4.S3 L4.S4

Figure 3-2. Structure of the ACE tree.


0-50, L4.S3 has a random sample in the range 26-50, while L4.S4 has a random sample in

the range 38-50.

3.3.3 Example Query Execution in ACE Tree

In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a

large random sample of records for any given range query. The query algorithm is formally

described in Section 3.6.

Let Q = [30-65] be our example query postulated over the ACE Tree depicted in

Figure 3-2. The query algorithm starts at II,1, the root node. Since I2,1.R overlaps Q, the

algorithm decides to explore the left child node labeled I2,1 in Figure 3-2. At this point

the two range values associated with the left and right children of I2,1 are 0-25 and 26-50.

Since the left child range has no overlap with the query range, the algorithm chooses to

explore the right child next. At this child node (I3,2), the algorithm picks leaf node L3 tO

be the first leaf node retrieved by the index. Records from section 1 of L3 (Which totally

encompasses Q) are filtered for Q and returned immediately to the consumer of the sample










as a random sample from the range [30-65], while records from sections 2, 3 and 4 are

stored in memory. Figure 3-3 shows the one random sample from section 1 of L3 Which

can be used directly for answering query Q.

0-100 0-50 26-50 26-37



L3.Sll L3.S2 3.S3 L3.S4











Figure 3-3. Random samples from section 1 of L3-


Next, the algorithm again starts at the root node and now chooses to explore the

right child node I22. After performing range comparisons, it explores the left child of I2,2

which is I3,3 SillCe I34.R has no overlap with Q. The algorithm chooses to visit the left

child node of I3,3 HOXt, Which is leaf node Ls. This is the second leaf node to be retrieved.

As depicted in Figure 3-4, since Ls.R1 encompasses Q, the records of L.S1 are filtered and

returned immediately to the user as two additional samples from R. Furthermore, section

2 records are combined with section 2 records of L3 tO Obtain a random sample of records

in the range 0-100. These are again filtered and returned, giving four more samples from

Q. Section 3 records are also combined with section 3 records of L3 tO Obtain a sample

of records in the range 26-75. Since this range also encompasses R, the records are again

filtered and returned adding four more records to our sample. Finally section 4 records are

stored in memory for later use.

Note that after retrieving just two leaf nodes in our small example, the algorithm

obtains eleven randomly selected records from the query range. However, in a real index,

this number would be many times greater. Thus, the ACE Tree supports I It first"










sampling from a range predicate: a large number of samples are returned very quickly. We

contrast this with a sample taken from a B+-Tree having a similar structure to the ACE

Tree depicted in Figure 3-2. The B+-Tree sampling algorithm would need to pre-select

which nodes to explore. Since four leaf nodes in the tree are needed to span the query

range, there is a reasonably high likelihood that the first four samples taken would need

to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to

retrieve the rest of the leaf nodes in the order L4, L6, L L7, L2, Ls.

0-100 0-50 51-100 26-50 51-75



L5.81 L3.S2 \ 5.S2 L3.S3 \ 5.S3

Combine Combine
Y- 3-65




30o-64 a30-6-







Figure 3-4. Combining samples from L3 and Ls-


3.3.4 Choice of Binary Versus k-Ary Tree

The ACE Tree as described above can also be implemented as a k-ary tree instead of

a binary tree. For example, for a ternary tree, each internal node can have two (instead

of one) keys and three (instead of two) children. If the height of the tree was h, every

leaf node would still have h ranges and h sections associated with them. Like a standard

complete k-ary tree, the number of leaf nodes will be kh. However, the big difference

would be the manner in which a query is executed using a k-ary ACE Tree as opposed to

a binary ACE Tree. The query algorithm will ah-li-w start at the root node and traverse










down to a leaf. However, at every internal node it will alternate between the k children in

a round-rohin fashion. 1\oreover, since the data space would be divided into k equal parts

at each level, the query algorithm might have to make k traversals and hence access k leaf

nodes before it can combine sections that can he used to answer the query. This would

mean that the query algorithm will have to wait longer (than a binary ACE Tree) before

it can combine leaf node sections and thus return useful random samples. Since the goal

of the ACE Tree is to support I it first" sampling, use of a binary tree instead of a k-ary

tree seems to be a better choice to implement the ACE Tree.

3.4 Properties of the ACE Tree

In this Section we describe the three important properties of the ACE Tree which

facilitate the efficient retrieval of random samples from any range query, and will be

instrumental in ensuring the performance of the algorithm described in Section 3.6.

3.4.1 Combinability







La 6~~3--47Sape


0-100 0 50 0265 0 263




L1.S1 L1.S2 L1.S3 L1.S4 ~ be






L3

Figure 3-5. Combining two sections of leaf nodes of the ACE tree.










The various samples produced from processing a set of leaf nodes are combinable.

For example, consider the two leaf nodes L1 and L3, and the query "Compute a random

sample of the records in the query range QI = [3 to 47]". As depicted in Figure 3-5, first

we read leaf node L1 and filter the second section in order to produce a random sample

of size nl from QI which is returned to the user. Next we read leaf node L3, and filter its

second section L3.S2 to produce a random sample of size n2 from QI which is also returned

to the user. At this point, the two sets returned to the user constitute a single random

sample from QI of size nl + n2. This means that as more and more nodes are read from

disk, the records contained in them can be combined to obtain an ever-increasing random

sample from any range query.

3.4.2 Appendability

The ith sections from two leaf nodes are ap~pendable. That is, given two leaf nodes Lj

and Lk, Lj.Si U Lk.Si is alr-l-ws a true random sample of all records of the database with

key values within the range Lj.Ri U Lk. i. For example, reconsider the query, "Compute

a random sample of the records in the query range QI = [3 to 47]". As depicted in Figure

3-6, we can append the third section from node L3 to the third section from node L1 and

filter the result to produce yet another random sample from QI. This means that sections

are never wasted.

3.4.3 Exponentiality

The ranges in a leaf node are exp~onential. The number of database records that fall

in L.R, is twice the number of records that fall in L.R, 1. This allows the ACE Tree

to maintain the invariant that for any query Q' over a relation R such that at least hp

database records fall in Q', and with |R|/2k"+1 <= |o-Q(R)| <= |R|/2k, Vk <= h 1, there

exists a pair of leaf nodes Li and Lj, where at least one-half of the database records falling

in L,.Rk+2 U Lj.Rk+2 arT alSO in Q. p is the average number of records in each section,

and & is the height of the tree or equivalently, the total number of sections in any leaf

node.









0-100


0-50 0-25 0-12


Combined
samples


Figure :3-6. Appending two sections of leaf nodes of the ACE tree.


While the formal statement of the exponentiality property is a bit complicated, the

net result is is simple: there is ah-li-s a pair of leaf nodes whose sections can he appended

to form a set which can he filtered to quickly obtain a sample from any range query Q'.

As an illustration, consider query Q over the ACE Tree of Figure :3-2. Note that the

number of database records falling in Q is greater than one-fourth, but less than half the

database size. The exponentiality property assures us that Q can he totally covered by

appending sections of two different leaf nodes. In our example, this means that Q can he

covered by appending section :3 of nodes L4 and Lg. If RC = L4- :3 U L6.R:3, then by the

invariant given above we can claim that |aq(R)| >= (1/2) x |URC -)I

3.5 Construction of the ACE Tree

In this Section, we present an I/O efficient, bulk construction algorithm for the ACE

Tree .

3.5.1 Design Goals

The algorithm for building an ACE Tree index is designed with the following goals in

mind:










1. Since the ACE Tree may index enormous amounts of data, construction of the tree
should rely on efficient, external memory algorithms, requiring as few passes through
the data set as possible.

2. In the resulting data structure, the data which are placed in each leaf node section
must constitute a true random sample (without replacement) of all database records
lying within the range associated with that section.

3. Finally, the tree must be constructed in such a way as to have the exp~or... :.:r.:/Lln;,
~~l...;;J..0..l..7.0; and 'i'1'i ':.;1'.:1.:1;i properties necessary for supporting the ACE Tree
query algorithms.

3.5.2 Construction
The construction of the ACE Tree proceeds in two distinct phases. Each phase
comprises of two read/write passes through the data set (that is, constructing an
ACE-Tree from scratch requires two external sorts of a large database table). The two
phases are as follows:

1. During Phase 1, the data set is sorted based on the record key values. This sorted
order of records is used to provide the split points associated with each internal node
in the tree.

2. During Phase 2, the data are organized into leaf nodes based on those key values.
Disk blocks corresponding to groups of internal nodes can easily be constructed at
the same time as the final pass through the data writes the leaf nodes to disk.

3.5.3 Construction Phase 1

The primary task of Phase 1 is to assign split points to each internal node of the tree.

To achieve this, the construction algorithm first sorts the data set based upon keys of the

records, as depicted in Figure 3-7.

After the dataset is sorted, the median record for the entire data set is determined

(this value is 50 in our example). This record's key will be used as the key associated with

the root of the ACE Tree, and will determine L.R2 foT eVeTy leaf IlOde in the tree. We

denote this key value by Il1.k, since the value serves as the key of the first internal node

in level 1 of the tree.

After determining the key value associated with the root node, the medians of each of

the two halves of the data set partitioned by Ili,.k are chosen as keys for the two internal

nodes at the next level: I21.k and I22.k, respectively. In the example of Figure 3-7, these










Median Record


3 7 1012115 812212129133361 3 141 ,1.k51351061612717 71 1818812



3 7 10 12 15 18 22 25 29 33 36 37 41 471 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98




I2,1.k ,'-s12,2.

3 7 10 12 15 18 22 25 29 33 361 37 41 471 50 50 53 58 60 62 691 72 74 175 77 81 84 88 89 92 98






Figure 3-7. C'!s....-!ni keys for internal nodes.


values are 25 and 75. I21.k and I22.k, along with Il1.k, will determine L.R3 foT eVeTy

leaf node in the tree. The process is then repeated recursively until enough medianS2

have been obtained to provide every internal node with a key value. Note that the same

time that these various key values are determined, the values cutl and cut, can also be

determined.

This simple strategy for choosing the various key values in the tree ensures that

the exponentiality property will hold. If the data space between li+1,2j-1.k and li,j.k

corresponds to some leaf node range L.R,, then the data space between li+1,2j-1.k and

li+2,4j-2.k will correspond to some range L.R, 1. Since li+2,4j-2.k is the midpoint of




2 We choose a value for the height of the tree in such a manner that the expected size
of a leaf node (see Sec. V F.) does not exceed one logical disk block. C'!s .. .-!ni a node
size that corresponds to the block size is done for the same reason it is done in most
traditional indexing structures: typically, the system disk block size has already been
carefully chosen by a DBA to balance speed of sequential access (which demands a larger
block size) with the cost of accessing more data than is needed (which demands a smaller
block size).










the data space between li+1,2j-1.k and l,4j.k, we know that two times as many database

records should fall in L.R,, compared with L.R, 1.

The following example also shows how the invariant described in Section 4.3 is

guaranteed by adopting the aforementioned strategy of assigning key values to internal

nodes. Consider the ACE Tree of Figure 3-2. Figure 3-8 shows the keys of the internal

nodes as medians of the dataset R. We also consider two example queries, Q1 and Q2 Such

that the number of database records falling in Q2 1S greater than one-fourth but less than

one-half of the database size, while the number of database records falling in Q1 is more

than half the database size.





12 25 37 50 62 75 88








Figure 3-8. Exponentiality property of ACE tree.


Q1 can be answered by appending section 2 of (for example) L4 and Ls (refer to

Figure 3-2). Let RC1 = L4- 2 U Ls.R2. Then all the database records fall in RC1.

Moreover, since |aQ,(R)| >= |R|/2, we have |aQ,(R)| >= (1/2) x |enc,(R)|. Similarly,

Q2 can be answered by appending section 3 of (for example) L4 and L6- 2fRC

L4- 3 U L6- 3, then half the database records fall in RC2. Also, since |eg, (R)|I > |R|/4

we have |eg (R)| >= (1/2) x |Uncz(R)|. This can be generalized to obtain the invariant

stated in Section 3.4.3.

3.5.4 Construction Phase 2

The objective of Phase 2 is to construct leaf nodes with appropriate sections and

populate them with records. This can be achieved by the following three steps:










Section Numbers
1 2 32 12 42 34 34 21 4 32 31 41 43 21 43 2 31 1

3 7 1() 12 15 18 22 25 29 33 361 37 41 47 5() 5()531 58 6() 62 69 72 741 75 771 81 84 88 89 92 98


(a) Records assigned section numbers

Leaf Numbers
Section Numbers
73 1 45 1 21 43 43 28 4 36 61 5 2673827
12 32 12 42 34 34 21 4 32 31 41 43 21 43 2 31 1

3 7 1() 12 15 18 22 25 29 33 361 37 41 47 5() 5()53 58 6() 62 69 72 74 75 771 81 84 88 89 92 98


(b) Records assigned leaf numbers

Leaf Numbers
Section Numbers

1 2 2 3 11 24 12 34 4 23 34 12 3 42 34 11 2 41 3 3

6() 18 25 1() 69 92 41 22 77 7 5() 37 33 12 29 36 5() 15 88 74 62 53 58 74 3 98 75 81 47 84 89


Leafl1 Leaf 2 Leaf 3 Leaf 4 Leaf 5 Leaf 6 Leaf 7 Leaf 8
(c) Records organized into leaf nodes

Figure :3-9. Phase 2 of tree construction.


1. Assign a uniformly generated random number between 1 and h to each record as its
section number.


2. Associate an additional random number with the record that will be used to identify
the leaf node to which the record will be assigned.

:3. Finally, re-organize the file by performing an external sort to group records in a given
leaf node and a given section together.

Figure :3-9(a) depicts our example data set after we have assigned each record a

randomly generated section number, assuming four sections in each leaf node.
In Step 2, the algorithm assigns one more randomly generated number to each
record, which will identify the leaf node to which the record will be assigned. We assume
for our example that the number of leaf nodes is 2h-1 23 8. The number to identify
the leaf node is assigned as follows.

1. First, the section number of the record is checked. We denote this value as s.










2. We then start at the root of the tree and traverse down by comparing the record key
with s 1 key values. After the comparisons, if we arrive at an internal node, le4,j
then we assign the record to one of the leaves in the subtree rooted at liey.

From the example of Figure :3-9(a), the first record having key value :3 has been

assigned to section 1. Since this record can he randomly assigned to any leaf from 1

through 8, we assign it to leaf 7.

The next record of Figure :3-9(a) has been assigned to section number 2. Referring

back to Figure :3-7, we see that the key of the root node is 50. Since the key of the record

is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of

the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we

randomly choose the leaf node :3.

For the next record having key value 10, we see that the section number assigned is :3.

To assign a leaf node to this record, we initially compare its key with the key of the root

node. Referring to Figure :3-7, we see that 10 is smaller than 50; hence we then compare it

with 25 which is the key of the left child node of the root. Since the record key is smaller

than 25, we assign the record to some leaf node in the left subtree of the node with key 25

by assigning to it a random number between 1 and 2.

The section number and leaf node identifiers for each record are written in a small

amount of temporary disk space associated with each record. Once all records have been

assigned to leaf nodes and sections, the dataset is re-organized into leaf nodes using a

two-pass external sorting algorithm as follows:

* Records are sorted in ascending order of their leaf node number.

* Records with the same leaf node number are arranged in ascending order of their
section number.

The re-organized data set is depicted in Figure :3-9(c).










3.5.5 Combinability/Appendability Revisited

In Phase 2 of the tree construction, we observe that all records belonging to some

section k are segregated based upon the result of the comparison of their key with the

appropriate medians, and are then randomly assigned a leaf node number from the feasible

ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain

all of the section a records. This ensures the 'i'1'i ':I'.J.1.7.1;i property of the ACE Tree.

Also note that the probability of assignment of one record to a section is unaffected

by the probability of assignment of some other record to that section. Since this results

in each section having a random subset of the database records, it is possible to merge

a sample of the records from one section that match a range query with a sample of

records from a different section that match the same query. This will produce a larger

random sample of records falling in the range of the query, thus ensuring the ll.:,~l.:.:l;J.0.l.

property.

3.5.6 Page Alignment

In Phase 2 of the construction algorithm, section numbers and leaf node numbers

are randomly generated. Hence we can only predict on expectation the number of records

that will fall in each section of each leaf node. As a result, section sizes within each leaf

node can differ, and the size of a leaf node itself is variable and will generally not be equal

to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf

node may span across multiple disk pages or may be contained within a single disk page.

This situation could be avoided if we fix the size of each section a priori. However,

this poses a serious problem. Consider two leaf node sections L,.Sj and L, 1.Sj. We

can force these two sections to contain the same number of records by ensuring that the

set of records assigned to section j in Phase 2 of the construction algorithm has equal

representation from L,.Rj and L, 1.Rj. However, this means that the set of records

assigned to section j is no longer random. If we fix the section size and force a set number










of records to fall in each section, we invalidate the appendability and combinability

properties of the structure. Thus, we are forced to accept a variable section size.
In order to implement variable section size, we can adopt one of the following two
schemes:


1. Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes.

2. Allow variable-sized leaf nodes along with variable-sized sections.

If we choose the ~fix~ed-sized leaf node, variable-sized section scheme, leaf node size is

fixed in advance. However, section size is allowed to vary. This allows full sections to grow

further by claiming any available space within the leaf node. The leaf node size chosen

should be large enough to prevent any leaf node from becoming completely filled up, which

prevents the partitioning of any leaf node across two disk pages. The major drawback of

this scheme is that the average leaf node space utilization will be very low. Assuming a

reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be

9'.sure that no leaf node gets filled up, the average leaf node space utilization will be

less than 15' .

The variable-sized leaf node, variable-sized section scheme does not impose a size

limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page

boundaries, if space is required. The important advantage of this scheme is that it is

space-efficient. The main drawback of this approach is that leaf nodes may span multiple

disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node.

Given that most of the cost associated with reading an arbitrary leaf page is associated

with the disk head movement needed to move the disk arm to the appropriate cylinder,

this does not pose too much of a problem. Hence we use this scheme for the construction

of leaf nodes of the ACE Tree.

3.6 Query Algorithm

In this Section, we describe in detail the algorithm used to answer range queries using

the ACE Tree.










3.6.1 Goals

The algorithm has been designed to meet the primary goal of achieving I I-i-first"

sampling from the index structure, which means it attempts to be greedy on the number

of records relevant for the query in the early stages of execution. In order to meet this

goal, the query answering algorithm identifies the leaf nodes which contain maximum

number of sections relevant for the query. A section Li,.Sj is relevant for a range query Q

if Li,.Rj n Q / 4 and Lil.Rj U Li,.Rj U -U Lin.Rj > Q where Lil,...,L, are some leaf

nodes in the tree.
The query algorithm priorities retrieval of leaf nodes so as to:

* Facilitate the combination of sections so as to maximize n in the above formulation,
and

* Maximize the number of relevant sections in each leaf node L retrieved such that
L.Sj n Q / where j = (c + 1). .h where L.Rc is the smallest range in L that
encompasses Q2.

3.6.2 Algorithm Overview

At a high level, the query answering algorithm retrieves the leaf nodes relevant to

answering a query via a series of stabs or traversals, accessing one leaf node per stab.

Each stab begins at the root node and traverses down to a leaf. The distinctive feature of

the algorithm is that at each internal node that is traversed during a stab, the algorithm

chooses to access the child node that was not chosen the last time the node was traversed.

For example, imagine that for a given internal node I, the algorithm chooses to traverse

to the left child of I during a stab. The next time that I is accessed during a stab, the

algorithm will choose to traverse to the right child node. This can be seen in Figure

3-10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to

traverse to the left child of the root node during the first stab, while during the second

stab it chooses to traverse to the right child of the root node.

The advantage of retrieving leaf nodes in this back and forth sequence is that it allows

us to quickly retrieve a set of leaf nodes with the most disparate sections possible in a













0-100


0-100

I 0nxt = R


-100
IS,4


L1 LP Ls L4 L5 6 L7 L8


(a) Stab 1, 1 contributing section

0-100

50 nxt = L

0-50 \ 51-100
12,1 7,
25 next = L 75

0-2 26-' 51-75 76-100
12 74)1 13,2 62) 13,3 88 3,4
ext= R



LL LP L3 L4 L5 6 L7 L8

(c) Stab 3, 7 contributing sections


C1 2P 3 4 L5 6 7 L8


(b)Stab 2, 6 contributing sections

0-100

S5 nxt = R


;-100
13,4


C1 2P 3 4 L5 6 7 L8

(d) Stab 4, 16 contributing sections


Figure 3-10. Execution runs of query answering algorithm with (a) 1 contributing section,
(b) 6 contributing sections, (c) 7 contributing sections and (d) 16
contributing sections.



given number of stabs. The reason that we want a non-homogeneous set of nodes is that

nodes from very distant portions of a query range will tend to have sections covering large

ranges that do not overlap. This allows us to append sections of newly retrieved leaf nodes

with the corresponding sections of previously retrieved leaf nodes. The samples obtained

can then be filtered and immediately returned.










This order of retrieval is implemented by associating a bit with each internal node

that indicates whether the next child node to be retrieved should be the left node or the

right node. The value of this bit is 'Mede-ld every time the node is accessed. Figure 3-10

illustrates the choices made by the algorithm at each internal node during four separate

stabs. Note that when the algorithm reaches an internal node where the range associated

with one of the child nodes has no overlap with the query range, the algorithm ahr-7-w

picks the child node that has overlap with the query, irrespective of the value of the

indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at

an internal node which overlaps the query range have been accessed. In such a case, the

internal node which overlaps the query range is not chosen and is never accessed again.

3.6.3 Data Structures
In addition to the structure of the internal and leaf nodes of the ACE Tree, the query
algorithm uses and updates the following two memory resident data structures:

1. A lookup table T, to store internal node information in the form of a pair of values
(next = [LEFT]| [RIGHT], done = [TRUE] |[FALSE]). The first value indicates whether
the next node to be retrieved should be the left child or right child. The second value
is TRUE if all leaf nodes in the subtree rooted at the current node have already been
accessed, else it is FALSE.

2. An array backets[h] to hold sections of all the leaf nodes which have been accessed so
far and whose records could not be used to answer the query. h is the height of the
ACE Tree.

3.6.4 Actual Algorithm

We now present the algorithms used for answering queries using the ACE Tree.

Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called

Shuttle Q. Each traversal or stab begins at the root node and proceeds down to a leaf

node. In each invocation of Shuttle0, a recursive call is made to either its left or right

child with the recursion ending when it reaches a leaf node. At this point, the sections in

the leaf node are combined with previously retrieved sections so that they can be used to

answer the query. The algorithm for combining sections is described in Algorithm 4. This









Algorithm 2: Query Answering Algorithm
Algorithm Answer (Query Q)
Let root be the root of the ACE Tree
While (!T.lookup(root) .done)
T.lookup(root).done = shuttle(Q, root);

Algorithm 3: ACE Tree traversal algorithm
Algorithm Shuttle (Query Q, Node curr_node)
If (curr _node is an internal node)
left_node = curr_node iget_1eft_node();
right_node = curr_node iget_right_node();
If (le ft_node is done AND right_node is done)
Mark curr_node as done
Else if (right_node is not done)
Shuttle(Q, right_node);
Else if (le ft _node is not done)
Shuttle(Q, le ft_node);
Else if (both children are not done)
If (Q overlaps only with le ft_node.R)
Shuttle(Q, le ft_node);
Else if (Q overlaps only with right_node.R)
Shuttle(Q, right_node);
Else //Q overlaps both sides or none
If (next node is LEFT)
Shuttle(Q, left_node);
Set next node to RIGHT;
If (next node is RIGHT)
Shuttle(Q, right_node);
Set next node to LEFT;
Else //curr _node is a leaf node
Combine _Tuples (Q, curr _node);
Mark curr _node as done


algorithm determines the sections that are required to be combined with every new section

a that is retrieved and then searches for them in the array buckets[]. If all sections are

found, it combines them with s and removes them from buckets[]. If it does not find all

the required sections in buckets[], it stores s in buckets[].









Algorithm 4: Algorithm for combining sections

Algorithm Combine_Tuples(Query Q, Leafl\ode node)
For each section s in node do
Store the section numbers required to be
combined with a to span Q, in a list list
For each section number i in list do
If buckets [] does not have section i
flag = false
If (flag == true)
Combine all sections from list with s
and use the records to answer Q
Else
Store s in the appropriate bucket


3.6.5 Algorithm Analysis

We now present a lower bound on the expected performance of the ACE Tree index

for sampling from a relational selection predicate. For simplicity, our analysis assumes that

the number of leaf nodes in the tree is a power of 2.
Lemma 1. Eff. .:. ,.. ; of the ACE Tree for query evaluation.

* Let a be the total number of leaf nodes in a ACE Tree used to sample from some
arbitrary r r,:.9. query, Q

* Let p be the 'ary ,II 1 power of 2 no greater than n

* Let p- be the mean section size in the tree

* Let a~ be fraction of database records f ill.:,:l in Q

* Let NV be the size of the sample from Q that has been obtained after m ACE Tree leaf
nodes have been retrieved from disk

If m is not too 'ar ,I. (that is, if m

E [N] > Xplog2 2


where E{N]I denotes the expected value of NV (the mean value of NV after an infinite number

of trials).










Proof. Let li~j and li~j 1 be the two internal nodes in the ACE Tree where R=

lij.R U lij+l.R covers Q and i is maximized. As long as the shuttle algorithm has not

retrieved all the children of li,; and li,; 1 (this is the case as long as m < 2a~n + 2), when

the mth leaf node has been processed, the expected number of new samples obtained:

[logz m] 2 -1

k=1 l=1

where the outer summation is over each of the h i contributing sections of the leaf

nodes, starting with section number i up to section number h, while CI I,,. represents the

fraction of records of the 2k"-1 combined sections that satisfy Q. By the exponentiality

property, Cl I,,. > 1/2 for every k,


Nm > log2


Thus after m leaf nodes have been obtained, the total number of expected samples is given

by:





>log2 k


> -mlog2


If m is a power of 2, the result is proven. O

Lemma 2. The expected number of records p in r,:; leaf node section is given by:

|R|
E [p] =
h2h-]

where |R| is the total number of database records, h is the height of the ACE Tree and 2h-1

is the number of leaf nodes in the ACE Tree.










Proof. The probability of assigning a record to any section i, i < h is 1/h. Given that

the record is assigned to section i, it can be assigned to only one of 2i-1 leaf node groups

after comparing with the appropriate medians. Since each group would have 2h-1/2i-1

candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is:



tE 1 ~2"-11 li21"-1 2i-1


h2h-1

This completes the proof of the lemma. O

3.7 Multi-Dimensional ACE Trees

The ACE Tree can be easily extended to support queries that include multi-dimensional

predicates. The change needed to incorporate this extension is to use a k-d binary tree

instead of the regular binary tree for the ACE Tree. Let al, ..., ak be the k key attributes

for the k-d ACE Tree. To construct such a tree, the root node would be the median of all

the al values in the database. Thus the root partitions the dataset based on al. At the

next step, we need to assign values for level 2 internal nodes of the tree. For each of the

resulting partitions of the dataset, we calculate the median of all the a2 ValueS. These two

medians are assigned to the two internal nodes at level 2 respectively, and we recursively

partition the two halves based on a2. This process is continued until we finish level k.

At level k + 1, we again consider al for choosing the medians. We would then assign a

randomly generated section number to every record. The strategy for assigning a leaf node

number to the records would also be similar to the one described in Section 3.5.4 except

that the appropriate key attribute is used while performing comparisons with the internal

nodes. Finally, the dataset is sorted into leaf nodes as in Figure 3-9(c).

Query answering with the k-d ACE Tree can use the Shuttle algorithm described

earlier with a few minor modifications. Whenever a section is retrieved by the algorithm,

only records which satisfy all predicates in the query should be returned. Also, the mth











0.18

0.16-




0.12-

0.1 ACE Tree

S0.08-

0.06-

0.04-
B+ Tree
0.02 -Randomly permuted file
0.0

0 0.5 1 1 .5 2 2.5 3 3.5 4
% of time required to scan relation


Figure 3-11. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 0."'.' of
the database records. The graph shows the percentage of database records
retrieved by all three sampling techniques versus time plotted as a percentage
of the time required to scan the relation


sections of two leaf nodes can be combined only if they match in all m dimensions. The

nth sections of two leaf nodes can be appended only if they match in the first n 1

dimensions and form a contiguous interval over the nth dimension.

3.8 Benchmarking

In this Section, we describe a set of experiments designed to test the ability of the

ACE Tree to quickly provide an online random sample from a relational selection predicate

as well as to demonstrate that the memory requirement of the ACE Tree is reasonable.

We performed two sets of experiments. The first set is designed to test the utility of the

ACE Tree for use with one-dimensional data, where the ACE Tree is compared with

a simple sequential file scan as well as Antoshenkov's algorithm for sampling from a

ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the











I I I I I I I I


0.35 F


0.25 F


ACE Tree
Randomly permuted file



B+ Tree

0.5 1 1 .5 2 2.5 3 3.5 4
% of time required to scan the relation


0.1


0.05


Figure 3-12.


Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 2.5' of
the database records. The graph shows the percentage of database records
retrieved by all three sampling techniques versus time plotted as a percentage
of the time required to scan the relation


sequential file scan as well as with the obvious extension of Antoshenkov's algorithm to a

two-dimensional R-Tree.

3.8.1 Overview

All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz

clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64K(B data pages were

used.


Experiment 1. For the first set of experiments, we consider the problem of sampling

from a range query of the form:

SELECT FROM SALE

WHERE SALE.DAY >= i AND SALE.DAY <= j

We implemented and tested the following three random-order record retrieval
algorithms for sampling the range query:



















1.5-



E 1 Ranomlypermuted file
ACE Tree

0.5-
B+ Tree


0.5 1 1 .5 2 2.5 3 3.5 4
% of time required to scan the relation

Figure :3-1:3. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 25' of
the database records. The graph shows the percentage of database records
retrieved by all three sampling techniques versus time plotted as a percentage
of the time required to scan the relation


1. AG'E Tree Query Algorithm: The ACE Tree was implemented exactly as described in
this thesis. In order to use the ACE Tree to aid in sampling from the SALE relation, a
materialized sample view for the relation was created, using SALE.DAY as the indexed
attribute.


2. Random mempling from a B+- Tree: Antoshenkov's algorithm for sampling from a
ranked B+-Tree was implemented as described in Algorithm 1. The B+-Tree used
in the experiment was a primary index on the SALE relation (that is, the underlying
data were actually stored within the tree), and was constructed using the standard
B+-Tree bulk construction algorithm.

:3. Sampling from technique as described in Section :3.2.1 of this chapter. This is the standard sampling
technique used in previous work on online .I_ egfation. The SALE relation was
randomly permuted by assigning a random key value k to each record. All of the
records from SALE were then sorted in ascending order of each k value using a
two-phase, multi-way merge sort (TPMAIS) (see Garcia-Molina et al. [:38]). As the
sorted records are written back to disk in the final pass of the TPMAIS, k is removed
from the file. To sample from a range predicate using a randomly permuted file, the












ACE Tree Randomly permuted file


I I I I


1.5-






0.5




0.

0 100 200 300 400 500 600 700 800
% of time required to scan the relation

Figure 3-14. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly
permuted file, with a one dimensional selection predicate accepting 2.5' of
the database records. The graph is an extension of Figure 3-12 and shows
results till all three sampling techniques return all the records matching the
query predicate.


file is scanned from front to back and all records matching the range predicate are
immediately returned.

For the first set of experiments, we synthetically generated the SALE relation to be

20GB in size with 100B records, resulting in around 200 million records in the relation.

We began the first set of experiments by sampling from 10 different range selection

predicates over SALE using the three sampling techniques described above. 0."1.' of the

records from MySam satisfied each range selection predicate. For each of the three random

sampling algorithms, we recorded the total number of random samples retrieved by the

algorithm at each time instant. The average number of random samples obtained for each

of the ten queries was then calculated. This average is plotted as a percentage of the total

number of records in SALE along the Y-axis in Figure 3-11. On the X-axis, we have plotted

the elapsed time as a percentage of the time required to scan the entire relation. We chose


















Av;erage across 10 q~uerles -
Maximum of 10 queries ------
Minimum of 10 queries ----


0 00035


0 0003-


0 00025-


0 0002-


0 00015-


0 0001


5e-05


0 1 234 56 7 8
% of time required to scan the relation

(a) (I _".' selectivity


9 10 11


0 003



0 0025



0 002



0 0015



0 001



0 0005


012345678
% of time required to scan the relation

(b) 2. "' selectivity


9 10 11


Figure 3-15. Number of records needed to be buffered by the ACE Tree for queries with

(a) 0.25' and (b) 2.5' selectivity. The graphs show the number of records
buffered as a fraction of the total database records versus time plotted as a

percentage of the time required to scan the relation.










this metric considering the linear scan as the baseline record retrieval method. The test

was then repeated with two more sets of selection predicates that are satisfied by 2.5' and

25' of MySam's records, respectively. The results are plotted in Figure :3-12 and Figure

:3-13. For all the three figures, results are shown for the first 15 seconds of execution,

corresponding to approximately !I' of the time required to scan the relation. We show an

additional graph in Figure :3-14 for the 2.5' selectivity case, where we plot results until all

the three record retrieval algorithms return all the records matching the query predicate.

Finally, we provide experimental results to indicate the number of records that

are needed to be buffered by the ACE Tree query algorithm for two different query

selectivities. Figure :3-15(a) shows the minimum, maximum and the average number of

records stored for ten different queries having a selectivity of 0.25' while Figure :3-15(b)

shows similar results for queries having selectivity 2.5' .

Experiment 2. For the second set of experiments, we add an additional attribute

AMOUNT to the SALE relation and test the following two-dimensional range query:

SELECT FROM SALE

WHERE SALE.DAY >= di AND SALE.DAY <= d2

AND SALE.AMOUNT >= al AND SALE.AMOUNT <= b2

To generate the SALE relation, each (DAY, AMOUNT) pair in each record is generated by

sampling from a hivariate uniform distribution.

In this experiment, we again test the three random sampling options given above:


1. ACE tree query algorithm: The ACE Tree for multi-dimensional data (a k-d ACE
Tree) was implemented exactly as described in Section :3.7. It was used to create a
materialized sample view over the DAY and AMOUNT attributes.

2. Random sampling from a R-tree: Antoshenkov's algorithm for sampling from a
ranked B+-Tree was extended in the obvious fashion for sampling from a R-Tree [46].
Just as in the case of the B+-Tree, the R-Tree is created as a primary index, and
the data from the SALE relation are actually stored in the leaf nodes of the tree. The
R-Tree was constructed in bulk using the well-known Sort-Tile-Recursive [81] hulk
construction algorithm.










3. Sampling from a randomly permuted file: We implemented this random sampling
technique in a similar manner as Experiment 1.

In this experiment, the SALE relation was generated so as to be about 16 GB in

size. Each record in the relation was 100B in size, resulting in approximately 160 million

records.

Just as in the first experiment, we began by sampling from 10 different range selection

predicates over SALE using the three sampling techniques described above. 0."1.' of

the records from SALE satisfied each range selection predicate. For all the three random

sampling algorithms, we recorded the total number of random samples retrieved by the

algorithm at each time instant. The average number of random samples obtained for each

of the ten queries is then computed. This average is plotted as a percentage of the total

number of records in SALE along the Y-axis in Figure 3-16. On the X-axis, we have plotted

the elapsed time as a percentage of the time required to scan the entire relation. The

test was then repeated with two more selection predicates that are satisfied by 2.5' and

25' of the SALE relations records, respectively. The results are plotted in Figure 3-17 and

Figure 3-18 respectively.

3.8.2 Discussion of Experimental Results

There are several important observations that can be made from the experimental

results. Irrespective of the selectivity of the query, we observed that the ACE Tree clearly

provides a much faster sampling rate during the first few seconds of query execution

compared with the other approaches. This advantage tends to degrade over time, but since

sampling is often performed only as long as more samples are needed to achieve a desired

accuracy, the fact that the ACE Tree can immediately provide a large, online random

sample almost immediately indicates its practical utility.

Another observation indicating the utility of the ACE Tree is that while it was the

top performer over the three query selectivities tested, the best alternative to the ACE

Tree generally changed depending on the query selectivity. For highly selective queries,












0.1

0.09-

0.08-

0.07-

S0.06-

15 0.05 -

0.04-
-ACE Tree ---dR Tree
0.03-

0.02-
Randomly permuted file
0.01-


0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
% of time required to scan relation


Figure 3-16. Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomly
permuted file, with a spatial selection predicate accepting 0.25' of the
database tuples.



the randomly-permuted file is almost useless due to the fact that the chance that any

given record is accepted by the relational selection predicate is very low. On the other

hand, the B+-Tree (and the R-Tree over multi-dimensional data) performs relatively

well for highly selective queries. The reason for this is that during the sampling, if the

query range is small, then all the leaf pages of the B+-Tree (or R-Tree) containing records

that match the query predicate are retrieved very quickly. Once all of the relevant pages

are in the buffer, the sampling algorithm does not have to access the disk to satisfy

subsequent sample requests and the rate of record retrieval increases rapidly. However,

for less selective queries, the randomly-permuted file works well since it can make use of

an efficient, sequential disk scan to retrieve records. As long as a relatively large fraction

of the records retrieved match the selection predicate, the amount of waste incurred by

scanning unwanted records as well is small compared to the additional efficiency gained by

the sequential scan. On the other hand, when the range associated with a query having














0.3-


S0.25-


8 0.2-


0.15-
ACE reeRandmlypermuted file




0.05
R Tree


0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
% of time required to scan relation

Figure 3-17. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly
permuted file with a spatial selection predicate accepting 2.5' of the
database tuples.


high selectivity is very large, the time required to load all of the relevant B+-Tree (or

R-Tree) pages into memory using random disk I/Os is prohibitive. Even if the query is run

long enough that all of the relevant pages are touched, for a query with high selectivity,

the buffer manager cannot be expected to buffer all the B+-Tree (or R-Tree) pages that

contain records matching the query predicate. This is the reason that the curve for the

B+-Tree in Figure 3-13 or for the R-Tree in Figure 3-18, never leaves the y7-axis for the

time range plotted.

The net result of this is that if an ACE Tree were not used, it would probably

be necessary to use both a B+-Tree and a randomly-permuted file in order to ensure

satisfactory performance in the general case. Again, this is a point which seems to strongly

favor use of the ACE Tree.

An observation we make from Figure 3-14 is that if all the three record retrieval

algorithms are allowed to run to completion, we find that the ACE Tree is not the first to











I I I I I I I I I


S1.5-


I / Randomly permuted file

-5 ACE Tree


0.5-

R Tree


0.5 1 1 .5 2 2.5 3 3.5 4 4.5 5
% of time required to scan relation


Figure 3-18. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly
permuted file with a spatial selection predicate accepting "'.' of the database
tuples.


complete execution. Thus, there is generally a crossover point beyond which the sampling

rate of an alternative random sampling technique is higher than the sampling rate of the

ACE Tree. However, the important point is that such a transition ahr-l-w occurs very late

in the query execution by which time the ACE Tree has already retrieved almost C1I' .~ of

the possible random samples. We found this trend for all the different query selectivities

we tested with single dimensional as well as multi-dimensional ACE Trees. Thus, we

emphasize that the existence of such a crossover point in no way belittles the utility of the

ACE Tree since in practical applications where random samples are used, the number of

random samples required is very small. Since the ACE Tree provides the desired number

of random samples (and many more) much faster than the other two methods, it still

emerges as the top performer among the three methods for obtaining random samples.

Finally, Figure 3-15 shows the memory requirement of the ACE Tree to store records

that match the query predicate but cannot he used as yet to answer the query. The










fluctuations in the number of records buffered by the query algorithm at different times

during the query execution is as expected. This is because the amount of buffer space

required by the query algorithm can vary as newly retrieved leaf node sections are either

buffered (thus requiring more buffer space) or can be appended with already buffered

sections (thus releasing buffer space). We also note from Figure 3-15 that the ACE Tree

has a reasonable memory requirement since a very small fraction of the total number of

records is buffered by it.

3.9 Conclusion and Discussion

In this chapter we have presented the idea of a 1'!npl.- v i. -- which is an indexed,

materialized view of an underlying database relation. The sample view facilitates efficient

random sampling of records satisfying a relational range predicate. In the chapter we

describe the ACE Tree which is a new indexing structure that we use to index the sample

view. We have shown experimentally that with the ACE Tree index, the sample view can

be used to provide an online random sample with much greater efficiency than the obvious

alternatives. For applications like online ..-:-o negation or data mining that require a random

ordering of input records, this makes the ACE Tree a natural choice for random sampling.

This is not to .7- that the ACE Tree is without any drawbacks. One obvious concern

is that the ACE Tree is a primary file organization as well as an index, and hence it

requires that the data be stored within the ACE Tree structure. This means that if the

data are stored within an ACE Tree, then without replication of the data elsewhere

it is not possible to cluster the data in another way at the same time. This may be a

drawback for some applications. For example, it might be desirable to organize the data

as a B+-Tree if non-sampling-based range queries are asked frequently as well, and this is

precluded by the ACE Tree. This is certainly a valid concern. However, we still feel that

the ACE Tree will be one important weapon in a data analyst's arsenal. Applications like

online ..-:-o negation (where the database is used primarily or exclusively for sampling-based

analysis) already require that the data be clustered in a randomized fashion, in such a









situation, it is not possible to apply traditional structures like a B+-Tree anyway, and

so there is no additional cost associated with the use of an ACE Tree as the primary

file organization. Even if the primary purpose of the database is a more traditional or

widespread application such as OLAP, we note that it is becoming increasingly common

for analysts to subsample the database and apply various analytic techniques (such as

data mining) to the subsample, if such a sample were to be materialized anyway, then

organizing the subsample itself as an ACE Tree in order to facilitate efficient online

analysis would be a natural choice.

Another potential drawback of the ACE Tree as it has been described in this chapter,

is that it is not an incrementally updateable structure. The ACE Tree is relatively

efficient to construct in bulk: it requires two external sorts of the underlying data to

build from scratch. The difficulty is that as new data are added, there is not an easy

way to update the structure without rebuilding it from scratch. Thus, one potential

area for future work is to add the ability to handle incremental inserts to the sample

view (assuming that the ACE Tree is most useful in a data warehousing environment,

then deletes are far less useful). However, we note that even without the ability to

incrementally update an ACE-Tree, it is still easily usable in a dynamic environment if a

standard method such as a differential file [111] is applied. Specifically, one could maintain

the differential file as a randomly permuted file or even a second ACE Tree, and when a

relational selection query is posed, in order to draw a random sample from the query one

selects the next sample from either the primary ACE Tree or the differential file with an

appropriate hypergeometric probability (for an idea of how this could be done, see the

recent paper of Brown and Haas [12] for a discussion of how to draw a single sample from

multiple data set partitions). Thus, we argue that the lack of an algorithm to update the

ACE tree incrementally may not be a tremendous drawback.

Finally, we close the chapter by asserting that the importance of having indexing

methods that can handle insertions incrementally is often overstated in the research










literature. In practice, most increnientally-updateable structures such as B+-Trees

cannot he updated incrementally in a data warehousing environment due to performance

considerations anyway [93]. Such structures still require on the order of one random

I/O per update, rendering it impossible to efficiently process bulk updates consisting of

millions of records without simply rebuilding the structure from scratch. Thus, we feel

that the drawbacks associated with the ACE Tree do not prevent its utility in many

real-world situations.









CHAPTER 4
SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES

4.1 Introduction

Sampling is well-established as a method for dealing with very large volumes of

data, when it is simply not practical or desirable to perform the computation over the

entire data set. Sampling has several advantages compared to other widely-studied

approximation methodologies from the data management literature such as wavelets [88],

histograms [92] and sketches [29]. Not the least of those is generality: it is very easy to

efficiently draw a sample from a large data set in a single pass using reservoir techniques

[34]. Then, once the sample has been drawn it is possible to guess, with greater or lesser

accuracy, the answer to virtually any statistical query over those sets. Samples can easily

handle many different database queries, including complex functions in relational selection

and join predicates. The same cannot be said of the other approximation methods, which

generally require more knowledge of the query during synopsis construction, such as the

attribute that will appear in the SELECT clause of the SQL query corresponding to the

desired statistical calculation.

However, one class of .l__oegatee queries that remain difficult or impossible to answer

with samples are the so-called -IIl*-- i" queries, which can generally be written in SQL in

the form:

SELECT SUM (fl(r))

FROM R as r

WHERE f2(r) AND NOT EXISTS

(SELECT FROM S AS s

WHERE f3 r, S))

Note that the function f2 can be incorporated into fl if we have fl evaluate to zero if

f2 1S not true, thus, in the remainder of the chapter we will ignore f2. An example of such










a query is: "Find the total salary of all employees who have not made a sale in the past



SELECT SUM (e .SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT FROM SALE AS s

WHERE s.EID = e.EID)

A general solution to this problem would greatly extend the class of database-- r i-1.

queries that are amenable to being answered via random sampling. For example, there is

a very close relationship between such queries and those obtained by removing the NOT

in the subquery. Using the terminology introduced later in this chapter, all records from
EMP with i records in SALES are called "classi"rcdsThonyifenebtwnNT

EXISTS and EXISTS is that the former query computes a sum over all class 0 records,

whereas the latter query computes a sum over all class i > 0 records. Since any reasonable

estimator for NOT EXISTS will likely have to compute an estimated sum over each class, a

solution for NOT EXISTS should immediately -II__- -a solution for EXISTS. Also, nested

queries having an IN (or NOT IN) clause can be easily rewritten as a nested query having

the EXISTS (or NOT EXISTS) clause. For example, the query "Find the total salary of all

employees who have not made a sale in the past 3-. .1 given above can also be written as:

SELECT SUM (e .SAL)

FROM EMP as e

WHERE e.EID NOT IN

(SELECT s.EID FROM SALE AS s)

Furthermore, a solution to the problem of sampling for subset queries would allow

sampling-based .I_ egates over SQL DISTINCT queries, which can easily be re-written as

subset queries. For example:

SELECT SUM (DISTINCT e.SAL)










FROM EMP AS e

is equivalent to:

SELECT SUM (e .SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT FROM EMP AS e2

WHERE idle) < id(e2)

AND e.SAL = e2.SAL)

In this query, id is a function that returns the row identifier for the record in question.

Some work has considered the problem of sampling for counts of distinct attribute values

[17, 49], but .I_ negates over DISTINCT queries remains an open problem. Similarly, it

is possible to write an ..-:-o gate query where records with identical values may appear

more than once in the data, but should be considered no more than once by the .I__-oegate

function as a subset-based SQL query. For example:

SELECT SUM (e .SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT FROM EMP AS e2

WHERE idle) < id(e2)

AND identical(e, e2))

In this query, the function identical returns true if the two records contain identical

values for all of their attributes. This would be very useful in computations where the

same data object may be seen at many sites in a distributed environment (packets in an

IP network, for example). Previous work has considered how to perform sampling in such

a distributed system [12, 77], but not how to deal with the duplicate data problem.

Unfortunately, it turns out that handling subset queries using sampling is exceedingly

difficult, due to the fact that the subquery in a subset query is not asking for a mean or a










sum -tasks for which sampling is particularly well-suited. Rather, the suhquery is asking

whether we will ever see a match for each tuple from the outer relation. By looking at an

individual tuple, this is very hard to guess: either we have seen a match already on our

sample (in which case we are assured that the inner relation has a match), or we have not,

in which case we may have almost no way to guess whether we will ever see a match. For

example, imagine that employee Joe does not have a sale in a 1(1' sample of the SALE

relation. How can we guess whether or not he has a sale in the remaining 911' ?

There is little relevant work in the statistical literature to -II__- -1 how to tackle

subset queries, because such queries ask a simultaneous question linking two populations

(database tables EMP and SALE in our example), which is an uncommon question in

traditional applications of finite population sampling. Outside of the work on sample from

the number of distinct values [17, 49, 50] and one method that requires an index on the

inner relation [75], there is also little relevant work in the data management literature;

we presume this is due to the difficulty of the problem; researchers have considered the

difficulty of the more limited problem of sampling for distinct values in some detail [17].

Our Contributions

In this chapter, we consider the problem of developing sampling-hased statistical

estimators for such queries. In the remainder of this chapter, we assume without-replacement

sampling, though our methods could easily be extended to other sampling plans. Given

the difficulty of the problem, it is perhaps not surprising that significant statistical and

mathematical machinery is required for a satisfactory solution.

Our first contribution is to develop an unbiased estimator, which is the traditional

first step when searching for a good statistical estimator. An unbiased estimator is one

that is correct on expectation; that is, if an unbiased estimator is run an infinite number

of times, then the average over all of the trials would be exactly the same as the correct

answer to the query. The reason that an unbiased estimator is the natural first choice is










that if the estimator has low variancel then the fact that it is correct on average implies

that it will .ll.k--i--s be very close to the correct answer.

Unfortunately, it turns out that the unbiased estimator we develop often has high

variance, which we prove analytically and demonstrate experimentally. Since it is easy to

argue that our unbiased estimator is the only unbiased estimator for a certain subclass of

subset-based queries (see the Related Work section of this chapter), it is perhaps doubtful

that a better unbiased estimator exists.

Thus, we also propose a novel, biased estimator that makes use of a statistical

technique called "superpopulation modelingt Superpopulation modeling is an example

of a so-called B li-o .Is statistical technique [39]. B li-, -i Ia methods generally make use of

mild and reasonable distributional assumptions about the data in order to greatly increase

estimation accuracy, and have become very popular in statistics in the last few decades.

Using this method in the context of answering subset-based queries presents a number of

significant technical challenges whose solutions are detailed in this chapter, including:

* The definition of an appropriate generative statistical model for the problem of
sampling for subset-based queries.

* The derivation of a unique Expectation Maximization algorithm [26] to learn the
model from the database samples.

* The development of algorithms for efficiently generating many new random data sets
from the model, without actually having to materialize them.

Through an extensive set of experiments, we show that the resulting biased B li-, -i Ia

estimator has excellent accuracy on a wide variety of data. The biased estimator also has

the desirable property that it provides something closely related to classical confidence

bounds, that can be used to give the user an idea of the accuracy of the associated

estimate.




1 Variance is the statistical measure of the random variability of an estimator.









4.2 The Concurrent Estimator

With a little effort, it is not hard to imagine several possible sampling-based

estimators for subset queries. In this section, we discuss one very simple (and sometimes

unusable) sample-based estimator. This estimator has previously been studied in detail

[75], but we present it here because it forms the basis for the unbiased estimator described
in the next section.

We begin our description with an even simpler estimation problem. Given a

one-attribute relation R(A) consisting of us records, imagine that our goal is to estimate

the sum over attribute A of all the records in R. A simple, sample-based estimator would

be as follows. We obtain a random sample R' of size us, of all the records of R, compute

total = C,,,, r.A, and then scale up total to output total x a~nR/n as the estimate for

the final sum. Not only is this estimator extremely simple to understand, but it is also

unbiased, consistent, and its variance reduces monotonically with increasing sample size.

We can extend this simple idea to define an estimator for the NOT EXISTS query

considered in the introduction. We start by obtaining random samples EMP' and SALE' of

sizes nEMP' and nSALE/, TOSpectively from the relations EMP and SALE. We then evaluate the

NOT EXISTS query over the samples of the two relations. We compare every record in EMP'

with every record in SALE', and if we do not find a matching record (that is, one for which

f3 eValuateS to true), then we add its fl value to the estimated total. Lastly, we scale up

the estimated total by a factor of nEMP nEMp t O obtain the final estimate, which we term





M2 =) EP (e) x (1 ini(1, cut(e, SALE)))
REMP'P

In this expression, cut(e, SALE') = sESALE/, I3(e, s)) where I is the standard

indicator function, returning 1 if the boolean argument evaluates to true, and 0 otherwise.

The algorithm can be slightly modified to accommodate for growing samples of the










relations, and has been described in detail in [75], where it is called the "concurrent

estimator" since it samples both relations concurrently.

Unfortunately, on expectation, the estimator is often severely biased, meaning that

it is, on average, incorrect. The reason for this bias is fairly intuitive. The algorithm

compares a record from EMP with all records from SALE', and if it does not find a matching

record in SALE', it classifies the record as having no match in the entire SALE relation.

Clearly, this classification may be incorrect for certain records in EMP, since although they

might have no matching record in SALE', it is possible that they may match with some

record from the part of SALE that was not included in the sample. As a result, M~ typically

overestimates the answer to the NOT EXISTS query. In fact, the bias of M~ is:


Bias(M) = fJI(e)(1 min.(1,cu~t~e, SALE))
e6EMP
-cp(nSALE, saleE, cut(e, SALE)))


In this expression, cp denotes the hypergeometric probability that a sample of size nSALE'

will contain none of the cut(e, SALE) matching records of e.

The solution that was emploi-v I previously to counteract this bias requires an index

such as a B+-Tree on the entire SALE relation, in order to estimate and correct for

Bias(M~). Unfortunately, the requirement for an index severely limits the applicability

of the method. If an index on the "join" attribute in the inner relation is not available,

the method cannot be used. In a streaming environment where it is not feasible to store

SALE in its entirety, an index is not practical. The requirement of an index also precludes

use of the concurrent estimator for a non-equality predicate in the inner subquery or




2 The hypergeometric probability distribution models the distribution of the number
of red balls that will be obtained in a sample without replacement of n' balls from an urn
containing r red balls and n r non-red balls.










for non-database environments where sampling might be useful, such as in a distributed

system.

In the remainder of this chapter, we consider the development of sampling-based

estimators for this problem that require nothing but samples from the relations themselves.

Our first estimator makes use of a provably unbiased estimator Bias(M~) for Bias(M~).

Taken together, M~ Bias(M~). is then an unbiased estimator for the final query answer.

The second estimator we consider is quite different in character, making use of B li-, -i Ia

statistical techniques.

4.3 Unbiased Estimator

4.3.1 High-Level Description

In order to develop an unbiased estimator for Bias(M~), it is useful to first re-write

the formula for Bias(M~) in a slightly different fashion. We subsequently refer to the set
of records in EMP that have i matches in SALE as "classireod"Dntehesmfte

..- _regate function over all records of class i by ti, so ti = Ze6Ep 1(e) x I(cut(e, SALE) = i)

(note that the final answer to the NOT EXISTS query is the quantity to). Given that the

probability that a record with i matches in SALE happens to have no matches in SALE' is

(p(nSALE, saleE, i), We CRI1 TO-Write the expression for the bias of M~ as:




Bias(Ml)= iip(nSALE~isr i RSALE i
i= 1

The above equation computes the bias of M~ since it computes the expected sum

over the .I__regate attribute of all records of EMP which are incorrectly classified as class 0

records by M~.

Let m be the maximum number of matching records in SALE for any record of EMP.

Equation 4-1 -11__- -0 an unbiased estimator for Bias(M~) because it turns out that

it is easy to generate an unbiased estimate for im: since no records other than those

with m matches in SALE can have m matches in SALE', we can simply count the sum










of the .I__ negate function fl over all such records in our sample, and scale up the total

accordingly. The scale-up would also be done to account for the fact that we use SALE' and

not SALE to count matches. Once we have an estimate for im, it is possible to estimate

im-1. How? Note that records with m 1 matches in SALE' must be a member of either

class m or class m 1. Using our unbiased estimate for im, it is possible to guess the

total .l_ -egate sum for those records with m 1 matches in SALE' that in reality have m

matches in SALE. By subtracting this from the sum for those records with m 1 matches

in SALE' and scaling up accordingly, we can obtain an unbiased estimate for im-1. In a

similar fashion, each unbiased estimate for ti leads to an unbiased estimate for 4_l. By

using this recursive relationship, it is possible to guess in an unbiased fashion the value

for each ti in the expression for Bias(M~). This leads to an unbiased estimator for the

Bias(M~) quantity, which can be subtracted from M~ to provide an unbiased guess for the

query result.

4.3.2 The Unbiased Estimator In Depth
We now formalize the above ideas to develop an unbiased estimator for each tk
that can be used in conjunction with Equation 4-1 to develop an unbiased estimator for
Bias(M~). We use the following additional notation for this section and the remainder of
this chapter:

* ak,i is a 0/1 (non-random) variable which evaluates to 1 if the ith tuple of EMP has k
matches in SALE and evaluates to 0 otherwise.

* Sk is the sum of fl over all records of EMP' having k matching records in SALE':
Sk CRj, AE)k)x fi(ei).

* cao is nEMP RnEMP, the sampling fraction of EMP.

* is a random variable which governs whether or not the ith record of EMP appears in
EMP'.

* h(k; nSALE, saleE, i) is the hypergeometric probability that out of the i interesting
records in a population of size nSALE, eXactly k will appear in a random sample of size
nSALE/. FOr compactness of representation we will refer to this probability as h(k; i) in
the remainder of the thesis, since our sampling fraction never changes.










We begin by noting that if we consider only those records from EMP which appear in

the sample EMP', an unbiased estimator for tk OVer EMP' can be expressed as follows:

1"EMP
tk k,i x1 fli) (4-2)
i= 1

Unfortunately, this estimator relies on being able to evaluate Ak~i for an arbitrary record,

which is impossible without scanning the inner relation in its entirety. However, with

a little cleverness, it is possible to remove this requirement. We have seen earlier that

a record e can have k matches in the sample SALE' provided it has i > k matches in

SALE. This implies that records from all classes i where i > k can contribute tO Sk*

The contribution of a class i record towards the expected value of Sk is obtained by

simply multiplying the probability that it will have k matches in SALE' with its .I__oregate

attribute value. Thus a generic expression to compute the contribution of any arbitrary

record from EMP' towards the expected value of Sk can be written as CE k ,~j x h(k; i) x

fl(ej). Then, the following random variable has an expected value that is equivalent to the

expected value of Sk:

.iki iCYj X 3x h(k; i) x f,(e,) (4-3)
j= 1 i= k
The fact that E [sk] = E [S] (prOVen in Section 4.3.3) is significant, because there is a

simple algebraic relationship between the various s^ variables and the various t^ variables.

Thus, we can express one set in terms of the other, and then replace each skr With Sk, in

order to derive an unbiased estimator for each t^. The benefit of doing this is that since Sk,

is defined as the sum of fl over all records of EMP' having k matching records in SALE', it

can be directly evaluated from the samples EMP' and SALE'.










To derive the relationship between &^ and t, we start with an expression for sm-r using

Equation 4-3:



j=1 i=m-r


i=m-r j=1

= h(m r; m1 r + i) ) X x,, Amri~ x ft(er)
i=0 j=1

= h~ r m r +i~o/^. ,44(4-4)
i=0

By re-arranging the terms we get the following important recursive relationship:


imm-r -r- o E h(m r; m r+i) xi-~ 45

For the base case we obtain:


t s

= am x s^m (4-6)


where am = 1/(aoh(m; m)).

By replacing Am-r, in the above equations with sm-r which is readily observable from

the data and has the same expected value, we can obtain a simple recursive algorithm for

computing an unbiased estimator for any ti. Before presenting the recursive algorithm, we

note that we can re-write Equation 4-5 for ti by replacing s^ with s and by changing the

summation variable from i to k and actually substituting m r by i,

si t~o Cm-' h(i; i + k)t~i+k,
caoh(i; i)










The following pseudo-code then gives the algforithmS foT COmputingf an unbiased


estimator for any ti.




Function GetEstTi(int i)

1 if (i == m)

2 return sm/(aoh(m; m))

3 else

4 returnval = si

5 for(int k = 1; k <= m


i; k++)


returnval -= aoh(i; i + k) xGetEstTi(i+k)


7 returnval /= aoh(i; i)

8 return returnval;

9}


Recall from Equation 4-1 that the bias of M~ was expressed as a linear combination

of various ti terms. Using GetEstTi to estimate each of the ti terms, we can write an

estimator for the bias of M~ as:


Bias(lM) = p(nSALE, nSALE' i) x GetEstTi(i)
i= 1


(4-7)


In the following two subsections, we present a formal analysis of the statistical

properties of our estimator.



3 Note the h(m; m) probability in line 2 of the GetEstTi function. If the sample size
from SALE is not at least as large as m, then h(m; m) = 0 and the GetEstTi is undefined.
This means that our estimator is undefined if the sample is not at least as large as the
largest number of matches for any record from EMP in SALE. The fact that the estimator is
undefined in this case is not surprising, since it means that our estimator does not conflict
with known results regarding the existence of an unbiased estimator for the distinct value
problem. See the Related Work section for more details.










4.3.3 Why Is the Estimator Unbiased?

According to Equation 4-7, the estimator for the bias of M~ is composed of a sum of

m different estimators. Hence by the linearity of expectation, the expected value of the

estimator can be written as:


E[BiaR(Mn)] = p(.nSALE, nSALE',1 i)x E[GetEstTi(i)] (4-8)
i= 1

The above relation -II__- -0 that in order to prove that the sample-based estimator

of Equation 4-7 is unbiased, it would suffice to prove that each of the individual GetEstTi

estimators is unbiased. We use mathematical induction to prove the correctness of the

various estimators on expectation.

As a preliminary step for the proof of unbiasedness, we first derive the expected

values of the as estimator used by GetEstTi. To do this, we introduce a zero/one random

variable Hj~k that evaluates to 1 if ej has k matches in SALE' and 0 otherwise. The

expected value of this variable is simply the probability that it evaluates to 1, giving us

E[Hj~k] = h(k; cut(ej, SALE)). With this:






nEMp m

j= 1 i= k
RELMP m
=, Co aiy x h(k; i) x ft(e ) (4-9)
j= 1 i= k

We are now ready to present a formal proof of unbiasedness of the GetEstTi.

Theorem 1. The expected value of GetEstTi(i) is E "as,yifl(eyi).

Proof. Using Equation 4-5, the recursive GetEstTi estimator can be re-written as:

Getsti(i) 3ic~~'si t~o CEm h(i; i + k)GetEstTi(i + k) 40
Get~~stc~ihi) = (-0









We first prove the unbiasedness for the base case: GetEstTi(m). Setting i = m in the

above relation and taking the expectation:

E [sm]
E[GetEstTi(m)]
aok (m; m)

Replacing E [sm] using Equation 4-9:


E[GetEstTi(m)] = c C m,;,h(m;l m) fl(ey)
caoh(m; m)
j= 1


j= 1

which is actually the value of im.

By induction, we can now assume that all estimators GetEstTi(i + k) for 1 I k I m i
are unbiased and we use this to prove that the estimator GetEstTi(i) is unbiased. Taking

the expectation on both sides of the above equation:

as C to= ILm L hi i +I~LS k)e~tii + k)n
E[GetEstTi(i)] = E

By the linearity of expectation:

E[Si] -g toZ~ mh(i; i + k;)E[GetEstTi(i + ki)]
caoh(i; i)

Replacing the values of E[GetEstTi(i + k)] and E[si]:


-o~~)i (No CJ C" Em-4 k,jh(i; k) fl(el)





LjE "I= Em-i+k,jhc~i;~ i + k) fl(ej))










For the second term in the parentheses, replacing i + k by p and changing the limits of

summation for the inner sum accordingly:




Lj= m=i+l ap~jh(i; p) fl(ej)) (-1

We notice that the limits of summation of the inner sum of the first term are from i to m.

Splitting this term into two terms such that one term has limits of summation from i to i

while the other has limits from i 1l to m:











C~ iifl (eyi) (4-12)




4.3.4 Computing the Variance of the Estimator

The unbiased property of Bias(M~) means that it may be useful. However, the

accuracy of any estimator depends on its variance as well as its bias. We now investigate

the variance of our unbiased estimator.

We have seen that Bias(M~) is a linear combination of various GetEstTi results with

cp(nSALE, saleE, i) aS the coefficient of GetEstTi(i). In order to derive an expression for the

variance of the estimator and gain insight about the potential values it can take, we first

express the estimator as a linear combination of as terms:


Bians(Ml) =: b x .s, (4-13)
i= 1









The next step in deriving the variance is being able to compute the various bi values.

Intuitively, the bi terms can be thought of as coming from the linear relationship between

the tei and as terms. The following algorithm shows how we can actually compute the bi

values.






Function ComputeBis(m)

1 // Let table[m][m] be a 2-dimensional array with all elements initialized to zero
2 for(int row = 0; row < m; row++) {

3 for(int term = 1; term <= row; term++) {

4 factor = -h(m row; m row + term)/h(m row; m row)

5 prow = row term

6 for(int pcol = 0; pcol <= prow; pcol++)

7 table[row][pcol] += factor a table~prow][pcol]

8}
9 table[row][row] = 1/h(m row; m row)

10 }
11for(int row = 0; row < m; row++)

12 for(int col = 0; col <= row; col++)

13 bm-col += to h(0, m row) table[row][col]


With this, the variance of this estimator can then be written as:


Var(Bias(M~)) = Var basei= (4-14)










Note that the as values are not independent random variables since if an EMP' record

has i matches in SALE', then it cannot have j matches in SALE'. Hence we have:


s(Mz)) = )bVar(as)+ 2 ~ ~ ~~Cov!: s4, )
i=1 i=1 i
can be computed by using the standard formulas:


Var(Bia~


(4-15)





(4-16)


The Var and Cov terms


Var~~~s4) = E G? 2 Si] C 000Siff) = E[s sj] -- E[si]E[sy]


To evaluate V araSi) andU C/ov(aSSj), E[S5] and E[sisj] can be computed as follows:
(nEWP RIEMP
E [sisj] =[( E Yk 1 6k)Hk-_i r 1 (er,)He,;i)
k=1 r=1

RELMP RnEMP
= C[ II, E Yk k,i~,l fi(e,) fl(er,)
k=1 r=1


The above expression can be evaluated usingf the following rules:


(4-17


*if k / r (that is, ek, and e, are two different tuples) then,
E[Hk~iHe,j] a h(i, cut(ek, SALE)) h(j, cut(er, SALE)) if we assume that no record a exists in
SALE where f3 6k, S) =3 Cr, 8) = ETru


*if i = j (that; is, we are computing E1sf]) and k
h(i, cut(ek, SALE))


r, then E[H~,iHe,y]


* if i / j (that is, we are computing E[sisj]) and k = r, then E[H~,iHe,4] = 0 since a
record cannot have two different numbers of matches in a sample

* if k = r, then E [YkY, = c80

* if k / r, then E [YkY, 2: 002

4.3.5 Is This Good?

At this point, we now have a simple, unbiased estimator for the answer to a

subset-based query, as well as a formal analysis of the statistical properties of the











Superpopulation 1Lodel

FI) process *Sprouao

Random
Samplmg

ActualI
populato
2 3 N Hypothet~cal
Sampling under step
desired samplmg design



S3

Figure 4-1. Sampling from a superpopulation


estimator. However, there are two problems related to the variance that may limit the

utility of the estimator.

First, in order to evaluate the hypergeometric probabilities needed to compute or

estimate the variance, we need the value of ent(e, SALE) for an arbitrary record e of

EMP. This information is generally unavailable during sampling, and it seems difficult

or impossible to obtain a good estimate for the appropriate probability without having

this information. This means that in practice, it will be difficult or impossible to tell

a user how accurate the resulting estimate is likely to be. We have experimented with

general-purpose methods such as the bootstrap [31] to estimate this variance, but have

found that these methods often do an extremely poor job in practice.

Second, the variance of the estimator itself may be huge. The bi coefficients are

composed of sums, products and ratios of hypergeometric probabilities which can

result in huge values. Particularly worrisome is the h(i, i) value in the denominator

used by GetEstTi. Such probabilities can he tiny; including such a small value in the

denominator of an expression results in a very large value that may "pump up" the

variance accordingly.










4.4 Developing a Biased Estimator

In light of these problems, in this section we describe a biased estimator that is

often far more accurate than the unbiased one, and also provides the user with an idea

of the estimation accuracy. Just like the unbiased estimator M~ Bias(M~) from the

previous section, our biased estimator will be nothing more than a weighted sum over the

observed Sk, ValueS. However, the weights will be chosen so as to minimize the expected or

mean-squared error of the resulting estimator.

To develop our biased estimator, we make use of the "superpopulation modelingt

approach from statistics [78]. One simple way to think of a superpopulation is that it

is an infinitely large set of records from which the original data set has been obtained

by random sampling. Because the superpopulation is infinite, it is specified using a

parametric distribution, which is usually referred to as the prior distribution.

Using a superpopulation method, we imagine the following two-step process is used to

produce our sample:

1. Draw a large sample of size NV from an imaginary infinite superpopulation where NV is
the data set size.

2. Draw a sample of size n < NV without replacement from the large sample of size NV
obtained in Step 1 where n is the desired sample size.

By characterizing the superpopulation, it is possible to design an estimator that tends

to perform well on any data set and sample obtained using the process above.
The following steps outline a road-map of our superpopulation-based approach for
obtaining a high-quality biased estimator for a subset-based query. We describe each step
in detail in the next section.

1. Postulate a superpopulation model F for our data set (F is the prior distribution,
we use the notation pF to denote the probability density function (PDF) of F). In
general, F is parameterized on a parameter set 8.

2. Infer the most likely values of the parameter set 8 from EMP' and SALE'. Since we
do not have the complete data, but rather a random sample of the data, this is a
difficult problem. We make use of an Expectation-Maximization (EM) algorithm to
learn the model parameters.









3. Use F(8) to generate d different populations P1, ..., Pd, where each Pi = (EMP,, SALE,).
Note that if the data set in question is large, this may be very expensive. We show
that for our problem it is not necessary to generate the actual populations -it is
enough to obtain certain sufficient statistics for each of them, which can be done
efficiently.

4. Sample from each Pi to obtain d sample pairs of the form Si = (EMP SALE(). Again,
this can be done without actually materializing the samples.

5. Let q(Ps) be the query answer over the ith data set. Construct a weighted estimator
W that, minimizes C z(q(Ps:) -W'(Se))2

6. Use W on the original samples EMP' and SALE' to obtain the final estimate to the NOT
EXISTS query. The MSE of this estimate can generally be assumed to be the MSE
ove:r all of the populations generated: C1,(1/dl) x (q(Ps)-W()2

4.5 Details of Our Approach

In this section, we discuss in detail each of the steps outlined above of our approach of

obtaining an optimal weighted estimator for the NOT EXISTS query.

4.5.1 Choice of Model and Model Parameters

The first task is to define a generative model and an associated probability density

function for the two relations EMP and SALE respectively. While this may seem like a

daunting task (and a potentially impossible one given all of the intricacies of modeling

real-life data) it is made easy by the fact that we only need to define a model that can

realistically reproduce those characteristics of EMP and SALE that may affect the bias or

variance of an estimator for a subset-based query. From the material in Section 4.3 of the

thesis, for a given record e from EMP, we know that these three characteristics are:

1. ft (e)

2. cut(e, SALE), which is the number of SALE records a for which f3 6, 8) is ETru

3. cut(e, e', SALE) where e' / e, which is the number of SALE records a for which
f3 6, 8) A 3 6 8) is ETru

To simplify our task, we will actually ignore the third characteristic and define a

model such that this count is alr-ws- zero for any given record pair. While this may










introduce some inaccuracy into our method, it still captures a large number of real-life

situations. For example, if f3 COnSIStS of an equality check on a foreign key from SALE

into EMP (which is arguably the most common example of such a subset-based query) then

two records from EMP can never match with the same record from SALE and this count is


Given that our model needs to be able to generate instances of EMP and SALE that
realistically model the first two aspects given above, we choose the parameter set 8 = {p,
pw, .2) Where:

* p is a vector of probabilities, where pi represents the probability that any arbitrary
record of EMP belongs to class i

* pw is a vector of means, where pi represents the mean ..-:-o negate value of all records
belonging to class i.

* 0.2 is the variance of ft (e) over all records e E EMP.

Then given these parameters, EMP and SALE are generated using our model as follows:





Procedure GenData

1 For rec = 1 to nEMP do

2 Randomly generate k between 0 and m such that for any

i; 0 < i < m, Pr [i] = pi

3 Generate a value for file) by sampling from N(pk, 0")

4 Add the resulting e to EMP

5 For j = 1 to k do

6 Generate a record s where f3(e, 8) is Ltru

7 Add s to SALE



In step (3), NV is a normally distributed random variable with the specified mean

and variance. We use a normal random variable because we are interested in sums over

classes in EMP; due to the central limit theorem (CLT), these sums will be normally









distributed for a large database. Thus, using a normal random variable does not result

in loss of generality. Also, note that in step (6), according to our earlier assumption

f3 6 8) = ft/Se, 6e 6 .

In our actual model, the various ps values are not assumed to be independent but

we rather assume a linear relationship between them to limit the degrees of freedom

of the model and thus avoid the problem of overfitting the model (see Sec. 4.5.1). In

our model the various ps values are related as ps = x i + po, where s and po are the

only two parameters that need to be learned to determine all the ps. Also in order to

avoid overfitting, we assume that a2 1S the variance of ft (e) over all records, rather than

modeling and learning variance values of all the individual classes separately.

We now define the density function for the superpopulation model corresponding to

the GenData algorithm. For a given EMP record e, if file) = v and cut(e, SALE) = k the

probability density for e given a parameter set 8 is given by:


p(e|8) = p(v, k|8) = pkfN kr, &, U) (418)


Where it is convenient, we will use the notation p(v, k|8) for values v and k and

p(e|8) for record e interchangeably. In this expression, pry is the PDF for the normal

distribution evaluated at v and is given by:





Then if we consider a given data set {EMP, SALE}, the probability density of the data

set is simply the product of the densities of all the individual records:


p(EMP, SALE)- = p~e|8) (4-20)
e6EMP

A Note on The Generality of the Model. As described, our model is extremely

general, making almost no assumptions about the data other than the fact that file)

values are normally distributed. This is actually an inconsequential assumption anyway,










since we are interested in sums over ft (e) values which will be normally distributed

whatever the distribution of ft (e) due to the CLT.

On one hand, this generality can be seen as a benefit of the approach: it makes use of

very few assumptions about the data. Most significant is the lack of any sort of restriction

on the probability vector p. The result is that the number of records from SALE matching

a certain record from EMP is multinomially distributed. On the other hand, a B li- -1 I

argument [39] can be made that such extreme freedom is actually a poor choice, and that

in "real-life", an analyst will have some sort of idea what the various pi values look like,

and a more restrictive distribution providing fewer degrees of freedom should be used.

For example, a negative binomial distribution has been assumed for the distinct value

estimation problem [90]. Such background knowledge could certainly improve the accuracy

of the method.

Though we eschew any such restrictions in the remainder of the thesis (except for an

assumption of a linear relationship among the ps values; see "Dealing with Over-fittingt

in the next section), we note that it would be very easy to incorporate such knowledge

into our method. The only change needed is that the EM algorithm described in the next

section would need to be modified to incorporate any constraints induced on the various

parameters by additional distributional assumptions.

4.5.2 Estimation of Model Parameters

Now that we have defined our superpopulation model, we need access to the

parameter set 8 that was used to create our particular instances of EMP and SALE in

order to develop an estimator that performs well for the resulting superpopulation.

However, we have several difficulties. First, we do not know 8; since EMP and SALE are in

reality not sampled from any parametric distribution, 8 does not even exist. We could









compute a maximum-likelihood estimate (j11.1 )4 to choose a 8 that optimally fits EMP

and SALE, but then we have an even bigger problem: we do not even have access to EMP

and SALE; we only have access to samples from them. Thus, we need a way to infer 8 by

looking only at the samples EMP' and SALE'.

It turns out that we can still make use of an MLE. Since EMP' may be treated as a

set of independent, identically distributed samples from F, if we simply replace EMP with

EMP' as an argument to pF, then by choosing 8 so as to maximize pF, we will still produce

exactly the same estimate for 8 on expectation that we would have if EMP were used

instead. Thus, we can essentially ignore the distinction between EMP and EMP'. However,

the same argument does not hold for SALE because without access to all of SALE, we

cannot compute k = cut(e, SALE) for arbitrary e in order to apply an MLE.

To handle this, we will modify our PDF slightly to also take into account the

sampling from SALE. This can easily be done by modifying the function p(v, k|8). To

simplify the modification, we ignore the fact that the number of such records a from SALE'

where f3 61, 8) is true may be correlated with the number of records from SALE' where

f3 62, 8) is ETru for arbitrary records el and e2 from EMP; that is, we assume that we are

looking for matches of a record e in its own pm i. I sample from SALE and that all

of these samplings are independent. With this, if f (e) = v and cut(e, SALE) = k and

cut(e, SALE') = k' then:

p(v, k, k'|8) = p(v, k|8)h(k'; k) (4-21)

In this expression, h is the hypergeometric probability of seeing k' matches for e in

SALE', given that there were k matches in SALE.



4 An MLE is a standard statistical estimator for unknown model parameters when a
sample is available; the MLE simply chooses 8 so as to maximize the value of the PDF of
the sample.










Since the portion of SALE that is not in SALE' is hidden to us due to the sampling, we

do not know k and we have a classic example of an MLE problem with hidden or missing

data. There are several methods from the literature for solving such a problem, the one

that we employ is the Expectation Maximization (EM) algorithm.

The EM algorithm [26] is a general method of finding the maximum-likelihood

estimate of the parameters of an underlying distribution from a given data set when the

data is incomplete or has missing values. EM starts out with an initial assignment of

values for the unknown parameters and at each step, recomputes new values for each of

the parameters via a set of update rules. EM continues this process until the likelihood

stops increasing any further. Since cut(e, SALE) is unknown, the likelihood function:



L(8| {EMP', SALE'}) = ~Iip(fi(e), k, cut(e, SALE')|-))
e6EMP' k=1
We present the derivation of our EM implementation in the Appendix, while here we

give only the algorithm. In this algorithm, fi(i|8, e) denotes the posterior probability for

record e belonging to class i. This is the probability that given the current set of values for

8, record e belongs to class i.


Procedure EM(8)

1 Initialize all parameters of 8; Lprey, = -9999

2 while (true) {

3 Compute L(8) from the sample and assign it to L,urr

4 if((L,urr L,rev)/L,re < 0.01) break
5 Compute posterior probabilities for each e E EMP' and each k

6 Recompute all parameters of 8 by using the following

update rules:

7 -i = eCEMPP 1 8l/~ )

8 o- etEMP CnO?~l/e(l()-L











10 L~prev = L~curr

11 }
12 Return values in 8 as the final parameters of the model



Every iteration of the EM algorithm performs an expectation (E) step and a

maximization (j!) step. In our algorithm, the E-step is contained in step (6) where

for each record e of EMP', a set of probability values f(i|8, e), O < i < m, is computed

under the current model parameters, 8. The posterior probability f(i|8, e) is computed

as described in the Appendix. Intuitively, the posterior probability for record e and class

i is a ratio of two quantities: (1) the probability that e belongs to class i according to the

density function of the model, and (2) the sum of probabilities that it belongs to each of

the classes 0 through m, also according to the model density function.

The M-step (which corresponds to steps (7) (10) of our algorithm) updates the

parameters of our model in such a way that the expected value of the likelihood function

associated with the model is maximized with respect to the posterior probabilities. Details

of how we obtain the various update rules are explained in the Appendix.

The observant reader may note that the EM algorithm assumes that the parameter

m is known before the process has begun. This is potentially a problem, since m will

typically be unknown. Fortunately, knowing the exact value for m is not vital, particularly

if m is overestimated (in which case the class probabilities associated with the class i

records for large i will end up being zero, if the EM algorithm functions correctly). As a

rough estimate for m, we take the record from EMP' with the largest number of matches in

SALE' and scale up the number of matches by nSALE RSALE/. PartiCularly if SeVeral records

with m matches in SALE are expected to appear in EMP', this estimate for m will be quite

conservative.










Dealing with Over-fitting. The superpopulation model has a total of 2(m + 1) + 1

parameters within 8. Since the number of degrees of freedom of the model is so large, the

model has a tremendous leeway when choosing parameter values. This potentially leads to

a well-known drawback of learned models -over-fitting the training data, where the model

is tailored to be excessively well-suited to the training data at the cost of generality.
Several techniques have been proposed to address the over-fitting problem [30]. We
use the following two methods in our approach:

* Limiting the number of degrees of freedom of the model.

* Using multiple models and combining them to develop our final estimator.

To use the first technique, we restrict our generative model so that the mean

..- negate value of all records of any class i is not independent of the mean value of

other classes. Rather, we use a simple linear regression model ps = x i + I-o. s and I-o are

the two parameters of the linear regression model and can be learned easily. This means

that once we have learned the two parameters s and I-o, the p-i values for all other classes

can be determined directly by the above relation and will not be learned separately. As

mentioned previously, it would also be possible to place distributional constraints upon the

vector p in order to reduce the degrees of freedom even more, though we choose not to do

this in our implementation.

Our second strategy to tackle the over-fitting problem is to learn multiple models

rather than working with a single model. These models differ from each other only in that

they are learned using our EM algorithm with different initial random settings for their

parameters. When generating populations from the models learned via EM (as described

in the next subsection), we then rotate through the various models in round-robin fashion.


Are we not done yet? Once the model has been learned, a simple estimator is

immediately available to us: we could return po x po x nEMp, Since this will be the expected

query result over an arbitrary database sampled from the model. This is equivalent to

first determining a class of databases that the database in question has been randomly










selected from, and then returning the average query result over all of those databases. If

multiple models are learned in order to alleviate the over-fitting problem, then we can use

the average of this expression over all of those models.

While this estimator is certainly reasonable, the concern is twofold. First, if there is

high variability in the possible populations that could be produced by the model or models

(corresponding to uncertainty in the correctness of the model), then simply taking the

average of all of these populations will expectedly result in an answer with high variance.

A related concern is that this is not very robust to errors in the model-learning process

an error in the model will lead directly to an error in the estimate.

Thus, in the next few subsections we detail a process that attempts to simultaneously

perform well on in0; and all of the databases that could be sampled from the model,

rather than simply returning the mean answer over all potential databases. The method

samples a large number of ((EMP,, SALES,), (EMP SALES )) combinations from the model,

and then attempts to construct an estimator that can accurately infer the query answer

over precisely the (EMP,, SALES,) that has been sampled by looking at (EMP SALES ).

4.5.3 Generating Populations From the Model

Once we know the parameter set 8, the next task is to generate many instances

of Pi = (EMP,, SALE,) and Si = (EMP SALE ) in order to optimize our biased estimator

over these population-sample pairs. The difficulty is that in practice, EMP and SALE can

have billions of records in them. Hence, it would not be feasible to actually materialize

each (Ps, Si) pair. The good news is that for our problem it is not necessary to actually

generate the populations if we can generate statistics associated with the pair that are

sufficient to optimize our biased estimator.

Computing sufficient statistics for EMP and SALE. For each Pi, we must generate the
following statistics:

* The number of records of EMP belonging to each class (we use ni to denote this).

* The mean over fl for all records belonging to each class.










The first set of statistics are easy to generate if we notice that the number of records

belonging to each class is simply a multinomial distribution with nEMp trialS and each

multinomial bucket probability is given by the vector p. A single, vector-valued sample

from an appropriately-distributed multinomial distribution can then give us each us.

The next set of statistics can be computed by relying on the CLT. According to

the generative model, the .I_- r_~egfate attribute value of records of the superpopulation

belonging to class i is given by ps with variance of a2. Since the population is an i.i.d.

random sample from the superpopulation, the mean .I_ _regate value of records belonging

to class i follows a normal distribution with mean given by ps and variance of a2/i

Thus, ti which is the sum over the .I__oregate attribute of all records of class i can then be

obtained by drawing a trial from the normal distribution NV(ps, a2/ni) and multiplying it

by us.


Computing sufficient statistics for EMP' and SALE'. For each Si, we must generate the

followingf statistics:

*The number of sampled records from each class of EMP, this is dennotedl byr n


* The number of sampled records from the ith class of EMP that have j matches in
SALE' for each i and j. We~ dennote th~is byr n,

* The mean over fl corresponding to each n .~j

The first set of statistics can be produced by repeatedly sampling from a hypergeometric

distribution. To compute n'o, we sample from a hypergeometric distribution with

parameters nEMP, nEMP', and no (these parameters are the population size, the sample
size, and the size of the subpopulation of interest, respectively). To_ compute_ 'Iesml

f r o m ~ ~~~~~~~~~~~~~~ a y e g o e r c d s t i u i n w t a a m t r E P 8 M, a n d n l. n '2 i s
sapefrom a hypergeometric distribution with parameters n EMP 0 n1) n EMP 0 1)


and n2. This process is repeated for each n(.









Once,,,, each,,, n(isgeerte, ec Iz is generated. In order to speed the process of
generating each nI mi wan assume that the epectedfr valueP of each n! is small compared

to nSALE, So that there is little difference between sampling with and without replacement.

Thus, we can assume that each nij; j < i is binomially distributed which in turn means

that all n ,j are multinomially distributed, where the probability that any class i record

will have j matches in the sample SALE' is a hypergeometric probability denoted by h(j; i).

A single trial over a multinomial random variable having probabilities of h(j; i) for j from
0to i will1 then givem us each niz for a given i.

Finally, again using a CLT-based argument, the mean over fl for all of the records

corresponding to each n ,j is generated by a single trial over a normal random variable

Nv(ps, O.2/n,

4.5.4 Constructing the Estimator

We have seen in the previous subsection that once a model has been learned, it can be

used to generate statistics for any number of population(s)/sample(s).

Recall from Section 4.4 that the jth population generated and the sample from that

population are Pj = (EMPj, SALEj) and S (EMP'., SALE' ), respectively. Let sij be the

value of si computed over Sj; that is, it is the sum for fl over all tuples in EMP'. that have i

matches in SALE'.. Our goal in all of this is to construct a weighted estimator:


W(S,) = i=0

that minimizes:


SSE =\ (WS)-qP))2 ,' Y\ (3 (4-23)


where q(Pj) is the answer to the NOT EXISTS query over the jth population.

W should be optimized by choosing each I, so as to minimize the SSE (sum-squared-error)

given above. In order to compute these weights we evaluate the partial derivative of the

SSE w.r.t each of the unknown weights. For example, by taking the partial derivative of










the SSE w.r.t wo, we obtain:


iBSSE mzi



If we differentiate with respect to each I, and set the resulting m + 1 expressions to

zero, we obtain m+ 1 linear equations in the m+ 1 unknown weights. These equations can

be represented in the following matrix form:











The optimal weights can then be easily obtained by using a linear equation solver to

solve the above system of equations.

Once W has been derived, it is then applied to the original samples EMP' and SALES'

in order to estimate the answer to the query. By dividing the SSE obtained via the

minimization problem described above by the number of data sets generated, we can also

obtain a reasonable estimate of the mean-squared error of W.

4.6 Experiments

In this section we describe results of the experiments we performed to test our

estimators. Our experiments are designed to test the accuracy of our estimators and the

running time of the biased estimator, over a wide variety of data sets.

4.6.1 Experimental Setup

In this subsection, we describe the properties of the various data sets we use to test

our estimators. We generate 66 synthetic data sets and use three real-life data sets for

conducting our experiments. All our experiments were performed on a Linux workstation

having 1 GB of RAM and a 2.4 GHz clock speed and all software was implemented using

the C++ programming language.










4.6.1.1 Synthetic data sets

In each data set, we have two relations, EMP (EID, AGE, SAL) and SALE (SALEID,

EID, AMOUNT) of size 10 million and 50 million records, respectively. We evaluate the

following SQL query over each data set:

SELECT SUM (e .SAL)

FROM EMP as e

WHERE NOT EXISTS

(SELECT FROM SALE AS s

WHERE s.EID = e.EID)

Two important data set properties that affect the query result are:

1. The distribution of the number of matching records in SALE for each record of EMP

2. The distribution of e.SAL values of all records of EMP

Based on these two important properties, we synthetically generated data sets so

that the distribution of the number of matching records for all EMP records follows a

discretized Gamma distribution. The Gamma distribution was chosen because it produces

positive numbers and is very flexible, allowing a long tail to the right. This means that it

is possible to create data sets for which most records in EMP have very few matches, but

some have a large number. We chose values of 1, 2 and 5 for the Gamma distribution's

shift parameter and values of 0.5 and 1 for the scale parameter. Based on these different

values for the shift and scale parameters, we obtained six possible data sets: 1: (shift=

1, scale = 0.5); 2: (shift = 2, scale = 0.5); :3: (shift = 5, scale = 0.5); 4: (shift = 1, scale

= 1); 5: (shift = 2, scale = 1); and 6: (shift = 5, scale = 1). For these six data sets, the

fraction of EMP records having no matches in SALE (and thus contributing to the query

answer) were .86, .59, .052, .6:3, .27, and .00:37, respectively. A plot of the probability that

an arbitrary tuple from EMP has m matches in SALE for each of the six data sets is given as

Figure 4-2. This shows the wide variety of data set characteristics we tested.











0.6


data set 2
data set 5
0.4-
\ dat set dataset 1

-a 0.3 *dt e
0.2 dataa set 6

0.1


0 1 2 3 4 5 6
Number of matches per record from SAL

Figure 4-2. Six distributions used to generate for each e in EMP the number of records s in
SALE for which f3 6, 8) eValuateS tO ETru.


We also varied the distribution of the e.SAL values such that the distribution can be

one of the following:

a. Normally distributed with a mean of 100 and standard deviation of 10

b. Normally distributed with a mean of 100 and standard deviation of 200, with only

the absolute values considered

c. Zipfian distributed with a skew parameter of 0.5

d. Zipfian distributed with a skew parameter of 1.0

We doubled the number of data sets by further providing a linear positive correlation

or no correlation between the e.SAL value of a record and the number of matching

records it has in SALE. We thus obtained 48 different data sets considering all possible

combinations of the distribution of matching records and the distribution of e.SAL values.

We also tested our estimator on 18 additional synthetic data sets that were

deliberately designed to have properties that violate the assumptions of the superpopulation

model of our biased estimator, so as to see how robust this estimator is to inaccuracies in

the parametric model. From Section 4.5.1, the three specific assumptions we made for our

superpopulation model were:










1. cut(e, e', SALE) = 0 when e' / e. Thus, the number of SALE records a for which

f3 6, 8) A f3 6 8) is ETru is zero. In other words, different records from EMP do not

-!s I.e" matching records in SALE.

2. There exists a linear relationship between the mean .I__ negate values of the different

classes of EMP records given by ps = x i + po where s is the slope of the straight line

connecting the various ps values.

3. The variance of the .I__ negate attribute values of records of any class is approximately

equal to the single model parameter o.2

For each of these three cases, we generate six different data sets using the six different

sets of gamma parameters described earlier. Thus we obtain 18 more data sets where the

first six sets violate assumption 1, the next six sets violate assumption 2 and the last six

sets violate assumption 3. For each of these 18 data sets, the .I__oregate attribute value is

normally distributed with a mean of 100 and standard deviation of 200 except for the last

six sets where different values of standard deviation are chosen for records from different

classes.

In order to violate assumption 1, we no longer assume a primary key-foreign key

relationship between EMP and SALE. To generate a data set violating this assumption, a set

at of records of size 100 from EMP is selected. Let max be the largest number of matches

in SALE for any record from sl. Then an associated set 82 Of maZ records is added to SALE

such that all records in at have their matching records in 82. Assumption 2 was violated

using ps = a x j + I-o, where j / i (in fact, the j value for a given i is randomly selected

from 1...m). Assumption 3 was violated by assuming different values for the variance

of records from different classes. We randomly chose these values from the range (100,

15000).

4.6.1.2 Real-life data sets

The three real-life data sets we use in our experiments are from the Internet Movie

Database (IMDB) [1], the Synoptic Cloud Reports [3] obtained from the Oak Ridge










National Laboratory, and the network connections data sets from the 1999 K(DDCup

event .

The IMDB database contains several relations with information about movies, actors

and production studios. For our experiments, we use the two relations MovieBusiness

and MovieGoofs. MovieBusiness contains information about box-office revenues of movies

while MovieGoof s contains records that describe unintended mistakes or go.-in various

movies. The following schema shows the relevant attributes of the two relations for the

queries we tested in our experiments.

MovieBusiness (MovieName, NumAdmissions)

MovieGoofs (Goofld, MovieName)

MovieName is the primary key of MovieBusiness and a foreign key of MovieGoofs.

We tested the following three SQL queries on the two relations of the IMDB dataset.

Q1: SELECT SUM (b.NumAdmissions)

FROM MovieBusiness as b

WHERE NOT EXISTS

(SELECT FROM MovieGoofs AS g

WHERE g.MovieName = b.MovieName)

Q2: SELECT SUM (b.NumAdmissions)

FROM MovieBusiness as b

WHERE NOT EXISTS

(SELECT FROM MovieBusiness AS b2

WHERE id (b) < id (b2)

AND b.NumAdmissions = b.NumAdmissions)

Q3: SELECT COUNT (*)

FROM MovieBusiness as b

WHERE NOT EXISTS

(SELECT FROM MovieGoofs AS g










WHERE g.MovieName = b.MovieName)

The second real-life data set we use is the Synoptic Cloud Report (SCR) data set. It

contains weather reports for a 10-year period obtained from measuring stations on land

as well as water. We use weather reports for the months of December 1981 and November

1991 from measuring stations on land. Specifically, the two relations and their relevant

schema used in our experiments are:

DEC81 (Id, Latitude, CloudAmount)

NOV91 (Id, Latitude, CloudAmount)

Here, Id is the key in both the relations. We tested the following two SQL queries on

the relations DEC81 and NOV91.

Q4 : SELECT SUM (D81. CloudAmount)

FROM DEC81 as D81

WHERE NOT EXISTS

(SELECT FROM NOV91 AS N91

WHERE N91.Latitude = D81.Latitude)

Q5: SELECT COUNT (*)

FROM DEC81 as D81

WHERE NOT EXISTS

(SELECT FROM NOV91 AS N91

WHERE N91.Latitude = D81.Latitude)

The K(DDCup data set contains information about various network connections that

can potentially be used for intrusion detection. This data set has 42 integer, real-valued,

and categorical attributes. We tested our estimator on this data set by estimating the

total number of source hytes of connections that were -!,,!1 11. aly (1!ll, i. !I from

the rest of the network connections. That is, we summed the total number of source

hytes created by outlier connections. Our definition of -!,,!1..1 1. aly (1!ll, i. !Il records is

those records whose distance from all other records in the data set is greater than some










predefined threshold. For our experiments, we use a simple distance function that uses

Euclidean distance for numerical attributes and a 0/1 distance for categorical attributes.

We execute the following query on the K(DDCup data set for our experiments.

SELECT SUM (kc1.SourceBytes)

FROM KDDCup as kcl

WHERE NOT EXISTS

(SELECT FROM KDDCup AS kc2

WHERE d(kci, kc2) < threshold)

By choosing different values for deareshola, we can control the selectivity of the above

query. For our experiments, we define Q6, 7 and Qs as three variants of the above query

with different values of deareshola so that Q6 has a selectivity of around 2 !' Q7 has a

selectivity of 1.'7 -' while Qs has a selectivity of 0. l' .

4.6.2 Results

We ran our experiments on 1 .~ 5' and 10'; random samples of the data sets (both

relations in each data set were sampled independently without replacement at the same

rate). Both the biased estimator and the unbiased estimator were run ten times on each

of the test cases. For comparison we also analytically compute the standard error for the

concurrent estimator described in Section 4.2. Results from the first 48 synthetic data sets

are given in in Tables 4-1 and 4-2 while results from the next 18 synthetic data sets (which

specifically violate the model assumptions) are presented in Table 4-3. Real-life data set

results are shown in Table 4-4. For each of the test cases, we give the square root of the

observed mean-squared error (that is, the standard error) for the biased, unbiased as well

as concurrent estimator. Because having an absolute value for the standard error lacks any

sort of scale and thus would not be informative, we give the standard error as a percentage

of the total .I__-oegate value of all records in the database. For example, for the synthetic

data sets, we give the standard error as a percentage of the answer to the query:

SELECT SUM (e .SAL)










FROM EMP as e

Thus, if the estimation method simply returned zero every time, its error would vary

between CI' and 100I' depending on the selectivity of the suhquery. If the method is

also able to estimate with high accuracy which of the constituent records should not he

counted in the .I negate total, then the error can he reduced to an arbitrarily small level.

Although our error metric is different from the relative error (which takes the ratio

of the absolute error with the true query answer), the value of the relative error could be

readily computed from the error value given by dividing by the ratio of the query answer

and the total .I_ negate value of all records in the outer relation. For all the eight cases of

data set 1, the query answer is approximately >.~' of the total answer. Hence, the relative

error is about 1.1 times the error reported in Table 4-1. Similarly for the rest of the data

sets, the factors are: data set 2: 1.7; data set :3: 19; data set 4: 1.5; data set 5: :3.7 and

data set 6: 270. For the IMDB and SCR data sets, the factors are between 1 and 5.5 while

for the K(DDCup the factors range from 2 (for the high selectivity query) to 40 (for the

very low selectivity query).

When we tested the queries, we also recorded the number of times (out of ten) that

the answer given by the biased estimator was within +2 estimated standard errors of the

real answer to the query and found that for almost all the test cases this number was ten

while only for a couple of test cases this number was found to be nine out of ten.

Finally, we measured the computation time required by the biased estimator to

initially learn the generative model, then compute weights for the various components of

the estimator, and to finally provide an estimate of the query result. We observed that

for the synthetic data sets (which consists of 10 million and 50 million records in the two

relations) the maximum observed running time of biased estimator was between :3 and 4

seconds for a 10I' sample from each. The vast usbIi~r~y of this time is spent in the EM

learning algorithm, which requires O(m x |EMP'| x i) time, where ni is the maximum

possible number of matches for a record in EMP with records in SALES, and i is the number










of iterations required for EM convergence. We speed our implementation by sub-sampling

EMP' and using the subsample in the EM algorithm rather than using EMP' directly. The

justification for this is that the EM can he quite expensive with a large EMP', and the

accuracy of the modeling step is much more closely related to the size of SALE'. We use a

subsample of size 500 in our experiments.

In comparison, computation for the unbiased estimator is almost instantaneous,

requiring a small fraction of a second. In our test data, the most costly operation for

the unbiased estimator is running the "join" between EMP' and SALE'; that is, searching

for matches for each record from EMP' in SALE'. Given summary statistics describing this

r~re Ilrfinr the core GetEstTi routine itself can he implemented as a dynamic programming

algorithm that takes time O(m/2) where nz' is the maximum number of matches for any

record from EMP' in SALE'.

4.6.3 Discussion

One of the most obvious results from Table 4-1 is that the unbiased estimator has

uniformly small error only on those eight tests performed using synthetic data set 1,

where the number of matches for each record e E EMP is generated using a Gamma

distribution with parameters (shift = 1, scale = 0.5). In this particular data set, only a

very small number of the records are excluded hv the NOT EXISTS clause since N.~' of the

records in EMP do not have a match in SALE. Furthermore, only a very small number of the

records have a large number of matches. Both of these characteristics tend to stabilize the

variance of the unbiased estimator, making it a fine choice.

For all the other data sets, the unbiased estimator does very poorly for most of the

cases. For synthetic data, the estimator's worst performance is for data set 6, in which

less than one percent of the records are accepted by the NOT EXISTS clause and several

records from EMP have more than 15 matching records in SALE. In this case, the unbiased

estimator is unusable, and the results were particularly poor with correlation between

the number of matches and the .I_ egfate value that is summed. For example, in the

















Data set 1 Sample 5' Sample 1(1' Sample
type error error error
Ga- Cor- Val IT C B IT C B IT C B
nina red ? Dist. (. ) (. ) (. ) (. ) (.
1 No a. 7. 39 1:3.:32 :38.30 2.39 12.632 :3.88 1.09 11.89 1.46
1 No bn. 6.69 1:3.45 :37.87 :3.04 12.6:3 5.92 1.08 11.9:3 1.38
1 No e.6.89 12.92 22.59 5.2:3 12.04 8.18 :3.79 11.2:3 7.09
1 No d. 16.65 63.32 68.37 15.94 6.19 29.34 9.56 5.94 19.72
1 Yes a. 11.90 20.90 :34.50 4.59 19.94 2.263 3.15 18.638 1.42
1 Yes bn. 1:3.50 17.80 :36.30 4.07 16.37 5.12 1.75 15.50 2.18
1 Yes c. 7. 70 15.06 21.14 5.69 14.06 7.84 :3.98 1:3.1:3 6.21
1 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.3:3
2 No a. 11.79 40.12 6;.09 8.10 :37.98 :3.55 2.4:3 :35.44 :3.37
2 No b. 1:3.65 :39.48 5.00 6;.82 :37.86 4.8:3 2.54 :35.51 4.03
2 No c. 179.87 :39.20 14.75 6.35 :37.00 8.34 4.54 :34.44 7.12
2 No d. :31.60 20.45 4:3.4:3 10.24 19.26 12.88 9.99 17.08 6.25
2 Yes a. 24.70 65.60 21.39 19.8:3 62.00 18.45 4.78 57.51 1:3.70
2 Yes bn. 19.34 54.27 12.99 12.61 51.19 12.28 :3.463 47. 72 7.48
2 Yes e.220.14 46.60 2:3.01 12.19 44.01 12.01 5.10 40.88 5.10
2 Yes d. 52.631 :39.08 :39.45 19.62 :36.75 5.32 9.20 :33.19 2.25
:3 No a. 2:34.60 92.75 18.61 59.67 84.91 12.22 :33.00 76.00 63.28
:3 No b. :315.97 9:3.29 19.42 70.32 84.68 11.68 :34.78 76.05 5.84
:3 No c. 188.17 91.50 20.5:3 46.14 84.01 18.50 24.92 75.07 15.80
:3 No d. 1:39.27 72.67 14.24 6:3.56 67. 36 12.18 6.79 59.8:3 5.3:3
:3 Yes a. 75:3.7:3 189.70 42.19 220.00 172.10 28.99 115.25 151.85 17.02
:3 Yes b. 421.00 146.70 :30.9:3 151.00 1:33.50 21.05 74.50 118.40 11.99
:3 Yes e.240.20 119.80 28.28 74.66 109.50 25.99 42.57 97.22 21.863
:3 Yes d. 47.95 144.631 :33.85 18.52 1:30.9:3 28.69 :3.6:3 114.00 18.6:3
Table 4-1. Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 24
synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows
the error for the three estiniators: IT IUnhiased estimator, C Concurrent
sampling estimator and B 1\odel-based biased estimator.

















Data set 1 Sample 5' Sample 1(1' Sample
type error error error
Ga- Cor- Val IT C B IT C B IT C B
nina red ? Dist. (. ) (. ) (. ) (. ) (.
4 No a. 15:3.70 :36.20 14.52 :37.17 :33.90 4.7:3 24.47 :31.20 0.89
4 No bn. 226.00 :37.00 18.563 50.32 :33.95 5.27 42.87 :31.11 1.3:3
4 No e.242.70 :35.20 11.10 19.40 :32.85 :3.632 17.03 :30.04 :3.59
4 No d. 146.37 16.56 45.163 2:3.60 14.85 21.263 8.85 12.632 16.61
4 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.64
4 Yes bn. :327.02 52.06 8.632 75.95 48.42 :3.92 45.632 44. 12 2.8:3
4 Yes c. :359.60 4:3.40 1:3.90 :30.19 40.39 7. 17 27.21 :36;.80 5.16
4 Yes d. 1. 1e:3 37.5:3 40.29 54.:33 :33.99 10.66 18.94 29.32 5.68
5 No a. 2:36;.00 72.04 1:3.19 46.18 6;6.08 12.07 :38.30 59.60 6.15
5 No b. :395.00 72.30 11.78 55.78 6;6.09 11.7:3 42.7:3 59.55 5.37
5 No c. 167. 70 71.10 7. 70 120.81 65.20 1.99 62.70 58.50 1.15
5 No d. 1:35.635 51.87 1:3.58 77.12 48.29 4.30 24. 14 42.21 4. 16
5 Yes a. 862.00 71.79 :31.25 20:3.81 64.90 7.21 57.22 57.00 2.9:3
5 Yes bn. 650.80 56.60 28.634 129.75 51.463 6.75 74. 16 4:3.90 1.86
5 Yes e.298.70 92.30 11.47 189.70 84.22 4.06 69.6:3 74.80 2.5:3
5 Yes d. 2 105.24 10.84 178.61 95.07 9.38 145.78 81.863 3.04
6 No a. 7. 1e:3 95.1:3 19.30 6.2e:3 79.49 9.82 4. 1e:3 6:3.:33 6.09
6; No b. 1.9e4 95.20 18.40 2.1e:3 79.58 9.47 6.6e2 6:3.40 5.74
6; No c. 1.9e4 94.32 1:3.0:3 1.2e:3 78.60 5.96 9.6e2 62.74 1.71
6; No d. 4. 7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 :3.97
6; Yes a. 5.4e4 :307.0 6;2.00 :10e4 249.30 :30.90 5.7e:3 119.00 18.78
6; Yes b. 4. 2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e:3 1:35.00 12.88
6 Yes 3.2e4 1563.3 22.70 2.0e:3 128.10 10.87 8.7e2 100.12 :3.05
6 Yes d. 1.3e5 2:34.4 29.78 2.9e:3 192.46 28.25 2.4e:3 148.28 12.79
Table 4-2. Observed standard error as a percentage of SUM (e .SAL) over all e E EMP for 24
synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows
the error for the three estiniators: IT IUnhiased estimator, C Concurrent
sampling estimator and B 1\odel-based biased estimator.









Data set 1 .Sample 5' Sample 1(1' Sample
type error error error
Ga- Vio- IT C: B IT C: B IT C: B
nina lates (. ) (. ) (. ) (. ) (.
1 (1) 8.8:3 1:3.:37 62.60 :3.12 12.47 15.24 1.19 11.75 4.632
2 (1) 24.66 :39.:33 :34.39 8.14 :37.89 2.74 :3.41 :35.60 2.48
:3 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 1:3.05
4 (1) 22.30 :36.67 :37.99 12.72 :34.07 7.96 6.34 :31.12 2.95
5 (1) 2:31.50 72.60 6.76 12:3.30 66.14 6.37 85.68 59.48 4.35
6; (1) 1:366.80 95.96 9.99 1.2e:3 78.64 5.85 700.0 6;2.6;2 1.88
1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.44
2 (2) 21.632 72.24 59.94 14.25 67.50 7.56 63.25 62.90 4.47
:3 (2) 8863.2 220.20 45.7:3 1:36.0 201.90 :31.7:3 79.75 180.10 25.76
4 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.4:3 11.52
5 (2) 247.60 205.0 18.84 2:33.0 187.00 17.69 88.55 168.30 9.78
6 (2) 6891.00 :369.0 42.30 5988.0 :310.00 40.90 1924.00 246.57 19.77
1 (:3) 14.70 21.14 61.86 6.24 20.20 10.15 1.1:3 19.1:3 2.67
2 (:3) 26.15 66.7:3 29.10 22.49 62.25 20.25 5.38 57.69 17. 35
:3 (:3) 920.10 185.30 41.86 147.60 167. 20 :30.12 65.6:3 146.88 27. 20
4 (:3) 2.3e5 64.42 :35.96 714.00 60.54 16.87 150.80 54.77 9.24
5 (:3) 1:350.30 14:3.00 :33.59 856.00 127. 76 29.58 :306.70 11:3.14 10.08
6; (:3) 2.2e5 264.02 :38.37 4519.10 212.80 :34.92 25:30.00 162.70 21.96


Table 4-:3.


Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 18
synthetically generated data sets. The table shows errors for three different
sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows
the error for the three estiniators: IT IUnhiased estimator, C Concurrent
sampling estimator and B 1\odel-based biased estimator.


correlated case with a 1 sample, most of the relative standard errors were more than

II II II II l' Such very poor results are found sporadically throughout most of the data

sets, though the results were somewhat erratic. The reason that the observed errors

associated with the unbiased estimator are highly variable is the very long tail of the

error distribution. Under many circumstances, most of the answers computed using

the unbiased estimator are very good, but there is still a small (though non-negligible)

probability of getting a ridiculous estimate whose error is hundreds of times the sunt over

the .I_ gate value over the entire EMP relation. Unfortunately, it is interesting to note









1 .Sample 5' Sample 101' Sample
Error Error Error
Data- Query UJ C B UJ C B UJ C B
Set ( ) ( ) ( )( ) ( .
IMDB Qi1 9.6e3 27.67 70.88 3.3e3 17.51 33.44 4. 1e2 13.71 14.14
IMDB Q22 1.2e2 75.12 65.10 91.26 6;2.86; 31.97 49.82 52.69 9.31
IMDB Q23 1.e4 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92
SCR Q24 1.4e4 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41
SCR Qs5 1.2e4 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95
KDDCup Q6 1.10e10 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10
KDDCup Q7 6.5el47 41.30 11.24 5.8e83 263.54 4.32 9.3e36 17.04 3.28
KDDCup Qs 7. 3e210 15.24 8.463 3.6e172 10.80 1.56 2.3e120 6.35 0.98
Table 4-4. Observed standard error as a percentage of the total .I__oregate value of all
records in the database for 8 queries over 3 real-life data sets. The table shows
errors for three different sampling fractions: 1 5' and 10I' and for each of
these fractions, it shows the error for the three estimators: U Unbiased
estimator, C Concurrent sampling estimator and B Model-based biased
estimator.


that the unbiased estimator's worst performance overall was observed on Qs over the

K(DDCup data, where the error was astronomically high: larger than 101oo

In comparison, the biased estimator generally did a very good job predicting the final

query result, and in most cases with a 5' or 10I' sampling fraction the observed standard

error was less than 101' of the total .I__ regate value found in EMP. In other words, if the

total value of SUM (e.SAL) with no NOT EXISTS clause is x, then for just about any query

tested, the standard error was less than x/10, and it was frequently much smaller. This is

actually quite impressive when one considers the difficulty of the problem. The primary

drawback associated with the biased estimator is its complexity (requiring non-trivial and

substantial statistically-oriented computations) and the fact that a significant amount

of computation is required, most of it associated with running the EM algorithm to

completion. By comparison, the unbiased estimate can be calculated via an almost trivial

recursive routine that relies on the calculation of simple hypergeometric probabilities.

One case where the biased estimator had questionable qualitative performance was

with the 16 tests associated with data sets 3 and 6. The problem in this case was that










the E1\ algorithm tended to overestimate po in 8, which is actually very small in these

two data sets (.052 and .00:37, respectively). This results in an error that hovers at 10' .

of the total .I_ gate value of e.SAL (even for a 5' I sample) when the real answer is

only 5' of this total for data set :3 or less than 1 of this total for data set 6. We stress

that guessing that only a few percent of the tuples in EMP have no matches in SALE from

a small sample with limited information is an extremely difficult estimation problem,

and we conjecture that without additional information (such as prior knowledge that the

distribution represented by p is a discretized gamma distribution) it will be very difficult

to achieve better results.

Results from the synthetic data sets which specifically violate the assumptions of the

superpopulation model are shown in Table 4-:3. The first six rows in the table show results

for data sets in which more than one EMP record can match with a given record from SALE.

The results show that violating this assumption of the model in the actual data set did

not affect the accuracy of the biased estimator significantly. The next set of six rows in the

table show results for data sets in which there is no linear relationship between the mean

.I_ regate values of the different classes of EMP records. The results show that the biased

estimator is about twice as inaccurate over these data sets as compared to corresponding

data sets which do not have a strict violation of the assumption. The last six rows in

the table show results over data sets in which the variances of the .I_ negate values of

records from different classes are significantly different. Results show that these data sets

affect the accuracy of the biased estimator as much as the data sets which violate the

"linear relationship of mean values" assumption. However, the results are certainly not

poor when these assumptions are violated, and the method still seems to have qualitative

performance that may be acceptable for many applications, particularly with a larger

sample size.

The results from the eight queries over the three real-life data sets are depicted in

Table 4-4. The key difference in the characteristics of the real-life data sets compared










to the synthetically-generated data sets is the number of matching records in the inner

relation for a given record from the outer relation of the NOT EXISTS query. For the

K(DDCup data set, the maximum number of matching records in the inner relation is as

high as 2500, while for the IMDB and SCR data sets this number is about 200 and 90

respectively. Due to this, none of the cases which are favorable for the use of the unbiased

estimator (as described above) are observed in the real-life data sets. On the other hand,

it can he seen from Table 4-4 that the accuracy of the biased estimator is generally quite

good over the real data.

We also note that the standard error of the biased estimator over the learned

superpopulation seems to be a reasonable surrogate for the standard error of the biased

estimator in practice. For most biased estimators, it is reasonable to use the standard

error of the biased estimator in the same way that one would use the standard deviation

of an unbiased estimator when constructing confidence bounds (see Sarndal et al. [109],

Section 5.2). According to the Vysochanskii-Petunin inequality [120], any unbiased

uni-modal estimator will be within three standard deviations of the correct answer 95' of

the time, and according to the more .I__ ressive central limit theorem, an estimator will be

within two standard deviations of the correct answer 95' of the time. We observed that

almost all of the tests, ten out of ten of the errors for the biased estimator were actually

within two predicted standard errors of zero. This seems to be strong evidence for the

utility of the bounds computed using the predicted standard error of the biased estimator.

We finally remark on the time required for the execution of the biased estimator. The

biased estimator performs several computations including learning the model parameters,

generating sufficient statistics for several population-sample pairs and then solving a

system of equations to compute weights for the various components of the estimator. As

discussed previously, this took no longer than four seconds for the largest samples tested.

If this is not fast enough, we point out that it may be able to speed this time even more,

though this is beyond the scope of the thesis. While we used the traditional EM algorithm










in our implementation, we note that EM can he made faster by using incremental variants

[69, 95, 116] of the EM algorithm. These variants of the EM algorithm typically achieve

faster convergence time by implementing the Expectation and/or the Minimization step of

the EM algorithm partially.

4.7 Related Work

Estimation via sampling has a long history in databases. One of the oldest and best

known works is Frank Olken's PhD thesis [97]. Other classic efforts at sampling-hased

estimation over database data are the adaptive sampling of Lipton and Naughton [83, 84]

for join query selectivity estimation, and the sampling techniques of Hou et al. [64, 65] for

..-_o egate queries. More recent well-known work on sampling is that on online .I_ egation

by Haas, Hellerstein, and their colleagues [47, 60, 61].

The sampling-hased database estimation problem that is closest to the one studied

in this chapter is that of sampling for the number of distinct values in a database. As

discussed in the introduction to this chapter, a solution to the problem of estimation over

subset-hased queries is a solution to the problem of estimating the number of distinct

values in a database since the latter problem can he written as a NOT EXISTS query. The

classic paper in distinct value estimation is due to Haas et al. [49]. For a survey of the

state-of-the-art work on this problem in databases through the year 2000, we refer the

reader to the Introduction of the paper by C'I Iml: I- et al. on the topic [17]. The paper

of Bunge and Fitzpatrick [13] provides a survey of work in the statistics area, current

through the early 1990's. Work in statistics continues on this problem to this d .v. In

fact, a recent paper from statistics by Mingoti [90] on the distinct value problem provided

inspiration for our use of superpopulation techniques.

Though the problems of distinct value estimation and subset-hased .I negate

estimation are related, we note that the problem of estimating the number of distinct

values is a very restricted version of the problem we study in this thesis, and it is not

immediately clear how arbitrary solutions to the distinct value problem can he generalized










to handle subset-hased queries. The most obvious difficulty in extending such methods

to subset-hased queries is the fact that a NOT EXISTS or related clause results in a

complicated statistic summarizing two populations (the two tables that are queried over).

Nonetheless, links between the problems do exist. For example, though our own unbiased

estimator was not directly inspired by Goodman's estimator [43]5 and it takes a very

different form, it is easy to argue that our unbiased estimator must he a generalization of

Goodman's estimator. The reasoning is straightforward: Goodman's estimator is proven to

be the only unbiased estimator for distinct value queries, and our own unbiased estimator

is unbiased for distinct value queries. Therefore, they must he equivalent when used on

this particular problem.

4.8 Conclusion

This chapter has presented two sampling-hased estimators for the answer to a

subset-based query, where the answer to a SUM .I__a-egate query (and by trivial extension,

AVERAGE and COUNT) is restricted to consider only those tuples that satisfy a NOT EXISTS

or related clause. The first estimator is provably unbiased, while the second makes use of

superpopulation methods and was found to be much more accurate.

As discussed in Section 4.5.1 of the thesis, one of the most controversial decisions

made in the development of the latter estimator was our choice of a very general prior

distribution. To a statistician from the so-called "B li-o -! .Is school [39 this may be

seen as a poor choice and B li-o -1 .Is statistician may argue that a more descriptive prior

distribution, if appropriate, would increase the accuracy of the method. This is certainly

true, if the selected distribution were a good match for the actual data distribution. In

our work, however, we have consciously chosen generality and its associated drawbacks in

place of specificity. Our experimental results seem to argue that for a variety of different



5 Goodman's estimator is one of the earliest statistical estimators for distinct value
queries.










data distributions, the resulting estimator still has high accuracy. Still, this represents

an intriguing question for future work: can a different prior distribution he chosen that

is appropriate for use in real-world data sets, and which results in a more accurate

estimators

Finally, we note that the model-based method outlined in the latter half of this

chapter was designed specifically to address the problem of estimating the answer to a

nested SQL query with a single table in the inner query and a single table in the outer

query linked by a NOT EXISTS predicate. As is, our model is not directly applicable to

arbitrarily complex nested queries. For example, nested queries may include multiple

relations in the outer as well as the inner query. One could imagine sampling all of the

input relations, and then using any result tuples that are discovered as part of the inner

or outer suhqueries as input into an estimator such as the one studied in this chapter.

However, this may be dangerous, and our superpopulation model is not directly applicable.

The problem is that if there is a join in the inner (or outer) query, then the tuples

produced via joining samples from the input relations are not i.i.d. samples from the join

[47]. This means that the join itself must he modeled, which is a problem for future work.

Another problem for future work is arbitrary levels of nesting. An inner query may itself

he linked with another inner query via a NOT EXISTS or similar clause.









CHAPTER 5
SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES

5.1 Introduction

The specific problem that we consider in this chapter is sampling-based approximation

of the answer to highly selective .I_ _regate queries -those having a relational selection

predicate that accepts only a very small percentage of the data set. Again, we consider

sampling because it is the most versatile of the approximation methods: a single sample

can be used to handle virtually any relational selection predicate or any join condition.

Samples generally do not require prior knowledge of what queries will be asked, unlike

other methods such as sketches [8]. We consider very selective queries because they are the

one class of queries that are hardest to handle approximately without workload knowledge:

if a query references only a few tuples from the data set, then it is very hard to make sure

that a synopsis structure (such as a sample) will contain the information needed to answer

the query.

The most natural method for handling highly selective queries using sampling is to

make use of strrHi..>Hi~~w [25]. In order to answer an .I__-egate query over a relation,

one could first offlinee) partition the relation's tuples into various subsets so that similar

tuples are grouped together -the assumption being that the relational selection predicate

associated with a given query will tend to favor certain strata. Even if a given query is

very selective, at least one or two of the strata will have a relatively heavy concentration

of tuples that will contribute to the query answer. When the query is processed, those

"important" strata can be sampled first and more heavily than the others. This is

illustrated with the following example:


Example 1: The relation MOVIE(MovieYear, Sales) is partitioned into two strata

as follows: The query Q is then issued:


SELECT SUM (Sales)










R1 :MovieYear < 1975 R2 :MOVieYear > 1975
ri : (1961, 30) rs : (1983, 60)
r2 : (1972, 50) r4 : (1977, 40)
re : (1997, 25)
re : (1992, 100)
ry : (2004, 100)


FROM MOVIE

WHERE MovieYear < 1980

Since all movies in R1 were released before 1975, all the records in the stratum R1

match Q. Hence, we decide to obtain a biased sample that includes as many records from

R1 as the sample size permits and we sample from R2 only if the desired sample size is not

met. For a sample size of 4, this results in an estimate whose variance (or error) is 2400.

Drawing a sample from the population as a whole results in an estimate whose variance is

2575. O


While stratification may be very useful, it is not a new idea. It has been studied in

statistics for decades, and it has been -II---- -rb I1 previously as a way to make approximate

..- -oregate query processing more accurate [18-20]. However, in the context of databases,

researchers have previously considered only half of the problem: how to divide the

database into strata. This may actually be the easy and less important half of the

problem, since even the relatively naive partitioning strategy we use in our experiments

can give excellent results. The equally fundamental problem we consider in this paper is:

how to allocate savaples to strater when r ;.~leilti r,.-;, ,:'.9 the query. 1\ore specifically,

given a budget of n samples, how does one choose how to "spend" those samples on the

various strata in order to achieve the greatest accuracy?

The classic allocation method from statistics is the Ney~man allocation, and it is the

one advocated previously in the database literature [19]. The key difficulty with applying

the Neyman Allocation in practice is that it requires extensive knowledge of certain

statistical characteristics of each strata, with respect to the incoming query. In practice










this knowledge can only be guessed at by taking a pilot sample. As we show in this paper,

if the guess is poor, then the resulting sampling plan can he disastrous. This results in a

classic chicken-and-egg problem: we want to sample in order to avoid scanning all of the

data, but in order to sample properly, we have to collect statistics that require scanning

all of the data! The result is that the classic Neyman allocation is unusable in many

situations, as we will demonstrate experimentally in the paper.

Our Contributions

In this thesis, we develop an alternative to the classic Neyman allocation that we

call the Br;,. -Neyman allocation. While this is a very general method and its utility

is not limited to the context of database management, the B is-; e-Neyman allocation is

particularly relevant to database sampling because it is designed to be robust when only a

few of the data records in the data set are relevant to estimating a quantity over the data

-as is the case when a query has a restrictive relational selection predicate. The specific

contributions of our work are as follows:

* The B is-; e-Neyman allocation explicitly takes into account the error that might he
incurred when developing the sampling plan to maximize the expected accuracy of
the resulting estimate.

* The B is-; e-Neyman allocation makes use of novel, B li- -1 Ia techniques from statistics
[14] that allow us to take into account any prior expectation (such as the expected
efficacy of the stratification) in a principled fashion.

* We carefully evaluate our methods experimentally, and show that if one is very
careful in developing a sampling plan, even a naive partitioning of samples to strata
that uses no workload information can show dramatic accuracy for very selective
queries.

* Our methods are very general. They can he used with any partitioning (such as those
proposed by Chaudhuri et. al [18-20]), or even in cases where the partitioning is not
user-defined and is imposed by the problem domain (for example, when the various
-lI II are different data sources in a distributed environment). Our methods can
also be extended to more complicated relational operations such as joins, though this
problem is beyond the scope of the paper.









5.2 Background

This section presents some preliminaries and background about stratified sampling,

and discusses the problems associated with using stratified sampling in a database setting

to estimate results of arbitrary queries.

5.2.1 Stratification

A general example of a SUM .I_ _regate query over a single relation can be written as
follows:

SELECT SUM (fl(r))

FROM R As r

WHERE f2 r)

Note that if we define a function f() where,


f (r = f (r)if f2 T) is LtHO
f() (I0 if f2 T) 1S falSe


the above query can be simply re-written as,

SELECT SUM ( f(r))

FROM R As r

If the relational selection predicate f2 r) SeleCtS a Very Small fraction of records from

the relation R, then the query is said to be a low selectivity query.

Assume that relation R is partitioned into L disjoint strata such that Ri represents

the ith stratum. Then, we have R = R1 U R2 U U RL. We denote the size of the ith

stratum by Nsi and thus we have |Ri| = Ns. Let R( where |R(| = us be the survey sample

(without replacement) from the ith stratum. The sizes of all the strata are known from

strata construction time, while the sizes of the survey samples from each of the strata

(the us values) can be determined by using some sampling allocation scheme subject to

the constraint Ci us = n, where n is the pre-determined total sample size from R. The

problem of determining an optimal sample allocation is the central focus of this paper.










If we execute the above query on each of the R(, the result of the query over the

sample of stratum i can be written as,



r6R:

The unbiased stratified sampling estimator for the query result expressed in terms of

the ye~ values is,

Y = --\i(5-1)
i= 1
The true variance of the records in stratum i can be computed as,


Sr6Ri r6RCd r


Thus the true variance (or error) of the estimator Y is given by,



i= 1

In practice, it is not feasible to know the true stratum variances for an arbitrary

query. Hence, a sample-based estimate for the variance of stratum i can be computed as,






Then, an unbiased estimator for the variance of Y can be obtained from Equation 5-2

by simply replacing all the az? terms with their corresponding unbiased estimators (Ti .

Central-Limit-Theorem-based confidence bounds [112] for Y can then be computed

as, Y + z,& where z, is the z-score for the desired confidence level. If desired, more

conservative confidence bounds from the literature (such as C!. Ial-l. v-based [112]) can

also be used.

Finally, we note that .I__-oegate queries like COUNT and AVG can also be handled by

stratified sampling estimators like the one described above by using ratios of two different

estimates. Aggregate queries with a GROUP BY clause can also be answered by using









stratification. A GROUP BY query can be considered as executing several simple queries in

parallel -one for each group. Joins can also be handled using methods similar to those

proposed by Haas and Hellerstein [54], though that is beyond the scope of the paper.

5.2.2 "Optimal" Allocation and Why It's Not

The problem of determining the ni values for all the strata for a predetermined

sample size n is the sample allocation problem. The key constraint on the values of the

sample sizes is that their sum should equal the total sample size. Besides this constraint,

there is freedom in the choice of the ni values, and hence a natural choice is to minimize

the error of Y of Equation 5-1. Since Y is unbiased, minimizing its error is equivalent

to minimizing its variance. An optimization problem can be formulated for the choice

of ni values so that the variance a2 is minimized -solving the problem leads to the

well-known Neyman allocation [25] from statistics. Specifically, the Neyman allocation

states that the variance of a stratified sampling estimator is minimized when sample size

ni is proportional to the size of the stratum, NVi, and to the variance of the f() values in

the stratum, af. That is,



The problem we face in a database setting is that strata variance values af are not

known for an arbitrary query. The stratum variance af depends on: (a) the function to be

.I_ egated fi(), and (b) the relational selection predicate f2(). Since these functions can

vary from one query to another, it is not feasible to compute beforehand exact values of

the various af terms for an arbitrary query. This means that the optimal ni values cannot

be computed in the absence of exact af values.

It is possible to obtain rough estimates for the strata variances by doing a pilot run of

the query on very small pilot samples from each stratum, which is the standard method.

However, as the following example shows, a 1 in &1- drawback of this approach is that

the variance estimates calculated from such pilot sampling can be arbitrarily erroneous,

leading to an extremely poor allocation scheme and even more severe problems.





















































True query result 20150

Avg. observed bias 10200

Avg. estimated MSE 0.76 million

Avg. observed MSE 100 million

MSE of true optimal 58.6 million


Example 2: Imagine that we have a relation R partitioned into two strata R1 and R2

such that |RI| = 10000 and |R2| = 10000. Let Q be a query identical to the query presented

in Section 5.2.1. The number of records from R1 accepted by f2 ) is 10 while the number of

records from R2 accepted by f2 ) 1S 1000. Further, let fl(r) ~ NV(1000, 100) Vr E R1 and

ft (r) ~ NV(10, 100) Vr E R2, Where NV(p, a) denotes a normal distribution with mean p
and variance a2

We use a pilot sample of 100 records to estimate the variance of the f() values in

each stratum. These estimates are &~ and &#,. If the desired sample size is n = 1000, the

estimated variances can be used with Equation 5-4 to obtain an estimate for the optimal

sampling allocation as follows:

1000 1000
&( + 22 2~ +22

We then ask the question: how accurate will the resulting sampling plan be? To

answer this question, we perform a simple experiment in which we repeat the above

process 1000 times. For each iteration, we record the squared error of the estimate

produced by the computed sampling plan. The average of all these squared errors gives us

an approximation of the mean-squared error (ilrmi) of the estimator. For each iteration,

we also compute the estimated variance of the result (using Equation 5-2) since this

variance would be used to report confidence bounds to the user. We then compute the

average estimated variance across the 1000 iterations. Finally, we use the true variances of

both strata to obtain an optimal sample allocation, and repeat the above experiment using

the optimal allocation. We summarize the results in the following table.










Overall, the results using the pilot sampling are disastrous. Specifically:


* The pilot-sampling-based allocation provides an average estimated error to the user
that is more than 2 orders of magnitude smaller than the true error -0.76 million versus
100 million. Since the estimated error is typically used to compute confidence bounds, the
resulting confidence bounds will be much narrower than what they should be in reality.
Hence, the user would be provided with a dangerously optimistic picture of the error of
the estimator.

* Second, the non-optimal allocation leads to an estimate that has a heavy bias. This is
due to the fact that the allocation often directs the stratified sampling to ignore the first
stratum. For approximately CII' of the 1000 iterations the pilot sample fails to discover
any matching records in R1. Hence, the pilot sample-based variance is naively guessed to
be zero. When this value is used with the Neyman allocation, no samples are allocated
to R1, while all 1000 samples are allocated to R2. The outcome is that the query result is
usually underestimated, because R1 actually contains records accepted by f2 -.

* Finally, by using a truly optimal sampling allocation to estimate the query result,
it is possible to achieve an error that is around half the error obtained by a non-optimal
allocation. The additional error incurred due to the poor allocation represents a wasted
opportunity to provide a much more accurate estimate.

5.3 Overview of Our Solution

The fundamental problem we face is that the natural estimator for of serves us

extremely poorly when we are trying to figure out how to allocate samples to strata.

Human intuition tells us that it is foolish to simply assume that of is zero in this case,

though our estimate of will be zero. This is because as human beings we know that there

will often be a number of records matching the given f2 ) in a Strata, and we will simply

be unlucky enough to miss them in our pilot sample.

To remedy these problems, we propose a novel B li-o -1 Io approach [14] called the

Bay., -Neyman allocation that can incorporate such intuition into the process in a

principled fashion. In general, B li-, -i Ia methods formally model such prior intuition

or belief as a probability distribution. Such methods then refine the distribution by

incorporating additional information -in our case information from the pilot sample -to

obtain an overall improved probability distribution.

At the highest level, the proposed B .v. E-Neyman allocation works as follows:










1. First, in B li-o Io fashion, we represent our belief in the possible variances over
the f () values in each strata as a prior probability distribution. Let the vector
E (~af a 2\ ~ -, -,,11 eon osbe set ofc strata ,, variances. W defne prbaltyc
distribution over all of the possible E values to represent this prior belief. Let Xe he
a random variable with exactly this probability distribution. Thus, sampling from Xe
(that is, performing a random trial over Xc) gives us one possible value for the vector
E, where those variance vectors that we feel are more "correct" are more likely to be
sampled.

2. Second, we take a pilot sample from the database and use the result of the pilot
sample to update the distribution of Xe in order to make it more accurate.

3. Third, we sample a large number of possible E values from the resulting Xe in
1\onte-Carlo fashion. This gives us a large number of possible alternative values for


4. Finally, we construct a sampling plan for estimating the answer to our query whose
average error (variance) is minimized over all of the E values that were sampled
from Xc. This gives us a sampling plan whose expected error over the possible set of
databases described by the distribution of Xe is minimized. This plan is then used to
perform the actual stratified sampling.

The three key technical questions that must he addressed when adopting this

approach are:

1. First, how is the random variable Xe defined?

2. Second, how can the distribution of Xe he updated to take into account any
information that is gathered via the pilot sample?

3. Third, how can a set of samples from the updated Xe he used to produce an optimal
sampling plan?

The next three sections outline our answer to these three questions.

5.4 Defining Xe

In this section, we consider the nature of Xe itself, and how to sample from it.

5.4.1 Overview

At the highest level, the process of producing a single sample E from Xe will be

further subdivided into three steps:









1. First, we sample from a random variable Xent to obtain a vector (cutl, cut2, CntL)
where this vector tells us how many tuples from each strata are accepted by the
relational selection predicate f2 -.

2. Second, we sample from a random variable Xc/ that gives us the vector E'=
((-1, 2 1a~, (-1, 2 2a~, ***, (-1, 2 aL). The ith pair (pi, p2) i1S the mean (that is, pt) and
second moment (that is, p2 1 OVeT all Of the the fi() values in strata i for those cuti
tup~les that are accepted by f2 -.

3. Third, once these two samples have been obtained, it is then a simple mathematical
task to use the outputs of Xent and Xc/ to compute the output of Xc.

We now consider each of these three steps in detail.

5.4.2 Defining Xent

Using terminology common in B li-o Io statistics, each entry in Xent is generated by

sampling from a binomial distribution with a Beta prior distribution [33]. This means

that we view the probability pi that an arbitrary tuple from stratum i will be accepted

by the relational selection predicate f2 ) aS being the result of a random sample from

the Beta distribution, which produces a result from 0 to 1. Since we view each tuple as a

separate and independent application of f2(), the number of tuples from stratum i that are

accepted by f2 ) is then binomially distributed With the binomial distribution taking the

value pi as input, along with the stratum size Nsi.

The Beta distribution is chosen as the prior distribution because it is a canonical

"conjugate prior" distribution for the binomial distribution. The fact that it is a conjugate

prior means that its domain is precisely equal to the parameter space for the Binomial

distribution, in this case, the range 0 to 1, which is the valid range for pi.



1 Recall that the second moment of a random variable X is the expected value of X2:
p2l = E[X2]

2 The binomial distribution models the case where a balls are thrown at a bucket and
each ball has a probability p of falling in the bucket. A binomially distributed sample
returns the number of balls that happened to land in the bucket.










03


S02-






001 02 03 04 05 06 07 08 09
Quelry electivity

Figure 5-1. Beta distribution with parameters a~ = = 0.5.


Given this setup, the first task is to choose the set of Beta parameters that control

the distribution of each pi so as to match the reality of what a typical value of pi will be

for each stratum. The Beta distribution is a parametric distribution and requires two input

parameters, a~ and p. Depending on the parameters that are selected, the Beta can take

a large variety of shapes and skews. Clon .~ -!nig a and P for the ith stratum is equivalent

to supplying our "intuition" to the method, stating what our initial belief is regarding the

probability that an arbitrary record will be accepted by f20).

There are two possibilities for setting those initial parameters. The first possibility

is to use workload information. We could monitor all previously-observed queries over

each and every strata, where we observe that for query i and stratum j the probability

that a given record was accepted by f20) was pij. Then, assuming that {pij Vi, j} are all

samples from our generative Beta prior, we simply estimate a~ and P from this set using

any standard method. An estimate for the Beta parameters based upon the principle of

Maximum Likelihood Estimation can easily be derived [112].

A second method is to simply assume that the stratification we choose usually works

well. In this case, most strata will either have a very low or a very high percentage of its

records accepted by f20). Clon.~~-!nig a = p = .5 results in a U-shaped distribution that

matches this intuition exactly, and is a common choice for a Beta prior. The resulting

Beta is illustrated in Figure 5-1. In practice we find that this produces excellent results.










We stress that though the initial choice of a~ and P for each stratum is important,

it is only important to the extent that it informs us what is going on in the case that we

have very little information available in the pilot sample (such as when the pilot is very

small). If the pilot sample contains a great deal of information, the update step described

in Section 5.5 will update a~ and P as needed to take into account the information present

in the pilot sample.

Producing the Vector of Counts

Given the above setup, the GetCounts algorithm can be used to produce the vector of

counts (cutl, cut2, CntL)



Algorithm GetCounts(a, P, N)

1 // Let a = (ax, 8, L) be the parameters of the beta

// distributions of all strata

2 // Let p = (P1, P2, L) be the parameters of the beta

// distributions of all strata

3 // Let N = (N1, N2, NL) be a vector of all strata sizes

4 // Let cut = (cutl, cutL) be a vector of counts for all strata
5 for(int i = 1; i <= L; i++) {

6 pi <- Beta(as, pi)

7 cuti <- Binomial (Ni, pi)

8}
9 return cut



5.4.3 Defining Xc/

In the previous subsection, we described how to obtain counts for the number of

records that satisfy the selection predicate f2(). However, in order to obtain a sample

from Xc, it is not enough to merely know these counts. We actually need to know the f ()









values of all the records that satisfy f2 ) ill Stratum i, since these values are needed to be

able to compute (py, p2 i aS 1S required to sample from Xc/.

To do this, we use the following method. For the ith stratum, let D be the vector

of all possible distinct values from the range of the function fit(). We then associate a

probability pj with the jth distinct value. pj indicates the likelihood of the jth distinct

value from the stratum (that is, D [j]) being assigned to an arbitrary tuple that has been

accepted by f2(). Then, let V denote a vector of |D| counts, where if V[j] = k then it

means that the jth distinct value from the stratum has been assigned to k tuples that

were accepted by f2(). Thus, Cj V[j] = cuti. Since we assume that each application

of fi() is independent on a per-tuple basis, V can be obtained by sampling from a

multinomial distribution" With two arguments: the probability vector consisting of all the

pj values, and the number of trials given by cuti (that is, the number of tuples accepted by

f2()). Then, the resulting vector V along with the distinct value vector D can be used to

compute the pair (pi, p2 i*
This technique poses two important questions that need to be answered:

* Is it ahr-l- .- feasible to consider all the values in the range of fit()? That is, can we
ahr-l- .- materialize D?

* How do we assign probabilities to all of the values in D? That is, how do we decide
the value of each pj?

The answer to the first question is simple: It is certainly not feasible to ahr-7- .-

consider all possible values in the range of fit(), for obvious computational and storage

reasons. However, for the moment, we assume that it is feasible and consider the more

general case in Section 5.4.5.



3 The multinomial distribution models the case case where cut balls are thrown at d
buckets so that the probability of an arbitrary ball falling in bucket j is pj, a sample from
the multinomial assigns bj balls to bucket k such that E bj 1.










In answering the second question, we develop a methodology analogous to the way we

choose the pi parameter when dealing with the ith strata for Xent. As described above, the

number of times that each distinct fit() value is selected follows a multinomial distribution.

We know from B li-o -1 .Is statistics that the standard conjugate prior for a multinomial

distribution is the Dirichlet distribution [33] -just as the Beta distribution is the standard

conjugate prior for a binomial distribution. The Dirichlet is the multi-dimensional

generalization of the Beta. A k-dimensional Dirichlet distribution makes use of the

parameter vector O = {81, 82, k ~. Just aS in the case of the Beta prior used by

Xent, the Dirichlet prior requires an initial set of parameters that represent our initial

belief. Since we typically have no knowledge about how likely it is that a given fit() value

will be selected by f2(), the simplest initial assumption to make is that all values are

equally likely. In the case of the Dirichlet distribution, using 84 = 1 for all i is the typical

zero-knowledge prior [33]. Given 8, it is then a simple matter to sample from Xc/, as we

describe formally in the next subsection. We note that although this initial parameter

choice may be inaccurate, in B li-o -1 .Is fashion the parameters will be made more accurate

based upon the information present in the pilot sample. Section 5.5 provides details of

how the update is accomplished.

Producing the Vector E'

We now present an algorithm GetMoments to obtain the vector E'. We assume that

we have all the 84 values corresponding to the parameters of the Dirichlet along with

counts of the number of records that are accepted by f2() in each stratum. These count

values are the values in the vector Xent obtained according to Algorithm GetCounts in

Section 5.4.2.



Algorithm GetMoments(81, BE, D) {

1 // Let Bi denote the Dirichlet parameters for stratum i

2 // Let D be an array of all distinct values from the range of f20)










3// Let E' = (E'z, E') be a veto of moments of-~ all strata-
4 for(int i = 1; i <= L; i++) {

5 p <- Diricklet(8i)

6 pl = p2 = 0

7 // Let V be an array of counts for each domain value
8 Vt <--ultinomial (cuti, p)

9 for(int j = 1; j <= |D|; j++) {

10 pl += V[j] a D[j]

11 p2 + j (D[j])2

12 }

13 pl /= cuti

14 p2 /= Ctii

15 (p ,2i 1 2

16 =(p ,2i

17 }
18 return E'



5.4.4 Combining The Two

Once a sample from Xent and from Xc/ have been obtained, it is then a simple matter

to put them together to obtain a sample from Xc. Recall that the variance of a variable

X is defined as follows:

0.2[X] = E[X2] E2[X]

where E [] denoes the expected value of the random variable. For the ith stratum, after

sampling from Xent and Xc/ we know three things:

1. The size of the stratum NVi.

2. The number of records accepted by f2 (), Which is cuti.

3. The first and second moment (pi, p2) o iO 1) applied to those tuples that were
accepted by f2 -.










Thus,, the variance, ~O fr f() applied to all tuples in the ith strata can be computed as:

2Cntig 1i Ctig
i x p2~+ x0O-
(cuti 2 Ng -Cnii x



x p2 1 -a- ct (5-5)


The two zeros in the above derivation come from the fact that both the first moment (or

mean) as well as the second moment for f() over every tuple not accepted by f2 ) arT ZeoO.

This computation is repeated for each possible i in order to obtain the desired sample

from Xc.

The algorithm GetSigma describes how the variances can be computed using the

above technique.



Algorithm GetSigma(cut, E', N) {

1 // Let cut = (cutl, cutL) be a vector of counts of records

// accepted by f2() Ffo all Strata

2 // Let E' = (E' ) ,IL E' b a vector of momentsllL of all strata

3 // Let N = (N1, N2i, NL) be a vector of all strata sizes

4 // Let E be a vector of variances for all strata

5 for(int i = 1; i <= L; i++) {




8}
9 return E

10 }









5.4.5 Limiting the Number of Domain Values

As mentioned in Section 5.4.3, the one remaining problem regarding how to sample

from Xc/ is the problem of having a very large (or even unknown) range for the function

fi(). In this case, dealing with the vectors D and V may be impossible, for both storage

and computational reasons.

The simple solution to this problem is to break the range of fit() into a number of

buckets and make use of a histogfram over the range, rather than using the range itself. In

this case, D is generalized to be an array of histogram buckets, where each entry in D has

summary information for a group of distinct fi() values. Each entry in D has the following

four specific pieces of information:

1. low and high, which are the upper and lower bounds for the fi() values that are
found in this particular bucket.

2. pi1, which is the mean of the fit() values that are found in this particular bucket.
TIhat is, if A is the set of distinct values from lowU to highl, thlenl p = CaeA i*i

3. p2l, Which is the second moment of the fi() values that are found in this particular
bucket. That is, p2 CaEA i'.(

Given |D|, there are two possible owsi~ to construct the histogram. In the case where

the queries that will be asked request a simple sum over one of the attributes from the

underlying relation R (that is, fit() does not encode any function other than a simple

relational projection), then it is possible to construct D offline by using any histogram

construction scheme [42, 45, 72] over the attribute that is to be queried. In the case that

multiple attributes might be queried, one histogram can be constructed for each attribute.

This is the method that we test experimentally.

Another appropriate method is to construct D on-the-fly by making use of the pilot

sample that is used to compute the sampling plan. This has the advantage that any

arbitrary fit() can be handled at run time. Again, any appropriate histogram construction

scheme can be used, but rather than constructing D offline using the entire relation R, fit()










is applied to each re E ] Ri~ior (whether or not r is accepted by f2()) and the histogram is

constructed over the resulting set of distinct values.

Whatever method is used to construct D, the function GetMoments from Section

5.4.3 must be modified so as to handle the modified D. The following is an appropriately

modified GetMoments we call it GetMomentsFromHist.



Algorithm GetMomentsFromHist(81, 8L, D) {

1 // Let Bi denote the vector of Dirichlet parameters for stratum i

2 // Let D be an array of histogram buckets

3 // Let E' = (E' ) ,IL E' b a vector of momentsllL of all strata

4 for(int i = 1; i <= L; i++) {

5 p <- Diricklet( Oil

6 pl = p2 = 0

7 // Let V be an array of counts for each bucket

8 Vt <--ultinomial(cuti, p)

9 for(int j = 1; j <= |D|; j++) {

10 pl += V[j] a D[j].pi

11 p2+ Vj D[j].p2

12 }

13 pl /= cuti

14 p2 /= Cnti

15 (p ,2i 1, 2

16 =(p,2i

17 }
18 return E'










5.5 Updating Priors Using The Pilot

In Section 5.4, we described how we assign initial values to the parameters of the two

prior distributions -the Beta and the Dirichlet distributions. In this section, we explain

how these initial values can be refined by using information from a pilot sample to obtain

corresponding posterior distributions. Updating these priors using the pilot sample in the

proposed B .v. E-Neyman approach is analagous to using the pilot sample to estimate the

stratum variances using the classic Neyman allocation. The update rules described in this

section are fairly straightforward applications of the standard B li-o Io update rules [14].

The Beta distribution has two parameters a( and P. Let Rpilot denote the pilot sample

and let a denote the number of records that are accepted by the predicate f2(). Thus,

|Rpilot| S will be the number of records that fail to be accepted by the query.

Then, the following update rules can be used to directly update the a( and P

parameters of the Beta distribution:


a( = a(+s

S= + (|Rpilotl S)


The Dirichlet distribution is updated similarly. Recall that this distribution uses a

vector of parameters, O = {81, 82, 8k Where k is the number of dimensions.

To update the parameter vector 8, we can use the same pilot sample that was used

to update the beta as follows. We initialize to zero all elements of an array count of size k.

These elements denote counts of the number of times that different values from the range

fit() appear in the pilot sample and are accepted by f2 -.

The following update rule can be used to update all the different parameters of the

Dirichlet distribution:

84 = 84 + county

Algorithm UpdatePriors describes exactly how pilot sampling is used to update the

parameters of the prior Beta and Dirichlet distributions for the ith stratum.











Algorithm UpdatePriors(a, P, 8, D, Rpilot)

1 // Let a, p be the parameters of the beta distribution for the

// stratum to be updated

2 //Let 8 = (81, BL) be the parameters of the Dirichlet

// distribution for the stratum

3 // Let D be an array of histogram buckets for the stratum

4 // Let Rpilot be a pilot sample from the stratum

5 // Let count be an array of counts for each histogram bucket for

// stratum i
6 for(int j = 1; j <= |D|; j++)

7 count[j] = 0
8 s=0

9 for(int r = 1; r <= |Rpilot|; r++) {

10 rec = Rc I

11 if(f2(TecC)@

12 s++

13 val = fl(rec)

14 pos = Fi nd Position lnArray(D, val)
15 count~pos]++

16 }

17 }
18t=a+s

19 p = P + (| Rpilot s
20 for(int j = 1; j <= |D|; j++)

21 Oj = Oj + count[j]

22 }











Algorithm Find PositionlnArray(D, val) {

1 // Let D be an array of histogram buckets

2 // Let val be a scalar value
3 for(int j = 1; j <= |D|; j++)

4 if(D[j].low I val && val < D[j].higlh)
5 return j

6}


5.6 Putting It All Together

In this section, we consider how the random variable Xe can be used to produce

an alternative allocation to the classical Neyman, and give the complete algorithm for

computing our allocation.

5.6.1 Minimizing the Variance

In general, the goal of any sampling plan should be to minimize the variance O.2 Of

the resulting stratified sampling estimator. The formula for o.2 in the classic allocation

problem is given as Equation 5-2 of the thesis. Our situation differs from the classic setup

only in that (in B li-o -1 .Is fashion) we now use Xe to implicitly define a distribution over

the per-strata variance values (o-1, o-, OL). Thus, we cannot minimize o.2 directly

because under the B li-, -i Ia regime, O.2 1S now a random variable.

Instead, it makes sense to minimize the expected value or average of 0.2, Which (using

Equation 5-2) can be computed as:

Nsi(Nv us) ,
E [O2] = E i of
ni

Using the linearity of expectation, we have:



i= 1










All of the machinery from the last two sections allows us to be able to sample possible

variance vectors from Xc. Assume that we sample v of these vectors, where v is a suitable

large n~umberr anld th~e samlples a~re denoted by EI a, E2 *** v. Then CE= Ey~3( is an

unbiased estimate of E[af]. Plugging this estimate into the previous equation, we have:



i= 1 j= 1

We now wish to minimize this value subject to the constraint that C ni = n. Notice

that the resulting optimization problem has exactly the same structure as the optimization
problem solved by the Neyman allocation, wit;h the exception that 1,2 has been replaced by

1C da,. Thus the resulting optimal solution is then nearly identical, with of being

replaced as appropriate:



C=1 L=1 d' j= 1 U s2


5.6.2 Computing the Final Sampling Allocation

Algorithm GetBayesNeymanAllocation describes exactly how an optimal sampling

allocation can be obtained using our technique.



Algorithm GetBayesNeymanAllocation(a~, P, 8, D, Rpilot, N, n, p

1 // Let a = (ax, 8, L), be the parameter of the beta

// distributions of all strata

2 // Let P = (P1, P2, L) be the parameter of the beta

// distributions of all strata

3 // Let 8 = (01, BL) be the set of parameters of the

// Dirichlet distributions of all strata

4 // Let D = (D1, D2, DL) be an array of histogram

// buckets for all strata

5 //LeRilot = (pilot Rpilot Rpilot) be the pilot samples










// from all strata

6 // Let v be the total number of iterations of re-sampling

7 for(int j = 1; j <= L; j++)

8 U pd atePriors(as, pi ei, Di, Rpilot)

9 // Let cut = (cutl, cut2, CntL) be a vector of counts for

// all strata

10 // Let E' = (E`', `'2 CL be aC vectorV of momentslr V forI all strata

11 // Let E and Etemp be vectors of variances of size L
12 for(int i = 1; i <= v; i++) {

13 cut = GetCounts(a, p, N)

14 E' = GetMomentsFromHist(81, BE, D)

15 Etemp = GetSigma(cut, E', N)

16 for(int j = 1; j <= L;j++)

17 E [j] += Etemp[j

18 }
19 denom = 0

20 for(int j = 1; j <= L;j++) {

21 E [j] /= v

22 denom += N[j] E[j]

23 }
24 for(int j = 1; j <= L;j++)

25 nj = (us Nj a E[j])/denom

26 }



5.7 Experiments

5.7.1 Goals

The specific goals of our experimental evaluation are as follows:










* To compare the width of the confidence bounds produced using both the classic
Neyman allocation and the proposed Bve;--Neyman allocation in realistic scenarios in
order to see which can produce tighter bounds.

* To test the reliability of the confidence bounds produced by the two methods. That
is, we wish to ask: if hounds are reported to the user as 1.' hounds, is the chance
that they contain the answer actually 1.' ?

* Third, we wish to compare both methods against simple random sampling as a sanity
check to see if there is a significant improvement in bound width.

* Finally, we wish to compare the compuation time required for the two estimators.

5.7.2 Experimental Setup

Data Sets Used. We use three different data sets in our experimental evaluation:

* The first is a synthetic data set called the (i 11 11 data set, and is produced using a
Gaussian (normal) mixture model. The GMM data set has three numerical and three
categorical attributes. Since the underlying normal variables only produce numerical
data, the three categorical attributes (having seven possible values each) and are
produced by mapping the ranges of three of the dimensions to discrete values. This
data set has 5 million records.

* The second is the Person data set. This is a 13-attribute real-life data set obtained
from the 1990 Census and contains family and income information. This dataset is
publicly available [2] and has a single relation with over 9.5 million records. The data
has twelve numerical attributes and one categorical attribute with 29 categories.

* The third is the KDD data set, which is the data set from the 1999 K(DD Cup event.
This data set has 42 attributes with status information regarding various network
connections for intrusion detection. This dataset consists of around 5 million records
with integer, real-valued, as well as categorical attributes.

Queries Tested. For each data set, we test queries of the form:

SELECT SUM (fl(r))

FROM R As r

WHERE f2 r)

fi() and f2 ) Vary depending upon the data set. For the GMM data set, fi() projects

one of the three different numerical attributes (each query projects a random attribute).

For the Person data set, either the Totallncome attribute or the wagelncome attribute are










projected by each query. For the K(DD data set, either the src_bytes or the dst_bytes

attributes are projected.

For each of the data sets, three different classes of selection predicates encoded by f2(

are used. Each class has a different selectivity. The three selectivity classes for f2() have

selectivities of (0.01~ 0.001 .), (0.1~ 0.01 ), and (1.0'; & 0.1 ), respectively.

For the GMM data set, f2 ) is COnStructed by rolling a three-faced die to decide how

many attributes will be included in the conjunction computed by f2(). The appropriate

number of attributes are then randomly selected from among the six GMM attributes. If

a categorical attribute is chosen as one of the attributes in f2(), then the attribute will be

checked with either an equality or inequality condition over a randomly-selected domain

value. If a numerical attribute is chosen, then a range predicate is constructed. For a given

numerical attribute, assume that low and high are the known minimum and maximum

attribute values. The range is constructed using glow = low + vl x (high low) and

high = glow + v2 X thigh glOW) Where vl and v2 are randomly chosen real values from

the range [0-1]. For each selectivity class, 50 different queries are generated by repeating

the query-generation process until enough queries falling the appropriate selectivity range

have been generated.

The f2 ) fulCtilOUS for the other two data sets are constructed similarly.

Stratification Tested. For each of the various data sets, a simple nearest-neighbor

classification algorithm is used to perform the statification. In order to partition a data

set into L strata, L records are first chosen randomly from the data to serve as "seeds" for

each of the strata, and all of the other records are added to the strata whose seed is closest

to the data point. For numerical attributes, the L2 IlOrm is used as the distance function.

For categorical attributes, we compute the distance using the support from the database

for the attribute values [36]. Since each data set has both numerical and categorical data,

the actual distance function used is the sum of the two "sub" distance functions. Note

that it would be possible to use a much more sophisticated stratification, but actually









Sample Sel Bandwidth Coverage
Size ( )GMM /Person /K(DD GMM /Person /K(DD
0.01 3.277 /2.289 /2.140 918 /892 /921
50K( 0.1 1.776 /0.514 /1.520 926 /912 /988
1 0.587 /0.184 /0.210 947 /944 /942
0.01 2.626 /2.108 /1.48 922 /941 /937
100K( 0.1 1.273 /0.351 /0.910 939 /948 /940
1 0.415 /0.128 /0.120 948 /952 /946
0.01 2.192 /1.740 /0.820 923 /943 /940
500K( 0.1 0.551 /0.132 /0.630 946 /947 /942
1 0.178 /0.087 /0.070 946 /947 /948
Bandwidth (as a ratio of error bounds width to the true query answer) and
Coverage (for 1000 query runs) for a Simple Random Sampling estimator for
the K(DD Cup data set. Results are shown for varying sample sizes and for
three different query selectivities 0.01 0.1 and 1 .


Table 5-1.


performing the stratification is not the point of this thesis -our goal is to study how to

best use the stratification.


In our experiments, we test L


1, L = 20, and L = 200. Note that if L


then there is actually no stratification performed, and so this case is equivalent to simple

random sampling without replacement and will serve as a sanity check in our experiments.
Tests Run. For the Neyman allocation and our B .ns--Neyman allocation, our test suite
consists of 54 different test cases for each data set, plus nine more tests using L = 1.
These test cases are obtained by assigning three different values to the following four
parameters:


* Number of strata -We use L = 1, L = 20, L = 200; as described above, L
equivalent to simple random sampling without replacement.


1 is also


* Pilot sample size -This is the number of records we obtain from each stratum in
order to perform the allocation. We choose values of 5, 20 and 100 records.

* Sample Size -This is the total sample size that has to be allocated. We use 50,000,
100,000 and 500,000 samples in our tests.

* Query Selectivity -As described above, we test query selectivities of 0.01 .~ 0.1 and










Average Running Time (sec.)
Neyman Bai-; E-Neyman
Gaussian Mixture 1 .5 2.4
Person 2.3 3.1
K(DD Cup 2.1 2.8
Table 5-2. Average running time of Neyman and Bayes-Neyman estimators over three
real-world datasets.


Each of the 50 queries for each (data set, selectivity) combination is re-run 20

times using 20 different (pilot sample, sample) combinations. Thus, for each (data set,

selectivity) combination we obtain results for 1000 query runs in all.

5.7.3 Results

Table 5-1 shows the results for the nine cases where L = 1, that is, where no

stratification is performed. We report two numbers: the bandwidth and the coverage. The

bandwidth is the ratio of the width of the 95' confidence bounds computed as the result

of using the allocation to the true query answer. The coverage is the number of times out

of the 1000 trials that the true answer is actually contained in the 95' confidence bounds

reported by the estimator. Naturally, one would expect this number to be close to 950 if

the bounds are in fact reliable. Tables 5-3 and 5-4 show the results for the 54 different test

cases where a stratification is actually performed. For each of the 54 test cases and both

of the sampling plans used (the Neyman allocation and the Bai-; E-N l-man allocation) we

again report the bandwidth and the coverage.

Finally, the following table shows the average running times for the two stratified

sampling estimators on all the three data sets. There is generally around a 50I' hit in

terms of running time when using the Bai-; E-Neyman allocation compared to the Neyman

allocation.

5.7.4 Discussion

There are quite a large number of results presented, and discussing all of the

intricacies present in all of our findings is beyond the scope of the thesis. However,

taken as a whole, our experiments clearly show two things. First, for the type of selective














Bandwidth Coverage
GMM /Person /K(DD GMM /Person /K(DD
NS PS SS Sel Neynian B is-c 4- Neynian Neynian B is-c 4- Neynian
0.01 0.00 /0.00 /0.00 2.90 /0.19 /1.12 0 /0 /0 9:35 /882 /927
20 5 50K( 0.1 0.03 /0.01 /0.02 1.27 /0.02 /0.80 : 3 /49 /2:3 929 /9:39 /9:38
1 0.05 /0.02 /0.14 0.39 /0.01 /0.09 1 1 /247 /155 940 /950 /945
0.01 0.00 /0.00 /0.00 2.77 /0.16 /1.08 0 /0 /0 9:36 /961 /9:30
100K( 0.1 0.02 /0.01 /0.01 0.90 /0.02 /0.7:3 : 3 /5:3 /28 941 /941 /9:38
1 0.05 /0.01 /0.03 0.28 /0.01 /0.08 24 /:306 /170 941 /947 /947
0.01 0.01 /0.00 /0.00 2.05 /0.06 /0.87 : 3 /0 /4 9:38 /948 /9:32
500K( 0.1 0.01 /0.00 /0.01 0.37 /0.01 /0.55 1 0 /62 /51 954 /954 /941
1 0.03 /0.01 /0.02 0.12 /0.00 /0.04 : 38 /:316 /184 957 /955 /945
0.01 0.06 /0.00 /0.04 2.72 /0.22 /1.06 14 /0 /5 942 /941 /9:38
20 50K( 0.1 0.17 /0.03 /0.09 1.21 /0.03 /0.81 106 /61 /88 908 /9:38 /944
1 0.21 /0.05 /0.27 0.34 /0.01 /0.09 404 /692 /561 948 /948 /947
0.01 0.01 /0.00 /0.01 2.58 /0.16 /0.911 2:3 /0 /6 941 /9:37 /941
100K( 0.1 0.11 /0.02 /0.06 0.85 /0.02 /0.74 165 /66 /107 9:34 /954 /9:39
1 0.14 /0.03 /0.09 0.25 /0.01 /0.06 4:31 /728 /612 954 /962 /95:3
0.01 0.01 /0.00 /0.01 1.9:3 /0.07 /0.62 : 30 /0 /21 946 /94:3 /944
500K( 0.1 0.01 /0.01 /0.01 0.34 /0.01 /0.511 2:30 /145 /245 942 /952 /945
1 0.04 /0.01 /0.03 0.09 /0.00 /0.02 447 /751 /746 94:3 /961 /950
0.01 0.15 /0.04 /0.08 2.:33 /0.19 /0.82 24 /58 /20 9:38 /922 /9:38
100 50K( 0.1 0.26 /0.10 /0.16 1.09 /0.02 /0.58 4:36 /204 /172 929 /949 /942
1 0.47 /0.18 /0.34 0.32 /0.01 /0.05 870 /891 /866 9:32 /962 /951
0.01 0.12 /0.03 /0.06 2.26 /0.16 /0.57 29 /59 /41 9:35 /945 /940
100K( 0.1 0.18 /0.05 /0.11 0.81 /0.02 /0.40 4:35 /249 /:355 927 /957 /942
1 0.31 /0.08 /0.02 0.22 /0.01 /0.04 895 /928 /914 948 /968 /94:3
0.01 0.01 /0.01 /0.01 1.72 /0.07 /0.:33 45 /66 /50 9:39 /952 /947
500K( 0.1 0.06 /0.02 /0.04 0.31 /0.01 /0.28 474 /297 /412 954 /954 /952
1 0.06 /0.02 /0.06 0.08 /0.00 /0.02 926 /9:35 /942 950 /970 /949
Table 5-:3. Bandwidth (as a ratio of error bounds width to the true query answer) and
Coverage (for 1000 query runs) for the Neynian estimator and the
B is-; e-Neynian estimator for the three data sets. Results are shown for 20
strata and for varying number of records in pilot sample per stratum (PS), and
sample sizes(SS) for three different query selectivities 0.01 .~ 0.1 and 1 .














Bandwidth Coverage
GMM /Person /K(DD GMM /Person /K(DD
NS PS SS Sel Neyman Bai-, a- Neyman Neyman Bai-, a- Neyman
0.01 0.00 /0.00 /0.00 1.73 /0.18 /0.911 0 /0 /0 933 /931 /924
200 5 50K( 0.1 0.00 /0.02 /0.01 0.97 /0.02 /0.76 0 /56 /27 933 /953 /936
1 0.05 /0.02 /0.03 0.26 /0.01 /0.09 19 /162 /149 940 /960 /940
0.01 0.00 /0.01 /0.01 1.57 /0.13 /0.75 0 /43 /28 936 /916 /930
100K( 0.1 0.01 /0.01 /0.01 0.72 /0.02 /0.64 7 /60 /41 938 /958 /936
1 0.03 /0.01 /0.01 0.19 /0.00 /0.08 34 /365 /212 945 /955 /947
0.01 0.01 /0.00 /0.00 1.20 /0.08 /0.52 5 /45 /34 940 /939 /938
500K( 0.1 0.02 /0.01 /0.00 0.28 /0.01 /0.44 22 /89 /76 946 /946 /944
1 0.02 /0.01 /0.01 0.07 /0.00 /0.06 45 /372 /336 954 /954 /951
0.01 0.05 /0.03 /0.04 1.59 /0.18 /0.85 19 /51 /21 943 /931 /934
20 50K( 0.1 0.11 /0.03 /0.07 0.75 /0.02 /0.72 91 /70 /94 943 /953 /939
1 0.09 /0.04 /0.09 0.18 /0.01 /0.07 345 /627 /580 958 /962 /945
0.01 0.01 /0.01 /0.03 1.35 /0.14 /0.67 22 /66 /45 948 /948 /941
100K( 0.1 0.02 /0.02 /0.04 0.54 /0.01 /0.54 131 /135 /128 935 /955 /949
1 0.05 /0.02 /0.05 0.12 /0.00 /0.06 488 /702 /643 945 /955 /952
0.01 0.01 /0.00 /0.01 1.04 /0.06 /0.42 49 /83 /72 941 /954 /947
500K( 0.1 0.01 /0.00 /0.02 0.20 /0.00 /0.35 210 /209 /282 955 /945 /950
1 0.04 /0.01 /0.01 0.03 /0.00 /0.03 617 /830 /869 948 /958 /953
0.01 0.08 /0.03 /0.06 1.35 /0.14 /0.54 28 /56 /39 939 /938 /939
100 50K( 0.1 0.20 /0.05 /0.09 0.56 /0.02 /0.40 313 /357 /243 949 /949 /942
1 0.10 /0.01 /0.15 0.14 /0.01 /0.03 543 /823 /874 948 /948 /951
0.01 0.07 /0.02 /0.04 1.11 /0.12 /0.39 47 /77 /53 938 /935 /947
100K( 0.1 0.08 /0.03 /0.06 0.40 /0.01 /0.28 533 /456 /427 948 /948 /951
1 0.06 /0.06 /0.08 0.09 /0.01 /0.02 918 /912 /930 959 /956 /952
0.01 0.01 /0.00 /0.02 0.89 /0.05 /0.211 63 /91 /104 946 /936 /937
500K( 0.1 0.02 /0.01 /0.02 0.10 /0.00 /0.13 580 /540 /607 945 /945 /948
1 0.04 /0.03 /0.05 0.01 /0.00 /0.01 936 /920 /941 960 /953 /950
Table 5-4. Bandwidth (as a ratio of error bounds width to the true query answer) and
Coverage (for 1000 query runs) for the Neyman estimator and the
B11-; E-Neyman estimator for the three data sets. Results are shown for 200
strata with varying number of records in pilot sample per stratum (PS), and
sample sizes(SS) for three different query selectivities 0.01 .~ 0.1 and 1 .










queries we concentrate on in our work, the classic Neyman allocation is generally useless.

As expected, the allocation tends to ignore strata with relevant records, resulting in "95' .

confidence bounds" that are generally accurate nowhere close to 95' of the time. Out of

162 different tests over the three data sets, the Neyman allocation produced confidence

bounds that had greater than 911' coverage only eleven times, even though 95' bounds

were specified. In 15 out of the 162 tests, the "95' confidence bounds" actually contained

the answer 0 out of 1000 times!

Second, the allocation produced by the proposed Bai-; E-Neyman tends to be

remarkably useful -that is, the bounds produced are both accurate and tight. In

only 7 of the 162 tests, the coverage of the bounds produced by the Bai-; E-Neyman

allocation was found to be less than C, :' and coverage was often remarkably close to 95' .

Furthermore, in the few cases where the classic Neyman bounds were actually worthwhile,

the Bai-; E-Neyman bounds were far superior in terms of having a tighter bandwidth.

Even if one looks only at the cases where the Neyman bounds were not ridiculous (where

"ridiculous" bounds are arbitrarily defined to be those that had a coverage of less than

211' .), the Bai-; E-Neyman bounds were actually tighter than the Neyman bounds 35

out of 70 times. In other words, there were many cases where the Neyman allocation

produced bounds that had coverage rates of only around 211' whereas the Bai-; E-Neyman

allocations produced bounds that were actually tighter, and still had coverage rates very

close to the user-specified 95' .

There are a few other interesting findings. Not surprisingly, increasing the number

of strata generally gives tighter error bounds for fixed pilot and sample sizes because it

tends to increase the homogeneity of the records in each stratum. However, in practice

there is a cost associated with increasing the number of strata and so this cannot

be done arbitrarily. Specifically, more strata may translate to more I/Os required to

actually perform the sampling. One might typically store the records within a stratum in

randomized order on disk. Thus, to sample from a given stratum requires only a sequential










scan, but each additional stratum requires a random disk I/O. In addition, it is more

difficult and more costly to maintain a large number of strata.

We also find that by using a larger pilot sample, estimation accuracy generally

increases. This is intuitive since a larger pilot sample contains more information about

the stratum, thus helping to make a better sampling allocation plan and providing a more

accurate estimate. However, a large pilot sample incurs a greater cost to actually perform

the pilot sampling. Explicitly studying this trade-off is an interesting avenue for future

work.

Finally, we point out that even the rudimentary stratification that we tested in these

experiments is remarkably successful -if the correct sampling allocation is used. Consider

the case of a 500K( record sample. For a query selectivity of 0.01 only around 50 records

in the sample will be accepted by selection predicate encoded in f2(). This is why the

bandwidth for the simple random sample estimator with no stratification (L = 1) is so

great: for the Person data set it is 1.74 and for the K(DD data set it is 0.82. The bounds

are so wide that they are essentially useless. However, if the B .ws--Neyman allocation is

used over 200 strata and a pilot sample of size 100, the handwidths shrink to 0.053 and

0.21, respectively. These are far tighter. In the case of the Person data set the bandwidth

shrinks by nearly two orders of magnitude. For the K(DD data set the reduction is more

modest (a factor of four) due to the high dimensionality of the data, which tends to render

the stratification less effective. Still, this -II__- -R that perhaps the real issue to consider

when stratifying in a database environment is not how to perform the stratification, but

how to use the stratification in an effective manner.

5.8 Related Work

Broadly p. .1:;19 it is possible to divide the related prior research into two categories

-those works from the statistics literature, and those from the data management

literature.










The idea of applying B li-o -1 .Is and/or superpopulation (model-based) methods to

the allocation problem has a long history in statistics, and seems to have been studied

with particular intensity in the 1970's. Given the number of papers on this topic, it is

not feasible to reference all of them, though a small number are listed in the References

section of the thesis [:32, 10:3, 104]. At a high level, the E----- -1 difference between this

work and that prior work is the specificity of our work with respect to database queries.

Sampling from a database is very unique in that the distribution of values that are

..- -oegated is typically ill-suited to traditional parametric models. Due to the inclusion

of the selection predicate encoded by f2(), the distribution of the f() values that are

.I_ _oegfated tends to have a large "stovepipe" located at zero corresponding to those

records that are not accepted by f2(), with a more well-behaved distribution of values

located elsewhere corresponding to those fi() values for records that were accepted by

f2(). The B e-m -Neyman allocation scheme proposed in this thesis explicit allows for

such a situation via its use of a two-stage model where first a certain number of records

are accepted by f2() (modeled via the random variable X,.,t) and then the fi() values for

those accepted records are produced (modeled by Xc/). This is quite different from the

general-purpose methods described in the statistics literature, which typically attach a

well-behaved, standard distribution to the mean and/or variance of each stratum [:32, 104].

Sampling for the answer to database queries has also been studied extensively

[6:3, 67, 96]. In particular, ('I! ..to11,l:l~ and his co-authors have explicitly studied the

idea of stratification for approximating database queries [18-20]. However, there is a

key difference between that work and our own: these existing papers focus on how to

break the data into strata, and not on how to sample the strata in a robust fashion. In

that sense, our work is completely orthogonal to ('I!s noIl~llst et al.'s prior work and our

sampling plans could easily be used in conjunction with the workload-based stratifications

that their methods can construct.










5.9 Conclusion

In this chapter, we have considered the problem of stratification for develpoingf robust

estimates for the answer to very selective .I__-oegate queries. While the obvious problem

to consider when stratifying is how to break the data into subsets, the more significant

challenge may lie in developing a sampling plan at run time that actually uses the strata

in a robust fashion. We have shown that the traditional Neyman sampling allocation can

give disastrous results when it is used in conjunction with mildly to very selective queries.

We have developed a unique B li-- -1 Io method for developing robust sampling plans. Our

plans explicitly minimize the expected variance of the final estimator over the space of

possible strata variances. We have shown that even when the resulting allocation is used

with a very naive nearest-neighbor stratification, the increase in accuracy compared to

simple random sampling is considerable. Even more significant is the fact that for highly

selective queries, our sampling plans give results that are II. in the sense that the

associated confidence bounds have near perfect coverage.










CHAPTER 6
CONCLUSION

In this research work, we have studied and described the problem of efficient

answering of complex queries on large data warehouses. Our approach for addressing

the problem relies on approximation. We present sanipling-hased techniques which can

he used to compute very quickly, approximate answers along with error guarantees for

long-running queries. The first part of this study addresses the problem of efficiently

obtaining random samples of records satisfying arbitrary range selection predicates. The

second part of the study develops statistical, sanipling-hased estiniators for the specific

class of queries that have a nested, correlated suhquery. The problem addressed in this

work is actually a generalisation of the important problem of estimating the number of

distinct values in a database. The third and final part of this study addresses the problem

of estimating the result to queries having highly selective predicates. Since a uniform

random sample is not likely to contain any records satisfying the selection predicate,

our approach uses stratified sampling and develops stratified sampling plans to correctly

identify high-density strata for arbitrary queries.









APPENDIX
EM ALGORITHM DERIVATION

Let Y, be the information about record e E EMP that can be observed i.e. v = file)

and k' = cut(e, SALE'). Let Xe be the information about record e that includes Yas well

as the relevant data that cannot be observed, i.e. k = cut(e, SALE).
Then let:

f (Xe = (Ye, k) | 8) = pk x h(k'; k)

Also, let:
me-(v-psi)2/2o2
g( |8)= p xx h(k'; i)
i=0 T/2i
We then compute the posterior probability that e belongs to class i as:

f (Xe = (Ye, i) | )
p(i| 8, e) =


Then the logarithm of the expected probability that we would observe EMP' and SALE'



E =) log( f(Xe = (Y,, i)|8))|p(i|8', e)
e6EMP


e6EMP i=0

log p x x (k'; ij)

= 35i~ (|', e) x (log~p ) -log(a) -logl(JZ)
e6EMP i=0

2e2 +log(h(k i)))

To find the unknown parameters, ps, a and pi, we maximize E for the given set of

posterior probabilities at that step. We do this by taking partial derivatives of E w.r.t

each of these parameters and setting the result to zero:













ii-1 = (|' e T
e6EMP
Setting this expression to zero gives:


1-1
Ce6EMP p10, e

We can obtain p2, lm ill a Similar manner.

By taking the partial derivative of E w.r.t 0.2 and setting to zero we get:

2 ,,, __= e6EMP lo' 0 p 6 6 -py


Finally, to evaluate the pi's, we also consider the additional constraint that Clopi

1. We can find the values of pi's that maximize E subject to this constraint by using the

method of Lagrangian multipliers to obtain:

e6EMP ( 0 e
p =
e6EMP C l (101,

This completes the derivation of the update rules given in Section 4.5.2.









REFERENCES

1. IMDB dataset. http://www.imdb.com

2. Person data set. http://usa.ipums .org/usa

3. Synoptic cloud report dataset.
http://cdiac.ornl.gov/epubs/ndp/ndp026b/nd06~t

4. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate
answering of group-by queries. In: Tech. Report, Bell Laboratories, Murray Hill, New
Jersey (1999)

5. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate
answering of group-by queries. In: SIGMOD, pp. 487-498 (2000)

6. Acharya, S., Gibbons, P.B., Poosala, V., R nwea si.- l!i, S.: Join synopses for
approximate query answering. In: SIGMOD, pp. 275-286 (1999)

7. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in
limited storage. In: PODS, pp. 10-20 (1999)

8. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the
frequency moments. In: STOC, pp. 20-29 (1996)

9. Antoshenkov, G.: Random sampling from pseudo-ranked b+ trees. In: VLDB, pp.
375-382 (1992)

10. Babcock, B., C'!s I eI III1! S., Das, G.: Dynamic sample selection for approximate
query processing. In: SIGMOD, pp. 539-550 (2003)

11. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large
databases. In: K(DD, pp. 9 -15 (1998)

12. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6
(2006)

13. Bunge, J., Fitzpatrick, M.: Estimating the number of species: A review. Journal of
the American Statistical Association 88, 364-373 (1993)

14. Carlin, B., Louis, T.: Bai-;- and Empirical Box-; a Methods for Data Analysis.
C'!s 111'! .1' and Hall (1996)

15. C'll I1:1 .I~arti, K(., Garofalakis, M., Rastogi, R., Shim, K(.: Approximate query
processing using wavelets. The VLDB Journal 10(2-3), 199-223 (2001)

16. ('1! .) sl: lI-, M., C'!s II1Ills,itI S., Motwani, R., ?- lI I- lyya, V.: Towards estimation error
guarantees for distinct values. In: PODS, pp. 268-279 (2000)










17. C'll I ll: Ir, AI., ('!, II1Illatil S., Motwani, R., No I.- lyyaa, V.: Towards estimation error
guarantees for distinct values. In: PODS, pp. 268-279 (2000)

18. C'li n11 11 ll1 ti S., Das, G., Datar, hi., Motwani, R., No .)rara, V.R.: Overcoming
limitations of sampling for .I__ negation queries. In: ICDE, pp. 5:34-542 (2001)

19. ('! ..e lla ti l S., Das, G., No I.- lyya, V.: A robust, optintization-based approach for
approximate answering of .l__-oegate queries. In: SIGMOD, pp. 295-306 (2001)

20. ('! ..e lla ti l S., Das, G., No I.- lyya, V.: Optimized stratified sampling for
approximate query processing. ACijl TODS, To Appear (2007)

21. Os1 .<11,11.1~i S., Das, G., Srivastava, U.: Effective use of block-level sampling in
statistics estimation. In: SIGMOD, pp. 287 -298 (2004)

22. Os1 II1I1!,.1 S., Motwani, R.: On sampling and relational operators. IEEE Data Eng.
Bull. 22(4), 41-46 (1999)

2:3. Os1 .<11,11.1~i S., Motwani, R., No I.- lyya, V.: Random sampling for histogram
construction: how much is enough? SIGMOD Rec. 27(2), 4:36-447 (1998)

24. Os1 .<11,11.1~i S., Motwani, R., No I.- lyya, V.: On randoni sanipling over joins. In:
SIGMOD, pp. 26:3274 (1999)

25. Cochran, W.: Sampling Techniques. Wiley and Sons (1977)

26. Dempster, A., Laird, N., Rubin, D.: Maxinium-likelihood from incomplete data via
the EM algorithm. J. Royal Statist. Soc. Ser. B. 39 (1977)

27. Diwan, A.A., Rane, S., Seshadri, S., Sudarshan, S.: Clustering techniques for
nminintizing external path length. In: VLDB, pp. :342-35:3 (1996)

28. Dobra, A.: Histograms revisited: when are histograms the best approximation
method for .I__-oegates over joins? In: PODS, pp. 228-237 (2005)

29. Dobra, A., Garofalakis, AI., Gehrke, J., Rastogi, R.: Processing complex .I__-oegate
queries over data streams. In: SIGMOD Conference, pp. 61-72 (2002)

:30. Donlingos, P.: B li- -1 .Is averaging of classifiers and the overfitting problem. In: 17th
International Conf. on Machine Learning (2000)

:31. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. C'!s 11pin .1' & Hall/CRC
(1998)

:32. Ericson, W.A.: Optiniun stratified sampling using prior information. JASA 60(:311),
750-771 (1965)

:33. Evans, hi., Hastings, N., Peacock, B.: Statistical Distributions. Wiley and Sons
(2000)










:34. Fan, C., Muller, hi., Rezucha, I.: Development of sampling plans by using sequential
(item by item) selection techniques and digital computers. Journal of the American
Statistical Association 57, :387-402 (1962)

:35. Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal sampling for
skew-resistant join size estimation. In: SIGMOD, pp. 271-281 (1996)

:36. Ganti, V., Gehrke, J., Raniakrishnan, R.: Cactus: clustering categorical data using
suninaries. In: K(DD, pp. 7:38:3(1999)

:37. Ganti, V., Lee, M.L., Raniakrishnan, R.: ICICLES:self-tuning samples for
approximate query answering. In: VLDB, pp. 176-187 (2000)

:38. Garcia-Molina, H., Widoni, J., Ullman, J.D.: Database System Inmplenientation.
Prentice-Hall, Inc. (1999)

:39. Gelnian, A., Carlin, J., Stern, H., Rubin, D.: B li- lIa Data Analysis, Second
Edition. ChI!I1I1!!I1 & Hall/CRC (200:3)

40. Gibbons, P.B., Matias, Y.: New sanipling-hased suninary statistics for improving
approximate query answers. In: SIGMOD, pp. :331-342 (1998)

41. Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. In: Technical
Report, Bell Laboratories, Murray Hill, New Jersey, pp. 275-286 (1999)

42. Gilbert, A.C., K~otidis, Y., Muthukrishnan, S., Strauss, AI.: Optimal and approximate
computation of suninary statistics for range .I__ negates. In: PODS (2001)

4:3. Goodman, L.: On the estimation of the number of classes in a population. Annals of
Mathematical Statistics 20, 272-579 (1949)

44. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational
.I_ negationn operator generalizing group-by, cross-tah, and sub-total. In: ICDE,
pp. 152-159 (1996)

45. Guha, S., K~oudas, N., Srivastava, D.: Fast algorithms for hierarchical range
histogram construction. In: PODS, pp. 180-187 (2002)

46. Guttnian, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD
Conference, pp. 47-57 (1984)

47. Haas, P., Hellerstein, J.: Ripple joins for online .I__ negation. In: SIGMOD
Conference, pp. 287-298 (1999)

48. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sanipling-hased estimation of the
number of distinct values of an attribute. In: 21st International Conference on Very
Large Databases, pp. :311-322 (1995)

49. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sanipling-hased estimation of the
number of distinct values of an attribute. In: VLDB, pp. :311-322 (1995)









50. Haas, P., Stokes, L.: Estimating the number of classes in a finite population. Journal
of the American Statistical Association 93, 1475-1487 (1998)

51. Haas, P.J.: Large-saniple and deterministic confidence intervals for online
.I_ negationn. In: Statistical and Scientific Database Alanagenient, pp. 51-63 (1997)

52. Haas, P.J.: The need for speed: Speeding up DB2 using sampling. IDUG Solutions
Journal 10, :32-34(200:3)

5:3. Haas, P.J., Hellerstein, J.: Join algorithms for online .I_ egation. IBM Research
Report RJ 10126 (1998)

54. Haas, P.J., Hellerstein, J.M.: Ripple joins for online .I__- egation. In: SIGMOD, pp.
287 -298 (1999)

55. Haas, P.J., K~oenig, C.: A hi-level hernoulli scheme for database sampling. In:
SIGMOD, pp. 275 -286 (2004)

56. Haas, P.J., Naughton, J.F., Seshadri, S., Swanxi, A.N.: Fixed-precision estimation of
join selectivity. In: PODS, pp. 190-201 (199:3)

57. Haas, P.J., Naughton, J.F., Seshadri, S., Swanxi, A.N.: Selectivity and cost estimation
for joins based on random sampling. J. Comput. Syst. Sci. 52(:3), 550-569 (1996)

58. Haas, P.J., Naughton, J.F., Swanxi, A.N.: On the relative cost of sampling for join
selectivity estimation. In: PODS, pp. 14-24 (1994)

59. Haas, P.J., Swanxi, A.N.: Sequential sampling procedures for query size estimation.
In: SIGMOD, pp. :341-350 (1992)

60. Hellerstein, J., Avnur, R., Chou, A., Hidber, C., Olston, C., Ranian, V., Roth, T.,
Haas, P.: Interactive data analysis: The cONTR OL project. IEEE Computer 32(8),
51-59 (1999)

61. Hellerstein, J., Haas, P., Wang, H.: Online .I__-oegfation. In: SIGMOD Conference, pp.
171-182 (1997)

62. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Ranian, V., Roth, T.,
Haas, P.J.: Interactive data analysis: The control project. In: IEEE Computer :32(8),
pp. 51 -59 (1999)

6:3. Hellerstein, J.M., Haas, P.J., W.'11_ H.J.: Online ..;:__regation. In: SIGMOD, pp.
171-182 (1997)

64. Hou, W.C., Ojzsoyoglu, G.: Statistical estiniators for .l_ gate relational algebra
queries. ACijl Trans. Database Syst. 16(4), 600-654 (1991)

65. Hou, W.C., Ozsoyoglu, G.: Processing tinte-constrained .I__-aegate queries in case-dh.
ACijl Trans. Database Syst. 18(2), 224-261 (199:3)









66. Hou, W.C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation in
relational databases. SIGMOD Rec. 20(2), 278-287 (1991)

67. Hou, W.C., Ozsoyoglu, G., TI.H. i B.K(.: Statistical estimators for relational algebra
expressions. In: PODS, pp. 276-287 (1988)

68. Hou, W.C., Ozsoyoglu, G., T H. 1 B.K(.: Processing .I__-o egate relational queries with
hard time constraints. In: SIGMOD, pp. 68-77 (1989)

69. Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases.
In: International Conference on Machine Learning and Cybernetics, pp. 3048-3052
(2005)

70. Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256-267 (1993)

71. Ioannidis, Y.E., Poosala, V.: Histogram-based approximation of set-valued
query-answers. In: VLDB (1999)

72. Jagadish, H.V., K~oudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K(.C., Suel, T.:
Optimal histograms with quality guarantees. In: VLDB, pp. 275-286 (1998)

73. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join with
probabilistic guarantees. In: SIGMOD, pp. 563-574 (2005)

74. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink
join. ACijl Trans. Database Syst. 31(4), 1382-1416 (2006)

75. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL
queries. In: 31st International conference on Very large data bases, pp. 745-756
(2005)

76. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random
samples. In: SIGMOD, pp. 299-310. ACjIl Press, New York, NY, USA (2004)

77. K~empe, D., Dobra, A., Gehrke, J.: Gossip-based computation of .l__oegfate
information. In: FOCS, pp. 482-491 (2003)

78. K~rewski, D., Platek, R., Rao, J.: Current Topics in Survey Sampling. Academic Press
(1981)

79. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: How to summarize the
semantics of a data cube. In: VLDB, pp. 778-789 (2002)

80. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: Qc-trees: An efficient summary structure for
semantic olap. In: SIGMOD, pp. 64-75 (2003)

81. Lenit; n;- r_, S.T., Edgington, J.M., Lopez, M.A.: STR: A simple and efficient
algorithm for r-tree packing. In: ICDE, pp. 497-506 (1997)









82. Ling, Y., Sun, W.: A supplement to sanipling-hased methods for query size
estimation in a database system. SIGMOD Rec. 21(4), 12-15 (1992)

8:3. Lipton, R., Naughton, J.: Query size estimation by adaptive sampling. In: PODS,
pp. 40-46 (1990)

84. Lipton, R., Naughton, J., Schneider, D.: Practical selectivity estimation through
adaptive sampling. In: SIGMOD Conference, pp. 1-11 (1990)

85. Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures.
In: VLDB, pp. 165-171 (1989)

86. Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. J.
Comput. Syst. Sci. 51(1), 18-25 (1995)

87. Luo, G., Ellnmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join
algorithm. In: SIGMOD, pp. 252-262 (2002)

88. Alatias, Y., Vitter, J., Wang, AI.: Wavelet-hased histograms for selectivity estimation.
In: SIGMOD Conference, pp. 448-459 (1998)

89. Alatias, Y., Vitter, J.S., W.'1_ AI.: Wavelet-hased histograms for selectivity
estimation. SIGMOD Record 27(2), 448-459 (1998)

90. Mingfoti, S.: B li- I, estimator for the total number of distinct species when quadrat
sampling is used. Journal of Applied Statistics 26(4), 469-48:3 (1999)

91. Alotwani, R., Raghavan, P.: Randonlized Algorithms. Cambridge University Press,
New York (1995)

92. Muralikrishna, AI., DeWitt, D.: Equi-depth histograms for estimating selectivity
factors for niulti-dintensional queries. In: SIGMOD Conference, pp. 28-36 (1988)

9:3. Muth, P., O'Neil, P.E., Pick, A., Weikunt, G.: Design, intplenientation, and
performance of the LHAM log-structured history data access method. In: VLDB, pp.
452-46:3 (1998)

94. Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: ICDT:
Proceedings of the third international conference on Database theory, pp. 499-51:3
(1990)

95. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse,
and other variants. In: Learning in Graphical Models (1998)

96. Olken, F.: Random sampling front databases. In: Ph.D. Dissertation (199:3)

97. Olken, F.: Random sampling front databases. Tech. Rep. LBL-;:I :~ Lawrence
Berkeley National Laboratory (199:3)









98. Olken, F., Rotent, D.: Simple random sampling front relational databases. In: VLDB,
pp. 160 -169 (1986)

99. Olken, F., Rotent, D.: Random sampling from h+ trees. In: VLDB, pp. 269-277
(1989)
100. Olken, F., Rotent, D.: Sampling from spatial databases. In: ICDE, pp. 199 -208
(199:3)
101. Olken, F., Rotent, D., Xu, P.: Random sampling from hash files. In: SIGMOD, pp.
:375 386 (1990)

102. Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples
satisfying a condition. In: SIGMOD, pp. 256-276 (1984)

10:3. Rao, T.J.: On the allocation of sample size in stratified sampling. Annals of the
Institute of Statistical Mathematics 20, 159-166 (1968)

104. Rao, T.J.: Optiniun allocation of sample size and prior distributions: A review.
International Statistical Review 45(2), 17:3179 (1977)

105. Roussopoulos, N., K~otidis, Y., Roussopoulos, AI.: Cubetree: organization of and bulk
incremental updates on the data cube. In: SIGMOD, pp. 89-99 (1997)

106. Rowe, N.C.: Top-down statistical estimation on a database. SIGMOD Record 13(4),
1:35-145 (198:3)

107. Rowe, N.C.: Antisaniplingf for estimation: an overview. IEEE Trans. Softw. Eng.
11(10), 1081-1091 (1985)

108. Rusu, F., Dobra, A.: Statistical analysis of sketch estiniators. In: To Appear,
SIGMOD (2007)

109. Sarndal, C., Swensson, B., Wretnman, J.: Model Assisted Survey Sampling. Springer,
New York (1992)

110. Selingfer, P.G., Astrahan, 31.3., C'I 1m.1 erlin, D.D., Lorie, R.A., Price, T.G.: Access
path selection in a relational database nianagenient system. In: SIGMOD, pp. 2:3-34
(1979)
111. Severance, D.G., Lohnman, G.31.: Differential files: Their application to the
maintenance of large databases. ACijl Trans. Database Syst. 1(:3), 256-267 (1976)

112. Shao, J.: Mathematical Statistics. Springer-Verlag (1999)

11:3. Sisnianis, Y., Deligiannakis, A., Roussopoulos, N., K~otidis, Y.: Dwarf: Shrinking the
petacube. In: SIGMOD, pp. 464-475 (2002)

114. Sisnianis, Y., Roussopoulos, N.: The polynomial complexity of fully materialized
coalesced cubes. In: VLDB, pp. 540-551 (2004)










115. Stonebraker, AI., Ahadi, D.J., Batkin, A., Chen, X., C'I. IIn l:~ A., Ferreira, hi.,
Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., Zdonik, S.:
C-store: a colunin-oriented DBMS. In: VLDB, pp. 55:3564 (2005)

116. Thiesson, B., Meek, C., Heckernian, D.: Accelerating ent for large databases. Alach.
Learn. 45(:3), 279-299 (2001)

117. Thorup, hi., Zhang, Y.: Tabulation based 4-universal hashing with applications to
second moment estimation. In: SODA, pp. 615-624 (2004)

118. Vitter, J.S., Wang, AI.: Approximate computation of nmultidintensional .I_ egates of
sparse data using wavelets. SIGMOD Rec. 28(2), 19:3204 (1999)

119. Vitter, J.S., Wang, hi., lyer, B.: Data cube approximation and histograms via
wavelets. In: CIK(M, pp. 96-104 (1998)

120. Vysochanskii, D., Petunin, Y.: Justification of the :$-sigma rule for uniniodal
distributions. Theory of Probability and Mathematical Statistics 21, 25-36 (1980)

121. Yu, X., Zuzarte, C., Sevcik, K(.C.: Towards estimating the number of distinct value
combinations for a set of attributes. In: CIK(M, pp. 656-66:3 (2005)









BIOGRAPHICAL SKETCH

Shantanu Joshi received his Bachelor of Engineering in Computer Science front the

University of Mumbai, India in 2000. After a brief stint of one year at Patni Computer

Systems in Mumbai, he joined the graduate school at the University of Florida in fall

2001, where he received his Master of Science (jlS) in 2003 front the Department of

Computer and Information Science and Engfineeringf.

In the suniner of 2006, he was a research intern at the Data Alanagenient, Exploration

and Mining Group at Microsoft Research, where he worked with Nicolas Bruno and

Surajit('I Ch ..tlat

Shantanu will receive a Ph.D. in Computer Science in August 2007 front the

University of Florida and will then join the Database Server Manageability group at

Oracle Corporation as a nienter of technical staff.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Firstly,mysincerestgratitudegoestomyadvisor,ProfessorChrisJermaineforhisinvaluableguidanceandsupportthroughoutmyPhDresearchwork.Duringtheinitialseveralmonthsofmygraduatework,ChriswasextremelypatientandalwaysledmetowardstherightdirectionwheneverIwouldwaver.Hisacuteinsightintotheresearchproblemsweworkedonsetanexcellentexampleandprovidedmeimmensemotivationtoworkonthem.Hehasalwaysemphasizedtheimportanceofhigh-qualitytechnicalwritingandhasspentseveralpainstakinghoursreadingandcorrectingmytechnicalmanuscripts.HehasbeenthebestmentorIcouldhavehopedforandIshallalwaysremainindebtedtohimforshapingmycareerandmoreimportantly,mythinking.IamalsoverythankfultoProfessorAlinDobraforhisguidanceduringmygraduatestudy.Hisenthusiasmandconstantwillingnesstohelphasalwaysamazedme.ThanksarealsoduetoProfessorJoachimHammerforhissupportduringtheveryearlydaysofmygraduatestudy.ItakethisopportunitytothankProfessorsTamerKahveciandGaryKoehlerfortakingthetimetoserveonmycommitteeandfortheirhelpfulsuggestions.ItwasapleasureworkingwithSubiArumugamandAbhijitPolonvariouscollaborativeresearchprojects.SeveralinterestingtechnicaldiscussionswithMingxiWu,FeiXu,FlorinRusu,LaukikChitnisandSeemaDegwekarprovidedastimulatingworkenvironmentintheDatabaseCenter.Thisworkwouldnothavebeenpossiblewithouttheconstantencouragementandsupportofmyfamily.Myparents,DrSharadJoshiandDrHemangiJoshialwaysencouragedmetofocusonmygoalsandpursuethemagainstallodds.Mybrother,DrAbhijitJoshihasalwaysplacedtrustinmyabilitiesandhasbeenanidealexampletofollowsincemychildhood.Mylovingsister-in-law,DrHetalJoshihasbeensupportivesincethetimeIdecidedtopursuecomputerscience. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 13 1.1ApproximateQueryProcessing(AQP)-ADierentParadigm ....... 13 1.2BuildinganAQPSystemAfresh ........................ 14 1.2.1SamplingVsPrecomputedSynopses .................. 15 1.2.2ArchitecturalChanges ......................... 16 1.3ContributionsinThisThesis .......................... 18 2RELATEDWORK .................................. 19 2.1Sampling-basedEstimation .......................... 19 2.2EstimationUsingNon-samplingPrecomputedSynopses ........... 28 2.3AnalyticQueryProcessingUsingNon-standardDataModels ........ 30 3MATERIALIZEDSAMPLEVIEWSFORDATABASEAPPROXIMATION .. 33 3.1Introduction ................................... 33 3.2ExistingSamplingTechniques ......................... 35 3.2.1RandomlyPermutedFiles ....................... 35 3.2.2SamplingfromIndices ......................... 36 3.2.3Block-basedRandomSampling ..................... 37 3.3OverviewofOurApproach ........................... 38 3.3.1ACETreeLeafNodes .......................... 38 3.3.2ACETreeStructure ........................... 39 3.3.3ExampleQueryExecutioninACETree ................ 40 3.3.4ChoiceofBinaryVersusk-AryTree .................. 42 3.4PropertiesoftheACETree .......................... 43 3.4.1Combinability .............................. 43 3.4.2Appendability .............................. 44 3.4.3Exponentiality .............................. 44 3.5ConstructionoftheACETree ......................... 45 3.5.1DesignGoals ............................... 45 3.5.2Construction ............................... 46 3.5.3ConstructionPhase1 .......................... 46 5

PAGE 6

.......................... 48 3.5.5Combinability/AppendabilityRevisited ................ 51 3.5.6PageAlignment ............................. 51 3.6QueryAlgorithm ................................ 52 3.6.1Goals ................................... 53 3.6.2AlgorithmOverview ........................... 53 3.6.3DataStructures ............................. 55 3.6.4ActualAlgorithm ............................ 55 3.6.5AlgorithmAnalysis ........................... 57 3.7Multi-DimensionalACETrees ......................... 59 3.8Benchmarking .................................. 60 3.8.1Overview ................................. 61 3.8.2DiscussionofExperimentalResults .................. 66 3.9ConclusionandDiscussion ........................... 70 4SAMPLING-BASEDESTIMATORSFORSUBSET-BASEDQUERIES .... 73 4.1Introduction ................................... 73 4.2TheConcurrentEstimator ........................... 78 4.3UnbiasedEstimator ............................... 80 4.3.1High-LevelDescription ......................... 80 4.3.2TheUnbiasedEstimatorInDepth ................... 81 4.3.3WhyIstheEstimatorUnbiased? .................... 85 4.3.4ComputingtheVarianceoftheEstimator ............... 87 4.3.5IsThisGood? .............................. 89 4.4DevelopingaBiasedEstimator ........................ 91 4.5DetailsofOurApproach ............................ 92 4.5.1ChoiceofModelandModelParameters ................ 92 4.5.2EstimationofModelParameters .................... 95 4.5.3GeneratingPopulationsFromtheModel ............... 100 4.5.4ConstructingtheEstimator ....................... 102 4.6Experiments ................................... 103 4.6.1ExperimentalSetup ........................... 103 4.6.1.1Syntheticdatasets ...................... 104 4.6.1.2Real-lifedatasets ....................... 106 4.6.2Results .................................. 109 4.6.3Discussion ................................ 111 4.7RelatedWork .................................. 118 4.8Conclusion .................................... 119 5SAMPLING-BASEDESTIMATIONOFLOWSELECTIVITYQUERIES ... 121 5.1Introduction ................................... 121 5.2Background ................................... 124 5.2.1Stratication ............................... 124 5.2.2\Optimal"AllocationandWhyIt'sNot ................ 126 6

PAGE 7

........................... 128 5.4DeningX 129 5.4.1Overview ................................. 129 5.4.2DeningXcnt 130 5.4.3DeningX0 132 5.4.4CombiningTheTwo .......................... 135 5.4.5LimitingtheNumberofDomainValues ................ 137 5.5UpdatingPriorsUsingThePilot ....................... 139 5.6PuttingItAllTogether ............................. 141 5.6.1MinimizingtheVariance ........................ 141 5.6.2ComputingtheFinalSamplingAllocation .............. 142 5.7Experiments ................................... 143 5.7.1Goals ................................... 143 5.7.2ExperimentalSetup ........................... 144 5.7.3Results .................................. 147 5.7.4Discussion ................................ 147 5.8RelatedWork .................................. 151 5.9Conclusion .................................... 153 6CONCLUSION .................................... 154 APPENDIX EMALGORITHMDERIVATION ......................... 155 REFERENCES ....................................... 157 BIOGRAPHICALSKETCH ................................ 165 7

PAGE 8

Table page 4-1ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. ................. 112 4-2ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. ................. 113 4-3ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor18syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. ................. 114 4-4Observedstandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabasefor8queriesover3real-lifedatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. ... 115 5-1Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)foraSimpleRandomSamplingestimatorfortheKDDCupdataset.Resultsareshownforvaryingsamplesizesandforthreedierentqueryselectivities-0.01%,0.1%and1%. ...................... 146 5-2AveragerunningtimeofNeymanandBayes-Neymanestimatorsoverthreereal-worlddatasets. ........................................ 147 5-3Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayes-Neymanestimatorforthethreedatasets.Resultsareshownfor20strataandforvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities-0.01%,0.1%and1%. ...................... 148 5-4Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayes-Neymanestimatorforthethreedatasets.Resultsareshownfor200stratawithvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities-0.01%,0.1%and1%. ...................... 149 8

PAGE 9

Figure page 1-1SimpliedarchitectureofaDBMS ......................... 17 3-1StructureofaleafnodeoftheACEtree. ...................... 39 3-2StructureoftheACEtree. .............................. 40 3-3Randomsamplesfromsection1ofL3. ....................... 41 3-4CombiningsamplesfromL3andL5. ........................ 42 3-5CombiningtwosectionsofleafnodesoftheACEtree. .............. 43 3-6AppendingtwosectionsofleafnodesoftheACEtree. .............. 45 3-7Choosingkeysforinternalnodes. .......................... 47 3-8ExponentialitypropertyofACEtree. ........................ 48 3-9Phase2oftreeconstruction. ............................. 49 3-10Executionrunsofqueryansweringalgorithmwith(a)1contributingsection,(b)6contributingsections,(c)7contributingsectionsand(d)16contributingsections. ........................................ 54 3-11SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting0.25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 60 3-12SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 61 3-13SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 62 3-14SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.ThegraphisanextensionofFigure 3-12 andshowsresultstillallthreesamplingtechniquesreturnalltherecordsmatchingthequerypredicate. 63 9

PAGE 10

...................... 64 3-16SamplingrateofanACETreevs.rateforanR-Treeandscanofarandomlypermutedle,withaspatialselectionpredicateaccepting0.25%ofthedatabasetuples. ......................................... 67 3-17SamplingrateofanACEtreevs.rateforanR-tree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting2.5%ofthedatabasetuples. ......................................... 68 3-18SamplingrateofanACEtreevs.rateforanR-tree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting25%ofthedatabasetuples. ......................................... 69 4-1Samplingfromasuperpopulation .......................... 90 4-2SixdistributionsusedtogenerateforeacheinEMPthenumberofrecordssinSALEforwhichf3(e;s)evaluatestotrue. ...................... 105 5-1Betadistributionwithparameters==0:5. .................. 131 10

PAGE 11

11

PAGE 12

12

PAGE 13

13

PAGE 14

63 ]calledOnlineaggregation(OLA).Theyproposeaninteractiveinterfacefordataexplorationandanalysiswhererecordsareretrievedinarandomorder.Usingtheserandomsamples,runningestimatesanderrorboundsarecomputedandimmediatelydisplayedtotheuser.Astimeprogresses,thesizeoftherandomsamplekeepsgrowingandsotheestimateiscontinuouslyrened.Atapredeterminedtimeinterval,therenedestimatealongwithitsimprovedaccuracyisdisplayedtotheuser.Ifatanypointoftimeduringtheexecutiontheuserissatisedwiththeaccuracyoftheanswer,shecanterminatefurtherexecution.Thesystemalsogivesanoverallprogressindicatorbasedonthefractionofrecordsthathavebeensampledthusfar.Thus,OLAprovidesaninterfacewheretheuserisgivenaroughestimateoftheresultveryquickly. 14

PAGE 15

25 109 ]inordertosupportAQP.Wemakethischoiceduetothefollowingimportantadvantagesofsamplingoverprecomputedsynopses.Theaccuracyofanestimatecomputedbyusingsamplescanbeeasilyimprovedbyobtainingmoresamplestoanswerthequery.Ontheotherhand,iftheestimatecomputedbyusingsynopsesisnotsucientlyaccurate,anewsynopsisprovidinggreateraccuracywouldhavetobebuilt.Sincethiswouldrequirescanningthedatasetitisimpractical.Secondly,samplingisveryamenabletoscalability.Evenforextremelylargedatasetsoftheorderofhundredsofgigabytes,itisgenerallypossibletoaccomodateasmallsampleinmainmemoryanduseecientin-memoryalgorithmstoprocessit.Ifthisisnotpossible,disk-basedsamplesandalgorithmshavealsobeenproposed[ 76 ]andareequallyeectiveastheirin-memorycounterparts.Thisisanimportantbenetofsampling 15

PAGE 16

1-1 depictsthevariouscomponentsfromasimpliedarchitectureofaDBMS.Thefourcomponentsthatrequiremajorchangesinordertosupportsampling-basedAQPareasfollows:Index/le/recordmanager-TheuseoftraditionalindexstructureslikeB+-Treesisnotappropriatetoobtainrandomsamples.Thisisbecausesuchindexstructuresorderrecordsbasedonrecordsearchkeyvalueswhichisactuallytheoppositeofobtainingrecordsinarandomorder.Hence,forAQPitisimportanttoprovidephysicalstructuresorleorganizationswhichsupportecientretrievalofrandomsamples.Executionengine-Theexecutionengineneedstoberevampedcompletelysothatitcanusetherandomsamplesreturnedbythelowerleveltoexecutethequeryonthem.Further,theresultofthequeryneedstobescaledupappropriatelyforthesizeoftheentiredatabase.Thiscomponentwouldalsoneedtobeabletocomputeaccuracyguaranteesfortheapproximateanswer. 16

PAGE 17

SimpliedarchitectureofaDBMS 17

PAGE 18

18

PAGE 19

96 98 { 101 ]andAntoshenkov[ 9 ],thoughtheideaofusingasurveysampleforestimationinstatisticsliteraturegoesbackmuchearlierthantheseworks.MostoftheworkbyOlkenandRotemdescribeshowtoperformsimplerandomsamplingfromdatabases.Estimationforseveraltypesofdatabasetaskshasbeenattemptedwithrandomsamples.Therestofthissectionpresentsimportantworksonsampling-basedestimationofmajordatabasetasks.SomeoftheinitialworkonestimatingselectivityofjoinqueriesisduetoHouetal.[ 67 68 ].Theypresentunbiasedandconsistentestimatorsforestimatingthejoinsizeandalsoprovideanalgorithmforclustersampling.In[ 64 ]theyproposeunbiasedestimatorsforCOUNTaggregatequeriesoverarbitraryrelationalalgebraexpressions.However,computationofvarianceoftheirestimatorsisverycomplex[ 67 ].Theyalsodonotprovideanyboundsonthenumberofrandomsamplesrequiredforestimation.Adaptivesamplinghasbeenusedforestimationofselectivityofpredicatesinrelationalselectionandjoinoperations[ 83 84 86 ]andforapproximatingthesizeofarelationalprojectionoperation[ 94 ].Adaptivesamplinghasalsobeenusedin[ 85 ],toestimatetransitiveclosuresofdatabaserelations.Theauthorspointoutthebenetsandgeneralityofusingsamplingforselectivityestimationoverparametricmethodswhichmakeassumptionsaboutanunderlyingprobabilitydistributionforthedataaswellasovernon-parametricmethodswhichrequirestoringandmaintainingsynopsesaboutthe 19

PAGE 20

59 ]observethatusingalooseupperboundforthemaximumsubquerysizecanleadtosamplingmoresubqueriesthannecessary,andpotentiallyincreasingthecostofsamplingsignicantly.Doublesamplingortwo-phasesamplinghasbeenusedin[ 66 ]forestimatingtheresultofaCOUNTquerywithaguaranteederrorboundatacertaincondencelevel.Theerrorboundisguaranteedbyperformingsamplingintwosteps.Intherststepasmallpilotsampleisusedtoobtainpreliminaryinformationabouttheinputrelation.Thisinformationisthenusedtocomputethesizeofthesampleforthesecondstepsuchthattheestimatorisguaranteedtoproduceanestimatewiththedesirederrorbound.AsHaasandSwami[ 59 ]pointout,thedrawbackofusingdoublesamplingisthatthereisnotheoreticalguidanceforchoosingthesizeofthepilotsample.Thiscouldleadtoanunpredictablyimpreciseestimateifthepilotsamplesizeistoosmalloranunnecessarilyhighsamplingcostifthepilotsamplesizeistoolarge.Intheirwork[ 59 ],HaasandSwamipresentsequentialsamplingtechniqueswhichprovideanestimateoftheresultsizeandalsoboundstheerrorinestimationwithaprespeciedprobability.Theypresenttwoalgorithmsinthepapertoestimatethesizeofaqueryresult.Althoughbothalgorithmshavebeenproventobeasymptoticallycorrectandecient,therstalgorithmsuersfromtheproblemofundercoverage.Thismeansthatinpracticetheprobabilitywithwhichitestimatesthequeryresultwithinthecomputederrorboundislessthanthespeciedcondencelevelofthealgorithm.Thisproblemisaddressedbythesecond 20

PAGE 21

82 ]pointoutthatgeneralsampling-basedestimationmethodshaveahighcostofexecutionsincetheymakeanoverlyrestrictiveassumptionofnoknowledgeabouttheoverallcharacteristicsofthedata.Inparticular,theynotethatestimationoftheoverallmeanandvarianceofthedatanotonlyincurscostbutalsointroduceserrorinestimation.Theauthorsrathersuggestanalternativeapproachofactuallykeepingtrackofthesecharacteristicsinthedatabaseataminimaloverhead.Adetailedstudyaboutthecostofsampling-basedmethodstoestimatejoinquerysizesappearsin[ 58 ].Thepapersystematicallyanalysesthefactorswhichinuencethecostofasampling-basedmethodtoestimatejoinselectivities.Basedontheiranalysis,theirndingscanbesummarizedasfollows:(a)Whenthemeasureofprecisionoftheestimateisabsolute,thecostofsamplingincreaseswiththenumberofrelationsinvolvedinthejoinaswellasthesizesoftherelationsthemselves.(b)Whenthemeasureofprecisionoftheestimateisrelative,thecostofusingsamplingincreaseswiththesizesoftherelations,butdecreasesasthenumberofinputrelationsincrease.(c)Whenthedistributionofthejoinattributevaluesisuniformorhighlyskewedforallinputrelations,thecostofsamplingtendstobelow,whileitishighwhenonlysomeoftheinputrelationshaveaskewedjoinattributevaluedistribution.(d)Thepresenceoftuplesinarelationwhichdonotjoinwithanyothertuplesfromotherrelationsalwaysincreasesthecostofsampling.Haasetal.[ 56 57 ]studyandcomparetheperformanceofnewaswellasprevioussampling-basedproceduresforestimatingtheselectivityofquerieswithjoins.Inparticulartheyidentifyestimatorswhichhaveaminimumvarianceafteraxednumberofsamplingstepshavebeenperformed.Theynotethatuseofindexesoninputrelationscanfurther 21

PAGE 22

35 ]describehowtoestimatethesizeofajoininthepresenceofskewinthedatabyusingatechniquecalledbifocalsampling.Thistechniqueclassiestuplesofeachinputrelationintotwogroups,sparseanddense,basedonthenumberoftupleswiththesamevalueforthejoinattribute.Everycombinationofthesegroupsisthensubjecttodierentestimationprocedures.Eachoftheseestimationproceduresrequireasamplesizelargerthanacertainvalue(intermsofthetotalnumberoftuplesintheinputrelation)toprovideanestimatewithinasmallconstantfactorofthetruejoinsize.Inordertoguaranteeestimateswiththespeciedaccuracy,bifocalsamplingalsorequiresthetotaljoinsizeandthejoinsizesfromsparse-sparsesubjoinstobegreaterthanacertainthreshold.GibbonsandMatias[ 40 ]introducetwosampling-basedsummarystatisticscalledcon-cisesamplesandcountingsamplesandpresenttechniquesfortheirfastandincrementalmaintenance.Althoughthepaperdescribessummarystatisticsratherthanon-the-ysamplingtechniques,thesummarystatisticsarecreatedfromrandomsamplesoftheunderlyingdataandareactuallydenedtodescribecharacteristicsofarandomsampleofthedata.Sincesummarystatisticsofarandomsamplerequiremuchlesseramountofmemorythanthesampleitself,thepaperdescribeshowinformationfromamuchlargersamplecanbestoredinagivenamountofmemorybystoringsamplestatisticsinsteadofusingthememorytostoreactualrandomsamples.Thus,theauthorsclaimthatsinceinformationfromalargersamplecanbestoredbytheirsummarystatisticstheaccuracyofapproximateanswerscanbeboosted.Chaudhuri,MotwaniandNarasayya[ 22 24 ]presentadetailedstudyoftheproblemofecientlysamplingtheoutputofajoinoperationwithoutactuallycomputingthe 22

PAGE 23

63 ]proposeasystemcalledOnlineAggregation(OLA)thatcansupportonlineexecutionofanalytic-styleaggregationqueries.Theyproposethesystemtohaveavisualinterfacewhichdisplaysthecurrentestimateoftheaggregatequeryalongwitherrorboundsatacertaincondencelevel.Then,astimeprogresses,thesystemcontinuallyrenestheestimateandatthesametimeshrinksthewidthoftheerrorbounds.Theuserwhoispresentedwithsuchavisualinterface,hasatalltimes,anoptiontoterminatefurtherexecutionofthequeryincasetheerrorboundwidthissatisfactoryforthegivencondencelevel.TheauthorsproposetheuserandomsamplingfrominputrelationstoprovideestimatesinOLA.Further,theydescribesomeofthekeychangesthatwouldberequiredinaDBMStosupportOLA.In[ 51 ],HaasdescribesstatisticaltechniquesforcomputingerrorboundsinOLA.TheworkonOLAeventuallygrewintotheUCBerkeleyCONTROLproject.Intheirarticle[ 62 ],Hellersteinetal.describevariousissuesinprovidinginteractivedataanalysisandpossibleapproachestoaddressthoseissues.HaasandHellerstein[ 53 54 ]proposeafamilyofjoinalgorithmscalledripplejoinstoperformrelationaljoinsinanOLAframework.Ripplejoinsweredesignedtominimizethetimeuntilanacceptablypreciseestimateofthequeryresultismadeavailable,asopposedtominimizingthetimetocompletionofthequeryasinatraditionalDBMS.For 23

PAGE 24

87 ]presentanonlineparallelhashripplejoinalgorithmtospeeduptheexecutionoftheripplejoinespeciallywhenthejoinselectivityislowandalsowhentheuserwishestocontinueexecutiontillcompletion.Thealgorithmisassumedtobeexecutedataxedsetofprocessornodes.Ateachnode,ahashtableismaintainedforeveryrelation.Moreovereverybucketineachhashtablecouldhavesometuplesstoredinmemoryandsomeothersstoredondisk.Thejoinalgorithmproceedsintwophases;intherstphasetuplesfrombothrelationsareretrievedinarandomorderanddistributedtotheprocessornodessothateachnodewouldperformroughlythesameamountofworkforexecutingthejoin.Byusingmultiplethreadsateachnode,productionofjointuplesfromthein-memoryhashtablebucketsbeginsevenastuplesarebeingdistributedtothevariousprocessors.Thesecondphasebeginsafterredistributionfromtherstphaseiscomplete.Inthisphase,anewin-memoryhashtableiscreatedwhichusesahashingfunctiondierentfromthefunctionusedinphase1.Thetuplesinthedisk-residentbucketsofthehashtableofphase1arethenhashedaccordingtothehashingfunctionofphase2andjoined.Thealgorithmprovidesaconsiderablespeed-upfactorovertheone-noderipplejoin,provideditsmemoryrequirementsaremet.Jermaineetal.[ 73 74 ]pointoutthatthedrawbackofboththeripplejoinalgorithmsdescribedaboveisthatthestatisticalguaranteesprovidedbytheestimatorarevalidonlyaslongastheoutputofthejoincanbeaccomodatedinmainmemory.Inordertocounteractthisproblem,theyproposetheSort-Merge-Shrinkjoinalgorithmasageneralizationoftheripplejoinwhichcanprovideerrorguaranteesthroughoutexecution, 24

PAGE 25

48 ].Theyprovideanoverviewoftheestimatorsusedinthedatabaseandstatisticsliteratureandalsodevelopseveralnewsampling-basedestimatorsforthedistinctvalueestimationproblem.Theyproposeanewhybridsamplingestimatorwhichexplicitlyadaptstodierentlevelsofdataskew.TheirhybridestimatorperformsaChi-squaretesttodetectskewinthedistributionoftheattributevalue.Ifthedataappearstobeskewed,thenShlosser'sestimatorisusedwhileifthetestdoesnotdetectskew,asmoothed-jackknifeestimator(whichisamodicationoftheconventionaljackknifeestimator)isused.Theauthorsattributeadearthofworkforsampling-basedestimationofthenumberofdistinctvaluestotheinherentdicultyoftheproblemwhilenotingthatitisamuchharderproblemthanestimatingtheselectivityofajoin.HaasandStokes[ 50 ]presentadetailedstudyoftheproblemofestimatingthenumberofclassesinanitepopulation.Thisisequivalenttothedatabaseproblemofestimatingthenumberofdistinctvaluesinarelation.Theauthorsmakerecommendationsaboutwhichstatisticalestimatorisappropriatesubjecttoconstraintsandnallyclaimfromempiricalresultsthatahybridestimatorwhichadaptsaccordingtodataskewisthemostsuperiorestimator. 25

PAGE 26

16 ]whichestablishesanegativeresultstatingthatnosampling-basedestimatorforestimatingthenumberofdistinctvaluescanguaranteesmallerroracrossallinputdistributionsunlessitexaminesalargefractionoftheinputdata.TheyalsopresentaGuaranteedErrorEstimator(GEE)whoseerrorisprovablynoworsethantheirnegativeresult.SincetheGEEisageneralestimatorprovidingoptimalerroroveralldistributions,theauthorsnotethatitsaccuracymaybelowerthansomepreviousestimatorsonspecicdistributions.Hence,theyproposeanestimatorcalledtheAdaptiveEstimator(AE)whichissimilarinspirittoHaasetal.'shybridestimator[ 50 ],butunlikethelatter,isnotcomposedoftwodistinctestimators.RathertheAEconsidersthecontributionofdataitemshavinghighandlowfrequenciesinasingleuniedestimator.IntheAQUAsystem[ 41 ]forapproximateansweringofqueries,Acharyaetal.[ 6 ]proposeusingsynopsesforestimatingtheresultofrelationaljoinqueriesinvolvingforeign-keyjoinsratherthanusingrandomsamplesfromthebaserelations.Thesesynopsesareactuallyprecomputedsamplesfromasmallsetofdistinguishedjoinsandarecalledjoinsynopsesinthepaper.Theideaofjoinsynopsesisthatbyprecomputingsamplesfromasmallsetofdistinguishedjoins,thesesamplescanbeusedforestimatingtheresultofmanyotherjoins.Theconceptisapplicableinak-wayjoinwhereeachjoininvolvesaprimaryandforeignkeyoftheparticipatingrelations.Thepaperdescribesthatifworkloadinformationisavailable,itcanbeusedtodesignanoptimalallocationforthejoinsynopsesthatminimizestheoverallerrorintheapproximateanswersovertheworkload.Acharyaetal.[ 5 ]proposeusingamixofuniformandbiasedsamplesforapproximatelyansweringquerieswithaGROUP-BYclause.Theirsamplingtechniquecalledcongres-sionalsamplingreliesonusingprecomputedsampleswhichareahybridunionofuniformandbiasedsamples.Theyassumethattheselectivityofthequerypredicateisnotsolowthattheirprecomputedsamplecompletelymissesoneormoregroupsfromtheresultof 26

PAGE 27

4 ]forconstructingthecongressionalsamples.Gantietal.[ 37 ]describeabiasedsamplingapproachwhichtheycallICICLEStoobtainrandomsampleswhicharetunedtoaparticularworkload.Thus,ifatupleischosenbymanyqueriesinaworkload,ithasahigherprobabilityofbeingselectedintheself-tuningsampleascomparedtotupleswhicharechosenbyfewerqueries.Sincethisisanon-uniformsample,traditionalsampling-basedestimatorsmustbeadaptedforthesesamples.Thepaperdescribesmodiedestimatorsforthecommonaggregationoperations.Italsodescribeshowtheself-tuningsamplesaretunedinthepresenceofadynamicallychangingworkload.Chaudhurietal.[ 18 ]notethatuniformrandomsamplingtoestimateaggregatequeriesisineectivewhenthedistributionoftheaggregateattributeisskewedorwhenthequerypredicatehasalowselectivity.Theyproposeusingacombinationoftwomethodstoaddressthisproblem.Theirrstapproachistoindexseparatelythoseattributevalueswhichcontributesignicantlytothequeryresult.ThismethodiscalledOutlierIndexinginthepaper.Thesecondapproachproposedinthepaperistoexploitworkloadinformationtoperformweightedsampling.Accordingtothistechnique,recordswhichsatisedmanyqueriesintheworkloadaresampledmorethanrecordsthansatisedfewerqueries.Chaudhuri,DasandNarasayya[ 19 20 ]describehowworkloadinformationcanbeusedtoprecomputeasamplethatminimizestheerrorforthegivenworkload.Theproblemofselectionofthesampleisframedasanoptimizationproblemsothattheerrorinestimationoftheworkloadqueriesusingtheresultingsampleisminimized.Whentheactualincomingqueriesareidenticaltoqueriesintheworkload,thisapproachgivesasolutionwithminimalerroracrossallqueries.Thepaperalsodescribeshowthechoiceof 27

PAGE 28

10 ]notethatauniformlyrandomsamplecanleadtoinaccurateanswersformanyqueries.Theyobservethatforsuchqueries,estimationusinganappropriatelybiasedsamplecanleadtomoreaccurateanswersascomparedtoestimationusinguniformlyrandomsamples.Basedonthisidea,thepaperdescribesatechniquecalledsmallgroupsamplingwhichisdesignedtoapproximatelyansweraggregationquerieshavingaGROUP-BYclause.ThedistinctivefeatureofthistechniqueascomparedtopreviousbiasedsamplingtechniqueslikecongressionalsamplingisthatanewbiasedsampleischosenforeveryGROUP-BYquery,suchthatitmaximizestheaccuracyofestimatingthequeryratherthantryingtodeviseabiasedsamplethatmaximizestheaccuracyoveranentireworkloadofqueries.Accordingtothistechnique,largergroupsfromtheoutputoftheGROUP-BYqueriesaresampleduniformlywhilethesmallgroupsaresampledatahigherratetoensurethattheyareadequatelyrepresented.Thegroupsamplesareobtainedonaper-querybasisfromanoverallsamplewhichiscomputedinapre-processingphase.Infact,databasesamplinghasbeenrecognizedasanimportantenoughproblemthatISOhasbeenworkingtodevelopastandardinterfaceforsamplingfromrelationaldatabasesystems[ 55 ],andsignicantresearcheortsaredirectedatprovidingsamplingfromdatabasesystemsbyvendorssuchasIBM[ 52 ]. 106 107 ].Thetechniqueproposediscalledantisamplingandinvolvescreationofaspecialauxiliarystructurecalleddatabaseabstract.Theabstractconsidersthedistributionofseveralattributesandgroupsofattributes.Correlationsbetweendierentattributescanalsobecharacterizedasstatistics.Thistechniquewasfoundtobefasterthanrandomsampling,butrequireddomainknowledgeaboutthevariousattributes. 28

PAGE 29

110 ]andPiatetsky-ShapiroandConnell[ 102 ].SelectivityestimationofquerieswithmultidimensionalpredicatesusinghistogramswaspresentedbyMuralikrishnaandDeWitt[ 92 ].Theyshowthatthemaximumerrorinestimationcanbecontrolledmoreeectivelybychoosingequi-depthhistogramsasopposedtoequi-widthhistograms.Ioannidis[ 70 ]describeshowserialhistogramsareoptimalforaggregatequeriesinvolvingarbitraryjointreeswithequalitypredicates.IoannidisandPoosala[ 71 ]havealsostudiedhowhistogramscanbeusedtoapproximatelyanswernon-aggregatequerieswhichhaveasetbasedresult.Severalhistogramconstructionschemes[ 42 45 72 ]havebeenproposedintheliterature.Jagadishetal.[ 72 ]describetechniquesforconstructinghistogramswhichcanminimizeagivenerrormetricwheretheerrorisintroducedbecauseofapproximationofvaluesinabucketbyasinglevalueassociatedwiththebucket.Theyalsodescribetechniquesforaugmentinghistogramswithadditionalinformationsothattheycanbeusedtoprovideaccuracyguaranteesoftheestimatedresults.ConstructionofapproximatehistogramsbyconsideringonlyarandomsampleofthedatasetwasinvestigatedbyChaudhurietal.[ 23 ].Theirtechniqueusesanadaptivesamplingapproachtodeterminethesamplesizethatwouldbesucienttogenerateapproximatehistogramswhichcanguaranteepre-speciederrorboundsinestimation.Theyalsoextendtheirworktoconsiderduplicatevaluesinthedomainoftheattributeforwhichahistogramistobeconstructed.TheproblemofestimationofthenumberofdistinctvaluecombinationsofasetofattributeshasbeenstudiedbyYuetal.[ 121 ].Duetotheinherentdicultyofdevelopingagood,sampling-basedestimationsolutiontotheproblem,theyproposeusingadditionalinformationaboutthedataintheformofhistograms,indexesordatacubes.Inarecentpaper[ 28 ],Dobrapresentsastudyofwhenhistogramsarebestsuitedforapproximation.Thepaperconsidersthelong-standingassumptionthathistogramsare 29

PAGE 30

89 ]andalsoforcomputingaggregatesoverdatacubes[ 118 119 ].Chakrabartietal.[ 15 ]presenttechniquesforapproximatecomputationofresultsforaggregateaswellasnon-aggregatequeriesusingHaarwavelets.Onemoresummarystructurethathasbeenproposedforapproximatingthesizeofjoinsissketches.Sketchesaresmall-spacesummariesofdatasuitedfordatastreams.Asketchgenerallyconsistsofmultiplecounterscorrespondingtorandomvariableswhichenablethemtoprovideapproximateanswerswitherrorguaranteesforaprioridecidedqueries.SomeoftheearliestworkonsketcheswaspresentedbyAlon,Gibbons,MatiasandSzegedy[ 7 8 ].SketchingtechniqueswithimprovederrorguaranteesandfasterupdatetimeshavebeenproposedasFast-Countsketches[ 117 ].Astatisticalanalysisofvarioussketchingtechniquesalongwithrecommendationsontheiruseforestimatingjoinsizesappearsin[ 108 ]. 44 ]forprocessingofanalyticstyleaggregationqueriesoverdatawarehouses.ThepaperdescribesageneralizationoftheSQLGROUPBYoperatortomultipledimensionsbyintroducingthedatacubeoperator.Thisoperatortreatseachofthepossibleaggregationattributesasadimensionofahighdimensionalspace.Theaggregateofaparticularsetofattributevaluesisconsideredasapointinthisspace.Sincethecubeholdsprecomputedaggregatevaluesoveralldimensions,itcanbeusedtoquicklycomputeresultstoGROUP-BYqueriesovermultipledimensions.Thedatacubeisprecomputed 30

PAGE 31

79 ]andquotientcubetree[ 80 ]structuresaresuchcompressedrepresentationsofthedatacubewhichpreservesemanticrelationshipswhilealsoallowingprocessingofpointandrangequeries.AnotherapproachthathasbeenemployedinshrinkingthedatacubewhileatthesametimepreservingalltheinformationinitistheDwarf[ 113 114 ]structure.Dwarfidentiesandeliminatesredundanciesinprexesandsuxesofthevaluesalongdierentdimensionsofadatacube.Thepapershowsthatbyeliminatingprexaswellassuxredundancies,bothdenseaswellassparsedatacubescanbecompressedeectively.Thepaperalsoshowsimprovedcubeconstructiontime,queryresponsetimeaswellasupdatetimeascomparedtocubetrees[ 105 ].Although,thedwarfstructureimprovestheperformanceofthedatacubemodel,itstillsuersfromtheinherentdrawbackofthedatacubemodel{itisnotsuitabletoecientlyanswerarbitrarilycomplexqueriessuchasquerieswithcorrelatedsubqueries.Recently,anewcolumn-orientedarchitecturefordatabasesystemscalledC-storewasproposedbyStonebrakeretal[ 115 ].Thesystemhasbeendesignedforanenvironmentthathasmuchhighernumberofdatabasereadsasopposedtowrites,suchasadatawarehousingenvironment.C-storelogicallysplitsattributesofarelationaltableintoprojectionswhicharecollectionsofattributes,andstoresthemondisksuchthatallvalues 31

PAGE 32

115 ],thesystemwasstillunderdevelopment. 32

PAGE 33

91 ]havebecomevitaldatamanagementtools.Inparticular,randomsamplingisoneofthemostimportantsourcesofrandomnessforsuchalgorithms.Scoresofalgorithmsthatareusefuloverlargedatarepositorieseitherrequirearandomizedinputorderingfordata(i.e.,anonlinerandomsample),orelsetheyoperateoversamplesofthedatatoincreasethespeedofthealgorithm.Althoughapplicationsrequiringrandomizationaboundinthedatamanagementliterature,wespecicallyconsideronlineaggregation[ 54 62 63 ]inthisthesis.Inonlineaggregation,databaserecordsareprocessedone-at-a-time,andusedtokeeptheuserinformedofthecurrent\bestguess"astotheeventualanswertothequery.Iftherecordsareinputintotheonlineaggregationalgorithminarandomizedorder,thenitbecomespossibletogiveprobabilisticguaranteesontherelationshipofthecurrentguesstotheeventualanswertothequery.Despitetheobviousimportanceofrandomsamplinginadatabaseenvironmentanddozensofrecentpapersonthesubject(approximately20papersfromrecentSIGMODandVLDBconferencesareconcernedwithdatabasesampling),therehasbeenrelativelylittleworktowardsactuallysupportingrandomsamplingwithphysicaldatabaseleorganizations.Theclassicworkinthisarea(byOlkenandhisco-authors[ 98 99 101 ])suersfromakeydrawback:eachrecordsampledfromadatabaselerequiresarandomdiskI/O.Atacurrentrateofaround100randomdiskI/Ospersecondperdisk,thismeansthatitispossibletoretrieveonly6,000samplesperminute.Ifthegoalisfastapproximatequeryprocessingorspeedingupadataminingalgorithm,thisisclearlyunacceptable. 33

PAGE 34

96 ]orbyscanningarandomlypermutedle.Ingeneral,theviewcanproducesamplesfromapredicateinvolvinganyattributehavinganaturalordering,andastraightforwardextensionoftheACETreecanbeusedforsamplingfrommulti-dimensionalpredicates.Theresultingsampleisonline,whichmeansthatnewsamplesarereturnedcontinuouslyastimeprogresses,andinamannersuchthatatalltimes,thesetofsamplesreturnedisatruerandomsampleofalloftherecordsintheviewthat 96 ]inaslightlydierentcontext,wherethegoalwastomaintainaxed-sizesampleofdatabase;incontrast,aswedescribesubsequentlyourmaterializedsampleviewisastructureallowingonlinesampling 34

PAGE 35

35

PAGE 36

9 96 99 { 101 ]thatsampledirectlyfromarelationalselectionpredicate,thusavoidingtheaforementionedproblemofobtainingtoofewrelevantrecordsinthesample.Olken[ 96 ]presentsacomprehensiveanalysisandcomparisonofmanysuchtechniques.InthisSectionwediscussthetechniqueofsamplingfromamaterializedvieworganizedasarankedB+-Tree,sinceithasbeenproventobethemostecientexistingiterativesamplingtechniqueintermsofnumberofdiskaccesses.ArankedB+-TreeisaregularB+-Treewhoseinternalnodeshavebeenaugmentedwithinformationwhichpermitsonetondtheithrecordinthele.LetusassumethattherelationSALEpresentedintheIntroductionisstoredasarankedB+-TreeleindexedontheattributeDAYandwewanttoretrievearandomsampleofrecordswhoseDAYattributevaluefallsbetween11-28-2004and03-02-2005.ThistranslatestothefollowingSQLquery: 36

PAGE 37

1.Findtherankr1oftherecordwhichhasthesmallest 2.Findtherankr2oftherecordwhichhasthelargest 3.Whilesamplesize
PAGE 38

3-1 depictsanexampleleafnodeintheACETreewithattributerangevalueswrittenaboveeachsectionandsectionnumbersmarkedbelow.Recordswithineachsectionareshownascircles. 38

PAGE 39

StructureofaleafnodeoftheACEtree. 27 ].Eachinternalnodehasthefollowingcomponents:1.ArangeRofkeyvaluesassociatedwiththenode.2.AkeyvaluekthatsplitsRandpartitionsthedataontheleftandrightofthenode.3.Pointersptrlandptrr,thatpointtotheleftandrightchildrenofthenode.4.Countscntlandcntr,thatgivethenumberofdatabaserecordsfallingintherangesassociatedwiththeleftandrightchildnodes.Thesevaluescanbeused,forexample,duringevaluationofonlineaggregationquerieswhichrequirethesizeofthepopulationfromwhichwearesampling[ 54 ].Figure 3-2 showsthelogicalstructureoftheACETree.Ii;jreferstothejthinternalnodeatleveli.TherootnodeislabeledwitharangeI1;1:R=[0-100],signifyingthatallrecordsinthedatasethavekeyvalueswithinthisrange.ThekeyoftherootnodepartitionsI1;1:RintoI2;1:R=[0-50]andI2;2:R=[51-100].Similarlyeachinternalnodedividestherangeofitsdescendentswithitsownkey.Therangesassociatedwitheachsectionofaleafnodearedeterminedbytherangesassociatedwitheachinternalnodeonthepathfromtherootnodetotheleaf.Forexample,ifweconsiderthepathfromtherootnodedowntoleafnodeL4,therangesthatweencounteralongthepathare0-100,0-50,26-50and38-50.ThusforL4,L4:S1hasarandomsampleofrecordsintherange0-100,L4:S2hasarandomsampleintherange 39

PAGE 40

StructureoftheACEtree. 0-50,L4:S3hasarandomsampleintherange26-50,whileL4:S4hasarandomsampleintherange38-50. 3.6 .LetQ=[30-65]beourexamplequerypostulatedovertheACETreedepictedinFigure 3-2 .ThequeryalgorithmstartsatI1;1,therootnode.SinceI2;1:RoverlapsQ,thealgorithmdecidestoexploretheleftchildnodelabeledI2;1inFigure 3-2 .AtthispointthetworangevaluesassociatedwiththeleftandrightchildrenofI2;1are0-25and26-50.Sincetheleftchildrangehasnooverlapwiththequeryrange,thealgorithmchoosestoexploretherightchildnext.Atthischildnode(I3;2),thealgorithmpicksleafnodeL3tobetherstleafnoderetrievedbytheindex.Recordsfromsection1ofL3(whichtotallyencompassesQ)arelteredforQandreturnedimmediatelytotheconsumerofthesample 40

PAGE 41

3-3 showstheonerandomsamplefromsection1ofL3whichcanbeuseddirectlyforansweringqueryQ. Figure3-3. Randomsamplesfromsection1ofL3. Next,thealgorithmagainstartsattherootnodeandnowchoosestoexploretherightchildnodeI2;2.Afterperformingrangecomparisons,itexplorestheleftchildofI2;2whichisI3;3sinceI3;4.RhasnooverlapwithQ.ThealgorithmchoosestovisittheleftchildnodeofI3;3next,whichisleafnodeL5.Thisisthesecondleafnodetoberetrieved.AsdepictedinFigure 3-4 ,sinceL5:R1encompassesQ,therecordsofL5:S1arelteredandreturnedimmediatelytotheuserastwoadditionalsamplesfromR.Furthermore,section2recordsarecombinedwithsection2recordsofL3toobtainarandomsampleofrecordsintherange0-100.Theseareagainlteredandreturned,givingfourmoresamplesfromQ.Section3recordsarealsocombinedwithsection3recordsofL3toobtainasampleofrecordsintherange26-75.SincethisrangealsoencompassesR,therecordsareagainlteredandreturnedaddingfourmorerecordstooursample.Finallysection4recordsarestoredinmemoryforlateruse.Notethatafterretrievingjusttwoleafnodesinoursmallexample,thealgorithmobtainselevenrandomlyselectedrecordsfromthequeryrange.However,inarealindex,thisnumberwouldbemanytimesgreater.Thus,theACETreesupports\fastrst" 41

PAGE 42

3-2 .TheB+-Treesamplingalgorithmwouldneedtopre-selectwhichnodestoexplore.Sincefourleafnodesinthetreeareneededtospanthequeryrange,thereisareasonablyhighlikelihoodthattherstfoursamplestakenwouldneedtoaccessallfourleafnodes.AstheACETreeQueryAlgorithmprogresses,itgoesontoretrievetherestoftheleafnodesintheorderL4,L6,L1,L7,L2,L8. Figure3-4. CombiningsamplesfromL3andL5. 42

PAGE 43

3.6 CombiningtwosectionsofleafnodesoftheACEtree. 43

PAGE 44

3-5 ,rstwereadleafnodeL1andlterthesecondsectioninordertoproducearandomsampleofsizen1fromQlwhichisreturnedtotheuser.NextwereadleafnodeL3,andlteritssecondsectionL3:S2toproducearandomsampleofsizen2fromQlwhichisalsoreturnedtotheuser.Atthispoint,thetwosetsreturnedtotheuserconstituteasinglerandomsamplefromQlofsizen1+n2.Thismeansthatasmoreandmorenodesarereadfromdisk,therecordscontainedinthemcanbecombinedtoobtainanever-increasingrandomsamplefromanyrangequery. 3-6 ,wecanappendthethirdsectionfromnodeL3tothethirdsectionfromnodeL1andltertheresulttoproduceyetanotherrandomsamplefromQl.Thismeansthatsectionsareneverwasted. 44

PAGE 45

AppendingtwosectionsofleafnodesoftheACEtree. Whiletheformalstatementoftheexponentialitypropertyisabitcomplicated,thenetresultisissimple:thereisalwaysapairofleafnodeswhosesectionscanbeappendedtoformasetwhichcanbelteredtoquicklyobtainasamplefromanyrangequeryQ0.Asanillustration,considerqueryQovertheACETreeofFigure 3-2 .NotethatthenumberofdatabaserecordsfallinginQisgreaterthanone-fourth,butlessthanhalfthedatabasesize.TheexponentialitypropertyassuresusthatQcanbetotallycoveredbyappendingsectionsoftwodierentleafnodes.Inourexample,thismeansthatQcanbecoveredbyappendingsection3ofnodesL4andL6.IfRC=L4:R3SL6:R3,thenbytheinvariantgivenabovewecanclaimthatjQ(R)j>=(1=2)jRC(R)j. 45

PAGE 46

3-7 .Afterthedatasetissorted,themedianrecordfortheentiredatasetisdetermined(thisvalueis50inourexample).Thisrecord'skeywillbeusedasthekeyassociatedwiththerootoftheACETree,andwilldetermineL:R2foreveryleafnodeinthetree.WedenotethiskeyvaluebyI1;1:k,sincethevalueservesasthekeyoftherstinternalnodeinlevel1ofthetree.Afterdeterminingthekeyvalueassociatedwiththerootnode,themediansofeachofthetwohalvesofthedatasetpartitionedbyI1;1:karechosenaskeysforthetwointernalnodesatthenextlevel:I2;1:kandI2;2:k,respectively.IntheexampleofFigure 3-7 ,these 46

PAGE 47

Choosingkeysforinternalnodes. valuesare25and75.I2;1:kandI2;2:k,alongwithI1;1:k,willdetermineL:R3foreveryleafnodeinthetree.Theprocessisthenrepeatedrecursivelyuntilenoughmedians 47

PAGE 48

3-2 .Figure 3-8 showsthekeysoftheinternalnodesasmediansofthedatasetR.Wealsoconsidertwoexamplequeries,Q1andQ2suchthatthenumberofdatabaserecordsfallinginQ2isgreaterthanone-fourthbutlessthanone-halfofthedatabasesize,whilethenumberofdatabaserecordsfallinginQ1ismorethanhalfthedatabasesize. Figure3-8. ExponentialitypropertyofACEtree. 3-2 ).LetRC1=L4:R2SL8:R2.ThenallthedatabaserecordsfallinRC1.Moreover,sincejQ1(R)j>=jRj=2,wehavejQ1(R)j>=(1=2)jRC1(R)j.Similarly,Q2canbeansweredbyappendingsection3of(forexample)L4andL6.IfRC2=L4:R3SL6:R3,thenhalfthedatabaserecordsfallinRC2.Also,sincejQ2(R)j>=jRj=4wehavejQ2(R)j>=(1=2)jRC2(R)j.ThiscanbegeneralizedtoobtaintheinvariantstatedinSection 3.4.3 48

PAGE 49

Phase2oftreeconstruction. 1.Assignauniformlygeneratedrandomnumberbetween1andhtoeachrecordasitssectionnumber.2.Associateanadditionalrandomnumberwiththerecordthatwillbeusedtoidentifytheleafnodetowhichtherecordwillbeassigned.3.Finally,re-organizethelebyperforminganexternalsorttogrouprecordsinagivenleafnodeandagivensectiontogether.Figure3-9(a)depictsourexampledatasetafterwehaveassignedeachrecordarandomlygeneratedsectionnumber,assumingfoursectionsineachleafnode.InStep2,thealgorithmassignsonemorerandomlygeneratednumbertoeachrecord,whichwillidentifytheleafnodetowhichtherecordwillbeassigned.Weassumeforourexamplethatthenumberofleafnodesis2h1=23=8.Thenumbertoidentifytheleafnodeisassignedasfollows.1.First,thesectionnumberoftherecordischecked.Wedenotethisvalueass. 49

PAGE 50

3-7 ,weseethatthekeyoftherootnodeis50.Sincethekeyoftherecordis7whichislessthan50,therecordwillbeassignedtoaleafnodeintheleftsubtreeoftheroot.Henceweassignaleafnodebetween1and4tothisrecord.Inourexample,werandomlychoosetheleafnode3.Forthenextrecordhavingkeyvalue10,weseethatthesectionnumberassignedis3.Toassignaleafnodetothisrecord,weinitiallycompareitskeywiththekeyoftherootnode.ReferringtoFigure 3-7 ,weseethat10issmallerthan50;hencewethencompareitwith25whichisthekeyoftheleftchildnodeoftheroot.Sincetherecordkeyissmallerthan25,weassigntherecordtosomeleafnodeintheleftsubtreeofthenodewithkey25byassigningtoitarandomnumberbetween1and2.Thesectionnumberandleafnodeidentiersforeachrecordarewritteninasmallamountoftemporarydiskspaceassociatedwitheachrecord.Onceallrecordshavebeenassignedtoleafnodesandsections,thedatasetisre-organizedintoleafnodesusingatwo-passexternalsortingalgorithmasfollows:Recordsaresortedinascendingorderoftheirleafnodenumber.Recordswiththesameleafnodenumberarearrangedinascendingorderoftheirsectionnumber.There-organizeddatasetisdepictedinFigure3-9(c). 50

PAGE 51

51

PAGE 52

52

PAGE 53

3-10 ,whenwecomparethepathstakenbyStab1andStab2.Thealgorithmchoosestotraversetotheleftchildoftherootnodeduringtherststab,whileduringthesecondstabitchoosestotraversetotherightchildoftherootnode.Theadvantageofretrievingleafnodesinthisbackandforthsequenceisthatitallowsustoquicklyretrieveasetofleafnodeswiththemostdisparatesectionspossibleina 53

PAGE 54

Executionrunsofqueryansweringalgorithmwith(a)1contributingsection,(b)6contributingsections,(c)7contributingsectionsand(d)16contributingsections. givennumberofstabs.Thereasonthatwewantanon-homogeneoussetofnodesisthatnodesfromverydistantportionsofaqueryrangewilltendtohavesectionscoveringlargerangesthatdonotoverlap.Thisallowsustoappendsectionsofnewlyretrievedleafnodeswiththecorrespondingsectionsofpreviouslyretrievedleafnodes.Thesamplesobtainedcanthenbelteredandimmediatelyreturned. 54

PAGE 55

3-10 illustratesthechoicesmadebythealgorithmateachinternalnodeduringfourseparatestabs.Notethatwhenthealgorithmreachesaninternalnodewheretherangeassociatedwithoneofthechildnodeshasnooverlapwiththequeryrange,thealgorithmalwayspicksthechildnodethathasoverlapwiththequery,irrespectiveofthevalueoftheindicatorbit.Theonlyexceptiontothisiswhenallleafnodesofthesubtreerootedataninternalnodewhichoverlapsthequeryrangehavebeenaccessed.Insuchacase,theinternalnodewhichoverlapsthequeryrangeisnotchosenandisneveraccessedagain. 55

PAGE 56

LetrootbetherootoftheACETree While(!T:lookup(root):done) node) If(curr nodeisaninternalnode) node=curr node!get left node(); node=curr node!get right node(); If(left nodeisdoneANDright nodeisdone) Markcurr nodeasdone Elseif(right nodeisnotdone) node); Elseif(left nodeisnotdone) node); Elseif(bothchildrenarenotdone) If(Qoverlapsonlywithleft node:R) node); Elseif(Qoverlapsonlywithright node:R) node); Else//Qoverlapsbothsidesornone If(nextnodeisLEFT) node); SetnextnodetoRIGHT; If(nextnodeisRIGHT) node); SetnextnodetoLEFT; Else//curr nodeisaleafnode Combine Tuples(Q,curr node); Markcurr nodeasdone algorithmdeterminesthesectionsthatarerequiredtobecombinedwitheverynewsectionsthatisretrievedandthensearchesfortheminthearraybuckets[].Ifallsectionsarefound,itcombinesthemwithsandremovesthemfrombuckets[].Ifitdoesnotndalltherequiredsectionsinbuckets[],itstoressinbuckets[]. 56

PAGE 57

Tuples(QueryQ,LeafNodenode) Foreachsectionsinnodedo Storethesectionnumbersrequiredtobe combinedwithstospanQ,inalistlist Ifbuckets[]doesnothavesectioni flag=false Combineallsectionsfromlistwiths Storesintheappropriatebucket

PAGE 59

2i1+1 2i12i1 3.5.4 exceptthattheappropriatekeyattributeisusedwhileperformingcomparisonswiththeinternalnodes.Finally,thedatasetissortedintoleafnodesasinFigure3-9(c).Queryansweringwiththek-dACETreecanusetheShuttlealgorithmdescribedearlierwithafewminormodications.Wheneverasectionisretrievedbythealgorithm,onlyrecordswhichsatisfyallpredicatesinthequeryshouldbereturned.Also,themth 59

PAGE 60

SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting0.25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation sectionsoftwoleafnodescanbecombinedonlyiftheymatchinallmdimensions.Thenthsectionsoftwoleafnodescanbeappendedonlyiftheymatchintherstn1dimensionsandformacontiguousintervaloverthenthdimension. 60

PAGE 61

SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation sequentiallescanaswellaswiththeobviousextensionofAntoshenkov'salgorithmtoatwo-dimensionalR-Tree. 61

PAGE 62

SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation 1.ACETreeQueryAlgorithm:TheACETreewasimplementedexactlyasdescribedinthisthesis.InordertousetheACETreetoaidinsamplingfromtheSALErelation,amaterializedsampleviewfortherelationwascreated,usingSALE.DAYastheindexedattribute.2.RandomsamplingfromaB+-Tree:Antoshenkov'salgorithmforsamplingfromarankedB+-TreewasimplementedasdescribedinAlgorithm1.TheB+-TreeusedintheexperimentwasaprimaryindexontheSALErelation(thatis,theunderlyingdatawereactuallystoredwithinthetree),andwasconstructedusingthestandardB+-Treebulkconstructionalgorithm.3.Samplingfromarandomlypermutedle:WeimplementedthisrandomsamplingtechniqueasdescribedinSection 3.2.1 ofthischapter.Thisisthestandardsamplingtechniqueusedinpreviousworkononlineaggregation.TheSALErelationwasrandomlypermutedbyassigningarandomkeyvaluektoeachrecord.AlloftherecordsfromSALEwerethensortedinascendingorderofeachkvalueusingatwo-phase,multi-waymergesort(TPMMS)(seeGarcia-Molinaetal.[ 38 ]).AsthesortedrecordsarewrittenbacktodiskinthenalpassoftheTPMMS,kisremovedfromthele.Tosamplefromarangepredicateusingarandomlypermutedle,the 62

PAGE 63

SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.ThegraphisanextensionofFigure 3-12 andshowsresultstillallthreesamplingtechniquesreturnalltherecordsmatchingthequerypredicate. leisscannedfromfronttobackandallrecordsmatchingtherangepredicateareimmediatelyreturned.Fortherstsetofexperiments,wesyntheticallygeneratedtheSALErelationtobe20GBinsizewith100Brecords,resultinginaround200millionrecordsintherelation.Webegantherstsetofexperimentsbysamplingfrom10dierentrangeselectionpredicatesoverSALEusingthethreesamplingtechniquesdescribedabove.0.25%oftherecordsfromMySamsatisedeachrangeselectionpredicate.Foreachofthethreerandomsamplingalgorithms,werecordedthetotalnumberofrandomsamplesretrievedbythealgorithmateachtimeinstant.Theaveragenumberofrandomsamplesobtainedforeachofthetenquerieswasthencalculated.ThisaverageisplottedasapercentageofthetotalnumberofrecordsinSALEalongtheY-axisinFigure 3-11 .OntheX-axis,wehaveplottedtheelapsedtimeasapercentageofthetimerequiredtoscantheentirerelation.Wechose 63

PAGE 64

(b)2.5%selectivityFigure3-15. NumberofrecordsneededtobebueredbytheACETreeforquerieswith(a)0.25%and(b)2.5%selectivity.Thegraphsshowthenumberofrecordsbueredasafractionofthetotaldatabaserecordsversustimeplottedasapercentageofthetimerequiredtoscantherelation. 64

PAGE 65

3-12 andFigure 3-13 .Forallthethreegures,resultsareshownfortherst15secondsofexecution,correspondingtoapproximately4%ofthetimerequiredtoscantherelation.WeshowanadditionalgraphinFigure 3-14 forthe2.5%selectivitycase,whereweplotresultsuntilallthethreerecordretrievalalgorithmsreturnalltherecordsmatchingthequerypredicate.Finally,weprovideexperimentalresultstoindicatethenumberofrecordsthatareneededtobebueredbytheACETreequeryalgorithmfortwodierentqueryselectivities.Figure 3-15(a) showstheminimum,maximumandtheaveragenumberofrecordsstoredfortendierentquerieshavingaselectivityof0.25%whileFigure 3-15(b) showssimilarresultsforquerieshavingselectivity2.5%.Experiment2.Forthesecondsetofexperiments,weaddanadditionalattributeAMOUNTtotheSALErelationandtestthefollowingtwo-dimensionalrangequery: 3.7 .ItwasusedtocreateamaterializedsampleviewovertheDAYandAMOUNTattributes.2.RandomsamplingfromaR-tree:Antoshenkov'salgorithmforsamplingfromarankedB+-TreewasextendedintheobviousfashionforsamplingfromaR-Tree[ 46 ].JustasinthecaseoftheB+-Tree,theR-Treeiscreatedasaprimaryindex,andthedatafromtheSALErelationareactuallystoredintheleafnodesofthetree.TheR-Treewasconstructedinbulkusingthewell-knownSort-Tile-Recursive[ 81 ]bulkconstructionalgorithm. 65

PAGE 66

3-16 .OntheX-axis,wehaveplottedtheelapsedtimeasapercentageofthetimerequiredtoscantheentirerelation.Thetestwasthenrepeatedwithtwomoreselectionpredicatesthataresatisedby2.5%and25%oftheSALErelationsrecords,respectively.TheresultsareplottedinFigure 3-17 andFigure 3-18 respectively. 66

PAGE 67

SamplingrateofanACETreevs.rateforanR-Treeandscanofarandomlypermutedle,withaspatialselectionpredicateaccepting0.25%ofthedatabasetuples. therandomly-permutedleisalmostuselessduetothefactthatthechancethatanygivenrecordisacceptedbytherelationalselectionpredicateisverylow.Ontheotherhand,theB+-Tree(andtheR-Treeovermulti-dimensionaldata)performsrelativelywellforhighlyselectivequeries.Thereasonforthisisthatduringthesampling,ifthequeryrangeissmall,thenalltheleafpagesoftheB+-Tree(orR-Tree)containingrecordsthatmatchthequerypredicateareretrievedveryquickly.Oncealloftherelevantpagesareinthebuer,thesamplingalgorithmdoesnothavetoaccessthedisktosatisfysubsequentsamplerequestsandtherateofrecordretrievalincreasesrapidly.However,forlessselectivequeries,therandomly-permutedleworkswellsinceitcanmakeuseofanecient,sequentialdiskscantoretrieverecords.Aslongasarelativelylargefractionoftherecordsretrievedmatchtheselectionpredicate,theamountofwasteincurredbyscanningunwantedrecordsaswellissmallcomparedtotheadditionaleciencygainedbythesequentialscan.Ontheotherhand,whentherangeassociatedwithaqueryhaving 67

PAGE 68

SamplingrateofanACEtreevs.rateforanR-tree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting2.5%ofthedatabasetuples. highselectivityisverylarge,thetimerequiredtoloadalloftherelevantB+-Tree(orR-Tree)pagesintomemoryusingrandomdiskI/Osisprohibitive.Evenifthequeryisrunlongenoughthatalloftherelevantpagesaretouched,foraquerywithhighselectivity,thebuermanagercannotbeexpectedtobueralltheB+-Tree(orR-Tree)pagesthatcontainrecordsmatchingthequerypredicate.ThisisthereasonthatthecurvefortheB+-TreeinFigure 3-13 orfortheR-TreeinFigure 3-18 ,neverleavesthey-axisforthetimerangeplotted.ThenetresultofthisisthatifanACETreewerenotused,itwouldprobablybenecessarytousebothaB+-Treeandarandomly-permutedleinordertoensuresatisfactoryperformanceinthegeneralcase.Again,thisisapointwhichseemstostronglyfavoruseoftheACETree.AnobservationwemakefromFigure 3-14 isthatifallthethreerecordretrievalalgorithmsareallowedtoruntocompletion,wendthattheACETreeisnottherstto 68

PAGE 69

SamplingrateofanACEtreevs.rateforanR-tree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting25%ofthedatabasetuples. completeexecution.Thus,thereisgenerallyacrossoverpointbeyondwhichthesamplingrateofanalternativerandomsamplingtechniqueishigherthanthesamplingrateoftheACETree.However,theimportantpointisthatsuchatransitionalwaysoccursverylateinthequeryexecutionbywhichtimetheACETreehasalreadyretrievedalmost90%ofthepossiblerandomsamples.Wefoundthistrendforallthedierentqueryselectivitieswetestedwithsingledimensionalaswellasmulti-dimensionalACETrees.Thus,weemphasizethattheexistenceofsuchacrossoverpointinnowaybelittlestheutilityoftheACETreesinceinpracticalapplicationswhererandomsamplesareused,thenumberofrandomsamplesrequiredisverysmall.SincetheACETreeprovidesthedesirednumberofrandomsamples(andmanymore)muchfasterthantheothertwomethods,itstillemergesasthetopperformeramongthethreemethodsforobtainingrandomsamples.Finally,Figure 3-15 showsthememoryrequirementoftheACETreetostorerecordsthatmatchthequerypredicatebutcannotbeusedasyettoanswerthequery.The 69

PAGE 70

3-15 thattheACETreehasareasonablememoryrequirementsinceaverysmallfractionofthetotalnumberofrecordsisbueredbyit. 70

PAGE 71

111 ]isapplied.Specically,onecouldmaintainthedierentialleasarandomlypermutedleorevenasecondACETree,andwhenarelationalselectionqueryisposed,inordertodrawarandomsamplefromthequeryoneselectsthenextsamplefromeithertheprimaryACETreeorthedierentiallewithanappropriatehypergeometricprobability(foranideaofhowthiscouldbedone,seetherecentpaperofBrownandHaas[ 12 ]foradiscussionofhowtodrawasinglesamplefrommultipledatasetpartitions).Thus,wearguethatthelackofanalgorithmtoupdatetheACEtreeincrementallymaynotbeatremendousdrawback.Finally,weclosethechapterbyassertingthattheimportanceofhavingindexingmethodsthatcanhandleinsertionsincrementallyisoftenoverstatedintheresearch 71

PAGE 72

93 ].SuchstructuresstillrequireontheorderofonerandomI/Operupdate,renderingitimpossibletoecientlyprocessbulkupdatesconsistingofmillionsofrecordswithoutsimplyrebuildingthestructurefromscratch.Thus,wefeelthatthedrawbacksassociatedwiththeACETreedonotpreventitsutilityinmanyreal-worldsituations. 72

PAGE 73

88 ],histograms[ 92 ]andsketches[ 29 ].Nottheleastofthoseisgenerality:itisveryeasytoecientlydrawasamplefromalargedatasetinasinglepassusingreservoirtechniques[ 34 ].Then,oncethesamplehasbeendrawnitispossibletoguess,withgreaterorlesseraccuracy,theanswertovirtuallyanystatisticalqueryoverthosesets.Samplescaneasilyhandlemanydierentdatabasequeries,includingcomplexfunctionsinrelationalselectionandjoinpredicates.Thesamecannotbesaidoftheotherapproximationmethods,whichgenerallyrequiremoreknowledgeofthequeryduringsynopsisconstruction,suchastheattributethatwillappearintheSELECTclauseoftheSQLquerycorrespondingtothedesiredstatisticalcalculation.However,oneclassofaggregatequeriesthatremaindicultorimpossibletoanswerwithsamplesaretheso-called\subset"queries,whichcangenerallybewritteninSQLintheform:SELECTSUM(f1(r))FROMRasrWHEREf2(r)ANDNOTEXISTS(SELECT*FROMSASsWHEREf3(r,s))Notethatthefunctionf2canbeincorporatedintof1ifwehavef1evaluatetozeroiff2isnottrue;thus,intheremainderofthechapterwewillignoref2.Anexampleofsuch 73

PAGE 75

17 49 ],butaggregatesoverDISTINCTqueriesremainsanopenproblem.Similarly,itispossibletowriteanaggregatequerywhererecordswithidenticalvaluesmayappearmorethanonceinthedata,butshouldbeconsiderednomorethanoncebytheaggregatefunctionasasubset-basedSQLquery.Forexample:SELECTSUM(e.SAL)FROMEMPASeWHERENOTEXISTS(SELECT*FROMEMPASe2WHEREid(e)
PAGE 76

17 49 50 ]andonemethodthatrequiresanindexontheinnerrelation[ 75 ],thereisalsolittlerelevantworkinthedatamanagementliterature;wepresumethisisduetothedicultyoftheproblem;researchershaveconsideredthedicultyofthemorelimitedproblemofsamplingfordistinctvaluesinsomedetail[ 17 ].OurContributions 76

PAGE 77

39 ].Bayesianmethodsgenerallymakeuseofmildandreasonabledistributionalassumptionsaboutthedatainordertogreatlyincreaseestimationaccuracy,andhavebecomeverypopularinstatisticsinthelastfewdecades.Usingthismethodinthecontextofansweringsubset-basedqueriespresentsanumberofsignicanttechnicalchallengeswhosesolutionsaredetailedinthischapter,including:Thedenitionofanappropriategenerativestatisticalmodelfortheproblemofsamplingforsubset-basedqueries.ThederivationofauniqueExpectationMaximizationalgorithm[ 26 ]tolearnthemodelfromthedatabasesamples.Thedevelopmentofalgorithmsforecientlygeneratingmanynewrandomdatasetsfromthemodel,withoutactuallyhavingtomaterializethem.Throughanextensivesetofexperiments,weshowthattheresultingbiasedBayesianestimatorhasexcellentaccuracyonawidevarietyofdata.Thebiasedestimatoralsohasthedesirablepropertythatitprovidessomethingcloselyrelatedtoclassicalcondencebounds,thatcanbeusedtogivetheuseranideaoftheaccuracyoftheassociatedestimate. 77

PAGE 78

75 ],butwepresentitherebecauseitformsthebasisfortheunbiasedestimatordescribedinthenextsection.Webeginourdescriptionwithanevensimplerestimationproblem.Givenaone-attributerelationR(A)consistingofnRrecords,imaginethatourgoalistoestimatethesumoverattributeAofalltherecordsinR.Asimple,sample-basedestimatorwouldbeasfollows.WeobtainarandomsampleR0ofsizenR0ofalltherecordsofR,computetotal=Pr2R0r:A,andthenscaleuptotaltooutputtotalnR=nR0astheestimateforthenalsum.Notonlyisthisestimatorextremelysimpletounderstand,butitisalsounbiased,consistent,anditsvariancereducesmonotonicallywithincreasingsamplesize.WecanextendthissimpleideatodeneanestimatorfortheNOTEXISTSqueryconsideredintheintroduction.WestartbyobtainingrandomsamplesEMP0andSALE0ofsizesnEMP0andnSALE0,respectivelyfromtherelationsEMPandSALE.WethenevaluatetheNOTEXISTSqueryoverthesamplesofthetworelations.WecompareeveryrecordinEMP0witheveryrecordinSALE0,andifwedonotndamatchingrecord(thatis,oneforwhichf3evaluatestotrue),thenweadditsf1valuetotheestimatedtotal.Lastly,wescaleuptheestimatedtotalbyafactorofnEMP=nEMP0toobtainthenalestimate,whichwetermM:M=nEMP 78

PAGE 79

75 ],whereitiscalledthe\concurrentestimator"sinceitsamplesbothrelationsconcurrently.Unfortunately,onexpectation,theestimatorisoftenseverelybiased,meaningthatitis,onaverage,incorrect.Thereasonforthisbiasisfairlyintuitive.ThealgorithmcomparesarecordfromEMPwithallrecordsfromSALE0,andifitdoesnotndamatchingrecordinSALE0,itclassiestherecordashavingnomatchintheentireSALErelation.Clearly,thisclassicationmaybeincorrectforcertainrecordsinEMP,sincealthoughtheymighthavenomatchingrecordinSALE0,itispossiblethattheymaymatchwithsomerecordfromthepartofSALEthatwasnotincludedinthesample.Asaresult,MtypicallyoverestimatestheanswertotheNOTEXISTSquery.Infact,thebiasofMis: 79

PAGE 80

4.3.1High-LevelDescriptionInordertodevelopanunbiasedestimatorforBias(M),itisusefultorstre-writetheformulaforBias(M)inaslightlydierentfashion.WesubsequentlyrefertothesetofrecordsinEMPthathaveimatchesinSALEas\classirecords".Denotethesumoftheaggregatefunctionoverallrecordsofclassibyti,soti=Pe2EMPf1(e)I(cnt(e;SALE)=i)(notethatthenalanswertotheNOTEXISTSqueryisthequantityt0).GiventhattheprobabilitythatarecordwithimatchesinSALEhappenstohavenomatchesinSALE0is'(nSALE;nSALE0;i),wecanre-writetheexpressionforthebiasofMas: TheaboveequationcomputesthebiasofMsinceitcomputestheexpectedsumovertheaggregateattributeofallrecordsofEMPwhichareincorrectlyclassiedasclass0recordsbyM.LetmbethemaximumnumberofmatchingrecordsinSALEforanyrecordofEMP.Equation 4{1 suggestsanunbiasedestimatorforBias(M)becauseitturnsoutthatitiseasytogenerateanunbiasedestimatefortm:sincenorecordsotherthanthosewithmmatchesinSALEcanhavemmatchesinSALE0,wecansimplycountthesum 80

PAGE 81

4{1 todevelopanunbiasedestimatorforBias(M).Weusethefollowingadditionalnotationforthissectionandtheremainderofthischapter:k;iisa0=1(non-random)variablewhichevaluatesto1iftheithtupleofEMPhaskmatchesinSALEandevaluatesto0otherwise.skisthesumoff1overallrecordsofEMP0havingkmatchingrecordsinSALE0:sk=PnEMP0i=1I(cnt(ei;SALE0)=k)f1(ei).0isnEMP0=nEMP,thesamplingfractionofEMP.YiisarandomvariablewhichgovernswhetherornottheithrecordofEMPappearsinEMP0.h(k;nSALE;nSALE0;i)isthehypergeometricprobabilitythatoutoftheiinterestingrecordsinapopulationofsizenSALE,exactlykwillappearinarandomsampleofsizenSALE0.Forcompactnessofrepresentationwewillrefertothisprobabilityash(k;i)intheremainderofthethesis,sinceoursamplingfractionneverchanges. 81

PAGE 82

^tk=1 ^sk=nEMPXj=1mXi=kYji;jh(k;i)f1(ej)(4{3)ThefactthatE[^sk]=E[sk](proveninSection 4.3.3 )issignicant,becausethereisasimplealgebraicrelationshipbetweenthevarious^svariablesandthevarious^tvariables.Thus,wecanexpressonesetintermsoftheother,andthenreplaceeach^skwithskinordertoderiveanunbiasedestimatorforeach^t.Thebenetofdoingthisisthatsinceskisdenedasthesumoff1overallrecordsofEMP0havingkmatchingrecordsinSALE0,itcanbedirectlyevaluatedfromthesamplesEMP0andSALE0. 82

PAGE 83

4{3 : ^smr=nEMPXj=1mXi=mrYji;jh(mr;i)f1(ej)=mXi=mrh(mr;i)nEMPXj=1Yji;jf1(ej)=rXi=0h(mr;mr+i)nEMPXj=1Yjmr+i;jf1(ej)=rXi=0h(mr;mr+i)0^tmr+i Byre-arrangingthetermswegetthefollowingimportantrecursiverelationship: ^tmr=^smr0Pri=1h(mr;mr+i)^tmr+i (4{5) Forthebasecaseweobtain: ^tm=^sm wheream=1=(0h(m;m)).Byreplacing^smrintheaboveequationswithsmrwhichisreadilyobservablefromthedataandhasthesameexpectedvalue,wecanobtainasimplerecursivealgorithmforcomputinganunbiasedestimatorforanyti.Beforepresentingtherecursivealgorithm,wenotethatwecanre-writeEquation 4{5 for^tibyreplacing^swithsandbychangingthesummationvariablefromitokandactuallysubstitutingmrbyi,^ti=si0Pmik=1h(i;i+k)^ti+k 83

PAGE 84

4{1 thatthebiasofMwasexpressedasalinearcombinationofvarioustiterms.UsingGetEstTitoestimateeachofthetiterms,wecanwriteanestimatorforthebiasofMas: (4{7) Inthefollowingtwosubsections,wepresentaformalanalysisofthestatisticalpropertiesofourestimator. 84

PAGE 85

4{7 ,theestimatorforthebiasofMiscomposedofasumofmdierentestimators.Hencebythelinearityofexpectation,theexpectedvalueoftheestimatorcanbewrittenas: 4{7 isunbiased,itwouldsucetoprovethateachoftheindividualGetEstTiestimatorsisunbiased.Weusemathematicalinductiontoprovethecorrectnessofthevariousestimatorsonexpectation.Asapreliminarystepfortheproofofunbiasedness,werstderivetheexpectedvaluesofthesiestimatorusedbyGetEstTi.Todothis,weintroduceazero/onerandomvariableHj;kthatevaluatesto1ifejhaskmatchesinSALE0and0otherwise.Theexpectedvalueofthisvariableissimplytheprobabilitythatitevaluatesto1,givingusE[Hj;k]=h(k;cnt(ej;SALE)).Withthis: (4{9) WearenowreadytopresentaformalproofofunbiasednessoftheGetEstTi. Proof. 4{5 ,therecursiveGetEstTiestimatorcanbere-writtenas: 85

PAGE 86

4{9 :E[GetEstTi(m)]=0 86

PAGE 87

=1 (4{11) Wenoticethatthelimitsofsummationoftheinnersumofthersttermarefromitom.Splittingthistermintotwotermssuchthatonetermhaslimitsofsummationfromitoiwhiletheotherhaslimitsfromi+1tom: =1 (4{12) 87

PAGE 88

88

PAGE 89

(4{17) Theaboveexpressioncanbeevaluatedusingthefollowingrules:ifk6=r(thatis,ekanderaretwodierenttuples)then,E[Hk;iHr;j]h(i;cnt(ek;SALE))h(j;cnt(er;SALE))ifweassumethatnorecordsexistsinSALEwheref3(ek;s)=f3(er;s)=trueifi=j(thatis,wearecomputingE[s2i])andk=r,thenE[Hk;iHr;j]=h(i;cnt(ek;SALE))ifi6=j(thatis,wearecomputingE[sisj])andk=r,thenE[Hk;iHr;j]=0sincearecordcannothavetwodierentnumbersofmatchesinasampleifk=r,thenE[YkYr]=0ifk6=r,thenE[YkYr]02 89

PAGE 90

Samplingfromasuperpopulation estimator.However,therearetwoproblemsrelatedtothevariancethatmaylimittheutilityoftheestimator.First,inordertoevaluatethehypergeometricprobabilitiesneededtocomputeorestimatethevariance,weneedthevalueofcnt(e;SALE)foranarbitraryrecordeofEMP.Thisinformationisgenerallyunavailableduringsampling,anditseemsdicultorimpossibletoobtainagoodestimatefortheappropriateprobabilitywithouthavingthisinformation.Thismeansthatinpractice,itwillbedicultorimpossibletotellauserhowaccuratetheresultingestimateislikelytobe.Wehaveexperimentedwithgeneral-purposemethodssuchasthebootstrap[ 31 ]toestimatethisvariance,buthavefoundthatthesemethodsoftendoanextremelypoorjobinpractice.Second,thevarianceoftheestimatoritselfmaybehuge.Thebicoecientsarecomposedofsums,productsandratiosofhypergeometricprobabilitieswhichcanresultinhugevalues.Particularlyworrisomeistheh(i;i)valueinthedenominatorusedbyGetEstTi.Suchprobabilitiescanbetiny;includingsuchasmallvalueinthedenominatorofanexpressionresultsinaverylargevaluethatmay\pumpup"thevarianceaccordingly. 90

PAGE 91

78 ].Onesimplewaytothinkofasuperpopulationisthatitisaninnitelylargesetofrecordsfromwhichtheoriginaldatasethasbeenobtainedbyrandomsampling.Becausethesuperpopulationisinnite,itisspeciedusingaparametricdistribution,whichisusuallyreferredtoasthepriordistribution.Usingasuperpopulationmethod,weimaginethefollowingtwo-stepprocessisusedtoproduceoursample:1.DrawalargesampleofsizeNfromanimaginaryinnitesuperpopulationwhereNisthedatasetsize.2.Drawasampleofsizen
PAGE 92

4.3 ofthethesis,foragivenrecordefromEMP,weknowthatthesethreecharacteristicsare:1.f1(e)2.cnt(e;SALE),whichisthenumberofSALErecordssforwhichf3(e;s)istrue3.cnt(e;e0;SALE)wheree06=e,whichisthenumberofSALErecordssforwhichf3(e;s)^f3(e0;s)istrueTosimplifyourtask,wewillactuallyignorethethirdcharacteristicanddeneamodelsuchthatthiscountisalwayszeroforanygivenrecordpair.Whilethismay 92

PAGE 93

93

PAGE 94

4.5.1 ).Inourmodelthevariousivaluesarerelatedasi=si+0,wheresand0aretheonlytwoparametersthatneedtobelearnedtodetermineallthei.Alsoinordertoavoidovertting,weassumethat2isthevarianceoff1(e)overallrecords,ratherthanmodelingandlearningvariancevaluesofalltheindividualclassesseparately.WenowdenethedensityfunctionforthesuperpopulationmodelcorrespondingtotheGenDataalgorithm.ForagivenEMPrecorde,iff1(e)=vandcnt(e;SALE)=ktheprobabilitydensityforegivenaparametersetisgivenby: 94

PAGE 95

39 ]canbemadethatsuchextremefreedomisactuallyapoorchoice,andthatin"real-life",ananalystwillhavesomesortofideawhatthevariouspivalueslooklike,andamorerestrictivedistributionprovidingfewerdegreesoffreedomshouldbeused.Forexample,anegativebinomialdistributionhasbeenassumedforthedistinctvalueestimationproblem[ 90 ].Suchbackgroundknowledgecouldcertainlyimprovetheaccuracyofthemethod.Thoughweeschewanysuchrestrictionsintheremainderofthethesis(exceptforanassumptionofalinearrelationshipamongtheivalues;see\DealingwithOver-tting"inthenextsection),wenotethatitwouldbeveryeasytoincorporatesuchknowledgeintoourmethod.TheonlychangeneededisthattheEMalgorithmdescribedinthenextsectionwouldneedtobemodiedtoincorporateanyconstraintsinducedonthevariousparametersbyadditionaldistributionalassumptions. 95

PAGE 96

96

PAGE 97

26 ]isageneralmethodofndingthemaximum-likelihoodestimateoftheparametersofanunderlyingdistributionfromagivendatasetwhenthedataisincompleteorhasmissingvalues.EMstartsoutwithaninitialassignmentofvaluesfortheunknownparametersandateachstep,recomputesnewvaluesforeachoftheparametersviaasetofupdaterules.EMcontinuesthisprocessuntilthelikelihoodstopsincreasinganyfurther.Sincecnt(e;SALE)isunknown,thelikelihoodfunction:L(jfEMP0;SALE0g)=Ye2EMP0mXk=1p(f1(e);k;cnt(e;SALE0)j)WepresentthederivationofourEMimplementationintheAppendix,whileherewegiveonlythealgorithm.Inthisalgorithm,~p(ij;e)denotestheposteriorprobabilityforrecordebelongingtoclassi.Thisistheprobabilitythatgiventhecurrentsetofvaluesfor,recordebelongstoclassi.|||||||||||||||||||||||||||ProcedureEM()1Initializeallparametersof;Lprev=99992while(true)f3ComputeL()fromthesampleandassignittoLcurr4if((LcurrLprev)=Lprev<0:01)break5Computeposteriorprobabilitiesforeache2EMP0andeachk6Recomputeallparametersofbyusingthefollowingupdaterules:7i=Pe2EMP~p(ij0;e)f1(e)

PAGE 98

98

PAGE 99

30 ].Weusethefollowingtwomethodsinourapproach:Limitingthenumberofdegreesoffreedomofthemodel.Usingmultiplemodelsandcombiningthemtodevelopournalestimator.Tousethersttechnique,werestrictourgenerativemodelsothatthemeanaggregatevalueofallrecordsofanyclassiisnotindependentofthemeanvalueofotherclasses.Rather,weuseasimplelinearregressionmodeli=si+0.sand0arethetwoparametersofthelinearregressionmodelandcanbelearnedeasily.Thismeansthatoncewehavelearnedthetwoparameterssand0,theivaluesforallotherclassescanbedetermineddirectlybytheaboverelationandwillnotbelearnedseparately.Asmentionedpreviously,itwouldalsobepossibletoplacedistributionalconstraintsuponthevectorpinordertoreducethedegreesoffreedomevenmore,thoughwechoosenottodothisinourimplementation.Oursecondstrategytotackletheover-ttingproblemistolearnmultiplemodelsratherthanworkingwithasinglemodel.ThesemodelsdierfromeachotheronlyinthattheyarelearnedusingourEMalgorithmwithdierentinitialrandomsettingsfortheirparameters.WhengeneratingpopulationsfromthemodelslearnedviaEM(asdescribedinthenextsubsection),wethenrotatethroughthevariousmodelsinround-robinfashion.Arewenotdoneyet?Oncethemodelhasbeenlearned,asimpleestimatorisimmediatelyavailabletous:wecouldreturnp00nEMP,sincethiswillbetheexpectedqueryresultoveranarbitrarydatabasesampledfromthemodel.Thisisequivalenttorstdeterminingaclassofdatabasesthatthedatabaseinquestionhasbeenrandomly 99

PAGE 100

100

PAGE 101

101

PAGE 102

4.4 thatthejthpopulationgeneratedandthesamplefromthatpopulationarePj=(EMPj;SALEj)andSj=(EMP0j;SALE0j),respectively.LetsijbethevalueofsicomputedoverSj;thatis,itisthesumforf1overalltuplesinEMP0jthathaveimatchesinSALE0j.Ourgoalinallofthisistoconstructaweightedestimator: 102

PAGE 103

@w0=Xj2mXi=0wisijq(Pj)!(s0j)Ifwedierentiatewithrespecttoeachwiandsettheresultingm+1expressionstozero,weobtainm+1linearequationsinthem+1unknownweights.Theseequationscanberepresentedinthefollowingmatrixform:2666666666664Pjs20jPjs0js1jPjs0js1jPjs21j::Pjs0jsmjPjs1jsmj37777777777752666666666664w0w1::wm3777777777775=2666666666664Pjs0jq(Pj)Pjs1jq(Pj)::Pjsmjq(Pj)3777777777775Theoptimalweightscanthenbeeasilyobtainedbyusingalinearequationsolvertosolvetheabovesystemofequations.OnceWhasbeenderived,itisthenappliedtotheoriginalsamplesEMP0andSALES0inordertoestimatetheanswertothequery.BydividingtheSSEobtainedviatheminimizationproblemdescribedabovebythenumberofdatasetsgenerated,wecanalsoobtainareasonableestimateofthemean-squarederrorofW. 103

PAGE 104

1. ThedistributionofthenumberofmatchingrecordsinSALEforeachrecordofEMP Thedistributionofe.SALvaluesofallrecordsofEMPBasedonthesetwoimportantproperties,wesyntheticallygenerateddatasetssothatthedistributionofthenumberofmatchingrecordsforallEMPrecordsfollowsadiscretizedGammadistribution.TheGammadistributionwaschosenbecauseitproducespositivenumbersandisveryexible,allowingalongtailtotheright.ThismeansthatitispossibletocreatedatasetsforwhichmostrecordsinEMPhaveveryfewmatches,butsomehavealargenumber.Wechosevaluesof1,2and5fortheGammadistribution'sshiftparameterandvaluesof0.5and1forthescaleparameter.Basedonthesedierentvaluesfortheshiftandscaleparameters,weobtainedsixpossibledatasets:1:(shift=1,scale=0.5);2:(shift=2,scale=0.5);3:(shift=5,scale=0.5);4:(shift=1,scale=1);5:(shift=2,scale=1);and6:(shift=5,scale=1).Forthesesixdatasets,thefractionofEMPrecordshavingnomatchesinSALE(andthuscontributingtothequeryanswer)were.86,.59,.052,.63,.27,and.0037,respectively.AplotoftheprobabilitythatanarbitrarytuplefromEMPhasmmatchesinSALEforeachofthesixdatasetsisgivenasFigure 4-2 .Thisshowsthewidevarietyofdatasetcharacteristicswetested. 104

PAGE 105

SixdistributionsusedtogenerateforeacheinEMPthenumberofrecordssinSALEforwhichf3(e;s)evaluatestotrue. Wealsovariedthedistributionofthee.SALvaluessuchthatthedistributioncanbeoneofthefollowing: 4.5.1 ,thethreespecicassumptionswemadeforoursuperpopulationmodelwere: 105

PAGE 106

2. ThereexistsalinearrelationshipbetweenthemeanaggregatevaluesofthedierentclassesofEMPrecordsgivenbyi=si+0wheresistheslopeofthestraightlineconnectingthevariousivalues. 3. Thevarianceoftheaggregateattributevaluesofrecordsofanyclassisapproximatelyequaltothesinglemodelparameter2.Foreachofthesethreecases,wegeneratesixdierentdatasetsusingthesixdierentsetsofgammaparametersdescribedearlier.Thusweobtain18moredatasetswheretherstsixsetsviolateassumption1,thenextsixsetsviolateassumption2andthelastsixsetsviolateassumption3.Foreachofthese18datasets,theaggregateattributevalueisnormallydistributedwithameanof100andstandarddeviationof200exceptforthelastsixsetswheredierentvaluesofstandarddeviationarechosenforrecordsfromdierentclasses.Inordertoviolateassumption1,wenolongerassumeaprimarykey-foreignkeyrelationshipbetweenEMPandSALE.Togenerateadatasetviolatingthisassumption,asets1ofrecordsofsize100fromEMPisselected.LetmaxbethelargestnumberofmatchesinSALEforanyrecordfroms1.Thenanassociatedsets2ofmaxrecordsisaddedtoSALEsuchthatallrecordsins1havetheirmatchingrecordsins2.Assumption2wasviolatedusingi=sj+0,wherej6=i(infact,thejvalueforagiveniisrandomlyselectedfrom1:::m).Assumption3wasviolatedbyassumingdierentvaluesforthevarianceofrecordsfromdierentclasses.Werandomlychosethesevaluesfromtherange(100,15000). 1 ],theSynopticCloudReports[ 3 ]obtainedfromtheOakRidge 106

PAGE 108

108

PAGE 109

4.2 .Resultsfromtherst48syntheticdatasetsaregivenininTables 4-1 and 4-2 whileresultsfromthenext18syntheticdatasets(whichspecicallyviolatethemodelassumptions)arepresentedinTable 4-3 .Real-lifedatasetresultsareshowninTable 4-4 .Foreachofthetestcases,wegivethesquarerootoftheobservedmean-squarederror(thatis,thestandarderror)forthebiased,unbiasedaswellasconcurrentestimator.Becausehavinganabsolutevalueforthestandarderrorlacksanysortofscaleandthuswouldnotbeinformative,wegivethestandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabase.Forexample,forthesyntheticdatasets,wegivethestandarderrorasapercentageoftheanswertothequery:SELECTSUM(e.SAL)

PAGE 110

4-1 .Similarlyfortherestofthedatasets,thefactorsare:dataset2:1.7;dataset3:19;dataset4:1.5;dataset5:3.7anddataset6:270.FortheIMDBandSCRdatasets,thefactorsarebetween1and5.5whilefortheKDDCupthefactorsrangefrom2(forthehighselectivityquery)to40(fortheverylowselectivityquery).Whenwetestedthequeries,wealsorecordedthenumberoftimes(outoften)thattheanswergivenbythebiasedestimatorwaswithin2estimatedstandarderrorsoftherealanswertothequeryandfoundthatforalmostallthetestcasesthisnumberwastenwhileonlyforacoupleoftestcasesthisnumberwasfoundtobenineoutoften.Finally,wemeasuredthecomputationtimerequiredbythebiasedestimatortoinitiallylearnthegenerativemodel,thencomputeweightsforthevariouscomponentsoftheestimator,andtonallyprovideanestimateofthequeryresult.Weobservedthatforthesyntheticdatasets(whichconsistsof10millionand50millionrecordsinthetworelations)themaximumobservedrunningtimeofbiasedestimatorwasbetween3and4secondsfora10%samplefromeach.ThevastmajorityofthistimeisspentintheEMlearningalgorithm,whichrequiresO(mjEMP0ji)time,wheremisthemaximumpossiblenumberofmatchesforarecordinEMPwithrecordsinSALES,andiisthenumber 110

PAGE 111

4-1 isthattheunbiasedestimatorhasuniformlysmallerroronlyonthoseeighttestsperformedusingsyntheticdataset1,wherethenumberofmatchesforeachrecorde2EMPisgeneratedusingaGammadistributionwithparameters(shift=1,scale=0.5).Inthisparticulardataset,onlyaverysmallnumberoftherecordsareexcludedbytheNOTEXISTSclausesince86%oftherecordsinEMPdonothaveamatchinSALE.Furthermore,onlyaverysmallnumberoftherecordshavealargenumberofmatches.Bothofthesecharacteristicstendtostabilizethevarianceoftheunbiasedestimator,makingitanechoice.Foralltheotherdatasets,theunbiasedestimatordoesverypoorlyformostofthecases.Forsyntheticdata,theestimator'sworstperformanceisfordataset6,inwhichlessthanonepercentoftherecordsareacceptedbytheNOTEXISTSclauseandseveralrecordsfromEMPhavemorethan15matchingrecordsinSALE.Inthiscase,theunbiasedestimatorisunusable,andtheresultswereparticularlypoorwithcorrelationbetweenthenumberofmatchesandtheaggregatevaluethatissummed.Forexample,inthe 111

PAGE 112

1%Sample 5%Sample 10%Sample type error error error GaCorVal U C B U C B U C B mma red? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%) 1 No a. 7.39 13.32 38.30 2.39 12.62 3.88 1.09 11.89 1.46 1 No b. 6.69 13.45 37.87 3.04 12.63 5.92 1.08 11.93 1.38 1 No c. 6.89 12.92 22.59 5.23 12.04 8.18 3.79 11.23 7.09 1 No d. 16.65 6.32 68.37 15.94 6.19 29.34 9.56 5.94 19.72 1 Yes a. 11.90 20.90 34.50 4.59 19.94 2.26 3.15 18.68 1.42 1 Yes b. 13.50 17.80 36.30 4.07 16.37 5.12 1.75 15.50 2.18 1 Yes c. 7.70 15.06 21.14 5.69 14.06 7.84 3.98 13.13 6.21 1 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.33 2 No a. 11.79 40.12 6.09 8.10 37.98 3.55 2.43 35.44 3.37 2 No b. 13.65 39.48 5.00 6.82 37.86 4.83 2.54 35.51 4.03 2 No c. 179.87 39.20 14.75 6.35 37.00 8.34 4.54 34.44 7.12 2 No d. 31.60 20.45 43.43 10.24 19.26 12.88 9.99 17.08 6.25 2 Yes a. 24.70 65.60 21.39 19.83 62.00 18.45 4.78 57.51 13.70 2 Yes b. 19.34 54.27 12.99 12.61 51.19 12.28 3.46 47.72 7.48 2 Yes c. 220.14 46.60 23.01 12.19 44.01 12.01 5.10 40.88 5.10 2 Yes d. 52.61 39.08 39.45 19.62 36.75 5.32 9.20 33.19 2.25 3 No a. 234.60 92.75 18.61 59.67 84.91 12.22 33.00 76.00 6.28 3 No b. 315.97 93.29 19.42 70.32 84.68 11.68 34.78 76.05 5.84 3 No c. 188.17 91.50 20.53 46.14 84.01 18.50 24.92 75.07 15.80 3 No d. 139.27 72.67 14.24 63.56 67.36 12.18 6.79 59.83 5.33 3 Yes a. 753.73 189.70 42.19 220.00 172.10 28.99 115.25 151.85 17.02 3 Yes b. 421.00 146.70 30.93 151.00 133.50 21.05 74.50 118.40 11.99 3 Yes c. 240.20 119.80 28.28 74.66 109.50 25.99 42.57 97.22 21.86 3 Yes d. 47.95 144.61 33.85 18.52 130.93 28.69 3.63 114.00 18.63 Table4-1. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. 112

PAGE 113

1%Sample 5%Sample 10%Sample type error error error GaCorVal U C B U C B U C B mma red? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%) 4 No a. 153.70 36.20 14.52 37.17 33.90 4.73 24.47 31.20 0.89 4 No b. 226.00 37.00 18.56 50.32 33.95 5.27 42.87 31.11 1.33 4 No c. 242.70 35.20 11.10 19.40 32.85 3.62 17.03 30.04 3.59 4 No d. 146.37 16.56 45.16 23.60 14.85 21.26 8.85 12.62 16.61 4 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.64 4 Yes b. 327.02 52.06 8.62 75.95 48.42 3.92 45.62 44.12 2.83 4 Yes c. 359.60 43.40 13.90 30.19 40.39 7.17 27.21 36.80 5.16 4 Yes d. 1.1e3 37.53 40.29 54.33 33.99 10.66 18.94 29.32 5.68 5 No a. 236.00 72.04 13.19 46.18 66.08 12.07 38.30 59.60 6.15 5 No b. 395.00 72.30 11.78 55.78 66.09 11.73 42.73 59.55 5.37 5 No c. 167.70 71.10 7.70 120.81 65.20 1.99 62.70 58.50 1.15 5 No d. 135.65 51.87 13.58 77.12 48.29 4.30 24.14 42.21 4.16 5 Yes a. 862.00 71.79 31.25 203.81 64.90 7.21 57.22 57.00 2.93 5 Yes b. 650.80 56.60 28.64 129.75 51.46 6.75 74.16 43.90 1.86 5 Yes c. 298.70 92.30 11.47 189.70 84.22 4.06 69.63 74.80 2.53 5 Yes d. 283.26 105.24 10.84 178.61 95.07 9.38 145.78 81.86 3.04 6 No a. 7.1e3 95.13 19.30 6.2e3 79.49 9.82 4.1e3 63.33 6.09 6 No b. 1.9e4 95.20 18.40 2.1e3 79.58 9.47 6.6e2 63.40 5.74 6 No c. 1.9e4 94.32 13.03 1.2e3 78.60 5.96 9.6e2 62.74 1.71 6 No d. 4.7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 3.97 6 Yes a. 5.4e4 307.0 62.00 3.0e4 249.30 30.90 5.7e3 119.00 18.78 6 Yes b. 4.2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e3 135.00 12.88 6 Yes c. 3.2e4 156.3 22.70 2.0e3 128.10 10.87 8.7e2 100.12 3.05 6 Yes d. 1.3e5 234.4 29.78 2.9e3 192.46 28.25 2.4e3 148.28 12.79 Table4-2. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. 113

PAGE 114

1%Sample 5%Sample 10%Sample type error error error GaVioU C B U C B U C B mma lates (%) (%) (%) (%) (%) (%) (%) (%) (%) 1 (1) 8.83 13.37 62.60 3.12 12.47 15.24 1.19 11.75 4.62 2 (1) 24.66 39.33 34.39 8.14 37.89 2.74 3.41 35.60 2.48 3 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 13.05 4 (1) 22.30 36.67 37.99 12.72 34.07 7.96 6.34 31.12 2.95 5 (1) 231.50 72.60 6.76 123.30 66.14 6.37 85.68 59.48 4.35 6 (1) 1366.80 95.96 9.99 1.2e3 78.64 5.85 700.0 62.62 1.88 1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.44 2 (2) 21.62 72.24 59.94 14.25 67.50 7.56 6.25 62.90 4.47 3 (2) 886.2 220.20 45.73 136.0 201.90 31.73 79.75 180.10 25.76 4 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.43 11.52 5 (2) 247.60 205.0 18.84 233.0 187.00 17.69 88.55 168.30 9.78 6 (2) 6891.00 369.0 42.30 5988.0 310.00 40.90 1924.00 246.57 19.77 1 (3) 14.70 21.14 61.86 6.24 20.20 10.15 1.13 19.13 2.67 2 (3) 26.15 66.73 29.10 22.49 62.25 20.25 5.38 57.69 17.35 3 (3) 920.10 185.30 41.86 147.60 167.20 30.12 65.63 146.88 27.20 4 (3) 2.3e5 64.42 35.96 714.00 60.54 16.87 150.80 54.77 9.24 5 (3) 1350.30 143.00 33.59 856.00 127.76 29.58 306.70 113.14 10.08 6 (3) 2.2e5 264.02 38.37 4519.10 212.80 34.92 2530.00 162.70 21.96 Table4-3. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor18syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. correlatedcasewitha1%sample,mostoftherelativestandarderrorsweremorethan40000%.Suchverypoorresultsarefoundsporadicallythroughoutmostofthedatasets,thoughtheresultsweresomewhaterratic.Thereasonthattheobservederrorsassociatedwiththeunbiasedestimatorarehighlyvariableistheverylongtailoftheerrordistribution.Undermanycircumstances,mostoftheanswerscomputedusingtheunbiasedestimatorareverygood,butthereisstillasmall(thoughnon-negligible)probabilityofgettingaridiculousestimatewhoseerrorishundredsoftimesthesumovertheaggregatevalueovertheentireEMPrelation.Unfortunately,itisinterestingtonote 114

PAGE 115

5%Sample 10%Sample Error Error Error DataQuery U C B U C B U C B Set (%) (%) (%) (%) (%) (%) (%) (%) (%) IMDB 27.67 70.88 3.3e3 17.51 33.44 4.1e2 13.71 14.14 IMDB 75.12 65.10 91.26 62.86 31.97 49.82 52.69 9.31 IMDB 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92 SCR 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41 SCR 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95 KDDCup 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10 KDDCup 41.30 11.24 5.8e83 26.54 4.32 9.3e36 17.04 3.28 KDDCup 15.24 8.46 3.6e172 10.80 1.56 2.3e120 6.35 0.98 Table4-4. Observedstandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabasefor8queriesover3real-lifedatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:U-Unbiasedestimator,C-ConcurrentsamplingestimatorandB-Model-basedbiasedestimator. thattheunbiasedestimator'sworstperformanceoverallwasobservedonQ8overtheKDDCupdata,wheretheerrorwasastronomicallyhigh:largerthan10100.Incomparison,thebiasedestimatorgenerallydidaverygoodjobpredictingthenalqueryresult,andinmostcaseswitha5%or10%samplingfractiontheobservedstandarderrorwaslessthan10%ofthetotalaggregatevaluefoundinEMP.Inotherwords,ifthetotalvalueofSUM(e.SAL)withnoNOTEXISTSclauseisx,thenforjustaboutanyquerytested,thestandarderrorwaslessthanx=10,anditwasfrequentlymuchsmaller.Thisisactuallyquiteimpressivewhenoneconsidersthedicultyoftheproblem.Theprimarydrawbackassociatedwiththebiasedestimatorisitscomplexity(requiringnon-trivialandsubstantialstatistically-orientedcomputations)andthefactthatasignicantamountofcomputationisrequired,mostofitassociatedwithrunningtheEMalgorithmtocompletion.Bycomparison,theunbiasedestimatecanbecalculatedviaanalmosttrivialrecursiveroutinethatreliesonthecalculationofsimplehypergeometricprobabilities.Onecasewherethebiasedestimatorhadquestionablequalitativeperformancewaswiththe16testsassociatedwithdatasets3and6.Theprobleminthiscasewasthat 115

PAGE 116

4-3 .TherstsixrowsinthetableshowresultsfordatasetsinwhichmorethanoneEMPrecordcanmatchwithagivenrecordfromSALE.Theresultsshowthatviolatingthisassumptionofthemodelintheactualdatasetdidnotaecttheaccuracyofthebiasedestimatorsignicantly.ThenextsetofsixrowsinthetableshowresultsfordatasetsinwhichthereisnolinearrelationshipbetweenthemeanaggregatevaluesofthedierentclassesofEMPrecords.Theresultsshowthatthebiasedestimatorisabouttwiceasinaccurateoverthesedatasetsascomparedtocorrespondingdatasetswhichdonothaveastrictviolationoftheassumption.Thelastsixrowsinthetableshowresultsoverdatasetsinwhichthevariancesoftheaggregatevaluesofrecordsfromdierentclassesaresignicantlydierent.Resultsshowthatthesedatasetsaecttheaccuracyofthebiasedestimatorasmuchasthedatasetswhichviolatethe\linearrelationshipofmeanvalues"assumption.However,theresultsarecertainlynotpoorwhentheseassumptionsareviolated,andthemethodstillseemstohavequalitativeperformancethatmaybeacceptableformanyapplications,particularlywithalargersamplesize.Theresultsfromtheeightqueriesoverthethreereal-lifedatasetsaredepictedinTable 4-4 .Thekeydierenceinthecharacteristicsofthereal-lifedatasetscompared 116

PAGE 117

4-4 thattheaccuracyofthebiasedestimatorisgenerallyquitegoodovertherealdata.Wealsonotethatthestandarderrorofthebiasedestimatoroverthelearnedsuperpopulationseemstobeareasonablesurrogateforthestandarderrorofthebiasedestimatorinpractice.Formostbiasedestimators,itisreasonabletousethestandarderrorofthebiasedestimatorinthesamewaythatonewouldusethestandarddeviationofanunbiasedestimatorwhenconstructingcondencebounds(seeSarndaletal.[ 109 ],Section5.2).AccordingtotheVysochanskii-Petunininequality[ 120 ],anyunbiaseduni-modalestimatorwillbewithinthreestandarddeviationsofthecorrectanswer95%ofthetime,andaccordingtothemoreaggressivecentrallimittheorem,anestimatorwillbewithintwostandarddeviationsofthecorrectanswer95%ofthetime.Weobservedthatalmostallofthetests,tenoutoftenoftheerrorsforthebiasedestimatorwereactuallywithintwopredictedstandarderrorsofzero.Thisseemstobestrongevidencefortheutilityoftheboundscomputedusingthepredictedstandarderrorofthebiasedestimator.Wenallyremarkonthetimerequiredfortheexecutionofthebiasedestimator.Thebiasedestimatorperformsseveralcomputationsincludinglearningthemodelparameters,generatingsucientstatisticsforseveralpopulation-samplepairsandthensolvingasystemofequationstocomputeweightsforthevariouscomponentsoftheestimator.Asdiscussedpreviously,thistooknolongerthanfoursecondsforthelargestsamplestested.Ifthisisnotfastenough,wepointoutthatitmaybeabletospeedthistimeevenmore,thoughthisisbeyondthescopeofthethesis.WhileweusedthetraditionalEMalgorithm 117

PAGE 118

69 95 116 ]oftheEMalgorithm.ThesevariantsoftheEMalgorithmtypicallyachievefasterconvergencetimebyimplementingtheExpectationand/ortheMinimizationstepoftheEMalgorithmpartially. 97 ].Otherclassiceortsatsampling-basedestimationoverdatabasedataaretheadaptivesamplingofLiptonandNaughton[ 83 84 ]forjoinqueryselectivityestimation,andthesamplingtechniquesofHouetal.[ 64 65 ]foraggregatequeries.Morerecentwell-knownworkonsamplingisthatononlineaggregationbyHaas,Hellerstein,andtheircolleagues[ 47 60 61 ].Thesampling-baseddatabaseestimationproblemthatisclosesttotheonestudiedinthischapteristhatofsamplingforthenumberofdistinctvaluesinadatabase.Asdiscussedintheintroductiontothischapter,asolutiontotheproblemofestimationoversubset-basedqueriesisasolutiontotheproblemofestimatingthenumberofdistinctvaluesinadatabasesincethelatterproblemcanbewrittenasaNOTEXISTSquery.TheclassicpaperindistinctvalueestimationisduetoHaasetal.[ 49 ].Forasurveyofthestate-of-the-artworkonthisproblemindatabasesthroughtheyear2000,wereferthereadertotheIntroductionofthepaperbyCharikaretal.onthetopic[ 17 ].ThepaperofBungeandFitzpatrick[ 13 ]providesasurveyofworkinthestatisticsarea,currentthroughtheearly1990's.Workinstatisticscontinuesonthisproblemtothisday.Infact,arecentpaperfromstatisticsbyMingoti[ 90 ]onthedistinctvalueproblemprovidedinspirationforouruseofsuperpopulationtechniques.Thoughtheproblemsofdistinctvalueestimationandsubset-basedaggregateestimationarerelated,wenotethattheproblemofestimatingthenumberofdistinctvaluesisaveryrestrictedversionoftheproblemwestudyinthisthesis,anditisnotimmediatelyclearhowarbitrarysolutionstothedistinctvalueproblemcanbegeneralized 118

PAGE 119

43 ] 4.5.1 ofthethesis,oneofthemostcontroversialdecisionsmadeinthedevelopmentofthelatterestimatorwasourchoiceofaverygeneralpriordistribution.Toastatisticianfromtheso-called\Bayesian"school[ 39 ],thismaybeseenasapoorchoiceandBayesianstatisticianmayarguethatamoredescriptivepriordistribution,ifappropriate,wouldincreasetheaccuracyofthemethod.Thisiscertainlytrue,iftheselecteddistributionwereagoodmatchfortheactualdatadistribution.Inourwork,however,wehaveconsciouslychosengeneralityanditsassociateddrawbacksinplaceofspecicity.Ourexperimentalresultsseemtoarguethatforavarietyofdierent 119

PAGE 120

47 ].Thismeansthatthejoinitselfmustbemodeled,whichisaproblemforfuturework.Anotherproblemforfutureworkisarbitrarylevelsofnesting.AninnerquerymayitselfbelinkedwithanotherinnerqueryviaaNOTEXISTSorsimilarclause. 120

PAGE 121

8 ].Weconsiderveryselectivequeriesbecausetheyaretheoneclassofqueriesthatarehardesttohandleapproximatelywithoutworkloadknowledge:ifaqueryreferencesonlyafewtuplesfromthedataset,thenitisveryhardtomakesurethatasynopsisstructure(suchasasample)willcontaintheinformationneededtoanswerthequery.Themostnaturalmethodforhandlinghighlyselectivequeriesusingsamplingistomakeuseofstratication[ 25 ].Inordertoansweranaggregatequeryoverarelation,onecouldrst(oine)partitiontherelation'stuplesintovarioussubsetssothatsimilartuplesaregroupedtogether{theassumptionbeingthattherelationalselectionpredicateassociatedwithagivenquerywilltendtofavorcertainstrata.Evenifagivenqueryisveryselective,atleastoneortwoofthestratawillhavearelativelyheavyconcentrationoftuplesthatwillcontributetothequeryanswer.Whenthequeryisprocessed,those\important"stratacanbesampledrstandmoreheavilythantheothers.Thisisilustratedwiththefollowingexample:Example1:TherelationMOVIE(MovieYear,Sales)ispartitionedintotwostrataasfollows:ThequeryQisthenissued:SELECTSUM(Sales)

PAGE 122

Whilestraticationmaybeveryuseful,itisnotanewidea.Ithasbeenstudiedinstatisticsfordecades,andithasbeensuggestedpreviouslyasawaytomakeapproximateaggregatequeryprocessingmoreaccurate[ 18 { 20 ].However,inthecontextofdatabases,researchershavepreviouslyconsideredonlyhalfoftheproblem:howtodividethedatabaseintostrata.Thismayactuallybetheeasyandlessimportanthalfoftheproblem,sinceeventherelativelynaivepartitioningstrategyweuseinourexperimentscangiveexcellentresults.Theequallyfundamentalproblemweconsiderinthispaperis:howtoallocatesamplestostratawhenactuallyansweringthequery.Morespecically,givenabudgetofnsamples,howdoesonechoosehowto\spend"thosesamplesonthevariousstratainordertoachievethegreatestaccuracy?TheclassicallocationmethodfromstatisticsistheNeymanallocation,anditistheoneadvocatedpreviouslyinthedatabaseliterature[ 19 ].ThekeydicultywithapplyingtheNeymanAllocationinpracticeisthatitrequiresextensiveknowledgeofcertainstatisticalcharacteristicsofeachstrata,withrespecttotheincomingquery.Inpractice 122

PAGE 123

14 ]thatallowustotakeintoaccountanypriorexpectation(suchastheexpectedecacyofthestratication)inaprincipledfashion.Wecarefullyevaluateourmethodsexperimentally,andshowthatifoneisverycarefulindevelopingasamplingplan,evenanaivepartitioningofsamplestostratathatusesnoworkloadinformationcanshowdramaticaccuracyforveryselectivequeries.Ourmethodsareverygeneral.Theycanbeusedwithanypartitioning(suchasthoseproposedbyChaudhuriet.al[ 18 { 20 ]),orevenincaseswherethepartitioningisnotuser-denedandisimposedbytheproblemdomain(forexample,whenthevarious\strata"aredierentdatasourcesinadistributedenvironment).Ourmethodscanalsobeextendedtomorecomplicatedrelationaloperationssuchasjoins,thoughthisproblemisbeyondthescopeofthepaper. 123

PAGE 124

124

PAGE 125

^Y=LXi=1Ni ^2i=1 5{2 bysimplyreplacingallthe2itermswiththeircorrespondingunbiasedestimators^2i.Central-Limit-Theorem-basedcondencebounds[ 112 ]for^Ycanthenbecomputedas,^Yzp^wherezpisthez-scoreforthedesiredcondencelevel.Ifdesired,moreconservativecondenceboundsfromtheliterature(suchasChebyshev-based[ 112 ])canalsobeused.Finally,wenotethataggregatequerieslikeCOUNTandAVGcanalsobehandledbystratiedsamplingestimatorsliketheonedescribedabovebyusingratiosoftwodierentestimates.AggregatequerieswithaGROUPBYclausecanalsobeansweredbyusing 125

PAGE 126

54 ],thoughthatisbeyondthescopeofthepaper. 5{1 .Since^Yisunbiased,minimizingitserrorisequivalenttominimizingitsvariance.Anoptimizationproblemcanbeformulatedforthechoiceofnivaluessothatthevariance2isminimized{solvingtheproblemleadstothewell-knownNeymanallocation[ 25 ]fromstatistics.Specically,theNeymanallocationstatesthatthevarianceofastratiedsamplingestimatorisminimizedwhensamplesizeniisproportionaltothesizeofthestratum,Ni,andtothevarianceofthef()valuesinthestratum,2i.Thatis, 126

PAGE 127

5.2.1 .ThenumberofrecordsfromR1acceptedbyf2()is10whilethenumberofrecordsfromR2acceptedbyf2()is1000.Further,letf1(r)N(1000;100)8r2R1andf1(r)N(10;100)8r2R2,whereN(;)denotesanormaldistributionwithmeanandvariance2.Weuseapilotsampleof100recordstoestimatethevarianceofthef()valuesineachstratum.Theseestimatesare^21and^22.Ifthedesiredsamplesizeisn=1000,theestimatedvariancescanbeusedwithEquation 5{4 toobtainanestimatefortheoptimalsamplingallocationasfollows:n1=1000 ^21+^22^21n2=1000 ^21+^22^22 5{2 )sincethisvariancewouldbeusedtoreportcondenceboundstotheuser.Wethencomputetheaverageestimatedvarianceacrossthe1000iterations.Finally,weusethetruevariancesofbothstratatoobtainanoptimalsampleallocation,andrepeattheaboveexperimentusingtheoptimalallocation.Wesummarizetheresultsinthefollowingtable. Truequeryresult 20150 Avg.observedbias 10200 Avg.estimatedMSE 0.76million Avg.observedMSE 100million MSEoftrueoptimal 58.6million 127

PAGE 128

14 ]calledtheBayes-Neymanallocationthatcanincorporatesuchintuitionintotheprocessinaprincipledfashion.Ingeneral,Bayesianmethodsformallymodelsuchpriorintuitionorbeliefasaprobabilitydistribution.Suchmethodsthenrenethedistributionbyincorporatingadditionalinformation{inourcaseinformationfromthepilotsample{toobtainanoverallimprovedprobabilitydistribution.Atthehighestlevel,theproposedBayes-Neymanallocationworksasfollows: 128

PAGE 129

129

PAGE 130

33 ].Thismeansthatweviewtheprobabilitypithatanarbitrarytuplefromstratumiwillbeacceptedbytherelationalselectionpredicatef2()asbeingtheresultofarandomsamplefromtheBetadistribution,whichproducesaresultfrom0to1.Sincewevieweachtupleasaseparateandindependentapplicationoff2(),thenumberoftuplesfromstratumithatareacceptedbyf2()isthenbinomiallydistributed 130

PAGE 131

Betadistributionwithparameters==0:5. Giventhissetup,thersttaskistochoosethesetofBetaparametersthatcontrolthedistributionofeachpisoastomatchtherealityofwhatatypicalvalueofpiwillbeforeachstratum.TheBetadisributionisaparametricdistributionandrequirestwoinputparameters,and.Dependingontheparametersthatareselected,theBetacantakealargevarietyofshapesandskews.Choosingandfortheithstratumisequivalenttosupplyingour\intuition"tothemethod,statingwhatourinitialbeliefisregardingtheprobabilitiythatanarbitraryrecordwillbeacceptedbyf2().Therearetwopossibilitiesforsettingthoseinitialparameters.Therstpossibilityistouseworkloadinformation.Wecouldmonitorallpreviously-observedqueriesovereachandeverystrata,whereweobservethatforqueryiandstratumjtheprobabilitythatagivenrecordwasacceptedbyf2()waspij.Then,assumingthatfpij8i;jgareallsamplesfromourgenerativeBetaprior,wesimplyestimateandfromthissetusinganystandardmethod.AnestimatefortheBetaparametersbasedupontheprincipleofMaximumLikelihoodEstimationcaneasilybederived[ 112 ].Asecondmethodistosimplyassumethatthestraticationwechooseusuallyworkswell.Inthiscase,moststratawilleitherhaveaveryloworaveryhighpercentageofitsrecordsacceptedbyf2().Choosing==:5resultsinaU-shapeddistributionthatmatchesthisintuitionexactly,andisacommonchoiceforaBetaprior.TheresultingBetaisillustratedinFigure 5-1 .Inpracticewendthatthisproducesexcellentresults. 131

PAGE 132

5.5 willupdateandasneededtotakeintoaccounttheinformationpresentinthepilotsample.ProducingtheVectorofCounts 132

PAGE 133

5.4.5 133

PAGE 134

33 ]{justastheBetadistributionisthestandardconjugatepriorforabinomialdistribution.TheDirichletisthemulti-dimensionalgeneralizationoftheBeta.Ak-dimensionalDirichletdistributionmakesuseoftheparametervector=f1;2;;kg.JustasinthecaseoftheBetapriorusedbyXcnt,theDirichletpriorrequiresaninitialsetofparametersthatrepresentourinitialbelief.Sincewetypicallyhavenoknowledgeabouthowlikelyitisthatagivenf1()valuewillbeselectedbyf2(),thesimplestinitialassumptiontomakeisthatallvaluesareequallylikely.InthecaseoftheDirichletdistribution,usingi=1foralliisthetypicalzero-knowledgeprior[ 33 ].Given,itisthenasimplemattertosamplefromX0,aswedescribeformallyinthenextsubsection.Wenotethatalthoughthisinitialparameterchoicemaybeinaccurate,inBayesianfashiontheparameterswillbemademoreaccuratebasedupontheinformationpresentinthepilotsample.Section 5.5 providesdetailsofhowtheupdateisaccomplished.ProducingtheVector0 5.4.2 .||||||||||||||||||||||||||||AlgorithmGetMoments(1;;L,D)f1//LetidenotetheDirichletparametersforstratumi2//LetDbeanarrayofalldistinctvaluesfromtherangeoff2()

PAGE 135

135

PAGE 137

5.4.3 ,theoneremainingproblemregardinghowtosamplefromX0istheproblemofhavingaverylarge(orevenunknown)rangeforthefunctionf1().Inthiscase,dealingwiththevectorsDandVmaybeimpossible,forbothstorageandcomputationalreasons.Thesimplesolutiontothisproblemistobreaktherangeoff1()intoanumberofbucketsandmakeuseofahistogramovertherange,ratherthanusingtherangeitself.Inthiscase,Disgeneralizedtobeanarrayofhistogrambuckets,whereeachentryinDhassummaryinformationforagroupofdistinctf1()values.EachentryinDhasthefollowingfourspecicpiecesofinformation:1.lowandhigh,whicharetheupperandlowerboundsforthef1()valuesthatarefoundinthisparticularbucket.2.1,whichisthemeanofthef1()valuesthatarefoundinthisparticularbucket.Thatis,ifAisthesetofdistinctvaluesfromlowtohigh,then1=Pa2Aa 42 45 72 ]overtheattributethatistobequeried.Inthecasethatmultipleattributesmightbequeried,onehistogramcanbeconstructedforeachattribute.Thisisthemethodthatwetestexperimentally.AnotherappropriatemethodistoconstructDon-the-ybymakinguseofthepilotsamplethatisusedtocomputethesamplingplan.Thishastheadvantagethatanyarbitraryf1()canbehandledatruntime.Again,anyappropriatehistogramconstructionschemecanbeused,butratherthanconstructingDoineusingtheentirerelationR,f1() 137

PAGE 138

5.4.3 mustbemodiedsoastohandlethemodiedD.ThefollowingisanappropriatelymodiedGetMoments-wecallitGetMomentsFromHist.||||||||||||||||||||||||||||AlgorithmGetMomentsFromHist(1;;L,D)f1//LetidenotethevectorofDirichletparametersforstratumi2//LetDbeanarrayofhistogrambuckets3//Let0=h01;;0Libeavectorofmomentsofallstrata4for(inti=1;i<=L;i++)f5pDirichlet(i)61=2=07//LetVbeanarrayofcountsforeachbucket8VMultinomial(cnti;p)9for(intj=1;j<=jDj;j++)f101+=V[j]D[j]:1112+=V[j]D[j]:212g131/=cnti142/=cnti15(1;2)i=(1,2)160i=(1;2)i17g18return0||||||||||||||||||||||||||||

PAGE 139

5.4 ,wedescribedhowweassigninitialvaluestotheparametersofthetwopriordistributions{theBetaandtheDirichletdistributions.Inthissection,weexplainhowtheseinitialvaluescanberenedbyusinginformationfromapilotsampletoobtaincorrespondingposteriordistributions.UpdatingthesepriorsusingthepilotsampleintheproposedBayes-NeymanapproachisanalagoustousingthepilotsampletoestimatethestratumvariancesusingtheclassicNeymanallocation.TheupdaterulesdescribedinthissectionarefairlystraightforwardapplicationsofthestandardBayesianupdaterules[ 14 ].TheBetadistributionhastwoparametersand.LetRpilotdenotethepilotsampleandletsdenotethenumberofrecordsthatareacceptedbythepredicatef2().Thus,jRpilotjswillbethenumberofrecordsthatfailtobeacceptedbythequery.Then,thefollowingupdaterulescanbeusedtodirectlyupdatetheandparametersoftheBetadistribution:=+s=+(jRpilotjs)TheDirichletdistributionisupdatedsimilarly.Recallthatthisdistributionusesavectorofparameters,=f1;2;;kg,wherekisthenumberofdimensions.Toupdatetheparametervector,wecanusethesamepilotsamplethatwasusedtoupdatethebetaasfollows.Weinitializetozeroallelementsofanarraycountofsizek.Theseelementsdenotecountsofthenumberoftimesthatdierentvaluesfromtherangef1()appearinthepilotsampleandareacceptedbyf2().ThefollowingupdaterulecanbeusedtoupdateallthedierentparametersoftheDirichletdistribution:i=i+countiAlgorithmUpdatePriorsdescribesexactlyhowpilotsamplingisusedtoupdatetheparametersofthepriorBetaandDirichletdistributionsfortheithstratum. 139

PAGE 141

5{2 ofthethesis.Oursituationdiersfromtheclassicsetuponlyinthat(inBayesianfashion)wenowuseXtoimplicitlydeneadistributionovertheper-stratavariancevaluesh1;2;;Li.Thus,wecannotminimize2directlybecauseundertheBayesianregime,2isnowarandomvariable.Instead,itmakessensetominimizetheexpectedvalueoraverageof2,which(usingEquation 5{2 )canbecomputedas:E[2]=E"LXi=1Ni(Nini) 141

PAGE 143

5.7.1GoalsThespecicgoalsofourexperimentalevaluationareasfollows: 143

PAGE 144

2 ]andhasasinglerelationwithover9.5millionrecords.Thedatahastwelvenumericalattributesandonecategoricalattributewith29categories.ThethirdistheKDDdataset,whichisthedatasetfromthe1999KDDCupevent.Thisdatasethas42attributeswithstatusinformationregardingvariousnetworkconnectionsforintrusiondetection.Thisdatasetconsistsofaround5millionrecordswithinteger,real-valued,aswellascategoricalattributes.QueriesTested.Foreachdataset,wetestqueriesoftheform:SELECTSUM(f1(r))FROMRAsrWHEREf2(r)f1()andf2()varydependinguponthedataset.FortheGMMdataset,f1()projectsoneofthethreedierentnumericalattributes(eachqueryprojectsarandomattribute).ForthePersondataset,eithertheTotalIncomeattributeortheWageIncomeattributeare 144

PAGE 145

bytesorthedst bytesattributesareprojected.Foreachofthedatasets,threedierentclassesofselectionpredicatesencodedbyf2()areused.Eachclasshasadierentselectivity.Thethreeselectivityclassesforf2()haveselectivitiesof(0:01%0:001%),(0:1%0:01%),and(1:0%0:1%),respectively.FortheGMMdataset,f2()isconstructedbyrollingathree-faceddietodecidehowmanyattributeswillbeincludedintheconjunctioncomputedbyf2().TheappropriatenumberofattributesarethenrandomlyselectedfromamongthesixGMMattributes.Ifacategoricalattributeischosenasoneoftheattributesinf2(),thentheattributewillbecheckedwitheitheranequalityorinequalityconditionoverarandomly-selecteddomainvalue.Ifanumericalattributeischosen,thenarangepredicateisconstructed.Foragivennumericalattribute,assumethatlowandhigharetheknownminimumandmaximumattributevalues.Therangeisconstructedusingqlow=low+v1(highlow)andqhigh=qlow+v2(highqlow)wherev1andv2arerandomlychosenrealvaluesfromtherange[0-1].Foreachselectivityclass,50dierentqueriesaregeneratedbyrepeatingthequery-generationprocessuntilenoughqueriesfallingtheappropriateselectivityrangehavebeengenerated.Thef2()functionsfortheothertwodatasetsareconstructedsimilarly.StraticationTested.Foreachofthevariousdatasets,asimplenearest-neighborclassicationalgorithmisusedtoperformthestatication.InordertopartitionadatasetintoLstrata,Lrecordsarerstchosenrandomlyfromthedatatoserveas\seeds"foreachofthestrata,andalloftheotherrecordsareaddedtothestratawhoseseedisclosesttothedatapoint.Fornumericalattributes,theL2normisusedasthedistancefunction.Forcategoricalattributes,wecomputethedistanceusingthesupportfromthedatabasefortheattributevalues[ 36 ].Sinceeachdatasethasbothnumericalandcategoricaldata,theactualdistancefunctionusedisthesumofthetwo\sub"distancefunctions.Notethatitwouldbepossibletouseamuchmoresophisticatedstratication,butactually 145

PAGE 146

Sel Bandwidth Coverage Size (%) GMM/Person/KDD GMM/Person/KDD 50K 0.01 3.277/2.289/2.140 918/892/921 0.1 1.776/0.514/1.520 926/912/988 1 0.587/0.184/0.210 947/944/942 100K 0.01 2.626/2.108/1.48 922/941/937 0.1 1.273/0.351/0.910 939/948/940 1 0.415/0.128/0.120 948/952/946 500K 0.01 2.192/1.740/0.820 923/943/940 0.1 0.551/0.132/0.630 946/947/942 1 0.178/0.087/0.070 946/947/948 Table5-1. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)foraSimpleRandomSamplingestimatorfortheKDDCupdataset.Resultsareshownforvaryingsamplesizesandforthreedierentqueryselectivities-0.01%,0.1%and1%. performingthestraticationisnotthepointofthisthesis{ourgoalistostudyhowtobestusethestratication.Inourexperiments,wetestL=1,L=20,andL=200.NotethatifL=1thenthereisactuallynostraticationperformed,andsothiscaseisequivalenttosimplerandomsamplingwithoutreplacementandwillserveasasanitycheckinourexperiments.TestsRun.FortheNeymanallocationandourBayes-Neymanallocation,ourtestsuiteconsistsof54dierenttestcasesforeachdataset,plusninemoretestsusingL=1.Thesetestcasesareobtainedbyassigningthreedierentvaluestothefollowingfourparameters:Numberofstrata{WeuseL=1,L=20,L=200;asdescribedabove,L=1isalsoequivalenttosimplerandomsamplingwithoutreplacement.Pilotsamplesize{Thisisthenumberofrecordsweobtainfromeachstratuminordertoperformtheallocation.Wechoosevaluesof5,20and100records.SampleSize{Thisisthetotalsamplesizethathastobeallocated.Weuse50,000,100,000and500,000samplesinourtests.QuerySelectivity{Asdescribedabove,wetestqueryselectivitiesof0.01%,0.1%and1%. 146

PAGE 147

Neyman Bayes-Neyman GaussianMixture 1.5 2.4 Person 2.3 3.1 KDDCup 2.1 2.8 Table5-2. AveragerunningtimeofNeymanandBayes-Neymanestimatorsoverthreereal-worlddatasets. Eachofthe50queriesforeach(dataset,selectivity)combinationisre-run20timesusing20dierent(pilotsample,sample)combinations.Thus,foreach(dataset,selectivity)combinationweobtainresultsfor1000queryrunsinall. 5-1 showstheresultsfortheninecaseswhereL=1;thatis,wherenostraticationisperformed.Wereporttwonumbers:thebandwidthandthecoverage.Thebandwidthistheratioofthewidthofthe95%condenceboundscomputedastheresultofusingtheallocationtothetruequeryanswer.Thecoverageisthenumberoftimesoutofthe1000trialsthatthetrueanswerisactuallycontainedinthe95%condenceboundsreportedbytheestimator.Naturally,onewouldexpectthisnumbertobecloseto950iftheboundsareinfactreliable.Tables 5-3 and 5-4 showtheresultsforthe54dierenttestcaseswhereastraticationisactuallyperformed.Foreachofthe54testcasesandbothofthesamplingplansused(theNeymanallocationandtheBayes-Naymanallocation)weagainreportthebandwidthandthecoverage.Finally,thefollowingtableshowstheaveragerunningtimesforthetwostratiedsamplingestimatorsonallthethreedatasets.Thereisgenerallyarounda50%hitintermsofrunningtimewhenusingtheBayes-NeymanallocationcomparedtotheNeymanallocation. 147

PAGE 148

Coverage GMM/Person/KDD GMM/Person/KDD NS PS SS Sel Neyman Bayes-Neyman Neyman Bayes-Neyman 20 5 50K 0.01 0.00/0.00/0.00 2.90/0.19/1.12 0/0/0 935/882/927 0.1 0.03/0.01/0.02 1.27/0.02/0.80 3/49/23 929/939/938 1 0.05/0.02/0.14 0.39/0.01/0.09 11/247/155 940/950/945 100K 0.01 0.00/0.00/0.00 2.77/0.16/1.08 0/0/0 936/961/930 0.1 0.02/0.01/0.01 0.90/0.02/0.73 3/53/28 941/941/938 1 0.05/0.01/0.03 0.28/0.01/0.08 24/306/170 941/947/947 500K 0.01 0.01/0.00/0.00 2.05/0.06/0.87 3/0/4 938/948/932 0.1 0.01/0.00/0.01 0.37/0.01/0.55 10/62/51 954/954/941 1 0.03/0.01/0.02 0.12/0.00/0.04 38/316/184 957/955/945 20 50K 0.01 0.06/0.00/0.04 2.72/0.22/1.06 14/0/5 942/941/938 0.1 0.17/0.03/0.09 1.21/0.03/0.81 106/61/88 908/938/944 1 0.21/0.05/0.27 0.34/0.01/0.09 404/692/561 948/948/947 100K 0.01 0.01/0.00/0.01 2.58/0.16/0.91 23/0/6 941/937/941 0.1 0.11/0.02/0.06 0.85/0.02/0.74 165/66/107 934/954/939 1 0.14/0.03/0.09 0.25/0.01/0.06 431/728/612 954/962/953 500K 0.01 0.01/0.00/0.01 1.93/0.07/0.62 30/0/21 946/943/944 0.1 0.01/0.01/0.01 0.34/0.01/0.51 230/145/245 942/952/945 1 0.04/0.01/0.03 0.09/0.00/0.02 447/751/746 943/961/950 100 50K 0.01 0.15/0.04/0.08 2.33/0.19/0.82 24/58/20 938/922/938 0.1 0.26/0.10/0.16 1.09/0.02/0.58 436/204/172 929/949/942 1 0.47/0.18/0.34 0.32/0.01/0.05 870/891/866 932/962/951 100K 0.01 0.12/0.03/0.06 2.26/0.16/0.57 29/59/41 935/945/940 0.1 0.18/0.05/0.11 0.81/0.02/0.40 435/249/355 927/957/942 1 0.31/0.08/0.02 0.22/0.01/0.04 895/928/914 948/968/943 500K 0.01 0.01/0.01/0.01 1.72/0.07/0.33 45/66/50 939/952/947 0.1 0.06/0.02/0.04 0.31/0.01/0.28 474/297/412 954/954/952 1 0.06/0.02/0.06 0.08/0.00/0.02 926/935/942 950/970/949 Table5-3. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayes-Neymanestimatorforthethreedatasets.Resultsareshownfor20strataandforvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities-0.01%,0.1%and1%. 148

PAGE 149

Coverage GMM/Person/KDD GMM/Person/KDD NS PS SS Sel Neyman Bayes-Neyman Neyman Bayes-Neyman 200 5 50K 0.01 0.00/0.00/0.00 1.73/0.18/0.91 0/0/0 933/931/924 0.1 0.00/0.02/0.01 0.97/0.02/0.76 0/56/27 933/953/936 1 0.05/0.02/0.03 0.26/0.01/0.09 19/162/149 940/960/940 100K 0.01 0.00/0.01/0.01 1.57/0.13/0.75 0/43/28 936/916/930 0.1 0.01/0.01/0.01 0.72/0.02/0.64 7/60/41 938/958/936 1 0.03/0.01/0.01 0.19/0.00/0.08 34/365/212 945/955/947 500K 0.01 0.01/0.00/0.00 1.20/0.08/0.52 5/45/34 940/939/938 0.1 0.02/0.01/0.00 0.28/0.01/0.44 22/89/76 946/946/944 1 0.02/0.01/0.01 0.07/0.00/0.06 45/372/336 954/954/951 20 50K 0.01 0.05/0.03/0.04 1.59/0.18/0.85 19/51/21 943/931/934 0.1 0.11/0.03/0.07 0.75/0.02/0.72 91/70/94 943/953/939 1 0.09/0.04/0.09 0.18/0.01/0.07 345/627/580 958/962/945 100K 0.01 0.01/0.01/0.03 1.35/0.14/0.67 22/66/45 948/948/941 0.1 0.02/0.02/0.04 0.54/0.01/0.54 131/135/128 935/955/949 1 0.05/0.02/0.05 0.12/0.00/0.06 488/702/643 945/955/952 500K 0.01 0.01/0.00/0.01 1.04/0.06/0.42 49/83/72 941/954/947 0.1 0.01/0.00/0.02 0.20/0.00/0.35 210/209/282 955/945/950 1 0.04/0.01/0.01 0.03/0.00/0.03 617/830/869 948/958/953 100 50K 0.01 0.08/0.03/0.06 1.35/0.14/0.54 28/56/39 939/938/939 0.1 0.20/0.05/0.09 0.56/0.02/0.40 313/357/243 949/949/942 1 0.10/0.01/0.15 0.14/0.01/0.03 543/823/874 948/948/951 100K 0.01 0.07/0.02/0.04 1.11/0.12/0.39 47/77/53 938/935/947 0.1 0.08/0.03/0.06 0.40/0.01/0.28 533/456/427 948/948/951 1 0.06/0.06/0.08 0.09/0.01/0.02 918/912/930 959/956/952 500K 0.01 0.01/0.00/0.02 0.89/0.05/0.21 63/91/104 946/936/937 0.1 0.02/0.01/0.02 0.10/0.00/0.13 580/540/607 945/945/948 1 0.04/0.03/0.05 0.01/0.00/0.01 936/920/941 960/953/950 Table5-4. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayes-Neymanestimatorforthethreedatasets.Resultsareshownfor200stratawithvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities-0.01%,0.1%and1%. 149

PAGE 150

150

PAGE 151

151

PAGE 152

32 103 104 ].Atahighlevel,thebiggestdierencebetweenthisworkandthatpriorworkisthespecicityofourworkwithrespecttodatabasequeries.Samplingfromadatabaseisveryuniqueinthatthedistributionofvaluesthatareaggregatedistypicallyill-suitedtotraditionalparametricmodels.Duetotheinclusionoftheselectionpredicateencodedbyf2(),thedistributionofthef()valuesthatareaggregatedtendstohavealarge\stovepipe"locatedatzerocorrespondingtothoserecordsthatarenotacceptedbyf2(),withamorewell-behaveddistributionofvalueslocatedelsewherecorrespondingtothosef1()valuesforrecordsthatwereacceptedbyf2().TheBayes-Neymanallocationschemeproposedinthisthesisexplicityallowsforsuchasituationviaitsuseofatwo-stagemodelwhererstacertainnumberofrecordsareacceptedbyf2()(modeledviatherandomvariableXcnt)andthenthef1()valuesforthoseacceptedrecordsareproduced(modeledbyX0).Thisisquitedierentfromthegeneral-purposemethodsdescribedinthestatisticsliterature,whichtypicallyattachawell-behaved,standarddistributiontothemeanand/orvarianceofeachstratum[ 32 104 ].Samplingfortheanswertodatabasequerieshasalsobeenstudiedextensively[ 63 67 96 ].Inparticular,Chaudhuriandhisco-authorshaveexplicitlystudiedtheideaofstraticationforapproximatingdatabasequeries[ 18 { 20 ].However,thereisakeydierencebetweenthatworkandourown:theseexistingpapersfocusonhowtobreakthedataintostrata,andnotonhowtosamplethestratainarobustfashion.Inthatsense,ourworkiscompletelyorthogonaltoChaudhurietal.'spriorworkandoursamplingplanscouldeasilybeusedinconjunctionwiththeworkload-basedstraticationsthattheirmethodscanconstruct. 152

PAGE 153

153

PAGE 154

154

PAGE 155

155

PAGE 156

@1=Xe2EMP~p(1j0;e)f1(e)1 4.5.2 156

PAGE 157

1. IMDBdataset.http://www.imdb.com 2. Persondataset.http://usa.ipums.org/usa 3. Synopticcloudreportdataset.http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm 4. Acharya,S.,Gibbons,P.B.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroup-byqueries.In:Tech.Report,BellLaboratories,MurrayHill,NewJersey(1999) 5. Acharya,S.,Gibbons,P.B.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroup-byqueries.In:SIGMOD,pp.487{498(2000) 6. Acharya,S.,Gibbons,P.B.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:SIGMOD,pp.275{286(1999) 7. Alon,N.,Gibbons,P.B.,Matias,Y.,Szegedy,M.:Trackingjoinandself-joinsizesinlimitedstorage.In:PODS,pp.10{20(1999) 8. Alon,N.,Matias,Y.,Szegedy,M.:Thespacecomplexityofapproximatingthefrequencymoments.In:STOC,pp.20{29(1996) 9. Antoshenkov,G.:Randomsamplingfrompseudo-rankedb+trees.In:VLDB,pp.375{382(1992) 10. Babcock,B.,Chaudhuri,S.,Das,G.:Dynamicsampleselectionforapproximatequeryprocessing.In:SIGMOD,pp.539{550(2003) 11. Bradley,P.S.,Fayyad,U.M.,Reina,C.:Scalingclusteringalgorithmstolargedatabases.In:KDD,pp.9{15(1998) 12. Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE,p.6(2006) 13. Bunge,J.,Fitzpatrick,M.:Estimatingthenumberofspecies:Areview.JournaloftheAmericanStatisticalAssociation88,364{373(1993) 14. Carlin,B.,Louis,T.:BayesandEmpiricalBayesMethodsforDataAnalysis.ChapmanandHall(1996) 15. Chakrabarti,K.,Garofalakis,M.,Rastogi,R.,Shim,K.:Approximatequeryprocessingusingwavelets.TheVLDBJournal10(2-3),199{223(2001) 16. Charikar,M.,Chaudhuri,S.,Motwani,R.,Narasayya,V.:Towardsestimationerrorguaranteesfordistinctvalues.In:PODS,pp.268{279(2000) 157

PAGE 158

17. Charikar,M.,Chaudhuri,S.,Motwani,R.,Narasayya,V.:Towardsestimationerrorguaranteesfordistinctvalues.In:PODS,pp.268{279(2000) 18. Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.R.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE,pp.534{542(2001) 19. Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimization-basedapproachforapproximateansweringofaggregatequeries.In:SIGMOD,pp.295{306(2001) 20. Chaudhuri,S.,Das,G.,Narasayya,V.:Optimizedstratiedsamplingforapproximatequeryprocessing.ACMTODS,ToAppear(2007) 21. Chaudhuri,S.,Das,G.,Srivastava,U.:Eectiveuseofblock-levelsamplinginstatisticsestimation.In:SIGMOD,pp.287{298(2004) 22. Chaudhuri,S.,Motwani,R.:Onsamplingandrelationaloperators.IEEEDataEng.Bull.22(4),41{46(1999) 23. Chaudhuri,S.,Motwani,R.,Narasayya,V.:Randomsamplingforhistogramconstruction:howmuchisenough?SIGMODRec.27(2),436{447(1998) 24. Chaudhuri,S.,Motwani,R.,Narasayya,V.:Onrandomsamplingoverjoins.In:SIGMOD,pp.263{274(1999) 25. Cochran,W.:SamplingTechniques.WileyandSons(1977) 26. Dempster,A.,Laird,N.,Rubin,D.:Maximum-likelihoodfromincompletedataviatheEMalgorithm.J.RoyalStatist.Soc.Ser.B.39(1977) 27. Diwan,A.A.,Rane,S.,Seshadri,S.,Sudarshan,S.:Clusteringtechniquesforminimizingexternalpathlength.In:VLDB,pp.342{353(1996) 28. Dobra,A.:Histogramsrevisited:whenarehistogramsthebestapproximationmethodforaggregatesoverjoins?In:PODS,pp.228{237(2005) 29. Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:SIGMODConference,pp.61{72(2002) 30. Domingos,P.:Bayesianaveragingofclassiersandtheoverttingproblem.In:17thInternationalConf.onMachineLearning(2000) 31. Efron,B.,Tibshirani,R.:AnIntroductiontotheBootstrap.Chapman&Hall/CRC(1998) 32. Ericson,W.A.:Optimumstratiedsamplingusingpriorinformation.JASA60(311),750{771(1965) 33. Evans,M.,Hastings,N.,Peacock,B.:StatisticalDistributions.WileyandSons(2000)

PAGE 159

34. Fan,C.,Muller,M.,,Rezucha,I.:Developmentofsamplingplansbyusingsequential(itembyitem)selectiontechniquesanddigitalcomputers.JournaloftheAmericanStatisticalAssociation57,387{402(1962) 35. Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskew-resistantjoinsizeestimation.In:SIGMOD,pp.271{281(1996) 36. Ganti,V.,Gehrke,J.,Ramakrishnan,R.:Cactus:clusteringcategoricaldatausingsummaries.In:KDD,pp.73{83(1999) 37. Ganti,V.,Lee,M.L.,Ramakrishnan,R.:ICICLES:self-tuningsamplesforapproximatequeryanswering.In:VLDB,pp.176{187(2000) 38. Garcia-Molina,H.,Widom,J.,Ullman,J.D.:DatabaseSystemImplementation.Prentice-Hall,Inc.(1999) 39. Gelman,A.,Carlin,J.,Stern,H.,Rubin,D.:BayesianDataAnalysis,SecondEdition.Chapman&Hall/CRC(2003) 40. Gibbons,P.B.,Matias,Y.:Newsampling-basedsummarystatisticsforimprovingapproximatequeryanswers.In:SIGMOD,pp.331{342(1998) 41. Gibbons,P.B.,Matias,Y.,Poosala,V.:Aquaprojectwhitepaper.In:TechnicalReport,BellLaboratories,MurrayHill,NewJersey,pp.275{286(1999) 42. Gilbert,A.C.,Kotidis,Y.,Muthukrishnan,S.,Strauss,M.:Optimalandapproximatecomputationofsummarystatisticsforrangeaggregates.In:PODS(2001) 43. Goodman,L.:Ontheestimationofthenumberofclassesinapopulation.AnnalsofMathematicalStatistics20,272{579(1949) 44. Gray,J.,Bosworth,A.,Layman,A.,Pirahesh,H.:Datacube:Arelationalaggregationoperatorgeneralizinggroup-by,cross-tab,andsub-total.In:ICDE,pp.152{159(1996) 45. Guha,S.,Koudas,N.,Srivastava,D.:Fastalgorithmsforhierarchicalrangehistogramconstruction.In:PODS,pp.180{187(2002) 46. Guttman,A.:R-trees:Adynamicindexstructureforspatialsearching.In:SIGMODConference,pp.47{57(1984) 47. Haas,P.,Hellerstein,J.:Ripplejoinsforonlineaggregation.In:SIGMODConference,pp.287{298(1999) 48. Haas,P.,Naughton,J.,Seshadri,S.,Stokes,L.:Sampling-basedestimationofthenumberofdistinctvaluesofanattribute.In:21stInternationalConferenceonVeryLargeDatabases,pp.311{322(1995) 49. Haas,P.,Naughton,J.,Seshadri,S.,Stokes,L.:Sampling-basedestimationofthenumberofdistinctvaluesofanattribute.In:VLDB,pp.311{322(1995)

PAGE 160

50. Haas,P.,Stokes,L.:Estimatingthenumberofclassesinanitepopulation.JournaloftheAmericanStatisticalAssociation93,1475{1487(1998) 51. Haas,P.J.:Large-sampleanddeterministiccondenceintervalsforonlineaggregation.In:StatisticalandScienticDatabaseManagement,pp.51{63(1997) 52. Haas,P.J.:Theneedforspeed:SpeedingupDB2usingsampling.IDUGSolutionsJournal10,32{34(2003) 53. Haas,P.J.,Hellerstein,J.:Joinalgorithmsforonlineaggregation.IBMResearchReportRJ10126(1998) 54. Haas,P.J.,Hellerstein,J.M.:Ripplejoinsforonlineaggregation.In:SIGMOD,pp.287{298(1999) 55. Haas,P.J.,Koenig,C.:Abi-levelbernoullischemefordatabasesampling.In:SIGMOD,pp.275{286(2004) 56. Haas,P.J.,Naughton,J.F.,Seshadri,S.,Swami,A.N.:Fixed-precisionestimationofjoinselectivity.In:PODS,pp.190{201(1993) 57. Haas,P.J.,Naughton,J.F.,Seshadri,S.,Swami,A.N.:Selectivityandcostestimationforjoinsbasedonrandomsampling.J.Comput.Syst.Sci.52(3),550{569(1996) 58. Haas,P.J.,Naughton,J.F.,Swami,A.N.:Ontherelativecostofsamplingforjoinselectivityestimation.In:PODS,pp.14{24(1994) 59. Haas,P.J.,Swami,A.N.:Sequentialsamplingproceduresforquerysizeestimation.In:SIGMOD,pp.341{350(1992) 60. Hellerstein,J.,Avnur,R.,Chou,A.,Hidber,C.,Olston,C.,Raman,V.,Roth,T.,Haas,P.:Interactivedataanalysis:ThecONTROLproject.IEEEComputer32(8),51{59(1999) 61. Hellerstein,J.,Haas,P.,Wang,H.:Onlineaggregation.In:SIGMODConference,pp.171{182(1997) 62. Hellerstein,J.M.,Avnur,R.,Chou,A.,Hidber,C.,Olston,C.,Raman,V.,Roth,T.,Haas,P.J.:Interactivedataanalysis:Thecontrolproject.In:IEEEComputer32(8),pp.51{59(1999) 63. Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:Onlineaggregation.In:SIGMOD,pp.171{182(1997) 64. Hou,W.C.,Ozsoyoglu,G.:Statisticalestimatorsforaggregaterelationalalgebraqueries.ACMTrans.DatabaseSyst.16(4),600{654(1991) 65. Hou,W.C.,Ozsoyoglu,G.:Processingtime-constrainedaggregatequeriesincase-db.ACMTrans.DatabaseSyst.18(2),224{261(1993)

PAGE 161

66. Hou,W.C.,Ozsoyoglu,G.,Dogdu,E.:Error-constrainedCOUNTqueryevaluationinrelationaldatabases.SIGMODRec.20(2),278{287(1991) 67. Hou,W.C.,Ozsoyoglu,G.,Taneja,B.K.:Statisticalestimatorsforrelationalalgebraexpressions.In:PODS,pp.276{287(1988) 68. Hou,W.C.,Ozsoyoglu,G.,Taneja,B.K.:Processingaggregaterelationalquerieswithhardtimeconstraints.In:SIGMOD,pp.68{77(1989) 69. Huang,H.,Bi,L.,Song,H.,Lu,Y.:Avariationalemalgorithmforlargedatabases.In:InternationalConferenceonMachineLearningandCybernetics,pp.3048{3052(2005) 70. Ioannidis,Y.E.:Universalityofserialhistograms.In:VLDB,pp.256{267(1993) 71. Ioannidis,Y.E.,Poosala,V.:Histogram-basedapproximationofset-valuedquery-answers.In:VLDB(1999) 72. Jagadish,H.V.,Koudas,N.,Muthukrishnan,S.,Poosala,V.,Sevcik,K.C.,Suel,T.:Optimalhistogramswithqualityguarantees.In:VLDB,pp.275{286(1998) 73. Jermaine,C.,Dobra,A.,Arumugam,S.,Joshi,S.,Pol,A.:Adisk-basedjoinwithprobabilisticguarantees.In:SIGMOD,pp.563{574(2005) 74. Jermaine,C.,Dobra,A.,Arumugam,S.,Joshi,S.,Pol,A.:Thesort-merge-shrinkjoin.ACMTrans.DatabaseSyst.31(4),1382{1416(2006) 75. Jermaine,C.,Dobra,A.,Pol,A.,Joshi,S.:Onlineestimationforsubset-basedSQLqueries.In:31stInternationalconferenceonVerylargedatabases,pp.745{756(2005) 76. Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:SIGMOD,pp.299{310.ACMPress,NewYork,NY,USA(2004) 77. Kempe,D.,Dobra,A.,Gehrke,J.:Gossip-basedcomputationofaggregateinformation.In:FOCS,pp.482{491(2003) 78. Krewski,D.,Platek,R.,Rao,J.:CurrentTopicsinSurveySampling.AcademicPress(1981) 79. Lakshmanan,L.V.S.,Pei,J.,Han,J.:Quotientcube:Howtosummarizethesemanticsofadatacube.In:VLDB,pp.778{789(2002) 80. Lakshmanan,L.V.S.,Pei,J.,Zhao,Y.:Qc-trees:Anecientsummarystructureforsemanticolap.In:SIGMOD,pp.64{75(2003) 81. Leutenegger,S.T.,Edgington,J.M.,Lopez,M.A.:STR:Asimpleandecientalgorithmforr-treepacking.In:ICDE,pp.497{506(1997)

PAGE 162

82. Ling,Y.,Sun,W.:Asupplementtosampling-basedmethodsforquerysizeestimationinadatabasesystem.SIGMODRec.21(4),12{15(1992) 83. Lipton,R.,Naughton,J.:Querysizeestimationbyadaptivesampling.In:PODS,pp.40{46(1990) 84. Lipton,R.,Naughton,J.,Schneider,D.:Practicalselectivityestimationthroughadaptivesampling.In:SIGMODConference,pp.1{11(1990) 85. Lipton,R.J.,Naughton,J.F.:Estimatingthesizeofgeneralizedtransitiveclosures.In:VLDB,pp.165{171(1989) 86. Lipton,R.J.,Naughton,J.F.:Querysizeestimationbyadaptivesampling.J.Comput.Syst.Sci.51(1),18{25(1995) 87. Luo,G.,Ellmann,C.J.,Haas,P.J.,Naughton,J.F.:Ascalablehashripplejoinalgorithm.In:SIGMOD,pp.252{262(2002) 88. Matias,Y.,Vitter,J.,Wang,M.:Wavelet-basedhistogramsforselectivityestimation.In:SIGMODConference,pp.448{459(1998) 89. Matias,Y.,Vitter,J.S.,Wang,M.:Wavelet-basedhistogramsforselectivityestimation.SIGMODRecord27(2),448{459(1998) 90. Mingoti,S.:Bayesianestimatorforthetotalnumberofdistinctspecieswhenquadratsamplingisused.JournalofAppliedStatistics26(4),469{483(1999) 91. Motwani,R.,Raghavan,P.:RandomizedAlgorithms.CambridgeUniversityPress,NewYork(1995) 92. Muralikrishna,M.,DeWitt,D.:Equi-depthhistogramsforestimatingselectivityfactorsformulti-dimensionalqueries.In:SIGMODConference,pp.28{36(1988) 93. Muth,P.,O'Neil,P.E.,Pick,A.,Weikum,G.:Design,implementation,andperformanceoftheLHAMlog-structuredhistorydataaccessmethod.In:VLDB,pp.452{463(1998) 94. Naughton,J.F.,Seshadri,S.:Onestimatingthesizeofprojections.In:ICDT:ProceedingsofthethirdinternationalconferenceonDatabasetheory,pp.499{513(1990) 95. Neal,R.,Hinton,G.:Aviewoftheemalgorithmthatjustiesincremental,sparse,andothervariants.In:LearninginGraphicalModels(1998) 96. Olken,F.:Randomsamplingfromdatabases.In:Ph.D.Dissertation(1993) 97. Olken,F.:Randomsamplingfromdatabases.Tech.Rep.LBL-32883,LawrenceBerkeleyNationalLaboratory(1993)

PAGE 163

98. Olken,F.,Rotem,D.:Simplerandomsamplingfromrelationaldatabases.In:VLDB,pp.160{169(1986) 99. Olken,F.,Rotem,D.:Randomsamplingfromb+trees.In:VLDB,pp.269{277(1989) 100. Olken,F.,Rotem,D.:Samplingfromspatialdatabases.In:ICDE,pp.199{208(1993) 101. Olken,F.,Rotem,D.,Xu,P.:Randomsamplingfromhashles.In:SIGMOD,pp.375{386(1990) 102. Piatetsky-Shapiro,G.,Connell,C.:Accurateestimationofthenumberoftuplessatisfyingacondition.In:SIGMOD,pp.256{276(1984) 103. Rao,T.J.:Ontheallocationofsamplesizeinstratiedsampling.AnnalsoftheInstituteofStatisticalMathematics20,159{166(1968) 104. Rao,T.J.:Optimumallocationofsamplesizeandpriordistributions:Areview.InternationalStatisticalReview45(2),173{179(1977) 105. Roussopoulos,N.,Kotidis,Y.,Roussopoulos,M.:Cubetree:organizationofandbulkincrementalupdatesonthedatacube.In:SIGMOD,pp.89{99(1997) 106. Rowe,N.C.:Top-downstatisticalestimationonadatabase.SIGMODRecord13(4),135{145(1983) 107. Rowe,N.C.:Antisamplingforestimation:anoverview.IEEETrans.Softw.Eng.11(10),1081{1091(1985) 108. Rusu,F.,Dobra,A.:Statisticalanalysisofsketchestimators.In:ToAppear,SIGMOD(2007) 109. Sarndal,C.,Swensson,B.,Wretman,J.:ModelAssistedSurveySampling.Springer,NewYork(1992) 110. Selinger,P.G.,Astrahan,M.M.,Chamberlin,D.D.,Lorie,R.A.,Price,T.G.:Accesspathselectioninarelationaldatabasemanagementsystem.In:SIGMOD,pp.23{34(1979) 111. Severance,D.G.,Lohman,G.M.:Dierentialles:Theirapplicationtothemaintenanceoflargedatabases.ACMTrans.DatabaseSyst.1(3),256{267(1976) 112. Shao,J.:MathematicalStatistics.Springer-Verlag(1999) 113. Sismanis,Y.,Deligiannakis,A.,Roussopoulos,N.,Kotidis,Y.:Dwarf:Shrinkingthepetacube.In:SIGMOD,pp.464{475(2002) 114. Sismanis,Y.,Roussopoulos,N.:Thepolynomialcomplexityoffullymaterializedcoalescedcubes.In:VLDB,pp.540{551(2004)

PAGE 164

115. Stonebraker,M.,Abadi,D.J.,Batkin,A.,Chen,X.,Cherniack,M.,Ferreira,M.,Lau,E.,Lin,A.,Madden,S.,O'Neil,E.,O'Neil,P.,Rasin,A.,Tran,N.,Zdonik,S.:C-store:acolumn-orientedDBMS.In:VLDB,pp.553{564(2005) 116. Thiesson,B.,Meek,C.,Heckerman,D.:Acceleratingemforlargedatabases.Mach.Learn.45(3),279{299(2001) 117. Thorup,M.,Zhang,Y.:Tabulationbased4-universalhashingwithapplicationstosecondmomentestimation.In:SODA,pp.615{624(2004) 118. Vitter,J.S.,Wang,M.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.SIGMODRec.28(2),193{204(1999) 119. Vitter,J.S.,Wang,M.,Iyer,B.:Datacubeapproximationandhistogramsviawavelets.In:CIKM,pp.96{104(1998) 120. Vysochanskii,D.,Petunin,Y.:Justicationofthe3-sigmaruleforunimodaldistributions.TheoryofProbabilityandMathematicalStatistics21,25{36(1980) 121. Yu,X.,Zuzarte,C.,Sevcik,K.C.:Towardsestimatingthenumberofdistinctvaluecombinationsforasetofattributes.In:CIKM,pp.656{663(2005)

PAGE 165

ShantanuJoshireceivedhisBachelorofEngineeringinComputerSciencefromtheUniversityofMumbai,Indiain2000.AfterabriefstintofoneyearatPatniComputerSystemsinMumbai,hejoinedthegraduateschoolattheUniversityofFloridainfall2001,wherehereceivedhisMasterofScience(MS)in2003fromtheDepartmentofComputerandInformationScienceandEngineering.Inthesummerof2006,hewasaresearchinternattheDataManagement,ExplorationandMiningGroupatMicrosoftResearch,whereheworkedwithNicolasBrunoandSurajitChaudhuri.ShantanuwillreceiveaPh.D.inComputerScienceinAugust2007fromtheUniversityofFloridaandwillthenjointheDatabaseServerManageabilitygroupatOracleCorporationasamemberoftechnicalsta. 165