UFDC Home  myUFDC Home  Help 



Full Text  
xml version 1.0 encoding UTF8 REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID E20101210_AAAAQZ INGEST_TIME 20101211T03:07:06Z PACKAGE UFE0021217_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES FILE SIZE 7002 DFID F20101210_AACBVE ORIGIN DEPOSITOR PATH joshi_s_Page_117thm.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5 28f0458f73ba0793188dc842baf4d52b SHA1 a4e5553b7423157073c36c319c16fabdc7a6fd41 22285 F20101210_AACBUQ joshi_s_Page_073.QC.jpg d9da71cef089a87d5b595ef259a6a200 ce8f27dd28a170d0478b899e4038168bad797569 87214 F20101210_AACAQZ joshi_s_Page_046.jpg 6c3d186c47175652ba610dc46763870d 4cd361986e82892842ca5a410b543f7d2fb5d869 43960 F20101210_AACASC joshi_s_Page_083.jpg a32b7a9222d7fe867dc1743e871d9f50 d70914671bbaab8195fb1ab9cafe304c9080c82f 39429 F20101210_AACARO joshi_s_Page_064.jpg 79fc925128347fef5208ce196c09e2bb 991b8b281915ea22ee5489f1ba16e323c53da73e 21484 F20101210_AACBVF joshi_s_Page_101.QC.jpg be0287d09370344d3b673ffb0e656fc3 26bfbbe0ec5bb95ce6df42b254f6b7f8a38ee712 6912 F20101210_AACBUR joshi_s_Page_026thm.jpg a0e3b4973fefed0b1af3a0af6492dfb8 4cf19e9780a21de38c570fdfdc571abfbbd4fa82 42915 F20101210_AACASD joshi_s_Page_086.jpg 1aae75f325d294fad159de1bf87d3bf3 30b59943d706efa405224fb482e184a586dc8543 89731 F20101210_AACARP joshi_s_Page_065.jpg 7a5fac4d13eca14f62733b4b56c2d132 4dbb031ffaf12423a2857e1bc636b55efbf577b6 4448 F20101210_AACBVG joshi_s_Page_058thm.jpg cd7eb3be425021aaa812e962c581a865 f995bc0518fea2e14cab9ca7caf26412d77a912e 6112 F20101210_AACBUS joshi_s_Page_127thm.jpg 9ef914b3739bcdffd08cf0db078546f5 a46dc6c6c97b980f4cbd3beb4bcec9d81522e87f 45736 F20101210_AACASE joshi_s_Page_087.jpg 65d8bf90d8812381bd17dcecba8e5f01 e7198e46fa100aa2449ca60b6b035e1e949b1679 82692 F20101210_AACARQ joshi_s_Page_066.jpg 33230a6f5a5d05405cbfb4bdc838f3a0 583738395a22a201d712190120286be7427c58d8 13760 F20101210_AACBVH joshi_s_Page_086.QC.jpg 496d0a2c81dc3a3b490855af3b096fb4 bc838976c3bf2cfb385db61d99ac14cca366b706 23117 F20101210_AACBUT joshi_s_Page_131.QC.jpg 90d7f55e37152f5e7411c34e9647ac5d e55d271fad1533fa0f3a6ccbd75d300d27244d27 36384 F20101210_AACASF joshi_s_Page_088.jpg 1a3da88c8f72add860dfc9228d616975 63fbba2934f8b5fd41252112088986fb05a4d58b 58704 F20101210_AACARR joshi_s_Page_068.jpg f876decbed2f37b576bd968107723994 88a80034e675e9d1f6462dc660b5ceeeb19cf9e3 21794 F20101210_AACBVI joshi_s_Page_013.QC.jpg 32bfd4383282b3678ae555408810963f e7b83f61d97a3d6e43a5746d536048908aff095c 24290 F20101210_AACBUU joshi_s_Page_133.QC.jpg f8d5d21348dfe7ba30e1ef333194e8fc 64092dd8a823a5b4cdf3f6664a4e359254c1bbcc 52337 F20101210_AACASG joshi_s_Page_089.jpg 0df30e2d0d024eb9810e60b0e0073a0f 2461dba4f888d6545e00438fb920757ac4e82ab4 63672 F20101210_AACARS joshi_s_Page_069.jpg 59be3afc70d6c8c9adc70341aa92e675 a8e9d9714c4875cf90753cbca3f2eb8364e55ec5 23924 F20101210_AACBVJ joshi_s_Page_102.QC.jpg ecdb99365a0588f1ffc279d557ae44df b456ac9e3d529f1044c7fb2f704cc07a3e914a7f 6354 F20101210_AACBUV joshi_s_Page_079thm.jpg 9182d0374aa582aaba4e3ad4c93c1e34 cdcb357bd8256c263a5e22bb46bb79bbde4eeb20 59419 F20101210_AACASH joshi_s_Page_090.jpg d82b4a96e62ff1b02f609111f439fdc5 2361fe1e92c911866e42fee55a85452d660abfd3 90234 F20101210_AACART joshi_s_Page_070.jpg 85a11aef2e32c236673324b57272ee16 8869711fac46f8e7c45a121638debeef1a361d30 21136 F20101210_AACBVK joshi_s_Page_049.QC.jpg aaf1767a64313c86da3254752026713a e17f2d63470d3033dfc21e1840c04b3bb75c8bec 28519 F20101210_AACBUW joshi_s_Page_026.QC.jpg 7f2cc38d4692fa42fea1d35bf2639c05 03d83bfd78e3ec67ddacbc06565b020061133b89 85915 F20101210_AACASI joshi_s_Page_091.jpg e2c1432618ae840ea6cdba71477d1314 7deef10438cbc70d5b23faadb0f1854eceb5a766 89392 F20101210_AACARU joshi_s_Page_071.jpg 01a982bc9b25b087cdb98d508586be9c 633ac27a5ca1f343924ad2ecc4aa8978076f2cdf 6941 F20101210_AACBVL joshi_s_Page_027thm.jpg 43efd1dd9d1de89928be6c462142885a 5d7364bdd6ba2da66a13e7c0dd10add3d2a9d076 11781 F20101210_AACBUX joshi_s_Page_064.QC.jpg febd03f06f27e3a0404b707a5aaef090 63777167ce6acf8471be9822bb9d676359290515 82951 F20101210_AACASJ joshi_s_Page_092.jpg b9fd1539d121da4abb862fd7888537bf c13d79d5737c23c52e457ea6c3f2e7bec4647999 70709 F20101210_AACARV joshi_s_Page_073.jpg 43b1851e971d5b63aac51f78d8337e32 2966d015d07c95a33d751a637198f8893227d8e1 5852 F20101210_AACBWA joshi_s_Page_131thm.jpg f0d79f91418deb00e14b4d3e805aa6f8 6724fccfadb8554cd046d12ffdef2cd18774c1d5 14656 F20101210_AACBVM joshi_s_Page_010.QC.jpg a800aab468982ffc691a17ae2ba63df3 04ebe5ff572ca08e7986de8373d18a337d2ca582 21809 F20101210_AACBUY joshi_s_Page_097.QC.jpg 20df8c4dd87fb28ec89c5c9fd5811a88 211482bfc1cb758dd61b745b9e4a95d20eb2f642 52931 F20101210_AACASK joshi_s_Page_093.jpg 7c35edecfae208a8e43c69c7a5cd2762 54db51eefd38443b19479bd803fcf5d91c48b2c2 58649 F20101210_AACARW joshi_s_Page_074.jpg 46699b9682344032e2a9d4c7e12eb5c6 7ac11565db9b41fa0f2a8ef8a185a22b7aa337be 2127 F20101210_AACBWB joshi_s_Page_032thm.jpg 0125b8ebd4ce6823b9825f5cb2cf91d0 8a1d47382f921be131d1f05688a7a4d5d2688c6f 15500 F20101210_AACBVN joshi_s_Page_056.QC.jpg 3096b1a4a7cc644ce98ae0e67d674b72 7990f03321dc723d8c14c7534e973b8ec8e80287 72477 F20101210_AACASL joshi_s_Page_094.jpg 6f3e83192d012d993c0f1a4961a9f783 a4ea3571b04cf6ce4a2a932267d144abcc91db20 28913 F20101210_AACBWC joshi_s_Page_020.QC.jpg b758bba6dbb348b2ad5770092cef2214 ec0d91e8353f653ed3c3f914f728980d427109ed 6133 F20101210_AACBVO joshi_s_Page_158thm.jpg 27cf4440ddb3cac03f5fdc630a38ce1c 61e76e90b4d332611923391e026ede61e43ff181 6094 F20101210_AACBUZ joshi_s_Page_012.QC.jpg 78804c1648e1085f0f17352bb6bf3e16 3a63bf3b7a474ad4d95018a6992689f9ae53162c 83066 F20101210_AACATA joshi_s_Page_112.jpg 5dca76a09bec70251041cd11a64fafbf 539f9b1b1d19e929406386a89523d7ec93e5f425 79213 F20101210_AACASM joshi_s_Page_095.jpg da71f20354f9ff86cc676c060b34c80b 2ab483640d54d0728761d8a25eeaf34bf672a6a4 60158 F20101210_AACARX joshi_s_Page_075.jpg 6cfa313d40f1e3991edf1efe9ba0f052 0d9b4e68978555a138ad2d41364de8f2cd17f9ed 19432 F20101210_AACBWD joshi_s_Page_063.QC.jpg 5018b78ae36a698e8f2a7479e6c796e6 2297077d5ad3fd3509f5eb8a4f5f83c3684e3f21 27753 F20101210_AACBVP joshi_s_Page_071.QC.jpg f95198cbe3c1629847f3cc9fa12b2fa2 8fcd275c55e78f7644a37acbae1c50c4d67bb0a2 85071 F20101210_AACATB joshi_s_Page_113.jpg 175d5c535c4d634efe82a216fcf2fb40 20a119be31247d121c90010c91f641287dd42297 67263 F20101210_AACASN joshi_s_Page_097.jpg 0fa0d299bd862da6ae51bc9f03d88958 a56e5649fb54b25dc909649aed21a41e61d159f3 89182 F20101210_AACARY joshi_s_Page_076.jpg 3a6478a4bc71eabcdc365e2e9a509f80 1c1cf92fa41f3c8ecc23866a4d3c101bc210d673 6484 F20101210_AACBWE joshi_s_Page_114thm.jpg e26f643a20ad0fc8ecc29037d5825ceb 0365632ee5f237adccabccc4376f5d18456d09af 6603 F20101210_AACBVQ joshi_s_Page_038thm.jpg 4bf26b526087adf948b7caf6337857b4 a440946657192c2a271045410b37c6356b57efa5 93760 F20101210_AACATC joshi_s_Page_114.jpg 9c2519544a352f488c93be84b616fec1 ce16f1c49731904dd97b7f8b6a999cbe5d233a7b 85903 F20101210_AACASO joshi_s_Page_099.jpg 253703b12a3d1e0bce10b305a9908bb8 c91b6e2d95894998de2e43a017094185f72ef3be 77971 F20101210_AACARZ joshi_s_Page_079.jpg a14d85cf9d52f6923cef9c2e8b2aa342 8d15ea29569a4831d76279a16db11e27d50b0904 24761 F20101210_AACBWF joshi_s_Page_109.QC.jpg 860f05f82569105b47c6c31bdbc893b8 1cf4e9c43610e5c5f3ae551d457e1f9d46b4092a 4972 F20101210_AACBVR joshi_s_Page_090thm.jpg 91ed9ac8edde30f8a967983f88e93283 7eeadc5bc57f04745d4111994e2977e3f718caae 89729 F20101210_AACATD joshi_s_Page_115.jpg 4adf6b15933824281e883fda9848350a 1cdbb968e110ad611ecacdbbc6de9bbbdd1816a2 76709 F20101210_AACASP joshi_s_Page_100.jpg 8fb45c9a2237505e1a589ab405dec00a 47fe5d9f9bdae1eb2c714104335619746a47314d 6740 F20101210_AACBWG joshi_s_Page_021thm.jpg 776161b187f8911167456545a51a4fc1 a9c9f2450d168c505a2767ecd6c032ca0da89508 5984 F20101210_AACBVS joshi_s_Page_050thm.jpg d42f20d428af1aa50e9c53b8d7a98cba 65f6b7ee13aa17c4dc1a58bf44f42045a3db4736 88644 F20101210_AACATE joshi_s_Page_116.jpg 7f03aee7d0c43ca1305e901f537292b4 49099864a67094e25fbfc79f64b30c133ab8e6ac 77892 F20101210_AACASQ joshi_s_Page_102.jpg f6b2bf75003f4ac271a226ba71761488 fbe0f1d1efb339cadf4ae4c67c4c28542aaefc11 3679 F20101210_AACBWH joshi_s_Page_017thm.jpg c00212d76299263e9848542ef5ba52d0 299a614089a2534b8f530bcc99bb687f65fe926d 22882 F20101210_AACBVT joshi_s_Page_127.QC.jpg 0f040726f01936d0a75c29066aa3c99f 690a4593fadea5c7a71bbc31f97bd9fc845c4559 91490 F20101210_AACATF joshi_s_Page_117.jpg f922d6fffef0dcabc4e70d19a47cff40 b04d1641f9f2fc379f14dacfd1e9b2e23ec12b54 59949 F20101210_AACASR joshi_s_Page_103.jpg d808dafba63fa219b81187e93f70e712 56d0fbea72e48d4f11c14102e4ee07342eb8d658 6426 F20101210_AACBWI joshi_s_Page_039thm.jpg 7a1a94cef96a8b5ebd33491ed2dd56bd 6ec51389b380a15c8985c25bc17e49fa2a32b232 6569 F20101210_AACBVU joshi_s_Page_130thm.jpg 9af2952603385a34c26dd5f89a031dbf 87f010a59095945dbe8158b0344cae28cd1856ab 87092 F20101210_AACATG joshi_s_Page_118.jpg 4655aedeac0f7188ea4364165e1964f8 74b171ec505693ddfde27332bbaccbae2909c999 72497 F20101210_AACASS joshi_s_Page_104.jpg b53c18bbee3cd6ac2cabbd43fb0e2ea2 61d6d0a3a71ba021ee4c2522799de47351cd6cf6 5797 F20101210_AACBWJ joshi_s_Page_006thm.jpg 506a4d4028e34ef546d5ac428e0d36f0 deb0e09a2376ccbd36a9493858758babf28be96e 13065 F20101210_AACBVV joshi_s_Page_164.QC.jpg 3297d42ebc0522e78c19a17b99187822 87a1219ab78dcead03a4d847e8a3ee152b9a13c7 84292 F20101210_AACATH joshi_s_Page_119.jpg e71e4fd2e9e5ea88d48b852d3faacf05 7c2840efc314fb76f3464ab6bd7a78da9163dae3 62286 F20101210_AACAST joshi_s_Page_105.jpg ced4266ff42d773a87b09fc430967d26 87e324307192b19a903e613a700be438e8b15f5d 6915 F20101210_AACBWK joshi_s_Page_028thm.jpg ee213fcc80d677a2d94a204b3ad00e9e 3d240bf6b5926b76d290ada502e9aaab1cfb037b 25473 F20101210_AACBVW joshi_s_Page_150.QC.jpg 4aee8c625fc0f420a1a6a5ea85fcb3f3 e1d88c2b994a2bf7bdc1b310a016e295731713be 60143 F20101210_AACATI joshi_s_Page_120.jpg 3ee334bedf86285bc4ff8f7a7e7241fc 3e7b374b5873211f47952406bd3c821e8c71a7e2 80337 F20101210_AACASU joshi_s_Page_106.jpg 27a52fabbe3d89761488142210fe8851 2d71d255bb0893850f530f27e29fbe05bc5173bc 4621 F20101210_AACBWL joshi_s_Page_141thm.jpg ea2267693ec572e3b23f17de6fbb0998 051e0b6dd748fc9e13fd768781b5c87f52fc0669 13655 F20101210_AACBVX joshi_s_Page_058.QC.jpg 1f9b33e749d8ba220501958e88c9bcf1 693614a15b6cb7aeb069c8df6f120fb2424aa402 77791 F20101210_AACATJ joshi_s_Page_121.jpg d0d99cc4ebcbf5aa29f810ef41c16984 971f7ba0326801e7a97af0a4ec0655aabeac29d3 51307 F20101210_AACASV joshi_s_Page_107.jpg b3a3b849578d3489f6037726dde49974 a5672ac9bcce88462213afc632138c6f5ad803cb 4206 F20101210_AACBXA joshi_s_Page_018thm.jpg 5c00f0d6393d71e1d98d341ce4184b57 51d51f700470fb13b9fd19a07179c587575da375 5849 F20101210_AACBWM joshi_s_Page_149thm.jpg 7ab7cf28a2ba233f07da2aa12b10d277 01980c499aff51f641ac689ec201dca4cf3bc09b 25266 F20101210_AACBVY joshi_s_Page_046.QC.jpg f19bcd9eecc2bd0300c594e0f0029345 f58db88df3b049f9a582c7a39338e3408b39f1ae 76926 F20101210_AACATK joshi_s_Page_122.jpg 6271bf391db700454aba1577b83c9afe 7b6bfe818fed7dee4279f43b528050be0db3ee58 58027 F20101210_AACASW joshi_s_Page_108.jpg e0a8a0a0c6448282a240f4b356c0bc9c 9cbcac3076b81c53a3d08f9df3370db8d7c66467 6587 F20101210_AACBXB joshi_s_Page_019thm.jpg 08ffb03871f511ffce28c1f29fba42eb 15b834823add58f796d15810d0eaf287aa84982d 25001 F20101210_AACBWN joshi_s_Page_014.QC.jpg 0a7625c07b0ef26be2192ea6f8f69693 4edcc80cd1681ee59e526f80ed1d760badd6c9c9 6312 F20101210_AACBVZ joshi_s_Page_034thm.jpg 811998b917c761cc57a8716317014c30 d6231c19f7dbb6066c6c02756075de2f6aaf5733 88800 F20101210_AACATL joshi_s_Page_123.jpg 9453c6d4987f204f8c87c24fdc4ed811 a4d5e1b87607630bd6511ccf6cbe4a520788d040 77932 F20101210_AACASX joshi_s_Page_109.jpg d473c11bdf258f158092537b60dee84d 65625dd670f5d2b8fee5518d325a5509107cc48e 27053 F20101210_AACBXC joshi_s_Page_021.QC.jpg bb5c151727e289325c91c6be4816d608 f5ffc802e95dc342807b253e4d7e19e336abb3e4 16183 F20101210_AACBWO joshi_s_Page_057.QC.jpg 35faf51a56ec9e7a66667fe4981881d0 2e01cd9ad3a7e37c1f1f0238c7d05205b0b7f5e9 54039 F20101210_AACATM joshi_s_Page_124.jpg 4b69bc22a2ba2cac7b5e694704530315 557490b215b422e4bc155a7f719f1e3474cf0234 33263 F20101210_AACAUA joshi_s_Page_143.jpg a36858dfaf2381cd224bdb59e75e111e 3bcb96a8a12aa1ec6bae8e80ac7e8d77853f8efc 27250 F20101210_AACBXD joshi_s_Page_022.QC.jpg 6c06c88828f59641c0499765d1a74cac 4580069c88120f5aac5ae2600e1ee2b873978a00 3294 F20101210_AACBWP joshi_s_Page_002.QC.jpg 87e40dc56f013e1c96603e48b6a2e754 61e58830b3806a261ef03c766da2fd2146245622 98170 F20101210_AACATN joshi_s_Page_128.jpg 5e2772d8329916ec97c513d63971db08 10ec1ba5315c797198ae8f91a0961b304b76ca41 90568 F20101210_AACASY joshi_s_Page_110.jpg db5de14a30223344c981f27d5bee6301 ba59c08316e141abc275f6983dd3c56d5f5a16bc 89441 F20101210_AACAUB joshi_s_Page_144.jpg 133ce3cc7d52cc88083b31a5f9b05eea ed0ca8b2491c7bb93ca10587a2e7115f14db3924 6930 F20101210_AACBXE joshi_s_Page_024thm.jpg 9511e55a6b9c20f4c01b7cd8fb94e7a9 d961264c9270c7bf92a5a6369657a0aa0cd84202 6520 F20101210_AACBWQ joshi_s_Page_106thm.jpg bc1d27578fed60739d1f2b146b940cad 933ca8bf647dab50df541878de9a9fc0e44ec4b3 70758 F20101210_AACATO joshi_s_Page_129.jpg b166a565440a9946894150a85c3e9cf3 6ef8c3127812c5dba254b7e11efe835298c4f544 88066 F20101210_AACASZ joshi_s_Page_111.jpg 5fe3f8b446d36390b80b6d1784ecd593 6ef538a453d47ee759b47631c9fa0aaf70cfa96f 88207 F20101210_AACAUC joshi_s_Page_145.jpg e5cb3a80a2012ddfe9cc3b8951826293 f1f81437a797f3c2ec940d988cf9563449451d81 25402 F20101210_AACBXF joshi_s_Page_036.QC.jpg 02cc3cbd5bb25a502a88022ae8876c81 081a944c7110a0d016c2c12cf0a6567dbe6ff2c1 18527 F20101210_AACBWR joshi_s_Page_108.QC.jpg 5177afeafd8005a5f2cb4885db8f5819 ba0a3d7e1fc00c950083e2aef337f9142504e4b1 81624 F20101210_AACATP joshi_s_Page_130.jpg c15b5f972b487756ea9f6fbe17b3b26f 333b0fee5ec4677e28df2495decf855296554632 78616 F20101210_AACAUD joshi_s_Page_147.jpg 60a15af45b3ac584e95995d9dd6c4730 6ca0169ddee5d57ea28d624bce15fa4336264eab 6409 F20101210_AACBXG joshi_s_Page_036thm.jpg 4a9dd399794cab5e49e451d2c30cec79 59e64907153298622ead396155d6faed0c3adbc2 6845 F20101210_AACBWS joshi_s_Page_071thm.jpg cd970fd3ec17d84e85be3359470ed2d5 773f2ecd1fec4193d45494954e01c3d5649da9e5 74699 F20101210_AACATQ joshi_s_Page_131.jpg d379f669bbe4c9fd8b1bf533cc2ec22f 2ef3acc47fc9aa74b2022e623ca3e06c9310179a 92627 F20101210_AACAUE joshi_s_Page_148.jpg 824c950190a0c1592ca0dcb6072116d4 90d8a8b048fa8895b2708b0da9f52fbfa290b099 25480 F20101210_AACBXH joshi_s_Page_038.QC.jpg 516ff1769a4594a7b4fd9082626b2f35 10535684189ee1421e9c0044631efbe3b70fe1e6 245285 F20101210_AACBWT UFE0021217_00001.xml FULL fa04cb472a83f791ff7e33599324a7d5 27b9841eae2256882f4ea013d860a3aee1f440aa 57316 F20101210_AACATR joshi_s_Page_132.jpg b70037c5baadeed8c716d9b7141697ed 7d7689bba5e110158979b49fd2afbc33df149e68 1053954 F20101210_AACBAA joshi_s_Page_004.tif 7c962f63970f50ead7bfaf5d55a9173c 6be770db4a5a70f785d882df94dc4513b304cbb1 95065 F20101210_AACAUF joshi_s_Page_149.jpg b38396065fe26fc6ad849b1b012d66af b3b1d77604b5ec1f697fed1e5df8a4b42272cd5b 24567 F20101210_AACBXI joshi_s_Page_039.QC.jpg 3633e9436d98e50c5f62fae01ca2eebc f2f98d77964fa38c690a2a7e6d6018a5b9a9b580 5308 F20101210_AACBWU joshi_s_Page_005thm.jpg 760553e2e97f9ccf46a3e01c4ef5dae4 669086c7f18b76813780e9bd15d4521373eaf139 81053 F20101210_AACATS joshi_s_Page_134.jpg 9e1b252aba95bf3dfeabf9b6e3a64a6e c8b988940f5e7f1ac2ab1deb2b271a6d56ca8ca9 25271604 F20101210_AACBAB joshi_s_Page_005.tif 488e20cd4792962869dede84900942a0 3ea88508edfd7d1df09e63a1d1fe0bb068103cf0 77169 F20101210_AACAUG joshi_s_Page_150.jpg 44917fb931226e6bbba78e5ed04a3ed6 a7178c265eb801636d5844765558b6e63840791a 19439 F20101210_AACBXJ joshi_s_Page_040.QC.jpg c780b1d66fd7616f3f7ae2a77a903f3d a8ae4a5aad496e0b1b8c4d2a8fcb8cb56f25920a 22757 F20101210_AACBWV joshi_s_Page_006.QC.jpg a976530cba4bda01512f7110ce0f240a 7a5e4d0aea4ffdae8590538e1a514309bf9664ea 39671 F20101210_AACATT joshi_s_Page_135.jpg fa61a567b3afd7b31298929a47fec9cc f57263b84a0df2088d6c7b381a45fb2b8d99504b F20101210_AACBAC joshi_s_Page_006.tif 7e93b5b89a156697d6e636576cea27ef 4df7b73a1a35c61f2be7a6247b9fc78f8eedc411 69885 F20101210_AACAUH joshi_s_Page_151.jpg 205f1816081c703a11dc3f0cdf886f38 d90c1564b7050986c4043034a543bb2210d2f3d9 6524 F20101210_AACBXK joshi_s_Page_044thm.jpg 9b8a1627b5e1ad6402a87363850c011c 7519707cbd03b7f8369181dc1702657e7b0a6701 30018 F20101210_AACBWW joshi_s_Page_008.QC.jpg ca3c66128a24519ee3c0d374cfdf13be b93cf7b05c7ec351f3489a7eb75dd8fff91b09ce 82987 F20101210_AACATU joshi_s_Page_137.jpg 95143e54a0f671ee21e8a695cf225cef 0efd890788e1fb0aa081bbc4e2eecf9e97de8928 F20101210_AACBAD joshi_s_Page_007.tif a71d305550f74035c3b7f794b801ba61 2acb111aa228998cac082ecc3c2740ec0c50e910 89044 F20101210_AACAUI joshi_s_Page_152.jpg 05d055ca7dfc64f665003862d0422f04 8f850d24842ae19f879adcc7faf382edea61a89b 20249 F20101210_AACBXL joshi_s_Page_045.QC.jpg 098ca4425d40664e6c4a5c00fb652a6f cef06165d798f5f3b5f9289b9f555633b636b6ae 1999 F20101210_AACBWX joshi_s_Page_012thm.jpg 24da4ea8677fc5e399af97118b9b5e91 67a1ba86118cfa0925bb7a53074eba0a109e70a8 39772 F20101210_AACATV joshi_s_Page_138.jpg 989ae36e4584d7b26d41520ea1e65d82 7d5213b909dd597ae6717d8bb77e3e339b04a156 F20101210_AACBAE joshi_s_Page_008.tif f266a28d980e27b2662073ccc734e016 42705a01f8aa60c148576b31c7a3c145e6405d95 44517 F20101210_AACAUJ joshi_s_Page_153.jpg 0658e6467f9ebf7cccd69ed0a9fe2dfb bf365b27d05295b9abd27b5c904afb3cc4250d31 5806 F20101210_AACBYA joshi_s_Page_073thm.jpg 2cc10f164ceaf3484ee72fecf8a06f36 0b932b39eda9bf6f69b058df6fc6621ff339d2d7 6447 F20101210_AACBXM joshi_s_Page_046thm.jpg c9b3b3ef4e8187678930419a46e33c0f 6488f93ec6fc24f95e05001a191ab54202e76da7 6345 F20101210_AACBWY joshi_s_Page_014thm.jpg d0270e0d10c228313b554a10ef217bc4 d8006e4c3e37a1d1c6f6353c88fda7c0265d1991 73592 F20101210_AACATW joshi_s_Page_139.jpg 53943bebf802c5284c14e1d9a801cdc7 7143a63d879d981ad7b474886e0c0a59ef8b87b8 42779 F20101210_AACAUK joshi_s_Page_155.jpg 0455939a8f97a26e45bdb03fe2864e34 15b2c008a3f76b4241f27a31e576eba1b193bac4 19554 F20101210_AACBYB joshi_s_Page_074.QC.jpg fe68efb025e726cc77685c4e8c6c59a0 7c42193c620184b10a01cad91bf21b02a23e89c5 5782 F20101210_AACBXN joshi_s_Page_047thm.jpg c9c5d28cf9f7b26cb2e09d61f68675cd 88dda928c8b9e7b61fa367c97b13850cf7d6956b 29035 F20101210_AACBWZ joshi_s_Page_015.QC.jpg 994e6bfc4ad00fed87ee1a3806aa5816 62c3a52efec49a8481361ad77f60a821dcf00918 32128 F20101210_AACATX joshi_s_Page_140.jpg 5a0a340c66bd0633a61a1563ce054011 1ee9360ca52f92272d753a691d40ab09222f1154 F20101210_AACBAF joshi_s_Page_009.tif e52e0919f47e4a53d22aadccca1d2557 8d4fda42f1ae12c3a34dc8214448d3ade7035406 33954 F20101210_AACAUL joshi_s_Page_156.jpg ff405eb0e5beddbfa395c9bfab65855d c6f4c502b62e922608b3694270693db6ec10d08d 5025 F20101210_AACBYC joshi_s_Page_075thm.jpg db828acef199e926e9592611f959c29f d46eace4e34669cfa880c2c105311532a57fee0b 22512 F20101210_AACBXO joshi_s_Page_048.QC.jpg 5f50fcecd98fd810c6531277dd00787d e5f0fccfe5f895b12627e450cd3f36373057e050 53272 F20101210_AACATY joshi_s_Page_141.jpg 2426081e1e26c936043738a41c4e7008 65dc29aa5c4e0e9169f1fbb257b969290e14f898 F20101210_AACBAG joshi_s_Page_010.tif a852aee4c5f6bdd2a5529b876c7e7f1d 310c45b0e247c8bf70227975cea9495aef1737ac 924735 F20101210_AACAVA joshi_s_Page_007.jp2 e0301e7bce39283372fb1b53c4d86f2f 2986dca9dc703a0016304b3d0fc87162ea5229fd 68699 F20101210_AACAUM joshi_s_Page_157.jpg 53688c178bcedebb7d283b8a0d230f4f 072f18525211e799f58589c8bd3c17782facd789 24944 F20101210_AACBYD joshi_s_Page_078.QC.jpg 1bb312bfe0d011f46e6d0acbeb7c3420 d0ef104b7bdafa2630ab8e393497ae8e29ca9e07 5863 F20101210_AACBXP joshi_s_Page_048thm.jpg b235155adaee6c0e3ca6fa074edcf30e b060317660f125a0298798d9d08b989951d9d34b F20101210_AACBAH joshi_s_Page_012.tif 48d2d833ff27348cbb5241001839dbdd d900408338c0a36207ac7eb50a3a2ce16224d015 1051978 F20101210_AACAVB joshi_s_Page_008.jp2 e5bd87dfbe84d7535a34fc00f6b78eee 33bba94604151d5206cc327ef840fc348a547dc5 79760 F20101210_AACAUN joshi_s_Page_158.jpg ee7bf5a24c53d99ca15dfbd4d26d833c 1e9b4ecffcfab6f509a341eef0489a1f6dd3a7df 23069 F20101210_AACBYE joshi_s_Page_080.QC.jpg 06f8c415c557dd54093300e8f0f0a8d7 92991a168f6f5fd35b36228d049600e85dce3347 23818 F20101210_AACBXQ joshi_s_Page_051.QC.jpg 794884a7b8646a1923fff24d2af42f63 f03be10d90e3e20b8ceaa16d991e090abdc770f6 52920 F20101210_AACATZ joshi_s_Page_142.jpg c1108f567f7a9c25f69ad14ed195957b bb6902ec9acb0f22b79d17be59ac73eba3c46943 F20101210_AACBAI joshi_s_Page_014.tif 4b4cbcde1147cbec6ccccdabf7a39ccb c8e030a4ccd8c29a5a5436b9a1232b00380713c7 1051963 F20101210_AACAVC joshi_s_Page_009.jp2 6928ad603d8aa003aad5c6332f6f0495 6c012c64cd3065820d14781471f65131a72654b6 84279 F20101210_AACAUO joshi_s_Page_159.jpg 44a4cd605012278e6e34a9da62aa36fd e17ebe01d90cec1fd21d3d5008d9867d4e188e37 6501 F20101210_AACBYF joshi_s_Page_081thm.jpg 2002b3e706624323ab47a1c9a8e2e79e f29e9198eafc39fd5a6b0f73f564d5d7ecaf66e7 6036 F20101210_AACBXR joshi_s_Page_051thm.jpg edfb87901a631039e93c4db90195ff27 3774badc0b5456abb121f38c89b6d02f9ca18744 F20101210_AACBAJ joshi_s_Page_015.tif 3eda01e756ef52261cdf3d25a4e2e3dd 6cf092cf24c309a9923007e3a022db80194a1bfe 1051980 F20101210_AACAVD joshi_s_Page_010.jp2 5bd5d6dd83ea4d17c1023a3739a1c88a c5c5c9cd8c832eaef72158714c3ce99e1fd2f416 78880 F20101210_AACAUP joshi_s_Page_161.jpg 03456fa1c8f192b95e6de7f4f724806d ceb23cd78b79dbe0dc2ec3f7c1081181dde266a4 16319 F20101210_AACBYG joshi_s_Page_084.QC.jpg c110bcd76a24192b5a99d037aef6aa7f 995d2cdf1ab8f8f70986a376b589de64f2e2e598 24972 F20101210_AACBXS joshi_s_Page_053.QC.jpg 3d463d0328224fa6819c1d2741d6a8a0 30e9129d916df893574cabeaf917ba844cbcdf1b F20101210_AACBAK joshi_s_Page_016.tif f5ba0f1b9e570cfe12a80b459d174166 db5ce34cd2a4a3b759c82f859015263fefffdee0 112690 F20101210_AACAVE joshi_s_Page_011.jp2 5980e09702bdf922cba6eee8a9743b1f 3e4e1be9ad923e036766ae1f65ca0925e2c45eea 76928 F20101210_AACAUQ joshi_s_Page_162.jpg d682a6d0608240febda81f74049d8ba5 96b99a7609e6dd75ed9763a307bcc70fb61c0425 4673 F20101210_AACBYH joshi_s_Page_084thm.jpg 838359062af1b8f6c45374ca7ab679c5 ce6b9ee8e0e13ea4db475e16776ed5817ee63448 27610 F20101210_AACBXT joshi_s_Page_055.QC.jpg b896eda41b579342f8c2869729575caf 84f61088b2b3af9fcdcb7ed942d8fbfe2a1046ab F20101210_AACBAL joshi_s_Page_017.tif af97874afccc83ffb34be35ea9d11584 4ef0d86c521c47c4f9fe915003f3330b081b03aa 22669 F20101210_AACAVF joshi_s_Page_012.jp2 b00c9099dda5a5affa2cf7b5e560acf2 5d7810296ea84bd0560e2ebfa34d7864e7749ba5 75099 F20101210_AACAUR joshi_s_Page_163.jpg e53eece4edaf12db0ed196002866f557 cc0983fc98b841a2a80ea685aa132050f5bf318f F20101210_AACBBA joshi_s_Page_037.tif 98b0c953e89767fd1cd2b75e95fc50cd ec12742429db8ed7e2323a91924f3b60c417697d 19126 F20101210_AACBYI joshi_s_Page_085.QC.jpg 7c56d41830a1e50c7822045aebcccc86 7dd8d261f5e178fc050da08362127dca2095bdac 6479 F20101210_AACBXU joshi_s_Page_059thm.jpg f763b8c43ce47392972d88eb25bd3e70 6d71a7e3581f46c9389c6f8e5f2b88ea7bc1c4c0 F20101210_AACBAM joshi_s_Page_018.tif 517805cc5fa3fa2d18ec032ffabfe25b e4abe3bea582fd39e6efd4baa629db8cb8e8a895 108083 F20101210_AACAVG joshi_s_Page_013.jp2 a6b4e94fa41f37de26e72d8e74d2a207 f9c7acdc3a64d2d11f52b5584711af11d2f18288 45798 F20101210_AACAUS joshi_s_Page_164.jpg f620b7db81ff4b3d41aa2149aece3233 e701d7380b7ede6cdda7645c5a49defccdd152ba F20101210_AACBBB joshi_s_Page_039.tif fb5c751dae392b14faacac3b671d0df2 b3f00d09be8327d1b6021de5c45fa8054a7a78fd 4333 F20101210_AACBYJ joshi_s_Page_086thm.jpg f580f3cd8616377428a9a163289db863 33541132e82a7360d39d636771b5b35a54490fac 4683 F20101210_AACBXV joshi_s_Page_060thm.jpg 104e99562bfe7fb103e60e9718cca1ff 873be469d8766508ff45b16fe6d5b594456981a1 F20101210_AACBAN joshi_s_Page_019.tif 712351066846cd18c18464f9acf14216 8ec838c3310c93a6279a50d4f575da9a8d6378f6 1051939 F20101210_AACAVH joshi_s_Page_014.jp2 afc7d297bd3355e1ff90075b937a2b97 60fd4e45a845fec46a96e0829e047a0629d76519 35818 F20101210_AACAUT joshi_s_Page_165.jpg 5b2ea725402b096e4987a68e4a5e8134 68076c65ccb9dd3aa3c4a797d99e260eeb138058 F20101210_AACBBC joshi_s_Page_041.tif 6c97a52a0c6519ca2aecd044fb5e6d4e 83fb27d827230d6db10806d2815b55d0fbca562c 14847 F20101210_AACBYK joshi_s_Page_087.QC.jpg 4714589c6505d8ed7ad3529a3363da2b 398b02da760314515e679ecebddee199e201095a 4551 F20101210_AACBXW joshi_s_Page_061thm.jpg df9231d69fbbce93e6138e4fc1db31af 8a20eafe3368824f1834be7fb3629a7530610c8f F20101210_AACBAO joshi_s_Page_020.tif 35ef46934967ec332e6532110ff29e6f 48b9f98eea245a083de401a66e80f4022087bd20 1051973 F20101210_AACAVI joshi_s_Page_015.jp2 d723ed227a7ac69d2640183a62e5ba15 4d88afa0864da45cc1290e38ef7fb8100d669e17 24468 F20101210_AACAUU joshi_s_Page_001.jp2 b2ef95ef7018d31291edfa3b4e210923 d6697bd5119c9b66b6abcce80316b063be7af3c4 F20101210_AACBBD joshi_s_Page_042.tif e8286375025c4b64579dfb85250da7c4 b70d7de3edef84bfee0e9de751ea45158e80e4c0 23133 F20101210_AACBYL joshi_s_Page_094.QC.jpg 186f60cedde12f229ce1aa219e96a491 832f76fe38e6c7a33323c114a5fe8ca80135b758 26237 F20101210_AACBXX joshi_s_Page_066.QC.jpg 681b015e43c393bf27c6a7688b349a0f 14811acf249eeb37173e6c52049ec9813ff3aa95 F20101210_AACBAP joshi_s_Page_021.tif 6e2e3a0fb2f5f450ee1e4b5901523236 b69cde103f6d28cddc6e26224e985e92d70ddd23 1051950 F20101210_AACAVJ joshi_s_Page_016.jp2 b13d5f7e35b88bc56f716fb9c01a6ab9 46161634cdc5c9fe72f51d544333d6883c928850 5523 F20101210_AACAUV joshi_s_Page_002.jp2 5d1cf299dd68cdf31d19a51f779ed4a9 1c0127ee287e7f477a6e6aafdfc20b4ce0adafb9 F20101210_AACBBE joshi_s_Page_043.tif dea195ace090cb49765c79a2f9c7b0ef e01ea414058d252855dd0daaff4f96b45045a8ad 26525 F20101210_AACBZA joshi_s_Page_126.QC.jpg 771f71cebc4c110319d7710168f00591 50559b686cecbaf33a5708e00f5ed422bdc2df05 24251 F20101210_AACBYM joshi_s_Page_095.QC.jpg d4d744bdfd1306f32c0a91a4930b6cc3 bc2bc7e576a2ff9e04c257127beb4b45557df90d 18310 F20101210_AACBXY joshi_s_Page_067.QC.jpg 91ca275fdd4f1472b50db346f82dde1b 4d7a061b641d8bc2f507e1cc1947aa0ae6919f9b F20101210_AACBAQ joshi_s_Page_023.tif 3966c3d1d2e3e745e02c8a066ab02a7d 4b805c4de0d90d72cd00e6aa66e49585ab25964d 68075 F20101210_AACAVK joshi_s_Page_017.jp2 574bbf194d17b1f5318651c7c56e1dfe 8d249830f40c7fa3b2516de90a62257218c4ae4c 6940 F20101210_AACAUW joshi_s_Page_003.jp2 3dc21af10e9af5175244b2cfe05ae42d 579085ebea263cb065de1c8ba90c7c6962f3dc36 F20101210_AACBBF joshi_s_Page_044.tif cf9d5d7fd43c1d0a678adb20275ccd58 fee1efe89f64d5d71d3d6cf3c8c0d6d04baa1f60 19468 F20101210_AACBZB joshi_s_Page_129.QC.jpg 8fd95bf9d54c3136828a2c44a08089d5 f3174c8a601122b67e01f0887b11e0de338291f0 24035 F20101210_AACBYN joshi_s_Page_100.QC.jpg 20dfeccf3d6e59408108e72f7c585a12 cf7ae843328152aeb5fa05e29ebf0c6ec47ba266 5432 F20101210_AACBXZ joshi_s_Page_069thm.jpg 48ac508b98017594903bcebde94d0f67 8622957725ba540dea292fe459ff2c1b5b09f7ea F20101210_AACBAR joshi_s_Page_024.tif 9d164ed1873e377b8190313541c010da 4383e60853e207d691f71cbff70180b314f3b684 72603 F20101210_AACAVL joshi_s_Page_018.jp2 e9e8ad1f9ea4a06dc56a26742e277d24 33d6a1b35d6219894c046b6fe3a03ad895bf25b5 111118 F20101210_AACAUX joshi_s_Page_004.jp2 edf06b65fa5d00a41947d9d2e5687fd1 f3485e39bfc372302b23f9f0a76e90a3cd7a4ecf 3847 F20101210_AACBZC joshi_s_Page_135thm.jpg 935cc8442e9d17f78fc3c8e65dbe04f3 9fbed57c6e6ed082c88a91313f3f894144fef908 5014 F20101210_AACBYO joshi_s_Page_108thm.jpg a9b5ab855c1befe4d520137c7f99ff99 ad7f00481baa0cfc4feaba2264385ddc92207de2 833581 F20101210_AACAWA joshi_s_Page_043.jp2 bc66fb9cd75e131b6efb97a07f7ee2d6 6fa4e3b091b9ae62581fa481e5d30cc9ac8841ee F20101210_AACBAS joshi_s_Page_025.tif 5f6c66422ead96dc0072ed6b9b7eb24c c5828f956dbf389e2df5706d6405f5a9f4473064 1051920 F20101210_AACAVM joshi_s_Page_019.jp2 b9f9741dbcd825c962aeb799e96206b6 b9c6347c084883a0e83a952acf55cf27d5625fc1 1051965 F20101210_AACAUY joshi_s_Page_005.jp2 ec2ec93c15fe1d83aa2510858cf5b2a9 dc00e15be51ec51a84207ecb92025a2bae4be218 F20101210_AACBBG joshi_s_Page_045.tif ac49065d2425bac4da2378939c9fad58 ab9650f6494368d571ccb8942bfbe4d36e022fa5 6605 F20101210_AACBZD joshi_s_Page_137thm.jpg 35f72dd868e557fcb78ec32b47b80cb7 8e2ca14e6ec302e04d68090570086e0be071b7c7 27042 F20101210_AACBYP joshi_s_Page_114.QC.jpg 1b4650413417ae5880571cf9d867b91c aa7330582ef6cf8a1aa060cadf83fd7eea4354d0 1051986 F20101210_AACAWB joshi_s_Page_044.jp2 84fa5365e18d23c142d616ad7f9b3141 dd1a4a52f8f6b3ab985e7656127fe274472c6fc7 F20101210_AACBAT joshi_s_Page_026.tif 346533711a016060578727ea18fc5a8f ad819b0017b3100b2a1d96e97c81713a7a75ab29 1051966 F20101210_AACAVN joshi_s_Page_020.jp2 4d4d1b10b888f13cd5ae288bce40cf78 3ab17e86b2c1607a9877e16c877f568e8691e537 1051981 F20101210_AACAUZ joshi_s_Page_006.jp2 025497623d3544c0bc6c87c260827643 47b51774ef947bf976f1810021d7b9906882ce9e F20101210_AACBBH joshi_s_Page_046.tif 0ce5ce724569f37a2cd823e6461c9254 77e4d5cf67c13bdb0ab53ce3f0ca66fc63fbd499 3862 F20101210_AACBZE joshi_s_Page_138thm.jpg a22097a30b995af80d60b81680b73c24 3b4394e0549e83180b4d7e8ac2c2a978fca97066 7090 F20101210_AACBYQ joshi_s_Page_116thm.jpg bcd1660ae3dd543fbbd423ae593a59df 0c94f725f67f7da1c2ac1d786e220f44a0bfb761 853601 F20101210_AACAWC joshi_s_Page_045.jp2 0bd016990eec50b3d439e0d59026a807 d945eda2bb2a75989cbe8894452ea5ea070728b7 F20101210_AACBAU joshi_s_Page_027.tif 60233b625711b6863138378dd53782b0 c001e2a85812c7367504fdb45f77ba292bfb8eac 1051956 F20101210_AACAVO joshi_s_Page_024.jp2 cf1ad3835dbff4508f74b777597bd53e 459f55b135d1315ac6c2545f982c35abcc140192 F20101210_AACBBI joshi_s_Page_047.tif 197cadb8a1ea91755afbffe287cd12b3 686401a9e6aa0dd6cbf5699f065d62650e1699b3 23701 F20101210_AACBZF joshi_s_Page_139.QC.jpg ea616c0b4825e565db0a08b446b8f34f 5695465f2a8e25f717d85c3e776455bde3744441 28378 F20101210_AACBYR joshi_s_Page_117.QC.jpg b49d226c3fe82618e363a85596a49fa1 ab2ce3c03a270460660a6e6cbf04fb4d78119a82 1051984 F20101210_AACAWD joshi_s_Page_046.jp2 2a46dbf32472000493d2d8329f815ab4 cbb1ce74d2403d2e407dd4bc0cee4a755d015061 F20101210_AACBAV joshi_s_Page_028.tif ef1b05e100b92beb2fe70d951b7cf9b5 9e4da4d4c8b89be258bfe0a3a5e38bca599bf78e 1051967 F20101210_AACAVP joshi_s_Page_030.jp2 3841cb121b76643d15d988e856bb20ac 468a572c4adb781fb99b4ec4ae316e3e6f6e59d6 F20101210_AACBBJ joshi_s_Page_048.tif bc0550ac35887d16af8a35a6e1a2f7d0 e82d61b8811c78b00518622bce6161d216db19a3 6116 F20101210_AACBZG joshi_s_Page_139thm.jpg 8c4793ef1fbc7366a47214bb6524de6f 63476f511306ed14c687c66b1d015e4a75b8908c 27459 F20101210_AACBYS joshi_s_Page_118.QC.jpg 411bffa2bebe530643f78ee6aea785fa 3a545ec7d54c1b486816ac580c9c2291b0fb21e6 986161 F20101210_AACAWE joshi_s_Page_047.jp2 7b0a69902ba1b7644bccab773e7742ea 16bd3dc39450a9ec88f36d7ea781c0e944dcc617 F20101210_AACBAW joshi_s_Page_030.tif 004029bfa015f263d16c41e9dba97681 277739fc8685e21afaf9443894520fd49b182283 1051979 F20101210_AACAVQ joshi_s_Page_031.jp2 1b0f75ea2c64c18df7773cef0cc58194 1e489971802ffa33d75a85de25a2fb0398e8ca9e F20101210_AACBBK joshi_s_Page_050.tif eb8b0e91d902df765463874f7658a1ca c01bf1305c81443682d4add7edd0623ed324a68b 5090 F20101210_AACBZH joshi_s_Page_142thm.jpg 00e107036ddecaa85687f2a035a439c7 440759bc9a91fb89ddd42dca4081f6c45e3bdd5c 6883 F20101210_AACBYT joshi_s_Page_118thm.jpg 8aab24e2a1edaa0c588e2705aaf8289f a4faaaf6c24a29744f4e6e71cf9bb9abbcb85b0d 945027 F20101210_AACAWF joshi_s_Page_048.jp2 81d8daeeed496f09c61d9dba975eb28a 14dd729af790d485403679ea14c2e8b7dd2c3a3c F20101210_AACBAX joshi_s_Page_033.tif 8f821d808b5020e5da3b61ad65064075 ae63a539ecfc8fae9349addef278a702cd83672c F20101210_AACAVR joshi_s_Page_034.jp2 d75e8d41e86259b92097af2c996dc331 fc9b36bc873d86d8755f0b8d349227112aed5220 F20101210_AACBCA joshi_s_Page_066.tif 527dd8eb75f296ea6efb3f908204d420 f0c00468196d926c55b072dd9dc3259e2cf93f21 F20101210_AACBBL joshi_s_Page_051.tif 6043e1e479ee55202d11ed04f8a28500 dfa9a602a78073bf400c0661554837e3588fc9f4 6320 F20101210_AACBZI joshi_s_Page_150thm.jpg 42cc16016532a1ac382691802e9e91e7 a57ed3e02cd248610c7ce96c0d249e4806ec286d 26186 F20101210_AACBYU joshi_s_Page_119.QC.jpg 8623b20a37a8d22af58889947608382d 13363ae4e8bd13f1488388c96d48cf4ef66a364e F20101210_AACBAY joshi_s_Page_034.tif d00eb99cc4fa1bcbbaf58b76ab807db4 4486403cddab072fc4110722ea7975a76b0e3388 115661 F20101210_AACAVS joshi_s_Page_035.jp2 2760162de454bdf3335f5511624bfa35 2842dc7f266d2868367a6a48a5e352b90e0fbf1d F20101210_AACBCB joshi_s_Page_067.tif 6db0521463f047ca0e93eaa358bd3300 212590b857627bf85961eddc9fae4b755fa7648b F20101210_AACBBM joshi_s_Page_052.tif c366d0a68df31c24b7a1333bdcc95060 4e236cd219375334712f95d30ad20f3eb537cabc 94686 F20101210_AACAWG joshi_s_Page_049.jp2 747c329e757401f3b41d1ff18446db6d e53d1138e337bfcc89ecfb2534d4c772d1a6c258 3974 F20101210_AACBZJ joshi_s_Page_154thm.jpg eebbaff8998ac2478e019be9b5209a60 44c57d2d6f8f23ac0c97fb2b37f735912480f9a9 6723 F20101210_AACBYV joshi_s_Page_119thm.jpg 042d0e94299c71ae20745f2e8a082530 6b4f5cfe465f9c046e0d5243bee243d9df841318 F20101210_AACBAZ joshi_s_Page_035.tif fe549457082a34d43eee199639cdd7d7 8df5c0cfb0eab0b317d19748954afcbb4b8a61da 1051904 F20101210_AACAVT joshi_s_Page_036.jp2 144f716a60045f8c28d65a77baaf1a46 2bf4c34f7594a2410738b856893f27421a25ba1e F20101210_AACBCC joshi_s_Page_068.tif b0dc4cf856fc68cbfe24eb33cd7d1704 9d64a7db23ac4ac73caa71b847e9201f935d7b58 F20101210_AACBBN joshi_s_Page_053.tif 9e6e6dba34f15f0ada4c47dcc4777045 3cc6b40a940fa19734e8b596cbdac68c14ce46e5 1051968 F20101210_AACAWH joshi_s_Page_050.jp2 c204a0ebafb62272cf30ddac099426a3 a809882164ddadfb1ced556960f2efe5373a9011 4145 F20101210_AACBZK joshi_s_Page_155thm.jpg 8f6f485517111e58aaf9279336f26da8 b3fdb5f5207b389dd10a69ec112b7d1d667ea9b4 24342 F20101210_AACBYW joshi_s_Page_121.QC.jpg b54a1c35d41b012ea391bc55029c7b05 bee2974ff5eedae269325e4813c4778be6a9b43a 1051962 F20101210_AACAVU joshi_s_Page_037.jp2 fe8c5ac78d03513c47aabbaa007b9fa0 2cd7ba87ae653a1d0092d3f6870630576119d2c5 F20101210_AACBCD joshi_s_Page_071.tif 29e6f9e5b851e279a8c75ac08b901add 2d8e92f5809feb9432ee6645a34f6aef82c4c8f4 F20101210_AACBBO joshi_s_Page_054.tif 7f1ec7beb0ce0eb096004085f178c561 bfe503ec49a9097c8c313a9374424e192b4e0e43 115504 F20101210_AACAWI joshi_s_Page_051.jp2 6980f46097858574eb9f2058a76fc450 cdd89862f6106c2172147b8dd660f3117e4a2950 10297 F20101210_AACBZL joshi_s_Page_156.QC.jpg 301de4ab6d30c30ffbf450a3c7e157ad 6e178169edfc8fd270204a286f020a50f20b7af3 6145 F20101210_AACBYX joshi_s_Page_121thm.jpg f4252e0699427714657500ce65403589 1b3ca913ce98b1c77cc8763ddc768d83c8b3c3c7 1051915 F20101210_AACAVV joshi_s_Page_038.jp2 4a27163c738b935d04b0e497a6bf7abe e1fc2401e69acce8969850664b6b3220c07b01f4 F20101210_AACBCE joshi_s_Page_072.tif e2379ebf640f16845c65efcb90f78102 032f82e868c937d9fbc35fa3b36aabc6ef51500f F20101210_AACBBP joshi_s_Page_055.tif e98f85bc0662c59cd18dd0f05ce561a7 6252a5f79357467ae5e8741e73f90f3b5db92499 112534 F20101210_AACAWJ joshi_s_Page_052.jp2 47d06014ba4aad357bcd26929039dd72 068a1abcd82b93e1c9be969015ea66d46d8b002c 20706 F20101210_AACBZM joshi_s_Page_157.QC.jpg dd4ab9cb18fedfacd668b02903bfd88d d42aa294234b25de3cd922ffa9e878984f9458f8 6698 F20101210_AACBYY joshi_s_Page_123thm.jpg 0077aa2bca1f71b5c664905fa419d445 0a22c238f02ff433dba5164cce7ba97901625035 F20101210_AACAVW joshi_s_Page_039.jp2 d4bb925280a9fa708e184285bf54bcd1 5d24e9d54c94ecf879de168c46cb67dd93350b7e F20101210_AACBCF joshi_s_Page_073.tif cc22da1e159ae7bad9f0acde322480a9 0a3b09abf05589b534c9961a40da6106a7bc8a53 F20101210_AACBBQ joshi_s_Page_056.tif 274fbea927ad73983bbb5b5c24033a24 3cdeaec860a17955d1b249d539b8ad2499b2ebcb F20101210_AACAWK joshi_s_Page_053.jp2 529115a9e7638bc9f6d8216374878895 d22bb82eea510ef7f1c0b60eda1aacf883a10195 5618 F20101210_AACBZN joshi_s_Page_157thm.jpg 35634f531954f477d9053af73e17b012 ffbb05816832b3fcd198836c0e88b61e8876c217 17423 F20101210_AACBYZ joshi_s_Page_124.QC.jpg 7e576d9bc6613718a9da962b90a39ff3 88db270e911e09bcd6d16975fa335ca78f352a69 849041 F20101210_AACAVX joshi_s_Page_040.jp2 25a7f21e28ef2f5313d76ea72f023369 da5f90a20dacc31efde2f61537c961cfda009561 F20101210_AACBCG joshi_s_Page_074.tif 0ae6b590abf34e39d397879bd7728293 362b7d7f5f7c31a8926c3cc0b0b6f0ec399698dd F20101210_AACBBR joshi_s_Page_057.tif 1fbf6cb536fd7d6ae40fb7be9ebf3156 71984baac475da9e32a4230c5a05fbbeaa3e74f7 96517 F20101210_AACAWL joshi_s_Page_054.jp2 3a13f0db69d2ff3864ddb764a63835e9 30430ee31ad44540201294a2d51e90ba9404299e 23192 F20101210_AACBZO joshi_s_Page_158.QC.jpg 583e1c474b100e44886a0220c1702908 c4b648a1cb28d3d722b67652543a3760fd72565d 992705 F20101210_AACAVY joshi_s_Page_041.jp2 ff6991f4d6e739f1885b9573333e624e dffdf9bffb31f2eefdb820c53b43e8663cd79335 989465 F20101210_AACAXA joshi_s_Page_073.jp2 7df23632dddf1884d94e8aa1e8cfb03b 9196017664e2ed1bacf20d9c8ac94a9ac7871715 F20101210_AACBBS joshi_s_Page_058.tif 1ad23b4c871b85e5b652713a88a7984e 8a7bab2f319eee1219db7b6197092531e253163a 1051938 F20101210_AACAWM joshi_s_Page_055.jp2 492fd390e9dceb1d50576b4d54cb8331 51681dc30f89088e423e7d4c6cff5947d6497b26 24497 F20101210_AACBZP joshi_s_Page_159.QC.jpg 667ed04061ecb7975e5851b16137a2ca e1f268c7a49d1bc5f05364afd6ca9751f1c3e31c 983710 F20101210_AACAVZ joshi_s_Page_042.jp2 9ecccfe1e2f42b05e87a4842a581f6fa ed4c8d929e3ff02bc978ceb9099700a4048191eb F20101210_AACBCH joshi_s_Page_077.tif 046fe9a0ace8cc419f6df9b147a8c6db 79f82ede322c55818c5f050f2f70876e6f24d05f 92210 F20101210_AACAXB joshi_s_Page_074.jp2 0ea744ae92aecca044af209438d5dc4a 391bf87210fe474e9cf03902e86ae128427fbee7 F20101210_AACBBT joshi_s_Page_059.tif 72131613dcdd8b01a31ceb72fb0b4fd0 fae85a030d41d922aaffa6e556cc1a7e615f8ddc 79584 F20101210_AACAWN joshi_s_Page_056.jp2 97128142a664d447f877423af54c245e be42cc4227e11639dff80d0380c6ce442330b987 6425 F20101210_AACBZQ joshi_s_Page_159thm.jpg 64daa394a2fd01d9b701ad263f3e8785 f2b2a134e0b7aa171c96a8e9b7b70d49c9485298 F20101210_AACBCI joshi_s_Page_078.tif 114a5ed7f39c78a2b0b8e4390c188f95 2c355f5f8f6e727d6db15adfecf1c2ef2dd8766f 825938 F20101210_AACAXC joshi_s_Page_075.jp2 436222df60e2368fabcf18453a2adf9f 2752c6c2dc498107f93de7d277857cbd28e034f0 F20101210_AACBBU joshi_s_Page_060.tif 09b716047aba773758258737bc359db1 1d3480b36fafd1d6528826b980b1a0e9c6da41aa 68591 F20101210_AACAWO joshi_s_Page_058.jp2 b2b0715bb05d2ff55859e6553deec5e5 c29d73b2d2b37b6d82769bf09a776c8f9b277b0f 23380 F20101210_AACBZR joshi_s_Page_160.QC.jpg bb291363b93c7717e24ed29d847d228e 74b1c1a5ba904664b58fe7a231b0ec5106f30223 F20101210_AACBCJ joshi_s_Page_079.tif 67159701ea8a461d92f5ac24c31e31d9 68b954df1cb01ce72dfd50981e03154f1420d7cf F20101210_AACAXD joshi_s_Page_076.jp2 0a9b05fc6dc1536d31069631556feed9 43c4f236cc7ca09b1700e08bf1d105068af1c2e7 F20101210_AACBBV joshi_s_Page_061.tif b3c5e0ea5ba064b13fd18ec30f560874 bd7f02e993eabf4f7306468676a14801b89d864f 1051977 F20101210_AACAWP joshi_s_Page_059.jp2 3727fee1719f43b9fd7f3154bf2954eb efa42ed6fe6d60c02bc9e4cd66d2904b4de8d50f 6274 F20101210_AACBZS joshi_s_Page_160thm.jpg b246efd6ebe76dda1f1a3d2378d4edef b95a3c9c37e3d5e06d8986525fb0a7ed949680aa F20101210_AACBCK joshi_s_Page_080.tif f70d8dfcf4c2ef9e5443aa8975d2da62 1147b24b83fb694237c1a4edb230a198c3370ee7 F20101210_AACAXE joshi_s_Page_077.jp2 9d17e7a1816b668ecd8123e36229de06 488f6b2f9b51cb3258cfedcf7468843f72c730d5 F20101210_AACBBW joshi_s_Page_062.tif 5c76d721963a61e3f9c8aadf69727fbf a407661e81eeeb681ecc2db1120454affbd30f97 69058 F20101210_AACAWQ joshi_s_Page_061.jp2 f38384f35710c9dff5b4140bf98a4127 4b81247bec7f6b18c64b79aecfe4cc3440144a4d 6127 F20101210_AACBZT joshi_s_Page_162thm.jpg 09f3a86b0335bdff2518b15ffd9d9430 659b9bad334621eb3fd76fde7b7b5d071727cde5 F20101210_AACBDA joshi_s_Page_096.tif fdb74b1ecdbe78bb4b3801115cf150a2 f3c41ed3b2221cc4f4b939685043bed14f5205dc F20101210_AACBCL joshi_s_Page_081.tif 5213ae8fd9be19908c33728e96c5e241 e172ede543891ee30d6303d6a0681fef6b1d9625 F20101210_AACAXF joshi_s_Page_078.jp2 fd5488ae067ef3911853acfab4032f14 9577277c075626f654a64e472ca22049f835d558 F20101210_AACBBX joshi_s_Page_063.tif 379aeabec4684bbf6e0cabd26334a276 c7625127eef3ff0e7c625739951ed17e68766013 1051861 F20101210_AACAWR joshi_s_Page_062.jp2 119986e31abcc9bafd19f909800eb5c0 d72aa26bb349fe9139e1d5d24aae0319d8c032df 6329 F20101210_AACBZU joshi_s_Page_163thm.jpg dd051b31b83252a4349106898562e2d6 8922e47775109a25917796191ddfd0c0b72bb7be F20101210_AACBDB joshi_s_Page_097.tif f488c7a669c88d6a5ead5cf452402a42 848a88c18cf81ce94bbe2027e2a22a2902680b20 F20101210_AACBCM joshi_s_Page_082.tif e2e63f40c1cfa7686d6207a068e281ba e612525e1830eeec7714bc2139834575f9db5d8a 1051940 F20101210_AACAXG joshi_s_Page_079.jp2 95f28ebd1570af6f7997f88e4b478792 38cc48ede2f3b9914b2d1696e8ffc4581d17775e F20101210_AACBBY joshi_s_Page_064.tif 9e0f870de1d466b43f5d9012b95e0c60 eea5e69bac347570b78ff8f13b40895034f89fba 55035 F20101210_AACAWS joshi_s_Page_064.jp2 9e7252f75775eec990380ed204c5d944 73f84d7b019b1c8621b9b0bbc12c315287908426 11470 F20101210_AACBZV joshi_s_Page_165.QC.jpg b4020b0dd941b5f636266c8b6be280ec c07a3499fe6bf6fcd7480e7ae0da9190e40ba424 F20101210_AACBDC joshi_s_Page_098.tif 938c476ea2c4b3ea6ccb7ee80a18de82 1325e66a38ea098ac40eb178f09fbcb5e1596825 F20101210_AACBCN joshi_s_Page_083.tif 2687f1425a1431386e9837f7a2556cca c4fc2d1bf2c23c35039d2716bd702b39a2c6fa70 1049949 F20101210_AACAXH joshi_s_Page_080.jp2 7bdaee6c9fee19c8cbfa3422823d4087 a5da887ce552d6df3c5a44055005231297ac774c F20101210_AACBBZ joshi_s_Page_065.tif 03bc3a81df798559541178fee2f606ac ce92ee654a78de107abf7eca73b085afb0a6d781 1051969 F20101210_AACAWT joshi_s_Page_065.jp2 bc91c1004f30d3ae43dfb8972f95ea49 f94a3abff4c74928fdaa7f2bbb760beb290d97a6 3221 F20101210_AACBZW joshi_s_Page_165thm.jpg cbceb848568d1f016b4a9e73d0081a4a 1cc37574fc2ccae0a536b2fe35b7a40e143c9ce6 F20101210_AACBDD joshi_s_Page_099.tif d30f003f69aebe135390ac79b469035f 7d63d4d19736853d27287931df8995079749635e F20101210_AACBCO joshi_s_Page_084.tif 6bd380203e867cc1a517dadc2ce90471 a7a1f6b5dbf74c5a50c32060b1bbf6764c848940 F20101210_AACAXI joshi_s_Page_081.jp2 0e225fd6e61bf82d298d236ce12a74bc 025768e0411794f1185a8feef33c8004a12befcc 1051898 F20101210_AACAWU joshi_s_Page_066.jp2 e48cc9753bb91aaab8fa959b7a079df7 558274ab8e235764c296f378a14b8d99c2d3e44c F20101210_AACBCP joshi_s_Page_085.tif 853c923a634fbe20d77012609e93a2c4 7c3435dc6b6b1bd23363eb658a9df254c43ddbb6 962143 F20101210_AACAXJ joshi_s_Page_082.jp2 fd2d4f518905a37b1a8fc3795d1762a4 9a68800dd3b17516942959d063840737f31fc22d 817824 F20101210_AACAWV joshi_s_Page_068.jp2 76f5325dcf175bce4971db8ade0bcb40 b0f549fa3711cf9a7fafd877d229ed15e69946f0 F20101210_AACBDE joshi_s_Page_102.tif 035b5c63efadd29afffde332a4ba6bad bab1d1d7ddbb1635be3626067923f5f9e98d40b1 F20101210_AACBCQ joshi_s_Page_086.tif 8f4dde0c9b6ca17b4343056e491429db ac0cfd7dc45cf25c514d762139b16d5945eb63a7 601714 F20101210_AACAXK joshi_s_Page_083.jp2 3268fb5a126edb6558c34e428746dd01 c28be1c33dea470819dd266f7fb39c2d2fbe5e04 881390 F20101210_AACAWW joshi_s_Page_069.jp2 9a900fae598ce7c36b581171a1d779cf e0e95bd1e5398e5b8e7076115628e4902eaae591 F20101210_AACBDF joshi_s_Page_103.tif 0eae575184149d05d684163af5b09d5b 25807d291d2b186226f0a80afddda6c2904ce445 F20101210_AACBCR joshi_s_Page_087.tif e6dc7900d333067d284f62210a00367a ff59f15890ce129d9f5e7417edd5792dc4deec49 741612 F20101210_AACAXL joshi_s_Page_084.jp2 53f44c7bc71bf8581dfeffee4074d293 a3f3bd458afc7fd63f88eb89aeb87d51c236f916 F20101210_AACAWX joshi_s_Page_070.jp2 24979cd906c84b5c795ed4e2517c3e6c 886d6588be338d62f0ff62c1c01dce929b9cdbc6 F20101210_AACBDG joshi_s_Page_104.tif 3f2ffeb97d45b2c0b958fdfdbf1f68b2 661cde2691085995c7ec38eb4f2d0739c4e3d5bb 1051974 F20101210_AACAYA joshi_s_Page_102.jp2 237dce7b1685588150c9356464a6e976 ed907ba695104c3a7a99c050cf5bc3e8fc949572 F20101210_AACBCS joshi_s_Page_088.tif 4fe9880724fa3d0ca8ae358ab74b6ddc 84fccc57fbb4b732ffd247d152af32e698d6cd13 860992 F20101210_AACAXM joshi_s_Page_085.jp2 ba679d748ac6247e5af8faf8a4536907 e6dc01349dde671435a8d009b6a36b507aee28c1 F20101210_AACAWY joshi_s_Page_071.jp2 2446b4ee4fbcc3a9aa9931e703e8941f 7aa82da1a44a8a4d19e2c778b7b5116d5bc21641 F20101210_AACBDH joshi_s_Page_107.tif 304677ef448cbbc52c1ec0d037560d9b 140298f3cb79bf9e5a065320c802e5d84a535bba 1020905 F20101210_AACAYB joshi_s_Page_104.jp2 fc7fb3ea013dad2c255cccb02a10393b 7d2242c2abe76fb855f30c64c4c692336453e3fa F20101210_AACBCT joshi_s_Page_089.tif 144a80341d42503c2b6df1ece289d129 421851cd1296c2a325d6019537e020fd8b24e194 571038 F20101210_AACAXN joshi_s_Page_086.jp2 8d5887cac9a42b9240a68b6a532f625d be58257f28f87cca279328265259826b1edb8a38 324407 F20101210_AACAWZ joshi_s_Page_072.jp2 7b61628ad9d26ecd6e994832014db637 5609b8a146547ba9675f2f3283fc9674f353bf8d 881241 F20101210_AACAYC joshi_s_Page_105.jp2 e93d56840cf05a4a7fb43fb1a1bb0e9f 880d9b47426e9541bd8605649bdf5c87657d667c F20101210_AACBCU joshi_s_Page_090.tif 1fe7d151afbdb5ce7c99f5b4dc9c49d9 bef662e6f35700262ca7bfc89275dfe56db58b4d 55121 F20101210_AACAXO joshi_s_Page_088.jp2 277e45fc57e7b5daa4c925b4ca7f355f c8e739ba6fea1e379f4954dea0f6781e7a0fbe2e F20101210_AACBDI joshi_s_Page_111.tif a54e442567c9ebf91f341a7361475e2b f282f12a5e16ece6011ebb760e4e4d90af96cafd 1051921 F20101210_AACAYD joshi_s_Page_106.jp2 a6a5b520b72e269a3732d30797a4b0a4 ce47e376b33c529e9ed29b333b8fa9392481721e F20101210_AACBCV joshi_s_Page_091.tif 84db2561c8066e3f1369958f64ce5f5d d2ba44ab64801347a3bbf0af7d5e91e072ebfc2b 80706 F20101210_AACAXP joshi_s_Page_089.jp2 627568efd42b942deeb2f9ea9ac96e99 e8f04534a09851372a1ac82d02a880e5265aee97 F20101210_AACBDJ joshi_s_Page_112.tif 3dc4088d737543876faf3e4856c4a5bb 5a14ef483a370f4b7b5d4329da4cacb203ca125c 79669 F20101210_AACAYE joshi_s_Page_107.jp2 61b7aacb95127f6366a2e8b0100655a9 42f156f931b3d58325cbad1790d7280652ebb19b F20101210_AACBCW joshi_s_Page_092.tif 6260554d36c83f03b6e2e82124c8ae06 d6b0db13497a2a3b98d18e96d0cea63111e119ff 1051958 F20101210_AACAXQ joshi_s_Page_091.jp2 0af44132d6be3f9cbd6d69388d89c9c4 43eaa1015b1330950de03ef79c5e550e36933b93 F20101210_AACBDK joshi_s_Page_113.tif e21b4ecf7203c1910dc0a278e945b260 21efed36b1e4827fc9bb6ab23a36aac1ddb28205 88373 F20101210_AACAYF joshi_s_Page_108.jp2 ef7acbfbd11a76eaf3f1803e0f81cb58 aeb16c95fab41bef23d529a238735bc1476bdafc F20101210_AACBCX joshi_s_Page_093.tif f363224e745ab0e4c86cd0d02a9c4d1d 214329c5e18122fd313582aa5f5b2c868012dc26 F20101210_AACAXR joshi_s_Page_092.jp2 4da4538cc4aef32ea466180158d86c9c daf89767c7e598df0aa9cfc2c986de28d3e2ee80 F20101210_AACBEA joshi_s_Page_135.tif ca8bf91ef4245719bd945503a5abfcf4 06d16af5b8480e41acf20a0eb835c87aed801950 F20101210_AACBDL joshi_s_Page_114.tif 69343ba2dcc02e3a9d23df71db467ed4 a66cf95f4419a487b26dcfb74fbff6c0f611bbaf F20101210_AACAYG joshi_s_Page_109.jp2 46120db4af081fbdf643d4de6f67f583 bf7535a948018ad7e38f7ed319ef98b559a4108d F20101210_AACBCY joshi_s_Page_094.tif 84f04c7a5be1a2c716e072a53743a0ea 7254354603a9fbe02e6f21530f48208d2f161b72 83694 F20101210_AACAXS joshi_s_Page_093.jp2 9a21e06537a6014dd822b22374a3095f d231d6311d1298e80f8f9d35cc24a7f8d9ed8c63 F20101210_AACBEB joshi_s_Page_136.tif 80521fc41c68a7efb8d8df33f4c0576c 221dc4bff37eacca516f1ed49d7db3dc087963ee F20101210_AACBDM joshi_s_Page_115.tif b9ae4e6587de46a169ed69a2ecbced6e 9ad382d9dc70485c34706bce237bce813f4d8eff 118366 F20101210_AACAYH joshi_s_Page_112.jp2 54943879d782c1cae11a18d6fe6fa66e bd811412215337fe584a4dcb286aa4a16d77927f F20101210_AACBCZ joshi_s_Page_095.tif 130650fc6ec356105bbd50c8b641c9f7 1dba34519bd3077cbc277404f3ae8f834174b923 1024716 F20101210_AACAXT joshi_s_Page_094.jp2 296adbd12bc98f7dbd218b773fedc8bf f44f3f6a6f8757566049e31d9f071c3093825165 F20101210_AACBEC joshi_s_Page_137.tif 81316f7418d6023ffe3f8d12dcfadc9e 578b250a515d658ca429ced3cb499d9ad62e5b89 F20101210_AACBDN joshi_s_Page_117.tif 3f408371963add0963656e8b6f15595a 82d95fc096670975cf35a8e7a558aa7d56615b4f 136877 F20101210_AACAYI joshi_s_Page_114.jp2 86d77fff4817acc9ec25b0f011594465 d171dcc04aa2a5614d827edfe00d7d7137bd4352 1051971 F20101210_AACAXU joshi_s_Page_095.jp2 b02d8eaadfe6ce2c6f35ac59044203f3 5241f145e4eac17cfb2ec040810a11cda0613200 F20101210_AACBED joshi_s_Page_138.tif 7168533be27900c79420bae12dbb2b30 7fdd5f2cec4b843a5c40c4c3a80cca6a6c502f8f F20101210_AACBDO joshi_s_Page_119.tif f7453e6260192350c03853fdb2ab3473 caa926d64f5b572179cfbd7d3d07942a309f112e 134754 F20101210_AACAYJ joshi_s_Page_115.jp2 b1e3c0bfbf1e975a0d8962509641d861 23a624e90aa6e9bef277b6ea7b2d528775b5b3cc F20101210_AACAXV joshi_s_Page_096.jp2 c25551408774eca924da7d7bfea17a81 64bd542fe2b14ea4db7abe977b3de091694c5486 F20101210_AACBEE joshi_s_Page_139.tif c329ed6f1a7edfaa715f8a7255c00d83 929850d5c8ba0a555040c7b6d497b43a3a46374d F20101210_AACBDP joshi_s_Page_120.tif 1113266d9186142bc03f342d83888462 2c5906c2281fb0b47b088c9ac4598f4fa596ea2c F20101210_AACAYK joshi_s_Page_116.jp2 13c0e6809af8dcc253a0731e7e8eaec7 9d2ead9d34273e58496f522335a42feb9dab1712 918089 F20101210_AACAXW joshi_s_Page_097.jp2 f5aa6525ebb3d57fe5daed8fab324511 003c2e0c3eb8674c8ff40ac2d3dc45241dbb5653 F20101210_AACBEF joshi_s_Page_140.tif eb59b445f3b05866ad5d3d571e489b3c 5be401f609d297efddbbbdc7919e6f53d7af23b5 F20101210_AACBDQ joshi_s_Page_121.tif 1362e5b328efbc9772c63efd1b3c9539 8ad40a229ee8b53d08d8eae97e9c0b3b7464f4ca F20101210_AACAYL joshi_s_Page_121.jp2 3f0217a05b2dcb73c40e32f8109c98df 5c86f0af38501006486761088843eea4d0bf3ce4 102260 F20101210_AACAXX joshi_s_Page_098.jp2 609a1f1ad208ed8752410c2557ee1987 b8d09d920041af26c7d1a235ecb48a57a7f8ce4d F20101210_AACBEG joshi_s_Page_141.tif 273a1ff11d33af9565908b8ac0995e7e 8de680d589c500208984929a9a7f6df72a9d0bf1 F20101210_AACBDR joshi_s_Page_122.tif e66926b6096ad4e72f7c422c898b0f32 6f3bcc58bda864f320686d9857559c3119a452df F20101210_AACAYM joshi_s_Page_122.jp2 85aad96b5c8f8ccd73fc391d2d71ada4 c5d10d1f7651b0537f90d5c4f7068c671a931cce 118556 F20101210_AACAXY joshi_s_Page_100.jp2 03aac02b1203ecfb4c2fe6f7661bf992 778d7268937c24ab0085b313de243a6207dbdf0c F20101210_AACBEH joshi_s_Page_142.tif 169354116aba8c4a70622f24d3ffca86 8a9aea4164de01d95eb9e6b9c839373a8fcd756c 712294 F20101210_AACAZA joshi_s_Page_141.jp2 e9bc55f91472382677e082c34b125374 d8ae8e93d683f88415aae731cb83f7e1110a750b F20101210_AACBDS joshi_s_Page_125.tif b1f7786a98e5425778ce553df5ce0716 660b6a220e113ad828da1a1505f9a5febb64a86b 1051949 F20101210_AACAYN joshi_s_Page_123.jp2 059116139417bc819316fe52a1f6419c 22d8f944157264ab67f68eab3df86b6fa5abdbcb 106635 F20101210_AACAXZ joshi_s_Page_101.jp2 c9776be01a8ffd31b09657878420aede c5ebed831f6c4fb277289632e8f17bd072867e79 F20101210_AACBEI joshi_s_Page_143.tif d6a54ec447932183b65661ff8ca3b224 9b56b82ebde91ad002ffa45f97f9da1671a8b017 84212 F20101210_AACAZB joshi_s_Page_142.jp2 aeb3ff4f57489fb0ed9c71998d593758 d09cffece1ad9c71e93062826c5378fb12298dd7 F20101210_AACBDT joshi_s_Page_126.tif 0f19ec6b3b2ff716ef49bcf43b614bd9 dadc8f873dd7a2d086b3c60f90f0c123a47e7aaa 81249 F20101210_AACAYO joshi_s_Page_124.jp2 ad92608526ccb5eb34905e4f77765647 6d8b48a80cccca709bc325c7977cb4415e3792e6 44980 F20101210_AACAZC joshi_s_Page_143.jp2 1eabd646d6216ea49763511e240ccf74 35ee09ea561722dab73f6ae732a59d73466f209d F20101210_AACBDU joshi_s_Page_127.tif 753bde9bf322e1f2243c4a9660150366 01e55880e6e53bd87bd204314710e9c12bb14e86 833273 F20101210_AACAYP joshi_s_Page_125.jp2 8a0a0e4f9301e4c670258ecea2064dcc 69950c4b794f88995a9c64cae38dc18e593a53a6 F20101210_AACBEJ joshi_s_Page_144.tif deff14bb3eb5e3181e5d6ea68a1c72eb 04a9eb072cd5863779931f5e928ec4ea33fc179e 1051972 F20101210_AACAZD joshi_s_Page_144.jp2 9e79fc7ec969dce082395054455a6433 506fdaec64de8b54ac2747f8c4fcfc9182e5a953 F20101210_AACBDV joshi_s_Page_128.tif 342b6d2637af3b1ca7eb317ec6a03135 2b7ed6bb20d8517b94f725fffba0da668c3a0925 F20101210_AACAYQ joshi_s_Page_128.jp2 890a9cd3091258712caec283401809b8 ed4bd192f27aced6641142308196baa0ab872b65 F20101210_AACBEK joshi_s_Page_145.tif 149f34b4389d03d978e95c075b1cecc8 f2f09f4dc57a99a2411051cb4e67b8dd3eb03ad1 F20101210_AACAZE joshi_s_Page_145.jp2 038861d49797c0b59c89f84045f7a6e1 169c95649185a54e15bd1fcb77082c4492a25f09 F20101210_AACBDW joshi_s_Page_129.tif 95636e7b20e293dc054a529c69856215 9a7b42d7e1eb3c2fa1104d7d84e72ea3b5b5eb17 107439 F20101210_AACAYR joshi_s_Page_129.jp2 0eba431e973f5ad90fe507835153fb76 ba848d73ddda4f176bb349688584fcdb345a2bd9 F20101210_AACBFA joshi_s_Page_165.tif 553616772c04b5b77c785482638cc8d1 df04aef07938a7cff135c24a1ec3644594e192b4 F20101210_AACBEL joshi_s_Page_146.tif 2c2777d92d13074abe87c4366859b1b5 2075b03aa3df84206f9c400afb5c64cb87d48d8b 1051914 F20101210_AACAZF joshi_s_Page_147.jp2 2b221ae52cb6cabdbb3e6dbf5e94dabb b36d9e8e701981a0472a6c9daaebd8aa1a894559 F20101210_AACBDX joshi_s_Page_130.tif ae0213ff1b8f2095f5f917c48c5ab3df cd315193992d6575310314f7da640631692eb0e6 1051976 F20101210_AACAYS joshi_s_Page_130.jp2 e5282197a262e1b12062344bc06b8036 d0b5f768356e555ab34d1b1601dc85ed9d7f06d9 8029 F20101210_AACBFB joshi_s_Page_001.pro 77f6a65495d41e7aed4f012794e1a657 f40385cc0e19a5103c3a1c7e55bc72de933e65b8 F20101210_AACBEM joshi_s_Page_147.tif 7433d0414c30cd02d7695c71bf8845a3 71b9634d096b4aa537774f349dad5fb8eb2db67e 133818 F20101210_AACAZG joshi_s_Page_148.jp2 910e18b90fb4faae70847a2967c8e2de 8e8d5520d03c6097e08ab53eacd0caabf28d81a0 F20101210_AACBDY joshi_s_Page_131.tif 404a4fdbbbef87c28501cdcc8bafc65b bceb6f8a466afb2c075fdd1b3043fb03ccf95a20 745029 F20101210_AACAYT joshi_s_Page_132.jp2 cc82a098df87bb1b303d2d9bd0057855 4da7dbade2daf671ad37e2c299da36fe8e92307f 803 F20101210_AACBFC joshi_s_Page_002.pro 7a607ed3f6e464713fca8759584bd55d cd08fe43d207d03378d4c724c06cfb9ab865dcf8 F20101210_AACBEN joshi_s_Page_148.tif 40d6a5a6d35fb152565aca1898412c38 998b1beea52899f2747d717f10d782bc03896604 135404 F20101210_AACAZH joshi_s_Page_149.jp2 0d8bbca58ee951973fad725ed8331978 d6fead3ab9e491778eded5ebf80e4195ee80c345 F20101210_AACBDZ joshi_s_Page_134.tif 8271b57165f6341984d7aa97782d4d12 bcaff9e1162a2c73941a49e44aa8127b2ab5dcda 1051873 F20101210_AACAYU joshi_s_Page_133.jp2 5373738da3fc012425b6a29dfbbd493f 9cd51a43f61defe748eeb9f0d2b4f03dd56bc541 1609 F20101210_AACBFD joshi_s_Page_003.pro bfa4765d09ea2549c13d7039c26c22fc 4df93d0b7c3ae1e150d105a89e8939118c6714e5 F20101210_AACBEO joshi_s_Page_149.tif 9c8b0ad575e38e2b41b33a0c5c0a390f d67c58cce94d746ba7c1a9a02ad84e7293636ef1 125054 F20101210_AACAZI joshi_s_Page_150.jp2 e6a251ba40a671c56ab0ceaf048a8d19 34130b0a1ea254e257f201d1c84a958448f9ec3e 1051924 F20101210_AACAYV joshi_s_Page_134.jp2 e9953d78da6004a0785eb53e2327c7d3 06649d27c94150537e0bbbbc48202d5d63a646d9 52580 F20101210_AACBFE joshi_s_Page_004.pro 32184f61419b526efbaf599fa34b9f66 c15f3d5ade1b9dd18005a63b0e83b52117290cfa F20101210_AACBEP joshi_s_Page_150.tif abfcbf6e752b92414ad6357c06ef5e08 283850f441fd33cb032672931a31dfea6d55f1a7 110676 F20101210_AACAZJ joshi_s_Page_151.jp2 8f8af4c2abb0720d2aa52c1cf57bf192 a3339388b85f8d8cdb863e4e3735ac62b7204ade 55664 F20101210_AACAYW joshi_s_Page_135.jp2 6c37427e8ac1d7dad6996fcc74d0aad4 2d8ba2ca4189b64a3b9506543e55ff43d8a3c241 54553 F20101210_AACBFF joshi_s_Page_005.pro 089e979caf0366a0a0b1cf1a46d34936 36509db96e0867189f5911be4c1d1c79c6da133b F20101210_AACBEQ joshi_s_Page_152.tif c7fdd5ac869555ac6a4bc845cc4b223d 1b06e8131cc0999435afda1dad67088a8855db8b F20101210_AACAZK joshi_s_Page_152.jp2 47256e417a7e0f2167832d79366ace8e 4882a5e6f526b070e3c41f2cfe22ad20662a099c 509374 F20101210_AACAYX joshi_s_Page_138.jp2 14bf51d6ca9be295eed918ad47be074e 6cdc26cd24b2941d200b2cdfcee84bc7ccdb2e34 32286 F20101210_AACBFG joshi_s_Page_007.pro 6ef0563463044e9dea8d7478071bdd93 df93acf33c265a49002ee93becfeb8d8aa38001a F20101210_AACBER joshi_s_Page_154.tif b5276eef30ca2276ceb2203235a94e7d 12eba49b2a38f640b8c960d46280d920875d32ec 65514 F20101210_AACAZL joshi_s_Page_153.jp2 786da6de90f3a2674fcdab350991a058 914b5453ba93dc6d20dd031d20d1fe45e9d98462 1051936 F20101210_AACAYY joshi_s_Page_139.jp2 b5cf242b75a7c571dea2ec8fa27b5144 b1351b10560c38cedbf422b28a983953cbbdf729 77864 F20101210_AACBFH joshi_s_Page_008.pro 6af7f794abfa3151880fc1a6b90932a2 153692dafbba501bb36160c06358008c6b8f3461 F20101210_AACBES joshi_s_Page_155.tif 2dbee7ba999da9f2db8d01bd6da9460f 667fb49d55adff12adb5e2cd537eea452433c940 70130 F20101210_AACAZM joshi_s_Page_154.jp2 b80b2fdda33c511972415c0b500b8ac9 c28d5af8f28175f54ece8bbcf69cbca4efc28da0 41682 F20101210_AACAYZ joshi_s_Page_140.jp2 1918dac6d00f991557b4ff78c9c18db7 2bbc5b905e69820c582393f81ba86e7efd443c0b 66944 F20101210_AACBFI joshi_s_Page_009.pro c6872078dda0af351808a9242c7ce073 d0864ba1eee99db751a18637d4c4e174f9374db8 F20101210_AACBET joshi_s_Page_156.tif fbf6b270851d1f9e0d4dbf9c89654195 ec9eac85eca093b19c7e8ea6194e8035e1024065 64384 F20101210_AACAZN joshi_s_Page_155.jp2 8d063d95a4ceed388eb3f4823e2ab6a6 1d27844d585978854bf92c062320184142b9688f 34033 F20101210_AACBFJ joshi_s_Page_010.pro 4bec23ce3085ffc1e6aca3c4342f442b e5e7e2317828ab80cb2421a59b21801fb90856f7 F20101210_AACBEU joshi_s_Page_158.tif 8fb9a1c605a12cbbe834d187d64abb86 d889ba83e7c0879bfc6c09cf4df9efbf68de3cbf 401864 F20101210_AACAZO joshi_s_Page_156.jp2 67b8d3cb7c84579b76490209f3b19d28 b19b079d98f86f888af4c934a1c9698d2dc27a45 F20101210_AACBEV joshi_s_Page_159.tif 118512037c1f99370bd7e4d08c5894a0 f081cd0d6932eb987a7f9f36755b66239eefc641 109565 F20101210_AACAZP joshi_s_Page_157.jp2 623fef760ef17719b7a870bd43ff8430 1c90d6ebc00a16378a18b0375bc5a66022b05534 51839 F20101210_AACBFK joshi_s_Page_011.pro 32dc9d21d87c8c6173aee0ca2ed139f8 3463f85bc569ec5941230cb739cf19e474fc57ff F20101210_AACBEW joshi_s_Page_160.tif 55abc9c337852c9367f616088bf0bbff 621ee7e02ee731dd10e78f252069726357acb6b6 124262 F20101210_AACAZQ joshi_s_Page_158.jp2 973aae5f49e2ab3e4ee83b06bf587e31 46bbe8bd5d8697316f37aee4ab6343f7717cca92 8624 F20101210_AACBFL joshi_s_Page_012.pro 9ec62af291b542046b6bd63b3e937fd6 0c41fb25a5e664beaa024f6eb725faba236eb279 F20101210_AACBEX joshi_s_Page_161.tif c07d6f70fe8dd89e08cb79c4facb6d26 c4272dee2c18fb8643953495713a7998541993d7 129337 F20101210_AACAZR joshi_s_Page_159.jp2 45dc74cc59f92f393e6ef41b86c45504 816b2a94c23f3968b407ed28656219db249d08fe 9296 F20101210_AACBGA joshi_s_Page_032.pro c8f650ba007aed34cefd71b91b08ab5a 656fb15cf8af8a5dbde9c8e6070c91ae1f51add9 52690 F20101210_AACBFM joshi_s_Page_014.pro 59e6224e5169e5eefd572b6911acca80 e33a7fd6c7b54ea43c7fe9807dd9e093e7ba1c38 F20101210_AACBEY joshi_s_Page_162.tif f12ab25c3d50ba86b5ae68d9b5498b64 40f24af3e359e494250ddd892c2393718de67c57 119610 F20101210_AACAZS joshi_s_Page_160.jp2 7f7f190370095a34eaa495f2a601cc2e a9eb4150d52cc1964f8fea7492366d5f6b9e3dbf 59205 F20101210_AACBGB joshi_s_Page_034.pro c17312a93e389d6fea955137c54668d5 14bce796dfd746d2efac75a927d9644bdd9d5dd3 61265 F20101210_AACBFN joshi_s_Page_015.pro 9c1dd4dfbee9a74c98f1b089277daf08 a1f784a32839f0d9b1d57500802e7fcbd05cea3e F20101210_AACBEZ joshi_s_Page_164.tif d1c6d382964316b1344bb1da103cda0b 2a115cc78e592578b4c33108896afce33c90eb24 121531 F20101210_AACAZT joshi_s_Page_161.jp2 a60adff3eb51269a564c16f4f4a10b1e c2029f763dc38bbc8212c682f01d0f19f5255f5e 51074 F20101210_AACBGC joshi_s_Page_036.pro 25a263c738f174704eef4929e136f999 fc683875aae0015defb514ac7d19e46e2681eb12 55712 F20101210_AACBFO joshi_s_Page_016.pro ac18f72037a17b9b1bdfddb5b53646d8 da7e640ef91808f7354f901a36e40de9e9df4eee 119607 F20101210_AACAZU joshi_s_Page_162.jp2 23a53f8b2231c84969837d5fb29bf4b1 523f829ca2aa3eaceae9e182919a0424cabc0ba5 53097 F20101210_AACBGD joshi_s_Page_037.pro 05a1d5455714d6a227ebb6f3a63ca941 b26909053bac1ddb2282c96087eafef4b131ee98 33505 F20101210_AACBFP joshi_s_Page_017.pro 467067dce7756d91f30fa95ee632546b 519d72d3fd38d0cd553ccc7323291a41e42877fb 115224 F20101210_AACAZV joshi_s_Page_163.jp2 ff8c67897b2a8f4943786612ea09d43e 8502321cfd85f2d2a6024d0a226930737d8cca5f 55840 F20101210_AACBGE joshi_s_Page_038.pro b7f5b09c15acc7cd511a74270f3e0729 6814722199e3e6936c338a1beb62ce06f50a777b 54916 F20101210_AACBFQ joshi_s_Page_019.pro cd30bdba127b683d41e3cf1ab9d87c9b f861b044657a5bb500d26b295b7a8c2fe888eb96 61527 F20101210_AACAZW joshi_s_Page_164.jp2 460723b04f1257c7a7ef5e19d8a8bea0 1f6046e9b6863ab62dc7ea7b1c5da32f4a75f992 49413 F20101210_AACBGF joshi_s_Page_039.pro 55d89ffae520ad8a75359acb33b34f85 846aecaaeddff3defcc18a506ed0d146cda3d1f0 59068 F20101210_AACBFR joshi_s_Page_021.pro d09d14eb2c2baf532ce2a097caaf532d e45c2ea17f125dce35a275ba48000474e3bad88f 49285 F20101210_AACAZX joshi_s_Page_165.jp2 6e23eaab2144141a3b44e19115fe5fe8 2aa52d11508db37621c1fa149cae0240e145cf45 32370 F20101210_AACBGG joshi_s_Page_040.pro da805ea8f3e65243f82d60b19c986953 aab95fe7f714df6f035d80c1fc4a7b96faaf9376 58778 F20101210_AACBFS joshi_s_Page_022.pro a6f948483495a96e4f8a10dda20dd8bc 9e673c8a007819886ccfe256039c91900903f933 F20101210_AACAZY joshi_s_Page_002.tif 2197ffbce6b68069eb539141ec7f9ce3 52e7eae64446bab8a547fd63da0376a5a127b2f7 42429 F20101210_AACBGH joshi_s_Page_041.pro be4007071a5ee356a4fb08b27e8d530f 0990b2b38582e84d4cbb1cfe16dadfdb34909347 60280 F20101210_AACBFT joshi_s_Page_024.pro 930e926a7fbdb75c858e2de3a6512feb 01b1c6b1b2fb418609bff2fe41bed3f2da193524 F20101210_AACAZZ joshi_s_Page_003.tif 0430459a9a1a77f5aa48053369641bfe 1579181ea5e3e1272b52ab16c5dd5c8e29e586c4 30157 F20101210_AACBGI joshi_s_Page_043.pro d778e39879e4c31837993ec309d18511 0acf826fb0d9b1a71f59e749f7bc4ce551c2d061 56704 F20101210_AACBFU joshi_s_Page_025.pro 1f5af74ae55bb128593b004b6ed37ef0 8cd241f4b2c21e1a1d220e526314a0612307cfba 54237 F20101210_AACBGJ joshi_s_Page_044.pro a34eeb1ea6642eb92cba7ce22e48993d 0656644a2445026fa61dbc5c17a9f3b0a9887d04 59742 F20101210_AACBFV joshi_s_Page_026.pro 2149b44292b921561b104a6b0b8e93c6 af3e6e06c5331d1963fb54e23919e3bdd779a742 30068 F20101210_AACBGK joshi_s_Page_045.pro 62832f5a9c5faa13c08acd9b26a7a1f7 01a270aeec10d4d512371fe983ec633ddba85152 57485 F20101210_AACBFW joshi_s_Page_028.pro 0effe03f6e3bb4c4414d2c1f79fe873c 7e9ab225a38006682925f5dfbd072cfacaba15df 58467 F20101210_AACBFX joshi_s_Page_029.pro d238ceb5734172d74761fe6662ca57dc bc171f280aa7e50fc63d8967b06a6f70a24ea7fb 58468 F20101210_AACBHA joshi_s_Page_065.pro 2ad59ce9bf739506286d97d508229e05 a564739ea57a6277478dad1783f36d3451e9e3e8 59890 F20101210_AACBGL joshi_s_Page_046.pro fa2e4a86b9e969bf918895a96dd43c8b 849c77f6533cb815761b965dde64d6065e8f82e4 58916 F20101210_AACBFY joshi_s_Page_030.pro 34d3b48be1be2cace893cee43db5be4f 5960e24f807753e9312cf63a2f1c59b177b44a8b 55973 F20101210_AACBHB joshi_s_Page_066.pro 5d92173231649ca38165bc45f4329056 7f42a3d2ff088ddc4a47cad66fb1d9d2be4f1d49 42407 F20101210_AACBGM joshi_s_Page_047.pro 464b926dd9ad8814bb459581ff4808a1 79221eab10f054a1035e7fa542702936f86cfa87 60751 F20101210_AACBFZ joshi_s_Page_031.pro 080246d27ab1999fe1498cc665d3a349 f6dc16f2835d180fa91439bf07d8fb9b89f67549 40666 F20101210_AACBHC joshi_s_Page_067.pro 793a422de4ae05d6b767b633ca6c68f7 c92289db1a75fedddca445ff291dea4ddfcfcca0 39402 F20101210_AACBGN joshi_s_Page_048.pro fae06c2e2b19fdde744eb6720fc5385d 2f1657fc4a35574e523c2041bb308f4ea1613163 35597 F20101210_AACBHD joshi_s_Page_068.pro 9b115aa7c3b3870599f3bea215dc321d 3920d194bde9d7a981fe4d97739a6a4d92e827b2 45181 F20101210_AACBGO joshi_s_Page_049.pro 5a89f356a1d1344b91d4b905a9c25b41 055bc5f6bf99586752158572447b96e0859ec8fc 38967 F20101210_AACBHE joshi_s_Page_069.pro a9a851ec5de5c2a4d4d426fb6bc5f1a2 d5b39a5b31e1a15704642fc5304d3f5919dbd5d7 49934 F20101210_AACBGP joshi_s_Page_050.pro 5853c7b674d59ad6166a845cb5807228 96fc54c10370439615fa1ec29d669d3cd20a3e33 60331 F20101210_AACBHF joshi_s_Page_070.pro cc891251d04f1de3189470049574415b d76ff08a7fff32abdc330485505342147e071174 54668 F20101210_AACBGQ joshi_s_Page_053.pro 691d3bd804ffd7dabaa02bb18e107937 d428a3129ba72a24891ab64ae5db6f1d845fff9d 60394 F20101210_AACBHG joshi_s_Page_071.pro 0ee2f73218353068267a201a437e38ba 1cc3e78972cbacd61e4ba80dd830a0ca506c1853 60338 F20101210_AACBGR joshi_s_Page_055.pro 4bcc39f889d75dee315b92c67fa03cdf 7e5b11ade946704d84ea9ea7156c72c5752b09f7 14335 F20101210_AACBHH joshi_s_Page_072.pro f0318a74a781af43a16a3f2c426b1c3f 8f0ff33598724a03ade40874b991a109a2b17d09 35756 F20101210_AACBGS joshi_s_Page_056.pro 3f66582cc2d6b0605f1919b1bdc0faf7 9fa58b0a8b6f26447a7f608efbe3060a907d02a6 43558 F20101210_AACBHI joshi_s_Page_073.pro 4bf53b8fbdf2028397441080b2680cf0 5aa5036b3ca0a589953e96144e9e34d6b9503db3 34936 F20101210_AACBGT joshi_s_Page_057.pro 97cf5dff50faa54bd73143820a9aa24a e42f115c73bcb8cfb0522049b4d199d6b6e43579 41083 F20101210_AACBHJ joshi_s_Page_074.pro fe1f2c106d8a5cf59f073fe9b921fa41 2cced72d4954b6c5c7edb92a7406a080e57696b4 28752 F20101210_AACBGU joshi_s_Page_058.pro 7f3c345a1e2dc21f13ad13b47b031946 795df2d7af6fdcf3aca769d9bef96852bba55a81 37482 F20101210_AACBHK joshi_s_Page_075.pro a1696a1aae15d7cbe349d3e7780f7f57 57a496031981686b5c5b83675d95c6b1037aeef8 52340 F20101210_AACBGV joshi_s_Page_059.pro 54ef49a3d1017be1b03d752b54e81556 615ea07144a0e43a279b0d8cf4d223b8deb3d65b 58738 F20101210_AACBHL joshi_s_Page_076.pro 578247d26b4c438d8cd11bee854d2214 e9761e064577429eb01fd8295b113c8bbeb4648e 36885 F20101210_AACBGW joshi_s_Page_060.pro 4c058c69ce43587a0ef5465f4073e21f 39be7943311b006c35bfdd016f1482d7c1c27155 48338 F20101210_AACBIA joshi_s_Page_098.pro 397247540755b23c3074b6c8dd0481ef a383295af96b91a6802d91677f9be11ec0c28bd4 29715 F20101210_AACBGX joshi_s_Page_061.pro 7831494e58c5e2f3215918b180be3386 68cc87d41e4c48d21fbf10c62a25b51ea032dcd3 52867 F20101210_AACBHM joshi_s_Page_077.pro a8e66f510d70392636a26013461e35ef 5518ba2123f53a654832fca47b01f02d1097138b 39207 F20101210_AACBGY joshi_s_Page_063.pro b94f2c50758d430ffd9c54bdc6dff5b7 d09c8c50464c45d6dc05044db23e7d1568bc960a 59155 F20101210_AACBIB joshi_s_Page_099.pro 11367f6d4ca0f00a68d67106ba3bd5fb a4b857c37dcdc658f0d6bfb5b4d9902781a38bf3 50984 F20101210_AACBHN joshi_s_Page_079.pro 3d64f8c7b30d94e1800d5cfa2c6c9edf 000eabf60cb463dbe1e20ccfaa62b19b6a7d70c8 18352 F20101210_AACBGZ joshi_s_Page_064.pro 42002565bb13dea4ee2308bb0a592af5 34cbd8fba3206ad877a1b8dd10aa1c8c42ce0756 49378 F20101210_AACBIC joshi_s_Page_102.pro 5d93397d7512fc1f126a8a3ea4e2ee5f 1e9632a5faa02e8f4ac5f0b18b12c9156878011e 56354 F20101210_AACBHO joshi_s_Page_081.pro 37a551917c1da524f7ee25ee274b60bc faa760be590e166debec4024354d676e12d3a361 36613 F20101210_AACBID joshi_s_Page_103.pro de185d5ea7dd4dded0714a807fd1a23d cb9039ed5a2b1f12c19a79d827d532625f6935b0 44993 F20101210_AACBHP joshi_s_Page_082.pro e6c53f2730b990044628a6fa716072b9 586f396a468cea4108c1133ea78c4341a89e3893 47748 F20101210_AACBIE joshi_s_Page_104.pro c1d0e966f5f4a53a0ea2a732c05acbe1 7b6bff1681e0974b18e1f8db25c0119c05a55fd5 23615 F20101210_AACBHQ joshi_s_Page_083.pro c8fc8bae12d695506784aa6260ae3614 40fa740d8d9373454ea79a007f8acdc34d4b076f 37806 F20101210_AACBIF joshi_s_Page_105.pro e30e986ba099b0bf5422d69f4a229095 d899e354ea8957d20c0a3889ca65e643473f30bd 21810 F20101210_AACBHR joshi_s_Page_086.pro 9f992cff2f10b2ec526b5f4946ce1713 725b0158ba500e31358933f3c3366427b3997a30 31928 F20101210_AACBIG joshi_s_Page_107.pro 368b961b7070511fad5acd92d64474e1 8a4c2cca6006ab6bc36db25d480d1ca9b45921c7 27833 F20101210_AACBHS joshi_s_Page_087.pro 79d54062801a90fdf1f8a7280580aa25 b4ccf0e3097bf3333b18611b2a3d9f8d94114192 39756 F20101210_AACBIH joshi_s_Page_108.pro 8481e8536741ed19fa3bce3498fab159 751282ef901a71bbcd9b0aa28922d42a7d6b8084 24112 F20101210_AACBHT joshi_s_Page_088.pro 0057313ba9bc307fe74c33bd005f43a1 39c98b0850a26f41fe40f46fc1c372b6caac4a06 51099 F20101210_AACBII joshi_s_Page_109.pro 63995d8b0bcc25cb9bed0e91c433af6c 92e934b998fa7e04b658f16d16c2b6a8e1585a81 36803 F20101210_AACBHU joshi_s_Page_089.pro b179ad0f6f90c95c60d92f2c22e63b0b 1ec849db4402a9f83392babfccefe63644b1295a 60418 F20101210_AACBIJ joshi_s_Page_110.pro 935f060eaec367daadf9e00e1868c168 0d7b46cf3296039c3c6bd335f2992520565989c1 57439 F20101210_AACBHV joshi_s_Page_091.pro 69857a60e3cbb62fe71290ff1721929a 062510853a77157ee73a829b09cc1af5e90fe125 57397 F20101210_AACBIK joshi_s_Page_111.pro cdf28f658dfaa01494132e7a77840324 581b8e471352d780c9836f90c5473888e1e80d27 39003 F20101210_AACBHW joshi_s_Page_093.pro 11b612f1475d91186907a58b1e24dc4d f79a17f75f8ea3d6d3acb5e3814dbf35a96b3784 55786 F20101210_AACBIL joshi_s_Page_112.pro 2f20c622839324c8c9bffbdafd083fa0 42a1411187f902a284bfb572c0751a80cc31faae 46101 F20101210_AACBHX joshi_s_Page_094.pro cac10854433156de19ec17b8ab6f1932 b587c9392cdab1f60a491a52910984b82ecdcfd1 52703 F20101210_AACBJA joshi_s_Page_133.pro ce09df546a66c55c9d55544b9fced7b2 a91c9d368c39a8671e10b05d410792fd6b5f8ef2 64217 F20101210_AACBIM joshi_s_Page_114.pro 5c8199991a9a09196d0176f37be0bde1 c5d9e9d11db47e67b133b7b619583f27333ec609 50246 F20101210_AACBHY joshi_s_Page_095.pro f5639fe65c59c02bc1bac2efc5edfdbe a0c2a2abcaa3e1865475d0364d149ecc354bd816 53341 F20101210_AACBJB joshi_s_Page_134.pro 5fe236da6cfaaf4fe3eee1824d239deb 135c742b11d12c772f1926deb2725e3b64e1dc28 41986 F20101210_AACBHZ joshi_s_Page_097.pro c0ca22cbfb0d4136a9e05f69da06e142 99ddccd49fed8b443503947125fd9e918d946d1a 26137 F20101210_AACBJC joshi_s_Page_135.pro bb32a067ee79183eb9eea35b0c9df1b3 68da7599a05cafa8fb8a1f31b0a43de1fb4b363a 63985 F20101210_AACBIN joshi_s_Page_115.pro ccc012d63741128f2c402a77272357c9 486e880356094c81a23f12350f3ed0842ececbff 24889 F20101210_AACBJD joshi_s_Page_136.pro 6cf70c4dd1f9812debb010e42774a2ac 972427ef47f2d944952dd0e4499bb2c84e9b1cd2 61945 F20101210_AACBIO joshi_s_Page_117.pro c3a02e091e51ebdbb07eac97cd0d92ed 16fb40167aa67062f30e3f2c4f011ce592faa76e 54998 F20101210_AACBJE joshi_s_Page_137.pro 0cf79fd4ee94808131183fedf711b638 0e039e086e788105c5ac7da230d5a893992c2f62 58358 F20101210_AACBIP joshi_s_Page_118.pro 0d73591b16aca27633535b8168dba3c6 6bd56cc36d38ff123795c8cbfa5b01e54946f445 F20101210_AACBJF joshi_s_Page_138.pro ccf23a88d0890faa82be443a6792b0bf c36a3595f96ed8961d0e9706792042046f8bf99d 54747 F20101210_AACBIQ joshi_s_Page_119.pro d26a2ff92cd68ca7f7bd8dbd23d62b2e 864cbbf2f0fcbfc43399a4caa722f8ae98a34918 49224 F20101210_AACBJG joshi_s_Page_139.pro 0d57b1dafd6349833d9256091889196b f72c5352f69688e3dba75e8af6586031e46df555 39329 F20101210_AACBIR joshi_s_Page_120.pro c7d548becdc6e1005184d2ecd7f70ea0 264e551a4f93261e31e2ed3665c157e9804c09fc 19656 F20101210_AACBJH joshi_s_Page_140.pro 6d71bbb6afa8b39d3be1aff2d04d2c96 08dbfd7d249d0060542c137d064de74edc59ad34 49797 F20101210_AACBIS joshi_s_Page_121.pro c3baf0edcb9d6ff0462d02e4e4fe4d1b 115589b0962411e159a4fdf81bfd7a4b9645abc9 31502 F20101210_AACBJI joshi_s_Page_141.pro 0cf552dbe38fa071f5250076a7ee3dac f6e50f22f52c569608f022c9c5ba0c5fb1de660c 38473 F20101210_AACBIT joshi_s_Page_124.pro ab9787c1ae888158c9a6220c163f65ab 0a0007c8610e80486c85ff25e10dbcd3683ce4e4 38678 F20101210_AACBJJ joshi_s_Page_142.pro 7f43652b28caeda2ac486b534388e548 0ce301d5bb6c9f88d2dd9a3e8158e301eec60876 34631 F20101210_AACBIU joshi_s_Page_125.pro 2f6f01ab8017eda4c882e64ea8df306c 3f5951b17ee4268a77acc2bf373ed4ee550b4bcc 59769 F20101210_AACBJK joshi_s_Page_144.pro 77c10b3e7b0a25511b28002ccb94e2e9 43dfaa8d2b7df05552ad3679349e8e0089bdbbf6 56948 F20101210_AACBIV joshi_s_Page_126.pro c757418bfb1f742a575c079f8ada23fb 5cd8edfb397e2f1c258b4a33733ca40525894023 60343 F20101210_AACBJL joshi_s_Page_145.pro 5479f21aa10ef82f45ceaa064d771756 cc557df972c4e7fa778f3b306cfd82cff42eb826 47107 F20101210_AACBIW joshi_s_Page_127.pro e1ab75bac1c429a8bcd0d3752467f378 88e546689852134826212a841115c0a9c47fc737 29251 F20101210_AACBKA joshi_s_Page_164.pro f7f2844cb23abc4097d65483f9256ebe 4ed9ed8cc9d68dbcb995ec2c949467b6f66a31a0 50499 F20101210_AACBJM joshi_s_Page_146.pro 2cd90e313377ad745a32798b99ef427a b02d83f326a559d307e7a6f9d00c3a33aa33ce70 67821 F20101210_AACBIX joshi_s_Page_128.pro a03d69a76efcebea2ce61c33722f33f6 bfd9be7924e6c0b474d3f61fbdc947befeccb03c 21456 F20101210_AACBKB joshi_s_Page_165.pro b006cd2a387768f6fff578afba39487b 1f0a2a002a6a6e3765b0f377d4c98f60276f52bc 62363 F20101210_AACBJN joshi_s_Page_148.pro ca79827ab40390dcc6f02d3f23e8a924 9b47ab1fce5c349cc0547a1706b2fb494b0ee530 55531 F20101210_AACBIY joshi_s_Page_130.pro 13142cd0bbced1f32e4d8d0981b7a5f8 fe693fa324691ace2ddae1ea44537163d69b70bf 464 F20101210_AACBKC joshi_s_Page_001.txt 2b194c05dcadcb81290a054a1cafa171 b2b1714195c7ed04ec634b90b0c773affb2f83ad 47042 F20101210_AACBIZ joshi_s_Page_131.pro 01dd742f0c085e93bb07cd957be5c12c 56430383a0d8a1bb4dab89ed2d6c48201fedf5bc 81 F20101210_AACBKD joshi_s_Page_002.txt 708d5cb11bd2f0a374b97fa88a164972 d4125793789fc39d45bae3966741e70deba66962 61532 F20101210_AACBJO joshi_s_Page_149.pro 8509750dd623385e07ad41c8de15fd12 99e9c5fad410f4e488e76dfd24f9ebe00ceb8487 120 F20101210_AACBKE joshi_s_Page_003.txt 143d91a19aec4e1168bec09233da4f63 ed7ee3fbf32f08b5e765c9d3a0612491c527c8ea 59913 F20101210_AACBJP joshi_s_Page_150.pro 54f420f1533c9ada4c38c7313a964645 410ceccb0cc577ffcb061d60f81898ea2de5d569 2119 F20101210_AACBKF joshi_s_Page_004.txt e7bb4760d969a95b6517b89152be5708 48e287332de768e0fe20c9138be2543eeabe8a93 53635 F20101210_AACBJQ joshi_s_Page_151.pro d2e4f03d61cf1870a0f6400c0c031a98 f0f0eecc9d47bfcadb609b1402ea93059ac5c637 2621 F20101210_AACBKG joshi_s_Page_005.txt c88ab11d4c04cb3238e3391d463c6910 9336e3d24b72da147bb12200ae8b82ce4735f00a 30246 F20101210_AACBJR joshi_s_Page_153.pro 0122f4e258cddc5f4a4b59413c594284 a7e6cbdfce0faacd620d4267596497c10eb98aa0 3240 F20101210_AACBKH joshi_s_Page_006.txt 8d96aa6299a62729cbbd1ee35ce96ec7 5667330cbfc8975c85eab2f32189b6189a7458ee 31810 F20101210_AACBJS joshi_s_Page_154.pro a211d51d5019a5ce5e8f59daf2458dd1 b72a65c5a502d224748091b8a708cab6d5609ad6 1617 F20101210_AACBKI joshi_s_Page_007.txt a69ecdf6543b836a7a4e679b24b391f5 fd098f11f64c0e9e7aa3fcee717d46ce8db376e3 25873 F20101210_AACBJT joshi_s_Page_155.pro 68349ec3e1992ee3b2514344fc573738 1edbff7d3c436063ba3a2f921274c1462bead759 F20101210_AACAHH joshi_s_Page_026.jp2 d9aa5b43f671b6a4d9f92025fdb6e3bc fd5c84dbccab6d15616165833277c00752154f37 2860 F20101210_AACBKJ joshi_s_Page_009.txt 4cc3dd403b8d64682c1ae911971e243a 0def2d2871936bfc0658a9dfc696e3d932ac6922 15266 F20101210_AACBJU joshi_s_Page_156.pro e56980ba1ef45ce3931591a57cf522d9 9d52d3d40c4aa6bfc3cfdb4362c7e3a29f24e9ac 57458 F20101210_AACAHI joshi_s_Page_113.pro f7f3571575b16729873c6bc22206bad2 01bee6f1b5738ff4765d797e609fd1bef749f658 1460 F20101210_AACBKK joshi_s_Page_010.txt 6622b9b4b9e91f74314210ecab85d0be e0651408fb43fa83f2348a4a47da05fe8a9ba18e 60387 F20101210_AACBJV joshi_s_Page_158.pro 7a57e58feaafbf46cc65c6b1b784d881 19321cf673529dfa39ee01693485ca95cd050c1f F20101210_AACAHJ joshi_s_Page_069.tif 477198a95effdbba532e4defb319e4e6 2339b76d8f8979d4bd71713edb810acb9a12bd82 2226 F20101210_AACBKL joshi_s_Page_011.txt 43ea2c08aea0d120b1ba31be4c70e5b1 0d50047cdc8844f83cc6a231effc0ed5b1d9a547 63701 F20101210_AACBJW joshi_s_Page_159.pro 483d957b602f01935fb2d18cb6c3ac94 c347a08cd65baee144231c45c754b3f2002f3de9 F20101210_AACAHK joshi_s_Page_132.tif b5c3358ea11f47e672197f45b71e27fa 0c0d7dc5850f78213d32503694f7eb3585e9e1a3 349 F20101210_AACBKM joshi_s_Page_012.txt 5e1e103bb8370097d9ab2a865aaadcfe 33abf91b3886391cd584f16e398fe5b6ec41690d 58651 F20101210_AACBJX joshi_s_Page_161.pro 87c7dd39ebf22f8175f1ee41f75cd92f e8ce5f80efddca36d957b9a232194073b351cfaf 2070 F20101210_AACBLA joshi_s_Page_033.txt 40e05645330a75abb2ba2ac342777b0e fcc204c8d0daf9c11d8c007f1ef42f23991996c9 14159 F20101210_AACAHL joshi_s_Page_018.QC.jpg 626ef2bd9ce843c6cbb9a5a876d976fc 1f0048cfbc155877b7e39a20aa74fb5d4cf3e188 2076 F20101210_AACBKN joshi_s_Page_013.txt 0265c91b9cbefc0ae45db86d569236eb 8c9fece3fccc1999fe253b4de9509e1dd558651d 57248 F20101210_AACBJY joshi_s_Page_162.pro 83392871242fb06b9ede4c88cc3d21c7 f1516ed454065ee5c7bf9cb275b1566bf43de86a 2361 F20101210_AACBLB joshi_s_Page_034.txt 67adbb64eefced5fe04588c1c2eb0a25 564f298018efe885aceeaca3d30cc33a45396ed8 2031 F20101210_AACAHM joshi_s_Page_036.txt f323938867886b90d5c7250af3e44d44 fe7c60495874bb5f29580d1976b55597419cb370 2111 F20101210_AACBKO joshi_s_Page_014.txt ac2760ccacfaa46514e3f533eac03348 6ed7051d35bf19efb2042199a832d87dd4653121 55317 F20101210_AACBJZ joshi_s_Page_163.pro 963fc3e47f128033e0d19c5dd0f3a740 65a25e9da077b84a7d790ecfe9670d2b2099d1a5 F20101210_AACAIA joshi_s_Page_108.tif 93a63ebf2d858ce6924f6f646c75dd7d 379af36f4d0f7ea34552c7f335b907fa76617d98 2246 F20101210_AACBLC joshi_s_Page_035.txt 49adfd1412f8bacc4d5755a2422090f7 ac0fbf15d95c0a922447fd1aedd8a856fb4f32c5 58765 F20101210_AACAIB joshi_s_Page_160.pro 9c051421b0e3384ebbcb89d16e7fb6c0 68eb4fdf03f12597c90593171dc7b5717a3f727c 2117 F20101210_AACBLD joshi_s_Page_037.txt da15dc6d8ab6fa8aea7e472dfe337102 b74c094c1c004543abba6eaa22d05cfdabd8f773 1567 F20101210_AACAHN joshi_s_Page_057.txt b59f92e7004619289e25aa52656f232d 9917fc5413c0a0063d3444b2a9843529e6dbe268 2407 F20101210_AACBKP joshi_s_Page_015.txt 8292cd40137ad4bd622849c4658562d5 c161e2151794b4b060972b8ab57328cd34f2931c 2115 F20101210_AACAIC joshi_s_Page_147.txt f27fc88d124ea4beb23f62294218f51d 6abf7f7ca25a28c338462fb2a3066826dcfb90a1 2277 F20101210_AACBLE joshi_s_Page_038.txt 4ec2ed62cf99138a50aca97bab18d7ca 62a348b3bd11d8ce9d9ccfa3b760bdc254f552b0 1424 F20101210_AACBKQ joshi_s_Page_017.txt 9bd7397f66d35307d2ccca767e8fbb25 4f77d110ad8278bd157b0c73a62524e7086b468f F20101210_AACAID joshi_s_Page_152.QC.jpg 04eeeb75f484244ca410c88cfe7240b0 b0631560821f08adaa22b2a6aa0e8dd06fbb41ff 1364 F20101210_AACBLF joshi_s_Page_040.txt da483136b4691fe7dc569fff65c02e46 e46ef5bbdea696cfc334a2c7999e192ab28db952 28259 F20101210_AACAHO joshi_s_Page_054.pro 71dd94c19f9164d7fbf9e290c9450ac8 fe1012761130a9d1f797ea785382812ba38aac4f 1409 F20101210_AACBKR joshi_s_Page_018.txt eefc0a6bb1006113e67b779acde9b8b6 7f8f477efb28f51339ac6606fc8e5ee0a5edbfc1 1665 F20101210_AACAIE joshi_s_Page_089.txt 701f01c741f0df71d6063029ad423ea1 9ce69ee7fccbfb2761e6445785af0f3e670a3fe2 1748 F20101210_AACBLG joshi_s_Page_041.txt 7b12c4577d114f6a017fc6c60c2c2fe7 1c59948a3a6dc09f5d95e291df2d6e853bfd2a3f F20101210_AACAHP joshi_s_Page_070.tif ab83fbe591063536ea403888811f278d 45e602790ce9abf3701cb7857de633c06ed47739 2331 F20101210_AACBKS joshi_s_Page_021.txt 1778d99e7014e94c7faf4791261a8959 ac3fa76481425162cc1a44428fe6e597984c8b5c 87034 F20101210_AACAIF joshi_s_Page_067.jp2 f1e72ad14f337a2379561c541ca1fefa 44337296f6662e01df1d49fd77117e7263510c5f 1659 F20101210_AACBLH joshi_s_Page_042.txt 38c5bd60b2ff88985e4c9559b44684e6 567b59b54669b2e4f61f5aa15669ad82a5d37fc5 1887 F20101210_AACAHQ joshi_s_Page_067.txt 5b8c343185a3a1f3ed73d218a96fbbd0 1247b9823b4fab0fe7012d763a371a73495e893c 2314 F20101210_AACBKT joshi_s_Page_022.txt 10d0e7a148f3ad3b9afed3c09bb05670 a1825af323c6dbf10f5fea52812dcbfc6a1345fe F20101210_AACAIG joshi_s_Page_099.jp2 3125f9fd51ca08121e46a6af34531841 b017e4111bd129b3023468c14a7a91c0aec8ac3b 1343 F20101210_AACBLI joshi_s_Page_043.txt 49a54ac4d995ee7e075d1e3fb9c5079d 5a90b259c73c130cfae1fd773ed1bc31c03f6c64 37544 F20101210_AACAHR joshi_s_Page_090.pro b86a8f34c25894838e5c42bc1e5a24ca 23fce16bc7309d771dc6efa019f46075c8bc0e77 2467 F20101210_AACBKU joshi_s_Page_023.txt ccf133073f5531b9849e2ff8c2b25e28 a44015dfbc363229052c7d897d49886f00d00ab9 35418 F20101210_AACAIH joshi_s_Page_084.pro 2180290316a82f87e61f4d2749032153 f037884ca12c922d146dbd31c3d1a30120790056 2141 F20101210_AACBLJ joshi_s_Page_044.txt 2a090ff9a093f77387875f8f8c111bee 4c4a14ab3759d7459342e9daa953091e7714b549 1592 F20101210_AACAHS joshi_s_Page_048.txt 2b7b4178cea9e8f6429ee9d07f041f58 aa678fb2e0b7f8fddd3c8e6d1bf3bdc3310fa0da 2366 F20101210_AACBKV joshi_s_Page_024.txt 7a2b6274384e7dc232a37ac73a2e36b1 6879e3f450f70b1b8bda7fccebadad489d11a059 49753 F20101210_AACAII joshi_s_Page_122.pro 71cf11f8dbe42a537e9c263c2a9d5519 a166e4a8d8f642226b6008234e015d50d30f5e2d 1231 F20101210_AACBLK joshi_s_Page_045.txt 7f89857659f1c7101f44b6d4afaa14ba 01959ca8cad367991f85c1c98393141afa4b70d4 F20101210_AACAHT joshi_s_Page_133.tif 4d55d87cab363ce4d74452657ea2a98b 0c1063f5cb06e4416812f84fda807c4ced45f103 2237 F20101210_AACBKW joshi_s_Page_025.txt 6710a58e3f606f3913cbd6290d55bae4 2f3fd901c5fe625c57851f479351977a0ceab264 F20101210_AACAIJ joshi_s_Page_027.jp2 d0431eff5a77b6d848d1a8fb929eba5d 164346e8f1761dfdda1cdf7aac1a7637920bfc45 2395 F20101210_AACBLL joshi_s_Page_046.txt 6c9b777d0d44575795a2d246449d6086 fabee71df7cb98571d0261c4e01017f35287bdfd F20101210_AACAHU joshi_s_Page_022.tif def16a3d7c4ea29e1269a31537fc96cd 4b69f7a74f458badd7857a300e2bfd97a88f75cb 901 F20101210_AACBMA joshi_s_Page_064.txt 8b25850ab0a6d28e49dbdc939f104123 edfdcc33ddd8464ff6814f3de6c413eb8e9cbb09 2347 F20101210_AACBKX joshi_s_Page_026.txt 7645ec32bba3a6a1864f1baeafc279a8 d4df90d5e3ca42b5b9df30b1c55e3d3fa3cf8ee4 24022 F20101210_AACAIK joshi_s_Page_079.QC.jpg aac889179421d97574380392268ad855 8d8311cb45379399f1d63f0dd0d04a45271f8c98 1785 F20101210_AACBLM joshi_s_Page_047.txt 49579c622226bf8060698a5663fa376b 892915fa8e29b7ff5072680116618881148b4f12 25712 F20101210_AACAHV joshi_s_Page_134.QC.jpg e657aa95b8bf01c342972138659a29f7 056bc2035caa6ded98a924be8041cd4e2c46aa24 2338 F20101210_AACBMB joshi_s_Page_065.txt e9d3c75552d4bdcdcf9d5b6eda226a48 d98d11314d8db138e820c9d935db3b7d08117736 2091 F20101210_AACBLN joshi_s_Page_049.txt 3287c06284aa3774ee70813c8a36c58d 136a499a6538c7289c3d91d386e2643fecdc41e9 2292 F20101210_AACBKY joshi_s_Page_027.txt 959c2aeff75fb9e48467d1ec0bcd9e4f 9706f952d70a0ff76a013b112dbe399fed22daa8 F20101210_AACAIL joshi_s_Page_117.jp2 e6834c5021d0056f9ba8112626f61805 96e0da491388f111b9a393bf0697ab0c4414764d 72372 F20101210_AACAHW joshi_s_Page_087.jp2 692c201e01a21e2f8fb27ddc0c8a5c25 596434f11019f74cf583f305733909f02f7ae886 1685 F20101210_AACBMC joshi_s_Page_068.txt 7150f16580c7c4d09c47a6f718400f97 c52d0c12c858d7b78844cfe7f0717e8288b11031 2006 F20101210_AACBLO joshi_s_Page_050.txt 360ea1c1e4cf007ff9a1b46d9310fddd 35a803871134743ed6c87be47bc4593faf09d288 2176 F20101210_AACAJA joshi_s_Page_106.txt 3d6b0b8c312f9f962593380c702822cc 75a449cec2cce68debbce82bb6ce355a197473ca 25769 F20101210_AACAIM joshi_s_Page_137.QC.jpg 5a633cbaf224321c92340d1a0380e855 5695a5c2c8beb5424734c78332bf18c3cda43e4d 60170 F20101210_AACAHX joshi_s_Page_123.pro 5b097356d4924b98c8abb11545291d6d 2bb36500c76c517965789327d7cb5638201d1ee4 373 F20101210_AACBKZ joshi_s_Page_032.txt a2ee841c797ff04b58c0d62579b5f266 3244cf2ef36ae45cb0fe3ad2fe0c796fe708703a 1755 F20101210_AACBMD joshi_s_Page_069.txt f7b3495ee9dd15bc36fa09b4eea00c16 54a5bfef4f0478d477dad2b2126cb249aae03df8 2224 F20101210_AACBLP joshi_s_Page_051.txt 5faee5e8dfd633dacdab23c93f49c0f2 3d721dbc3cbcc165beca56baa2f9133210e9cd6d 6336 F20101210_AACAJB joshi_s_Page_033thm.jpg 7e17ed6f04980181ec3ee331b1deb3f4 733e82c16fe08e362935949033a3f46e1ba7ac29 38400 F20101210_AACAIN joshi_s_Page_136.jpg 95b04d8505a84bb2157e147e208e16d6 36b4f6ba8e33c5a25338b12ee81c167fca80802a 21887 F20101210_AACAHY joshi_s_Page_082.QC.jpg 6e7f88b877978a8351683eb705241969 501807abf527e5c561b27198c3ccd6abffad6746 2391 F20101210_AACBME joshi_s_Page_070.txt faea5eb6f38c4993e67395b1ccef7db9 aca3433cacc873c4b6e3dda695b43ee11d063d22 34381 F20101210_AACAJC joshi_s_Page_018.pro 7b8ac751eb49d1e0c15b03784d65f806 74e5c4fa937aa13b6704e0aaa1fbb1d824f0f85e 66980 F20101210_AACAHZ joshi_s_Page_101.jpg 68f4b8b51dfd170c383136f053007dd2 02486d6f67b83a6a58cda2a96eba53056902db55 2369 F20101210_AACBMF joshi_s_Page_071.txt 6908e12225fb7860641aedde892deebc ca0eb0f238cb821a9b50d9a09cdb99fcedc5919c 2160 F20101210_AACBLQ joshi_s_Page_052.txt f8753dd3ca08b74905b58f0851f10e6f 1b30284a7eae9561d9acce18997503a5b8e62675 52848 F20101210_AACAJD joshi_s_Page_129.pro 548e56deef1cc73c4c7d40f5a2e6a2ef 3368c30c8db76d6521b9ba1db55642d38c70c274 65496 F20101210_AACAIO joshi_s_Page_098.jpg 49c563dfc4ecee8e77cd73cbafc9ecc4 8b06d044d6b9aba6a6a2d1a4522d566f9ca3fabf 574 F20101210_AACBMG joshi_s_Page_072.txt 6a264fd8988ca660bf8292f74714ccd3 9ad8356fa6c49673961fbd87b96e757872e6687f 2169 F20101210_AACBLR joshi_s_Page_053.txt 3a63236cdfc4cb0db772924355738c97 57893be9a9666180639ed67ed3fc04ad2a279dec 2651 F20101210_AACAJE joshi_s_Page_115.txt d0cda916de900c3a3539e1314976fe65 812f956bc63597fe1ede23964a52a86217783c48 55689 F20101210_AACAIP joshi_s_Page_035.pro fb57f8a9a0c27a79d1c6fb82bf55dd9b b7acab1a69372b8c524330781a136aa05319c203 1854 F20101210_AACBMH joshi_s_Page_073.txt 5e99153c8041eb8d48bda7734fd0cc0d f17e349896b340733cf54bccb19a29c3f08aa2c2 1361 F20101210_AACBLS joshi_s_Page_054.txt 58e964c253c897b2ab097ac9be1ebe1e 556b6ac26cba41e6ae77ad418b03f48bdada4293 1016392 F20101210_AACAJF joshi_s_Page_131.jp2 1c690a6f6b06ad773a3fd88c9715b0a8 d13832ad398aa189816875a6e800369e764a7e2a 6436 F20101210_AACAIQ joshi_s_Page_078thm.jpg 1e373d971ff4d79bb113bfcc719955b7 661ea9083350db5f9a63a362c35483d0bf1b73c7 1701 F20101210_AACBMI joshi_s_Page_074.txt 12fe6e90bb6a58ae0548039e6153076a d85a605de9f6389e2bd9f13b93ef68df66bc4b97 1619 F20101210_AACBLT joshi_s_Page_056.txt c830e01d1284e96984c71473aae9879c 8a066f4d80a629430c2b5f552705dd7e0dd5e878 20942 F20101210_AACAJG joshi_s_Page_098.QC.jpg bf36ffcbe24a03ffa0e64e9eb07871f9 3595fec1d86547443f88469dcd37f5a82e6bd90f F20101210_AACAIR joshi_s_Page_013.tif 639d179144f44b1b3661c1524442f57c 6c34bd0356c120bc69da0d1fc7783ed0fd294e54 1588 F20101210_AACBMJ joshi_s_Page_075.txt 5d54c40c179216d3ddc39dbf46c5acc1 445497510daea22718ea0c8c745b2733afa574e6 1492 F20101210_AACBLU joshi_s_Page_058.txt 024a7900e369312c3340ab56a6a638a7 343733a3b1a9f819dca107390d6d49274d10f9b0 2427 F20101210_AACAJH joshi_s_Page_020.txt 9f4b16ffaf5130cc385fe3546c254731 d3f4c0138f36640669dd024d6cfe3f6f29c8bc05 58110 F20101210_AACAIS joshi_s_Page_027.pro 4c6006b41109a81531dd9f3624b9ce2a 74d9eecf7216343dda49053092b5c36f2298d8a2 2316 F20101210_AACBMK joshi_s_Page_076.txt af02dd5b29f25e6a88fd1b36de2c91be 119e9769c83e6594c12c2f86f47e040a40884f18 2156 F20101210_AACBLV joshi_s_Page_059.txt 461aba9295d069795a655ddd3037e7d2 e1c867a467c4d3882cb5cb3f314cbdf5ed8f3f27 F20101210_AACAJI joshi_s_Page_118.tif 4476a94976829c30cc2b094e1cd5d2ad 31022324949dc3db41152299b306323d49dd827a 1051932 F20101210_AACAIT joshi_s_Page_025.jp2 bec786b827b33d661b436ab911ace8f8 df66b2a2408847c52a6e872392fe8201878dd3f7 2120 F20101210_AACBML joshi_s_Page_077.txt aef51d43ba8cb78247be12e80260187b 3c143a0c0b9a334edd6bcc8acaaeddd3000053a7 1806 F20101210_AACBLW joshi_s_Page_060.txt 4519e71550f6571af18d3323763dedbf 06802ea2fc6ecb4d0351ef2e93d63190b1af2e9b 5124 F20101210_AACAJJ joshi_s_Page_093thm.jpg b13f1bcc1cc34be30aeda880db6ff4f0 23800cd9bab4878e043e98c1a3c432971fff9a67 F20101210_AACAIU joshi_s_Page_022.jp2 2ac5b547d0639c4c147af5f1601227c8 e14be2ffe635bee15f3de9b3721644b4a2556e69 2002 F20101210_AACBNA joshi_s_Page_096.txt 291f1bf70378c8ec3af50dcd005e3492 a4e473e98852b95a79104a91405167bf62b86982 F20101210_AACBMM joshi_s_Page_078.txt 48fe8b2e499699c2bd6d62d74d7d6de9 e1244d49d04d3a90046de931595b91f5169b553b 1325 F20101210_AACBLX joshi_s_Page_061.txt 7a12fc31e5ba2031137a38ff1b0a061c 3b649cff577126b258185b64714b657bbcf2855b 26884 F20101210_AACAJK joshi_s_Page_065.QC.jpg 741a4915d096ddc1ba05cd5bdb35e175 50462472021324761899fa5472491b56cf077929 2116 F20101210_AACAIV joshi_s_Page_079.txt 73502e4a5ab0779a2843392843b29237 91d10aee8bfb4613809b1e64d280585643f307d9 2329 F20101210_AACBNB joshi_s_Page_099.txt 48755892c9cac98613e12deb856bec5f 47bdec316fe2b39f260304887b4c18619d7e1369 1993 F20101210_AACBMN joshi_s_Page_080.txt 81cc32be314c2354e78afcef429cb392 ab76b3bacaabbe6ff0662c1f075f550634436433 2195 F20101210_AACBLY joshi_s_Page_062.txt 8b043bdf4a7a7021bc72d352476a4809 dab5756db083d2ed4fbabe622a98bf1fa818b864 14597 F20101210_AACAJL joshi_s_Page_061.QC.jpg ab64349a219b876c4c41200c6a4dbe0c 6ded5e5940b0e04a848ce7bb444966edc6b6214e 6753 F20101210_AACAIW joshi_s_Page_144thm.jpg b976921fd0905ba3d34f9d1f2c220aa7 fea55212473f1d49701fea5ec9af3a29aac11340 2243 F20101210_AACBNC joshi_s_Page_100.txt 3edc8814bfcc9613782a5416743aa1c9 9ad7cda2999c472502b6a3bb69a119c19e6e2de6 2248 F20101210_AACBMO joshi_s_Page_081.txt e4221f8c265f1837a909a2400e45ac15 2bee86fe9ee3d75ef6ce76ae155133c71115ba0e 1723 F20101210_AACBLZ joshi_s_Page_063.txt ca42578a17502e02be652de3c0727161 f62298382758e114ead6ec218fc1e88e88be2ada 96068 F20101210_AACAJM joshi_s_Page_055.jpg 4a85cbbd82e35f8a2125d90aaad2417d 1ffa0238871d5496c94b33e0da1af176fdcd0e91 F20101210_AACAIX joshi_s_Page_101.tif 9a4c6580ae222104f1ff6be6b7c7c119 59cd774b09e9385b3423662b6b29a2bfb9b2b105 6775 F20101210_AACAKA joshi_s_Page_032.QC.jpg 8a953a028d31b44a0850225826724699 f0a46017dc3359806f969a828a6a187368a2d97b 2059 F20101210_AACBND joshi_s_Page_101.txt 0484932d2519abebc469da8bffe17a1a 5a2e24aeffd0844cb4c88e710c00cf85d409e5e6 1983 F20101210_AACBMP joshi_s_Page_082.txt 38691aee61fa9c07fca54c7c262babad 33b129c07253971fbf1e43b0901436975c8ed0a9 22179 F20101210_AACAJN joshi_s_Page_042.QC.jpg 7d14a45e206f102e50e3618d34354031 01963eedefe83bde0a98193ad756b5c72743926b 75219 F20101210_AACAIY joshi_s_Page_127.jpg 2d3898f3629c41d05da7034fbc880aff 4d9d48d00399f9752c8b415d6fd8c16ae89b48c2 F20101210_AACAKB joshi_s_Page_036.tif 20f62e8ce5283dffe98be3b7ecd5b570 1577665f0f9a32a824ce4d9fe2d5bdde974c62ee 2100 F20101210_AACBNE joshi_s_Page_102.txt c0e6e04c8217d15329c96cdeefc9815b 9777cb59a731343323079ef9b5e28d6442c0e646 1318 F20101210_AACBMQ joshi_s_Page_083.txt 3c9b666c61beeb91b5b0221c75eeae7a b5b36f55895b7c667cd07e60002e980beead2bd6 11206 F20101210_AACAJO joshi_s_Page_143.QC.jpg 3c8134a234bc61527c144ad416b23a3f c1dbcfee0effac9e05183d50fbf084091a49b8cd 3022 F20101210_AACAIZ joshi_s_Page_140thm.jpg 9c5241467c4575a9f343c07b8f2904fc 9ffd051680b889fad48a64f9ea0ca52fd1817068 56600 F20101210_AACAKC joshi_s_Page_084.jpg 54bfc341a9b1e7184aaecf86eea83a34 afe881df46e210bb3834007cd281edec9682e6bc 1547 F20101210_AACBNF joshi_s_Page_103.txt ad291ecf82ef1d21ab3a2de4ce018f02 6ada2cac1899840c349bb0484ec142b2c7629554 F20101210_AACAKD joshi_s_Page_163.tif dcc06d3062944e2d1bb81f0d82ade307 a20160a6ef99e402efb8792f1ce7419d8c505228 1933 F20101210_AACBNG joshi_s_Page_104.txt 25f93dcda1dc598388710e5c4514383c 8b9f466eb22c6ab5b86d5facf988a05e394fff33 1452 F20101210_AACBMR joshi_s_Page_084.txt f838aba7fa9401623d3c231184649ef6 eec3a0b21f7b662e2d03e7e9e52dc7d14ae7a5e7 89090 F20101210_AACAJP joshi_s_Page_006.jpg dc24e78e0a2209274e5dc8d4622c9b3d fe274233a99b9335fd85e66f7e60a2c044830e1b 51429 F20101210_AACAKE joshi_s_Page_157.pro dc574223a8c5b9dac98878c4adeabff6 e93e2cbb6bed31ecbd8fa5fb544e02438ecdb8c4 1702 F20101210_AACBNH joshi_s_Page_108.txt 2892a8aa343ca0ee379cfd3b14d541db bf6868491d560f8f68fb0c8a34a857a5ba4f8172 1750 F20101210_AACBMS joshi_s_Page_085.txt b4776f65f9b750e8d3aee1aa02a5a418 db1f26f1e704366ea025a4172d332a6f9d286d28 2344 F20101210_AACAJQ joshi_s_Page_152.txt 7c8b30d88d1026b7823cc8e76c833ece 9aa34de8910f00a66000f1c1af52238c5536c6f8 47479 F20101210_AACAKF joshi_s_Page_080.pro 02a11eae1eb90114e3b44525ca3c5e68 f19a2f0ac9af2b99d8d5679ecc0fe13434e4aea7 2082 F20101210_AACBNI joshi_s_Page_109.txt 45c2fb9721842e7d7670674d98035b5e c3c60b353a6d8e62429d8d3cac7fd4cad78a56d1 1334 F20101210_AACBMT joshi_s_Page_086.txt e663edae5ccf15c3603596ad4b4f4432 cca12f43e9171867a91265c2b9cc856da3fcc70b 905854 F20101210_AACAJR joshi_s_Page_063.jp2 104718353eefe683119e360b83812347 648346a3623e37fe9173606310b34ac82b8572a0 89309 F20101210_AACAKG joshi_s_Page_103.jp2 e4058872fd98331ffd33eb92c8d9f434 0b254613446b95e2518c85bee881fa67bc1f4537 2388 F20101210_AACBNJ joshi_s_Page_110.txt ef2de7e5f82c2e733db622dbe69b1010 d2f2e8635e134fbf45dbeb5bfe4f6948c8349b04 1300 F20101210_AACBMU joshi_s_Page_087.txt 682fbe485fcb78b647212296f7288ce6 b5b83c76807abe91bbf0718e676227470647a628 6477 F20101210_AACAJS joshi_s_Page_115thm.jpg 98a09f06c8684fce92611dbe5a3ce4fa 56ffef85a02b60f6d14e546bdf440eff7a76b2d0 1500 F20101210_AACAKH joshi_s_Page_003thm.jpg 2530136bead771e5497961f429fdc472 9669f52ec8bd531cdf8f60fba33cc5a15a05fd03 2262 F20101210_AACBNK joshi_s_Page_111.txt 367cc8455538d617da0d49d7ca8f0b2c 83fc2a44d76b2829a56c32bc51b373c32d86d10f 1094 F20101210_AACBMV joshi_s_Page_088.txt 77000d52471f86199ababe48773bdfe0 2c56f57fd72a0b2625cc6172f637f149d9765fa2 7105 F20101210_AACAJT joshi_s_Page_070thm.jpg 52d1b29dbe9fbfab9e21ccb245e24d9f 76d1249766439007e0119f7830c5cdb9cf722fcf 5706 F20101210_AACAKI joshi_s_Page_097thm.jpg 1c192f30f544748ac536b0c5392791a3 6a23b8032036ac6d64b7f82bfcc151f92088056d 2266 F20101210_AACBNL joshi_s_Page_112.txt fc5e446c664f400b76aa648c1a7f4365 41f184c87ce09fb67f4117a96a5412ea31b13fca 1976 F20101210_AACBMW joshi_s_Page_090.txt 346b33599768546e9ba194b7d1545d30 e81f78311a1eb4d568edb98db7abaac337b98d72 36982 F20101210_AACAJU joshi_s_Page_085.pro 2743c58402821b8485ab8314ce309e96 7dc06ac98e113cbc9a171997ade8228928f3648f 828051 F20101210_AACAKJ joshi_s_Page_090.jp2 0cc37d095588662079e4ffd587aa09ae cdf0ff0e2d9f65b3d1d7d30ad5f6d56efe31f04b 1530 F20101210_AACBOA joshi_s_Page_132.txt a5062a5ef62ae6d76671f72eab65053f 3d2252bcdf0c7991bee3d63b5f0cbb4a74abd95c 2396 F20101210_AACBNM joshi_s_Page_116.txt 41aacc7e5618bae2ddfaa3a7e74db9a2 5df1ea7ba9da01218ae2053570b60e1c8549a6de F20101210_AACBMX joshi_s_Page_091.txt 21c1dc51eaa9dabae909b277f6af0322 d07a365cda88e64dfec2bd97f23f18d97cf6548b 6258 F20101210_AACAJV joshi_s_Page_080thm.jpg 036ad7ae9046df93b737a1c40d8ffc04 db26738ec9470f65f06387b3c0b89a889e2d4cd8 1051985 F20101210_AACAKK joshi_s_Page_028.jp2 5cf32354efbb6aa3c058a44963c36680 98d611bf7194a736d18361581dfdbd32b2c34f4f 2096 F20101210_AACBOB joshi_s_Page_133.txt bb72b86d68c28b5da42e085e3f8ac65a be488387f8c0ab9654031d1b93337da6d7f34628 2431 F20101210_AACBNN joshi_s_Page_117.txt bd485ce737d7cdf8d3be55a0fc93c33d 352c31416628d52326ae0f20f3916d34c6bcd260 1943 F20101210_AACBMY joshi_s_Page_094.txt 29a3504e8a3fa6e058bca78bf9d8f78f 64bf5931ab322b99229cf8c99c8a6b95ee522416 20196 F20101210_AACAJW joshi_s_Page_043.QC.jpg aa3378e010728b34326306260e22f64c 1935d98c6bdc6eb13fb0bade3f0e10b044e11493 F20101210_AACAKL joshi_s_Page_075.tif 3ddd75aa094c187880909d06ffb48028 c4c8e99242c342ae73b4486ea02f20054500f40d 2142 F20101210_AACBOC joshi_s_Page_134.txt 7055acbda8af84f8c1a4d137de5b17fc 46a00777eb4a610a82322712d2f115f5e4907008 2336 F20101210_AACBNO joshi_s_Page_118.txt 774b9fdf1b7122e3293efd900f9dc9be 00a64bfd1932fd0b6766758581fa4c05ad73a6b0 1995 F20101210_AACBMZ joshi_s_Page_095.txt e6910cfa95efc2991f9a253e200823c6 1bd91fcd2190a1308f94b4ae8cf93f362d32ccf8 96519 F20101210_AACAJX joshi_s_Page_009.jpg 90aa52473993698f5e4e5b3309acf576 c9aad9fc0b885fa0bf8c1ebcfbe13cddcdab9cf4 27571 F20101210_AACALA joshi_s_Page_072.jpg 3aee1b3a4438d6a8bba750739ef21cff 1aa86f7bd821e73276a7b750b31a1e2071130572 72187 F20101210_AACAKM joshi_s_Page_082.jpg f11ab03e86bb439c014119f1456bfbe3 8e251e8ec89f6edeaf5f5edaf8ebae3e3d69a68c 1191 F20101210_AACBOD joshi_s_Page_135.txt 6689c3af8f091c5ddca1f76c14e8af46 8164d162c8cedd00e570a16cce9fd57a88fac1b1 2204 F20101210_AACBNP joshi_s_Page_119.txt c35d1fa82dffca4769cf007b4bc658a3 4536cb2d0acafd8ace730407b85bcd98814a04ec 78647 F20101210_AACAJY joshi_s_Page_133.jpg 1201e49cee261b9e43bac62c35d1887d 4b7344def321c22e56e43ddd5009118c38e1423c 56891 F20101210_AACALB joshi_s_Page_125.jpg ea04c260852effd72d9532e6fec74fce 4ca0b86c5e79053736a0d0810c88da85e4d70e3f 847185 F20101210_AACAKN joshi_s_Page_120.jp2 088d3c63d9ac6927e9e2744d2ce36075 aec5bcf2d07f6724858387bc6fb11a97252bca51 1228 F20101210_AACBOE joshi_s_Page_136.txt 13674aa4fa7d8fe30a8d8280f25e4d38 68faa4762c72837794f164252b3b2465249ab9d4 1566 F20101210_AACBNQ joshi_s_Page_120.txt 54bde9684812413310bbcbb7245959f9 d289ee56edd8f8f9dddf5f7ddb9a576f96706c50 F20101210_AACAJZ joshi_s_Page_113.txt 081ca82dd69dd5cc34a118758458b1d2 dc04cb6a5af87db01eb662941e74062c1ca1fe08 F20101210_AACALC joshi_s_Page_100.tif 4af615af2807accfa177d2f18ef2ca23 e36dd8f8ffbe02e99966520194d37edd29b5b0c1 1051935 F20101210_AACAKO joshi_s_Page_127.jp2 aebfc1a25da38ccc789c147b8f5f52d4 3b8d5a60570893e3721b9158e6b0f4dbc1c77503 2103 F20101210_AACBOF joshi_s_Page_139.txt 689b30d3a28c72ae83a72a8071467079 53042ad72ba18eb7baa10d001f023466124d14b3 2078 F20101210_AACBNR joshi_s_Page_121.txt 063c099e0e9a0785a919f2f45995eaf7 44c7d67c28cb7a55b78e12d88213cd3c0ebe5f52 46570 F20101210_AACALD joshi_s_Page_154.jpg 9fcd8332e7278902d548d9ddba058417 e20524b6d5db265d4e499e55def0f7e002260a23 28021 F20101210_AACAKP joshi_s_Page_116.QC.jpg 0359732786e0734aa325d4cde27b433e 431b1684eed5ad71a163dcbb4dce426c13cafab8 828 F20101210_AACBOG joshi_s_Page_140.txt 051d115dcdc4ca6a984c5b6788258271 2a7ea42700495f356704da78e1a9a64a9339ad5e 78617 F20101210_AACALE joshi_s_Page_077.jpg 574641326119580885515c8a2c322d7c 8c791c0051fd84315b36b849986c514a2948c6d7 2428 F20101210_AACBOH joshi_s_Page_144.txt 94d558f3c9d39d97fd821ae36626288e ba3c3f770b901a40a1c6b8fd2bed08fbb7a0ab86 2069 F20101210_AACBNS joshi_s_Page_122.txt 7ee2b9b3e4d1fadc3225fdec046c4ea9 31a85a6bde455f704d79c85555bb1c5b87ee8b20 2394 F20101210_AACALF joshi_s_Page_031.txt 41e4376cbeb8785e57063c1f74f32bd1 8b70b60dd171c3112e325e2ba2d35b761c3c3a6a F20101210_AACAKQ joshi_s_Page_032.tif 07bbd487ce94f3c2e612d9e862589a8d 6c45f594fa79a5dbcbbc9f7caa28d7b0993659f9 2374 F20101210_AACBOI joshi_s_Page_145.txt b479965f992ccc57753300274828b907 6c31342dd88e51600334c9d7a6eda40df2922627 2399 F20101210_AACBNT joshi_s_Page_123.txt d6761f9e00c6030489a05d28d287076c 7795e861a34f83e86f82afd3706cec21025765c6 122202 F20101210_AACALG joshi_s_Page_113.jp2 59dab1efe06145e5ebedae811554757d 37bf3ce8b5546256c7a3fd0ba719452bc7ce2730 36230 F20101210_AACAKR joshi_s_Page_132.pro 5864162ee072158abbbfd0583e479745 606b83f037f7587a86b5b7b4f5a3fab94b0f24d7 2081 F20101210_AACBOJ joshi_s_Page_146.txt d9ff06716d14b81a81e11c181d2a03b1 6911b717ba085274f6e66fbcb5fe84dc4501ed03 1683 F20101210_AACBNU joshi_s_Page_124.txt 7d9f49a22e1b763d8652490894745f0b 8fa0daa1a959e08964463dc931795c5f09fe398d 70485 F20101210_AACALH joshi_s_Page_146.jpg 3758ad28f63a2a759042726f8c148d5d ea9dae8678594e88ab8029fad270c9bb169daf25 1051959 F20101210_AACAKS joshi_s_Page_119.jp2 e7702f4bb1e35a1fe39d200f0c12ab96 16aded971deb96eb1593a09a8d4bd50fbec304d5 2921 F20101210_AACBOK joshi_s_Page_149.txt 7c8b8de01d70bd1807fcfb0c3abac3b5 a40d256f6c030d12ea49b2865ca19f8da627c08d 1623 F20101210_AACBNV joshi_s_Page_125.txt 696177968c9e1bca7457c0a62708963c d81170d25e12f8fce6f549ecc125e6725194030d 7076 F20101210_AACALI joshi_s_Page_145thm.jpg f75d0cc983f85dfca407775e145933f0 aab0ac43de8fd3da4eb1791332d362ef436e5718 2216 F20101210_AACAKT joshi_s_Page_066.txt 4ab54658023c11d892821beca8fdc0e8 9ccd1ccc7f39c54f46fe250df80e5ec0c6a9c878 2171 F20101210_AACBOL joshi_s_Page_151.txt ab2e264f7f4855298c70b264ee65eace 0a0e36551b42b237c1e09f2f0a9f38a1d615fb73 2240 F20101210_AACBNW joshi_s_Page_126.txt 6d651b88e7948008312687e29864e92d cc6dddc7c59334aa9ccb6f1a4ab30318fefb8a2f 86983 F20101210_AACALJ joshi_s_Page_126.jpg 395cd27957afc9c1e56cfd4d6e0c78ec 14e04e9b0e41d2e8ab88aa7eef8fd6ed3ff34806 3172 F20101210_AACAKU joshi_s_Page_008.txt a95b0de0de82fb6d9d6dffe49c9f0c1f d1ecbd823781fa2fcce8545b88123ea2d172be95 10613 F20101210_AACBPA joshi_s_Page_140.QC.jpg 5fce7e83d446bb7c2d3071c1939a02e7 c52a91a0855428bc5051723b2496b819e2155248 1347 F20101210_AACBOM joshi_s_Page_154.txt de2d17b09b58f449e4353d588fabe9f9 62eaa8fc74eecb1ef02b98b65a03b4d35b1b7d70 2032 F20101210_AACBNX joshi_s_Page_127.txt 45c24e142613f3d8edfbddc7390e3be8 0f80d6594095c02b6560ea44ff8c54671e0f2941 28079 F20101210_AACALK joshi_s_Page_110.QC.jpg b2a9a6e2123af68555000429c8f3c9e9 b2869a0e8fd0688c5b9590f228b8718bab17a350 2293 F20101210_AACAKV joshi_s_Page_162.txt df8e2e1352abad6639b63fe0fbc840e5 1f64472370263e63ca0a37313581ee1dfa17843f 26122 F20101210_AACBPB joshi_s_Page_009.QC.jpg 4a97d2213c65e7bb18e192e3a281d8dd 4eba43cc3179e9cfe345091672091b013873e5e2 F20101210_AACBON joshi_s_Page_155.txt 65c77c582e3a33d59f35c2042fa3cbae 899d1ec0274317257b09d061cae0ba0c8cee1319 2674 F20101210_AACBNY joshi_s_Page_128.txt 8ddb28644d72f873397a7d37d0112dcd 8442d0dadf82230c89445bbbc7a48d4fd485da3e 1513 F20101210_AACALL joshi_s_Page_141.txt dc5983a9d2aaca2cc30676be9bc60b8b 262c1000a5434110a6a52c1a4e085182bf11b2ee 60808 F20101210_AACAKW joshi_s_Page_116.pro df9daeb19676e1ea3b3fe1b8b6f0a3b8 1c37ee92c77fc70da27c5bbd36e8d2f1edc1dba0 6724 F20101210_AACBPC joshi_s_Page_126thm.jpg 88aa174462eafe6e87354a6730a919c0 2ab8548b7da82e6790d752fac7a95f4f339f6455 929 F20101210_AACBOO joshi_s_Page_156.txt c3cde823b5aa991e4902bd0ebe27d34c 34bf0310d608d193bce92f77260b97d487c71ad6 1963 F20101210_AACBNZ joshi_s_Page_131.txt 37ee79b32cd3527ec7b8618459a219be c6ba07fe52fcad14ea0cab355b9e21aedee1ed72 1093 F20101210_AACAMA joshi_s_Page_143.txt 125ad21afd738047875a88607d9f241f 6f7047f2efc3cc57db4f4ce09896eb42886ab21d 56247 F20101210_AACALM joshi_s_Page_051.pro 84b0d3b7f78f3c96435a0e168425b652 34ff4824b818dff4309a1a4fc816fa0a37ace9a2 76400 F20101210_AACAKX joshi_s_Page_005.jpg 8848cf9f9938fb3d045abbdb5388223e f5691243b4d485341d6d0db96bf55658efb1eb10 18270 F20101210_AACBPD joshi_s_Page_090.QC.jpg 337d26227dfe40e74db9bf3fb5f87f77 3363493085ce67909bed888a6ad32e1b54af3617 F20101210_AACBOP joshi_s_Page_157.txt 2deb492b3100cf14a6b352988eb7bc10 5b8e6fa293b0b4b4422321e5c0b0fe37f378d790 7221 F20101210_AACAMB joshi_s_Page_001.QC.jpg 5e9d72726ece7eb3992297638ed0f8ac 1cdca08501ff54a8e36628f6e451d54fbd2f838a 6589 F20101210_AACALN joshi_s_Page_053thm.jpg 70dcceca39da1e96b1aed134929bf013 ccc90eafc6f457cfd52bfdc95ba2b247a7b23e43 F20101210_AACAKY joshi_s_Page_124.tif 5164fb2b96e140772575a881b864c5ff a881bfad88112c29df3fba2c5364df7f3adf90aa 28369 F20101210_AACBPE joshi_s_Page_128.QC.jpg 01075857fe29a6add1c4781e069f376e 30dfe84d48f98412c411037e1c1774416ed87a3a 2425 F20101210_AACBOQ joshi_s_Page_158.txt 25b924b5f0b1e135a2ca62140bfa713b fae98a72a60df14d121691125ea9b024c2ee11d8 2271 F20101210_AACAMC joshi_s_Page_028.txt 7dc811595fec755b8afd3dcb6e55bc32 715a83e8ca91a3ba545ebdb219301ac208213af9 60717 F20101210_AACALO joshi_s_Page_085.jpg a09aeb7584120f8a93c978d24aaed2af cab5c592d1f68046d333162e8fb9e0b745021477 3136 F20101210_AACAKZ joshi_s_Page_156thm.jpg ca3c1222ce79dfb22c90773affcf4ccd 3e77483a2291fe6b7c1da6b848c6e591116cff11 9023 F20101210_AACBPF joshi_s_Page_072.QC.jpg c2b27e50cae69637b862849aaa7810bc 55b9e1f2e13f97c9775a7038155e75b674c83e28 2549 F20101210_AACBOR joshi_s_Page_159.txt 409738a1e7bcc67ee5b9707706227ef0 a14e14878ac63bad8394ebac55cada00f3ff93a5 1051923 F20101210_AACAMD joshi_s_Page_110.jp2 2a6230e97b1a2063511e78004d94c92d 15edd92008ce4c2ab8cd380dbc7f3194bb58d603 25333 F20101210_AACALP joshi_s_Page_091.QC.jpg 7e569e700e3f77475e7e526534d45319 94643e3d92104ab2a9f36db8d8ebec886f9d1574 18832 F20101210_AACBPG joshi_s_Page_103.QC.jpg 00ef73436e2ded978fd799e5b5a6733d c248c2a3c065dc8c48467287a040db2ecec1541f 2352 F20101210_AACBOS joshi_s_Page_160.txt 0f02134a4209022f7e447c1ec6beb1dc 8248bda66af8230776f85fdf58f255c6f38a4d16 6047 F20101210_AACAME joshi_s_Page_147thm.jpg 77d90edf6fe1c1f60bae3aa62edebaa1 c26c07489ae6c45f78c5071b5ba4b4e4b25e080a 76257 F20101210_AACALQ joshi_s_Page_096.jpg b8994b693c2d723c4e21a5d35e0e5209 8964561a9522eadc7ee2e05bd3c5da78d4263368 6193 F20101210_AACBPH joshi_s_Page_100thm.jpg d2d9cd1207c4f39ad9c1ecb79b7be496 4ce60418aecd28dc6ef274fda19e7976d82c4466 2233 F20101210_AACAMF joshi_s_Page_016.txt 6a769dcfd097e580b9b2ed5192864ea3 af63625f7b06ebae7cba65f422e8d21023d30ade 5300 F20101210_AACBPI joshi_s_Page_068thm.jpg 8ba9ce07648670147e926a72fc4baf8b 763907e6242f45c6711c3ed74c535aab3e3d37a7 2353 F20101210_AACBOT joshi_s_Page_161.txt 1799dee9b259dcc3fbd83090433be97d c542da52a1714edaac0300f47be9e2003e80e181 F20101210_AACAMG joshi_s_Page_109.tif d2295353367711e701084985edbed810 885201f8c53a6f9a9973330698f936c6d561626d 6165 F20101210_AACALR joshi_s_Page_094thm.jpg 1ed25be0b25cfcd222ca78fad8fe7387 ec4c1b01bebf036650373f27724017f9d103e666 6773 F20101210_AACBPJ joshi_s_Page_076thm.jpg ce49e140329c1292030b35b2c3d268d9 a2e54556cfa1629143395b2939157c3e28598f1a 2228 F20101210_AACBOU joshi_s_Page_163.txt abb5e6cb5ea06f491d5da95aaa4b60ff 94dc2e105a9b955cbef08c26aea920224e32acb7 F20101210_AACAMH joshi_s_Page_106.tif 16d9dc24fbebdf9c42f6609f1e4a1f23 9248b179d544f82746f842de00c01af248c01df5 F20101210_AACALS joshi_s_Page_031.tif faddbac5a7be2b2cff18b060277517ce 05fd34bf1c6bab6fc178288a18a389af800c3012 20675 F20101210_AACBPK joshi_s_Page_062.QC.jpg 97bf70a7b4c1afc492801704a9cfed0e 653b59220e39e79981d610d05e9f0e4ddbf9c859 1195 F20101210_AACBOV joshi_s_Page_164.txt c125a3186d5916bdda157963afd3bc52 a8c4de1773e2315f165e7cd1c3c301d7ab85b8bf 2927 F20101210_AACAMI joshi_s_Page_148.txt 35b2856b649e9402a2a0498105f741c0 cf5f49e25666221124d41ab62806bb92f07f88cc 56793 F20101210_AACALT joshi_s_Page_100.pro b6fa1e8c334a6e51d2f523301a88babc bc57e897847b8ed02b33f5a40deeb4aa24ef848e 5694 F20101210_AACBPL joshi_s_Page_062thm.jpg 50d3cfe068b70918b67291726c093e1d 18e18965e0a649368c43c53949241315eb0aec6a 898 F20101210_AACBOW joshi_s_Page_165.txt b9c0db6574009d96ea759b79e610c14e e18561ed33b93461b20d9647a6432b031414166c 26288 F20101210_AACAMJ joshi_s_Page_019.QC.jpg daac0287117283832a27c7f2c27b6e2f 94a3d188dd9cf7910ff1af4697c77d7a8e5966a3 F20101210_AACALU joshi_s_Page_029.jp2 ca2d09ec84ec354d780b082579f01acc 804ba284746ba35205036b19a8b08753763fa20e 5578 F20101210_AACBQA joshi_s_Page_054thm.jpg 53d8baafcbdf3f23166199404f931401 db94f55987330a5e41920769d739657461d19a0b 24715 F20101210_AACBPM joshi_s_Page_130.QC.jpg 5950ee47bf7c498b283fca860e20c0e9 400b944b4e159f280b9625bf3f4431e6f9f0c381 2215 F20101210_AACBOX joshi_s_Page_001thm.jpg 0b2e9e76307ccdcf98aefa11e9baf60a d1408f33a63558aa000ed08c3529d1000b98b68a 6363 F20101210_AACAMK joshi_s_Page_092thm.jpg fdafb9122ad5ece68853ffcebdd305a3 443b29cced24fae1d1aa9a0ffbb354b2f0576bc7 F20101210_AACALV joshi_s_Page_105.tif f67717ec67eef64ab8136d35d47ff9a5 205a24dfa78e925183f0d7685ebc7e7bac3486fb 5990 F20101210_AACBQB joshi_s_Page_104thm.jpg 3d5de69fbb4ca90c2eebb30fe5b87cb9 b93ad5aaeb5f662bd65b597648a7ce9f37b2b852 17709 F20101210_AACBPN joshi_s_Page_068.QC.jpg 5425874947938442e9ff68b61d628a41 01adc625c8f3acb584a676c9baf7b267142cca58 1012946 F20101210_AACBOY joshi_s.pdf 18e11188f64f0bd7929032bff07b802f ea2d1a91d77f0db61c6df77dd7c50647496997ff 49646 F20101210_AACAML joshi_s_Page_096.pro 50632e8bf94006a69c641ec5c43ffc95 a7b271b7ac98ccbc25b78e809a92575782dc41b6 1051970 F20101210_AACALW joshi_s_Page_023.jp2 ed6e646de8a8ba54473afa3106b9c582 3bb5da2db4463b5da01f8622a0505461f7403ee1 20912 F20101210_AACBQC joshi_s_Page_005.QC.jpg 57b621c961cf5bc6475f8e4d6e13a764 a4ba135d4659489c8fd1284ab8c1941e082043c3 6757 F20101210_AACBPO joshi_s_Page_055thm.jpg 6498b45f93e63e1058c29d37b0ba3ec6 21bf5bc0aa3bac97fc5ce663e1a03939573b7539 6085 F20101210_AACBOZ joshi_s_Page_122thm.jpg 19cc4c1316412830dd7b76bd226a72ad 0024ecf66bea35873d8818fdfe26628f1784bc60 80493 F20101210_AACALX joshi_s_Page_060.jp2 1af3ab19b3e021e0cdf6ed59d7419882 05ad8ad34b0c984366ade9d52e0abf1c1aa31f60 1051954 F20101210_AACANA joshi_s_Page_021.jp2 2c3029c8e8ff950912df8bd71ab62ac2 e3bc557e37be2a0d625f6b8e9c8659eaf463a484 1938 F20101210_AACAMM joshi_s_Page_098.txt 0cf5b54823a5f9581e0214ef1472c6bb 4d729d7e2d9c917bb89531d61058e8ebcb125432 6095 F20101210_AACBQD joshi_s_Page_035thm.jpg da754200aa43180590e6e6a2cc97066c 24f1db65f8fe58c2d857cfe7b6c3b3fd01e58083 7045 F20101210_AACBPP joshi_s_Page_008thm.jpg d7dbcae98e5d2a02797d927b185b91d9 ca6ddf0ccbe09063c2efcff0455778aad74fcbf4 F20101210_AACALY joshi_s_Page_093.txt 62fa569466cab28809a8f6b0ba5f39af d3dc96abb38a46634307750ccb0bfa6c77060f26 F20101210_AACANB joshi_s_Page_110.tif 0752f6b4fdea79036e9ea9af6e50dbaa fd2bc43522b5b877d5c4df5b1f37c9fd39cb8464 F20101210_AACAMN joshi_s_Page_040.tif 12c45c3744fd2a9566bf715a11865ee3 65f95e325b233c7e301faea7318963a1f6ee1f80 4569 F20101210_AACBQE joshi_s_Page_107thm.jpg a45a413de30f98db837b8115ec7e8f8d 7d2301e0eb89ccd2dc61e72249da4e7842b15eb3 5698 F20101210_AACBPQ joshi_s_Page_013thm.jpg 42f56d5c62cfe66fcdc12cc5d36cf7fb 93051d2487f8f976c034cf74be2723df204835df 55202 F20101210_AACALZ joshi_s_Page_092.pro c92ad63806aafb3177c5ae0d01aa8b2e e2fd5738c330bc2e3bd71917248a9bec9fab6998 1242 F20101210_AACANC joshi_s_Page_153.txt 548e28ac3ad8b1cb2c8b04985394eea1 0c682c470c93d38f6de807c50594bb8266c1ef16 F20101210_AACAMO joshi_s_Page_049.tif 4503942172854d44bb899a1bfad9592d aeb49c5684b918f4c3389f3394a2440c245c5894 24966 F20101210_AACBQF joshi_s_Page_106.QC.jpg 4e658f591d1e562f7415277569334f3e a95abf7db3d55364d54eb87b67529322d56084f3 5936 F20101210_AACBPR joshi_s_Page_042thm.jpg d706307340ac669000fca02135bec32c 183e8efb399d5758fbcdcfd11000d3c02ea332eb F20101210_AACAND joshi_s_Page_157.tif ceb7c890297a85f46d3c0de3c02066d1 e8de245fc5e484a3a2d7b6398f2775cdef5929c4 F20101210_AACAMP joshi_s_Page_137.jp2 1a86aa5a3c39c14dad218d4ac23d200e d9efad298b2d2ee5004cebd617c60dabf0a5119e 19562 F20101210_AACBQG joshi_s_Page_069.QC.jpg 9bc98748816283e0702b586a06653283 a5c07cd1fe9176acf8f844770ae6fa5b0b902199 6548 F20101210_AACBPS joshi_s_Page_016thm.jpg e4972ee5f1d1705876947360b46afe2a 334ddc4cfd002025ed246f87442cd8ab4673bad7 22319 F20101210_AACANE joshi_s_Page_011.QC.jpg 5f78eabdd5e9f150d26cc9910f860c17 460a7e1d9a65b622db551378c23590a7c68e059c 2201 F20101210_AACAMQ joshi_s_Page_129.txt 26a24245a4794ad2cd2d3c160050285e c9984bccb4a055269d5cf38e18cc9214ad3861c2 5880 F20101210_AACBQH joshi_s_Page_112thm.jpg f0306c091c82919a92a5b27c1a352ef3 9be912efe211f1893ce844a50e5e4470913389d4 17855 F20101210_AACBPT joshi_s_Page_093.QC.jpg 2ae6a5db7b08aa6c2ff222a43b795f61 7ad9b0218ab0c32fb70660f24c26701a3f00c9a7 65695 F20101210_AACANF joshi_s_Page_006.pro ee496914b21ec1c565725491597f1b76 0ac3d40ec69b3ff43b08823a538dc8322ab16bee 106067 F20101210_AACAMR joshi_s_Page_146.jp2 de5c2879461d751e4039c0d273d19aef 40f9a080f91c2fd796ed80c915e8cb6ccacd732a 22888 F20101210_AACBQI joshi_s_Page_004.QC.jpg f335c238145429147d105f54f3560edd 68acac169ca2f4f62d0d9a431eadccbaca542d09 77972 F20101210_AACANG joshi_s_Page_160.jpg 2a70b8c3c15624835ec0ff6825a0f0a0 40b224079c33b428e6520720e43cb4f736e87c7b 17921 F20101210_AACBQJ joshi_s_Page_142.QC.jpg e4159ca5e86817bd027ed398f79c7aa1 c0b253b96256348f9e5b8dd21cc237f59d6f2347 26894 F20101210_AACBPU joshi_s_Page_025.QC.jpg c996313ee9cfb1c2cf7364696300cd89 6e89d1d19fecd5d92ba9e35f86d2594d7353e590 2340 F20101210_AACANH joshi_s_Page_030.txt a7719d303e1136aad1d69f4ecd92040d cf5215550381de59c14f3cad0c7f1d4a8eb35784 F20101210_AACAMS joshi_s_Page_151.tif 07ba1deceef7ab787e82acba734f5d55 61c026d331e2b28aaab18562f3336c6e5429ca43 5007 F20101210_AACBQK joshi_s_Page_132thm.jpg 16934cba0a68786c59014713f862c9c0 397daf245527812d789876fc5fc1ade55ac8f72d 6964 F20101210_AACBPV joshi_s_Page_022thm.jpg 5d32de98f10d11967874222c13058d7c 9015ce585fd498383ccfa430b0b0fae68d015c97 14075 F20101210_AACANI joshi_s_Page_155.QC.jpg 3c6b842d371956d6e011ac3d07150293 de3a08e3ea46085ce87930ee5980517adbfa4330 51013 F20101210_AACAMT joshi_s_Page_101.pro a1e2d57ecdc5f1630e4fbaf86f3c97f3 74228036aad082472149db32239f4f37e4155e7b 3927 F20101210_AACBQL joshi_s_Page_088thm.jpg 384e4bcf7bdc122ab72946eb51daa61e c9ad87113ba4d7d02bf4f6b6906239541e2bfcf5 23233 F20101210_AACBPW joshi_s_Page_112.QC.jpg cb8d4fc689a31d7c3a17e859bad154b6 9adbcd52ddcddce48cce257edcf7b925bacc71ed 2517 F20101210_AACANJ joshi_s_Page_114.txt 5d29a79541fbb02fcd7f55860af29fd7 755edd47cd9d7c0d2cf1fbe769175704b21d273d 2220 F20101210_AACAMU joshi_s_Page_130.txt 30c68a0e82015ffc1ff2ebeed0ede42d 42f77dccc44f21018d451f385a1c28b40fa94a9e 4553 F20101210_AACBRA joshi_s_Page_087thm.jpg 61078674dad01c21c2601e7a933d3c5f 42f1c152f9bba83dc06ef4398bed37906d531e9a 3537 F20101210_AACBQM joshi_s_Page_164thm.jpg e79c7e48e23d14072805fd62f4a6d946 71cb181d0276ec99a3f1a60506cbe822091a5f7a 5068 F20101210_AACBPX joshi_s_Page_067thm.jpg 2095b0315e1fd0eddd513b6c3f7a3b37 65864c2dbe785b7513dba2f11408648a0feff43d 1816 F20101210_AACANK joshi_s_Page_097.txt 0606f046e24a60f18b47dd94315bc892 04987b5a8f9d8f471fb92199f02f7684bc83faeb 57804 F20101210_AACAMV joshi_s_Page_136.jp2 a0e7cc62d3b89be3ba859abfaecdb4d5 c9cb86ad2cacf6849ba392105379097afaa86354 24981 F20101210_AACBRB joshi_s_Page_016.QC.jpg 23d551a68f471b7fbd7fce88fb1a1059 68f76a8e9593433e079fc3cd9419cd948b384466 6140 F20101210_AACBQN joshi_s_Page_096thm.jpg cc77cb1d4a42e98368d01004066dbea0 c8e4af3fcc10baa992a46fb0c5cbf9fbc0935dc1 4628 F20101210_AACBPY joshi_s_Page_057thm.jpg e6766115720e6f026a54a9b2f4fadf27 9c941484b8b7fa0f10fb803342f964f711f4e952 6229 F20101210_AACANL joshi_s_Page_095thm.jpg f1429fa732b9716fc83d2a77fdd71f2c 162456f6ebf0a89cf60d89f4b291e7f1f4472118 5750 F20101210_AACAMW joshi_s_Page_011thm.jpg f9c31f4f7bf66f8159a13c83e83ed016 0d68e51bf38bf9a1aef638d2498bdf289b1b3dc8 2655 F20101210_AACBRC joshi_s_Page_072thm.jpg e7a660e932ce6fc779f9ad445372afd3 2dc7503d2f87b022bb58d537b72122fbf23dd1dc 23901 F20101210_AACBQO joshi_s_Page_113.QC.jpg ff2e98580de4ac488963171a78304540 29598293f48491de5f87eb30407c3d7b369ea285 20025 F20101210_AACBPZ joshi_s_Page_054.QC.jpg 4d6a347311035f1360628b296cea9042 062bc00f39aaff436a0e9669fb2531bca8ea5f6c 53080 F20101210_AACAOA joshi_s_Page_010.jpg 78c1d84e3bce486b63c3ed1262a008fd a6a1d055e8b3603712e66c3610a103254b1af72c 6261 F20101210_AACANM joshi_s_Page_102thm.jpg d702508c8978a6c41e4418b870ff28bb 8175774be115efc6250b2303d6993c328aa1b431 50260 F20101210_AACAMX joshi_s_Page_033.pro 22aba5160b288e1cc4ae7b0c172320c0 90f424cbb415cffdbef0e619fc1edc0511fa81c6 5644 F20101210_AACBRD joshi_s_Page_082thm.jpg 575f3bacbea7a8a75f538facbe1aa204 6499c3952dafbf789b4207a36fe57706f9a2c464 5513 F20101210_AACBQP joshi_s_Page_098thm.jpg 6869d86ffa1ef66d8d1cc6a71bdaaaca f6b89e81bfc75a3e356f4a050a7b71aca27def72 F20101210_AACAOB joshi_s_Page_029.tif 2de26e47a5ccde2b6ebdfdb2080b3712 bc7070b804a8fcde74d41fe207c60a4a4a233342 F20101210_AACANN joshi_s_Page_111.jp2 3bb6364cdce45370b53935fe2b37016b a5690e474f574246186d3ebd250f1c98d858c17b 1712 F20101210_AACAMY joshi_s_Page_142.txt 2fefd1d2a291e91e67387b35e0ca8814 59319f4197cc596e329f1c9bb4d5cc6ac1227422 27740 F20101210_AACBRE joshi_s_Page_111.QC.jpg 55cefd4c59bd243040086e6629995b56 f810b1a4d2ddb04bbea3051284e483c5b6f8fd02 6151 F20101210_AACBQQ joshi_s_Page_161thm.jpg a9a896289def5c187648e0d871864372 50c074a0ff83b2a99259bd4b1081ccc6f3aa4c99 F20101210_AACAOC joshi_s_Page_011.tif 87eb9b97f33b5641e4e5691b28942851 a621a50b0f265e6bf7c9ebeb5f135d71e813334b 2311 F20101210_AACANO joshi_s_Page_029.txt 2bd1908d2f407cb46ee319aab1bfdfe8 ecbf89ce4c892b8fddd8ee0a29518813ab07d1c4 80222 F20101210_AACAMZ joshi_s_Page_057.jp2 541ede0035fc756bdcd6c3d07b7d677a 792afb3f5d7a468733924103b6fdce80497670d8 12982 F20101210_AACBRF joshi_s_Page_136.QC.jpg e719569d50fcbe27c09959eff831318a ed44238a3329a21fd40732b845eb5747058abeed 1371 F20101210_AACBQR joshi_s_Page_002thm.jpg 2717f210b74df17af9672163119915fd 139849a48f6a3b16ed19d6001a9baed4ddd4b3e1 2350 F20101210_AACAOD joshi_s_Page_150.txt feee04a4c10617c73791a061784a477c 69da1686b08758483b5acb8a2a461ead9d287661 50190 F20101210_AACANP joshi_s_Page_013.pro dfd9ff35b75c44add09dbd03083a5bc8 c2ee2090f4fc3f9d944d4d1b576ab07c509baac5 5408 F20101210_AACBRG joshi_s_Page_125thm.jpg b0662f4dec23e723788c232c291696f6 904c2f1d796dd10db01489f2a2a8e21a903afde3 7066 F20101210_AACBQS joshi_s_Page_128thm.jpg f12ead7e4ebb906c8fb9ffef8fea9430 923ca648da2f0bdd6d31e515122dcce3162bc2fc 5088 F20101210_AACAOE joshi_s_Page_074thm.jpg a9b1bd7cbcbda69b52066fef6c7afe69 a49d97e2eb2a7ab176eec082626ca83ac6c8e765 1051945 F20101210_AACANQ joshi_s_Page_033.jp2 82aa82029509dc5a2f74a9457a3becd4 ad1f3b24c8e47d3f303e8994ed9f8a52e25613d5 27818 F20101210_AACBRH joshi_s_Page_024.QC.jpg e23d0cfc94fb2bf918ee10914589a391 8f823793957197e7e0e92573a0dbc57d5eec2c22 F20101210_AACBQT joshi_s_Page_031.QC.jpg 6877d173f7b01788d5fcba689c46a89c f810989ed7342b5dd3c9a68590a4261a224303ce 27482 F20101210_AACAOF joshi_s_Page_029.QC.jpg c83344f08216164fbc47d163f558f645 d3de6ed4d4c1091e8a265beb66063bca7188b24b 20727 F20101210_AACANR joshi_s_Page_146.QC.jpg f5afd1d3b7d1345e35f3bcdade80e3cd e16adcdaa4701cfad6dd6d56667ab545cfd8942f 5157 F20101210_AACBRI joshi_s_Page_103thm.jpg f7b15d1b25ddcc8b4629bc2fe914dc8d d4a559b0a54d34e11c86cd05975033c86718b14f 4959 F20101210_AACBQU joshi_s_Page_089thm.jpg c45bfdf34ef49a1c02bc7dd8515003ea a393e244c35f0b4c9462d82470c63944a5cab557 2238 F20101210_AACAOG joshi_s_Page_092.txt 32cfac6e68f2794b36a71172682e7374 d79de0ebe98e45b77b847f444ea5d815a79d8093 49584 F20101210_AACANS joshi_s_Page_147.pro b945904e1667cc04ada497779b1456c9 dad6ca0da9817be7c205c969fb0335f83784f00b 27485 F20101210_AACBRJ joshi_s_Page_027.QC.jpg f89ce8498e5fbffc39ec76128c934053 cb91954f641ff7865b8d8df3f0e0cb6e168c3fc3 F20101210_AACAOH joshi_s_Page_153.tif a181c8e94185ca3abb8dc26498f1815b 213b5887d3f32c7293a4305649c30f8d410113ee 13500 F20101210_AACBRK joshi_s_Page_135.QC.jpg 283b293b6048f2ddbab1911fc298e72d 20e72487ada1c56e4e1f717ed2392ebe2815de2c 7068 F20101210_AACBQV joshi_s_Page_031thm.jpg a854b3494aaadf4fdc1991330797c4df af87725f31c0bbcc7fe3d09e73f97c7526b042fc F20101210_AACAOI joshi_s_Page_123.tif f3bf68a5920f009fe39a56bd050a0773 f54fb50503fe341831539808c6564bf8d995d5ec 1836 F20101210_AACANT joshi_s_Page_105.txt d3403437a8f9b3506f7fc078222c61d3 713d627fbcc8557904f0e90d029f58e8ff1fef5c 13692 F20101210_AACBRL joshi_s_Page_007.QC.jpg 0470ef870c89ffa4ab624a0a5fa3082c 9f38e45ead96aaa8034e79438e1df53c7915a69d 3928 F20101210_AACBQW joshi_s_Page_136thm.jpg 743daeb2647056a9dec5e937e852c4c1 77961b383070f3c94ff5837734652e04a9e805b6 62749 F20101210_AACAOJ joshi_s_Page_023.pro 68cc98565fa6072968df579cc816e720 c163ce151b5c76d8be4b18eb32563733e8866c6a 58389 F20101210_AACANU joshi_s_Page_067.jpg d7bd8d4478ebb9642a23e0dd02072211 56eac78de7d361d11e6b5ba2cc0535e05cdcb5d6 16759 F20101210_AACBSA joshi_s_Page_141.QC.jpg 5de46f6fe5cbe1b7660c72e1e535b65a e860a3ba6146139f3fc749fd308d71ce4dbb02da 16854 F20101210_AACBRM joshi_s_Page_107.QC.jpg 2fe623b7f541a7a98bca03c538837208 b1377f2ebe53af167e824a7d5ba7ee1711095868 18556 F20101210_AACBQX joshi_s_Page_125.QC.jpg e2818feeae9af3dc43590e72a59214c0 b6acc764e002669ff638fdfcec6e00b2b524ab60 53681 F20101210_AACAOK joshi_s_Page_106.pro cd68f5362746ccebbef8556afbcac045 403498ff787f984bcf398b5d8dc92b2d3e256560 1463 F20101210_AACANV joshi_s_Page_107.txt 34aad97913520be51802850672bcabf4 d5eb27c0506041920dd8eee75038eb81172d603a 6947 F20101210_AACBSB joshi_s_Page_099thm.jpg c57e1abb1464929db1a3fb954965c2ec 8c2d577ac463c05bf06e99b67013b240675d9447 6595 F20101210_AACBRN joshi_s_Page_065thm.jpg 7071d2a484b0350dd7c4387643b0690a a2b5de33442311984c6502bd88ed49c54475dae7 5506 F20101210_AACBQY joshi_s_Page_146thm.jpg 3fcd51e4a038e594b50ab590bbc508fa 9f3a52a38ea6f506e8b80936edf4fb88a5d40bc8 F20101210_AACAOL joshi_s_Page_116.tif f8cf217e9c13fce45cd94460510a2abd 6b3ee29b0b5dd787675dfd167d49f6460bc58e6b 51362 F20101210_AACANW joshi_s_Page_078.pro d062d9a61e5863fb411616fce28d49cb cc4c614beb77bdb57581234aeeb89116411112e2 4899 F20101210_AACBSC joshi_s_Page_120thm.jpg 76a0b4180192be34689bb17d6cab4ccd 003fe182d8b143a7dcd94b9468f9b7445dd9674f 5932 F20101210_AACBRO joshi_s_Page_052thm.jpg cec35fb9812eed483a3cad0cc81b2934 7f2283e8f206aba187d75787fd0473863ea702bd 24855 F20101210_AACBQZ joshi_s_Page_144.QC.jpg 64f698cb9036ec0f4589021aeb3e3cc7 73f79ec3248f0cdc1ece6215e2d299790921079b 1106 F20101210_AACAOM joshi_s_Page_138.txt a4be08e5d6b3f3e1fcfb4026abcad957 22c52d3ab2a969c7a72d872d614fb58de1d43c99 86722 F20101210_AACANX joshi_s_Page_021.jpg 65e56f9cb5f47d7225b6e346ad8e5cbb cd32c77e665edc5feae7a4a9e4c262c65241f146 F20101210_AACAPA joshi_s_Page_038.tif b22ef01208f50fb430c961b170728ba3 efc5d788217a6c4a4a1773c525b9907d921f4884 22624 F20101210_AACBSD joshi_s_Page_151.QC.jpg e32c5b24f3dd3db26257e50f27ea1b84 30a03dfaad058310270d49c9064ab704744ef747 4827 F20101210_AACBRP joshi_s_Page_124thm.jpg 04fe5f0b2f65218d2a50e1504910c7b5 4ac37fa8715ddc1e25c78cb5a984c92ed2b7e77d 87293 F20101210_AACAON joshi_s_Page_028.jpg 39ef44745b461620e19568ad60dee072 3bd25c7b53790ad744ec690feed00e60ab13b774 2392 F20101210_AACANY joshi_s_Page_055.txt e8e6441107f7759aa567e082506553f9 70e305c3424cefb172520dbfd5ac1fa3ca5351f1 21751 F20101210_AACAPB joshi_s_Page_143.pro 6db193f9bf5c785d2cd9888ebf3b1951 614e72945b682105ad7c3ab90c5a46cf14a79489 6616 F20101210_AACBSE joshi_s_Page_152thm.jpg edc9a1dc108c3aec06acbb1a3526a8ed 5debc3418ae4e7aced4efec46722c10f180e690c 5447 F20101210_AACBRQ joshi_s_Page_085thm.jpg 7794c1e875833f6ce9a4f83ad9552a81 c2f98334ac8b23388f6b8a12a9c84215e44c163c 78466 F20101210_AACAOO joshi_s_Page_078.jpg 0205495669a038a926f04fd831971348 f3cb5a4c5fdc8a1e8e5f06d19ceb32c15adfa5bc F20101210_AACANZ joshi_s_Page_001.tif 7ec3a467f5edade950d68ede597a369d b53b36d5a3b4f16d9062586eaf0f2d3157847350 48129 F20101210_AACAPC joshi_s_Page_062.pro 6ad0cd4820805f91d2957ce7527216e3 20ce592be41b68f2159615356a1cff15abf31050 25426 F20101210_AACBSF joshi_s_Page_081.QC.jpg 26f13f9b6c428906e4286a6b9026fe60 64889bd292c887ae24f4bbf9014a87ae78843a16 14219 F20101210_AACBRR joshi_s_Page_083.QC.jpg d3fdcc86206836e24299a05f22c98aec 83fe73badc92675f8715d347e36681ed25adaf4a 1051944 F20101210_AACAOP joshi_s_Page_126.jp2 9eb6c322e458e89bbf275a87b8555f3e 351cc5f502c5e4eda7855e1eefde404efabaec67 68368 F20101210_AACAPD joshi_s_Page_013.jpg 4e2c85a1663df4477a13f02aa98e5dd2 a244659bf7ccccab2f4b56e626a2da51af720493 6804 F20101210_AACBSG joshi_s_Page_111thm.jpg 4df417fd6314a5d0e05101c7a3121e86 3eefc402884f1eafedf380e44a0a5249f29ad449 27534 F20101210_AACBRS joshi_s_Page_028.QC.jpg 7da996da003ccab842943ce3ed6f3c81 573e280d82005ded61a9dffb6efbdcddca51d69e 53545 F20101210_AACAOQ joshi_s_Page_052.pro 703871ccc9a0b8fd50ebf76686f64057 2d293e80de3b442ff5a419f4d05078a7fcadef8d F20101210_AACAPE joshi_s_Page_076.tif 8eb3cb586df7c9aa7e9d65a5a24654a4 7bb25c19c4970802f9ec3ed66aec8099072d025d 19237 F20101210_AACBSH joshi_s_Page_075.QC.jpg f1e73eba854e8f6c8c9fa90c2a18ac2a e9e6eb581d8434185f58a86feba1b1ed912cfb7f 3324 F20101210_AACBRT joshi_s_Page_143thm.jpg bf564b0acd5fa3c09caf5176732219ee 1dda603dc2a6975790b74d48167d6a956731d268 59556 F20101210_AACAOR joshi_s_Page_152.pro a9f809c4897a4e75f52e67a9672bd1ef 91865fbf550a5cede02b5f67522b948540467585 214675 F20101210_AACAPF joshi_s_Page_032.jp2 b3aa3c3d673724e956f82434dbae8a13 285e43833d053d563d184cecc73d3468dd1ae178 6404 F20101210_AACBSI joshi_s_Page_133thm.jpg 1a2b453e2d77c8e699eee4a260b220b0 8f648596282ae901cc7b489441dbf16fcab502ee 5721 F20101210_AACBRU joshi_s_Page_041thm.jpg 6e65942b3526734b664f59e4bb5b05f7 6091938327f5c9b2942527900c052055d2aa22f9 6925 F20101210_AACAOS joshi_s_Page_066thm.jpg 0c944399c12d1f3cc97b49c1b3bb9e39 0e00c2c3ded5167829f6e35c51039345dc208e19 70215 F20101210_AACAPG joshi_s_Page_011.jpg aa3fa8aa13a52a4fcc8ced4c2ea6ee51 0b5f9f5ab76a42d7884e8d8af06a764d2d3a27a3 F20101210_AACBSJ joshi_s_Page_109thm.jpg a67af9486e2b5b8929857bdf547712a4 16df86862741dcd68e839b483126787a3d6510ff 5664 F20101210_AACBRV joshi_s_Page_101thm.jpg 5a55570c07134ed2b9da8bd4b7829bf3 457da96535632b9a670fde34ed8107c0e9eabcb5 2265 F20101210_AACAOT joshi_s_Page_019.txt eca6100444554100039bf3090b125c52 f58d0859458898de9d67d2e5aad1cb790d058475 25185 F20101210_AACAPH joshi_s_Page_059.QC.jpg 2d06bf30d7f5ff14f5881cf6950efa06 02ccf79a009fcebc6179bf1c3c04ab07a211a909 7053 F20101210_AACBSK joshi_s_Page_015thm.jpg 829dd4eb079a186497f31263d07ffc7e 0942fbd793a6e0ea337931e4178d2fdf8d2742f7 37124 F20101210_AACAPI joshi_s_Page_042.pro 5a62a015b8b0c2c637719ec59fe1adbc 8bd374fa2f1744eace9d01bccde7b6f95a2d7395 28741 F20101210_AACBSL joshi_s_Page_023.QC.jpg ff807b394d1f3c240ca9b839c3c97c92 db6f0e3bfbd81da598a41e9fb92f744ddafd1932 14036 F20101210_AACBRW joshi_s_Page_153.QC.jpg f14a05807cd8e525be5b9a1c7c020626 9108cda6ae80925479749a28b5e88051227b828f 2018 F20101210_AACAOU joshi_s_Page_039.txt 789465851389807aa9fa8043b61637a6 119e54de1d924f1fe5dcef4733e5e3927cbbc7dd 26091 F20101210_AACAPJ joshi_s_Page_123.QC.jpg 4136260b6f97faf158e1cb0d06167134 8f491c6538917a478b9363cc026b2d9a9c526275 F20101210_AACBTA joshi_s_Page_049thm.jpg 95ba54ba26f909bf2c03d07b4484a760 13b207ff01ca48b60abb0a2793bcf13a3de0e581 23850 F20101210_AACBSM joshi_s_Page_050.QC.jpg 13fd169730bdefd045565579b954d89a 3686566825057d1246e694b94ca34e9160b71034 19187 F20101210_AACBRX joshi_s_Page_120.QC.jpg 787b73bd5aa67c5aa46dc18c214bdf0e 5589f2fce0ceaa5448dd0dd70f1cf54c767ce3d0 F20101210_AACAOV joshi_s_Page_118.jp2 dffcc39b19be55e2c3fa5ef00061891f 3fd2a8d879a4d0fd55a2fcf7581196b874630ba1 71442 F20101210_AACAPK joshi_s_Page_048.jpg cd10cd3684e07d71697c74ebfa8aa499 05a9537aa48d92a46b620cf7de84d4e0ea81b962 23030 F20101210_AACBTB joshi_s_Page_162.QC.jpg bdf5cfee113d852fadac6f23f2935454 c25fc88cc103c6320103df5134d147e7d6bdaa9b 14929 F20101210_AACBSN joshi_s_Page_154.QC.jpg 2a24e9a359f9db40adebffb59b59c2a7 28735dbd675910f78469733b80636760dc50df94 27559 F20101210_AACBRY joshi_s_Page_145.QC.jpg c76b78fa3c45ca91ac77664b51539f04 a732386c59b853162453cc74d5f9a94834ab9d17 3740 F20101210_AACAOW joshi_s_Page_153thm.jpg af2dbacb6e27b6ab724ed3cd8c41a898 f9268d5313f53465a7c625655cb10b33ceb2a39e 189492 F20101210_AACAPL UFE0021217_00001.mets 1102ce4d6da7687481a3fdd7f4f5bc38 516818336c639777ff8605bf52fbb40c52e694e7 5722 F20101210_AACBTC joshi_s_Page_151thm.jpg 575a8fcbbfcc30388f778664ce1afb4e 6ca32bc95f65b2d89b082a102fa1d9d20bd60901 6996 F20101210_AACBSO joshi_s_Page_023thm.jpg 644dde18b06d916517d05165cb30ef6a ab948758d80ca0119113bf9294ff9dd8249d330b 28898 F20101210_AACBRZ joshi_s_Page_070.QC.jpg ebf03953d1fd2e0ef8bb6c543941a3bf 83ed27d3b7f353e636560a38568f9e826aad1cd5 2183 F20101210_AACAOX joshi_s_Page_137.txt d2a1587bafb9dd13d41c8954e0356371 5d5bc23b76c7ccd556cd569a59acd009c99dd284 84797 F20101210_AACAQA joshi_s_Page_019.jpg dd46515a9cb7f3bdb29337635e752333 491828bd7d41d4fe40f823f927eb3090c4e0d171 21707 F20101210_AACBTD joshi_s_Page_041.QC.jpg 2d741a838a7c29b5d65b7aefefa8f888 de48ca199771fc8240960a7fcc3a1f3b74b79b0d 27795 F20101210_AACBSP joshi_s_Page_030.QC.jpg bad3548230230f9b183a3258d6fe5fb6 3decdfbc84c8e07a0f642de743c609422b1c46ab 54032 F20101210_AACAOY joshi_s_Page_060.jpg c11c59e3e0999f074820b4a66a1d8fea 8d4586346e763444a0b11ccb020142c34fde1003 92330 F20101210_AACAQB joshi_s_Page_020.jpg 1aab519938d2c59fffbcc41ab0ba5b40 330c5951d1e6fafe6e9eab4ffc620ccb63506e5f 6375 F20101210_AACBTE joshi_s_Page_037thm.jpg fece6cc5a18a16859f399accd3e8fe94 4e95f70ab6a51c2fa984e26644845da3d7ab7758 25210 F20101210_AACBSQ joshi_s_Page_034.QC.jpg 32a9ad159b239d95f8f896d072cd0507 93a8486262a7dead88fbf17afbf451027804ada4 61933 F20101210_AACAOZ joshi_s_Page_020.pro 4fd9c0c730962b013d3bba972fe9d166 516047a9152beace064391f58a06a1f5fee511bb 87831 F20101210_AACAQC joshi_s_Page_022.jpg 0403a2450b65ee60a5347a7f502d175a b611d3bc6591017f1442184998db2d9f5bfe5d65 23478 F20101210_AACAPO joshi_s_Page_001.jpg 24fe9e89813a71f263de541e5d438596 b6d40ba71b4b56d3960b8ac213c161673b54a9e1 4249 F20101210_AACBTF joshi_s_Page_056thm.jpg f128fd3a1b167611021968b4d3d057d8 00f3b0868e84f93a96f14612313303b93ea13b9e 6412 F20101210_AACBSR joshi_s_Page_077thm.jpg e3a500849b786e33f3484a3aeec753b9 4697d0b7481200ab8e1c7ebff5786b251d385fa7 92343 F20101210_AACAQD joshi_s_Page_023.jpg c9645dfd6f2ed27ffbf56fc82b009719 7fd20c372e688fc8f8b08739079eb6270461dafc 10033 F20101210_AACAPP joshi_s_Page_002.jpg a618ebadb5b98f4566b5c21ba6d2487c 94bca1b4436c553799cda6f6d489962a2ec01633 25421 F20101210_AACBTG joshi_s_Page_149.QC.jpg a0114a0463af82d635e1e0737fbf3a5d f68ceb14158373ecf1daf5343762c0a2b0235b9b 23265 F20101210_AACBSS joshi_s_Page_104.QC.jpg b00813712e4e48b4da3526d9815c091e c1f3f85cdcf8a1df7e8c589ced81e568500113c8 88692 F20101210_AACAQE joshi_s_Page_024.jpg 409a883091e10495e703eb4a340380ec 8f7475c1eaf4bba87d679cd7a8a92344d8ffa599 10474 F20101210_AACAPQ joshi_s_Page_003.jpg 8fcafb91038a5c92449d62721427750c d492e802d922946b69121d5b4d0837269747f857 16403 F20101210_AACBTH joshi_s_Page_060.QC.jpg 0441eef651770d39bed39da0f89619cd 2e006d2197f626f7a68d2c94c4a2a10e400cbe87 22582 F20101210_AACBST joshi_s_Page_163.QC.jpg 6674879085fdb1e409c87eadcdd43e3e 91892ccdca9f90ae456e3ad81b42968eee9dcf06 85635 F20101210_AACAQF joshi_s_Page_025.jpg fde8a8c749e6ea407742bdbea350e18b 620a8cd8c7d671e0cad3f4ec979a2f8c7c66b715 72216 F20101210_AACAPR joshi_s_Page_004.jpg 8a980dc16e64346eafcb4b0dd21eef3b d4d6ea64c86a26db1be47fbb45d5ef2e5376cbbe 6420 F20101210_AACBTI joshi_s_Page_091thm.jpg ee7a9217c6714cd956eea9cc6d7bdf12 cc9da86482cc7391eba7db59b89b70554fdc20eb 7179 F20101210_AACBSU joshi_s_Page_020thm.jpg bcb28cf3f29c4c1f17a9144ba50c742f 1333cae74ab9219faf82bdf319fa1b9fca1044c5 88717 F20101210_AACAQG joshi_s_Page_026.jpg a03b8e238581a7d486ec932b223dfde1 17d456d0b057761a8a5a4c1406fd41e3050ae1f5 49909 F20101210_AACAPS joshi_s_Page_007.jpg 499b93577551dc562ec3caf3d08f4238 96079b19ac30431da141a77b8e791cc934861d6d 17659 F20101210_AACBTJ joshi_s_Page_132.QC.jpg 0675690db6e8f3b7830afe9149b43f48 68bb7b9b55988bd1977c6a19bdb426c0f2f72f18 5363 F20101210_AACBSV joshi_s_Page_043thm.jpg 6c18cbec0b27bb669a1812bbd0c8a5ac a303f21d82d91f9c516dda43c189f980731fb8e1 87684 F20101210_AACAQH joshi_s_Page_027.jpg d4af41f4159cc946b6d7a907473c570b 4e80b0cf326aea67f30829e9e14a35dd25045509 116117 F20101210_AACAPT joshi_s_Page_008.jpg 532317cd130cd22557dd67f19fc8335e e790152b3bd510773a07371b29096a2ae2a1bc32 23280 F20101210_AACBTK joshi_s_Page_122.QC.jpg 93e68d4c9bb68e4a936588fbc71fc72a 7a33092b58811a675f797df99cbe1ec31d3a6d87 4284 F20101210_AACBSW joshi_s_Page_083thm.jpg 9c92957199c0fba01cb97c022d859545 e5067c01f32f2f72c9a97d1fab45e033721c99e8 87975 F20101210_AACAQI joshi_s_Page_029.jpg 80c2a609365b50b665c03b466d6a1c36 d7d118f2f6b8ff14c91ccac547dc11ab69196093 18255 F20101210_AACAPU joshi_s_Page_012.jpg 99d5b7f997493df64f555b0a064c544c f4d7aeb62f053b0ac712682ea585258b823d4ab5 13057 F20101210_AACBTL joshi_s_Page_138.QC.jpg d6fdcc75c94662b1b73716e53edb2e50 fd0702487817a6e93c04a024eab81156e64246fc 86891 F20101210_AACAQJ joshi_s_Page_030.jpg 3c63d1181ade34a373ec6e439e3a7119 8299cb2e63998304fe0c54369b4db0ddbc9a084d 6806 F20101210_AACBUA joshi_s_Page_110thm.jpg c804d1e6f1a3ce470d3059fb7afece83 ca9086d0844b9536f82955fc19d29bcb51bab7c9 24940 F20101210_AACBTM joshi_s_Page_077.QC.jpg 1a07d8e0b93f3c46c01103c88695155c 5892074ce468308cbbc57a22dca73986a45e9c7d 23653 F20101210_AACBSX joshi_s_Page_147.QC.jpg bfab9de9e06dc2ecc8244c68e69bb0c9 4fd1e07f43d2ada4dcc19c08bfef64af5b28601c 89740 F20101210_AACAQK joshi_s_Page_031.jpg b52524c671cb4faf2bf6407178800e44 371143cc869d945ad9ab9e3d8aab9e38dd2e2e5f 79386 F20101210_AACAPV joshi_s_Page_014.jpg c1fa2a6879cb206b649424dc967ef5b9 21415b0afc5e1edb3e0a35a2aeb4f2fb1c49c620 6618 F20101210_AACBUB joshi_s_Page_025thm.jpg 943c50aba7262d2c63b47aa59ce82e40 931a34ce66d615b55a4cb236142a080dcc5381ff 22309 F20101210_AACBTN joshi_s_Page_052.QC.jpg b845697c4fc8ef4780335431749f1060 d1bfd9fdfd19ee9de071a6c063aa8f51f5d2a49b 26466 F20101210_AACBSY joshi_s_Page_115.QC.jpg 7574adf6944335e48ff6b90bc4eaa8b2 35dd010f46f435d1ead8b354e67839fa3ff695fb 19836 F20101210_AACAQL joshi_s_Page_032.jpg 4b3ef547a5d356f55bc7888c0effb70d 05d53fd044d8d6398f92000d12e505c8c2d5b756 92822 F20101210_AACAPW joshi_s_Page_015.jpg a1cf114af99f7de04dd30c508d212964 9539282ba31d992e958b912fc5e4b8d5f37d7ea3 6904 F20101210_AACBUC joshi_s_Page_030thm.jpg 02f8ea263df515d4f4836fbdf5e822c6 2c5df6f94a003be94932620b315b6c04df48ca5b 5965 F20101210_AACBTO joshi_s_Page_113thm.jpg 6961083b4773f252226876f4bd6b5d23 f415582cd28385d223467eb3f23e9e836189fbb5 6469 F20101210_AACBSZ joshi_s_Page_134thm.jpg b816d188acde7ef4c71057cd9c88e56e b07c68440fd8c9712420ae402d8719e3e29d85cc 76913 F20101210_AACARA joshi_s_Page_047.jpg fce6da3a6e71541599c4e4f1cfaef712 38a85af72d027f070eb68e1876b406e083e83544 80017 F20101210_AACAQM joshi_s_Page_033.jpg a569e2e5648b2ed1fae888fe93d5125a ece9b31f6bdecb7a415772992a7748b811ae528c 83704 F20101210_AACAPX joshi_s_Page_016.jpg 9d34dd009ce58826b65b270ae9766e96 d4a1054dfe1d41d1bfcf6389577c31c70ddf230f 23773 F20101210_AACBUD joshi_s_Page_096.QC.jpg f80c23f649c8b5600efe0a453ec2c940 ca5c2ba053c6be40ec03e38ec587562907cfb704 5638 F20101210_AACBTP joshi_s_Page_129thm.jpg ecab245cc2e6b5ef7cbf6fd66cf9d0d9 9eab198ebd31dcd43eb0f0478436cd03910387c2 70886 F20101210_AACARB joshi_s_Page_049.jpg 799f7bf23d78d5d9070302a2d2b9802f 7e6de4afb316cca7749070cec9deedb12b6115d5 91360 F20101210_AACAQN joshi_s_Page_034.jpg 686b8e34fc510e9c50bf4a84e61973ed c96a42b4e93df1f55c6da72cf1cc90d84f31f327 48790 F20101210_AACAPY joshi_s_Page_017.jpg 260c740474d00e3828e2889f7a8d6540 d8cb6b271eb542eb3171d7b3e44ff373ac15ed92 26124 F20101210_AACBUE joshi_s_Page_044.QC.jpg aa7db9e1c95132fdcf707f674bb59f1d 68f6b4df1e6f5d0278eaf1272b9b0def9ce13c50 5089 F20101210_AACBTQ joshi_s_Page_105thm.jpg 5daecd30a2d9a15d63b68ca4bad8931e 14225f89a103778d2249bd4f9feeb1a9473036a5 76457 F20101210_AACARC joshi_s_Page_050.jpg dd90ed2fe98089d83fc8fbaa1fb3d20e 5383853f09f1b48f44fbff02caa7a61874576267 73084 F20101210_AACAQO joshi_s_Page_035.jpg 445607da8a487408aee212a7686e7ce8 0c81c4cf81ac7cc16f7e54272d64a58525565134 48721 F20101210_AACAPZ joshi_s_Page_018.jpg 439ca683e600f7c6a4313169194f81d0 72e6cebf4490125b3798ee876e5de32cf8c9aaef 5491 F20101210_AACBUF joshi_s_Page_063thm.jpg 517446c104dc4f74c9e976e1586aac6d 26665e05d25dcb1c62b83ed20eefe68d6932f18c 3573 F20101210_AACBTR joshi_s_Page_007thm.jpg 1efb7e39468532312c36b23398538ce9 8cfb842a58405b0df6093727528205f5aa49583a 75398 F20101210_AACARD joshi_s_Page_051.jpg bd01dcf7b6d41f67072533fa1a26fc08 14214f958b393fdf44fb30d7a8c0f13366abe8a8 79374 F20101210_AACAQP joshi_s_Page_036.jpg b5eff0fdfd513fc74922e3ea30f88fb6 dbff5cd900b37413903b280d12c71e2e063a2e6d 21361 F20101210_AACBUG joshi_s_Page_047.QC.jpg 3f6a2cfadb3b0b6f107a18fbeccd0ade 9333a54500585b3666ce299d3dab2af2e212e4a0 27830 F20101210_AACBTS joshi_s_Page_076.QC.jpg 06be516ed99e7b824302057d34bb39d0 43770023e5de15838a0b2cbd48027c0f5932139b 70924 F20101210_AACARE joshi_s_Page_052.jpg 1dbba33798cdfb1c8c88dfb1a93ef8f2 808d716c336d5e2c3af46c9639fb57a81534eae7 83467 F20101210_AACAQQ joshi_s_Page_037.jpg b12c0a6d0a4cb30f3a1e8583a41ae486 fd54f23f92fa75bff22bbfb7230fc18b43f32374 5469 F20101210_AACBUH joshi_s_Page_045thm.jpg 2d6216ccf3ad8f72881a290b07e4aa5f 8c37ab855e5b13d6e25a6fd928ae73e7469b6dab 24868 F20101210_AACBTT joshi_s_Page_037.QC.jpg 2dc0f0f607996a8bdbccdf12204aad38 bec1559589f1ef2cf5ac5d76295b352190326dfa 81260 F20101210_AACARF joshi_s_Page_053.jpg d0e287148bbb8f699172b7ed9473201d a03be1ec2e0ec9201986f7b40006d9feda065474 86620 F20101210_AACAQR joshi_s_Page_038.jpg bdb4d07f8d483e8ff7d3f56fe71ce561 8d91093454a4673c5c81706f1b2949e973bbe7d8 7025 F20101210_AACBUI joshi_s_Page_029thm.jpg e77eb7feb97a98ccd2ca8f160f9903d6 bccb0f57ad26de91a5259a2ad8597e8dab1639be 23514 F20101210_AACBTU joshi_s_Page_161.QC.jpg 8760f561e488a1b53d8fd9bed1f97941 4e9c979315e217b91a071b96593885fd55b490a8 61990 F20101210_AACARG joshi_s_Page_054.jpg b2fb4ddb59044630f8e80f29f0b559a8 cda2d29f9a233b80829f4d06b6bd781d831dfab9 78760 F20101210_AACAQS joshi_s_Page_039.jpg 4b2b9a0484a423c85bdeff8e75e65a2b 977638ad0f694b50012fd8bcb7b2027d9060dafe 25854 F20101210_AACBUJ joshi_s_Page_148.QC.jpg 620d1d888518d78cf3aa65a4dd2b5dda 7da80fbf330b248e69151a29ff753436a3184486 5808 F20101210_AACBTV joshi_s_Page_004thm.jpg 4538f53e5b1469d6842697e8d2d9e094 d6919d652e2aee536e9d8bc3e2f9c62077877bd3 54192 F20101210_AACARH joshi_s_Page_056.jpg 4ff5be5c6067b2ec176941cd1260327c a260c5687d0b2877d8d0cbb6e04100e51bb1d3a5 62244 F20101210_AACAQT joshi_s_Page_040.jpg f7c2d3d343c48ebe25d0d4c093576c50 cf27a604ee03a20b98133d8d5b6d740d5a655c45 5198 F20101210_AACBUK joshi_s_Page_040thm.jpg d7bbf86e97254a13fe2e2d71fc653d6e 27d6e8f35d8974eb0ea128341e2e8bb3274c7166 6141 F20101210_AACBTW joshi_s_Page_148thm.jpg f574e2590ba843e1b6d681bc514d7846 ecf6b9f985c8fc3f0a9050169b98263247aab9c4 51780 F20101210_AACARI joshi_s_Page_057.jpg 66417a5746103817939ddd04318858a7 9d3c3f5536f713d489fc868815fa5e3a8bac863c 69642 F20101210_AACAQU joshi_s_Page_041.jpg fbeb80ced852bd59709154c1df5a829c 150a6c0cdb08becee7effd065e20adfaa5a6a884 6280 F20101210_AACBUL joshi_s_Page_009thm.jpg fd302e4db0844121ee8c68399336fc53 be77b802f0cc0a3ab43635a9303559fb537b93c8 16861 F20101210_AACBTX joshi_s_Page_089.QC.jpg 85a229f3479d8d200bde0a22daed68fe 9d93b85619c56b48b09ec7e8426579175d112d00 70402 F20101210_AACAQV joshi_s_Page_042.jpg a97b1483fa51ca7806a71c202e1ab2c1 645862655c2bde9cb0e716ad777d351f47b0dd6f 42029 F20101210_AACARJ joshi_s_Page_058.jpg e2204aed9cf3606b9c4c9493b7866737 e647dbf052a4003fb005ca2f08289b234fbec839 4081 F20101210_AACBVA joshi_s_Page_010thm.jpg 4f5dfb434e71210912102ebc90ddebfa e267efb5e6bd18d99024fea62552eb3b1549b125 23748 F20101210_AACBUM joshi_s_Page_035.QC.jpg 4aeaee40c6a0a5c78bc7b2576e25d808 0a8a3c3b7c6dc07ba8bcdde478463be842b489ce 79135 F20101210_AACARK joshi_s_Page_059.jpg 55b45a1003375dec2e92861c0029c87a 66b81421f8d053f01cb634d0115d2eb3bcd0a5f6 19700 F20101210_AACBVB joshi_s_Page_105.QC.jpg 166dc4d2019d06aaca87da99cf68634f 5c1b4dbcd2a0970fec1227e23c8cbc7d5bb2c0b0 3673 F20101210_AACBUN joshi_s_Page_064thm.jpg ef36f15915c47b31b561e5e8150f793b f1cab36be6c1c2477e77468ac9a269600339cdee 24653 F20101210_AACBTY joshi_s_Page_092.QC.jpg 2e749b10e405e32d46f06aa055fe8108 8ab1595de9745ed5e081e9efb50e14cf1bc1d386 62869 F20101210_AACAQW joshi_s_Page_043.jpg 1328712bffdc916b25ec3e0fe4f1c37c 2522eb10147ce60b3aa814267e9c23d4fb4ce838 47609 F20101210_AACARL joshi_s_Page_061.jpg 2b327d0852d59b53b74c288edcf91e76 60329c17599a6fb957eb9df1e997574f16464ad8 27797 F20101210_AACBVC joshi_s_Page_099.QC.jpg ffc7ab9ba7b4f30c3d3ddff9b3b087d4 7336ec408fb4a27c592eb0023fb9ef214ee41a94 24578 F20101210_AACBUO joshi_s_Page_033.QC.jpg 021d3328a7e683b2b584cfcdf4624ebe 116aa3812c0137f53a22dc2e36c73a59e0d87dad 3459 F20101210_AACBTZ joshi_s_Page_003.QC.jpg 12f802b3bafd88076a3a55feb12c0c27 aaab54883579ddc6fe9a382f30df3cc61d15c690 81871 F20101210_AACAQX joshi_s_Page_044.jpg 2ce8997573085f32fcc5584c6cde98cc 1782c64eedbb20354dafd430413e50181f17b375 71364 F20101210_AACASA joshi_s_Page_080.jpg f32a2faf90e55e6d1202e063069035dc c1db967105abf4a97ca20e19b12c64438e4a56b4 78670 F20101210_AACARM joshi_s_Page_062.jpg c1965a1e72a9d20dc9770da03457aab4 d083361386d6cb3a7159d37dd4ccfffdeacc0ebf 12669 F20101210_AACBVD joshi_s_Page_088.QC.jpg b364c9a095069baf41667d2a0a7a8eeb d846e5b46e340ed9990c942b49c66bd35a57aa5b 13364 F20101210_AACBUP joshi_s_Page_017.QC.jpg 6b74a5d07903273e2d06d6d272bc2a32 7ad0cf91d14bd12fad8d275d7099ff317f75099c 65231 F20101210_AACAQY joshi_s_Page_045.jpg 43069a99586beec18e5b8856b121049a 96cb56cbb8c906050d3feaa73c6dc7e9fc50f528 85367 F20101210_AACASB joshi_s_Page_081.jpg 67568c00989d376d0d8657d69f8209d1 40797fe052879810d8b49a7faaa82ce3bae723af 68536 F20101210_AACARN joshi_s_Page_063.jpg cc0141c584bfb4b0b2b62fc7297d09fd 15818d3635a0e860f81c93ca300d7ee751b0c558 SAMPLINGBASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING By SHANTANU JOSHI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 S2007 SHANTANU JOSHI To my parents, Dr Sharad Joshi and Dr Hemangi Joshi ACKENOWLED GMENTS Firstly, my sincerest gratitude goes to my advisor, Professor C!~! i Jermaine for his invaluable guidance and support throughout my PhD research work. During the initial several months of my graduate work, Clau s was extremely patient and aknss~ led me towards the right direction whenever I would waver. His acute insight into the research problems we worked on set an excellent example and provided me immense motivation to work on them. He has akr 1< emphasized the importance of highquality technical writing and has spent several painstaking hours reading and correcting my technical manuscripts. He has been the best mentor I could have hoped for and I shall ak .1< remain indebted to him for shaping my career and more importantly, my thinking. I am also very thankful to Professor Alin Dobra for his guidance during my graduate study. His enthusiasm and constant willingness to help has .I.k ii.; amazed me. Thanks are also due to Professor Joachim Hammer for his support during the very early d ex< Of my graduate study. I take this opportunity to thank Professors Tamer K~ahveci and Gary K~oehler for taking the time to serve on my committee and for their helpful  I _r;ull It was a pleasure working with Subi Arumugam and Abhijit Pol on various collaborative research projects. Several interesting technical discussions with Mingxi Wu, Fei Xu, Florin Rusu, Laukik Chitnis and Seema Degwekar provided a stimulating work environment in the Database Center. This work would not have been possible without the constant encouragement and support of my family. My parents, Dr Sharad Joshi and Dr Hemangi Joshi ak .1< encouraged me to focus on my goals and pursue them against all odds. My brother, Dr Abhijit Joshi has akr 1< placed trust in my abilities and has been an ideal example to follow since my childhood. My loving sisterinlaw, Dr Hetal Joshi has been supportive since the time I decided to pursue computer science. TABLE OF CONTENTS page ACK(NOWLEDGMENTS ......... . .. .. 4 LIST OF TABLES ......... .... .. 8 LIST OF FIGURES ......... .... .. 9 ABSTRACT ......... ..... . 11 CHAPTER 1 INTRODUCTION ......... ... .. 1:3 1.1 Approximate Query Processing (AQP) A Different Paradigm .. .. .. 1:3 1.2 Building an AQP System Afresh . ..... .. 14 1.2.1 Sampling Vs Preconmputed Synopses .... .. .. .. 15 1.2.2 Architectural(l Ch!_. , ........ .. .. 16 1.3 Contributions in This Thesis ........ .. 18 2 RELATED WORK(........... ..... .... 19 2.1 Saniplinghased Estimation ....... .. 19 2.2 Estimation Using Nonsanipling Preconmputed Synopses .. .. .. .. 28 2.3 Analytic Query Processing Using Nonstandard Data Models .. .. .. :30 :3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION 3:3 :3.1 Introduction ......... . .. .. 3:3 :3.2 Existing Sampling Techniques ....... ... .. :35 :3.2.1 Randomly Pernmuted Files . ..... .. .. :35 :3.2.2 Sampling from Indices ....... ... .. :36 :3.2.3 Blockbased Random Sampling .... ... :37 :3.3 Overview of Our Approach ........ ... :38 :3.3.1 ACE Tree Leaf Nodes ....... .. :38 :3.3.2 ACE Tree Structure . ..... ... :39 :3.3.3 Example Query Execution in ACE Tree .. .. .. .. 40 :3.3.4 Cl o n. .. of Binary Versus kAry Tree .... .. .. 42 :3.4 Properties of the ACE Tree ........ ... .. 4:3 :3.4.1 Combinability ......... .. .. 4:3 :3.4.2 Appendability ......... .. .. 44 :3.4.3 Exponentiality ......... .. .. 44 :3.5 Construction of the ACE Tree . ..... .. 45 :3.5.1 Design Goals ......... .. .. 45 :3.5.2 Construction ......... .. .. 46 :3.5.3 Construction Phase 1 ........ ... .. 46 3.5.4 Construction Phase 2 . .... ... .. 48 3.5.5 Combinability/Appendability Revisited ... .. .. .. 51 3.5.6 Page Alignment ......... .. .. 51 3.6 Query Algorithm ......... ... .. 52 3.6.1 Goals ......... . .. .. 53 3.6.2 Algorithm Overview ........ ... .. 53 3.6.3 Data Structures ......... .. .. 55 3.6.4 Actual Algorithm ......... ... .. 55 3.6.5 Algorithm Analysis ......... ... .. 57 3.7 MultiDimensional ACE Trees . ..... .. 59 3.8 Benchmarking ........ . .. 60 3.8.1 Overview ........... ...... ...... 61 3.8.2 Discussion of Experimental Results ... ... .. 66 3.9 Conclusion and Discussion ....... .. 70 4 SAMPLINGBASED ESTIMATORS FOR SUBSETBASED QUERIES .. 73 4.1 Introduction ........ ... .. 73 4.2 The Concurrent Estimator ....... .. 78 4.3 Unbiased Estimator ........ ... .. 80 4.3.1 HigfhLevel Description . ..... .. 80 4.3.2 The Unbiased Estimator In Depth .... .... .. 81 4.3.3 Why Is the Estimator Unbiased? ..... ... .. 85 4.3.4 Computing the Variance of the Estimator .. .. .. .. 87 4.3.5 Is This Good? ......... .. .. 89 4.4 Developing a Biased Estimator . ...... .. .. 91 4.5 Details of Our Approach ......... .. .. 92 4.5.1 Cle.. ~of Model and Model Parameters ... ... .. .. 92 4.5.2 Estimation of Model Parameters .... ... .. 95 4.5.3 Generating Populations From the Model ... .. . .. 100 4.5.4 Constructing the Estimator . ... .. 102 4.6 Experiments ........ . .. 103 4.6.1 Experimental Setup ....... ... .. 103 4.6.1.1 Synthetic data sets .... .. .. 104 4.6.1.2 Reallife data sets . ... .. 106 4.6.2 Results ........ . .. 109 4.6.3 Discussion ........ . .. 111 4.7 Related Work ........ ... .. 118 4.8 Conclusion ........ . .. 119 5 SAMPLINGBASED ESTIMATION OF LOW SELECTIVITY QUERIES .. 121 5.1 Introduction ........ . .. 121 5.2 Background ........ . .. 124 5.2.1 Stratification . .... .. ... .. 124 5.2.2 "Optimal" Allocation and Why It's Not .. .. .. .. 126 5.3 Overview of Our Solution ......... ... .. 128 5.4 Definingf Xe ......... . .. .. 129 5.4.1 Overview ......... . .. 129 5.4.2 Definingf X,.,z ......... . .. 1:30 5.4.3 Defining Xc/ ......... ... .. 1:32 5.4.4 Combining The Two . ...... ... .. 1:35 5.4.5 Limiting the Number of Domain Values ... ... . .. 1:37 5.5 Updating Priors Using The Pilot . ..... .. 1:39 5.6 Putting It All Together ......... .. .. 141 5.6.1 Minimizing the Variance . ..... .. .. 141 5.6.2 Computing the Final Sampling Allocation .. . .. 142 5.7 Experiments ......... . .. .. 14:3 5.7.1 Goals ......... . .. .. 14:3 5.7.2 Experimental Setup ........ ... .. 144 5.7.3 Results ......... ... .. 147 5.7.4 Discussion ......... ... .. 147 5.8 Related Work ......... .. .. 151 5.9 Conclusion ......... ... .. 15:3 6 CONCLUSION ......... . .. .. 154 APPENDIX EM ALGORITHM DERIVATION . ...... .. 155 REFERENCES ............ ........... 157 BIOGRAPHICAL SK(ETCH ......... . .. 165 LIST OF TABLES Table page 41 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 101' and for each of these fractions, it shows the error for the three estimators: U Unbiased estimator, C Concurrent sampling estimator and B Modelbased biased estimator. .. .. .. 112 42 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 101' and for each of these fractions, it shows the error for the three estimators: U Unbiased estimator, C Concurrent sampling estimator and B Modelbased biased estimator. .. .. .. 113 43 Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 18 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 101' and for each of these fractions, it shows the error for the three estimators: U Unbiased estimator, C Concurrent sampling estimator and B Modelbased biased estimator. .. .. .. 114 44 Observed standard error as a percentage of the total .I_ r_~egate value of all records in the database for 8 queries over 3 reallife data sets. The table shows errors for three different sampling fractions: 1 5' and 101' and for each of these fractions, it shows the error for the three estimators: U Unbiased estimator, C Concurrent sampling estimator and B Modelbased biased estimator. .. 115 51 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for a Simple Random Sampling estimator for the K(DD Cup data set. Results are shown for varying sample sizes and for three different query selectivities 0.01 0.1 and 1 . ... .. .. 146 52 Average running time of Neyman and Bai; ENeyman estimators over three realworld datasets. ......... .... . 147 53 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neyman estimator and the Bai; ENeyman estimator for the three data sets. Results are shown for 20 strata and for varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities 0.01 0.1 and 1~ ...... .. . 148 54 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neyman estimator and the Bai; ENeyman estimator for the three data sets. Results are shown for 200 strata with varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities 0.01 0.1 and 1~ ..... .. .. 149 LIST OF FIGURES Figure page 11 Simplified architecture of a DBMS ........ .. .. 17 31 Structure of a leaf node of the ACE tree. ...... .. 39 32 Structure of the ACE tree. ......... .. .. 40 33 Random samples from section 1 of L3 * * 4 34 Combining samples from L3 and Ls. . . 42 35 Combining two sections of leaf nodes of the ACE tree. ... .. .. 43 36 Appending two sections of leaf nodes of the ACE tree. ... .. .. 45 37 Cls...~! nig keys for internal nodes. ........ ... .. 47 38 Exponentiality property of ACE tree. . ..... .. 48 39 Phase 2 of tree construction. ......... .. .. 49 310 Execution runs of query answering algorithm with (a) 1 contributing section, (b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributing sections. ......... .... . 54 311 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 0.25' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation ......... .. .. 60 312 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation ......... .. .. 61 313 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting "'.' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation ......... .. .. 62 314 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5' of the database records. The graph is an extension of Figure 312 and shows results till all three sampling techniques return all the records matching the query predicate. 63 315 Number of records needed to be buffered by the ACE Tree for queries with (a) 0.25' and (b) 2.5' selectivity. The graphs show the number of records buffered as a fraction of the total database records versus time plotted as a percentage of the time required to scan the relation. ..... .. .. 64 316 Sampling rate of an ACE Tree vs. rate for an RTree and scan of a randomly permuted file, with a spatial selection predicate accepting 0.25' of the database tuples. ... .... .. 67 317 Sampling rate of an ACE tree vs. rate for an Rtree, and scan of a randomly permuted file with a spatial selection predicate accepting 2.5' of the database tuples ... ........ ............. 68 318 Sampling rate of an ACE tree vs. rate for an Rtree, and scan of a randomly permuted file with a spatial selection predicate accepting 25' of the database tuples. ........ ... .. 69 41 Sampling from a superpopulation ........ ... .. 90 42 Six distributions used to generate for each e in EMP the number of records s in SALE for which f3 6, 8) eValuateS to true. ...... .. . 105 51 Beta distribution with parameters a~ = P = 0.5. ... .. .. 131 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SAMPLINGBASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING By SHANTANU JOSHI August 2007 Major: Computer Engineering The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPCH, which show that ]rn Ilry typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for adhoc, data exploratory as~ lli One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. Two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable samplingbased estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary range query. (b) Samplingbased estimators for .I__oegate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of .. oregate queries having highly selective predicates. CHAPTER 1 INTRODUCTION The last couple of decades have seen an explosive growth of electronic data. It is not unusual for data management systems to support several terabytes or even petabytes of data. Such massive volumes of data have led to the evolution of "data warehouses", which are systems capable of supporting storage and efficient retrieval of large amounts of data. Data Warehouses are typically used for applications such as online analytical processing among others. Such applications process queries and expect results in a manner that is different from traditional transaction processing. For example, a typical query by a sales manager on a sales data warehouse might be: "Return average salary of all employees at locations whose sales have increased by atleast 101' over the past 3 y.~ l  The result of such a query could be used to make highlevel decisions such as whether or not to hire more employees at the locations of interest. Such queries are typical in a data warehousing environment in that their evaluation requires complex analytical processing over huge amounts of data. Traditional transactional processing methods may be unacceptably slow to answer such complex queries. 1.1 Approximate Query Processing (AQP) A Different Paradigm The nature of analytical queries and their associated applications provides an opportunity to provide results which may not be exact. Since computation of exact results may require an unreasonable amount of time due to massive volumes of data, approximation may be attractive if the approximate results can be computed in a fraction of the time it would take to compute the exact results. Moreover, providing approximate results can be useful to quickly explore the whole data at a high level. This technique of providing fast but approximate results has been termed "Approximate Query Prol !ag in the literature. In addition to computing an approximate answer, it is also important to provide metrics about the accuracy of the answer. One way to express the accuracy is in terms of error bounds with certain probabilistic guarantees of the form, "The estimated answer is 2.45 x 105, and with 95' confidence the true answer lies within +1.18 x 103 of the 01... !.! I Here, the error bounds are expressed as an interval and the accuracy guarantee is provided at 95' confidence. A promising approach for .I_ egation queries in Approximate Query Processing (AQP) has been proposed by Haas and Hellerstein [63] called Online .I_ egation (OLA). They propose an interactive interface for data exploration and analysis where records are retrieved in a random order. Using these random samples, running estimates and error bounds are computed and immediately di;11 lio I to the user. As time progresses, the size of the random sample keeps growing and so the estimate is continuously refined. At a predetermined time interval, the refined estimate along with its improved accuracy is di11lai II to the user. If at any point of time during the execution the user is satisfied with the accuracy of the answer, she can terminate further execution. The system also gives an overall progress indicator based on the fraction of records that have been sampled thus far. Thus, OLA provides an interface where the user is given a rough estimate of the result very quickly. 1.2 Building an AQP System Afresh The OLA system described above presents an intuitive interface for approximate answering of .l__oegate queries. However, to support the functionality proposed by the system, fundamental changes need to be incorporated in several components of a traditional database management system. In this section, we first examine why sampling is a good approach for AQP, and then present an overview of the changes needed in the architecture of a database management system to support samplinghased AQP. 1.2.1 Sampling Vs Precomputed Synopses We now discuss two techniques that can be used to support fast but approximate answering of queries. One intuitive technique is using some compact information about records for answering queries. Such information is typically called a database statistic and it is actually summary information about the actual records of the database. Commonly used database statistics are wavelets, histogframs and sketches. Such statistics also known as synopses, are orders of magnitude smaller in size than the actual data. Hence it is much faster and efficient to access synopses as compared to reading the entire data. However, such synopses are precomputed and static. If a query is issued which requires some synopses that are not already available, then they would have to be computed by scanning the dataset, possibly multiple times before answering the query. Second approach to AQP is using samples of database records to answer queries. Query execution is extremely fast since the number of records in the sample is a small fraction of the total number of records in the database. The answer is then extrapolated or I I!. I1up" to the size of the entire database. Since the answer is computed by processing very few records of the database, it is an approximation of the true answer. For the work in this thesis, we propose to use sampling [25, 109] in order to support AQP. We make this choice due to the following important advantages of sampling over precomputed synopses. The accuracy of an estimate computed by using samples can be easily improved by obtaining more samples to answer the query. On the other hand, if the estimate computed by using synopses is not sufficiently accurate, a new synopsis providing greater accuracy would have to be built. Since this would require scanning the dataset it is impractical. Secondly, sampling is very amenable to scalability. Even for extremely large datasets of the order of hundreds of gigabytes, it is generally possible to accommodate a small sample in main memory and use efficient inmemory algorithms to process it. If this is not possible, diskbased samples and algorithms have also been proposed [76] and are equally effective as their inmemory counterparts. This is an important benefit of sampling as compared to histograms, which become unwieldy as the number of attributes of the records in the dataset increases. Thirdly, since real records (although very few) are used in a sample, it is possible to answer any statistical query including arbitrarily complex functions in relational selection and join predicates. This is a very important advantage of sampling as opposed to synopses such as sketches, which are not suitable for answering arbitrary queries. Finally, unlike preconmputed synopses there is no requirement of maintenance and updates for onthefly sampling as data are updated. 1.2.2 Architectural Changes In order to support saniplinghased AQP in a database nianagenient system, 1!! lin i changes need to be incorporated in the architecture of the system. The reason for this is that traditional database nianagenient systems were not designed to work with random samples or to support computation of approximate results. In this section, we briefly describe some of the most critical changes that are required in the architecture of a DBMS to support saniplinghased AQP. Figure 11 depicts the various components front a simplified architecture of a DBMS. The four components that require us! .1.i changes in order to support saniplinghased AQP are as follows: * Index/file/record manager The use of traditional index structures like B+Trees is not appropriate to obtain random samples. This is because such index structures order records based on record search key values which is actually the opposite of obtaining records in a random order. Hence, for AQP it is important to provide physical structures or file organizations which support efficient retrieval of random samples. * Execution engine The execution engine needs to be revamped completely so that it can use the random samples returned by the lower level to execute the query on them. Further, the result of the query needs to be scaled up appropriately for the size of the entire database. This component would also need to be able to compute accuracy guarantees for the approximate answer. User Interface Queries/Updates Query Compiler Qery plan Execution Engine Index, file and record requests Index/File/Record Manager Buffer Manager Page commands Read/write pages Storage Manager Figure 11. Simplified architecture of a DBMS * Query compiler The query compiler has to be modified so that it can chalk out a different strategy of execution for various types of queries like relational joins, subsetbased queries or queries with a GROUPBY clause. Moreover, optimization of queries needs to be done very differently from traditional query optimizers which create the most efficient query plan to run a query to completion. For AQP, queries should be optimized so that the first few result tuples are output as quickly as possible. * User interface There is tremendous scope of providing an intuitive user interface for an online AQP system. In addition to the UI being able to provide accuracy guarantees to the estimate, it would be very intuitive to provide a visualization of the intermediate results as and when they become available so that the user can continue to explore the query or decide to modify or terminate it. Current database management systems provide user interfaces with very limited functionality. 1.3 Contributions in This Thesis These tasks involve significant research and implementation issues. Since ]rn Ilw of the problems have never been tackled in the literature, there are several challenging tasks to be addressed. For the scope of my research, I choose to address the following three problems. The motivation and our solutions to each of these research problems is described separately in the following chapters of this thesis. * We present a primary index structure which can support efficient retrieval of random samples from an arbitrary range query. This requires a specialized file organization and an efficient algorithm to actually retrieve the desired random samples from the index. This work falls in the scope of the Index/file/record manager component described earlier. * We present our solution to support execution of queries which have a nested subquery where the inner query is correlated to the outer query, in an approximate query processing framework. This work falls in the purview of the execution engine of the system. * Finally, we also present a technique to support efficient execution of queries which have predicates with low selectivities, such as GROUP BY queries with many different groups. This work also falls in the scope of the query execution engine. CHAPTER 2 RELATED WORK( This chapter presents previous work in the data management and statistics literature related to estimation using sampling as well as nonsampling hased precomputed synopses structures. Finally, it describes work related to OLAP query processing using nonrelational data models like data cubes. 2.1 Samplingbased Estimation Sampling has a long history in the data management literature. Some of the pioneering work in this field has been done hv Olken and Rotem [96, 98101] and Antoshenkov [9], though the idea of using a survey sample for estimation in statistics literature goes back much earlier than these works. 1\ost of the work by Olken and Rotem describes how to perform simple random sampling from databases. Estimation for several types of database tasks has been attempted with random samples. The rest of this section presents important works on samplinghased estimation of inl li1~ database tasks. Some of the initial work on estimating selectivity of join queries is due to Hou et al. [67, 68]. They present unbiased and consistent estimators for estimating the join size and also provide an algorithm for cluster sampling. In [64] they propose unbiased estimators for COUNT .I__regate queries over arbitrary relational algebra expressions. However, computation of variance of their estimators is very complex [67]. They also do not provide any bounds on the number of random samples required for estimation. Adaptive sampling has been used for estimation of selectivity of predicates in relational selection and join operations [83, 84, 86] and for approximating the size of a relational projection operation [94]. Adaptive sampling has also been used in [85], to estimate transitive closures of database relations. The authors point out the benefits and generality of using sampling for selectivity estimation over parametric methods which make assumptions about an underlying probability distribution for the data as well as over nonparametric methods which require storing and maintaining synopses about the underlying data. The algorithms consider the query result as a collection of results from several disjoint subqueries. Subqueries are sampled randomly and their result sizes are computed. The estimate of the actual query result size is then obtained from the results of the various subqueries. The sampling of subqueries is continued until either the sum of the subquery sizes is sufficiently large or the number of samples taken is sufficiently large. The method requires that the maximum size of a subquery be known. Since this is generally not available, the authors use an upper bound for the maximum subquery size in their method. Haas and Swami [59] observe that using a loose upper bound for the maximum subquery size can lead to sampling more subqueries than necessary, and potentially increasing the cost of sampling significantly. Double sampling or twophase sampling has been used in [66] for estimating the result of a COUNT query with a guaranteed error bound at a certain confidence level. The error bound is guaranteed by performing sampling in two steps. In the first step a small pilot sample is used to obtain preliminary information about the input relation. This information is then used to compute the size of the sample for the second step such that the estimator is guaranteed to produce an estimate with the desired error bound. As Haas and Swami [59] point out, the drawback of using double sampling is that there is no theoretical guidance for choosing the size of the pilot sample. This could lead to an unpredictably imprecise estimate if the pilot sample size is too small or an unnecessarily high sampling cost if the pilot sample size is too large. In their work [59], Haas and Swami present sequential sampling techniques which provide an estimate of the result size and also bounds the error in estimation with a prespecified probability. They present two algorithms in the paper to estimate the size of a query result. Although both algorithms have been proven to be .imptotically correct and efficient, the first algorithm suffers from the problem of undercoverage. This means that in practice the probability with which it estimates the query result within the computed error bound is less than the specified confidence level of the algorithm. This problem is addressed by the second algorithm which organizes groups of equalsized results sets into a single stratum and then performs stratified sampling over the different strata. However, their algorithms do not perform very well when estimating the size of joins between a skewed and a nonskewed relation. Ling and Sun [82] point out that general samplingbased estimation methods have a high cost of execution since they make an overly restrictive assumption of no knowledge about the overall characteristics of the data. In particular, they note that estimation of the overall mean and variance of the data not only incurs cost but also introduces error in estimation. The authors rather II__ r an alternative approach of actually keeping track of these characteristics in the database at a minimal overhead. A detailed study about the cost of samplingbased methods to estimate join query sizes appears in [58]. The paper systematically analyses the factors which influence the cost of a samplingbased method to estimate join selectivities. Based on their analysis, their findings can be summarized as follows: (a) When the measure of precision of the estimate is absolute, the cost of sampling increases with the number of relations involved in the join as well as the sizes of the relations themselves. (b) When the measure of precision of the estimate is relative, the cost of using sampling increases with the sizes of the relations, but decreases as the number of input relations increase. (c) When the distribution of the join attribute values is uniform or highly skewed for all input relations, the cost of sampling tends to be low, while it is high when only some of the input relations have a skewed join attribute value distribution. (d) The presence of tuples in a relation which do not join with any other tuples from other relations ahrlw increases the cost of sampling. Haas et al. [56, 57] study and compare the performance of new as well as previous samplingbased procedures for estimating the selectivity of queries with joins. In particular they identify estimators which have a minimum variance after a fixed number of sampling steps have been performed. They note that use of indexes on input relations can further reduce variance of the selectivity estimate. The authors also show how their estimation methods can he used to estimate the cost of implementing a given join query plan without making any assumptions about the underlying data or requiring storage and maintenance of summary statistics about the data. Ganguly et al. [35] describe how to estimate the size of a join in the presence of skew in the data by using a technique called L~.:I. .. l '"rl'l.:..y This technique classifies tuples of each input relation into two groups, sparse and dense, based on the number of tuples with the same value for the join attribute. Every combination of these groups is then subject to different estimation procedures. Each of these estimation procedures require a sample size larger than a certain value (in terms of the total number of tuples in the input relation) to provide an estimate within a small constant factor of the true join size. In order to guarantee estimates with the specified accuracy, hifocal sampling also requires the total join size and the join sizes from sparsesparse subjoins to be greater than a certain threshold. Gibbons and Matias [40] introduce two samplinghased summary statistics called con ci~se samaples and counting samaples and present techniques for their fast and incremental maintenance. Although the paper describes summary statistics rather than onthefly sampling techniques, the summary statistics are created from random samples of the underlying data and are actually defined to describe characteristics of a random sample of the data. Since summary statistics of a random sample require much lesser amount of memory than the sample itself, the paper describes how information from a much larger sample can he stored in a given amount of memory by storing sample statistics instead of using the memory to store actual random samples. Thus, the authors claim that since information from a larger sample can he stored by their summary statistics the accuracy of approximate answers can he boosted. C'!s 1! 1isti Motwani and No I: lyya [22, 24] present a detailed study of the problem of efficiently sampling the output of a join operation without actually computing the entire join. They prove a negative result that it is not possible to generate a sample of the join result of two relations by merely joining samples of the relations involved in the join. Based on this result, they propose a biased sampling strategy which samples tuples from one relation in the proportion with which their matching tuples appear in the other relation. The intuition behind this approach is that the resulting biased sample is more likely to reflect the structure of the actual join result between the two relations. Information about the frequency of the various join attribute values is assumed to be available in the form of some synopsis structures like histograms. There has also been work to estimate the actual result of an .I_ _egate query which involves a relational join operation on its input relations. In fact, Haas, Hellerstein and Wang [63] propose a system called Online Aggregation (OLA) that can support online execution of analytic VI ..;:_ regation queries. They propose the system to have a visual interface which di pt17in the current estimate of the ..;:_ regate query along with error bounds at a certain confidence level. Then, as time progresses, the system continually refines the estimate and at the same time shrinks the width of the error bounds. The user who is presented with such a visual interface, has at all times, an option to terminate further execution of the query in case the error bound width is satisfactory for the given confidence level. The authors propose the use random sampling from input relations to provide estimates in OLA. Further, they describe some of the key changes that would be required in a DBMS to support OLA. In [51], Haas describes statistical techniques for computing error bounds in OLA. The work on OLA eventually grew into the UC Berkeley CONTROL project. In their article [62], Hellerstein et al. describe various issues in providing interactive data analysis and possible approaches to address those issues. Haas and Hellerstein [53, 54] propose a family of join algorithms called ripple joins to perform relational joins in an OLA framework. Ripple joins were designed to minimize the time until an acceptably precise estimate of the query result is made available, as opposed to minimizing the time to completion of the query as in a traditional DBMS. For a twotable join, the algorithm retrieves a certain number of random tuples from both relations at each sampling step; these new tuples are joined with previously seen tuples and with each other. The running result of the .I__oegate query is updated with these newly retrieved tuples. The paper also describes how a statistically meaningful confidence interval of the estimated result can he computed based on the Central Limit Theorem (CLT) . Luo et al. [87] present an online parallel hash ripple join algorithm to speed up the execution of the ripple join especially when the join selectivity is low and also when the user wishes to continue execution till completion. The algorithm is assumed to be executed at a fixed set of processor nodes. At each node, a hash table is maintained for every relation. 1\oreover every bucket in each hash table could have some tuples stored in memory and some others stored on disk. The join algorithm proceeds in two phases; in the first phase tuples from both relations are retrieved in a random order and distributed to the processor nodes so that each node would perform roughly the same amount of work for executing the join. By using multiple threads at each node, production of join tuples from the inmemory hash table buckets begins even as tuples are being distributed to the various processors. The second phase begins after redistribution from the first phase is complete. In this phase, a new inmemory hash table is created which uses a hashing function different from the function used in phase 1. The tuples in the diskresident buckets of the hash table of phase 1 are then hashed according to the hashing function of phase 2 and joined. The algorithm provides a considerable speedup factor over the onenode ripple join, provided its memory requirements are met. Jermaine et al. [73, 74] point out that the drawback of both the ripple join algorithms described above is that the statistical guarantees provided by the estimator are valid only as long as the output of the join can he accomodated in main memory. In order to counteract this problem, they propose the SortM~ergeb'in,.::.. join algorithm as a generalization of the ripple join which can provide error guarantees throughout execution, even if it operates from disk. The algorithm proceeds in three phases. In the sort phase, the two input relations are read in parallel and sorted into runs. Each pair of runs is subject to an inmemory hash ripple join and provides a corresponding estimate of the join result. The merge and shrink phases execute concurrently where in the merge phase, tuples are retrieved from the various sorted runs of both relations and joined with each other. Since the sorted runs "1 ~ tuples which are pulled by the merge phase, the shrinking phase takes these tuples into account and updates the estimator accordingly. The authors provide a detailed statistical analysis of the estimator as well computation of error bounds. Estimation using sampling of the number of distinct values in a column has been studied hv Haas et al. [48]. They provide an overview of the estimators used in the database and statistics literature and also develop several new samplinghased estimators for the distinct value estimation problem. They propose a new hybrid sampling estimator which explicitly adapts to different levels of data skew. Their hybrid estimator performs a Chisquare test to detect skew in the distribution of the attribute value. If the data appears to be skewed, then Shlosser's estimator is used while if the test does not detect skew, a smoothedjackknife estimator (which is a modification of the conventional jackknife estimator) is used. The authors attribute a dearth of work for samplinghased estimation of the number of distinct values to the inherent difficulty of the problem while noting that it is a much harder problem than estimating the selectivity of a join. Haas and Stokes [50] present a detailed study of the problem of estimating the number of classes in a finite population. This is equivalent to the database problem of estimating the number of distinct values in a relation. The authors make recommendations about which statistical estimator is appropriate subject to constraints and finally claim from empirical results that a hybrid estimator which adapts according to data skew is the most superior estimator. There has also been work by C'! ~ l: .I:. et al. [16] which establishes a negative result stating that no samplinghased estimator for estimating the number of distinct values can guarantee small error across all input distributions unless it examines a large fraction of the input data. They also present a Guaranteed Error Estimator (GEE) whose error is provably no worse than their negative result. Since the GEE is a general estimator providing optimal error over all distributions, the authors note that its accuracy may be lower than some previous estimators on specific distributions. Hence, they propose an estimator called the Adaptive Estimator (AE) which is similar in spirit to Haas et al.'s hybrid estimator [50], but unlike the latter, is not composed of two distinct estimators. Rather the AE considers the contribution of data items having high and low frequencies in a single unified estimator. In the AQUA system [41] for approximate answering of queries, Acharya et al. [6] propose using synopses for estimating the result of relational join queries involving foreignkey joins rather than using random samples from the base relations. These synopses are actually precomputed samples from a small set of distinguished joins and are called join .synop~se~s in the paper. The idea of join synopses is that by precomputing samples from a small set of distinguished joins, these samples can he used for estimating the result of many other joins. The concept is applicable in a kway join where each join involves a primary and foreign key of the participating relations. The paper describes that if workload information is available, it can he used to design an optimal allocation for the join synopses that minimizes the overall error in the approximate answers over the workload. Acharya et al. [5] propose using a mix of uniform and biased samples for approximately answering queries with a GROUPBY clause. Their sampling technique called congre~s .sional ****rl,,~.:,:l relies on using precomputed samples which are a hybrid union of uniform and biased samples. They assume that the selectivity of the query predicate is not so low that their precomputed sample completely misses one or more groups from the result of the GR OUPBY query. Based on this assumption, they devise a sampling plan for the different groups such that the expected minimum number of tuples satisfying the query predicate in any group, is maximized. The authors also present onepass algorithms [4] for constructing the congressional samples. Ganti et al. [37] describe a biased sampling approach which they call ICICLES to obtain random samples which are tuned to a particular workload. Thus, if a tuple is chosen by many queries in a workload, it has a higher probability of being selected in the selftuning sample as compared to tuples which are chosen by fewer queries. Since this is a nonuniform sample, traditional samplinghased estimators must he adapted for these samples. The paper describes modified estimators for the common .l__oegation operations. It also describes how the selftuning samples are tuned in the presence of a dynamically changing workload. OsI IIII.l1lltis et al. [18] note that uniform random sampling to estimate .I__regate queries is ineffective when the distribution of the .I_ negate attribute is skewed or when the query predicate has a low selectivity. They propose using a combination of two methods to address this problem. Their first approach is to index separately those attribute values which contribute significantly to the query result. This method is called Outlier Indexring in the paper. The second approach proposed in the paper is to exploit workload information to perform weighted sampling. According to this technique, records which satisfied many queries in the workload are sampled more than records than satisfied fewer queries. C'!s u te llis ti Das and No I: lyya [19, 20] describe how workload information can he used to precompute a sample that minimizes the error for the given workload. The problem of selection of the sample is framed as an optimization problem so that the error in estimation of the workload queries using the resulting sample is minimized. When the actual incoming queries are identical to queries in the workload, this approach gives a solution with minimal error across all queries. The paper also describes how the choice of the sample can be tuned to achieve effective estimates when the actual queries are similar but not identical to the workload. Babcock, Chaudhuri and Das [10] note that a uniformly random sample can lead to inaccurate answers for many queries. They observe that for such queries, estimation using an appropriately biased sample can lead to more accurate answers as compared to estimation using uniformly random samples. Based on this idea, the paper describes a technique called small II,. ;,1....rt;l,,ll y:l which is designed to approximately answer .. egation queries having a GROUPBY clause. The distinctive feature of this technique as compared to previous biased sampling techniques like congressional sampling is that a new biased sample is chosen for every GROUPBY query, such that it maximizes the accuracy of estimating the query rather than trying to devise a biased sample that maximizes the accuracy over an entire workload of queries. According to this technique, larger groups from the output of the GROUPBY queries are sampled uniformly while the small groups are sampled at a higher rate to ensure that they are adequately represented. The group samples are obtained on a perquery basis from an overall sample which is computed in a preprocessing phase. In fact, database sampling has been recognized as an important enough problem that ISO has been working to develop a standard interface for sampling from relational database systems [55], and significant research efforts are directed at providing sampling from database systems by vendors such as IBM [52]. 2.2 Estimation Using Nonsampling Precomputed Synopses Estimation in databases using a nonsampling technique was first proposed by Rowe [106, 107]. The technique proposed is called antisamp~ling and involves creation of a special auxiliary structure called database abstract. The abstract considers the distribution of several attributes and groups of attributes. Correlations between different attributes can also be characterized as statistics. This technique was found to be faster than random sampling, but required domain knowledge about the various attributes. Classic work on histogram hased estimation of predicate selectivity is by Selinger et al. [110] and PiatetskyShapiro and Connell [102]. Selectivity estimation of queries with multidimensional predicates using histograms was presented by 1\uralikrishna and DeWitt [92]. They show that the maximum error in estimation can he controlled more effectively by choosing equidepth histograms as opposed to equiwidth histograms. Ioannidis [70] describes how serial histograms are optimal for ..:o negate queries involving arbitrary join trees with equality predicates. Ioannidis and Poosala [71] have also studied how histograms can he used to approximately answer non I__oegate queries which have a set based result. Several histogram construction schemes [42, 45, 72] have been proposed in the literature. Jagadish et al. [72] describe techniques for constructing histograms which can minimize a given error metric where the error is introduced because of approximation of values in a bucket by a single value associated with the bucket. They also describe techniques for augmenting histograms with additional information so that they can he used to provide accuracy guarantees of the estimated results. Construction of approximate histograms by considering only a random sample of the data set was investigated by C'!s II11!sIll et al. [23]. Their technique uses an adaptive sampling approach to determine the sample size that would be sufficient to generate approximate histograms which can guarantee prespecified error bounds in estimation. They also extend their work to consider duplicate values in the domain of the attribute for which a histogfram is to be constructed. The problem of estimation of the number of distinct value combinations of a set of attributes has been studied by Yu et al. [121]. Due to the inherent difficulty of developing a good, samplinghased estimation solution to the problem, they propose using additional information about the data in the form of histograms, indexes or data cubes. In a recent paper [28], Dobra presents a study of when histograms are best suited for approximation. The paper considers the longstanding assumption that histogframs are most effective only when all elements in a bucket have the same frequency and actually extends it to a less restrictive assumption that histograms are wellsuited when elements within a bucket are randomly arranged even though they might have different frequencies. Wavelets have a long history as mathematical tools for hierarchical decomposition of functions in signal and image processing. Vitter and his collaborators have also studied how wavelets can he applied to selectivity estimation of queries [89] and also for computing .I__oegates over data cubes [118, 119]. C' I..:1 .Ilarti et al. [15] present techniques for approximate computation of results for .I_ r_ egate as well as non I__ negate queries using Haar wavelets. One more summary structure that has been proposed for approximating the size of joins is .sketches. Sketches are smallspace summaries of data suited for data streams. A sketch generally consists of multiple counters corresponding to random variables which enable them to provide approximate answers with error guarantees for a priori decided queries. Some of the earliest work on sketches was presented by Alon, Gibbons, 10atias and Szegedy [7, 8]. Sketching techniques with improved error guarantees and faster update times have been proposed as FastCount sketches [117]. A statistical analysis of various sketching techniques along with recommendations on their use for estimating join sizes appears in [108]. 2.3 Analytic Query Processing Using Nonstandard Data Models A data model for OLAP applications called dthea cube was proposed by Gray et al. [44] for processing of analytic style .I__ negation queries over data warehouses. The paper describes a generalization of the SQL GROUP BY operator to multiple dimensions by introducing the data cube operator. This operator treats each of the possible .I__ regfation attributes as a dimension of a high dimensional space. The .I__oegate of a particular set of attribute values is considered as a point in this space. Since the cube holds precomputed .I negate values over all dimensions, it can he used to quickly compute results to GROUPBY queries over multiple dimensions. The data cube is precomputed and can require significant amount of space for storage of the precomputed .I_ regfates along the different dimensions. A more serious drawback of the data cube approach is that it can he used to efficiently answer only such queries which have a grouping hierarchy that conforms to the hierarchy on which the data cube is built. 1\oreover, complex queries which have been addressed in this thesis such as queries having correlated suhqueries are not amenable to efficient processing with the data cube model. Due to potentially large sizes of data cubes for high dimensions, researchers have studied techniques to discover semantic relationships in a data cube. This approach reduces the number of precomputed .I_ _egates grouped by different attributes if their .l regate values are identical. The quotient cube [79] and quotient cube tree [80] structures are such compressed representations of the data cube which preserve semantic relationships while also allowing processing of point and range queries. Another approach that has been emploi 0 in shrinking the data cube while at the same time preserving all the information in it is the Dwarf [113, 114] structure. Dwarf identifies and eliminates redundancies in prefixes and suffixes of the values along different dimensions of a data cube. The paper shows that by eliminating prefix as well as suffix redundancies, both dense as well as sparse data cubes can he compressed effectively. The paper also shows improved cube construction time, query response time as well as update time as compared to cube trees [105]. Although, the dwarf structure improves the performance of the data cube model, it still suffers from the inherent drawback of the data cube model it is not suitable to efficiently answer arbitrarily complex queries such as queries with correlated suhqueries. Recently, a new columnoriented architecture for database systems called C'store was proposed by Stonebraker et al [115]. The system has been designed for an environment that has much higher number of database reads as opposed to writes, such as a data warehousing environment. Cstore logically splits attributes of a relational table into projections which are collections of attributes, and stores them on disk such that all values of any attribute are stored .Il11 Ilent to each other. The paper presents experimental results which show that Cstore executes several selectprojectjoin and groupby queries over the TPCH benchmark much faster than coninercial roworiented or coluninoriented systems. At the time of the paper [115], the system was still under development. CHAPTER 3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION 3.1 Introduction With everincreasing database sizes, randomization and randomized algorithms [91] have become vital data management tools. In particular, random sampling is one of the most important sources of randomness for such algorithms. Scores of algorithms that are useful over large data repositories either require a randomized input ordering for data (i.e., an online random sample), or else they operate over samples of the data to increase the speed of the algorithm. Although applications requiring randomization abound in the data management literature, we specifically consider online ..:o negation [54, 62, 63] in this thesis. In online .I_ _oegfation, database records are processed oneatatime, and used to keep the user informed of the current "hest guess" as to the eventual answer to the query. If the records are input into the online .I_ egfation algorithm in a randomized order, then it becomes possible to give probabilistic guarantees on the relationship of the current guess to the eventual answer to the query. Despite the obvious importance of random sampling in a database environment and dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD and VLDB conferences are concerned with database sampling), there has been relatively little work towards actually supporting random sampling with physical database file organizations. The classic work in this area (by Olken and his coauthors [98, 99, 101]) suffers from a key drawback: each record sampled from a database file requires a random disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast approximate query processing or speeding up a data mining algorithm, this is clearly unacceptable. The Materialized Sample view In this chapter, we propose to use the materialized smltale view 1 as a convenient abstraction for allowing efficient random sampling from a database. For example, consider the following database schema: SALE (DAY, CUST, PART, SUPP) Imagine that we want to support fast, random sampling from this table, and most of our queries include a temporal range predicate on the DAY attribute. This is exactly the interface provided by a materialized sample view. A materialized sample view can he specified with the following SQLlike query: CREATE MATERIALIZED SAMPLE VIEW MySam AS SELECT FROM SALE INDEX ON DAY In general, the range attribute or attributes referenced in the INDEX ON clause can he spatial, temporal, or otherwise, depending on the requirements of the application. While the materialized sample view is a straightforward concept, efficient implementation is difficult. The primary technical contribution of this thesis is a novel index structure called the AG'E Tree (Al'i.. ,:ltl.Jl..7.;, C~I,.:,rl.;nt..0.l..; i Eu'***''~ ''.:~ld////; see Section 3.4) which can he used to efficiently implement a materialized sample view. Such a view, stored as an ACETree, has the following characteristics: * It is possible to efficiently sample (without replacement) from any arbitrary range query over the indexed attribute, at a rate that is far faster than is possible using techniques proposed by Olken [96] or by scanning a randomly permuted file. In general, the view can produce samples from a predicate involving any attribute having a natural ordering, and a straightforward extension of the ACE Tree can he used for sampling from multidimensional predicates. * The resulting sample is online, which means that new samples are returned continuously as time progresses, and in a manner such that at all times, the set of samples returned is a true random sample of all of the records in the view that 1 This term was originally used in Olken's PhD thesis [96] in a slightly different context, where the goal was to maintain a fixedsize sample of database; in contrast, as we describe subsequently our materialized sample view is a structure allowing online sampling match the range query. This is vital for important applications like online ..:o negation and data mining. *Finally, the sample view is created efficiently, requiring only two external sorts of the records in the view, and with only a very small space overhead beyond the storage required for the data records. We note that while the materialized sample view is a logical concept, the actual file organization used to implement such a view can be referred to as a sample index: since it is a primary index structure to efficiently retrieve random samples. 3.2 Existing Sampling Techniques In this section, we discuss three simple techniques that can be used to create materialized sample views to support random sampling from a relational selection predicate. 3.2.1 Randomly Permuted Files One option for creating a materialized sample view is to randomly shuffle or permute the records in the view. To sample from a relational selection predicate over the view, we scan it sequentially from beginning to end and accept those records that satisfy the predicate while rejecting the rest. This method has the advantage that it is very simple, and using a fast external sorting algorithm, permuting the records can be very efficient. Furthermore, since the process of scanning the file can make use of the fast, sequential I/O provided by modern hard disks, a materialized view organized as a randomly permuted file can be very useful for answering queries that are not very selective. However, the 1!! i 1 problem with such a materialized view is that the fraction of useful samples retrieved by it is directly proportional to the selectivity of the selection predicate. For example, if the selectivity of the query is 101' then on average only 1CI' of the random samples obtained by such a view can be used to answer the query. Hence for moderate to low selectivity queries, most of the random samples retrieved by such a view will not be useful for answering queries. Thus, the performance of such a view quickly degrades as selectivity of the selection predicates decreases. 3.2.2 Sampling from Indices The second approach to creating a materialized sample view is to use one of the standard indexing structures like a hashing scheme or a treebased index structure to organize the records in the view. In order to produce random samples from such a materialized view, we can employ iterative or batch sampling techniques [9, 96, 99101] that sample directly front a relational selection predicate, thus avoiding the aforementioned problem of obtaining too few relevant records in the sample. Olken [96] presents a comprehensive analysis and comparison of many such techniques. In this Section we discuss the technique of sampling front a materialized view organized as a ranked B+Tree, since it has been proven to be the most efficient existing iterative sampling technique in terms of number of disk accesses. A ranked B+Tree is a regular B+Tree whose internal nodes have been augmented with information which permits one to find the ith record in the file. Let us assume that the relation SALE presented in the Introduction is stored as a ranked B+Tree file indexed on the attribute DAY and we want to retrieve a random sample of records whose DAY attribute value falls between 11282004 and 03022005. This translates to the following SQL query: SELECT FROM SALE WHERE SALE.DAY BETWEEN '11282004' AND '03022005' Algorithm 1 above can then he used to obtain a random sample of relevant records fron the ranked B+Tree file. The drawback of the above algorithm is that whenever a leaf page is accessed, the algorithm retrieves only that record whose rank matches with the rank being searched for. Hence for every record which resides on a page that is not currently suffered, the retrieval time is the same as the time required for a random disk I/O. Thus, as long as there are unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow. Algorithm 1: Sampling from a Ranked B+Tree Algorithm SampleRankedB+Tree (Value vl, Value v2) 1. Find the rank rl of the record which has the smallest DAY value greater than vl. 2. Find the rank r2 of the record which has the largest DAY value smaller than v2* 3. While sample size < desired sample size 3.a Generate a uniformly distributed random number i, between rl and T2 3.b If i has been generated previously, discard it and generate the next random number. 3.c Using the rank information in the internal nodes, retrieve the record whose rank is i. 3.2.3 Blockbased Random Sampling While the classic algorithms of Olken and Antoshenkov sample records oneatatime, it is possible to sample from an indexing structure such as a B+Tree, and make use of entire blocks of records [21, 55]. The number of records per block is typically on the order of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of records retrieved over time if all of the records in each block are consumed, rather than a single record. However, there are two problems with this approach. First, if the structure is used to estimate the answer to some ..:o negate query, then the confidence bounds associated with any estimate provided after NV samples have been retrieved from a range predicate using a B+Tree (or some other index structure) may be much wider than the confidence bounds that would have been obtained had all NV samples been independent. In the extreme case where the values on each block of records are closely correlated with one another, all of the NV samples may be no better than a single sample. Second, any algorithm which makes use of such a sample must be aware of the blockbased method used to sample the index, and adjust its estimates accordingly, thus adding complexity to the query result estimating process. For algorithms such as Bradley's K(means algorithm [11], it is not clear whether or not such samples are even appropriate. 3.3 Overview of Our Approach We propose an entirely different strategy for implementing a materialized sample view. Our strategy uses a new data structure called the ACE Tree to index the records in the sample view. At the highest level, the ACE Tree partitions a data set into a large number of different random samples such that each is a random sample without replacement from one particular range query. When an application asks to sample from some arbitrary range query, the ACE Tree and its associated algorithms filter and combine these samples so that very quickly, a large and random subset of the records satisfying the range query is returned. The sampling algorithm of the ACE Tree is an online l.hm which means that as time progresses, a larger and larger sample is produced by the structure. At all times, the set of records retrieved is a true random sample of all the database records matching the range selection predicate. 3.3.1 ACE Tree Leaf Nodes The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has two components: 1. A set of h r ing~. where a range is a pair of key values in the domain of the key attribute and & is the height of the ACE Tree. Unlike a B+Tree, each leaf node in the ACE Tree stores records falling in several different ranges. The ith range associated with leaf node L is denoted by L.R,. The h different ranges associated with a leaf node are textithierarchical, that is L.R1 > L.R2 > > L.Rh. The first range in any leaf node, L.R1, ahliw contains a uniform random sample of all records of the database thus corresponding to the range (oo, 00). The Ath range in any leaf node is the smallest among all other ranges in that leaf node. 2. A set of h associated sections. The ith section of leaf node L is denoted by L.Si. The section L.Si contains a random subset of all the database records with key values in the range L.Ri. Figure 31 depicts an example leaf node in the ACE Tree with attribute range values written above each section and section numbers marked below. Records within each section are shown as circles. RI :0too R2 :050 R3 :025 R4 :012 St S2 S3 S4 Figure 31. Structure of a leaf node of the ACE tree. 3.3.2 ACE Tree Structure Logically, the ACE Tree is a diskbased binary tree data structure with internal nodes used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal nodes in a binary tree are much smaller than disk pages, they are packed and stored together in diskpagesized units [27]. Each internal node has the following components: 1. A range R of key values associated with the node. 2. A key value k that splits R and partitions the data on the left and right of the node. 3. Pointers ptrl and ptrr, that point to the left and right children of the node. 4. Counts cutl and cut,, that give the number of database records falling in the ranges associated with the left and right child nodes. These values can be used, for example, during evaluation of online .I_ _regation queries which require the size of the population from which we are sampling [54]. Figure 32 shows the logical structure of the ACE Tree. lIs~ refers to the jth internal node at level i. The root node is labeled with a range I1,1.R = [0100], signifying that all records in the data set have key values within this range. The key of the root node partitions I1,1.R into I2,1.R = [050] and I2,2.R = [51100]. Similarly each internal node divides the range of its descendents with its own key. The ranges associated with each section of a leaf node are determined by the ranges associated with each internal node on the path from the root node to the leaf. For example, if we consider the path from the root node down to leaf node L4, the ranges that we encounter along the path are 0100, 050, 2650 and 3850. Thus for L4, L4.S1 has a random sample of records in the range 0100, L4.S2 has a random sample in the range 0100 ;100 L1 L2 L 3^ L 4 5 L6 L7 L8 6100 050 2650 3850, L4.S1 L4.S2 L4.S3 L4.S4 Figure 32. Structure of the ACE tree. 050, L4.S3 has a random sample in the range 2650, while L4.S4 has a random sample in the range 3850. 3.3.3 Example Query Execution in ACE Tree In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a large random sample of records for any given range query. The query algorithm is formally described in Section 3.6. Let Q = [3065] be our example query postulated over the ACE Tree depicted in Figure 32. The query algorithm starts at II,1, the root node. Since I2,1.R overlaps Q, the algorithm decides to explore the left child node labeled I2,1 in Figure 32. At this point the two range values associated with the left and right children of I2,1 are 025 and 2650. Since the left child range has no overlap with the query range, the algorithm chooses to explore the right child next. At this child node (I3,2), the algorithm picks leaf node L3 tO be the first leaf node retrieved by the index. Records from section 1 of L3 (Which totally encompasses Q) are filtered for Q and returned immediately to the consumer of the sample as a random sample from the range [3065], while records from sections 2, 3 and 4 are stored in memory. Figure 33 shows the one random sample from section 1 of L3 Which can be used directly for answering query Q. 0100 050 2650 2637 L3.Sll L3.S2 3.S3 L3.S4 Figure 33. Random samples from section 1 of L3 Next, the algorithm again starts at the root node and now chooses to explore the right child node I22. After performing range comparisons, it explores the left child of I2,2 which is I3,3 SillCe I34.R has no overlap with Q. The algorithm chooses to visit the left child node of I3,3 HOXt, Which is leaf node Ls. This is the second leaf node to be retrieved. As depicted in Figure 34, since Ls.R1 encompasses Q, the records of L.S1 are filtered and returned immediately to the user as two additional samples from R. Furthermore, section 2 records are combined with section 2 records of L3 tO Obtain a random sample of records in the range 0100. These are again filtered and returned, giving four more samples from Q. Section 3 records are also combined with section 3 records of L3 tO Obtain a sample of records in the range 2675. Since this range also encompasses R, the records are again filtered and returned adding four more records to our sample. Finally section 4 records are stored in memory for later use. Note that after retrieving just two leaf nodes in our small example, the algorithm obtains eleven randomly selected records from the query range. However, in a real index, this number would be many times greater. Thus, the ACE Tree supports I It first" sampling from a range predicate: a large number of samples are returned very quickly. We contrast this with a sample taken from a B+Tree having a similar structure to the ACE Tree depicted in Figure 32. The B+Tree sampling algorithm would need to preselect which nodes to explore. Since four leaf nodes in the tree are needed to span the query range, there is a reasonably high likelihood that the first four samples taken would need to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to retrieve the rest of the leaf nodes in the order L4, L6, L L7, L2, Ls. 0100 050 51100 2650 5175 L5.81 L3.S2 \ 5.S2 L3.S3 \ 5.S3 Combine Combine Y 365 30o64 a306 Figure 34. Combining samples from L3 and Ls 3.3.4 Choice of Binary Versus kAry Tree The ACE Tree as described above can also be implemented as a kary tree instead of a binary tree. For example, for a ternary tree, each internal node can have two (instead of one) keys and three (instead of two) children. If the height of the tree was h, every leaf node would still have h ranges and h sections associated with them. Like a standard complete kary tree, the number of leaf nodes will be kh. However, the big difference would be the manner in which a query is executed using a kary ACE Tree as opposed to a binary ACE Tree. The query algorithm will ahliw start at the root node and traverse down to a leaf. However, at every internal node it will alternate between the k children in a roundrohin fashion. 1\oreover, since the data space would be divided into k equal parts at each level, the query algorithm might have to make k traversals and hence access k leaf nodes before it can combine sections that can he used to answer the query. This would mean that the query algorithm will have to wait longer (than a binary ACE Tree) before it can combine leaf node sections and thus return useful random samples. Since the goal of the ACE Tree is to support I it first" sampling, use of a binary tree instead of a kary tree seems to be a better choice to implement the ACE Tree. 3.4 Properties of the ACE Tree In this Section we describe the three important properties of the ACE Tree which facilitate the efficient retrieval of random samples from any range query, and will be instrumental in ensuring the performance of the algorithm described in Section 3.6. 3.4.1 Combinability La 6~~347Sape 0100 0 50 0265 0 263 L1.S1 L1.S2 L1.S3 L1.S4 ~ be L3 Figure 35. Combining two sections of leaf nodes of the ACE tree. The various samples produced from processing a set of leaf nodes are combinable. For example, consider the two leaf nodes L1 and L3, and the query "Compute a random sample of the records in the query range QI = [3 to 47]". As depicted in Figure 35, first we read leaf node L1 and filter the second section in order to produce a random sample of size nl from QI which is returned to the user. Next we read leaf node L3, and filter its second section L3.S2 to produce a random sample of size n2 from QI which is also returned to the user. At this point, the two sets returned to the user constitute a single random sample from QI of size nl + n2. This means that as more and more nodes are read from disk, the records contained in them can be combined to obtain an everincreasing random sample from any range query. 3.4.2 Appendability The ith sections from two leaf nodes are ap~pendable. That is, given two leaf nodes Lj and Lk, Lj.Si U Lk.Si is alrlws a true random sample of all records of the database with key values within the range Lj.Ri U Lk. i. For example, reconsider the query, "Compute a random sample of the records in the query range QI = [3 to 47]". As depicted in Figure 36, we can append the third section from node L3 to the third section from node L1 and filter the result to produce yet another random sample from QI. This means that sections are never wasted. 3.4.3 Exponentiality The ranges in a leaf node are exp~onential. The number of database records that fall in L.R, is twice the number of records that fall in L.R, 1. This allows the ACE Tree to maintain the invariant that for any query Q' over a relation R such that at least hp database records fall in Q', and with R/2k"+1 <= oQ(R) <= R/2k, Vk <= h 1, there exists a pair of leaf nodes Li and Lj, where at least onehalf of the database records falling in L,.Rk+2 U Lj.Rk+2 arT alSO in Q. p is the average number of records in each section, and & is the height of the tree or equivalently, the total number of sections in any leaf node. 0100 050 025 012 Combined samples Figure :36. Appending two sections of leaf nodes of the ACE tree. While the formal statement of the exponentiality property is a bit complicated, the net result is is simple: there is ahlis a pair of leaf nodes whose sections can he appended to form a set which can he filtered to quickly obtain a sample from any range query Q'. As an illustration, consider query Q over the ACE Tree of Figure :32. Note that the number of database records falling in Q is greater than onefourth, but less than half the database size. The exponentiality property assures us that Q can he totally covered by appending sections of two different leaf nodes. In our example, this means that Q can he covered by appending section :3 of nodes L4 and Lg. If RC = L4 :3 U L6.R:3, then by the invariant given above we can claim that aq(R) >= (1/2) x URC )I 3.5 Construction of the ACE Tree In this Section, we present an I/O efficient, bulk construction algorithm for the ACE Tree . 3.5.1 Design Goals The algorithm for building an ACE Tree index is designed with the following goals in mind: 1. Since the ACE Tree may index enormous amounts of data, construction of the tree should rely on efficient, external memory algorithms, requiring as few passes through the data set as possible. 2. In the resulting data structure, the data which are placed in each leaf node section must constitute a true random sample (without replacement) of all database records lying within the range associated with that section. 3. Finally, the tree must be constructed in such a way as to have the exp~or... :.:r.:/Lln;, ~~l...;;J..0..l..7.0; and 'i'1'i ':.;1'.:1.:1;i properties necessary for supporting the ACE Tree query algorithms. 3.5.2 Construction The construction of the ACE Tree proceeds in two distinct phases. Each phase comprises of two read/write passes through the data set (that is, constructing an ACETree from scratch requires two external sorts of a large database table). The two phases are as follows: 1. During Phase 1, the data set is sorted based on the record key values. This sorted order of records is used to provide the split points associated with each internal node in the tree. 2. During Phase 2, the data are organized into leaf nodes based on those key values. Disk blocks corresponding to groups of internal nodes can easily be constructed at the same time as the final pass through the data writes the leaf nodes to disk. 3.5.3 Construction Phase 1 The primary task of Phase 1 is to assign split points to each internal node of the tree. To achieve this, the construction algorithm first sorts the data set based upon keys of the records, as depicted in Figure 37. After the dataset is sorted, the median record for the entire data set is determined (this value is 50 in our example). This record's key will be used as the key associated with the root of the ACE Tree, and will determine L.R2 foT eVeTy leaf IlOde in the tree. We denote this key value by Il1.k, since the value serves as the key of the first internal node in level 1 of the tree. After determining the key value associated with the root node, the medians of each of the two halves of the data set partitioned by Ili,.k are chosen as keys for the two internal nodes at the next level: I21.k and I22.k, respectively. In the example of Figure 37, these Median Record 3 7 1012115 812212129133361 3 141 ,1.k51351061612717 71 1818812 3 7 10 12 15 18 22 25 29 33 36 37 41 471 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98 I2,1.k ,'s12,2. 3 7 10 12 15 18 22 25 29 33 361 37 41 471 50 50 53 58 60 62 691 72 74 175 77 81 84 88 89 92 98 Figure 37. C'!s....!ni keys for internal nodes. values are 25 and 75. I21.k and I22.k, along with Il1.k, will determine L.R3 foT eVeTy leaf node in the tree. The process is then repeated recursively until enough medianS2 have been obtained to provide every internal node with a key value. Note that the same time that these various key values are determined, the values cutl and cut, can also be determined. This simple strategy for choosing the various key values in the tree ensures that the exponentiality property will hold. If the data space between li+1,2j1.k and li,j.k corresponds to some leaf node range L.R,, then the data space between li+1,2j1.k and li+2,4j2.k will correspond to some range L.R, 1. Since li+2,4j2.k is the midpoint of 2 We choose a value for the height of the tree in such a manner that the expected size of a leaf node (see Sec. V F.) does not exceed one logical disk block. C'!s .. .!ni a node size that corresponds to the block size is done for the same reason it is done in most traditional indexing structures: typically, the system disk block size has already been carefully chosen by a DBA to balance speed of sequential access (which demands a larger block size) with the cost of accessing more data than is needed (which demands a smaller block size). the data space between li+1,2j1.k and l,4j.k, we know that two times as many database records should fall in L.R,, compared with L.R, 1. The following example also shows how the invariant described in Section 4.3 is guaranteed by adopting the aforementioned strategy of assigning key values to internal nodes. Consider the ACE Tree of Figure 32. Figure 38 shows the keys of the internal nodes as medians of the dataset R. We also consider two example queries, Q1 and Q2 Such that the number of database records falling in Q2 1S greater than onefourth but less than onehalf of the database size, while the number of database records falling in Q1 is more than half the database size. 12 25 37 50 62 75 88 Figure 38. Exponentiality property of ACE tree. Q1 can be answered by appending section 2 of (for example) L4 and Ls (refer to Figure 32). Let RC1 = L4 2 U Ls.R2. Then all the database records fall in RC1. Moreover, since aQ,(R) >= R/2, we have aQ,(R) >= (1/2) x enc,(R). Similarly, Q2 can be answered by appending section 3 of (for example) L4 and L6 2fRC L4 3 U L6 3, then half the database records fall in RC2. Also, since eg, (R)I > R/4 we have eg (R) >= (1/2) x Uncz(R). This can be generalized to obtain the invariant stated in Section 3.4.3. 3.5.4 Construction Phase 2 The objective of Phase 2 is to construct leaf nodes with appropriate sections and populate them with records. This can be achieved by the following three steps: Section Numbers 1 2 32 12 42 34 34 21 4 32 31 41 43 21 43 2 31 1 3 7 1() 12 15 18 22 25 29 33 361 37 41 47 5() 5()531 58 6() 62 69 72 741 75 771 81 84 88 89 92 98 (a) Records assigned section numbers Leaf Numbers Section Numbers 73 1 45 1 21 43 43 28 4 36 61 5 2673827 12 32 12 42 34 34 21 4 32 31 41 43 21 43 2 31 1 3 7 1() 12 15 18 22 25 29 33 361 37 41 47 5() 5()53 58 6() 62 69 72 74 75 771 81 84 88 89 92 98 (b) Records assigned leaf numbers Leaf Numbers Section Numbers 1 2 2 3 11 24 12 34 4 23 34 12 3 42 34 11 2 41 3 3 6() 18 25 1() 69 92 41 22 77 7 5() 37 33 12 29 36 5() 15 88 74 62 53 58 74 3 98 75 81 47 84 89 Leafl1 Leaf 2 Leaf 3 Leaf 4 Leaf 5 Leaf 6 Leaf 7 Leaf 8 (c) Records organized into leaf nodes Figure :39. Phase 2 of tree construction. 1. Assign a uniformly generated random number between 1 and h to each record as its section number. 2. Associate an additional random number with the record that will be used to identify the leaf node to which the record will be assigned. :3. Finally, reorganize the file by performing an external sort to group records in a given leaf node and a given section together. Figure :39(a) depicts our example data set after we have assigned each record a randomly generated section number, assuming four sections in each leaf node. In Step 2, the algorithm assigns one more randomly generated number to each record, which will identify the leaf node to which the record will be assigned. We assume for our example that the number of leaf nodes is 2h1 23 8. The number to identify the leaf node is assigned as follows. 1. First, the section number of the record is checked. We denote this value as s. 2. We then start at the root of the tree and traverse down by comparing the record key with s 1 key values. After the comparisons, if we arrive at an internal node, le4,j then we assign the record to one of the leaves in the subtree rooted at liey. From the example of Figure :39(a), the first record having key value :3 has been assigned to section 1. Since this record can he randomly assigned to any leaf from 1 through 8, we assign it to leaf 7. The next record of Figure :39(a) has been assigned to section number 2. Referring back to Figure :37, we see that the key of the root node is 50. Since the key of the record is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we randomly choose the leaf node :3. For the next record having key value 10, we see that the section number assigned is :3. To assign a leaf node to this record, we initially compare its key with the key of the root node. Referring to Figure :37, we see that 10 is smaller than 50; hence we then compare it with 25 which is the key of the left child node of the root. Since the record key is smaller than 25, we assign the record to some leaf node in the left subtree of the node with key 25 by assigning to it a random number between 1 and 2. The section number and leaf node identifiers for each record are written in a small amount of temporary disk space associated with each record. Once all records have been assigned to leaf nodes and sections, the dataset is reorganized into leaf nodes using a twopass external sorting algorithm as follows: * Records are sorted in ascending order of their leaf node number. * Records with the same leaf node number are arranged in ascending order of their section number. The reorganized data set is depicted in Figure :39(c). 3.5.5 Combinability/Appendability Revisited In Phase 2 of the tree construction, we observe that all records belonging to some section k are segregated based upon the result of the comparison of their key with the appropriate medians, and are then randomly assigned a leaf node number from the feasible ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain all of the section a records. This ensures the 'i'1'i ':I'.J.1.7.1;i property of the ACE Tree. Also note that the probability of assignment of one record to a section is unaffected by the probability of assignment of some other record to that section. Since this results in each section having a random subset of the database records, it is possible to merge a sample of the records from one section that match a range query with a sample of records from a different section that match the same query. This will produce a larger random sample of records falling in the range of the query, thus ensuring the ll.:,~l.:.:l;J.0.l. property. 3.5.6 Page Alignment In Phase 2 of the construction algorithm, section numbers and leaf node numbers are randomly generated. Hence we can only predict on expectation the number of records that will fall in each section of each leaf node. As a result, section sizes within each leaf node can differ, and the size of a leaf node itself is variable and will generally not be equal to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf node may span across multiple disk pages or may be contained within a single disk page. This situation could be avoided if we fix the size of each section a priori. However, this poses a serious problem. Consider two leaf node sections L,.Sj and L, 1.Sj. We can force these two sections to contain the same number of records by ensuring that the set of records assigned to section j in Phase 2 of the construction algorithm has equal representation from L,.Rj and L, 1.Rj. However, this means that the set of records assigned to section j is no longer random. If we fix the section size and force a set number of records to fall in each section, we invalidate the appendability and combinability properties of the structure. Thus, we are forced to accept a variable section size. In order to implement variable section size, we can adopt one of the following two schemes: 1. Enforce fixedsized leaf nodes and allow variablesized sections within the leaf nodes. 2. Allow variablesized leaf nodes along with variablesized sections. If we choose the ~fix~edsized leaf node, variablesized section scheme, leaf node size is fixed in advance. However, section size is allowed to vary. This allows full sections to grow further by claiming any available space within the leaf node. The leaf node size chosen should be large enough to prevent any leaf node from becoming completely filled up, which prevents the partitioning of any leaf node across two disk pages. The major drawback of this scheme is that the average leaf node space utilization will be very low. Assuming a reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be 9'.sure that no leaf node gets filled up, the average leaf node space utilization will be less than 15' . The variablesized leaf node, variablesized section scheme does not impose a size limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page boundaries, if space is required. The important advantage of this scheme is that it is spaceefficient. The main drawback of this approach is that leaf nodes may span multiple disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node. Given that most of the cost associated with reading an arbitrary leaf page is associated with the disk head movement needed to move the disk arm to the appropriate cylinder, this does not pose too much of a problem. Hence we use this scheme for the construction of leaf nodes of the ACE Tree. 3.6 Query Algorithm In this Section, we describe in detail the algorithm used to answer range queries using the ACE Tree. 3.6.1 Goals The algorithm has been designed to meet the primary goal of achieving I Iifirst" sampling from the index structure, which means it attempts to be greedy on the number of records relevant for the query in the early stages of execution. In order to meet this goal, the query answering algorithm identifies the leaf nodes which contain maximum number of sections relevant for the query. A section Li,.Sj is relevant for a range query Q if Li,.Rj n Q / 4 and Lil.Rj U Li,.Rj U U Lin.Rj > Q where Lil,...,L, are some leaf nodes in the tree. The query algorithm priorities retrieval of leaf nodes so as to: * Facilitate the combination of sections so as to maximize n in the above formulation, and * Maximize the number of relevant sections in each leaf node L retrieved such that L.Sj n Q / where j = (c + 1). .h where L.Rc is the smallest range in L that encompasses Q2. 3.6.2 Algorithm Overview At a high level, the query answering algorithm retrieves the leaf nodes relevant to answering a query via a series of stabs or traversals, accessing one leaf node per stab. Each stab begins at the root node and traverses down to a leaf. The distinctive feature of the algorithm is that at each internal node that is traversed during a stab, the algorithm chooses to access the child node that was not chosen the last time the node was traversed. For example, imagine that for a given internal node I, the algorithm chooses to traverse to the left child of I during a stab. The next time that I is accessed during a stab, the algorithm will choose to traverse to the right child node. This can be seen in Figure 310, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to traverse to the left child of the root node during the first stab, while during the second stab it chooses to traverse to the right child of the root node. The advantage of retrieving leaf nodes in this back and forth sequence is that it allows us to quickly retrieve a set of leaf nodes with the most disparate sections possible in a 0100 0100 I 0nxt = R 100 IS,4 L1 LP Ls L4 L5 6 L7 L8 (a) Stab 1, 1 contributing section 0100 50 nxt = L 050 \ 51100 12,1 7, 25 next = L 75 02 26' 5175 76100 12 74)1 13,2 62) 13,3 88 3,4 ext= R LL LP L3 L4 L5 6 L7 L8 (c) Stab 3, 7 contributing sections C1 2P 3 4 L5 6 7 L8 (b)Stab 2, 6 contributing sections 0100 S5 nxt = R ;100 13,4 C1 2P 3 4 L5 6 7 L8 (d) Stab 4, 16 contributing sections Figure 310. Execution runs of query answering algorithm with (a) 1 contributing section, (b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributing sections. given number of stabs. The reason that we want a nonhomogeneous set of nodes is that nodes from very distant portions of a query range will tend to have sections covering large ranges that do not overlap. This allows us to append sections of newly retrieved leaf nodes with the corresponding sections of previously retrieved leaf nodes. The samples obtained can then be filtered and immediately returned. This order of retrieval is implemented by associating a bit with each internal node that indicates whether the next child node to be retrieved should be the left node or the right node. The value of this bit is 'Medeld every time the node is accessed. Figure 310 illustrates the choices made by the algorithm at each internal node during four separate stabs. Note that when the algorithm reaches an internal node where the range associated with one of the child nodes has no overlap with the query range, the algorithm ahr7w picks the child node that has overlap with the query, irrespective of the value of the indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at an internal node which overlaps the query range have been accessed. In such a case, the internal node which overlaps the query range is not chosen and is never accessed again. 3.6.3 Data Structures In addition to the structure of the internal and leaf nodes of the ACE Tree, the query algorithm uses and updates the following two memory resident data structures: 1. A lookup table T, to store internal node information in the form of a pair of values (next = [LEFT] [RIGHT], done = [TRUE] [FALSE]). The first value indicates whether the next node to be retrieved should be the left child or right child. The second value is TRUE if all leaf nodes in the subtree rooted at the current node have already been accessed, else it is FALSE. 2. An array backets[h] to hold sections of all the leaf nodes which have been accessed so far and whose records could not be used to answer the query. h is the height of the ACE Tree. 3.6.4 Actual Algorithm We now present the algorithms used for answering queries using the ACE Tree. Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called Shuttle Q. Each traversal or stab begins at the root node and proceeds down to a leaf node. In each invocation of Shuttle0, a recursive call is made to either its left or right child with the recursion ending when it reaches a leaf node. At this point, the sections in the leaf node are combined with previously retrieved sections so that they can be used to answer the query. The algorithm for combining sections is described in Algorithm 4. This Algorithm 2: Query Answering Algorithm Algorithm Answer (Query Q) Let root be the root of the ACE Tree While (!T.lookup(root) .done) T.lookup(root).done = shuttle(Q, root); Algorithm 3: ACE Tree traversal algorithm Algorithm Shuttle (Query Q, Node curr_node) If (curr _node is an internal node) left_node = curr_node iget_1eft_node(); right_node = curr_node iget_right_node(); If (le ft_node is done AND right_node is done) Mark curr_node as done Else if (right_node is not done) Shuttle(Q, right_node); Else if (le ft _node is not done) Shuttle(Q, le ft_node); Else if (both children are not done) If (Q overlaps only with le ft_node.R) Shuttle(Q, le ft_node); Else if (Q overlaps only with right_node.R) Shuttle(Q, right_node); Else //Q overlaps both sides or none If (next node is LEFT) Shuttle(Q, left_node); Set next node to RIGHT; If (next node is RIGHT) Shuttle(Q, right_node); Set next node to LEFT; Else //curr _node is a leaf node Combine _Tuples (Q, curr _node); Mark curr _node as done algorithm determines the sections that are required to be combined with every new section a that is retrieved and then searches for them in the array buckets[]. If all sections are found, it combines them with s and removes them from buckets[]. If it does not find all the required sections in buckets[], it stores s in buckets[]. Algorithm 4: Algorithm for combining sections Algorithm Combine_Tuples(Query Q, Leafl\ode node) For each section s in node do Store the section numbers required to be combined with a to span Q, in a list list For each section number i in list do If buckets [] does not have section i flag = false If (flag == true) Combine all sections from list with s and use the records to answer Q Else Store s in the appropriate bucket 3.6.5 Algorithm Analysis We now present a lower bound on the expected performance of the ACE Tree index for sampling from a relational selection predicate. For simplicity, our analysis assumes that the number of leaf nodes in the tree is a power of 2. Lemma 1. Eff. .:. ,.. ; of the ACE Tree for query evaluation. * Let a be the total number of leaf nodes in a ACE Tree used to sample from some arbitrary r r,:.9. query, Q * Let p be the 'ary ,II 1 power of 2 no greater than n * Let p be the mean section size in the tree * Let a~ be fraction of database records f ill.:,:l in Q * Let NV be the size of the sample from Q that has been obtained after m ACE Tree leaf nodes have been retrieved from disk If m is not too 'ar ,I. (that is, if m E [N] > Xplog2 2 where E{N]I denotes the expected value of NV (the mean value of NV after an infinite number of trials). Proof. Let li~j and li~j 1 be the two internal nodes in the ACE Tree where R= lij.R U lij+l.R covers Q and i is maximized. As long as the shuttle algorithm has not retrieved all the children of li,; and li,; 1 (this is the case as long as m < 2a~n + 2), when the mth leaf node has been processed, the expected number of new samples obtained: [logz m] 2 1 k=1 l=1 where the outer summation is over each of the h i contributing sections of the leaf nodes, starting with section number i up to section number h, while CI I,,. represents the fraction of records of the 2k"1 combined sections that satisfy Q. By the exponentiality property, Cl I,,. > 1/2 for every k, Nm > log2 Thus after m leaf nodes have been obtained, the total number of expected samples is given by: >log2 k > mlog2 If m is a power of 2, the result is proven. O Lemma 2. The expected number of records p in r,:; leaf node section is given by: R E [p] = h2h] where R is the total number of database records, h is the height of the ACE Tree and 2h1 is the number of leaf nodes in the ACE Tree. Proof. The probability of assigning a record to any section i, i < h is 1/h. Given that the record is assigned to section i, it can be assigned to only one of 2i1 leaf node groups after comparing with the appropriate medians. Since each group would have 2h1/2i1 candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is: tE 1 ~2"11 li21"1 2i1 h2h1 This completes the proof of the lemma. O 3.7 MultiDimensional ACE Trees The ACE Tree can be easily extended to support queries that include multidimensional predicates. The change needed to incorporate this extension is to use a kd binary tree instead of the regular binary tree for the ACE Tree. Let al, ..., ak be the k key attributes for the kd ACE Tree. To construct such a tree, the root node would be the median of all the al values in the database. Thus the root partitions the dataset based on al. At the next step, we need to assign values for level 2 internal nodes of the tree. For each of the resulting partitions of the dataset, we calculate the median of all the a2 ValueS. These two medians are assigned to the two internal nodes at level 2 respectively, and we recursively partition the two halves based on a2. This process is continued until we finish level k. At level k + 1, we again consider al for choosing the medians. We would then assign a randomly generated section number to every record. The strategy for assigning a leaf node number to the records would also be similar to the one described in Section 3.5.4 except that the appropriate key attribute is used while performing comparisons with the internal nodes. Finally, the dataset is sorted into leaf nodes as in Figure 39(c). Query answering with the kd ACE Tree can use the Shuttle algorithm described earlier with a few minor modifications. Whenever a section is retrieved by the algorithm, only records which satisfy all predicates in the query should be returned. Also, the mth 0.18 0.16 0.12 0.1 ACE Tree S0.08 0.06 0.04 B+ Tree 0.02 Randomly permuted file 0.0 0 0.5 1 1 .5 2 2.5 3 3.5 4 % of time required to scan relation Figure 311. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 0."'.' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation sections of two leaf nodes can be combined only if they match in all m dimensions. The nth sections of two leaf nodes can be appended only if they match in the first n 1 dimensions and form a contiguous interval over the nth dimension. 3.8 Benchmarking In this Section, we describe a set of experiments designed to test the ability of the ACE Tree to quickly provide an online random sample from a relational selection predicate as well as to demonstrate that the memory requirement of the ACE Tree is reasonable. We performed two sets of experiments. The first set is designed to test the utility of the ACE Tree for use with onedimensional data, where the ACE Tree is compared with a simple sequential file scan as well as Antoshenkov's algorithm for sampling from a ranked B+Tree. In the second set, we compare a multidimensional ACE Tree with the I I I I I I I I 0.35 F 0.25 F ACE Tree Randomly permuted file B+ Tree 0.5 1 1 .5 2 2.5 3 3.5 4 % of time required to scan the relation 0.1 0.05 Figure 312. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation sequential file scan as well as with the obvious extension of Antoshenkov's algorithm to a twodimensional RTree. 3.8.1 Overview All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64K(B data pages were used. Experiment 1. For the first set of experiments, we consider the problem of sampling from a range query of the form: SELECT FROM SALE WHERE SALE.DAY >= i AND SALE.DAY <= j We implemented and tested the following three randomorder record retrieval algorithms for sampling the range query: 1.5 E 1 Ranomlypermuted file ACE Tree 0.5 B+ Tree 0.5 1 1 .5 2 2.5 3 3.5 4 % of time required to scan the relation Figure :31:3. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 25' of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation 1. AG'E Tree Query Algorithm: The ACE Tree was implemented exactly as described in this thesis. In order to use the ACE Tree to aid in sampling from the SALE relation, a materialized sample view for the relation was created, using SALE.DAY as the indexed attribute. 2. Random mempling from a B+ Tree: Antoshenkov's algorithm for sampling from a ranked B+Tree was implemented as described in Algorithm 1. The B+Tree used in the experiment was a primary index on the SALE relation (that is, the underlying data were actually stored within the tree), and was constructed using the standard B+Tree bulk construction algorithm. :3. Sampling from technique used in previous work on online .I_ egfation. The SALE relation was randomly permuted by assigning a random key value k to each record. All of the records from SALE were then sorted in ascending order of each k value using a twophase, multiway merge sort (TPMAIS) (see GarciaMolina et al. [:38]). As the sorted records are written back to disk in the final pass of the TPMAIS, k is removed from the file. To sample from a range predicate using a randomly permuted file, the ACE Tree Randomly permuted file I I I I 1.5 0.5 0. 0 100 200 300 400 500 600 700 800 % of time required to scan the relation Figure 314. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5' of the database records. The graph is an extension of Figure 312 and shows results till all three sampling techniques return all the records matching the query predicate. file is scanned from front to back and all records matching the range predicate are immediately returned. For the first set of experiments, we synthetically generated the SALE relation to be 20GB in size with 100B records, resulting in around 200 million records in the relation. We began the first set of experiments by sampling from 10 different range selection predicates over SALE using the three sampling techniques described above. 0."1.' of the records from MySam satisfied each range selection predicate. For each of the three random sampling algorithms, we recorded the total number of random samples retrieved by the algorithm at each time instant. The average number of random samples obtained for each of the ten queries was then calculated. This average is plotted as a percentage of the total number of records in SALE along the Yaxis in Figure 311. On the Xaxis, we have plotted the elapsed time as a percentage of the time required to scan the entire relation. We chose Av;erage across 10 q~uerles  Maximum of 10 queries  Minimum of 10 queries  0 00035 0 0003 0 00025 0 0002 0 00015 0 0001 5e05 0 1 234 56 7 8 % of time required to scan the relation (a) (I _".' selectivity 9 10 11 0 003 0 0025 0 002 0 0015 0 001 0 0005 012345678 % of time required to scan the relation (b) 2. "' selectivity 9 10 11 Figure 315. Number of records needed to be buffered by the ACE Tree for queries with (a) 0.25' and (b) 2.5' selectivity. The graphs show the number of records buffered as a fraction of the total database records versus time plotted as a percentage of the time required to scan the relation. this metric considering the linear scan as the baseline record retrieval method. The test was then repeated with two more sets of selection predicates that are satisfied by 2.5' and 25' of MySam's records, respectively. The results are plotted in Figure :312 and Figure :313. For all the three figures, results are shown for the first 15 seconds of execution, corresponding to approximately !I' of the time required to scan the relation. We show an additional graph in Figure :314 for the 2.5' selectivity case, where we plot results until all the three record retrieval algorithms return all the records matching the query predicate. Finally, we provide experimental results to indicate the number of records that are needed to be buffered by the ACE Tree query algorithm for two different query selectivities. Figure :315(a) shows the minimum, maximum and the average number of records stored for ten different queries having a selectivity of 0.25' while Figure :315(b) shows similar results for queries having selectivity 2.5' . Experiment 2. For the second set of experiments, we add an additional attribute AMOUNT to the SALE relation and test the following twodimensional range query: SELECT FROM SALE WHERE SALE.DAY >= di AND SALE.DAY <= d2 AND SALE.AMOUNT >= al AND SALE.AMOUNT <= b2 To generate the SALE relation, each (DAY, AMOUNT) pair in each record is generated by sampling from a hivariate uniform distribution. In this experiment, we again test the three random sampling options given above: 1. ACE tree query algorithm: The ACE Tree for multidimensional data (a kd ACE Tree) was implemented exactly as described in Section :3.7. It was used to create a materialized sample view over the DAY and AMOUNT attributes. 2. Random sampling from a Rtree: Antoshenkov's algorithm for sampling from a ranked B+Tree was extended in the obvious fashion for sampling from a RTree [46]. Just as in the case of the B+Tree, the RTree is created as a primary index, and the data from the SALE relation are actually stored in the leaf nodes of the tree. The RTree was constructed in bulk using the wellknown SortTileRecursive [81] hulk construction algorithm. 3. Sampling from a randomly permuted file: We implemented this random sampling technique in a similar manner as Experiment 1. In this experiment, the SALE relation was generated so as to be about 16 GB in size. Each record in the relation was 100B in size, resulting in approximately 160 million records. Just as in the first experiment, we began by sampling from 10 different range selection predicates over SALE using the three sampling techniques described above. 0."1.' of the records from SALE satisfied each range selection predicate. For all the three random sampling algorithms, we recorded the total number of random samples retrieved by the algorithm at each time instant. The average number of random samples obtained for each of the ten queries is then computed. This average is plotted as a percentage of the total number of records in SALE along the Yaxis in Figure 316. On the Xaxis, we have plotted the elapsed time as a percentage of the time required to scan the entire relation. The test was then repeated with two more selection predicates that are satisfied by 2.5' and 25' of the SALE relations records, respectively. The results are plotted in Figure 317 and Figure 318 respectively. 3.8.2 Discussion of Experimental Results There are several important observations that can be made from the experimental results. Irrespective of the selectivity of the query, we observed that the ACE Tree clearly provides a much faster sampling rate during the first few seconds of query execution compared with the other approaches. This advantage tends to degrade over time, but since sampling is often performed only as long as more samples are needed to achieve a desired accuracy, the fact that the ACE Tree can immediately provide a large, online random sample almost immediately indicates its practical utility. Another observation indicating the utility of the ACE Tree is that while it was the top performer over the three query selectivities tested, the best alternative to the ACE Tree generally changed depending on the query selectivity. For highly selective queries, 0.1 0.09 0.08 0.07 S0.06 15 0.05  0.04 ACE Tree dR Tree 0.03 0.02 Randomly permuted file 0.01 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 % of time required to scan relation Figure 316. Sampling rate of an ACE Tree vs. rate for an RTree and scan of a randomly permuted file, with a spatial selection predicate accepting 0.25' of the database tuples. the randomlypermuted file is almost useless due to the fact that the chance that any given record is accepted by the relational selection predicate is very low. On the other hand, the B+Tree (and the RTree over multidimensional data) performs relatively well for highly selective queries. The reason for this is that during the sampling, if the query range is small, then all the leaf pages of the B+Tree (or RTree) containing records that match the query predicate are retrieved very quickly. Once all of the relevant pages are in the buffer, the sampling algorithm does not have to access the disk to satisfy subsequent sample requests and the rate of record retrieval increases rapidly. However, for less selective queries, the randomlypermuted file works well since it can make use of an efficient, sequential disk scan to retrieve records. As long as a relatively large fraction of the records retrieved match the selection predicate, the amount of waste incurred by scanning unwanted records as well is small compared to the additional efficiency gained by the sequential scan. On the other hand, when the range associated with a query having 0.3 S0.25 8 0.2 0.15 ACE reeRandmlypermuted file 0.05 R Tree 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 % of time required to scan relation Figure 317. Sampling rate of an ACE tree vs. rate for an Rtree, and scan of a randomly permuted file with a spatial selection predicate accepting 2.5' of the database tuples. high selectivity is very large, the time required to load all of the relevant B+Tree (or RTree) pages into memory using random disk I/Os is prohibitive. Even if the query is run long enough that all of the relevant pages are touched, for a query with high selectivity, the buffer manager cannot be expected to buffer all the B+Tree (or RTree) pages that contain records matching the query predicate. This is the reason that the curve for the B+Tree in Figure 313 or for the RTree in Figure 318, never leaves the y7axis for the time range plotted. The net result of this is that if an ACE Tree were not used, it would probably be necessary to use both a B+Tree and a randomlypermuted file in order to ensure satisfactory performance in the general case. Again, this is a point which seems to strongly favor use of the ACE Tree. An observation we make from Figure 314 is that if all the three record retrieval algorithms are allowed to run to completion, we find that the ACE Tree is not the first to I I I I I I I I I S1.5 I / Randomly permuted file 5 ACE Tree 0.5 R Tree 0.5 1 1 .5 2 2.5 3 3.5 4 4.5 5 % of time required to scan relation Figure 318. Sampling rate of an ACE tree vs. rate for an Rtree, and scan of a randomly permuted file with a spatial selection predicate accepting "'.' of the database tuples. complete execution. Thus, there is generally a crossover point beyond which the sampling rate of an alternative random sampling technique is higher than the sampling rate of the ACE Tree. However, the important point is that such a transition ahrlw occurs very late in the query execution by which time the ACE Tree has already retrieved almost C1I' .~ of the possible random samples. We found this trend for all the different query selectivities we tested with single dimensional as well as multidimensional ACE Trees. Thus, we emphasize that the existence of such a crossover point in no way belittles the utility of the ACE Tree since in practical applications where random samples are used, the number of random samples required is very small. Since the ACE Tree provides the desired number of random samples (and many more) much faster than the other two methods, it still emerges as the top performer among the three methods for obtaining random samples. Finally, Figure 315 shows the memory requirement of the ACE Tree to store records that match the query predicate but cannot he used as yet to answer the query. The fluctuations in the number of records buffered by the query algorithm at different times during the query execution is as expected. This is because the amount of buffer space required by the query algorithm can vary as newly retrieved leaf node sections are either buffered (thus requiring more buffer space) or can be appended with already buffered sections (thus releasing buffer space). We also note from Figure 315 that the ACE Tree has a reasonable memory requirement since a very small fraction of the total number of records is buffered by it. 3.9 Conclusion and Discussion In this chapter we have presented the idea of a 1'!npl. v i.  which is an indexed, materialized view of an underlying database relation. The sample view facilitates efficient random sampling of records satisfying a relational range predicate. In the chapter we describe the ACE Tree which is a new indexing structure that we use to index the sample view. We have shown experimentally that with the ACE Tree index, the sample view can be used to provide an online random sample with much greater efficiency than the obvious alternatives. For applications like online ..:o negation or data mining that require a random ordering of input records, this makes the ACE Tree a natural choice for random sampling. This is not to .7 that the ACE Tree is without any drawbacks. One obvious concern is that the ACE Tree is a primary file organization as well as an index, and hence it requires that the data be stored within the ACE Tree structure. This means that if the data are stored within an ACE Tree, then without replication of the data elsewhere it is not possible to cluster the data in another way at the same time. This may be a drawback for some applications. For example, it might be desirable to organize the data as a B+Tree if nonsamplingbased range queries are asked frequently as well, and this is precluded by the ACE Tree. This is certainly a valid concern. However, we still feel that the ACE Tree will be one important weapon in a data analyst's arsenal. Applications like online ..:o negation (where the database is used primarily or exclusively for samplingbased analysis) already require that the data be clustered in a randomized fashion, in such a situation, it is not possible to apply traditional structures like a B+Tree anyway, and so there is no additional cost associated with the use of an ACE Tree as the primary file organization. Even if the primary purpose of the database is a more traditional or widespread application such as OLAP, we note that it is becoming increasingly common for analysts to subsample the database and apply various analytic techniques (such as data mining) to the subsample, if such a sample were to be materialized anyway, then organizing the subsample itself as an ACE Tree in order to facilitate efficient online analysis would be a natural choice. Another potential drawback of the ACE Tree as it has been described in this chapter, is that it is not an incrementally updateable structure. The ACE Tree is relatively efficient to construct in bulk: it requires two external sorts of the underlying data to build from scratch. The difficulty is that as new data are added, there is not an easy way to update the structure without rebuilding it from scratch. Thus, one potential area for future work is to add the ability to handle incremental inserts to the sample view (assuming that the ACE Tree is most useful in a data warehousing environment, then deletes are far less useful). However, we note that even without the ability to incrementally update an ACETree, it is still easily usable in a dynamic environment if a standard method such as a differential file [111] is applied. Specifically, one could maintain the differential file as a randomly permuted file or even a second ACE Tree, and when a relational selection query is posed, in order to draw a random sample from the query one selects the next sample from either the primary ACE Tree or the differential file with an appropriate hypergeometric probability (for an idea of how this could be done, see the recent paper of Brown and Haas [12] for a discussion of how to draw a single sample from multiple data set partitions). Thus, we argue that the lack of an algorithm to update the ACE tree incrementally may not be a tremendous drawback. Finally, we close the chapter by asserting that the importance of having indexing methods that can handle insertions incrementally is often overstated in the research literature. In practice, most increnientallyupdateable structures such as B+Trees cannot he updated incrementally in a data warehousing environment due to performance considerations anyway [93]. Such structures still require on the order of one random I/O per update, rendering it impossible to efficiently process bulk updates consisting of millions of records without simply rebuilding the structure from scratch. Thus, we feel that the drawbacks associated with the ACE Tree do not prevent its utility in many realworld situations. CHAPTER 4 SAMPLINGBASED ESTIMATORS FOR SUBSETBASED QUERIES 4.1 Introduction Sampling is wellestablished as a method for dealing with very large volumes of data, when it is simply not practical or desirable to perform the computation over the entire data set. Sampling has several advantages compared to other widelystudied approximation methodologies from the data management literature such as wavelets [88], histograms [92] and sketches [29]. Not the least of those is generality: it is very easy to efficiently draw a sample from a large data set in a single pass using reservoir techniques [34]. Then, once the sample has been drawn it is possible to guess, with greater or lesser accuracy, the answer to virtually any statistical query over those sets. Samples can easily handle many different database queries, including complex functions in relational selection and join predicates. The same cannot be said of the other approximation methods, which generally require more knowledge of the query during synopsis construction, such as the attribute that will appear in the SELECT clause of the SQL query corresponding to the desired statistical calculation. However, one class of .l__oegatee queries that remain difficult or impossible to answer with samples are the socalled IIl* i" queries, which can generally be written in SQL in the form: SELECT SUM (fl(r)) FROM R as r WHERE f2(r) AND NOT EXISTS (SELECT FROM S AS s WHERE f3 r, S)) Note that the function f2 can be incorporated into fl if we have fl evaluate to zero if f2 1S not true, thus, in the remainder of the chapter we will ignore f2. An example of such a query is: "Find the total salary of all employees who have not made a sale in the past SELECT SUM (e .SAL) FROM EMP AS e WHERE NOT EXISTS (SELECT FROM SALE AS s WHERE s.EID = e.EID) A general solution to this problem would greatly extend the class of database r i1. queries that are amenable to being answered via random sampling. For example, there is a very close relationship between such queries and those obtained by removing the NOT in the subquery. Using the terminology introduced later in this chapter, all records from EMP with i records in SALES are called "classi"rcdsThonyifenebtwnNT EXISTS and EXISTS is that the former query computes a sum over all class 0 records, whereas the latter query computes a sum over all class i > 0 records. Since any reasonable estimator for NOT EXISTS will likely have to compute an estimated sum over each class, a solution for NOT EXISTS should immediately II__ a solution for EXISTS. Also, nested queries having an IN (or NOT IN) clause can be easily rewritten as a nested query having the EXISTS (or NOT EXISTS) clause. For example, the query "Find the total salary of all employees who have not made a sale in the past 3. .1 given above can also be written as: SELECT SUM (e .SAL) FROM EMP as e WHERE e.EID NOT IN (SELECT s.EID FROM SALE AS s) Furthermore, a solution to the problem of sampling for subset queries would allow samplingbased .I_ egates over SQL DISTINCT queries, which can easily be rewritten as subset queries. For example: SELECT SUM (DISTINCT e.SAL) FROM EMP AS e is equivalent to: SELECT SUM (e .SAL) FROM EMP AS e WHERE NOT EXISTS (SELECT FROM EMP AS e2 WHERE idle) < id(e2) AND e.SAL = e2.SAL) In this query, id is a function that returns the row identifier for the record in question. Some work has considered the problem of sampling for counts of distinct attribute values [17, 49], but .I_ negates over DISTINCT queries remains an open problem. Similarly, it is possible to write an ..:o gate query where records with identical values may appear more than once in the data, but should be considered no more than once by the .I__oegate function as a subsetbased SQL query. For example: SELECT SUM (e .SAL) FROM EMP AS e WHERE NOT EXISTS (SELECT FROM EMP AS e2 WHERE idle) < id(e2) AND identical(e, e2)) In this query, the function identical returns true if the two records contain identical values for all of their attributes. This would be very useful in computations where the same data object may be seen at many sites in a distributed environment (packets in an IP network, for example). Previous work has considered how to perform sampling in such a distributed system [12, 77], but not how to deal with the duplicate data problem. Unfortunately, it turns out that handling subset queries using sampling is exceedingly difficult, due to the fact that the subquery in a subset query is not asking for a mean or a sum tasks for which sampling is particularly wellsuited. Rather, the suhquery is asking whether we will ever see a match for each tuple from the outer relation. By looking at an individual tuple, this is very hard to guess: either we have seen a match already on our sample (in which case we are assured that the inner relation has a match), or we have not, in which case we may have almost no way to guess whether we will ever see a match. For example, imagine that employee Joe does not have a sale in a 1(1' sample of the SALE relation. How can we guess whether or not he has a sale in the remaining 911' ? There is little relevant work in the statistical literature to II__ 1 how to tackle subset queries, because such queries ask a simultaneous question linking two populations (database tables EMP and SALE in our example), which is an uncommon question in traditional applications of finite population sampling. Outside of the work on sample from the number of distinct values [17, 49, 50] and one method that requires an index on the inner relation [75], there is also little relevant work in the data management literature; we presume this is due to the difficulty of the problem; researchers have considered the difficulty of the more limited problem of sampling for distinct values in some detail [17]. Our Contributions In this chapter, we consider the problem of developing samplinghased statistical estimators for such queries. In the remainder of this chapter, we assume withoutreplacement sampling, though our methods could easily be extended to other sampling plans. Given the difficulty of the problem, it is perhaps not surprising that significant statistical and mathematical machinery is required for a satisfactory solution. Our first contribution is to develop an unbiased estimator, which is the traditional first step when searching for a good statistical estimator. An unbiased estimator is one that is correct on expectation; that is, if an unbiased estimator is run an infinite number of times, then the average over all of the trials would be exactly the same as the correct answer to the query. The reason that an unbiased estimator is the natural first choice is that if the estimator has low variancel then the fact that it is correct on average implies that it will .ll.kis be very close to the correct answer. Unfortunately, it turns out that the unbiased estimator we develop often has high variance, which we prove analytically and demonstrate experimentally. Since it is easy to argue that our unbiased estimator is the only unbiased estimator for a certain subclass of subsetbased queries (see the Related Work section of this chapter), it is perhaps doubtful that a better unbiased estimator exists. Thus, we also propose a novel, biased estimator that makes use of a statistical technique called "superpopulation modelingt Superpopulation modeling is an example of a socalled B lio .Is statistical technique [39]. B li, i Ia methods generally make use of mild and reasonable distributional assumptions about the data in order to greatly increase estimation accuracy, and have become very popular in statistics in the last few decades. Using this method in the context of answering subsetbased queries presents a number of significant technical challenges whose solutions are detailed in this chapter, including: * The definition of an appropriate generative statistical model for the problem of sampling for subsetbased queries. * The derivation of a unique Expectation Maximization algorithm [26] to learn the model from the database samples. * The development of algorithms for efficiently generating many new random data sets from the model, without actually having to materialize them. Through an extensive set of experiments, we show that the resulting biased B li, i Ia estimator has excellent accuracy on a wide variety of data. The biased estimator also has the desirable property that it provides something closely related to classical confidence bounds, that can be used to give the user an idea of the accuracy of the associated estimate. 1 Variance is the statistical measure of the random variability of an estimator. 4.2 The Concurrent Estimator With a little effort, it is not hard to imagine several possible samplingbased estimators for subset queries. In this section, we discuss one very simple (and sometimes unusable) samplebased estimator. This estimator has previously been studied in detail [75], but we present it here because it forms the basis for the unbiased estimator described in the next section. We begin our description with an even simpler estimation problem. Given a oneattribute relation R(A) consisting of us records, imagine that our goal is to estimate the sum over attribute A of all the records in R. A simple, samplebased estimator would be as follows. We obtain a random sample R' of size us, of all the records of R, compute total = C,,,, r.A, and then scale up total to output total x a~nR/n as the estimate for the final sum. Not only is this estimator extremely simple to understand, but it is also unbiased, consistent, and its variance reduces monotonically with increasing sample size. We can extend this simple idea to define an estimator for the NOT EXISTS query considered in the introduction. We start by obtaining random samples EMP' and SALE' of sizes nEMP' and nSALE/, TOSpectively from the relations EMP and SALE. We then evaluate the NOT EXISTS query over the samples of the two relations. We compare every record in EMP' with every record in SALE', and if we do not find a matching record (that is, one for which f3 eValuateS to true), then we add its fl value to the estimated total. Lastly, we scale up the estimated total by a factor of nEMP nEMp t O obtain the final estimate, which we term M2 =) EP (e) x (1 ini(1, cut(e, SALE))) REMP'P In this expression, cut(e, SALE') = sESALE/, I3(e, s)) where I is the standard indicator function, returning 1 if the boolean argument evaluates to true, and 0 otherwise. The algorithm can be slightly modified to accommodate for growing samples of the relations, and has been described in detail in [75], where it is called the "concurrent estimator" since it samples both relations concurrently. Unfortunately, on expectation, the estimator is often severely biased, meaning that it is, on average, incorrect. The reason for this bias is fairly intuitive. The algorithm compares a record from EMP with all records from SALE', and if it does not find a matching record in SALE', it classifies the record as having no match in the entire SALE relation. Clearly, this classification may be incorrect for certain records in EMP, since although they might have no matching record in SALE', it is possible that they may match with some record from the part of SALE that was not included in the sample. As a result, M~ typically overestimates the answer to the NOT EXISTS query. In fact, the bias of M~ is: Bias(M) = fJI(e)(1 min.(1,cu~t~e, SALE)) e6EMP cp(nSALE, saleE, cut(e, SALE))) In this expression, cp denotes the hypergeometric probability that a sample of size nSALE' will contain none of the cut(e, SALE) matching records of e. The solution that was emploiv I previously to counteract this bias requires an index such as a B+Tree on the entire SALE relation, in order to estimate and correct for Bias(M~). Unfortunately, the requirement for an index severely limits the applicability of the method. If an index on the "join" attribute in the inner relation is not available, the method cannot be used. In a streaming environment where it is not feasible to store SALE in its entirety, an index is not practical. The requirement of an index also precludes use of the concurrent estimator for a nonequality predicate in the inner subquery or 2 The hypergeometric probability distribution models the distribution of the number of red balls that will be obtained in a sample without replacement of n' balls from an urn containing r red balls and n r nonred balls. for nondatabase environments where sampling might be useful, such as in a distributed system. In the remainder of this chapter, we consider the development of samplingbased estimators for this problem that require nothing but samples from the relations themselves. Our first estimator makes use of a provably unbiased estimator Bias(M~) for Bias(M~). Taken together, M~ Bias(M~). is then an unbiased estimator for the final query answer. The second estimator we consider is quite different in character, making use of B li, i Ia statistical techniques. 4.3 Unbiased Estimator 4.3.1 HighLevel Description In order to develop an unbiased estimator for Bias(M~), it is useful to first rewrite the formula for Bias(M~) in a slightly different fashion. We subsequently refer to the set of records in EMP that have i matches in SALE as "classireod"Dntehesmfte .. _regate function over all records of class i by ti, so ti = Ze6Ep 1(e) x I(cut(e, SALE) = i) (note that the final answer to the NOT EXISTS query is the quantity to). Given that the probability that a record with i matches in SALE happens to have no matches in SALE' is (p(nSALE, saleE, i), We CRI1 TOWrite the expression for the bias of M~ as: Bias(Ml)= iip(nSALE~isr i RSALE i i= 1 The above equation computes the bias of M~ since it computes the expected sum over the .I__regate attribute of all records of EMP which are incorrectly classified as class 0 records by M~. Let m be the maximum number of matching records in SALE for any record of EMP. Equation 41 11__ 0 an unbiased estimator for Bias(M~) because it turns out that it is easy to generate an unbiased estimate for im: since no records other than those with m matches in SALE can have m matches in SALE', we can simply count the sum of the .I__ negate function fl over all such records in our sample, and scale up the total accordingly. The scaleup would also be done to account for the fact that we use SALE' and not SALE to count matches. Once we have an estimate for im, it is possible to estimate im1. How? Note that records with m 1 matches in SALE' must be a member of either class m or class m 1. Using our unbiased estimate for im, it is possible to guess the total .l_ egate sum for those records with m 1 matches in SALE' that in reality have m matches in SALE. By subtracting this from the sum for those records with m 1 matches in SALE' and scaling up accordingly, we can obtain an unbiased estimate for im1. In a similar fashion, each unbiased estimate for ti leads to an unbiased estimate for 4_l. By using this recursive relationship, it is possible to guess in an unbiased fashion the value for each ti in the expression for Bias(M~). This leads to an unbiased estimator for the Bias(M~) quantity, which can be subtracted from M~ to provide an unbiased guess for the query result. 4.3.2 The Unbiased Estimator In Depth We now formalize the above ideas to develop an unbiased estimator for each tk that can be used in conjunction with Equation 41 to develop an unbiased estimator for Bias(M~). We use the following additional notation for this section and the remainder of this chapter: * ak,i is a 0/1 (nonrandom) variable which evaluates to 1 if the ith tuple of EMP has k matches in SALE and evaluates to 0 otherwise. * Sk is the sum of fl over all records of EMP' having k matching records in SALE': Sk CRj, AE)k)x fi(ei). * cao is nEMP RnEMP, the sampling fraction of EMP. * is a random variable which governs whether or not the ith record of EMP appears in EMP'. * h(k; nSALE, saleE, i) is the hypergeometric probability that out of the i interesting records in a population of size nSALE, eXactly k will appear in a random sample of size nSALE/. FOr compactness of representation we will refer to this probability as h(k; i) in the remainder of the thesis, since our sampling fraction never changes. We begin by noting that if we consider only those records from EMP which appear in the sample EMP', an unbiased estimator for tk OVer EMP' can be expressed as follows: 1"EMP tk k,i x1 fli) (42) i= 1 Unfortunately, this estimator relies on being able to evaluate Ak~i for an arbitrary record, which is impossible without scanning the inner relation in its entirety. However, with a little cleverness, it is possible to remove this requirement. We have seen earlier that a record e can have k matches in the sample SALE' provided it has i > k matches in SALE. This implies that records from all classes i where i > k can contribute tO Sk* The contribution of a class i record towards the expected value of Sk is obtained by simply multiplying the probability that it will have k matches in SALE' with its .I__oregate attribute value. Thus a generic expression to compute the contribution of any arbitrary record from EMP' towards the expected value of Sk can be written as CE k ,~j x h(k; i) x fl(ej). Then, the following random variable has an expected value that is equivalent to the expected value of Sk: .iki iCYj X 3x h(k; i) x f,(e,) (43) j= 1 i= k The fact that E [sk] = E [S] (prOVen in Section 4.3.3) is significant, because there is a simple algebraic relationship between the various s^ variables and the various t^ variables. Thus, we can express one set in terms of the other, and then replace each skr With Sk, in order to derive an unbiased estimator for each t^. The benefit of doing this is that since Sk, is defined as the sum of fl over all records of EMP' having k matching records in SALE', it can be directly evaluated from the samples EMP' and SALE'. To derive the relationship between &^ and t, we start with an expression for smr using Equation 43: j=1 i=mr i=mr j=1 = h(m r; m1 r + i) ) X x,, Amri~ x ft(er) i=0 j=1 = h~ r m r +i~o/^. ,44(44) i=0 By rearranging the terms we get the following important recursive relationship: immr r o E h(m r; m r+i) xi~ 45 For the base case we obtain: t s = am x s^m (46) where am = 1/(aoh(m; m)). By replacing Amr, in the above equations with smr which is readily observable from the data and has the same expected value, we can obtain a simple recursive algorithm for computing an unbiased estimator for any ti. Before presenting the recursive algorithm, we note that we can rewrite Equation 45 for ti by replacing s^ with s and by changing the summation variable from i to k and actually substituting m r by i, si t~o Cm' h(i; i + k)t~i+k, caoh(i; i) The following pseudocode then gives the algforithmS foT COmputingf an unbiased estimator for any ti. Function GetEstTi(int i) 1 if (i == m) 2 return sm/(aoh(m; m)) 3 else 4 returnval = si 5 for(int k = 1; k <= m i; k++) returnval = aoh(i; i + k) xGetEstTi(i+k) 7 returnval /= aoh(i; i) 8 return returnval; 9} Recall from Equation 41 that the bias of M~ was expressed as a linear combination of various ti terms. Using GetEstTi to estimate each of the ti terms, we can write an estimator for the bias of M~ as: Bias(lM) = p(nSALE, nSALE' i) x GetEstTi(i) i= 1 (47) In the following two subsections, we present a formal analysis of the statistical properties of our estimator. 3 Note the h(m; m) probability in line 2 of the GetEstTi function. If the sample size from SALE is not at least as large as m, then h(m; m) = 0 and the GetEstTi is undefined. This means that our estimator is undefined if the sample is not at least as large as the largest number of matches for any record from EMP in SALE. The fact that the estimator is undefined in this case is not surprising, since it means that our estimator does not conflict with known results regarding the existence of an unbiased estimator for the distinct value problem. See the Related Work section for more details. 4.3.3 Why Is the Estimator Unbiased? According to Equation 47, the estimator for the bias of M~ is composed of a sum of m different estimators. Hence by the linearity of expectation, the expected value of the estimator can be written as: E[BiaR(Mn)] = p(.nSALE, nSALE',1 i)x E[GetEstTi(i)] (48) i= 1 The above relation II__ 0 that in order to prove that the samplebased estimator of Equation 47 is unbiased, it would suffice to prove that each of the individual GetEstTi estimators is unbiased. We use mathematical induction to prove the correctness of the various estimators on expectation. As a preliminary step for the proof of unbiasedness, we first derive the expected values of the as estimator used by GetEstTi. To do this, we introduce a zero/one random variable Hj~k that evaluates to 1 if ej has k matches in SALE' and 0 otherwise. The expected value of this variable is simply the probability that it evaluates to 1, giving us E[Hj~k] = h(k; cut(ej, SALE)). With this: nEMp m j= 1 i= k RELMP m =, Co aiy x h(k; i) x ft(e ) (49) j= 1 i= k We are now ready to present a formal proof of unbiasedness of the GetEstTi. Theorem 1. The expected value of GetEstTi(i) is E "as,yifl(eyi). Proof. Using Equation 45, the recursive GetEstTi estimator can be rewritten as: Getsti(i) 3ic~~'si t~o CEm h(i; i + k)GetEstTi(i + k) 40 Get~~stc~ihi) = (0 We first prove the unbiasedness for the base case: GetEstTi(m). Setting i = m in the above relation and taking the expectation: E [sm] E[GetEstTi(m)] aok (m; m) Replacing E [sm] using Equation 49: E[GetEstTi(m)] = c C m,;,h(m;l m) fl(ey) caoh(m; m) j= 1 j= 1 which is actually the value of im. By induction, we can now assume that all estimators GetEstTi(i + k) for 1 I k I m i are unbiased and we use this to prove that the estimator GetEstTi(i) is unbiased. Taking the expectation on both sides of the above equation: as C to= ILm L hi i +I~LS k)e~tii + k)n E[GetEstTi(i)] = E By the linearity of expectation: E[Si] g toZ~ mh(i; i + k;)E[GetEstTi(i + ki)] caoh(i; i) Replacing the values of E[GetEstTi(i + k)] and E[si]: o~~)i (No CJ C" Em4 k,jh(i; k) fl(el) LjE "I= Emi+k,jhc~i;~ i + k) fl(ej)) For the second term in the parentheses, replacing i + k by p and changing the limits of summation for the inner sum accordingly: Lj= m=i+l ap~jh(i; p) fl(ej)) (1 We notice that the limits of summation of the inner sum of the first term are from i to m. Splitting this term into two terms such that one term has limits of summation from i to i while the other has limits from i 1l to m: C~ iifl (eyi) (412) 4.3.4 Computing the Variance of the Estimator The unbiased property of Bias(M~) means that it may be useful. However, the accuracy of any estimator depends on its variance as well as its bias. We now investigate the variance of our unbiased estimator. We have seen that Bias(M~) is a linear combination of various GetEstTi results with cp(nSALE, saleE, i) aS the coefficient of GetEstTi(i). In order to derive an expression for the variance of the estimator and gain insight about the potential values it can take, we first express the estimator as a linear combination of as terms: Bians(Ml) =: b x .s, (413) i= 1 The next step in deriving the variance is being able to compute the various bi values. Intuitively, the bi terms can be thought of as coming from the linear relationship between the tei and as terms. The following algorithm shows how we can actually compute the bi values. Function ComputeBis(m) 1 // Let table[m][m] be a 2dimensional array with all elements initialized to zero 2 for(int row = 0; row < m; row++) { 3 for(int term = 1; term <= row; term++) { 4 factor = h(m row; m row + term)/h(m row; m row) 5 prow = row term 6 for(int pcol = 0; pcol <= prow; pcol++) 7 table[row][pcol] += factor a table~prow][pcol] 8} 9 table[row][row] = 1/h(m row; m row) 10 } 11for(int row = 0; row < m; row++) 12 for(int col = 0; col <= row; col++) 13 bmcol += to h(0, m row) table[row][col] With this, the variance of this estimator can then be written as: Var(Bias(M~)) = Var basei= (414) Note that the as values are not independent random variables since if an EMP' record has i matches in SALE', then it cannot have j matches in SALE'. Hence we have: s(Mz)) = )bVar(as)+ 2 ~ ~ ~~Cov!: s4, ) i=1 i=1 i can be computed by using the standard formulas: Var(Bia~ (415) (416) The Var and Cov terms Var~~~s4) = E G? 2 Si] C 000Siff) = E[s sj]  E[si]E[sy] To evaluate V araSi) andU C/ov(aSSj), E[S5] and E[sisj] can be computed as follows: (nEWP RIEMP E [sisj] =[( E Yk 1 6k)Hk_i r 1 (er,)He,;i) k=1 r=1 RELMP RnEMP = C[ II, E Yk k,i~,l fi(e,) fl(er,) k=1 r=1 The above expression can be evaluated usingf the following rules: (417 *if k / r (that is, ek, and e, are two different tuples) then, E[Hk~iHe,j] a h(i, cut(ek, SALE)) h(j, cut(er, SALE)) if we assume that no record a exists in SALE where f3 6k, S) =3 Cr, 8) = ETru *if i = j (that; is, we are computing E1sf]) and k h(i, cut(ek, SALE)) r, then E[H~,iHe,y] * if i / j (that is, we are computing E[sisj]) and k = r, then E[H~,iHe,4] = 0 since a record cannot have two different numbers of matches in a sample * if k = r, then E [YkY, = c80 * if k / r, then E [YkY, 2: 002 4.3.5 Is This Good? At this point, we now have a simple, unbiased estimator for the answer to a subsetbased query, as well as a formal analysis of the statistical properties of the Superpopulation 1Lodel FI) process *Sprouao Random Samplmg ActualI populato 2 3 N Hypothet~cal Sampling under step desired samplmg design S3 Figure 41. Sampling from a superpopulation estimator. However, there are two problems related to the variance that may limit the utility of the estimator. First, in order to evaluate the hypergeometric probabilities needed to compute or estimate the variance, we need the value of ent(e, SALE) for an arbitrary record e of EMP. This information is generally unavailable during sampling, and it seems difficult or impossible to obtain a good estimate for the appropriate probability without having this information. This means that in practice, it will be difficult or impossible to tell a user how accurate the resulting estimate is likely to be. We have experimented with generalpurpose methods such as the bootstrap [31] to estimate this variance, but have found that these methods often do an extremely poor job in practice. Second, the variance of the estimator itself may be huge. The bi coefficients are composed of sums, products and ratios of hypergeometric probabilities which can result in huge values. Particularly worrisome is the h(i, i) value in the denominator used by GetEstTi. Such probabilities can he tiny; including such a small value in the denominator of an expression results in a very large value that may "pump up" the variance accordingly. 4.4 Developing a Biased Estimator In light of these problems, in this section we describe a biased estimator that is often far more accurate than the unbiased one, and also provides the user with an idea of the estimation accuracy. Just like the unbiased estimator M~ Bias(M~) from the previous section, our biased estimator will be nothing more than a weighted sum over the observed Sk, ValueS. However, the weights will be chosen so as to minimize the expected or meansquared error of the resulting estimator. To develop our biased estimator, we make use of the "superpopulation modelingt approach from statistics [78]. One simple way to think of a superpopulation is that it is an infinitely large set of records from which the original data set has been obtained by random sampling. Because the superpopulation is infinite, it is specified using a parametric distribution, which is usually referred to as the prior distribution. Using a superpopulation method, we imagine the following twostep process is used to produce our sample: 1. Draw a large sample of size NV from an imaginary infinite superpopulation where NV is the data set size. 2. Draw a sample of size n < NV without replacement from the large sample of size NV obtained in Step 1 where n is the desired sample size. By characterizing the superpopulation, it is possible to design an estimator that tends to perform well on any data set and sample obtained using the process above. The following steps outline a roadmap of our superpopulationbased approach for obtaining a highquality biased estimator for a subsetbased query. We describe each step in detail in the next section. 1. Postulate a superpopulation model F for our data set (F is the prior distribution, we use the notation pF to denote the probability density function (PDF) of F). In general, F is parameterized on a parameter set 8. 2. Infer the most likely values of the parameter set 8 from EMP' and SALE'. Since we do not have the complete data, but rather a random sample of the data, this is a difficult problem. We make use of an ExpectationMaximization (EM) algorithm to learn the model parameters. 3. Use F(8) to generate d different populations P1, ..., Pd, where each Pi = (EMP,, SALE,). Note that if the data set in question is large, this may be very expensive. We show that for our problem it is not necessary to generate the actual populations it is enough to obtain certain sufficient statistics for each of them, which can be done efficiently. 4. Sample from each Pi to obtain d sample pairs of the form Si = (EMP SALE(). Again, this can be done without actually materializing the samples. 5. Let q(Ps) be the query answer over the ith data set. Construct a weighted estimator W that, minimizes C z(q(Ps:) W'(Se))2 6. Use W on the original samples EMP' and SALE' to obtain the final estimate to the NOT EXISTS query. The MSE of this estimate can generally be assumed to be the MSE ove:r all of the populations generated: C1,(1/dl) x (q(Ps)W()2 4.5 Details of Our Approach In this section, we discuss in detail each of the steps outlined above of our approach of obtaining an optimal weighted estimator for the NOT EXISTS query. 4.5.1 Choice of Model and Model Parameters The first task is to define a generative model and an associated probability density function for the two relations EMP and SALE respectively. While this may seem like a daunting task (and a potentially impossible one given all of the intricacies of modeling reallife data) it is made easy by the fact that we only need to define a model that can realistically reproduce those characteristics of EMP and SALE that may affect the bias or variance of an estimator for a subsetbased query. From the material in Section 4.3 of the thesis, for a given record e from EMP, we know that these three characteristics are: 1. ft (e) 2. cut(e, SALE), which is the number of SALE records a for which f3 6, 8) is ETru 3. cut(e, e', SALE) where e' / e, which is the number of SALE records a for which f3 6, 8) A 3 6 8) is ETru To simplify our task, we will actually ignore the third characteristic and define a model such that this count is alrws zero for any given record pair. While this may introduce some inaccuracy into our method, it still captures a large number of reallife situations. For example, if f3 COnSIStS of an equality check on a foreign key from SALE into EMP (which is arguably the most common example of such a subsetbased query) then two records from EMP can never match with the same record from SALE and this count is Given that our model needs to be able to generate instances of EMP and SALE that realistically model the first two aspects given above, we choose the parameter set 8 = {p, pw, .2) Where: * p is a vector of probabilities, where pi represents the probability that any arbitrary record of EMP belongs to class i * pw is a vector of means, where pi represents the mean ..:o negate value of all records belonging to class i. * 0.2 is the variance of ft (e) over all records e E EMP. Then given these parameters, EMP and SALE are generated using our model as follows: Procedure GenData 1 For rec = 1 to nEMP do 2 Randomly generate k between 0 and m such that for any i; 0 < i < m, Pr [i] = pi 3 Generate a value for file) by sampling from N(pk, 0") 4 Add the resulting e to EMP 5 For j = 1 to k do 6 Generate a record s where f3(e, 8) is Ltru 7 Add s to SALE In step (3), NV is a normally distributed random variable with the specified mean and variance. We use a normal random variable because we are interested in sums over classes in EMP; due to the central limit theorem (CLT), these sums will be normally distributed for a large database. Thus, using a normal random variable does not result in loss of generality. Also, note that in step (6), according to our earlier assumption f3 6 8) = ft/Se, 6e 6 . In our actual model, the various ps values are not assumed to be independent but we rather assume a linear relationship between them to limit the degrees of freedom of the model and thus avoid the problem of overfitting the model (see Sec. 4.5.1). In our model the various ps values are related as ps = x i + po, where s and po are the only two parameters that need to be learned to determine all the ps. Also in order to avoid overfitting, we assume that a2 1S the variance of ft (e) over all records, rather than modeling and learning variance values of all the individual classes separately. We now define the density function for the superpopulation model corresponding to the GenData algorithm. For a given EMP record e, if file) = v and cut(e, SALE) = k the probability density for e given a parameter set 8 is given by: p(e8) = p(v, k8) = pkfN kr, &, U) (418) Where it is convenient, we will use the notation p(v, k8) for values v and k and p(e8) for record e interchangeably. In this expression, pry is the PDF for the normal distribution evaluated at v and is given by: Then if we consider a given data set {EMP, SALE}, the probability density of the data set is simply the product of the densities of all the individual records: p(EMP, SALE) = p~e8) (420) e6EMP A Note on The Generality of the Model. As described, our model is extremely general, making almost no assumptions about the data other than the fact that file) values are normally distributed. This is actually an inconsequential assumption anyway, since we are interested in sums over ft (e) values which will be normally distributed whatever the distribution of ft (e) due to the CLT. On one hand, this generality can be seen as a benefit of the approach: it makes use of very few assumptions about the data. Most significant is the lack of any sort of restriction on the probability vector p. The result is that the number of records from SALE matching a certain record from EMP is multinomially distributed. On the other hand, a B li 1 I argument [39] can be made that such extreme freedom is actually a poor choice, and that in "reallife", an analyst will have some sort of idea what the various pi values look like, and a more restrictive distribution providing fewer degrees of freedom should be used. For example, a negative binomial distribution has been assumed for the distinct value estimation problem [90]. Such background knowledge could certainly improve the accuracy of the method. Though we eschew any such restrictions in the remainder of the thesis (except for an assumption of a linear relationship among the ps values; see "Dealing with Overfittingt in the next section), we note that it would be very easy to incorporate such knowledge into our method. The only change needed is that the EM algorithm described in the next section would need to be modified to incorporate any constraints induced on the various parameters by additional distributional assumptions. 4.5.2 Estimation of Model Parameters Now that we have defined our superpopulation model, we need access to the parameter set 8 that was used to create our particular instances of EMP and SALE in order to develop an estimator that performs well for the resulting superpopulation. However, we have several difficulties. First, we do not know 8; since EMP and SALE are in reality not sampled from any parametric distribution, 8 does not even exist. We could compute a maximumlikelihood estimate (j11.1 )4 to choose a 8 that optimally fits EMP and SALE, but then we have an even bigger problem: we do not even have access to EMP and SALE; we only have access to samples from them. Thus, we need a way to infer 8 by looking only at the samples EMP' and SALE'. It turns out that we can still make use of an MLE. Since EMP' may be treated as a set of independent, identically distributed samples from F, if we simply replace EMP with EMP' as an argument to pF, then by choosing 8 so as to maximize pF, we will still produce exactly the same estimate for 8 on expectation that we would have if EMP were used instead. Thus, we can essentially ignore the distinction between EMP and EMP'. However, the same argument does not hold for SALE because without access to all of SALE, we cannot compute k = cut(e, SALE) for arbitrary e in order to apply an MLE. To handle this, we will modify our PDF slightly to also take into account the sampling from SALE. This can easily be done by modifying the function p(v, k8). To simplify the modification, we ignore the fact that the number of such records a from SALE' where f3 61, 8) is true may be correlated with the number of records from SALE' where f3 62, 8) is ETru for arbitrary records el and e2 from EMP; that is, we assume that we are looking for matches of a record e in its own pm i. I sample from SALE and that all of these samplings are independent. With this, if f (e) = v and cut(e, SALE) = k and cut(e, SALE') = k' then: p(v, k, k'8) = p(v, k8)h(k'; k) (421) In this expression, h is the hypergeometric probability of seeing k' matches for e in SALE', given that there were k matches in SALE. 4 An MLE is a standard statistical estimator for unknown model parameters when a sample is available; the MLE simply chooses 8 so as to maximize the value of the PDF of the sample. Since the portion of SALE that is not in SALE' is hidden to us due to the sampling, we do not know k and we have a classic example of an MLE problem with hidden or missing data. There are several methods from the literature for solving such a problem, the one that we employ is the Expectation Maximization (EM) algorithm. The EM algorithm [26] is a general method of finding the maximumlikelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. EM starts out with an initial assignment of values for the unknown parameters and at each step, recomputes new values for each of the parameters via a set of update rules. EM continues this process until the likelihood stops increasing any further. Since cut(e, SALE) is unknown, the likelihood function: L(8 {EMP', SALE'}) = ~Iip(fi(e), k, cut(e, SALE'))) e6EMP' k=1 We present the derivation of our EM implementation in the Appendix, while here we give only the algorithm. In this algorithm, fi(i8, e) denotes the posterior probability for record e belonging to class i. This is the probability that given the current set of values for 8, record e belongs to class i. Procedure EM(8) 1 Initialize all parameters of 8; Lprey, = 9999 2 while (true) { 3 Compute L(8) from the sample and assign it to L,urr 4 if((L,urr L,rev)/L,re < 0.01) break 5 Compute posterior probabilities for each e E EMP' and each k 6 Recompute all parameters of 8 by using the following update rules: 7 i = eCEMPP 1 8l/~ ) 8 o etEMP CnO?~l/e(l()L 10 L~prev = L~curr 11 } 12 Return values in 8 as the final parameters of the model Every iteration of the EM algorithm performs an expectation (E) step and a maximization (j!) step. In our algorithm, the Estep is contained in step (6) where for each record e of EMP', a set of probability values f(i8, e), O < i < m, is computed under the current model parameters, 8. The posterior probability f(i8, e) is computed as described in the Appendix. Intuitively, the posterior probability for record e and class i is a ratio of two quantities: (1) the probability that e belongs to class i according to the density function of the model, and (2) the sum of probabilities that it belongs to each of the classes 0 through m, also according to the model density function. The Mstep (which corresponds to steps (7) (10) of our algorithm) updates the parameters of our model in such a way that the expected value of the likelihood function associated with the model is maximized with respect to the posterior probabilities. Details of how we obtain the various update rules are explained in the Appendix. The observant reader may note that the EM algorithm assumes that the parameter m is known before the process has begun. This is potentially a problem, since m will typically be unknown. Fortunately, knowing the exact value for m is not vital, particularly if m is overestimated (in which case the class probabilities associated with the class i records for large i will end up being zero, if the EM algorithm functions correctly). As a rough estimate for m, we take the record from EMP' with the largest number of matches in SALE' and scale up the number of matches by nSALE RSALE/. PartiCularly if SeVeral records with m matches in SALE are expected to appear in EMP', this estimate for m will be quite conservative. Dealing with Overfitting. The superpopulation model has a total of 2(m + 1) + 1 parameters within 8. Since the number of degrees of freedom of the model is so large, the model has a tremendous leeway when choosing parameter values. This potentially leads to a wellknown drawback of learned models overfitting the training data, where the model is tailored to be excessively wellsuited to the training data at the cost of generality. Several techniques have been proposed to address the overfitting problem [30]. We use the following two methods in our approach: * Limiting the number of degrees of freedom of the model. * Using multiple models and combining them to develop our final estimator. To use the first technique, we restrict our generative model so that the mean .. negate value of all records of any class i is not independent of the mean value of other classes. Rather, we use a simple linear regression model ps = x i + Io. s and Io are the two parameters of the linear regression model and can be learned easily. This means that once we have learned the two parameters s and Io, the pi values for all other classes can be determined directly by the above relation and will not be learned separately. As mentioned previously, it would also be possible to place distributional constraints upon the vector p in order to reduce the degrees of freedom even more, though we choose not to do this in our implementation. Our second strategy to tackle the overfitting problem is to learn multiple models rather than working with a single model. These models differ from each other only in that they are learned using our EM algorithm with different initial random settings for their parameters. When generating populations from the models learned via EM (as described in the next subsection), we then rotate through the various models in roundrobin fashion. Are we not done yet? Once the model has been learned, a simple estimator is immediately available to us: we could return po x po x nEMp, Since this will be the expected query result over an arbitrary database sampled from the model. This is equivalent to first determining a class of databases that the database in question has been randomly selected from, and then returning the average query result over all of those databases. If multiple models are learned in order to alleviate the overfitting problem, then we can use the average of this expression over all of those models. While this estimator is certainly reasonable, the concern is twofold. First, if there is high variability in the possible populations that could be produced by the model or models (corresponding to uncertainty in the correctness of the model), then simply taking the average of all of these populations will expectedly result in an answer with high variance. A related concern is that this is not very robust to errors in the modellearning process an error in the model will lead directly to an error in the estimate. Thus, in the next few subsections we detail a process that attempts to simultaneously perform well on in0; and all of the databases that could be sampled from the model, rather than simply returning the mean answer over all potential databases. The method samples a large number of ((EMP,, SALES,), (EMP SALES )) combinations from the model, and then attempts to construct an estimator that can accurately infer the query answer over precisely the (EMP,, SALES,) that has been sampled by looking at (EMP SALES ). 4.5.3 Generating Populations From the Model Once we know the parameter set 8, the next task is to generate many instances of Pi = (EMP,, SALE,) and Si = (EMP SALE ) in order to optimize our biased estimator over these populationsample pairs. The difficulty is that in practice, EMP and SALE can have billions of records in them. Hence, it would not be feasible to actually materialize each (Ps, Si) pair. The good news is that for our problem it is not necessary to actually generate the populations if we can generate statistics associated with the pair that are sufficient to optimize our biased estimator. Computing sufficient statistics for EMP and SALE. For each Pi, we must generate the following statistics: * The number of records of EMP belonging to each class (we use ni to denote this). * The mean over fl for all records belonging to each class. The first set of statistics are easy to generate if we notice that the number of records belonging to each class is simply a multinomial distribution with nEMp trialS and each multinomial bucket probability is given by the vector p. A single, vectorvalued sample from an appropriatelydistributed multinomial distribution can then give us each us. The next set of statistics can be computed by relying on the CLT. According to the generative model, the .I_ r_~egfate attribute value of records of the superpopulation belonging to class i is given by ps with variance of a2. Since the population is an i.i.d. random sample from the superpopulation, the mean .I_ _regate value of records belonging to class i follows a normal distribution with mean given by ps and variance of a2/i Thus, ti which is the sum over the .I__oregate attribute of all records of class i can then be obtained by drawing a trial from the normal distribution NV(ps, a2/ni) and multiplying it by us. Computing sufficient statistics for EMP' and SALE'. For each Si, we must generate the followingf statistics: *The number of sampled records from each class of EMP, this is dennotedl byr n * The number of sampled records from the ith class of EMP that have j matches in SALE' for each i and j. We~ dennote th~is byr n, * The mean over fl corresponding to each n .~j The first set of statistics can be produced by repeatedly sampling from a hypergeometric distribution. To compute n'o, we sample from a hypergeometric distribution with parameters nEMP, nEMP', and no (these parameters are the population size, the sample size, and the size of the subpopulation of interest, respectively). To_ compute_ 'Iesml f r o m ~ ~~~~~~~~~~~~~~ a y e g o e r c d s t i u i n w t a a m t r E P 8 M, a n d n l. n '2 i s sapefrom a hypergeometric distribution with parameters n EMP 0 n1) n EMP 0 1) and n2. This process is repeated for each n(. Once,,,, each,,, n(isgeerte, ec Iz is generated. In order to speed the process of generating each nI mi wan assume that the epectedfr valueP of each n! is small compared to nSALE, So that there is little difference between sampling with and without replacement. Thus, we can assume that each nij; j < i is binomially distributed which in turn means that all n ,j are multinomially distributed, where the probability that any class i record will have j matches in the sample SALE' is a hypergeometric probability denoted by h(j; i). A single trial over a multinomial random variable having probabilities of h(j; i) for j from 0to i will1 then givem us each niz for a given i. Finally, again using a CLTbased argument, the mean over fl for all of the records corresponding to each n ,j is generated by a single trial over a normal random variable Nv(ps, O.2/n, 4.5.4 Constructing the Estimator We have seen in the previous subsection that once a model has been learned, it can be used to generate statistics for any number of population(s)/sample(s). Recall from Section 4.4 that the jth population generated and the sample from that population are Pj = (EMPj, SALEj) and S (EMP'., SALE' ), respectively. Let sij be the value of si computed over Sj; that is, it is the sum for fl over all tuples in EMP'. that have i matches in SALE'.. Our goal in all of this is to construct a weighted estimator: W(S,) = that minimizes: SSE =\ (WS)qP))2 ,' Y\ (3 (423) where q(Pj) is the answer to the NOT EXISTS query over the jth population. W should be optimized by choosing each I, so as to minimize the SSE (sumsquarederror) given above. In order to compute these weights we evaluate the partial derivative of the SSE w.r.t each of the unknown weights. For example, by taking the partial derivative of the SSE w.r.t wo, we obtain: iBSSE mzi If we differentiate with respect to each I, and set the resulting m + 1 expressions to zero, we obtain m+ 1 linear equations in the m+ 1 unknown weights. These equations can be represented in the following matrix form: The optimal weights can then be easily obtained by using a linear equation solver to solve the above system of equations. Once W has been derived, it is then applied to the original samples EMP' and SALES' in order to estimate the answer to the query. By dividing the SSE obtained via the minimization problem described above by the number of data sets generated, we can also obtain a reasonable estimate of the meansquared error of W. 4.6 Experiments In this section we describe results of the experiments we performed to test our estimators. Our experiments are designed to test the accuracy of our estimators and the running time of the biased estimator, over a wide variety of data sets. 4.6.1 Experimental Setup In this subsection, we describe the properties of the various data sets we use to test our estimators. We generate 66 synthetic data sets and use three reallife data sets for conducting our experiments. All our experiments were performed on a Linux workstation having 1 GB of RAM and a 2.4 GHz clock speed and all software was implemented using the C++ programming language. 4.6.1.1 Synthetic data sets In each data set, we have two relations, EMP (EID, AGE, SAL) and SALE (SALEID, EID, AMOUNT) of size 10 million and 50 million records, respectively. We evaluate the following SQL query over each data set: SELECT SUM (e .SAL) FROM EMP as e WHERE NOT EXISTS (SELECT FROM SALE AS s WHERE s.EID = e.EID) Two important data set properties that affect the query result are: 1. The distribution of the number of matching records in SALE for each record of EMP 2. The distribution of e.SAL values of all records of EMP Based on these two important properties, we synthetically generated data sets so that the distribution of the number of matching records for all EMP records follows a discretized Gamma distribution. The Gamma distribution was chosen because it produces positive numbers and is very flexible, allowing a long tail to the right. This means that it is possible to create data sets for which most records in EMP have very few matches, but some have a large number. We chose values of 1, 2 and 5 for the Gamma distribution's shift parameter and values of 0.5 and 1 for the scale parameter. Based on these different values for the shift and scale parameters, we obtained six possible data sets: 1: (shift= 1, scale = 0.5); 2: (shift = 2, scale = 0.5); :3: (shift = 5, scale = 0.5); 4: (shift = 1, scale = 1); 5: (shift = 2, scale = 1); and 6: (shift = 5, scale = 1). For these six data sets, the fraction of EMP records having no matches in SALE (and thus contributing to the query answer) were .86, .59, .052, .6:3, .27, and .00:37, respectively. A plot of the probability that an arbitrary tuple from EMP has m matches in SALE for each of the six data sets is given as Figure 42. This shows the wide variety of data set characteristics we tested. 0.6 data set 2 data set 5 0.4 \ dat set dataset 1 a 0.3 *dt e 0.2 dataa set 6 0.1 0 1 2 3 4 5 6 Number of matches per record from SAL Figure 42. Six distributions used to generate for each e in EMP the number of records s in SALE for which f3 6, 8) eValuateS tO ETru. We also varied the distribution of the e.SAL values such that the distribution can be one of the following: a. Normally distributed with a mean of 100 and standard deviation of 10 b. Normally distributed with a mean of 100 and standard deviation of 200, with only the absolute values considered c. Zipfian distributed with a skew parameter of 0.5 d. Zipfian distributed with a skew parameter of 1.0 We doubled the number of data sets by further providing a linear positive correlation or no correlation between the e.SAL value of a record and the number of matching records it has in SALE. We thus obtained 48 different data sets considering all possible combinations of the distribution of matching records and the distribution of e.SAL values. We also tested our estimator on 18 additional synthetic data sets that were deliberately designed to have properties that violate the assumptions of the superpopulation model of our biased estimator, so as to see how robust this estimator is to inaccuracies in the parametric model. From Section 4.5.1, the three specific assumptions we made for our superpopulation model were: 1. cut(e, e', SALE) = 0 when e' / e. Thus, the number of SALE records a for which f3 6, 8) A f3 6 8) is ETru is zero. In other words, different records from EMP do not !s I.e" matching records in SALE. 2. There exists a linear relationship between the mean .I__ negate values of the different classes of EMP records given by ps = x i + po where s is the slope of the straight line connecting the various ps values. 3. The variance of the .I__ negate attribute values of records of any class is approximately equal to the single model parameter o.2 For each of these three cases, we generate six different data sets using the six different sets of gamma parameters described earlier. Thus we obtain 18 more data sets where the first six sets violate assumption 1, the next six sets violate assumption 2 and the last six sets violate assumption 3. For each of these 18 data sets, the .I__oregate attribute value is normally distributed with a mean of 100 and standard deviation of 200 except for the last six sets where different values of standard deviation are chosen for records from different classes. In order to violate assumption 1, we no longer assume a primary keyforeign key relationship between EMP and SALE. To generate a data set violating this assumption, a set at of records of size 100 from EMP is selected. Let max be the largest number of matches in SALE for any record from sl. Then an associated set 82 Of maZ records is added to SALE such that all records in at have their matching records in 82. Assumption 2 was violated using ps = a x j + Io, where j / i (in fact, the j value for a given i is randomly selected from 1...m). Assumption 3 was violated by assuming different values for the variance of records from different classes. We randomly chose these values from the range (100, 15000). 4.6.1.2 Reallife data sets The three reallife data sets we use in our experiments are from the Internet Movie Database (IMDB) [1], the Synoptic Cloud Reports [3] obtained from the Oak Ridge National Laboratory, and the network connections data sets from the 1999 K(DDCup event . The IMDB database contains several relations with information about movies, actors and production studios. For our experiments, we use the two relations MovieBusiness and MovieGoofs. MovieBusiness contains information about boxoffice revenues of movies while MovieGoof s contains records that describe unintended mistakes or go.in various movies. The following schema shows the relevant attributes of the two relations for the queries we tested in our experiments. MovieBusiness (MovieName, NumAdmissions) MovieGoofs (Goofld, MovieName) MovieName is the primary key of MovieBusiness and a foreign key of MovieGoofs. We tested the following three SQL queries on the two relations of the IMDB dataset. Q1: SELECT SUM (b.NumAdmissions) FROM MovieBusiness as b WHERE NOT EXISTS (SELECT FROM MovieGoofs AS g WHERE g.MovieName = b.MovieName) Q2: SELECT SUM (b.NumAdmissions) FROM MovieBusiness as b WHERE NOT EXISTS (SELECT FROM MovieBusiness AS b2 WHERE id (b) < id (b2) AND b.NumAdmissions = b.NumAdmissions) Q3: SELECT COUNT (*) FROM MovieBusiness as b WHERE NOT EXISTS (SELECT FROM MovieGoofs AS g WHERE g.MovieName = b.MovieName) The second reallife data set we use is the Synoptic Cloud Report (SCR) data set. It contains weather reports for a 10year period obtained from measuring stations on land as well as water. We use weather reports for the months of December 1981 and November 1991 from measuring stations on land. Specifically, the two relations and their relevant schema used in our experiments are: DEC81 (Id, Latitude, CloudAmount) NOV91 (Id, Latitude, CloudAmount) Here, Id is the key in both the relations. We tested the following two SQL queries on the relations DEC81 and NOV91. Q4 : SELECT SUM (D81. CloudAmount) FROM DEC81 as D81 WHERE NOT EXISTS (SELECT FROM NOV91 AS N91 WHERE N91.Latitude = D81.Latitude) Q5: SELECT COUNT (*) FROM DEC81 as D81 WHERE NOT EXISTS (SELECT FROM NOV91 AS N91 WHERE N91.Latitude = D81.Latitude) The K(DDCup data set contains information about various network connections that can potentially be used for intrusion detection. This data set has 42 integer, realvalued, and categorical attributes. We tested our estimator on this data set by estimating the total number of source hytes of connections that were !,,!1 11. aly (1!ll, i. !I from the rest of the network connections. That is, we summed the total number of source hytes created by outlier connections. Our definition of !,,!1..1 1. aly (1!ll, i. !Il records is those records whose distance from all other records in the data set is greater than some predefined threshold. For our experiments, we use a simple distance function that uses Euclidean distance for numerical attributes and a 0/1 distance for categorical attributes. We execute the following query on the K(DDCup data set for our experiments. SELECT SUM (kc1.SourceBytes) FROM KDDCup as kcl WHERE NOT EXISTS (SELECT FROM KDDCup AS kc2 WHERE d(kci, kc2) < threshold) By choosing different values for deareshola, we can control the selectivity of the above query. For our experiments, we define Q6, 7 and Qs as three variants of the above query with different values of deareshola so that Q6 has a selectivity of around 2 !' Q7 has a selectivity of 1.'7 ' while Qs has a selectivity of 0. l' . 4.6.2 Results We ran our experiments on 1 .~ 5' and 10'; random samples of the data sets (both relations in each data set were sampled independently without replacement at the same rate). Both the biased estimator and the unbiased estimator were run ten times on each of the test cases. For comparison we also analytically compute the standard error for the concurrent estimator described in Section 4.2. Results from the first 48 synthetic data sets are given in in Tables 41 and 42 while results from the next 18 synthetic data sets (which specifically violate the model assumptions) are presented in Table 43. Reallife data set results are shown in Table 44. For each of the test cases, we give the square root of the observed meansquared error (that is, the standard error) for the biased, unbiased as well as concurrent estimator. Because having an absolute value for the standard error lacks any sort of scale and thus would not be informative, we give the standard error as a percentage of the total .I__oegate value of all records in the database. For example, for the synthetic data sets, we give the standard error as a percentage of the answer to the query: SELECT SUM (e .SAL) FROM EMP as e Thus, if the estimation method simply returned zero every time, its error would vary between CI' and 100I' depending on the selectivity of the suhquery. If the method is also able to estimate with high accuracy which of the constituent records should not he counted in the .I negate total, then the error can he reduced to an arbitrarily small level. Although our error metric is different from the relative error (which takes the ratio of the absolute error with the true query answer), the value of the relative error could be readily computed from the error value given by dividing by the ratio of the query answer and the total .I_ negate value of all records in the outer relation. For all the eight cases of data set 1, the query answer is approximately >.~' of the total answer. Hence, the relative error is about 1.1 times the error reported in Table 41. Similarly for the rest of the data sets, the factors are: data set 2: 1.7; data set :3: 19; data set 4: 1.5; data set 5: :3.7 and data set 6: 270. For the IMDB and SCR data sets, the factors are between 1 and 5.5 while for the K(DDCup the factors range from 2 (for the high selectivity query) to 40 (for the very low selectivity query). When we tested the queries, we also recorded the number of times (out of ten) that the answer given by the biased estimator was within +2 estimated standard errors of the real answer to the query and found that for almost all the test cases this number was ten while only for a couple of test cases this number was found to be nine out of ten. Finally, we measured the computation time required by the biased estimator to initially learn the generative model, then compute weights for the various components of the estimator, and to finally provide an estimate of the query result. We observed that for the synthetic data sets (which consists of 10 million and 50 million records in the two relations) the maximum observed running time of biased estimator was between :3 and 4 seconds for a 10I' sample from each. The vast usbIi~r~y of this time is spent in the EM learning algorithm, which requires O(m x EMP' x i) time, where ni is the maximum possible number of matches for a record in EMP with records in SALES, and i is the number of iterations required for EM convergence. We speed our implementation by subsampling EMP' and using the subsample in the EM algorithm rather than using EMP' directly. The justification for this is that the EM can he quite expensive with a large EMP', and the accuracy of the modeling step is much more closely related to the size of SALE'. We use a subsample of size 500 in our experiments. In comparison, computation for the unbiased estimator is almost instantaneous, requiring a small fraction of a second. In our test data, the most costly operation for the unbiased estimator is running the "join" between EMP' and SALE'; that is, searching for matches for each record from EMP' in SALE'. Given summary statistics describing this r~re Ilrfinr the core GetEstTi routine itself can he implemented as a dynamic programming algorithm that takes time O(m/2) where nz' is the maximum number of matches for any record from EMP' in SALE'. 4.6.3 Discussion One of the most obvious results from Table 41 is that the unbiased estimator has uniformly small error only on those eight tests performed using synthetic data set 1, where the number of matches for each record e E EMP is generated using a Gamma distribution with parameters (shift = 1, scale = 0.5). In this particular data set, only a very small number of the records are excluded hv the NOT EXISTS clause since N.~' of the records in EMP do not have a match in SALE. Furthermore, only a very small number of the records have a large number of matches. Both of these characteristics tend to stabilize the variance of the unbiased estimator, making it a fine choice. For all the other data sets, the unbiased estimator does very poorly for most of the cases. For synthetic data, the estimator's worst performance is for data set 6, in which less than one percent of the records are accepted by the NOT EXISTS clause and several records from EMP have more than 15 matching records in SALE. In this case, the unbiased estimator is unusable, and the results were particularly poor with correlation between the number of matches and the .I_ egfate value that is summed. For example, in the Data set 1 Sample 5' Sample 1(1' Sample type error error error Ga Cor Val IT C B IT C B IT C B nina red ? Dist. (. ) (. ) (. ) (. ) (. 1 No a. 7. 39 1:3.:32 :38.30 2.39 12.632 :3.88 1.09 11.89 1.46 1 No bn. 6.69 1:3.45 :37.87 :3.04 12.6:3 5.92 1.08 11.9:3 1.38 1 No e.6.89 12.92 22.59 5.2:3 12.04 8.18 :3.79 11.2:3 7.09 1 No d. 16.65 63.32 68.37 15.94 6.19 29.34 9.56 5.94 19.72 1 Yes a. 11.90 20.90 :34.50 4.59 19.94 2.263 3.15 18.638 1.42 1 Yes bn. 1:3.50 17.80 :36.30 4.07 16.37 5.12 1.75 15.50 2.18 1 Yes c. 7. 70 15.06 21.14 5.69 14.06 7.84 :3.98 1:3.1:3 6.21 1 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.3:3 2 No a. 11.79 40.12 6;.09 8.10 :37.98 :3.55 2.4:3 :35.44 :3.37 2 No b. 1:3.65 :39.48 5.00 6;.82 :37.86 4.8:3 2.54 :35.51 4.03 2 No c. 179.87 :39.20 14.75 6.35 :37.00 8.34 4.54 :34.44 7.12 2 No d. :31.60 20.45 4:3.4:3 10.24 19.26 12.88 9.99 17.08 6.25 2 Yes a. 24.70 65.60 21.39 19.8:3 62.00 18.45 4.78 57.51 1:3.70 2 Yes bn. 19.34 54.27 12.99 12.61 51.19 12.28 :3.463 47. 72 7.48 2 Yes e.220.14 46.60 2:3.01 12.19 44.01 12.01 5.10 40.88 5.10 2 Yes d. 52.631 :39.08 :39.45 19.62 :36.75 5.32 9.20 :33.19 2.25 :3 No a. 2:34.60 92.75 18.61 59.67 84.91 12.22 :33.00 76.00 63.28 :3 No b. :315.97 9:3.29 19.42 70.32 84.68 11.68 :34.78 76.05 5.84 :3 No c. 188.17 91.50 20.5:3 46.14 84.01 18.50 24.92 75.07 15.80 :3 No d. 1:39.27 72.67 14.24 6:3.56 67. 36 12.18 6.79 59.8:3 5.3:3 :3 Yes a. 75:3.7:3 189.70 42.19 220.00 172.10 28.99 115.25 151.85 17.02 :3 Yes b. 421.00 146.70 :30.9:3 151.00 1:33.50 21.05 74.50 118.40 11.99 :3 Yes e.240.20 119.80 28.28 74.66 109.50 25.99 42.57 97.22 21.863 :3 Yes d. 47.95 144.631 :33.85 18.52 1:30.9:3 28.69 :3.6:3 114.00 18.6:3 Table 41. Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows the error for the three estiniators: IT IUnhiased estimator, C Concurrent sampling estimator and B 1\odelbased biased estimator. Data set 1 Sample 5' Sample 1(1' Sample type error error error Ga Cor Val IT C B IT C B IT C B nina red ? Dist. (. ) (. ) (. ) (. ) (. 4 No a. 15:3.70 :36.20 14.52 :37.17 :33.90 4.7:3 24.47 :31.20 0.89 4 No bn. 226.00 :37.00 18.563 50.32 :33.95 5.27 42.87 :31.11 1.3:3 4 No e.242.70 :35.20 11.10 19.40 :32.85 :3.632 17.03 :30.04 :3.59 4 No d. 146.37 16.56 45.163 2:3.60 14.85 21.263 8.85 12.632 16.61 4 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.64 4 Yes bn. :327.02 52.06 8.632 75.95 48.42 :3.92 45.632 44. 12 2.8:3 4 Yes c. :359.60 4:3.40 1:3.90 :30.19 40.39 7. 17 27.21 :36;.80 5.16 4 Yes d. 1. 1e:3 37.5:3 40.29 54.:33 :33.99 10.66 18.94 29.32 5.68 5 No a. 2:36;.00 72.04 1:3.19 46.18 6;6.08 12.07 :38.30 59.60 6.15 5 No b. :395.00 72.30 11.78 55.78 6;6.09 11.7:3 42.7:3 59.55 5.37 5 No c. 167. 70 71.10 7. 70 120.81 65.20 1.99 62.70 58.50 1.15 5 No d. 1:35.635 51.87 1:3.58 77.12 48.29 4.30 24. 14 42.21 4. 16 5 Yes a. 862.00 71.79 :31.25 20:3.81 64.90 7.21 57.22 57.00 2.9:3 5 Yes bn. 650.80 56.60 28.634 129.75 51.463 6.75 74. 16 4:3.90 1.86 5 Yes e.298.70 92.30 11.47 189.70 84.22 4.06 69.6:3 74.80 2.5:3 5 Yes d. 2 105.24 10.84 178.61 95.07 9.38 145.78 81.863 3.04 6 No a. 7. 1e:3 95.1:3 19.30 6.2e:3 79.49 9.82 4. 1e:3 6:3.:33 6.09 6; No b. 1.9e4 95.20 18.40 2.1e:3 79.58 9.47 6.6e2 6:3.40 5.74 6; No c. 1.9e4 94.32 1:3.0:3 1.2e:3 78.60 5.96 9.6e2 62.74 1.71 6; No d. 4. 7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 :3.97 6; Yes a. 5.4e4 :307.0 6;2.00 :10e4 249.30 :30.90 5.7e:3 119.00 18.78 6; Yes b. 4. 2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e:3 1:35.00 12.88 6 Yes 3.2e4 1563.3 22.70 2.0e:3 128.10 10.87 8.7e2 100.12 :3.05 6 Yes d. 1.3e5 2:34.4 29.78 2.9e:3 192.46 28.25 2.4e:3 148.28 12.79 Table 42. Observed standard error as a percentage of SUM (e .SAL) over all e E EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows the error for the three estiniators: IT IUnhiased estimator, C Concurrent sampling estimator and B 1\odelbased biased estimator. Data set 1 .Sample 5' Sample 1(1' Sample type error error error Ga Vio IT C: B IT C: B IT C: B nina lates (. ) (. ) (. ) (. ) (. 1 (1) 8.8:3 1:3.:37 62.60 :3.12 12.47 15.24 1.19 11.75 4.632 2 (1) 24.66 :39.:33 :34.39 8.14 :37.89 2.74 :3.41 :35.60 2.48 :3 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 1:3.05 4 (1) 22.30 :36.67 :37.99 12.72 :34.07 7.96 6.34 :31.12 2.95 5 (1) 2:31.50 72.60 6.76 12:3.30 66.14 6.37 85.68 59.48 4.35 6; (1) 1:366.80 95.96 9.99 1.2e:3 78.64 5.85 700.0 6;2.6;2 1.88 1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.44 2 (2) 21.632 72.24 59.94 14.25 67.50 7.56 63.25 62.90 4.47 :3 (2) 8863.2 220.20 45.7:3 1:36.0 201.90 :31.7:3 79.75 180.10 25.76 4 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.4:3 11.52 5 (2) 247.60 205.0 18.84 2:33.0 187.00 17.69 88.55 168.30 9.78 6 (2) 6891.00 :369.0 42.30 5988.0 :310.00 40.90 1924.00 246.57 19.77 1 (:3) 14.70 21.14 61.86 6.24 20.20 10.15 1.1:3 19.1:3 2.67 2 (:3) 26.15 66.7:3 29.10 22.49 62.25 20.25 5.38 57.69 17. 35 :3 (:3) 920.10 185.30 41.86 147.60 167. 20 :30.12 65.6:3 146.88 27. 20 4 (:3) 2.3e5 64.42 :35.96 714.00 60.54 16.87 150.80 54.77 9.24 5 (:3) 1:350.30 14:3.00 :33.59 856.00 127. 76 29.58 :306.70 11:3.14 10.08 6; (:3) 2.2e5 264.02 :38.37 4519.10 212.80 :34.92 25:30.00 162.70 21.96 Table 4:3. Observed standard error as a percentage of SUM (e.SAL) over all e E EMP for 18 synthetically generated data sets. The table shows errors for three different sampling fractions: 1 5' and 1CI' and for each of these fractions, it shows the error for the three estiniators: IT IUnhiased estimator, C Concurrent sampling estimator and B 1\odelbased biased estimator. correlated case with a 1 sample, most of the relative standard errors were more than II II II II l' Such very poor results are found sporadically throughout most of the data sets, though the results were somewhat erratic. The reason that the observed errors associated with the unbiased estimator are highly variable is the very long tail of the error distribution. Under many circumstances, most of the answers computed using the unbiased estimator are very good, but there is still a small (though nonnegligible) probability of getting a ridiculous estimate whose error is hundreds of times the sunt over the .I_ gate value over the entire EMP relation. Unfortunately, it is interesting to note 1 .Sample 5' Sample 101' Sample Error Error Error Data Query UJ C B UJ C B UJ C B Set ( ) ( ) ( )( ) ( . IMDB Qi1 9.6e3 27.67 70.88 3.3e3 17.51 33.44 4. 1e2 13.71 14.14 IMDB Q22 1.2e2 75.12 65.10 91.26 6;2.86; 31.97 49.82 52.69 9.31 IMDB Q23 1.e4 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92 SCR Q24 1.4e4 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41 SCR Qs5 1.2e4 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95 KDDCup Q6 1.10e10 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10 KDDCup Q7 6.5el47 41.30 11.24 5.8e83 263.54 4.32 9.3e36 17.04 3.28 KDDCup Qs 7. 3e210 15.24 8.463 3.6e172 10.80 1.56 2.3e120 6.35 0.98 Table 44. Observed standard error as a percentage of the total .I__oregate value of all records in the database for 8 queries over 3 reallife data sets. The table shows errors for three different sampling fractions: 1 5' and 10I' and for each of these fractions, it shows the error for the three estimators: U Unbiased estimator, C Concurrent sampling estimator and B Modelbased biased estimator. that the unbiased estimator's worst performance overall was observed on Qs over the K(DDCup data, where the error was astronomically high: larger than 101oo In comparison, the biased estimator generally did a very good job predicting the final query result, and in most cases with a 5' or 10I' sampling fraction the observed standard error was less than 101' of the total .I__ regate value found in EMP. In other words, if the total value of SUM (e.SAL) with no NOT EXISTS clause is x, then for just about any query tested, the standard error was less than x/10, and it was frequently much smaller. This is actually quite impressive when one considers the difficulty of the problem. The primary drawback associated with the biased estimator is its complexity (requiring nontrivial and substantial statisticallyoriented computations) and the fact that a significant amount of computation is required, most of it associated with running the EM algorithm to completion. By comparison, the unbiased estimate can be calculated via an almost trivial recursive routine that relies on the calculation of simple hypergeometric probabilities. One case where the biased estimator had questionable qualitative performance was with the 16 tests associated with data sets 3 and 6. The problem in this case was that the E1\ algorithm tended to overestimate po in 8, which is actually very small in these two data sets (.052 and .00:37, respectively). This results in an error that hovers at 10' . of the total .I_ gate value of e.SAL (even for a 5' I sample) when the real answer is only 5' of this total for data set :3 or less than 1 of this total for data set 6. We stress that guessing that only a few percent of the tuples in EMP have no matches in SALE from a small sample with limited information is an extremely difficult estimation problem, and we conjecture that without additional information (such as prior knowledge that the distribution represented by p is a discretized gamma distribution) it will be very difficult to achieve better results. Results from the synthetic data sets which specifically violate the assumptions of the superpopulation model are shown in Table 4:3. The first six rows in the table show results for data sets in which more than one EMP record can match with a given record from SALE. The results show that violating this assumption of the model in the actual data set did not affect the accuracy of the biased estimator significantly. The next set of six rows in the table show results for data sets in which there is no linear relationship between the mean .I_ regate values of the different classes of EMP records. The results show that the biased estimator is about twice as inaccurate over these data sets as compared to corresponding data sets which do not have a strict violation of the assumption. The last six rows in the table show results over data sets in which the variances of the .I_ negate values of records from different classes are significantly different. Results show that these data sets affect the accuracy of the biased estimator as much as the data sets which violate the "linear relationship of mean values" assumption. However, the results are certainly not poor when these assumptions are violated, and the method still seems to have qualitative performance that may be acceptable for many applications, particularly with a larger sample size. The results from the eight queries over the three reallife data sets are depicted in Table 44. The key difference in the characteristics of the reallife data sets compared to the syntheticallygenerated data sets is the number of matching records in the inner relation for a given record from the outer relation of the NOT EXISTS query. For the K(DDCup data set, the maximum number of matching records in the inner relation is as high as 2500, while for the IMDB and SCR data sets this number is about 200 and 90 respectively. Due to this, none of the cases which are favorable for the use of the unbiased estimator (as described above) are observed in the reallife data sets. On the other hand, it can he seen from Table 44 that the accuracy of the biased estimator is generally quite good over the real data. We also note that the standard error of the biased estimator over the learned superpopulation seems to be a reasonable surrogate for the standard error of the biased estimator in practice. For most biased estimators, it is reasonable to use the standard error of the biased estimator in the same way that one would use the standard deviation of an unbiased estimator when constructing confidence bounds (see Sarndal et al. [109], Section 5.2). According to the VysochanskiiPetunin inequality [120], any unbiased unimodal estimator will be within three standard deviations of the correct answer 95' of the time, and according to the more .I__ ressive central limit theorem, an estimator will be within two standard deviations of the correct answer 95' of the time. We observed that almost all of the tests, ten out of ten of the errors for the biased estimator were actually within two predicted standard errors of zero. This seems to be strong evidence for the utility of the bounds computed using the predicted standard error of the biased estimator. We finally remark on the time required for the execution of the biased estimator. The biased estimator performs several computations including learning the model parameters, generating sufficient statistics for several populationsample pairs and then solving a system of equations to compute weights for the various components of the estimator. As discussed previously, this took no longer than four seconds for the largest samples tested. If this is not fast enough, we point out that it may be able to speed this time even more, though this is beyond the scope of the thesis. While we used the traditional EM algorithm in our implementation, we note that EM can he made faster by using incremental variants [69, 95, 116] of the EM algorithm. These variants of the EM algorithm typically achieve faster convergence time by implementing the Expectation and/or the Minimization step of the EM algorithm partially. 4.7 Related Work Estimation via sampling has a long history in databases. One of the oldest and best known works is Frank Olken's PhD thesis [97]. Other classic efforts at samplinghased estimation over database data are the adaptive sampling of Lipton and Naughton [83, 84] for join query selectivity estimation, and the sampling techniques of Hou et al. [64, 65] for .._o egate queries. More recent wellknown work on sampling is that on online .I_ egation by Haas, Hellerstein, and their colleagues [47, 60, 61]. The samplinghased database estimation problem that is closest to the one studied in this chapter is that of sampling for the number of distinct values in a database. As discussed in the introduction to this chapter, a solution to the problem of estimation over subsethased queries is a solution to the problem of estimating the number of distinct values in a database since the latter problem can he written as a NOT EXISTS query. The classic paper in distinct value estimation is due to Haas et al. [49]. For a survey of the stateoftheart work on this problem in databases through the year 2000, we refer the reader to the Introduction of the paper by C'I Iml: I et al. on the topic [17]. The paper of Bunge and Fitzpatrick [13] provides a survey of work in the statistics area, current through the early 1990's. Work in statistics continues on this problem to this d .v. In fact, a recent paper from statistics by Mingoti [90] on the distinct value problem provided inspiration for our use of superpopulation techniques. Though the problems of distinct value estimation and subsethased .I negate estimation are related, we note that the problem of estimating the number of distinct values is a very restricted version of the problem we study in this thesis, and it is not immediately clear how arbitrary solutions to the distinct value problem can he generalized to handle subsethased queries. The most obvious difficulty in extending such methods to subsethased queries is the fact that a NOT EXISTS or related clause results in a complicated statistic summarizing two populations (the two tables that are queried over). Nonetheless, links between the problems do exist. For example, though our own unbiased estimator was not directly inspired by Goodman's estimator [43]5 and it takes a very different form, it is easy to argue that our unbiased estimator must he a generalization of Goodman's estimator. The reasoning is straightforward: Goodman's estimator is proven to be the only unbiased estimator for distinct value queries, and our own unbiased estimator is unbiased for distinct value queries. Therefore, they must he equivalent when used on this particular problem. 4.8 Conclusion This chapter has presented two samplinghased estimators for the answer to a subsetbased query, where the answer to a SUM .I__aegate query (and by trivial extension, AVERAGE and COUNT) is restricted to consider only those tuples that satisfy a NOT EXISTS or related clause. The first estimator is provably unbiased, while the second makes use of superpopulation methods and was found to be much more accurate. As discussed in Section 4.5.1 of the thesis, one of the most controversial decisions made in the development of the latter estimator was our choice of a very general prior distribution. To a statistician from the socalled "B lio ! .Is school [39 this may be seen as a poor choice and B lio 1 .Is statistician may argue that a more descriptive prior distribution, if appropriate, would increase the accuracy of the method. This is certainly true, if the selected distribution were a good match for the actual data distribution. In our work, however, we have consciously chosen generality and its associated drawbacks in place of specificity. Our experimental results seem to argue that for a variety of different 5 Goodman's estimator is one of the earliest statistical estimators for distinct value queries. data distributions, the resulting estimator still has high accuracy. Still, this represents an intriguing question for future work: can a different prior distribution he chosen that is appropriate for use in realworld data sets, and which results in a more accurate estimators Finally, we note that the modelbased method outlined in the latter half of this chapter was designed specifically to address the problem of estimating the answer to a nested SQL query with a single table in the inner query and a single table in the outer query linked by a NOT EXISTS predicate. As is, our model is not directly applicable to arbitrarily complex nested queries. For example, nested queries may include multiple relations in the outer as well as the inner query. One could imagine sampling all of the input relations, and then using any result tuples that are discovered as part of the inner or outer suhqueries as input into an estimator such as the one studied in this chapter. However, this may be dangerous, and our superpopulation model is not directly applicable. The problem is that if there is a join in the inner (or outer) query, then the tuples produced via joining samples from the input relations are not i.i.d. samples from the join [47]. This means that the join itself must he modeled, which is a problem for future work. Another problem for future work is arbitrary levels of nesting. An inner query may itself he linked with another inner query via a NOT EXISTS or similar clause. CHAPTER 5 SAMPLINGBASED ESTIMATION OF LOW SELECTIVITY QUERIES 5.1 Introduction The specific problem that we consider in this chapter is samplingbased approximation of the answer to highly selective .I_ _regate queries those having a relational selection predicate that accepts only a very small percentage of the data set. Again, we consider sampling because it is the most versatile of the approximation methods: a single sample can be used to handle virtually any relational selection predicate or any join condition. Samples generally do not require prior knowledge of what queries will be asked, unlike other methods such as sketches [8]. We consider very selective queries because they are the one class of queries that are hardest to handle approximately without workload knowledge: if a query references only a few tuples from the data set, then it is very hard to make sure that a synopsis structure (such as a sample) will contain the information needed to answer the query. The most natural method for handling highly selective queries using sampling is to make use of strrHi..>Hi~~w [25]. In order to answer an .I__egate query over a relation, one could first offlinee) partition the relation's tuples into various subsets so that similar tuples are grouped together the assumption being that the relational selection predicate associated with a given query will tend to favor certain strata. Even if a given query is very selective, at least one or two of the strata will have a relatively heavy concentration of tuples that will contribute to the query answer. When the query is processed, those "important" strata can be sampled first and more heavily than the others. This is illustrated with the following example: Example 1: The relation MOVIE(MovieYear, Sales) is partitioned into two strata as follows: The query Q is then issued: SELECT SUM (Sales) R1 :MovieYear < 1975 R2 :MOVieYear > 1975 ri : (1961, 30) rs : (1983, 60) r2 : (1972, 50) r4 : (1977, 40) re : (1997, 25) re : (1992, 100) ry : (2004, 100) FROM MOVIE WHERE MovieYear < 1980 Since all movies in R1 were released before 1975, all the records in the stratum R1 match Q. Hence, we decide to obtain a biased sample that includes as many records from R1 as the sample size permits and we sample from R2 only if the desired sample size is not met. For a sample size of 4, this results in an estimate whose variance (or error) is 2400. Drawing a sample from the population as a whole results in an estimate whose variance is 2575. O While stratification may be very useful, it is not a new idea. It has been studied in statistics for decades, and it has been II rb I1 previously as a way to make approximate .. oregate query processing more accurate [1820]. However, in the context of databases, researchers have previously considered only half of the problem: how to divide the database into strata. This may actually be the easy and less important half of the problem, since even the relatively naive partitioning strategy we use in our experiments can give excellent results. The equally fundamental problem we consider in this paper is: how to allocate savaples to strater when r ;.~leilti r,.;, ,:'.9 the query. 1\ore specifically, given a budget of n samples, how does one choose how to "spend" those samples on the various strata in order to achieve the greatest accuracy? The classic allocation method from statistics is the Ney~man allocation, and it is the one advocated previously in the database literature [19]. The key difficulty with applying the Neyman Allocation in practice is that it requires extensive knowledge of certain statistical characteristics of each strata, with respect to the incoming query. In practice this knowledge can only be guessed at by taking a pilot sample. As we show in this paper, if the guess is poor, then the resulting sampling plan can he disastrous. This results in a classic chickenandegg problem: we want to sample in order to avoid scanning all of the data, but in order to sample properly, we have to collect statistics that require scanning all of the data! The result is that the classic Neyman allocation is unusable in many situations, as we will demonstrate experimentally in the paper. Our Contributions In this thesis, we develop an alternative to the classic Neyman allocation that we call the Br;,. Neyman allocation. While this is a very general method and its utility is not limited to the context of database management, the B is; eNeyman allocation is particularly relevant to database sampling because it is designed to be robust when only a few of the data records in the data set are relevant to estimating a quantity over the data as is the case when a query has a restrictive relational selection predicate. The specific contributions of our work are as follows: * The B is; eNeyman allocation explicitly takes into account the error that might he incurred when developing the sampling plan to maximize the expected accuracy of the resulting estimate. * The B is; eNeyman allocation makes use of novel, B li 1 Ia techniques from statistics [14] that allow us to take into account any prior expectation (such as the expected efficacy of the stratification) in a principled fashion. * We carefully evaluate our methods experimentally, and show that if one is very careful in developing a sampling plan, even a naive partitioning of samples to strata that uses no workload information can show dramatic accuracy for very selective queries. * Our methods are very general. They can he used with any partitioning (such as those proposed by Chaudhuri et. al [1820]), or even in cases where the partitioning is not userdefined and is imposed by the problem domain (for example, when the various lI II are different data sources in a distributed environment). Our methods can also be extended to more complicated relational operations such as joins, though this problem is beyond the scope of the paper. 5.2 Background This section presents some preliminaries and background about stratified sampling, and discusses the problems associated with using stratified sampling in a database setting to estimate results of arbitrary queries. 5.2.1 Stratification A general example of a SUM .I_ _regate query over a single relation can be written as follows: SELECT SUM (fl(r)) FROM R As r WHERE f2 r) Note that if we define a function f() where, f (r = f (r)if f2 T) is LtHO f() (I0 if f2 T) 1S falSe the above query can be simply rewritten as, SELECT SUM ( f(r)) FROM R As r If the relational selection predicate f2 r) SeleCtS a Very Small fraction of records from the relation R, then the query is said to be a low selectivity query. Assume that relation R is partitioned into L disjoint strata such that Ri represents the ith stratum. Then, we have R = R1 U R2 U U RL. We denote the size of the ith stratum by Nsi and thus we have Ri = Ns. Let R( where R( = us be the survey sample (without replacement) from the ith stratum. The sizes of all the strata are known from strata construction time, while the sizes of the survey samples from each of the strata (the us values) can be determined by using some sampling allocation scheme subject to the constraint Ci us = n, where n is the predetermined total sample size from R. The problem of determining an optimal sample allocation is the central focus of this paper. If we execute the above query on each of the R(, the result of the query over the sample of stratum i can be written as, r6R: The unbiased stratified sampling estimator for the query result expressed in terms of the ye~ values is, Y = \i(51) i= 1 The true variance of the records in stratum i can be computed as, Sr6Ri r6RCd r Thus the true variance (or error) of the estimator Y is given by, i= 1 In practice, it is not feasible to know the true stratum variances for an arbitrary query. Hence, a samplebased estimate for the variance of stratum i can be computed as, Then, an unbiased estimator for the variance of Y can be obtained from Equation 52 by simply replacing all the az? terms with their corresponding unbiased estimators (Ti . CentralLimitTheorembased confidence bounds [112] for Y can then be computed as, Y + z,& where z, is the zscore for the desired confidence level. If desired, more conservative confidence bounds from the literature (such as C!. Iall. vbased [112]) can also be used. Finally, we note that .I__oegate queries like COUNT and AVG can also be handled by stratified sampling estimators like the one described above by using ratios of two different estimates. Aggregate queries with a GROUP BY clause can also be answered by using stratification. A GROUP BY query can be considered as executing several simple queries in parallel one for each group. Joins can also be handled using methods similar to those proposed by Haas and Hellerstein [54], though that is beyond the scope of the paper. 5.2.2 "Optimal" Allocation and Why It's Not The problem of determining the ni values for all the strata for a predetermined sample size n is the sample allocation problem. The key constraint on the values of the sample sizes is that their sum should equal the total sample size. Besides this constraint, there is freedom in the choice of the ni values, and hence a natural choice is to minimize the error of Y of Equation 51. Since Y is unbiased, minimizing its error is equivalent to minimizing its variance. An optimization problem can be formulated for the choice of ni values so that the variance a2 is minimized solving the problem leads to the wellknown Neyman allocation [25] from statistics. Specifically, the Neyman allocation states that the variance of a stratified sampling estimator is minimized when sample size ni is proportional to the size of the stratum, NVi, and to the variance of the f() values in the stratum, af. That is, The problem we face in a database setting is that strata variance values af are not known for an arbitrary query. The stratum variance af depends on: (a) the function to be .I_ egated fi(), and (b) the relational selection predicate f2(). Since these functions can vary from one query to another, it is not feasible to compute beforehand exact values of the various af terms for an arbitrary query. This means that the optimal ni values cannot be computed in the absence of exact af values. It is possible to obtain rough estimates for the strata variances by doing a pilot run of the query on very small pilot samples from each stratum, which is the standard method. However, as the following example shows, a 1 in &1 drawback of this approach is that the variance estimates calculated from such pilot sampling can be arbitrarily erroneous, leading to an extremely poor allocation scheme and even more severe problems. True query result 20150 Avg. observed bias 10200 Avg. estimated MSE 0.76 million Avg. observed MSE 100 million MSE of true optimal 58.6 million Example 2: Imagine that we have a relation R partitioned into two strata R1 and R2 such that RI = 10000 and R2 = 10000. Let Q be a query identical to the query presented in Section 5.2.1. The number of records from R1 accepted by f2 ) is 10 while the number of records from R2 accepted by f2 ) 1S 1000. Further, let fl(r) ~ NV(1000, 100) Vr E R1 and ft (r) ~ NV(10, 100) Vr E R2, Where NV(p, a) denotes a normal distribution with mean p and variance a2 We use a pilot sample of 100 records to estimate the variance of the f() values in each stratum. These estimates are &~ and &#,. If the desired sample size is n = 1000, the estimated variances can be used with Equation 54 to obtain an estimate for the optimal sampling allocation as follows: 1000 1000 &( + 22 2~ +22 We then ask the question: how accurate will the resulting sampling plan be? To answer this question, we perform a simple experiment in which we repeat the above process 1000 times. For each iteration, we record the squared error of the estimate produced by the computed sampling plan. The average of all these squared errors gives us an approximation of the meansquared error (ilrmi) of the estimator. For each iteration, we also compute the estimated variance of the result (using Equation 52) since this variance would be used to report confidence bounds to the user. We then compute the average estimated variance across the 1000 iterations. Finally, we use the true variances of both strata to obtain an optimal sample allocation, and repeat the above experiment using the optimal allocation. We summarize the results in the following table. Overall, the results using the pilot sampling are disastrous. Specifically: * The pilotsamplingbased allocation provides an average estimated error to the user that is more than 2 orders of magnitude smaller than the true error 0.76 million versus 100 million. Since the estimated error is typically used to compute confidence bounds, the resulting confidence bounds will be much narrower than what they should be in reality. Hence, the user would be provided with a dangerously optimistic picture of the error of the estimator. * Second, the nonoptimal allocation leads to an estimate that has a heavy bias. This is due to the fact that the allocation often directs the stratified sampling to ignore the first stratum. For approximately CII' of the 1000 iterations the pilot sample fails to discover any matching records in R1. Hence, the pilot samplebased variance is naively guessed to be zero. When this value is used with the Neyman allocation, no samples are allocated to R1, while all 1000 samples are allocated to R2. The outcome is that the query result is usually underestimated, because R1 actually contains records accepted by f2 . * Finally, by using a truly optimal sampling allocation to estimate the query result, it is possible to achieve an error that is around half the error obtained by a nonoptimal allocation. The additional error incurred due to the poor allocation represents a wasted opportunity to provide a much more accurate estimate. 5.3 Overview of Our Solution The fundamental problem we face is that the natural estimator for of serves us extremely poorly when we are trying to figure out how to allocate samples to strata. Human intuition tells us that it is foolish to simply assume that of is zero in this case, though our estimate of will be zero. This is because as human beings we know that there will often be a number of records matching the given f2 ) in a Strata, and we will simply be unlucky enough to miss them in our pilot sample. To remedy these problems, we propose a novel B lio 1 Io approach [14] called the Bay., Neyman allocation that can incorporate such intuition into the process in a principled fashion. In general, B li, i Ia methods formally model such prior intuition or belief as a probability distribution. Such methods then refine the distribution by incorporating additional information in our case information from the pilot sample to obtain an overall improved probability distribution. At the highest level, the proposed B .v. ENeyman allocation works as follows: 1. First, in B lio Io fashion, we represent our belief in the possible variances over the f () values in each strata as a prior probability distribution. Let the vector E (~af a 2\ ~ , ,,11 eon osbe set ofc strata ,, variances. W defne prbaltyc distribution over all of the possible E values to represent this prior belief. Let Xe he a random variable with exactly this probability distribution. Thus, sampling from Xe (that is, performing a random trial over Xc) gives us one possible value for the vector E, where those variance vectors that we feel are more "correct" are more likely to be sampled. 2. Second, we take a pilot sample from the database and use the result of the pilot sample to update the distribution of Xe in order to make it more accurate. 3. Third, we sample a large number of possible E values from the resulting Xe in 1\onteCarlo fashion. This gives us a large number of possible alternative values for 4. Finally, we construct a sampling plan for estimating the answer to our query whose average error (variance) is minimized over all of the E values that were sampled from Xc. This gives us a sampling plan whose expected error over the possible set of databases described by the distribution of Xe is minimized. This plan is then used to perform the actual stratified sampling. The three key technical questions that must he addressed when adopting this approach are: 1. First, how is the random variable Xe defined? 2. Second, how can the distribution of Xe he updated to take into account any information that is gathered via the pilot sample? 3. Third, how can a set of samples from the updated Xe he used to produce an optimal sampling plan? The next three sections outline our answer to these three questions. 5.4 Defining Xe In this section, we consider the nature of Xe itself, and how to sample from it. 5.4.1 Overview At the highest level, the process of producing a single sample E from Xe will be further subdivided into three steps: 1. First, we sample from a random variable Xent to obtain a vector (cutl, cut2, CntL) where this vector tells us how many tuples from each strata are accepted by the relational selection predicate f2 . 2. Second, we sample from a random variable Xc/ that gives us the vector E'= ((1, 2 1a~, (1, 2 2a~, ***, (1, 2 aL). The ith pair (pi, p2) i1S the mean (that is, pt) and second moment (that is, p2 1 OVeT all Of the the fi() values in strata i for those cuti tup~les that are accepted by f2 . 3. Third, once these two samples have been obtained, it is then a simple mathematical task to use the outputs of Xent and Xc/ to compute the output of Xc. We now consider each of these three steps in detail. 5.4.2 Defining Xent Using terminology common in B lio Io statistics, each entry in Xent is generated by sampling from a binomial distribution with a Beta prior distribution [33]. This means that we view the probability pi that an arbitrary tuple from stratum i will be accepted by the relational selection predicate f2 ) aS being the result of a random sample from the Beta distribution, which produces a result from 0 to 1. Since we view each tuple as a separate and independent application of f2(), the number of tuples from stratum i that are accepted by f2 ) is then binomially distributed With the binomial distribution taking the value pi as input, along with the stratum size Nsi. The Beta distribution is chosen as the prior distribution because it is a canonical "conjugate prior" distribution for the binomial distribution. The fact that it is a conjugate prior means that its domain is precisely equal to the parameter space for the Binomial distribution, in this case, the range 0 to 1, which is the valid range for pi. 1 Recall that the second moment of a random variable X is the expected value of X2: p2l = E[X2] 2 The binomial distribution models the case where a balls are thrown at a bucket and each ball has a probability p of falling in the bucket. A binomially distributed sample returns the number of balls that happened to land in the bucket. 03 S02 001 02 03 04 05 06 07 08 09 Quelry electivity Figure 51. Beta distribution with parameters a~ = = 0.5. Given this setup, the first task is to choose the set of Beta parameters that control the distribution of each pi so as to match the reality of what a typical value of pi will be for each stratum. The Beta distribution is a parametric distribution and requires two input parameters, a~ and p. Depending on the parameters that are selected, the Beta can take a large variety of shapes and skews. Clon .~ !nig a and P for the ith stratum is equivalent to supplying our "intuition" to the method, stating what our initial belief is regarding the probability that an arbitrary record will be accepted by f20). There are two possibilities for setting those initial parameters. The first possibility is to use workload information. We could monitor all previouslyobserved queries over each and every strata, where we observe that for query i and stratum j the probability that a given record was accepted by f20) was pij. Then, assuming that {pij Vi, j} are all samples from our generative Beta prior, we simply estimate a~ and P from this set using any standard method. An estimate for the Beta parameters based upon the principle of Maximum Likelihood Estimation can easily be derived [112]. A second method is to simply assume that the stratification we choose usually works well. In this case, most strata will either have a very low or a very high percentage of its records accepted by f20). Clon.~~!nig a = p = .5 results in a Ushaped distribution that matches this intuition exactly, and is a common choice for a Beta prior. The resulting Beta is illustrated in Figure 51. In practice we find that this produces excellent results. We stress that though the initial choice of a~ and P for each stratum is important, it is only important to the extent that it informs us what is going on in the case that we have very little information available in the pilot sample (such as when the pilot is very small). If the pilot sample contains a great deal of information, the update step described in Section 5.5 will update a~ and P as needed to take into account the information present in the pilot sample. Producing the Vector of Counts Given the above setup, the GetCounts algorithm can be used to produce the vector of counts (cutl, cut2, CntL) Algorithm GetCounts(a, P, N) 1 // Let a = (ax, 8, L) be the parameters of the beta // distributions of all strata 2 // Let p = (P1, P2, L) be the parameters of the beta // distributions of all strata 3 // Let N = (N1, N2, NL) be a vector of all strata sizes 4 // Let cut = (cutl, cutL) be a vector of counts for all strata 5 for(int i = 1; i <= L; i++) { 6 pi < Beta(as, pi) 7 cuti < Binomial (Ni, pi) 8} 9 return cut 5.4.3 Defining Xc/ In the previous subsection, we described how to obtain counts for the number of records that satisfy the selection predicate f2(). However, in order to obtain a sample from Xc, it is not enough to merely know these counts. We actually need to know the f () values of all the records that satisfy f2 ) ill Stratum i, since these values are needed to be able to compute (py, p2 i aS 1S required to sample from Xc/. To do this, we use the following method. For the ith stratum, let D be the vector of all possible distinct values from the range of the function fit(). We then associate a probability pj with the jth distinct value. pj indicates the likelihood of the jth distinct value from the stratum (that is, D [j]) being assigned to an arbitrary tuple that has been accepted by f2(). Then, let V denote a vector of D counts, where if V[j] = k then it means that the jth distinct value from the stratum has been assigned to k tuples that were accepted by f2(). Thus, Cj V[j] = cuti. Since we assume that each application of fi() is independent on a pertuple basis, V can be obtained by sampling from a multinomial distribution" With two arguments: the probability vector consisting of all the pj values, and the number of trials given by cuti (that is, the number of tuples accepted by f2()). Then, the resulting vector V along with the distinct value vector D can be used to compute the pair (pi, p2 i* This technique poses two important questions that need to be answered: * Is it ahrl . feasible to consider all the values in the range of fit()? That is, can we ahrl . materialize D? * How do we assign probabilities to all of the values in D? That is, how do we decide the value of each pj? The answer to the first question is simple: It is certainly not feasible to ahr7 . consider all possible values in the range of fit(), for obvious computational and storage reasons. However, for the moment, we assume that it is feasible and consider the more general case in Section 5.4.5. 3 The multinomial distribution models the case case where cut balls are thrown at d buckets so that the probability of an arbitrary ball falling in bucket j is pj, a sample from the multinomial assigns bj balls to bucket k such that E bj 1. In answering the second question, we develop a methodology analogous to the way we choose the pi parameter when dealing with the ith strata for Xent. As described above, the number of times that each distinct fit() value is selected follows a multinomial distribution. We know from B lio 1 .Is statistics that the standard conjugate prior for a multinomial distribution is the Dirichlet distribution [33] just as the Beta distribution is the standard conjugate prior for a binomial distribution. The Dirichlet is the multidimensional generalization of the Beta. A kdimensional Dirichlet distribution makes use of the parameter vector O = {81, 82, k ~. Just aS in the case of the Beta prior used by Xent, the Dirichlet prior requires an initial set of parameters that represent our initial belief. Since we typically have no knowledge about how likely it is that a given fit() value will be selected by f2(), the simplest initial assumption to make is that all values are equally likely. In the case of the Dirichlet distribution, using 84 = 1 for all i is the typical zeroknowledge prior [33]. Given 8, it is then a simple matter to sample from Xc/, as we describe formally in the next subsection. We note that although this initial parameter choice may be inaccurate, in B lio 1 .Is fashion the parameters will be made more accurate based upon the information present in the pilot sample. Section 5.5 provides details of how the update is accomplished. Producing the Vector E' We now present an algorithm GetMoments to obtain the vector E'. We assume that we have all the 84 values corresponding to the parameters of the Dirichlet along with counts of the number of records that are accepted by f2() in each stratum. These count values are the values in the vector Xent obtained according to Algorithm GetCounts in Section 5.4.2. Algorithm GetMoments(81, BE, D) { 1 // Let Bi denote the Dirichlet parameters for stratum i 2 // Let D be an array of all distinct values from the range of f20) 3// Let E' = (E'z, E') be a veto of moments of~ all strata 4 for(int i = 1; i <= L; i++) { 5 p < Diricklet(8i) 6 pl = p2 = 0 7 // Let V be an array of counts for each domain value 8 Vt <ultinomial (cuti, p) 9 for(int j = 1; j <= D; j++) { 10 pl += V[j] a D[j] 11 p2 + j (D[j])2 12 } 13 pl /= cuti 14 p2 /= Ctii 15 (p ,2i 1 2 16 =(p ,2i 17 } 18 return E' 5.4.4 Combining The Two Once a sample from Xent and from Xc/ have been obtained, it is then a simple matter to put them together to obtain a sample from Xc. Recall that the variance of a variable X is defined as follows: 0.2[X] = E[X2] E2[X] where E [] denoes the expected value of the random variable. For the ith stratum, after sampling from Xent and Xc/ we know three things: 1. The size of the stratum NVi. 2. The number of records accepted by f2 (), Which is cuti. 3. The first and second moment (pi, p2) o iO 1) applied to those tuples that were accepted by f2 . Thus,, the variance, ~O fr f() applied to all tuples in the ith strata can be computed as: 2Cntig 1i Ctig i x p2~+ x0O (cuti 2 Ng Cnii x x p2 1 a ct (55) The two zeros in the above derivation come from the fact that both the first moment (or mean) as well as the second moment for f() over every tuple not accepted by f2 ) arT ZeoO. This computation is repeated for each possible i in order to obtain the desired sample from Xc. The algorithm GetSigma describes how the variances can be computed using the above technique. Algorithm GetSigma(cut, E', N) { 1 // Let cut = (cutl, cutL) be a vector of counts of records // accepted by f2() Ffo all Strata 2 // Let E' = (E' ) ,IL E' b a vector of momentsllL of all strata 3 // Let N = (N1, N2i, NL) be a vector of all strata sizes 4 // Let E be a vector of variances for all strata 5 for(int i = 1; i <= L; i++) { 8} 9 return E 10 } 5.4.5 Limiting the Number of Domain Values As mentioned in Section 5.4.3, the one remaining problem regarding how to sample from Xc/ is the problem of having a very large (or even unknown) range for the function fi(). In this case, dealing with the vectors D and V may be impossible, for both storage and computational reasons. The simple solution to this problem is to break the range of fit() into a number of buckets and make use of a histogfram over the range, rather than using the range itself. In this case, D is generalized to be an array of histogram buckets, where each entry in D has summary information for a group of distinct fi() values. Each entry in D has the following four specific pieces of information: 1. low and high, which are the upper and lower bounds for the fi() values that are found in this particular bucket. 2. pi1, which is the mean of the fit() values that are found in this particular bucket. TIhat is, if A is the set of distinct values from lowU to highl, thlenl p = CaeA i*i 3. p2l, Which is the second moment of the fi() values that are found in this particular bucket. That is, p2 CaEA i'.( Given D, there are two possible owsi~ to construct the histogram. In the case where the queries that will be asked request a simple sum over one of the attributes from the underlying relation R (that is, fit() does not encode any function other than a simple relational projection), then it is possible to construct D offline by using any histogram construction scheme [42, 45, 72] over the attribute that is to be queried. In the case that multiple attributes might be queried, one histogram can be constructed for each attribute. This is the method that we test experimentally. Another appropriate method is to construct D onthefly by making use of the pilot sample that is used to compute the sampling plan. This has the advantage that any arbitrary fit() can be handled at run time. Again, any appropriate histogram construction scheme can be used, but rather than constructing D offline using the entire relation R, fit() is applied to each re E ] Ri~ior (whether or not r is accepted by f2()) and the histogram is constructed over the resulting set of distinct values. Whatever method is used to construct D, the function GetMoments from Section 5.4.3 must be modified so as to handle the modified D. The following is an appropriately modified GetMoments we call it GetMomentsFromHist. Algorithm GetMomentsFromHist(81, 8L, D) { 1 // Let Bi denote the vector of Dirichlet parameters for stratum i 2 // Let D be an array of histogram buckets 3 // Let E' = (E' ) ,IL E' b a vector of momentsllL of all strata 4 for(int i = 1; i <= L; i++) { 5 p < Diricklet( Oil 6 pl = p2 = 0 7 // Let V be an array of counts for each bucket 8 Vt <ultinomial(cuti, p) 9 for(int j = 1; j <= D; j++) { 10 pl += V[j] a D[j].pi 11 p2+ Vj D[j].p2 12 } 13 pl /= cuti 14 p2 /= Cnti 15 (p ,2i 1, 2 16 =(p,2i 17 } 18 return E' 5.5 Updating Priors Using The Pilot In Section 5.4, we described how we assign initial values to the parameters of the two prior distributions the Beta and the Dirichlet distributions. In this section, we explain how these initial values can be refined by using information from a pilot sample to obtain corresponding posterior distributions. Updating these priors using the pilot sample in the proposed B .v. ENeyman approach is analagous to using the pilot sample to estimate the stratum variances using the classic Neyman allocation. The update rules described in this section are fairly straightforward applications of the standard B lio Io update rules [14]. The Beta distribution has two parameters a( and P. Let Rpilot denote the pilot sample and let a denote the number of records that are accepted by the predicate f2(). Thus, Rpilot S will be the number of records that fail to be accepted by the query. Then, the following update rules can be used to directly update the a( and P parameters of the Beta distribution: a( = a(+s S= + (Rpilotl S) The Dirichlet distribution is updated similarly. Recall that this distribution uses a vector of parameters, O = {81, 82, 8k Where k is the number of dimensions. To update the parameter vector 8, we can use the same pilot sample that was used to update the beta as follows. We initialize to zero all elements of an array count of size k. These elements denote counts of the number of times that different values from the range fit() appear in the pilot sample and are accepted by f2 . The following update rule can be used to update all the different parameters of the Dirichlet distribution: 84 = 84 + county Algorithm UpdatePriors describes exactly how pilot sampling is used to update the parameters of the prior Beta and Dirichlet distributions for the ith stratum. Algorithm UpdatePriors(a, P, 8, D, Rpilot) 1 // Let a, p be the parameters of the beta distribution for the // stratum to be updated 2 //Let 8 = (81, BL) be the parameters of the Dirichlet // distribution for the stratum 3 // Let D be an array of histogram buckets for the stratum 4 // Let Rpilot be a pilot sample from the stratum 5 // Let count be an array of counts for each histogram bucket for // stratum i 6 for(int j = 1; j <= D; j++) 7 count[j] = 0 8 s=0 9 for(int r = 1; r <= Rpilot; r++) { 10 rec = Rc I 11 if(f2(TecC)@ 12 s++ 13 val = fl(rec) 14 pos = Fi nd Position lnArray(D, val) 15 count~pos]++ 16 } 17 } 18t=a+s 19 p = P + ( Rpilot s 20 for(int j = 1; j <= D; j++) 21 Oj = Oj + count[j] 22 } Algorithm Find PositionlnArray(D, val) { 1 // Let D be an array of histogram buckets 2 // Let val be a scalar value 3 for(int j = 1; j <= D; j++) 4 if(D[j].low I val && val < D[j].higlh) 5 return j 6} 5.6 Putting It All Together In this section, we consider how the random variable Xe can be used to produce an alternative allocation to the classical Neyman, and give the complete algorithm for computing our allocation. 5.6.1 Minimizing the Variance In general, the goal of any sampling plan should be to minimize the variance O.2 Of the resulting stratified sampling estimator. The formula for o.2 in the classic allocation problem is given as Equation 52 of the thesis. Our situation differs from the classic setup only in that (in B lio 1 .Is fashion) we now use Xe to implicitly define a distribution over the perstrata variance values (o1, o, OL). Thus, we cannot minimize o.2 directly because under the B li, i Ia regime, O.2 1S now a random variable. Instead, it makes sense to minimize the expected value or average of 0.2, Which (using Equation 52) can be computed as: Nsi(Nv us) , E [O2] = E i of ni Using the linearity of expectation, we have: i= 1 All of the machinery from the last two sections allows us to be able to sample possible variance vectors from Xc. Assume that we sample v of these vectors, where v is a suitable large n~umberr anld th~e samlples a~re denoted by EI a, E2 *** v. Then CE= Ey~3( is an unbiased estimate of E[af]. Plugging this estimate into the previous equation, we have: i= 1 j= 1 We now wish to minimize this value subject to the constraint that C ni = n. Notice that the resulting optimization problem has exactly the same structure as the optimization problem solved by the Neyman allocation, wit;h the exception that 1,2 has been replaced by 1C da,. Thus the resulting optimal solution is then nearly identical, with of being replaced as appropriate: C=1 L=1 d' j= 1 U s2 5.6.2 Computing the Final Sampling Allocation Algorithm GetBayesNeymanAllocation describes exactly how an optimal sampling allocation can be obtained using our technique. Algorithm GetBayesNeymanAllocation(a~, P, 8, D, Rpilot, N, n, p 1 // Let a = (ax, 8, L), be the parameter of the beta // distributions of all strata 2 // Let P = (P1, P2, L) be the parameter of the beta // distributions of all strata 3 // Let 8 = (01, BL) be the set of parameters of the // Dirichlet distributions of all strata 4 // Let D = (D1, D2, DL) be an array of histogram // buckets for all strata 5 //LeRilot = (pilot Rpilot Rpilot) be the pilot samples // from all strata 6 // Let v be the total number of iterations of resampling 7 for(int j = 1; j <= L; j++) 8 U pd atePriors(as, pi ei, Di, Rpilot) 9 // Let cut = (cutl, cut2, CntL) be a vector of counts for // all strata 10 // Let E' = (E`', `'2 CL be aC vectorV of momentslr V forI all strata 11 // Let E and Etemp be vectors of variances of size L 12 for(int i = 1; i <= v; i++) { 13 cut = GetCounts(a, p, N) 14 E' = GetMomentsFromHist(81, BE, D) 15 Etemp = GetSigma(cut, E', N) 16 for(int j = 1; j <= L;j++) 17 E [j] += Etemp[j 18 } 19 denom = 0 20 for(int j = 1; j <= L;j++) { 21 E [j] /= v 22 denom += N[j] E[j] 23 } 24 for(int j = 1; j <= L;j++) 25 nj = (us Nj a E[j])/denom 26 } 5.7 Experiments 5.7.1 Goals The specific goals of our experimental evaluation are as follows: * To compare the width of the confidence bounds produced using both the classic Neyman allocation and the proposed Bve;Neyman allocation in realistic scenarios in order to see which can produce tighter bounds. * To test the reliability of the confidence bounds produced by the two methods. That is, we wish to ask: if hounds are reported to the user as 1.' hounds, is the chance that they contain the answer actually 1.' ? * Third, we wish to compare both methods against simple random sampling as a sanity check to see if there is a significant improvement in bound width. * Finally, we wish to compare the compuation time required for the two estimators. 5.7.2 Experimental Setup Data Sets Used. We use three different data sets in our experimental evaluation: * The first is a synthetic data set called the (i 11 11 data set, and is produced using a Gaussian (normal) mixture model. The GMM data set has three numerical and three categorical attributes. Since the underlying normal variables only produce numerical data, the three categorical attributes (having seven possible values each) and are produced by mapping the ranges of three of the dimensions to discrete values. This data set has 5 million records. * The second is the Person data set. This is a 13attribute reallife data set obtained from the 1990 Census and contains family and income information. This dataset is publicly available [2] and has a single relation with over 9.5 million records. The data has twelve numerical attributes and one categorical attribute with 29 categories. * The third is the KDD data set, which is the data set from the 1999 K(DD Cup event. This data set has 42 attributes with status information regarding various network connections for intrusion detection. This dataset consists of around 5 million records with integer, realvalued, as well as categorical attributes. Queries Tested. For each data set, we test queries of the form: SELECT SUM (fl(r)) FROM R As r WHERE f2 r) fi() and f2 ) Vary depending upon the data set. For the GMM data set, fi() projects one of the three different numerical attributes (each query projects a random attribute). For the Person data set, either the Totallncome attribute or the wagelncome attribute are projected by each query. For the K(DD data set, either the src_bytes or the dst_bytes attributes are projected. For each of the data sets, three different classes of selection predicates encoded by f2( are used. Each class has a different selectivity. The three selectivity classes for f2() have selectivities of (0.01~ 0.001 .), (0.1~ 0.01 ), and (1.0'; & 0.1 ), respectively. For the GMM data set, f2 ) is COnStructed by rolling a threefaced die to decide how many attributes will be included in the conjunction computed by f2(). The appropriate number of attributes are then randomly selected from among the six GMM attributes. If a categorical attribute is chosen as one of the attributes in f2(), then the attribute will be checked with either an equality or inequality condition over a randomlyselected domain value. If a numerical attribute is chosen, then a range predicate is constructed. For a given numerical attribute, assume that low and high are the known minimum and maximum attribute values. The range is constructed using glow = low + vl x (high low) and high = glow + v2 X thigh glOW) Where vl and v2 are randomly chosen real values from the range [01]. For each selectivity class, 50 different queries are generated by repeating the querygeneration process until enough queries falling the appropriate selectivity range have been generated. The f2 ) fulCtilOUS for the other two data sets are constructed similarly. Stratification Tested. For each of the various data sets, a simple nearestneighbor classification algorithm is used to perform the statification. In order to partition a data set into L strata, L records are first chosen randomly from the data to serve as "seeds" for each of the strata, and all of the other records are added to the strata whose seed is closest to the data point. For numerical attributes, the L2 IlOrm is used as the distance function. For categorical attributes, we compute the distance using the support from the database for the attribute values [36]. Since each data set has both numerical and categorical data, the actual distance function used is the sum of the two "sub" distance functions. Note that it would be possible to use a much more sophisticated stratification, but actually Sample Sel Bandwidth Coverage Size ( )GMM /Person /K(DD GMM /Person /K(DD 0.01 3.277 /2.289 /2.140 918 /892 /921 50K( 0.1 1.776 /0.514 /1.520 926 /912 /988 1 0.587 /0.184 /0.210 947 /944 /942 0.01 2.626 /2.108 /1.48 922 /941 /937 100K( 0.1 1.273 /0.351 /0.910 939 /948 /940 1 0.415 /0.128 /0.120 948 /952 /946 0.01 2.192 /1.740 /0.820 923 /943 /940 500K( 0.1 0.551 /0.132 /0.630 946 /947 /942 1 0.178 /0.087 /0.070 946 /947 /948 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for a Simple Random Sampling estimator for the K(DD Cup data set. Results are shown for varying sample sizes and for three different query selectivities 0.01 0.1 and 1 . Table 51. performing the stratification is not the point of this thesis our goal is to study how to best use the stratification. In our experiments, we test L 1, L = 20, and L = 200. Note that if L then there is actually no stratification performed, and so this case is equivalent to simple random sampling without replacement and will serve as a sanity check in our experiments. Tests Run. For the Neyman allocation and our B .nsNeyman allocation, our test suite consists of 54 different test cases for each data set, plus nine more tests using L = 1. These test cases are obtained by assigning three different values to the following four parameters: * Number of strata We use L = 1, L = 20, L = 200; as described above, L equivalent to simple random sampling without replacement. 1 is also * Pilot sample size This is the number of records we obtain from each stratum in order to perform the allocation. We choose values of 5, 20 and 100 records. * Sample Size This is the total sample size that has to be allocated. We use 50,000, 100,000 and 500,000 samples in our tests. * Query Selectivity As described above, we test query selectivities of 0.01 .~ 0.1 and Average Running Time (sec.) Neyman Bai; ENeyman Gaussian Mixture 1 .5 2.4 Person 2.3 3.1 K(DD Cup 2.1 2.8 Table 52. Average running time of Neyman and BayesNeyman estimators over three realworld datasets. Each of the 50 queries for each (data set, selectivity) combination is rerun 20 times using 20 different (pilot sample, sample) combinations. Thus, for each (data set, selectivity) combination we obtain results for 1000 query runs in all. 5.7.3 Results Table 51 shows the results for the nine cases where L = 1, that is, where no stratification is performed. We report two numbers: the bandwidth and the coverage. The bandwidth is the ratio of the width of the 95' confidence bounds computed as the result of using the allocation to the true query answer. The coverage is the number of times out of the 1000 trials that the true answer is actually contained in the 95' confidence bounds reported by the estimator. Naturally, one would expect this number to be close to 950 if the bounds are in fact reliable. Tables 53 and 54 show the results for the 54 different test cases where a stratification is actually performed. For each of the 54 test cases and both of the sampling plans used (the Neyman allocation and the Bai; EN lman allocation) we again report the bandwidth and the coverage. Finally, the following table shows the average running times for the two stratified sampling estimators on all the three data sets. There is generally around a 50I' hit in terms of running time when using the Bai; ENeyman allocation compared to the Neyman allocation. 5.7.4 Discussion There are quite a large number of results presented, and discussing all of the intricacies present in all of our findings is beyond the scope of the thesis. However, taken as a whole, our experiments clearly show two things. First, for the type of selective Bandwidth Coverage GMM /Person /K(DD GMM /Person /K(DD NS PS SS Sel Neynian B isc 4 Neynian Neynian B isc 4 Neynian 0.01 0.00 /0.00 /0.00 2.90 /0.19 /1.12 0 /0 /0 9:35 /882 /927 20 5 50K( 0.1 0.03 /0.01 /0.02 1.27 /0.02 /0.80 : 3 /49 /2:3 929 /9:39 /9:38 1 0.05 /0.02 /0.14 0.39 /0.01 /0.09 1 1 /247 /155 940 /950 /945 0.01 0.00 /0.00 /0.00 2.77 /0.16 /1.08 0 /0 /0 9:36 /961 /9:30 100K( 0.1 0.02 /0.01 /0.01 0.90 /0.02 /0.7:3 : 3 /5:3 /28 941 /941 /9:38 1 0.05 /0.01 /0.03 0.28 /0.01 /0.08 24 /:306 /170 941 /947 /947 0.01 0.01 /0.00 /0.00 2.05 /0.06 /0.87 : 3 /0 /4 9:38 /948 /9:32 500K( 0.1 0.01 /0.00 /0.01 0.37 /0.01 /0.55 1 0 /62 /51 954 /954 /941 1 0.03 /0.01 /0.02 0.12 /0.00 /0.04 : 38 /:316 /184 957 /955 /945 0.01 0.06 /0.00 /0.04 2.72 /0.22 /1.06 14 /0 /5 942 /941 /9:38 20 50K( 0.1 0.17 /0.03 /0.09 1.21 /0.03 /0.81 106 /61 /88 908 /9:38 /944 1 0.21 /0.05 /0.27 0.34 /0.01 /0.09 404 /692 /561 948 /948 /947 0.01 0.01 /0.00 /0.01 2.58 /0.16 /0.911 2:3 /0 /6 941 /9:37 /941 100K( 0.1 0.11 /0.02 /0.06 0.85 /0.02 /0.74 165 /66 /107 9:34 /954 /9:39 1 0.14 /0.03 /0.09 0.25 /0.01 /0.06 4:31 /728 /612 954 /962 /95:3 0.01 0.01 /0.00 /0.01 1.9:3 /0.07 /0.62 : 30 /0 /21 946 /94:3 /944 500K( 0.1 0.01 /0.01 /0.01 0.34 /0.01 /0.511 2:30 /145 /245 942 /952 /945 1 0.04 /0.01 /0.03 0.09 /0.00 /0.02 447 /751 /746 94:3 /961 /950 0.01 0.15 /0.04 /0.08 2.:33 /0.19 /0.82 24 /58 /20 9:38 /922 /9:38 100 50K( 0.1 0.26 /0.10 /0.16 1.09 /0.02 /0.58 4:36 /204 /172 929 /949 /942 1 0.47 /0.18 /0.34 0.32 /0.01 /0.05 870 /891 /866 9:32 /962 /951 0.01 0.12 /0.03 /0.06 2.26 /0.16 /0.57 29 /59 /41 9:35 /945 /940 100K( 0.1 0.18 /0.05 /0.11 0.81 /0.02 /0.40 4:35 /249 /:355 927 /957 /942 1 0.31 /0.08 /0.02 0.22 /0.01 /0.04 895 /928 /914 948 /968 /94:3 0.01 0.01 /0.01 /0.01 1.72 /0.07 /0.:33 45 /66 /50 9:39 /952 /947 500K( 0.1 0.06 /0.02 /0.04 0.31 /0.01 /0.28 474 /297 /412 954 /954 /952 1 0.06 /0.02 /0.06 0.08 /0.00 /0.02 926 /9:35 /942 950 /970 /949 Table 5:3. Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neynian estimator and the B is; eNeynian estimator for the three data sets. Results are shown for 20 strata and for varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities 0.01 .~ 0.1 and 1 . Bandwidth Coverage GMM /Person /K(DD GMM /Person /K(DD NS PS SS Sel Neyman Bai, a Neyman Neyman Bai, a Neyman 0.01 0.00 /0.00 /0.00 1.73 /0.18 /0.911 0 /0 /0 933 /931 /924 200 5 50K( 0.1 0.00 /0.02 /0.01 0.97 /0.02 /0.76 0 /56 /27 933 /953 /936 1 0.05 /0.02 /0.03 0.26 /0.01 /0.09 19 /162 /149 940 /960 /940 0.01 0.00 /0.01 /0.01 1.57 /0.13 /0.75 0 /43 /28 936 /916 /930 100K( 0.1 0.01 /0.01 /0.01 0.72 /0.02 /0.64 7 /60 /41 938 /958 /936 1 0.03 /0.01 /0.01 0.19 /0.00 /0.08 34 /365 /212 945 /955 /947 0.01 0.01 /0.00 /0.00 1.20 /0.08 /0.52 5 /45 /34 940 /939 /938 500K( 0.1 0.02 /0.01 /0.00 0.28 /0.01 /0.44 22 /89 /76 946 /946 /944 1 0.02 /0.01 /0.01 0.07 /0.00 /0.06 45 /372 /336 954 /954 /951 0.01 0.05 /0.03 /0.04 1.59 /0.18 /0.85 19 /51 /21 943 /931 /934 20 50K( 0.1 0.11 /0.03 /0.07 0.75 /0.02 /0.72 91 /70 /94 943 /953 /939 1 0.09 /0.04 /0.09 0.18 /0.01 /0.07 345 /627 /580 958 /962 /945 0.01 0.01 /0.01 /0.03 1.35 /0.14 /0.67 22 /66 /45 948 /948 /941 100K( 0.1 0.02 /0.02 /0.04 0.54 /0.01 /0.54 131 /135 /128 935 /955 /949 1 0.05 /0.02 /0.05 0.12 /0.00 /0.06 488 /702 /643 945 /955 /952 0.01 0.01 /0.00 /0.01 1.04 /0.06 /0.42 49 /83 /72 941 /954 /947 500K( 0.1 0.01 /0.00 /0.02 0.20 /0.00 /0.35 210 /209 /282 955 /945 /950 1 0.04 /0.01 /0.01 0.03 /0.00 /0.03 617 /830 /869 948 /958 /953 0.01 0.08 /0.03 /0.06 1.35 /0.14 /0.54 28 /56 /39 939 /938 /939 100 50K( 0.1 0.20 /0.05 /0.09 0.56 /0.02 /0.40 313 /357 /243 949 /949 /942 1 0.10 /0.01 /0.15 0.14 /0.01 /0.03 543 /823 /874 948 /948 /951 0.01 0.07 /0.02 /0.04 1.11 /0.12 /0.39 47 /77 /53 938 /935 /947 100K( 0.1 0.08 /0.03 /0.06 0.40 /0.01 /0.28 533 /456 /427 948 /948 /951 1 0.06 /0.06 /0.08 0.09 /0.01 /0.02 918 /912 /930 959 /956 /952 0.01 0.01 /0.00 /0.02 0.89 /0.05 /0.211 63 /91 /104 946 /936 /937 500K( 0.1 0.02 /0.01 /0.02 0.10 /0.00 /0.13 580 /540 /607 945 /945 /948 1 0.04 /0.03 /0.05 0.01 /0.00 /0.01 936 /920 /941 960 /953 /950 Table 54. Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neyman estimator and the B11; ENeyman estimator for the three data sets. Results are shown for 200 strata with varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities 0.01 .~ 0.1 and 1 . queries we concentrate on in our work, the classic Neyman allocation is generally useless. As expected, the allocation tends to ignore strata with relevant records, resulting in "95' . confidence bounds" that are generally accurate nowhere close to 95' of the time. Out of 162 different tests over the three data sets, the Neyman allocation produced confidence bounds that had greater than 911' coverage only eleven times, even though 95' bounds were specified. In 15 out of the 162 tests, the "95' confidence bounds" actually contained the answer 0 out of 1000 times! Second, the allocation produced by the proposed Bai; ENeyman tends to be remarkably useful that is, the bounds produced are both accurate and tight. In only 7 of the 162 tests, the coverage of the bounds produced by the Bai; ENeyman allocation was found to be less than C, :' and coverage was often remarkably close to 95' . Furthermore, in the few cases where the classic Neyman bounds were actually worthwhile, the Bai; ENeyman bounds were far superior in terms of having a tighter bandwidth. Even if one looks only at the cases where the Neyman bounds were not ridiculous (where "ridiculous" bounds are arbitrarily defined to be those that had a coverage of less than 211' .), the Bai; ENeyman bounds were actually tighter than the Neyman bounds 35 out of 70 times. In other words, there were many cases where the Neyman allocation produced bounds that had coverage rates of only around 211' whereas the Bai; ENeyman allocations produced bounds that were actually tighter, and still had coverage rates very close to the userspecified 95' . There are a few other interesting findings. Not surprisingly, increasing the number of strata generally gives tighter error bounds for fixed pilot and sample sizes because it tends to increase the homogeneity of the records in each stratum. However, in practice there is a cost associated with increasing the number of strata and so this cannot be done arbitrarily. Specifically, more strata may translate to more I/Os required to actually perform the sampling. One might typically store the records within a stratum in randomized order on disk. Thus, to sample from a given stratum requires only a sequential scan, but each additional stratum requires a random disk I/O. In addition, it is more difficult and more costly to maintain a large number of strata. We also find that by using a larger pilot sample, estimation accuracy generally increases. This is intuitive since a larger pilot sample contains more information about the stratum, thus helping to make a better sampling allocation plan and providing a more accurate estimate. However, a large pilot sample incurs a greater cost to actually perform the pilot sampling. Explicitly studying this tradeoff is an interesting avenue for future work. Finally, we point out that even the rudimentary stratification that we tested in these experiments is remarkably successful if the correct sampling allocation is used. Consider the case of a 500K( record sample. For a query selectivity of 0.01 only around 50 records in the sample will be accepted by selection predicate encoded in f2(). This is why the bandwidth for the simple random sample estimator with no stratification (L = 1) is so great: for the Person data set it is 1.74 and for the K(DD data set it is 0.82. The bounds are so wide that they are essentially useless. However, if the B .wsNeyman allocation is used over 200 strata and a pilot sample of size 100, the handwidths shrink to 0.053 and 0.21, respectively. These are far tighter. In the case of the Person data set the bandwidth shrinks by nearly two orders of magnitude. For the K(DD data set the reduction is more modest (a factor of four) due to the high dimensionality of the data, which tends to render the stratification less effective. Still, this II__ R that perhaps the real issue to consider when stratifying in a database environment is not how to perform the stratification, but how to use the stratification in an effective manner. 5.8 Related Work Broadly p. .1:;19 it is possible to divide the related prior research into two categories those works from the statistics literature, and those from the data management literature. The idea of applying B lio 1 .Is and/or superpopulation (modelbased) methods to the allocation problem has a long history in statistics, and seems to have been studied with particular intensity in the 1970's. Given the number of papers on this topic, it is not feasible to reference all of them, though a small number are listed in the References section of the thesis [:32, 10:3, 104]. At a high level, the E 1 difference between this work and that prior work is the specificity of our work with respect to database queries. Sampling from a database is very unique in that the distribution of values that are .. oegated is typically illsuited to traditional parametric models. Due to the inclusion of the selection predicate encoded by f2(), the distribution of the f() values that are .I_ _oegfated tends to have a large "stovepipe" located at zero corresponding to those records that are not accepted by f2(), with a more wellbehaved distribution of values located elsewhere corresponding to those fi() values for records that were accepted by f2(). The B em Neyman allocation scheme proposed in this thesis explicit allows for such a situation via its use of a twostage model where first a certain number of records are accepted by f2() (modeled via the random variable X,.,t) and then the fi() values for those accepted records are produced (modeled by Xc/). This is quite different from the generalpurpose methods described in the statistics literature, which typically attach a wellbehaved, standard distribution to the mean and/or variance of each stratum [:32, 104]. Sampling for the answer to database queries has also been studied extensively [6:3, 67, 96]. In particular, ('I! ..to11,l:l~ and his coauthors have explicitly studied the idea of stratification for approximating database queries [1820]. However, there is a key difference between that work and our own: these existing papers focus on how to break the data into strata, and not on how to sample the strata in a robust fashion. In that sense, our work is completely orthogonal to ('I!s noIl~llst et al.'s prior work and our sampling plans could easily be used in conjunction with the workloadbased stratifications that their methods can construct. 5.9 Conclusion In this chapter, we have considered the problem of stratification for develpoingf robust estimates for the answer to very selective .I__oegate queries. While the obvious problem to consider when stratifying is how to break the data into subsets, the more significant challenge may lie in developing a sampling plan at run time that actually uses the strata in a robust fashion. We have shown that the traditional Neyman sampling allocation can give disastrous results when it is used in conjunction with mildly to very selective queries. We have developed a unique B li 1 Io method for developing robust sampling plans. Our plans explicitly minimize the expected variance of the final estimator over the space of possible strata variances. We have shown that even when the resulting allocation is used with a very naive nearestneighbor stratification, the increase in accuracy compared to simple random sampling is considerable. Even more significant is the fact that for highly selective queries, our sampling plans give results that are II. in the sense that the associated confidence bounds have near perfect coverage. CHAPTER 6 CONCLUSION In this research work, we have studied and described the problem of efficient answering of complex queries on large data warehouses. Our approach for addressing the problem relies on approximation. We present saniplinghased techniques which can he used to compute very quickly, approximate answers along with error guarantees for longrunning queries. The first part of this study addresses the problem of efficiently obtaining random samples of records satisfying arbitrary range selection predicates. The second part of the study develops statistical, saniplinghased estiniators for the specific class of queries that have a nested, correlated suhquery. The problem addressed in this work is actually a generalisation of the important problem of estimating the number of distinct values in a database. The third and final part of this study addresses the problem of estimating the result to queries having highly selective predicates. Since a uniform random sample is not likely to contain any records satisfying the selection predicate, our approach uses stratified sampling and develops stratified sampling plans to correctly identify highdensity strata for arbitrary queries. APPENDIX EM ALGORITHM DERIVATION Let Y, be the information about record e E EMP that can be observed i.e. v = file) and k' = cut(e, SALE'). Let Xe be the information about record e that includes Yas well as the relevant data that cannot be observed, i.e. k = cut(e, SALE). Then let: f (Xe = (Ye, k)  8) = pk x h(k'; k) Also, let: me(vpsi)2/2o2 g( 8)= p xx h(k'; i) i=0 T/2i We then compute the posterior probability that e belongs to class i as: f (Xe = (Ye, i)  ) p(i 8, e) = Then the logarithm of the expected probability that we would observe EMP' and SALE' E =) log( f(Xe = (Y,, i)8))p(i8', e) e6EMP e6EMP i=0 log p x x (k'; ij) = 35i~ (', e) x (log~p ) log(a) logl(JZ) e6EMP i=0 2e2 +log(h(k i))) To find the unknown parameters, ps, a and pi, we maximize E for the given set of posterior probabilities at that step. We do this by taking partial derivatives of E w.r.t each of these parameters and setting the result to zero: ii1 = (' e T e6EMP Setting this expression to zero gives: 11 Ce6EMP p10, e We can obtain p2, lm ill a Similar manner. By taking the partial derivative of E w.r.t 0.2 and setting to zero we get: 2 ,,, __= e6EMP lo' 0 p 6 6 py Finally, to evaluate the pi's, we also consider the additional constraint that Clopi 1. We can find the values of pi's that maximize E subject to this constraint by using the method of Lagrangian multipliers to obtain: e6EMP ( 0 e p = e6EMP C l (101, This completes the derivation of the update rules given in Section 4.5.2. REFERENCES 1. IMDB dataset. http://www.imdb.com 2. Person data set. http://usa.ipums .org/usa 3. Synoptic cloud report dataset. http://cdiac.ornl.gov/epubs/ndp/ndp026b/nd06~t 4. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of groupby queries. In: Tech. Report, Bell Laboratories, Murray Hill, New Jersey (1999) 5. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of groupby queries. In: SIGMOD, pp. 487498 (2000) 6. Acharya, S., Gibbons, P.B., Poosala, V., R nwea si. l!i, S.: Join synopses for approximate query answering. In: SIGMOD, pp. 275286 (1999) 7. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and selfjoin sizes in limited storage. In: PODS, pp. 1020 (1999) 8. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOC, pp. 2029 (1996) 9. Antoshenkov, G.: Random sampling from pseudoranked b+ trees. In: VLDB, pp. 375382 (1992) 10. Babcock, B., C'!s I eI III1! S., Das, G.: Dynamic sample selection for approximate query processing. In: SIGMOD, pp. 539550 (2003) 11. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: K(DD, pp. 9 15 (1998) 12. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6 (2006) 13. Bunge, J., Fitzpatrick, M.: Estimating the number of species: A review. Journal of the American Statistical Association 88, 364373 (1993) 14. Carlin, B., Louis, T.: Bai; and Empirical Box; a Methods for Data Analysis. C'!s 111'! .1' and Hall (1996) 15. C'll I1:1 .I~arti, K(., Garofalakis, M., Rastogi, R., Shim, K(.: Approximate query processing using wavelets. The VLDB Journal 10(23), 199223 (2001) 16. ('1! .) sl: lI, M., C'!s II1Ills,itI S., Motwani, R., ? lI I lyya, V.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268279 (2000) 17. C'll I ll: Ir, AI., ('!, II1Illatil S., Motwani, R., No I. lyyaa, V.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268279 (2000) 18. C'li n11 11 ll1 ti S., Das, G., Datar, hi., Motwani, R., No .)rara, V.R.: Overcoming limitations of sampling for .I__ negation queries. In: ICDE, pp. 5:34542 (2001) 19. ('! ..e lla ti l S., Das, G., No I. lyya, V.: A robust, optintizationbased approach for approximate answering of .l__oegate queries. In: SIGMOD, pp. 295306 (2001) 20. ('! ..e lla ti l S., Das, G., No I. lyya, V.: Optimized stratified sampling for approximate query processing. ACijl TODS, To Appear (2007) 21. Os1 .<11,11.1~i S., Das, G., Srivastava, U.: Effective use of blocklevel sampling in statistics estimation. In: SIGMOD, pp. 287 298 (2004) 22. Os1 II1I1!,.1 S., Motwani, R.: On sampling and relational operators. IEEE Data Eng. Bull. 22(4), 4146 (1999) 2:3. Os1 .<11,11.1~i S., Motwani, R., No I. lyya, V.: Random sampling for histogram construction: how much is enough? SIGMOD Rec. 27(2), 4:36447 (1998) 24. Os1 .<11,11.1~i S., Motwani, R., No I. lyya, V.: On randoni sanipling over joins. In: SIGMOD, pp. 26:3274 (1999) 25. Cochran, W.: Sampling Techniques. Wiley and Sons (1977) 26. Dempster, A., Laird, N., Rubin, D.: Maxiniumlikelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B. 39 (1977) 27. Diwan, A.A., Rane, S., Seshadri, S., Sudarshan, S.: Clustering techniques for nminintizing external path length. In: VLDB, pp. :34235:3 (1996) 28. Dobra, A.: Histograms revisited: when are histograms the best approximation method for .I__oegates over joins? In: PODS, pp. 228237 (2005) 29. Dobra, A., Garofalakis, AI., Gehrke, J., Rastogi, R.: Processing complex .I__oegate queries over data streams. In: SIGMOD Conference, pp. 6172 (2002) :30. Donlingos, P.: B li 1 .Is averaging of classifiers and the overfitting problem. In: 17th International Conf. on Machine Learning (2000) :31. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. C'!s 11pin .1' & Hall/CRC (1998) :32. Ericson, W.A.: Optiniun stratified sampling using prior information. JASA 60(:311), 750771 (1965) :33. Evans, hi., Hastings, N., Peacock, B.: Statistical Distributions. Wiley and Sons (2000) :34. Fan, C., Muller, hi., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association 57, :387402 (1962) :35. Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal sampling for skewresistant join size estimation. In: SIGMOD, pp. 271281 (1996) :36. Ganti, V., Gehrke, J., Raniakrishnan, R.: Cactus: clustering categorical data using suninaries. In: K(DD, pp. 7:38:3(1999) :37. Ganti, V., Lee, M.L., Raniakrishnan, R.: ICICLES:selftuning samples for approximate query answering. In: VLDB, pp. 176187 (2000) :38. GarciaMolina, H., Widoni, J., Ullman, J.D.: Database System Inmplenientation. PrenticeHall, Inc. (1999) :39. Gelnian, A., Carlin, J., Stern, H., Rubin, D.: B li lIa Data Analysis, Second Edition. ChI!I1I1!!I1 & Hall/CRC (200:3) 40. Gibbons, P.B., Matias, Y.: New saniplinghased suninary statistics for improving approximate query answers. In: SIGMOD, pp. :331342 (1998) 41. Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. In: Technical Report, Bell Laboratories, Murray Hill, New Jersey, pp. 275286 (1999) 42. Gilbert, A.C., K~otidis, Y., Muthukrishnan, S., Strauss, AI.: Optimal and approximate computation of suninary statistics for range .I__ negates. In: PODS (2001) 4:3. Goodman, L.: On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20, 272579 (1949) 44. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational .I_ negationn operator generalizing groupby, crosstah, and subtotal. In: ICDE, pp. 152159 (1996) 45. Guha, S., K~oudas, N., Srivastava, D.: Fast algorithms for hierarchical range histogram construction. In: PODS, pp. 180187 (2002) 46. Guttnian, A.: Rtrees: A dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 4757 (1984) 47. Haas, P., Hellerstein, J.: Ripple joins for online .I__ negation. In: SIGMOD Conference, pp. 287298 (1999) 48. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Saniplinghased estimation of the number of distinct values of an attribute. In: 21st International Conference on Very Large Databases, pp. :311322 (1995) 49. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Saniplinghased estimation of the number of distinct values of an attribute. In: VLDB, pp. :311322 (1995) 50. Haas, P., Stokes, L.: Estimating the number of classes in a finite population. Journal of the American Statistical Association 93, 14751487 (1998) 51. Haas, P.J.: Largesaniple and deterministic confidence intervals for online .I_ negationn. In: Statistical and Scientific Database Alanagenient, pp. 5163 (1997) 52. Haas, P.J.: The need for speed: Speeding up DB2 using sampling. IDUG Solutions Journal 10, :3234(200:3) 5:3. Haas, P.J., Hellerstein, J.: Join algorithms for online .I_ egation. IBM Research Report RJ 10126 (1998) 54. Haas, P.J., Hellerstein, J.M.: Ripple joins for online .I__ egation. In: SIGMOD, pp. 287 298 (1999) 55. Haas, P.J., K~oenig, C.: A hilevel hernoulli scheme for database sampling. In: SIGMOD, pp. 275 286 (2004) 56. Haas, P.J., Naughton, J.F., Seshadri, S., Swanxi, A.N.: Fixedprecision estimation of join selectivity. In: PODS, pp. 190201 (199:3) 57. Haas, P.J., Naughton, J.F., Seshadri, S., Swanxi, A.N.: Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52(:3), 550569 (1996) 58. Haas, P.J., Naughton, J.F., Swanxi, A.N.: On the relative cost of sampling for join selectivity estimation. In: PODS, pp. 1424 (1994) 59. Haas, P.J., Swanxi, A.N.: Sequential sampling procedures for query size estimation. In: SIGMOD, pp. :341350 (1992) 60. Hellerstein, J., Avnur, R., Chou, A., Hidber, C., Olston, C., Ranian, V., Roth, T., Haas, P.: Interactive data analysis: The cONTR OL project. IEEE Computer 32(8), 5159 (1999) 61. Hellerstein, J., Haas, P., Wang, H.: Online .I__oegfation. In: SIGMOD Conference, pp. 171182 (1997) 62. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Ranian, V., Roth, T., Haas, P.J.: Interactive data analysis: The control project. In: IEEE Computer :32(8), pp. 51 59 (1999) 6:3. Hellerstein, J.M., Haas, P.J., W.'11_ H.J.: Online ..;:__regation. In: SIGMOD, pp. 171182 (1997) 64. Hou, W.C., Ojzsoyoglu, G.: Statistical estiniators for .l_ gate relational algebra queries. ACijl Trans. Database Syst. 16(4), 600654 (1991) 65. Hou, W.C., Ozsoyoglu, G.: Processing tinteconstrained .I__aegate queries in casedh. ACijl Trans. Database Syst. 18(2), 224261 (199:3) 66. Hou, W.C., Ozsoyoglu, G., Dogdu, E.: Errorconstrained COUNT query evaluation in relational databases. SIGMOD Rec. 20(2), 278287 (1991) 67. Hou, W.C., Ozsoyoglu, G., TI.H. i B.K(.: Statistical estimators for relational algebra expressions. In: PODS, pp. 276287 (1988) 68. Hou, W.C., Ozsoyoglu, G., T H. 1 B.K(.: Processing .I__o egate relational queries with hard time constraints. In: SIGMOD, pp. 6877 (1989) 69. Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases. In: International Conference on Machine Learning and Cybernetics, pp. 30483052 (2005) 70. Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256267 (1993) 71. Ioannidis, Y.E., Poosala, V.: Histogrambased approximation of setvalued queryanswers. In: VLDB (1999) 72. Jagadish, H.V., K~oudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K(.C., Suel, T.: Optimal histograms with quality guarantees. In: VLDB, pp. 275286 (1998) 73. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A diskbased join with probabilistic guarantees. In: SIGMOD, pp. 563574 (2005) 74. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sortmergeshrink join. ACijl Trans. Database Syst. 31(4), 13821416 (2006) 75. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subsetbased SQL queries. In: 31st International conference on Very large data bases, pp. 745756 (2005) 76. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: SIGMOD, pp. 299310. ACjIl Press, New York, NY, USA (2004) 77. K~empe, D., Dobra, A., Gehrke, J.: Gossipbased computation of .l__oegfate information. In: FOCS, pp. 482491 (2003) 78. K~rewski, D., Platek, R., Rao, J.: Current Topics in Survey Sampling. Academic Press (1981) 79. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: How to summarize the semantics of a data cube. In: VLDB, pp. 778789 (2002) 80. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: Qctrees: An efficient summary structure for semantic olap. In: SIGMOD, pp. 6475 (2003) 81. Lenit; n; r_, S.T., Edgington, J.M., Lopez, M.A.: STR: A simple and efficient algorithm for rtree packing. In: ICDE, pp. 497506 (1997) 82. Ling, Y., Sun, W.: A supplement to saniplinghased methods for query size estimation in a database system. SIGMOD Rec. 21(4), 1215 (1992) 8:3. Lipton, R., Naughton, J.: Query size estimation by adaptive sampling. In: PODS, pp. 4046 (1990) 84. Lipton, R., Naughton, J., Schneider, D.: Practical selectivity estimation through adaptive sampling. In: SIGMOD Conference, pp. 111 (1990) 85. Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures. In: VLDB, pp. 165171 (1989) 86. Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. J. Comput. Syst. Sci. 51(1), 1825 (1995) 87. Luo, G., Ellnmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD, pp. 252262 (2002) 88. Alatias, Y., Vitter, J., Wang, AI.: Wavelethased histograms for selectivity estimation. In: SIGMOD Conference, pp. 448459 (1998) 89. Alatias, Y., Vitter, J.S., W.'1_ AI.: Wavelethased histograms for selectivity estimation. SIGMOD Record 27(2), 448459 (1998) 90. Mingfoti, S.: B li I, estimator for the total number of distinct species when quadrat sampling is used. Journal of Applied Statistics 26(4), 46948:3 (1999) 91. Alotwani, R., Raghavan, P.: Randonlized Algorithms. Cambridge University Press, New York (1995) 92. Muralikrishna, AI., DeWitt, D.: Equidepth histograms for estimating selectivity factors for niultidintensional queries. In: SIGMOD Conference, pp. 2836 (1988) 9:3. Muth, P., O'Neil, P.E., Pick, A., Weikunt, G.: Design, intplenientation, and performance of the LHAM logstructured history data access method. In: VLDB, pp. 45246:3 (1998) 94. Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: ICDT: Proceedings of the third international conference on Database theory, pp. 49951:3 (1990) 95. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models (1998) 96. Olken, F.: Random sampling front databases. In: Ph.D. Dissertation (199:3) 97. Olken, F.: Random sampling front databases. Tech. Rep. LBL;:I :~ Lawrence Berkeley National Laboratory (199:3) 98. Olken, F., Rotent, D.: Simple random sampling front relational databases. In: VLDB, pp. 160 169 (1986) 99. Olken, F., Rotent, D.: Random sampling from h+ trees. In: VLDB, pp. 269277 (1989) 100. Olken, F., Rotent, D.: Sampling from spatial databases. In: ICDE, pp. 199 208 (199:3) 101. Olken, F., Rotent, D., Xu, P.: Random sampling from hash files. In: SIGMOD, pp. :375 386 (1990) 102. PiatetskyShapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD, pp. 256276 (1984) 10:3. Rao, T.J.: On the allocation of sample size in stratified sampling. Annals of the Institute of Statistical Mathematics 20, 159166 (1968) 104. Rao, T.J.: Optiniun allocation of sample size and prior distributions: A review. International Statistical Review 45(2), 17:3179 (1977) 105. Roussopoulos, N., K~otidis, Y., Roussopoulos, AI.: Cubetree: organization of and bulk incremental updates on the data cube. In: SIGMOD, pp. 8999 (1997) 106. Rowe, N.C.: Topdown statistical estimation on a database. SIGMOD Record 13(4), 1:35145 (198:3) 107. Rowe, N.C.: Antisaniplingf for estimation: an overview. IEEE Trans. Softw. Eng. 11(10), 10811091 (1985) 108. Rusu, F., Dobra, A.: Statistical analysis of sketch estiniators. In: To Appear, SIGMOD (2007) 109. Sarndal, C., Swensson, B., Wretnman, J.: Model Assisted Survey Sampling. Springer, New York (1992) 110. Selingfer, P.G., Astrahan, 31.3., C'I 1m.1 erlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database nianagenient system. In: SIGMOD, pp. 2:334 (1979) 111. Severance, D.G., Lohnman, G.31.: Differential files: Their application to the maintenance of large databases. ACijl Trans. Database Syst. 1(:3), 256267 (1976) 112. Shao, J.: Mathematical Statistics. SpringerVerlag (1999) 11:3. Sisnianis, Y., Deligiannakis, A., Roussopoulos, N., K~otidis, Y.: Dwarf: Shrinking the petacube. In: SIGMOD, pp. 464475 (2002) 114. Sisnianis, Y., Roussopoulos, N.: The polynomial complexity of fully materialized coalesced cubes. In: VLDB, pp. 540551 (2004) 115. Stonebraker, AI., Ahadi, D.J., Batkin, A., Chen, X., C'I. IIn l:~ A., Ferreira, hi., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., Zdonik, S.: Cstore: a coluninoriented DBMS. In: VLDB, pp. 55:3564 (2005) 116. Thiesson, B., Meek, C., Heckernian, D.: Accelerating ent for large databases. Alach. Learn. 45(:3), 279299 (2001) 117. Thorup, hi., Zhang, Y.: Tabulation based 4universal hashing with applications to second moment estimation. In: SODA, pp. 615624 (2004) 118. Vitter, J.S., Wang, AI.: Approximate computation of nmultidintensional .I_ egates of sparse data using wavelets. SIGMOD Rec. 28(2), 19:3204 (1999) 119. Vitter, J.S., Wang, hi., lyer, B.: Data cube approximation and histograms via wavelets. In: CIK(M, pp. 96104 (1998) 120. Vysochanskii, D., Petunin, Y.: Justification of the :$sigma rule for uniniodal distributions. Theory of Probability and Mathematical Statistics 21, 2536 (1980) 121. Yu, X., Zuzarte, C., Sevcik, K(.C.: Towards estimating the number of distinct value combinations for a set of attributes. In: CIK(M, pp. 65666:3 (2005) BIOGRAPHICAL SKETCH Shantanu Joshi received his Bachelor of Engineering in Computer Science front the University of Mumbai, India in 2000. After a brief stint of one year at Patni Computer Systems in Mumbai, he joined the graduate school at the University of Florida in fall 2001, where he received his Master of Science (jlS) in 2003 front the Department of Computer and Information Science and Engfineeringf. In the suniner of 2006, he was a research intern at the Data Alanagenient, Exploration and Mining Group at Microsoft Research, where he worked with Nicolas Bruno and Surajit('I Ch ..tlat Shantanu will receive a Ph.D. in Computer Science in August 2007 front the University of Florida and will then join the Database Server Manageability group at Oracle Corporation as a nienter of technical staff. PAGE 1 1 PAGE 2 2 PAGE 3 3 PAGE 4 Firstly,mysincerestgratitudegoestomyadvisor,ProfessorChrisJermaineforhisinvaluableguidanceandsupportthroughoutmyPhDresearchwork.Duringtheinitialseveralmonthsofmygraduatework,ChriswasextremelypatientandalwaysledmetowardstherightdirectionwheneverIwouldwaver.Hisacuteinsightintotheresearchproblemsweworkedonsetanexcellentexampleandprovidedmeimmensemotivationtoworkonthem.Hehasalwaysemphasizedtheimportanceofhighqualitytechnicalwritingandhasspentseveralpainstakinghoursreadingandcorrectingmytechnicalmanuscripts.HehasbeenthebestmentorIcouldhavehopedforandIshallalwaysremainindebtedtohimforshapingmycareerandmoreimportantly,mythinking.IamalsoverythankfultoProfessorAlinDobraforhisguidanceduringmygraduatestudy.Hisenthusiasmandconstantwillingnesstohelphasalwaysamazedme.ThanksarealsoduetoProfessorJoachimHammerforhissupportduringtheveryearlydaysofmygraduatestudy.ItakethisopportunitytothankProfessorsTamerKahveciandGaryKoehlerfortakingthetimetoserveonmycommitteeandfortheirhelpfulsuggestions.ItwasapleasureworkingwithSubiArumugamandAbhijitPolonvariouscollaborativeresearchprojects.SeveralinterestingtechnicaldiscussionswithMingxiWu,FeiXu,FlorinRusu,LaukikChitnisandSeemaDegwekarprovidedastimulatingworkenvironmentintheDatabaseCenter.Thisworkwouldnothavebeenpossiblewithouttheconstantencouragementandsupportofmyfamily.Myparents,DrSharadJoshiandDrHemangiJoshialwaysencouragedmetofocusonmygoalsandpursuethemagainstallodds.Mybrother,DrAbhijitJoshihasalwaysplacedtrustinmyabilitiesandhasbeenanidealexampletofollowsincemychildhood.Mylovingsisterinlaw,DrHetalJoshihasbeensupportivesincethetimeIdecidedtopursuecomputerscience. 4 PAGE 5 page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 13 1.1ApproximateQueryProcessing(AQP)ADierentParadigm ....... 13 1.2BuildinganAQPSystemAfresh ........................ 14 1.2.1SamplingVsPrecomputedSynopses .................. 15 1.2.2ArchitecturalChanges ......................... 16 1.3ContributionsinThisThesis .......................... 18 2RELATEDWORK .................................. 19 2.1SamplingbasedEstimation .......................... 19 2.2EstimationUsingNonsamplingPrecomputedSynopses ........... 28 2.3AnalyticQueryProcessingUsingNonstandardDataModels ........ 30 3MATERIALIZEDSAMPLEVIEWSFORDATABASEAPPROXIMATION .. 33 3.1Introduction ................................... 33 3.2ExistingSamplingTechniques ......................... 35 3.2.1RandomlyPermutedFiles ....................... 35 3.2.2SamplingfromIndices ......................... 36 3.2.3BlockbasedRandomSampling ..................... 37 3.3OverviewofOurApproach ........................... 38 3.3.1ACETreeLeafNodes .......................... 38 3.3.2ACETreeStructure ........................... 39 3.3.3ExampleQueryExecutioninACETree ................ 40 3.3.4ChoiceofBinaryVersuskAryTree .................. 42 3.4PropertiesoftheACETree .......................... 43 3.4.1Combinability .............................. 43 3.4.2Appendability .............................. 44 3.4.3Exponentiality .............................. 44 3.5ConstructionoftheACETree ......................... 45 3.5.1DesignGoals ............................... 45 3.5.2Construction ............................... 46 3.5.3ConstructionPhase1 .......................... 46 5 PAGE 6 .......................... 48 3.5.5Combinability/AppendabilityRevisited ................ 51 3.5.6PageAlignment ............................. 51 3.6QueryAlgorithm ................................ 52 3.6.1Goals ................................... 53 3.6.2AlgorithmOverview ........................... 53 3.6.3DataStructures ............................. 55 3.6.4ActualAlgorithm ............................ 55 3.6.5AlgorithmAnalysis ........................... 57 3.7MultiDimensionalACETrees ......................... 59 3.8Benchmarking .................................. 60 3.8.1Overview ................................. 61 3.8.2DiscussionofExperimentalResults .................. 66 3.9ConclusionandDiscussion ........................... 70 4SAMPLINGBASEDESTIMATORSFORSUBSETBASEDQUERIES .... 73 4.1Introduction ................................... 73 4.2TheConcurrentEstimator ........................... 78 4.3UnbiasedEstimator ............................... 80 4.3.1HighLevelDescription ......................... 80 4.3.2TheUnbiasedEstimatorInDepth ................... 81 4.3.3WhyIstheEstimatorUnbiased? .................... 85 4.3.4ComputingtheVarianceoftheEstimator ............... 87 4.3.5IsThisGood? .............................. 89 4.4DevelopingaBiasedEstimator ........................ 91 4.5DetailsofOurApproach ............................ 92 4.5.1ChoiceofModelandModelParameters ................ 92 4.5.2EstimationofModelParameters .................... 95 4.5.3GeneratingPopulationsFromtheModel ............... 100 4.5.4ConstructingtheEstimator ....................... 102 4.6Experiments ................................... 103 4.6.1ExperimentalSetup ........................... 103 4.6.1.1Syntheticdatasets ...................... 104 4.6.1.2Reallifedatasets ....................... 106 4.6.2Results .................................. 109 4.6.3Discussion ................................ 111 4.7RelatedWork .................................. 118 4.8Conclusion .................................... 119 5SAMPLINGBASEDESTIMATIONOFLOWSELECTIVITYQUERIES ... 121 5.1Introduction ................................... 121 5.2Background ................................... 124 5.2.1Stratication ............................... 124 5.2.2\Optimal"AllocationandWhyIt'sNot ................ 126 6 PAGE 7 ........................... 128 5.4DeningX 129 5.4.1Overview ................................. 129 5.4.2DeningXcnt 130 5.4.3DeningX0 132 5.4.4CombiningTheTwo .......................... 135 5.4.5LimitingtheNumberofDomainValues ................ 137 5.5UpdatingPriorsUsingThePilot ....................... 139 5.6PuttingItAllTogether ............................. 141 5.6.1MinimizingtheVariance ........................ 141 5.6.2ComputingtheFinalSamplingAllocation .............. 142 5.7Experiments ................................... 143 5.7.1Goals ................................... 143 5.7.2ExperimentalSetup ........................... 144 5.7.3Results .................................. 147 5.7.4Discussion ................................ 147 5.8RelatedWork .................................. 151 5.9Conclusion .................................... 153 6CONCLUSION .................................... 154 APPENDIX EMALGORITHMDERIVATION ......................... 155 REFERENCES ....................................... 157 BIOGRAPHICALSKETCH ................................ 165 7 PAGE 8 Table page 41ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. ................. 112 42ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. ................. 113 43ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor18syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. ................. 114 44Observedstandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabasefor8queriesover3reallifedatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. ... 115 51Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)foraSimpleRandomSamplingestimatorfortheKDDCupdataset.Resultsareshownforvaryingsamplesizesandforthreedierentqueryselectivities0.01%,0.1%and1%. ...................... 146 52AveragerunningtimeofNeymanandBayesNeymanestimatorsoverthreerealworlddatasets. ........................................ 147 53Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayesNeymanestimatorforthethreedatasets.Resultsareshownfor20strataandforvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities0.01%,0.1%and1%. ...................... 148 54Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayesNeymanestimatorforthethreedatasets.Resultsareshownfor200stratawithvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities0.01%,0.1%and1%. ...................... 149 8 PAGE 9 Figure page 11SimpliedarchitectureofaDBMS ......................... 17 31StructureofaleafnodeoftheACEtree. ...................... 39 32StructureoftheACEtree. .............................. 40 33Randomsamplesfromsection1ofL3. ....................... 41 34CombiningsamplesfromL3andL5. ........................ 42 35CombiningtwosectionsofleafnodesoftheACEtree. .............. 43 36AppendingtwosectionsofleafnodesoftheACEtree. .............. 45 37Choosingkeysforinternalnodes. .......................... 47 38ExponentialitypropertyofACEtree. ........................ 48 39Phase2oftreeconstruction. ............................. 49 310Executionrunsofqueryansweringalgorithmwith(a)1contributingsection,(b)6contributingsections,(c)7contributingsectionsand(d)16contributingsections. ........................................ 54 311SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting0.25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 60 312SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 61 313SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation ............................. 62 314SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.ThegraphisanextensionofFigure 312 andshowsresultstillallthreesamplingtechniquesreturnalltherecordsmatchingthequerypredicate. 63 9 PAGE 10 ...................... 64 316SamplingrateofanACETreevs.rateforanRTreeandscanofarandomlypermutedle,withaspatialselectionpredicateaccepting0.25%ofthedatabasetuples. ......................................... 67 317SamplingrateofanACEtreevs.rateforanRtree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting2.5%ofthedatabasetuples. ......................................... 68 318SamplingrateofanACEtreevs.rateforanRtree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting25%ofthedatabasetuples. ......................................... 69 41Samplingfromasuperpopulation .......................... 90 42SixdistributionsusedtogenerateforeacheinEMPthenumberofrecordssinSALEforwhichf3(e;s)evaluatestotrue. ...................... 105 51Betadistributionwithparameters==0:5. .................. 131 10 PAGE 11 11 PAGE 12 12 PAGE 13 13 PAGE 14 63 ]calledOnlineaggregation(OLA).Theyproposeaninteractiveinterfacefordataexplorationandanalysiswhererecordsareretrievedinarandomorder.Usingtheserandomsamples,runningestimatesanderrorboundsarecomputedandimmediatelydisplayedtotheuser.Astimeprogresses,thesizeoftherandomsamplekeepsgrowingandsotheestimateiscontinuouslyrened.Atapredeterminedtimeinterval,therenedestimatealongwithitsimprovedaccuracyisdisplayedtotheuser.Ifatanypointoftimeduringtheexecutiontheuserissatisedwiththeaccuracyoftheanswer,shecanterminatefurtherexecution.Thesystemalsogivesanoverallprogressindicatorbasedonthefractionofrecordsthathavebeensampledthusfar.Thus,OLAprovidesaninterfacewheretheuserisgivenaroughestimateoftheresultveryquickly. 14 PAGE 15 25 109 ]inordertosupportAQP.Wemakethischoiceduetothefollowingimportantadvantagesofsamplingoverprecomputedsynopses.Theaccuracyofanestimatecomputedbyusingsamplescanbeeasilyimprovedbyobtainingmoresamplestoanswerthequery.Ontheotherhand,iftheestimatecomputedbyusingsynopsesisnotsucientlyaccurate,anewsynopsisprovidinggreateraccuracywouldhavetobebuilt.Sincethiswouldrequirescanningthedatasetitisimpractical.Secondly,samplingisveryamenabletoscalability.Evenforextremelylargedatasetsoftheorderofhundredsofgigabytes,itisgenerallypossibletoaccomodateasmallsampleinmainmemoryanduseecientinmemoryalgorithmstoprocessit.Ifthisisnotpossible,diskbasedsamplesandalgorithmshavealsobeenproposed[ 76 ]andareequallyeectiveastheirinmemorycounterparts.Thisisanimportantbenetofsampling 15 PAGE 16 11 depictsthevariouscomponentsfromasimpliedarchitectureofaDBMS.ThefourcomponentsthatrequiremajorchangesinordertosupportsamplingbasedAQPareasfollows:Index/le/recordmanagerTheuseoftraditionalindexstructureslikeB+Treesisnotappropriatetoobtainrandomsamples.Thisisbecausesuchindexstructuresorderrecordsbasedonrecordsearchkeyvalueswhichisactuallytheoppositeofobtainingrecordsinarandomorder.Hence,forAQPitisimportanttoprovidephysicalstructuresorleorganizationswhichsupportecientretrievalofrandomsamples.ExecutionengineTheexecutionengineneedstoberevampedcompletelysothatitcanusetherandomsamplesreturnedbythelowerleveltoexecutethequeryonthem.Further,theresultofthequeryneedstobescaledupappropriatelyforthesizeoftheentiredatabase.Thiscomponentwouldalsoneedtobeabletocomputeaccuracyguaranteesfortheapproximateanswer. 16 PAGE 17 SimpliedarchitectureofaDBMS 17 PAGE 18 18 PAGE 19 96 98 { 101 ]andAntoshenkov[ 9 ],thoughtheideaofusingasurveysampleforestimationinstatisticsliteraturegoesbackmuchearlierthantheseworks.MostoftheworkbyOlkenandRotemdescribeshowtoperformsimplerandomsamplingfromdatabases.Estimationforseveraltypesofdatabasetaskshasbeenattemptedwithrandomsamples.Therestofthissectionpresentsimportantworksonsamplingbasedestimationofmajordatabasetasks.SomeoftheinitialworkonestimatingselectivityofjoinqueriesisduetoHouetal.[ 67 68 ].Theypresentunbiasedandconsistentestimatorsforestimatingthejoinsizeandalsoprovideanalgorithmforclustersampling.In[ 64 ]theyproposeunbiasedestimatorsforCOUNTaggregatequeriesoverarbitraryrelationalalgebraexpressions.However,computationofvarianceoftheirestimatorsisverycomplex[ 67 ].Theyalsodonotprovideanyboundsonthenumberofrandomsamplesrequiredforestimation.Adaptivesamplinghasbeenusedforestimationofselectivityofpredicatesinrelationalselectionandjoinoperations[ 83 84 86 ]andforapproximatingthesizeofarelationalprojectionoperation[ 94 ].Adaptivesamplinghasalsobeenusedin[ 85 ],toestimatetransitiveclosuresofdatabaserelations.Theauthorspointoutthebenetsandgeneralityofusingsamplingforselectivityestimationoverparametricmethodswhichmakeassumptionsaboutanunderlyingprobabilitydistributionforthedataaswellasovernonparametricmethodswhichrequirestoringandmaintainingsynopsesaboutthe 19 PAGE 20 59 ]observethatusingalooseupperboundforthemaximumsubquerysizecanleadtosamplingmoresubqueriesthannecessary,andpotentiallyincreasingthecostofsamplingsignicantly.Doublesamplingortwophasesamplinghasbeenusedin[ 66 ]forestimatingtheresultofaCOUNTquerywithaguaranteederrorboundatacertaincondencelevel.Theerrorboundisguaranteedbyperformingsamplingintwosteps.Intherststepasmallpilotsampleisusedtoobtainpreliminaryinformationabouttheinputrelation.Thisinformationisthenusedtocomputethesizeofthesampleforthesecondstepsuchthattheestimatorisguaranteedtoproduceanestimatewiththedesirederrorbound.AsHaasandSwami[ 59 ]pointout,thedrawbackofusingdoublesamplingisthatthereisnotheoreticalguidanceforchoosingthesizeofthepilotsample.Thiscouldleadtoanunpredictablyimpreciseestimateifthepilotsamplesizeistoosmalloranunnecessarilyhighsamplingcostifthepilotsamplesizeistoolarge.Intheirwork[ 59 ],HaasandSwamipresentsequentialsamplingtechniqueswhichprovideanestimateoftheresultsizeandalsoboundstheerrorinestimationwithaprespeciedprobability.Theypresenttwoalgorithmsinthepapertoestimatethesizeofaqueryresult.Althoughbothalgorithmshavebeenproventobeasymptoticallycorrectandecient,therstalgorithmsuersfromtheproblemofundercoverage.Thismeansthatinpracticetheprobabilitywithwhichitestimatesthequeryresultwithinthecomputederrorboundislessthanthespeciedcondencelevelofthealgorithm.Thisproblemisaddressedbythesecond 20 PAGE 21 82 ]pointoutthatgeneralsamplingbasedestimationmethodshaveahighcostofexecutionsincetheymakeanoverlyrestrictiveassumptionofnoknowledgeabouttheoverallcharacteristicsofthedata.Inparticular,theynotethatestimationoftheoverallmeanandvarianceofthedatanotonlyincurscostbutalsointroduceserrorinestimation.Theauthorsrathersuggestanalternativeapproachofactuallykeepingtrackofthesecharacteristicsinthedatabaseataminimaloverhead.Adetailedstudyaboutthecostofsamplingbasedmethodstoestimatejoinquerysizesappearsin[ 58 ].Thepapersystematicallyanalysesthefactorswhichinuencethecostofasamplingbasedmethodtoestimatejoinselectivities.Basedontheiranalysis,theirndingscanbesummarizedasfollows:(a)Whenthemeasureofprecisionoftheestimateisabsolute,thecostofsamplingincreaseswiththenumberofrelationsinvolvedinthejoinaswellasthesizesoftherelationsthemselves.(b)Whenthemeasureofprecisionoftheestimateisrelative,thecostofusingsamplingincreaseswiththesizesoftherelations,butdecreasesasthenumberofinputrelationsincrease.(c)Whenthedistributionofthejoinattributevaluesisuniformorhighlyskewedforallinputrelations,thecostofsamplingtendstobelow,whileitishighwhenonlysomeoftheinputrelationshaveaskewedjoinattributevaluedistribution.(d)Thepresenceoftuplesinarelationwhichdonotjoinwithanyothertuplesfromotherrelationsalwaysincreasesthecostofsampling.Haasetal.[ 56 57 ]studyandcomparetheperformanceofnewaswellasprevioussamplingbasedproceduresforestimatingtheselectivityofquerieswithjoins.Inparticulartheyidentifyestimatorswhichhaveaminimumvarianceafteraxednumberofsamplingstepshavebeenperformed.Theynotethatuseofindexesoninputrelationscanfurther 21 PAGE 22 35 ]describehowtoestimatethesizeofajoininthepresenceofskewinthedatabyusingatechniquecalledbifocalsampling.Thistechniqueclassiestuplesofeachinputrelationintotwogroups,sparseanddense,basedonthenumberoftupleswiththesamevalueforthejoinattribute.Everycombinationofthesegroupsisthensubjecttodierentestimationprocedures.Eachoftheseestimationproceduresrequireasamplesizelargerthanacertainvalue(intermsofthetotalnumberoftuplesintheinputrelation)toprovideanestimatewithinasmallconstantfactorofthetruejoinsize.Inordertoguaranteeestimateswiththespeciedaccuracy,bifocalsamplingalsorequiresthetotaljoinsizeandthejoinsizesfromsparsesparsesubjoinstobegreaterthanacertainthreshold.GibbonsandMatias[ 40 ]introducetwosamplingbasedsummarystatisticscalledconcisesamplesandcountingsamplesandpresenttechniquesfortheirfastandincrementalmaintenance.Althoughthepaperdescribessummarystatisticsratherthanontheysamplingtechniques,thesummarystatisticsarecreatedfromrandomsamplesoftheunderlyingdataandareactuallydenedtodescribecharacteristicsofarandomsampleofthedata.Sincesummarystatisticsofarandomsamplerequiremuchlesseramountofmemorythanthesampleitself,thepaperdescribeshowinformationfromamuchlargersamplecanbestoredinagivenamountofmemorybystoringsamplestatisticsinsteadofusingthememorytostoreactualrandomsamples.Thus,theauthorsclaimthatsinceinformationfromalargersamplecanbestoredbytheirsummarystatisticstheaccuracyofapproximateanswerscanbeboosted.Chaudhuri,MotwaniandNarasayya[ 22 24 ]presentadetailedstudyoftheproblemofecientlysamplingtheoutputofajoinoperationwithoutactuallycomputingthe 22 PAGE 23 63 ]proposeasystemcalledOnlineAggregation(OLA)thatcansupportonlineexecutionofanalyticstyleaggregationqueries.Theyproposethesystemtohaveavisualinterfacewhichdisplaysthecurrentestimateoftheaggregatequeryalongwitherrorboundsatacertaincondencelevel.Then,astimeprogresses,thesystemcontinuallyrenestheestimateandatthesametimeshrinksthewidthoftheerrorbounds.Theuserwhoispresentedwithsuchavisualinterface,hasatalltimes,anoptiontoterminatefurtherexecutionofthequeryincasetheerrorboundwidthissatisfactoryforthegivencondencelevel.TheauthorsproposetheuserandomsamplingfrominputrelationstoprovideestimatesinOLA.Further,theydescribesomeofthekeychangesthatwouldberequiredinaDBMStosupportOLA.In[ 51 ],HaasdescribesstatisticaltechniquesforcomputingerrorboundsinOLA.TheworkonOLAeventuallygrewintotheUCBerkeleyCONTROLproject.Intheirarticle[ 62 ],Hellersteinetal.describevariousissuesinprovidinginteractivedataanalysisandpossibleapproachestoaddressthoseissues.HaasandHellerstein[ 53 54 ]proposeafamilyofjoinalgorithmscalledripplejoinstoperformrelationaljoinsinanOLAframework.Ripplejoinsweredesignedtominimizethetimeuntilanacceptablypreciseestimateofthequeryresultismadeavailable,asopposedtominimizingthetimetocompletionofthequeryasinatraditionalDBMS.For 23 PAGE 24 87 ]presentanonlineparallelhashripplejoinalgorithmtospeeduptheexecutionoftheripplejoinespeciallywhenthejoinselectivityislowandalsowhentheuserwishestocontinueexecutiontillcompletion.Thealgorithmisassumedtobeexecutedataxedsetofprocessornodes.Ateachnode,ahashtableismaintainedforeveryrelation.Moreovereverybucketineachhashtablecouldhavesometuplesstoredinmemoryandsomeothersstoredondisk.Thejoinalgorithmproceedsintwophases;intherstphasetuplesfrombothrelationsareretrievedinarandomorderanddistributedtotheprocessornodessothateachnodewouldperformroughlythesameamountofworkforexecutingthejoin.Byusingmultiplethreadsateachnode,productionofjointuplesfromtheinmemoryhashtablebucketsbeginsevenastuplesarebeingdistributedtothevariousprocessors.Thesecondphasebeginsafterredistributionfromtherstphaseiscomplete.Inthisphase,anewinmemoryhashtableiscreatedwhichusesahashingfunctiondierentfromthefunctionusedinphase1.Thetuplesinthediskresidentbucketsofthehashtableofphase1arethenhashedaccordingtothehashingfunctionofphase2andjoined.Thealgorithmprovidesaconsiderablespeedupfactorovertheonenoderipplejoin,provideditsmemoryrequirementsaremet.Jermaineetal.[ 73 74 ]pointoutthatthedrawbackofboththeripplejoinalgorithmsdescribedaboveisthatthestatisticalguaranteesprovidedbytheestimatorarevalidonlyaslongastheoutputofthejoincanbeaccomodatedinmainmemory.Inordertocounteractthisproblem,theyproposetheSortMergeShrinkjoinalgorithmasageneralizationoftheripplejoinwhichcanprovideerrorguaranteesthroughoutexecution, 24 PAGE 25 48 ].Theyprovideanoverviewoftheestimatorsusedinthedatabaseandstatisticsliteratureandalsodevelopseveralnewsamplingbasedestimatorsforthedistinctvalueestimationproblem.Theyproposeanewhybridsamplingestimatorwhichexplicitlyadaptstodierentlevelsofdataskew.TheirhybridestimatorperformsaChisquaretesttodetectskewinthedistributionoftheattributevalue.Ifthedataappearstobeskewed,thenShlosser'sestimatorisusedwhileifthetestdoesnotdetectskew,asmoothedjackknifeestimator(whichisamodicationoftheconventionaljackknifeestimator)isused.Theauthorsattributeadearthofworkforsamplingbasedestimationofthenumberofdistinctvaluestotheinherentdicultyoftheproblemwhilenotingthatitisamuchharderproblemthanestimatingtheselectivityofajoin.HaasandStokes[ 50 ]presentadetailedstudyoftheproblemofestimatingthenumberofclassesinanitepopulation.Thisisequivalenttothedatabaseproblemofestimatingthenumberofdistinctvaluesinarelation.Theauthorsmakerecommendationsaboutwhichstatisticalestimatorisappropriatesubjecttoconstraintsandnallyclaimfromempiricalresultsthatahybridestimatorwhichadaptsaccordingtodataskewisthemostsuperiorestimator. 25 PAGE 26 16 ]whichestablishesanegativeresultstatingthatnosamplingbasedestimatorforestimatingthenumberofdistinctvaluescanguaranteesmallerroracrossallinputdistributionsunlessitexaminesalargefractionoftheinputdata.TheyalsopresentaGuaranteedErrorEstimator(GEE)whoseerrorisprovablynoworsethantheirnegativeresult.SincetheGEEisageneralestimatorprovidingoptimalerroroveralldistributions,theauthorsnotethatitsaccuracymaybelowerthansomepreviousestimatorsonspecicdistributions.Hence,theyproposeanestimatorcalledtheAdaptiveEstimator(AE)whichissimilarinspirittoHaasetal.'shybridestimator[ 50 ],butunlikethelatter,isnotcomposedoftwodistinctestimators.RathertheAEconsidersthecontributionofdataitemshavinghighandlowfrequenciesinasingleuniedestimator.IntheAQUAsystem[ 41 ]forapproximateansweringofqueries,Acharyaetal.[ 6 ]proposeusingsynopsesforestimatingtheresultofrelationaljoinqueriesinvolvingforeignkeyjoinsratherthanusingrandomsamplesfromthebaserelations.Thesesynopsesareactuallyprecomputedsamplesfromasmallsetofdistinguishedjoinsandarecalledjoinsynopsesinthepaper.Theideaofjoinsynopsesisthatbyprecomputingsamplesfromasmallsetofdistinguishedjoins,thesesamplescanbeusedforestimatingtheresultofmanyotherjoins.Theconceptisapplicableinakwayjoinwhereeachjoininvolvesaprimaryandforeignkeyoftheparticipatingrelations.Thepaperdescribesthatifworkloadinformationisavailable,itcanbeusedtodesignanoptimalallocationforthejoinsynopsesthatminimizestheoverallerrorintheapproximateanswersovertheworkload.Acharyaetal.[ 5 ]proposeusingamixofuniformandbiasedsamplesforapproximatelyansweringquerieswithaGROUPBYclause.Theirsamplingtechniquecalledcongressionalsamplingreliesonusingprecomputedsampleswhichareahybridunionofuniformandbiasedsamples.Theyassumethattheselectivityofthequerypredicateisnotsolowthattheirprecomputedsamplecompletelymissesoneormoregroupsfromtheresultof 26 PAGE 27 4 ]forconstructingthecongressionalsamples.Gantietal.[ 37 ]describeabiasedsamplingapproachwhichtheycallICICLEStoobtainrandomsampleswhicharetunedtoaparticularworkload.Thus,ifatupleischosenbymanyqueriesinaworkload,ithasahigherprobabilityofbeingselectedintheselftuningsampleascomparedtotupleswhicharechosenbyfewerqueries.Sincethisisanonuniformsample,traditionalsamplingbasedestimatorsmustbeadaptedforthesesamples.Thepaperdescribesmodiedestimatorsforthecommonaggregationoperations.Italsodescribeshowtheselftuningsamplesaretunedinthepresenceofadynamicallychangingworkload.Chaudhurietal.[ 18 ]notethatuniformrandomsamplingtoestimateaggregatequeriesisineectivewhenthedistributionoftheaggregateattributeisskewedorwhenthequerypredicatehasalowselectivity.Theyproposeusingacombinationoftwomethodstoaddressthisproblem.Theirrstapproachistoindexseparatelythoseattributevalueswhichcontributesignicantlytothequeryresult.ThismethodiscalledOutlierIndexinginthepaper.Thesecondapproachproposedinthepaperistoexploitworkloadinformationtoperformweightedsampling.Accordingtothistechnique,recordswhichsatisedmanyqueriesintheworkloadaresampledmorethanrecordsthansatisedfewerqueries.Chaudhuri,DasandNarasayya[ 19 20 ]describehowworkloadinformationcanbeusedtoprecomputeasamplethatminimizestheerrorforthegivenworkload.Theproblemofselectionofthesampleisframedasanoptimizationproblemsothattheerrorinestimationoftheworkloadqueriesusingtheresultingsampleisminimized.Whentheactualincomingqueriesareidenticaltoqueriesintheworkload,thisapproachgivesasolutionwithminimalerroracrossallqueries.Thepaperalsodescribeshowthechoiceof 27 PAGE 28 10 ]notethatauniformlyrandomsamplecanleadtoinaccurateanswersformanyqueries.Theyobservethatforsuchqueries,estimationusinganappropriatelybiasedsamplecanleadtomoreaccurateanswersascomparedtoestimationusinguniformlyrandomsamples.Basedonthisidea,thepaperdescribesatechniquecalledsmallgroupsamplingwhichisdesignedtoapproximatelyansweraggregationquerieshavingaGROUPBYclause.ThedistinctivefeatureofthistechniqueascomparedtopreviousbiasedsamplingtechniqueslikecongressionalsamplingisthatanewbiasedsampleischosenforeveryGROUPBYquery,suchthatitmaximizestheaccuracyofestimatingthequeryratherthantryingtodeviseabiasedsamplethatmaximizestheaccuracyoveranentireworkloadofqueries.Accordingtothistechnique,largergroupsfromtheoutputoftheGROUPBYqueriesaresampleduniformlywhilethesmallgroupsaresampledatahigherratetoensurethattheyareadequatelyrepresented.Thegroupsamplesareobtainedonaperquerybasisfromanoverallsamplewhichiscomputedinapreprocessingphase.Infact,databasesamplinghasbeenrecognizedasanimportantenoughproblemthatISOhasbeenworkingtodevelopastandardinterfaceforsamplingfromrelationaldatabasesystems[ 55 ],andsignicantresearcheortsaredirectedatprovidingsamplingfromdatabasesystemsbyvendorssuchasIBM[ 52 ]. 106 107 ].Thetechniqueproposediscalledantisamplingandinvolvescreationofaspecialauxiliarystructurecalleddatabaseabstract.Theabstractconsidersthedistributionofseveralattributesandgroupsofattributes.Correlationsbetweendierentattributescanalsobecharacterizedasstatistics.Thistechniquewasfoundtobefasterthanrandomsampling,butrequireddomainknowledgeaboutthevariousattributes. 28 PAGE 29 110 ]andPiatetskyShapiroandConnell[ 102 ].SelectivityestimationofquerieswithmultidimensionalpredicatesusinghistogramswaspresentedbyMuralikrishnaandDeWitt[ 92 ].Theyshowthatthemaximumerrorinestimationcanbecontrolledmoreeectivelybychoosingequidepthhistogramsasopposedtoequiwidthhistograms.Ioannidis[ 70 ]describeshowserialhistogramsareoptimalforaggregatequeriesinvolvingarbitraryjointreeswithequalitypredicates.IoannidisandPoosala[ 71 ]havealsostudiedhowhistogramscanbeusedtoapproximatelyanswernonaggregatequerieswhichhaveasetbasedresult.Severalhistogramconstructionschemes[ 42 45 72 ]havebeenproposedintheliterature.Jagadishetal.[ 72 ]describetechniquesforconstructinghistogramswhichcanminimizeagivenerrormetricwheretheerrorisintroducedbecauseofapproximationofvaluesinabucketbyasinglevalueassociatedwiththebucket.Theyalsodescribetechniquesforaugmentinghistogramswithadditionalinformationsothattheycanbeusedtoprovideaccuracyguaranteesoftheestimatedresults.ConstructionofapproximatehistogramsbyconsideringonlyarandomsampleofthedatasetwasinvestigatedbyChaudhurietal.[ 23 ].Theirtechniqueusesanadaptivesamplingapproachtodeterminethesamplesizethatwouldbesucienttogenerateapproximatehistogramswhichcanguaranteeprespeciederrorboundsinestimation.Theyalsoextendtheirworktoconsiderduplicatevaluesinthedomainoftheattributeforwhichahistogramistobeconstructed.TheproblemofestimationofthenumberofdistinctvaluecombinationsofasetofattributeshasbeenstudiedbyYuetal.[ 121 ].Duetotheinherentdicultyofdevelopingagood,samplingbasedestimationsolutiontotheproblem,theyproposeusingadditionalinformationaboutthedataintheformofhistograms,indexesordatacubes.Inarecentpaper[ 28 ],Dobrapresentsastudyofwhenhistogramsarebestsuitedforapproximation.Thepaperconsidersthelongstandingassumptionthathistogramsare 29 PAGE 30 89 ]andalsoforcomputingaggregatesoverdatacubes[ 118 119 ].Chakrabartietal.[ 15 ]presenttechniquesforapproximatecomputationofresultsforaggregateaswellasnonaggregatequeriesusingHaarwavelets.Onemoresummarystructurethathasbeenproposedforapproximatingthesizeofjoinsissketches.Sketchesaresmallspacesummariesofdatasuitedfordatastreams.Asketchgenerallyconsistsofmultiplecounterscorrespondingtorandomvariableswhichenablethemtoprovideapproximateanswerswitherrorguaranteesforaprioridecidedqueries.SomeoftheearliestworkonsketcheswaspresentedbyAlon,Gibbons,MatiasandSzegedy[ 7 8 ].SketchingtechniqueswithimprovederrorguaranteesandfasterupdatetimeshavebeenproposedasFastCountsketches[ 117 ].Astatisticalanalysisofvarioussketchingtechniquesalongwithrecommendationsontheiruseforestimatingjoinsizesappearsin[ 108 ]. 44 ]forprocessingofanalyticstyleaggregationqueriesoverdatawarehouses.ThepaperdescribesageneralizationoftheSQLGROUPBYoperatortomultipledimensionsbyintroducingthedatacubeoperator.Thisoperatortreatseachofthepossibleaggregationattributesasadimensionofahighdimensionalspace.Theaggregateofaparticularsetofattributevaluesisconsideredasapointinthisspace.Sincethecubeholdsprecomputedaggregatevaluesoveralldimensions,itcanbeusedtoquicklycomputeresultstoGROUPBYqueriesovermultipledimensions.Thedatacubeisprecomputed 30 PAGE 31 79 ]andquotientcubetree[ 80 ]structuresaresuchcompressedrepresentationsofthedatacubewhichpreservesemanticrelationshipswhilealsoallowingprocessingofpointandrangequeries.AnotherapproachthathasbeenemployedinshrinkingthedatacubewhileatthesametimepreservingalltheinformationinitistheDwarf[ 113 114 ]structure.Dwarfidentiesandeliminatesredundanciesinprexesandsuxesofthevaluesalongdierentdimensionsofadatacube.Thepapershowsthatbyeliminatingprexaswellassuxredundancies,bothdenseaswellassparsedatacubescanbecompressedeectively.Thepaperalsoshowsimprovedcubeconstructiontime,queryresponsetimeaswellasupdatetimeascomparedtocubetrees[ 105 ].Although,thedwarfstructureimprovestheperformanceofthedatacubemodel,itstillsuersfromtheinherentdrawbackofthedatacubemodel{itisnotsuitabletoecientlyanswerarbitrarilycomplexqueriessuchasquerieswithcorrelatedsubqueries.Recently,anewcolumnorientedarchitecturefordatabasesystemscalledCstorewasproposedbyStonebrakeretal[ 115 ].Thesystemhasbeendesignedforanenvironmentthathasmuchhighernumberofdatabasereadsasopposedtowrites,suchasadatawarehousingenvironment.Cstorelogicallysplitsattributesofarelationaltableintoprojectionswhicharecollectionsofattributes,andstoresthemondisksuchthatallvalues 31 PAGE 32 115 ],thesystemwasstillunderdevelopment. 32 PAGE 33 91 ]havebecomevitaldatamanagementtools.Inparticular,randomsamplingisoneofthemostimportantsourcesofrandomnessforsuchalgorithms.Scoresofalgorithmsthatareusefuloverlargedatarepositorieseitherrequirearandomizedinputorderingfordata(i.e.,anonlinerandomsample),orelsetheyoperateoversamplesofthedatatoincreasethespeedofthealgorithm.Althoughapplicationsrequiringrandomizationaboundinthedatamanagementliterature,wespecicallyconsideronlineaggregation[ 54 62 63 ]inthisthesis.Inonlineaggregation,databaserecordsareprocessedoneatatime,andusedtokeeptheuserinformedofthecurrent\bestguess"astotheeventualanswertothequery.Iftherecordsareinputintotheonlineaggregationalgorithminarandomizedorder,thenitbecomespossibletogiveprobabilisticguaranteesontherelationshipofthecurrentguesstotheeventualanswertothequery.Despitetheobviousimportanceofrandomsamplinginadatabaseenvironmentanddozensofrecentpapersonthesubject(approximately20papersfromrecentSIGMODandVLDBconferencesareconcernedwithdatabasesampling),therehasbeenrelativelylittleworktowardsactuallysupportingrandomsamplingwithphysicaldatabaseleorganizations.Theclassicworkinthisarea(byOlkenandhiscoauthors[ 98 99 101 ])suersfromakeydrawback:eachrecordsampledfromadatabaselerequiresarandomdiskI/O.Atacurrentrateofaround100randomdiskI/Ospersecondperdisk,thismeansthatitispossibletoretrieveonly6,000samplesperminute.Ifthegoalisfastapproximatequeryprocessingorspeedingupadataminingalgorithm,thisisclearlyunacceptable. 33 PAGE 34 96 ]orbyscanningarandomlypermutedle.Ingeneral,theviewcanproducesamplesfromapredicateinvolvinganyattributehavinganaturalordering,andastraightforwardextensionoftheACETreecanbeusedforsamplingfrommultidimensionalpredicates.Theresultingsampleisonline,whichmeansthatnewsamplesarereturnedcontinuouslyastimeprogresses,andinamannersuchthatatalltimes,thesetofsamplesreturnedisatruerandomsampleofalloftherecordsintheviewthat 96 ]inaslightlydierentcontext,wherethegoalwastomaintainaxedsizesampleofdatabase;incontrast,aswedescribesubsequentlyourmaterializedsampleviewisastructureallowingonlinesampling 34 PAGE 35 35 PAGE 36 9 96 99 { 101 ]thatsampledirectlyfromarelationalselectionpredicate,thusavoidingtheaforementionedproblemofobtainingtoofewrelevantrecordsinthesample.Olken[ 96 ]presentsacomprehensiveanalysisandcomparisonofmanysuchtechniques.InthisSectionwediscussthetechniqueofsamplingfromamaterializedvieworganizedasarankedB+Tree,sinceithasbeenproventobethemostecientexistingiterativesamplingtechniqueintermsofnumberofdiskaccesses.ArankedB+TreeisaregularB+Treewhoseinternalnodeshavebeenaugmentedwithinformationwhichpermitsonetondtheithrecordinthele.LetusassumethattherelationSALEpresentedintheIntroductionisstoredasarankedB+TreeleindexedontheattributeDAYandwewanttoretrievearandomsampleofrecordswhoseDAYattributevaluefallsbetween11282004and03022005.ThistranslatestothefollowingSQLquery: 36 PAGE 37 1.Findtherankr1oftherecordwhichhasthesmallest 2.Findtherankr2oftherecordwhichhasthelargest 3.Whilesamplesize PAGE 38 31 depictsanexampleleafnodeintheACETreewithattributerangevalueswrittenaboveeachsectionandsectionnumbersmarkedbelow.Recordswithineachsectionareshownascircles. 38 PAGE 39 StructureofaleafnodeoftheACEtree. 27 ].Eachinternalnodehasthefollowingcomponents:1.ArangeRofkeyvaluesassociatedwiththenode.2.AkeyvaluekthatsplitsRandpartitionsthedataontheleftandrightofthenode.3.Pointersptrlandptrr,thatpointtotheleftandrightchildrenofthenode.4.Countscntlandcntr,thatgivethenumberofdatabaserecordsfallingintherangesassociatedwiththeleftandrightchildnodes.Thesevaluescanbeused,forexample,duringevaluationofonlineaggregationquerieswhichrequirethesizeofthepopulationfromwhichwearesampling[ 54 ].Figure 32 showsthelogicalstructureoftheACETree.Ii;jreferstothejthinternalnodeatleveli.TherootnodeislabeledwitharangeI1;1:R=[0100],signifyingthatallrecordsinthedatasethavekeyvalueswithinthisrange.ThekeyoftherootnodepartitionsI1;1:RintoI2;1:R=[050]andI2;2:R=[51100].Similarlyeachinternalnodedividestherangeofitsdescendentswithitsownkey.Therangesassociatedwitheachsectionofaleafnodearedeterminedbytherangesassociatedwitheachinternalnodeonthepathfromtherootnodetotheleaf.Forexample,ifweconsiderthepathfromtherootnodedowntoleafnodeL4,therangesthatweencounteralongthepathare0100,050,2650and3850.ThusforL4,L4:S1hasarandomsampleofrecordsintherange0100,L4:S2hasarandomsampleintherange 39 PAGE 40 StructureoftheACEtree. 050,L4:S3hasarandomsampleintherange2650,whileL4:S4hasarandomsampleintherange3850. 3.6 .LetQ=[3065]beourexamplequerypostulatedovertheACETreedepictedinFigure 32 .ThequeryalgorithmstartsatI1;1,therootnode.SinceI2;1:RoverlapsQ,thealgorithmdecidestoexploretheleftchildnodelabeledI2;1inFigure 32 .AtthispointthetworangevaluesassociatedwiththeleftandrightchildrenofI2;1are025and2650.Sincetheleftchildrangehasnooverlapwiththequeryrange,thealgorithmchoosestoexploretherightchildnext.Atthischildnode(I3;2),thealgorithmpicksleafnodeL3tobetherstleafnoderetrievedbytheindex.Recordsfromsection1ofL3(whichtotallyencompassesQ)arelteredforQandreturnedimmediatelytotheconsumerofthesample 40 PAGE 41 33 showstheonerandomsamplefromsection1ofL3whichcanbeuseddirectlyforansweringqueryQ. Figure33. Randomsamplesfromsection1ofL3. Next,thealgorithmagainstartsattherootnodeandnowchoosestoexploretherightchildnodeI2;2.Afterperformingrangecomparisons,itexplorestheleftchildofI2;2whichisI3;3sinceI3;4.RhasnooverlapwithQ.ThealgorithmchoosestovisittheleftchildnodeofI3;3next,whichisleafnodeL5.Thisisthesecondleafnodetoberetrieved.AsdepictedinFigure 34 ,sinceL5:R1encompassesQ,therecordsofL5:S1arelteredandreturnedimmediatelytotheuserastwoadditionalsamplesfromR.Furthermore,section2recordsarecombinedwithsection2recordsofL3toobtainarandomsampleofrecordsintherange0100.Theseareagainlteredandreturned,givingfourmoresamplesfromQ.Section3recordsarealsocombinedwithsection3recordsofL3toobtainasampleofrecordsintherange2675.SincethisrangealsoencompassesR,therecordsareagainlteredandreturnedaddingfourmorerecordstooursample.Finallysection4recordsarestoredinmemoryforlateruse.Notethatafterretrievingjusttwoleafnodesinoursmallexample,thealgorithmobtainselevenrandomlyselectedrecordsfromthequeryrange.However,inarealindex,thisnumberwouldbemanytimesgreater.Thus,theACETreesupports\fastrst" 41 PAGE 42 32 .TheB+Treesamplingalgorithmwouldneedtopreselectwhichnodestoexplore.Sincefourleafnodesinthetreeareneededtospanthequeryrange,thereisareasonablyhighlikelihoodthattherstfoursamplestakenwouldneedtoaccessallfourleafnodes.AstheACETreeQueryAlgorithmprogresses,itgoesontoretrievetherestoftheleafnodesintheorderL4,L6,L1,L7,L2,L8. Figure34. CombiningsamplesfromL3andL5. 42 PAGE 43 3.6 CombiningtwosectionsofleafnodesoftheACEtree. 43 PAGE 44 35 ,rstwereadleafnodeL1andlterthesecondsectioninordertoproducearandomsampleofsizen1fromQlwhichisreturnedtotheuser.NextwereadleafnodeL3,andlteritssecondsectionL3:S2toproducearandomsampleofsizen2fromQlwhichisalsoreturnedtotheuser.Atthispoint,thetwosetsreturnedtotheuserconstituteasinglerandomsamplefromQlofsizen1+n2.Thismeansthatasmoreandmorenodesarereadfromdisk,therecordscontainedinthemcanbecombinedtoobtainaneverincreasingrandomsamplefromanyrangequery. 36 ,wecanappendthethirdsectionfromnodeL3tothethirdsectionfromnodeL1andltertheresulttoproduceyetanotherrandomsamplefromQl.Thismeansthatsectionsareneverwasted. 44 PAGE 45 AppendingtwosectionsofleafnodesoftheACEtree. Whiletheformalstatementoftheexponentialitypropertyisabitcomplicated,thenetresultisissimple:thereisalwaysapairofleafnodeswhosesectionscanbeappendedtoformasetwhichcanbelteredtoquicklyobtainasamplefromanyrangequeryQ0.Asanillustration,considerqueryQovertheACETreeofFigure 32 .NotethatthenumberofdatabaserecordsfallinginQisgreaterthanonefourth,butlessthanhalfthedatabasesize.TheexponentialitypropertyassuresusthatQcanbetotallycoveredbyappendingsectionsoftwodierentleafnodes.Inourexample,thismeansthatQcanbecoveredbyappendingsection3ofnodesL4andL6.IfRC=L4:R3SL6:R3,thenbytheinvariantgivenabovewecanclaimthatjQ(R)j>=(1=2)jRC(R)j. 45 PAGE 46 37 .Afterthedatasetissorted,themedianrecordfortheentiredatasetisdetermined(thisvalueis50inourexample).Thisrecord'skeywillbeusedasthekeyassociatedwiththerootoftheACETree,andwilldetermineL:R2foreveryleafnodeinthetree.WedenotethiskeyvaluebyI1;1:k,sincethevalueservesasthekeyoftherstinternalnodeinlevel1ofthetree.Afterdeterminingthekeyvalueassociatedwiththerootnode,themediansofeachofthetwohalvesofthedatasetpartitionedbyI1;1:karechosenaskeysforthetwointernalnodesatthenextlevel:I2;1:kandI2;2:k,respectively.IntheexampleofFigure 37 ,these 46 PAGE 47 Choosingkeysforinternalnodes. valuesare25and75.I2;1:kandI2;2:k,alongwithI1;1:k,willdetermineL:R3foreveryleafnodeinthetree.Theprocessisthenrepeatedrecursivelyuntilenoughmedians 47 PAGE 48 32 .Figure 38 showsthekeysoftheinternalnodesasmediansofthedatasetR.Wealsoconsidertwoexamplequeries,Q1andQ2suchthatthenumberofdatabaserecordsfallinginQ2isgreaterthanonefourthbutlessthanonehalfofthedatabasesize,whilethenumberofdatabaserecordsfallinginQ1ismorethanhalfthedatabasesize. Figure38. ExponentialitypropertyofACEtree. 32 ).LetRC1=L4:R2SL8:R2.ThenallthedatabaserecordsfallinRC1.Moreover,sincejQ1(R)j>=jRj=2,wehavejQ1(R)j>=(1=2)jRC1(R)j.Similarly,Q2canbeansweredbyappendingsection3of(forexample)L4andL6.IfRC2=L4:R3SL6:R3,thenhalfthedatabaserecordsfallinRC2.Also,sincejQ2(R)j>=jRj=4wehavejQ2(R)j>=(1=2)jRC2(R)j.ThiscanbegeneralizedtoobtaintheinvariantstatedinSection 3.4.3 48 PAGE 49 Phase2oftreeconstruction. 1.Assignauniformlygeneratedrandomnumberbetween1andhtoeachrecordasitssectionnumber.2.Associateanadditionalrandomnumberwiththerecordthatwillbeusedtoidentifytheleafnodetowhichtherecordwillbeassigned.3.Finally,reorganizethelebyperforminganexternalsorttogrouprecordsinagivenleafnodeandagivensectiontogether.Figure39(a)depictsourexampledatasetafterwehaveassignedeachrecordarandomlygeneratedsectionnumber,assumingfoursectionsineachleafnode.InStep2,thealgorithmassignsonemorerandomlygeneratednumbertoeachrecord,whichwillidentifytheleafnodetowhichtherecordwillbeassigned.Weassumeforourexamplethatthenumberofleafnodesis2h1=23=8.Thenumbertoidentifytheleafnodeisassignedasfollows.1.First,thesectionnumberoftherecordischecked.Wedenotethisvalueass. 49 PAGE 50 37 ,weseethatthekeyoftherootnodeis50.Sincethekeyoftherecordis7whichislessthan50,therecordwillbeassignedtoaleafnodeintheleftsubtreeoftheroot.Henceweassignaleafnodebetween1and4tothisrecord.Inourexample,werandomlychoosetheleafnode3.Forthenextrecordhavingkeyvalue10,weseethatthesectionnumberassignedis3.Toassignaleafnodetothisrecord,weinitiallycompareitskeywiththekeyoftherootnode.ReferringtoFigure 37 ,weseethat10issmallerthan50;hencewethencompareitwith25whichisthekeyoftheleftchildnodeoftheroot.Sincetherecordkeyissmallerthan25,weassigntherecordtosomeleafnodeintheleftsubtreeofthenodewithkey25byassigningtoitarandomnumberbetween1and2.Thesectionnumberandleafnodeidentiersforeachrecordarewritteninasmallamountoftemporarydiskspaceassociatedwitheachrecord.Onceallrecordshavebeenassignedtoleafnodesandsections,thedatasetisreorganizedintoleafnodesusingatwopassexternalsortingalgorithmasfollows:Recordsaresortedinascendingorderoftheirleafnodenumber.Recordswiththesameleafnodenumberarearrangedinascendingorderoftheirsectionnumber.ThereorganizeddatasetisdepictedinFigure39(c). 50 PAGE 51 51 PAGE 52 52 PAGE 53 310 ,whenwecomparethepathstakenbyStab1andStab2.Thealgorithmchoosestotraversetotheleftchildoftherootnodeduringtherststab,whileduringthesecondstabitchoosestotraversetotherightchildoftherootnode.Theadvantageofretrievingleafnodesinthisbackandforthsequenceisthatitallowsustoquicklyretrieveasetofleafnodeswiththemostdisparatesectionspossibleina 53 PAGE 54 Executionrunsofqueryansweringalgorithmwith(a)1contributingsection,(b)6contributingsections,(c)7contributingsectionsand(d)16contributingsections. givennumberofstabs.Thereasonthatwewantanonhomogeneoussetofnodesisthatnodesfromverydistantportionsofaqueryrangewilltendtohavesectionscoveringlargerangesthatdonotoverlap.Thisallowsustoappendsectionsofnewlyretrievedleafnodeswiththecorrespondingsectionsofpreviouslyretrievedleafnodes.Thesamplesobtainedcanthenbelteredandimmediatelyreturned. 54 PAGE 55 310 illustratesthechoicesmadebythealgorithmateachinternalnodeduringfourseparatestabs.Notethatwhenthealgorithmreachesaninternalnodewheretherangeassociatedwithoneofthechildnodeshasnooverlapwiththequeryrange,thealgorithmalwayspicksthechildnodethathasoverlapwiththequery,irrespectiveofthevalueoftheindicatorbit.Theonlyexceptiontothisiswhenallleafnodesofthesubtreerootedataninternalnodewhichoverlapsthequeryrangehavebeenaccessed.Insuchacase,theinternalnodewhichoverlapsthequeryrangeisnotchosenandisneveraccessedagain. 55 PAGE 56 LetrootbetherootoftheACETree While(!T:lookup(root):done) node) If(curr nodeisaninternalnode) node=curr node!get left node(); node=curr node!get right node(); If(left nodeisdoneANDright nodeisdone) Markcurr nodeasdone Elseif(right nodeisnotdone) node); Elseif(left nodeisnotdone) node); Elseif(bothchildrenarenotdone) If(Qoverlapsonlywithleft node:R) node); Elseif(Qoverlapsonlywithright node:R) node); Else//Qoverlapsbothsidesornone If(nextnodeisLEFT) node); SetnextnodetoRIGHT; If(nextnodeisRIGHT) node); SetnextnodetoLEFT; Else//curr nodeisaleafnode Combine Tuples(Q,curr node); Markcurr nodeasdone algorithmdeterminesthesectionsthatarerequiredtobecombinedwitheverynewsectionsthatisretrievedandthensearchesfortheminthearraybuckets[].Ifallsectionsarefound,itcombinesthemwithsandremovesthemfrombuckets[].Ifitdoesnotndalltherequiredsectionsinbuckets[],itstoressinbuckets[]. 56 PAGE 57 Tuples(QueryQ,LeafNodenode) Foreachsectionsinnodedo Storethesectionnumbersrequiredtobe combinedwithstospanQ,inalistlist Ifbuckets[]doesnothavesectioni flag=false Combineallsectionsfromlistwiths Storesintheappropriatebucket PAGE 59 2i1+1 2i12i1 3.5.4 exceptthattheappropriatekeyattributeisusedwhileperformingcomparisonswiththeinternalnodes.Finally,thedatasetissortedintoleafnodesasinFigure39(c).QueryansweringwiththekdACETreecanusetheShuttlealgorithmdescribedearlierwithafewminormodications.Wheneverasectionisretrievedbythealgorithm,onlyrecordswhichsatisfyallpredicatesinthequeryshouldbereturned.Also,themth 59 PAGE 60 SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting0.25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation sectionsoftwoleafnodescanbecombinedonlyiftheymatchinallmdimensions.Thenthsectionsoftwoleafnodescanbeappendedonlyiftheymatchintherstn1dimensionsandformacontiguousintervaloverthenthdimension. 60 PAGE 61 SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation sequentiallescanaswellaswiththeobviousextensionofAntoshenkov'salgorithmtoatwodimensionalRTree. 61 PAGE 62 SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting25%ofthedatabaserecords.Thegraphshowsthepercentageofdatabaserecordsretrievedbyallthreesamplingtechniquesversustimeplottedasapercentageofthetimerequiredtoscantherelation 1.ACETreeQueryAlgorithm:TheACETreewasimplementedexactlyasdescribedinthisthesis.InordertousetheACETreetoaidinsamplingfromtheSALErelation,amaterializedsampleviewfortherelationwascreated,usingSALE.DAYastheindexedattribute.2.RandomsamplingfromaB+Tree:Antoshenkov'salgorithmforsamplingfromarankedB+TreewasimplementedasdescribedinAlgorithm1.TheB+TreeusedintheexperimentwasaprimaryindexontheSALErelation(thatis,theunderlyingdatawereactuallystoredwithinthetree),andwasconstructedusingthestandardB+Treebulkconstructionalgorithm.3.Samplingfromarandomlypermutedle:WeimplementedthisrandomsamplingtechniqueasdescribedinSection 3.2.1 ofthischapter.Thisisthestandardsamplingtechniqueusedinpreviousworkononlineaggregation.TheSALErelationwasrandomlypermutedbyassigningarandomkeyvaluektoeachrecord.AlloftherecordsfromSALEwerethensortedinascendingorderofeachkvalueusingatwophase,multiwaymergesort(TPMMS)(seeGarciaMolinaetal.[ 38 ]).AsthesortedrecordsarewrittenbacktodiskinthenalpassoftheTPMMS,kisremovedfromthele.Tosamplefromarangepredicateusingarandomlypermutedle,the 62 PAGE 63 SamplingrateofanACEtreevs.rateforaB+treeandscanofarandomlypermutedle,withaonedimensionalselectionpredicateaccepting2.5%ofthedatabaserecords.ThegraphisanextensionofFigure 312 andshowsresultstillallthreesamplingtechniquesreturnalltherecordsmatchingthequerypredicate. leisscannedfromfronttobackandallrecordsmatchingtherangepredicateareimmediatelyreturned.Fortherstsetofexperiments,wesyntheticallygeneratedtheSALErelationtobe20GBinsizewith100Brecords,resultinginaround200millionrecordsintherelation.Webegantherstsetofexperimentsbysamplingfrom10dierentrangeselectionpredicatesoverSALEusingthethreesamplingtechniquesdescribedabove.0.25%oftherecordsfromMySamsatisedeachrangeselectionpredicate.Foreachofthethreerandomsamplingalgorithms,werecordedthetotalnumberofrandomsamplesretrievedbythealgorithmateachtimeinstant.Theaveragenumberofrandomsamplesobtainedforeachofthetenquerieswasthencalculated.ThisaverageisplottedasapercentageofthetotalnumberofrecordsinSALEalongtheYaxisinFigure 311 .OntheXaxis,wehaveplottedtheelapsedtimeasapercentageofthetimerequiredtoscantheentirerelation.Wechose 63 PAGE 64 (b)2.5%selectivityFigure315. NumberofrecordsneededtobebueredbytheACETreeforquerieswith(a)0.25%and(b)2.5%selectivity.Thegraphsshowthenumberofrecordsbueredasafractionofthetotaldatabaserecordsversustimeplottedasapercentageofthetimerequiredtoscantherelation. 64 PAGE 65 312 andFigure 313 .Forallthethreegures,resultsareshownfortherst15secondsofexecution,correspondingtoapproximately4%ofthetimerequiredtoscantherelation.WeshowanadditionalgraphinFigure 314 forthe2.5%selectivitycase,whereweplotresultsuntilallthethreerecordretrievalalgorithmsreturnalltherecordsmatchingthequerypredicate.Finally,weprovideexperimentalresultstoindicatethenumberofrecordsthatareneededtobebueredbytheACETreequeryalgorithmfortwodierentqueryselectivities.Figure 315(a) showstheminimum,maximumandtheaveragenumberofrecordsstoredfortendierentquerieshavingaselectivityof0.25%whileFigure 315(b) showssimilarresultsforquerieshavingselectivity2.5%.Experiment2.Forthesecondsetofexperiments,weaddanadditionalattributeAMOUNTtotheSALErelationandtestthefollowingtwodimensionalrangequery: 3.7 .ItwasusedtocreateamaterializedsampleviewovertheDAYandAMOUNTattributes.2.RandomsamplingfromaRtree:Antoshenkov'salgorithmforsamplingfromarankedB+TreewasextendedintheobviousfashionforsamplingfromaRTree[ 46 ].JustasinthecaseoftheB+Tree,theRTreeiscreatedasaprimaryindex,andthedatafromtheSALErelationareactuallystoredintheleafnodesofthetree.TheRTreewasconstructedinbulkusingthewellknownSortTileRecursive[ 81 ]bulkconstructionalgorithm. 65 PAGE 66 316 .OntheXaxis,wehaveplottedtheelapsedtimeasapercentageofthetimerequiredtoscantheentirerelation.Thetestwasthenrepeatedwithtwomoreselectionpredicatesthataresatisedby2.5%and25%oftheSALErelationsrecords,respectively.TheresultsareplottedinFigure 317 andFigure 318 respectively. 66 PAGE 67 SamplingrateofanACETreevs.rateforanRTreeandscanofarandomlypermutedle,withaspatialselectionpredicateaccepting0.25%ofthedatabasetuples. therandomlypermutedleisalmostuselessduetothefactthatthechancethatanygivenrecordisacceptedbytherelationalselectionpredicateisverylow.Ontheotherhand,theB+Tree(andtheRTreeovermultidimensionaldata)performsrelativelywellforhighlyselectivequeries.Thereasonforthisisthatduringthesampling,ifthequeryrangeissmall,thenalltheleafpagesoftheB+Tree(orRTree)containingrecordsthatmatchthequerypredicateareretrievedveryquickly.Oncealloftherelevantpagesareinthebuer,thesamplingalgorithmdoesnothavetoaccessthedisktosatisfysubsequentsamplerequestsandtherateofrecordretrievalincreasesrapidly.However,forlessselectivequeries,therandomlypermutedleworkswellsinceitcanmakeuseofanecient,sequentialdiskscantoretrieverecords.Aslongasarelativelylargefractionoftherecordsretrievedmatchtheselectionpredicate,theamountofwasteincurredbyscanningunwantedrecordsaswellissmallcomparedtotheadditionaleciencygainedbythesequentialscan.Ontheotherhand,whentherangeassociatedwithaqueryhaving 67 PAGE 68 SamplingrateofanACEtreevs.rateforanRtree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting2.5%ofthedatabasetuples. highselectivityisverylarge,thetimerequiredtoloadalloftherelevantB+Tree(orRTree)pagesintomemoryusingrandomdiskI/Osisprohibitive.Evenifthequeryisrunlongenoughthatalloftherelevantpagesaretouched,foraquerywithhighselectivity,thebuermanagercannotbeexpectedtobueralltheB+Tree(orRTree)pagesthatcontainrecordsmatchingthequerypredicate.ThisisthereasonthatthecurvefortheB+TreeinFigure 313 orfortheRTreeinFigure 318 ,neverleavestheyaxisforthetimerangeplotted.ThenetresultofthisisthatifanACETreewerenotused,itwouldprobablybenecessarytousebothaB+Treeandarandomlypermutedleinordertoensuresatisfactoryperformanceinthegeneralcase.Again,thisisapointwhichseemstostronglyfavoruseoftheACETree.AnobservationwemakefromFigure 314 isthatifallthethreerecordretrievalalgorithmsareallowedtoruntocompletion,wendthattheACETreeisnottherstto 68 PAGE 69 SamplingrateofanACEtreevs.rateforanRtree,andscanofarandomlypermutedlewithaspatialselectionpredicateaccepting25%ofthedatabasetuples. completeexecution.Thus,thereisgenerallyacrossoverpointbeyondwhichthesamplingrateofanalternativerandomsamplingtechniqueishigherthanthesamplingrateoftheACETree.However,theimportantpointisthatsuchatransitionalwaysoccursverylateinthequeryexecutionbywhichtimetheACETreehasalreadyretrievedalmost90%ofthepossiblerandomsamples.WefoundthistrendforallthedierentqueryselectivitieswetestedwithsingledimensionalaswellasmultidimensionalACETrees.Thus,weemphasizethattheexistenceofsuchacrossoverpointinnowaybelittlestheutilityoftheACETreesinceinpracticalapplicationswhererandomsamplesareused,thenumberofrandomsamplesrequiredisverysmall.SincetheACETreeprovidesthedesirednumberofrandomsamples(andmanymore)muchfasterthantheothertwomethods,itstillemergesasthetopperformeramongthethreemethodsforobtainingrandomsamples.Finally,Figure 315 showsthememoryrequirementoftheACETreetostorerecordsthatmatchthequerypredicatebutcannotbeusedasyettoanswerthequery.The 69 PAGE 70 315 thattheACETreehasareasonablememoryrequirementsinceaverysmallfractionofthetotalnumberofrecordsisbueredbyit. 70 PAGE 71 111 ]isapplied.Specically,onecouldmaintainthedierentialleasarandomlypermutedleorevenasecondACETree,andwhenarelationalselectionqueryisposed,inordertodrawarandomsamplefromthequeryoneselectsthenextsamplefromeithertheprimaryACETreeorthedierentiallewithanappropriatehypergeometricprobability(foranideaofhowthiscouldbedone,seetherecentpaperofBrownandHaas[ 12 ]foradiscussionofhowtodrawasinglesamplefrommultipledatasetpartitions).Thus,wearguethatthelackofanalgorithmtoupdatetheACEtreeincrementallymaynotbeatremendousdrawback.Finally,weclosethechapterbyassertingthattheimportanceofhavingindexingmethodsthatcanhandleinsertionsincrementallyisoftenoverstatedintheresearch 71 PAGE 72 93 ].SuchstructuresstillrequireontheorderofonerandomI/Operupdate,renderingitimpossibletoecientlyprocessbulkupdatesconsistingofmillionsofrecordswithoutsimplyrebuildingthestructurefromscratch.Thus,wefeelthatthedrawbacksassociatedwiththeACETreedonotpreventitsutilityinmanyrealworldsituations. 72 PAGE 73 88 ],histograms[ 92 ]andsketches[ 29 ].Nottheleastofthoseisgenerality:itisveryeasytoecientlydrawasamplefromalargedatasetinasinglepassusingreservoirtechniques[ 34 ].Then,oncethesamplehasbeendrawnitispossibletoguess,withgreaterorlesseraccuracy,theanswertovirtuallyanystatisticalqueryoverthosesets.Samplescaneasilyhandlemanydierentdatabasequeries,includingcomplexfunctionsinrelationalselectionandjoinpredicates.Thesamecannotbesaidoftheotherapproximationmethods,whichgenerallyrequiremoreknowledgeofthequeryduringsynopsisconstruction,suchastheattributethatwillappearintheSELECTclauseoftheSQLquerycorrespondingtothedesiredstatisticalcalculation.However,oneclassofaggregatequeriesthatremaindicultorimpossibletoanswerwithsamplesarethesocalled\subset"queries,whichcangenerallybewritteninSQLintheform:SELECTSUM(f1(r))FROMRasrWHEREf2(r)ANDNOTEXISTS(SELECT*FROMSASsWHEREf3(r,s))Notethatthefunctionf2canbeincorporatedintof1ifwehavef1evaluatetozeroiff2isnottrue;thus,intheremainderofthechapterwewillignoref2.Anexampleofsuch 73 PAGE 75 17 49 ],butaggregatesoverDISTINCTqueriesremainsanopenproblem.Similarly,itispossibletowriteanaggregatequerywhererecordswithidenticalvaluesmayappearmorethanonceinthedata,butshouldbeconsiderednomorethanoncebytheaggregatefunctionasasubsetbasedSQLquery.Forexample:SELECTSUM(e.SAL)FROMEMPASeWHERENOTEXISTS(SELECT*FROMEMPASe2WHEREid(e) PAGE 76 17 49 50 ]andonemethodthatrequiresanindexontheinnerrelation[ 75 ],thereisalsolittlerelevantworkinthedatamanagementliterature;wepresumethisisduetothedicultyoftheproblem;researchershaveconsideredthedicultyofthemorelimitedproblemofsamplingfordistinctvaluesinsomedetail[ 17 ].OurContributions 76 PAGE 77 39 ].Bayesianmethodsgenerallymakeuseofmildandreasonabledistributionalassumptionsaboutthedatainordertogreatlyincreaseestimationaccuracy,andhavebecomeverypopularinstatisticsinthelastfewdecades.Usingthismethodinthecontextofansweringsubsetbasedqueriespresentsanumberofsignicanttechnicalchallengeswhosesolutionsaredetailedinthischapter,including:Thedenitionofanappropriategenerativestatisticalmodelfortheproblemofsamplingforsubsetbasedqueries.ThederivationofauniqueExpectationMaximizationalgorithm[ 26 ]tolearnthemodelfromthedatabasesamples.Thedevelopmentofalgorithmsforecientlygeneratingmanynewrandomdatasetsfromthemodel,withoutactuallyhavingtomaterializethem.Throughanextensivesetofexperiments,weshowthattheresultingbiasedBayesianestimatorhasexcellentaccuracyonawidevarietyofdata.Thebiasedestimatoralsohasthedesirablepropertythatitprovidessomethingcloselyrelatedtoclassicalcondencebounds,thatcanbeusedtogivetheuseranideaoftheaccuracyoftheassociatedestimate. 77 PAGE 78 75 ],butwepresentitherebecauseitformsthebasisfortheunbiasedestimatordescribedinthenextsection.Webeginourdescriptionwithanevensimplerestimationproblem.GivenaoneattributerelationR(A)consistingofnRrecords,imaginethatourgoalistoestimatethesumoverattributeAofalltherecordsinR.Asimple,samplebasedestimatorwouldbeasfollows.WeobtainarandomsampleR0ofsizenR0ofalltherecordsofR,computetotal=Pr2R0r:A,andthenscaleuptotaltooutputtotalnR=nR0astheestimateforthenalsum.Notonlyisthisestimatorextremelysimpletounderstand,butitisalsounbiased,consistent,anditsvariancereducesmonotonicallywithincreasingsamplesize.WecanextendthissimpleideatodeneanestimatorfortheNOTEXISTSqueryconsideredintheintroduction.WestartbyobtainingrandomsamplesEMP0andSALE0ofsizesnEMP0andnSALE0,respectivelyfromtherelationsEMPandSALE.WethenevaluatetheNOTEXISTSqueryoverthesamplesofthetworelations.WecompareeveryrecordinEMP0witheveryrecordinSALE0,andifwedonotndamatchingrecord(thatis,oneforwhichf3evaluatestotrue),thenweadditsf1valuetotheestimatedtotal.Lastly,wescaleuptheestimatedtotalbyafactorofnEMP=nEMP0toobtainthenalestimate,whichwetermM:M=nEMP 78 PAGE 79 75 ],whereitiscalledthe\concurrentestimator"sinceitsamplesbothrelationsconcurrently.Unfortunately,onexpectation,theestimatorisoftenseverelybiased,meaningthatitis,onaverage,incorrect.Thereasonforthisbiasisfairlyintuitive.ThealgorithmcomparesarecordfromEMPwithallrecordsfromSALE0,andifitdoesnotndamatchingrecordinSALE0,itclassiestherecordashavingnomatchintheentireSALErelation.Clearly,thisclassicationmaybeincorrectforcertainrecordsinEMP,sincealthoughtheymighthavenomatchingrecordinSALE0,itispossiblethattheymaymatchwithsomerecordfromthepartofSALEthatwasnotincludedinthesample.Asaresult,MtypicallyoverestimatestheanswertotheNOTEXISTSquery.Infact,thebiasofMis: 79 PAGE 80 4.3.1HighLevelDescriptionInordertodevelopanunbiasedestimatorforBias(M),itisusefultorstrewritetheformulaforBias(M)inaslightlydierentfashion.WesubsequentlyrefertothesetofrecordsinEMPthathaveimatchesinSALEas\classirecords".Denotethesumoftheaggregatefunctionoverallrecordsofclassibyti,soti=Pe2EMPf1(e)I(cnt(e;SALE)=i)(notethatthenalanswertotheNOTEXISTSqueryisthequantityt0).GiventhattheprobabilitythatarecordwithimatchesinSALEhappenstohavenomatchesinSALE0is'(nSALE;nSALE0;i),wecanrewritetheexpressionforthebiasofMas: TheaboveequationcomputesthebiasofMsinceitcomputestheexpectedsumovertheaggregateattributeofallrecordsofEMPwhichareincorrectlyclassiedasclass0recordsbyM.LetmbethemaximumnumberofmatchingrecordsinSALEforanyrecordofEMP.Equation 4{1 suggestsanunbiasedestimatorforBias(M)becauseitturnsoutthatitiseasytogenerateanunbiasedestimatefortm:sincenorecordsotherthanthosewithmmatchesinSALEcanhavemmatchesinSALE0,wecansimplycountthesum 80 PAGE 81 4{1 todevelopanunbiasedestimatorforBias(M).Weusethefollowingadditionalnotationforthissectionandtheremainderofthischapter:k;iisa0=1(nonrandom)variablewhichevaluatesto1iftheithtupleofEMPhaskmatchesinSALEandevaluatesto0otherwise.skisthesumoff1overallrecordsofEMP0havingkmatchingrecordsinSALE0:sk=PnEMP0i=1I(cnt(ei;SALE0)=k)f1(ei).0isnEMP0=nEMP,thesamplingfractionofEMP.YiisarandomvariablewhichgovernswhetherornottheithrecordofEMPappearsinEMP0.h(k;nSALE;nSALE0;i)isthehypergeometricprobabilitythatoutoftheiinterestingrecordsinapopulationofsizenSALE,exactlykwillappearinarandomsampleofsizenSALE0.Forcompactnessofrepresentationwewillrefertothisprobabilityash(k;i)intheremainderofthethesis,sinceoursamplingfractionneverchanges. 81 PAGE 82 ^tk=1 ^sk=nEMPXj=1mXi=kYji;jh(k;i)f1(ej)(4{3)ThefactthatE[^sk]=E[sk](proveninSection 4.3.3 )issignicant,becausethereisasimplealgebraicrelationshipbetweenthevarious^svariablesandthevarious^tvariables.Thus,wecanexpressonesetintermsoftheother,andthenreplaceeach^skwithskinordertoderiveanunbiasedestimatorforeach^t.Thebenetofdoingthisisthatsinceskisdenedasthesumoff1overallrecordsofEMP0havingkmatchingrecordsinSALE0,itcanbedirectlyevaluatedfromthesamplesEMP0andSALE0. 82 PAGE 83 4{3 : ^smr=nEMPXj=1mXi=mrYji;jh(mr;i)f1(ej)=mXi=mrh(mr;i)nEMPXj=1Yji;jf1(ej)=rXi=0h(mr;mr+i)nEMPXj=1Yjmr+i;jf1(ej)=rXi=0h(mr;mr+i)0^tmr+i Byrearrangingthetermswegetthefollowingimportantrecursiverelationship: ^tmr=^smr0Pri=1h(mr;mr+i)^tmr+i (4{5) Forthebasecaseweobtain: ^tm=^sm wheream=1=(0h(m;m)).Byreplacing^smrintheaboveequationswithsmrwhichisreadilyobservablefromthedataandhasthesameexpectedvalue,wecanobtainasimplerecursivealgorithmforcomputinganunbiasedestimatorforanyti.Beforepresentingtherecursivealgorithm,wenotethatwecanrewriteEquation 4{5 for^tibyreplacing^swithsandbychangingthesummationvariablefromitokandactuallysubstitutingmrbyi,^ti=si0Pmik=1h(i;i+k)^ti+k 83 PAGE 84 4{1 thatthebiasofMwasexpressedasalinearcombinationofvarioustiterms.UsingGetEstTitoestimateeachofthetiterms,wecanwriteanestimatorforthebiasofMas: (4{7) Inthefollowingtwosubsections,wepresentaformalanalysisofthestatisticalpropertiesofourestimator. 84 PAGE 85 4{7 ,theestimatorforthebiasofMiscomposedofasumofmdierentestimators.Hencebythelinearityofexpectation,theexpectedvalueoftheestimatorcanbewrittenas: 4{7 isunbiased,itwouldsucetoprovethateachoftheindividualGetEstTiestimatorsisunbiased.Weusemathematicalinductiontoprovethecorrectnessofthevariousestimatorsonexpectation.Asapreliminarystepfortheproofofunbiasedness,werstderivetheexpectedvaluesofthesiestimatorusedbyGetEstTi.Todothis,weintroduceazero/onerandomvariableHj;kthatevaluatesto1ifejhaskmatchesinSALE0and0otherwise.Theexpectedvalueofthisvariableissimplytheprobabilitythatitevaluatesto1,givingusE[Hj;k]=h(k;cnt(ej;SALE)).Withthis: (4{9) WearenowreadytopresentaformalproofofunbiasednessoftheGetEstTi. Proof. 4{5 ,therecursiveGetEstTiestimatorcanberewrittenas: 85 PAGE 86 4{9 :E[GetEstTi(m)]=0 86 PAGE 87 =1 (4{11) Wenoticethatthelimitsofsummationoftheinnersumofthersttermarefromitom.Splittingthistermintotwotermssuchthatonetermhaslimitsofsummationfromitoiwhiletheotherhaslimitsfromi+1tom: =1 (4{12) 87 PAGE 88 88 PAGE 89 (4{17) Theaboveexpressioncanbeevaluatedusingthefollowingrules:ifk6=r(thatis,ekanderaretwodierenttuples)then,E[Hk;iHr;j]h(i;cnt(ek;SALE))h(j;cnt(er;SALE))ifweassumethatnorecordsexistsinSALEwheref3(ek;s)=f3(er;s)=trueifi=j(thatis,wearecomputingE[s2i])andk=r,thenE[Hk;iHr;j]=h(i;cnt(ek;SALE))ifi6=j(thatis,wearecomputingE[sisj])andk=r,thenE[Hk;iHr;j]=0sincearecordcannothavetwodierentnumbersofmatchesinasampleifk=r,thenE[YkYr]=0ifk6=r,thenE[YkYr]02 89 PAGE 90 Samplingfromasuperpopulation estimator.However,therearetwoproblemsrelatedtothevariancethatmaylimittheutilityoftheestimator.First,inordertoevaluatethehypergeometricprobabilitiesneededtocomputeorestimatethevariance,weneedthevalueofcnt(e;SALE)foranarbitraryrecordeofEMP.Thisinformationisgenerallyunavailableduringsampling,anditseemsdicultorimpossibletoobtainagoodestimatefortheappropriateprobabilitywithouthavingthisinformation.Thismeansthatinpractice,itwillbedicultorimpossibletotellauserhowaccuratetheresultingestimateislikelytobe.Wehaveexperimentedwithgeneralpurposemethodssuchasthebootstrap[ 31 ]toestimatethisvariance,buthavefoundthatthesemethodsoftendoanextremelypoorjobinpractice.Second,thevarianceoftheestimatoritselfmaybehuge.Thebicoecientsarecomposedofsums,productsandratiosofhypergeometricprobabilitieswhichcanresultinhugevalues.Particularlyworrisomeistheh(i;i)valueinthedenominatorusedbyGetEstTi.Suchprobabilitiescanbetiny;includingsuchasmallvalueinthedenominatorofanexpressionresultsinaverylargevaluethatmay\pumpup"thevarianceaccordingly. 90 PAGE 91 78 ].Onesimplewaytothinkofasuperpopulationisthatitisaninnitelylargesetofrecordsfromwhichtheoriginaldatasethasbeenobtainedbyrandomsampling.Becausethesuperpopulationisinnite,itisspeciedusingaparametricdistribution,whichisusuallyreferredtoasthepriordistribution.Usingasuperpopulationmethod,weimaginethefollowingtwostepprocessisusedtoproduceoursample:1.DrawalargesampleofsizeNfromanimaginaryinnitesuperpopulationwhereNisthedatasetsize.2.Drawasampleofsizen PAGE 92 4.3 ofthethesis,foragivenrecordefromEMP,weknowthatthesethreecharacteristicsare:1.f1(e)2.cnt(e;SALE),whichisthenumberofSALErecordssforwhichf3(e;s)istrue3.cnt(e;e0;SALE)wheree06=e,whichisthenumberofSALErecordssforwhichf3(e;s)^f3(e0;s)istrueTosimplifyourtask,wewillactuallyignorethethirdcharacteristicanddeneamodelsuchthatthiscountisalwayszeroforanygivenrecordpair.Whilethismay 92 PAGE 93 93 PAGE 94 4.5.1 ).Inourmodelthevariousivaluesarerelatedasi=si+0,wheresand0aretheonlytwoparametersthatneedtobelearnedtodetermineallthei.Alsoinordertoavoidovertting,weassumethat2isthevarianceoff1(e)overallrecords,ratherthanmodelingandlearningvariancevaluesofalltheindividualclassesseparately.WenowdenethedensityfunctionforthesuperpopulationmodelcorrespondingtotheGenDataalgorithm.ForagivenEMPrecorde,iff1(e)=vandcnt(e;SALE)=ktheprobabilitydensityforegivenaparametersetisgivenby: 94 PAGE 95 39 ]canbemadethatsuchextremefreedomisactuallyapoorchoice,andthatin"reallife",ananalystwillhavesomesortofideawhatthevariouspivalueslooklike,andamorerestrictivedistributionprovidingfewerdegreesoffreedomshouldbeused.Forexample,anegativebinomialdistributionhasbeenassumedforthedistinctvalueestimationproblem[ 90 ].Suchbackgroundknowledgecouldcertainlyimprovetheaccuracyofthemethod.Thoughweeschewanysuchrestrictionsintheremainderofthethesis(exceptforanassumptionofalinearrelationshipamongtheivalues;see\DealingwithOvertting"inthenextsection),wenotethatitwouldbeveryeasytoincorporatesuchknowledgeintoourmethod.TheonlychangeneededisthattheEMalgorithmdescribedinthenextsectionwouldneedtobemodiedtoincorporateanyconstraintsinducedonthevariousparametersbyadditionaldistributionalassumptions. 95 PAGE 96 96 PAGE 97 26 ]isageneralmethodofndingthemaximumlikelihoodestimateoftheparametersofanunderlyingdistributionfromagivendatasetwhenthedataisincompleteorhasmissingvalues.EMstartsoutwithaninitialassignmentofvaluesfortheunknownparametersandateachstep,recomputesnewvaluesforeachoftheparametersviaasetofupdaterules.EMcontinuesthisprocessuntilthelikelihoodstopsincreasinganyfurther.Sincecnt(e;SALE)isunknown,thelikelihoodfunction:L(jfEMP0;SALE0g)=Ye2EMP0mXk=1p(f1(e);k;cnt(e;SALE0)j)WepresentthederivationofourEMimplementationintheAppendix,whileherewegiveonlythealgorithm.Inthisalgorithm,~p(ij;e)denotestheposteriorprobabilityforrecordebelongingtoclassi.Thisistheprobabilitythatgiventhecurrentsetofvaluesfor,recordebelongstoclassi.ProcedureEM()1Initializeallparametersof;Lprev=99992while(true)f3ComputeL()fromthesampleandassignittoLcurr4if((LcurrLprev)=Lprev<0:01)break5Computeposteriorprobabilitiesforeache2EMP0andeachk6Recomputeallparametersofbyusingthefollowingupdaterules:7i=Pe2EMP~p(ij0;e)f1(e) PAGE 98 98 PAGE 99 30 ].Weusethefollowingtwomethodsinourapproach:Limitingthenumberofdegreesoffreedomofthemodel.Usingmultiplemodelsandcombiningthemtodevelopournalestimator.Tousethersttechnique,werestrictourgenerativemodelsothatthemeanaggregatevalueofallrecordsofanyclassiisnotindependentofthemeanvalueofotherclasses.Rather,weuseasimplelinearregressionmodeli=si+0.sand0arethetwoparametersofthelinearregressionmodelandcanbelearnedeasily.Thismeansthatoncewehavelearnedthetwoparameterssand0,theivaluesforallotherclassescanbedetermineddirectlybytheaboverelationandwillnotbelearnedseparately.Asmentionedpreviously,itwouldalsobepossibletoplacedistributionalconstraintsuponthevectorpinordertoreducethedegreesoffreedomevenmore,thoughwechoosenottodothisinourimplementation.Oursecondstrategytotackletheoverttingproblemistolearnmultiplemodelsratherthanworkingwithasinglemodel.ThesemodelsdierfromeachotheronlyinthattheyarelearnedusingourEMalgorithmwithdierentinitialrandomsettingsfortheirparameters.WhengeneratingpopulationsfromthemodelslearnedviaEM(asdescribedinthenextsubsection),wethenrotatethroughthevariousmodelsinroundrobinfashion.Arewenotdoneyet?Oncethemodelhasbeenlearned,asimpleestimatorisimmediatelyavailabletous:wecouldreturnp00nEMP,sincethiswillbetheexpectedqueryresultoveranarbitrarydatabasesampledfromthemodel.Thisisequivalenttorstdeterminingaclassofdatabasesthatthedatabaseinquestionhasbeenrandomly 99 PAGE 100 100 PAGE 101 101 PAGE 102 4.4 thatthejthpopulationgeneratedandthesamplefromthatpopulationarePj=(EMPj;SALEj)andSj=(EMP0j;SALE0j),respectively.LetsijbethevalueofsicomputedoverSj;thatis,itisthesumforf1overalltuplesinEMP0jthathaveimatchesinSALE0j.Ourgoalinallofthisistoconstructaweightedestimator: 102 PAGE 103 @w0=Xj2mXi=0wisijq(Pj)!(s0j)Ifwedierentiatewithrespecttoeachwiandsettheresultingm+1expressionstozero,weobtainm+1linearequationsinthem+1unknownweights.Theseequationscanberepresentedinthefollowingmatrixform:2666666666664Pjs20jPjs0js1jPjs0js1jPjs21j::Pjs0jsmjPjs1jsmj37777777777752666666666664w0w1::wm3777777777775=2666666666664Pjs0jq(Pj)Pjs1jq(Pj)::Pjsmjq(Pj)3777777777775Theoptimalweightscanthenbeeasilyobtainedbyusingalinearequationsolvertosolvetheabovesystemofequations.OnceWhasbeenderived,itisthenappliedtotheoriginalsamplesEMP0andSALES0inordertoestimatetheanswertothequery.BydividingtheSSEobtainedviatheminimizationproblemdescribedabovebythenumberofdatasetsgenerated,wecanalsoobtainareasonableestimateofthemeansquarederrorofW. 103 PAGE 104 1. ThedistributionofthenumberofmatchingrecordsinSALEforeachrecordofEMP Thedistributionofe.SALvaluesofallrecordsofEMPBasedonthesetwoimportantproperties,wesyntheticallygenerateddatasetssothatthedistributionofthenumberofmatchingrecordsforallEMPrecordsfollowsadiscretizedGammadistribution.TheGammadistributionwaschosenbecauseitproducespositivenumbersandisveryexible,allowingalongtailtotheright.ThismeansthatitispossibletocreatedatasetsforwhichmostrecordsinEMPhaveveryfewmatches,butsomehavealargenumber.Wechosevaluesof1,2and5fortheGammadistribution'sshiftparameterandvaluesof0.5and1forthescaleparameter.Basedonthesedierentvaluesfortheshiftandscaleparameters,weobtainedsixpossibledatasets:1:(shift=1,scale=0.5);2:(shift=2,scale=0.5);3:(shift=5,scale=0.5);4:(shift=1,scale=1);5:(shift=2,scale=1);and6:(shift=5,scale=1).Forthesesixdatasets,thefractionofEMPrecordshavingnomatchesinSALE(andthuscontributingtothequeryanswer)were.86,.59,.052,.63,.27,and.0037,respectively.AplotoftheprobabilitythatanarbitrarytuplefromEMPhasmmatchesinSALEforeachofthesixdatasetsisgivenasFigure 42 .Thisshowsthewidevarietyofdatasetcharacteristicswetested. 104 PAGE 105 SixdistributionsusedtogenerateforeacheinEMPthenumberofrecordssinSALEforwhichf3(e;s)evaluatestotrue. Wealsovariedthedistributionofthee.SALvaluessuchthatthedistributioncanbeoneofthefollowing: 4.5.1 ,thethreespecicassumptionswemadeforoursuperpopulationmodelwere: 105 PAGE 106 2. ThereexistsalinearrelationshipbetweenthemeanaggregatevaluesofthedierentclassesofEMPrecordsgivenbyi=si+0wheresistheslopeofthestraightlineconnectingthevariousivalues. 3. Thevarianceoftheaggregateattributevaluesofrecordsofanyclassisapproximatelyequaltothesinglemodelparameter2.Foreachofthesethreecases,wegeneratesixdierentdatasetsusingthesixdierentsetsofgammaparametersdescribedearlier.Thusweobtain18moredatasetswheretherstsixsetsviolateassumption1,thenextsixsetsviolateassumption2andthelastsixsetsviolateassumption3.Foreachofthese18datasets,theaggregateattributevalueisnormallydistributedwithameanof100andstandarddeviationof200exceptforthelastsixsetswheredierentvaluesofstandarddeviationarechosenforrecordsfromdierentclasses.Inordertoviolateassumption1,wenolongerassumeaprimarykeyforeignkeyrelationshipbetweenEMPandSALE.Togenerateadatasetviolatingthisassumption,asets1ofrecordsofsize100fromEMPisselected.LetmaxbethelargestnumberofmatchesinSALEforanyrecordfroms1.Thenanassociatedsets2ofmaxrecordsisaddedtoSALEsuchthatallrecordsins1havetheirmatchingrecordsins2.Assumption2wasviolatedusingi=sj+0,wherej6=i(infact,thejvalueforagiveniisrandomlyselectedfrom1:::m).Assumption3wasviolatedbyassumingdierentvaluesforthevarianceofrecordsfromdierentclasses.Werandomlychosethesevaluesfromtherange(100,15000). 1 ],theSynopticCloudReports[ 3 ]obtainedfromtheOakRidge 106 PAGE 108 108 PAGE 109 4.2 .Resultsfromtherst48syntheticdatasetsaregivenininTables 41 and 42 whileresultsfromthenext18syntheticdatasets(whichspecicallyviolatethemodelassumptions)arepresentedinTable 43 .ReallifedatasetresultsareshowninTable 44 .Foreachofthetestcases,wegivethesquarerootoftheobservedmeansquarederror(thatis,thestandarderror)forthebiased,unbiasedaswellasconcurrentestimator.Becausehavinganabsolutevalueforthestandarderrorlacksanysortofscaleandthuswouldnotbeinformative,wegivethestandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabase.Forexample,forthesyntheticdatasets,wegivethestandarderrorasapercentageoftheanswertothequery:SELECTSUM(e.SAL) PAGE 110 41 .Similarlyfortherestofthedatasets,thefactorsare:dataset2:1.7;dataset3:19;dataset4:1.5;dataset5:3.7anddataset6:270.FortheIMDBandSCRdatasets,thefactorsarebetween1and5.5whilefortheKDDCupthefactorsrangefrom2(forthehighselectivityquery)to40(fortheverylowselectivityquery).Whenwetestedthequeries,wealsorecordedthenumberoftimes(outoften)thattheanswergivenbythebiasedestimatorwaswithin2estimatedstandarderrorsoftherealanswertothequeryandfoundthatforalmostallthetestcasesthisnumberwastenwhileonlyforacoupleoftestcasesthisnumberwasfoundtobenineoutoften.Finally,wemeasuredthecomputationtimerequiredbythebiasedestimatortoinitiallylearnthegenerativemodel,thencomputeweightsforthevariouscomponentsoftheestimator,andtonallyprovideanestimateofthequeryresult.Weobservedthatforthesyntheticdatasets(whichconsistsof10millionand50millionrecordsinthetworelations)themaximumobservedrunningtimeofbiasedestimatorwasbetween3and4secondsfora10%samplefromeach.ThevastmajorityofthistimeisspentintheEMlearningalgorithm,whichrequiresO(mjEMP0ji)time,wheremisthemaximumpossiblenumberofmatchesforarecordinEMPwithrecordsinSALES,andiisthenumber 110 PAGE 111 41 isthattheunbiasedestimatorhasuniformlysmallerroronlyonthoseeighttestsperformedusingsyntheticdataset1,wherethenumberofmatchesforeachrecorde2EMPisgeneratedusingaGammadistributionwithparameters(shift=1,scale=0.5).Inthisparticulardataset,onlyaverysmallnumberoftherecordsareexcludedbytheNOTEXISTSclausesince86%oftherecordsinEMPdonothaveamatchinSALE.Furthermore,onlyaverysmallnumberoftherecordshavealargenumberofmatches.Bothofthesecharacteristicstendtostabilizethevarianceoftheunbiasedestimator,makingitanechoice.Foralltheotherdatasets,theunbiasedestimatordoesverypoorlyformostofthecases.Forsyntheticdata,theestimator'sworstperformanceisfordataset6,inwhichlessthanonepercentoftherecordsareacceptedbytheNOTEXISTSclauseandseveralrecordsfromEMPhavemorethan15matchingrecordsinSALE.Inthiscase,theunbiasedestimatorisunusable,andtheresultswereparticularlypoorwithcorrelationbetweenthenumberofmatchesandtheaggregatevaluethatissummed.Forexample,inthe 111 PAGE 112 1%Sample 5%Sample 10%Sample type error error error GaCorVal U C B U C B U C B mma red? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%) 1 No a. 7.39 13.32 38.30 2.39 12.62 3.88 1.09 11.89 1.46 1 No b. 6.69 13.45 37.87 3.04 12.63 5.92 1.08 11.93 1.38 1 No c. 6.89 12.92 22.59 5.23 12.04 8.18 3.79 11.23 7.09 1 No d. 16.65 6.32 68.37 15.94 6.19 29.34 9.56 5.94 19.72 1 Yes a. 11.90 20.90 34.50 4.59 19.94 2.26 3.15 18.68 1.42 1 Yes b. 13.50 17.80 36.30 4.07 16.37 5.12 1.75 15.50 2.18 1 Yes c. 7.70 15.06 21.14 5.69 14.06 7.84 3.98 13.13 6.21 1 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.33 2 No a. 11.79 40.12 6.09 8.10 37.98 3.55 2.43 35.44 3.37 2 No b. 13.65 39.48 5.00 6.82 37.86 4.83 2.54 35.51 4.03 2 No c. 179.87 39.20 14.75 6.35 37.00 8.34 4.54 34.44 7.12 2 No d. 31.60 20.45 43.43 10.24 19.26 12.88 9.99 17.08 6.25 2 Yes a. 24.70 65.60 21.39 19.83 62.00 18.45 4.78 57.51 13.70 2 Yes b. 19.34 54.27 12.99 12.61 51.19 12.28 3.46 47.72 7.48 2 Yes c. 220.14 46.60 23.01 12.19 44.01 12.01 5.10 40.88 5.10 2 Yes d. 52.61 39.08 39.45 19.62 36.75 5.32 9.20 33.19 2.25 3 No a. 234.60 92.75 18.61 59.67 84.91 12.22 33.00 76.00 6.28 3 No b. 315.97 93.29 19.42 70.32 84.68 11.68 34.78 76.05 5.84 3 No c. 188.17 91.50 20.53 46.14 84.01 18.50 24.92 75.07 15.80 3 No d. 139.27 72.67 14.24 63.56 67.36 12.18 6.79 59.83 5.33 3 Yes a. 753.73 189.70 42.19 220.00 172.10 28.99 115.25 151.85 17.02 3 Yes b. 421.00 146.70 30.93 151.00 133.50 21.05 74.50 118.40 11.99 3 Yes c. 240.20 119.80 28.28 74.66 109.50 25.99 42.57 97.22 21.86 3 Yes d. 47.95 144.61 33.85 18.52 130.93 28.69 3.63 114.00 18.63 Table41. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. 112 PAGE 113 1%Sample 5%Sample 10%Sample type error error error GaCorVal U C B U C B U C B mma red? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%) 4 No a. 153.70 36.20 14.52 37.17 33.90 4.73 24.47 31.20 0.89 4 No b. 226.00 37.00 18.56 50.32 33.95 5.27 42.87 31.11 1.33 4 No c. 242.70 35.20 11.10 19.40 32.85 3.62 17.03 30.04 3.59 4 No d. 146.37 16.56 45.16 23.60 14.85 21.26 8.85 12.62 16.61 4 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.64 4 Yes b. 327.02 52.06 8.62 75.95 48.42 3.92 45.62 44.12 2.83 4 Yes c. 359.60 43.40 13.90 30.19 40.39 7.17 27.21 36.80 5.16 4 Yes d. 1.1e3 37.53 40.29 54.33 33.99 10.66 18.94 29.32 5.68 5 No a. 236.00 72.04 13.19 46.18 66.08 12.07 38.30 59.60 6.15 5 No b. 395.00 72.30 11.78 55.78 66.09 11.73 42.73 59.55 5.37 5 No c. 167.70 71.10 7.70 120.81 65.20 1.99 62.70 58.50 1.15 5 No d. 135.65 51.87 13.58 77.12 48.29 4.30 24.14 42.21 4.16 5 Yes a. 862.00 71.79 31.25 203.81 64.90 7.21 57.22 57.00 2.93 5 Yes b. 650.80 56.60 28.64 129.75 51.46 6.75 74.16 43.90 1.86 5 Yes c. 298.70 92.30 11.47 189.70 84.22 4.06 69.63 74.80 2.53 5 Yes d. 283.26 105.24 10.84 178.61 95.07 9.38 145.78 81.86 3.04 6 No a. 7.1e3 95.13 19.30 6.2e3 79.49 9.82 4.1e3 63.33 6.09 6 No b. 1.9e4 95.20 18.40 2.1e3 79.58 9.47 6.6e2 63.40 5.74 6 No c. 1.9e4 94.32 13.03 1.2e3 78.60 5.96 9.6e2 62.74 1.71 6 No d. 4.7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 3.97 6 Yes a. 5.4e4 307.0 62.00 3.0e4 249.30 30.90 5.7e3 119.00 18.78 6 Yes b. 4.2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e3 135.00 12.88 6 Yes c. 3.2e4 156.3 22.70 2.0e3 128.10 10.87 8.7e2 100.12 3.05 6 Yes d. 1.3e5 234.4 29.78 2.9e3 192.46 28.25 2.4e3 148.28 12.79 Table42. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor24syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. 113 PAGE 114 1%Sample 5%Sample 10%Sample type error error error GaVioU C B U C B U C B mma lates (%) (%) (%) (%) (%) (%) (%) (%) (%) 1 (1) 8.83 13.37 62.60 3.12 12.47 15.24 1.19 11.75 4.62 2 (1) 24.66 39.33 34.39 8.14 37.89 2.74 3.41 35.60 2.48 3 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 13.05 4 (1) 22.30 36.67 37.99 12.72 34.07 7.96 6.34 31.12 2.95 5 (1) 231.50 72.60 6.76 123.30 66.14 6.37 85.68 59.48 4.35 6 (1) 1366.80 95.96 9.99 1.2e3 78.64 5.85 700.0 62.62 1.88 1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.44 2 (2) 21.62 72.24 59.94 14.25 67.50 7.56 6.25 62.90 4.47 3 (2) 886.2 220.20 45.73 136.0 201.90 31.73 79.75 180.10 25.76 4 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.43 11.52 5 (2) 247.60 205.0 18.84 233.0 187.00 17.69 88.55 168.30 9.78 6 (2) 6891.00 369.0 42.30 5988.0 310.00 40.90 1924.00 246.57 19.77 1 (3) 14.70 21.14 61.86 6.24 20.20 10.15 1.13 19.13 2.67 2 (3) 26.15 66.73 29.10 22.49 62.25 20.25 5.38 57.69 17.35 3 (3) 920.10 185.30 41.86 147.60 167.20 30.12 65.63 146.88 27.20 4 (3) 2.3e5 64.42 35.96 714.00 60.54 16.87 150.80 54.77 9.24 5 (3) 1350.30 143.00 33.59 856.00 127.76 29.58 306.70 113.14 10.08 6 (3) 2.2e5 264.02 38.37 4519.10 212.80 34.92 2530.00 162.70 21.96 Table43. ObservedstandarderrorasapercentageofSUM(e.SAL)overalle2EMPfor18syntheticallygenerateddatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. correlatedcasewitha1%sample,mostoftherelativestandarderrorsweremorethan40000%.Suchverypoorresultsarefoundsporadicallythroughoutmostofthedatasets,thoughtheresultsweresomewhaterratic.Thereasonthattheobservederrorsassociatedwiththeunbiasedestimatorarehighlyvariableistheverylongtailoftheerrordistribution.Undermanycircumstances,mostoftheanswerscomputedusingtheunbiasedestimatorareverygood,butthereisstillasmall(thoughnonnegligible)probabilityofgettingaridiculousestimatewhoseerrorishundredsoftimesthesumovertheaggregatevalueovertheentireEMPrelation.Unfortunately,itisinterestingtonote 114 PAGE 115 5%Sample 10%Sample Error Error Error DataQuery U C B U C B U C B Set (%) (%) (%) (%) (%) (%) (%) (%) (%) IMDB 27.67 70.88 3.3e3 17.51 33.44 4.1e2 13.71 14.14 IMDB 75.12 65.10 91.26 62.86 31.97 49.82 52.69 9.31 IMDB 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92 SCR 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41 SCR 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95 KDDCup 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10 KDDCup 41.30 11.24 5.8e83 26.54 4.32 9.3e36 17.04 3.28 KDDCup 15.24 8.46 3.6e172 10.80 1.56 2.3e120 6.35 0.98 Table44. Observedstandarderrorasapercentageofthetotalaggregatevalueofallrecordsinthedatabasefor8queriesover3reallifedatasets.Thetableshowserrorsforthreedierentsamplingfractions:1%,5%and10%andforeachofthesefractions,itshowstheerrorforthethreeestimators:UUnbiasedestimator,CConcurrentsamplingestimatorandBModelbasedbiasedestimator. thattheunbiasedestimator'sworstperformanceoverallwasobservedonQ8overtheKDDCupdata,wheretheerrorwasastronomicallyhigh:largerthan10100.Incomparison,thebiasedestimatorgenerallydidaverygoodjobpredictingthenalqueryresult,andinmostcaseswitha5%or10%samplingfractiontheobservedstandarderrorwaslessthan10%ofthetotalaggregatevaluefoundinEMP.Inotherwords,ifthetotalvalueofSUM(e.SAL)withnoNOTEXISTSclauseisx,thenforjustaboutanyquerytested,thestandarderrorwaslessthanx=10,anditwasfrequentlymuchsmaller.Thisisactuallyquiteimpressivewhenoneconsidersthedicultyoftheproblem.Theprimarydrawbackassociatedwiththebiasedestimatorisitscomplexity(requiringnontrivialandsubstantialstatisticallyorientedcomputations)andthefactthatasignicantamountofcomputationisrequired,mostofitassociatedwithrunningtheEMalgorithmtocompletion.Bycomparison,theunbiasedestimatecanbecalculatedviaanalmosttrivialrecursiveroutinethatreliesonthecalculationofsimplehypergeometricprobabilities.Onecasewherethebiasedestimatorhadquestionablequalitativeperformancewaswiththe16testsassociatedwithdatasets3and6.Theprobleminthiscasewasthat 115 PAGE 116 43 .TherstsixrowsinthetableshowresultsfordatasetsinwhichmorethanoneEMPrecordcanmatchwithagivenrecordfromSALE.Theresultsshowthatviolatingthisassumptionofthemodelintheactualdatasetdidnotaecttheaccuracyofthebiasedestimatorsignicantly.ThenextsetofsixrowsinthetableshowresultsfordatasetsinwhichthereisnolinearrelationshipbetweenthemeanaggregatevaluesofthedierentclassesofEMPrecords.Theresultsshowthatthebiasedestimatorisabouttwiceasinaccurateoverthesedatasetsascomparedtocorrespondingdatasetswhichdonothaveastrictviolationoftheassumption.Thelastsixrowsinthetableshowresultsoverdatasetsinwhichthevariancesoftheaggregatevaluesofrecordsfromdierentclassesaresignicantlydierent.Resultsshowthatthesedatasetsaecttheaccuracyofthebiasedestimatorasmuchasthedatasetswhichviolatethe\linearrelationshipofmeanvalues"assumption.However,theresultsarecertainlynotpoorwhentheseassumptionsareviolated,andthemethodstillseemstohavequalitativeperformancethatmaybeacceptableformanyapplications,particularlywithalargersamplesize.TheresultsfromtheeightqueriesoverthethreereallifedatasetsaredepictedinTable 44 .Thekeydierenceinthecharacteristicsofthereallifedatasetscompared 116 PAGE 117 44 thattheaccuracyofthebiasedestimatorisgenerallyquitegoodovertherealdata.Wealsonotethatthestandarderrorofthebiasedestimatoroverthelearnedsuperpopulationseemstobeareasonablesurrogateforthestandarderrorofthebiasedestimatorinpractice.Formostbiasedestimators,itisreasonabletousethestandarderrorofthebiasedestimatorinthesamewaythatonewouldusethestandarddeviationofanunbiasedestimatorwhenconstructingcondencebounds(seeSarndaletal.[ 109 ],Section5.2).AccordingtotheVysochanskiiPetunininequality[ 120 ],anyunbiasedunimodalestimatorwillbewithinthreestandarddeviationsofthecorrectanswer95%ofthetime,andaccordingtothemoreaggressivecentrallimittheorem,anestimatorwillbewithintwostandarddeviationsofthecorrectanswer95%ofthetime.Weobservedthatalmostallofthetests,tenoutoftenoftheerrorsforthebiasedestimatorwereactuallywithintwopredictedstandarderrorsofzero.Thisseemstobestrongevidencefortheutilityoftheboundscomputedusingthepredictedstandarderrorofthebiasedestimator.Wenallyremarkonthetimerequiredfortheexecutionofthebiasedestimator.Thebiasedestimatorperformsseveralcomputationsincludinglearningthemodelparameters,generatingsucientstatisticsforseveralpopulationsamplepairsandthensolvingasystemofequationstocomputeweightsforthevariouscomponentsoftheestimator.Asdiscussedpreviously,thistooknolongerthanfoursecondsforthelargestsamplestested.Ifthisisnotfastenough,wepointoutthatitmaybeabletospeedthistimeevenmore,thoughthisisbeyondthescopeofthethesis.WhileweusedthetraditionalEMalgorithm 117 PAGE 118 69 95 116 ]oftheEMalgorithm.ThesevariantsoftheEMalgorithmtypicallyachievefasterconvergencetimebyimplementingtheExpectationand/ortheMinimizationstepoftheEMalgorithmpartially. 97 ].OtherclassiceortsatsamplingbasedestimationoverdatabasedataaretheadaptivesamplingofLiptonandNaughton[ 83 84 ]forjoinqueryselectivityestimation,andthesamplingtechniquesofHouetal.[ 64 65 ]foraggregatequeries.MorerecentwellknownworkonsamplingisthatononlineaggregationbyHaas,Hellerstein,andtheircolleagues[ 47 60 61 ].Thesamplingbaseddatabaseestimationproblemthatisclosesttotheonestudiedinthischapteristhatofsamplingforthenumberofdistinctvaluesinadatabase.Asdiscussedintheintroductiontothischapter,asolutiontotheproblemofestimationoversubsetbasedqueriesisasolutiontotheproblemofestimatingthenumberofdistinctvaluesinadatabasesincethelatterproblemcanbewrittenasaNOTEXISTSquery.TheclassicpaperindistinctvalueestimationisduetoHaasetal.[ 49 ].Forasurveyofthestateoftheartworkonthisproblemindatabasesthroughtheyear2000,wereferthereadertotheIntroductionofthepaperbyCharikaretal.onthetopic[ 17 ].ThepaperofBungeandFitzpatrick[ 13 ]providesasurveyofworkinthestatisticsarea,currentthroughtheearly1990's.Workinstatisticscontinuesonthisproblemtothisday.Infact,arecentpaperfromstatisticsbyMingoti[ 90 ]onthedistinctvalueproblemprovidedinspirationforouruseofsuperpopulationtechniques.Thoughtheproblemsofdistinctvalueestimationandsubsetbasedaggregateestimationarerelated,wenotethattheproblemofestimatingthenumberofdistinctvaluesisaveryrestrictedversionoftheproblemwestudyinthisthesis,anditisnotimmediatelyclearhowarbitrarysolutionstothedistinctvalueproblemcanbegeneralized 118 PAGE 119 43 ] 4.5.1 ofthethesis,oneofthemostcontroversialdecisionsmadeinthedevelopmentofthelatterestimatorwasourchoiceofaverygeneralpriordistribution.Toastatisticianfromthesocalled\Bayesian"school[ 39 ],thismaybeseenasapoorchoiceandBayesianstatisticianmayarguethatamoredescriptivepriordistribution,ifappropriate,wouldincreasetheaccuracyofthemethod.Thisiscertainlytrue,iftheselecteddistributionwereagoodmatchfortheactualdatadistribution.Inourwork,however,wehaveconsciouslychosengeneralityanditsassociateddrawbacksinplaceofspecicity.Ourexperimentalresultsseemtoarguethatforavarietyofdierent 119 PAGE 120 47 ].Thismeansthatthejoinitselfmustbemodeled,whichisaproblemforfuturework.Anotherproblemforfutureworkisarbitrarylevelsofnesting.AninnerquerymayitselfbelinkedwithanotherinnerqueryviaaNOTEXISTSorsimilarclause. 120 PAGE 121 8 ].Weconsiderveryselectivequeriesbecausetheyaretheoneclassofqueriesthatarehardesttohandleapproximatelywithoutworkloadknowledge:ifaqueryreferencesonlyafewtuplesfromthedataset,thenitisveryhardtomakesurethatasynopsisstructure(suchasasample)willcontaintheinformationneededtoanswerthequery.Themostnaturalmethodforhandlinghighlyselectivequeriesusingsamplingistomakeuseofstratication[ 25 ].Inordertoansweranaggregatequeryoverarelation,onecouldrst(oine)partitiontherelation'stuplesintovarioussubsetssothatsimilartuplesaregroupedtogether{theassumptionbeingthattherelationalselectionpredicateassociatedwithagivenquerywilltendtofavorcertainstrata.Evenifagivenqueryisveryselective,atleastoneortwoofthestratawillhavearelativelyheavyconcentrationoftuplesthatwillcontributetothequeryanswer.Whenthequeryisprocessed,those\important"stratacanbesampledrstandmoreheavilythantheothers.Thisisilustratedwiththefollowingexample:Example1:TherelationMOVIE(MovieYear,Sales)ispartitionedintotwostrataasfollows:ThequeryQisthenissued:SELECTSUM(Sales) PAGE 122 Whilestraticationmaybeveryuseful,itisnotanewidea.Ithasbeenstudiedinstatisticsfordecades,andithasbeensuggestedpreviouslyasawaytomakeapproximateaggregatequeryprocessingmoreaccurate[ 18 { 20 ].However,inthecontextofdatabases,researchershavepreviouslyconsideredonlyhalfoftheproblem:howtodividethedatabaseintostrata.Thismayactuallybetheeasyandlessimportanthalfoftheproblem,sinceeventherelativelynaivepartitioningstrategyweuseinourexperimentscangiveexcellentresults.Theequallyfundamentalproblemweconsiderinthispaperis:howtoallocatesamplestostratawhenactuallyansweringthequery.Morespecically,givenabudgetofnsamples,howdoesonechoosehowto\spend"thosesamplesonthevariousstratainordertoachievethegreatestaccuracy?TheclassicallocationmethodfromstatisticsistheNeymanallocation,anditistheoneadvocatedpreviouslyinthedatabaseliterature[ 19 ].ThekeydicultywithapplyingtheNeymanAllocationinpracticeisthatitrequiresextensiveknowledgeofcertainstatisticalcharacteristicsofeachstrata,withrespecttotheincomingquery.Inpractice 122 PAGE 123 14 ]thatallowustotakeintoaccountanypriorexpectation(suchastheexpectedecacyofthestratication)inaprincipledfashion.Wecarefullyevaluateourmethodsexperimentally,andshowthatifoneisverycarefulindevelopingasamplingplan,evenanaivepartitioningofsamplestostratathatusesnoworkloadinformationcanshowdramaticaccuracyforveryselectivequeries.Ourmethodsareverygeneral.Theycanbeusedwithanypartitioning(suchasthoseproposedbyChaudhuriet.al[ 18 { 20 ]),orevenincaseswherethepartitioningisnotuserdenedandisimposedbytheproblemdomain(forexample,whenthevarious\strata"aredierentdatasourcesinadistributedenvironment).Ourmethodscanalsobeextendedtomorecomplicatedrelationaloperationssuchasjoins,thoughthisproblemisbeyondthescopeofthepaper. 123 PAGE 124 124 PAGE 125 ^Y=LXi=1Ni ^2i=1 5{2 bysimplyreplacingallthe2itermswiththeircorrespondingunbiasedestimators^2i.CentralLimitTheorembasedcondencebounds[ 112 ]for^Ycanthenbecomputedas,^Yzp^wherezpisthezscoreforthedesiredcondencelevel.Ifdesired,moreconservativecondenceboundsfromtheliterature(suchasChebyshevbased[ 112 ])canalsobeused.Finally,wenotethataggregatequerieslikeCOUNTandAVGcanalsobehandledbystratiedsamplingestimatorsliketheonedescribedabovebyusingratiosoftwodierentestimates.AggregatequerieswithaGROUPBYclausecanalsobeansweredbyusing 125 PAGE 126 54 ],thoughthatisbeyondthescopeofthepaper. 5{1 .Since^Yisunbiased,minimizingitserrorisequivalenttominimizingitsvariance.Anoptimizationproblemcanbeformulatedforthechoiceofnivaluessothatthevariance2isminimized{solvingtheproblemleadstothewellknownNeymanallocation[ 25 ]fromstatistics.Specically,theNeymanallocationstatesthatthevarianceofastratiedsamplingestimatorisminimizedwhensamplesizeniisproportionaltothesizeofthestratum,Ni,andtothevarianceofthef()valuesinthestratum,2i.Thatis, 126 PAGE 127 5.2.1 .ThenumberofrecordsfromR1acceptedbyf2()is10whilethenumberofrecordsfromR2acceptedbyf2()is1000.Further,letf1(r)N(1000;100)8r2R1andf1(r)N(10;100)8r2R2,whereN(;)denotesanormaldistributionwithmeanandvariance2.Weuseapilotsampleof100recordstoestimatethevarianceofthef()valuesineachstratum.Theseestimatesare^21and^22.Ifthedesiredsamplesizeisn=1000,theestimatedvariancescanbeusedwithEquation 5{4 toobtainanestimatefortheoptimalsamplingallocationasfollows:n1=1000 ^21+^22^21n2=1000 ^21+^22^22 5{2 )sincethisvariancewouldbeusedtoreportcondenceboundstotheuser.Wethencomputetheaverageestimatedvarianceacrossthe1000iterations.Finally,weusethetruevariancesofbothstratatoobtainanoptimalsampleallocation,andrepeattheaboveexperimentusingtheoptimalallocation.Wesummarizetheresultsinthefollowingtable. Truequeryresult 20150 Avg.observedbias 10200 Avg.estimatedMSE 0.76million Avg.observedMSE 100million MSEoftrueoptimal 58.6million 127 PAGE 128 14 ]calledtheBayesNeymanallocationthatcanincorporatesuchintuitionintotheprocessinaprincipledfashion.Ingeneral,Bayesianmethodsformallymodelsuchpriorintuitionorbeliefasaprobabilitydistribution.Suchmethodsthenrenethedistributionbyincorporatingadditionalinformation{inourcaseinformationfromthepilotsample{toobtainanoverallimprovedprobabilitydistribution.Atthehighestlevel,theproposedBayesNeymanallocationworksasfollows: 128 PAGE 129 129 PAGE 130 33 ].Thismeansthatweviewtheprobabilitypithatanarbitrarytuplefromstratumiwillbeacceptedbytherelationalselectionpredicatef2()asbeingtheresultofarandomsamplefromtheBetadistribution,whichproducesaresultfrom0to1.Sincewevieweachtupleasaseparateandindependentapplicationoff2(),thenumberoftuplesfromstratumithatareacceptedbyf2()isthenbinomiallydistributed 130 PAGE 131 Betadistributionwithparameters==0:5. Giventhissetup,thersttaskistochoosethesetofBetaparametersthatcontrolthedistributionofeachpisoastomatchtherealityofwhatatypicalvalueofpiwillbeforeachstratum.TheBetadisributionisaparametricdistributionandrequirestwoinputparameters,and.Dependingontheparametersthatareselected,theBetacantakealargevarietyofshapesandskews.Choosingandfortheithstratumisequivalenttosupplyingour\intuition"tothemethod,statingwhatourinitialbeliefisregardingtheprobabilitiythatanarbitraryrecordwillbeacceptedbyf2().Therearetwopossibilitiesforsettingthoseinitialparameters.Therstpossibilityistouseworkloadinformation.Wecouldmonitorallpreviouslyobservedqueriesovereachandeverystrata,whereweobservethatforqueryiandstratumjtheprobabilitythatagivenrecordwasacceptedbyf2()waspij.Then,assumingthatfpij8i;jgareallsamplesfromourgenerativeBetaprior,wesimplyestimateandfromthissetusinganystandardmethod.AnestimatefortheBetaparametersbasedupontheprincipleofMaximumLikelihoodEstimationcaneasilybederived[ 112 ].Asecondmethodistosimplyassumethatthestraticationwechooseusuallyworkswell.Inthiscase,moststratawilleitherhaveaveryloworaveryhighpercentageofitsrecordsacceptedbyf2().Choosing==:5resultsinaUshapeddistributionthatmatchesthisintuitionexactly,andisacommonchoiceforaBetaprior.TheresultingBetaisillustratedinFigure 51 .Inpracticewendthatthisproducesexcellentresults. 131 PAGE 132 5.5 willupdateandasneededtotakeintoaccounttheinformationpresentinthepilotsample.ProducingtheVectorofCounts 132 PAGE 133 5.4.5 133 PAGE 134 33 ]{justastheBetadistributionisthestandardconjugatepriorforabinomialdistribution.TheDirichletisthemultidimensionalgeneralizationoftheBeta.AkdimensionalDirichletdistributionmakesuseoftheparametervector=f1;2;;kg.JustasinthecaseoftheBetapriorusedbyXcnt,theDirichletpriorrequiresaninitialsetofparametersthatrepresentourinitialbelief.Sincewetypicallyhavenoknowledgeabouthowlikelyitisthatagivenf1()valuewillbeselectedbyf2(),thesimplestinitialassumptiontomakeisthatallvaluesareequallylikely.InthecaseoftheDirichletdistribution,usingi=1foralliisthetypicalzeroknowledgeprior[ 33 ].Given,itisthenasimplemattertosamplefromX0,aswedescribeformallyinthenextsubsection.Wenotethatalthoughthisinitialparameterchoicemaybeinaccurate,inBayesianfashiontheparameterswillbemademoreaccuratebasedupontheinformationpresentinthepilotsample.Section 5.5 providesdetailsofhowtheupdateisaccomplished.ProducingtheVector0 5.4.2 .AlgorithmGetMoments(1;;L,D)f1//LetidenotetheDirichletparametersforstratumi2//LetDbeanarrayofalldistinctvaluesfromtherangeoff2() PAGE 135 135 PAGE 137 5.4.3 ,theoneremainingproblemregardinghowtosamplefromX0istheproblemofhavingaverylarge(orevenunknown)rangeforthefunctionf1().Inthiscase,dealingwiththevectorsDandVmaybeimpossible,forbothstorageandcomputationalreasons.Thesimplesolutiontothisproblemistobreaktherangeoff1()intoanumberofbucketsandmakeuseofahistogramovertherange,ratherthanusingtherangeitself.Inthiscase,Disgeneralizedtobeanarrayofhistogrambuckets,whereeachentryinDhassummaryinformationforagroupofdistinctf1()values.EachentryinDhasthefollowingfourspecicpiecesofinformation:1.lowandhigh,whicharetheupperandlowerboundsforthef1()valuesthatarefoundinthisparticularbucket.2.1,whichisthemeanofthef1()valuesthatarefoundinthisparticularbucket.Thatis,ifAisthesetofdistinctvaluesfromlowtohigh,then1=Pa2Aa 42 45 72 ]overtheattributethatistobequeried.Inthecasethatmultipleattributesmightbequeried,onehistogramcanbeconstructedforeachattribute.Thisisthemethodthatwetestexperimentally.AnotherappropriatemethodistoconstructDontheybymakinguseofthepilotsamplethatisusedtocomputethesamplingplan.Thishastheadvantagethatanyarbitraryf1()canbehandledatruntime.Again,anyappropriatehistogramconstructionschemecanbeused,butratherthanconstructingDoineusingtheentirerelationR,f1() 137 PAGE 138 5.4.3 mustbemodiedsoastohandlethemodiedD.ThefollowingisanappropriatelymodiedGetMomentswecallitGetMomentsFromHist.AlgorithmGetMomentsFromHist(1;;L,D)f1//LetidenotethevectorofDirichletparametersforstratumi2//LetDbeanarrayofhistogrambuckets3//Let0=h01;;0Libeavectorofmomentsofallstrata4for(inti=1;i<=L;i++)f5pDirichlet(i)61=2=07//LetVbeanarrayofcountsforeachbucket8VMultinomial(cnti;p)9for(intj=1;j<=jDj;j++)f101+=V[j]D[j]:1112+=V[j]D[j]:212g131/=cnti142/=cnti15(1;2)i=(1,2)160i=(1;2)i17g18return0 PAGE 139 5.4 ,wedescribedhowweassigninitialvaluestotheparametersofthetwopriordistributions{theBetaandtheDirichletdistributions.Inthissection,weexplainhowtheseinitialvaluescanberenedbyusinginformationfromapilotsampletoobtaincorrespondingposteriordistributions.UpdatingthesepriorsusingthepilotsampleintheproposedBayesNeymanapproachisanalagoustousingthepilotsampletoestimatethestratumvariancesusingtheclassicNeymanallocation.TheupdaterulesdescribedinthissectionarefairlystraightforwardapplicationsofthestandardBayesianupdaterules[ 14 ].TheBetadistributionhastwoparametersand.LetRpilotdenotethepilotsampleandletsdenotethenumberofrecordsthatareacceptedbythepredicatef2().Thus,jRpilotjswillbethenumberofrecordsthatfailtobeacceptedbythequery.Then,thefollowingupdaterulescanbeusedtodirectlyupdatetheandparametersoftheBetadistribution:=+s=+(jRpilotjs)TheDirichletdistributionisupdatedsimilarly.Recallthatthisdistributionusesavectorofparameters,=f1;2;;kg,wherekisthenumberofdimensions.Toupdatetheparametervector,wecanusethesamepilotsamplethatwasusedtoupdatethebetaasfollows.Weinitializetozeroallelementsofanarraycountofsizek.Theseelementsdenotecountsofthenumberoftimesthatdierentvaluesfromtherangef1()appearinthepilotsampleandareacceptedbyf2().ThefollowingupdaterulecanbeusedtoupdateallthedierentparametersoftheDirichletdistribution:i=i+countiAlgorithmUpdatePriorsdescribesexactlyhowpilotsamplingisusedtoupdatetheparametersofthepriorBetaandDirichletdistributionsfortheithstratum. 139 PAGE 141 5{2 ofthethesis.Oursituationdiersfromtheclassicsetuponlyinthat(inBayesianfashion)wenowuseXtoimplicitlydeneadistributionovertheperstratavariancevaluesh1;2;;Li.Thus,wecannotminimize2directlybecauseundertheBayesianregime,2isnowarandomvariable.Instead,itmakessensetominimizetheexpectedvalueoraverageof2,which(usingEquation 5{2 )canbecomputedas:E[2]=E"LXi=1Ni(Nini) 141 PAGE 143 5.7.1GoalsThespecicgoalsofourexperimentalevaluationareasfollows: 143 PAGE 144 2 ]andhasasinglerelationwithover9.5millionrecords.Thedatahastwelvenumericalattributesandonecategoricalattributewith29categories.ThethirdistheKDDdataset,whichisthedatasetfromthe1999KDDCupevent.Thisdatasethas42attributeswithstatusinformationregardingvariousnetworkconnectionsforintrusiondetection.Thisdatasetconsistsofaround5millionrecordswithinteger,realvalued,aswellascategoricalattributes.QueriesTested.Foreachdataset,wetestqueriesoftheform:SELECTSUM(f1(r))FROMRAsrWHEREf2(r)f1()andf2()varydependinguponthedataset.FortheGMMdataset,f1()projectsoneofthethreedierentnumericalattributes(eachqueryprojectsarandomattribute).ForthePersondataset,eithertheTotalIncomeattributeortheWageIncomeattributeare 144 PAGE 145 bytesorthedst bytesattributesareprojected.Foreachofthedatasets,threedierentclassesofselectionpredicatesencodedbyf2()areused.Eachclasshasadierentselectivity.Thethreeselectivityclassesforf2()haveselectivitiesof(0:01%0:001%),(0:1%0:01%),and(1:0%0:1%),respectively.FortheGMMdataset,f2()isconstructedbyrollingathreefaceddietodecidehowmanyattributeswillbeincludedintheconjunctioncomputedbyf2().TheappropriatenumberofattributesarethenrandomlyselectedfromamongthesixGMMattributes.Ifacategoricalattributeischosenasoneoftheattributesinf2(),thentheattributewillbecheckedwitheitheranequalityorinequalityconditionoverarandomlyselecteddomainvalue.Ifanumericalattributeischosen,thenarangepredicateisconstructed.Foragivennumericalattribute,assumethatlowandhigharetheknownminimumandmaximumattributevalues.Therangeisconstructedusingqlow=low+v1(highlow)andqhigh=qlow+v2(highqlow)wherev1andv2arerandomlychosenrealvaluesfromtherange[01].Foreachselectivityclass,50dierentqueriesaregeneratedbyrepeatingthequerygenerationprocessuntilenoughqueriesfallingtheappropriateselectivityrangehavebeengenerated.Thef2()functionsfortheothertwodatasetsareconstructedsimilarly.StraticationTested.Foreachofthevariousdatasets,asimplenearestneighborclassicationalgorithmisusedtoperformthestatication.InordertopartitionadatasetintoLstrata,Lrecordsarerstchosenrandomlyfromthedatatoserveas\seeds"foreachofthestrata,andalloftheotherrecordsareaddedtothestratawhoseseedisclosesttothedatapoint.Fornumericalattributes,theL2normisusedasthedistancefunction.Forcategoricalattributes,wecomputethedistanceusingthesupportfromthedatabasefortheattributevalues[ 36 ].Sinceeachdatasethasbothnumericalandcategoricaldata,theactualdistancefunctionusedisthesumofthetwo\sub"distancefunctions.Notethatitwouldbepossibletouseamuchmoresophisticatedstratication,butactually 145 PAGE 146 Sel Bandwidth Coverage Size (%) GMM/Person/KDD GMM/Person/KDD 50K 0.01 3.277/2.289/2.140 918/892/921 0.1 1.776/0.514/1.520 926/912/988 1 0.587/0.184/0.210 947/944/942 100K 0.01 2.626/2.108/1.48 922/941/937 0.1 1.273/0.351/0.910 939/948/940 1 0.415/0.128/0.120 948/952/946 500K 0.01 2.192/1.740/0.820 923/943/940 0.1 0.551/0.132/0.630 946/947/942 1 0.178/0.087/0.070 946/947/948 Table51. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)foraSimpleRandomSamplingestimatorfortheKDDCupdataset.Resultsareshownforvaryingsamplesizesandforthreedierentqueryselectivities0.01%,0.1%and1%. performingthestraticationisnotthepointofthisthesis{ourgoalistostudyhowtobestusethestratication.Inourexperiments,wetestL=1,L=20,andL=200.NotethatifL=1thenthereisactuallynostraticationperformed,andsothiscaseisequivalenttosimplerandomsamplingwithoutreplacementandwillserveasasanitycheckinourexperiments.TestsRun.FortheNeymanallocationandourBayesNeymanallocation,ourtestsuiteconsistsof54dierenttestcasesforeachdataset,plusninemoretestsusingL=1.Thesetestcasesareobtainedbyassigningthreedierentvaluestothefollowingfourparameters:Numberofstrata{WeuseL=1,L=20,L=200;asdescribedabove,L=1isalsoequivalenttosimplerandomsamplingwithoutreplacement.Pilotsamplesize{Thisisthenumberofrecordsweobtainfromeachstratuminordertoperformtheallocation.Wechoosevaluesof5,20and100records.SampleSize{Thisisthetotalsamplesizethathastobeallocated.Weuse50,000,100,000and500,000samplesinourtests.QuerySelectivity{Asdescribedabove,wetestqueryselectivitiesof0.01%,0.1%and1%. 146 PAGE 147 Neyman BayesNeyman GaussianMixture 1.5 2.4 Person 2.3 3.1 KDDCup 2.1 2.8 Table52. AveragerunningtimeofNeymanandBayesNeymanestimatorsoverthreerealworlddatasets. Eachofthe50queriesforeach(dataset,selectivity)combinationisrerun20timesusing20dierent(pilotsample,sample)combinations.Thus,foreach(dataset,selectivity)combinationweobtainresultsfor1000queryrunsinall. 51 showstheresultsfortheninecaseswhereL=1;thatis,wherenostraticationisperformed.Wereporttwonumbers:thebandwidthandthecoverage.Thebandwidthistheratioofthewidthofthe95%condenceboundscomputedastheresultofusingtheallocationtothetruequeryanswer.Thecoverageisthenumberoftimesoutofthe1000trialsthatthetrueanswerisactuallycontainedinthe95%condenceboundsreportedbytheestimator.Naturally,onewouldexpectthisnumbertobecloseto950iftheboundsareinfactreliable.Tables 53 and 54 showtheresultsforthe54dierenttestcaseswhereastraticationisactuallyperformed.Foreachofthe54testcasesandbothofthesamplingplansused(theNeymanallocationandtheBayesNaymanallocation)weagainreportthebandwidthandthecoverage.Finally,thefollowingtableshowstheaveragerunningtimesforthetwostratiedsamplingestimatorsonallthethreedatasets.Thereisgenerallyarounda50%hitintermsofrunningtimewhenusingtheBayesNeymanallocationcomparedtotheNeymanallocation. 147 PAGE 148 Coverage GMM/Person/KDD GMM/Person/KDD NS PS SS Sel Neyman BayesNeyman Neyman BayesNeyman 20 5 50K 0.01 0.00/0.00/0.00 2.90/0.19/1.12 0/0/0 935/882/927 0.1 0.03/0.01/0.02 1.27/0.02/0.80 3/49/23 929/939/938 1 0.05/0.02/0.14 0.39/0.01/0.09 11/247/155 940/950/945 100K 0.01 0.00/0.00/0.00 2.77/0.16/1.08 0/0/0 936/961/930 0.1 0.02/0.01/0.01 0.90/0.02/0.73 3/53/28 941/941/938 1 0.05/0.01/0.03 0.28/0.01/0.08 24/306/170 941/947/947 500K 0.01 0.01/0.00/0.00 2.05/0.06/0.87 3/0/4 938/948/932 0.1 0.01/0.00/0.01 0.37/0.01/0.55 10/62/51 954/954/941 1 0.03/0.01/0.02 0.12/0.00/0.04 38/316/184 957/955/945 20 50K 0.01 0.06/0.00/0.04 2.72/0.22/1.06 14/0/5 942/941/938 0.1 0.17/0.03/0.09 1.21/0.03/0.81 106/61/88 908/938/944 1 0.21/0.05/0.27 0.34/0.01/0.09 404/692/561 948/948/947 100K 0.01 0.01/0.00/0.01 2.58/0.16/0.91 23/0/6 941/937/941 0.1 0.11/0.02/0.06 0.85/0.02/0.74 165/66/107 934/954/939 1 0.14/0.03/0.09 0.25/0.01/0.06 431/728/612 954/962/953 500K 0.01 0.01/0.00/0.01 1.93/0.07/0.62 30/0/21 946/943/944 0.1 0.01/0.01/0.01 0.34/0.01/0.51 230/145/245 942/952/945 1 0.04/0.01/0.03 0.09/0.00/0.02 447/751/746 943/961/950 100 50K 0.01 0.15/0.04/0.08 2.33/0.19/0.82 24/58/20 938/922/938 0.1 0.26/0.10/0.16 1.09/0.02/0.58 436/204/172 929/949/942 1 0.47/0.18/0.34 0.32/0.01/0.05 870/891/866 932/962/951 100K 0.01 0.12/0.03/0.06 2.26/0.16/0.57 29/59/41 935/945/940 0.1 0.18/0.05/0.11 0.81/0.02/0.40 435/249/355 927/957/942 1 0.31/0.08/0.02 0.22/0.01/0.04 895/928/914 948/968/943 500K 0.01 0.01/0.01/0.01 1.72/0.07/0.33 45/66/50 939/952/947 0.1 0.06/0.02/0.04 0.31/0.01/0.28 474/297/412 954/954/952 1 0.06/0.02/0.06 0.08/0.00/0.02 926/935/942 950/970/949 Table53. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayesNeymanestimatorforthethreedatasets.Resultsareshownfor20strataandforvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities0.01%,0.1%and1%. 148 PAGE 149 Coverage GMM/Person/KDD GMM/Person/KDD NS PS SS Sel Neyman BayesNeyman Neyman BayesNeyman 200 5 50K 0.01 0.00/0.00/0.00 1.73/0.18/0.91 0/0/0 933/931/924 0.1 0.00/0.02/0.01 0.97/0.02/0.76 0/56/27 933/953/936 1 0.05/0.02/0.03 0.26/0.01/0.09 19/162/149 940/960/940 100K 0.01 0.00/0.01/0.01 1.57/0.13/0.75 0/43/28 936/916/930 0.1 0.01/0.01/0.01 0.72/0.02/0.64 7/60/41 938/958/936 1 0.03/0.01/0.01 0.19/0.00/0.08 34/365/212 945/955/947 500K 0.01 0.01/0.00/0.00 1.20/0.08/0.52 5/45/34 940/939/938 0.1 0.02/0.01/0.00 0.28/0.01/0.44 22/89/76 946/946/944 1 0.02/0.01/0.01 0.07/0.00/0.06 45/372/336 954/954/951 20 50K 0.01 0.05/0.03/0.04 1.59/0.18/0.85 19/51/21 943/931/934 0.1 0.11/0.03/0.07 0.75/0.02/0.72 91/70/94 943/953/939 1 0.09/0.04/0.09 0.18/0.01/0.07 345/627/580 958/962/945 100K 0.01 0.01/0.01/0.03 1.35/0.14/0.67 22/66/45 948/948/941 0.1 0.02/0.02/0.04 0.54/0.01/0.54 131/135/128 935/955/949 1 0.05/0.02/0.05 0.12/0.00/0.06 488/702/643 945/955/952 500K 0.01 0.01/0.00/0.01 1.04/0.06/0.42 49/83/72 941/954/947 0.1 0.01/0.00/0.02 0.20/0.00/0.35 210/209/282 955/945/950 1 0.04/0.01/0.01 0.03/0.00/0.03 617/830/869 948/958/953 100 50K 0.01 0.08/0.03/0.06 1.35/0.14/0.54 28/56/39 939/938/939 0.1 0.20/0.05/0.09 0.56/0.02/0.40 313/357/243 949/949/942 1 0.10/0.01/0.15 0.14/0.01/0.03 543/823/874 948/948/951 100K 0.01 0.07/0.02/0.04 1.11/0.12/0.39 47/77/53 938/935/947 0.1 0.08/0.03/0.06 0.40/0.01/0.28 533/456/427 948/948/951 1 0.06/0.06/0.08 0.09/0.01/0.02 918/912/930 959/956/952 500K 0.01 0.01/0.00/0.02 0.89/0.05/0.21 63/91/104 946/936/937 0.1 0.02/0.01/0.02 0.10/0.00/0.13 580/540/607 945/945/948 1 0.04/0.03/0.05 0.01/0.00/0.01 936/920/941 960/953/950 Table54. Bandwidth(asaratiooferrorboundswidthtothetruequeryanswer)andCoverage(for1000queryruns)fortheNeymanestimatorandtheBayesNeymanestimatorforthethreedatasets.Resultsareshownfor200stratawithvaryingnumberofrecordsinpilotsampleperstratum(PS),andsamplesizes(SS)forthreedierentqueryselectivities0.01%,0.1%and1%. 149 PAGE 150 150 PAGE 151 151 PAGE 152 32 103 104 ].Atahighlevel,thebiggestdierencebetweenthisworkandthatpriorworkisthespecicityofourworkwithrespecttodatabasequeries.Samplingfromadatabaseisveryuniqueinthatthedistributionofvaluesthatareaggregatedistypicallyillsuitedtotraditionalparametricmodels.Duetotheinclusionoftheselectionpredicateencodedbyf2(),thedistributionofthef()valuesthatareaggregatedtendstohavealarge\stovepipe"locatedatzerocorrespondingtothoserecordsthatarenotacceptedbyf2(),withamorewellbehaveddistributionofvalueslocatedelsewherecorrespondingtothosef1()valuesforrecordsthatwereacceptedbyf2().TheBayesNeymanallocationschemeproposedinthisthesisexplicityallowsforsuchasituationviaitsuseofatwostagemodelwhererstacertainnumberofrecordsareacceptedbyf2()(modeledviatherandomvariableXcnt)andthenthef1()valuesforthoseacceptedrecordsareproduced(modeledbyX0).Thisisquitedierentfromthegeneralpurposemethodsdescribedinthestatisticsliterature,whichtypicallyattachawellbehaved,standarddistributiontothemeanand/orvarianceofeachstratum[ 32 104 ].Samplingfortheanswertodatabasequerieshasalsobeenstudiedextensively[ 63 67 96 ].Inparticular,Chaudhuriandhiscoauthorshaveexplicitlystudiedtheideaofstraticationforapproximatingdatabasequeries[ 18 { 20 ].However,thereisakeydierencebetweenthatworkandourown:theseexistingpapersfocusonhowtobreakthedataintostrata,andnotonhowtosamplethestratainarobustfashion.Inthatsense,ourworkiscompletelyorthogonaltoChaudhurietal.'spriorworkandoursamplingplanscouldeasilybeusedinconjunctionwiththeworkloadbasedstraticationsthattheirmethodscanconstruct. 152 PAGE 153 153 PAGE 154 154 PAGE 155 155 PAGE 156 @1=Xe2EMP~p(1j0;e)f1(e)1 4.5.2 156 PAGE 157 1. IMDBdataset.http://www.imdb.com 2. Persondataset.http://usa.ipums.org/usa 3. Synopticcloudreportdataset.http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm 4. Acharya,S.,Gibbons,P.B.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroupbyqueries.In:Tech.Report,BellLaboratories,MurrayHill,NewJersey(1999) 5. Acharya,S.,Gibbons,P.B.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroupbyqueries.In:SIGMOD,pp.487{498(2000) 6. Acharya,S.,Gibbons,P.B.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:SIGMOD,pp.275{286(1999) 7. Alon,N.,Gibbons,P.B.,Matias,Y.,Szegedy,M.:Trackingjoinandselfjoinsizesinlimitedstorage.In:PODS,pp.10{20(1999) 8. Alon,N.,Matias,Y.,Szegedy,M.:Thespacecomplexityofapproximatingthefrequencymoments.In:STOC,pp.20{29(1996) 9. Antoshenkov,G.:Randomsamplingfrompseudorankedb+trees.In:VLDB,pp.375{382(1992) 10. Babcock,B.,Chaudhuri,S.,Das,G.:Dynamicsampleselectionforapproximatequeryprocessing.In:SIGMOD,pp.539{550(2003) 11. Bradley,P.S.,Fayyad,U.M.,Reina,C.:Scalingclusteringalgorithmstolargedatabases.In:KDD,pp.9{15(1998) 12. Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE,p.6(2006) 13. Bunge,J.,Fitzpatrick,M.:Estimatingthenumberofspecies:Areview.JournaloftheAmericanStatisticalAssociation88,364{373(1993) 14. Carlin,B.,Louis,T.:BayesandEmpiricalBayesMethodsforDataAnalysis.ChapmanandHall(1996) 15. Chakrabarti,K.,Garofalakis,M.,Rastogi,R.,Shim,K.:Approximatequeryprocessingusingwavelets.TheVLDBJournal10(23),199{223(2001) 16. Charikar,M.,Chaudhuri,S.,Motwani,R.,Narasayya,V.:Towardsestimationerrorguaranteesfordistinctvalues.In:PODS,pp.268{279(2000) 157 PAGE 158 17. Charikar,M.,Chaudhuri,S.,Motwani,R.,Narasayya,V.:Towardsestimationerrorguaranteesfordistinctvalues.In:PODS,pp.268{279(2000) 18. Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.R.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE,pp.534{542(2001) 19. Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimizationbasedapproachforapproximateansweringofaggregatequeries.In:SIGMOD,pp.295{306(2001) 20. Chaudhuri,S.,Das,G.,Narasayya,V.:Optimizedstratiedsamplingforapproximatequeryprocessing.ACMTODS,ToAppear(2007) 21. Chaudhuri,S.,Das,G.,Srivastava,U.:Eectiveuseofblocklevelsamplinginstatisticsestimation.In:SIGMOD,pp.287{298(2004) 22. Chaudhuri,S.,Motwani,R.:Onsamplingandrelationaloperators.IEEEDataEng.Bull.22(4),41{46(1999) 23. Chaudhuri,S.,Motwani,R.,Narasayya,V.:Randomsamplingforhistogramconstruction:howmuchisenough?SIGMODRec.27(2),436{447(1998) 24. Chaudhuri,S.,Motwani,R.,Narasayya,V.:Onrandomsamplingoverjoins.In:SIGMOD,pp.263{274(1999) 25. Cochran,W.:SamplingTechniques.WileyandSons(1977) 26. Dempster,A.,Laird,N.,Rubin,D.:MaximumlikelihoodfromincompletedataviatheEMalgorithm.J.RoyalStatist.Soc.Ser.B.39(1977) 27. Diwan,A.A.,Rane,S.,Seshadri,S.,Sudarshan,S.:Clusteringtechniquesforminimizingexternalpathlength.In:VLDB,pp.342{353(1996) 28. Dobra,A.:Histogramsrevisited:whenarehistogramsthebestapproximationmethodforaggregatesoverjoins?In:PODS,pp.228{237(2005) 29. Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:SIGMODConference,pp.61{72(2002) 30. Domingos,P.:Bayesianaveragingofclassiersandtheoverttingproblem.In:17thInternationalConf.onMachineLearning(2000) 31. Efron,B.,Tibshirani,R.:AnIntroductiontotheBootstrap.Chapman&Hall/CRC(1998) 32. Ericson,W.A.:Optimumstratiedsamplingusingpriorinformation.JASA60(311),750{771(1965) 33. Evans,M.,Hastings,N.,Peacock,B.:StatisticalDistributions.WileyandSons(2000) PAGE 159 34. Fan,C.,Muller,M.,,Rezucha,I.:Developmentofsamplingplansbyusingsequential(itembyitem)selectiontechniquesanddigitalcomputers.JournaloftheAmericanStatisticalAssociation57,387{402(1962) 35. Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskewresistantjoinsizeestimation.In:SIGMOD,pp.271{281(1996) 36. Ganti,V.,Gehrke,J.,Ramakrishnan,R.:Cactus:clusteringcategoricaldatausingsummaries.In:KDD,pp.73{83(1999) 37. Ganti,V.,Lee,M.L.,Ramakrishnan,R.:ICICLES:selftuningsamplesforapproximatequeryanswering.In:VLDB,pp.176{187(2000) 38. GarciaMolina,H.,Widom,J.,Ullman,J.D.:DatabaseSystemImplementation.PrenticeHall,Inc.(1999) 39. Gelman,A.,Carlin,J.,Stern,H.,Rubin,D.:BayesianDataAnalysis,SecondEdition.Chapman&Hall/CRC(2003) 40. Gibbons,P.B.,Matias,Y.:Newsamplingbasedsummarystatisticsforimprovingapproximatequeryanswers.In:SIGMOD,pp.331{342(1998) 41. Gibbons,P.B.,Matias,Y.,Poosala,V.:Aquaprojectwhitepaper.In:TechnicalReport,BellLaboratories,MurrayHill,NewJersey,pp.275{286(1999) 42. Gilbert,A.C.,Kotidis,Y.,Muthukrishnan,S.,Strauss,M.:Optimalandapproximatecomputationofsummarystatisticsforrangeaggregates.In:PODS(2001) 43. Goodman,L.:Ontheestimationofthenumberofclassesinapopulation.AnnalsofMathematicalStatistics20,272{579(1949) 44. Gray,J.,Bosworth,A.,Layman,A.,Pirahesh,H.:Datacube:Arelationalaggregationoperatorgeneralizinggroupby,crosstab,andsubtotal.In:ICDE,pp.152{159(1996) 45. Guha,S.,Koudas,N.,Srivastava,D.:Fastalgorithmsforhierarchicalrangehistogramconstruction.In:PODS,pp.180{187(2002) 46. Guttman,A.:Rtrees:Adynamicindexstructureforspatialsearching.In:SIGMODConference,pp.47{57(1984) 47. Haas,P.,Hellerstein,J.:Ripplejoinsforonlineaggregation.In:SIGMODConference,pp.287{298(1999) 48. Haas,P.,Naughton,J.,Seshadri,S.,Stokes,L.:Samplingbasedestimationofthenumberofdistinctvaluesofanattribute.In:21stInternationalConferenceonVeryLargeDatabases,pp.311{322(1995) 49. Haas,P.,Naughton,J.,Seshadri,S.,Stokes,L.:Samplingbasedestimationofthenumberofdistinctvaluesofanattribute.In:VLDB,pp.311{322(1995) PAGE 160 50. Haas,P.,Stokes,L.:Estimatingthenumberofclassesinanitepopulation.JournaloftheAmericanStatisticalAssociation93,1475{1487(1998) 51. Haas,P.J.:Largesampleanddeterministiccondenceintervalsforonlineaggregation.In:StatisticalandScienticDatabaseManagement,pp.51{63(1997) 52. Haas,P.J.:Theneedforspeed:SpeedingupDB2usingsampling.IDUGSolutionsJournal10,32{34(2003) 53. Haas,P.J.,Hellerstein,J.:Joinalgorithmsforonlineaggregation.IBMResearchReportRJ10126(1998) 54. Haas,P.J.,Hellerstein,J.M.:Ripplejoinsforonlineaggregation.In:SIGMOD,pp.287{298(1999) 55. Haas,P.J.,Koenig,C.:Abilevelbernoullischemefordatabasesampling.In:SIGMOD,pp.275{286(2004) 56. Haas,P.J.,Naughton,J.F.,Seshadri,S.,Swami,A.N.:Fixedprecisionestimationofjoinselectivity.In:PODS,pp.190{201(1993) 57. Haas,P.J.,Naughton,J.F.,Seshadri,S.,Swami,A.N.:Selectivityandcostestimationforjoinsbasedonrandomsampling.J.Comput.Syst.Sci.52(3),550{569(1996) 58. Haas,P.J.,Naughton,J.F.,Swami,A.N.:Ontherelativecostofsamplingforjoinselectivityestimation.In:PODS,pp.14{24(1994) 59. Haas,P.J.,Swami,A.N.:Sequentialsamplingproceduresforquerysizeestimation.In:SIGMOD,pp.341{350(1992) 60. Hellerstein,J.,Avnur,R.,Chou,A.,Hidber,C.,Olston,C.,Raman,V.,Roth,T.,Haas,P.:Interactivedataanalysis:ThecONTROLproject.IEEEComputer32(8),51{59(1999) 61. Hellerstein,J.,Haas,P.,Wang,H.:Onlineaggregation.In:SIGMODConference,pp.171{182(1997) 62. Hellerstein,J.M.,Avnur,R.,Chou,A.,Hidber,C.,Olston,C.,Raman,V.,Roth,T.,Haas,P.J.:Interactivedataanalysis:Thecontrolproject.In:IEEEComputer32(8),pp.51{59(1999) 63. Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:Onlineaggregation.In:SIGMOD,pp.171{182(1997) 64. Hou,W.C.,Ozsoyoglu,G.:Statisticalestimatorsforaggregaterelationalalgebraqueries.ACMTrans.DatabaseSyst.16(4),600{654(1991) 65. Hou,W.C.,Ozsoyoglu,G.:Processingtimeconstrainedaggregatequeriesincasedb.ACMTrans.DatabaseSyst.18(2),224{261(1993) PAGE 161 66. Hou,W.C.,Ozsoyoglu,G.,Dogdu,E.:ErrorconstrainedCOUNTqueryevaluationinrelationaldatabases.SIGMODRec.20(2),278{287(1991) 67. Hou,W.C.,Ozsoyoglu,G.,Taneja,B.K.:Statisticalestimatorsforrelationalalgebraexpressions.In:PODS,pp.276{287(1988) 68. Hou,W.C.,Ozsoyoglu,G.,Taneja,B.K.:Processingaggregaterelationalquerieswithhardtimeconstraints.In:SIGMOD,pp.68{77(1989) 69. Huang,H.,Bi,L.,Song,H.,Lu,Y.:Avariationalemalgorithmforlargedatabases.In:InternationalConferenceonMachineLearningandCybernetics,pp.3048{3052(2005) 70. Ioannidis,Y.E.:Universalityofserialhistograms.In:VLDB,pp.256{267(1993) 71. Ioannidis,Y.E.,Poosala,V.:Histogrambasedapproximationofsetvaluedqueryanswers.In:VLDB(1999) 72. Jagadish,H.V.,Koudas,N.,Muthukrishnan,S.,Poosala,V.,Sevcik,K.C.,Suel,T.:Optimalhistogramswithqualityguarantees.In:VLDB,pp.275{286(1998) 73. Jermaine,C.,Dobra,A.,Arumugam,S.,Joshi,S.,Pol,A.:Adiskbasedjoinwithprobabilisticguarantees.In:SIGMOD,pp.563{574(2005) 74. Jermaine,C.,Dobra,A.,Arumugam,S.,Joshi,S.,Pol,A.:Thesortmergeshrinkjoin.ACMTrans.DatabaseSyst.31(4),1382{1416(2006) 75. Jermaine,C.,Dobra,A.,Pol,A.,Joshi,S.:OnlineestimationforsubsetbasedSQLqueries.In:31stInternationalconferenceonVerylargedatabases,pp.745{756(2005) 76. Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:SIGMOD,pp.299{310.ACMPress,NewYork,NY,USA(2004) 77. Kempe,D.,Dobra,A.,Gehrke,J.:Gossipbasedcomputationofaggregateinformation.In:FOCS,pp.482{491(2003) 78. Krewski,D.,Platek,R.,Rao,J.:CurrentTopicsinSurveySampling.AcademicPress(1981) 79. Lakshmanan,L.V.S.,Pei,J.,Han,J.:Quotientcube:Howtosummarizethesemanticsofadatacube.In:VLDB,pp.778{789(2002) 80. Lakshmanan,L.V.S.,Pei,J.,Zhao,Y.:Qctrees:Anecientsummarystructureforsemanticolap.In:SIGMOD,pp.64{75(2003) 81. Leutenegger,S.T.,Edgington,J.M.,Lopez,M.A.:STR:Asimpleandecientalgorithmforrtreepacking.In:ICDE,pp.497{506(1997) PAGE 162 82. Ling,Y.,Sun,W.:Asupplementtosamplingbasedmethodsforquerysizeestimationinadatabasesystem.SIGMODRec.21(4),12{15(1992) 83. Lipton,R.,Naughton,J.:Querysizeestimationbyadaptivesampling.In:PODS,pp.40{46(1990) 84. Lipton,R.,Naughton,J.,Schneider,D.:Practicalselectivityestimationthroughadaptivesampling.In:SIGMODConference,pp.1{11(1990) 85. Lipton,R.J.,Naughton,J.F.:Estimatingthesizeofgeneralizedtransitiveclosures.In:VLDB,pp.165{171(1989) 86. Lipton,R.J.,Naughton,J.F.:Querysizeestimationbyadaptivesampling.J.Comput.Syst.Sci.51(1),18{25(1995) 87. Luo,G.,Ellmann,C.J.,Haas,P.J.,Naughton,J.F.:Ascalablehashripplejoinalgorithm.In:SIGMOD,pp.252{262(2002) 88. Matias,Y.,Vitter,J.,Wang,M.:Waveletbasedhistogramsforselectivityestimation.In:SIGMODConference,pp.448{459(1998) 89. Matias,Y.,Vitter,J.S.,Wang,M.:Waveletbasedhistogramsforselectivityestimation.SIGMODRecord27(2),448{459(1998) 90. Mingoti,S.:Bayesianestimatorforthetotalnumberofdistinctspecieswhenquadratsamplingisused.JournalofAppliedStatistics26(4),469{483(1999) 91. Motwani,R.,Raghavan,P.:RandomizedAlgorithms.CambridgeUniversityPress,NewYork(1995) 92. Muralikrishna,M.,DeWitt,D.:Equidepthhistogramsforestimatingselectivityfactorsformultidimensionalqueries.In:SIGMODConference,pp.28{36(1988) 93. Muth,P.,O'Neil,P.E.,Pick,A.,Weikum,G.:Design,implementation,andperformanceoftheLHAMlogstructuredhistorydataaccessmethod.In:VLDB,pp.452{463(1998) 94. Naughton,J.F.,Seshadri,S.:Onestimatingthesizeofprojections.In:ICDT:ProceedingsofthethirdinternationalconferenceonDatabasetheory,pp.499{513(1990) 95. Neal,R.,Hinton,G.:Aviewoftheemalgorithmthatjustiesincremental,sparse,andothervariants.In:LearninginGraphicalModels(1998) 96. Olken,F.:Randomsamplingfromdatabases.In:Ph.D.Dissertation(1993) 97. Olken,F.:Randomsamplingfromdatabases.Tech.Rep.LBL32883,LawrenceBerkeleyNationalLaboratory(1993) PAGE 163 98. Olken,F.,Rotem,D.:Simplerandomsamplingfromrelationaldatabases.In:VLDB,pp.160{169(1986) 99. Olken,F.,Rotem,D.:Randomsamplingfromb+trees.In:VLDB,pp.269{277(1989) 100. Olken,F.,Rotem,D.:Samplingfromspatialdatabases.In:ICDE,pp.199{208(1993) 101. Olken,F.,Rotem,D.,Xu,P.:Randomsamplingfromhashles.In:SIGMOD,pp.375{386(1990) 102. PiatetskyShapiro,G.,Connell,C.:Accurateestimationofthenumberoftuplessatisfyingacondition.In:SIGMOD,pp.256{276(1984) 103. Rao,T.J.:Ontheallocationofsamplesizeinstratiedsampling.AnnalsoftheInstituteofStatisticalMathematics20,159{166(1968) 104. Rao,T.J.:Optimumallocationofsamplesizeandpriordistributions:Areview.InternationalStatisticalReview45(2),173{179(1977) 105. Roussopoulos,N.,Kotidis,Y.,Roussopoulos,M.:Cubetree:organizationofandbulkincrementalupdatesonthedatacube.In:SIGMOD,pp.89{99(1997) 106. Rowe,N.C.:Topdownstatisticalestimationonadatabase.SIGMODRecord13(4),135{145(1983) 107. Rowe,N.C.:Antisamplingforestimation:anoverview.IEEETrans.Softw.Eng.11(10),1081{1091(1985) 108. Rusu,F.,Dobra,A.:Statisticalanalysisofsketchestimators.In:ToAppear,SIGMOD(2007) 109. Sarndal,C.,Swensson,B.,Wretman,J.:ModelAssistedSurveySampling.Springer,NewYork(1992) 110. Selinger,P.G.,Astrahan,M.M.,Chamberlin,D.D.,Lorie,R.A.,Price,T.G.:Accesspathselectioninarelationaldatabasemanagementsystem.In:SIGMOD,pp.23{34(1979) 111. Severance,D.G.,Lohman,G.M.:Dierentialles:Theirapplicationtothemaintenanceoflargedatabases.ACMTrans.DatabaseSyst.1(3),256{267(1976) 112. Shao,J.:MathematicalStatistics.SpringerVerlag(1999) 113. Sismanis,Y.,Deligiannakis,A.,Roussopoulos,N.,Kotidis,Y.:Dwarf:Shrinkingthepetacube.In:SIGMOD,pp.464{475(2002) 114. Sismanis,Y.,Roussopoulos,N.:Thepolynomialcomplexityoffullymaterializedcoalescedcubes.In:VLDB,pp.540{551(2004) PAGE 164 115. Stonebraker,M.,Abadi,D.J.,Batkin,A.,Chen,X.,Cherniack,M.,Ferreira,M.,Lau,E.,Lin,A.,Madden,S.,O'Neil,E.,O'Neil,P.,Rasin,A.,Tran,N.,Zdonik,S.:Cstore:acolumnorientedDBMS.In:VLDB,pp.553{564(2005) 116. Thiesson,B.,Meek,C.,Heckerman,D.:Acceleratingemforlargedatabases.Mach.Learn.45(3),279{299(2001) 117. Thorup,M.,Zhang,Y.:Tabulationbased4universalhashingwithapplicationstosecondmomentestimation.In:SODA,pp.615{624(2004) 118. Vitter,J.S.,Wang,M.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.SIGMODRec.28(2),193{204(1999) 119. Vitter,J.S.,Wang,M.,Iyer,B.:Datacubeapproximationandhistogramsviawavelets.In:CIKM,pp.96{104(1998) 120. Vysochanskii,D.,Petunin,Y.:Justicationofthe3sigmaruleforunimodaldistributions.TheoryofProbabilityandMathematicalStatistics21,25{36(1980) 121. Yu,X.,Zuzarte,C.,Sevcik,K.C.:Towardsestimatingthenumberofdistinctvaluecombinationsforasetofattributes.In:CIKM,pp.656{663(2005) PAGE 165 ShantanuJoshireceivedhisBachelorofEngineeringinComputerSciencefromtheUniversityofMumbai,Indiain2000.AfterabriefstintofoneyearatPatniComputerSystemsinMumbai,hejoinedthegraduateschoolattheUniversityofFloridainfall2001,wherehereceivedhisMasterofScience(MS)in2003fromtheDepartmentofComputerandInformationScienceandEngineering.Inthesummerof2006,hewasaresearchinternattheDataManagement,ExplorationandMiningGroupatMicrosoftResearch,whereheworkedwithNicolasBrunoandSurajitChaudhuri.ShantanuwillreceiveaPh.D.inComputerScienceinAugust2007fromtheUniversityofFloridaandwillthenjointheDatabaseServerManageabilitygroupatOracleCorporationasamemberoftechnicalsta. 165 