UFDC Home  myUFDC Home  Help 



Full Text  
xml version 1.0 encoding UTF8 REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID E20101129_AAAAAO INGEST_TIME 20101129T22:36:51Z PACKAGE UFE0021132_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES FILE SIZE 87648 DFID F20101129_AAAIRC ORIGIN DEPOSITOR PATH pol_a_Page_083.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5 3bf06a597e2fb1e32dc294bd1f875bdc SHA1 d1aa7a2ed96b19ad6d1d11135efe1fa82c1a765a 68797 F20101129_AAAIQO pol_a_Page_065.jpg 69a362d9524537c84dbec01b8436d382 0398ed496efb1d0fca2eb1b19e2f137c1b83122a 89143 F20101129_AAAIPZ pol_a_Page_045.jpg c06cf6185da012caf51b3fa1a3510a50 8b9b621dc7df41055e151f4b5004a59a848ff78d 87172 F20101129_AAAIRD pol_a_Page_084.jpg 17d49526f9197c35485a8cad55a7e516 fdd980c0807d5ea455040495c0091991b717581b 78086 F20101129_AAAIQP pol_a_Page_066.jpg 74539082d561b285bff83a937b491fd8 7c6d0bcb84432e97db28feb8debb4e79a461260f 97536 F20101129_AAAIRE pol_a_Page_085.jpg 3a73882144070c5cc93886be9f77c892 36f7543c4157b3c8292945c2f3fa8538199abb19 46409 F20101129_AAAIQQ pol_a_Page_067.jpg 63c3311a1fc0fd90044576708d81b18c 3190df5bec3aa1687f0fe501158545e62bdc0c58 86990 F20101129_AAAIRF pol_a_Page_086.jpg a49f22841da6dac6d33fda2c331c09d4 4a79b3594fdc2800da90ad7b2aa5e9adef6aa3b7 57244 F20101129_AAAIQR pol_a_Page_068.jpg 15ae3a84cb27c6ef44ede0a9f9e8acd3 20a580d2299379127bb35866dae33779f1500b3c 74466 F20101129_AAAIRG pol_a_Page_087.jpg 22c3318007630cdd9ed206a7858bf82d 083abc0f3ab09d3b20429fd2a611c85f8df4b5c3 54450 F20101129_AAAIQS pol_a_Page_069.jpg 7f9e18d9774196841a29b6f9d797e403 44e4ffd3dc022a801b07115a02e6c60b54b771c1 68050 F20101129_AAAIRH pol_a_Page_088.jpg 3961357ca3e599bc7c62f5b90c709b1a 396ad099696a9d8363936d00e073a2881a281e05 74455 F20101129_AAAIRI pol_a_Page_089.jpg 2fb8f6abc38e7a265ab56257bd733258 9a4ee368c1bfbe9cede276060d969e89e68cdd76 40952 F20101129_AAAIQT pol_a_Page_070.jpg 657ea2d85a6ecbd5de7789046abbaaa8 fb6414ecb06501f0c86ce4a933f8eee01261eb75 92305 F20101129_AAAIRJ pol_a_Page_090.jpg bc76bfed15ce5ca8e996384547e6c22e 32be331f4e4b684d07051c45ca92a5a720c5c3c7 55684 F20101129_AAAIQU pol_a_Page_071.jpg 36a002932e80872c0e97a8ad4f75e0ee b75716a04011e289867d9abb36979595e849802f 89896 F20101129_AAAIRK pol_a_Page_091.jpg 3ca646e820bb48dff846ae4c74e00758 894ab9a7237848be32485b1533689c29989b9acd 47379 F20101129_AAAIQV pol_a_Page_072.jpg eac8a893a20bfcea0268e4a07032b0ed 3029912b928ea8b5634d8d75695bdd637b7b5f46 86185 F20101129_AAAIRL pol_a_Page_092.jpg a164e121db4c443abb971a193233627c 9f368d5c425f1a281d6f5716d1cf055fa598e8cd 60336 F20101129_AAAIQW pol_a_Page_073.jpg 855f4c54732caa43b483513346183b36 9add60264cbcd52111d030fe86b54e93ad98b603 85123 F20101129_AAAIRM pol_a_Page_094.jpg 8628d6fe4c411f619422727a3c4d5238 a797c4d9bff288b978a91345ba626cf63d95ec5c 63646 F20101129_AAAIQX pol_a_Page_074.jpg 9675df5e18712d318cccf52176a21052 77ef44ed968b45ffec885e06a3a8874b69c9b2d8 78661 F20101129_AAAISA pol_a_Page_114.jpg 851dc246daca41e4081b2fc6462b1523 0968c1659128bcd54f567bb12cc96edaa550cd46 89053 F20101129_AAAIRN pol_a_Page_096.jpg 6b04bdc210aeed486abe8a60c42d0810 927898fb6d038ab6d8c6a3dbcb2a6832d3e18916 60381 F20101129_AAAIQY pol_a_Page_077.jpg 4933145136ca05fc7d0e0463d54cd9a6 d4519d6fd931e4c8daf22e3d33812409d44a45c8 82217 F20101129_AAAISB pol_a_Page_116.jpg 957bea2f860b98de8e1e464f1de13df6 04bdd69bc90a6e61ab2a630eb111f6211d88417a 89350 F20101129_AAAIRO pol_a_Page_099.jpg e1d3d1bf776dedd42f71e2f745ce2510 776ef459c2f998203b0e86a7ce58a7120c2fcd01 52460 F20101129_AAAIQZ pol_a_Page_079.jpg 8a7fce22c674784b7590e1d328bfa2d0 4d4e9eeb6ddfcee14ae1e36b345bbb4f6cf541e7 29233 F20101129_AAAISC pol_a_Page_117.jpg 34367dc26551bcc01b084a1153b1ce87 898133b7af503c74b34b7926fd8b95ecad623883 97969 F20101129_AAAIRP pol_a_Page_100.jpg d2b546a15513d72c9c98b62dbb7b1f26 7689dbe65288e44590b37c565cbab30e2dad908e 91639 F20101129_AAAISD pol_a_Page_118.jpg 0e863875a9d39dddad6723795b78b4ba 837f976cf89136720f098726ec752f3853d55674 43220 F20101129_AAAIRQ pol_a_Page_101.jpg cefccf7720072cd34c0a2924b6d5407e 44fd443950e391b2434dfd5556b6ea84ced0baf7 89408 F20101129_AAAISE pol_a_Page_119.jpg b5ff09fc4a65f2d3beca739c210783c5 acf7658ce5bc990003f8a0693ee9f9ba1febe76b 55553 F20101129_AAAIRR pol_a_Page_102.jpg 96c7ecf181efad641461612e87907d1b 8d3cac246c793a4e663b4e82c461ce58dfa53bdb 22332 F20101129_AAAISF pol_a_Page_001.jp2 5683983bda70b943be10c84dc85d944d 2905c79ca1a767ab92e077610311acc288e2a701 86477 F20101129_AAAIRS pol_a_Page_103.jpg cc73155684948ae98a27df5a94aa19a8 28cc03ce37d4a550f708c2b117ba1e83c9c08244 1051960 F20101129_AAAISG pol_a_Page_005.jp2 42a141b90956bc449fff596be7bbeb88 b7a3a6da4da050b671c5c5c31ea5c4e79e6bd71b 66904 F20101129_AAAIRT pol_a_Page_104.jpg 566d3e67a96ff73b3e75350084fca091 86e7e438a65f0dd0ae875fe94cd444d021ee4573 1051974 F20101129_AAAISH pol_a_Page_006.jp2 a0a210d596d0373c64d9a8b1f96e6614 031bcd91cb303d46f34164026309356f8f807414 570568 F20101129_AAAISI pol_a_Page_007.jp2 96d443c0d954e81ffb28fb1769472d1e 54914f28dd04b42f2b217b43a35192196d94a3ec 61670 F20101129_AAAIRU pol_a_Page_105.jpg 9848c2f53b8ac26df814568d991129c4 47f144c453ad56e3c8c43300bd16c33a8acbf8e8 441831 F20101129_AAAISJ pol_a_Page_008.jp2 e1c0f065e8614ba7125ea706f45ad7db af90955fa24629f19570c8e2de0e1af0621ca3dc 68510 F20101129_AAAIRV pol_a_Page_107.jpg 3dc52df7db514b2d22635554c10c6309 53fe3fb2e8bc94aefd961a2a296918627e544bcb 1051958 F20101129_AAAISK pol_a_Page_009.jp2 fadacae4495b3c3b54bf7c19f66eae2c c1da74715ca5c1c250f397354069e3ba70d045cf 96508 F20101129_AAAIRW pol_a_Page_108.jpg 715015f362b519baee3e5b0be0926014 feb70e1cff8cea8b96a0b83d3e4156159d49a319 32877 F20101129_AAAISL pol_a_Page_011.jp2 8fc679a18762c10efd08589c7f03b5c6 37464c0202d8a22e0421ac25a88e27aa5fa1f604 59333 F20101129_AAAIRX pol_a_Page_110.jpg b8462692af8f624c867b69a7f0745f6c 2e6e0e9e303b29d47213ff0bd991d00ddd521b03 1051978 F20101129_AAAITA pol_a_Page_031.jp2 c4daaf28820bea99532624762af83ddf 34400c3e3056b590af35d9d50197e4b54253920c 1051959 F20101129_AAAISM pol_a_Page_012.jp2 523ea70ce91b27974aa81be43cc4c42d d7ec573aa06afdaae5da9263976995ca9540cca7 93141 F20101129_AAAIRY pol_a_Page_111.jpg 51c68f1dc3cb1312332110b20fe4b008 51bd1a41dc93e6ef802d15c2b5959a524d9d9e7a 1051982 F20101129_AAAITB pol_a_Page_034.jp2 4a14c397d16a0d20f547f81bdbf3fb04 63d07f9e656593fa652844a52bb4321802886480 1051962 F20101129_AAAISN pol_a_Page_013.jp2 2b300f1db1841c403fc6b5664c7434ed e49c79aaed6ced6e0c45daaed27e72ff05fbfd57 82577 F20101129_AAAIRZ pol_a_Page_113.jpg cb8144249a0c7ab7f0b9815788cd096d 58d7b463f8bd068107b7e2f400c1216a14767ec5 1051985 F20101129_AAAITC pol_a_Page_035.jp2 1c5315e149acd841baa66e7ef5765bdc 2a8267f9b8e812acd1674e2c2afce19cb7643537 1051684 F20101129_AAAISO pol_a_Page_014.jp2 198a184f300f324bcb35d5375972f0f8 d2ad40be8eaee17511efded3fa358aa33b6be190 106184 F20101129_AAAITD pol_a_Page_036.jp2 81c0b5d6670f4f0b8297e49cf212de5c acde94e838175266d8c0c9fefac4a2155d222b7f 1051932 F20101129_AAAISP pol_a_Page_016.jp2 f8e0e91652cc0010c9cf599e194a65e8 a687eb4752b11b88c1afdb80173f67089e62b22b 1051909 F20101129_AAAITE pol_a_Page_037.jp2 a72c05cb4a6e64071ffe0736f3cd5029 e9e99af829c7bd1767c378fa3f58f32abb79c6e1 126460 F20101129_AAAISQ pol_a_Page_018.jp2 6082689f277de47b5c3d6b18d3b75e62 10f8792bba51a0d94849babb09a7c6534636305d 752327 F20101129_AAAITF pol_a_Page_038.jp2 95a638a62cc809b12e4badfce3b20de1 f05036f05f5fb2df09d1d0d4b472af3ab79db16e 131827 F20101129_AAAISR pol_a_Page_019.jp2 24788b3eb30fc3686cd71fab8b36487b e7887c2909463791d708c1c69eab4e5525fdb2cb 481319 F20101129_AAAITG pol_a_Page_039.jp2 496bbb023be9da2251d9cf91170bc3d6 8b0a801b69890af20c8485a5397a6aa9697031fc 526518 F20101129_AAAISS pol_a_Page_021.jp2 19a0417d50a593ea45723dd33f8f05d0 b57036d949ba7022d98ad56347c77650cf40667c 1051977 F20101129_AAAITH pol_a_Page_040.jp2 a0da4ffdac70345cbeb8bda9b73d7635 20227d299da264ad1ea96b27d95b564bacbdb03a 1051964 F20101129_AAAIST pol_a_Page_022.jp2 dc8f0d9853e0eb5a0a6574ffbc0fdc56 7bf78ddb4035d7fdf5eccd0fbe0d5643e867477b 1017825 F20101129_AAAITI pol_a_Page_041.jp2 0da7f36a13347bec20ff06f5f837ed49 a437cb04bc3d481f74765ab0e9538f94e1c8ec69 1051965 F20101129_AAAISU pol_a_Page_023.jp2 fa68ed0975b3948b5550f2c113d0f505 1a1892e2975749240cceceada189435cd49eac7d 650015 F20101129_AAAITJ pol_a_Page_043.jp2 731a0fd7de8204307d5db84db686316c f12befe83398ed39848fab0a0b93a8b7f39cd984 1021397 F20101129_AAAITK pol_a_Page_044.jp2 27707fead68af9e6c8ccf4c76369d034 98b35a789eb7a4e6358fd2495a863faad0abdcc2 F20101129_AAAISV pol_a_Page_024.jp2 571f9ea783c789ea43c405e0fc06f056 773fe412f79dbc25dba86ea0589d167bee15c09b 1051968 F20101129_AAAITL pol_a_Page_045.jp2 8db2f8d92dc02901529da95adca79321 535e824aba942e3969d0b1c9abf6e5a57a3cf4cf 1051980 F20101129_AAAISW pol_a_Page_025.jp2 2777722d2009cf1a19407715de59d92e 853d470d8173c32462884b6d7ce2f224594fdbf9 736415 F20101129_AAAIUA pol_a_Page_071.jp2 dad74dd9af149008b670f37c8f7d7653 2716d827973a66559a3081161d8c2943f9579a8c 1051946 F20101129_AAAITM pol_a_Page_046.jp2 8be0206344a13a16733ccbd8c5ec07a7 8afe9312b96e20d1e49ec1afdae119ac3858d322 1051939 F20101129_AAAISX pol_a_Page_026.jp2 99f09074349106c4e2b129e0888fe749 15d511d0342258421ddbc492e6730801055a0f30 574058 F20101129_AAAIUB pol_a_Page_072.jp2 607e930112e3b1445a247a34b470c233 c1513b90cb7b2f9f06ab482f7742042d84e25685 1051963 F20101129_AAAITN pol_a_Page_048.jp2 3a630c834a2892390e4486ffbef8ed10 7569879dce32640847c9949af69e2ffc38e3bced 48527 F20101129_AAAISY pol_a_Page_027.jp2 dbd733fc72eafc863ff158bef6d211dd 418239eb9e0a6c13bfe23a4571c7afe89ca9a044 136728 F20101129_AAAIUC pol_a_Page_075.jp2 2cb33c7e13a0bfb23491d4ab35571965 28bccb40b6167860f5ca92d09f3802872c7d7d16 F20101129_AAAITO pol_a_Page_049.jp2 91e77bc873ad900b8a16d85368a5aec7 b38c5aa0822c47a48bde49c60a2289cbaa46cfc2 F20101129_AAAISZ pol_a_Page_029.jp2 cf3311ad2c1c51ebb727e8d812aa8af1 2deb04ba9f0ee9755c12f920d1c07cd9291446c3 F20101129_AAAITP pol_a_Page_050.jp2 55625b1ffd80a06aa0ded2e8dc595115 8bc969480cb8e797e8e506449db8f7721ee56208 1051971 F20101129_AAAIUD pol_a_Page_076.jp2 625cbf58da3041e80a5caa03492b62d9 cb971038f4af62d6da8523b797df6b4291434214 F20101129_AAAITQ pol_a_Page_052.jp2 83f62b1593e974aed8308a90a4637f4a 5c6b69003c39666b2adf1afd90fa458e4a510d51 819621 F20101129_AAAIUE pol_a_Page_077.jp2 720b380a3981251c69f30d831de6e211 92c263250c0ca2c9b33bc0e4e863705eecf04ff5 1051973 F20101129_AAAITR pol_a_Page_053.jp2 dae2efc591c3ce1ef16d70f67880746b 6ac32f118e5e357bf337d2312b8df7e01570c771 26706 F20101129_AAAJAA pol_a_Page_054.pro 95d47e98e305b589e8fc941a7850f5b0 afb7bad41e1666ffe4b3256d6855ca6008fa332b 803661 F20101129_AAAIUF pol_a_Page_078.jp2 1fa0ab90eba17a4142f07ef363de0ed7 731cf9035e62ddcb5bf56347345a9381969cd9ab 920415 F20101129_AAAITS pol_a_Page_054.jp2 400a4800a85e11c2d3c10f8707fca8b7 9560ef3a2d027d51d6fbb5c14a26a413fe68b15d 63019 F20101129_AAAJAB pol_a_Page_055.pro 891b0085f68595ed5b145335f9c60d26 f50bfe37f7609d2660348273ceabd832b9c7ce5e 78359 F20101129_AAAIUG pol_a_Page_079.jp2 6a94c098e3f891b7f19b8bedbecfe75f 0a6e9f25086d7006d6708514054380c330651c41 745893 F20101129_AAAITT pol_a_Page_057.jp2 b198c5887eb88562f49e268da39ea12a de2657d03bbb866f96013eef73a983f60466cbcf 1051931 F20101129_AAAIUH pol_a_Page_081.jp2 a2aef046c57517e416624c8942e24b9a 0d776f0a05bb713aa833f019966222d3c735afa4 F20101129_AAAITU pol_a_Page_058.jp2 bd526b16361b045cc80e050b23583414 aa6e20cb1fef2f312bb94c24f0090ef734e46a01 58854 F20101129_AAAJAC pol_a_Page_056.pro 092b387e8821358a627a7373a4bfdf71 9a1b710f668acd4c722ed739121e16fc53912b66 77169 F20101129_AAAIUI pol_a_Page_082.jp2 bd15081a2637334fd719e30944e9e52a 02c05b6a5ea1dcbfb5dfb8dcf622f911e074217b 1051900 F20101129_AAAITV pol_a_Page_059.jp2 500dbafc5c041a18bcccaeafb77f6990 b73ede9c73287e40c615b2a3288a1d74450164cf 33700 F20101129_AAAJAD pol_a_Page_057.pro e3b8ffb77c190553aecb4542ebf5e27a 777d3c5a9367a8c9e33b89d15c6d50c1090afc80 1051981 F20101129_AAAIUJ pol_a_Page_084.jp2 fe344cb92604995cc44398bc41c17a0e e1db8d1420aa58c8638377cfdfe12e62f3845681 61177 F20101129_AAAJAE pol_a_Page_058.pro 676de51461ee1afe60f93a6eb1480ffa 1b3db443417b03487266225fd165d2c332b26912 F20101129_AAAIUK pol_a_Page_085.jp2 cf05d5a995c343b8bb41673c48b3ee2d 31611aaa3643df447c252f43828c9b8802a0523f F20101129_AAAITW pol_a_Page_061.jp2 d28b143ac15994f6bb6c45b71f68a42f 5ae1565bf26fc3463081319d2440ac98e5302fc2 40520 F20101129_AAAJAF pol_a_Page_060.pro 4337bbe74c1a58dbc1b26666be782866 ec20c5885e288465129a0be3a176bd490928dc15 110135 F20101129_AAAIUL pol_a_Page_087.jp2 7cb5901217b41ddb09f4a1ff775b2d3a fd095fa08480f9385a50dc5edd0ddfa55ea7917d 1051969 F20101129_AAAITX pol_a_Page_063.jp2 18acd6a2e32dc96c808f0a86e68d031d cd8da34d9464017e012361d02931515e5a0b03bd 54631 F20101129_AAAJAG pol_a_Page_061.pro c3b604aae8962ce19af01adcd1bd6f1e 3b25da755e640f812b5df06f1b6e5df1aaf46565 118553 F20101129_AAAIVA pol_a_Page_109.jp2 0bed8fa4799c7fdc893770e3a50f9bfa 460bd2e4fe836e9d49768cd3f099dad55499ddf7 931008 F20101129_AAAIUM pol_a_Page_088.jp2 b508fc0d3dc01558cea78088b2cb3446 1dd6762260376624641ea8597177c7e7cb70b5e1 746486 F20101129_AAAITY pol_a_Page_068.jp2 6bf471f5ae1bd65e1cc4c8c402496c61 597a7cf4ae285df149024a49abe094ad0afb8753 58690 F20101129_AAAJAH pol_a_Page_062.pro b51706d3a37b0affd9e980eb74cd8b4b 41bd8d18816b895ec6b204a7376892d7bff6621e 1051975 F20101129_AAAIVB pol_a_Page_111.jp2 111924980962125fa1d15c91aecabc4c d95bf0c59cae00627854d69d143703b54820b5eb 110445 F20101129_AAAIUN pol_a_Page_089.jp2 ffe46ad268e3a59a6a1b4c9bc3660ed9 585ee5de038456088e5d994faf908784eae3d4fb 674808 F20101129_AAAITZ pol_a_Page_069.jp2 edebb08e75263094244009cb7a4576f2 153472ba8ba28fb02a8a1cd916685e02c6dd335a 54539 F20101129_AAAJAI pol_a_Page_063.pro f6ae6a16ddc752cb7358b280a4068ec2 d45322155c6867f7e0b5414739c1545213e3c57a 887018 F20101129_AAAIVC pol_a_Page_112.jp2 79f7bc17de5ee325c4755ca6eaaa864c 2606d973db0e346a303b106b7e3852345a321d46 1051972 F20101129_AAAIUO pol_a_Page_091.jp2 d5ad360786f6c3495d5149c28aeb396b c5139919adf17b5385c8380f6b2568ac84ed8bc8 39874 F20101129_AAAJAJ pol_a_Page_064.pro 3e0c0a19784d454a8c60d73a85153c06 7066760c4d19f6a34094afc6248bbd5d282a1328 1051952 F20101129_AAAIVD pol_a_Page_113.jp2 aab2c4576c9b81060262a6290f954764 bb7cc97675d5a6373cb8a67f8516e9945e4b823a 128877 F20101129_AAAIUP pol_a_Page_094.jp2 94f937fc2b89105f32e39ce208a92f18 f81e7c4b650d43c32968863e22df38701d76c14e 43699 F20101129_AAAJAK pol_a_Page_065.pro 5223a4068cabf632f09872eefe3158c3 2f47161864733104c6298ccbe5134baefb9ccf9c 115853 F20101129_AAAIVE pol_a_Page_114.jp2 c56ad236a6c5682b8420513857549491 206f580954cb22993701ded90bbb191f4ecb32fd F20101129_AAAIUQ pol_a_Page_095.jp2 f976cf80b2de36a7e8fdb93f8601e733 80c303331de6ca149fb89dba2bf8b2cde2aa299b 54279 F20101129_AAAJBA pol_a_Page_089.pro 407bc3e900fd372b0b50414130f28eb6 269052575dc82c1ca51326f3361e54155497bafe 48795 F20101129_AAAJAL pol_a_Page_066.pro 0b1ca4c51673a5c333776480232a8914 f7e8908be649c0e9c51b1e5b58f884b28f9312c7 136722 F20101129_AAAIVF pol_a_Page_118.jp2 42ac0489f89fd3dc9c4814ffa493c13e 190dabb6f718b4304c8d4af023d1a835295957b2 1051943 F20101129_AAAIUR pol_a_Page_096.jp2 a2f489090a600f510e8c89aaa935fe00 6eab658a56f0af22399976f75d89fc35142c0044 23748 F20101129_AAAJAM pol_a_Page_067.pro 16e3dac90f948486db6954193096561b dc2bfeb57afc3cdff79424fdef07c492804826b6 126814 F20101129_AAAIVG pol_a_Page_119.jp2 68117524036ff99c7ff294e46e9d1049 acdc7d4f132116ce0ed3911656d3ea37eaaedfa3 1051970 F20101129_AAAIUS pol_a_Page_097.jp2 7183b6fcefdda4174ae82fd52e552ea1 e3f0c25bde2ef42435d0dd32af492acbaf43e5dc 62033 F20101129_AAAJBB pol_a_Page_090.pro fc129af50c426ca54a7daaa5ba2d3ed9 791791d8f978fe3727f625f661f858b79df56324 32169 F20101129_AAAJAN pol_a_Page_068.pro df78f664daaf38e7c8d348532062c047 f08a6483fc51e879d3564a01bd914b91f4f855e8 131122 F20101129_AAAIVH pol_a_Page_120.jp2 e06f988884728fe84a5664ac7bf094e6 9ba645914cf172f6ec5773021b1cb98666104e87 237197 F20101129_AAAIUT pol_a_Page_098.jp2 e7421a2cfb93849a5f246452dbd52bf0 6411d42a9e8acd0c0825df7bf6ce935a05e052f8 61507 F20101129_AAAJBC pol_a_Page_091.pro 68f912596d91d3e44ab39b1ae1851ba4 0d983f7fd0a59dfeeaf650f6d199e49845cf20af 34094 F20101129_AAAJAO pol_a_Page_071.pro ffa4cb699098944cb5130c48040570b6 d31ba3102fbe142ce9b7c9d3486d1c691edb38a5 69339 F20101129_AAAIVI pol_a_Page_121.jp2 c6dec1ab46e6bfa98e320a863b777c0e 3842808d1f1ebae884206f457f467ca46f9a7e86 1051986 F20101129_AAAIUU pol_a_Page_100.jp2 12cf1da9873948b1fafbed5986346e2c 290dc065dbf93651a7243b2249efb606128c70d3 24768 F20101129_AAAJAP pol_a_Page_072.pro 924853f81f81cefe006dfba9f85fae21 666228e66d3585e8af6477dbc9f3002fe70eb96d 96529 F20101129_AAAIVJ pol_a_Page_122.jp2 5ec35b38e215c439d3dced31db67d9e3 9239680b546a28f698c38ffdd62e6fa4b5a9f272 76126 F20101129_AAAIUV pol_a_Page_102.jp2 9129f8a3d4e7c4ec2a11f0686c6befa6 421dbd32207a1986250e747d7ab677d522759a37 54469 F20101129_AAAJBD pol_a_Page_092.pro 5e7b8d5b7ee153782c5c74017344ab98 7d631c605620cc2c67b6a9658195b347de222300 37838 F20101129_AAAJAQ pol_a_Page_073.pro 19b13a9755178dfcf94e99b1c1e5082e 78ba7aa5a5bd5cbfad121e195534fa95deee8d97 1053954 F20101129_AAAIVK pol_a_Page_002.tif 6820b075f3a53953bfb4d31b6a466697 52b91877121618cec71a9ce053c6cceefdfb692d 1051944 F20101129_AAAIUW pol_a_Page_103.jp2 be545cc26dc1c5c147a59a2aad1f7db3 e26e2f36564dabb8a353416a88c9e1176bce7713 47909 F20101129_AAAJBE pol_a_Page_093.pro 8c56adad6f9faf7484601c7372d5826e 220a6c3811c9e069a874e6aa6556afe0a9d144bd 39240 F20101129_AAAJAR pol_a_Page_074.pro 601b6c2f5923a5f6cd48e08a8715a680 a40add84fe70dea421ff59c118bb2bde0baba69f F20101129_AAAIVL pol_a_Page_004.tif ce0b0f8f06d5eedcfd680f817ed0bb67 f7b191c98cb8f8bcaecb61595f0492a530ce213d 64417 F20101129_AAAJBF pol_a_Page_094.pro 1026f335e117cd63d995c7737d5d0914 75cdce20e88f18948d788885a273e0ede64a1414 25271604 F20101129_AAAIWA pol_a_Page_028.tif 633ffe3eda52da609b5ee3e5d940b670 40123f967797ef918a71704ad424e897d8408aca 50981 F20101129_AAAJAS pol_a_Page_076.pro a4b873c43556c7a9c84a002e3997292c 4bb589744b59692df3e58d60be9f6c3bf8199a96 F20101129_AAAIVM pol_a_Page_005.tif df3dffc97d44d092b42ce08031a0cd1f 5d677f6fe4ec060de8a683a38faf1d8a765f0765 91842 F20101129_AAAIUX pol_a_Page_105.jp2 243cfcaeaedd8ea7fdf97ef12b440c34 fadf67f05f841583c6fe11a419751b6802733ff6 10613 F20101129_AAAJBG pol_a_Page_098.pro d2c4d07d8de995d85fdd9670556f35dc 8f7339070ad0cc7c023739480551d7d0b2b7d3e1 F20101129_AAAIWB pol_a_Page_029.tif 2b1c3be5bf19bbff4191db5056250886 fa7121c29f5cf9f3ba0045c37739ab0cfb425ef6 36792 F20101129_AAAJAT pol_a_Page_079.pro 171428c138ebce9b2724f6f4320b2a62 4cccd8236f3f18645cc512df5639267c81f5eaf2 F20101129_AAAIVN pol_a_Page_008.tif 3998f9cf9b9c1f475b1c97cdcb216436 e93751b2756b937cd6acab47eaab1361717baab9 974113 F20101129_AAAIUY pol_a_Page_106.jp2 3ff20cfe2e993057c32fded8c97c3336 70e6b41fdeaa7bc84b15c404362d019a3fa65f24 57481 F20101129_AAAJBH pol_a_Page_099.pro d084202dd95acfad46f30a09b1e4d3e5 d7cacf63a0520bedc06ae2e05733f4839af412fa F20101129_AAAIWC pol_a_Page_032.tif 1419de4075b5f25c3f7393d91200070b 3114e7c9b43334d92966adeb8cf5274bd40407e7 62635 F20101129_AAAJAU pol_a_Page_080.pro abf52b5853ced4397adb2c1a634f00cd b551d657228d72fcab2d2b261ef4efd1b031ff3b F20101129_AAAIVO pol_a_Page_009.tif 552cc4a7daaa34d490cf648e7e01115d 94a08f73de8284d624245a759a8458da960eff71 961887 F20101129_AAAIUZ pol_a_Page_107.jp2 85abd85063fed45b8d8469c70e16f473 cd60c9056bedd21fe7d443521f869e1b825e5f37 64167 F20101129_AAAJBI pol_a_Page_100.pro bded9ec893ce1ba92be9ef766c434e2f a3135910e04fb9658f1af70b5757e5f0dd776b48 F20101129_AAAIWD pol_a_Page_035.tif 9d53e68b72ce9fa147033605a5a35842 48d49f99fd54f4b87cd37c33c0ec7f0114939494 56378 F20101129_AAAJAV pol_a_Page_081.pro 5017d77c43b31a4bb549748121dd4266 5c3b08302d509fe494546dc5d61cd38e9a36f8ed F20101129_AAAIVP pol_a_Page_010.tif 9d8bcb948b26232f81a03ee74c971707 210954d290c6c0e1336c9846525c08fef0881b98 14050 F20101129_AAAJBJ pol_a_Page_101.pro 607310119a2cbf5f2fb896d8c1256c1b 57eaae09751b97cca391e1ce7a0a0e20cbaa091c F20101129_AAAIWE pol_a_Page_036.tif bdc3fb38a8f1d1c4a0d1225d76d8dd16 29661b4f4d6503d283152aa7d2f2bef3ea1b13de 36521 F20101129_AAAJAW pol_a_Page_082.pro f1961cde2e4857e16e2b81b06964190b 49261a9c907c631449c0e1c4d6f327c8b71d8600 F20101129_AAAIVQ pol_a_Page_011.tif f7b806ffb193b761746a854dceee502b dfdd8d8eec2dfe6178f3d5a5fcc8eff0712ad14f 36591 F20101129_AAAJBK pol_a_Page_102.pro 23133f7a584d2d9325b92fe5a0b6c8e3 09ea4f76fa6fbe40e0a2106379356049011928a5 F20101129_AAAIWF pol_a_Page_038.tif 235d60d2f22cefbf6ecf95af3722d115 cd3c24942fbecd3d57dabacf3f3f19c8b304a9f7 56555 F20101129_AAAJAX pol_a_Page_083.pro 60abc6ef2b69f27d9ae34dec906997a1 2080f502d81815415c474dd1504c6a2c658a9407 F20101129_AAAIVR pol_a_Page_013.tif f2d00a001772fd5a905a2368d800944b 854c3a23a2f7a6025ceff429dcd5d45ad2e30dee 669 F20101129_AAAJCA pol_a_Page_008.txt 2e31fd90c31eef8827cfde46ab76a6a8 7c5e3b9fcc63cc1834dd7756e7f7ec1940f7f807 57681 F20101129_AAAJBL pol_a_Page_103.pro 6acf0e215248b7eedcbdfcbc702dd8e3 26cf1f288330df9565a528c8a914bbfee7fa103c F20101129_AAAIWG pol_a_Page_040.tif 54d19aacf35f77912a1abeadf8945415 21dbfb9eb210f5fc6d5de87f7a923c311a52c6c9 66924 F20101129_AAAJAY pol_a_Page_086.pro 6ec6105ecae20f189a45df2e874ae66a 54ff5f00e75fa1b5b4d2075f6ef5f7df4772ffea F20101129_AAAIVS pol_a_Page_014.tif 6e34e2396acf4cc269b36e3a735e3b60 c21748e84269ab371ec644c8426763c54507d9aa 1410 F20101129_AAAJCB pol_a_Page_009.txt cb2da4e428283205adb1413f0491eb1e 8cef15c30f67d30c57ac2941a53dd3747e133702 43858 F20101129_AAAJBM pol_a_Page_107.pro fbf4e06e72dbb21cc07e566662425c92 b86e82b9eb63cfc00a5ebbf869c609e3c6b6b66f F20101129_AAAIWH pol_a_Page_042.tif f2e51d30c5e3044cb1ebd9eba2b933a2 f82818cff18632b746308ef0ed1935660ef570dc 41608 F20101129_AAAJAZ pol_a_Page_088.pro f8f70cbecc69e8756881a09b096ed5e0 3ecb7104b69fd9617284a814627f41e0d5910389 F20101129_AAAIVT pol_a_Page_015.tif c58cdb9e783d9002b7af490726bd8fc7 dd8cd990cb1f3e5f22548ea29d4b1925e240e034 2158 F20101129_AAAJCC pol_a_Page_010.txt e53f84e3790d74d4ac1033fc4aefc80f e401ec62ca9a4f7b47eb6abf5de6b5aa40e3de29 63549 F20101129_AAAJBN pol_a_Page_108.pro 0bc87907694ce8a1251f1e37c3e12439 4e1d7b4332fe187d491e33b3b716fa7ec3897b45 F20101129_AAAIWI pol_a_Page_044.tif 9b6f3dd076f4cb7cd69e4e9304ddf4a3 89e0620517ce0ca76de68e60b26b83d594ea2760 F20101129_AAAIVU pol_a_Page_016.tif 11fe6c9162a41453ad649ac5fac208d1 2bbd74f11053c9816e3b40e6d482bae786b245f1 2648 F20101129_AAAJCD pol_a_Page_012.txt 842258010a9830eea0e2c263f9730e5c 8c2d81a146ab221e406374e12d96035509d0b344 57379 F20101129_AAAJBO pol_a_Page_109.pro 265f7ba48a2fce37c03477d641ea98e7 1003241a630378b5a2ab8b56cbfccf88523c30e7 F20101129_AAAIWJ pol_a_Page_045.tif 9492c64cd73059569f00c4e67680e142 ee7f5b11a30d6a780f6ec1f8556417e425bf2d87 F20101129_AAAIVV pol_a_Page_019.tif 55b62dcda7ffa6aa7fdb144c8ba1edc5 2e678bfce583ef7f4684c8c8bf94e4e05154d015 31994 F20101129_AAAJBP pol_a_Page_110.pro 01769cad62f083559f24c44bd2f969e8 e5eeeae15c2afede30cf9abd77b28a21ac3d2beb F20101129_AAAIWK pol_a_Page_046.tif 685b9bfe9a5b69851414159b3b984ab1 8ae6b7de93a4763ac523ed2bb5552c55241abd4f F20101129_AAAIVW pol_a_Page_020.tif 7069120fdc47271c41f55fa288d7406e 6ffd1b0ca7f3bb8001dfba91735c7fbf5a8cfded 2596 F20101129_AAAJCE pol_a_Page_014.txt fd319b2c168534a30ea9f4e46ac5cd1e 463d7bcbe62c59ee34f8f3af3aabd1e1c6103ac4 62651 F20101129_AAAJBQ pol_a_Page_111.pro 48093613a35194bac4c3b22ed7607cbf bb9fb10b94b60fdac524cd7c47b2bcc9e9dbc584 F20101129_AAAIWL pol_a_Page_048.tif 1d7086eeed50db3dba687c7b8b106540 864abc70afd0fbbc5378914c23992516b7191252 F20101129_AAAIVX pol_a_Page_021.tif f298bb195b5a752f57cae54723683cdf c6b569ec6b6d0641408749a95524dec5a78bbeeb 2276 F20101129_AAAJCF pol_a_Page_016.txt f86fe6fa8f0a2a4bb0c1eb79719151b7 6c68676f2324d7b76ba607650e0d7c392a97062a 52256 F20101129_AAAJBR pol_a_Page_113.pro f87555a9e8917528630483a03221982d 2562f2ed02be5cd1e0ce545bba7078d478869e2b F20101129_AAAIWM pol_a_Page_049.tif adaf2c48588667eeef440677423219dd e48a42b49b2a9bf7e17bed18fb8408928bf59d12 1825 F20101129_AAAJCG pol_a_Page_017.txt 551271d28eca49609bc6de037a3e4e56 7923ec2c5450dfc066cd95d65bc4a5df2e3c308c F20101129_AAAIXA pol_a_Page_069.tif 67fabdf45950dc68878ead1d5fefc088 264bd2629bf4b9cd643871c18bd1b5529c9826d0 17132 F20101129_AAAJBS pol_a_Page_115.pro 29ffecaae7f1c75da88c9ed35dfb41d9 25b8b0ba1601c219c9a1a159966c5022f11b9edb F20101129_AAAIWN pol_a_Page_051.tif e4480c97b5dc5b1a17c6b8d141d04229 54b489c0ee3343a031f7de34e83faf1d82abf74a F20101129_AAAIVY pol_a_Page_023.tif 647706fea3f4ee3b0363710a76f19403 c31a2d52b49b9f2a05b7ca29285ff1077a4aaed1 2585 F20101129_AAAJCH pol_a_Page_019.txt 6392c3fb456054c426e42ba3a37a4e9f 2a6c257f889bd716d8e7079451faff1b00ee2ebc F20101129_AAAIXB pol_a_Page_070.tif 04a55f90242e013d6fc829de97139cdd b2ee9b49bd05b27d994b216f7595ba9cd7031ac5 31635 F20101129_AAAJBT pol_a_Page_121.pro 8fe58abbc31c5cda259ef8e6f2117a3e c41125d0a53046b6fce2f000f1a3d44b2bf20540 F20101129_AAAIWO pol_a_Page_055.tif 6e6e07051346f04d8fcc452abae68890 5db285de71c4d51fdb299f0d22c5e55e1b4b3905 F20101129_AAAIVZ pol_a_Page_024.tif beefae6eaf06a3053e4b820ac05b9dc5 d92d03fde0d03be4c1c134bfbe892f7a8c93973f 2602 F20101129_AAAJCI pol_a_Page_020.txt 20e46f7ab4d227f4de6bc6761fc79692 2246f9b2a9cec200edc59173e0327195e3784178 F20101129_AAAIXC pol_a_Page_071.tif 1363c1e641bed94c1eba1bff4ed57fe9 84a507fef867b50241b90610b7eb9a3e45541549 427 F20101129_AAAJBU pol_a_Page_001.txt fce937ca3cfe15f20f6e0d25698bec12 8674c46839ef263f7a90397414bdf64ce6766340 F20101129_AAAIWP pol_a_Page_056.tif 510b0c55b7c57e96ccf8e6fc39573df3 a9acf50a96feb008ad7608f0b11806e95b6eaeba 2548 F20101129_AAAJCJ pol_a_Page_023.txt aaa8043f27e37984ab8d21fc1772aa88 119ed0dcf03499703aa7cde1591623201183f3c3 F20101129_AAAIXD pol_a_Page_072.tif 1dcf6fffd6e892bb3f4d620a673d355b 434b35e732e1d1080ad197da74851c6fdcaf7756 91 F20101129_AAAJBV pol_a_Page_002.txt ea4241e534758d910918e15886966e74 71d6d5b7fa7b1d64ad34bfa2040f577967d71c27 F20101129_AAAIWQ pol_a_Page_057.tif d26eba2a8792f1761d0acc18a448e18c e9801163e4bba270e286641b7fcfa170caff414b 2469 F20101129_AAAJCK pol_a_Page_024.txt 672dedb281fe7804cb361139621e644a ec804c2e8ee102d54418ceae6418814df29f7791 F20101129_AAAIXE pol_a_Page_073.tif c7d303c6ad20e90f5bdd9e636956e5d3 4d288df5ae13a8f5b767de58ad410d74937f9752 92 F20101129_AAAJBW pol_a_Page_003.txt 32d88347d96d66d7ee30e268eb8a0b00 26d0d7669edea095070fa25ab761ca497dd51efe F20101129_AAAIWR pol_a_Page_058.tif aad75d664cdf118ee0d8b5a5ed2a80c4 00234104c41e45aeed56b7cad5213c2df19c43a1 2418 F20101129_AAAJDA pol_a_Page_045.txt c92f8aed8e96c6c48aba319f5df35aca 2d507ec2c0c23f771412a5c4658b19e3946b6758 2538 F20101129_AAAJCL pol_a_Page_025.txt 0067ab939a93deb5b168f184b95b1416 86915364a6371f2218572d315dab816c4a3b7dc8 F20101129_AAAIXF pol_a_Page_074.tif 2a632ed61ca1e899b83e9076c59f1dac 1ddf92d4acda2b9d3f1ac2bfbb13e35e873c1105 1460 F20101129_AAAJBX pol_a_Page_004.txt 3a913ab7277058bbaf715cef7051bb3f 4d91ca19f952ecfd66bc41f56775bc1605bd952a F20101129_AAAIWS pol_a_Page_059.tif 050632cdf6329831e2ffa3b31c89ce59 73f1b0556945a7a9fa491054daff03e58fad7670 2584 F20101129_AAAJDB pol_a_Page_046.txt 63134c324f9fa416a49b71e016996949 2cd12f5cc297303055ed9a7efd4dfa10dd51771d 2356 F20101129_AAAJCM pol_a_Page_026.txt 7357ff8ada6982afe4d1b790366bc69b 9ec2d68b89db7be408a3b247df6804f795f30a77 F20101129_AAAIXG pol_a_Page_076.tif 1d2488e5bcd14e50bc930f54f040bbfa 357cc360152adfbfe84c08b85843b9f33607c352 2957 F20101129_AAAJBY pol_a_Page_005.txt 3c7c2d8b88f2e08d4916adc8cf4c9443 25cfe96858a1c496d2af7a0d500b809ead958885 F20101129_AAAIWT pol_a_Page_061.tif 4b4ec12f548ba7de94fa1dfa189d0fbb b174536298ab6b81ec2c0241a209c830d72c237f 2328 F20101129_AAAJDC pol_a_Page_047.txt 7a5087072aa0aed1535caa888cdd95ef 4340ff717c864b142991b1417b467c60bb1c0385 894 F20101129_AAAJCN pol_a_Page_027.txt bb518d007e0e0cd031b8e8a740322854 ef10593d4aa839f0b760ab8824c863f16bf5c002 F20101129_AAAIXH pol_a_Page_077.tif 252e894e345ff7f115631befe028e132 e59ca769239700c6dd739619f57ab30e9a4fe751 3343 F20101129_AAAJBZ pol_a_Page_006.txt db9183a438c3619413a3bf2d43083a58 b4d71435be07efdf45b3a2f8a9d499cd7fab96c6 F20101129_AAAIWU pol_a_Page_062.tif 9c5ac6ea37166a3f38510d6f0b5104ea 6220688ce5be3bde46695d12defdacfce46e9458 2200 F20101129_AAAJDD pol_a_Page_048.txt 0d667227c7b08eacab3b2cfe110c3356 d4589cfcdfbe98c94b439875af994734132130f2 2215 F20101129_AAAJCO pol_a_Page_029.txt 1f743adcaf9a093e1214fa42e768839f 727b39fe20bea6c0ba1affbc9bbe2f0a204d17ec F20101129_AAAIXI pol_a_Page_078.tif ad5a892ad9668cc3e952ec210ddaefd6 7367111b91485164555eac62b6b4b46ea3fb9773 F20101129_AAAIWV pol_a_Page_063.tif 3da2e72fed7fdaeee66b693955f11523 9128201ecbf198fdb35dec756a3174a6d54b125d 2500 F20101129_AAAJDE pol_a_Page_050.txt 1e0a6c43e904a84c42129587c81e4e8c 291248d3aed9ec85c36948a046712f0b7ca71986 2244 F20101129_AAAJCP pol_a_Page_031.txt b6b63799d58a2eb72777bc90140764a4 bb73a82bfaaf825c6df3ad22894e2e7fbe5ccdac F20101129_AAAIXJ pol_a_Page_083.tif ba6a727d98f81b564a2055d412c5dd78 ddff8332fcdb86d2a2bdda6c6e0f759f888991ec F20101129_AAAIWW pol_a_Page_064.tif ed17254e531ad0ff85b4af76d5780a06 f5ab6f2c33efc7a441ddd60bfb72b9368e83b866 2537 F20101129_AAAJCQ pol_a_Page_032.txt cf019623cb64b56b6f9c40ebe32b8d31 cc3106fd2e7accf2972c9aea6e6006d87301e714 F20101129_AAAIXK pol_a_Page_084.tif ddfcdebc0e5132b1479290c413f44cc0 b1e45f858f202183913f79f2a26cae6c50d70f16 F20101129_AAAIWX pol_a_Page_065.tif 6ef0c8bcec5f724f0eaf8c018f3e068b cb815578733a6e176e6c6aa7d1ee0ee0e19548ec 2457 F20101129_AAAJDF pol_a_Page_052.txt dd6c9131360f58c3bb74abb59b5c49c4 c9ca952b1f2970fe6a9201cd5cb1f65360157953 2220 F20101129_AAAJCR pol_a_Page_033.txt 28fa8f2bde91df4201c3554342339e72 37ece63ba84e497010c0db751bc8fda7d5e2b981 F20101129_AAAIXL pol_a_Page_085.tif 2674297eb6dab65c193ddddd728b488f 984c81ac953e7be6adb88feb3a2caee8a4ec586b F20101129_AAAIWY pol_a_Page_066.tif 79fac5666ff031775eb2c43c08764331 d19a08ffdc4d9ea33de264dd2fc0f1712e42305f 2476 F20101129_AAAJDG pol_a_Page_055.txt 5ceaef86ece65020740157729cd245bc 79a2f150306874b3244e8297e8eb4f99a2580e6d F20101129_AAAIYA pol_a_Page_112.tif 0eaf71b8912fca75717ee22c31799827 3054a140fcb0a6a29fb3a47972a1dd60bdc26ffa 2622 F20101129_AAAJCS pol_a_Page_035.txt 7afea95ec70ad5b4e99cbe3e89b655de 62593eb7999ac3c52568b5de928ce4e6c3600a89 F20101129_AAAIXM pol_a_Page_086.tif 4a7d9022a6dff3090868c63bfe544b30 34884359e81337a64884aa2bbc23cc4131f057c1 2349 F20101129_AAAJDH pol_a_Page_056.txt 5e6d80c029fde2c057bce45accf30967 964bd39752070968eaf9ab73f665fd4de11bb572 F20101129_AAAIYB pol_a_Page_113.tif 473cc0c7e6ce4149f0239b92776be687 07b2842d4b5a0ff81fd908ea6ca650ee0f7be11c 1668 F20101129_AAAJCT pol_a_Page_038.txt fd6861f9b87ba8f6c165ea327b3157be 773b34c6d94f7e741e526c7e4ad2a6e29bb5cc90 F20101129_AAAIXN pol_a_Page_088.tif 5b7c1fc8b1120b907d3c6720cebfd780 808aaf73d1b6a6e61c81393501bae75b7e5519f8 F20101129_AAAIWZ pol_a_Page_068.tif 704903772484857bdd2fbe8e6e8fc4e4 355171a609b93578ef1f00215e126f210ae8854c 2495 F20101129_AAAJDI pol_a_Page_058.txt 48b868c3d7aca672203b3d74dd7e8612 0364661f0447857f4561cb8cd21eeca2988e8091 F20101129_AAAIYC pol_a_Page_114.tif 34c2de998a254768a5724802bebe8586 d39c826702c5d914bf1e7ae20b0c98a19e55a402 643 F20101129_AAAJCU pol_a_Page_039.txt 5c284cfcacdeebebb7d2af5c9ca23300 8f5e64fb8a9edba6b4ce3c5600a034a4c5701bf1 F20101129_AAAIXO pol_a_Page_089.tif 8d31dd98f34dbe9d8fe0c2da466ff89e 907e35cd3736b0260db400824cac80ba9bc7fbef 2363 F20101129_AAAJDJ pol_a_Page_059.txt e79a2c13cb6c2b30db22bd5a3148e6f5 b0fda9ee8a5cd0fd55cea3efe137d91ae134b065 F20101129_AAAIYD pol_a_Page_116.tif 22582d8cca875aa65b97a3ded9f49a80 3018eba7a2d4ed0b96e537e0015daee6e64bcd12 2492 F20101129_AAAJCV pol_a_Page_040.txt 0c2e68e3dd3fa5f742c14e137d42be84 f47a37c4731b351c0bf56073e1e8273b09859e04 F20101129_AAAIXP pol_a_Page_093.tif e26dc84dabf533b181c435ebff922292 af5bb8801178c05827a4fea12e42730454b0e013 1754 F20101129_AAAJDK pol_a_Page_060.txt 097e9900ab29e8f3f4e4a695d21b3460 601dd3c90333e91c001e43beeacbef9a1338d055 F20101129_AAAIYE pol_a_Page_117.tif 185a1e39332f007b15e4f26af89f573b 5f0103f2690c42ed3af2706f4f2d1fc3b787b4aa 2024 F20101129_AAAJCW pol_a_Page_041.txt 8a38b644e446dd9afff5c9b2e9e2b2f4 2fa9f8d1a7244908715b5355f39621d5db3f455a F20101129_AAAIXQ pol_a_Page_094.tif ed204c147a3abd4e4ed1f5a979aeda9f 77f7320169bb60a93f3dd94ae5640f91a02a5ce5 2245 F20101129_AAAJDL pol_a_Page_061.txt e75c74dc0111efc061c83ce31bf25ff8 aeb763dc4fb6cffba723b41914f5c29ace5f0e1f F20101129_AAAIYF pol_a_Page_119.tif ed1eb1bfaffb41254a242fda4621ed75 1d7aec4eb36e731b65b0e9cb4ee3cb473f2685e8 2294 F20101129_AAAJCX pol_a_Page_042.txt 07f7c9682c9b0aa12eaebfa0034e7f8e 522756c3cb45641d0f6dd0da1f2f25551ffd2ce5 F20101129_AAAIXR pol_a_Page_096.tif 4cda3e51b70edb27a4fe338ddc8397c1 b30f844c3879849d94809a82c767764343a11987 1692 F20101129_AAAJEA pol_a_Page_078.txt cecb7ef623ca9d1887faafaee9c1e127 683ab79ed85cba758a6f11357e694b1a8942c97f F20101129_AAAJDM pol_a_Page_062.txt fbe385016221b69e9a3aab876eb86a89 d4454f3952645d0e0c9ca9311ddf3a203eebe32c F20101129_AAAIYG pol_a_Page_120.tif 585e3ec0383f059cbc577c083239f57f 5a9eac2c1aae9766ae98a0411347483bfd38e3ae 764 F20101129_AAAJCY pol_a_Page_043.txt 05146ff872ae14b8f7c13e4de1552721 5af132746db1337d96bda21720caed4844b5ce1e F20101129_AAAIXS pol_a_Page_100.tif a92b86f00b528161a84368c93757a91c 037af97a87413fd6cbc2ef3d04b6bd70894d0183 1659 F20101129_AAAJEB pol_a_Page_079.txt 0e6d21f7ae9894e27610e897cb1746a0 8bb67d2068db2682093b539b3d08a274f710391e 2248 F20101129_AAAJDN pol_a_Page_063.txt 517350e41be947348f2ab7b24181b09a 16464ba9bec6cde6f27c13d0c370d4fabfb0a988 F20101129_AAAIYH pol_a_Page_121.tif 77796b8b8eb126c1609855ac01b1b60d ed1b9eaf72f673d5644ff6968da33037d5deffb5 1702 F20101129_AAAJCZ pol_a_Page_044.txt e9b6762d5a43080b0b22a4deec232ca3 0d60ddfee051fe3bb4eec4e8ba3cf937d99e4e11 F20101129_AAAIXT pol_a_Page_102.tif b28f7d91f15b55dd611ed7c224b476c9 1a2ec779e795b154c92bbebdb084e0a213512a0f 2557 F20101129_AAAJEC pol_a_Page_080.txt 0a1236445a83373e83a847f5b385ddf2 94d1bc810d726890c2e1b577ae981fbe0079264a 2018 F20101129_AAAJDO pol_a_Page_064.txt 433ecc806ce5062b343a6cf887b6613a a37aa6a553294b5bf1baa4ceb89012329e586997 F20101129_AAAIYI pol_a_Page_122.tif ca2dd323aed115a444996bb65757e9da c0803399f884a888bf2b2e6b5b10da4adf2afc4e F20101129_AAAIXU pol_a_Page_103.tif 1a81d09ea319c19eab43afa9141ef9cf 1d1542d5e6c48afae166506e533f62da632674bb 2249 F20101129_AAAJED pol_a_Page_081.txt 20a4878d319ac00e80835d40744de6f0 6ff5b01873e674d9bf242be7681efd326ab2d4b7 1996 F20101129_AAAJDP pol_a_Page_065.txt bb138a3ff881a6bce29a217d15d4163f 7f0ca07c4d7ed76846ccce4e9103e6d9c111dad9 7581 F20101129_AAAIYJ pol_a_Page_001.pro 4475b591e0dd8089f84501da8ef16ba0 c3ae5a4fa230ec3cc1c5c0ea45bb5b32250f8155 F20101129_AAAIXV pol_a_Page_105.tif 2ea9ef75d26515e558b1ec2355ba97f9 05ef1e136cf0373feabacab95b13ab4e5a840d7e 1874 F20101129_AAAJEE pol_a_Page_082.txt 3649ee2c05eefac1f7a9ae5467a1f3b4 2fb8ed00725d054b4557fb7dc94ea64b57729832 1968 F20101129_AAAJDQ pol_a_Page_066.txt 5e2b55af162b38a71c9dad72a4456b46 d1f11bab850f715032a3bedba767cee12b6301b3 805 F20101129_AAAIYK pol_a_Page_002.pro 34358baa8191c7ae011245ccd5dfb7a3 cf0cbf4537d126314a161e231fc9fdd4be9efa9d F20101129_AAAIXW pol_a_Page_106.tif 79b171e11e3bfd0084efd1771d325246 62f6ce4e7297e73a100c5478ed2dd53d2e1db740 2225 F20101129_AAAJEF pol_a_Page_083.txt 57b2320dad3ebbbbe2006adc5b40b01e 2bae9cf1c4b29998826e09fb2087825b805f265d 1468 F20101129_AAAJDR pol_a_Page_068.txt 6e792abd0b598e0c6303d46e688dba8f 4dbeba83ca45c7db4413eac8c73c4db86875a027 35212 F20101129_AAAIYL pol_a_Page_004.pro 206838ce741767cb07db69110084dfd3 7162381a9c8080e0ab8301d200cffb12f85a29b4 F20101129_AAAIXX pol_a_Page_108.tif f65cb125567979407be91b7d86761c28 436235b98910b1cc1a9a69302bd86caccbd01c7c 1474 F20101129_AAAJDS pol_a_Page_069.txt 45aa2b13d5f67d495991172f2012f6d7 853f9d2bb0ab54c6d5d1ed5697afb1c56cca716b 67993 F20101129_AAAIYM pol_a_Page_005.pro db69a797ee027a7d40fcf72c8f7c9d8c 1798701eaccbefa582fb148afd9cc64b4fbccada F20101129_AAAIXY pol_a_Page_109.tif a307c626ab288a3f85e061c43ba95f79 4426b987ba64c8a842d8308f63e8698a886489de 2250 F20101129_AAAJEG pol_a_Page_084.txt b7dada69d7f889b15cdfd18c5d70b563 ad6c9de31db2143c9775f4013614daeba1c0a9a3 22612 F20101129_AAAIZA pol_a_Page_021.pro a28b0d7b4846a8dd5ff8dbcef96fa951 53b801c828d38cd25c5772e6aea96293a4ca1ff2 1119 F20101129_AAAJDT pol_a_Page_070.txt fc5728e7ab78362123d2028e0f0e4938 1f14721ea13a16ecb091b4965c76578e162cea59 80816 F20101129_AAAIYN pol_a_Page_006.pro e75006feb7d0c8a78754cd279e32fb05 c224b727a5ac680b4a6a8472e16219734acbe148 F20101129_AAAIXZ pol_a_Page_111.tif edcc9316dcf3365cfb8e9705dc7880e1 3d6c59959101dfc7e56fa2cd2d7b581a24f1bcc4 2633 F20101129_AAAJEH pol_a_Page_085.txt 56f29d3fdf7a74ecd94da2ead9137d02 145711697affe4c743b50cdd68620240bc126cca 59520 F20101129_AAAIZB pol_a_Page_022.pro 11abb0dab5f46a84822d0a993febecd5 5afce84c807e07d818567dcae856cf91b738ac20 1871 F20101129_AAAJDU pol_a_Page_071.txt c9c226c38a6633dd7d5020fc32e04575 4caa5c09f4a25c023d649f43337d51cdb8371fef 24411 F20101129_AAAIYO pol_a_Page_007.pro 54b05e3b99aa2bfe39c8c018e5184f81 b1b09b7025704cc18d82bda19a7b6a19d2a7a2c0 2271 F20101129_AAAJEI pol_a_Page_087.txt 01f8a64aeff8bb8f8dbf8ac159bb0ddd b9f9ae3427b920f64a166e18d983bc0d4c29c895 64617 F20101129_AAAIZC pol_a_Page_023.pro 64c591732e8d51a011eb5d4e0ece1776 b5a453adb86e73529dc3fedc3c5b30b60901ce54 1099 F20101129_AAAJDV pol_a_Page_072.txt 21e0be0de45f290784a9c64d144d75f1 f7be07b687b792ce93b73cb0088e8843a4ac0925 15615 F20101129_AAAIYP pol_a_Page_008.pro 6e2eaead9af27d6109a99fce599675b4 1507b2aca39f673f54c4d502650ff49b112793ff 2238 F20101129_AAAJEJ pol_a_Page_089.txt 06b4443fc63899dbbacdfe7f1cb09392 69496e8798348d3eaccea8b08759b6e3cc92d53c 62405 F20101129_AAAIZD pol_a_Page_024.pro 8f8d25446dfefbcc19450736b2c56f00 661412aa3a4ca7029a8f52192b0e43c9a6c95114 1647 F20101129_AAAJDW pol_a_Page_073.txt 2ca582ebfb98da5a97d94b39f8a70e42 23d4b216ef6a2921747f4930db8f9312507e3d0c 33604 F20101129_AAAIYQ pol_a_Page_009.pro e99c4c0c8baf8af3506e73bf5ba42244 ee387e1f8c38c08a1ee9a8bca88dbc7ad0fbe892 2464 F20101129_AAAJEK pol_a_Page_090.txt 472dc9b98191ca3fe123fe24efaaf645 33e470fdd20e15c205314d777356d6749cf7da39 57542 F20101129_AAAIZE pol_a_Page_026.pro 7c2f3204a8158b50b210e339ff20dc80 efb1734cb1e5942a64dfbf6d37cb00a27c8da96e 1854 F20101129_AAAJDX pol_a_Page_074.txt 4e5413a8fd288c02d05a057aa1e08ab0 1236e5b87d64cf0ab1944f630b38d721eb538e9e 14502 F20101129_AAAIYR pol_a_Page_011.pro c2067ca9eaea3d8a5bdeda3865b8f641 7ae9653eeda57868c5778b66d5c9f0166060f4b0 687 F20101129_AAAJFA pol_a_Page_115.txt 403b584e3a6140c81858c9fb81a5aff2 708d5349e9c6d2ef169eced61dc8c1f8eeb164df 2165 F20101129_AAAJEL pol_a_Page_092.txt 1d8bfb18d0279555d96edfaf23413e13 0d4026b955b3f855d396964cd0849299157d9c64 22129 F20101129_AAAIZF pol_a_Page_027.pro 5d2b62d67284a26616520284de6460b8 9a62856eccedffa892f19036cff2e9f456f1b990 2075 F20101129_AAAJDY pol_a_Page_076.txt 74ecbc92a5e79a59db6a169a4156b754 0737098361c62a4caac394575d09a3c411bcb926 64312 F20101129_AAAIYS pol_a_Page_012.pro fdfbd1774c8d6143c2fca57566c290b7 a6799274aa18b421e18dcdf7edff4ea94edf3818 2426 F20101129_AAAJFB pol_a_Page_116.txt d6f8c8d6f2d73b2ccafbc9016f1f2159 1ac25b005949a6adc4d3e37e2f84f238f6cb41db 1953 F20101129_AAAJEM pol_a_Page_093.txt ae43c78bc63869088a97f07470e3e77a dee9966519b04d51070f926a9e3612da373d1792 47932 F20101129_AAAIZG pol_a_Page_028.pro bb56c00c93c8ce7013a43e725a6ef768 ce8b3292a32fd6c6c51165fc6548414c57f7fb56 62395 F20101129_AAAIYT pol_a_Page_013.pro f379e6a8f46878133f92e436d01e053e 8a3a2b2e089bfe3f460d89bd2d966e7d5ff9d393 2511 F20101129_AAAJFC pol_a_Page_120.txt eb75a61797a2c97e39c0e5364622a40c 07125fb0d09260cf91bb5dd0492128806f83a6ee 2530 F20101129_AAAJEN pol_a_Page_094.txt b3580c1dce76659e8496809a4593aa7f c7ac010b2a1a66131aae04089a8587ee70a15333 54791 F20101129_AAAIZH pol_a_Page_029.pro 9a2a8d28f814b730ad5aec5a1da9d084 4be41ac626c7a7b57caf238e2e6fe32225673717 1710 F20101129_AAAJDZ pol_a_Page_077.txt b37743cd0740e3d07dd127fa030a96f6 c8db92bd0af4fbd8564de50da4620f3ace6060ef 65347 F20101129_AAAIYU pol_a_Page_014.pro d6c0605d135fc84219653a7a51e874a5 749659cc2d360a9aabcbeb6efc468f7629753ceb 1834 F20101129_AAAJFD pol_a_Page_122.txt b12468731020dcba567befdb9f53f4a9 a9b897e50a62ae0d0edf2c05bfcca22423faec99 2375 F20101129_AAAJEO pol_a_Page_097.txt b51b907615fe644a016babcdd3d1a47f 7bce775ba59eb0ea991d35d3f238badd93785ea5 48899 F20101129_AAAIZI pol_a_Page_030.pro d37ae1358927e2e5aefb7e4cd135d0ed dcfa2e047254d8e563729bbe4eeae1237cc690a7 63447 F20101129_AAAIYV pol_a_Page_015.pro e4e926024d602204ed34da162a28c208 513dade5c2be38e0e810cd93a75db47572faaec8 729421 F20101129_AAAJFE pol_a.pdf e7427db401ee3dd485d39182e790a835 bad3bd08a01b4e204532c1e1b898614c9acfce8a 429 F20101129_AAAJEP pol_a_Page_098.txt e07068399cb164b1422b3dbbf4e9741d a25041a0880f8ffe060b06a2387ea9f81dffa3ee 55812 F20101129_AAAIZJ pol_a_Page_031.pro a10bd491b8b0dc0bb8a55372a26bdc27 2cb9c76c409394337531b6b8657e8d5286857be3 39481 F20101129_AAAIYW pol_a_Page_017.pro 60a2a19023ddbd13a94461a18ecd8775 dddce70f68f28c7ad8941b15c1b3ee152cee793d 6276 F20101129_AAAJFF pol_a_Page_065thm.jpg ae241ef1306459d46316824f2eaa76ee 589d8b3ebda84eb201791323fede5d7e9477e089 2371 F20101129_AAAJEQ pol_a_Page_099.txt eda22423ff9d8c067cb81ccf1ea6f117 d25c58d3b6b6e10f9a600f3bd06b650934515ec8 64715 F20101129_AAAIZK pol_a_Page_032.pro 99729cb43532d23bbf96c01f86acded3 1aae1af30da2e9efc87667127307e2530c9c55cb 62460 F20101129_AAAIYX pol_a_Page_018.pro 466ffe502475b669bf1b666df7af0bfb cb977b4ce2f52043b67339e0abd3e068fff91746 6773 F20101129_AAAJFG pol_a_Page_033thm.jpg 8bce621e0d7cb6144baaf25acf5804cb 1ea7cb19a901642ef0335e128eced6dbd6ac167c 2510 F20101129_AAAJER pol_a_Page_100.txt f222c667bf2814523ed442fea03dbe3e 09ae15f2d8bdbf9749157c1738e46cdc39b7c6ac 62155 F20101129_AAAIZL pol_a_Page_034.pro 99c63b07a537d1b076193642044bd1bb 4c69653c94603abb35b6875566dee2f172d984d2 65415 F20101129_AAAIYY pol_a_Page_019.pro 02952a6da62768144551d66d8299c058 e8a85f25b106130730b99e38a72362adbeb9da52 623 F20101129_AAAJES pol_a_Page_101.txt ece61a758de4b610ddadcc1ea1622a0e 527f8ec9573aff572f07c664a665afcc1b4f5690 67067 F20101129_AAAIZM pol_a_Page_035.pro 06f39a81f595fa25d866cbf6bbd432e7 76136f48a317b937c913f99dc801360b793a3340 66377 F20101129_AAAIYZ pol_a_Page_020.pro 430aec24b39370e6278549c3937ec2d2 e00e67a3573749afab1b6839c25f7626241aa62f 6763 F20101129_AAAJFH pol_a_Page_113thm.jpg 024dcb8cd5bb4c2c1f8329b0a6cdd66c 07f3f72737b1b44960794fe5b202ba9b34e3cb8e 1907 F20101129_AAAJET pol_a_Page_102.txt b4e212600971fdf015652335d4af2d46 562427332f24106b937f53e7e9d1f2e66891a992 59972 F20101129_AAAIZN pol_a_Page_037.pro 7c17ffc85ab8c2574ca4a7076aadcec9 a0520f74059b5fff5427802170297e52e91fcdee 28993 F20101129_AAAJFI pol_a_Page_058.QC.jpg 16ccb801ec5943a90cc4583fc7fff4e2 9c2568d90b1ad167415ca88a30058159ac9274ab 2144 F20101129_AAAJEU pol_a_Page_105.txt 19fc6c64d17c06eea329cb5c21d66c14 3e9cb73876d064323562b342d053243187eaaae5 31497 F20101129_AAAIZO pol_a_Page_038.pro 8aa05dbf904061911764681a08c65821 1fab6a1518406676ee6c6622ecb0a1b29c1babf1 23568 F20101129_AAAJFJ pol_a_Page_044.QC.jpg f02b90bef89ad048684afc8d6399fa63 6b30261c2f5bd9ceeb4a953274c99f2dd9f6b47a 2167 F20101129_AAAJEV pol_a_Page_106.txt 30f1d4c4368113ca12adc646a5a70f66 d6450d474a2da1cd93cc056fe580cad47df4a58b 10115 F20101129_AAAIZP pol_a_Page_039.pro 4744ac2e12617040732250b5cc9534cb 36b60b3775f687b445a30a4c376494f23127dc67 21717 F20101129_AAAJFK pol_a_Page_041.QC.jpg 0862959ab0c1807c2d809d7ba65ab6c2 956fda07bd3046c931982c4908c4cca57b4fea28 2269 F20101129_AAAJEW pol_a_Page_109.txt adeefae8af2ca5de3ffa387e46a9940b f435aff77825d3cbaa22c4d82e8db6f3260a2751 62206 F20101129_AAAIZQ pol_a_Page_040.pro 4987fa67f404978dd2fe24eeaee73d21 c52ee37cd7f9eaff29c5e15c2dd840bbb94e5995 7001 F20101129_AAAJFL pol_a_Page_051thm.jpg 3bbe2be3134206d9e838f99094c864c3 b48731d9647b6aad59cad360c9880b9363940c9d 2473 F20101129_AAAJEX pol_a_Page_111.txt aa401f87a67d8deeaa56da8b4bdb21dc f857e0f35dc333758eda6a1bcf7a921bab75051e 48568 F20101129_AAAIZR pol_a_Page_041.pro 66f3e47863a8e7221678eac3046cfdc4 c872ea0f68f9f29b7b1ef930a81827c0910a2f85 6485 F20101129_AAAJGA pol_a_Page_062thm.jpg c04c05d7a151154762c4e92675e268ae 604794643512da203f0d59a0da77438def37ffd2 6915 F20101129_AAAJFM pol_a_Page_118thm.jpg dd048b57aa216e5f826f2220d2f34ddf f37db309a6b2315e425b0d6431dfa8cdb61f1a37 2422 F20101129_AAAJEY pol_a_Page_113.txt 6c55154e71f797a5c3e41b0944044e4a f3fcad0eeb178f303aa654bf6dc932f65f18297c 11988 F20101129_AAAIZS pol_a_Page_043.pro 760e90ae001616d259d03a37306322f5 76a6a9755911af3a9f62e5a55c20b9feee589eed 7648 F20101129_AAAJGB pol_a_Page_032thm.jpg 0283f6dd66d2f042904a8e261445a67b 96e8017d0058ff9a0666bc33d233e546eb71eacd 29055 F20101129_AAAJFN pol_a_Page_012.QC.jpg fe731eb33d5400add87a84754a836ecc fd42f9ce15cdadd2d6023e74f3a0e11106b0f97f 2547 F20101129_AAAJEZ pol_a_Page_114.txt a29afde3dc71dfa46f28074b78154209 8e49b1be529f6d35378c1385405224a8b2d1a527 60593 F20101129_AAAIZT pol_a_Page_045.pro 2e941641d9b15f78ed4bc25830a9b504 6a7fdf0779a2c6710e853f00b510c57ed8df332f 7669 F20101129_AAAJGC pol_a_Page_035thm.jpg c9c659a2eecc5be33d140ec12fcc6f08 c2f23f9151851e10cf07edb9d575546781c3fb84 6660 F20101129_AAAJFO pol_a_Page_116thm.jpg f40d8650bf2d973c18f36c2cf0d3a6b0 9fc9dd227330678d209d2c0b9259b3086ea452d1 65898 F20101129_AAAIZU pol_a_Page_046.pro a829e78647b8db575d45090b63639b78 f8951d2d58416d2b7d4ac76e66f37784a741c2ad 26746 F20101129_AAAJGD pol_a_Page_061.QC.jpg e4ae68faee5aadf6f899bde46331c1b6 8762ce711a6b74c0970f032e1156dc778a171f23 5511 F20101129_AAAJFP pol_a_Page_112thm.jpg 71ddefc2c9df4f2946fb1ddcf86f5bfa e5de8161cf162a0996f23de6fa1b54fc534d8de8 58271 F20101129_AAAIZV pol_a_Page_047.pro b036e4161c9bf95a20d684e6849e4af5 f32f0e9220f64f950812db662dfab4cc0bce2939 26613 F20101129_AAAJGE pol_a_Page_075.QC.jpg 18bec1fe1883eef29713334c1b51db33 72567c870ca9173e21e275772d5fdb53a5e5d6b5 5901 F20101129_AAAJFQ pol_a_Page_107thm.jpg 517bee17719a42d38f9db8e57ce8c89a eac40680ffc82ca505c97d25910b013f8eb58d8d 59358 F20101129_AAAIZW pol_a_Page_049.pro 9dfa60abede754d1daa98630ae3ba9ce 3ba63a10ca6bbc62877f2b6bedbb65e2af9a207a 4188 F20101129_AAAJGF pol_a_Page_070thm.jpg 940bd426f7d5fed11719a48e84b6d546 ca26de83bf9ea1c8789a9d2fdd35bd015e980769 7130 F20101129_AAAJFR pol_a_Page_059thm.jpg 09fa012676eea7a8cb8b18d09e15d220 bc9e41f62329b9a8674b622f2d5cca28016f0cc6 55901 F20101129_AAAIZX pol_a_Page_051.pro 1347c3203fa201d553c0847615ba18ef 63338c1b89c48af4ebd3496e7335d3fc8c2abd4c 1333 F20101129_AAAJGG pol_a_Page_002thm.jpg 5836e4dba3af7d19754b1d387517bf17 034ac08e47aa63facad66c673a89eb4782881895 5218 F20101129_AAAJFS pol_a_Page_078thm.jpg 0c62a7e6d4b139f30ce578d5f44f70d0 6eb24f56185c12e3add39d7fc3903bd2aa043dfa 62313 F20101129_AAAIZY pol_a_Page_052.pro 87be44081b0adc4bc8f94aa76919539d bcebc51f55e9c8c851104629beec0b7aefebcead 6359 F20101129_AAAJGH pol_a_Page_044thm.jpg 303417f81cfb6b49f090ae693bc04612 4dbbf5478a310a0d8d154fa9e9eed00f7c8d0595 7037 F20101129_AAAJFT pol_a_Page_037thm.jpg c385638af7efbbcdd5097c6af7bf70f8 86422db7ab65d84d7295a86a20f2371f9bee26d6 59931 F20101129_AAAIZZ pol_a_Page_053.pro e77a6a762632242b96247f7b27d549f0 86e3a23f73068e197407a52eef7bf1333ccc2eff 2666 F20101129_AAAJFU pol_a_Page_011thm.jpg 589f6c7a9dbb5891d082c967b6a080a4 71baa64bc7e1bf455e72711a3bdfd79ec3693e0a 9649 F20101129_AAAJGI pol_a_Page_007.QC.jpg 22817d9b06cff29e0888f48a8cb4af6e 4a6e92ff86c804f524d78b10477a0bd31c0f2e30 3892 F20101129_AAAJFV pol_a_Page_039thm.jpg ba486ddb703e271301268da91bb8f93e eb664dd9c047b702614c7d9928428cdd21f2b660 2715 F20101129_AAAJGJ pol_a_Page_007thm.jpg 4c8952662f903e6264a4949f83afd2b4 83c676e8c3d5d84fb2971f989428b0d774163c43 24033 F20101129_AAAJFW pol_a_Page_005.QC.jpg 7a935e9ebd4ccaeb7dc174aee86d7d7a 35681c7cbceb75a735a96de7ed579336d1e0213a 6862 F20101129_AAAJGK pol_a_Page_061thm.jpg e2633862ce8fc0595d0b663cde7dd9d2 a4f54e1a5bbf14bc6ebcf6facd70b143b2f4ab13 22101 F20101129_AAAJFX pol_a_Page_107.QC.jpg d5337790bb61de7d5ef73e10d265a044 b86b8a9df2ea751b48a3309ef50f58ab1c9fe1bc 23924 F20101129_AAAJHA pol_a_Page_066.QC.jpg 2b55d257130fc9deb80d4e1528d7f8be 63074c00b590f6fabc002c3b48646afd0a862621 19421 F20101129_AAAJGL pol_a_Page_060.QC.jpg b4322f2f5dcb95a3f0b44645b2c4aee7 c1193b14eca0a8faafb38211cf338669cb774b24 26930 F20101129_AAAJFY pol_a_Page_120.QC.jpg e897a9ce8ce9ce0daec98e2b10f9f4a4 b0e69a2390d680cd0bb8d61c2dbe34bde601064f 6134 F20101129_AAAJHB pol_a_Page_066thm.jpg 494df1dd9ee6e4fdff99ca47eb144753 31d1f1c76f0614d289f53bb55218488df19efaac 16774 F20101129_AAAJGM pol_a_Page_068.QC.jpg 7b9bf160406a98c819d1cfe5e5f25a84 43d874b0a015bc9651a09f36f1aebd6925f2c584 7464 F20101129_AAAJFZ pol_a_Page_014thm.jpg 7eedae8c839ad67d0a415e6844bc573f fb57f7c8d2f150231ca34226ed8edebf520d2cf5 21527 F20101129_AAAJHC pol_a_Page_088.QC.jpg d11d6ce16bff31446fb312329d6a8a6c 707edacc508c9bda4804b503e009bf40ef9c329b 24387 F20101129_AAAJGN pol_a_Page_095.QC.jpg 42f64cd1d520f6a1ceb777da2b1abf72 e3919a92efde91d6dcadb8746b443d1c10ff5a2d 7996 F20101129_AAAJHD pol_a_Page_008.QC.jpg 5bca18d55b59361fe98c7b9843f4d933 0732ecc8779bb2e0f92289ab71148ebf1520dbfd 28401 F20101129_AAAJGO pol_a_Page_096.QC.jpg 394da0fa684ca06ec463822b4ec2ed53 b413317f3b2cc0cc2673f3bca301a99873c0d473 7603 F20101129_AAAJHE pol_a_Page_025thm.jpg c8cc48572fb7f2b12b0d5f4a28b9c4f0 10e6dc66833f3e33e4548f556b6ca1c1f26d8e5d 24341 F20101129_AAAJGP pol_a_Page_030.QC.jpg 2174fb35ad7e2ca732050bb4d379b90d b278fd81a18e48c4e692abe685783960d0af2925 20160 F20101129_AAAJHF pol_a_Page_038.QC.jpg a29e7a537990c6a9edc8a33e743dfb9d d5df272e2bad3c9df94202d92d731b34d8f7e301 5524 F20101129_AAAJGQ pol_a_Page_077thm.jpg 0add00e20116d64558e05425f405ab99 547b8de562218c6e6723b9a38f55d595a66e7476 14514 F20101129_AAAJHG pol_a_Page_072.QC.jpg 311e249e462cf3a4e3ae037dd3576898 795b83839fc80aee641e7d7a9d696d4871dd1bfc 6905 F20101129_AAAJGR pol_a_Page_048thm.jpg b6a302be026612df2597d3c2904b6f37 0e1708fec3e0978fe2d0430adbd24d0cd65066a3 7252 F20101129_AAAJHH pol_a_Page_094thm.jpg f65c9144de6aa4b0a72bf7f028574b5c d2390e016e3cd1d442bd0f02e3a68c2366aadf33 29450 F20101129_AAAJGS pol_a_Page_091.QC.jpg c5a6b462131aa3edeecfaa0e29f1f46b 3198a3ef145d54761469f9f5666d607161044bea 19022 F20101129_AAAJHI pol_a_Page_064.QC.jpg e4bf38bd5df0ce69403551066bbff22d 6c4fbfae306f67ac59ba034ce0e3d631b1c3f3b7 7392 F20101129_AAAJGT pol_a_Page_090thm.jpg 04714e9523e37e45ab91626aa85f450c 895e3d96fd51cfb7a10c6713d22b06691e2a3420 23424 F20101129_AAAJGU pol_a_Page_028.QC.jpg 5d843b3d3b118e71868c290897554bb2 6e5f45cae48719a8981fb8d8b78510bb2c872323 7102 F20101129_AAAJHJ pol_a_Page_099thm.jpg 9feb0987d6f77600137a2a50758cba36 5547a88db7dff81742bae21259e12f45284198e0 1351 F20101129_AAAJGV pol_a_Page_003thm.jpg 8f6e54e3c3b29eaeb8add38c54a8f585 eaecf05ae2cb9fa41d8f7c7b292cddff8ee85f26 4060 F20101129_AAAJHK pol_a_Page_009thm.jpg 424f653800589622017a7b47333d4f53 81c8b0820d879f293e7a8f735b122ede276e17f0 29982 F20101129_AAAJGW pol_a_Page_052.QC.jpg a0dc0074c8e4554cf064b851366333ec f38ffd7755ada4218114e6ba5323ab3e6debfa01 5885 F20101129_AAAJIA pol_a_Page_010thm.jpg 59ec44f8efcd01aeeac87b296e928999 69bdf7054c72a5c3d77ac3d4d8cd11c3c64eaa68 7017 F20101129_AAAJHL pol_a_Page_019thm.jpg 0cb4d7e9cc70c9aed66fe7393ddd2ba5 6514eaad941a96edf9fc423932022e66625792ec 25946 F20101129_AAAJGX pol_a_Page_018.QC.jpg 94552078a294edc806a6f7d15d1e14eb a82cd62ce0ce3f02559bca58b6422138b59f04f8 5818 F20101129_AAAJIB pol_a_Page_028thm.jpg a26efff575003fcaed1b4cd8ef71ecc7 331a1f12bebf5379cf53e96995965d969103b735 4931 F20101129_AAAJHM pol_a_Page_082thm.jpg 6e263b2fc9a31243277181c1d75c05d3 be7bf596d363346a50c37a741e34e1318ad62232 30227 F20101129_AAAJGY pol_a_Page_023.QC.jpg 38ae783b77ce85ea0ca08d3c033744e9 2aebe72069dc14b69b47b9293273181f3d151472 F20101129_AAAIFA pol_a_Page_033.tif 8ad280ddb0bec755586071d1562af426 1fda9e26998440b933f3022c508fabb2e7447c4a 27745 F20101129_AAAJIC pol_a_Page_099.QC.jpg 9a7f33f89110f22c2c90f46e39143339 18a609f3c481c19b4648e87f0bceac3abb11bc99 F20101129_AAAIEL pol_a_Page_034.tif 096058d9d029350b918c0242dcfad1a1 74567feab927ea0a0bbc5e0eca2cfa77c7ff836a 16561 F20101129_AAAJHN pol_a_Page_069.QC.jpg 6045c7154b931be74fca7f5ea2fd69a8 fcd13ccce1fa99b284ba9bf636bf70cdba61f43c 29789 F20101129_AAAJGZ pol_a_Page_055.QC.jpg fb8e6fcf5c6332b50ba629c30e8abf3d d330616ddeaac65d248b3acb330aef2ddb91ec38 30405 F20101129_AAAIFB pol_a_Page_108.QC.jpg d1a0ab0f29eae20e229509ad987a1827 4d3c2e72259e4b93dce7c84245758446efd24238 27861 F20101129_AAAJID pol_a_Page_026.QC.jpg 3f18a0055214514a1a8a1c013427cef8 b16a1e8d4e2c59f50baa8374b914431758483159 52873 F20101129_AAAIEM pol_a_Page_087.pro 9b36e0a8e7dd2559d3fe7c967510e01e 8253c0dc78a16a823992662f300adaa784a3675e 26249 F20101129_AAAJHO pol_a_Page_062.QC.jpg cc8ace4508e7cba4e757716d98bb49e8 74fae0757871b984175a9651b4d6b9d13979b5b6 F20101129_AAAIFC pol_a_Page_087.tif ccc2a290d5fc99b4737a98b06c0515a5 1fe4d7e60dcf8e4ae7028197bf2a2e90d1436049 180505 F20101129_AAAJIE UFE0021132_00001.xml FULL c006607d8e805030a27eb223b48f7b1e 9bee4b7b5c63b14d1b704de11709dde758167c51 55688 F20101129_AAAIEN pol_a_Page_033.pro f960ef923bfa7dc82594a4a51ce630de 7ec304966aebb9c2706120c47134db86a05ecf2c 7453 F20101129_AAAJHP pol_a_Page_052thm.jpg 0c2802ddd9aa9883fa46ed04bb68e058 f26328bbf3bb982e825b10bf48a687c50af881eb 24452 F20101129_AAAIFD pol_a_Page_076.QC.jpg e1b66e0973bf67c1b32111dced638b77 efd348fb693ace43c3d9d929baa7a6dad89bcd49 5800 F20101129_AAAJIF pol_a_Page_005thm.jpg bc004184ff9e5f737ca668f03c4dc8a9 83587b7b77f82483a5d7ec9ac176e0db5e9a49ad 64767 F20101129_AAAIEO pol_a_Page_025.pro d209a6cac32f216303f55d0b9759da8f 0e3137b595bdc8249d0a5a893f5aa79217bebb6e 5722 F20101129_AAAJHQ pol_a_Page_038thm.jpg 4d0e36aa15909d7a97965531128cc2b4 61591c11ef49d74abaef9eae2c694e691339dd50 1051911 F20101129_AAAIFE pol_a_Page_042.jp2 e0aa9657ea92080dff72329271dbd14e 3143840c2b311724fdaddff8432a8b0fe2366b4f 27346 F20101129_AAAJIG pol_a_Page_006.QC.jpg a9bc2af1eb8436daecbc3865e1894b31 55bbbecc75320889164423c9d263d358feb2d120 133269 F20101129_AAAIEP pol_a_Page_086.jp2 da5548defbb67edc2df6cffdc6d111ba e15ba1a70ef3e1258a14b0387a4beee71e992d9c 27214 F20101129_AAAJHR pol_a_Page_042.QC.jpg 1b2866b402d170c2e703a802eee3cb00 fe61331415814753dd849efb1eefe2764ae1668f F20101129_AAAIFF pol_a_Page_092.jp2 ce720c9fe99a44a4cab9e9640a46dfe7 1ba55d56f76d56307da3764e129072aa82dc2a51 6396 F20101129_AAAJIH pol_a_Page_006thm.jpg aae429cdd898e7e968f96e296319b1dc 4fcea5ec518e487b1a5aaf37de0269d0837de073 F20101129_AAAIEQ pol_a_Page_095.tif cf46f4adfd9380ee1ef29a166752578d f2bdad96ef543a69326244f2c5043d070899f4ce 4749 F20101129_AAAJHS pol_a_Page_071thm.jpg 996ce955beb0b909385b0e31088e2524 e7420d20928b4e941e031d7ecc2e101b0591c3de 21578 F20101129_AAAIFG pol_a_Page_054.QC.jpg 43fa79ae9a60a979c581fb356b36f17e 27f181ce015ffc2d85ea7c274671ecba9fc026d2 2540 F20101129_AAAJII pol_a_Page_008thm.jpg a0cd3221f94bfab85f91def7956516fc 47181a55d0753e8030967975f699cf0df452d5b0 2093 F20101129_AAAIER pol_a_Page_036.txt f62b06e074641912744bd460d2374d24 58f68e22b0b9450a6ae8f8d10aadd78671b76a8c 6981 F20101129_AAAJHT pol_a_Page_001.QC.jpg ad6ce50a6d1071fb930ff3615d2aa235 124d424f6325c075252a78a27f273190f5727dd5 27295 F20101129_AAAIFH pol_a_Page_081.QC.jpg 7c5eebcff1814c99751d24970a45ceb4 43e6aa0f3e22182bf6ed36a34be1eeb703bb92d1 15172 F20101129_AAAJIJ pol_a_Page_009.QC.jpg 724052da585e957ed0e54eb647d3034d ca99ee58b44b01b4e74bdfbab2205f25f87aa913 F20101129_AAAIES pol_a_Page_034.txt 6f7a5448a3903aff4d02ea54a7a13ae4 54b60bdd49293edd375031c9a62e026e9f5682f3 7416 F20101129_AAAJHU pol_a_Page_080thm.jpg c21279a05bfe2eb1be98b86b185a92ed 13805d76684a0f60b50f79724f3a08484f7066ef 22803 F20101129_AAAIET pol_a_Page_106.QC.jpg bc50f572e3a1fb0bb38f9c5d316c7d92 70d082fea533ad6d046260cabb835cbef458f53c 19105 F20101129_AAAJHV pol_a_Page_073.QC.jpg aef3ba7f5ed359e1882e1a34f7b1db80 4bb4ba642790500f8dd1a856611e64107f9ad095 44846 F20101129_AAAIFI pol_a_Page_106.pro 4ca6dc5f9858b62c93a79624c3f77926 58d96b3dbb0dae2f244613b1022115803c94bac5 22309 F20101129_AAAJIK pol_a_Page_010.QC.jpg 61cefdd0d4ae536f24f0dd90da4a69c3 5bdf3a0536e9aa834f85c351bc07d42bce367101 7552 F20101129_AAAIEU pol_a_Page_023thm.jpg 9d28160019cdc72050a7cb5b45d67965 b745b26f09dcfcfd9ca759dd5fef6ffe91618ec4 15356 F20101129_AAAJHW pol_a_Page_121.QC.jpg 05b6ee0accd2a2ad1ef28f5be7611846 dc9d70d89509378f9cf45e81d2f912b9c5c6a219 31802 F20101129_AAAJJA pol_a_Page_035.QC.jpg 95f8760664194863c4855b5cdf8d8ad8 26c0969c82121a28d0c6f006cdb339a5a76a052f 7198 F20101129_AAAIFJ pol_a_Page_042thm.jpg df7fc4579deb49263a730b6ff1194600 56a56fa7e12c0bdc8d5475de3c57aff4f4d1772f 7127 F20101129_AAAJIL pol_a_Page_013thm.jpg 60c04f780407132be7ef569df7d4cfee 88bf250ec10ebc3a7863f2a4f7abf91dd1084f0e 2289 F20101129_AAAIEV pol_a_Page_001thm.jpg c6ec37bb532f4cbf1dd509414ad6133e 1f9b4c18003a6af7986cc42c646632d779d702f6 5231 F20101129_AAAJHX pol_a_Page_102thm.jpg 31cbacea019f5ba6b34442d1441fe3a5 ee4ab074b59ebc88d1c4cbcb36c53e0db606f554 22783 F20101129_AAAJJB pol_a_Page_036.QC.jpg 0d2dd1a5a0a84c533fb1018b4bc5be21 e409ccf6caa6aa1a4b01d8296204cd388fb66194 66619 F20101129_AAAIFK pol_a_Page_122.jpg 1eb67b1185c14c6a197fb6189ec573a3 e0ef44ad99e84c5289d0d861f2da732ecda11167 28230 F20101129_AAAJIM pol_a_Page_015.QC.jpg 9b4912a6eae68aa19eb36c39ba293366 07e4dfdf572e9422bed8e5309a2461cc902edf95 56749 F20101129_AAAIEW pol_a_Page_016.pro b735045d76ed1b94e002ac2b77f639c4 8764aed0a1f8f407aa4100bb882f6fbf65a221c6 4987 F20101129_AAAJHY pol_a_Page_004thm.jpg 39270af85393d44a8c7cc8312fe9bb01 3cdb5cfbb74685fc4087f904a23289c03cfc0fee 6277 F20101129_AAAJJC pol_a_Page_036thm.jpg ac5885ece83267671c5b59eee8c096cc 9160ca6892e1f7ee8eb95ffd92fade7bd8496c1b 27183 F20101129_AAAJIN pol_a_Page_016.QC.jpg e519ef14d275cee2fa718c6dcaaed5a1 696c8dab0273d16f05aeebfa5fd98c33448fcbed 9540 F20101129_AAAIEX pol_a_Page_115.QC.jpg b3a12406fb65770b05fa443732ec80b6 ca60acd12b3802e427fa8e4a9c9635ed386cf4b0 8500 F20101129_AAAJHZ pol_a_Page_011.QC.jpg 1d7f26114d9218dadacb7c2b7604ce2b e2c55ef005c4dbd0c1082a3f144db32ecfa08617 F20101129_AAAIGA pol_a_Page_104.tif f27518c56987295228614e99336842ae 9886c58cae0d8fd8eead84b19535779b44f2ee23 1976 F20101129_AAAIFL pol_a_Page_112.txt b5bfa58642bb42dc2b71483334cc95a1 639f84e8455c0247618dfd7625be858d2b23252a 27879 F20101129_AAAJJD pol_a_Page_037.QC.jpg 88d9772ffdd3653acf7dafcec0367055 d96bb68ff971d9848491efad12da9792989a2f0b 6872 F20101129_AAAJIO pol_a_Page_016thm.jpg 5d9e54d7dc3dc5eaf8036fe264326457 0874d7d6c96d07d7fe06fe803a53a07e3d293ccc 1051926 F20101129_AAAIEY pol_a_Page_033.jp2 a5730f52b46bdd39c9455c7a017d84f4 5a96d9e594717c170450401ad187fda99815cd37 17110 F20101129_AAAIGB pol_a_Page_004.QC.jpg 99473c24069afc3a675559bd354f17b4 50cd196e8517c9b43b2d2e8ba820fb2530845ce5 8423998 F20101129_AAAIFM pol_a_Page_039.tif 5edc4bddd232ffe25bde21145af68be9 8d539ec480892cf72ca82d4046a0320e4a1afa2d 29543 F20101129_AAAJJE pol_a_Page_040.QC.jpg ebc6c5ba301fdea34c2d31227fa8bc20 7f8edfc0ae00c95240d7f6cd4e9a9f5ef802a755 28667 F20101129_AAAJIP pol_a_Page_019.QC.jpg 6475ba88d04807cf40cdf41ef7436076 6932588428e70e70c8aee35248d82102ed5d3676 83489 F20101129_AAAIEZ pol_a_Page_031.jpg 20ba0ce66f8b683dc4248dda59b041d0 cd8e77e0671df838368f53e6dc467415b2c309f0 F20101129_AAAIGC pol_a_Page_082.tif ade562da24df51b0c6bc93756dc29967 0ec7550d1bba7f6de7b121d36993b2df72ad1512 698 F20101129_AAAIFN pol_a_Page_117.txt ea2b24c7c2f61864cd5a1bf3a41a8ab1 6fb602dbaffaf62009dafaadcd249dabbedee327 7165 F20101129_AAAJJF pol_a_Page_040thm.jpg ba290b740a1768c46d67e6f6c1605edb a2014b5b795d4765dc01ce6658bf81b577127b85 3554 F20101129_AAAJIQ pol_a_Page_021thm.jpg 2104cc326344125b2b7fbc2342f6ae28 e954d508731b02d902872561cb9cbe0a4e2d5fb0 75785 F20101129_AAAIGD pol_a_Page_004.jp2 051d140dfbf57e50fa62f85fd0173346 79fb7a3fe79c133059ccb35cb722f234a7c4243b F20101129_AAAIFO pol_a_Page_022.tif 6a3580a317b02405805678502c2d7fd1 31f6117e7826dbf3c696d8faaa74709cb94b99a6 5921 F20101129_AAAJJG pol_a_Page_041thm.jpg 8c9de56b11c05895d82ce7fa461476b2 de4ede57a45e293ef7c1759eee7d387a645298e4 28073 F20101129_AAAJIR pol_a_Page_022.QC.jpg 1145d4fbceab8660757923bf5b6d3e1c a8de5707f6a691f3450237a359f21027e0ad572c 30658 F20101129_AAAIGE pol_a_Page_100.QC.jpg d7699458c099760ca7204e956fdabd7e 9d668f7bbe1d264515bff753680434333477ebac 832746 F20101129_AAAIFP pol_a_Page_017.jp2 3d508f44cde42c05f482d3cf3f59b595 631ae8c4709926afb962031c5aa2fe0f554caa5e 31691 F20101129_AAAJJH pol_a_Page_046.QC.jpg 333ea2eb83f49831c9c834d3c5462660 3089c48f9b4a3d32458c9d7a644b8a6121966499 29859 F20101129_AAAJIS pol_a_Page_024.QC.jpg ef6ec37ce84d9a1cd71d83b5d5be24ce 604c141ab76ae87ff6aedfdc14a107f5e4c4515d F20101129_AAAIGF pol_a_Page_052.tif 55fffda7ce875807691cf9b7d52dd344 d3cbbdc208c5031df2069e8464e8b046a9288654 1047601 F20101129_AAAIFQ pol_a_Page_028.jp2 1a3e09a570de47c7f3dec5a59596d1d7 5816e08ad5e7f3f1710aca0677726aa19d5e88d1 25831 F20101129_AAAJJI pol_a_Page_047.QC.jpg c991ae36444985294e7b0145dd321e6b 39eceed247db72ffa9b68b70d11f2905f0a2ee76 30412 F20101129_AAAJIT pol_a_Page_025.QC.jpg bdd77a6efb6210fae50fa8b4024ab856 94ea892924c4c11ccf51d300c9a5ce33116db343 7747 F20101129_AAAIGG pol_a_Page_046thm.jpg 04655a3ee70da1cadda73cbbab3327b1 c54039e0c3e6dfa8d0dccf307adcadb680ba104a 51563 F20101129_AAAIFR pol_a_Page_095.pro 092d94d3236a51356d2644df7c5b1dbb 2ff6921f0f39997385da4574f374c200689fe7b2 6813 F20101129_AAAJJJ pol_a_Page_047thm.jpg f7f9d32678515d303e710a165f5f6c8a 793845989733260164388c0f03c71fe061f54839 11993 F20101129_AAAJIU pol_a_Page_027.QC.jpg 3eae443fbc196c01bbfaae9d640e438a 9fdef625f8f7cf5806156b493a1f658dd827d0a7 33566 F20101129_AAAIGH pol_a_Page_078.pro 9bafaa27e265dc27abe4445b8ecd516f 25ee1767ff1472310804526b08d271568c2e9e34 49673 F20101129_AAAIFS pol_a_Page_121.jpg e45395482c81cf73ec1786d0248d3f5d 40c84337e4dfa4ed70ce5961d15fb34f63d98baa 27588 F20101129_AAAJJK pol_a_Page_048.QC.jpg eed6c2aaa6f445843102eacf7c6d52c3 db9c81a800cd4192b1a72f768ac3bc097925c2bc 26484 F20101129_AAAJIV pol_a_Page_029.QC.jpg 0e841b2602247d771d06e247292eaa86 3929f6d57fad93517a7060cc7419464557f8feb4 F20101129_AAAIGI pol_a_Page_055.jp2 d880036fe48bf7a0aa0cff27e29c62eb 34bb9dfdb5cee5cca206f644c0fbc5c030abe402 F20101129_AAAIFT pol_a_Page_050.tif 7c06099db75d02a6430544999fe78a87 8a95f07806f1b348466214acbf000b55e030c4c3 6759 F20101129_AAAJIW pol_a_Page_029thm.jpg 506f0f17165257770aa9778287ff5c1e 49cc9dd3fa6ecbf123bf6bb3dd6a0496f584fa63 40806 F20101129_AAAIFU pol_a_Page_044.pro 216e151a117c22a03eb3cc8f55ccca49 9acefa5bb9bf4755dd00e363262996becc912ce7 12831 F20101129_AAAJKA pol_a_Page_070.QC.jpg bd3585298af9a60f4bff992720bca00b 13f0b0049dd650cb7c1c6ce4a507fff9a98df600 7474 F20101129_AAAJJL pol_a_Page_049thm.jpg e57b6939b7e29b0ed571f20cd602af8d a7993f885d831ed5c3d7f524a5875cc1427b179c 6429 F20101129_AAAJIX pol_a_Page_030thm.jpg 7ae7c4b9e03c87502d1b1f002b3c8d70 45d47a4cb3a8783ed6e389d9a9ef3ec4673a8925 F20101129_AAAIGJ pol_a_Page_118.tif 3366c3c2fb9a332e34496c94debe4691 757a17374ea3b88a5edaf0894b33d9cb43791114 59330 F20101129_AAAIFV pol_a_Page_078.jpg 2075dd8c062e55e02747e6936d07b9d9 ebfc6e0bf53739e5acbe3b1f3a7fe2e6deb2e4f7 17717 F20101129_AAAJKB pol_a_Page_071.QC.jpg 91d8f25b9288b5f703c363c667cd4152 aae3f20a0526fa1824164502003e24a3e3007b48 27063 F20101129_AAAJJM pol_a_Page_051.QC.jpg c702dc24d39366507777386af3c29e01 151cf9e2a01cfcdc329f94fe1b015f86c7a8a4fb 25996 F20101129_AAAJIY pol_a_Page_033.QC.jpg acb16a6f3780cb9a194f97cf5c30aa3b 1c441ef411d654d4a7405ed742fb6fd2bc9279ce 74481 F20101129_AAAIGK pol_a_Page_054.jpg d3a0695512bc9a5dc45fa2f199617f8b c66bc3645eaf5cb21d81b113f86958770c0c96d8 2166 F20101129_AAAIFW pol_a_Page_107.txt 1d4ca6b4c8ff9853bdb2beb1f6d09af9 13fa1200ffa48b9368976a2e971543b7ce46b2d6 4310 F20101129_AAAJKC pol_a_Page_072thm.jpg eb80ecee85ac3263751ccbcef12e3586 abc86829fc24e6ffe47aee1384453458a3a9217f 5961 F20101129_AAAJJN pol_a_Page_054thm.jpg ee0402a12f9076e39e1b52ada87892fe b847f35b96f598cf1a9271b9b4191e770e3850f4 7454 F20101129_AAAJIZ pol_a_Page_034thm.jpg 58e8aab58395eceb8ce1e5dd936c1224 5b5fa3b74f458620307cb2be6f9dd74f801ac128 5158 F20101129_AAAIHA pol_a_Page_068thm.jpg 215da062e152def99ed81a7b825f8543 de07da461624cf66fdc3e58c636992a5170c6161 54740 F20101129_AAAIGL pol_a_Page_057.jpg 2e8bbf9786ce6d09570b0fed661169c9 a0e9160287fb7fe9fbb1e4d728b4be142157fca0 76064 F20101129_AAAIFX pol_a_Page_076.jpg a94ab7badaeca607d5a40afb20960e3d fc8aaf6c815bc8b053f841bb0dfe089c768a1246 5254 F20101129_AAAJKD pol_a_Page_073thm.jpg b5859ba33b9dfee58f2b35dc24d8a664 b908eb2a386e88b631573bac55f00b298d309d0e 27636 F20101129_AAAJJO pol_a_Page_056.QC.jpg b9d36d2ce47acb95b516a58346662589 1a3ffcb0bdbefae3e4b34dea47ed5135a5b3d4fd 2301 F20101129_AAAIHB pol_a_Page_103.txt e0c32c5c3288b3a2549d79a34504ede4 6e83fb24b685e0a5991061ba22f680a1cb79c7c6 19936 F20101129_AAAIGM pol_a_Page_105.QC.jpg eabe6f192979968bc0f8091c159653e1 d3cf5f93d50f75afe35847f9135a3ca258a5116a 43382 F20101129_AAAIFY pol_a_Page_104.pro 8384e473b3b1c82ff742da4d496a36ca af32cb15cdb8e1a216c069539712e3286bae4ed0 5644 F20101129_AAAJKE pol_a_Page_074thm.jpg b7fafdcc14aa3d7a461abdb9ed13cddf d0ff1cfe2e77712cb76ffc275fac8af94d8ab56f 7277 F20101129_AAAJJP pol_a_Page_056thm.jpg 7cda1fa50a406a4be80bf6b976e047b2 80eae477686d44c6d613969e185d62e0a0d78509 F20101129_AAAIHC pol_a_Page_054.tif 1af30ed2f7636e7dee8d1b355cf82af9 b1463f5bf6760b07672ffcf7d0abde7f5d2a398f 59522 F20101129_AAAIGN pol_a_Page_116.pro 11730ee56b966da7a812401ad0ffdf5b b2a5d66d3c036252d628b57aa9777bcd93d05994 F20101129_AAAIFZ pol_a_Page_027.tif 1fda355ce39d62ae4a87d8c4912556ce 41da466cbc33a9881d8e01ec808ac9cc0c84601c 6695 F20101129_AAAJKF pol_a_Page_075thm.jpg f6d4fa1e47e5a95bbfd3cf51a18d5ec8 7c45dd413a4954cda391be63e9966cb72d619083 17850 F20101129_AAAJJQ pol_a_Page_057.QC.jpg 7f9d6dc33177d616d2963f2c6b8b03f6 7b7eaada47b57d3eca30678793561501feaaf036 1288 F20101129_AAAIHD pol_a_Page_121.txt 9794db84e142c4eeec0f6ddf7f54c821 3af819464370ed2663cf214208f2029c2316b898 65705 F20101129_AAAIGO pol_a_Page_118.pro 4c96b85d88365076c0a7cd705821b956 ee7f7eb775035b38962b20338d52790191c6eda4 19054 F20101129_AAAJKG pol_a_Page_077.QC.jpg 8b3fab3d7114623e5a8fc7c94fd7d7a1 27cf526799bc203e648a776714b2abf3280d596e 4817 F20101129_AAAJJR pol_a_Page_057thm.jpg 65c0ea8275d05bb263ef35c0dfb400c9 22d039f9b1bec610586284297b6c83054215bd67 7710 F20101129_AAAIHE pol_a_Page_020thm.jpg f0846a5a33fb2dd47bab08a0e40216d1 0afe25c798e0cc1c19b169be7bc839c9cfc2ea77 25237 F20101129_AAAIGP pol_a_Page_114.QC.jpg f435ba44dc158455245646121ba3233a 7cb95a32622558f38e102a22f630cad12a5d54ef 18132 F20101129_AAAJKH pol_a_Page_078.QC.jpg 3ddaf3fce91fc12fcac580e8972cac8f 432201fa89594d0a0901967ee2ee1cfeba19040f 7142 F20101129_AAAJJS pol_a_Page_058thm.jpg 10614a0e879886bcebfc979fd74a5794 0644b87e818827f4bb444246c30f25f1f1124b16 29211 F20101129_AAAIHF pol_a_Page_097.QC.jpg 171a8eabacaa8e5b9e54162b36f824e8 461541ed28b4576809d2c48ce84060b7a5338792 7166 F20101129_AAAIGQ pol_a_Page_026thm.jpg 3eea39b56a2fa782f2e52f68588c4574 a779a1b59ec091bf88ea8c1129cf98b1d8f747e6 17046 F20101129_AAAJKI pol_a_Page_079.QC.jpg d24d12ba9e0757873dc4b838d8812c43 890b4a554b9c850afe9f09f0fdcb8796ce809c73 27122 F20101129_AAAJJT pol_a_Page_059.QC.jpg 0e97bcbc2b410bac4b9152fc9c1bad26 8150cf18a5922ee6c7efd13e60a82b16d149c532 94566 F20101129_AAAIHG pol_a_Page_024.jpg 8b1ac466aaae74ceab570b8763be799e 48af2c31db3e83e645cce28fdd515c8e2741c7ea 13052 F20101129_AAAIGR pol_a_Page_039.QC.jpg 093550c676ff679a1f5c738b8b8665eb 069f4448c467353b0e2b2a4149daf826aada547a 4800 F20101129_AAAJKJ pol_a_Page_079thm.jpg f305391bdad1b78060e1a8a627d5cd5b e8e954dee0d8848fb8e97aae7792cf0a9e8847d0 25252 F20101129_AAAJJU pol_a_Page_063.QC.jpg af3324bb533dcfacf3a380800d8442b3 944905550584405d5cea608975ed34450f8d0730 30269 F20101129_AAAIHH pol_a_Page_069.pro 6a008ca0df3f5d74adf709ba510d18de 52dbef04f975fd6f4bcc51294184b9952f9c379c 7220 F20101129_AAAIGS pol_a_Page_081thm.jpg 5567faa7ed79282e1b043cb975c1cfa6 9f2d0b5575b57375962bb9876938ebd7bb5fb7ac 29971 F20101129_AAAJKK pol_a_Page_080.QC.jpg 737055f26dbc3b33d6e204570567ea74 467f9f3be8cf6aedca894efbdf5161db8cf559c4 5229 F20101129_AAAJJV pol_a_Page_064thm.jpg 8d1a9cbf4c71f130699b95c91c5cab7f 10c1d65e20f7d17a6b599ea366562bc5c0698204 950190 F20101129_AAAIHI pol_a_Page_065.jp2 78adc330026fd9166e43a3d0bcd660b6 916db68a6eb87476925021bf8b2a78c06d43a884 F20101129_AAAIGT pol_a_Page_090.jp2 ab0cca9ac9b3c20ab852593ef91b3775 23d18ed8f68527a9458859d3f298b161afb172d1 7051 F20101129_AAAJKL pol_a_Page_083thm.jpg e2280b0ab6fb9079554e020901fbfaf2 3af7aadc7749ec0374c60a47442697d73b5805c8 21378 F20101129_AAAJJW pol_a_Page_065.QC.jpg 88dc38e3f8b95d1dc899ef96de2f17be 98fe8646b8c447476c85ba6a5022f74f96593c4c 586 F20101129_AAAIHJ pol_a_Page_011.txt 4a47e070e22c73378279a63f360c6b49 4ef186b35665848c7654d532f22f4bf0147be9aa 31099 F20101129_AAAIGU pol_a_Page_032.QC.jpg 532b7de18fcee2d808626b9b94d8e757 a72414a77dbf39f0366eb68b752a7a2a2cba6fc8 7558 F20101129_AAAJLA pol_a_Page_100thm.jpg 82f3292039474087af2f87f71fc5bba1 aadead07642109452b03e00d83c1230c1e8b4fc6 14079 F20101129_AAAJJX pol_a_Page_067.QC.jpg b34ab58e37ec4b20cdecfc1783ac0c2d 45c1c7b9a5249c15babbe4cc1702fd1ce771ba94 F20101129_AAAIGV pol_a_Page_090.tif a0234896fdbf20b13e8eec0f0c0af777 1b3c8ae81725ed5767c7e8c4f505ca653f6362d8 4258 F20101129_AAAJLB pol_a_Page_101thm.jpg b1509c150c01530901ee01da2d837411 e0ade059ecdf4f3fd3759fefd54a21dba6dd1cbf 7004 F20101129_AAAJKM pol_a_Page_084thm.jpg c9a68b31a09c50bf7c68b7889bf00b17 ff0c86f495c814468fe706955dcf07dfd69102a2 4110 F20101129_AAAJJY pol_a_Page_067thm.jpg e2b30b55be4848b070f7a10053f5cb44 3e8b884a93bf2bea57e49aa11ed68ca736a02462 2730 F20101129_AAAIHK pol_a_Page_117thm.jpg 096cc4db2b431e1011619774f9415870 0c1a267fae4e90b3699bce29e926c117aa7c38e5 F20101129_AAAIGW pol_a_Page_080.jp2 d6643e02106152a337630e992928aed6 bf516c5b6ad44cadc109646cb1361772a64c4503 27241 F20101129_AAAJLC pol_a_Page_103.QC.jpg 65bbc6bf5a516507c9f43c18b3d0ca96 90f69fda298d3257a464e742cacb54ca5a746d8d 7025 F20101129_AAAJKN pol_a_Page_086thm.jpg 0ea38e369eb13bdec28bc5e2855b6e26 b813759e06be803626eb11ab330e13b9944bc730 4699 F20101129_AAAJJZ pol_a_Page_069thm.jpg 1802e99c92795ee6d7c0e11b14de714d a1bea28ddd7791b959f478ee0f0b41bbac53d694 F20101129_AAAIHL pol_a_Page_079.tif 5b3d063927d3272d30d4c15f1e59d1cf 862991e0d243a90d3ff77106bc6df9a4929bd5cb 87978 F20101129_AAAIGX pol_a_Page_059.jpg 6d5a65b62a5136eaa5e5dd23a7ce3675 dfcb800a77d297e4b623e5037f8b2d60ece03f13 5300 F20101129_AAAIIA pol_a_Page_017thm.jpg 78296873f713e259f6bba4d6ef0805aa fa524c7dc1a7ecb93e22b9ffd9637975d4570861 5724 F20101129_AAAJLD pol_a_Page_104thm.jpg 83cddafbecca07dada8c93fc57abc234 78a5884f4f8aaacbaf2a51aae7c6b8bc9cbfc0ff 23839 F20101129_AAAJKO pol_a_Page_087.QC.jpg dfd4cf744b1511a76b2c1adfe6d86494 9e00eaf3f1d0601d7da36484fe79add423c3b439 F20101129_AAAIHM pol_a_Page_099.jp2 b500b630afd625b03e9e1f34e162c417 f1619442748e60a85de35207f3573135ca78042d 2667 F20101129_AAAIGY pol_a_Page_086.txt 402fc8f4b01aec568af6ab951ae51829 4bede79fd346030327db010f38468c315fff9b0a F20101129_AAAIIB pol_a_Page_060.tif fe31814aad00fd4caf0a5c9148b2b393 a80fb9eb59f796aa8471bc6dc2241bad36e326f0 5902 F20101129_AAAJLE pol_a_Page_106thm.jpg 5d37c3f34e21220ac5dfa87bbed2b6d9 859c88ff88e25f34c15d9673cf1ae540f5eba4de 6091 F20101129_AAAJKP pol_a_Page_087thm.jpg 3be25d85fd42617eb08d86f7342a5051 e2c93c84c6c322f83510f3e38f836d282ec6fc6d 16960 F20101129_AAAIHN pol_a_Page_102.QC.jpg 92de5f2e349a95eb38218bca043a3d93 eb420c6f5858ff221a43f3b7f1857b8cf978ab04 6568 F20101129_AAAIGZ pol_a_Page_119thm.jpg e1bc8bb559120dd7ba988fbb5014cea0 15e74f8f779a1563ef0201133f79b89da8f60c5c 1827 F20101129_AAAIIC pol_a_Page_088.txt 1f04833e22a33576d775d945acdff840 0d7182e50ea39fe597a4b58f264226a4b8f40453 25503 F20101129_AAAJLF pol_a_Page_109.QC.jpg 6cb6617ec8afad3ac8f53d959c58ea59 9120fe15a66f13ad675ab0f03ec1ae1e99c54a54 24218 F20101129_AAAJKQ pol_a_Page_089.QC.jpg e73b39141d6382cd61182e5fff9de6e3 733f2788c53adc7b62c97428a90f7822ffacf894 125012 F20101129_AAAIHO pol_a_Page_116.jp2 d8ae4d101afd4ddc1807342e18a4dae6 f5b2cfca8e0c6a87173d8b6e1a5a13a5c2691e0d 64083 F20101129_AAAIID pol_a_Page_038.jpg e39f9d9f1be2a6a7d1cf979817c30358 f8af5ca8a5be706c52747ddb36ff57499e2bd2e7 19354 F20101129_AAAJLG pol_a_Page_110.QC.jpg 46f13286736d099b9f843b5a67e7c8b0 525cc8e8d25171f9b1cda82496e2e538d1ac0d9f 6492 F20101129_AAAJKR pol_a_Page_089thm.jpg 2765f7e38a632cff1cd781c2d1781106 13721682484110b02b744a640f0e9b7052f788d5 34213 F20101129_AAAIHP pol_a_Page_007.jpg a4c74440848e18d8d9d8e619fee73d9d 3795fae524a8ecc79a3ec030543097067eaefed3 F20101129_AAAIIE pol_a_Page_012.tif c9df8a04301017d2436556ca56a4d272 ef3863e8c50d04f10be0513308480bc36c081ef9 29741 F20101129_AAAJLH pol_a_Page_111.QC.jpg 9c82c09df31acaeb22c51739837929cf e80728d129b081368fb3fd73efeb7e2d4ae644e8 29207 F20101129_AAAJKS pol_a_Page_090.QC.jpg c20e0702fb5d8652bb0357ccf8f9036b e2366d32178ad7197f2c22ce4a0f3ad66fed0f1c 5326 F20101129_AAAIHQ pol_a_Page_060thm.jpg a097725c0b75d10278de79e9cc391526 edcd770616535eaeab0bb888b69a9e1a3a3c2a78 28877 F20101129_AAAIIF pol_a_Page_049.QC.jpg c0a5533318565e214b7bc468b44689c3 46e8c3cd8966eea09f1eb192723ba1d1d0540d2e 6611 F20101129_AAAJLI pol_a_Page_114thm.jpg ca6c4c23f40f808db0294f82c59b496d 33a623306c8201a26946c05513bf6ca192f45714 7498 F20101129_AAAJKT pol_a_Page_091thm.jpg 3f6b390717c4aa071ec80da453eb2814 9e7959b01ef51734cf5d92ed2a495e17fa8a7364 F20101129_AAAIHR pol_a_Page_026.tif ec9eb7070902f67a563814c66136fbb4 6de218b287ac7a3ada6bb2058dd679a08c69dfa5 822536 F20101129_AAAIIG pol_a_Page_073.jp2 7b0ef7d98b1464f9b78118932521c31d aec08d82725182a0fbd15f495101c23e0f03a339 26733 F20101129_AAAJLJ pol_a_Page_116.QC.jpg c562d42d6b7c141f2dabebf0ae6d0723 645d5dcf424c82663cf87bd51619be5d3f8bc38d 22498 F20101129_AAAJKU pol_a_Page_093.QC.jpg f62901c935cb4667a1ec769ec6f80a88 c4f73503cfca62b976d555e2996ec20b2e30e17e 59235 F20101129_AAAIHS pol_a_Page_070.jp2 26a6e4a0c1972988b96b5bd7aa29077f 653ddc3ac652047e553d7cc6c98ccfe7cdb380e5 2455 F20101129_AAAIIH pol_a_Page_013.txt e5515f59137e7ea57b55cc771e5649de 4d2174f7b071460138deeb14c6524ec127f628d5 7047 F20101129_AAAJLK pol_a_Page_120thm.jpg d53ace764d6c1bc20877cd6c3b513b7e 581e98a9a9515c26f01e779028b25bc78ccb3534 5933 F20101129_AAAJKV pol_a_Page_093thm.jpg a3b11438128af9065aa3749dde899517 a055c0ed2cab88ebe6c72e226f7fc5966d124e01 1051967 F20101129_AAAIHT pol_a_Page_020.jp2 17377863978c23cf0a7c136837e5dc05 e015f7c24d02c4f173a13d23d7aacf665ef3a0ed 55747 F20101129_AAAIII pol_a_Page_048.pro 39d27e69dfad35d27cd878a1b1816738 01e3ce5b84888d80e431f21b84dce2730a22ed1b 4402 F20101129_AAAJLL pol_a_Page_121thm.jpg 6a7441841ebf51d47ce3783e564244ec 646205b8167860e0e69d9c02df85a9b218768f9b 28126 F20101129_AAAJKW pol_a_Page_094.QC.jpg dc8dfdd5024921bd27ad5a675eacac98 9a4654e05dd9e31a66494fe99915e9ba1b6b74b6 7156 F20101129_AAAIHU pol_a_Page_085thm.jpg 75a40d492ce4b8cae38b6b5fe9997476 038bb931cee99426714457678ff2a2444bff72be 1051912 F20101129_AAAIIJ pol_a_Page_051.jp2 85aadae4b3d89c664652e82d774f3f01 1c5be51feb7ba2fc2a3ecbc1aaa8647b56583ae6 6561 F20101129_AAAJKX pol_a_Page_095thm.jpg b962446e400d5616187fc525b1bdf807 5a32912771a2e95c7375c06626356e044285c68d F20101129_AAAIHV pol_a_Page_097.tif f8614410bdde8ea9569d4547f30054d2 8f4040892435302c3a4e3bfe32b81b6abb5e4ef0 63759 F20101129_AAAIIK pol_a_Page_050.pro 5a13cd6e8ee7dd1b3c2a6dee33796ba7 3353259b541a2c9e9a090094ba40601b4f5505fd 7475 F20101129_AAAJKY pol_a_Page_097thm.jpg 10b3c59a794c2496c5511195d63ec66b f451c5bd16d5d4b5b0431824c8f7df1a4b782eb9 852627 F20101129_AAAIHW pol_a_Page_074.jp2 e04955eff0aef51ad4c7027c69d61d62 b1f9fcce5c8d0c81e06a330d53198a6ac69492cf F20101129_AAAJKZ pol_a_Page_098thm.jpg 48ea12ccb281a1fe9a7db5586a79d83b fdcb1832b756828825e6e834cfa6a822e6e46582 84926 F20101129_AAAIHX pol_a_Page_060.jp2 5ecb81acc9a0bca978a41a37f9c32ebf 8176e333ba607403275120db1132965c52a265ac 99638 F20101129_AAAIJA pol_a_Page_093.jp2 5bba58e6b12eb7fc3193c665a37fc6f8 40e65ca1aaaab9096471a6bef964e7c231fa1ba6 3451 F20101129_AAAIIL pol_a_Page_027thm.jpg 3a71c0233b00be29a9816eef5e1a905a 9e4f40622db3c1abadc4a979c649512b5bbf1cb5 62529 F20101129_AAAIHY pol_a_Page_112.jpg 82feba72cb763b7ccea66d0313f4a7fa 9bdb2f463a8c36958761a1e53b17259ab6c12ce1 F20101129_AAAIJB pol_a_Page_001.tif 300624bc88df09ab2ab449160a8e24f8 5ebb9bbed4cbb3cbfad0917f12986f8ff511f4f0 96908 F20101129_AAAIIM pol_a_Page_032.jpg d8c6cf768a37b8d6e25a675fe84e3b36 9a773d2cb1be366776ca236dd4e330b30206c076 F20101129_AAAIHZ pol_a_Page_043.tif 9be3a8b8f031301bfe92dcd80ea68b46 b231e0aa141ca5acff8d1fda7e7e22c7476431aa 29001 F20101129_AAAIJC pol_a_Page_053.QC.jpg 372006aad71d689059d7e5c06224a9aa 55d7b0ae959d4c8172a7622f70d0077d8e769446 F20101129_AAAIIN pol_a_Page_053.tif b05d0b352ce496b3fbd6b26e0433de8c 7ca2a1905acc5e9c6ecc5d0e66e692ea53e9d43e F20101129_AAAIJD pol_a_Page_017.tif 45be9bc3f0027fe306646fd9f1601c5e 2d3f16d98eeeae1bcd57e492aedcd5eff1d4270d F20101129_AAAIIO pol_a_Page_099.tif c08758d0eb6ec288147962ee7e55c813 58a0449d3aed2fe723ee297eb1ab599c2ee94997 28597 F20101129_AAAIJE pol_a_Page_013.QC.jpg 91edd9608cb61f92d4998cb02aaab02c 17b471cc439fc5181254c9cad867e9c2a2a02d68 7207 F20101129_AAAIIP pol_a_Page_053thm.jpg 6af9121658258e591a26bd3ac079a98f e29ae0cb0390dc97d3e2ad3f6be62f179d66e820 9641 F20101129_AAAIJF pol_a_Page_117.QC.jpg 39d8f3b2dc6eb7399d759c2666b0c247 006fc8b047a28fb3013f2a43763b44d3977190a8 26548 F20101129_AAAIIQ pol_a_Page_092.QC.jpg a3b8737c36e93f53ded02980f3640673 514b5e7044bd37162a5cbc389cdd96e9b7a1f96d 45107 F20101129_AAAIJG pol_a_Page_122.pro d6f2f983dd009371a54ebdf1b8163a73 8c04a3ea9cf3962b97ccced56d2d8a9ebe9c02a2 70241 F20101129_AAAIIR pol_a_Page_075.pro 0fc78ea3e4a95a6d8114e42ce70b3bcd 22f88f076e9564aa42685121be8de687629b4f56 77868 F20101129_AAAIJH pol_a_Page_063.jpg 1406d8e353cc06146e24df7e371852d1 ed40d167a81ab7605173b6e960db39a7d79f008c 836367 F20101129_AAAIIS pol_a_Page_064.jp2 e6ee15b07790127e0a47eee1b2a06b0e 3bdb76842946a970911558d052505e85b530eb7a 2484 F20101129_AAAIJI pol_a_Page_108.txt 0a6790882e2fdb83acb69fe64b5b95af 0b397f4fa1688f1ad45fd5fb0b65fca96e186e5d F20101129_AAAIIT pol_a_Page_030.tif e86d2f9b510d68457d71da360195ca7c 786b65989b86d3eed94d6973483117787b472c87 16810 F20101129_AAAIJJ pol_a_Page_043.QC.jpg 85bb528b7830dca7e9f15ababaa2c4c9 6e217cfb6a8e1a7db31bb11f6e4a6cc33fc72d0e 6718 F20101129_AAAIIU pol_a_Page_018thm.jpg 59c0cc6756bfe104f9e524ef40658f59 cfd1b8823694893082550101a0554b87aaad3752 17131 F20101129_AAAIJK pol_a_Page_082.QC.jpg 567f15b2a31e8111e4493ba66b584774 529d5996943d86eb97597fc8cf8a50868b1960d3 21958 F20101129_AAAIIV pol_a_Page_098.jpg 838898734bf7955327646cb9a2e66352 5ee0ed69d710074d9f991d59021a9283d807dbe4 9747 F20101129_AAAIJL pol_a_Page_002.jpg 401154cf440dd4be0499d8aaaf37ae3e 213e25f5c03d8f8677b8d38ebfa518c3f66ef94a 49727 F20101129_AAAIIW pol_a_Page_010.pro 39bd50e4ce0613a719d551bf1944b702 e4c65a5008b61b4f72830d2560a0c43e3bec9445 44820 F20101129_AAAIKA pol_a_Page_105.pro ea8fcb75b96f73de1301e3cc1d1b81f6 60d2c5946ddf82d7cbf2c5c2f32bc0ad17c94aae 7318 F20101129_AAAIIX pol_a_Page_096thm.jpg 10194312bf636e4e60b6fc8be64642e3 e6b3293753e75a2a073cb4151d46d9b8b4625bc9 6398 F20101129_AAAIKB pol_a_Page_109thm.jpg 87cf0bd6e3b92737aeaad640e126ac46 b04db62ad93bea8716be6ded8691e6bb1d210ba5 7178 F20101129_AAAIJM pol_a_Page_015thm.jpg 1a62dec764df89e1e40556af869a57c2 d1e44eae463ec73baac648a74a90308ae9b707c3 7122 F20101129_AAAIIY pol_a_Page_098.QC.jpg 89c9d20f0e88e2b20f2937f7a584d913 1aa7c822590b5480d2c1ba28c3c118d86d128737 51455 F20101129_AAAIKC pol_a_Page_082.jpg 70f33af02f414c5d8919b611ea71e3c8 b59f1ebada2ad973cbf7ebe3c4b2247bc407e825 F20101129_AAAIJN pol_a_Page_025.tif 3e7a6d76382ebd0f474254193f90aed2 ce1fa9d214a69f3230f8239578b31d9a2c490c19 7462 F20101129_AAAIIZ pol_a_Page_012thm.jpg 54a891eab3b483637c1c834aa1ec378b 23e4c494de2e8547eb09a8c555059c50beb20a11 28163 F20101129_AAAIKD pol_a_Page_084.QC.jpg 8220500194f13ec774f1217c7365630a 419432df40e6974a2504dfdb0478ff1e3ef7afca 19875 F20101129_AAAIJO pol_a_Page_017.QC.jpg 9d525873fcf1996f5ac82a14ba8d93fa 6e7640473d15fbbc6ea3d4cd42b39f57d1e3b480 7469 F20101129_AAAIKE pol_a_Page_055thm.jpg 4b0e355791012f3b92234c6ec2d256bb c51d9faac4224bfb405cb1539d9092801b1707f1 13378 F20101129_AAAIJP pol_a_Page_021.QC.jpg e3c35715da2c67345304fb9cd05c6452 ca89130f1a7dbbb39af34e3127083114bc5a3034 1051983 F20101129_AAAIKF pol_a_Page_015.jp2 70c72f39d9c717308fbd3871e92e977f 38f29749c0f8319a6e24a267d373e071b5780d8b F20101129_AAAIJQ pol_a_Page_003.tif 09307364b6cc75ccb9d56f3d8b2f62f0 776de0c000b259ed309327a5b29159aa6daf24c6 5868 F20101129_AAAIKG pol_a_Page_088thm.jpg e5dcc3d4db27bf4feac563609f787d1f 6ad9bc9d9163b4a34928b87dbf17c56e7a23aec7 35621 F20101129_AAAIJR pol_a_Page_077.pro f950e35d6ff15d278a4f64712fc89999 2a39b4973368160fcc5b553c9860c2f6fbb03c29 F20101129_AAAIKH pol_a_Page_037.tif 7e20ee95fa0dedfe6ebcdd89090b2ff8 83a6d822e518c4f64e477e072f41f317f32921c4 1051898 F20101129_AAAIJS pol_a_Page_056.jp2 561139ad8958e7fa5acb966f8d0e73fd 6240ccf3f20f71297414f95ae2cae13d0d6fe275 F20101129_AAAIKI pol_a_Page_098.tif 13d53361e873ffdfb85db7392c21408a 736d50c7e921c621f64a228e67373e013aa5a7ee 51847 F20101129_AAAIJT pol_a_Page_036.pro 3492cf16917afb61ebaf14b8fc7f2ee6 f98069d524e4f690a255c04cf371da177ece96c3 13681 F20101129_AAAIJU pol_a_Page_101.QC.jpg 90f0868718a981104ba5354e065abb27 75d4a1d4b61993044bfa0e10545612a065c1e522 F20101129_AAAIKJ pol_a_Page_006.tif 8716b236b7719ab39a6c1b009dd520b2 dbb6c063b2cb0c2c5d9c7de9bf7e50d313419146 2443 F20101129_AAAIJV pol_a_Page_091.txt ae4de29cd2bc7d43722e71b1d5985c34 9d81dd4b7c101e5dd6088043940f09771bc1911b 98851 F20101129_AAAIKK pol_a_Page_012.jpg 7a2cfa9654122d841c2acc35ae7acedb 49de0bba66f61d27994fbf533bd092a608e23843 38433 F20101129_AAAIJW pol_a_Page_115.jp2 a9d5c62adcbdf709f290d862b7130bac e6247e8c7b33f2c3ec3d2be6acb7a422ad2a0ef8 F20101129_AAAIKL pol_a_Page_110.tif 64ebd31cbe54a56785700a1ffb5d4914 504ee23328ef8c2d50993824a89c24ac6e76d694 2552 F20101129_AAAIJX pol_a_Page_018.txt 028d5ecc9494f85ab987b37de313a488 b04d5085a6b8d42a718d4d48e54b92d48cb77075 66076 F20101129_AAAILA pol_a_Page_085.pro f5bf47bdaaaf60ee113551ed3f126b97 bdadce40e49cd42e1c862b699c1da29daca20654 26842 F20101129_AAAIKM pol_a_Page_031.QC.jpg 1fc192fd3e8ded9cf4523e1766e97d09 e2261fabde57f84cf69daa8e4877d60e0cb8102a 2017 F20101129_AAAIJY pol_a_Page_028.txt 8c3ee6dbeec7cc7f375bb7b59b43a192 757f8f7b94dde8a25b13050c024d12c2252c54cf 60377 F20101129_AAAILB pol_a_Page_097.pro b36e5271cfcb85048d2e89ee3932f2ea 019de87cdeb716ee5a1d9ce43395060a990cbe0c 7090 F20101129_AAAIJZ pol_a_Page_103thm.jpg 56c53fbaf2c7fef950f818cd6c7ec92e 06cf2c0916b3a3cfb7bfa5a9ef5f788f163cfc02 71618 F20101129_AAAILC pol_a_Page_010.jpg 0d008f6ad0ca985aa712dddf71e6ed97 84e57fe2d9c99d78e0f9c2c0b839745dd727d659 57948 F20101129_AAAIKN pol_a_Page_059.pro 2a45e389aeaff9a53cb1f2acee259170 5a352121a9e0f53bb4113f6e329f582f46a8b750 F20101129_AAAILD pol_a_Page_101.tif 8fc33fb7bd097cc8141b6daa15146557 97e49d473462b9bd2148fa5846a2e05a966130af 3093 F20101129_AAAIKO pol_a_Page_003.QC.jpg bdf50c3ef209bb1e547041773cccc4a7 a8ad408933149c0f6c59c09ace5ac9e17a695616 25228 F20101129_AAAILE pol_a_Page_113.QC.jpg 8117ad2b08fffb11a887e904fd1e1f4e 156ad31838c482f24b4863d67f38cec54f01d8f8 F20101129_AAAIKP pol_a_Page_067.tif 803a4ab62d252699b8594052e2370409 6c947ee32e1067f8df788857d4363cf547340416 881 F20101129_AAAILF pol_a_Page_003.pro 2e2b697a4ab0071b4d6670262f7d9bfe 27a4ad40341beffe5399ba709b6139369475f146 68707 F20101129_AAAIKQ pol_a_Page_036.jpg c720fa82054dc28071b920f06ab75817 a15e825b5d5d2b84063f8466c88ef0320763bf5d 89894 F20101129_AAAILG pol_a_Page_075.jpg ebfe10c07a65c21c04bd62a912d2dcc1 08daec67c6bde6ed3c093543675454be81b852d5 F20101129_AAAIKR pol_a_Page_092.tif e0e86b7188d42029fcec2157123b0720 6733d669c75f4cdf11469be404541eca92029e3d 117723 F20101129_AAAILH pol_a_Page_062.jp2 b7fd3cdfc997ad427797d219b5a1b07e e95c8189ef3c7a7261a3e462ad7a5d9e8a2586cf 90664 F20101129_AAAIKS pol_a_Page_015.jpg 9b1782dd2d8b33d1d672c9a5b11a8c72 abbac47d5d27119ef9044573094be66d1bd1fc78 1051966 F20101129_AAAILI pol_a_Page_032.jp2 cd5d38b0fc3d374b581229aa597e7b19 bd73649094f262ef1399464cf6adc96424ec7fb5 2326 F20101129_AAAIKT pol_a_Page_096.txt fd8253668d25345597325870d18202ca 827d64d7cf32aa63fac6ca31a58979ce79a59fea 20872 F20101129_AAAILJ pol_a_Page_104.QC.jpg c6a27576789671056b31a498d232c0fc e0741098eb77879511a284d86c18b2e0c771c4d4 99116 F20101129_AAAIKU pol_a_Page_020.jpg 4d4d3030e3380bc3e18aa50b3663f096 31b1f7a3aa3b64fc0c7a61d3b0c7c3827a1d2134 556891 F20101129_AAAILK pol_a_Page_067.jp2 dd61b7e42cb870a74fd733e91217bbe7 1419af803a8d590dcd3fb9b245bd8c28136f6d86 25280 F20101129_AAAIKV pol_a_Page_070.pro afff44297ff05730ec843a5762efce15 8224aab134859d04dcd98fb9c8f057199fcfe000 56158 F20101129_AAAILL pol_a_Page_114.pro 2842b59919ae760b5ac5cc8a0a2eb1e5 2d80cfdf29cd9145057a6db6d18a91eefde4130f 1160 F20101129_AAAIKW pol_a_Page_054.txt 5274c499f883d9f5caf3c7863aaf3f24 6cb62b052da042b7d0a7964b9b6a104e0d5ab068 58000 F20101129_AAAIMA pol_a_Page_096.pro f488295850c5fa778361a91a864d2785 91d0c7d1f58ae012e198c54688b3290295ccb2a1 119033 F20101129_AAAILM pol_a_Page_047.jp2 5fdd7b0630422b3d5a6d1cb2be325409 9b0171e1523e188b7c8ad26184a0d8bade493ade 1034 F20101129_AAAIKX pol_a_Page_007.txt 052c0cc2b99c11f2260051b9c24027cb 9b47ad074c0481d186c547454ed8703f158a80cb 905 F20101129_AAAIMB pol_a_Page_021.txt 1b008b4f65eb65c237516b31037f1b35 49e29ce9ac14932601e952946e7e04e79bea8990 56330 F20101129_AAAILN pol_a_Page_084.pro 61dbdbecce6c83f84c93f13e06d268f6 07c0fc9ee0873940ed61cf82d844d806b3d9c238 2087 F20101129_AAAIKY pol_a_Page_095.txt 7fddc6e68f8f820862eb07d0aff3dd47 5807bb66650a605c72eea7b8f2e14958faea43e9 F20101129_AAAIMC pol_a_Page_047.tif acad8519141c0eab420cf089e5c7fa11 edca548ac86a538656a9571b1513a36dbf7729a9 F20101129_AAAIKZ pol_a_Page_080.tif f558f2de99476edea5917e02f21f4327 f4ab4a1045dc4ab00135c4c55e3490b970a92db8 27334 F20101129_AAAIMD pol_a_Page_045.QC.jpg a74d7e78eddbcd2d988103e5b3390ab2 94023be644c4202508f8fb95ae8bac6a3788fab3 52515 F20101129_AAAILO pol_a_Page_004.jpg a6efceb7b0eb7401a7ca0ef021a7208a 36bfda2db09fe83f3063d2de9b7b91df1c5201f1 31080 F20101129_AAAIME pol_a_Page_014.QC.jpg 4fdeee3fba22b4f62d8f767c66bcaaae 2ec1e12d9e164386ab063630f6eb6aa579803a04 80509 F20101129_AAAILP pol_a_Page_109.jpg 158c75c7ebb641ca704259b5d2584908 9a57b1f3f03a51bf9db16c67c9b3582be0fd7ef2 F20101129_AAAIMF pol_a_Page_107.tif ee64b4312763d11e91d3d8ab707d9681 62ca4bc03872c5dd17d976f7d2f82f0a6a44e21b F20101129_AAAILQ pol_a_Page_053.txt 125217984eda5037ba3cac851e8f14ee f3b01f6ac09bbd29172ad1dfe43fafbde3503154 27512 F20101129_AAAIMG pol_a_Page_083.QC.jpg 82c93438340612c9fbd756296e8ce4a0 101cd4751c58b3752a0f4ad8422509495e5e8abb F20101129_AAAILR pol_a_Page_075.tif 97db8c8b0cc600c3b6d5d26dd982421b 246c971a43cbd4bfb708d921a414c33b81a137fb 76859 F20101129_AAAIMH pol_a_Page_095.jpg 5720d7c1bc547f54ab12d61b135fbd6a 73735b126f6134f0e03df4b7fbe40e5ba1671f35 5046 F20101129_AAAILS pol_a_Page_002.jp2 f8979d3c7cc03f949105e1b9b99e05da 131cfc23f04863df61b7a9e63150f022134b1f34 105166 F20101129_AAAIMI pol_a_Page_010.jp2 38531a0f15a0da38a8e8a0bf9f4dd0f9 2521bb10b58376cdb1af67c27dbb36684128fcf1 61635 F20101129_AAAILT pol_a_Page_101.jp2 2341be08aff1f7afd54d5d703f22da15 b3830eefd5ceb700637766ee046d3c9d5d2e5bd7 2379 F20101129_AAAIMJ pol_a_Page_037.txt f4b66c4591323b4a541f27fb75032937 c29e65e80b8d07bbb5ec64ff1a2aa9a6184c48e9 78853 F20101129_AAAILU pol_a_Page_062.jpg c28ffafa8177803f939d7e687f40cf70 db6fb381060462645925b5d3f655ccabf63154fd 2440 F20101129_AAAIMK pol_a_Page_119.txt 4665a96203a2503adda4ee08877b85c8 250cbe8f5c8d18d16f9dd730a2119f011b41225d 60401 F20101129_AAAILV pol_a_Page_119.pro 759960a9a00916d02d3c5614846707b1 0645d9513bd0e8c125b86a5d456cfa8be4f3f58f 56544 F20101129_AAAIML pol_a_Page_042.pro 836a5655fbeac2b69161667ae53ead6b 65b55d213b8a659592517c47da53806e685bdb3d 7050 F20101129_AAAILW pol_a_Page_031thm.jpg d381701309563f68c8a27ead8e4e7ab2 807615ad5ec03fee1efb460123d5c8a3288b02c1 89018 F20101129_AAAIMM pol_a_Page_120.jpg 492e62f778d116d54c814be8c642690a 58d5634ce58698578930fa0eca0c99031b87a1c0 4950 F20101129_AAAILX pol_a_Page_003.jp2 4858c179029380562d9e307ce67f0aa4 d5bb480f680033c8876b145abc44590d9cee29a4 20419 F20101129_AAAINA pol_a_Page_112.QC.jpg b18f251c8d63e46cbd8756b9faab2b7f 1aaa9e1e97361965112d24ae7f2437c6ea52d087 6914 F20101129_AAAIMN pol_a_Page_063thm.jpg 7c4e89e641bab68bfb024e4795adab0b a16f76186ac130404b3a1557ece0bd3185aed530 F20101129_AAAILY pol_a_Page_092thm.jpg 2ab1e2d1276887dfc1fcc948f41e9d1f c70534fa0126549f8af8825a14169010c0b81abc 2933 F20101129_AAAINB pol_a_Page_075.txt e2236f66ef56fa33e8179d056f2b18e4 9e6da24827b2b63d57a02e482edc2255cb4fa678 2381 F20101129_AAAIMO pol_a_Page_049.txt a4386dc4b7867c913a217cb9c31935c1 921fb5f1c485c7c6df606a046e0f1e505af5f795 62368 F20101129_AAAILZ pol_a_Page_120.pro c59f4dcfc1ee9c19d3645407d0461972 e06800a3410d4be7cd1be13d1ad58bfb5a23fb07 5261 F20101129_AAAINC pol_a_Page_110thm.jpg 887fbc7524c18292fe3323d4203e0099 7162c231a668c97f72d6a7bc90ed85aeece921f2 5642 F20101129_AAAIND pol_a_Page_122thm.jpg 9f712a098e753ae96444e39e33406261 77d699e673561db5da402cd1caa763a6cb00cc73 F20101129_AAAIMP pol_a_Page_108.jp2 da87ba5da24c48b51bf3981993520e48 60a687f73cb22ed2fa432f156198247833e9ac01 26878 F20101129_AAAINE pol_a_Page_086.QC.jpg 9773875687b94bb92ba33947b4e76c89 a11fb485957079ffdd1dd16a046028bf2e141e23 F20101129_AAAIMQ pol_a_Page_018.tif 23999145ef75a50a6440c93e8de2e1ff 95b43be63a3bd3fef909667e2b0f6a4607f267df 40955 F20101129_AAAINF pol_a_Page_112.pro 1a31fa53718d4d24971c3ee7335c2d22 317225cbca34005b73a8f80be146fe51f58e4d51 76565 F20101129_AAAIMR pol_a_Page_041.jpg 0fc5e7ad45f5ac86b6293ab4570ad2a6 49f61a36a1d1531c236046c4a511c95d10bdd55c 2118 F20101129_AAAING pol_a_Page_030.txt f0f0ac251383239500e9f3a4f644c02b 3fd6a61f7fbdf8038cf29c877839c01b49040275 1347 F20101129_AAAIMS pol_a_Page_057.txt 82e6ed7adcc06cb86f5812f3eeb141f4 b3a5af916b1052a31e87bf58c76a49c4e0840885 1170 F20101129_AAAINH pol_a_Page_067.txt 04e7693c240d16aa49af80ac6db3824d b340effadc4858c55c194e4f99037244236b33e8 2647 F20101129_AAAIMT pol_a_Page_118.txt 4c53cf4abf2e50bd4a9eba5809353b5f 7290ee3a654762ae7470cf435801db46b437b205 36457 F20101129_AAAINI pol_a_Page_027.jpg 127c4f69837d00fb77aba2f79c7ef771 b11f305c1cf48ddcfcc9207bcdc1d6f5ffe3f4c2 5036 F20101129_AAAIMU pol_a_Page_043thm.jpg 26db769c3d626e8b01bca96810edf091 018a4cc9b5e30a852c36a454e5026082857afdf0 3056 F20101129_AAAINJ pol_a_Page_002.QC.jpg 5765323a8224aec7cb5deb808366bc43 fa380cea0196316216c6e0fa5b67dbfdd16f1c01 5297 F20101129_AAAIMV pol_a_Page_105thm.jpg 2f26eeead46b6757aa2a601e2cad053f 9b970a545b3286cd02e4b9869a4f52ffbaa9aa04 1637 F20101129_AAAINK pol_a_Page_110.txt 83d4f98b06c79e59443836208bc3985f 60897622f219753e309a624b08c79893535466b8 27223 F20101129_AAAIMW pol_a_Page_118.QC.jpg b0cdf6c04cc1e8aceda64a90dea74128 b1f89ee30e202ea9c436c038c919a8847c5445ad 2786 F20101129_AAAINL pol_a_Page_115thm.jpg fd7d674c8d539a263e17cbe2baa44e9c fdb5561fb1cc3ab460346864e232f059b5fd9876 20049 F20101129_AAAIMX pol_a_Page_074.QC.jpg 8f821edf7ec5338df089eccfc0e71359 cfafb3729b2f2baa7eaf5b54dae48d12662d0936 F20101129_AAAIOA pol_a_Page_083.jp2 07e6a46bbeb23c22ae7a8c7463593272 d00f28f0d6260cd591f70f03234854d865e492b6 68317 F20101129_AAAINM pol_a_Page_093.jpg 76dd4f3b151ce0be24dd9b7cfd3409b0 b971adb43b0cdd2993442450ea773005699ba2c7 94306 F20101129_AAAIMY pol_a_Page_097.jpg 6e6f55027e62fe984a443ef893ac9e23 5cc73a2adbbd1c7855ae5011da51bdaa939432ef 21838 F20101129_AAAIOB pol_a_Page_122.QC.jpg bcfbbaf9e674b99343c7a83ddeed6cdf 9b800249790ac3bc4e465c7b858df66af8b8f974 F20101129_AAAINN pol_a_Page_081.tif 5ffd1eebc2bfe2ea9e58d5d431595861 9ce79fe05ca8d0eec70057d6c47d5d611ae5ca32 F20101129_AAAIMZ pol_a_Page_031.tif f3018f770dd7e8ff4d4b9ce5095ccded 4312f8810bc6e1713aeaa8daaeb6c44f944093e5 98970 F20101129_AAAIOC pol_a_Page_014.jpg 6dc4f8840654b01b7f00dd95d34edbe6 2b2d1799e711d10b47ee2f54727c9f88785c5678 31775 F20101129_AAAINO pol_a_Page_020.QC.jpg 3d1a3488a6fc8c7978d4ca2054a8ff10 cd25efbcb27c28f4a1af9a88e7a0cfb9d0c27b28 2545 F20101129_AAAIOD pol_a_Page_015.txt 7c08a3cf98e0efe9e6dd1f27e59c51ee 01ac91d1eeed6c7d120cd401b6bb4670684e8979 7135 F20101129_AAAINP pol_a_Page_022thm.jpg adb7a0ac2d18532e48fb5f0e8aac7173 180626385f57f2aba84081c0682c3b1c8a6becf9 F20101129_AAAIOE pol_a_Page_041.tif 3c8a43f8a6df5873bb7eab3287545b32 ab556961154979aa399d2fe5839c2fee094b768c 1050715 F20101129_AAAIOF pol_a_Page_030.jp2 69a69c73ab9ba8121f22d3571c054f69 720a151e98fd9e2a5efed7e6dcdb6013496c7f16 29367 F20101129_AAAINQ pol_a_Page_034.QC.jpg 4f59159088aa0e84d7488e3d84034b1b b6ddb38d6466a5c5edae5a55790dfe2257802bca 86081 F20101129_AAAIOG pol_a_Page_018.jpg d9ff521cf8963fcb6fb813201973acaf bafe8b56fcf9e753aa76625199d122f8088a1b8b 25786 F20101129_AAAINR pol_a_Page_119.QC.jpg 165732c9b0e420b4fee334672f8c9636 7d4fcb7cdd8f6dcac5fadc0289d4674c0448220c 7513 F20101129_AAAIOH pol_a_Page_024thm.jpg ef3b66abac8fe89d7503024dcb73439f 9da9c8d26e0ef39625b601bbc1880226e4722228 76970 F20101129_AAAINS pol_a_Page_044.jpg d98dc25144a98f2ab47e90afe3fa8f7c b97eda012b644eb6d6ac71c12fe516b61efa4fc8 7660 F20101129_AAAIOI pol_a_Page_050thm.jpg f9f211bed5b233a6c9eea0400432cfbd ee35e3fd8d7a5a0fdd32e1c96346904330c1f9e8 2446 F20101129_AAAINT pol_a_Page_022.txt f1fddd164250a8ee3d3ad74330e13c96 2992efad53f7478ea2daf9715addb15a3ad7602a 2287 F20101129_AAAIOJ pol_a_Page_051.txt 65a7fca403d15415dfc709dcfadbbe9f fbdc5704cc4737c535094fab9986beaa5f45c632 7125 F20101129_AAAINU pol_a_Page_045thm.jpg 8b605f53608cb3d2676df063018705a9 53664932544323b42dbb4a9905f49b2b81619f73 29127 F20101129_AAAIOK pol_a_Page_115.jpg 77182de24b0ac69aac91363a81732eb6 d48f846fc49e3731c3edae5499b28e1bedb1cf95 F20101129_AAAINV pol_a_Page_091.tif ac4751c2759178c22663d6ab262f87c4 d9d52f64f7b71bc5b9d56d64e80e59985e4d8cf9 F20101129_AAAIOL pol_a_Page_007.tif da0452138b3a4b078f2f0a31cfe40639 43022843b3cf4d340b3b64ca9f1030e49882bc38 30737 F20101129_AAAINW pol_a_Page_050.QC.jpg f6578e05fb9132883ed73f2e9372bda4 9fe0c17f0915b94f587ab59eaaea76f924ce44d0 9621 F20101129_AAAIPA pol_a_Page_003.jpg f620bc21e285a9002039657b36b36815 6d88bd9fac4b631ec471bc165a0fe80a9bcd53e5 7408 F20101129_AAAIOM pol_a_Page_108thm.jpg 474b2786a18b8ee292ae2a41dd68a4b7 eba8ac50be7df231218fca73a29089d363c00247 38363 F20101129_AAAINX pol_a_Page_117.jp2 19022cb29b8096058ddd87567a543607 9f2f63569ea1f314719d894b08971c2f3ca551e5 88789 F20101129_AAAIPB pol_a_Page_005.jpg 758d7db505fde949fc8314eb9314cd57 dc0d4c21afc22785697db85ea6a58031015d6835 29569 F20101129_AAAION pol_a_Page_085.QC.jpg cf6cd48cc55a131c2d2bc659d85da5e5 daf496262df9d8fe0d71a849b6e7de1429845e87 70103 F20101129_AAAINY pol_a_Page_106.jpg d929d9f8adbede46d331101ab9def7b1 fb703b6586dcf818b6c490361aa77b6fd6f12902 102018 F20101129_AAAIPC pol_a_Page_006.jpg 199b9196dbf6c6f5962bdf247c9f4505 bbec0dc48811416c6e29d32fe828ff7a70fc7d60 7339 F20101129_AAAIOO pol_a_Page_111thm.jpg 0eb9c7f4a0809c03dbb5933d2105c5f2 6300d6ea662770ea3733049c172c13a55db4acaf 2222 F20101129_AAAINZ pol_a_Page_104.txt 78adbe4b9634256c169c4fd725632115 35d69384451162dfdd51abb19d373df473f72875 26028 F20101129_AAAIPD pol_a_Page_008.jpg c3a0b8d83a1d3f5677bf848e3d22b870 589e23921b4fa6202a7008efbd56162b0d9eae28 17481 F20101129_AAAIOP pol_a_Page_117.pro 44c0bcb52610831bb5f75de1a1d91035 6fb2f79fcc03127e82c25310485ebf9728582d83 50209 F20101129_AAAIPE pol_a_Page_009.jpg 127b2b6d8f66c35c73ea748d0b0c99de 1237a278d8d90e215b671e12ed12ba9e66cbc2ef 771750 F20101129_AAAIOQ pol_a_Page_110.jp2 08649f4d6b3088e5c2e98c790df99c09 b67c8f676760f36a5296ce32e8b4dd4983f7833d 25369 F20101129_AAAIPF pol_a_Page_011.jpg ed149776f1b96a88ae3823a71c4b8a94 e195a5040325bab35010d5921610a716ccf56598 F20101129_AAAIOR pol_a_Page_115.tif 6e4d9b4d8c932c7695d5016d80e1ea82 76d9b9a90b74e1122a03d345e85f3b7d1f7774c9 92443 F20101129_AAAIPG pol_a_Page_013.jpg 5c19b05361e611790ce0fa3c51a0509d 16b35e427bcca1200fdba6de5c79f629ec0bd719 F20101129_AAAIOS pol_a_Page_066.jp2 ac6d06fa8dde7a7eb4477ecb57cb746d 9d0469985497274a672f5622d304c6533533987d 88856 F20101129_AAAIPH pol_a_Page_016.jpg 8dabea652d6e41477f4485937e9e62ce b3b8d0be2c3503e3fadaedcd9e328e57972f5d2f 903275 F20101129_AAAIOT pol_a_Page_104.jp2 83599ebe5adfd7d7830fab2798d5d45c b1330a5a8d72bad9673a3712391cb4362bac80a2 66024 F20101129_AAAIPI pol_a_Page_017.jpg 2b0989cd6ea34419a74d4b752155659e 01d4dc007756c6b4e9b0043d767775f79c782e79 92459 F20101129_AAAIOU pol_a_Page_034.jpg 4b5ae45b01144cc2b6aea9c7ca5e950e aa91e89983fddb6f998e5c97fb16249fc94742aa 85292 F20101129_AAAIPJ pol_a_Page_019.jpg 8dbbf544d3ce0f0385d81c73f7a07ead e4b53ea3976fa6464e508683344246022335b7bc 6720 F20101129_AAAIOV pol_a_Page_076thm.jpg 42fa041db3c58c56754b1ccf78baf5f8 fabce01deb6d79d6f9779ef937512164f7bdb565 41458 F20101129_AAAIPK pol_a_Page_021.jpg a24f1ea92cc872469a5b29a7171a1fdb 8ae7705ce2d5abc3c2e3c52390499bfd077f63cb 139431 F20101129_AAAIOW UFE0021132_00001.mets a54d7fa1ac200c3cccac16e57f6ca364 ffdb08d7ddb251e7c5bd18556ef1d6a4c4da8c77 88960 F20101129_AAAIPL pol_a_Page_022.jpg ac51662eddde75601cf57d29c87599f5 018df3c6786f69776b40365547811c28b3d223f8 99362 F20101129_AAAIQA pol_a_Page_046.jpg 58da4491f3533411af7a77b56f7c346b ca4153eae9ace4908e16d0798ec1b8bbc1c9ba0b 96901 F20101129_AAAIPM pol_a_Page_023.jpg 7834a57f46df144d18dcd9c1a27d07dd 670745c06ee52c1ff4e856b8b58b2c3df10fe476 78170 F20101129_AAAIQB pol_a_Page_047.jpg 5757647abead80471184b192070b467f c6e5aa87e4abc2546697de4fb4617b861b582d25 97434 F20101129_AAAIPN pol_a_Page_025.jpg 542f1f17bc759ada68b34864ba92133f 2c2553e9846c311fc8245f683e7c30a1468941ec 22276 F20101129_AAAIOZ pol_a_Page_001.jpg 4dda041bfb90f17a7d2ecf0da71a6c1c 9e5586133b896ecc6bec161d31368ed28fb40b48 87757 F20101129_AAAIQC pol_a_Page_048.jpg 0cc1f6d45252e590b9cb10785176fa27 d19931e4dd4526e2c419a3d495779dce057e7e7c 89337 F20101129_AAAIPO pol_a_Page_026.jpg 5dcf8a61ccd44902723e96949661ba8a 98f28a37dfea8dc007c8306228be7e6f290e58ba 91823 F20101129_AAAIQD pol_a_Page_049.jpg 171ed1cd7d4bb5c2403613899bf0422d a5dc46b4a52c6ec51ca60dc7d0abdcb708ced87d 79329 F20101129_AAAIPP pol_a_Page_028.jpg ff95cdc7e121e3ce42e0d088d8ca66ce f5e061621aeedd5eafdf8a54a2d02e491e6a2177 95381 F20101129_AAAIQE pol_a_Page_050.jpg 51accbe14800b94afafc958bf3823ec8 b61939cfed02c7241c27cf7020fb5893f3f798df 83240 F20101129_AAAIPQ pol_a_Page_029.jpg d33e213f09a5876423004e6b1549812a 2d251991fc73c20158e82344d011b2bbcfa9c244 85040 F20101129_AAAIQF pol_a_Page_051.jpg 4f58b5c6859401874536ff274be18f65 7c9c623fc0ee9561f7edb558df9fb1bea7f52d10 75873 F20101129_AAAIPR pol_a_Page_030.jpg b5d162a436ce7bda405a64595a1e7653 ed7beb91031b921c7771de4b8df5d1ec4860ea3e 97968 F20101129_AAAIQG pol_a_Page_052.jpg bc3b92a5f9f9c83ef5110a758a43d6ba 52450940b25d84eb1587de5689780eb1ef94b9f9 91289 F20101129_AAAIQH pol_a_Page_053.jpg 246d1b78c3513a3338fe07e2a52ac83e 951b9346e9439e759e7300f73b078101eda7c4d3 82424 F20101129_AAAIPS pol_a_Page_033.jpg fb6d67fbb4db0258f41e1cef3df051c0 03365a101ed35b0de9f7dc8db08d26fd8e325243 95935 F20101129_AAAIQI pol_a_Page_055.jpg 31fc9680a0adc7ad632c0e3f221ad288 08d0362730051a459144685fa86a5cb5e399cd53 100708 F20101129_AAAIPT pol_a_Page_035.jpg ac0d9618b946194085885cffb4a8e7f6 136fa2d153d96ac9918d52a351e243fe122e0be5 88173 F20101129_AAAIQJ pol_a_Page_056.jpg a2afabb8c88a10f7eac56891f8ae527b aed2b27d77d3026840def0fa27d5493d86aec699 90324 F20101129_AAAIPU pol_a_Page_037.jpg c481b7aec118a8cd6a38cd94a69e2b4c 2979d6fcd664256cfba49ce03ff88e67cbb0bf48 91983 F20101129_AAAIQK pol_a_Page_058.jpg 4b912838abcf77d4c03a71356056a811 3346b227ba07acf2907782b089685c8b7e8c313b 40820 F20101129_AAAIPV pol_a_Page_039.jpg addcbf8a99dfd8da1b60acd9f0d6210a 2a5605ba9c12c79e8ae96d37dcb4f1cdf3dee906 58044 F20101129_AAAIQL pol_a_Page_060.jpg a82f18614bf0e397701deb3f62a16076 a45fadd9e8d2754b84b2107d70f15e36615a724d 95587 F20101129_AAAIPW pol_a_Page_040.jpg 94bf4360dd6c5aa7cc9d3fef734dc1a3 cf6c4594866eb377f8c2e79a9fcc404f3ef0db01 97135 F20101129_AAAIRA pol_a_Page_080.jpg 679ff7fdfa7c187736a1d835ec1688e5 fb06058aff67d9e6284a31556c4d6939c6fbf796 86149 F20101129_AAAIQM pol_a_Page_061.jpg 6ebcd8baee5b82baba3b68eff601faf8 203a3a50a1ea282399077880cde2107e56ca70f3 86679 F20101129_AAAIPX pol_a_Page_042.jpg a21ea75635073ad59a037e560421ed17 0e26ea31423c0754c4b4374178bfdcd8097c9311 86429 F20101129_AAAIRB pol_a_Page_081.jpg 9bf9b8025fb408e882cf896be4d7f77e 97cbfbfb21e88d983f4eef1d76a92c0a4ed74b86 62714 F20101129_AAAIQN pol_a_Page_064.jpg 0394ca3e2517fd0ac6239c7b9f6a807b 0d0f2765bdc7e8fc242673334147c7f4f3f28b65 55410 F20101129_AAAIPY pol_a_Page_043.jpg 84136ccadd526731e11767318ae11928 2f748fa80118a89f79e6f42bcd90f19b2d76ace3 MAINTAINING VERY LARGE SAMPLES USING THE GEOMETRIC FILE By ABHIJIT A. POL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 2007 Abhijit A. Pol To my wonderful parents ACKNOWLEDGMENTS At the end of my dissertation I would like to thank all those people who made this disserta tion possible and an enjoyable experience for me. First of all I wish to express my sincere gratitude to my adviser Chris Jermaine for his patient guidance, encouragement, and excellent advice throughout this study. If I would have access to magic tool createyourownadviser, I still would not have ended up with anyone better than Chris. He always introduces me to interesting research problems. He is around whenever I have a question, but at the same time encourages me to think on my own and work on any problems that interest me. I am also indebted to Alin Dobra for his support and encouragement. Alin is a constant source of enthusiasm. The only topic I have not discussed with him is strategies of Gator football games. I am grateful to my dissertation committee members Tamer Kahveci, Joachim Hammer, and Ravindra Ahuja for their support and their encouragement. I acknowledge the Department of Industrial and Systems Engineering, Ravindra Ahuja, and chair Donald Heam for the financial support and advice I received during initial years of my studies. Finally, I would like to express my deepest gratitude for the constant support, understanding, and love that I received from my parents during the past years. TABLE OF CONTENTS page ACKNOWLEDGMENTS ................ ...................... 4 LIST OF TABLES ...................................... 8 LIST OF FIGURES ................. ............. ....... 9 ABSTRACT ................................... ....... 10 CHAPTER 1 INTRODUCTION .................................. 12 1.1 The Geometric File ................... .......... 14 1.2 Biased Reservoir Sampling ................... ........ 16 1.3 Sampling The Sample ................... .......... 18 1.4 Index Structures For The Geometric File ........ ........ .... 19 2 RELATED WORK ..................... .............. 22 2.1 Related Work on Reservoir Sampling .......... ....... ....... 22 2.2 Biased Sampling Related Work ................... ....... 24 3 THE GEOMETRIC FILE .......... ........... ........... 28 3.1 Reservoir Sampling ................... .......... 28 3.2 Sampling: Sometimes a Little is not Enough ........ ........ .. 30 3.3 Reservoir for Very Large Samples ......... ................ 31 3.4 The Geometric File ......... .. .............. 34 3.5 Characterizing Subsample Decay .......... ................ 36 3.6 Geometric File Organization .................. ......... .. 40 3.7 Reservoir Sampling With a Geometric File .... . . .... 40 3.7.1 Introducing the Required Randomness . . . ..... 41 3.7.2 Handling the Variance ............ . . ...... 42 3.7.3 Bounding the Variance .................. ....... 45 3.8 Choosing Parameter Values ............. . . .... 47 3.8.1 Choosing a Value for Alpha .................. ...... .. 47 3.8.2 Choosing a Value for Beta ...... . . ...... 48 3.9 Why Reservoir Sampling with a Geometric File is Correct? . . .... 49 3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer . 49 3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File .50 3.10 Multiple Geometric Files ............ . . . 51 3.11 Reservoir Sampling with Multiple Geometric Files . . . 51 3.11.1 Consolidation And Merging .................. ...... .. 53 3.11.2 How Can Correctness Be Maintained? ..... . . ..... 53 3.11.3 Handling the Stacks in Multiple Geometric Files . . .... 56 3.12 SpeedUp Analysis .. .................... 4 BIASED RESERVOIR SAMPLING .. ........................ 58 4.1 A SinglePass Biased Sampling Algorithm ...... . . . 59 4.1.1 Biased Reservoir Sampling . . . ....... 59 4.1.2 So, What Can Go Wrong? (And a Simple Solution) . . ... 60 4.1.3 Adjusting Weights of Existing Samples . . . 62 4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm . ... 65 4.2.1 The Proof for the Worst Case . . . ...... ..... 66 4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist . ... 73 4.3 Biased Reservoir Sampling With The Geometric File . . . 75 4.4 Estimation Using a Biased Reservoir ................ . .. 76 5 SAMPLING THE GEOMETRIC FILE .................. ..... 80 5.1 Why Might We Need To Sample From a Geometric File? . . .... 80 5.2 Different Sampling Plans for the Geometric File . . . ..... 80 5.3 Batch Sampling From a Geometric File ................ .. .. 81 5.3.1 A Naive Algorithm .......... . . . .. 81 5.3.2 A Geometric File StructureBased Algorithm . . . 82 5.3.3 Batch Sampling Multiple Geometric Files . . . 84 5.4 Online Sampling From a Geometric File ............. ... .. 84 5.4.1 A Naive Algorithm .......... . . . .. 84 5.4.2 A Geometric File StructureBased Algorithm . . . 85 5.5 Sampling A Biased Sample ............. . . ..... 88 6 INDEX STRUCTURES FOR THE GEOMETRIC FILE . . . 89 6.1 Why Index a Geometric File? ........... . . ...... 89 6.2 Different Index Structures for the Geometric File . . . 90 6.3 A SegmentBased Index Structure .................. ... .. 91 6.3.1 Index Construction During Startup .... . . ... 91 6.3.2 Maintaining Index During Normal Operation . . . 92 6.3.3 Index LookUp and Search .................. ...... .. 93 6.4 A SubsampleBased Index Structure ................... .... .. 93 6.4.1 Index Construction and Maintenance ... . . .. 94 6.4.2 Index LookUp .................. ............ .. 95 6.5 A LSMTreeBased Index Structure .................. ..... .. 96 6.5.1 An LSMTree Index ............. . . ..... 96 6.5.2 Index Maintenance and LookUps .............. ... .. 97 7 BENCHMARKING .................. ................ .. 99 7.1 Processing Insertions .................. ............. .. 99 7.1.1 Experiments Performed .................. ....... 99 7.1.2 Discussion of Experimental Results ... . . ... 100 7.2 Biased Reservoir Sampling ............ . . .... 103 7.2.1 Experimental Setup ............ . . ... 104 7.2.2 Discussion ................. . . . ... 106 7.3 Sampling From a Geometric File .................. ...... .. 107 7.3.1 Experiments Performed .................. ....... 108 7.3.2 Discussion of Experimental Results .... . . .... 109 7.4 Index Structures For The Geometric File ....... . . ... 110 7.4.1 Experiments Performed .................. ....... 110 7.4.2 Discussion ................. . . . ... 112 8 CONCLUSION ................... ........ ........ 116 REFERENCES ...................... . . . 118 BIOGRAPHICAL SKETCH ................. . . ..... 122 LIST OF TABLES Table page 11 Population: student records ................ . . .... 17 12 Random sample of the size=4 .................. ............. 17 13 Biased sample of the size=4 ................ . . .... 17 71 Millions of records inserted in 10 hrs .................. ........ .. 110 72 Query timing results for 1k record,  R = 10 million, and  B = 50k . ... 113 73 Query timing results for 200 bytes record, R1 = 50 million, and BI = 250k . 114 LIST OF FIGURES Figure page 31 Decay of a subsample after multiple buffer flushes. ..... . . 38 32 Basic structure of the geometric file. .................. ........ 39 33 Building a geometric file. .................. ............ 43 34 Distributing new records to existing subsamples. .................. ..44 35 Speeding up the processing of new samples using multiple geometric files. ...... 54 41 Adjustment of r'" to rrn . ........ . ...69 71 Results of benchmarking experiments (Processing insertions). . . ... 101 72 Results of benchmarking experiments (Sampling from a geometric file). . ... 102 73 Sum query estimation accuracy for zipf=0.2. ................... 104 74 Sum query estimation accuracy for zipf=0.5. ................... 105 75 Sum query estimation accuracy for zipf=0.8. ................... 106 76 Sum query estimation accuracy for zipf=l. ................... ....... 107 77 Disk footprint for 1KB record size .................. .......... 110 78 Disk footprint for 200B record size .................. ........ 112 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MAINTAINING VERY LARGE SAMPLES USING THE GEOMETRIC FILE By Abhijit A. Pol August 2007 Chair: Christopher M. Jermaine Major: Computer Engineering Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a onepass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, diskbased samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for main taining a very large, ondisk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixedsize sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, userdefined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file. CHAPTER 1 INTRODUCTION Despite the variety of alternatives for approximate query processing [1, 21, 30, 34, 39], sampling is still one of the most powerful methods for building a onepass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. Sampling's many benefits include: Sampling is the most widelystudied and best understood approximation technique cur rently available. Sampling has been studied for hundreds of years, and many fundamental results describe the utility of random samples (such as the Central Limit Theorem, Cher noff, Hoeffding and Chebyshev bounds [16, 49]). Sampling is the most versatile approximation technique available. Most data processing algorithms can be used on a random sample of a data set rather than the original data with little or no modification. For example, almost any data mining algorithm for building a decision tree classifier can be run directly on a sample. Sampling is the most widelyused approximation technique. Sampling is common in data mining, statistics, and machine learning. The sheer number of recent papers from ICDE, VLDB, and SIGMOD [2, 3, 8, 14, 15, 28, 32, 33, 35, 46, 51, 52] that use samples testify to sampling's popularity as a data management tool. Given the obvious importance of random sampling, it is perhaps surprising that there has been very little work in the data management community on how to actually perform random sampling. The most wellknown papers in this area are due to Olken and Rotem [25, 27], who also offer the definitive survey of related work through the early 1990s [26]. However, this work is relevant mostly for sampling from data stored in a database, and implicitly assumes that a "sample" is a small data structure that is easily stored in main memory. Such assumptions are sometimes overly restrictive. Consider the problem of approximate query processing. Recent work has suggested the possibility of maintaining a sample of a large database and then executing analytic queries over the sample rather than the original data as a way to speed up processing [4, 31]. Given the most recent TPCH benchmark results [17], it is clear that processing standard reportstyle queries over a large, multiterabyte data warehouse may take hours or days. In such a situation, maintaining a fully materialized random sample of the data (or "sample view" [43]) may be desirable. In order to save time and/or computer resources, queries can then be evaluated over the sample rather than the original data, as long as the user can tolerate some carefully controlled inaccuracy in the query results. This particular application has two specific requirements that are addressed by the dis sertation. First, it may be necessary to use quite a large sample in order to achieve acceptable accuracy; perhaps on the order of gigabytes in size. This is especially true if the sample will be used to answer selective queries or aggregates over attributes with high variance (see Sec tion 3.2). Second, whatever the required sample size, it is often independent of the size of the database, since estimation accuracy depends primarily on sample size 1 In other words, the required sample size will generally not grow as the database size increases, as long as other factors such as query selectivity remain relatively constant. Thus, this application requires that we be able to maintain a large, diskbased, fixedsize random sample of the archived data, even as new data are added to the warehouse. This is precisely the problem we tackle in the dissertation. For another example of a case where existing sampling methods can fall short, consider streambased data management tasks, such as network monitoring (for an example of such an application, we point to the Gigascope project from AT&T Laboratories [1820]). Given the tremendous amount of data transported over today's computer networks, the only conceivable way to facilitate adhoc, afterthefact query processing over the set of packets that have passed through a network router is to build some sort of statistical model for those packets. The most obvious choice would be to produce a very large, statistically random sample of the packets that have passed through the router. Again, maintaining such a sample is precisely the problem we tackle in this dissertation. While other researchers have tackled the problem of maintaining an 1 The unimportance of database size for certain queries is due to the fact that the bias and vari ance of many samplingbased estimators are related far more to sample size than to the sampling fraction (see Cochran [16] for a thorough treatment of finite population random sampling). online sample targeted towards more recent data [7], no existing methods have considered how to handle very large samples that exceed the available main memory. In this dissertation we describe a new data organization called the geometric file and related online algorithms for maintaining a very large, diskbased sample from a data stream. The dissertation is divided into four parts. In the first part we describe the geometric file organization and detail how geometric files can be used to maintain a very large simple random sample. In the second part we propose a simple modification to the classical reservoir sampling algorithm to compute a biased sample in a single pass over the data stream and describe how the geometric file can be used to maintain a very large biased sample. In the third part we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects. Finally, in the fourth part, we discuss secondary index structures for the geometric file. Index structures are useful to speed up search and discovery of required information from a huge sample stored in a geometric file. The index structures must be maintained concurrently with constant updates to the geometric file and at the same time provide efficient access to its records. We now give an introduction to these four parts of the dissertation in subsequent sections. 1.1 The Geometric File If one accepts the notion that being able to maintain a very large (but fixed size) random sample from a data stream is an important problem, it is reasonable to ask: Is maintaining such a sample difficult or costly using modern algorithms and hardware? Fortunately, modem storage hardware gives us the capacity to inexpensively store very large samples that should suffice for even difficult and emerging applications. A terabyte of commodity hard disk storage now costs less than $1,000. Given current trends, we should see storage costs of $1,000 per petabyte by the year 2020. However, even given such large storage capacities, it turns out that maintaining a large sample is difficult using current technology. The problem is not purchasing the hardware to store the sample; rather, the problem is actually getting the samples onto disk, so as to guarantee the statistical randomness of the sample, in the face of data streams that may exceed tens of gigabytes per minute in the case of a network monitoring application. Current techniques suitable for maintaining samples from a data stream are based on reservoir sampling [11, 38]. Reservoir sampling algorithms can be used to dynamically maintain a fixedsize sample of N records from a stream, so that at any given instant, the N records in the sample constitute a true random sample of all of the records that have been produced by the stream. However, as we will discuss in this dissertation, the problem is that existing reservoir techniques are suitable only when the sample is small enough to fit into main memory. Given that there are limited techniques for maintaining very large samples, the problem addressed in the first part of this dissertation is as follows: Given a main memory buffer B large enough to hold BI records, can we develop efficient nlgt,'rihii,\ for dynamically maintaining a massive random sample containing exactly N records from a data stream, where N > IBI ? Key design goals for the algorithms we develop are 1. The algorithms must be suitable for streaming data, or any similar environment where a large sample must be maintained online in a single pass through a data set, with the strict requirement that the sample always be a true, statistically random sample of fixed size N (without replacement) from all of the data produced by the stream thus far. 2. When maintaining the sample, the fraction of I/O time devoted to reads should be close to zero. Ideally, there would never be a need to read a block of samples from disk simply to add one new sample and subsequently write the block out again. 3. The fraction I/O of time spent performing random I/Os should also be close to zero. Costly random disk seeks should be few and far between. Almost all I/O should be sequential. 4. Finally, the amount of data written to disk should be bounded by the total size of all of the records that are ever sampled. The geometric file meets each of the requirements listed above. With memory large enough to buffer IBI > 1 records, the geometric file can be used to maintain an online sample of arbitrary size with an amortized cost of O(u x log B / B ) random disk head movements for each newly sampled record (see Section 3.12). The multiplier u can be made arbitrarily small by making use of additional disk space. A rigorous benchmark of the geometric file demonstrates its superiority over the obvious alternatives. 1.2 Biased Reservoir Sampling In this part of the dissertation, we study the problem of how to compute a simple, fixedsize random sample (without replacement) in a single pass over a data stream, where the goal is to bias the sample using some arbitrary weighting function. The need for biased sampling can easily be illustrated with an example population, given in Table 1.2. This particular data set contains records describing graduate student salaries in a university academic department, and our goal is to guess the total graduate student salary. Imagine that a simple random sample of the data set is drawn, as shown in the Table 12. The four sampled records are then used to guess that the total student salary is (520 + 700 + 580 + 600) x 12/4 = $7200, which is considerably less than the true total of "' %,45. The problem is that we happened to miss most of the highsalary students who are generally more important when computing the overall total. Now, imagine that we weight each record, so that the probability of including any given record with a salary 700 or greater in the sample is (2) x (4/12), and the probability of including a given record with a salary less than 700 is (1/2) x (4/12). Thus, our sample will tend to include those records with higher values, that are more important to the overall sum. The resulting biased sample is depicted in Table 13. The standard HorvitzThompson estimator [50] is then applied to the sample (where each record is weighted according to the inverse of its sampling probability), which gives us an estimate of (1200 + 1500 + 750) x (12/8) + (580) x (24/4) = ,,. This is obviously a better estimate than $7200, and the fact that it is better then the original estimate is not just accidental: if one chooses the weights carefully, it is easily possible to produce a sample whose associated estimator has lower variance (and hence higher accuracy) than the simple, uniformprobability sample. For instance, the variance of the estimator in the student salary example is 2.533 x 106 under the uniformprobability sampling and it is 5.083 x 105 under the biased sampling scheme. Table 11. Population: student records Rec # Name Class Salary ($/month) 1 James Junior 1200 2 Tom Freshman 520 3 Sandra Junior 1250 4 Jim Senior 1500 5 Ashley Sophomore 700 6 Jennifer Freshman 530 7 Robert Sophomore 750 8 Frank Freshman 580 9 Rachel Freshman 605 10 Tim Freshman 550 11 Maria Sophomore 760 12 Monica Freshman 600 Total Salary: 9545.00 Table 12. Random sample of the size=4 Rec # Name Class Salary ($/month) 2 Tom Freshman 520 5 Ashley Sophomore 700 8 Frank Freshman 580 12 Monica Freshman 600 Other cases where a biased sample is preferable abound. For example, if the goal is to monitor the packets flowing through a network, one may choose to weight more recent packets more heavily, since they would tend to figure more prominently in most query workloads. We propose a simple modifications to the classic reservoir sampling algorithm [11, 38] in order to derive a very simple algorithm that permits the sort of fixedsize, biased sampling given in the example. Our method assumes the existence of an arbitrary, userdefined weighting function f which takes as an argument a record ri, where f(ri) > 0 describes the record's utility Table 13. Biased sample of the size=4 Rec # Name Class Salary ($/month) 1 James Junior 1200 4 Jim Senior 1500 7 Robert Sophomore 750 11 Maria Sophomore 760 in subsequent query processing. We then compute (in a single pass) a biased sample Ri of the i records produced by a data stream. Ri is fixedsize, and the probability of sampling the jth record from the stream is proportional to f(rj) for all j < i. This is a fairly simple and yet powerful definition of biased sampling, and is general enough to support many applications. The key contributions of this part of dissertation are as follows: 1. We present a modified version of the classic reservoir sampling algorithm that is ex ceedingly simple, and is applicable for biased sampling using any arbitrary userdefined weighting function f. 2. In most cases, our algorithm is able to produce a correctly biased sample. However, given certain pathological data sets and data orderings, this may not be the case. Our algorithm adapts in this case and provides a correctly biased sample for a slightly modified bias function f'. We analytically bound how far f' can be from f in such a pathological case, and experimentally evaluate the practical significance of this difference. 3. We describe how to perform a biased reservoir sampling and maintain large biased samples with the geometric file. 4. Finally, we derive the correlation covariancee) between the Bernoulli random variables gov erning the sampling of two records ri and rj using our algorithm. We use this covariance to derive the variance of a HorvitzThomson estimator making use of a sample computed using our algorithm. 1.3 Sampling The Sample A geometric file is a simple random sample (without replacement) from a data stream. In this part of the dissertation we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects that are themselves random samples (without replacement) from the original data stream. The goal of the algorithms described in this part is to efficiently support further sampling of a geometric file by making use of its own structure. Small samples frequently do not provide enough accuracy, especially in the case when the resulting statistical estimator has a very high variance. However, while in the general case a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. For example, consider the problem of estimating the average net worth of American households. In the general case, many millions of samples may be needed to estimate the net worth of the average household accurately (due to a small ratio between the average household's net worth and the standard deviation of this statistic across all American households). However, if the same set of records held information about the size of each household, only a few hundred records would be needed to obtain similar accuracy for an estimate of the average size of an American household, since the ratio of average household size to the standard deviation of sample size across households in the United States is greater than 2. Thus, to estimate the answer to these two queries, vastly different sample sizes are needed. Since there is no single sample size that is optimal for answering all queries and the required sample size can vary dramatically from query to query, this part of dissertation considers the problem of generating a sample of size N from a data stream using an existing geometric file that contains a large sample of records from the stream, where N < R. We will consider two specific problems. First, we consider the case where N is known beforehand. We will refer to a sample retrieved in this manner as a batch sample. We will also consider the case where N is not known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext results in an additional sampled record being returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an online or sequential sample. 1.4 Index Structures For The Geometric File A geometric file could easily contain a sample of size several gigabytes or even terabytes. A huge sample like this may often contain too much information and it becomes expensive to scan all the records of a sample to find those (most likely very few) records that match a given condition. A natural way to speed up the search and discovery of those records from a geometric file that have a particular value for a particular attribute is to build an index structure. In this part of the dissertation we discuss and compare three different index structures for the geometric file. In general an index is a data structure that lets us find a record without having to look at more than a small fraction of all possible records. An index is referred to as primary index if it determines the location of the index record in the file. An index is referred as secondary index if it tells us the current location of records that may have been decided by a primary index. Thus, a secondary index is an index that is maintained for a data file, but not used to control the current processing order of the file. In the case of a geometric file the physical location of a sampled record is determined (randomly) by the insertion algorithms. We could therefore build a secondary index structure on one or more attributes including the key attribute of the record. Apart from providing an efficient access to the desired information in the file, the index must be maintained as new records are inserted to the geometric file. For instance, we could build a secondary index on an attribute when the new records are bulk inserted in the geometric file. At this time we must determine how we merge the new secondary index with the existing indexes built for the rest of the file. Furthermore, we must maintain the index as existing records are being overwritten with newly inserted records and hence are deleted from the geometric file. With these goals in mind, we discuss three secondary index structures for the geometric file: (1) a segmentbased index, (2) a subsamplebased index, and (3) a LogStructured MergeTree (LSM) based index. The first two indexes are developed around the structure of the geometric file. Multiple B+tree indexes are maintained for each segment or subsample in a geometric file. As new records are added to the file in units of a segment or subsample, a new B+tree indexing new records is created and added to the index structure. Also, an existing B+tree is deleted from the structure when all the records indexed by it are deleted from the file. The third index structure makes use of the LSMtree index [44] a diskedbased data structure designed to provide low cost indexing in an environment with a high rate of inserts and deletes. We evaluate and compare these three index structures experimentally by measuring build time and disk footprint as new records are inserted in the geometric file. We also compare efficiency of these structures for point and range queries. Dissertation organization and original publications: The rest of the dissertation is organized as follows. We present the related work in Chapter 2. In Chapter 3 we present the geometric file organization and show how this structure can be used to maintain a very large simple random sample. In Chapter 4 we propose a single pass biased reservoir sampling algorithm. In Chapter 5 we develop techniques that can be used to sample geometric files to obtain a small size sample. In Chapter 6 we present secondary index structures for the geometric file. In Chapter 7 we discuss the benchmarking results. The dissertation is concluded in Chapter 8. Most of the work in the dissertation is either already published or is under review for publication. The material from Chapter 3 is from the paper with Christopher Jermaine and Subramanian Arumugam that was originally published in SIGMOD 2004 [36]. The work presented in Chapter 4 is submitted to TKDE and is under review [47]. The material in Chapter 5 is the part of journal paper accepted at VLDBJ [48]. The results in the Chapter 7 are taken from above three papers as well. CHAPTER 2 RELATED WORK In this chapter, we first review the literature on reservoir sampling algorithms. We then present the summary of existing work on biased sampling. 2.1 Related Work on Reservoir Sampling Sampling has a very long history in the data management literature, and research continues unabated today [2, 3, 8, 14, 15, 28, 32, 33, 35, 51, 52]. However, the most previous papers (including the aforementioned references) are concerned with how to use a sample, and not with how to actually store or maintain one. Most of these algorithms could be viewed as potential users of a large sample maintained as a geometric file. As mentioned in the introduction chapter, a series of papers by Olken and Rotem (including two papers listed in the References section [25, 27]) probably constitute the most wellknown body of research detailing how to actually compute samples in a database environment. Olken and Rotem give an excellent survey of work in this area [26]. However, most of this work is very different than ours, in that it is concerned primarily with sampling from an existing database file, where it is assumed that the data to be sampled from are all present on disk and indexed by the database. Single pass sampling is generally not the goal, and when it is, management of the sample itself as a diskbased object is not considered. The algorithms in this dissertation are based on reservoir sampling, which was first de veloped in the 1960s [11, 38]. In his wellknown paper [53], Vitter extends this early work by describing how to decrease the number of random numbers required to perform the sampling. Vitter's techniques could be used in conjunction with our own, but the focus of existing work on reservoir sampling is again quite different from ours; management of the sample itself is not considered, and the sample is implicitly assumed to be small and inmemory. However, if we re move the requirement that our sample of size N be maintained online so that it is always a valid snapshot of the stream and must evolve over time, then sequential sampling techniques related to reservoir sampling that could be used to build (but not maintain) a large, ondisk sample (see Vitter [54], for example). Several data structures and algorithms have been proposed to speed up index inserts such as the LSMTree [44], BufferTree [6], and YTree [12]. These papers consider problem of providing I/O efficient indexing for a database experiencing a very high record insertion rate which is impossible to handle using a traditional B+Tree indexing structure. In general these methods buffer a large set of insertions and then scan the entire base relation, which is typically organized as a B+Tree, at once adding new data to the structure. Any of the above methods could trivially be used to maintain a large random sample of a data stream. Every time a sampling algorithm probabilistically selects a record for insertion, it must overwrite, at random, an existing record of the reservoir. Once an evictee is determined, we can attach its location as a position identifier (a number between 1 and R) with a new sample record. This position field is then used to insert the new record into these index structures. While performing the efficient batch inserts, if an index structure discovers that a record with the same position identifier exists, it simply overwrites the old record with the newer one. However, none of these methods can come close to the raw write speed of the disk, as the geometric file can [13]. In a sense, the issue is that while the indexing provided by these structures could be used to implement efficient, diskbased reservoir sampling, it is too heavy duty a solution. We would end up paying too much in terms of disk I/O to send a new record to overwrite a specific, existing record chosen at the time the new record is inserted, when all one really needs is to have a new record overwrite any random, existing record. There has been much recent interest in approximate query processing over data streams (a very small subset of these papers is listed in the References section [1, 21, 34]); even some work on sampling from a data stream [7]. This work is very different from our own, in that most existing approximation techniques try to operate in very small space. Instead, our focus is on making use of today's very large and very inexpensive secondary storage to physically store the largest snapshot possible of the stream. Finally, we mention the U.C. Berkeley CONTROL project [37] (which resulted in the development of online aggregation [33] and ripple joins [32]). This work does address issues associated with randomization and sampling from a data management perspective. However, the assumption underlying the CONTROL project is that all of the data are present and can be archived by the system; online sampling is not considered. Our work is complementary to the CONTROL project in that their algorithms could make use of our samples. For example, a sample maintained as a geometric file could easily be used as input to a ripple join or online aggregation. 2.2 Biased Sampling Related Work Our biased sampling algorithm is based on reservoir sampling algorithm which was first proposed in the 1960s [11, 38]. Recently, Gemulla et al. [29] extended the reservoir sampling algorithm to handle deletions. In their algorithm called "random pairing" (RP) every deletion from the dataset is eventually compensated by a subsequent insertion. The RP Algorithm keeps tracks of uncompensated deletions and uses this information while performing the inserts. The Algorithm guards the bound on the sample size and at the same time utilizes the sample space effectively to provides a stable sample. Another extension to the classic reservoir sampling algorithm has been recently proposed by Brown and Haas for warehousing of sample data [10]. They propose hybrid reservoir sampling for independent and parallel uniform random sampling of multiple streams. These algorithms can be used to maintain a warehouse of sampled data that shadows the fullscale data warehouse. They have also provided methods for merging samples for different streams to create a uniform random sample. The problem of temporal biased sampling in a stream environment has been considered. Babcock et al. [7] presented the sliding window approach with restricted horizon of the sample to biased the sample towards the recent streaming records. However, this solution has a potential to completely lose the entire history of past stream data that is not a part of sliding window. The work done by Aggarwal [5] addresses this limitation and presents a biased sampling method so that we can have temporal bias for recent records as well as we keep representation from stream history. This work exploits some interesting properties of the class of memoryless bias functions to present a singlepass biased sampling algorithm for these type of biased functions. However, since these techniques are tailored for a specific class of bias functions, one can not adapt them directly for arbitrary userdefined biased functions. On the other hand, one can perform temporal biased sampling using our algorithm by simply attaching a temporal weight to each streaming record. Another piece of work on singlepass sampling with a nonuniform distribution is due to Kolonko and Wasch [40]. They present a singlepass algorithm to sample a data stream of unknown size (that is, not know beforehand) to obtain a sample of arbitrary size n such that the probability of selecting a data item i is depend on the individual item. The weight or fitness of the item that is used for its probabilistic selection is derived using exponentially distributed auxiliary values with the parameter of the exponential distribution and the largest auxiliary value determines the sample. Like temporal biased sampling method discussed above, this algorithm can not be directly adapted for arbitrary userdefined biased functions. Surprisingly, above three papers are the only pieces of work that are know to authors on how to perform a singlepass biased sampling over large datasets or streaming data. The another body of related work is the papers from network usage area [2224, 41]. These papers present techniques for estimating the total network traffic (or usage) based on the sample of a flow records produced by routers. Since these flows typically have heavytailed distributions, the techniques presented in these papers make use of sizedependent sampling scheme. In general, such schemes work by sampling all the records whose traffic is above certain threshold and sampling the rest with probability proportional to their traffic. Although, such techniques introduce sampling bias where size can be thought as the weight of a record, there are key differences between such techniques and the algorithm presented in this dissertation. The goal of our algorithm is to obtain a fixed size biased sample that comply with the arbitrary userdefined biased function. The goal of the sizedependent sampling scheme is to obtain a sample that will provide the best accuracy for estimating the total network traffic that follows a specific distribution. The sample gathered by these schemes is not necessarily a fixed size biased sample. It only guarantees that the expected sample size is no larger than the expected sample size of a random sample obtained with sampling probability of 1/7, where r is the threshold used by these algorithms. Thus, the threshold r is carefully selected to control the sample size and if required, it is increased to honor the upper bound of the sample size. The problem of implementing fixed size sampling design with desired and unequal inclusion probabilities has been studied in statistics. The monogram Theory of Sample Surveys [50] discusses several methods for such a sampling technique, which is of some practical importance in survey sampling. This monogram begins by discussing two designs which mimic simple random sampling without replacement with selection probabilities for a given draw that are not the same for all the units. We first summarize these techniques. Successive Sampling: Let the selection probabilities pi, p2,... PL such that pi > 0 and i =1 1, and desired sample size be N =2. Then the design suggests that we draw r with probability pr, and q with probability pq/l( Pr). The inclusion probabilities can be expressed in terms of the selection probabilities by the fact that r is included if it is drawn on the first draw, or on the second draw not having been chosen on the first. Thus, the inclusion probability r, is given by p, (1 + Zqr p,/(1 p,)). Similarly, the value for the joint probability rrq can be deduced. The monogram suggests that the value of pr be found using an iterative computation method. Fellegi's Method: This method is very much like Successive sampling described above except that the selection probabilities are different for the second draw. The second draw probabilities are chosen such that the marginal selection probabilities for the both draws are the same. This feature makes this method suitable for rotating sampling as in labor force sampling where a fixed proportion of the sample is replaced each month. The procedure is as follows: the first draw is made with probability pr = a, and then q with probability p,/(l pr), where pi,... PL is another set of selection probabilities chosen so that C "rPq/l( Pr) a nq r sq Where ci are specified positive numbers such that Eii or. The above two methods implement simple random sampling (SRS) without replacement with successive draws. An alternative method for SRS with fixed size is to select units for replacement, and then to reject the sample if there are duplicates. We discuss one such method here, called Sampford's Method. Sampford's Method: In this method we will first draw r with probability a, and in the remaining N 1 draws, which are carried out with replacement, we use the selection probabilities 3, = Kai/(1 Nao), where K is the normalizing constant. If there are any duplicates in the sample we start again from the beginning and repeat the procedure until the desired sample with no duplicates is obtained. The main drawback of this sampling design is that as N becomes large it becomes likely that duplicates will occur in each sampling round. CHAPTER 3 THE GEOMETRIC FILE In this chapter we give an introduction to the basic reservoir sampling algorithm that was proposed to obtain an online random sample of a data stream. The algorithm assumes that the sample maintained is small enough to fit in main memory in its entirety. We discuss and motivate why very large sample sizes can be mandatory in common situations. We describe three alternatives for maintaining very large, diskbased samples in a streaming environment. We then introduce the geometric file organization and present algorithms for reservoir sampling with the geometric file. We also describe how multiple geometric files can be maintained allatonce to achieve considerable speed up. 3.1 Reservoir Sampling The classic algorithm for maintaining an online random sample of a data stream is known as reservoir sampling [11, 38]. To maintain a reservoir sample R of target size R, the following loop is used: Algorithm 1 Reservoir Sampling 1: Add first R items from the stream directly to R 2: for int i = RI + 1 to oo do 3: Wait for a new record r to appear in the stream 4: with probability IR/i do 5: Remove a randomly selected record from R 6: Add r to R A key benefit of the reservoir algorithm is that after each execution of the for loop, it can be shown that the set R is a true, uniform random sample (without replacement) of the first i records from the stream. Thus, at all times, the algorithm maintains an unbiased snapshot of all of the data produced by the stream. The name "reservoir sampling" is an apt one. The sample R serves as a reservoir that buffers certain records from the data stream. New records appearing in the stream may be trapped by the reservoir, whose limited capacity then forces an existing record to exit the reservoir. Reservoir sampling can be very efficient, with time complexity less than linear in the size of the stream. Variations on the algorithm allow it to "go to sleep" for a period of time during which it only counts the number of records that have passed by [53]. After a certain number of records have been seen, the algorithm "wakes up" and capture the next record from the stream. Correctness of the reservoir sampling algorithm: The reservoir sampling process can be viewed as two phase process: (1) adding the first R records to the reservoir, and (2) adding subsequent records until the input is consumed. A reservoir algorithm should maintain following invariant in the second phase: after each record is processed, a reservoir should be a simple random sample of size R of the records processed so far Algorithm 1 maintains this invariant in steps (26) as follows [11, 38]. The ith record processed (i > IRI), it is added to the reservoir with probability IRl/i by step 4. We need to show that for all other records processed thus far, the inclusion probability is IR /i. Let rk be any record in the reservoir s.t. k / i. Let Ri denote the state of the reservoir just after addition of the ith record. Thus, we are interested in the Pr[rk r Ri] Pr[rk G Ri] = Pr[rk G Ri1]Pr[ri E Ri]Pr[rk i Ri] + (Pr[rk G R,i]Pr[ri i R]) R[R t R R iR 1R i i The correctness of the inclusion probability alone is not sufficient to prove the required invariant. Consider the systematic sampling described in the Chapter 8 of Cohran book [16]. To select a sample of IRI units, systematic sampling takes a unit at random from the first k units and "every kth" unit thereafter. Although the inclusion probability in systematic sampling is the same as in simple random sampling, the properties of a sample such as variance can be far different. It is known that the variance of the systematic sampling can be better or worse compared to a simple random sampling depending on data heterogeneity and correlation coefficient between pairs of sampled units. We therefore also need to show that the pairwise values Pr[rk, rl E Ri] has the correct value. All threeway inclusion probabilities must also be correct, as well as all fourway inclusion probabilities, and so on. In other words we need to show that for a set S of interest, Pr[S E Ri] has the correct value, for all S C R. The proof that reservoir sampling maintains the correct inclusion probability for any set of interest is actually very similar to the univariate inclusion probability correctness discussed above. We know that the univariate inclusion probability Pr[rk Ri] = R/i. For any arbitrary value of IS < R, assume that we have the correct probabilities when we have seen i 1 input records, i.e. Pr[S E R] = () / (i). When the ith record is processed (i > R), we have Pr[S Ri] = Pr[S E R i_l]Pr[ri e Ri]Pr[None of S's records are expelled ]+ (Pr[S E Ri1]Pr[r, ( RJ]) Pr[SR G + )] (I) R s_ S1S) P \ = P [SE r15] 1 i +  (S) [(I \S) which is the desired probability. 3.2 Sampling: Sometimes a Little is not Enough One advantage of random sampling is that samples usually offer statistical guarantees on the estimates they are used to produce. Typically, a sample can be used to produce an estimate for a query result that is guaranteed to have error less E than with a probability 6 (see Cochran for a nice introduction to sampling [16]). The 6 value is known as the c(nifid ,h c of the estimate. Very large samples are often required to provide accurate estimates with suitably high confidence. The need for very large samples can be easily explained in the context of the Central Limit Theorem (CLT) [27]. The CLT implies that if we use a random sample of size N to estimate the mean p of a set of numbers, the error of our estimate is usually normally distributed with mean zero and variance O.2/N, where .2 is the variance of the set over which we are performing our estimation. Since the "spread" of a normally distributed random variable is proportional to the square root of the variance (also known as the standard deviation), the error observed when using a random sample is governed by two factors: 1. The error is inversely proportional to the square root of the sample size. 2. The error is directly proportional to the standard deviation of the set over which we are estimating the mean over. The significance of this observation is that the sample size required to produce an accurate estimate can vary tremendously in practice, and grows quadratically with increasing standard deviation. For example, say that we use a random sample of 100 students at a university to estimate the average students age. Imagine that the average age is 20 with a standard deviation of 2 years. According to the CLT, our samplebased estimate will be accurate to within 2.5% with confidence of around 98%, giving us an accurate guess as to the correct answer with only 100 sampled students. Now, consider a second scenario. We want to use a second random sample to estimate the average net worth of households in the United States, which is around $140, 000, with a standard deviation of at least $5,000,000. Because the standard deviation is so large, a quick calculation shows we will need more than 12 million samples to achieve the same statistical guarantees as in the first case. Required sample sizes can be far larger when standard database operations like relational selection and join are considered, because these operations can effectively magnify the variance of our estimate. For example, the work on ripple joins [32] provides an excellent example of how variance can be magnified by sampling over the relational join operator. 3.3 Reservoir for Very Large Samples Reservoir sampling is very efficient if the sample is small enough to be stored in main memory. However, efficiency is difficult if a large sample must be stored on disk. Obvious extensions of the reservoir algorithm to ondisk samples all have serious drawbacks. We discuss the obvious extensions now. The virtual memory extension. The most obvious adaptation for very large sample sizes is to simply treat the reservoir as if it were stored in virtual memory. The problem with this solution is that every new sample that is added to the reservoir will overwrite a random, existing record on disk, and so it will require two random disk I/Os: one to read in the block where the record will be written, and one to rewrite it with the new sample. This means we can sample only on the order of 50 records per second at 10ms per random I/O per disk. Currently, a terabyte of storage requires as few as five disks, giving us a sampling rate of only 5 x 50 = 250 records per second. To put this in perspective, it would take months to sample enough 100 byte records to fill that terabyte. The massive rebuild extension. As an alternative, when new samples are selected from the stream, they are not added to the ondisk reservoir immediately. Rather, we make use of all of our available main memory to buffer new samples. At all times, the records stored in the buffer B logically represent a set samples that should have been used to replace ondisk samples in order to preserve the correctness of the reservoir algorithm, but that have not yet been moved to disk for performance reasons. When the buffer B fills, we simply scan the entire reservoir R, and replace a random subset of the existing records with the new, buffered samples. The modified algorithm is given as Algorithm 2. Count(B) refers to the current number of records in B. Note that since the records contained in B logically represent records in the reservoir that have not yet been added to disk, a newlysampled record can either be assigned to replace an ondisk record, or it can be assigned to replace a buffered record (this is decided in Step (7) of the algorithm). In a realistic scenario, the ratio of the number of disk blocks to the number of records buffered in main memory may approach or even exceed one. For example, a 1 TB database with 128 KB blocks will have 7.8 million blocks; and for such a relatively large database it is realistic to expect that we have access to enough memory to buffer millions records. As the number of buffered records per block meets or exceeds one, most or all of the blocks on disk will contain Algorithm 2 Reservoir Sampling with a Buffer 1: for int i = 1 to oo do 2: Wait for a new record r to appear in the stream 3: if i < R then 4: Add r directly to R and continue 5: else 6: with probability IR/i do 7: with probability Count(B)/IRI do 8: //new samples can overwrite buffered samples 9: Replace a random record in B with r 10: else do 11: Add r to B 12: if Count(B) == BI then 13: Scan the reservoir R and empty B in one pass 14: B = a record that has been randomly selected for replacement by line (9) of Algorithm 2, and so all of the database blocks must be updated. Thus, it makes sense to rely on fast, sequential I/O to update the entire file in a single pass. The drawback of this approach is that every time that the buffer fills, we are effectively rebuilding the entire reservoir to process a set of buffered records that are a small fraction of the existing reservoir size. The localized overwrite extension. We will do better if we enforce a requirement that all samples are stored in a random order on disk. If data are clustered randomly, then we can simply write the buffer sequentially to disk at any arbitrary position. Because of the random clustering, we can guarantee that wherever the buffer is written to disk, the new samples will overwrite a random subset of the records in the reservoir and preserve the correctness of the algorithm. The problem with this solution is that after the buffered samples are added, the data are no longer clustered randomly and so a randomized overwrite cannot be used a second time. The data are now clustered by insertion time, since the buffered samples were the most recently seen in the data stream, and were written to a single position on disk. Any subsequent buffer flush will need to overwrite portions of both the new and the old records to preserve the algorithm's correctness, requiring an additional random disk head movement. With each subsequent flush, maintaining randomness will become more costly, as data become more and more clustered by insertion time. Eventually, this solution will deteriorate, unless we periodically rerandomize the entire reservoir. Unfortunately, rerandomizing the entire reservoir is as costly as performing an externalmemory sort of the entire file containing samples, and requires taking the sample offline. 3.4 The Geometric File The three extensions to Algorithm 1 can be used to maintain a large, ondisk sample, but all of them have drawbacks. In this section, we discuss a fourth algorithm and an associated data organization called the geometric file to address these pitfalls. The geometric file is best seen as an extension of the massive rebuild option given as Algorithm 2. Just like Algorithm 2, the geometric file makes use of a mainmemory buffer that allows new samples selected by the reservoir algorithm to be added to the ondisk reservoir in a lazy fashion. However, the key difference between Algorithm 2 and the algorithms used by the geometric file is that the geometric file makes use of a far more efficient algorithm for merging those new samples into the reservoir. Intuitive description: Except for Step (13) of Algorithm 2, the basic algorithm employed by the geometric file is not much different. As far as Step (13) is concerned, the difference between the geometric file and the massive rebuild extension is that the geometric file empties the buffer more efficiently, in order to avoid scanning or periodically rerandomizing the entire reservoir. To accomplish this, the entire sample in main memory that is flushed into the reservoir is viewed as a single subsample or a stratum [16], and the reservoir itself is viewed as a collection of subsamples, each formed via a single buffer flush. Since the records in a subsample are non random subset of the records in the reservoir (they are sampled from the stream during a specific time period), each new subsample needs to overwrite a true, random subset of the records in the reservoir in order to maintain the correctness of the reservoir sampling algorithm. If this can be done efficiently, we can avoid rebuilding the entire reservoir in order to process a buffer flush. At first glance, it may seem difficult to achieve the desired efficiency. The buffered records that must be added to the reservoir will typically overwrite a subset of the records stored in each of the existing subsamples during a buffer flush. Though we may be able to avoid rebuilding the entire file, the fact that the buffer must overwrite a subset of each ondisk subsample presents a challenge when trying to maintain acceptable performance, because this naturally leads to fragmentation (see the discussion of the localized overwrite extension in Section 3.3). For example, if there are 100 ondisk subsamples, the buffer must be split 100 ways in order to write to a portion of each of the 100 ondisk subsamples. This fragmented buffer then becomes a new subsample, and subsequent buffer flushes that need to replace a random portion of this subsample must somehow efficiently overwrite a random subset of the subsample's fragmented data. The geometric file uses a careful, ondisk data organization in order to avoid such fragmen tation. The key observation behind the geometric file is that the number of records of a subsample that are replaced with records from buffered sample can be characterized with reasonable accu racy using a geometric series (hence the name geometricfile). As buffered samples are added to the reservoir via buffer flushes, we observe that each existing subsample loses approximately the same fraction of its remaining records every time, where the fraction of records lost is governed by the ratio of the size of a buffered sample to the overall size of the reservoir. By "loses", we mean that the subsample has some of its records replaced in the reservoir with records from a subsequent subsample. Thus, the size of a subsample decays approximately in an exponential manner as buffered samples are added to the reservoir. This exponential decay is used to great advantage in the geometric file, because it suggests a way to organize the data in order to avoid problems with fragmentation. Each subsample is partitioned into a set of segments of exponentially decreasing size. These segments are sized so that every time a buffered sample is added to the reservoir, we expect that each existing subsample loses exactly the set of records contained in its largest remaining segment. As a result, each subsample loses one segment to the newlycreated subsample every time the buffer is emptied, and a geometric file can be organized into a fixed and unchanging set of segments that are stored as contiguous runs of blocks on disk. Because the set of segments is fixed beforehand, fragmentation and update performance are not problematic: in order to replace records in an existing subsample with the records from a new buffer flush, a simple, efficient, sequential overwrite of the existing subsample's largest segment generally suffices. 3.5 Characterizing Subsample Decay To describe the geometric file in detail, we begin with an analogy between the samples in a subsample S that are lost over time, and radioactive decay. Imagine that we have 100 grams of Uranium at an initial point of time (Uo = 100), and a decay rate (1 a) = 0.1 with a retention rate of a. On day one, the mass of Uranium decays to Uo x a = 90 grams, because the Uranium loses Uo x (1 a) = 10 grams of its mass. We define n = Uo x (1 a) to be the mass of Uranium lost on the very first day, giving n = 10 for our example. On day two, (with U1 = 90) the Uranium further decays to U1 x a = 81 grams, this time losing U1 x (1 a) = Uo x a x (1 a) n x a = 9 grams of its mass. On day three, it further decays by n x a2 = 7.2 grams, and so on. The decay process is allowed to continue until we have less than 3 grams of Uranium remaining. Continuing with the Uranium analogy, three questions that are relevant to our problem of maintaining very large samples from a data stream are What is the amount of Uranium lost on any given ith day? How can the initial mass of Uranium, 100 grams, can be expressed in terms of n and a? How many days it will take for us before we are left with 3 grams or less of Uranium? These questions can be answered using the following three simple observations related to geometric series: Observation 1: Given a retention rate a < 1 and n to be the first term of a geometric series, the ith term is given by n x a1 for any nE cR. Observation 2: Given a retention rate a < 1, it holds that C"n x a 1 "r for any n R. Observation 3: Given a retention rate a < 1, define f(j) as x aj. From Observation 2, it follows that the largest j such that f(j) > P3 is j log3loglog(1a)]. We denote this floor by T. To relate this back to the task of the reservoir sampling, imagine that our large, diskbased reservoir sample R is maintained using a reservoir sampling algorithm in conjunction with a main memory buffer B (as in Algorithm 2). Recall that the way reservoir sampling works is that new samples from the data stream are chosen to overwrite random samples currently in the reservoir. The buffer temporarily stores these new samples, delaying the overwrite of a random set of records that are already stored on disk. Once the buffer is full, all new samples are merged with the R by overwriting a random subset of the existing samples in R. Consider some arbitrary subsample S of R (so S C R), with capacity IS. Since the buffer B represents the samples that have already overwritten the equal number of records of R, a buffer flush overwrites exactly B samples of R. Thus, on expectation the merge will overwrite SlxB samples of S. If we define 1 a then on expectation, S should lose IS x (1 a) of its own records due to the buffer flush1 We refer this loss as subsample decay. We can roughly describe the expected decay of S after repeated buffer merges using the three observations stated before. If the subsample retention rate a = 1 then: From Observation 1, it follows that the ith buffer merge, on expectation, removes n x a1 samples from what remains of S. From Observation 2, it follows that the initial size of a subsample ISI =  From Observation 3, it follows that the expected number of merges required until S has 3 or less samples left is T. The net result of this is that it is possible to characterize the expected decay of any arbitrary subset of the records in our diskbased sample as new records are added to the sample through multiple emptyings of the buffer. If we view S as being composed of T ondisk "segments" of exponentially decreasing size, plus a special, a single group of final segments of total size 1 Actually, this is only a fairly tight approximation to the expected rate of decay. It is not an exact characterization because these expressions treat the emptying of the buffer into the reservoir as a single, atomic event, rather than a set of individual record additions (See Section 3.7). segment 2: 2 segments numbered V segment 0: segment 1: no samples and after: 3 samples total n samples na samples stored in main n memory before first buffer flush: samples total 1a disk after first buffer flush: disk after second buffer flush: disk after third buffer flush: disk O0O after buffer flush number y: disk Figure 31. Decay of a subsample after multiple buffer flushes. /3 that are buffered in main memory (subsequently referred to as the "beta segment"), then the ith buffer flush into R will on expectation overwrite exactly one ondisk segment from S. S loses an additional segment with every buffer flush until the subsample has only its beta segment remaining. At the point that only the subsample's beta segment remains, the samples contained therein can be replaced directly. The reason that the beta segment is buffered in main memory is that overwriting a segment requires at least one random disk head movement, which is costly. By storing the beta segment in main memory, we can reduce the number of disk head movements with little mainmemory storage cost. The process is depicted in Figure 31. low address on disk cNI all segment O's all segment 1 's all segment 2's all segment 3's all segment 4's " all segment 5's all segment 6's all segment y 's high address on disk SIIIII I I III 1 111111H 111111m 1 11111 1111111 III all smaller segments, buffered in main memory Figure 32. Basic structure of the geometric file. 3.6 Geometric File Organization This decay process suggests a file organization for efficiently maintaining very large random samples from a data stream. Let a subsample S be the set of records that are loaded into our diskbased reservoir sample R in a single emptying of the buffer. Since we know that the number of records that remain in S will on expectation decay over time as depicted in Figure 31, we can organize our large, disk based sample as a set of decaying subsamples. At any point of time, the largest subsample was created by the most recent flushing of the buffer into R, and has not yet lost any segments. The second largest subsample was created by the second most recent buffer flush; it lost its largest segment in the most recent buffer flush. In general, the ith largest subsample was created by the ith most recent buffer flush, and it has had i 1 segments removed by subsequent buffer flushes. The overall file organization is depicted in Figure 32. 3.7 Reservoir Sampling With a Geometric File Given this organization, processing a buffer flush becomes an easy task. The overall reservoir sampling algorithm for the geometric file organization is given as Algorithm 3. The terms n, a, and T carry the meaning discussed in Section 3.5. This process described by Algorithm 3 is depicted graphically in Figure 33. First, the file is filled with the initial data produced by the stream (a through c). To add the first records to the file, the buffer is allowed to fill with samples. The buffered records are then randomly grouped into segments, and the segments are written to disk to form the largest initial subsample (a). For the second initial subsample, the buffer is only allowed to fill to IB a of its capacity before being written out (b). For the third initial subsample, the buffer fills to I Ba2 of its capacity before it is written (c). This is repeated until the reservoir has completely filled (as was shown in Figure 32). At this point, new samples must overwrite existing ones. To facilitate this, the buffer is again allowed to fill to capacity. Records are then randomly grouped into segments of appropriate size, and those segments overwrite the largest segment of each existing subsample (d). This process is then repeated indefinitely, as long as the stream produces new records (e and f). This file organization has several significant benefits for use in maintaining a very large sample from a data stream: Performing a buffer flush requires absolutely no reads from disk. Each buffer flush requires only T random disk head movements; all other disk I/Os are sequential writes. To add the new samples from the buffer into the geometric file to create a new subsample S, we need only seek to the position that will be occupied by each of S's ondisk segments. Even if segments are not blockaligned, only the first and last block in each overwritten segment must be read and then rewritten (to preserve the records from adjacent segments). Algorithm 3 Reservoir Sampling with a Geometric File 1: Set numSubsamples = 0 2: for int i = 1 to oo do 3: Wait for a new record r to appear in the stream 4: if i < R then 5: Add r to B 6: if Count(B) == IBInmSubsamples then 7: Randomize the ordering of the records in B 8: Set n Count(B) x (1 a) 9: Partition B into segments of size n, na, na2, and so on 10: Flush the first T segments to the disk 11: Store the group of remaining segments in main memory 12: numSubsamples + + 13: B = 14: else 15: with probability IR/i do 16: with probability Count(B)/ IR do 17: Replace a random record in B with r 18: else do 19: Add r to B 20: if Count(B) = BI then 21: Partition the buffer into segments of size n, na, na2, and so on (see Section 3.7.1) 22: for each segment sgj from B do 23: Overwrite the largest segment of jth largest subsample of R with sgj 24: B = 3.7.1 Introducing the Required Randomness One issue that needs to be addressed is the partitioning the buffer into segments in Algo rithm 3 Step (21). In order to maintain the algorithm's correctness, when the buffer is flushed to disk it must overwrite a truly random subset of the records on disk. Thus, when performing the flush, we need to randomly choose records from the reservoir to replace. This implies that the ondisk subsamples (which are expectedly of size 1 7, and so on) will lose around 1 1aI 1a n, na, na2 records, and so on, respectively. However, while the number of records replaced in a subsample S will on expectation be proportional to the size of S (and hence equal to the size of S's largest ondisk segment) this replacement must be performed in a randomized fashion. The situation can be illustrated as follows. Say we have a set of numbers, divided into three buckets, as shown in Figure 34. Now, we want to add five additional numbers to our set, by randomly replacing five existing numbers. While we do expect numbers to be replaced in a way that is proportional to bucket size (Figure 34 (b)), this is not always what will happen (Figure 34 (c)). Algorithm 4 Randomized Segmentation of the buffer 1: for each subsample i in the reservoir R do 2: Set Ni= Number of records in Si 3: Set .11 0 4: for each record r in the buffer B do 5: Randomly choose a victim subsample Si such that Pr[choosing 5J] = Nil / Nj 6: N, ; 1 + + In order to correctly introduce this variance into the geometric file, we need to add a few additional steps to Algorithm 3. Before we add a new subsample to disk via a buffer flush in Step (21), we first perform a logical, randomized partitioning of the buffer into segments, described by Algorithm 4. In Algorithm 4, each newlysampled record is randomly assigned to replace a sample from an existing, ondisk subsample so that the probability of each subsample losing a record is proportional to its size. The result of Algorithm 4 is an array of 11 values, where 3 1 tells Step (21) of Algorithm 3 how many records should be assigned to overwrite the ith ondisk subsample. 3.7.2 Handling the Variance Of course, there is no guarantee that M1 = n, i = na, = na2, and so on, so there is no guarantee that Algorithm 3 will overwrite exactly the number of records contained in each low address on disk `u ,new samples high address on disk (a) (low address on disk new C samples high address on disk (b) low address on disk uI samples high address on disk samples n dd ` samf high address on disk high addss on disk (d) (e) ew  )les sam high address on disk (f) new iples Figure 33. Building a geometric file. 1/5 of total 1/5 of total 3/5 of total (a) Five new samples randomly replace existing samples which are grouped into three (b) Most likely outcome: new samples distributed proportionally (c) Possible (though unlikely) outcome: new samples all distributed to smallest bucket Figure 34. Distributing new records to existing subsamples. subsample's largest segment. To handle this problem, we associate a stack (or buffer1 ) with each of the subsamples. The stack associated with a subsample will buffer any of a subsample's records that logically should not have been overwritten during buffer flush into the subsample (because 3 [ for some buffer flush for that subsample was smaller than expected), but whose space had to be claimed by the buffer flush in order to write a new subsample to disk. If the size of the stack is positive, it means that the corresponding subsample is larger than expected because it has had fewer of its records overwritten than expected. We also allow a negative stack size. This simply means that some of the subsample's records should have been overwritten but were not, because an 3 [ value for that subsample was larger than expected. A stack size of k means that k of the subsample's ondisk records logically are not part of the reservoir (even though they are physically present on disk), and should be ignored during query processing. 1 We use the term "stack" rather than "buffer" to clearly differentiate the extra storage associ ated with each subsample from the buffer B. Making use of the set of stacks is fairly straightforward. Imagine that na(i1) of a buffer's records are sent to overwrite a segment from an existing subsample Si, but according to Algo rithm 4, 1 [ should have been. Then, there are two possible cases: Case 1: i is smaller than na(i1) by some number of records e In this case, E records are removed from the segment that is about to be overwritten and pushed onto Si's stack in order to buffer them. This is necessary because these records logically should not be overwritten by the records that are going to be added to the disk, but they will be. Case 2: 1 [ is larger than na(i1) by some number of records e. In this case, e records are popped off of Si's stack to reflect the additional records that should have been removed from Si, but were not. These stack operations are performed just prior to Step (23) in Algorithm 3. Note that since the final group of segments from a subsample of total size 3 are buffered in main memory, their maintenance does not require any stack operations. Once a subsample has lost all of its ondisk samples, overwrites of records in this set can be handled by simply replacing the records directly. 3.7.3 Bounding the Variance Because the stacks associated with each subsample will be used with high frequency as insertions are processed, each stack must be maintained with extreme efficiency. Writes should be entirely sequential, with no random disk head movements. To assure this efficiency and avoid any sort of online reorganization, it is desirable to preallocate space for each of the stacks on disk. To preallocate space for these stacks, we need to characterize how much overflow we can expect from a given subsample, which will bound the growth of the subsample's stack. It is important to have a good characterization of the expected stack growth. If we allocate too much space for the stacks, then we allocate disk space for storage that is never used. If we allocate too little space, then the top of one stack may grow up into the base of another. If a stack does overflow, it can be handled by buffering the additional records temporarily in memory or moving the stack to a new location on disk until the stack can again fit in its allocated space. This is not a catastrophic event, but it increases the disk I/O associated with stack maintenance and leads to fragmentation, and so it is an event that we would like to render very rare. To avoid this, we observe that if the stack associated with a subsample S contains any samples at a given moment, then S has had fewer of its own samples removed than expected. Thus, our problem of bounding the growth of S's stack is equivalent to bounding the difference between the expected and the observed number of samples that S loses as IBI new samples are added to the reservoir, over all possible values for IBI. To bound this difference, we first note that after adding IBI new samples into the reservoir, the probability that any existing sample in the reservoir has been overwritten by a new sample is 1 1 { During the addition of new records to the reservoir, we can view a subsample S of initial size IBI as a set of IBI identical, independent Bernoulli trials (coin flips). The ith trial determines whether the ith sample was removed from S. Given this model, the number of samples remaining in S after IBI new samples have been added to the reservoir is binomially distributed with IBI trials and P = Pr[s E S remains] = 1 1 Since we are interested in characterizing the variance in the number of samples removed from S primarily when IBP is large, the binomial distribution can be approximated with very high accuracy using a normal distribution with mean = B IP and standard deviation a = IBIP(1 P) [42]. Simple arithmetic implies that the greatest variance is achieved when a subsample has on expectation lost 50' of its records to new sample (P = 0.5); at this point the standard deviation a is 0.5 B/B. Since we want to ensure that stack overruns are essentially impossible, we choose a stack size of 3/B This allows the amount of data remaining in a given subsample to be up to six standard deviations from the norm without a stack overflow, and is not too costly an additional overhead. A quick lookup in a standard table of normal probabilities tells us that this will yield only around a 109 probability that any given subsample overflows its stack. While achieving such a small probability may seem like overkill, it is important to remember that many thousands of subsamples may be created in all during the life of the geometric file, and we want to ensure that very few of them overflow their respective stacks. If 100, 000 ondisk segments are replaced, then using a stack of size 3 IB1 will yield a very reasonable probability that we experience no overflows of (1 109)100 000, or 99.9, I'.. In practice, the actual probability of experiencing no overflows will be even greater. This is due to the fact that the standard deviation in subsample size for most of a subsample's lifespan will be much less than 0.5 BVB due to the high percentage of its lifespan that it has an associated P of less than 0.5 as it slowly loses all of its samples. 3.8 Choosing Parameter Values Given a specified file size and buffer size, two parameters associated with using the geometric file must be chosen: a, which is the fraction of a subsample's records that remain after the addition of a new subsample, and 3, which is the total size of a subsample's segments that are buffered in memory. 3.8.1 Choosing a Value for Alpha In general, it is desirable to minimize a. Decreasing a decreases the number of segments used to store each subsample. Fewer segments means fewer random disk head movements are required to write a new subsample to disk, since each segment requires around four disk seeks to write (one to read the location and one to write a new segment, and similarly two more considering the cost of subsequently adjusting the stack of the previous owner). To illustrate the importance of minimizing a, imagine that we have a 1GB buffer and a stream producing 100B records, and we want to maintain a 1TB sample. Assume that we use an a value of 0.99. Thus, each subsample is originally 1GB, and  B = 107. From Observa tion 2 we know that  must be 107, so we must use n = 105. If we choose / = 320 (so that 3 is around the size of one 32KB disk block), then from Observation 3 we will require Sog320 log0log(.99) 1029 segments to store the entire new subsample. Now, consider the situation if a = 0.999. A similar computation shows that we will now require 10, 344 segments to store the same 1GB subsample. This is an orderofmagnitude difference, with significant practical importance. With four disk seeks per segment, 1029 segments might mean that we spend around 40 seconds of disk time in random I/Os (at 10ms each), whereas 10, 344 might mean that 400 seconds of disk time is spent on random disk I/Os. This is important when one considers that the time required to write 1GB to a disk sequentially is only around 25 seconds. While minimizing a is vital, it turns out that we do not have the freedom to choose a. In fact, to guarantee that the sum of all existing subsamples is IR, the choice of a is governed by the ratio of IR to the size of the buffer IB: Lemma 1. (The size of a geometric file is IR) <> ((1 a = ) Proof. In the proof, (and consequently the Lemma) we ignore the fact that IB x '1 may not be integral, we also ignore the storage associated with auxiliary structures such as the stacks and the beta segments. In this case, the geometric file is simply collection of subsamples of decaying size. We know that the largest subsample on disk is created by the most recent buffer flush and has B records in it. From Observation 1 the size of the ith subsample of a file is IBI x ai1 It then follows from Observation 2 that the total size of all subsamples of a geometric file is 1, BI x a'1 BI and thus (1 a) = . We will address this limitation in Section 3.10. 3.8.2 Choosing a Value for Beta It turns out that the choice of 3 is actually somewhat unimportant, with far less impact than a. For example, if we allocate 32KB for holding our 3 inmemory samples for each subsample, and B/RI is 0.01, then as described above, adding a new subsample requires that 1029 segments be written, which will require on the order of 1029 seeks. Redoing this calculation with 1MB allocated to buffer samples from each ondisk subsample, the number of ondisk segments is Lo0 og 10+log(10.99)] or 687. By increasing the amount of main memory devoted to holding the smallest segments for each subsample by a factor of 32, we are able to reduce the number of disk head movements by less than a factor of two. Thus, we will not consider optimizing 3. Rather, we will fix 3 to hold a set of samples equivalent to the system block size, and search for a better way to increase performance. 3.9 Why Reservoir Sampling with a Geometric File is Correct? We discuss the correctness of the geometric file by answering the following questions: 1. Why is the classical reservoir sampling algorithm (presented as Algorithm 1) correct? That is what is the invariant maintained by the Algorithm 1? 2. Why is the obvious diskbased, extension of Algorithm 1 (presented as Algorithm 2) correct? That is how does Algorithm 2 maintain the invariant of Algorithm 1 via the use of a main memory buffer? 3. Why is the proposed geometric file based sampling technique in Algorithm 3 correct? We have answered the first question in Section 3.1. We discuss the second and third questions here. 3.9.1 Correctness of the Reservoir Sampling Algorithm with a Buffer The Algorithm 2 makes use of the main memory buffer of size IBI to buffer new samples. The buffered samples logically represent a set samples that should have been used to replace ondisk samples in order to preserve the correctness of the sampling algorithm, but that have not yet been moved to disk for performance reasons (that is, due to lazy writes). It is not hard to see that the invariant maintained by Algorithm 1 is also maintained by Algorithm 2 in step (6). The new records are sampled with the same probability I R/i. The only difference is that newly sampled records are added to the reservoir using steps (714) instead of simple steps (56) of Algorithm 1. We now discuss why these steps are equivalent. One straightforward way of keeping the sampled records in the buffer and do lazy writes is as follows. Every time we decide to add a new sample to the buffer (i.e. with probability I R/i), we also generate a random number between 1 and R to decide its position in the reservoir. However, we store this position in the position array and thus avoid an immediate disk seek. If we happen to generate a position that is already in the position array, we overwrite the corresponding record in the buffer with the newly sampled record. If we would have flushed that record to disk using the classic algorithm (rather than buffering it), we would have replaced it with the newly sampled record. Thus we would obtain the same result. Once the buffer is full we flush it in a single scan of the reservoir and overwrite the records as dictated by the sorted order of the position array. It is obvious that this process is equivalent to the steps (56) of Algorithm 1 as far as correctness is concerned. Logically, steps (714) of Algorithm 2 actually implement exactly this process. The probability that we will generate a random position between 1 and IRI that is already in the position array of size BI is IBI/R. Step (7) of Algorithm 2 decides whether to overwrite a random buffered record with a newly sampled record. Once the buffer is full, step (13) performs a one pass bufferreservoir merging by generating sequential random positions in the reservoir on the fly. 3.9.2 Correctness of the Reservoir Sampling Algorithm with a Geometric File In Algorithm 2 we store the samples sequentially on the disk and overwrite them in a random order. Though correct, the algorithm demands almost a complete scan of the reservoir (to perform all random overwrites) for every buffer flush. We can do better if we instead force the samples to be stored in a random order on disk so that they can be replaced via an overwrite using sequential I/Os. The localized overwrite extension discussed before use this idea. Every time a buffer is flushed to the reservoir it is randomized in main memory and written as a random cluster on the disk. We maintain the correctness of this technique by splitting the random cluster in Nways where N is the number of existing clusters on the disk and by overwriting random subset of each existing cluster. This avoids the problem of clustering by insertion time. However, the drawback of this technique is that the solution deteriorates because of fragmentation of clusters. The geometric file overcomes the drawbacks of these two techniques and can be viewed as a combination of Algorithm 2 and the idea used in the localized overwrite extension. The correctness of the Geometric file is results directly from the correctness of these two techniques. In case of the geometric file the entire sample in the main memory (referred to as a subsample) is randomized and flushed into the reservoir. Furthermore, each new subsample is split into exactly those many segments as the number of existing subsamples on the disk. These segments then overwrite a random portion of each diskbased subsample. The only difference with the geometric file is that it organizes the records to be overwritten systematically on the disk by making the observation that each existing subsample loses approximately the same fraction of its remaining records every time. 3.10 Multiple Geometric Files The value of a can have a significant effect on geometric file performance. If a = 0.999, we can expect to spend up to 95' of our time on random disk head movements. However, if we were instead able to choose a = 0.9, then we reduce the number of disk head movements by factor of 100, and we would spend only a tiny fraction of the total processing time on seeks. Unfortunately, as things stand, we are not free to choose a. According to Lemma 1, a is fixed by the ratio IB/ R. That is, for a fixed desired size of reservoir we need a larger buffer to lower the value of a. However, there is a way to improve the situation. Given a buffer of fixed capacity IBI and desired sample size IR, we choose a smaller value a' < a, and then maintain more than one geometric file at the same time to achieve a large enough sample. Specifically, we need to maintain m =(1,) geometric files at once. These files are identical to what we have described thus far, except that the parameter a' is used to compute the sizes of a subsample's ondisk segments and size of each file is . The remainder of this Section describes the details of how in multiple geometric files are used to achieve greater efficiency. 3.11 Reservoir Sampling with Multiple Geometric Files The reservoir sampling algorithm with multiple geometric files is similar to the Algorithm 3. Each of the m geometric files is still treated as a set of decaying subsamples, and each subsample is partitioned into a set of segments of exponentially decreasing size, just as is done in Algorithm 3, Steps (5)(13). The only difference is that as each file is created, the parameter a' is used instead of a in Steps (6), (8)(9), and each of the m geometric files is filled after one another, in turn. Thus, each subsample of each geometric file will have segments of size n, na', na'2 and so on. Algorithm 5 Randomized Segmentation of the Buffer for Multiple Geometric Files 1: for each Sij, the ith subsample in jth file do 2: Set Nij= Number of records in Sij 3: Set 1. 0 4: for each record r in the buffer B do 5: Randomly choose a victim subsample Sij such that Pr[choosing ij] = Nij/ C8 Nkl 6: Nj ;V + + However, processing additional records from the stream is somewhat different. As more and more records are produced by the stream, new samples are captured and are added to the buffer exactly as in Algorithm 3 Steps (15)(20) until buffer is full. Once the buffer is full, its record order is then randomized, just as is in a single geometric file. Next the buffer is flushed to disk. This is where the algorithm is modified. Overwriting records on disk with records from the buffer is somewhat different, in two primary ways, as discussed next. Partitioning the buffer: In Algorithm 4, the buffer is partitioned so that the size of each buffer segment is on expectation proportional to the current size of subsamples in a single file. In case of multiple geometric files, we partition the buffer just like in Algorithm 4; however, we randomly partition the buffer across all subsamples from all geometric files. The number of buffer segments after the partitioning is the same as the total number of subsamples in the entire reservoir, and the size of each buffer segment is on expectation proportional to the current size of each of the subsamples from one of the geometric files. This allows us to maintain the correctness of the reservoir sampling algorithm. The buffer partitioning steps in case of multiple geometric files are given in Algorithm 5. Merging buffer segments with multiple geometric files: This step requires quite a different approach compared to Algorithm 3's buffer merge algorithm. We discuss all the intricacies subsequently, but at highlevel, the largest segment of each subsample from only one geometric file is overwritten with samples from the buffer. This allows for considerable speedup, as we discuss in Section 3.12. At first, this would seem to compromise the correctness of the algorithm: logically, the buffered samples must overwrite samples from every one of the geometric files (in fact, this is precisely why the buffer is partitioned across all geometric files, as described in the previous bullet). However, the correctness can be maintained by making use of some additional buffer space. In Sections 3.11.1 to 3.11.3, we describe in detail an algorithm that is able to maintain the correctness of the sample. 3.11.1 Consolidation And Merging As stated previously, the process of flushing a buffer to disk once it has been partitioned must be altered. The first step in flushing the buffer to disk is the consolidation of the many small buffer segments that result from partitioning the buffer across all files to form larger segments that are then used to overwrite segments in only a single geometric file. To form the largest consolidated segment, we group the m buffer segments assigned to the largest subsample from every file. The next largest consolidated segment is formed by grouping the m buffer segments corresponding to the next largest subsample across every files, and so on. Once the segments assigned to the various files have been consolidated, the resulting segments are used to overwrite subsamples from a single geometric file using exactly the algorithm from Section 3.4, subject to the constraint that the jth buffer merge overwrites subsamples from the (j mod m)th geometric file. 3.11.2 How Can Correctness Be Maintained? Logically, samples from the buffer have been partitioned so as to preserve the correctness of the reservoir algorithm: each record has been assigned to a subsample with probability proportional to the subsample's size. However, the fact that these partitions are then consolidated and merged into a single subsample would seem to compromise algorithm correctness, since the subsamples in the (j mod m)th geometric file are overwritten with too many new samples. Thus, this file physically loses many of its samples before it should. This results in a subsample with fewer samples stored on disk than it should have in order to preserve the correctness of the reservoir sampling algorithm. Our remedy to this problem is to delay overwriting a subsample's largest segment until the time that all (or most) of the records that will be overwritten on disk are invalid, in the sense that they have logically been "overwritten" by having records from subsequent buffer (a) Initial configuration. Each of the m geometric files has an additional dummy segment that holds no data. ,  mmmm( )m mm mmmml[][] m mVy u***00 segments initially owned by the dummy (c) Existing subsamples give their largest segment to reconstitute the dummy. The data in these segments are protected until the next time the dummy is overwritten. newly reconstituted dummy (b) The jth new subsample is added by overwriting the dummy in the i = (j mod m)th geometric file *** Onn  .mmmm  mNNOD array of m geometric files newly added subsample (d) The next m 1 buffer flushes write new subsamples to the other m 1 geometric files, using the same process. The mth buffer flush again overwrites the dum my in the ith geometric file, and the pro cess is repeated from step (c). mEmmm. newly added subsample Figure 35. Speeding up the processing of new samples using multiple geometric files. ~~~ "~~ ~~~~ ~ ~~~ "~" ann~n flushes assigned to replace them. In order to accomplish this, we note that if we did not perform consolidation and instead replaced a segment from each subsample with exactly those records assigned to overwrite records from that subsample, then on expectation a subsample would lose all of the records in its largest segment after m buffer flushes. Thus, if we somehow delay overwriting the largest segment in each file for m buffer flushes, we could sidestep the problem of losing too many records due to consolidation. The way to accomplish this is to overwrite subsamples in a lazy manner. We merge the buffer with the (j mod m)th geometric file, but we do not overwrite any of the valid samples stored in the file until the next time we get to the file. We can achieve this by allocating enough extra space in each geometric file to hold a complete, empty subsample in each geometric file. This subsample is referred to as the dummy. The dummy never decays in size, and never stores its own samples. Rather, it is used as a buffer that allows us to sidestep the problem of a subsample decaying too quickly. When a new subsample is added to a geometric file, the new subsample overwrites segments of dummy rather than overwriting largest segment of any existing subsamples. Thus, we have protected segments of subsamples that contain valid data by overwriting dummy's records instead. When records are merged from the buffer into the dummy, the space previously owned by the dummy is given up to allow storage of the file's newest subsample. After this flush, the largest segment from each of the subsamples in the file is given up to reconstitute the new dummy. Because the records in (new) dummy's segments will not be overwritten until the next time that this particular geometric file is written to, all of the data that is contained within it is protected. Note that with a dummy subsample, we no longer have a problem with a subsample losing its samples too quickly. Instead, a subsample may have slightly too many samples present on disk at any given time, buffered by the file's dummy. These extra samples can easily be ignored during query processing. The only additional cost we incur with dummy is that each of the geometric files on disk must have IBI additional units of storage allocated. The use of a dummy subsample is illustrated in Figure 35. 3.11.3 Handling the Stacks in Multiple Geometric Files One final issue that should be considered is maintenance of the stacks associated with each subsamples of the (j mod m)th geometric file. Just as in the single file case, the purpose of the stack associated with a subsample is to store samples that are still valid, but whose space must be given up in order to store new samples from the buffer that have been flushed to disk. With multiple geometric files, this does not change. It is possible that when the buffer is written to the dummy subsample in a file, the dummy may still contain valid samples from a subsample in that file. Specifically, one or more of the dummy's segments may contain valid samples from the last subsample to own the segment. In that case, the valid samples are saved to that subsample's stack before the dummy is overwritten. 3.12 SpeedUp Analysis The increase in speed achieved using multiple geometric files can be dramatic. The time required to flush a set of new samples to disk as a new subsample is dominated by the need to perform random disk head movements. For each subsample, we need two random movements to overwrite its largest segment (one to read the location and one to write a new segment) and then two more seeks for its stack adjustment; a total of around 40 ms/segment. The number of segments required to write a new subsample to disk in the case of multiple geometric files (and thus the number of random disk head movements required) is given by Lemma 2. Lemma 2. Let u = (log(1/ac'))1. Multiple geometric files can be used to maintain an online sample of arbitrary size with a cost of O(u x log IB /B ) random disk head movements for each newly sampled record. Proof We know that for every buffer flush, m segments in the buffer are grouped to form a consolidated segment. All such consolidated segments are then used to overwrite the largest on disk segments of the subsamples stored in a single geometric file. From Observation 3, we know that the number ondisk segments of a subsample (and thus the number of consolidated segments) is log 3log+log(la) ]. Substituting n = (1 a') x BI and simplifying the expression (as well as log a' ignoring the floor) we compute the number of segments to write as 1(log 3 log BI). If we let u = (log(1/a'))1 the number of segments can be expressed as w(log IBI log/3). Assuming a constant number c of random seeks per segment written to the disk, the total random disk head movements required per record is wc ((log IB log /3)/IB ), which is O(u x log IB I/ IB). D In case of multiple geometric files we use additional space for m dummy subsamples. Thus, the total storage required by all geometric files is R + (m x IBI). If we wish to maintain a 1TB reservoir of 100B samples with 1GB of memory, we can achieve a' = 0.9 by using only 1.1TB of disk storage in total. For a' = 0.9, we need to write less than 100 segments per 1GB buffer flush. At 40 ms/segment, this is only 4 seconds of random disk head movements to write 1GB of new samples to disk. In order to test the relative ability of the geometric file to process a highspeed stream of insertions, we have implemented and benchmarked five alternatives for maintaining a large reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the framework described in Section 3.10 for using multiple geometric files at once. We present these benchmarking results in Chapter 7. CHAPTER 4 BIASED RESERVOIR SAMPLING Random sampling selects a subset of the items in a population so that statistical properties of the population can be inferred by studying the sample rather than the entire population. In this chapter we study the problem of how to compute a simple, fixedsize random sample (without replacement) in a single pass over a data stream, where the goal is to bias the sample using some arbitrary weighting function. In this chapter we propose a simple modifications to the classic reservoir sampling algorithm [11, 38] in order to derive a very simple algorithm that permits the sort of fixedsize, biased sampling given in the example. Our method assumes the existence of an arbitrary, userdefined weighting function f which takes as an argument a record ri, where f(ri) > 0 describes the record's utility in subsequent query processing. We then compute (in a single pass) a biased sample Ri of the i records produced by a data stream. Ri is fixedsize, and the probability of sampling the jth record from the stream is proportional to f(rj) for all j < i. This is a fairly simple and yet powerful definition of biased sampling, and is general enough to support many applications. Of course, one straightforward way to sample according to a welldefined bias function would be to make a complete pass over the data set to compute the total weight of all the records, E I f(rj). During a second pass, we can then choose the th record of the data set with probability Rf(i) by flipping a biased coin in a Bernoulli fashion. However, there are 2yji f(rj) two problems with this method. First, this algorithm requires two passes over the data set. This may not be practical for very large data sets and it may be infeasible in a streaming environment. Second, the resulting sample is not fixedsize, which may be undesirable for several reasons. The resources required to store the sample are not fixed, and most estimators over the resulting sample will have higher variance. In most cases, our algorithm is able to produce a correctly biased sample. However, given certain pathological data sets and data orderings, this may not be the case. Our algorithm adapts in this case and provides a correctly biased sample for a slightly modified bias function f'. We analytically bound how far f' can be from f in such a pathological case, and experimentally eval uate the practical significance of this difference. Finally, we derive the correlation covariancee) between the Bernoulli random variables governing the sampling of two records ri and rj using our algorithm. We use this covariance to derive the variance of a HorvitzThomson estimator making use of a sample computed using our algorithm. The rest of the chapter is organized as follows. We describes a singlepass biased sampling algorithm. We also define a distance metric to evaluate the worst case deviation from the user defined weighting function f. Finally, we derive a simple estimator for a biased reservoir. The experiments performed to test our algorithms are presented in Chapter 7. 4.1 A SinglePass Biased Sampling Algorithm We introduced the classical reservoir sampling algorithm that maintains an unbiased sample of a data stream in the previous chapter. We will extend this algorithm to give our biased reservoir sampling algorithm and prove various properties and pathological cases for the same. 4.1.1 Biased Reservoir Sampling It turns out that in most cases, one may produce a correctly biased sample by simply modifying the reservoir algorithm to maintain a current sum total Weight over all observed f(ri). Then, incoming records are added to the reservoir so that the probability of sampling record rj is f(rj)/totalWeight. This basic version of the algorithm is given as Algorithm 6. It is possible to prove that this modified algorithm results in a correctly biased sample, provided that the "probability" from line (8) of Algorithm 6 does not exceed one. Lemma 3. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Using the biased sampling described in Alg, rihil 6, we are guaranteed that for each Ri and for each record rj produced by the data stream such that j < i, we have Pr[rj EG i] I Rif(r3) E 1f/(rk) Proof We need to prove that when a new record ri appears in the stream, then for each record rj from the stream, Pr[rj E R1] = f( A new records produced by the stream is sampled El=I f(rIl) with probability If(rj) in Step (8) of the algorithm and the probability requirement trivially E1=1 f(ri) holds for the new record. We now must prove this fact for rk, for all k < i. Since R is correct, we know that for k < i, Pr[rk E R~ 1i (r) i Then there are two cases to consider; either EI=i f(rl) the new record ri is chosen for the reservoir, or it is not. If ri is not chosen, then rk remains in the reservoir for k < i. If ri is chosen, then rk remains in the reservoir if rk is not selected for expulsion from the reservoir (the chance of this happening if ri is chosen is (R 1)/R). Thus, the probability that a record rk is in Ri is Pr[rk e Ri] Pr[rk E Pr[rk G Pr[rk E Ri1] Pr[rk E Ri1] Pr[rk E Rii] = Pr[rk E Ri1] t R f (rk) ' IRIf(rk) Y i1 f(r') This is the desired results and prn Pr[ri E Ri]Pr[rk not expelled] + (Pr[rk E Ri1]Pr[r, Ri) (Pr[ri G R] ( ) + 1 Pr[r G Rj]) (R\Pr[r, G R] Pr[ri c R] \R\ R\Pr[r, e R}\ IRI + IRI (R Pr[rC e R1]) AR (f: f(ri) Ste stt t of t l oves the statement of the lemma. 4.1.2 So, What Can Go Wrong? (And a Simple Solution) This simple modification to the reservoir sampling algorithm will give us the desired biased sample as long as the probability IRI f(ri)/totalWeight never exceeds one. If this value does exceed one, then the correctness of the algorithm is not preserved. Unfortunately, we may very well see such meaningless probabilities, especially early on as the reservoir is Algorithm 6 Biased Reservoir Sampling (A Simple Modification to the Algorithm 1) 1: Set totalWeight = 0 2: for int i = 1 to oo do 3: Wait for a new record ri to appear in the stream 4: totalWeight = totalWeight + f(ri) 5: if i < R then 6: Add ri directly to R 7: else 8: with probability toi t do 9: Remove a randomly selected record from R 10: Add ri to R initially filled with samples and the value of totalWeight is relatively small. Fortunately, after some time, the situation will improve as the number of records produced by the stream is very large and totalWeight grows accordingly, making it unlikely that any single record will have IRlf(ri)/totalWeight > 1. We define an overweight record to be a record ri for which RS(i) > 1. This is simply Ek=1 J' ) a record for which the selection "probability" exceeds one. There are two methods for handling such overweight records. The first solution which we describe presently is to use some additional buffer memory. Every time we encounter an overweight record, we do not process the record immediately and instead buffer the record in a priority queue. The queue is arranged to that at all times, it gives us access to the minimumweight, buffered record, which we term r"i". Every time that totalWeight is incremented, we check to see if IRI f(rm')/ totalWeight < 1. If it is, we then remove the record from the queue and reconsider it for selection. The process is then repeated until the record at the head of the queue is found to be overweight, at which point the modified reservoir algorithm again proceeds normally. An important factor to consider while determining the feasibility of maintaining such a queue in the general case is providing an upper bound on its size. This can be done by consider ing the worst possible ordering of the records input into the algorithm, subject to the constraint that the bias function is welldefined. In general, we describe the userdefined weighting function f as being welldefined if I f(r <1 Vi 1,2,... N. kl 1fkc)  It turns out that in the worstcase scenario we might have to buffer almost the entire data stream. We describe the case by construction. For a given arbitrary reservoir size R and the stream size N, we add first R records, all with the same weight wtl 1, to the reservoir. Next, we set f(rRl+l ) = i f (rk)/IR] + 1 1wt + 1 2. The inclusion probability of rRa+a1 is R f(rlRnl+)/l z( f(rK) = 21 /(IR + 2) > 1. Since rRl+1 is an overweight record, we buffer it. We construct the remaining records of the stream with f(fIRI+2) ... f(rN) = (rlR+1) = 2 so as to have all of them overweight and we must buffer them all. The priority queue thus contains N IRI records in it. Since f(ri) = 1 Vi < IR, we have: RIlf (r)/ k L f(rk) IRI/[ IR + 2(N RI)] < 1, and since f(ri) 2 Vi > RI and N > IR, we have: IRIf(r)/ E i f(rk) 21RI/[ R + 2(N I)] < 1. Thus, for a welldefined biased function f and the constructed stream the required queue size is N IRI. We therefore conclude that for N > IRI, the size of the buffer required for delayed insertion of the overweight records is 0(N). We stress that though this upper bound is quite poor (requiring that we need to buffer the entire data stream!) it is in fact a worstcase scenario, and the approach will often be feasible in practice. This is because weights will often increase monotonically over time (as in the case where newer records tend to be more relevant for query processing than older ones). Still, given the poor worstcase upper bound, a more robust solution is required, which we now describe. 4.1.3 Adjusting Weights of Existing Samples Another, orthogonal method for handling overweight records (that can be applied when the available buffer memory is exceeded) is to simply adjust the bias function and try to do the best that we can. Specifically, when we encounter an overweight record, we simply bump up the weights of all existing samples so as to ensure the inclusion probability of the current record is exactly one. Of course, as a result of this we will not be able to ensure that the weight of each record ri is exactly f(ri). We describe what we will be able to guarantee in the context of the true weight of a record: Definition 1. If Ri is the biased sample of the first i records produced by a data stream, the value is the true weight of a record rj if and only if Pr[rj c Ri] = f' (r ) k= I f(rk) What we will be able to guarantee is then twofold: 1. First, we will be able to guarantee that f'(rj) will be exactly f(rj) if (R f(ri))/totalWeight < 1 for all k > j. 2. We can also guarantee that we can compute the true weight for a given record to unbiased any estimate made using our sample (see Section 4.4). In other words, our biased sample can still be used to produce unbiased estimates that are correct on expectation [16], but the sample might not be biased exactly as specified by the userdefined function f, if the value of f(r) tends to fluctuate wildly. While this may seem like a drawback, the number of records not sampled according to f will usually be small. Furthermore, since the function used to measure the utility of a sample in biased sampling is usually the result of an approximate answer to a difficult optimization problem [15] or the application of a heuristic [52], having a small deviation from that function might not be of much concern. We present a singlepass biased sampling algorithm that provides both guarantees outlined above as Algorithm 7, and Lemma 4 proves the correctness of the algorithm. Lemma 4. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Using the biased sampling described in Al 'r I, ii 7, we are guaranteed that for each Ri and for each record rj produced by the data stream such that j < i, we have, Pr[rj E Ri] Lif'(r) Z=1 f'(rm)" Proof We know that the probability of selecting ith record in the reservoir is IR f(ri)/totalWeight. Then, there are two cases to explore. The first, when the reservoir is full and before we encounter an overweight record rl, and the the second after we encounter such an rl. Case (i): The proof of this case is very similar to the proof of the Lemma 3. We simply use f' instead of f to prove the desired result. Algorithm 7 Biased Reservoir Sampling (Adjusting Weights of Existing Samples) 1: Set totalWeight = 0 2: for int i = 1 to oo do 3: Wait for a new record ri to appear in the stream 4: Set r ". ':,l,, = f(ri) 5: totalWeight = totalWeight + f(ri) 6: if i < R then 7: Add ri directly to R 8: else 9: if If(rI) < 1 then totalWeight 10: with probability lR/(r' do totalWeight 11: Remove a randomly selected record from R 12: Add ri to R 13: else 14: for each record j in R do 15: rj.weight = (I RI )f(ri) x rj.weight totalWeight f(ri) X igh 16: totalWeight = IRf(ri) 17: Remove a randomly selected record from R 18: Add ri to R Case (ii): If Iti (t > 1, we scale the true weight of every existing sample so as to have totalWeight I= f(rf). This is done by first setting C = IR 1)(rs) and then scaling up totalWeight f (ri) f(rk) C x f(rk) Vk < i. As a result of this linear scaling, we have IRl x C x f(r) Pr [rj E Ri] totalWeight R x C x f (ri) E c x f (T) + f(ri) IRIf'(rk) l 1 fl(ni) An important factor to consider while determining the applicability of Algorithm 7 is the deviation of f' from f. That is: how far off from the correct weighting can we be, in the worst case? When stream has no overweight records, we expect f' to be exactly equal to f, but it may be very far away under certain circumstances. To address this, we define a distance metric in Definition 2 and evaluate the worse case distance between f' and f. Definition 2. Iff is the userdefined bias function and f' is the actual bias function, then the distance between ,hl \,, two functions is defined as totalDist(f, f') = EC dist(ri), where dist(r) f'(ri) f(r) k1 fP(rk) k (O For a data stream with no overweight records, totalDist(f, f') = 0 (the best case). The worst case distance is given by the Theorem 1 and is analyzed and proved in the Appendix of this paper. Theorem 1. Given a set of streaming records rl, r2,... rN and a userdefined weighting function f, Alg,,ritihm 7 will sample with an actual bias function f' where totalDist(f, f') is upper bounded by k N E ( f() 11() (RI p )f(r' ) + = R f(+) 1 f(r ) R ff((r) k IR1+1 f(r'k) and r, r, ., r' is the permutation (reordering) of the streaming records such that f(r') < f(r) < <_ f(rK') According to this Theorem, the worst case occurs when the reservoir is initially filled (on startup) with the R records having the smallest possible weights (that is, we have the smallest totalWeight when the reservoir is filled) and we encounter the record with the largest weight immediately thereafter. We evaluate the effect of this worstpossible ordering in the experimental section of the paper. 4.2 Worst Case Analysis for Biased Reservoir Sampling Algorithm Algorithm 7 computes a biased sample according to f', where f' is a "close" function to a userdefined weighting function f according to the following distance metric: N f'(ri) f(ri) totalDist(f, f') = dist(ri), where dist(r)) = rfi iN N il 2kl f'(rI ) C,l f(rk) The worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the R records having the smallest possible weights and (2) we encounter the record rmx with the largest weight immediately thereafter. Theorem 1 presented an upper bound on totalDist(f, f') in this worst case. In this section, we first provide the proof of this worst case for Algorithm 7 and then prove the upper bound on totalDist(f, f') given by Theorem 1. 4.2.1 The Proof for the Worst Case To prove the worst case for Algorithm 7, we first prove the following three propositions. These proofs lead us to the worstcase argument. If we denote the record with the highest weight in the stream as rmx and use rma to denote the case where r"m is located at position i in the stream, then for any given random ordering of the streaming records ri,..., ri1, rmax, TN, we prove that 1. Moving the record rma" earlier in the range rjI ... rN can not decrease totalDist(f, f'). 2. When we are initially filling the reservoir, choosing RI records with smallest possible weight maximizes totalDist(f, f'). 3. Reordering of any record that appears after rmax in the range r~+l ... rN can not increase totalDist(f, f'). The proof of the first proposition regarding moving rm~ earlier in the stream: We prove this proposition by showing that if we move r~m to ra, totalDist(f, f) can not decrease. If rmax is not an overweight record, the claim trivially holds as moving non overweight record does not change totalDist(f, f'). If r"ma is an overweight record, we prove that totalDist(f, f') increases because of the move. We first compute totalDisti(f, f') for rm and then compute totalDist2(f, f') for rma. We prove the claim by showing totalDist2(f, f') totalDisti(f, f') > 0. 1. An Expression for totalDist (f, f') for rax We start with the totalDist formula totalDist(f, f') f'(rl) Yk1 f'(ik) f(ri) k If(rk) f(ri) N +( =k1 f (k) f'(rii) : N k I, fl(rk) Sf'(rN) c 1*I fl(Tk) f(ri1) + f (rN) =k1 f(Tk) Since r"x' is the ith record of the stream, using the result of Lemma 5 (given below) we rewrite the totalDist formula as totalDisti(f, f') f'(ri) 1 f f( k) f'(rl) ( k f'(rIk) f(ri) k1f(rk) VE~nr.) f l . f(rl) :N Zk1 f(Tk) f'(ri) Y k 1 f'(r~k) f'(ri1) + + N >3k 1 f'(rk) f(rN) N >3k 1 f(rk) ( f(r) N ( fk1 f( f1 if'( 1 Y.N .,P f(ri1) kZ1 ff(rk) f'(rN) =k 1 f'(rk) f(ri) rk) 1 kf(ik) f'(r N) rk) 1I f'(rk) We know that Vj < i, f'(rj) (IR1)f (r' )f(rj) and Vj > i, f'(rj) f (rj). We also know kI1f r I R f(rM a) k i+1 f(rik). Therefore, the above equation simplifies to f'(ri1) + N Y=1 kf (rk) f(rN) N t f( k) that E I f '(rk) totalDisti(f, f') (i1 l)f(rax)(N l) :i1 N R f(rk1) X = f'(rfk) >cIl f(R) E+L /f(k) (R)f(r7ax)E 11f(rk) k i f(rk) kf=(rk) N+  k 1f(rk) k 1f(rk) k 1 f(rk) 1f(rk) Y=i f 1r k k1 f(rk) f(rk) Y k1 f (rk) f(rk) x 1 f'(Tk) xEk =1 k (IRI  k 1 Cki f(rk) N >k=1 f'(rk) Zk1 f(Tk) (IRl 1_)f(r70 k 1 fl(rk) ki f(rk) Yc =1 f'(rk) Cki f(rk) k1 f'(rk) (IR 1)f(r7) f ) af(rk) IR If(r7ax) i+l f(rk) (41) 2. An Expression for totalDist2(f, f') for r i In this case, since r"m is the i rewrite the totalDist formula as 1th record of the stream, using the result of Lemma 5 we totalDist2(f, f') f'(ri) NI f,( k1 f'(rk) f(ri1) ck 1 f(Tk) f(rl) EN kC 1 f(rk) f'(r1) N+ k1 fl(Tk) f'(r2) N k 1 f'(rk) f(rN) + f( =k1 f(Tk) f(r2) N kC=1 f(rk) f'(rN) =k1 fl(Tk) Using an argument similar to the case of totalDisti(f, f'), the formula simplifies to totalDist2(f, f') Ei1 (rk) L21 fk) YNk f(k) (IIR 1) f (rin ix) 1f( rk) I lf(rrix) + yN f(rk) i( 1)+ r( ) 3. An Expression for totalDist (f, f') i N f(rk) + (r) 1 f( ) (42) totalDist (f f ') Region 2 and 3 Region 1 / / 1 ... r i2 i l r i i+l Region 4 rNI rmax .)max ^1'i Figure 41. Adjustment of rma to rii_ We obtain this expression by subtracting Equation (41) from Equation (42) as follows: E l 1f(rk) E lf(rn) + (I l )f(r7() i i l f(rk)1 ki f(r) R 1 f(r(r( ) + E if(rk) k t~ 1 (I \I 1 ) i_ f+i1N S) I f (r) If (r)(r ) (rk) i f(rk ) R f( ) ()+ E +1i f(rk) Y N 2 ki2 ki1 f(rk) E If(rk) Z ki 1f(rk) (IR 1)f (r"x) i1 f(r) IRIf(rma) + y k f() ZN 1f r r) Y f(N ) (IR 1 )f m (rax) f(Tk) R f(rma) = i f(rk) IRIf(r ) + Ey +1() Figure 41 shows the adjustment of rmax to r x. We denote the record that is swapped with r"a as rswap. The above equation further simplifies to totalDist2(..) totalDisti(..) /rnax) + swap) i Fk) 2Ei ) f(rrx) f(rw"ap) Ci+1 ff(rk) f(rk) YN f(rk) f(rTna) i f(rk) (swap) 1 f(i k) Z f(rk) [( 2)f(rnax) 1 (k)] f(r swap) [RIf(r'ax) + E 7f(ri )] + f(rswap) [(Jl 2)f(rna) _ i f(Tk)] [If(rIf x) + ~ i+1 f(rk)] totalDist2(..) totalDist1(..) 2)f(r.a) E i+1 f(rk)] and Y k IY have totalDist2(..) totalDisti(..) 2 x f(rswa) N f(rk) 2 x f(rswap Ek f(rk) X f(rswa) Y + f(rswP) f(rswa) x [X + Y] Y x [Y + f(rswa)] Substituting the values for X and Y, the equation further simplifies to totalDist2(..) totalDisti(..) 2 x f(rwap) Nk1 f(Rk) f (rswap) [(R 2)f(rna) l+ f(rk) + Rf(ra) + E i1 (ik) Y x [Y + f(rPs)] 2 x f(rswp) 2 x ( f(r) 2 x f(rswaP) x (IRI )f(rra) Since [IR f(rm ) ki+l /(k)] > (IR 1)f(r'ma), we have totalDist2(..) totalDisti(..) > 2 x f(rswp) N /f(Tk) 2 x f(rswap) [IRf(ri ) + Z i7 f(rk) + f(rswap)] Since [ f(r rna) + f(rswa) + + (rk)] > 1 1 f(rk), we have totalDist2(f, f') totalDisti(f, f') 2 x f(rswap) k 1 f(rk) 2 x f(rswap) Nk1 f(rk) >0 [IR f(rm,) ki+f(rk)] x [IRIf(rr )+ i+1 f(rk) + f(rswa)] [(1R If we let X [Il(r )+E i f(rk)],we Furthermore, we know that Algorithm 7 accepts the first IR records of the stream with probability 1. No weight adjustments are triggered for first IR records irrespective of their weights. Therefore, the earliest position rmx can appear in the stream is right after the reservoir is filled. This proves the proposition. We now turn to proving Lemma 5, which was used in the previous proof. Lemma 5. If r' appears as the ith record of the stream, then Vj < i we have: '(r) > j) f') (rk)f) (r) and Vj > i we have: 77(r) < f(r ) k l f(rk) k=lif (k) f(k)) Proof When we encounter r"m as the ith record of the stream, we increase the weights of rj Vj < i by a factor of C = 1)f and adjust E i f'(rk) R (ra) + + (r). =1 f(r k) We also know that Vj > i, f'(rj) = f(rj). Part 1: Vj < i we have f'(rj) f(rj) C x f(rj) f(rj) k1 fl(k) f (r) ) f (r) S () f( Tk) L = fl(Tk) k 1 Tk) Cx f(rj) f(rj) i1 C X f (rk)Z ki f(rk) I f(rk) Since C > 1, we have C x f(rj) f (rj) i C x (rk) + Ek f(rk) ' lf(rk) We can therefore conclude that f'(rj) f(rj) > f(rj) f(rj) k I f '(rk) ki f(rk) fE (rk) f(rk) >0 This proves the first part of the lemma. Part 2: Vj > i we have f(rj) kI ff(rk) f(rj) IRlf(rax) Zk i+l f(rA) f(rj) IR f(Tr ) Z hi+l f(Trk) f(rj) kN f(rk) f(rj) 'ik1 f(rk) + Yk i+l f(Tk) Since Ck, f(rk) < IRlf(rTma), we have f(rj) lRlf(ra.x) + Ek>i+L f(rk) < f(rj) 1 f(rk) We can therefore conclude that f(rj) < f(rj) :k1 f(Trk) k1 f(rk) f(rj) kI ff(rk) This proves the second part of the lemma. The proof of the second proposition regarding the effect of the first IR records: We now turn our attention to the effect of first IR record of the stream on the worstcase distance. If rm" appears as the IR1th record in the worst case, then using the result of Lemma 5, Vj < IR we know that k f'(rk) f(rj) k1 ff(rk) (II1)f (r ma") R1 (f ()+) IRlf(rmax) + z7 IRI+I f(rk) (Il1R) f (rm) (IRI )f (rma) f(rj) (IR1 1)f(rT ax) + k HI f(rk) f(rj) k1 ff(rk) f(rk) Yk I f(Tk) f'(rj) k 1if'(rk) f(rj) YNk f'(rk) For a given set of records, ZE I 1 f(rk) +  kRI f(rk) is a constant. Therefore, dist(rj) increases as E I1 f(rk) decreases. In other words, the dist(rj) is maximum for the smallest possible ZElj 1 f(rk). Thus, the totalDist(f, f') is largest when the reservoir is initially filled with the R records having the smallest possible weights. This proves the claim in the second proposition. The proof of the third proposition regarding reordering of records after r"mx: This is immediate since r" m is the highest weight record of the stream, no record after r"' can be an overweight record. From the above three propositions, we can conclude that the worst case for Algorithm 7 occurs when (1) the reservoir is initially filled with the R records having the smallest possible weights and, (2) we encounter the record r"' with the largest weight immediately thereafter. 4.2.2 The Proof of Theorem 1: The Upper Bound on totalDist To derive the upper bound, we start with the totalDist formula and give its value in the worst case totalDst(f, (r) f(rl) f(rlR 1) f(r lR1) + SEtotalDist(f(, f') N E+ + k1 f'(rk) k, 1f(rk) N I f'(rk) kNI f(rk) f (rlRI) f(rRI) f'(rN) f(rN) ki, fl'(k) k fE(rk) Y I f'(rk) Eki f(rk) We know that in the worst case r ma appears as the IR1th record in the stream, using the result of Lemma 5 we rewrite the totalDist formula as f'(l) YNk1 f'(rk) f (r R) YNk1 f(rk) f(rl) f'(rfIR1) N + N 1 f(Trk) k1 fl(Trk) f'(rlsI) f (rN) N, + N + , =k 1 f'(rk) =k1 f(rk) ( f'(r) f'(rIR1) Sf(rl) E1 f (rk) if (rII ) f + (N) k, f(r) k 1 f(rk) f(rlR 1) k 1 f(rk) f' (rN) Yk I fl(rk) ( f(ri) f'(rlR) f(rlRi1) " l f(rk) J f'(rN) N YEkI f'(k (I R 1) f (r")ax\ f(rj) We know that Vj < IR1, f'(r,) i and Vj > R f'(rj) f(rj). We also k=1 Jv) N IZI1rak+ know that Ek1 f'(rk) = If(ri) + J 1R f(rk). Therefore, similar to the Equation (41), the above equation simplifies to (R1) k (I f (rk) totalDist,(f f') = (r ) Ek1 f(Tk) X Yi/k1 fl(rk) i1 f(rk) kIR f(rk) k1 f(rk) 1l f(rk) N Ek IRI f(rk) k1 f'(rk) EkJf(r1) E 1f(r() (iRi 1)f(r ) k RI f(rk) :k 1 f(rk) R 1f( ) >ki f(rk) (43) In the worst case the reservoir is initially filled with the IR records having the smallest possible weights. If rl, r2,... rN are the records in appearance order then we define r', r ,..., r' as the permutation (reordering) of the records such that f(r') < f(r') < ... < f(r/). The condition requiring reservoir filled with the smallest possible weights can be then written as (IRI 1)f (r') f(r) S1 f) 44) I If (r' ) + y I (r ) >37 RI f(r') yl 1 f(r') >k 1 f(rk) totalDist, (f, f')  totalDist, (f, f') 4.3 Biased Reservoir Sampling With The Geometric File It is easy to use the biased reservoir sampling algorithm with a geometric file. To use the geometric file for biased sampling, it is vital that we be able to compute the true weight of any given record. To allow this, we will require that the following auxiliary information be stored: Each record r will have its effective weight r.weight stored along with it in the geometric file on disk. Once totalWeight becomes large, we can expect that for each new record r, r.weight = f(r). However, for the initial records from the data stream, these two values will not necessarily be the same. Each subsample Si will have a weight multiplier i3 associated with it. Again, for subsam ples containing records produced by the data stream after totalWeight becomes large, .3 will typically be one. For efficiency, 3 [ can be buffered in main memory. Along with the effective weight, the weight multiplier can give us the true weight for a given record, which will be M(r) x r.weight. Algorithmic changes: Given that we need to store this auxiliary information, the algorithms for sampling from a data stream using the geometric file will require three changes to support biased sampling. These modifications are described now: During startup. To begin with, the reservoir is filled with the first IR records from the stream. For each of these initial records, r.weight is set to one. Let "totalWeight" be the sum of f(r) over the first IRI records. When the reservoir is finished filling, 3 1 is set to totalWeight/IRI for every one of the initial subsamples. In this way, the true weight of each of the first IRI records produced by the data stream is set to be the mean value of f(r) for the first IRI records. Giving the first IRI records a uniform true weight is a necessary evil, since they will all be overwritten by subsequent buffer flushes with equal probability. As subsequent records are produced by the data stream. Just as suggested by Algo rithm 4, additional records produced by the stream are added to the buffer with probability (IR f(ri))/totalWeight, so that at least initially, the true weight of the ith record is exactly f(ri). The interesting case is when I Rf(r) > 1 when the ith record is produced by the data stream. In this case, we must scale the true weight of every existing record up so that tot" .a 1. To accomplish this, we do the following: 1. For each ondisk subsample, Mj is set to be IRIf(r) 1. totalWeight 2. For each sampled record still in the buffer, rj.weight is set to txr.weight f(r = . 3. Finally, totalWeight is set to IR f(ri). As the buffer fills. When the buffer fills and the jth subsample is to be created and written to disk, Mj is set to 1. 4.4 Estimation Using a Biased Reservoir The biased sampling algorithm presented gives a user the opportunity to make use of different weighting algorithms and estimators, depending upon the particular application domain. We discuss one such simple estimator, the standard HorvitzThompson estimator [50] for a sample computed using our algorithm. We derive the correlation covariancee) between the Bernoulli random variables governing the sampling of two records ri and rj using our algorithm and use this covariance to derive the variance of a HorvitzThomson estimator. Combined with the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single database table for which the reservoir is maintained. Though handling more complicated queries using the biased sample is beyond the scope of the paper, it is straightforward to extend the analysis of this Section to more complicated queries such as joins [32]. Imagine that we have the following singletable query, whose (unknown) answer is q: SELECT SUM g(r) FROM THE_TABLE AS r WHERE g2(r) Given such a query, let g(r) = g9(r) if g2(r) evaluates to true 0 otherwise. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Then the unbiased HorvitzThompson estimator for the query answer q can be written as q = re pr al]i In the HorvitzThompson estimator, each record is weighted according to the inverse of its sampling probability. Next, we derive the variance of this estimator. To do this, we need a result similar to Lemma 3 that can be used to compute the probability Pr[{rj, rk} E Ri] under our biased sampling scheme. Lemma 6. Let Ri be a state of the biased sample just after the ith record in the stream has been processed. Using the biased sampling algorithm described in Alg Iir1i1n 7, for each Ri and for each record pair {rj, rk} produced by the data stream where j < k < i we have 1Rl( l 1) f'(rj) f'(rk) Pr[{rr1 f k X H i 1=1 f(l) Ell= f/( 1) l=k+1 2Pr[rl Ri] A ) Proof The proof is analogous to the proof of Lemma 3. There are two subcases to consider. If k = i, then the proof is relatively easy. In this case Pr[{rj, rk} c R] = Pr[rj Ri]Pr[ri c Ri]Pr[rj not expelled] (IRf'(rj) IRf'(ri) R 1 ilf'(l)) E f'l(r)) ( R R(RI 1)f'(rj)f'(r ) If both k and j are less than i, then the proof becomes a bit more involved and the probabil ity must be computed recursively. In this case, we have Pr[{rj, rk} G Ri] Pr[{rj, rk} E R i_1]Pr[ri E Ri]Pr[{rj, rk} not expelled]+ (Pr[{rj,rk} C Rii]Pr[ri R ]i) Pr[{r, rk R i Pr[ri R] (I 2) + 1 Pr[ri R C R1) Pr[{r ,r k Ci1] Pr[{fr,rk} RJ i1 Pr[{r ,r k} R i21 Pr[{rT, rk} SR] x (Prir, C R 2Pr[ri C Ri] + ( 2Pr[ri E Ri] (1 2Pr[ri_1 E Ri1) 2 i 2Pr[rll Ri] 1= k+1 SR1(1 R 1) f'(rj)f'(rk) i R Pr[{r,}R1, kk X Yll /'1 (r)E If/( l k+1\  Pr [r, R) Pr[ri C Ri] R 2Pr[rl E Ri] A ) which is the desired probability. This expression can then be used in conjunction with the next lemma to compute the variance of the natural estimator for q. Lemma 7. The variance of a is g'(rj) + r 2Pr[(rj,rk)lg(rJ)g(rk) 2 i, Pr[rjER,] 2 ,rk Pr[rjERi]Pr[rkRi q Proof Var(q) = E[q2] (E[q])2 Sg(rj) 2 E[ Z Pr]r rjERi rjERi [1 E g(rj) q2 SPr [rj Ri] r3Rrj,rlc E [Xj]g2(r) + 2XjXkg(rj)g(r) SP2[r, E[r R, Pr[rj E R1]Pr[rk G R] Pr2[rj E R} .' Pr[rj E Ri]Pr[rfC Ri] rj Pr[r RPr[r rj rj,rk This proves the lemma. By using the result of Lemma 6 to compute Pr [{rj, rk E Ri], the variance of the estimator is then easily obtained for a specific query. In practice, the variance itself must be estimated by considering only the sampled records as we typically do not have access to each and every rj during query processing. The q2 term and the two sums in the expression of variance are thus computed over each rj in the sample of biased geometric file rather than over the entire reservoir. There is one additional issue regarding biased sampling that is worth some additional discussion: how to efficiently compute the value Pr[{rj, rk} E Ri] in order to estimate the variance during query evaluation. Computing Pr[{rj, rk }E R] requires that we be able to compute two subexpressions for each sampled record pair: RI 1)f(rj) f'(rk) and y I f'(rI) k Jf(r1I) H 2Pr[r ERi] l=k+1 2RP The first subexpressions can be easily computed with the help of running total totalWeight along with the weight multipliers associated with each subsample. When sample records are added to the reservoir, like attribute ri.weight, we store another attribute with each record, ri.oldTotalWeight and r\.oldM. The first attribute gets its value from current value of totalWeight, whereas the M(ri) is stored in the second attribute. When a query is evaluated and we need to compute the first subexpressions for a given record pair rj and rk, we compute terms in its denominator as follows: Sf'(rn) Tk.oldTotalWeight x Mr l 1 kl k k Sf'(r) E '() f'(rk) '(r) (rk.weight x M(rk)) 1=1 1=1 l=1 The second subexpressions can also be easily computed if we maintain a running total subexp2Total for the sum log (1 2Pr[rER]) at all times. When a new record is added to the reservoir, the current values of subexp2Total is stored as another attribute r .\~1\, \p2Val along with each record. When a query is evaluated, for a given record pair rj and rk we simply evaluate ni l (1 I2Pr[r ER] (subexp2Totalrk.subexp2Val) l=k+1 R CHAPTER 5 SAMPLING THE GEOMETRIC FILE A geometric file is a simple random sample (without replacement) from a data stream. In this chapter we develop techniques which allow a geometric file to itself be sampled in order to produce smaller sets of data objects that are themselves random samples (without replacement) from the original data stream. The goal of the algorithms described in this chapter is to efficiently support further sampling of a geometric file by making use of its own structure. 5.1 Why Might We Need To Sample From a Geometric File? In Section 3.2, we argued that small samples frequently do not provide enough accuracy, especially in the case when the resulting statistical estimator has a very high variance. However, while in the general case a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. For example, reconsider the problem of estimating the average net worth of American households as described in Section 3.2. In the general case, many millions of samples may be needed to estimate the net worth of the average household accurately (due to a small ratio between the average household's net worth and the standard deviation of this statistic across all American households). However, if the same set of records held information about the size of each household, only a few hundred records would be needed to obtain similar accuracy for an estimate of the average size of an American household, since the ratio of average household size to the standard deviation of sample size across households in the United States is greater than 2. Thus, to estimate the answer to these two queries, vastly different sample sizes are needed. 5.2 Different Sampling Plans for the Geometric File Since there is no single sample size that is optimal for answering all queries and the required sample size can vary dramatically from query to query, this chapter considers the problem of generating a sample of size N from a data stream using an existing geometric file that contains a large sample of records from the stream, where N < R. We will consider two specific problems. First, we consider the case where N is known beforehand. We will refer to a sample retrieved in this manner as a batch sample. Batch samples of fixed size have been suggested for use in several approximate query processing applications [1, 21, 30, 34, 39]. In general, the drawback of making use of a batch sample is that the accuracy of any estimator which makes use of the sample is fixed at the time that the sample is taken, whereas the benefit of batch sampling is that the sample can be drawn with very high efficiency. We will also consider the case where N is not known beforehand, and we want to implement an iterative function GetNext. Each call to GetNext results in an additional sampled record being returned to the caller, and so N consecutive calls to GetNext results in a sample of size N. We will refer a sample retrieved in this manner as an online or sequential sample. The drawback of online sampling compared to batch sampling is that it is generally less efficient to obtain a sample of size N using online methods. However, since the consumer of the sample can call GetNext repeatedly until an estimator with enough accuracy is obtained, online sampling is more flexible than batch sampling. An online sample retrieved from a geometric file can be useful for many applications, including online aggregation [32, 33]. In online 'riLlion. a database system tries to quickly gather enough information so as to approximate answer to an aggregate query. As more and more information is gathered, the approximation quality is improved, and the online sampling procedure is halted when the user is happy with the approximation accuracy. 5.3 Batch Sampling From a Geometric File 5.3.1 A Naive Algorithm The most obvious way to implement batch sampling is to make use of the reservoir sampling algorithm to raw a sample of size N from a geometric file of size IRI in a single pass. As the following lemma asserts, the resulting sample is also a sample of size N from the original data stream. Lemma 8. The reservoir sampling algorithm over a geometric file produces a correct random sample of the stream. Proof If S is the batch sample of size N retrieved from a geometric file R of size R using the reservoir sampling algorithm, then we know from the correctness of the reservoir sampling algorithm that: # of subsets of size N E R Pr[S C] # of such subsets in a data stream D (l R NJ N) Now, imagine that S E R. If we obtain a sample of size N from R using the reservoir algorithm, the probability that we choose precisely S is: Pr[S sampled from RS c R] 1/( I Thus we have: Pr[S sampled from R] = Pr[S sampled from R S E R] x Pr[S E R] 1 (R (IRI) (l D) 1 This is precisely the probability we would expect if we sampled directly from the stream without replacement. O Unfortunately, though it is very simple, the naive algorithm will be inefficient for drawing a small sample from a large geometric file since it requires a full scan of the geometric file to obtain a true random sample for any value of N. Since the geometric file may be gigabytes in size, this can be problematic. 5.3.2 A Geometric File StructureBased Algorithm We can do better if we make use of the structure of a geometric file itself. The intuitive outline of this approach is as follows. To obtain a batch sample of size N, we precalculate how many records from each ondisk subsample will be included in the batch sample, and then we read the appropriate number of records sequentially from the various segments of each subsample. The process of choosing the number of records to select from each subsample is analogous to Olken and Rotem's procedure for choosing the number of records to select from each hash bucket when performing batched sampling from a hashed file [26]. Once the number of sampled records from each segment has been determined, sampling those records can be done with an efficient sequential read since within each on disk segment, all records are store in a randomized order. The key algorithmic issue is how to calculate the contribution of each subsample. Since this contribution is a multivariate hypergeometric random variable, we can use an approach analogous to Algorithm 4, which is used to partition the buffer to form the segments of a subsample. In other words, we can view retrieving N samples from a geometric file analogous to choosing N random records to overwrite when new records are added to the file. The resulting algorithm can be described as follows. To start with, we partition the sample space of N records into segments of varying size exactly as in Algorithm 4. We refer to these segments of the sample space as sampling segments. The sampling segments are then filled with samples from the disk using a series of sequential reads, analogous to the set of writes that are used to add new samples to the geometric file. The largest sampling segment obtains all of its records from the largest subsample, the next largest sampling segment obtains all its record from second largest subsample, and so on. Algorithm 8 Batch Sampling a Geometric File 1: Set NS = Number of subsamples in a geometric file 2: for i =1 to NS do 3: Set RecsInSubsam[i] = Size of ith subsample 4: Set RecsToRead[i] = 0 5: for i =1 to NS do 6: Choose j such that Pr[choosing j]= RecslnS, l.u, [i] /1RI 7: RecslnS ,F..,[di]  8: RecsToR ..i[ j] + + 9: for i =1 to NS do 10: Append to batchsample RecsToRead[i] records from the ith subsample When using this algorithm, some care needs to be taken when N approaches to the size of a geometric file. Specifically, when all disk segments of a subsample are returned to a corresponding sampling segment, we must also consider the subsample's inmemory buffered records and any records contained in its stack in order to obtain desired size sample. The detailed algorithm is presented as Algorithm 8. It is clear that this algorithm obtains the desired batch sample by scanning exactly N records as against the entire scan of the reservoir sampling at the cost of few random disk seeks. Since the sampling process is analogous to the process of adding more samples to the file, it is just as efficient, requiring O(w x log BI /N) random disk head movements for each newly sampled record, as described in Lemma 2. 5.3.3 Batch Sampling Multiple Geometric Files A geometric file structure based batch sampling algorithm can be extended to allow efficient batch sampling from multiple geometric files in the same way that the insertion algorithm for new samples into the geometric file can be extended to allow insertions into multiple geometric files. The extension is fairly straightforward with additional first step where we determine the number of records to be sampled from each geometric file. Once this number is determined, we execute Algorithm 8 on each file in order to obtain the desired batch sample. 5.4 Online Sampling From a Geometric File 5.4.1 A Naive Algorithm One straightforward way of supporting online sampling from a geometric file is to imple ment the iterative function GetNext as follows. For every call to GetNext, we simply generate a random number i between 1 and size of the file IRI, and then return a record at the ith position in the geometric file. Care must be taken to avoid choosing same record of R more than once in order to obtain a correct sample without replacement. For example, to sample N records from R, the numbers 0 through N 1 could be hashed or randomized using a bijective pseudorandom function onto the domain 0 through IRI 1, and the resulting N numbers used to generate the sample. To pick the next record to sample, we simply hash N. It is easy to see that a naive algorithm will give us a correct online sample of a geometric file. However, we will use one disk seek per call to GetNext. Since each random I/O requires around 10 milliseconds, the naive algorithm can only sample around 6, 000 records from the geometric file per minute per disk. This performance is unacceptable for most applications. 5.4.2 A Geometric File StructureBased Algorithm As in the case of batch sampling algorithm, we can make use of the structure of a geometric file to efficiently support online sampling. Instead of selecting a random record of a geometric file, we randomly pick a subsample and choose its next available record as a return value of GetNext. This is analogous to the classic online sampling algorithm for sampling from a hashed file [26], where first a hash bucket is selected and then a record is chosen. Since the selection of a random record within a subsample is sequential, we may reduce the number of costly disk seeks if we read the subsample in its entirety, and buffer the subsample's records in memory. Using this basic methodology, we now describe how a call to the GetNext will be processed: We first randomly pick a subsample Si, with the probability of selecting i proportional to the size of ith subsample. Next, we look for buffered records of Si; if such records exist, we choose and return the first available record as the return value of GetNext. If no buffered records are found, we fetch and buffer a number of blocks of records from subsample Si; these records are then buffered. We return the first buffered record as the return value of GetNext. Since the records from each subsample are read and buffered in memory sequentially, we are guaranteed to choose each record of the reservoir at most once, giving us desired random sample without replacement. A proof of this is simple, and analogous to the proof of Lemma 3. However, thus far we have not considered a very important question: How many blocks of a subsample Si should we fetch at the time of buffer refill? In general there are two extremes that we may consider: Fetch many. If we fetch a large number of blocks at the time of the buffer refill, we reduce the overall time to sample N records for large N. This is due to the fact that by fetching many blocks using a sequential read, we amortize the seek time over a large number of blocks and at the same time we prepare ourselves for future calls to GetNext; once the records are fetched from disk, the response time for subsequent calls to GetNext is almost instantaneous (only inmemory computations are required). However, the drawback of this approach is that the more records we fetch sequentially from the disk during a single call to GetNext, the longer the response time will be for the particular call to GetNext during which we fetch those blocks. This is particularly worrisome if we spend a lot of time to fetch blocks which are never used (which will be the case if the user intends to draw only a relatively smallsized sample.) *Fetch few. If we fetch small number of blocks at the time of buffer refills, we reduce the maximum response time for any given GetNext call. However, we then need more seeks to sample N records. The approach can be problematic if user intends to draw a relatively large sample from the file. In order to discuss such considerations more concretely, we note that the time required to process GetNext call is proportional to the number of blocks fetched on the call, assuming that the cost to perform the required inmemory calculations is minimal. If b blocks are fetched during a particular call, we spend s + br time units on that particular call to GetNext, where s is the seek time and r is time required to scan a block. Once these b blocks are fetched we incur zero cost for next bn calls to GetNext, where n is the blocking factor (number of records per block). Thus, in the case where blocks are fetched at the first call to GetNext, we incur the total cost of s + br to sample bn records, and have a response time of s + br units at the first call to GetNext, with all subsequent calls having zero cost. Now imagine that instead we split b blocks into two chunks of size b/2 each, and read a chunkatatime. Thus, the first GetNext call will cost us s + br/2 time units. Once these bn/2 records are used up we read next chuck of blocks. The total cost in this scenario is 2s + br with a response time of s + br/2 time units once at the starting point and other midway through. Note that although the maximum response time on any call to GetNext is reduced by half, we required more time to sample bn records. The question then becomes, How do we reconcile response time with overall sampling time to give the user optimal performance? The systematic approach we take to answering this question is based on minimizing the average square sum of response time over all GetNext calls. This idea is similar to the widely utilized sumsquareerror or MSE criterion, which tries to keeps the average error or "cost" from being too high, but also penalizes particularly poor individual errors or costs. However, one problem we face using this strategy in the context of online sampling is that we do not know before hand the value of N, the number of records to be sampled. Algorithm 9 GetNext for Online Sampling 1: Set NS = Number of subsamples in a geometric file 2: for i = 1 to NS do 3: Set RecsInSubsam[i] = Size of ith subsample 4: Set BufferedSubsamSize[i] = 0 5: Randomly choose a subsample Si such that Pr[choosing i] = RecsInSubsam[i] /1RI 6: RecsInSubsam[i]  7: if BufferedSubsamSize[i] == 0 then 8: Set numRecs to minimum of sf/r and RecsInSubsam[i] 9: Read and buffer numRecs records of Si 10: BufferedSubsamSize[i] = numRecs 11: Buf ferSubsamSize[i]  12: Return the next available buffered record of Si To address this issue, we use a simple heuristic. Every time we refill a buffer, we look at the number of records already sampled from a subsample and assume that the user will ask for the same number of samples as the algorithm progresses. This gives us the planning horizon for which we can determine the number of blocks to be fetched. We also use the obvious constraint that the total number of samples fetched from the subsample should not exceed the number of records in a subsample. Given this, an analytic solution to the problem of minimizing the average squared cost over all calls to GetNext is as follows: If there are b number of records per blocks then let N/b be the number of blocks in the planning horizon, and let X be the number of equal size chunks that we read on every buffer refill. Our goal is to determine the value of X and the number of blocks in each chunk. We know that the time to read a chunk is proportional to s + (N/b x r)/X, and thus the square sum of response time of all GetNext calls is X(s + (N/b x r)/X)2 In order to derive a formula for the value of X that minimizes this, we simply differentiate it with respect to X and then solve for the zero. dX = Xs+ (:N/b xr)/X)2) S(Xs2 + 2Nsr + (N/b x r)2/X) dX =s2 (N/b x r)2/X2 Setting this to zero, we have X = Nr/bs. Thus, we divide N/b blocks into Nr/bs chunks and read bs/r number of blocks from a subsample every time we refill the buffer. It turns out that when this solution is used, the number of blocks read at the time of buffer refill depends on the ratio of the seek time to the block scan time. Since this solution is independent of the planning horizon, we always read bs/r blocks irrespective of the number of records sampled so far. Algorithm 9 gives the detailed online sampling algorithm. 5.5 Sampling A Biased Sample We end this chapter by noting that if a geometric files sample is correctly biased, then batch and online sampling algorithms we have given will also produce a correctly biased sample with no modification, as described by the following lemma. Lemma 9. A simple, equalprobability random sample from a correctly biased geometric file will be correctly biased if the sample stored by the geometric file is correctly biased. Proof In biased sampling, the probability of record being accepted in a geometric file is I xf (r) where is the weight of the record under consideration and the totalWeight is the sum totalWeight' of weights of all records from the stream so far. Let the Sample be the biased sample of the geometric file, then we have Pr[i E Sample] = Pr[Selecting i from Si] x Pr[Selecting Si] x Pr[i E Si] 1 sl IRf(r) S, R totalWeight f(r) totalWeight We examine the various algorithms for producing smaller samples from a large, diskbased geometric file in chapter 7 of this dissertation. CHAPTER 6 INDEX STRUCTURES FOR THE GEOMETRIC FILE Efficiently searching and discovering required information from a sample stored in a geo metric file is essential to speed up query processing. A natural way to support this functionality is to build an index structure for the geometric file. In this chapter we discuss three secondary index structures for the geometric file. The goal is to maintain the index structures as new records are inserted to the geometric file and at the same time provide efficient access to the desired information in the file. 6.1 Why Index a Geometric File? A geometric file may contain a sample of size several gigabytes or even terabytes. For certain queries, a huge sample like this may contain too much information and it becomes expensive to scan all the records of a sample to find those (most likely very few) records that match a given condition. For example, consider a geometric file that maintains a temporally biased sample of the daily transactions at a large retail store like WalMart. The records feature all of the attributes that are necessary to capture the details for a transaction like, such as: StoreID, Location, TransTotal, CustomerID, PaymentMethod and so on. Consider answering following SQL query, that returns all Florida customers who caused transactions during this calender year: SELECT CustomerName, StoreName, TranTotal FROM Transaction WHERE StoreState = 'FL' AND TransDate > 1/1/2007 If a sample of the database were stored in a geometric file, a naive way to guess the answer to this query would be to scan the entire geometric file, examining all its records and testing the condition in the WHERE clause on each. The file might have only few thousands of records that satisfy above criteria, while we will have to scan several billions of samples to answer the query. It would be more efficient if we had some way of obtaining only the records from the current year, and then testing each of them to see if they are from the state of Florida. It would be even more efficient if we could obtain directly few thousand tuples that satisfy both the conditions of the WHERE clause. A natural way to speed up the search and discovery of those records from a geometric file that have a particular value for a particular attributes) is to build an index structure. In general an index is a data structure that lets us find a record without having to look at more than a small fraction of all possible records. Thus, in our example, we could use the index built on either StoreState or TransDate (or both) to quickly access specific set of records and test them for the conditions in the WHERE clause. In this chapter we focus on building such an index structure for the geometric file. 6.2 Different Index Structures for the Geometric File An index is referred to as a primary index if it actually forces a specific location for each record in the file, whereas an index is referred as secondary index if it merely tells us the current location of a record. Thus, a secondary index is an index that is maintained for a data file, but not used to control the current processing order of the file. In case of a geometric file the physical location of a sampled record is determined (randomly) by the insertion algorithm. We therefore consider how to build a secondary index structure on one or more attributes in a geometric file. Apart from providing efficient access to the desired information in the file, a key consider ation is that the index for the geometric file must be maintained as new records are inserted. For instance, we could build a secondary index on an attribute when the new records are bulk inserted into the geometric file. We must then determine how do we merge the new secondary index with the existing indexes built for the rest of the file. Furthermore, we must maintain the index as existing records are being overwritten with newly inserted records and hence are deleted from the geometric file. With these goals in mind, we discuss three secondary index structures for the geometric file: (1) a segmentbased index, (2) a subsamplebased index, and (3) a LogStructured MergeTree (LSM) based index. The first two indexes are developed around the structure of the geometric file. Multiple B+tree indexes [9] are maintained for each segment or subsample in a geometric file. As new records are added to the file in units of a segment or subsample, a new B+tree that indexes the new records is created and added to the index structure. Also, an existing B+tree is deleted from the structure when all of the records indexed by it are deleted from the file. The third index structure makes use of the LSMtree index [44] a diskbased data structure designed to provide lowcost indexing in an environment with a high rate of inserts and deletes. In the subsequent sections we discuss construction, maintenance, and querying of these three types of indexes. 6.3 A SegmentBased Index Structure The geometric file is a collection of subsamples of exponentially decreasing size. Each subsample is further divided into a number of segments of exponentially decreasing size. At every buffer flush the buffered records are divided into different segments which are then used to overwrite largest segment of each ondisk subsample. This structure of the geometric file suggests a simple way to construct and maintain an index structure for the file. We could create a B+Tree index for each segment of each subsample of a geometric file and maintain them as new segments overwrite existing segments. We construct the index structure during startup as the reservoir is filled with first IRI records and maintain it as subsequent records are produced by the data stream. We detail construction and maintenance of a segmentbased index structure in this section. 6.3.1 Index Construction During Startup The geometric file makes use of steps (4)(13) of Algorithm 3 from Chapter 3 during start up to fill the reservoir. Every time the buffer accumulates the desired number of records, it is segmented and flushed to the disk. We build a B+tree index for each segment just before they are written out to the disk. For each buffered record of a segment we construct an index record. An index record is comprised of the value of the attribute on which the index is getting built (the key value) and the position of the buffered record on the disk. The position is stored as a number pair: a page number and offset within a page. The index records are then used to create an index using the bulk insertion algorithm for a B+Tree. We use a simple arraybased data structure to keep track of the B+Trees for each segment in the geometric file. Each array entry simply stores the position of a B+Tree root node. Rather than maintaining a file for each B+Tree created, we organize multiple B+Trees on a single disk file. We refer this single file as indexfile. The index file, in a sense, is similar to the logstructure file system proposed by Ousterhout [45]. In logstructured file system, as files are modified, the contents are written out to the disk as logs in a sequential stream. This allows writes in fullcylinder units, with only tracktotrack seeks. Thus the disk operates at nearly its full bandwidth. The index file enjoys the similar performance benefits. Every time a B+Tree is created for a memory resident segment, it is written to the index file in a sequential stream at the next available position. The array maintaining all B+Tree root nodes is augmented with the starting disk position of the B+Tree. Finally, we do not index segments that are never flushed to the disk. These segments are typically very small (a size of a disk block) and it is efficient to search them using sequential memory scan when geometric file is queried. 6.3.2 Maintaining Index During Normal Operation Maintaining a segmentbased index structure is exceedingly simple. During normal operation as a new subsample and its segments are formed, we build a B+Tree index for each inmemory segment just like we did during the startup. The only difference is that the B+Trees are written to the disk in slightly different manner. As an inmemory segment overwrites the ondisk segment, a B+Tree for an inmemory segment overwrites the B+Tree for the ondisk segment. We update the B+Tree array entry for the root node of the new B+Tree that is added to the index structure. Thus, the index maintenance for records newly inserted into the geometric file and the records that are deleted from the file is handled at the same time. The algorithm used to construct and maintain a segmentbased index structure is given as Algorithm 10. Algorithm 10 Construction and Maintenance of a SegmentBased Index Structure 1: Set n B x (1 a) 2: Set totSegslnSubsam ,o log n+og( 3: Set totSubsamlnR 0 4: Set totSegslnR = 0 5: Set numRecs = 0 6: while numRecs < R1 do 7: numRecs+ = IBI X atotSubamInR 8: totSegslnR+ = totSegslnSubsam totSubsamlnR 9: totSubsamlnR + + 10: Set BTree array of size totSegslnR 11: for inti = 1 to oo do 12: if Buffer B is partitioned then 13: for each segment sgj in B do 14: Build a B+Tree BTj 15: if i < R then 16: Flush BTj on the disk at next available spot in the Index File 17: else 18: Overwrite the B+Tree for the largest segment of jth largest subsample of R with BTj 19: Record BTj root and its disk position in BTree array 6.3.3 Index LookUp and Search A segmentbased index structure is a collection of B+Trees, one for each segment of the geometric file. Any indexbased search involves looking up all B+Tree indexes. We use the existing B+Treebased point query and range query algorithms and rerun them for each entry in the B+Tree array. The algorithm returns all index records that satisfy the search criteria. We sort the valid index records by their page number attribute. We then retrieve the actual records from the geometric file and return them as a query result. We expect a segmentbased index structure to be a compact structure as there is exactly one index record present in the index structure for each record in the geometric file, and the index structure is maintained as new records are deleted from the file. 6.4 A SubsampleBased Index Structure Although compact, the segmentbased index structure has little too many small indexes. The requirement that we perform a lookup using every single one of a large number of B+Tree can easily degrade the performance of indexbased search. A geometric file could easily have multiple thousands of segments in it; even with two disk seeks per B+Tree to retrieve an index record, a simple point query may required thousands of disk seeks to return the query results. An alternative to a segmentbased index structure is to build a B+Tree index for each subsample of the geometric file. We refer this approach as a subsamplebased index structure. 6.4.1 Index Construction and Maintenance Every time the buffer accumulates the desired number of samples for a new subsample, we build a single B+Tree index for all the buffered records. As in the case of a segmentbased index structure, we construct an index record for each buffer record and then bulk insert them all to create a B+Tree index. The structure of the index record for a subsamplebased index structure is the same as that of a segmentbased index structure, except that we add an attribute recording the segment number to which the buffered record belongs. As discussed subsequently, we use the segment number associated with the index record to determine if it is stale. We remember the B+Tree added to the structure by keeping track of its root node in an array structure. As in the case of a segmentbased index structure, we arrange the B+Tree indexes on disk in a single index file. However, we need a slightly different approach, because during the startup subsamples are flushed to the geometric file, until the reservoir is full. Thereafter subsamples of the same size B are added to the reservoir. Since each B+Tree will index no more than B records, we can bound the size of a B+Tree index. We use this bound to pre allocate a fixsized slot on disk for each B+Tree. Furthermore, for every buffer flush after the reservoir is full, exactly one subsample is added to the file and the smallest subsample of the file decays completely, keeping the number of subsamples in a geometric file constant. We use this information to lay out the subsamplebased B+Trees on disk and maintain them as new records are sampled from the data stream. Thus, if totSubsamples is the total number subsamples in R, we first allocate fixedsize totSubsamples slots in the index file. Initially all the slots are empty. During startup, as a new B+Tree is built, we seek to the next available slot and write out the B+Tree in a sequential manner. When the reservoir is full, we have used all of the slots exactly once. During the normal operation, every time buffer is full, the slot corresponding to the smallest subsample in the reservoir (which is about to decay completely) is used to write out newly built B+Tree. Thus, during normal operations B+Tree slots are used in roundrobin fashion. The algorithm used to construct and maintain a segmentbased index structure is given as Algorithm 11. Algorithm 11 Construction and Maintenance of a SubsampleBased Index Structure 1: Set totSubsamnInR = [iog3og IB+g(ia) log a 2: Set BTree allTrees [totSubsamInR] 3: Set btlndex = 0 4: for int i = 1 to oo do 5: if Buffer B is partitioned then 6: for each segment j in B do 7: allTrees[btIndex].BuildBTree(j) 8: btlndex + + 9: ifi > R1 then 10: btlndex = btli,.i ,'. I. I ubsamInR 6.4.2 Index LookUp In the subsamplebased index structure, after every buffer flush, exactly one B+Tree is created and written to the disk, making insertions in the index structure very efficient. However, most of the deletions are deferred until the subsample decays completely. Thus, although every subsample losses its records to the new subsample, B+Tree records are deleted from the index structure only when the entire B+Tree is to be deleted. In other words, at any given time all B+Trees except the one recently inserted contains stale records that must be ignored during the search. A search on subsamplebased index structure involves looking up all B+Tree indexes, one for each subsample in the geometric file. We modify the existing B+Treebased point query and range query algorithms and run them for each entry in the B+Tree array of the index structure. The modification is required to ignore the stale records in the B+Trees. As mentioned before, the subsample corresponding to a B+Tree may lose its segments, but the index records are not deleted from the index tree until the subsample completely decays (when the entire tree is deleted). We refer to a index record as a stale record if it belongs to a segment of a subsample that is already overwritten (lost). Recall that we have recorded a segment number in an additional field along with each index record. For a given subsample, we keep track of which of its segments are decayed so far and use this information to ignore the index records that are stale. We returns all valid index records that satisfy the search criteria. We first sort these index records by their page number attribute and then then retrieve the actual records from the geometric file and return them as a query result. Although, the subsamplebased index structure maintains and must search far fewer B+Trees compared to the segmentbased index structure, we except reasonable search time per B+Tree due to the smaller size and lazy deletion policy. 6.5 A LSMTreeBased Index Structure An alternative to the segmentbased and subsamplebased index structure is to build a single index structure for the entire geometric file, and maintain it as new records are inserted in the file. Thus, we design the third index structure that makes use of the LSMtree index [44]. The LSMTree is a diskbased data structure designed to provide lowcost indexing in an environment with a high rate of inserts and deletes. 6.5.1 An LSMTree Index An LSMtree is composed of two or more treelike component data structures. The smallest component of the index always resides entirely in main memory (referred as the Co tree), and all other larger components reside on disk (referred as C1, C2, ..., Cj). The schematic picture of an LSMtree of two components is depicted in Figure 2.1 of the original LSMTree paper [44]. Although C1 (and higher) components are diskresident, the most frequently referred nodes (in general nodes at higher level) of these trees are buffered in main memory for performance reasons. LSMTree insertions and deletions: Index records are first inserted into the mainmemory resident Co component, after which they migrate to the C1 component that is stored on disk. Insertion into the Co component has no I/O cost associated with it. However, its size is limited by the size of the available memory. Thus, we must efficiently migrate part of the Co component to the diskresident C1 component. Whenever the Co component reaches a threshold size an ongoing rolling merge process removes some records (a contiguous segment) from the Co component and merges them into the Ci component on disk. The the rolling merge process is depicted pictorially in Figure 2.2 of the original LSMTree paper [44]. The rolling merge is repeated for migration between higher components of an LSMTree in similar manner. Thus, there is a certain amount of delay before records in the Co component migrate out to the diskresident Ci and higher components. Deletions are performed concurrently in batch fashion similar to inserts. The disk resident components of an LSMtree are comparable to a B+tree structure, but are optimized for sequential disk access, with nodes 100% full. Lower levels of the tree are packed together in contiguous, multipage disk blocks for better I/O performance during the rolling merge. 6.5.2 Index Maintenance and LookUps As in case of previously proposed index structures, every time the buffer is fulled and partitioned into segments, we create an index record for each buffered record and bulk insert them all into an LSMtree index. The index record is comprised of five fields: (1) the key value, (2) the disk page number of the record, (3) an offset within the page, (4) the segment number to which the record belongs, and (5) the subsample number to which the record belongs. The segment and subsample number are recorded with each index record to determine its staleness. Every time a record is migrated from a lower component to a higher disk based component, the rolling merge additionally identifies stale records and removes them from the tree structure. We refer to an index record as a stale record if it is indexing a record either from a subsample that is decayed completely, or a segment of a subsample that is overwritten. We use the existing LSMTreebased point query and range query algorithms to perform index lookups. As in case of previously proposed index structures, we sort the valid index records by their page number attribute and retrieve the actual records from the geometric file as a query result. In Chapter 7, we evaluate and compare the three index structures suggested in this chapter experimentally by measuring build time and disk footprint as new records are inserted into the geometric file. We also compare the efficiency of these structures for point and range queries. CHAPTER 7 BENCHMARKING In this chapter, we detail three sets of benchmarking experiments. In the first set of experi ments, we attempt to measure the ability of the geometric file to process a highspeed stream of data records. In the second set of experiments, we examine the various algorithms for producing smaller samples from a large, diskbased geometric file. Finally, in the third set of experiments, we compare the three index structures for the geometric file for build time, disk space, and index lookup speed. 7.1 Processing Insertions In order to test the relative ability of the geometric file to process a highspeed stream of insertions, we have implemented and benchmarked five alternatives for maintaining a large reservoir on disk: the three alternatives discussed in Section 3.3, the geometric file, and the framework described in Section 3.10 for using multiple geometric files at once. In the remainder of this Section, we refer to these alternatives as the virtual memory, scan, local overwrite, geo file, and multiple geofiles options. An a' value of 0.9 was used for the multiple geo files option. All implementation was performed in C++. Benchmarking was performed using a set of Linux workstations, each equipped with 2.4 GHz Intel Xeon Processors. 15,000 RPM, 80GB Seagate SCSI hard disks were used to store each of the reservoirs. Benchmarking of these disks showed a sustained read/write rate of 3550 MB/second, and an "across the disk" random data access time of around 10ms. 7.1.1 Experiments Performed The following three experiments were performed: Insertion experiment 1: The task in this experiment was to maintain a 50GB reservoir holding a sample of 1 billion, 50B records from a synthetic data stream. Each of the five alternatives was allowed 600MB of buffer memory to work with when maintaining the reservoir. For the scan, local overwrite, geo file, and multiple geo files options, 100MB was used as an LRU buffer for disk reads/writes, and 500MB was used to buffer newly sampled records before processing. The virtual memory option used all 600MB as an LRU buffer. In the experiment, a continual stream of records was selected to be inserted into the reservoir (as many as each of the five options could handle). The goal was to test how many new records could be added to the reservoir in 20 hours, while at the same time expelling existing records from the reservoir as is required by the reservoir algorithm. The number of new samples processed by each of the five options (that is, the number of records added to disk) is plotted as a function of time in Figure 71 (a). By "number of samples processed" we mean the number of records that are actually inserted into the reservoir, and not the number of records that have passed through the data stream. Insertion experiment 2: This experiment is identical to Experiment 1, except that the 50GB sample was composed of 50 million, 1KB records. Results are plotted in Figure 71 (b). Thus, we test the effect of record size on the five options. Insertion experiment 3: This experiment is identical to Experiment 1, except that the amount of buffer memory is reduced to 150MB for each of the five options. The virtual memory option used all 150MB for an LRU buffer, and the four other options allocated 100MB to the LRU buffer and 50MB to the buffer for new samples. Results are plotted in Figure 71 (c). This experiment tests the effect of a constrained amount of main memory. 7.1.2 Discussion of Experimental Results All three experiments suggest that the multiple geo files option is superior to the other options. In Experiments 1 and 2, the multiple geofiles option was able to write new samples to disk almost at the maximum sustained speed of the hard disk, at around 40 MB/sec. It is worthwhile to point out a few specific findings. Each of the five options writes the first 50GB of data from the stream more or less directly to disk, as the reservoir is large enough to hold all of the data as long as the total is less than 50GB. However, Figure 71 (a) and (b) show that only the multiple geofiles option does not have much of a decline in performance after the reservoir fills (at least in Experiments 1 and 2). This is why the scan and virtual memory options plateau after the amount of data inserted reaches 50GB. There is something of a decline in performance in all of the methods once the reservoir fills in Experiment 3 (with restricted buffer memory), but it is far less severe for the multiple geofiles option than for the other options. (a) 50 Byte records, 600MB buffer space hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hr Time elapsed geo file local overwrite scan & virtual mem 8 hrs 12 hrs 16 hrs 20 hr Time elapsed multiple geo files 7 local geo file overwrte (c) 50 Byte records, 150MB buffer space scan & virtual mem 0 hrs 4 hrs 8 hrs 12 hrs 16 hrs 20 hrs Time elapsed Figure 71. Results of benchmarking experiments (Processing insertions). multiple geo files \V, (b) 1KB records, 600MB buffer space 7 10000 1000 100 S10 1 10000 1000 100 10 1 001 0 001 00001 le05 le06 naive algorithm geo file structure based algorithm 0 10 20 30 40 50 60 70 80 i0010 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Records sampled (a) Batch Sampling 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Records sampled (c) Online Sampling 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Calls to GetNext (e) Variance Plots 10000 1000 100 10 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Records sampled (b) Batch Sampling (Multiple Geo Files) 10000 1 0 0 0      1000 ... naive algorithm 100 10 geo file structure based algorithm 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Records sampled (d) Online Sampling (Multiple Geo Files) 001 0001 00001 0 100 200 300 400 500 600 700 800 900 1000 Thousands of Calls to GetNext (f) Variance Plots (Multiple Geo Files) Figure 72. Results of benchmarking experiments (Sampling from a geometric file). 102 naive algorithm  geo file structure based algorithm N ( j j f   naive algorithm geo file structure based algorithm geo file structure based algorithm naive algorithm geo file structure based algorithm naive algorithm naive algorithm Furthermore, this degeneration in performance could probably be reduced by using a smaller value for a'. As expected, the local overwrite option performs very well early on, especially in the first two experiments (see Section 3.3 for a discussion of why this is expected). Even with limited buffer memory in Experiment 3, it uniformly outperforms a single geometric file. Furthermore, with enough buffer memory in Experiments 1 and 2, the local overwrite option is competitive with the multiple geofiles option early on. However, fragmentation becomes a problem and performance decreases over time. Unless offline rerandomization of the file is possible periodically, this degradation probably precludes longterm use of the local overwrite option. It is interesting that as demonstrated by Experiment 3 (and explained in Section 3.8) a single geometric file is very sensitive to the ratio of the size of the reservoir to the amount of available memory for buffering new records from the stream. The geofile option performs well in Experiments 1 and 2 when this ratio is 100, but rather poorly in Experiment 3 when the ratio is 1000. Finally, we point out the general unusability of the scan and virtual memory options, scan generally outperformed virtual memory, but both generally did poorly. Except in experiment 1 with large memory and small record size, with these two options more than 97'. of the processing of records from the stream occurs in the first half hour as the reservoir fills. In the 19.5 hours or so after the reservoir first fills, only a tiny fraction of additional processing occurs due to the inefficiency of the two options. 7.2 Biased Reservoir Sampling In Section 4.1 we gave an upper bound for the distance between the actual bias function f' computed using our reservoir algorithm, and the desired, userdefined bias function f. While useful, this bound does not tell the entire story. In the end, what a user of a biased sampling algorithm is interested in is not how close the bias function that is actually computed is to the userspecified one, but instead the key question is what sort of effect any deviation has on the 5e+13 Biased sampling w/o skewed records 4.5e+13 . Unbiased reservoir sampling Biased sampling worst case 4e+13 3.5e+13 (D S 3e+13 5 2.5e+13 0 2e+13  " ^~: ^ 1.5e+13 1e+13 5e+12 0 I I I 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 73. Sum query estimation accuracy for zipf=0.2. particular estimation task that is to be performed. Perhaps the easiest way to detail the practical effect of a pathological data ordering is through experimentation. In this section we present the experimental results evaluating practical significance of a worstcase data ordering. Specifically, we design a set of experiments to compute the error (variance) one would expect when sampling for the answer to a SUM query in following there scenarios: 1. When a biased sample is computed using our reservoir algorithm with the data ordered so as to produce no overweight records. 2. When an unbiased sample is computed using the classical reservoir sampling algorithm. 3. When a biased sample computed using our reservoir algorithm, with records arranged so as to produce the bias function furthest from the userspecified one, as described by the Theorem 1. By examining the results, it should become clear exactly what sort of practical effect on the accuracy of an estimator one might expect due to a pathological ordering. 7.2.1 Experimental Setup In our experiments, we evaluated a SUM query over a set of synthetic data streams having various statistical properties. In each experiment, every record has two attributes: A and B. 8e+14 Biased sampling w/o skewed records 7e+14 Unbiased reservoir sampling SBiased sampling worst case 6e+14 .. 5e+14 5 4e+14 7 3e+14 2e+14 le+14 0 II I I 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 74. Sum query estimation accuracy for zipf=0.5. Attribute B is the attribute that is actually aggregated by the SUM query. Each set is generated so that attributes A and B both have a certain amount of Zipfian skew, specified by the parameter zipf. In each case, the bias function f is defined so as to minimize the variance for a SUM query evaluated over attribute A. In addition to the parameter zipf, each data set also has a second parameter which we term the correlation factor. This is the probability that attribute A has the same value as attribute B. If the correlation factor is 1, then A and B are identical, and since the bias function is defined so as to minimize the variance of a query over A, the bias function also minimizes the variance of an estimate over the actual query attribute B. Thus, a correlation factor of 1 provides for a perfect bias function. As the correlation factor decreases, the quality of the bias function for a query over attribute B declines, because the chance increases that a record deemed important by looking at attribute A is, in fact, one that should not be included in the sample. This models the case where one can only guess at the correct bias function beforehand for example, when queries with an arbitrary relational selection predicate may be issued. A small correlation factor corresponds to the case when the guessedat bias function is actually very incorrect. 1.6e+17 Biased sampling w/o skewed records 1.4e+17 Unbiased reservoir sampling Biased sampling worst case 1.2e+17 l e+17  S 8e+16 7 6e+16 4e+16 2e+16 0 'I' I I' 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 75. Sum query estimation accuracy for zipf=O.8. By testing each of the three different scenarios described in the previous subsection over a set of data sets created by varying zipf as well as the correlation factor, we can see the effect of data skew and of bias function quality on the relative quality of the estimator produced by each of the three scenarios. For each experiment, we generate a data stream of one million records and obtain a sample of size 1000. For each of the three scenarios and each of the data sets that we test, we repeat the sampling process 1000 times over the same data stream in MonteCarlo fashion. The variance of the corresponding estimator is reported as the observed variance of the 1000 estimates. The observed MonteCarlo variances are depicted in Figures 73, 74, 75, and 76. 7.2.2 Discussion It is possible to draw a couple of conclusions based on the experimental results. Most significant is that biased sampling under the pathological record ordering shows qualitative performance similar to the biased sampling without any overweight records. Even though in the pathological case the sample might not be biased exactly as specified by the userdefined function f, the number of records not sampled according to f is usually small, and the resulting estimator typically suffers from an increase in variance of around a factor of ten of less. This demonstrates 1.4e+18 Biased sampling w/o skewed records Unbiased reservoir sampling 1.2e+18 Biased sampling worst case le+18 8e+17 6e+17 4e+17 2e+17 0 0 i'^ 0 0.2 0.4 0.6 0.8 1 Correlation Factor Figure 76. Sum query estimation accuracy for zipf=l. that even for very skewed data sets, it is difficult for even an adversary to come up with a data ordering that can significantly alter the quality of the userdefined bias function. We also observe that for a low zipf parameter and a low correlation factor, unbiased sampling outperforms biased sampling. In other words, it is actually preferable not to bias in this case. This is because the low zipf value assigns relatively uniform values to attribute B, rendering an optimal biased scheme little different from uniform sampling. Furthermore, as the correlation factor decreases, the weighting scheme used both biased sampling schemes becomes less accurate, hence the higher variance. As the weighting scheme becomes very inaccurate, it is better not to bias at all. Not surprisingly, there are more cases where the biased scheme under the pathological ordering is actually worse than the unbiased scheme. However, as the correlation factor increases and the bias scheme becomes more accurate, it quickly becomes preferable to bias. 7.3 Sampling From a Geometric File We have also implemented and benchmarked four sampling techniques to sample geometric files discussed in the Chapter 5. Specifically, we have compared the naive batch sampling and the online sampling algorithms against a geometric file structure based batch sampling and online sampling algorithms. We have also tested these four techniques with the framework that make use of multiple geometric files. All of the algorithms were implemented on top of the geometric file prototype that was benchmarked in the previous sections. 7.3.1 Experiments Performed To compare the various options, we used the following setup. We first initialize a geometric file by sampling and adding records from a synthesized data stream to the files for a period of several hours. This ensures a realistic scenario for testing: the reservoir in the file that is to be tested has been filled, a reasonable portion of each initial subsample has been overwritten, and some of the smaller initial subsamples have been removed from the file, and a number of new subsamples have been create. The parameters used in building the geometric file are the same as those describe in Experiment 2 of the previous section (a 50GB file with 50 million, 1KB records). Given such a file, the following set of experiments were performed: Sampling experiment 1: The goal of this experiment was to compare the two options for obtaining a batch sample from a geometric file: the naive algorithm, and then the geometric file structure based algorithm. For both algorithms, we plot the time to perform the sampling as a function of the desired sample size. Figure 72 (a) depicts the plot for a single geometric file; Figure 72 (b) shows an analogous plot for the multiple geometric files option. Sampling experiment 2: This experiment is analogous to Sampling Experiment 1, except that online sampling is performed via multiple successive calls to GetNext. The number of records sampled with multiple calls to GetNext versus the elapsed time is plotted in Figure 72 (c) for both the naive algorithm and the more advanced, geometric file structure based algorithm designed to increase the sampling rate and even out the response times. The analogous plot for multiple geometric file case is shown in Figure 72 (d). We also plot the variance in response times over all calls to GetNext as a function of the number of calls to GetNext in Figures 7 2(e) and 72(f) (the first is for a single geometric file; the second is with multiple files). Taken together, these plots show the tradeoff between overall processing time and the potential for waiting for a long time in order to obtain a single sample. 7.3.2 Discussion of Experimental Results Not surprisingly, these results suggest that the geometric file structure based sampling methods are superior over the more obvious naive algorithms, both in the batch and online case. As expected, the naive batch sampling algorithm took almost constant time to obtain batch sample of any size as it requires scan of the entire geometric file to retrieve any batch sample. The geometric file structure based algorithm can produce a smallsize batch sample very fast, and the total sampling time increases linearly with sample size. The time required for the geometric file structure based algorithm is well below the time required by the naive approach even when 1/10 of file is sampled. In case of online sampling, geometric file structure based algorithm clearly outperformed naive approach and this was not surprising as it must expend one disk seek per sample. For both, batch and online sampling multiple geometric files framework showed results analogues to single geometric file case. As expected and then demonstrated by variance plots, the variance of online naive approach is smaller than geometric file structure based algorithm. Although with this little larger variance (less than 10 times for 100k samples) in the response times, the structure based approach executed order of magnitude faster (more than 100 times for 100k samples) than the naive approach for any number of records sampled, justifying our approach of minimizing the average square sum of the response time. In other words, we got enough added speed for a small enough added variance in response time to make the tradeoff acceptable. As more and more samples are obtained the variance of structure based algorithm approached variance of the naive algorithm making the tradeoff even more reasonable for large intended sample sizes. Finally, we point out that both the geometric file structure based algorithms, batch and online case, were able to read sample records from disk almost at the maximum sustained speed of the hard disk, at around 45 MB/sec. This is comparable to the rate of a sequential read from disk, the best we can hope for. C / 0.5 0 0. 5 0 2 4 6 8 10 12 Time elapsed in Hrs Figure 77. Disk footprint for 1KB record size Table 71. Millions of records inserted in 10 hrs No Index SubsampleBased SegmentBased LSMTree 1KB Record; 13700 12550 10960 9680 IRI 10 million; BI = 50k 200 Bytes Record; 12810 7230 8030 2930 R1 50 million; BI = 250k 7.4 Index Structures For The Geometric File In chapter 6 we introduced three index structures for the geometric file: the segmentbased, the subsamplebased, and the LSMtreebased index structure. In this section, we experimentally evaluate and compare these three index structures by measuring build time and disk footprint as new records are inserted in the geometric file. We also compare efficiency of these structures for point and range queries. All of the index structures were implemented on top of the geometric file prototype that was benchmarked in the previous sections. 7.4.1 Experiments Performed In order to compare the three index structures, we used the following setup. We initialize a geometric file by sampling and adding records from a synthesized data stream to the files for a period of ten hours. As the geometric file is initialized, we built index structures as discussed in chapter 6. The ten hours of insertion in the geometric file ensures that a reasonable number of insertions and deletions are performed on an index structure. Given such a file, we collected following three pieces of information for each of the three index structures under consideration. Build time: With concurrent updates to the index structure we wish to record how many records can be inserted into a geometric file at the end of insertion window (10 hrs). This should help compare build time for three index structures. Disk footprint: We observe the disk footprint of three proposed index structure by record ing the total disk spaced used by an index structure every time after a buffer full index records are bulk inserted into an index file. Index lookup time: After ten hours of insertion, once index structures are built for the geometric file, we query the index structures to lookup records with specific key or range of key values. For each index structure a point query and range queries with different selectivity are executed. The point query returns exactly one record as an output record. The range queries are designed to return approximately 10, 100, or 1000 records as an output set. For each selectivity we execute the query 100 times and reported the average lookup time. Further, processing a query involves index lookup followed by one more seeks in the geometric file to access output records. We therefore report index lookup time, geometric file access time, and the total query processing time. With these metrics in mind we performed following two sets of experiments: Indexing experiment 1: The task in this experiment was to maintain a 10GB reservoir holding a sample of 10 million, 1KB records from a synthetic data stream. Each of the three alternatives was allowed 500MB of buffer memory to work with when maintaining the reservoir. The results for index build time is shown in Table 7.4, the disk space used by three index structure is plotted in Figure 77, and the index lookup speed in tabulated in Table 7.4.1. Indexing experiment 2: This experiment is identical to Experiment 1, except that the 10GB sample was composed of 50 million, 200B records. For this experiment the results for index build 14 Subsamplebased  Segmentbased 12 LSMTree m 10 o 8 / co 6 c0 o 4 0 2 4 6 8 10 12 Time elapsed in Hrs Figure 78. Disk footprint for 200B record size speed are shown in Table 7.4, the disk space used by three index structure is plotted in Figure 78, and the index lookup speed in tabulated in Table 7.4.2. Thus, we test the effect of record size on the three index structure. 7.4.2 Discussion It is possible to draw a few conclusions based on the experimental results. The subsample based index structure shows the best build time, the segmentbased index structure has the most compact disk footprint, whereas the LSMtreebased index structure has best response to the index lookups. Table 7.4 shows millions of records inserted into geometric file after ten hours of insertions and concurrent updates to the index structure. For comparison we present the number of records inserted into a geometric file when no index structure is maintained (the "no index" column). It is clear that the subsamplebased index structure performs the best on insertions, with performance comparable to the "no index" option. This difference reflects the cost of concurrently maintaining the index structure. The segment based index structure does the next best. It is slower than the subsamplebased index structure because of higher number of seeks performed during the start up. Recall that during startup the segmentbased index must write a B+tree for each segment. Table 72. Query timing results for 1k record, R = 10 million, and IBI = 50k Scheme Selectivity Index Time File Time Total Time Point Query 38.2890 0.0226 38.3116 S 10 recs 40.2477 0.1803 40.2480 SegmentBased 100 recs 43.2856 0.8766 44.1622 1000 recs 45.6276 6.2571 51.8847 Point Query 0.87551 0.02382 0.89937 10 recs 1.12740 0.15867 1.28607 SubsampleBased SubsampleBased 100 recs 1.74911 1.10544 2.85455 1000 recs 2.09980 5.96637 8.06617 Point Query 0.00012 0.01996 0.02008 10 recs 0.00015 0.01263 0.01278 LSMTree 100 recs 0.00019 0.79358 0.79377 1000 recs 0.00056 5.82210 5.82266 Once the reservoir is initialized, both the segmentbased and the subsamplebased index structure perform an equal number of disk seeks. Finally, the LSMtreebased index structure is slowest amongst the three. The LSMtree maintains the index by processing insertions and deletions more aggressively than other two options, demanding more rolling merges and more disk seeks per buffer flush. Table 7.4 also shows the insertion figures for the smaller, 200B record size. Not surprisingly, all three index structures shows similar insertion patterns, but since they have to process a larger number of records the insertion rates are slower than in the case of the 1KB record size. We also observed and plotted the disk footprint size for three index structures (Figure 77 and Figure 78). As expected, all three index structures initially grow fairly quickly. The segmentbased and the subsamplebased index structures stabilize soon after the reservoir is filled, whereas the LSMTreebased structure stabilizes a little later when the removal of stale records from the rolling merges stabilizes. The subsamplebased index structure has the largest footprint (almost 1/5th of the geometric file size). This is expected as stale index records is removed from the B+trees only when the Table 73. Query timing results for 200 bytes record, R entire subsample decays. On the other hand, the segmentbased index structure has the smallest footprint as at every buffer flush all stale records are removed from the index structure. This results in a very compact index structure. The disk space usage of the LSMTreebased index structure lies between these two index structures. Although at every rolling merge, stale records are removed from the part of index structure that is merging, not all of the stale records in the structure are removed all at once. As soon as the rate of removal of stale records stabilizes the disk footprint also becomes stable. Finally, we compared the index lookup speed of these three index structures. We report index lookup and geometric file access times for different selectivity queries. As expected, the geometric file access time remains constant irrespective of the index structure option and increases linearly as the query produces more output tuples. The index lookup time varied for the three index structures. The segmentbased index structure (the slowest) was an order of magnitude slower than the LSMTreebased index structure (the fastest). This is mainly because the segmentbased index structure requires index lookups in several thousand B+Trees for any selectivity query, where the LSMTreebased structure uses a singe LSMTree, requiring a small, constant number of seeks. The performance of the subsamplebased index structure lies in Scheme Selectivity Index Time File Time Total Time Point Query 6.2488 0.0338 6.2826 10 recs 9.6186 0.1267 9.7453 SegmentBased S 100 recs 12.9885 0.9288 13.9173 1000 recs 17.6891 5.9754 23.6645 Point Query 2.50717 0.0156 2.5227 ~ 10 recs 4.92744 0.1763 5.1037 SubsampleBased u 100 recs 7.2387 0.8637 8.1024 ~ 1000 recs 9.9837 6.1363 16.1200 Point Query 0.00505 0.0174 0.0224 ~ 10 recs 0.00967 0.1565 0.1661 LSMTree 100 recs 0.01440 0.8343 0.8487 ~ 1000 recs 0.05987 4.9961 5.0559 = 50 million, and IBI = 250k between these two structures. This is expected as the structure maintains many fewer B+Trees than the segmentbased index but far more than the LSMTreebased structure. In general the subsamplebased index structure gives the best build time with reasonable index lookup speed at the cost of slightly larger disk footprint. The LSMTreebased index structure makes use of reasonable disk space and gives the best query performance at the cost of slow insertion rate or build time. The segmentbased index structure gives comparable build time and has the most compact disk footprint, but suffers considerably when it comes to index lookups. CHAPTER 8 CONCLUSION Random sampling is a ubiquitous data management tool, but relatively little research from the data management community has been concerned with how to actually compute and maintain a sample. In this dissertation we have considered the problem of random sampling from a data stream, where the sample to be maintained is very large and must reside on secondary storage. We have developed the geometric file organization which can be used to maintain an online sample of arbitrary size with an amortized cost of O(w x logB IB/ B ) random disk head movements for each newly sampled record. The multiplier u can be made very small by making use of a small amount of additional disk space. We have presented a modified version of the classic reservoir sampling algorithm that is exceedingly simple, and is applicable for biased sampling using any arbitrary userdefined weighting function f. Our algorithm computes, in a single pass, a biased sample Ri (without replacement) of the i records produced by a data stream. We have also discussed certain pathological cases where our algorithm can provide a correctly biased sample for a slightly modified bias function f'. We have analytically bound how far f' can be from f in such a pathological case. We have also experimentally evaluated the practical significance of this difference. We have also derived the variance of a HorvitzThomson estimator making use of a sample computed using our algorithm. Combined with the Central Limit Theorem, the variance can then be used to provide bounds on the estimator's accuracy. The estimator is suitable for the SUM aggregate function (and, by extension, the AVERAGE and COUNT aggregates) over a single database table for which the reservoir is maintained. We have developed efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. We considered two sampling techniques (1) a batch sampling when sample size is known before hand and (2) an online sampling which implements an iterative function GetNext to retrieve a sample atatime. The goal of these algorithms was to efficiently support further sampling of a geometric file by making use of its own structure. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this functionality is to build an index structure. We discussed three secondary index structures and their maintenance as new records are inserted into a geometric file. The segmentbased and the subsamplebased index structures are designed around structure of the geometric file. The third index structure, the LSMtreebased index structure makes use of LSMtree structure, an efficient structure to handle bulk insertion and deletion. We compared these structures for build time, disk space used, and index lookup time. REFERENCES [1] A. Das J. Gehrke, M.R.: Approximate join processing over data streams. In: ACM SIGMOD International Conference on Management of Data (2003) [2] Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of groupby queries. In: ACM SIGMOD International Conference on Management of Data (2000) [3] Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: ACM SIGMOD International Conference on Management of Data (1999) [4] Acharya, S., P.B. Gibbons, V.P., Ramaswamy, S.: The aqua approximate query answering system. In: ACM SIGMOD International Conference on Management of Data (1999) [5] Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB'2006: Proceedings of the 32nd international conference on Very large data bases, pp. 607618. VLDB Endowment (2006) [6] Arge, L.: The buffer tree: A new technique for optimal i/oalgorithms. In: International Workshop on Algorithms and Data Structures (1995) [7] Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA'02: Proceedings of the thirteenth annual ACMSIAM symposium on Discrete algorithms, pp. 633634. Society for Industrial and Applied Mathematics (2002) [8] Babcock, B., S. Chaudhuri, G.D.: Dynamic sample selection for approximate query processing. In: ACM SIGMOD International Conference on Management of Data (2003) [9] Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indexes. In: SIGFIDET Workshop, pp. 107141 (1970) [10] Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE '06: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 6. IEEE Computer Society, Washington, DC, USA (2006) [11] C. Fan M. Muller, I.R.: Development of sampling plans by using sequential (item by item) techniques and digital computers. In: Journal of American Statistical Association, pp. 57: 387402 (1962) [12] C. Jermaine A. Datta, E.O.: A novel index supporting high volume data warehouse insertion. In: International Conference on Very Large Data Bases (1999) [13] C. Jermaine E. Omiecinski, W.Y: The partitioned exponential file for database storage management. In: International Conference on Very Large Data Bases (1999) [14] Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aig gricglion queries. In: ICDE (2001) [15] Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimizationbased approach for approx imate answering of aggregate queries. In: ACM SIGMOD International Conference on Management of Data (2001) [16] Cochran, W.: Sampling Techniques. Wiley and Sons (1977) [17] Council, T.P.: TPCH Benchmark. http://www.tpc.org (2004) [18] Cranor, C., Gao, Y, Johnson, T., Shkapenyuk, V., Spatscheck, O.: Gigascope high per formance network monitoring with an sql interface. In: ACM SIGMOD International Conference on Management of Data (2002) [19] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: ACM SIGMOD International Conference on Management of Data (2003) [20] Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: The gigascope stream database. In: IEEE Data Engineering Bulletin, pp. 26(1): 2732 (2003) [21] Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: ACM SIGMOD International Conference on Management of Data (2002) [22] Duffield, N., Lund, C., Thorup, M.: Charging from sampled network usage. In: IMW '01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, pp. 245256. ACM Press, New York, NY, USA (2001) [23] Estan, C., Naughton, J.E: Endbiased samples for join cardinality estimation. In: ICDE '06: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p. 20. IEEE Computer Society, Washington, DC, USA (2006) [24] Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270313 (2003) [25] F. Olken, D.R.: Random sampling from b+ trees. In: International Conference on Very Large Data Bases (1989) [26] F. Olken, D.R.: Random sampling from database files a survey. In: International Working Conference on Scientific and Statistical Database Management (1990) [27] F. Olken D. Rotem, P.X.: Random sampling from hash fies. In: ACM SIGMOD Interna tional Conference on Management of Data (1990) [28] Ganguly, S., Gibbons, P., Matias, Y, Silberschatz, A.: Bifocal sampling for skewresistant join size estimation. In: ACM SIGMOD International Conference on Management of Data (1996) [29] Gemulla, R., Lehner, W., Haas, P.J.: A dip in the reservoir: maintaining sample synopses of evolving datasets. In: VLDB'2006: Proceedings of the 32nd international conference on Very large data bases, pp. 595606. VLDB Endowment (2006) [30] Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multidimensional aggregate range queries over real attributes. In: ACM SIGMOD International Conference on Management of Data (2000) [31] Haas, P.: The need for speed: Speeding up db2 using sampling. In: IDUG Solutions Journal (2003) [32] Haas, P.J., Hellerstein, J.M.: Ripple joins for Online Aggregation. In: ACM SIGMOD International Conference on Management of Data, pp. 287 298 (1999) [33] Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation. In: ACM SIGMOD International Conference on Management of Data, pp. 171182 (1997) [34] J. Gehrke F. Korn, D.S.: On computing correlated aggregates over continual data streams. In: ACM SIGMOD International Conference on Management of Data (2001) [35] Jermaine, C.: Robust estimation with sampling and approximate preaggregation. In: International Conference on Very Large Data Bases (2003) [36] Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: ACM SIGMOD International Conference on Management of Data, pp. 299310 (2004) [37] J.M. Hellerstein R. Avnur, V.R.: Informix under control online query processing. In: Data Mining and Knowledge Discovery, pp. 4(4): 281314 (2000) [38] Joens, T.: A note on sampling from a tape file. In: Communications of the ACM, p. 5: 343 (1964) [39] J.S. Vitter, M.W.: Approximate computation of multidimensional aggregates of sparse data using wavelets. In: ACM SIGMOD International Conference on Management of Data (1999) [40] Kolonko, M., Wasch, D.: Sequential reservoir sampling with a nonuniform distribution. ACM Trans. Math. Softw. 32(2), 257273 (2006). DOI http://doi.acm.org/10.1145/ 1141885.1141891 [41] Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB Conference (2002) [42] N.L. Johnson, S.K.: Discrete Distributions. Houghton Mifflin (1969) [43] Olken, F.: Random Sampling from Databases. In: Ph.D. Dissertation (1993) [44] O'Neil, P., Cheng, E., Gawlick, D., O'NeilJ, E.: The logstructured mergetree. In: Acta Informatica, pp. 33:351385 (1996) [45] Ousterhout, J.K., Douglis, F.: Beating the i/o bottleneck: A case for logstructured file systems. Operating Systems Review 23(1), 1128 (1989) [46] P.B. Gibbons Y. Matias, VP.: Fast incremental maintenance of approximate histograms. In: ACM Transactions on Database Systems, pp. 27(3): 261298 (2002) [47] Pol, A., Jermaine, C.: Biased reservoir sampling. IEEE Transactions on Knowledge and Data Engineering [48] Pol, A., Jermaine, C., Arumugam, S.: Maintaining very large random samples using the geometric file. VLDBJ (2007) [49] Shao, J.: Mathematical Statistics. SpringerVerlag (1999) [50] Thompson, M.E.: Theory of Sample Surveys. Chapman and Hall (1997) [51] Toivonen, H.: Sampling large databases for association rules. In: International Conference on Very Large Data Bases (1996) [52] V. Ganti M.L. Lee, R.R.: Icicles selftuning samples for approximate query answering. In: International Conference on Very Large Data Bases (2000) [53] Vitter, J.: Random sampling with a reservoir. In: ACM Transactions on Mathematical Software (1985) [54] Vitter, J.: An efficient algorithm for sequential random sampling. In: ACM Transactions on Mathematical Software, pp. 13(1): 5867 (1987) BIOGRAPHICAL SKETCH Abhijit Pol was born and brought up in state of Maharashtra in India. He received his Bachelor of Engineering from Government College of Engineering Pune (COEP) University of Pune, one of the most prestigious and oldest engineering college in India, in 1999. Abhijit majored in mechanical engineering and obtained a distinguished record. He ranked second in the university merit ranking. He was employed in the Research and Development department of Kirloskar Oil Engines Ltd for one year. Abhijit received his first Master of Science from University of Florida in 2002. He majored in industrial and systems engineering. Abhijit then worked as a researcher in the Department of Computer and Information Science and Engineering at the University of Florida. He received his second Master of Science and Doctor of Philosophy (Ph.D) in computer engineering in 2007. During his studies at University of Florida, Abhijit coauthored a text book titled "Develop ing WebEnabled Decision Support Systems." He taught the WebDSS course several times in the Department of Industrial and Systems Engineering at the University of Florida. He presented several tutorials at workshops and conferences on the need and importance of teaching DSS material, and he also taught at two instructortraining workshops on DSS development. Abhijit's research focus is in the area of databases, with special interests in approximate query processing, physical database design, and data streams. He has presented research papers at several prestigious database conferences and performed research at the Microsoft Research Lab. He is now a Senior Software Engineer in the Strategic Data Solutions group at Yahoo! Inc. PAGE 1 1 PAGE 2 2 PAGE 3 3 PAGE 4 AttheendofmydissertationIwouldliketothankallthosepeoplewhomadethisdissertationpossibleandanenjoyableexperienceforme.FirstofallIwishtoexpressmysinceregratitudetomyadviserChrisJermaineforhispatientguidance,encouragement,andexcellentadvicethroughoutthisstudy.IfIwouldhaveaccesstomagictoolcreateyourownadviser,IstillwouldnothaveendedupwithanyonebetterthanChris.Healwaysintroducesmetointerestingresearchproblems.HeisaroundwheneverIhaveaquestion,butatthesametimeencouragesmetothinkonmyownandworkonanyproblemsthatinterestme.IamalsoindebtedtoAlinDobraforhissupportandencouragement.Alinisaconstantsourceofenthusiasm.TheonlytopicIhavenotdiscussedwithhimisstrategiesofGatorfootballgames.IamgratefultomydissertationcommitteemembersTamerKahveci,JoachimHammer,andRavindraAhujafortheirsupportandtheirencouragement.IacknowledgetheDepartmentofIndustrialandSystemsEngineering,RavindraAhuja,andchairDonaldHearnforthenancialsupportandadviceIreceivedduringinitialyearsofmystudies.Finally,Iwouldliketoexpressmydeepestgratitudefortheconstantsupport,understanding,andlovethatIreceivedfrommyparentsduringthepastyears. 4 PAGE 5 page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 10 CHAPTER 1INTRODUCTION .................................... 12 1.1TheGeometricFile ................................. 14 1.2BiasedReservoirSampling ............................. 16 1.3SamplingTheSample ................................ 18 1.4IndexStructuresForTheGeometricFile ...................... 19 2RELATEDWORK .................................... 22 2.1RelatedWorkonReservoirSampling ........................ 22 2.2BiasedSamplingRelatedWork ........................... 24 3THEGEOMETRICFILE ................................. 28 3.1ReservoirSampling ................................. 28 3.2Sampling:SometimesaLittleisnotEnough .................... 30 3.3ReservoirforVeryLargeSamples ......................... 31 3.4TheGeometricFile ................................. 34 3.5CharacterizingSubsampleDecay .......................... 36 3.6GeometricFileOrganization ............................ 40 3.7ReservoirSamplingWithaGeometricFile ..................... 40 3.7.1IntroducingtheRequiredRandomness ................... 41 3.7.2HandlingtheVariance ............................ 42 3.7.3BoundingtheVariance ........................... 45 3.8ChoosingParameterValues ............................. 47 3.8.1ChoosingaValueforAlpha ......................... 47 3.8.2ChoosingaValueforBeta .......................... 48 3.9WhyReservoirSamplingwithaGeometricFileisCorrect? ............ 49 3.9.1CorrectnessoftheReservoirSamplingAlgorithmwithaBuffer ...... 49 3.9.2CorrectnessoftheReservoirSamplingAlgorithmwithaGeometricFile 50 3.10MultipleGeometricFiles .............................. 51 3.11ReservoirSamplingwithMultipleGeometricFiles ................ 51 3.11.1ConsolidationAndMerging ......................... 53 3.11.2HowCanCorrectnessBeMaintained? ................... 53 3.11.3HandlingtheStacksinMultipleGeometricFiles .............. 56 5 PAGE 6 ................................. 56 4BIASEDRESERVOIRSAMPLING ........................... 58 4.1ASinglePassBiasedSamplingAlgorithm ..................... 59 4.1.1BiasedReservoirSampling ......................... 59 4.1.2So,WhatCanGoWrong?(AndaSimpleSolution) ............ 60 4.1.3AdjustingWeightsofExistingSamples ................... 62 4.2WorstCaseAnalysisforBiasedReservoirSamplingAlgorithm .......... 65 4.2.1TheProoffortheWorstCase ........................ 66 4.2.2TheProofofTheorem 1 :TheUpperBoundontotalDist 73 4.3BiasedReservoirSamplingWithTheGeometricFile ............... 75 4.4EstimationUsingaBiasedReservoir ........................ 76 5SAMPLINGTHEGEOMETRICFILE ......................... 80 5.1WhyMightWeNeedToSampleFromaGeometricFile? ............. 80 5.2DifferentSamplingPlansfortheGeometricFile .................. 80 5.3BatchSamplingFromaGeometricFile ...................... 81 5.3.1ANaiveAlgorithm ............................. 81 5.3.2AGeometricFileStructureBasedAlgorithm ............... 82 5.3.3BatchSamplingMultipleGeometricFiles ................. 84 5.4OnlineSamplingFromaGeometricFile ...................... 84 5.4.1ANaiveAlgorithm ............................. 84 5.4.2AGeometricFileStructureBasedAlgorithm ............... 85 5.5SamplingABiasedSample ............................. 88 6INDEXSTRUCTURESFORTHEGEOMETRICFILE ................ 89 6.1WhyIndexaGeometricFile? ............................ 89 6.2DifferentIndexStructuresfortheGeometricFile ................. 90 6.3ASegmentBasedIndexStructure ......................... 91 6.3.1IndexConstructionDuringStartup ..................... 91 6.3.2MaintainingIndexDuringNormalOperation ................ 92 6.3.3IndexLookUpandSearch ......................... 93 6.4ASubsampleBasedIndexStructure ........................ 93 6.4.1IndexConstructionandMaintenance .................... 94 6.4.2IndexLookUp ............................... 95 6.5ALSMTreeBasedIndexStructure ........................ 96 6.5.1AnLSMTreeIndex ............................. 96 6.5.2IndexMaintenanceandLookUps ..................... 97 7BENCHMARKING .................................... 99 7.1ProcessingInsertions ................................ 99 7.1.1ExperimentsPerformed ........................... 99 7.1.2DiscussionofExperimentalResults ..................... 100 6 PAGE 7 ............................. 103 7.2.1ExperimentalSetup ............................. 104 7.2.2Discussion .................................. 106 7.3SamplingFromaGeometricFile .......................... 107 7.3.1ExperimentsPerformed ........................... 108 7.3.2DiscussionofExperimentalResults ..................... 109 7.4IndexStructuresForTheGeometricFile ...................... 110 7.4.1ExperimentsPerformed ........................... 110 7.4.2Discussion .................................. 112 8CONCLUSION ...................................... 116 REFERENCES ......................................... 118 BIOGRAPHICALSKETCH .................................. 122 7 PAGE 8 Table page 11Population:studentrecords ................................ 17 12Randomsampleofthesize=4 ............................... 17 13Biasedsampleofthesize=4 ................................ 17 71Millionsofrecordsinsertedin10hrs ........................... 110 72Querytimingresultsfor1krecord,jRj=10million,andjBj=50k 113 73Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k 114 8 PAGE 9 Figure page 31Decayofasubsampleaftermultiplebufferushes. ................... 38 32Basicstructureofthegeometricle. ........................... 39 33Buildingageometricle. ................................. 43 34Distributingnewrecordstoexistingsubsamples. ..................... 44 35Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. ....... 54 41Adjustmentofrmaxitormaxi1 69 71Resultsofbenchmarkingexperiments(Processinginsertions). ............. 101 72Resultsofbenchmarkingexperiments(Samplingfromageometricle). ........ 102 73Sumqueryestimationaccuracyforzipf=0.2. ....................... 104 74Sumqueryestimationaccuracyforzipf=0.5. ....................... 105 75Sumqueryestimationaccuracyforzipf=0.8. ....................... 106 76Sumqueryestimationaccuracyforzipf=1. ........................ 107 77Diskfootprintfor1KBrecordsize ............................ 110 78Diskfootprintfor200Brecordsize ............................ 112 9 PAGE 10 Samplingisoneofthemostfundamentaldatamanagementtoolsavailable.Itisoneofthemostpowerfulmethodsforbuildingaonepasssynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.However,mostcurrentresearchinvolvingsamplingconsiderstheproblemofhowtouseasample,andnothowtocomputeone.Theimplicitassumptionisthatasampleisasmalldatastructurethatiseasilymaintainedasnewdataareencountered,eventhoughsimplestatisticalargumentsdemonstratethatverylargesamplesofgigabytesorterabytesinsizecanbenecessarytoprovidehighaccuracy.Noexistingworktacklestheproblemofmaintainingverylarge,diskbasedsamplesinanonlinemannerfromstreamingdata. Wepresentanewdataorganizationcalledthegeometricleandonlinealgorithmsformaintainingaverylarge,ondisksamples.Thealgorithmsaredesignedforanyenvironmentwherealargesamplemustbemaintainedonlineinasinglepassthroughadataset.Thegeometricleorganizationmeetsthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsample(withoutreplacement)ofallofthedataprocessedthusfar. Wemodifytheclassicreservoirsamplingalgorithmtocomputeaxedsizesampleinasinglepassoveradataset,wherethegoalistobiasthesampleusinganarbitrary,userdenedweightingfunction.Wealsodescribehowthegeometriclecanbeusedtoperformabiasedreservoirsampling. 10 PAGE 11 Efcientlysearchinganddiscoveringinformationfromthegeometricleisessentialforqueryprocessing.Anaturalwaytosupportthisistobuildanindexstructure.Wediscussthreesecondaryindexstructuresandtheirmaintenanceasnewrecordsareinsertedtoageometricle. 11 PAGE 12 Despitethevarietyofalternativesforapproximatequeryprocessing[ 1 21 30 34 39 ],samplingisstilloneofthemostpowerfulmethodsforbuildingaonepasssynopsisofadataset,especiallyinastreamingenvironmentwheretheassumptionisthatthereistoomuchdatatostoreallofitpermanently.Sampling'smanybenetsinclude: 16 49 ]). 2 3 8 14 15 28 32 33 35 46 51 52 ]thatusesamplestestifytosampling'spopularityasadatamanagementtool. Giventheobviousimportanceofrandomsampling,itisperhapssurprisingthattherehasbeenverylittleworkinthedatamanagementcommunityonhowtoactuallyperformrandomsampling.ThemostwellknownpapersinthisareaareduetoOlkenandRotem[ 25 27 ],whoalsoofferthedenitivesurveyofrelatedworkthroughtheearly1990s[ 26 ].However,thisworkisrelevantmostlyforsamplingfromdatastoredinadatabase,andimplicitlyassumesthatasampleisasmalldatastructurethatiseasilystoredinmainmemory. Suchassumptionsaresometimesoverlyrestrictive.Considertheproblemofapproximatequeryprocessing.Recentworkhassuggestedthepossibilityofmaintainingasampleofalargedatabaseandthenexecutinganalyticqueriesoverthesampleratherthantheoriginaldataasawaytospeedupprocessing[ 4 31 ].GiventhemostrecentTPCHbenchmarkresults[ 17 ],itisclearthatprocessingstandardreportstylequeriesoveralarge,multiterabytedatawarehousemaytakehoursordays.Insuchasituation,maintainingafullymaterializedrandomsample 12 PAGE 13 43 ])maybedesirable.Inordertosavetimeand/orcomputerresources,queriescanthenbeevaluatedoverthesampleratherthantheoriginaldata,aslongastheusercantoleratesomecarefullycontrolledinaccuracyinthequeryresults. Thisparticularapplicationhastwospecicrequirementsthatareaddressedbythedissertation.First,itmaybenecessarytousequitealargesampleinordertoachieveacceptableaccuracy;perhapsontheorderofgigabytesinsize.Thisisespeciallytrueifthesamplewillbeusedtoanswerselectivequeriesoraggregatesoverattributeswithhighvariance(seeSection 3.2 ).Second,whatevertherequiredsamplesize,itisoftenindependentofthesizeofthedatabase,sinceestimationaccuracydependsprimarilyonsamplesize Foranotherexampleofacasewhereexistingsamplingmethodscanfallshort,considerstreambaseddatamanagementtasks,suchasnetworkmonitoring(foranexampleofsuchanapplication,wepointtotheGigascopeprojectfromAT&TLaboratories[ 18 20 ]).Giventhetremendousamountofdatatransportedovertoday'scomputernetworks,theonlyconceivablewaytofacilitateadhoc,afterthefactqueryprocessingoverthesetofpacketsthathavepassedthroughanetworkrouteristobuildsomesortofstatisticalmodelforthosepackets.Themostobviouschoicewouldbetoproduceaverylarge,statisticallyrandomsampleofthepacketsthathavepassedthroughtherouter.Again,maintainingsuchasampleispreciselytheproblemwetackleinthisdissertation.Whileotherresearchershavetackledtheproblemofmaintainingan 16 ]forathoroughtreatmentofnitepopulationrandomsampling). 13 PAGE 14 7 ],noexistingmethodshaveconsideredhowtohandleverylargesamplesthatexceedtheavailablemainmemory. Inthisdissertationwedescribeanewdataorganizationcalledthegeometricleandrelatedonlinealgorithmsformaintainingaverylarge,diskbasedsamplefromadatastream.Thedissertationisdividedintofourparts.Intherstpartwedescribethegeometricleorganizationanddetailhowgeometriclescanbeusedtomaintainaverylargesimplerandomsample.Inthesecondpartweproposeasimplemodicationtotheclassicalreservoirsamplingalgorithmtocomputeabiasedsampleinasinglepassoverthedatastreamanddescribehowthegeometriclecanbeusedtomaintainaverylargebiasedsample.Inthethirdpartwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjects.Finally,inthefourthpart,wediscusssecondaryindexstructuresforthegeometricle.Indexstructuresareusefultospeedupsearchanddiscoveryofrequiredinformationfromahugesamplestoredinageometricle.Theindexstructuresmustbemaintainedconcurrentlywithconstantupdatestothegeometricleandatthesametimeprovideefcientaccesstoitsrecords. Wenowgiveanintroductiontothesefourpartsofthedissertationinsubsequentsections. 14 PAGE 15 11 38 ].ReservoirsamplingalgorithmscanbeusedtodynamicallymaintainaxedsizesampleofNrecordsfromastream,sothatatanygiveninstant,theNrecordsinthesampleconstituteatruerandomsampleofalloftherecordsthathavebeenproducedbythestream.However,aswewilldiscussinthisdissertation,theproblemisthatexistingreservoirtechniquesaresuitableonlywhenthesampleissmallenoughtotintomainmemory. Giventhattherearelimitedtechniquesformaintainingverylargesamples,theproblemaddressedintherstpartofthisdissertationisasfollows: 1. Thealgorithmsmustbesuitableforstreamingdata,oranysimilarenvironmentwherealargesamplemustbemaintainedonlineinasinglepassthroughadataset,withthestrictrequirementthatthesamplealwaysbeatrue,statisticallyrandomsampleofxedsizeN(withoutreplacement)fromallofthedataproducedbythestreamthusfar. 2. Whenmaintainingthesample,thefractionofI/Otimedevotedtoreadsshouldbeclosetozero.Ideally,therewouldneverbeaneedtoreadablockofsamplesfromdisksimplytoaddonenewsampleandsubsequentlywritetheblockoutagain. 3. ThefractionI/OoftimespentperformingrandomI/Osshouldalsobeclosetozero.Costlyrandomdiskseeksshouldbefewandfarbetween.AlmostallI/Oshouldbesequential. 4. Finally,theamountofdatawrittentodiskshouldbeboundedbythetotalsizeofalloftherecordsthatareeversampled. Thegeometriclemeetseachoftherequirementslistedabove.WithmemorylargeenoughtobufferjBj>1records,thegeometriclecanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord(seeSection 3.12 ).Themultiplier!canbemadearbitrarilysmallbymakinguseofadditionaldiskspace.Arigorousbenchmarkofthegeometricledemonstratesitssuperiorityovertheobviousalternatives. 15 PAGE 16 Theneedforbiasedsamplingcaneasilybeillustratedwithanexamplepopulation,giveninTable 1.2 .Thisparticulardatasetcontainsrecordsdescribinggraduatestudentsalariesinauniversityacademicdepartment,andourgoalistoguessthetotalgraduatestudentsalary.Imaginethatasimplerandomsampleofthedatasetisdrawn,asshownintheTable 12 .Thefoursampledrecordsarethenusedtoguessthatthetotalstudentsalaryis(520+700+580+600)12=4=$7200,whichisconsiderablylessthanthetruetotalof$9545.Theproblemisthatwehappenedtomissmostofthehighsalarystudentswhoaregenerallymoreimportantwhencomputingtheoveralltotal. Now,imaginethatweweighteachrecord,sothattheprobabilityofincludinganygivenrecordwithasalary700orgreaterinthesampleis(2)(4=12),andtheprobabilityofincludingagivenrecordwithasalarylessthan700is(1=2)(4=12).Thus,oursamplewilltendtoincludethoserecordswithhighervalues,thataremoreimportanttotheoverallsum.TheresultingbiasedsampleisdepictedinTable 13 .ThestandardHorvitzThompsonestimator[ 50 ]isthenappliedtothesample(whereeachrecordisweightedaccordingtotheinverseofitssamplingprobability),whichgivesusanestimateof(1200+1500+750)(12=8)+(580)(24=4)=$8655.Thisisobviouslyabetterestimatethan$7200,andthefactthatitisbetterthentheoriginalestimateisnotjustaccidental:ifonechoosestheweightscarefully,itiseasilypossibletoproduceasamplewhoseassociatedestimatorhaslowervariance(andhencehigheraccuracy)thanthesimple,uniformprobabilitysample.Forinstance,thevarianceoftheestimatorinthestudentsalaryexampleis2:533106undertheuniformprobabilitysamplinganditis5:083105underthebiasedsamplingscheme. 16 PAGE 17 Population:studentrecords Rec# NameClassSalary($/month) 1 JamesJunior12002 TomFreshman5203 SandraJunior12504 JimSenior15005 AshleySophomore7006 JenniferFreshman5307 RobertSophomore7508 FrankFreshman5809 RachelFreshman60510 TimFreshman55011 MariaSophomore76012 MonicaFreshman600 TotalSalary:9545.00 Table12. Randomsampleofthesize=4 Rec# NameClassSalary($/month) 2 TomFreshman5205 AshleySophomore7008 FrankFreshman58012 MonicaFreshman600 Othercaseswhereabiasedsampleispreferableabound.Forexample,ifthegoalistomonitorthepacketsowingthroughanetwork,onemaychoosetoweightmorerecentpacketsmoreheavily,sincetheywouldtendtoguremoreprominentlyinmostqueryworkloads. Weproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxedsize,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,userdenedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutility Table13. Biasedsampleofthesize=4 Rec# NameClassSalary($/month) 1 JamesJunior12004 JimSenior15007 RobertSophomore75011 MariaSophomore760 17 PAGE 18 Thekeycontributionsofthispartofdissertationareasfollows: 1. Wepresentamodiedversionoftheclassicreservoirsamplingalgorithmthatisexceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuserdenedweightingfunctionf. 2. Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Weanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase,andexperimentallyevaluatethepracticalsignicanceofthisdifference. 3. Wedescribehowtoperformabiasedreservoirsamplingandmaintainlargebiasedsampleswiththegeometricle. 4. Finally,wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgoverningthesamplingoftworecordsriandrjusingouralgorithm.WeusethiscovariancetoderivethevarianceofaHorvitzThomsonestimatormakinguseofasamplecomputedusingouralgorithm. Smallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,considertheproblemofestimatingtheaverage 18 PAGE 19 Sincethereisnosinglesamplesizethatisoptimalforansweringallqueriesandtherequiredsamplesizecanvarydramaticallyfromquerytoquery,thispartofdissertationconsiderstheproblemofgeneratingasampleofsizeNfromadatastreamusinganexistinggeometriclethatcontainsalargesampleofrecordsfromthestream,whereNR.Wewillconsidertwospecicproblems.First,weconsiderthecasewhereNisknownbeforehand.Wewillrefertoasampleretrievedinthismannerasabatchsample.WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample. Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Anindexisreferredtoasprimaryindexifit 19 PAGE 20 Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegmentbasedindex,(2)asubsamplebasedindex,and(3)aLogStructuredMergeTree(LSM)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+treeindexesaremaintainedforeachsegmentorsubsampleinageometricle.Asnewrecordsareaddedtotheleinunitsofasegmentorsubsample,anewB+treeindexingnewrecordsiscreatedandaddedtotheindexstructure.Also,anexistingB+treeisdeletedfromthestructurewhenalltherecordsindexedbyitaredeletedfromthele.ThethirdindexstructuremakesuseoftheLSMtreeindex[ 44 ]adiskedbaseddatastructuredesignedtoprovidelowcostindexinginanenvironmentwithahighrateofinsertsanddeletes.Weevaluateandcomparethesethreeindexstructuresexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries. 2 .InChapter 3 wepresentthegeometricleorganizationandshowhowthisstructurecanbeusedtomaintainaverylarge 20 PAGE 21 4 weproposeasinglepassbiasedreservoirsamplingalgorithm.InChapter 5 wedeveloptechniquesthatcanbeusedtosamplegeometriclestoobtainasmallsizesample.InChapter 6 wepresentsecondaryindexstructuresforthegeometricle.InChapter 7 wediscussthebenchmarkingresults.ThedissertationisconcludedinChapter 8 Mostoftheworkinthedissertationiseitheralreadypublishedorisunderreviewforpublication.ThematerialfromChapter 3 isfromthepaperwithChristopherJermaineandSubramanianArumugamthatwasoriginallypublishedinSIGMOD2004[ 36 ].TheworkpresentedinChapter 4 issubmittedtoTKDEandisunderreview[ 47 ].ThematerialinChapter 5 isthepartofjournalpaperacceptedatVLDBJ[ 48 ].TheresultsintheChapter 7 aretakenfromabovethreepapersaswell. 21 PAGE 22 Inthischapter,werstreviewtheliteratureonreservoirsamplingalgorithms.Wethenpresentthesummaryofexistingworkonbiasedsampling. 2 3 8 14 15 28 32 33 35 51 52 ].However,themostpreviouspapers(includingtheaforementionedreferences)areconcernedwithhowtouseasample,andnotwithhowtoactuallystoreormaintainone.Mostofthesealgorithmscouldbeviewedaspotentialusersofalargesamplemaintainedasageometricle. Asmentionedintheintroductionchapter,aseriesofpapersbyOlkenandRotem(includingtwopaperslistedintheReferencessection[ 25 27 ])probablyconstitutethemostwellknownbodyofresearchdetailinghowtoactuallycomputesamplesinadatabaseenvironment.OlkenandRotemgiveanexcellentsurveyofworkinthisarea[ 26 ].However,mostofthisworkisverydifferentthanours,inthatitisconcernedprimarilywithsamplingfromanexistingdatabasele,whereitisassumedthatthedatatobesampledfromareallpresentondiskandindexedbythedatabase.Singlepasssamplingisgenerallynotthegoal,andwhenitis,managementofthesampleitselfasadiskbasedobjectisnotconsidered. Thealgorithmsinthisdissertationarebasedonreservoirsampling,whichwasrstdevelopedinthe1960s[ 11 38 ].Inhiswellknownpaper[ 53 ],Vitterextendsthisearlyworkbydescribinghowtodecreasethenumberofrandomnumbersrequiredtoperformthesampling.Vitter'stechniquescouldbeusedinconjunctionwithourown,butthefocusofexistingworkonreservoirsamplingisagainquitedifferentfromours;managementofthesampleitselfisnotconsidered,andthesampleisimplicitlyassumedtobesmallandinmemory.However,ifweremovetherequirementthatoursampleofsizeNbemaintainedonlinesothatitisalwaysavalidsnapshotofthestreamandmustevolveovertime,thensequentialsamplingtechniquesrelatedtoreservoirsamplingthatcouldbeusedtobuild(butnotmaintain)alarge,ondisksample(seeVitter[ 54 ],forexample). 22 PAGE 23 44 ],BufferTree[ 6 ],andYTree[ 12 ].ThesepapersconsiderproblemofprovidingI/OefcientindexingforadatabaseexperiencingaveryhighrecordinsertionratewhichisimpossibletohandleusingatraditionalB+Treeindexingstructure.Ingeneralthesemethodsbufferalargesetofinsertionsandthenscantheentirebaserelation,whichistypicallyorganizedasaB+Tree,atonceaddingnewdatatothestructure. Anyoftheabovemethodscouldtriviallybeusedtomaintainalargerandomsampleofadatastream.Everytimeasamplingalgorithmprobabilisticallyselectsarecordforinsertion,itmustoverwrite,atrandom,anexistingrecordofthereservoir.Onceanevicteeisdetermined,wecanattachitslocationasapositionidentier(anumberbetween1andR)withanewsamplerecord.Thispositioneldisthenusedtoinsertthenewrecordintotheseindexstructures.Whileperformingtheefcientbatchinserts,ifanindexstructurediscoversthatarecordwiththesamepositionidentierexists,itsimplyoverwritestheoldrecordwiththenewerone. However,noneofthesemethodscancomeclosetotherawwritespeedofthedisk,asthegeometriclecan[ 13 ].Inasense,theissueisthatwhiletheindexingprovidedbythesestructurescouldbeusedtoimplementefcient,diskbasedreservoirsampling,itistooheavydutyasolution.WewouldenduppayingtoomuchintermsofdiskI/Otosendanewrecordtooverwriteaspecic,existingrecordchosenatthetimethenewrecordisinserted,whenallonereallyneedsistohaveanewrecordoverwriteanyrandom,existingrecord. Therehasbeenmuchrecentinterestinapproximatequeryprocessingoverdatastreams(averysmallsubsetofthesepapersislistedintheReferencessection[ 1 21 34 ]);evensomeworkonsamplingfromadatastream[ 7 ].Thisworkisverydifferentfromourown,inthatmostexistingapproximationtechniquestrytooperateinverysmallspace.Instead,ourfocusisonmakinguseoftoday'sverylargeandveryinexpensivesecondarystoragetophysicallystorethelargestsnapshotpossibleofthestream. Finally,wementiontheU.C.BerkeleyCONTROLproject[ 37 ](whichresultedinthedevelopmentofonlineaggregation[ 33 ]andripplejoins[ 32 ]).Thisworkdoesaddressissues 23 PAGE 24 11 38 ].Recently,Gemullaetal.[ 29 ]extendedthereservoirsamplingalgorithmtohandledeletions.Intheiralgorithmcalledrandompairing(RP)everydeletionfromthedatasetiseventuallycompensatedbyasubsequentinsertion.TheRPAlgorithmkeepstracksofuncompensateddeletionsandusesthisinformationwhileperformingtheinserts.TheAlgorithmguardstheboundonthesamplesizeandatthesametimeutilizesthesamplespaceeffectivelytoprovidesastablesample.AnotherextensiontotheclassicreservoirsamplingalgorithmhasbeenrecentlyproposedbyBrownandHaasforwarehousingofsampledata[ 10 ].Theyproposehybridreservoirsamplingforindependentandparalleluniformrandomsamplingofmultiplestreams.Thesealgorithmscanbeusedtomaintainawarehouseofsampleddatathatshadowsthefullscaledatawarehouse.Theyhavealsoprovidedmethodsformergingsamplesfordifferentstreamstocreateauniformrandomsample. Theproblemoftemporalbiasedsamplinginastreamenvironmenthasbeenconsidered.Babcocketal.[ 7 ]presentedtheslidingwindowapproachwithrestrictedhorizonofthesampletobiasedthesampletowardstherecentstreamingrecords.However,thissolutionhasapotentialtocompletelylosetheentirehistoryofpaststreamdatathatisnotapartofslidingwindow.TheworkdonebyAggarwal[ 5 ]addressesthislimitationandpresentsabiasedsamplingmethodsothatwecanhavetemporalbiasforrecentrecordsaswellaswekeeprepresentationfromstreamhistory.Thisworkexploitssomeinterestingpropertiesoftheclassofmemorylessbiasfunctionstopresentasinglepassbiasedsamplingalgorithmforthesetypeofbiasedfunctions.However, 24 PAGE 25 AnotherpieceofworkonsinglepasssamplingwithanonuniformdistributionisduetoKolonkoandWasch[ 40 ].Theypresentasinglepassalgorithmtosampleadatastreamofunknownsize(thatis,notknowbeforehand)toobtainasampleofarbitrarysizensuchthattheprobabilityofselectingadataitemiisdependontheindividualitem.Theweightortnessoftheitemthatisusedforitsprobabilisticselectionisderivedusingexponentiallydistributedauxiliaryvalueswiththeparameteroftheexponentialdistributionandthelargestauxiliaryvaluedeterminesthesample.Liketemporalbiasedsamplingmethoddiscussedabove,thisalgorithmcannotbedirectlyadaptedforarbitraryuserdenedbiasedfunctions. Surprisingly,abovethreepapersaretheonlypiecesofworkthatareknowtoauthorsonhowtoperformasinglepassbiasedsamplingoverlargedatasetsorstreamingdata. Theanotherbodyofrelatedworkisthepapersfromnetworkusagearea[ 22 24 41 ].Thesepaperspresenttechniquesforestimatingthetotalnetworktrafc(orusage)basedonthesampleofaowrecordsproducedbyrouters.Sincetheseowstypicallyhaveheavytaileddistributions,thetechniquespresentedinthesepapersmakeuseofsizedependentsamplingscheme.Ingeneral,suchschemesworkbysamplingalltherecordswhosetrafcisabovecertainthresholdandsamplingtherestwithprobabilityproportionaltotheirtrafc.Although,suchtechniquesintroducesamplingbiaswheresizecanbethoughtastheweightofarecord,therearekeydifferencesbetweensuchtechniquesandthealgorithmpresentedinthisdissertation.Thegoalofouralgorithmistoobtainaxedsizebiasedsamplethatcomplywiththearbitraryuserdenedbiasedfunction.Thegoalofthesizedependentsamplingschemeistoobtainasamplethatwillprovidethebestaccuracyforestimatingthetotalnetworktrafcthatfollowsaspecicdistribution.Thesamplegatheredbytheseschemesisnotnecessarilyaxedsizebiasedsample.Itonlyguaranteesthattheexpectedsamplesizeisnolargerthantheexpectedsample 25 PAGE 26 Theproblemofimplementingxedsizesamplingdesignwithdesiredandunequalinclusionprobabilitieshasbeenstudiedinstatistics.ThemonogramTheoryofSampleSurveys[ 50 ]discussesseveralmethodsforsuchasamplingtechnique,whichisofsomepracticalimportanceinsurveysampling.Thismonogrambeginsbydiscussingtwodesignswhichmimicsimplerandomsamplingwithoutreplacementwithselectionprobabilitiesforagivendrawthatarenotthesameforalltheunits.Werstsummarizethesetechniques. 26 PAGE 27 withxedsizeistoselectunitsforreplacement,andthentorejectthesampleifthereareduplicates.Wediscussonesuchmethodhere,calledSampford'sMethod. 27 PAGE 28 Inthischapterwegiveanintroductiontothebasicreservoirsamplingalgorithmthatwasproposedtoobtainanonlinerandomsampleofadatastream.Thealgorithmassumesthatthesamplemaintainedissmallenoughtotinmainmemoryinitsentirety.Wediscussandmotivatewhyverylargesamplesizescanbemandatoryincommonsituations.Wedescribethreealternativesformaintainingverylarge,diskbasedsamplesinastreamingenvironment.Wethenintroducethegeometricleorganizationandpresentalgorithmsforreservoirsamplingwiththegeometricle.Wealsodescribehowmultiplegeometriclescanbemaintainedallatoncetoachieveconsiderablespeedup. 11 38 ].TomaintainareservoirsampleRoftargetsizejRj,thefollowingloopisused: 28 PAGE 29 53 ].Afteracertainnumberofrecordshavebeenseen,thealgorithmwakesupandcapturethenextrecordfromthestream. 1 maintainsthisinvariantinsteps(26)asfollows[ 11 38 ].Theithrecordprocessed(i>jRj),itisaddedtothereservoirwithprobabilityjRj=ibystep4.Weneedtoshowthatforallotherrecordsprocessedthusfar,theinclusionprobabilityisjRj=i.Letrkbeanyrecordinthereservoirs.t.k6=i.LetRidenotethestateofthereservoirjustafteradditionoftheithrecord.Thus,weareinterestedinthePr[rk2Ri] i11 i=R i1R i1 i=R i 16 ].ToselectasampleofjRjunits,systematicsamplingtakesaunitatrandomfromtherstkunitsandeverykthunitthereafter.Althoughtheinclusionprobabilityinsystematicsamplingisthesameasinsimplerandomsampling,thepropertiesofasamplesuchasvariancecanbefardifferent.Itisknownthatthevarianceofthesystematicsamplingcanbebetterorworsecomparedtoasimplerandomsamplingdependingondataheterogeneityandcorrelationcoefcientbetweenpairsofsampledunits. 29 PAGE 30 Theproofthatreservoirsamplingmaintainsthecorrectinclusionprobabilityforanysetofinterestisactuallyverysimilartotheunivariateinclusionprobabilitycorrectnessdiscussedabove.WeknowthattheunivariateinclusionprobabilityPr[rk2Ri]=R=i.ForanyarbitraryvalueofjSjjRj,assumethatwehavethecorrectprobabilitieswhenwehaveseeni1inputrecords,i.e.Pr[S2Ri1]=jRjjSj=i1jSj.Whentheithrecordisprocessed(i>jRj),wehave i1S R+1R i=jRjjSj i1jSjR iS i+1R i=jRjjSj ijSjiS iiS i=jRjjSj ijSj 16 ]).Thevalueisknownasthecondenceoftheestimate. Verylargesamplesareoftenrequiredtoprovideaccurateestimateswithsuitablyhighcondence.TheneedforverylargesamplescanbeeasilyexplainedinthecontextoftheCentralLimitTheorem(CLT)[ 27 ].TheCLTimpliesthatifweusearandomsampleofsizeNtoestimatethemeanofasetofnumbers,theerrorofourestimateisusuallynormally 30 PAGE 31 1. Theerrorisinverselyproportionaltothesquarerootofthesamplesize. 2. Theerrorisdirectlyproportionaltothestandarddeviationofthesetoverwhichweareestimatingthemeanover. Thesignicanceofthisobservationisthatthesamplesizerequiredtoproduceanaccurateestimatecanvarytremendouslyinpractice,andgrowsquadraticallywithincreasingstandarddeviation.Forexample,saythatweusearandomsampleof100studentsatauniversitytoestimatetheaveragestudentsage.Imaginethattheaverageageis20withastandarddeviationof2years.AccordingtotheCLT,oursamplebasedestimatewillbeaccuratetowithin2.5%withcondenceofaround98%,givingusanaccurateguessastothecorrectanswerwithonly100sampledstudents. Now,considerasecondscenario.WewanttouseasecondrandomsampletoestimatetheaveragenetworthofhouseholdsintheUnitedStates,whichisaround$140,000,withastandarddeviationofatleast$5,000,000.Becausethestandarddeviationissolarge,aquickcalculationshowswewillneedmorethan12millionsamplestoachievethesamestatisticalguaranteesasintherstcase. Requiredsamplesizescanbefarlargerwhenstandarddatabaseoperationslikerelationalselectionandjoinareconsidered,becausetheseoperationscaneffectivelymagnifythevarianceofourestimate.Forexample,theworkonripplejoins[ 32 ]providesanexcellentexampleofhowvariancecanbemagniedbysamplingovertherelationaljoinoperator. 31 PAGE 32 2 .Count(B)referstothecurrentnumberofrecordsinB.NotethatsincetherecordscontainedinBlogicallyrepresentrecordsinthereservoirthathavenotyetbeenaddedtodisk,anewlysampledrecordcaneitherbeassignedtoreplaceanondiskrecord,oritcanbeassignedtoreplaceabufferedrecord(thisisdecidedinStep(7)ofthealgorithm). Inarealisticscenario,theratioofthenumberofdiskblockstothenumberofrecordsbufferedinmainmemorymayapproachorevenexceedone.Forexample,a1TBdatabasewith128KBblockswillhave7.8millionblocks;andforsucharelativelylargedatabaseitisrealistictoexpectthatwehaveaccesstoenoughmemorytobuffermillionsrecords.Asthenumberofbufferedrecordsperblockmeetsorexceedsone,mostoralloftheblocksondiskwillcontain 32 PAGE 33 2 ,andsoallofthedatabaseblocksmustbeupdated.Thus,itmakessensetorelyonfast,sequentialI/Otoupdatetheentireleinasinglepass.Thedrawbackofthisapproachisthateverytimethatthebufferlls,weareeffectivelyrebuildingtheentirereservoirtoprocessasetofbufferedrecordsthatareasmallfractionoftheexistingreservoirsize. 33 PAGE 34 1 canbeusedtomaintainalarge,ondisksample,butallofthemhavedrawbacks.Inthissection,wediscussafourthalgorithmandanassociateddataorganizationcalledthegeometricletoaddressthesepitfalls.ThegeometricleisbestseenasanextensionofthemassiverebuildoptiongivenasAlgorithm 2 .JustlikeAlgorithm 2 ,thegeometriclemakesuseofamainmemorybufferthatallowsnewsamplesselectedbythereservoiralgorithmtobeaddedtotheondiskreservoirinalazyfashion.However,thekeydifferencebetweenAlgorithm 2 andthealgorithmsusedbythegeometricleisthatthegeometriclemakesuseofafarmoreefcientalgorithmformergingthosenewsamplesintothereservoir. 2 ,thebasicalgorithmemployedbythegeometricleisnotmuchdifferent.AsfarasStep(13)isconcerned,thedifferencebetweenthegeometricleandthemassiverebuildextensionisthatthegeometricleemptiesthebuffermoreefciently,inordertoavoidscanningorperiodicallyrerandomizingtheentirereservoir. Toaccomplishthis,theentiresampleinmainmemorythatisushedintothereservoirisviewedasasinglesubsampleorastratum[ 16 ],andthereservoiritselfisviewedasacollectionofsubsamples,eachformedviaasinglebufferush.Sincetherecordsinasubsamplearenonrandomsubsetoftherecordsinthereservoir(theyaresampledfromthestreamduringaspecictimeperiod),eachnewsubsampleneedstooverwriteatrue,randomsubsetoftherecordsinthereservoirinordertomaintainthecorrectnessofthereservoirsamplingalgorithm.Ifthiscanbedoneefciently,wecanavoidrebuildingtheentirereservoirinordertoprocessabufferush. Atrstglance,itmayseemdifculttoachievethedesiredefciency.Thebufferedrecordsthatmustbeaddedtothereservoirwilltypicallyoverwriteasubsetoftherecordsstoredineach PAGE 35 3.3 ).Forexample,ifthereare100ondisksubsamples,thebuffermustbesplit100waysinordertowritetoaportionofeachofthe100ondisksubsamples.Thisfragmentedbufferthenbecomesanewsubsample,andsubsequentbufferushesthatneedtoreplacearandomportionofthissubsamplemustsomehowefcientlyoverwritearandomsubsetofthesubsample'sfragmenteddata. Thegeometricleusesacareful,ondiskdataorganizationinordertoavoidsuchfragmentation.Thekeyobservationbehindthegeometricleisthatthenumberofrecordsofasubsamplethatarereplacedwithrecordsfrombufferedsamplecanbecharacterizedwithreasonableaccuracyusingageometricseries(hencethenamegeometricle).Asbufferedsamplesareaddedtothereservoirviabufferushes,weobservethateachexistingsubsamplelosesapproximatelythesamefractionofitsremainingrecordseverytime,wherethefractionofrecordslostisgovernedbytheratioofthesizeofabufferedsampletotheoverallsizeofthereservoir.Byloses,wemeanthatthesubsamplehassomeofitsrecordsreplacedinthereservoirwithrecordsfromasubsequentsubsample.Thus,thesizeofasubsampledecaysapproximatelyinanexponentialmannerasbufferedsamplesareaddedtothereservoir. Thisexponentialdecayisusedtogreatadvantageinthegeometricle,becauseitsuggestsawaytoorganizethedatainordertoavoidproblemswithfragmentation.Eachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize.Thesesegmentsaresizedsothateverytimeabufferedsampleisaddedtothereservoir,weexpectthateachexistingsubsamplelosesexactlythesetofrecordscontainedinitslargestremainingsegment.Asaresult,eachsubsamplelosesonesegmenttothenewlycreatedsubsampleeverytimethebufferisemptied,andageometriclecanbeorganizedintoaxedandunchangingsetofsegmentsthatarestoredascontiguousrunsofblocksondisk.Becausethesetofsegmentsisxedbeforehand,fragmentationandupdateperformancearenotproblematic:inordertoreplacerecordsinan 35 PAGE 36 Ondaytwo,(withU1=90)theUraniumfurtherdecaystoU1=81grams,thistimelosingU1(1)=U0(1)=n=9gramsofitsmass.Ondaythree,itfurtherdecaysbyn2=7:2grams,andsoon.ThedecayprocessisallowedtocontinueuntilwehavelessthangramsofUraniumremaining. ContinuingwiththeUraniumanalogy,threequestionsthatarerelevanttoourproblemofmaintainingverylargesamplesfromadatastreamare Thesequestionscanbeansweredusingthefollowingthreesimpleobservationsrelatedtogeometricseries: logc.Wedenotethisoorby. 36 PAGE 37 2 ).Recallthatthewayreservoirsamplingworksisthatnewsamplesfromthedatastreamarechosentooverwriterandomsamplescurrentlyinthereservoir.Thebuffertemporarilystoresthesenewsamples,delayingtheoverwriteofarandomsetofrecordsthatarealreadystoredondisk.Oncethebufferisfull,allnewsamplesaremergedwiththeRbyoverwritingarandomsubsetoftheexistingsamplesinR. ConsidersomearbitrarysubsampleSofR(soSR),withcapacityjSj.SincethebufferBrepresentsthesamplesthathavealreadyoverwrittentheequalnumberofrecordsofR,abufferushoverwritesexactlyjBjsamplesofR.Thus,onexpectationthemergewilloverwritejSjjBj jRjsamplesofS.IfwedenejBj jRj=1,thenonexpectation,SshouldlosejSj(1)ofitsownrecordsduetothebufferush WecanroughlydescribetheexpecteddecayofSafterrepeatedbuffermergesusingthethreeobservationsstatedbefore.Ifthesubsampleretentionrate=1jBj jRj,then: Thenetresultofthisisthatitispossibletocharacterizetheexpecteddecayofanyarbitrarysubsetoftherecordsinourdiskbasedsampleasnewrecordsareaddedtothesamplethroughmultipleemptyingsofthebuffer.IfweviewSasbeingcomposedofondisksegmentsofexponentiallydecreasingsize,plusaspecial,asinglegroupofnalsegmentsoftotalsize 3.7 ). 37 PAGE 38 Decayofasubsampleaftermultiplebufferushes. 31 38 PAGE 39 Basicstructureofthegeometricle. 39 PAGE 40 31 ,wecanorganizeourlarge,diskbasedsampleasasetofdecayingsubsamples.Atanypointoftime,thelargestsubsamplewascreatedbythemostrecentushingofthebufferintoR,andhasnotyetlostanysegments.Thesecondlargestsubsamplewascreatedbythesecondmostrecentbufferush;itlostitslargestsegmentinthemostrecentbufferush.Ingeneral,theithlargestsubsamplewascreatedbytheithmostrecentbufferush,andithashadi1segmentsremovedbysubsequentbufferushes.TheoverallleorganizationisdepictedinFigure 32 3 .Thetermsn,,andcarrythemeaningdiscussedinSection 3.5 .ThisprocessdescribedbyAlgorithm 3 isdepictedgraphicallyinFigure 33 .First,theleislledwiththeinitialdataproducedbythestream(athroughc).Toaddtherstrecordstothele,thebufferisallowedtollwithsamples.Thebufferedrecordsarethenrandomlygroupedintosegments,andthesegmentsarewrittentodisktoformthelargestinitialsubsample(a).Forthesecondinitialsubsample,thebufferisonlyallowedtolltojBjofitscapacitybeforebeingwrittenout(b).Forthethirdinitialsubsample,thebufferllstojBj2ofitscapacitybeforeitiswritten(c).Thisisrepeateduntilthereservoirhascompletelylled(aswasshowninFigure 32 ).Atthispoint,newsamplesmustoverwriteexistingones.Tofacilitatethis,thebufferisagainallowedtolltocapacity.Recordsarethenrandomlygroupedintosegmentsofappropriatesize,andthosesegmentsoverwritethelargestsegmentofeachexistingsubsample(d).Thisprocessisthenrepeatedindenitely,aslongasthestreamproducesnewrecords(eandf). 40 PAGE 41 3.7.1 ) 3 Step(21).Inordertomaintainthealgorithm'scorrectness,whenthebufferisushed 41 PAGE 42 34 .Now,wewanttoaddveadditionalnumberstoourset,byrandomlyreplacingveexistingnumbers.Whilewedoexpectnumberstobereplacedinawaythatisproportionaltobucketsize(Figure 34 (b)),thisisnotalwayswhatwillhappen(Figure 34 (c)). 3 .BeforeweaddanewsubsampletodiskviaabufferushinStep(21),werstperformalogical,randomizedpartitioningofthebufferintosegments,describedbyAlgorithm 4 .InAlgorithm 4 ,eachnewlysampledrecordisrandomlyassignedtoreplaceasamplefromanexisting,ondisksubsamplesothattheprobabilityofeachsubsamplelosingarecordisproportionaltoitssize.TheresultofAlgorithm 4 isanarrayofMivalues,whereMitellsStep(21)ofAlgorithm 3 howmanyrecordsshouldbeassignedtooverwritetheithondisksubsample. 3 willoverwriteexactlythenumberofrecordscontainedineach 42 PAGE 43 Buildingageometricle. 43 PAGE 44 Distributingnewrecordstoexistingsubsamples. subsample'slargestsegment.Tohandlethisproblem,weassociateastack(orbuffer 44 PAGE 45 4 ,Mishouldhavebeen.Then,therearetwopossiblecases: ThesestackoperationsareperformedjustpriortoStep(23)inAlgorithm 3 .Notethatsincethenalgroupofsegmentsfromasubsampleoftotalsizearebufferedinmainmemory,theirmaintenancedoesnotrequireanystackoperations.Onceasubsamplehaslostallofitsondisksamples,overwritesofrecordsinthissetcanbehandledbysimplyreplacingtherecordsdirectly. Topreallocatespaceforthesestacks,weneedtocharacterizehowmuchoverowwecanexpectfromagivensubsample,whichwillboundthegrowthofthesubsample'sstack.Itisimportanttohaveagoodcharacterizationoftheexpectedstackgrowth.Ifweallocatetoomuchspaceforthestacks,thenweallocatediskspaceforstoragethatisneverused.Ifweallocatetoolittlespace,thenthetopofonestackmaygrowupintothebaseofanother.Ifastackdoesoverow,itcanbehandledbybufferingtheadditionalrecordstemporarilyinmemoryormovingthestacktoanewlocationondiskuntilthestackcanagaintinitsallocatedspace.Thisisnot 45 PAGE 46 Toavoidthis,weobservethatifthestackassociatedwithasubsampleScontainsanysamplesatagivenmoment,thenShashadfewerofitsownsamplesremovedthanexpected.Thus,ourproblemofboundingthegrowthofS'sstackisequivalenttoboundingthedifferencebetweentheexpectedandtheobservednumberofsamplesthatSlosesasjBjnewsamplesareaddedtothereservoir,overallpossiblevaluesforjBj. Toboundthisdifference,werstnotethatafteraddingjBjnewsamplesintothereservoir,theprobabilitythatanyexistingsampleinthereservoirhasbeenoverwrittenbyanewsampleis111 42 ].Simplearithmeticimpliesthatthegreatestvarianceisachievedwhenasubsamplehasonexpectationlost50%ofitsrecordstonewsample(P=0:5);atthispointthestandarddeviationis0:5p 46 PAGE 47 Toillustratetheimportanceofminimizing,imaginethatwehavea1GBbufferandastreamproducing100Brecords,andwewanttomaintaina1TBsample.Assumethatweuseanvalueof0.99.Thus,eachsubsampleisoriginally1GB,andjBj=107.FromObservation2weknowthatn log0:99c=1029segmentstostoretheentirenewsubsample. Now,considerthesituationif=0:999.Asimilarcomputationshowsthatwewillnowrequire10;344segmentstostorethesame1GBsubsample.Thisisanorderofmagnitudedifference,withsignicantpracticalimportance.Withfourdiskseekspersegment,1029segmentsmightmeanthatwespendaround40secondsofdisktimeinrandomI/Os(at10ms 47 PAGE 48 jRj jRj. WewilladdressthislimitationinSection 3.10 log0:99cor687.Byincreasingtheamountofmainmemorydevotedtoholdingthesmallestsegmentsforeachsubsamplebyafactorof32,weareabletoreducethenumberofdiskheadmovementsbylessthanafactoroftwo.Thus,wewillnotconsideroptimizing.Rather,wewillxtoholdasetofsamplesequivalenttothesystemblocksize,andsearchforabetterwaytoincreaseperformance. 48 PAGE 49 1. Whyistheclassicalreservoirsamplingalgorithm(presentedasAlgorithm 1 )correct?ThatiswhatistheinvariantmaintainedbytheAlgorithm 1 ? 2. Whyistheobviousdiskbased,extensionofAlgorithm 1 (presentedasAlgorithm 2 )correct?ThatishowdoesAlgorithm 2 maintaintheinvariantofAlgorithm 1 viatheuseofamainmemorybuffer? 3. WhyistheproposedgeometriclebasedsamplingtechniqueinAlgorithm 3 correct? WehaveansweredtherstquestioninSection 3.1 .Wediscussthesecondandthirdquestionshere. 2 makesuseofthemainmemorybufferofsizejBjtobuffernewsamples.Thebufferedsampleslogicallyrepresentasetsamplesthatshouldhavebeenusedtoreplaceondisksamplesinordertopreservethecorrectnessofthesamplingalgorithm,butthathavenotyetbeenmovedtodiskforperformancereasons(thatis,duetolazywrites). ItisnothardtoseethattheinvariantmaintainedbyAlgorithm 1 isalsomaintainedbyAlgorithm 2 instep(6).ThenewrecordsaresampledwiththesameprobabilityjRj=i.Theonlydifferenceisthatnewlysampledrecordsareaddedtothereservoirusingsteps(714)insteadofsimplesteps(56)ofAlgorithm 1 .Wenowdiscusswhythesestepsareequivalent. Onestraightforwardwayofkeepingthesampledrecordsinthebufferanddolazywritesisasfollows.Everytimewedecidetoaddanewsampletothebuffer(i.e.withprobabilityjRj=i),wealsogeneratearandomnumberbetween1andRtodecideitspositioninthereservoir.However,westorethispositioninthepositionarrayandthusavoidanimmediatediskseek.Ifwehappentogenerateapositionthatisalreadyinthepositionarray,weoverwritethecorrespondingrecordinthebufferwiththenewlysampledrecord.Ifwewouldhaveushedthatrecordtodiskusingtheclassicalgorithm(ratherthanbufferingit),wewouldhavereplaceditwiththenewlysampledrecord.Thuswewouldobtainthesameresult.Oncethebufferisfullwe 49 PAGE 50 1 asfarascorrectnessisconcerned. Logically,steps(714)ofAlgorithm 2 actuallyimplementexactlythisprocess.Theprobabilitythatwewillgeneratearandompositionbetween1andjRjthatisalreadyinthepositionarrayofsizejBjisjBj=R.Step(7)ofAlgorithm 2 decideswhethertooverwritearandombufferedrecordwithanewlysampledrecord.Oncethebufferisfull,step(13)performsaonepassbufferreservoirmergingbygeneratingsequentialrandompositionsinthereservoironthey. 2 westorethesamplessequentiallyonthediskandoverwritetheminarandomorder.Thoughcorrect,thealgorithmdemandsalmostacompletescanofthereservoir(toperformallrandomoverwrites)foreverybufferush.WecandobetterifweinsteadforcethesamplestobestoredinarandomorderondisksothattheycanbereplacedviaanoverwriteusingsequentialI/Os.Thelocalizedoverwriteextensiondiscussedbeforeusethisidea.Everytimeabufferisushedtothereservoiritisrandomizedinmainmemoryandwrittenasarandomclusteronthedisk.WemaintainthecorrectnessofthistechniquebysplittingtherandomclusterinNwayswhereNisthenumberofexistingclustersonthediskandbyoverwritingrandomsubsetofeachexistingcluster.Thisavoidstheproblemofclusteringbyinsertiontime.However,thedrawbackofthistechniqueisthatthesolutiondeterioratesbecauseoffragmentationofclusters. ThegeometricleovercomesthedrawbacksofthesetwotechniquesandcanbeviewedasacombinationofAlgorithm 2 andtheideausedinthelocalizedoverwriteextension.ThecorrectnessoftheGeometricleisresultsdirectlyfromthecorrectnessofthesetwotechniques.Incaseofthegeometricletheentiresampleinthemainmemory(referredtoasasubsample)israndomizedandushedintothereservoir.Furthermore,eachnewsubsampleissplitintoexactlythosemanysegmentsasthenumberofexistingsubsamplesonthedisk.Thesesegmentsthenoverwritearandomportionofeachdiskbasedsubsample.Theonlydifferencewiththe 50 PAGE 51 1 ,isxedbytheratiojBj=jRj.Thatis,foraxeddesiredsizeofreservoirweneedalargerbuffertolowerthevalueof. However,thereisawaytoimprovethesituation.GivenabufferofxedcapacityjBjanddesiredsamplesizejRj,wechooseasmallervalue0<,andthenmaintainmorethanonegeometricleatthesametimetoachievealargeenoughsample.Specically,weneedtomaintainm=(10) (1)geometriclesatonce.Theselesareidenticaltowhatwehavedescribedthusfar,exceptthattheparameter0isusedtocomputethesizesofasubsample'sondisksegmentsandsizeofeachleisjRj 3 .Eachofthemgeometriclesisstilltreatedasasetofdecayingsubsamples,andeachsubsampleispartitionedintoasetofsegmentsofexponentiallydecreasingsize,justasisdoneinAlgorithm 3 ,Steps(5)(13).Theonlydifferenceisthataseachleiscreated,theparameter0isusedinsteadofinSteps(6),(8)(9),andeachofthemgeometriclesislledafteroneanother,inturn.Thus,eachsubsampleofeachgeometriclewillhavesegmentsofsizen;n0;n02andsoon. 51 PAGE 52 3 Steps(15)(20)untilbufferisfull.Oncethebufferisfull,itsrecordorderisthenrandomized,justasisinasinglegeometricle.Nextthebufferisushedtodisk.Thisiswherethealgorithmismodied.Overwritingrecordsondiskwithrecordsfromthebufferissomewhatdifferent,intwoprimaryways,asdiscussednext. 4 ,thebufferispartitionedsothatthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofsubsamplesinasinglele.Incaseofmultiplegeometricles,wepartitionthebufferjustlikeinAlgorithm 4 ;however,werandomlypartitionthebufferacrossallsubsamplesfromallgeometricles.Thenumberofbuffersegmentsafterthepartitioningisthesameasthetotalnumberofsubsamplesintheentirereservoir,andthesizeofeachbuffersegmentisonexpectationproportionaltothecurrentsizeofeachofthesubsamplesfromoneofthegeometricles.Thisallowsustomaintainthecorrectnessofthereservoirsamplingalgorithm.ThebufferpartitioningstepsincaseofmultiplegeometriclesaregiveninAlgorithm 5 3 'sbuffermergealgorithm.Wediscussalltheintricaciessubsequently,butathighlevel,thelargestsegmentofeachsubsamplefromonlyonegeometricleisoverwrittenwithsamplesfromthebuffer.Thisallowsforconsiderablespeedup,aswediscussinSection 3.12 .Atrst,thiswouldseemtocompromisethecorrectnessofthealgorithm:logically,thebufferedsamplesmustoverwritesamplesfromeveryoneofthegeometricles(infact,thisispreciselywhythebufferispartitionedacrossallgeometricles,as 52 PAGE 53 3.11.1 to 3.11.3 ,wedescribeindetailanalgorithmthatisabletomaintainthecorrectnessofthesample. Oncethesegmentsassignedtothevariousleshavebeenconsolidated,theresultingsegmentsareusedtooverwritesubsamplesfromasinglegeometricleusingexactlythealgorithmfromSection 3.4 ,subjecttotheconstraintthatthejthbuffermergeoverwritessubsamplesfromthe(jmodm)thgeometricle. Ourremedytothisproblemistodelayoverwritingasubsample'slargestsegmentuntilthetimethatall(ormost)oftherecordsthatwillbeoverwrittenondiskareinvalid,inthesensethattheyhavelogicallybeenoverwrittenbyhavingrecordsfromsubsequentbuffer 53 PAGE 54 Speedinguptheprocessingofnewsamplesusingmultiplegeometricles. 54 PAGE 55 Thewaytoaccomplishthisistooverwritesubsamplesinalazymanner.Wemergethebufferwiththe(jmodm)thgeometricle,butwedonotoverwriteanyofthevalidsamplesstoredintheleuntilthenexttimewegettothele.Wecanachievethisbyallocatingenoughextraspaceineachgeometricletoholdacomplete,emptysubsampleineachgeometricle.Thissubsampleisreferredtoasthedummy.Thedummyneverdecaysinsize,andneverstoresitsownsamples.Rather,itisusedasabufferthatallowsustosidesteptheproblemofasubsampledecayingtooquickly.Whenanewsubsampleisaddedtoageometricle,thenewsubsampleoverwritessegmentsofdummyratherthanoverwritinglargestsegmentofanyexistingsubsamples.Thus,wehaveprotectedsegmentsofsubsamplesthatcontainvaliddatabyoverwritingdummy'srecordsinstead. Whenrecordsaremergedfromthebufferintothedummy,thespacepreviouslyownedbythedummyisgivenuptoallowstorageofthele'snewestsubsample.Afterthisush,thelargestsegmentfromeachofthesubsamplesintheleisgivenuptoreconstitutethenewdummy.Becausetherecordsin(new)dummy'ssegmentswillnotbeoverwrittenuntilthenexttimethatthisparticulargeometricleiswrittento,allofthedatathatiscontainedwithinitisprotected. Notethatwithadummysubsample,wenolongerhaveaproblemwithasubsamplelosingitssamplestooquickly.Instead,asubsamplemayhaveslightlytoomanysamplespresentondiskatanygiventime,bufferedbythele'sdummy.Theseextrasamplescaneasilybeignoredduringqueryprocessing.TheonlyadditionalcostweincurwithdummyisthateachofthegeometriclesondiskmusthavejBjadditionalunitsofstorageallocated.TheuseofadummysubsampleisillustratedinFigure 35 55 PAGE 56 2 Proof. log0c.Substitutingn=(10)jBjandsimplifyingtheexpression(aswellas 56 PAGE 57 log0(loglogjBj).Ifwelet!=(log(1=0))1thenumberofsegmentscanbeexpressedas!(logjBjlog).Assumingaconstantnumbercofrandomseekspersegmentwrittentothedisk,thetotalrandomdiskheadmovementsrequiredperrecordis!c((logjBjlog)=jBj),whichisO(!logjBj=jBj). Incaseofmultiplegeometriclesweuseadditionalspaceformdummysubsamples.Thus,thetotalstoragerequiredbyallgeometriclesisjRj+(mjBj).Ifwewishtomaintaina1TBreservoirof100Bsampleswith1GBofmemory,wecanachieve0=0:9byusingonly1.1TBofdiskstorageintotal.For0=0:9,weneedtowritelessthan100segmentsper1GBbufferush.At40ms/segment,thisisonly4secondsofrandomdiskheadmovementstowrite1GBofnewsamplestodisk. Inordertotesttherelativeabilityofthegeometricletoprocessahighspeedstreamofinsertions,wehaveimplementedandbenchmarkedvealternativesformaintainingalargereservoirondisk:thethreealternativesdiscussedinSection 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.WepresentthesebenchmarkingresultsinChapter 7 57 PAGE 58 Inthischapterweproposeasimplemodicationstotheclassicreservoirsamplingalgorithm[ 11 38 ]inordertoderiveaverysimplealgorithmthatpermitsthesortofxedsize,biasedsamplinggivenintheexample.Ourmethodassumestheexistenceofanarbitrary,userdenedweightingfunctionfwhichtakesasanargumentarecordri,wheref(ri)>0describestherecord'sutilityinsubsequentqueryprocessing.Wethencompute(inasinglepass)abiasedsampleRioftheirecordsproducedbyadatastream.Riisxedsize,andtheprobabilityofsamplingthejthrecordfromthestreamisproportionaltof(rj)forallji.Thisisafairlysimpleandyetpowerfuldenitionofbiasedsampling,andisgeneralenoughtosupportmanyapplications. Ofcourse,onestraightforwardwaytosampleaccordingtoawelldenedbiasfunctionwouldbetomakeacompletepassoverthedatasettocomputethetotalweightofalltherecords,PNj=1f(rj).Duringasecondpass,wecanthenchoosetheithrecordofthedatasetwithprobabilityjRjf(ri) Inmostcases,ouralgorithmisabletoproduceacorrectlybiasedsample.However,givencertainpathologicaldatasetsanddataorderings,thismaynotbethecase.Ouralgorithmadaptsinthiscaseandprovidesacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.We 58 PAGE 59 Therestofthechapterisorganizedasfollows.Wedescribesasinglepassbiasedsamplingalgorithm.Wealsodeneadistancemetrictoevaluatetheworstcasedeviationfromtheuserdenedweightingfunctionf.Finally,wederiveasimpleestimatorforabiasedreservoir.TheexperimentsperformedtotestouralgorithmsarepresentedinChapter 7 6 .Itispossibletoprovethatthismodiedalgorithmresultsinacorrectlybiasedsample,providedthattheprobabilityfromline(8)ofAlgorithm 6 doesnotexceedone. 6 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehavePr[rj2Ri]=jRjf(rj) Proof. 59 PAGE 60 60 PAGE 61 1 ) WedeneanoverweightrecordtobearecordriforwhichjRjf(ri) Animportantfactortoconsiderwhiledeterminingthefeasibilityofmaintainingsuchaqueueinthegeneralcaseisprovidinganupperboundonitssize.Thiscanbedonebyconsideringtheworstpossibleorderingoftherecordsinputintothealgorithm,subjecttotheconstraintthatthebiasfunctioniswelldened.Ingeneral,wedescribetheuserdenedweightingfunctionfasbeingwelldenedifjRjf(ri) 61 PAGE 62 Westressthatthoughthisupperboundisquitepoor(requiringthatweneedtobuffertheentiredatastream!)itisinfactaworstcasescenario,andtheapproachwilloftenbefeasibleinpractice.Thisisbecauseweightswilloftenincreasemonotonicallyovertime(asinthecasewherenewerrecordstendtobemorerelevantforqueryprocessingthanolderones).Still,giventhepoorworstcaseupperbound,amorerobustsolutionisrequired,whichwenowdescribe. 62 PAGE 63 1. First,wewillbeabletoguaranteethatf0(rj)willbeexactlyf(rj)if(jRjf(ri))=totalWeight1forallk>j. 2. Wecanalsoguaranteethatwecancomputethetrueweightforagivenrecordtounbiasedanyestimatemadeusingoursample(seeSection 4.4 ). Inotherwords,ourbiasedsamplecanstillbeusedtoproduceunbiasedestimatesthatarecorrectonexpectation[ 16 ],butthesamplemightnotbebiasedexactlyasspeciedbytheuserdenedfunctionf,ifthevalueoff(r)tendstouctuatewildly.Whilethismayseemlikeadrawback,thenumberofrecordsnotsampledaccordingtofwillusuallybesmall.Furthermore,sincethefunctionusedtomeasuretheutilityofasampleinbiasedsamplingisusuallytheresultofanapproximateanswertoadifcultoptimizationproblem[ 15 ]ortheapplicationofaheuristic[ 52 ],havingasmalldeviationfromthatfunctionmightnotbeofmuchconcern. WepresentasinglepassbiasedsamplingalgorithmthatprovidesbothguaranteesoutlinedaboveasAlgorithm 7 ,andLemma 4 provesthecorrectnessofthealgorithm. 7 ,weareguaranteedthatforeachRiandforeachrecordrjproducedbythedatastreamsuchthatji,wehave,Pr[rj2Ri]=jRjf0(rj) Proof. 3 .Wesimplyusef0insteadofftoprovethedesiredresult. 63 PAGE 64 7 isthedeviationoff0fromf.Thatis:howfarofffromthecorrectweightingcanwebe,intheworstcase?Whenstreamhasnooverweightrecords,weexpectf0tobeexactlyequaltof,butitmaybeveryfarawayundercertaincircumstances.Toaddressthis,wedeneadistancemetricinDenition 2 andevaluatetheworsecasedistancebetweenf0andf. 64 PAGE 65 1 andisanalyzedandprovedintheAppendixofthispaper. 7 willsamplewithanactualbiasfunctionf0wheretotalDist(f;f0)isupperboundedbyPNk=jRjf(r0k)PjRj1k=1f(r0k) 7 computesabiasedsampleaccordingtof0,wheref0isaclosefunctiontoauserdenedweightingfunctionfaccordingtothefollowingdistancemetric: PAGE 66 7 occurswhen(1)thereservoirisinitiallylledwiththeRrecordshavingthesmallestpossibleweightsand(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter.Theorem 1 presentedanupperboundontotalDist(f;f0)inthisworstcase.Inthissection,werstprovidetheproofofthisworstcaseforAlgorithm 7 andthenprovetheupperboundontotalDist(f;f0)givenbyTheorem 1 7 ,werstprovethefollowingthreepropositions.Theseproofsleadustotheworstcaseargument.Ifwedenotetherecordwiththehighestweightinthestreamasrmaxandusermaxitodenotethecasewherermaxislocatedatpositioniinthestream,thenforanygivenrandomorderingofthestreamingrecordsr1;:::;ri1;rmaxi;:::;rN,weprovethat 1. MovingtherecordrmaxiearlierintherangerjRj:::rNcannotdecreasetotalDist(f;f0). 2. Whenweareinitiallyllingthereservoir,choosingjRjrecordswithsmallestpossibleweightmaximizestotalDist(f;f0). 3. Reorderingofanyrecordthatappearsafterrmaxiintherangeri+1:::rNcannotincreasetotalDist(f;f0). 66 PAGE 67 5 (givenbelow)werewritethetotalDistformulaas 67 PAGE 68 5 werewritethetotalDistformulaas PAGE 69 Adjustmentofrmaxitormaxi1 4 )fromEquation( 4 )asfollows: 41 showstheadjustmentofrmaxitormaxi1.Wedenotetherecordthatisswappedwithrmaxasrswap.Theaboveequationfurthersimpliesto hjRjf(rmax)+PNk=i+1f(rk)i PAGE 70 Y=2f(rswap) PAGE 71 7 acceptstherstjRjrecordsofthestreamwithprobability1.NoweightadjustmentsaretriggeredforrstjRjrecordsirrespectiveoftheirweights.Therefore,theearliestpositionrmaxcanappearinthestreamisrightafterthereservoirislled.Thisprovestheproposition.WenowturntoprovingLemma 5 ,whichwasusedinthepreviousproof. Proof. 71 PAGE 72 5 ,8j PAGE 73 Fromtheabovethreepropositions,wecanconcludethattheworstcaseforAlgorithm 7 occurswhen(1)thereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweightsand,(2)weencountertherecordrmaxwiththelargestweightimmediatelythereafter. 1 :TheUpperBoundontotalDist 5 werewritethetotalDistformulaas 73 PAGE 74 4 ),theaboveequationsimpliesto IntheworstcasethereservoirisinitiallylledwiththejRjrecordshavingthesmallestpossibleweights.Ifr1;r2;:::rNaretherecordsinappearanceorderthenwedener01;r02;:::;r0Nasthepermutation(reordering)oftherecordssuchthatf(r01)f(r02)f(r0N):Theconditionrequiringreservoirlledwiththesmallestpossibleweightscanbethenwrittenas 74 PAGE 75 1. Foreachondisksubsample,MjissettobejRjMjf(ri) 2. Foreachsampledrecordstillinthebuffer,rj:weightissettojRjrj:weightf(ri) 3. Finally,totalWeightissettojRjf(ri). 75 PAGE 76 50 ]forasamplecomputedusingouralgorithm.Wederivethecorrelation(covariance)betweentheBernoullirandomvariablesgoverningthesamplingoftworecordsriandrjusingouralgorithmandusethiscovariancetoderivethevarianceofaHorvitzThomsonestimator.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained.Thoughhandlingmorecomplicatedqueriesusingthebiasedsampleisbeyondthescopeofthepaper,itisstraightforwardtoextendtheanalysisofthisSectiontomorecomplicatedqueriessuchasjoins[ 32 ]. Imaginethatwehavethefollowingsingletablequery,whose(unknown)answerisq: TABLEASr Next,wederivethevarianceofthisestimator.Todothis,weneedaresultsimilartoLemma 3 thatcanbeusedtocomputetheprobabilityPr[frj;rkg2Ri]underourbiasedsamplingscheme. 76 PAGE 77 7 ,foreachRiandforeachrecordpairfrj;rkgproducedbythedatastreamwherej PAGE 78 Thisexpressioncanthenbeusedinconjunctionwiththenextlemmatocomputethevarianceofthenaturalestimatorforq. ByusingtheresultofLemma 6 tocomputePr[frj;rkg2Ri],thevarianceoftheestimatoristheneasilyobtainedforaspecicquery.Inpractice,thevarianceitselfmustbeestimatedbyconsideringonlythesampledrecordsaswetypicallydonothaveaccesstoeachandeveryrjduringqueryprocessing.Theq2termandthetwosumsintheexpressionofvariancearethuscomputedovereachrjinthesampleofbiasedgeometricleratherthanovertheentirereservoir. Thereisoneadditionalissueregardingbiasedsamplingthatisworthsomeadditionaldiscussion:howtoefcientlycomputethevaluePr[frj;rkg2Ri]inordertoestimate 78 PAGE 79 TherstsubexpressionscanbeeasilycomputedwiththehelpofrunningtotaltotalWeightalongwiththeweightmultipliersassociatedwitheachsubsample.Whensamplerecordsareaddedtothereservoir,likeattributeri.weight,westoreanotherattributewitheachrecord,ri.oldTotalWeightandri.oldM.TherstattributegetsitsvaluefromcurrentvalueoftotalWeight,whereastheM(ri)isstoredinthesecondattribute.Whenaqueryisevaluatedandweneedtocomputetherstsubexpressionsforagivenrecordpairrjandrk,wecomputetermsinitsdenominatorasfollows:kXl=1f0(rl)=rk:oldTotalWeightM(rk) PAGE 80 Ageometricleisasimplerandomsample(withoutreplacement)fromadatastream.Inthischapterwedeveloptechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallersetsofdataobjectsthatarethemselvesrandomsamples(withoutreplacement)fromtheoriginaldatastream.Thegoalofthealgorithmsdescribedinthischapteristoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 3.2 ,wearguedthatsmallsamplesfrequentlydonotprovideenoughaccuracy,especiallyinthecasewhentheresultingstatisticalestimatorhasaveryhighvariance.However,whileinthegeneralcaseaverylargesamplecanberequiredtoansweradifcultquery,ahugesamplemayoftencontaintoomuchinformation.Forexample,reconsidertheproblemofestimatingtheaveragenetworthofAmericanhouseholdsasdescribedinSection 3.2 .Inthegeneralcase,manymillionsofsamplesmaybeneededtoestimatethenetworthoftheaveragehouseholdaccurately(duetoasmallratiobetweentheaveragehousehold'snetworthandthestandarddeviationofthisstatisticacrossallAmericanhouseholds).However,ifthesamesetofrecordsheldinformationaboutthesizeofeachhousehold,onlyafewhundredrecordswouldbeneededtoobtainsimilaraccuracyforanestimateoftheaveragesizeofanAmericanhousehold,sincetheratioofaveragehouseholdsizetothestandarddeviationofsamplesizeacrosshouseholdsintheUnitedStatesisgreaterthan2.Thus,toestimatetheanswertothesetwoqueries,vastlydifferentsamplesizesareneeded. 80 PAGE 81 1 21 30 34 39 ].Ingeneral,thedrawbackofmakinguseofabatchsampleisthattheaccuracyofanyestimatorwhichmakesuseofthesampleisxedatthetimethatthesampleistaken,whereasthebenetofbatchsamplingisthatthesamplecanbedrawnwithveryhighefciency. WewillalsoconsiderthecasewhereNisnotknownbeforehand,andwewanttoimplementaniterativefunctionGetNext.EachcalltoGetNextresultsinanadditionalsampledrecordbeingreturnedtothecaller,andsoNconsecutivecallstoGetNextresultsinasampleofsizeN.Wewillreferasampleretrievedinthismannerasanonlineorsequentialsample.ThedrawbackofonlinesamplingcomparedtobatchsamplingisthatitisgenerallylessefcienttoobtainasampleofsizeNusingonlinemethods.However,sincetheconsumerofthesamplecancallGetNextrepeatedlyuntilanestimatorwithenoughaccuracyisobtained,onlinesamplingismoreexiblethanbatchsampling.Anonlinesampleretrievedfromageometriclecanbeusefulformanyapplications,includingonlineaggregation[ 32 33 ].Inonlineaggregation,adatabasesystemtriestoquicklygatherenoughinformationsoastoapproximateanswertoanaggregatequery.Asmoreandmoreinformationisgathered,theapproximationqualityisimproved,andtheonlinesamplingprocedureishaltedwhentheuserishappywiththeapproximationaccuracy. 5.3.1ANaiveAlgorithm Proof. 81 PAGE 82 jDjN jDjN=1 Unfortunately,thoughitisverysimple,thenaivealgorithmwillbeinefcientfordrawingasmallsamplefromalargegeometriclesinceitrequiresafullscanofthegeometricletoobtainatruerandomsampleforanyvalueofN.Sincethegeometriclemaybegigabytesinsize,thiscanbeproblematic. 82 PAGE 83 26 ].Oncethenumberofsampledrecordsfromeachsegmenthasbeendetermined,samplingthoserecordscanbedonewithanefcientsequentialreadsincewithineachondisksegment,allrecordsarestoreinarandomizedorder.Thekeyalgorithmicissueishowtocalculatethecontributionofeachsubsample.Sincethiscontributionisamultivariatehypergeometricrandomvariable,wecanuseanapproachanalogoustoAlgorithm 4 ,whichisusedtopartitionthebuffertoformthesegmentsofasubsample.Inotherwords,wecanviewretrievingNsamplesfromageometricleanalogoustochoosingNrandomrecordstooverwritewhennewrecordsareaddedtothele. Theresultingalgorithmcanbedescribedasfollows.Tostartwith,wepartitionthesamplespaceofNrecordsintosegmentsofvaryingsizeexactlyasinAlgorithm 4 .Werefertothesesegmentsofthesamplespaceassamplingsegments.Thesamplingsegmentsarethenlledwithsamplesfromthediskusingaseriesofsequentialreads,analogoustothesetofwritesthatareusedtoaddnewsamplestothegeometricle.Thelargestsamplingsegmentobtainsallofitsrecordsfromthelargestsubsample,thenextlargestsamplingsegmentobtainsallitsrecordfromsecondlargestsubsample,andsoon. Whenusingthisalgorithm,somecareneedstobetakenwhenNapproachestothesizeofageometricle.Specically,whenalldisksegmentsofasubsamplearereturnedtoacorrespondingsamplingsegment,wemustalsoconsiderthesubsample'sinmemorybuffered 83 PAGE 84 8 ItisclearthatthisalgorithmobtainsthedesiredbatchsamplebyscanningexactlyNrecordsasagainsttheentirescanofthereservoirsamplingatthecostoffewrandomdiskseeks.Sincethesamplingprocessisanalogoustotheprocessofaddingmoresamplestothele,itisjustasefcient,requiringO(!logjBj=N)randomdiskheadmovementsforeachnewlysampledrecord,asdescribedinLemma2. 8 oneachleinordertoobtainthedesiredbatchsample. 5.4.1ANaiveAlgorithm Itiseasytoseethatanaivealgorithmwillgiveusacorrectonlinesampleofageometricle.However,wewilluseonediskseekpercalltoGetNext.SinceeachrandomI/Orequires 84 PAGE 85 Insteadofselectingarandomrecordofageometricle,werandomlypickasubsampleandchooseitsnextavailablerecordasareturnvalueofGetNext.Thisisanalogoustotheclassiconlinesamplingalgorithmforsamplingfromahashedle[ 26 ],whererstahashbucketisselectedandthenarecordischosen.Sincetheselectionofarandomrecordwithinasubsampleissequential,wemayreducethenumberofcostlydiskseeksifwereadthesubsampleinitsentirety,andbufferthesubsample'srecordsinmemory.Usingthisbasicmethodology,wenowdescribehowacalltotheGetNextwillbeprocessed: Sincetherecordsfromeachsubsamplearereadandbufferedinmemorysequentially,weareguaranteedtochooseeachrecordofthereservoiratmostonce,givingusdesiredrandomsamplewithoutreplacement.Aproofofthisissimple,andanalogoustotheproofofLemma3.However,thusfarwehavenotconsideredaveryimportantquestion:HowmanyblocksofasubsampleSishouldwefetchatthetimeofbufferrell?Ingeneraltherearetwoextremesthatwemayconsider: 85 PAGE 86 Inordertodiscusssuchconsiderationsmoreconcretely,wenotethatthetimerequiredtoprocessGetNextcallisproportionaltothenumberofblocksfetchedonthecall,assumingthatthecosttoperformtherequiredinmemorycalculationsisminimal.Ifbblocksarefetchedduringaparticularcall,wespends+brtimeunitsonthatparticularcalltoGetNext,wheresistheseektimeandristimerequiredtoscanablock.OncethesebblocksarefetchedweincurzerocostfornextbncallstoGetNext,wherenistheblockingfactor(numberofrecordsperblock).Thus,inthecasewhereblocksarefetchedattherstcalltoGetNext,weincurthetotalcostofs+brtosamplebnrecords,andhavearesponsetimeofs+brunitsattherstcalltoGetNext,withallsubsequentcallshavingzerocost. Nowimaginethatinsteadwesplitbblocksintotwochunksofsizeb=2each,andreadachunkatatime.Thus,therstGetNextcallwillcostuss+br=2timeunits.Oncethesebn=2recordsareusedupwereadnextchuckofblocks.Thetotalcostinthisscenariois2s+brwitharesponsetimeofs+br=2timeunitsonceatthestartingpointandothermidwaythrough.NotethatalthoughthemaximumresponsetimeonanycalltoGetNextisreducedbyhalf,werequiredmoretimetosamplebnrecords.Thequestionthenbecomes,Howdowereconcileresponsetimewithoverallsamplingtimetogivetheuseroptimalperformance? ThesystematicapproachwetaketoansweringthisquestionisbasedonminimizingtheaveragesquaresumofresponsetimeoverallGetNextcalls.ThisideaissimilartothewidelyutilizedsumsquareerrororMSEcriterion,whichtriestokeepstheaverageerrororcostfrombeingtoohigh,butalsopenalizesparticularlypoorindividualerrorsorcosts.However,one 86 PAGE 87 dX(X(s+(N=br)=X)2)=d dX(Xs2+2Nsr+(N=br)2=X)=s2(N=br)2=X2 PAGE 88 9 givesthedetailedonlinesamplingalgorithm. Proof. LettheSamplebethebiasedsampleofthegeometricle,thenwehavePr[i2Sample]=Pr[SelectingifromSi]Pr[SelectingSi]Pr[i2Si]=1 jRjjRjf(r) 7 ofthisdissertation. 88 PAGE 89 Efcientlysearchinganddiscoveringrequiredinformationfromasamplestoredinageometricleisessentialtospeedupqueryprocessing.Anaturalwaytosupportthisfunctionalityistobuildanindexstructureforthegeometricle.Inthischapterwediscussthreesecondaryindexstructuresforthegeometricle.Thegoalistomaintaintheindexstructuresasnewrecordsareinsertedtothegeometricleandatthesametimeprovideefcientaccesstothedesiredinformationinthele. FROMTransaction WHEREStoreState='FL'ANDTransDate>1/1/2007 89 PAGE 90 Anaturalwaytospeedupthesearchanddiscoveryofthoserecordsfromageometriclethathaveaparticularvalueforaparticularattribute(s)istobuildanindexstructure.Ingeneralanindexisadatastructurethatletsusndarecordwithouthavingtolookatmorethanasmallfractionofallpossiblerecords.Thus,inourexample,wecouldusetheindexbuiltoneitherStoreStateorTransDate(orboth)toquicklyaccessspecicsetofrecordsandtestthemfortheconditionsintheWHEREclause.Inthischapterwefocusonbuildingsuchanindexstructureforthegeometricle. Apartfromprovidingefcientaccesstothedesiredinformationinthele,akeyconsiderationisthattheindexforthegeometriclemustbemaintainedasnewrecordsareinserted.Forinstance,wecouldbuildasecondaryindexonanattributewhenthenewrecordsarebulkinsertedintothegeometricle.Wemustthendeterminehowdowemergethenewsecondaryindexwiththeexistingindexesbuiltfortherestofthele.Furthermore,wemustmaintaintheindexasexistingrecordsarebeingoverwrittenwithnewlyinsertedrecordsandhencearedeletedfromthegeometricle. Withthesegoalsinmind,wediscussthreesecondaryindexstructuresforthegeometricle:(1)asegmentbasedindex,(2)asubsamplebasedindex,and(3)aLogStructuredMergeTree(LSM)basedindex.Thersttwoindexesaredevelopedaroundthestructureofthegeometricle.MultipleB+treeindexes[ 9 ]aremaintainedforeachsegmentorsubsampleinageometric 90 PAGE 91 44 ]adiskbaseddatastructuredesignedtoprovidelowcostindexinginanenvironmentwithahighrateofinsertsanddeletes. Inthesubsequentsectionswediscussconstruction,maintenance,andqueryingofthesethreetypesofindexes. Wedetailconstructionandmaintenanceofasegmentbasedindexstructureinthissection. 3 fromChapter 3 duringstartuptollthereservoir.Everytimethebufferaccumulatesthedesirednumberofrecords,itissegmentedandushedtothedisk.WebuildaB+treeindexforeachsegmentjustbeforetheyarewrittenouttothedisk.Foreachbufferedrecordofasegmentweconstructanindexrecord.Anindexrecordiscomprisedofthevalueoftheattributeonwhichtheindexisgettingbuilt(thekeyvalue)andthepositionofthebufferedrecordonthedisk.Thepositionisstoredasanumberpair:apagenumberandoffsetwithinapage.TheindexrecordsarethenusedtocreateanindexusingthebulkinsertionalgorithmforaB+Tree.Weuseasimplearraybaseddatastructuretokeep 91 PAGE 92 RatherthanmaintainingaleforeachB+Treecreated,weorganizemultipleB+Treesonasinglediskle.Wereferthissingleleasindexle.Theindexle,inasense,issimilartothelogstructurelesystemproposedbyOusterhout[ 45 ].Inlogstructuredlesystem,aslesaremodied,thecontentsarewrittenouttothediskaslogsinasequentialstream.Thisallowswritesinfullcylinderunits,withonlytracktotrackseeks.Thusthediskoperatesatnearlyitsfullbandwidth.Theindexleenjoysthesimilarperformancebenets.EverytimeaB+Treeiscreatedforamemoryresidentsegment,itiswrittentotheindexleinasequentialstreamatthenextavailableposition.ThearraymaintainingallB+TreerootnodesisaugmentedwiththestartingdiskpositionoftheB+Tree. Finally,wedonotindexsegmentsthatareneverushedtothedisk.Thesesegmentsaretypicallyverysmall(asizeofadiskblock)anditisefcienttosearchthemusingsequentialmemoryscanwhengeometricleisqueried. ThealgorithmusedtoconstructandmaintainasegmentbasedindexstructureisgivenasAlgorithm 10 92 PAGE 93 logc Weexpectasegmentbasedindexstructuretobeacompactstructureasthereisexactlyoneindexrecordpresentintheindexstructureforeachrecordinthegeometricle,andtheindexstructureismaintainedasnewrecordsaredeletedfromthele. 93 PAGE 94 Asinthecaseofasegmentbasedindexstructure,wearrangetheB+Treeindexesondiskinasingleindexle.However,weneedaslightlydifferentapproach,becauseduringthestartupsubsamplesareushedtothegeometricle,untilthereservoirisfull.ThereaftersubsamplesofthesamesizejBjareaddedtothereservoir.SinceeachB+TreewillindexnomorethanjBjrecords,wecanboundthesizeofaB+Treeindex.WeusethisboundtopreallocateaxsizedslotondiskforeachB+Tree.Furthermore,foreverybufferushafterthereservoirisfull,exactlyonesubsampleisaddedtotheleandthesmallestsubsampleoftheledecayscompletely,keepingthenumberofsubsamplesinageometricleconstant.WeusethisinformationtolayoutthesubsamplebasedB+Treesondiskandmaintainthemasnewrecordsaresampledfromthedatastream. Thus,iftotSubsamplesisthetotalnumbersubsamplesinR,werstallocatexedsizetotSubsamplesslotsintheindexle.Initiallyalltheslotsareempty.Duringstartup,asanewB+Treeisbuilt,weseektothenextavailableslotandwriteouttheB+Treeinasequential 94 PAGE 95 ThealgorithmusedtoconstructandmaintainasegmentbasedindexstructureisgivenasAlgorithm 11 logc AsearchonsubsamplebasedindexstructureinvolveslookingupallB+Treeindexes,oneforeachsubsampleinthegeometricle.WemodifytheexistingB+TreebasedpointqueryandrangequeryalgorithmsandrunthemforeachentryintheB+Treearrayoftheindexstructure.ThemodicationisrequiredtoignorethestalerecordsintheB+Trees.Asmentionedbefore,thesubsamplecorrespondingtoaB+Treemayloseitssegments,buttheindexrecordsare 95 PAGE 96 Recallthatwehaverecordedasegmentnumberinanadditionaleldalongwitheachindexrecord.Foragivensubsample,wekeeptrackofwhichofitssegmentsaredecayedsofarandusethisinformationtoignoretheindexrecordsthatarestale.Wereturnsallvalidindexrecordsthatsatisfythesearchcriteria.Werstsorttheseindexrecordsbytheirpagenumberattributeandthenthenretrievetheactualrecordsfromthegeometricleandreturnthemasaqueryresult. Although,thesubsamplebasedindexstructuremaintainsandmustsearchfarfewerB+Treescomparedtothesegmentbasedindexstructure,weexceptreasonablesearchtimeperB+Treeduetothesmallersizeandlazydeletionpolicy. 44 ].TheLSMTreeisadiskbaseddatastructuredesignedtoprovidelowcostindexinginanenvironmentwithahighrateofinsertsanddeletes. 44 ]. AlthoughC1(andhigher)componentsarediskresident,themostfrequentlyreferrednodes(ingeneralnodesathigherlevel)ofthesetreesarebufferedinmainmemoryforperformancereasons. 96 PAGE 97 WhenevertheC0componentreachesathresholdsizeanongoingrollingmergeprocessremovessomerecords(acontiguoussegment)fromtheC0componentandmergesthemintotheC1componentondisk.ThetherollingmergeprocessisdepictedpictoriallyinFigure2.2oftheoriginalLSMTreepaper[ 44 ].TherollingmergeisrepeatedformigrationbetweenhighercomponentsofanLSMTreeinsimilarmanner.Thus,thereisacertainamountofdelaybeforerecordsintheC0componentmigrateouttothediskresidentC1andhighercomponents.Deletionsareperformedconcurrentlyinbatchfashionsimilartoinserts. ThediskresidentcomponentsofanLSMtreearecomparabletoaB+treestructure,butareoptimizedforsequentialdiskaccess,withnodes100%full.Lowerlevelsofthetreearepackedtogetherincontiguous,multipagediskblocksforbetterI/Operformanceduringtherollingmerge. WeusetheexistingLSMTreebasedpointqueryandrangequeryalgorithmstoperformindexlookups.Asincaseofpreviouslyproposedindexstructures,wesortthevalidindex 97 PAGE 98 InChapter 7 ,weevaluateandcomparethethreeindexstructuressuggestedinthischapterexperimentallybymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedintothegeometricle.Wealsocomparetheefciencyofthesestructuresforpointandrangequeries. 98 PAGE 99 Inthischapter,wedetailthreesetsofbenchmarkingexperiments.Intherstsetofexperiments,weattempttomeasuretheabilityofthegeometricletoprocessahighspeedstreamofdatarecords.Inthesecondsetofexperiments,weexaminethevariousalgorithmsforproducingsmallersamplesfromalarge,diskbasedgeometricle.Finally,inthethirdsetofexperiments,wecomparethethreeindexstructuresforthegeometricleforbuildtime,diskspace,andindexlookupspeed. 3.3 ,thegeometricle,andtheframeworkdescribedinSection 3.10 forusingmultiplegeometriclesatonce.IntheremainderofthisSection,werefertothesealternativesasthevirtualmemory,scan,localoverwrite,geole,andmultiplegeolesoptions.An0valueof0:9wasusedforthemultiplegeolesoption. AllimplementationwasperformedinC++.BenchmarkingwasperformedusingasetofLinuxworkstations,eachequippedwith2.4GHzIntelXeonProcessors.15,000RPM,80GBSeagateSCSIharddiskswereusedtostoreeachofthereservoirs.Benchmarkingofthesedisksshowedasustainedread/writerateof3550MB/second,andanacrossthediskrandomdataaccesstimeofaround10ms. 99 PAGE 100 71 (a).Bynumberofsamplesprocessedwemeanthenumberofrecordsthatareactuallyinsertedintothereservoir,andnotthenumberofrecordsthathavepassedthroughthedatastream. 71 (b).Thus,wetesttheeffectofrecordsizeontheveoptions. 71 (c).Thisexperimentteststheeffectofaconstrainedamountofmainmemory. Itisworthwhiletopointoutafewspecicndings.Eachoftheveoptionswritestherst50GBofdatafromthestreammoreorlessdirectlytodisk,asthereservoirislargeenoughtoholdallofthedataaslongasthetotalislessthan50GB.However,Figure 71 (a)and(b)showthatonlythemultiplegeolesoptiondoesnothavemuchofadeclineinperformanceafterthereservoirlls(atleastinExperiments1and2).Thisiswhythescanandvirtualmemoryoptionsplateauaftertheamountofdatainsertedreaches50GB.ThereissomethingofadeclineinperformanceinallofthemethodsoncethereservoirllsinExperiment3(withrestrictedbuffermemory),butitisfarlesssevereforthemultiplegeolesoptionthanfortheotheroptions. 100 PAGE 101 Resultsofbenchmarkingexperiments(Processinginsertions). 101 PAGE 102 Resultsofbenchmarkingexperiments(Samplingfromageometricle). 102 PAGE 103 Asexpected,thelocaloverwriteoptionperformsverywellearlyon,especiallyinthersttwoexperiments(seeSection 3.3 foradiscussionofwhythisisexpected).EvenwithlimitedbuffermemoryinExperiment3,ituniformlyoutperformsasinglegeometricle.Furthermore,withenoughbuffermemoryinExperiments1and2,thelocaloverwriteoptioniscompetitivewiththemultiplegeolesoptionearlyon.However,fragmentationbecomesaproblemandperformancedecreasesovertime.Unlessofinererandomizationoftheleispossibleperiodically,thisdegradationprobablyprecludeslongtermuseofthelocaloverwriteoption. ItisinterestingthatasdemonstratedbyExperiment3(andexplainedinSection 3.8 )asinglegeometricleisverysensitivetotheratioofthesizeofthereservoirtotheamountofavailablememoryforbufferingnewrecordsfromthestream.ThegeoleoptionperformswellinExperiments1and2whenthisratiois100,butratherpoorlyinExperiment3whentheratiois1000. Finally,wepointoutthegeneralunusabilityofthescanandvirtiualmemoryoptions.scangenerallyoutperformedvirtualmemory,butbothgenerallydidpoorly.Exceptinexperiment1withlargememoryandsmallrecordsize,withthesetwooptionsmorethan97%oftheprocessingofrecordsfromthestreamoccursinthersthalfhourasthereservoirlls.Inthe19:5hoursorsoafterthereservoirrstlls,onlyatinyfractionofadditionalprocessingoccursduetotheinefciencyofthetwooptions. 4.1 wegaveanupperboundforthedistancebetweentheactualbiasfunctionf0computedusingourreservoiralgorithm,andthedesired,userdenedbiasfunctionf.Whileuseful,thisbounddoesnottelltheentirestory.Intheend,whatauserofabiasedsamplingalgorithmisinterestedinisnothowclosethebiasfunctionthatisactuallycomputedistotheuserspeciedone,butinsteadthekeyquestioniswhatsortofeffectanydeviationhasonthe 103 PAGE 104 Sumqueryestimationaccuracyforzipf=0.2. particularestimationtaskthatistobeperformed.Perhapstheeasiestwaytodetailthepracticaleffectofapathologicaldataorderingisthroughexperimentation. Inthissectionwepresenttheexperimentalresultsevaluatingpracticalsignicanceofaworstcasedataordering.Specically,wedesignasetofexperimentstocomputetheerror(variance)onewouldexpectwhensamplingfortheanswertoaSUMqueryinfollowingtherescenarios: 1. Whenabiasedsampleiscomputedusingourreservoiralgorithmwiththedataorderedsoastoproducenooverweightrecords. 2. Whenanunbiasedsampleiscomputedusingtheclassicalreservoirsamplingalgorithm. 3. Whenabiasedsamplecomputedusingourreservoiralgorithm,withrecordsarrangedsoastoproducethebiasfunctionfurthestfromtheuserspeciedone,asdescribedbytheTheorem 1 Byexaminingtheresults,itshouldbecomeclearexactlywhatsortofpracticaleffectontheaccuracyofanestimatoronemightexpectduetoapathologicalordering. 104 PAGE 105 Sumqueryestimationaccuracyforzipf=0.5. AttributeBistheattributethatisactuallyaggregatedbytheSUMquery.EachsetisgeneratedsothatattributesAandBbothhaveacertainamountofZipanskew,speciedbytheparameterzipf.Ineachcase,thebiasfunctionfisdenedsoastominimizethevarianceforaSUMqueryevaluatedoverattributeA. Inadditiontotheparameterzipf,eachdatasetalsohasasecondparameterwhichwetermthecorrelationfactor.ThisistheprobabilitythatattributeAhasthesamevalueasattributeB.Ifthecorrelationfactoris1,thenAandBareidentical,andsincethebiasfunctionisdenedsoastominimizethevarianceofaqueryoverA,thebiasfunctionalsominimizesthevarianceofanestimateovertheactualqueryattributeB.Thus,acorrelationfactorof1providesforaperfectbiasfunction.Asthecorrelationfactordecreases,thequalityofthebiasfunctionforaqueryoverattributeBdeclines,becausethechanceincreasesthatarecorddeemedimportantbylookingatattributeAis,infact,onethatshouldnotbeincludedinthesample.Thismodelsthecasewhereonecanonlyguessatthecorrectbiasfunctionbeforehandforexample,whenquerieswithanarbitraryrelationalselectionpredicatemaybeissued.Asmallcorrelationfactorcorrespondstothecasewhentheguessedatbiasfunctionisactuallyveryincorrect. 105 PAGE 106 Sumqueryestimationaccuracyforzipf=0.8. Bytestingeachofthethreedifferentscenariosdescribedintheprevioussubsectionoverasetofdatasetscreatedbyvaryingzipfaswellasthecorrelationfactor,wecanseetheeffectofdataskewandofbiasfunctionqualityontherelativequalityoftheestimatorproducedbyeachofthethreescenarios. Foreachexperiment,wegenerateadatastreamofonemillionrecordsandobtainasampleofsize1000.Foreachofthethreescenariosandeachofthedatasetsthatwetest,werepeatthesamplingprocess1000timesoverthesamedatastreaminMonteCarlofashion.Thevarianceofthecorrespondingestimatorisreportedastheobservedvarianceofthe1000estimates.TheobservedMonteCarlovariancesaredepictedinFigures 73 74 75 ,and 76 106 PAGE 107 Sumqueryestimationaccuracyforzipf=1. thatevenforveryskeweddatasets,itisdifcultforevenanadversarytocomeupwithadataorderingthatcansignicantlyalterthequalityoftheuserdenedbiasfunction. Wealsoobservethatforalowzipfparameterandalowcorrelationfactor,unbiasedsamplingoutperformsbiasedsampling.Inotherwords,itisactuallypreferablenottobiasinthiscase.ThisisbecausethelowzipfvalueassignsrelativelyuniformvaluestoattributeB,renderinganoptimalbiasedschemelittledifferentfromuniformsampling.Furthermore,asthecorrelationfactordecreases,theweightingschemeusedbothbiasedsamplingschemesbecomeslessaccurate,hencethehighervariance.Astheweightingschemebecomesveryinaccurate,itisbetternottobiasatall.Notsurprisingly,therearemorecaseswherethebiasedschemeunderthepathologicalorderingisactuallyworsethantheunbiasedscheme.However,asthecorrelationfactorincreasesandthebiasschemebecomesmoreaccurate,itquicklybecomespreferabletobias. 5 .Specically,wehavecomparedthenaivebatchsamplingandtheonlinesamplingalgorithmsagainstageometriclestructurebasedbatchsamplingandonline 107 PAGE 108 72 (a)depictstheplotforasinglegeometricle;Figure 72 (b)showsananalogousplotforthemultiplegeometriclesoption. 72 (c)forboththenaivealgorithmandthemoreadvanced,geometriclestructurebasedalgorithmdesignedtoincreasethesamplingrateandevenouttheresponsetimes.TheanalogousplotformultiplegeometriclecaseisshowninFigure 72 (d).WealsoplotthevarianceinresponsetimesoverallcallstoGetNextasafunctionofthenumberofcallstoGetNextinFigures 72 (e)and 72 (f)(therstisforasinglegeometricle;thesecondiswithmultipleles).Takentogether,theseplotsshowthetradeoffbetweenoverallprocessingtimeandthepotentialforwaitingforalongtimeinordertoobtainasinglesample. 108 PAGE 109 Asexpectedandthendemonstratedbyvarianceplots,thevarianceofonlinenaiveapproachissmallerthangeometriclestructurebasedalgorithm.Althoughwiththislittlelargervariance(lessthan10timesfor100ksamples)intheresponsetimes,thestructurebasedapproachexecutedorderofmagnitudefaster(morethan100timesfor100ksamples)thanthenaiveapproachforanynumberofrecordssampled,justifyingourapproachofminimizingtheaveragesquaresumoftheresponsetime.Inotherwords,wegotenoughaddedspeedforasmallenoughaddedvarianceinresponsetimetomakethetradeoffacceptable.Asmoreandmoresamplesareobtainedthevarianceofstructurebasedalgorithmapproachedvarianceofthenaivealgorithmmakingthetradeoffevenmorereasonableforlargeintendedsamplesizes. Finally,wepointoutthatboththegeometriclestructurebasedalgorithms,batchandonlinecase,wereabletoreadsamplerecordsfromdiskalmostatthemaximumsustainedspeedoftheharddisk,ataround45MB/sec.Thisiscomparabletotherateofasequentialreadfromdisk,thebestwecanhopefor. 109 PAGE 110 Diskfootprintfor1KBrecordsize Table71. Millionsofrecordsinsertedin10hrs SubsampleBased SegmentBased LSMTree 13700 12550 10960 9680 12810 7230 8030 2930 6 weintroducedthreeindexstructuresforthegeometricle:thesegmentbased,thesubsamplebased,andtheLSMtreebasedindexstructure.Inthissection,weexperimentallyevaluateandcomparethesethreeindexstructuresbymeasuringbuildtimeanddiskfootprintasnewrecordsareinsertedinthegeometricle.Wealsocompareefciencyofthesestructuresforpointandrangequeries.Alloftheindexstructureswereimplementedontopofthegeometricleprototypethatwasbenchmarkedintheprevioussections. 110 PAGE 111 6 .Thetenhoursofinsertioninthegeometricleensuresthatareasonablenumberofinsertionsanddeletionsareperformedonanindexstructure.Givensuchale,wecollectedfollowingthreepiecesofinformationforeachofthethreeindexstructuresunderconsideration. Withthesemetricsinmindweperformedfollowingtwosetsofexperiments: 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 77 ,andtheindexlookupspeedintabulatedinTable 7.4.1 111 PAGE 112 Diskfootprintfor200Brecordsize speedareshowninTable 7.4 ,thediskspaceusedbythreeindexstructureisplottedinFigure 78 ,andtheindexlookupspeedintabulatedinTable 7.4.2 .Thus,wetesttheeffectofrecordsizeonthethreeindexstructure. Table 7.4 showsmillionsofrecordsinsertedintogeometricleaftertenhoursofinsertionsandconcurrentupdatestotheindexstructure.Forcomparisonwepresentthenumberofrecordsinsertedintoageometriclewhennoindexstructureismaintained(thenoindexcolumn).Itisclearthatthesubsamplebasedindexstructureperformsthebestoninsertions,withperformancecomparabletothenoindexoption.Thisdifferencereectsthecostofconcurrentlymaintainingtheindexstructure.Thesegmentbasedindexstructuredoesthenextbest.Itisslowerthanthesubsamplebasedindexstructurebecauseofhighernumberofseeksperformedduringthestartup.RecallthatduringstartupthesegmentbasedindexmustwriteaB+treeforeachsegment. 112 PAGE 113 Querytimingresultsfor1krecord,jRj=10million,andjBj=50k Selectivity IndexTime FileTime TotalTime PointQuery 38.2890 0.0226 38.3116 40.2477 0.1803 40.2480 43.2856 0.8766 44.1622 45.6276 6.2571 51.8847 SubsampleBased PointQuery 0.87551 0.02382 0.89937 1.12740 0.15867 1.28607 1.74911 1.10544 2.85455 2.09980 5.96637 8.06617 LSMTree PointQuery 0.00012 0.01996 0.02008 0.00015 0.01263 0.01278 0.00019 0.79358 0.79377 0.00056 5.82210 5.82266 Oncethereservoirisinitialized,boththesegmentbasedandthesubsamplebasedindexstructureperformanequalnumberofdiskseeks.Finally,theLSMtreebasedindexstructureisslowestamongstthethree.TheLSMtreemaintainstheindexbyprocessinginsertionsanddeletionsmoreaggressivelythanothertwooptions,demandingmorerollingmergesandmorediskseeksperbufferush. Table 7.4 alsoshowstheinsertionguresforthesmaller,200Brecordsize.Notsurprisingly,allthreeindexstructuresshowssimilarinsertionpatterns,butsincetheyhavetoprocessalargernumberofrecordstheinsertionratesareslowerthaninthecaseofthe1KBrecordsize.Wealsoobservedandplottedthediskfootprintsizeforthreeindexstructures(Figure 77 andFigure 78 ).Asexpected,allthreeindexstructuresinitiallygrowfairlyquickly.Thesegmentbasedandthesubsamplebasedindexstructuresstabilizesoonafterthereservoirislled,whereastheLSMTreebasedstructurestabilizesalittlelaterwhentheremovalofstalerecordsfromtherollingmergesstabilizes. Thesubsamplebasedindexstructurehasthelargestfootprint(almost1=5thofthegeometriclesize).ThisisexpectedasstaleindexrecordsisremovedfromtheB+treesonlywhenthe 113 PAGE 114 Querytimingresultsfor200bytesrecord,jRj=50million,andjBj=250k Selectivity IndexTime FileTime TotalTime PointQuery 6.2488 0.0338 6.2826 9.6186 0.1267 9.7453 12.9885 0.9288 13.9173 17.6891 5.9754 23.6645 SubsampleBased PointQuery 2.50717 0.0156 2.5227 4.92744 0.1763 5.1037 7.2387 0.8637 8.1024 9.9837 6.1363 16.1200 LSMTree PointQuery 0.00505 0.0174 0.0224 0.00967 0.1565 0.1661 0.01440 0.8343 0.8487 0.05987 4.9961 5.0559 entiresubsampledecays.Ontheotherhand,thesegmentbasedindexstructurehasthesmallestfootprintasateverybufferushallstalerecordsareremovedfromtheindexstructure.Thisresultsinaverycompactindexstructure.ThediskspaceusageoftheLSMTreebasedindexstructureliesbetweenthesetwoindexstructures.Althoughateveryrollingmerge,stalerecordsareremovedfromthepartofindexstructurethatismerging,notallofthestalerecordsinthestructureareremovedallatonce.Assoonastherateofremovalofstalerecordsstabilizesthediskfootprintalsobecomesstable. Finally,wecomparedtheindexlookupspeedofthesethreeindexstructures.Wereportindexlookupandgeometricleaccesstimesfordifferentselectivityqueries.Asexpected,thegeometricleaccesstimeremainsconstantirrespectiveoftheindexstructureoptionandincreaseslinearlyasthequeryproducesmoreoutputtuples.Theindexlookuptimevariedforthethreeindexstructures.Thesegmentbasedindexstructure(theslowest)wasanorderofmagnitudeslowerthantheLSMTreebasedindexstructure(thefastest).ThisismainlybecausethesegmentbasedindexstructurerequiresindexlookupsinseveralthousandB+Treesforanyselectivityquery,wheretheLSMTreebasedstructureusesasingeLSMTree,requiringasmall,constantnumberofseeks.Theperformanceofthesubsamplebasedindexstructureliesin 114 PAGE 115 Ingeneralthesubsamplebasedindexstructuregivesthebestbuildtimewithreasonableindexlookupspeedatthecostofslightlylargerdiskfootprint.TheLSMTreebasedindexstructuremakesuseofreasonablediskspaceandgivesthebestqueryperformanceatthecostofslowinsertionrateorbuildtime.Thesegmentbasedindexstructuregivescomparablebuildtimeandhasthemostcompactdiskfootprint,butsuffersconsiderablywhenitcomestoindexlookups. 115 PAGE 116 Randomsamplingisaubiquitousdatamanagementtool,butrelativelylittleresearchfromthedatamanagementcommunityhasbeenconcernedwithhowtoactuallycomputeandmaintainasample.Inthisdissertationwehaveconsideredtheproblemofrandomsamplingfromadatastream,wherethesampletobemaintainedisverylargeandmustresideonsecondarystorage.WehavedevelopedthegeometricleorganizationwhichcanbeusedtomaintainanonlinesampleofarbitrarysizewithanamortizedcostofO(!logjBj=jBj)randomdiskheadmovementsforeachnewlysampledrecord.Themultiplier!canbemadeverysmallbymakinguseofasmallamountofadditionaldiskspace. Wehavepresentedamodiedversionoftheclassicreservoirsamplingalgorithmthatisexceedinglysimple,andisapplicableforbiasedsamplingusinganyarbitraryuserdenedweightingfunctionf.Ouralgorithmcomputes,inasinglepass,abiasedsampleRi(withoutreplacement)oftheirecordsproducedbyadatastream. Wehavealsodiscussedcertainpathologicalcaseswhereouralgorithmcanprovideacorrectlybiasedsampleforaslightlymodiedbiasfunctionf0.Wehaveanalyticallyboundhowfarf0canbefromfinsuchapathologicalcase.Wehavealsoexperimentallyevaluatedthepracticalsignicanceofthisdifference. WehavealsoderivedthevarianceofaHorvitzThomsonestimatormakinguseofasamplecomputedusingouralgorithm.CombinedwiththeCentralLimitTheorem,thevariancecanthenbeusedtoprovideboundsontheestimator'saccuracy.TheestimatorissuitablefortheSUMaggregatefunction(and,byextension,theAVERAGEandCOUNTaggregates)overasingledatabasetableforwhichthereservoirismaintained. Wehavedevelopedefcienttechniqueswhichallowageometricletoitselfbesampledinordertoproducesmallerdataobjects.Weconsideredtwosamplingtechniques(1)abatchsamplingwhensamplesizeisknownbeforehandand(2)anonlinesamplingwhichimplementsaniterativefunctionGetNexttoretrieveasampleatatime.Thegoalofthesealgorithmswastoefcientlysupportfurthersamplingofageometriclebymakinguseofitsownstructure. 116 PAGE 117 117 PAGE 118 [1] A.DasJ.Gehrke,M.R.:Approximatejoinprocessingoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [2] Acharya,S.,Gibbons,P.,Poosala,V.:Congressionalsamplesforapproximateansweringofgroupbyqueries.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [3] Acharya,S.,Gibbons,P.,Poosala,V.,Ramaswamy,S.:Joinsynopsesforapproximatequeryanswering.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [4] Acharya,S.,P.B.Gibbons,V.P.,Ramaswamy,S.:Theaquaapproximatequeryansweringsystem.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [5] Aggarwal,C.C.:Onbiasedreservoirsamplinginthepresenceofstreamevolution.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.607.VLDBEndowment(2006) [6] Arge,L.:Thebuffertree:Anewtechniqueforoptimali/oalgorithms.In:InternationalWorkshoponAlgorithmsandDataStructures(1995) [7] Babcock,B.,Datar,M.,Motwani,R.:Samplingfromamovingwindowoverstreamingdata.In:SODA'02:ProceedingsofthethirteenthannualACMSIAMsymposiumonDiscretealgorithms,pp.633.SocietyforIndustrialandAppliedMathematics(2002) [8] Babcock,B.,S.Chaudhuri,G.D.:Dynamicsampleselectionforapproximatequeryprocessing.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [9] Bayer,R.,McCreight,E.M.:Organizationandmaintenanceoflargeorderedindexes.In:SIGFIDETWorkshop,pp.107(1970) [10] Brown,P.G.,Haas,P.J.:Techniquesforwarehousingofsampledata.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.6.IEEEComputerSociety,Washington,DC,USA(2006) [11] C.FanM.Muller,I.R.:Developmentofsamplingplansbyusingsequential(itembyitem)techniquesanddigitalcomputersi.In:JournalofAmericanStatisticalAssociation,pp.57:387(1962) [12] C.JermaineA.Datta,E.O.:Anovelindexsupportinghighvolumedatawarehouseinsertion.In:InternationalConferenceonVeryLargeDataBases(1999) [13] C.JermaineE.Omiecinski,W.Y.:Thepartitionedexponentiallefordatabasestoragemanagement.In:InternationalConferenceonVeryLargeDataBases(1999) [14] Chaudhuri,S.,Das,G.,Datar,M.,Motwani,R.,Narasayya,V.:Overcominglimitationsofsamplingforaggregationqueries.In:ICDE(2001) 118 PAGE 119 [15] Chaudhuri,S.,Das,G.,Narasayya,V.:Arobust,optimizationbasedapproachforapproximateansweringofaggregatequeries.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [16] Cochran,W.:SamplingTechniques.WileyandSons(1977) [17] Council,T.P.:TPCHBenchmark.http://www.tpc.org(2004) [18] Cranor,C.,Gao,Y.,Johnson,T.,Shkapenyuk,V.,Spatscheck,O.:Gigascopehighperformancenetworkmonitoringwithansqlinterface.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [19] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Gigascope:Astreamdatabasefornetworkapplications.In:ACMSIGMODInternationalConferenceonManagementofData(2003) [20] Cranor,C.,Johnson,T.,Spatscheck,O.,Shkapenyuk,V.:Thegigascopestreamdatabase.In:IEEEDataEngineeringBulletin,pp.26(1):27(2003) [21] Dobra,A.,Garofalakis,M.,Gehrke,J.,Rastogi,R.:Processingcomplexaggregatequeriesoverdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2002) [22] Dufeld,N.,Lund,C.,Thorup,M.:Chargingfromsamplednetworkusage.In:IMW'01:Proceedingsofthe1stACMSIGCOMMWorkshoponInternetMeasurement,pp.245.ACMPress,NewYork,NY,USA(2001) [23] Estan,C.,Naughton,J.F.:Endbiasedsamplesforjoincardinalityestimation.In:ICDE'06:Proceedingsofthe22ndInternationalConferenceonDataEngineering(ICDE'06),p.20.IEEEComputerSociety,Washington,DC,USA(2006) [24] Estan,C.,Varghese,G.:Newdirectionsintrafcmeasurementandaccounting:Focusingontheelephants,ignoringthemice.ACMTrans.Comput.Syst.21(3),270(2003) [25] F.Olken,D.R.:Randomsamplingfromb+trees.In:InternationalConferenceonVeryLargeDataBases(1989) [26] F.Olken,D.R.:Randomsamplingfromdatabaselesasurvey.In:InternationalWorkingConferenceonScienticandStatisticalDatabaseManagement(1990) [27] F.OlkenD.Rotem,P.X.:Randomsamplingfromhashes.In:ACMSIGMODInternationalConferenceonManagementofData(1990) [28] Ganguly,S.,Gibbons,P.,Matias,Y.,Silberschatz,A.:Bifocalsamplingforskewresistantjoinsizeestimation.In:ACMSIGMODInternationalConferenceonManagementofData(1996) PAGE 120 [29] Gemulla,R.,Lehner,W.,Haas,P.J.:Adipinthereservoir:maintainingsamplesynopsesofevolvingdatasets.In:VLDB'2006:Proceedingsofthe32ndinternationalconferenceonVerylargedatabases,pp.595.VLDBEndowment(2006) [30] Gunopulos,D.,Kollios,G.,Tsotras,V.,Domeniconi,C.:Approximatingmultidimensionalaggregaterangequeriesoverrealattributes.In:ACMSIGMODInternationalConferenceonManagementofData(2000) [31] Haas,P.:Theneedforspeed:Speedingupdb2usingsampling.In:IDUGSolutionsJournal(2003) [32] Haas,P.J.,Hellerstein,J.M.:RipplejoinsforOnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.287298(1999) [33] Hellerstein,J.M.,Haas,P.J.,Wang,H.J.:OnlineAggregation.In:ACMSIGMODInternationalConferenceonManagementofData,pp.171(1997) [34] J.GehrkeF.Korn,D.S.:Oncomputingcorrelatedaggregatesovercontinualdatastreams.In:ACMSIGMODInternationalConferenceonManagementofData(2001) [35] Jermaine,C.:Robustestimationwithsamplingandapproximatepreaggregation.In:InternationalConferenceonVeryLargeDataBases(2003) [36] Jermaine,C.,Pol,A.,Arumugam,S.:Onlinemaintenanceofverylargerandomsamples.In:ACMSIGMODInternationalConferenceonManagementofData,pp.299(2004) [37] J.M.HellersteinR.Avnur,V.R.:Informixundercontrolonlinequeryprocessing.In:DataMiningandKnowledgeDiscovery,pp.4(4):281(2000) [38] Joens,T.:Anoteonsamplingfromatapele.In:CommunicationsoftheACM,p.5:343(1964) [39] J.S.Vitter,M.W.:Approximatecomputationofmultidimensionalaggregatesofsparsedatausingwavelets.In:ACMSIGMODInternationalConferenceonManagementofData(1999) [40] Kolonko,M.,Wasch,D.:Sequentialreservoirsamplingwithanonuniformdistribution.ACMTrans.Math.Softw.32(2),257(2006).DOIhttp://doi.acm.org/10.1145/1141885.1141891 [41] Manku,G.S.,Motwani,R.:Approximatefrequencycountsoverdatastreams.In:VLDBConference(2002) [42] N.L.Johnson,S.K.:DiscreteDistributions.HoughtonMifin(1969) [43] Olken,F.:RandomSamplingfromDatabases.In:Ph.D.Dissertation(1993) [44] O'Neil,P.,Cheng,E.,Gawlick,D.,O'NeilJ,E.:Thelogstructuredmergetree.In:ActaInformatica,pp.33:351(1996) PAGE 121 [45] Ousterhout,J.K.,Douglis,F.:Beatingthei/obottleneck:Acaseforlogstructuredlesystems.OperatingSystemsReview23(1),11(1989) [46] P.B.GibbonsY.Matias,V.P.:Fastincrementalmaintenanceofapproximatehistograms.In:ACMTransactionsonDatabaseSystems,pp.27(3):261(2002) [47] Pol,A.,Jermaine,C.:Biasedreservoirsampling.IEEETransactionsonKnowledgeandDataEngineering [48] Pol,A.,Jermaine,C.,Arumugam,S.:Maintainingverylargerandomsamplesusingthegeometricle.VLDBJ(2007) [49] Shao,J.:MathematicalStatistics.SpringerVerlag(1999) [50] Thompson,M.E.:TheoryofSampleSurveys.ChapmanandHall(1997) [51] Toivonen,H.:Samplinglargedatabasesforassociationrules.In:InternationalConferenceonVeryLargeDataBases(1996) [52] V.GantiM.L.Lee,R.R.:Iciclesselftuningsamplesforapproximatequeryanswering.In:InternationalConferenceonVeryLargeDataBases(2000) [53] Vitter,J.:Randomsamplingwithareservoir.In:ACMTransactionsonMathematicalSoftware(1985) [54] Vitter,J.:Anefcientalgorithmforsequentialrandomsampling.In:ACMTransactionsonMathematicalSoftware,pp.13(1):58(1987) PAGE 122 AbhijitPolwasbornandbroughtupinstateofMaharashtrainIndia.HereceivedhisBachelorofEngineeringfromGovernmentCollegeofEngineeringPune(COEP)UniversityofPune,oneofthemostprestigiousandoldestengineeringcollegeinIndia,in1999.Abhijitmajoredinmechanicalengineeringandobtainedadistinguishedrecord.Herankedsecondintheuniversitymeritranking.HewasemployedintheResearchandDevelopmentdepartmentofKirloskarOilEnginesLtdforoneyear.AbhijitreceivedhisrstMasterofSciencefromUniversityofFloridain2002.Hemajoredinindustrialandsystemsengineering.AbhijitthenworkedasaresearcherintheDepartmentofComputerandInformationScienceandEngineeringattheUniversityofFlorida.HereceivedhissecondMasterofScienceandDoctorofPhilosophy(Ph.D)incomputerengineeringin2007.DuringhisstudiesatUniversityofFlorida,AbhijitcoauthoredatextbooktitledDevelopingWebEnabledDecisionSupportSystems.HetaughttheWebDSScourseseveraltimesintheDepartmentofIndustrialandSystemsEngineeringattheUniversityofFlorida.HepresentedseveraltutorialsatworkshopsandconferencesontheneedandimportanceofteachingDSSmaterial,andhealsotaughtattwoinstructortrainingworkshopsonDSSdevelopment.Abhijit'sresearchfocusisintheareaofdatabases,withspecialinterestsinapproximatequeryprocessing,physicaldatabasedesign,anddatastreams.HehaspresentedresearchpapersatseveralprestigiousdatabaseconferencesandperformedresearchattheMicrosoftResearchLab.HeisnowaSeniorSoftwareEngineerintheStrategicDataSolutionsgroupatYahoo!Inc. 122 