<%BANNER%>

Indexing Techniques for Metric Databases with Costly Searches

Permanent Link: http://ufdc.ufl.edu/UFE0021666/00001

Material Information

Title: Indexing Techniques for Metric Databases with Costly Searches
Physical Description: 1 online resource (127 p.)
Language: english
Creator: Venkateswaran, Jayendr
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: databases, dynamic, indexing, metric, multimedia, reference, sequence, similarity
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Similarity search in database systems is becoming an increasingly important task in modern application domains such as artificial intelligence, computational biology, pattern recognition and data mining. With the evolution of information, applications with new data types such as text, images, videos, audio, DNA and protein sequences have began to appear. Despite extensive research and the development of a plethora of index structures, similarity search is still too costly in many application domains, especially when measuring the similarity between a pair or objects is expensive. In this dissertation, the similarity search queries we consider are classified under similarity search and similarity join queries. Several new indexing techniques to improve the performance of similarity search are proposed. For the similarity search queries, reference-based indexing methods applicable to both static and growing databases are proposed. For similarity join queries, a generalized nearest neighbor framework and several search and optimization algorithms are proposed. The extensive experiments evaluates the different parameters used by the proposed methods and performance improvements over the state-of-art algorithms.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jayendr Venkateswaran.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.
Local: Co-adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021666:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021666/00001

Material Information

Title: Indexing Techniques for Metric Databases with Costly Searches
Physical Description: 1 online resource (127 p.)
Language: english
Creator: Venkateswaran, Jayendr
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: databases, dynamic, indexing, metric, multimedia, reference, sequence, similarity
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Similarity search in database systems is becoming an increasingly important task in modern application domains such as artificial intelligence, computational biology, pattern recognition and data mining. With the evolution of information, applications with new data types such as text, images, videos, audio, DNA and protein sequences have began to appear. Despite extensive research and the development of a plethora of index structures, similarity search is still too costly in many application domains, especially when measuring the similarity between a pair or objects is expensive. In this dissertation, the similarity search queries we consider are classified under similarity search and similarity join queries. Several new indexing techniques to improve the performance of similarity search are proposed. For the similarity search queries, reference-based indexing methods applicable to both static and growing databases are proposed. For similarity join queries, a generalized nearest neighbor framework and several search and optimization algorithms are proposed. The extensive experiments evaluates the different parameters used by the proposed methods and performance improvements over the state-of-art algorithms.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jayendr Venkateswaran.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.
Local: Co-adviser: Jermaine, Christophe.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021666:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101117_AAAACS INGEST_TIME 2010-11-17T22:55:41Z PACKAGE UFE0021666_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 53077 DFID F20101117_AABRVN ORIGIN DEPOSITOR PATH venkateswaran_j_Page_057.pro GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
cdeddd453d5310cfd3c334e1ef84c5b3
SHA-1
884887823cf12ba06696bb775a13e6250e3aa4d0
47125 F20101117_AABRUY venkateswaran_j_Page_042.pro
9f232b6fe30fe31b046d66329ebfc415
89bd681d89e46725cfbf16f58a35d99fec1d1d24
1698 F20101117_AABSBH venkateswaran_j_Page_080.txt
f31a27458a069a307e9bc6919d67bff0
9839e4e21de5a6307ff2f529081e06b2df9f03c8
13805 F20101117_AABRWB venkateswaran_j_Page_071.pro
3dcd41d6de6127b27533424d37e4bd4b
c29d7cd32f8706bf6fa4f8becb63cc6dcce29361
2309 F20101117_AABSAT venkateswaran_j_Page_066.txt
7b1158c0a9e08304c6916c1708eff6c3
b7c23c85f42a9a62964fea1fb4e414200071f83d
57462 F20101117_AABRVO venkateswaran_j_Page_058.pro
3b6c0f2a1f55fc371ac0a7a8317e92ff
7b7284f70a7bf790ba05f7dd52b7f85ac24e486f
42161 F20101117_AABRUZ venkateswaran_j_Page_043.pro
e430ec37e1020e2aa78cc24d23539cab
345cfdbadc16985dc37737d32143a336bb1c95ae
2231 F20101117_AABSBI venkateswaran_j_Page_081.txt
42370608522ac2cf1a6d3dc64632694d
4c80fc4cf6e48caa9ea4ef03fbfd067ca7aefb9a
19715 F20101117_AABRWC venkateswaran_j_Page_072.pro
d674f9c7673c6458660eec9c2b2371a2
f3f5150773c838519ac797e1b09dce6c571318a5
2404 F20101117_AABSAU venkateswaran_j_Page_067.txt
86e34c675f7053949b5d2a9d4d634bf4
abfa4a68a5bf6d233c5aec952c45c869a17c8ce0
2012 F20101117_AABSBJ venkateswaran_j_Page_082.txt
7d1222b13c8a2431570c0c3e8b6e1aa1
0b879e85220001c390a352c21047a3f130b20706
37335 F20101117_AABRWD venkateswaran_j_Page_073.pro
985aedb2bb65310d15d77bb79b2cc65d
ad5516e2c8fc0d95ba4e6b5d90def4ff3bc151f6
896 F20101117_AABSAV venkateswaran_j_Page_068.txt
e4dbcb9e79241aa86ace883615fcb6ee
6deb51e323c427f09a6b615fcffbbb894b3180c9
38313 F20101117_AABRVP venkateswaran_j_Page_059.pro
ccc17c934a75d5bd600fb52f981d0426
f880f6b8aeb36de8f7a7461a3efc014000f6b622
1902 F20101117_AABSBK venkateswaran_j_Page_083.txt
bc7e2d76df502ffba4985ee2e739e0f8
382a4efa57c802b849560bd4c433ba1d1f2ad330
39891 F20101117_AABRWE venkateswaran_j_Page_074.pro
c302ab53b2052858a99fd031b1430b65
49cebfbcac6fca6666c30254ead4ba01136c7efe
1920 F20101117_AABSAW venkateswaran_j_Page_069.txt
64271eeafc6a1c243eb5ed86c4355552
b17d2b6fe6e68046cc1dc921cbe563b8597d0fb5
12848 F20101117_AABRVQ venkateswaran_j_Page_060.pro
7e4b6d4b511e03e53a2239d33db5f2e6
ae64a85d38acac67710c41f4e19a35663fdbfb47
1330 F20101117_AABSBL venkateswaran_j_Page_084.txt
baee61cb53ef2dad4954967b9a441a0c
45b6c92aa79bb94951d917518ca9e584cf72aec0
1386 F20101117_AABRWF venkateswaran_j_Page_075.pro
326f5ee592af1fe0bdf629f98a036f3d
c2cef94f4ae19f7e8eaea9ded898e19b33f24622
1618 F20101117_AABSAX venkateswaran_j_Page_070.txt
01d73098b0def36290829dcf386f98cf
c396825179a21d010fe60a272133065ddfd4c05f
31950 F20101117_AABRVR venkateswaran_j_Page_061.pro
b10bf980a03f1470f2da8095c674b323
43c92a8bebc5dc44d5c8b8171ac689ac12a53238
2218 F20101117_AABSCA venkateswaran_j_Page_100.txt
8e49345c3d054aa837ba57965cdde09d
d76cd1acbe8be4e7a694abc96c963b30b19890ab
1457 F20101117_AABSBM venkateswaran_j_Page_085.txt
97fececf304b4dd08576c7e2f2afa2e5
c634a64ef590fd723b79c4ff6916751f9a7d8fe4
57772 F20101117_AABRWG venkateswaran_j_Page_076.pro
a860a7b768883b33e0db58da9528a974
1fcf1ac49980de3ee799cd64c06ad8a6f140f288
746 F20101117_AABSAY venkateswaran_j_Page_071.txt
6673d363cc2f6f4f8c7466d9d8364446
fd2cfe448e5db2332b6a6a11f0db61080cf3d54a
38445 F20101117_AABRVS venkateswaran_j_Page_062.pro
88bf40cfb790fea192fcbe17b7817000
76acbc4807e2164763ec9b5f24f309eb8269d028
2201 F20101117_AABSCB venkateswaran_j_Page_101.txt
04d735ed3c0df2d0b0d1f74ae054e2b0
d5d7524d64e320f52f3b00bfc7643f3eb27fd01d
1515 F20101117_AABSBN venkateswaran_j_Page_086.txt
93a6f4980d1d0f96b7a0e1deccd5a6fa
b77d540d0487e1ad0d61ea7715b3689af91e6698
60375 F20101117_AABRWH venkateswaran_j_Page_077.pro
339a1a6aebcda41ca99d0ce023bc53c2
53cd514f48982ece3b89424382d728cdfc52c508
34367 F20101117_AABRVT venkateswaran_j_Page_063.pro
ef7cfe8b25ee0c6ff234537574051b9a
4941cc219e05d1e40f290114226fc50a084e55eb
2418 F20101117_AABSCC venkateswaran_j_Page_102.txt
06e0a3f42de56ac151053a866cd7cfa1
585207fe6962814ca72a600f53f6cb48bc68f39b
1571 F20101117_AABSBO venkateswaran_j_Page_087.txt
a07c0bde206d1e3fa4b28f59cad8e456
b5c7d1d103aca5e61aae46d4347c68425b92ebd4
39218 F20101117_AABRWI venkateswaran_j_Page_078.pro
4d3d4e73719669244cf1de01a72b3c5c
6e68c5682a1fa5eb4c368de4fa9590deae8328f1
1275 F20101117_AABSAZ venkateswaran_j_Page_072.txt
ff24d7fe72af1d41d2c769377b1359c0
1f9baf04077789768799fb474ab6b3ab18cee089
37419 F20101117_AABRVU venkateswaran_j_Page_064.pro
16725b3e6f1188a4f77d8642d7c5c838
bbe3c4e679db4805d47c354c7b8e871c66bf8396
1828 F20101117_AABSCD venkateswaran_j_Page_103.txt
d1821e5640a5de884200ee40bfb4fb24
6cb45cf2f29eb57e7a86306974e6a29a2b3e0040
1530 F20101117_AABSBP venkateswaran_j_Page_088.txt
6bbcab78fa78b0e87431441d6d9588e4
1170ed9927522f4bbe4a2a1112a48ea62638af6e
58557 F20101117_AABRWJ venkateswaran_j_Page_079.pro
88e4bc6ed5076e6dfaeb939daabcce96
481cb4d6cee1a2e166a358df3a2e8a0e3116c85d
12824 F20101117_AABRVV venkateswaran_j_Page_065.pro
8ab5a65dd162dce1ffd9831f4329bd33
f1acaf29b0fc00b223a83a2f0dce406224398741
1869 F20101117_AABSCE venkateswaran_j_Page_104.txt
d5d2fc53d64003a6dab72d8580c5ca4a
215561f862d987c88cb496ee50bfb9853fe78765
1984 F20101117_AABSBQ venkateswaran_j_Page_089.txt
d282695b5c7831555aef5beefc348988
e7c83941fe2b69ae63e3236b079948aaa0f93975
37492 F20101117_AABRWK venkateswaran_j_Page_080.pro
9c883e686dc06a59c00fb27147b87d97
c336bb09c1a09749bb91b2cb5dd34484c418df5b
54259 F20101117_AABRVW venkateswaran_j_Page_066.pro
49fb587ecffbac69729e82b8a84edd52
2e878a6a4d817a6ade57c6dab107c3222796a3bf
2251 F20101117_AABSCF venkateswaran_j_Page_105.txt
b213ec49246d9a805c9321ba18c00bcc
509f7ecd90580119c08661b31445a14c085cebb7
2535 F20101117_AABSBR venkateswaran_j_Page_091.txt
c9149770ebed238ecc442a08481cb556
e673fbb0a86664c0d64952912fe02d88e8ae4321
56514 F20101117_AABRWL venkateswaran_j_Page_081.pro
66dcd1bca6c95750de25bfddb5221aed
d99a2624dd7588716c0537320926bc4c40bbfa4d
61200 F20101117_AABRVX venkateswaran_j_Page_067.pro
d82e7f838fb23d7f796ed432696ec42c
439781db24330f51e41c9aeabb76048da68a894e
1706 F20101117_AABSCG venkateswaran_j_Page_106.txt
1244c9705ac590d6a0203cc4f534c5ef
6d08f419d4ae723f930ffc68858877980bfbcb6b
39587 F20101117_AABRXA venkateswaran_j_Page_096.pro
ea8b1a342055b394ac0cfa8da31adae4
8b535ccde789476967016004a929352ea51f3f90
1498 F20101117_AABSBS venkateswaran_j_Page_092.txt
6247750a58f4e3c3f66ecc6048ee1f90
52ba500e3630d8afd4a18448c79f7ad87edc7f92
39685 F20101117_AABRWM venkateswaran_j_Page_082.pro
99888733f34e131b8a9a9a1416995f70
bd3dc679e5157f2bdc69d00557a42add2b24a5db
15366 F20101117_AABRVY venkateswaran_j_Page_068.pro
258cbda9c967decba4d604016cebe3a8
60e5cd740884a593ec2ecdecca9dc862689637a7
1728 F20101117_AABSCH venkateswaran_j_Page_107.txt
67c341868540470de83e75ff8a4c6894
0350a00803acc8f7ac2ecbba61f6ad4db5932dd7
31590 F20101117_AABRXB venkateswaran_j_Page_097.pro
19ca2bb8aa9a94ec4c055d2fee07f651
6f3a3f85ae7b32ae81e9f31a5c5fb511c4ce0551
2398 F20101117_AABSBT venkateswaran_j_Page_093.txt
6ed3f1f6b85767b496b0930e66615217
5e294745b8e4e088999be2447a0e4242906c9826
36732 F20101117_AABRWN venkateswaran_j_Page_083.pro
de178ee0633d69d403a0eabfe914a83f
f3bb3b6ac789b46479c0480108fe05a53af74828
40578 F20101117_AABRVZ venkateswaran_j_Page_069.pro
f3fd2abdb410bfad1a275219e46e2cd0
509b15c552c9829777c7b1d1483da5efe1a66464
2025 F20101117_AABSCI venkateswaran_j_Page_108.txt
4300390065173cd263cd275420bf6e7e
bb7fdfb45ed7df52de2a38aa9dc51e45319bdea8
52295 F20101117_AABRXC venkateswaran_j_Page_098.pro
0757525896075630d601e0f93e4e6bf7
aa4446bed6b8832fa8c4f5266714d37c352fd026
2410 F20101117_AABSBU venkateswaran_j_Page_094.txt
6bf6c6f295baa6b56fadf39f53e2ff54
a4d5749b0feb34cb4e04e2d7a45027b7b5c363e0
27037 F20101117_AABRWO venkateswaran_j_Page_084.pro
56bb920d89a1dc3fb601055e8cecabc6
319498445e0f770e3271ed67a9439871210de19c
1901 F20101117_AABSCJ venkateswaran_j_Page_109.txt
a3f53427c7bf83b81b0552f7a3c98c9d
9ec4101068d0d422e6f764bcd9a5250dfb7e81cc
56221 F20101117_AABRXD venkateswaran_j_Page_099.pro
83c705efa2146569b9b3d928b5f6aef9
d8c568b797cc3848d81e4f25daf547e0b9ea3f91
2342 F20101117_AABSBV venkateswaran_j_Page_095.txt
64870e984b17d62ec6dcda0a79af114f
77a034462b7b8c693dbe432a4f396af53e6991b5
34458 F20101117_AABRWP venkateswaran_j_Page_085.pro
67ecabdb6cdb35d69f8113b12edf032f
1b674eeb904e7bc0a0d3425e0fb4333df5da879e
1916 F20101117_AABSCK venkateswaran_j_Page_110.txt
34838ab8f636db8a97134d471eda886c
d3f211bac0d925462f76a3d9eb1c24a3a7e701bf
52259 F20101117_AABRXE venkateswaran_j_Page_100.pro
09b202b16105d1c4614d1972ce00e096
3f033328efa33140af9603cfcf37af4660de98ba
1688 F20101117_AABSBW venkateswaran_j_Page_096.txt
a9ee981aa37bf90676a3812c9a51f403
50beeae92fe904ab86bc4d4b38179a531f1ab51a
1624 F20101117_AABSCL venkateswaran_j_Page_111.txt
829ab51d32c7ba34670e7e3a34bdccef
69b21c0bca9b4e866ca6d9c9cef850add1f2d4f0
53079 F20101117_AABRXF venkateswaran_j_Page_101.pro
825f0b2b63eff4188ee0352fdd8898be
b18a5b185f713d122ef14242377e38d20231f46a
1758 F20101117_AABSBX venkateswaran_j_Page_097.txt
5918a40446587fe1ea4b08650c84662f
46ed556160741f0d729a3c4a7fec5094d7358350
30738 F20101117_AABRWQ venkateswaran_j_Page_086.pro
37cb74f65b9e15992ac0e2951bf13917
9b1aa49d024235cfd13fa6e95fc25c808a7602b1
1748 F20101117_AABSCM venkateswaran_j_Page_112.txt
e02c8a345ca09e527a026fd28ec50b00
a82316a70d9d58b4b44a16fd9c1898a597137fd0
60986 F20101117_AABRXG venkateswaran_j_Page_102.pro
52cb7acf3aaf0457f30efe3b1fa8611d
1f9ebb086fe6a6ebf9e1caf6e1dd5c1dbd6b7ff5
2101 F20101117_AABSBY venkateswaran_j_Page_098.txt
b5646c501623e451a30f776b11a33149
0b1fc27d96c94b7f1782c3cbde246d719e8e6689
33237 F20101117_AABRWR venkateswaran_j_Page_087.pro
eac528c8a330b4bee6dcc2562c756b66
a6a0b109e3f83b77503f5a50dbb25d051efb7c4f
864 F20101117_AABSDA venkateswaran_j_Page_126.txt
405d1c40e9da096653baf18990216553
15455ff166138674301a211d6864728a89661ec8
1821 F20101117_AABSCN venkateswaran_j_Page_113.txt
9f761ca404b35b308219363b7da6ec88
69ee78bee9aa1a6f593255f80941eaf68c230fa5
40880 F20101117_AABRXH venkateswaran_j_Page_103.pro
a927a465fd91f771218c89389bfe7995
935c107036ddd7baa59aee47fe8d8a5f1b4ee976
2234 F20101117_AABSBZ venkateswaran_j_Page_099.txt
d6c247338a7eb5a8353a123af2c57e4b
57bdb9300ea2c24fc318392993a8ea98c8883e58
31681 F20101117_AABRWS venkateswaran_j_Page_088.pro
737ffd582eff081c8935dc710ad04aa6
9ab5f40df99922896f4a75b19631d3cbc782c0ba
1556 F20101117_AABSDB venkateswaran_j_Page_127.txt
56e98d364e06265d68a9449180120575
64876fef02ded4e4f3bb76ab53840476807230b3
1776 F20101117_AABSCO venkateswaran_j_Page_114.txt
e4e4cbfe8da351ffa7b3e2c6bb7ca20e
99b65062b42925c1916cf3fb643e6a4355ef9d6c
42885 F20101117_AABRXI venkateswaran_j_Page_104.pro
1a5d540e8b788b92684138682524d620
221c770b9f14ec73785911b5ed7d638f491dae48
40092 F20101117_AABRWT venkateswaran_j_Page_089.pro
cac98deca745e30172dc9337e5474dae
f61655ff675a00905338f6513cb66868180c056e
1673 F20101117_AABSDC venkateswaran_j_Page_001thm.jpg
600f6927743e1f9174174aae908a89c4
c2dee8c89394fcbdc7a33dff4c819e013207f57e
1777 F20101117_AABSCP venkateswaran_j_Page_115.txt
255a9f6db2f72fe064d735273713ab98
e410854f260bd422855b33b604ad5c7bdb5dcc3d
56360 F20101117_AABRXJ venkateswaran_j_Page_105.pro
adfb26210d4fc3e2d274c034221c6c28
15221b4a4ed51899d9bb440e45ef9f7b47643817
12901 F20101117_AABRWU venkateswaran_j_Page_090.pro
25776c920c2fe26cef63bc3dbb25f54b
899391c896bbf1d954f01985205a73acef18fd36
939639 F20101117_AABSDD venkateswaran_j.pdf
a22be43c108f938d5e1aea1a34e65f3c
41bea70a6a950b81cbfb148b4c5e6d480b7973a7
1786 F20101117_AABSCQ venkateswaran_j_Page_116.txt
fdd45907bbc020a6348eb70e3ba12c51
181edf83032f103efe8cacee3ed2a63846ea3900
38807 F20101117_AABRXK venkateswaran_j_Page_106.pro
f512a1005abbe591de8ee963ba38dc67
d79ed8365f8068473779d1ab91de16a264a95c1b
63083 F20101117_AABRWV venkateswaran_j_Page_091.pro
a3d1c809bc52fc8c1c7ad2dba73e23c6
024c5fc634a9e3fea92e718ea755533835242598
6549 F20101117_AABSDE venkateswaran_j_Page_001.QC.jpg
291188fe48f44d63a62dfe9a8ded63db
20f1df25bdf1933b41b6eb22d82e3452ef59a6a5
2441 F20101117_AABSCR venkateswaran_j_Page_117.txt
44e95017e8bc82e5681513b4e08ca614
d6d773aebca2ca97318119da946d8b0e3e0ab5f7
39458 F20101117_AABRXL venkateswaran_j_Page_107.pro
1d7beff7a186d02a4e1cc5dbaba1a6fe
bc5f09b03488f98422455dde85f7138583e27a57
34027 F20101117_AABRWW venkateswaran_j_Page_092.pro
ed2ae099395ec3c798dab8a14c03e14f
5ee2fa766d845736ff3a7732cda820f19ae9390e
447 F20101117_AABSDF venkateswaran_j_Page_002thm.jpg
1daff5b9e79223112d69067b015c270c
c741c4c100e3229b2f50c866dcc4447bc1ede2e5
63737 F20101117_AABRYA venkateswaran_j_Page_122.pro
c2f711007c1237e25c8943d22abc3c01
63a0721724e57d30f6c4b48fa4a686bd7cd251ed
2492 F20101117_AABSCS venkateswaran_j_Page_118.txt
50c229786f02a13bf3e93538d266ad47
f9b7df347e0016434673e78d65fe50f34a931bf3
43477 F20101117_AABRXM venkateswaran_j_Page_108.pro
ee9e421beebfc80605d3c0712b474667
9d296bee04d5ea46316c69f3b7b2a2680bc4779b
60240 F20101117_AABRWX venkateswaran_j_Page_093.pro
59bef7828d0dc376fb9e52ea2146b05e
651c8829be1ca0a47563d081f09e2cecaf212f7c
1220 F20101117_AABSDG venkateswaran_j_Page_002.QC.jpg
340e10a7535cd8c3d7fc8901456cae2b
f9150accd3370f64f192d2a7860aebe849b84529
63007 F20101117_AABRYB venkateswaran_j_Page_123.pro
e996ae762cfc2d2792dc732bb5a8a8b0
89bbfca0ec3385dca5baf40faf03a2e8d9567d60
635 F20101117_AABSCT venkateswaran_j_Page_119.txt
2a1c9067d2cdac0b239d000b67547c5a
87eaf25fbecdae144a1b99240d86d2872b27aa7e
40561 F20101117_AABRXN venkateswaran_j_Page_109.pro
495f5e9889142a1230887d899df155ea
a952cd8b7f58ac140a4eea459cb7a3caa251e5dc
59610 F20101117_AABRWY venkateswaran_j_Page_094.pro
e115e1444cd844ffd3981dc86ec3fe13
f6bfa1783b6eb53e9eb17020ba90850a97863998
511 F20101117_AABSDH venkateswaran_j_Page_003thm.jpg
e77e2c281a3f65ecfb6211d5e484c84f
2a9911b61c6081ba025901be69035060cdc2311d
64711 F20101117_AABRYC venkateswaran_j_Page_124.pro
2655f5734f7ff726185c0ed4475f635d
56b7fd192c0ee8e93afa1577f4d69f5d818a6db3
2579 F20101117_AABSCU venkateswaran_j_Page_120.txt
fef19208d41cafe691355014f821005d
c2485a6d1fe1ef8f459cd23ca314f62ad4efc507
45844 F20101117_AABRXO venkateswaran_j_Page_110.pro
247d671f49d7ee6f0170fb7a90789c81
a9543224982db44ed6885fb8409a67ca71285175
58452 F20101117_AABRWZ venkateswaran_j_Page_095.pro
6bbddcacfd60749b1434042008321383
c84cccf879854b38f882dad8eb7a9f5c67ac11e9
1122 F20101117_AABSDI venkateswaran_j_Page_003.QC.jpg
e488cb2f2b84f5a26de79677ad2c22ea
492ddd6508a0e8f8c5a80c2d30190350f1499d42
64133 F20101117_AABRYD venkateswaran_j_Page_125.pro
ce1bb478cefa85fd5d03bbf80669838f
083a9abc3274c5cfeb97c3b6f06192cc1314ed5d
2652 F20101117_AABSCV venkateswaran_j_Page_121.txt
0aa54c22f9da72bd5861a93a9a8dc09b
e9413b6532565ec14e86ba65d7e8d2822694501c
36664 F20101117_AABRXP venkateswaran_j_Page_111.pro
34424690caf22be89c452e8e0ec90c02
644a98fecc8ca9884f7f616fcde9577cfa4c1958
2610 F20101117_AABSDJ venkateswaran_j_Page_004thm.jpg
097fb7d18e4be24d18d599152313fe34
506fba7611cd741e4e67e60260f7d148da7ffd09
21297 F20101117_AABRYE venkateswaran_j_Page_126.pro
1725f13b6f5797f0e83b0871a7e48e24
a5d127e7cf1cdbd361bcfef2b9cc7f96ab01d07c
2562 F20101117_AABSCW venkateswaran_j_Page_122.txt
33d30ef590e385ed2e1f76cf57fc4f99
562d588b1a1d2e5d23c4c1ee6d5aff4bf3bee317
39949 F20101117_AABRXQ venkateswaran_j_Page_112.pro
d50a865b4832340d4d147faa6dc9cd9f
dad37b82411ef65cb13fd39054f12a554f98c8e9
11516 F20101117_AABSDK venkateswaran_j_Page_004.QC.jpg
de71128f384c1c9855914e23767f6ef0
5b6017ef539a38772986adf8858effae57a7b220
38120 F20101117_AABRYF venkateswaran_j_Page_127.pro
b1093d246ea5cb02f799b05f55db2a2d
2399a025ace97a55a576258dd7a413e5b6cb5d28
2537 F20101117_AABSCX venkateswaran_j_Page_123.txt
bcc806dd3f7477f25dc647caa67850bf
d8e6a02d2ced59f992ed8400d71576ba2ff4a3c0
5055 F20101117_AABSDL venkateswaran_j_Page_005thm.jpg
9e63f951b25dbe450cadb5aea2455e74
0b7b5547118eb568c07bf6ef377129892522b67d
437 F20101117_AABRYG venkateswaran_j_Page_001.txt
e9b324eeb11c8cf965c0f27fc00425de
7b132ef2df5636ed3927120668540c710e62d110
2595 F20101117_AABSCY venkateswaran_j_Page_124.txt
dace695621f09bd6d17f18aa47d57a09
a638cdd48c9473184c9ac1d7e9993f5463aa8f09
41041 F20101117_AABRXR venkateswaran_j_Page_113.pro
a3b8bc0623da9097322a13eca4f464e0
eae67c17dd1e24b016e7c3f0b0d61b1a0200574f
6606 F20101117_AABSEA venkateswaran_j_Page_013thm.jpg
0f02ed7a7a386e8f16ec407a688fd79a
503b52296c3fe9ab0d5f9a8c29c2e67d2a1227f3
25784 F20101117_AABSDM venkateswaran_j_Page_005.QC.jpg
63d0423b0a6b73170a1bd5b8c83b380d
b1ee9e78f8e1121eb31cac70ae0f69f9eed92b88
2575 F20101117_AABSCZ venkateswaran_j_Page_125.txt
56971ac24805a7f0f536860c65bfe21f
ed91b7084706763b086cb6b6ffd3cd82a4f1a997
40052 F20101117_AABRXS venkateswaran_j_Page_114.pro
e68021ca3b5accb0ece6be6491f102c2
3e3d598b2ed56961bcbfff3bd248548b2cd77ec7
28808 F20101117_AABSEB venkateswaran_j_Page_013.QC.jpg
add093b666126effa3fb89e80dda18a5
e50f9247978136dbb4be00ef9eec33ad15105e29
5497 F20101117_AABSDN venkateswaran_j_Page_006thm.jpg
32d6ceb517e64fc6e2794e395854988b
56bbbb71d68927e8504de7eb179204d55eadd295
98 F20101117_AABRYH venkateswaran_j_Page_002.txt
80b011e618e44be08b60c337e0f54e9a
babcc8da6dcfbdb46069ef552e12c0f478db1d9b
39209 F20101117_AABRXT venkateswaran_j_Page_115.pro
fc568586e458a6a429bca929f31ab0ff
24b68df658c284e3f3ca94342678d4b50edb912d
4771 F20101117_AABSEC venkateswaran_j_Page_014thm.jpg
4a2cf2de108aa28032dbdbfb9acd92a3
cc4827da3dc7a3a90cea01d4d4036485fe58da20
28988 F20101117_AABSDO venkateswaran_j_Page_006.QC.jpg
6750d45767edd20a3b1bd9cdf7bc04f7
c80eb6f3632da99365befe70b85fb01186cb9109
92 F20101117_AABRYI venkateswaran_j_Page_003.txt
32d88347d96d66d7ee30e268eb8a0b00
26d0d7669edea095070fa25ab761ca497dd51efe
38030 F20101117_AABRXU venkateswaran_j_Page_116.pro
655b1ac46cbb05fb80516dc450b05693
c217fb4e59781aef6b842abe6ea1b6b835105c27
18223 F20101117_AABSED venkateswaran_j_Page_014.QC.jpg
3f7b16607d7d8f7c3f4d4b319790b800
3db7564d2cf34d2afac84ca249bf252b5c495a2e
4117 F20101117_AABSDP venkateswaran_j_Page_007thm.jpg
772df6265434ea15cc5cae0dfe5b14dd
1ff916dc11c333f255e619a9ef432744ec712f25
879 F20101117_AABRYJ venkateswaran_j_Page_004.txt
83fd146633cf0327b823f65db97b5f07
7ca0500eeef5c1fc801614cbb4088bca0bc16107
59889 F20101117_AABRXV venkateswaran_j_Page_117.pro
1f8d58786b32fb97b7838ffdf98e6a03
d7d07d51b647bd1f12b9093a10c9d344fa25efda
5985 F20101117_AABSEE venkateswaran_j_Page_015thm.jpg
73e7fab8501b7a1a4a2e883385dafc45
9ec51b9a64859ff9dc0502916f614215619527e3
1926 F20101117_AABSDQ venkateswaran_j_Page_008thm.jpg
4705ed2d6de55cbd7637665654ea46a1
4c754878241a9dd288a183f71f50024ef7121f0e
2772 F20101117_AABRYK venkateswaran_j_Page_005.txt
cea6e0e78f390236c47f69df3bacc7f5
d9eba132139a1752616c2737e9e97879c21863be
63443 F20101117_AABRXW venkateswaran_j_Page_118.pro
2def01cb249cad6b101447eed0b823b3
76f8231e1540c84c31570f382c306bc8a0dc815d
24909 F20101117_AABSEF venkateswaran_j_Page_015.QC.jpg
e60a76e4b753c883ed8b54010fce8b68
e61af0f0feab1d24ffdb9faf255cad50eadc4721
7773 F20101117_AABSDR venkateswaran_j_Page_008.QC.jpg
27b4f69dfe0c6018815c41a72322a9cb
4d4ad6cacc9fde98aba2d0d3abb279745a3a53fb
3620 F20101117_AABRYL venkateswaran_j_Page_006.txt
18fd61adf150007f1fd16c2234802719
9be66e9e433ea2960bf2584d028e0ffb6a5a2930
15854 F20101117_AABRXX venkateswaran_j_Page_119.pro
825ff90cd9f5581f5791e02eee14c27c
4a67f2c8d97ce5251897c7322b4be1088c2209b9
5087 F20101117_AABSEG venkateswaran_j_Page_016thm.jpg
b82519f6a057f7013f70379eeca6e881
dafa3373b436f86638063566295c9e885e453a4d
2033 F20101117_AABRZA venkateswaran_j_Page_021.txt
0f4b329286ba8b393c3aad59ca9cb3f0
763d7582bd5ec74252ea36540cb0db5105260ccd
6220 F20101117_AABSDS venkateswaran_j_Page_009thm.jpg
a3509404009085147144d0f948305688
7f153876c2acb43d1d807a8356090d2986475fb5
2421 F20101117_AABRYM venkateswaran_j_Page_007.txt
665b1271b55069eb1e9da197d43fa765
edfd04446ef5e2429a5a0232907b7f3be4722d30
64109 F20101117_AABRXY venkateswaran_j_Page_120.pro
5016e141861258f1176b3bdcb54d2029
97bf7b9c2c8469132da75aaa7a840c33c1b7c72d
19946 F20101117_AABSEH venkateswaran_j_Page_016.QC.jpg
277eebfb265cbac055ed60f2bfc8f9a3
3c2b8c32a025b32ad5e06b1c45f7d55cc49e7c25
2329 F20101117_AABRZB venkateswaran_j_Page_022.txt
fd1cfe8a5ccf028d79037f4510769578
5e68883eeec2b0a5a2b036bc885c48c2a2fe41d7
29759 F20101117_AABSDT venkateswaran_j_Page_009.QC.jpg
ce08689fe2627b69a341a68c9dadc6e0
b0643360b1c9053df3a31b9bd1eeb06a865d9c96
632 F20101117_AABRYN venkateswaran_j_Page_008.txt
31951f4c055263108ff3fb40956da319
6030afff33004b33d5f7caf2d8597661e5fa2c4b
65883 F20101117_AABRXZ venkateswaran_j_Page_121.pro
21a4b1bdc80a5290c70a3540370fec8c
1c09750618b9b4c805b09869151957dc44dd8c25
6462 F20101117_AABSEI venkateswaran_j_Page_017thm.jpg
a2d6ba68de1b42ecbb2809854e85cf6e
1b0cac721e97cc06eaed80664c518cd29bd8d8a9
1932 F20101117_AABRZC venkateswaran_j_Page_023.txt
bcb997268148c21e20aa84f7063009a2
dbfb99d7e5a917fa3d49070815b9ca3648170f5e
6403 F20101117_AABSDU venkateswaran_j_Page_010thm.jpg
89c93f83e72f2fae2e4409168d186eb0
07acf8401d4b7d52d16ba1b8ec2c23e7a21adcb0
2592 F20101117_AABRYO venkateswaran_j_Page_009.txt
b5a199af28d7d000c52e002add099da7
569b67161220f621f1b6f92e2f952f87259ede61
27977 F20101117_AABSEJ venkateswaran_j_Page_017.QC.jpg
fd33716483b849527e2f5270c9585c02
13063ceeca59a10b209c6731dbb32b8ecffd6d60
1964 F20101117_AABRZD venkateswaran_j_Page_024.txt
361dc0b22939d271b46eec51b180eae5
9c26a43ecd99732e91f39f15f5caafa20103d707
29833 F20101117_AABSDV venkateswaran_j_Page_010.QC.jpg
78ace5661dcc21a2bd2fad605d2ca142
b86141fc8c5ff71245ef783b70151dd2e2a7d0e6
2696 F20101117_AABRYP venkateswaran_j_Page_010.txt
d1563022a579d3d5895470377bb5ce45
5d07fc7ed571a8e2051ac83b99ade1ce0592df24
6746 F20101117_AABSEK venkateswaran_j_Page_018thm.jpg
978016875a8c875e81af14816328dd5b
8ae04c1451b3e7ac54b2d8f9e06384192e9ed913
1740 F20101117_AABRZE venkateswaran_j_Page_025.txt
49dc8f004e21a112bc32bb3ce75076df
99c0b4def96f10fdea7a1c3a3ba39e9db5f6447f
3066 F20101117_AABSDW venkateswaran_j_Page_011thm.jpg
0a06636a6ea3c8ba8ac79d63f238780e
1092066f1da3af6619b7313fff03bd219a4a7016
1092 F20101117_AABRYQ venkateswaran_j_Page_011.txt
ceff3f67d84d3de07f152a65ecdd25a4
48f2dc8e71614276c549d5e4ea916cb736d64c14
30063 F20101117_AABSEL venkateswaran_j_Page_018.QC.jpg
3fbee107ed60c885b524c80a5b2f110a
e565c2c98565aa544b030c4b59a448d1444fceb6
282 F20101117_AABRZF venkateswaran_j_Page_026.txt
92b956e8d44f79ca497c32624dac4cfb
a22be06b5eacc3b032ffc7c979d7b2184a8227b3
13489 F20101117_AABSDX venkateswaran_j_Page_011.QC.jpg
3afdb5b518089e913a002270e929b384
5f4bce4b13b3d719c55b11bd4ee9b29a577a77cb
1802 F20101117_AABRYR venkateswaran_j_Page_012.txt
0973c701c0ed093fcccd011fd419d1c7
e885522d67995aac96368f87f75a7a333d210f6f
1184 F20101117_AABSFA venkateswaran_j_Page_026thm.jpg
9707ec9c15f319c5a9dc2e0a157d502a
d9982eea4deb97cd4606f5165aa095dbf274896b
6483 F20101117_AABSEM venkateswaran_j_Page_019thm.jpg
4980d9cdf5f17f7474d36e692e243951
55c4b075f1e66c1bccfc4f0d960243d97b7b55ff
2489 F20101117_AABRZG venkateswaran_j_Page_027.txt
6498051b56b286217dd95f86d28dc538
8cc5b2409a2914b59edfb336b15df302c6420029
4620 F20101117_AABSDY venkateswaran_j_Page_012thm.jpg
ffc68e35f2d42ceb805ec5f452775322
4767d6983ab3bc169fae6aa58f1905a384aa53e7
4257 F20101117_AABSFB venkateswaran_j_Page_026.QC.jpg
10c6744f1df628d9a4be239ced1603c1
521fe4c3dd52cac5e8e5a666265d1a287f965f4b
28150 F20101117_AABSEN venkateswaran_j_Page_019.QC.jpg
dfb9f2b54d8bb43371875cff195f134b
b084150ecd1201c18851d30b099c99e1e02bf569
1824 F20101117_AABRZH venkateswaran_j_Page_028.txt
c4f2cc450160fc22f21eef6dfc1bab3a
9d2d7a388ad33eedbf2aa250dbcfca5138eca3f6
20525 F20101117_AABSDZ venkateswaran_j_Page_012.QC.jpg
27259c6327aa4a52ffa61383ce41934d
30ac131b5dd8142352eaee2cdabb4fd84e61db66
2420 F20101117_AABRYS venkateswaran_j_Page_013.txt
620de00c26b10dfc319be77ce2c770b0
ee101dfeea11c9bfdf30b9e135677bfa882083b6
6617 F20101117_AABSFC venkateswaran_j_Page_027thm.jpg
4aa16ef5b55b9c810e75e516e43282ea
605ac231a7ccb049c9bcc70c5266344e1f38e12d
6276 F20101117_AABSEO venkateswaran_j_Page_020thm.jpg
a79d0e1b45ba8befa22320c065ff4d02
40d396f089f3370faa7ea32c469771d883c43773
2503 F20101117_AABRZI venkateswaran_j_Page_029.txt
19af3f598644e95184b227b477313ddd
6471d2ae7f05e507ef2e9a4fd2294a55746910a1
1452 F20101117_AABRYT venkateswaran_j_Page_014.txt
985bbda0185376c543b5387cd2909d27
54e5e3eb7f6a63721fb853873471f5a0e2e5ff62
30381 F20101117_AABSFD venkateswaran_j_Page_027.QC.jpg
e1835defe2b4e329775f0c68a893c99c
137d2813fc678fabde742c66c53a6a171cffef76
28786 F20101117_AABSEP venkateswaran_j_Page_020.QC.jpg
ea6f5af5ce71acc9ea767bf5261adda7
af1225401427dd919f4cbba3ffcc49493099a91b
2427 F20101117_AABRZJ venkateswaran_j_Page_030.txt
7240182e05544f6bc2d8c82c83b42c3c
9f928e290de32d4873f81d3aadfe9d55b1a3cae1
2027 F20101117_AABRYU venkateswaran_j_Page_015.txt
b5f69dd7398e6ccecd0c5258b69b6004
1b649d41473d2512615a1c9ba777442c36266304
6027 F20101117_AABSEQ venkateswaran_j_Page_021thm.jpg
721cdff589397e544265221c81ca7d37
e2ea825f1ffccf68fe1b661926f510343aca5cb0
2250 F20101117_AABRZK venkateswaran_j_Page_031.txt
be2d04d00f4abd831e785405a31832ea
0ad7c89bdb12bbb0bc0b94f1b9519a5c6e3d6019
1823 F20101117_AABRYV venkateswaran_j_Page_016.txt
5a06cb51526f83903bb208ddf36c8a95
b0f52d72c06044d57769770bcd1d527c872ab810
5636 F20101117_AABSFE venkateswaran_j_Page_028thm.jpg
85706af62a38f25accc1183282aa13de
c0c85a7a4b0d6c5e9a343738120a2422b2012252
26249 F20101117_AABSER venkateswaran_j_Page_021.QC.jpg
115d736af74017ad9bde246462ecc10e
15aae0726242c6c76acef43f5e96403ace950047
F20101117_AABRZL venkateswaran_j_Page_032.txt
95b95d8b277847640799f129baaf349a
9de63e71fe2f3e2593f27dc4452521a88c17be97
2222 F20101117_AABRYW venkateswaran_j_Page_017.txt
a2ed062d792003a72f59568fbb5a2be0
2cc283a5b7559122841c5010ff3e5ef5c3114a3d
21420 F20101117_AABSFF venkateswaran_j_Page_028.QC.jpg
a308ca6dc1d55a63808c4d6c5797d200
6dbc337c5be241fd3b04b218078b262ddace50e1
6503 F20101117_AABSES venkateswaran_j_Page_022thm.jpg
8af884015aa78731adde1ae22a7a7557
16cfb375abd2f741fe318f6fbf695a9c9844bc78
2494 F20101117_AABRZM venkateswaran_j_Page_033.txt
6fec62a2520cba4e429d3a954f6debb1
d89351e860f1c0c16744b72a0f69548e711a0591
2433 F20101117_AABRYX venkateswaran_j_Page_018.txt
d327cefe8c6c1cb7ba23c5d29172b264
1792a22edf82de98e8c259318733f3737c0cfbf2
7047 F20101117_AABSFG venkateswaran_j_Page_029thm.jpg
38b569ef331e0549e836a434c77bb76a
aa08401016de6630a561b9c4c31657b234ba4868
28822 F20101117_AABSET venkateswaran_j_Page_022.QC.jpg
ae9c542921b0e9f32f14f89f403d7ae7
11f6c361fe85439da81a973fe7781ad724d18fb5
2044 F20101117_AABRZN venkateswaran_j_Page_034.txt
1a6e650f50070706d1d220cbee049b55
d665810ea9e07fdca8250bfb79d6c5a3786394c9
2259 F20101117_AABRYY venkateswaran_j_Page_019.txt
092958f887b2ad6f61ff3c4fd3315104
a493f853fd8ada702f4392d2c9bc76fe45d16cf0
31683 F20101117_AABSFH venkateswaran_j_Page_029.QC.jpg
a926b031e108ab95cb86b26629f396f0
ee747593bba6073638206daf0a94faa23c494031
5132 F20101117_AABSEU venkateswaran_j_Page_023thm.jpg
07810989fa9aafff32f8ba30a446688f
4a6af8cd978f01ab068c1b99b932d9db57874277
2191 F20101117_AABRZO venkateswaran_j_Page_035.txt
d42d0ed8ef3836312f27a26ef2b7fbee
37d73e01a93a665de9969ee59147942af5852500
2419 F20101117_AABRYZ venkateswaran_j_Page_020.txt
2f1e420ba23f2cbb37043df0d9a3a866
7a820313a7144ec2908e6473bc90ccb613dd1fc3
6705 F20101117_AABSFI venkateswaran_j_Page_030thm.jpg
6e5d9f9c5389287990d106244164f199
9023bed6983509d045355667ffdd9a5525de9040
22066 F20101117_AABSEV venkateswaran_j_Page_023.QC.jpg
2a1682e771b115e7c2943df6a7245b51
9df180300dfbf3a86014cdfa7fb50b4e0ce56fef
2041 F20101117_AABRZP venkateswaran_j_Page_036.txt
0df96abdf8adc2dbe0f402a0ded16943
7bdb88f8229ca163d88c76a8d47b44cfe0c81cdf
30742 F20101117_AABSFJ venkateswaran_j_Page_030.QC.jpg
9fd1640b24e3f9fa52068248f99a2d6f
9f6a1fb3e26e791913ec22cc629fa0c1577f127e
5936 F20101117_AABSEW venkateswaran_j_Page_024thm.jpg
6bafc8cf0ad021530101079cab8a6534
4c55044ebd0d7be5323de85af4352b71b12489c0
F20101117_AABRZQ venkateswaran_j_Page_037.txt
236e1272dd34a67ba57d3b40480baf77
78d178da11e3d85a6ffe34b8487832fad48b2689
6498 F20101117_AABSFK venkateswaran_j_Page_031thm.jpg
4b3aa5b1b601a42bf6a96933c32f67ce
7f1ec1bf6d1ab3bb492da4af62b841ca4e0f7c20
23151 F20101117_AABSEX venkateswaran_j_Page_024.QC.jpg
2508960197227210b63e20d9ca4d8ea5
516d939744e7eaaf7383b73b1908d563dee6eab5
2572 F20101117_AABRZR venkateswaran_j_Page_038.txt
8371575fd77f63423e04a08aa04c000b
9800eaa43cb66edd930f216321b0fc4ddb325526
28295 F20101117_AABSFL venkateswaran_j_Page_031.QC.jpg
f1d72196105e1ac3c3fe08c937609364
c51f00c41dc6b037bdd49ca70da8069cc730eb9a
6004 F20101117_AABSEY venkateswaran_j_Page_025thm.jpg
b8dfa15745bfeaa1895979b3ea56c095
b7b5d8a275e90bd9953e19144249f503c3184aca
2432 F20101117_AABRZS venkateswaran_j_Page_039.txt
6e0a17b8ef51ac09c17a4ddc72135dba
e606f7ac61c0596c07b2919c2452518b0b7ecb44
6716 F20101117_AABSGA venkateswaran_j_Page_039thm.jpg
ac5d5fcff4d3d56616124480d314fdf9
d294d96e60e7c181a30fe6ada88bb59539fe4edc
6780 F20101117_AABSFM venkateswaran_j_Page_032thm.jpg
b3c1a17dbc92e3f61fb6e320e2e7ed8c
6c99ef465282e340aef4c6a4f04b8264b7ec45d0
24830 F20101117_AABSEZ venkateswaran_j_Page_025.QC.jpg
de88b98a26413695aac9bfb97d775502
92a754b06352d8780de19d9a0811640eec6d6b7d
30630 F20101117_AABSGB venkateswaran_j_Page_039.QC.jpg
b709249efb4e2c665c6dd82f75c51707
50228e229358b463f9a01c3d43d9129359ce3006
30193 F20101117_AABSFN venkateswaran_j_Page_032.QC.jpg
dc0fe72e5c0b27f6a5b587d021ac8744
c48db57034bf6b1fbad6375d2e63496a9d588ec2
2371 F20101117_AABRZT venkateswaran_j_Page_040.txt
b8c4e4c712eeb70781d248fc512a2a19
e04a04a191f92d85254dbb968a9bb095d0580b2b
6611 F20101117_AABSGC venkateswaran_j_Page_040thm.jpg
f4e104787477d1e25377a9c759e8e1d4
84f07bba8aab1067d28bbf2be35d41675d87923a
7002 F20101117_AABSFO venkateswaran_j_Page_033thm.jpg
7cfad6692fc34b1a1b348321dda61d68
026f995e854054d75cf265092190db3eef767223
1558 F20101117_AABRZU venkateswaran_j_Page_041.txt
6d7f15bd872ac603099c4635454c7f99
d8c6aac43a5c5d7ad373f2b793905f4d3e045be2
29580 F20101117_AABSGD venkateswaran_j_Page_040.QC.jpg
0f84ec53853a5f21a3f5a4e72c2f206d
836a84645b6d11e12d0c2590d77592b0214a22ee
31506 F20101117_AABSFP venkateswaran_j_Page_033.QC.jpg
1f5a4d451fc4a55fe5e6e017a5789250
25ff4ce734444f8b0c466be4ae706c3f988c74c0
2149 F20101117_AABRZV venkateswaran_j_Page_042.txt
7ae51a406a658b108e8b92b0228539bf
d8f7a9175cd08a1bf97a59ebdd9534084a91ac38
4741 F20101117_AABSGE venkateswaran_j_Page_041thm.jpg
d6f444226c53483b43ebec23fa7a60f7
17f526e6d1a39947eb2908b1eae986b2ab9f7c95
5763 F20101117_AABSFQ venkateswaran_j_Page_034thm.jpg
965449cd60059c4e0f28e1f82c494d28
38f9f6b81e31b8622f6aac6809b49a7f0e1d489f
1817 F20101117_AABRZW venkateswaran_j_Page_043.txt
5266a9f803e096e1c1263ec65ca1e658
1c25854ca6c597c3b3a9e88e272e2b1a891d577d
20142 F20101117_AABSGF venkateswaran_j_Page_041.QC.jpg
ac8d908649e19eed1afd32ef45c43006
5b561b7fd6211edfeb86109c9ab65c17bc963cf4
24747 F20101117_AABSFR venkateswaran_j_Page_034.QC.jpg
a619d471a5d89781bf3415aaa7267450
33aa83f6cef0eafcd8d2cb3be8863f4aa9baad7a
1999 F20101117_AABRZX venkateswaran_j_Page_044.txt
4c3eb5398c9ae5ca911a33e1ed9df8b7
f5275cc5e22f9f729cf8fd3e789773ca77db1c4c
5555 F20101117_AABSGG venkateswaran_j_Page_042thm.jpg
d4815701a3dec735a116343cbe2199c9
d267fe259ca6a10181d24ad904884508796bf41b
6125 F20101117_AABSFS venkateswaran_j_Page_035thm.jpg
8afc3b8530c5991b7d15b3d8b859b323
11ee2e03d0706a3709cb901116445aef12b8b1dd
2245 F20101117_AABRZY venkateswaran_j_Page_045.txt
8db6c46d4b9dd42dcfbb508d587663af
950fa306208728e4b3f15ab1d1404fe6f3cfb11d
22096 F20101117_AABSGH venkateswaran_j_Page_042.QC.jpg
b7e6e4a91b3215258cf481ae2b34cc27
7f3f90940ec402228d9c9f569366b4910dea92d8
26749 F20101117_AABSFT venkateswaran_j_Page_035.QC.jpg
3691d11ff52e48f5767d707c9e4666de
d510a57c2468603975df293cac01e8993b74f99a
1903 F20101117_AABRZZ venkateswaran_j_Page_046.txt
3ac860d37625a3317292138ee1a0be6b
1eeefe09ff4575bfd8ea6c8ad37773d7eeca1ee5
5031 F20101117_AABSGI venkateswaran_j_Page_043thm.jpg
b109f331bb3bcd43c4ffd5e1c20be2f6
5fce4edecdd177f3e6e83750535476027901ce4c
5953 F20101117_AABSFU venkateswaran_j_Page_036thm.jpg
a9a6208f2e33b1640c533679b6419a92
985996d351169f5b511dc551cc3fd43622f8484d
21242 F20101117_AABSGJ venkateswaran_j_Page_043.QC.jpg
a3303071adc033d879fd7c48b1a98ce0
4802223818b64cd3dce21ea4519a6cc1da95f3e7
25715 F20101117_AABSFV venkateswaran_j_Page_036.QC.jpg
231ed2ae36e60a41b28657822ad3aa60
c4f2d2f2c35a13a948635d7da571c5806f82ef38
5810 F20101117_AABSGK venkateswaran_j_Page_044thm.jpg
012ad5c2d7b8e3a0eea42f0b4cdf3df0
a43c529955c135d4506e7cbfb1653208a02e66d0
6643 F20101117_AABSFW venkateswaran_j_Page_037thm.jpg
540de8681adda0b8cfbaa4a523bf34c5
2195b16fd59cd3e0faeb573880c8735c20469aa7
23401 F20101117_AABSGL venkateswaran_j_Page_044.QC.jpg
e01c7f955235fd8bbfe52417e4c2f2c4
2070654ce4a3346b80f9659afee7a488c1ef5ad5
30251 F20101117_AABSFX venkateswaran_j_Page_037.QC.jpg
dbb8d50d29d0442ec94bdb1801d530b7
064ff619b4a4714611347c6027b97bdd16c7a732
5548 F20101117_AABSHA venkateswaran_j_Page_052thm.jpg
835634e2b4df2999bdabbd53ea489562
dcf0255ed62d2860036593c4e27740d4a65a0c9e
6419 F20101117_AABSGM venkateswaran_j_Page_045thm.jpg
4d770a424a08e7f6acc3172aaa03bcee
22a6e3e1dfb303fceaca212c88389b6e4a12c89d
7049 F20101117_AABSFY venkateswaran_j_Page_038thm.jpg
d33fec39749d9c075c30b6c6ef195719
73b56e112019c419559fd461cf0a90e447d5554f
21485 F20101117_AABSHB venkateswaran_j_Page_052.QC.jpg
4dff80b50808e9abf21f88904d36a898
74116a7f2ef8b611ef428e44785d6fa9179e8931
28250 F20101117_AABSGN venkateswaran_j_Page_045.QC.jpg
4eb4626a309de608c95435ea6d0d72c8
92551d5728034e6dbc5c2b638e4390584743ce6b
32556 F20101117_AABSFZ venkateswaran_j_Page_038.QC.jpg
787058a2b909af621e5d83f972268f70
8883c74153bca08bc6f9380b86f1fbd6a1a78a57
6690 F20101117_AABSHC venkateswaran_j_Page_053thm.jpg
e017a26e30fac0b9442deaaba18bf53c
15710c88f2ad99589721f08ddca8cfe3a2f5a989
5368 F20101117_AABSGO venkateswaran_j_Page_046thm.jpg
101a892b5025d2ab647cd6efb04b60c9
edf3778e14cabc247e1a94624d438f07fa5c0d7a
29946 F20101117_AABSHD venkateswaran_j_Page_053.QC.jpg
b2755e4c8dcea5436bf06c78414da533
815093c97ed4a7f6026d7e3e8999304c71f639b6
20857 F20101117_AABSGP venkateswaran_j_Page_046.QC.jpg
d8c80e5bec208a16dd7d468f826268b0
a11ea65a559154a88831744d6f6b683b87ebc375
3387 F20101117_AABSHE venkateswaran_j_Page_054thm.jpg
0fce9bad6020f1ceafa20df7cd41a5ee
8f00b5bef4a29a8086d41493b94da60bda6289f0
6825 F20101117_AABSGQ venkateswaran_j_Page_047thm.jpg
9ca1d39c4526df15e61368c8b3d74e29
79febb16ac4eaf80a761e6afc8491445faf0acdc
12687 F20101117_AABSHF venkateswaran_j_Page_054.QC.jpg
f50e43e327d4a64bc89248c55841b172
55c384d70eb57846cffd2bf1d41bb617e73cc75a
30736 F20101117_AABSGR venkateswaran_j_Page_047.QC.jpg
b6bde1745313ac32632757c91e64a4a4
e427b90e9dc58c2a9624c12bcd1983f36571d5a3
6499 F20101117_AABSHG venkateswaran_j_Page_055thm.jpg
675015868f5c250f89cc37b12f8ddc37
63dbb82448e407b8c9963e73440f1781f04fe85b
5921 F20101117_AABSGS venkateswaran_j_Page_048thm.jpg
b9c9ffbb26656b81f4657e81feebe759
e0a74110019e905d1f162f2ee0e238ba7189af50
28065 F20101117_AABSHH venkateswaran_j_Page_055.QC.jpg
462ce04fcb643906141c22d4802c04bb
83e62caf9c992ee4149696ab0f57b488eca1ce68
24307 F20101117_AABSGT venkateswaran_j_Page_048.QC.jpg
161f62289b93c41fdbb1d0d4302f1d89
9a8366fa8ec245febfef552179350d6c93eb374e
6515 F20101117_AABSHI venkateswaran_j_Page_056thm.jpg
c2b5b3e232c76bf35121f93bff9a7c94
21b16659bd0032258f10e805f6da607ea73a0a0b
6389 F20101117_AABSGU venkateswaran_j_Page_049thm.jpg
4278e5afa73e7ea9345bcd3166f62ea0
a06d13093efb5d791abdd3dcf4f11d7f416c5601
26842 F20101117_AABSHJ venkateswaran_j_Page_056.QC.jpg
f1e85c6eb21c2b77f68c531012bf6e75
55799ec2763d2e194ae2dca632ef8ca060f7533d
27419 F20101117_AABSGV venkateswaran_j_Page_049.QC.jpg
6cad79df02e95f157c22d379c55c1f8d
3cff4e08b8e5973b93dad5b1c2fa1a171883cbb8
6301 F20101117_AABSHK venkateswaran_j_Page_057thm.jpg
0b9028c51a7c36860a619289e82e4711
becdcef316497bb06c9bf0b3501686e1a436c764
5370 F20101117_AABSGW venkateswaran_j_Page_050thm.jpg
a6717c21a948d0e8293f9b4aa760ed57
dd303e2bba481f9b30c90e6396a9b41606a87281
28699 F20101117_AABSHL venkateswaran_j_Page_057.QC.jpg
0908df3797fda0e4b167701c32a5015f
8b24e06939cdda6f36a028021cf64ea866bc0d2f
21934 F20101117_AABSGX venkateswaran_j_Page_050.QC.jpg
2b159d30e8f02bebaaee1977c4cfb542
48405ea5580fd6643d52124cf9fcbfb51441ffeb
14719 F20101117_AABSIA venkateswaran_j_Page_065.QC.jpg
06a89afd58d0d530444890df377dd29f
6e02cc93c2eac003af2556383dd9c1c0e378a994
6552 F20101117_AABSHM venkateswaran_j_Page_058thm.jpg
706d47e87b757238eeca8bfa6df70844
dbf25f73356fc7d8be7d42b52df0e20c4578425b
6492 F20101117_AABSGY venkateswaran_j_Page_051thm.jpg
a83856ae5f5a9137f5eb2c0f6f3d956b
c6e6ce1154536848876e7261e66b709081886997
25271604 F20101117_AABREL venkateswaran_j_Page_049.tif
3db9f6af67ea0a6d2bb80bc0c582fdcc
b5482d882e89b32b2ac085c411acfb5e6464f2f2
28406 F20101117_AABSHN venkateswaran_j_Page_058.QC.jpg
b50f0a49c4ec6421ddf02e772316f15f
1fda8568fd1074729ce611055143ebb8d2a2b515
28540 F20101117_AABSGZ venkateswaran_j_Page_051.QC.jpg
a168d899ed64afba1f5ff4b845276468
defe86e4d23dcac372886d29a697e95693cd993d
6309 F20101117_AABSIB venkateswaran_j_Page_066thm.jpg
34c661a2df7d44078b43c22c847c7024
c0542d475f3a4aeb9239c45e6b34592b75ad9528
518 F20101117_AABREM venkateswaran_j_Page_090.txt
27d6a96bd3df8b5763263f34a86eb198
3a6a399f11fb80721e8a06a633da7aad79a4a3b2
5397 F20101117_AABSHO venkateswaran_j_Page_059thm.jpg
e68ca2195495120c76b334e4b2ebca0a
dbb402ffe28bbfe34b03fd606d2f54d0c42538af
108753 F20101117_AABRFA venkateswaran_j_Page_006.jpg
81a64f2e2862af680f00683ae9634ca7
b4caf2895eb3ee3b15196f6d2654fdc8011d683d
27377 F20101117_AABSIC venkateswaran_j_Page_066.QC.jpg
87d8a63db5241a8513a508c414795e86
5070e84ce149fdd567c5407d9166325ef7481a74
23875 F20101117_AABREN venkateswaran_j_Page_064.QC.jpg
f2a6303997170ade60f464feeaac7718
86d8e269b1a74ca194d725753c015210c0c0e6b7
22648 F20101117_AABSHP venkateswaran_j_Page_059.QC.jpg
1686ad294af2b4cffaa4ddcd591622ca
cec39e3f593a904763b113f03e67f1b91ade9492
77805 F20101117_AABRFB venkateswaran_j_Page_007.jpg
e32ceb5b9106185019f782a08574a07d
d864d334f85ca29c11ddac17971ab0c0221a451c
6952 F20101117_AABSID venkateswaran_j_Page_067thm.jpg
506e0d295fb29e8a51550c574801557f
c16c4c75503f8f54cf64a51f74b20601e3f7750f
29783 F20101117_AABREO venkateswaran_j_Page_077.QC.jpg
28ecbd144561597338e2d91d6a3dd1ca
70b38e7097646b6a898a96342f9c2285c42873da
3913 F20101117_AABSHQ venkateswaran_j_Page_060thm.jpg
8947b3975dfc47382e8135c2324c0cf3
d7cf0640421d29579de7a866a9efcc2ea1d9e8e4
24319 F20101117_AABRFC venkateswaran_j_Page_008.jpg
4fa31943dabd97cd4cdcd3c9bdcf3b45
b64c3ba0ed9124d34057bcc5162436cc0e801c59
31170 F20101117_AABSIE venkateswaran_j_Page_067.QC.jpg
58fabf45271aea322ddafa5ee14b90c9
c8541b23d40da4f1dfa09786688cd5d32026ad22
90144 F20101117_AABREP venkateswaran_j_Page_101.jpg
be8e7a53a52175cbe5083160ec9a6036
55847418de764f34a914b8e46d8980dc7c4988d5
13183 F20101117_AABSHR venkateswaran_j_Page_060.QC.jpg
724460bec1ea8a241130cd9966a78c4a
d12ef15ba800a35206506e4b786cfbbbd1295f14
96572 F20101117_AABRFD venkateswaran_j_Page_009.jpg
5a6982898691c8517c3337997e73513e
3794809b1da0dbd5b9b7eb9b2a9d1b60c4823f30
4286 F20101117_AABSIF venkateswaran_j_Page_068thm.jpg
1bb82e4bf8472112a44202339ad8ed07
89827542fc9dfba035cf2ae93cbc02c467c514bb
21134 F20101117_AABREQ venkateswaran_j_Page_007.QC.jpg
36b1645be0ba45d2a12e9d921ea93ece
d2882931ed55faa3b868662f8b7d5b0540019237
5003 F20101117_AABSHS venkateswaran_j_Page_061thm.jpg
c93e5f2e4578f21a8e9965b2b7929a53
a57accb67d1730bb44d76c66bddeaccb77e26b25
98895 F20101117_AABRFE venkateswaran_j_Page_010.jpg
26fbd764101b87cb8b43bae0a09bc2e8
0a491b40becfa1438a1d41d6dd0a205600f86079
13374 F20101117_AABSIG venkateswaran_j_Page_068.QC.jpg
ddbb5c4114864b535a06abadfccf262b
f65335d3bb313a83a423eca465eaae000860fe3a
F20101117_AABRER venkateswaran_j_Page_033.tif
3a0d5015957dfa0ab95370e82710dfa2
b95db37dbb281188615dbe067b71d0acc5934eb1
20036 F20101117_AABSHT venkateswaran_j_Page_061.QC.jpg
435a8aa94a6827db69c545f31079c23b
d07843e55ff74be5519e555929ec7d90b419d82b
44337 F20101117_AABRFF venkateswaran_j_Page_011.jpg
4a8891761462146ab08804bef14fa7eb
03030a108d72fa16e0f8045f6945efa52823d761
5925 F20101117_AABSIH venkateswaran_j_Page_069thm.jpg
67f9761fa66f623b62b79afa5daa350c
4ad12db3136f0f5093eff6a9b87b086bd67dd816
5364 F20101117_AABSHU venkateswaran_j_Page_062thm.jpg
65601d9a473ef7d7d8663199bedef51c
0d5372b9ffe1a5359e23e8723fa27fb7e9d271c9
69484 F20101117_AABRFG venkateswaran_j_Page_012.jpg
861d8d1724e628401ce057b79c0d7a37
3e2b029dd242d25a1dfbcfcfc62da891ef78a889
23330 F20101117_AABSII venkateswaran_j_Page_069.QC.jpg
98daaad06cc5804bb1ae320cf48d6bf7
7bc17ca7a2c97e241924bf5bda9e2a0c84f91783
196691 F20101117_AABRES UFE0021666_00001.xml FULL
a74cf3ed4ec24272f34ff7c1caa34cd4
381c65679727b88b2a5da449bb0548c3d67f6953
23244 F20101117_AABSHV venkateswaran_j_Page_062.QC.jpg
2d404aa4a6c4821794c7cc8e0d41a969
159f219f6b477f53e0c5b6c80a380f7723066d4d
96354 F20101117_AABRFH venkateswaran_j_Page_013.jpg
581a6f996636defbebeb632500cf6e94
31b44a5d69eb1f0e3e9dc22f576d26326c36cabe
5313 F20101117_AABSIJ venkateswaran_j_Page_070thm.jpg
12495676fb06bded5f3e9db34a3f0eec
4474a717c5985f733eff64eb598a00738abea643
5508 F20101117_AABSHW venkateswaran_j_Page_063thm.jpg
495c5d34d4b819af45d428768ec51b4b
6e3ad0f903dd135e7933e77d7e2acca2c2c1267a
56040 F20101117_AABRFI venkateswaran_j_Page_014.jpg
b987ba751ba3904f54deaec2b408771d
7afe263605454ac932a343a12f401d539a555416
19574 F20101117_AABSIK venkateswaran_j_Page_070.QC.jpg
5e4a18596d15680cc1f27bbd930907c8
94b707114f88d69927c78a634814863e57597689
6620 F20101117_AABSJA venkateswaran_j_Page_079thm.jpg
fb44bb9e8c31063a627beb2b0dda5da3
36c7b50dc456496fa2a6bb0a6506f1eda8e61405
20307 F20101117_AABSHX venkateswaran_j_Page_063.QC.jpg
705d9bb5dad42f09e59adb972737969d
da5adaa40453f48e41cb3d326c56ed63917ebcc9
83592 F20101117_AABRFJ venkateswaran_j_Page_015.jpg
ae57c7d70605fece498b0b03c739bcbf
1180422ae75aedaa1a923064361feaf13fb7c517
4555 F20101117_AABSIL venkateswaran_j_Page_071thm.jpg
2aa167efb107c70e70ef9ed919221b93
65adeaeee0a1276e0625b1a3dea7a766e4aa3fc1
21961 F20101117_AABREV venkateswaran_j_Page_001.jpg
3624b02c9c41bcf5a18f311f09627a3e
e8b29aa84c7286dbc6a146ea41c821435ff49b95
29461 F20101117_AABSJB venkateswaran_j_Page_079.QC.jpg
605c90b5ee37933f9874c4c6a628d5ec
399549f759faa491cf819d378d895eb5295e1378
6005 F20101117_AABSHY venkateswaran_j_Page_064thm.jpg
af23c835276803c6803536a5522d3091
5e5002d98d345e51131abd229d8aed175442f777
64296 F20101117_AABRFK venkateswaran_j_Page_016.jpg
46a7f790c1d522fb4246bf6a36b345d7
66af7f92959e5b02373e3d685b42dabfc73ca3b0
15123 F20101117_AABSIM venkateswaran_j_Page_071.QC.jpg
c6e12b0557cf186d771274c1853a9688
92452c23645c5e2b8a324c942109677cd1cb07f3
3843 F20101117_AABREW venkateswaran_j_Page_002.jpg
48d5b88dba3a5017ccf441f384ae494d
76d17872f1c8672b7310e700d861da55d444fa30
4420 F20101117_AABSHZ venkateswaran_j_Page_065thm.jpg
6073c481e52212561220854434f09f62
9a2a3fa26970f1090ef9b56d97db853f61f8a25a
88473 F20101117_AABRFL venkateswaran_j_Page_017.jpg
2a0d0957749bb08934bdd6bf0a6de3f5
26af5b7567c2fda816581478efdf6ea3a0c08fa3
4332 F20101117_AABSIN venkateswaran_j_Page_072thm.jpg
aa10590b15d0a6471074829bc113b8d5
39c634859b8956cd3093b5220f848ffd21c84994
3302 F20101117_AABREX venkateswaran_j_Page_003.jpg
f5c88f657745f502b4a60c8248caacb1
e9f671f27a84eeb8670e35a87d16e8a6f0f1a30e
4700 F20101117_AABSJC venkateswaran_j_Page_080thm.jpg
bba27835eec2268a09b70d1fc3ec3937
54830c8cb763d631ab441eb177277606aa8897fb
97422 F20101117_AABRGA venkateswaran_j_Page_032.jpg
90fcf94cdc1790451bf75d5419cf7e8a
61fc0dc775824fc2d3f2eb23bc23fc9b3c38f28f
98799 F20101117_AABRFM venkateswaran_j_Page_018.jpg
d5c58c6e120565c537e421b220e572a4
2f9c50f5a58ea5450c4643017cca2a92e4a8c29e
14158 F20101117_AABSIO venkateswaran_j_Page_072.QC.jpg
3685b32e04571210f8fcff1a1f3ee792
28728fd461149e22d81e2e89efe3f1313fabac9e
36381 F20101117_AABREY venkateswaran_j_Page_004.jpg
434d93530619875f9bd16ec623b57476
fccfeb562ff26b63bddaf7a9d52d93cb1643c570
17919 F20101117_AABSJD venkateswaran_j_Page_080.QC.jpg
363e4b0c4599484bc5ce3aefb0b7e50d
bccbed9f30a2200f2063333feb7bba614ad1f556
101371 F20101117_AABRGB venkateswaran_j_Page_033.jpg
5afa49c4791652b337872683402aea13
d115745e195bee3ad808d64003971e87ba3bb73f
92236 F20101117_AABRFN venkateswaran_j_Page_019.jpg
bc0f0b125dc8611fb2db0e518e153325
7a531877aa2eb6c784d459c8c2ab4b30f4271bbd
5465 F20101117_AABSIP venkateswaran_j_Page_073thm.jpg
8ec34fe9d9882b0aa16fe3da6e0860a1
e4891639f6602c4560ba9ffd4bc8a621b93d215c
93516 F20101117_AABREZ venkateswaran_j_Page_005.jpg
e2832b16e12ae2d09c0884c275d25fdf
7fa727f6ce0f9b5a9b15777551dc843e1ed8c24e
6251 F20101117_AABSJE venkateswaran_j_Page_081thm.jpg
448599fc8d6268e4da7ff9f2683dc9a0
902b826e4a70e1b58f78b929a08db19c5f4918bb
80417 F20101117_AABRGC venkateswaran_j_Page_034.jpg
44ed34efde98f8e4a91146bb806619b5
0eea0baf315d104a312373e9179c4ca8d733cb8b
92525 F20101117_AABRFO venkateswaran_j_Page_020.jpg
de4c883a9cc55507e4787bf6b120046b
5fcb78c26e4de47d159523c4a631674c3e2793b9
20975 F20101117_AABSIQ venkateswaran_j_Page_073.QC.jpg
9336bb301074d016bcde11897fc56188
946e62f39fdeb0f9ea7506bcdc9e3a65183f1a81
27734 F20101117_AABSJF venkateswaran_j_Page_081.QC.jpg
0e39295bd0530c90f1a22e1d382f2718
aa9db6a3fa510c626192b1e3f134b9261b3b1671
5884 F20101117_AABSIR venkateswaran_j_Page_074thm.jpg
0942587cbdf027db5a63f41a2bd719d5
bb7104bd11cb6614020ae9efe43e37d4bd7081a8
86167 F20101117_AABRGD venkateswaran_j_Page_035.jpg
86428ffbd0a1e6ed6354d34c3f2d4d11
e80f2034b30638e96e9f5a9ac166185e224dfe3b
82273 F20101117_AABRFP venkateswaran_j_Page_021.jpg
2c23fcb84cc294fb0d0ca14706fcc893
70374a6e1714a865af9166443e8c9087e0cfd4de
5724 F20101117_AABSJG venkateswaran_j_Page_082thm.jpg
f1c323b369b5c3bf75500435290e64fc
86df600707a7ae7614d3409e39b5044b7c4b8179
23398 F20101117_AABSIS venkateswaran_j_Page_074.QC.jpg
8be8298a045651d534c01631d180f448
b16b9997453003017688d766a2c70478b5b0f313
82496 F20101117_AABRGE venkateswaran_j_Page_036.jpg
0c0d840f250e1e395a9474c075e1fe41
4173c6013ad8aa4e56c05c48ae64bac1fda0cb01
93765 F20101117_AABRFQ venkateswaran_j_Page_022.jpg
16cc2aa2e01cf1d640a0eda9ee18b6e9
22c609eb83728806f9003e5b5c235cd516fe4502
24076 F20101117_AABSJH venkateswaran_j_Page_082.QC.jpg
4364031a6f32c5803cb757203f79c488
984c44aeb62c939fe8fffaf056a77c0df5e0d41b
2176 F20101117_AABSIT venkateswaran_j_Page_075thm.jpg
e51ce57e72fc8737a43be9a7a66569fe
0010fbfd2683eb89bff22c38b022da9be60a95b1
98866 F20101117_AABRGF venkateswaran_j_Page_037.jpg
10137a00a6f4ff89aae5aee0413b1e76
256cba002d802b14d5bdd2f38384652045ba8127
71722 F20101117_AABRFR venkateswaran_j_Page_023.jpg
bcb1a1d9ffcb4027e69b8a17ee7f1824
e6b9c12e8121e33c3368d080a4a26341fe9b4239
5211 F20101117_AABSJI venkateswaran_j_Page_083thm.jpg
b8ca8428be4d3be9edfa3c7d7fbd794d
3130132a76b96582f2dca0c2eb7928412046e809
6501 F20101117_AABSIU venkateswaran_j_Page_075.QC.jpg
797e80e132ad877f4999d21b0dcab131
f7ff4a4b878807f6e6b0a60c238e403819a35054
105482 F20101117_AABRGG venkateswaran_j_Page_038.jpg
5895144a7465df25e34cfaa788f30d89
cc472acefad9a3038592c14078b76a9430d1acd4
76018 F20101117_AABRFS venkateswaran_j_Page_024.jpg
1fafb06c0d0bd6e5053977a21e8e92b0
a77a82bd6eed8fce1f384ddf57f2972efd4cc50a
21145 F20101117_AABSJJ venkateswaran_j_Page_083.QC.jpg
54a167cfa2d079a329208540073930ee
33f9a17f90c7aef3f0cda9205f1e2d6dadcd870f
6803 F20101117_AABSIV venkateswaran_j_Page_076thm.jpg
9eaa0f6c27527e3c814a61884ee5b052
883f2cc03f7ca350efd68d6b9c16c69613ab9693
100062 F20101117_AABRGH venkateswaran_j_Page_039.jpg
215ef85a6650a5bd311f75c0901ae8fa
582785102ecbda80a24175a9cc0e402ecb22d68b
75220 F20101117_AABRFT venkateswaran_j_Page_025.jpg
e24279c62ad1d530bebbb0c54a9b381d
9940132060ba656e84b9215b490cc61d00698f13
3751 F20101117_AABSJK venkateswaran_j_Page_084thm.jpg
6257232616f4cf0ee17ad211452bd8e8
1a6b642ddc9a38947ae010aa6f47f274943d824e
30084 F20101117_AABSIW venkateswaran_j_Page_076.QC.jpg
63ff0eca7ae88ca41849935d3c43dfb5
8fc769d9b7733dd6f3bca872506ab3131437123a
97469 F20101117_AABRGI venkateswaran_j_Page_040.jpg
0325f2c1f4a9e4130614e786c7668953
85035115189b57287434150dd17b84d53f9ba79e
12225 F20101117_AABRFU venkateswaran_j_Page_026.jpg
b6686a828b3b963287ba6bccf0a497d5
3a1eb0924acf63540659d83dd84edaafcb332dd8
5265 F20101117_AABSKA venkateswaran_j_Page_092thm.jpg
1cb657a32a0387f0ad929d511ae7a420
6c7a40f82fdb5b2d7303d1b8f395b498ac8149b3
14133 F20101117_AABSJL venkateswaran_j_Page_084.QC.jpg
d6fa260f5868d184b775fdbd8d88dcef
36cc93f72665ddf5dd41ed027819b190544c4d15
6703 F20101117_AABSIX venkateswaran_j_Page_077thm.jpg
da60c62a0dfbe449e7231db0c9dc1d9c
80643681237451d7cac9e651f402ba1e83b42e65
63653 F20101117_AABRGJ venkateswaran_j_Page_041.jpg
9ab8d620e840b3a002dcd7ef25121542
3b4a52d79f9d0d106dab7ef65c7ed679cc17c52e
98755 F20101117_AABRFV venkateswaran_j_Page_027.jpg
f9f2edbe48cd4c4a4b1eda09e2485f88
b3faed3e87c2d6e74d22b6d599718d22dac9eed1
21867 F20101117_AABSKB venkateswaran_j_Page_092.QC.jpg
ee55c8d5ff6d69c929615720e815e6ea
92c04fbf823d67a85700053fc9eb3d2b2fb5538e
5710 F20101117_AABSJM venkateswaran_j_Page_085thm.jpg
c7bff81855035d0486a3ff51ee0094a1
eb89459341dd9f07667c1bfb8872d3c51cf97c4f
5054 F20101117_AABSIY venkateswaran_j_Page_078thm.jpg
e61b770de9c355899c0ae05ad075d7db
dd47134f5616dc790e4091f1e4bdac2d20c280b4
76695 F20101117_AABRGK venkateswaran_j_Page_042.jpg
525a58a9359cdb9af811fe214e39022c
736580f5d1a7f53f2e0a2ac473c9f039b64ad87b
69153 F20101117_AABRFW venkateswaran_j_Page_028.jpg
2acd2ddfca5a684afe3ebf87167d37b5
2c0c9b93cf478aabdd9ddb72ab783caf0247c14e
6757 F20101117_AABSKC venkateswaran_j_Page_093thm.jpg
bb6c445324e3c3b364e3d1d4914d51fc
c1453bc724bf1871a47dbd019353e4665bbc0c54
21832 F20101117_AABSJN venkateswaran_j_Page_085.QC.jpg
3ec73af6d2f9e3f0dc0f85aeebcccbcc
4a8100548ad46a080ae3cc6461786f9a8bbabb63
19490 F20101117_AABSIZ venkateswaran_j_Page_078.QC.jpg
60c9d313b5fefbb4b56d881ab1b57694
08d732fd67f73dffaf556171de2739d4805f8362
69093 F20101117_AABRGL venkateswaran_j_Page_043.jpg
7f6448b95b94ed5501d5472e7f98d822
f253ab181679a33ab49b63786313ac69f608125e
102995 F20101117_AABRFX venkateswaran_j_Page_029.jpg
6a9baba9ceca4e48bcc5c5f6e27abbe2
876309ae17d35bc4eeaedfbc1955e6ede91b68e7
92740 F20101117_AABRHA venkateswaran_j_Page_058.jpg
f07e5c48d8334f7b3e2a5456f6992ed4
818c2af5add15a1277861ac1ef2d8d80f14043dd
4096 F20101117_AABSJO venkateswaran_j_Page_086thm.jpg
fac0d5e3c2bf792d858a5df60224e798
527d847993f5cfc9be547b74bf30c38efecbbab0
75976 F20101117_AABRGM venkateswaran_j_Page_044.jpg
312839cdf580d6354b67c114031c635c
c828da967fd1e82c24278f10d25c112cc42220ab
101855 F20101117_AABRFY venkateswaran_j_Page_030.jpg
624aada5fea5c4d5adad0b00e9727635
40323a5c877cd27c0c301b9ad805f0cf11b6b7c9
30097 F20101117_AABSKD venkateswaran_j_Page_093.QC.jpg
cf57534620ea14f04495b1bb00de5211
563c43546499414dc42787c2c9227784d4348ce1
15779 F20101117_AABSJP venkateswaran_j_Page_086.QC.jpg
c8b81f21ee3532e9eb32e0ac1c9b76a5
edb46dbddbda4f036fa1b9f363064ed9a01de279
90358 F20101117_AABRGN venkateswaran_j_Page_045.jpg
104b1613e5ef51b73a9211b36c9e3f26
099eafa90b7f52270d8f10026c63c0a01ae0204e
91285 F20101117_AABRFZ venkateswaran_j_Page_031.jpg
d63c01ca5da35d22ae06eea6b0d84404
3ebb5b40f017115e05296118a68c0df083718e5c
71524 F20101117_AABRHB venkateswaran_j_Page_059.jpg
4b1dc6ece320139e2cd0fe50e8cc1e9d
4aefec0b565785e0ec7ebc729a9907926f628bdb
7038 F20101117_AABSKE venkateswaran_j_Page_094thm.jpg
b961d7773763005f37d9321624379c1d
c525e936f53b8840d2bd7486cdb4ff585ecc1c38
4383 F20101117_AABSJQ venkateswaran_j_Page_087thm.jpg
9e7a9eadd8ffd4b45cd2817c86a410ae
bd22322bc2d15ace5027a95be3541bb16cee3729
67876 F20101117_AABRGO venkateswaran_j_Page_046.jpg
3cf09bf5fd46182829f355c54c0645c4
d18a8b7a961dc186c1d1476235db43fa7310b914
38977 F20101117_AABRHC venkateswaran_j_Page_060.jpg
7be4125cc131abf20f3b2c406998ef1f
74fe556769f77a2ea2e9e1570329391e819c080a
30843 F20101117_AABSKF venkateswaran_j_Page_094.QC.jpg
0fbc1b9ad654f109cdc1f9bf427d7801
b8dc29b6ed3ec20361b3327dfe693e02b3cbe2a0
17804 F20101117_AABSJR venkateswaran_j_Page_087.QC.jpg
b4ee53849efc6fd2625f4a5196b8cdb1
709dce32254a6c6b3c24763790b9be312f290b55
98570 F20101117_AABRGP venkateswaran_j_Page_047.jpg
c92cfe2c153afb75f992cbce391ad85d
29f0d6ea073356309ddc2e8f1914b2f1aa60f7db
63122 F20101117_AABRHD venkateswaran_j_Page_061.jpg
20711acfcb4fa14368c9ee1624e2d1d0
c74876cd258250f9687145da9c8b80765b04b2ce
6642 F20101117_AABSKG venkateswaran_j_Page_095thm.jpg
dba64d42a396f620a4c4d09616b8291a
612dc82ec72dbe37f85dd8e70896618a2b3502b7
4593 F20101117_AABSJS venkateswaran_j_Page_088thm.jpg
29badc362e16611a76b6851374e04d94
a046f8d3469185f14ab9dd5b1f3ea498f73c3d26
81290 F20101117_AABRGQ venkateswaran_j_Page_048.jpg
e76448e506020009bd6ac3ff2c3dd289
8245101a1b47451dd373ebc5cf124579f9b97c8c
74460 F20101117_AABRHE venkateswaran_j_Page_062.jpg
e8ec5a1da8d5bcf5fbe4306e5fe8a751
8c7e16e31bb76833fb056e7d8e82ee275fe582a9
30457 F20101117_AABSKH venkateswaran_j_Page_095.QC.jpg
0d782796cb8c2a6029181c43155be426
1775cd464c34348c9fcfe53c9264fbb178ba9fd6
17324 F20101117_AABSJT venkateswaran_j_Page_088.QC.jpg
9fb87c8be706c09b21bf59ebac827bc5
afe229f5a715417676b67ea4545ab7c23538d8c6
89301 F20101117_AABRGR venkateswaran_j_Page_049.jpg
13b553fbf440e3efb2db0ae085caecbc
f0c53d45a2256d2640bb04fd7643bd4e198c0d64
65475 F20101117_AABRHF venkateswaran_j_Page_063.jpg
c157d389c0295b66ae360283871b2784
d0108664b59530a5aca040108848edfe6e42fd66
5486 F20101117_AABSKI venkateswaran_j_Page_096thm.jpg
16e1e59c667c23431db914549bc6017d
5c8d8c79b0a6717f2604a881be6854db0ecbed08
4896 F20101117_AABSJU venkateswaran_j_Page_089thm.jpg
13a5bfe1848cbf50a000acb1e754394c
a087af7c02c357fc31e46ab4e8dda2eb617cf032
70417 F20101117_AABRGS venkateswaran_j_Page_050.jpg
ecb957c84b6c2fc24c206e29b983a129
191aea4e7c4a0dd69c07c982d6e7aaf1af6d873d
78333 F20101117_AABRHG venkateswaran_j_Page_064.jpg
e5f8120d3f2f78ccfb26875147b453e7
21175cb6feb94906e2dcdad883b0fd1c02e76bf5
23235 F20101117_AABSKJ venkateswaran_j_Page_096.QC.jpg
5681602cc1015d66cb26af62ffbb086f
ef8e4e6f2c2d4ee01181b335341d4e5b58c51150
19762 F20101117_AABSJV venkateswaran_j_Page_089.QC.jpg
0139f062978ded8ae74fa4723949bc81
5b2986fd504e9e1f747ca48cf248adca11a58f54
91541 F20101117_AABRGT venkateswaran_j_Page_051.jpg
1ec31df91edd59a94e0d9e21142019c3
5d48d00ffda59ad94e3eb9079f9ad551a4af7d9e
43748 F20101117_AABRHH venkateswaran_j_Page_065.jpg
ebb76ddc86b84202d0dd00c714abab18
281da419509b9436cf1014a44d9185b071e64e84
4835 F20101117_AABSKK venkateswaran_j_Page_097thm.jpg
0fac483d589ea93e8678a3d4a06f6e7a
ea03a0d78ff23864a8f4e4dd3dec7b4d4d239964
2013 F20101117_AABSJW venkateswaran_j_Page_090thm.jpg
99244814f6f08d0a757f9a7fff95b49f
6a459da8e750ccce682393a28f0fd4d4878da0ee
68642 F20101117_AABRGU venkateswaran_j_Page_052.jpg
baab38649a0257bed615e2dbe8377744
a3e912e9b1229db7dcd757ae4e2165fc0bd924c8
91650 F20101117_AABRHI venkateswaran_j_Page_066.jpg
d4dff84bb7e3ea088fab6d4313a37616
106d7c8dd7188eec862a5a78a189cf858965bcd1
6619 F20101117_AABSLA venkateswaran_j_Page_105thm.jpg
da141a6243da4f2e48b7318ae7d28e79
80546274e223ba417c901034b50288afaa00c434
18002 F20101117_AABSKL venkateswaran_j_Page_097.QC.jpg
e9e570968a6305b734dabeca06080507
ede15f8d4c8bbf973675f597902bf814a416656a
7838 F20101117_AABSJX venkateswaran_j_Page_090.QC.jpg
e702ea4e4ae4a15a00436f00dbd1956d
7681520b0015fb6a8ff3fc9bf0c34a242edebedc
95837 F20101117_AABRGV venkateswaran_j_Page_053.jpg
d666b108b20b6bb370e3b6e5080203b9
71938861bbd92d3dcf1ebcebcad90493fc41a305
100517 F20101117_AABRHJ venkateswaran_j_Page_067.jpg
70d0c6b9eedf71220f43106b95c55d85
5f28938e36c609d28877a484a82f36363d1e44e3
29329 F20101117_AABSLB venkateswaran_j_Page_105.QC.jpg
611cbc3fa9200fddb48a8d6fb610526f
4533c23ab37c77177196dff7d1ec6d871c59f54f
6008 F20101117_AABSKM venkateswaran_j_Page_098thm.jpg
114c3dfe93ab9547f65f445d0307dc6a
5f4e52049eb3931f6e1610d88c5643c21f6aff61
6996 F20101117_AABSJY venkateswaran_j_Page_091thm.jpg
c3d0b81a92e4d6e5872b037b0dbdaeab
c44b393b11df6dbe12301c76bd8132032be8b39a
40981 F20101117_AABRGW venkateswaran_j_Page_054.jpg
e36aa09b30c1652ad69ed13a2a7bbf21
680929b7c59ffddde62aa668f8273df72fab4bd0
41914 F20101117_AABRHK venkateswaran_j_Page_068.jpg
4f5a26d1b1f4a9d67b4faf6489c6ce6f
6ce1611aa2586a832ecda878868cf3197ea7c839
5892 F20101117_AABSLC venkateswaran_j_Page_106thm.jpg
6d29b0190037b2df7f46bdb18cfe0b7b
511efc9960b4dc606d2d27e9011e29d74eea292a
26440 F20101117_AABSKN venkateswaran_j_Page_098.QC.jpg
1f6b3a0d3d6af97eb85c3e17fcf4b21c
ca9313ee9138ca233c67e9a4b464f20b932de997
32100 F20101117_AABSJZ venkateswaran_j_Page_091.QC.jpg
91e1b66b8b2940560d44c77351e04ecb
d1a8f302f786d88850329ed4f56869e6535aa48e
91125 F20101117_AABRGX venkateswaran_j_Page_055.jpg
4b91a4c2da4b6498d4ecfd2eb41e8fca
cb72d41b4a5d5df66d9a297200cc8b7c3e04ef6f
44535 F20101117_AABRIA venkateswaran_j_Page_084.jpg
4979d00f8d5909f5856b907f01e24710
c57f6720dff01f56f4551daa99408183c9ed9b3b
72965 F20101117_AABRHL venkateswaran_j_Page_069.jpg
001edd275d22a69ea3e3decd59a01011
1d61466754802a007e31c2629b59eaa3e6ae9ac0
24215 F20101117_AABSLD venkateswaran_j_Page_106.QC.jpg
2c19f246dfbdf3583d61d24329691f18
9ca780d07c800fc3af87c3ea9244eeb4e6c6c194
6493 F20101117_AABSKO venkateswaran_j_Page_099thm.jpg
3c17808ecd2617c0ff68f6301112da52
aa984c9510defaa136056f5bdb298fd53f8e3f08
88987 F20101117_AABRGY venkateswaran_j_Page_056.jpg
0001c1f9eef7781ce600662c3e36fdac
6b8550aead5aa6794310478c7f31b995d02e2036
68978 F20101117_AABRIB venkateswaran_j_Page_085.jpg
453eb5b45c3485a0e14df692bc41f4f6
ac239e75aeb60a6e95ef4c1e0da191a88985876c
63371 F20101117_AABRHM venkateswaran_j_Page_070.jpg
2bdf6a61ba97181667e845de2b4a5671
de812d8e0799741b706d16414eae016c1105eb7c
28664 F20101117_AABSKP venkateswaran_j_Page_099.QC.jpg
ee4ce79e97154e0d563b0b32f0cca202
760a01a5f015da0c61fc2ad6ce5b50a76830a3a5
89825 F20101117_AABRGZ venkateswaran_j_Page_057.jpg
205f095b2798ce2a4c7cb16ed40c6cd4
8492f6403d41f4913dc27b061098bd2a6c2ade54
43392 F20101117_AABRHN venkateswaran_j_Page_071.jpg
723f115fe2777d5b39acf6ee0f481b51
9c487f79c83570cfdc8de7a45172d81411fda77e
5429 F20101117_AABSLE venkateswaran_j_Page_107thm.jpg
da7194ab0b0e3b1920f5b9f1dbb2464b
1bbe5c4f109ce37d5aa3a55c9fc24fc861ced61a
6568 F20101117_AABSKQ venkateswaran_j_Page_100thm.jpg
300aaa0b41c3af3347800540af329e15
1c28c683a14f3560692e4dffcf88aa505ef7a789
51229 F20101117_AABRIC venkateswaran_j_Page_086.jpg
b744b2696a1e1f9a41e450f063e3e2bc
e4090efb3a2873cd719d8732ed6ebfa70671894d
43266 F20101117_AABRHO venkateswaran_j_Page_072.jpg
65634779e26bb060949ecda65f140daa
a454b6bb0602222ac2f1f9b159d305f0cd868f07
22623 F20101117_AABSLF venkateswaran_j_Page_107.QC.jpg
62e2fb3d35f13d0f9a38dc3372baba27
5d2933e268d7f5015b1397a3698df2cc0af7842c
26679 F20101117_AABSKR venkateswaran_j_Page_100.QC.jpg
7e1541a95d93e70be7878c6c27fc57f8
98f235b808a5682cf4ae939c2cfb3223bb2f866f
57989 F20101117_AABRID venkateswaran_j_Page_087.jpg
6289e0f358d9adfeaff2ee7a6f6ca2f3
9ba795238b561b780b2e5c8fc6dd4264dc8cc970
70010 F20101117_AABRHP venkateswaran_j_Page_073.jpg
b0335ca6d395ab33bfd38fe6b3170ace
c6736e0a57a0c1c5c7b4e538a073d6ff56206750
5803 F20101117_AABSLG venkateswaran_j_Page_108thm.jpg
72ef1e9e8f79de91c4bd8d875a0a6d29
cfb57ae5968a2b6778c546d7a743b26bdf392806
6185 F20101117_AABSKS venkateswaran_j_Page_101thm.jpg
8dde742af7056737425a3b4d371be35d
c05cb4fa19bd09ae34e61e85e20e6c4cf1702919
55075 F20101117_AABRIE venkateswaran_j_Page_088.jpg
6c0e1e000b945c124281f2babea6e69e
38f50984dd5780d205417b22e8ab0269b76d97e8
74058 F20101117_AABRHQ venkateswaran_j_Page_074.jpg
67142f959b585a02b1002d6416e72f5a
65b8774c41f46ead390dc091e07d924de947702d
23795 F20101117_AABSLH venkateswaran_j_Page_108.QC.jpg
99b6fad2cc21574289eeadd71d30262d
806e2332dd4ebb5a722b10047b84c5027af28b37
26258 F20101117_AABSKT venkateswaran_j_Page_101.QC.jpg
c3e63b76d20b9cbbe2f20f3d8d2669c2
a430080339e600a57c13febe85874090b8b24971
63718 F20101117_AABRIF venkateswaran_j_Page_089.jpg
d15ed11cdd1fb8905d85cd15f55be717
4f5bba357833fea26de01771ad9d5a4b9f5e64d0
17585 F20101117_AABRHR venkateswaran_j_Page_075.jpg
a185bf1754f2ef8c10aef4dc9c854755
25cf7dcc795a616d352056537d8bfea2e73c33e0
5828 F20101117_AABSLI venkateswaran_j_Page_109thm.jpg
fcdbb4b3aed6eb855cad50e765e9d33f
0fe9eaf99eef78cf75738c0e7b7692f0ec9eb226
6963 F20101117_AABSKU venkateswaran_j_Page_102thm.jpg
03f540a0bcb47950827c7f652f29636b
933491a0f721e50a80370a8ba7ede9cfa8771c7e
23318 F20101117_AABRIG venkateswaran_j_Page_090.jpg
5ef35704ccb78fc287862ba3c837c824
bcef3d4a0a4e0dfaed045d608c1438ea7e9494a3
99321 F20101117_AABRHS venkateswaran_j_Page_076.jpg
ad76d70fd73b3e926a9d45b34fb1b6c3
9c8bcde032d77919a96ba2f990fef9b96abfc72b
23536 F20101117_AABSLJ venkateswaran_j_Page_109.QC.jpg
9ab58412355cdeba590dfdc77276b42e
66ad55e4890454d1bf0f880806931f792ff9a0f9
30913 F20101117_AABSKV venkateswaran_j_Page_102.QC.jpg
127d2afe3f09d2a0ea76410f91805e48
872e982902fab5d26037214d0d65cda4d4f5eab3
103783 F20101117_AABRIH venkateswaran_j_Page_091.jpg
e791f2648755d9ce0aae1eb71b336f09
080768f3db903983a6a2abdd97e9f6b9f9782029
95352 F20101117_AABRHT venkateswaran_j_Page_077.jpg
c67e5bdd0122c4f6904ff86123aab589
bb3650f2da1bdc3031b240385f67b0ce4adb6df0
5805 F20101117_AABSLK venkateswaran_j_Page_110thm.jpg
c2b611b2b6839a25bcad95e9ad8f9414
257ae8bd0c455697f35b7527e1546abbdd8f41dc
5908 F20101117_AABSKW venkateswaran_j_Page_103thm.jpg
f72cbc392265291f54d80ba623d16c44
80f8819c75eb5ecbdc430c5eecbae74dea4c2cc2
70184 F20101117_AABRII venkateswaran_j_Page_092.jpg
932efbbe667938728336351acac56529
04de9f96110d1c69bced9acab7f2f587e923868b
63056 F20101117_AABRHU venkateswaran_j_Page_078.jpg
fe41c0e2249836d43441a6510032b0e0
0338f7c1372630c20bc36961d1442555f538c25c
7003 F20101117_AABSMA venkateswaran_j_Page_118thm.jpg
d3326a2187be2b882286a1e84924a6d5
11ab03018446d762137cf14bb3fcd09f55a3dfed
23935 F20101117_AABSLL venkateswaran_j_Page_110.QC.jpg
fe74ea377796b2ce73cdf6268f85ffcf
6bae44f52b9f65f6b454f1c223ceb8c4bf578f8b
23507 F20101117_AABSKX venkateswaran_j_Page_103.QC.jpg
d030f621b2866f7785480e68156ad235
6d22290d82e98b593aeb954c5f7060944c4c0c63
96884 F20101117_AABRIJ venkateswaran_j_Page_093.jpg
813313ccca353ccfd9325060f131473c
20627ce47347169bc63b21f0d30bdfc5bed48a60
94818 F20101117_AABRHV venkateswaran_j_Page_079.jpg
0a58b83b222d84ab59cdd5c06f96679f
d284cb59dbb97c9697c80d2cf105b4c464014715
31935 F20101117_AABSMB venkateswaran_j_Page_118.QC.jpg
19d1cd1dbc55e34583e0db312c8b779e
23be91f196a868d66d63c6f75459e376046c17b4
6017 F20101117_AABSLM venkateswaran_j_Page_111thm.jpg
979fc9ffa9ae051612ba19f846cbe243
e4b9888e325d115d53ce19cc0100bcd71164c5b6
6234 F20101117_AABSKY venkateswaran_j_Page_104thm.jpg
7501395bffca001a976e27d981ce616f
8e3c4f3a0784cdc193ed10f45cf31bc019d62c4c
98477 F20101117_AABRIK venkateswaran_j_Page_094.jpg
6ebe958a697befd597c05ac7699ae5b8
82e9d9353fb5b7faae1527aa2240b9275d14bceb
60118 F20101117_AABRHW venkateswaran_j_Page_080.jpg
19d9360069b092288a607bcaa149eed2
1d1c9bf781fcb028a2821266f3771c2045820569
2062 F20101117_AABSMC venkateswaran_j_Page_119thm.jpg
76a0a1c76d52fa448128e65849a3cc35
5dd38b62abb3d7b0bbe30cb73bb18dc250fef2f4
23155 F20101117_AABSLN venkateswaran_j_Page_111.QC.jpg
02755a370335c19be230ea689d62319b
c73810dabf75673d0cd9846b74f0abbdfbba3f81
25666 F20101117_AABSKZ venkateswaran_j_Page_104.QC.jpg
11caec245abaf0388c1428861f3713dd
3c30059201f67297cb1bfc18fa1722de31889ff5
74888 F20101117_AABRJA venkateswaran_j_Page_111.jpg
d4229375744c4fa6edcb8643f99c5c24
3a647ce2edd1d0a76a345d6e7076290ef0c2f40b
98457 F20101117_AABRIL venkateswaran_j_Page_095.jpg
1c0b8ab59eaaf2d220115832015ae301
8da8a77be23236c4ee21bd0c06da437f8c25efd5
89203 F20101117_AABRHX venkateswaran_j_Page_081.jpg
a27b0122d0bca21b8ccd21cbd38d30ec
7b2db988daa6e637d080f63439d45a98347794ac
8996 F20101117_AABSMD venkateswaran_j_Page_119.QC.jpg
4b3c409cb315c50ac2f7957611c6dda0
3d12588ea599ce1d8c3c804f3be2851fb6bde9f7
5947 F20101117_AABSLO venkateswaran_j_Page_112thm.jpg
78e0ded5c8853990b5ad1fd3dbe07a89
9e08f6c0aa28c73d8e12b733c361b5b420b1392c
77688 F20101117_AABRJB venkateswaran_j_Page_112.jpg
9f3f35acf0dd702cd4ca31b94aec58ff
23d8f13a8e7eeb0fc627323bb98b690fa5deff52
71383 F20101117_AABRIM venkateswaran_j_Page_096.jpg
ae448188f1e451c3d93405f3488e3fed
2763e22489a5924c9d71e184f675927db25830fd
76316 F20101117_AABRHY venkateswaran_j_Page_082.jpg
1d6cf5e1c7fc41d1d0f1e56249f40a61
b41520b854818545039529d932caba9c7558fe20
6928 F20101117_AABSME venkateswaran_j_Page_120thm.jpg
0bf5c24a8215547bb0570e07d5d76668
9c1d31e2ea0d3bfc2d24da1f6ca64de2c569d6e9
24789 F20101117_AABSLP venkateswaran_j_Page_112.QC.jpg
3077576afa9193cada9e0b1973956e4c
b916b679c640052e2624bfaae025ffd77ae1729f
80384 F20101117_AABRJC venkateswaran_j_Page_113.jpg
ff67f5aeec6a02a05c993bfff379fb27
176c3100fa0cf7bf4c65026f882feed43098c765
56458 F20101117_AABRIN venkateswaran_j_Page_097.jpg
6b9609819ad05895faceba8c85f0579c
c25a5c71d5799914b80bf86baedbeaf03171acb9
67899 F20101117_AABRHZ venkateswaran_j_Page_083.jpg
0921054b0981f6cfceef4d81c2ae9be3
13fef3a9747484478d87256c087b84e661c63d73
6189 F20101117_AABSLQ venkateswaran_j_Page_113thm.jpg
3584a96d2585d2debdb1e47028149e25
6d5fe03f480ec0727f716db5d25592870acef7a0
82592 F20101117_AABRIO venkateswaran_j_Page_098.jpg
8f0da527410935ab4e9f82174c2ac580
ef17fc796ba369189cc25ae6ae979a6c21f6bf53
30040 F20101117_AABSMF venkateswaran_j_Page_120.QC.jpg
7c0987cc585a67331feba7a7e0958409
f41603e77bf3042fb44c704bcb974c4196231d2a
24779 F20101117_AABSLR venkateswaran_j_Page_113.QC.jpg
7c53b5862f8d01c04aebd1485830fea6
0947ba6eacd0ec2d9cf69465ccabbe33d1b59277
80898 F20101117_AABRJD venkateswaran_j_Page_114.jpg
b0bdfe709d720182b84457d39f11b6a9
710037ed00f542bea7a55e65045f631dd4b86a3f
92617 F20101117_AABRIP venkateswaran_j_Page_099.jpg
a578eef5d4bf00e430ee120789da5fb9
09d4779456edf9f92d24ffc759704a040d1c98fc
6668 F20101117_AABSMG venkateswaran_j_Page_121thm.jpg
ce8d591dcaa8d3269f98243c570f48d7
e18c92dcf80db66e31df4935856420fcf1161be9
6079 F20101117_AABSLS venkateswaran_j_Page_114thm.jpg
c50f819dd914bd77a97cf579d515a357
6b3cd338d8d9f0d68506056282d5708dcc603a83
72688 F20101117_AABRJE venkateswaran_j_Page_115.jpg
d08350da38b031b495de5b778ee9ac4b
d251dc2f60b049dc410bbd5dcbc357abc4f6bea3
88148 F20101117_AABRIQ venkateswaran_j_Page_100.jpg
728487f9296e7b9f7b9c9abfb7367df6
76f3fb0fdac859e83ad84e1f7dd3915e55a08fe7
30613 F20101117_AABSMH venkateswaran_j_Page_121.QC.jpg
8a1c664b7ba158a63c319c5f149563f8
06b66410a3691ece630adc103cd3f77853a04758
24954 F20101117_AABSLT venkateswaran_j_Page_114.QC.jpg
2654ddfa2ed9e9273d4f5e0e0ecffbda
e6393ae6a0725e47f791e939fea0d2ec431f0cc5
73257 F20101117_AABRJF venkateswaran_j_Page_116.jpg
44398bee57a8b64d82f60ee11d81701b
380fedafa670e259805da772bf70fa6b485ce1c2
100847 F20101117_AABRIR venkateswaran_j_Page_102.jpg
88703e9a2a97eb1245e10d428d899222
37b1bb8cfcd58ca01f49c22884e6780fc7e2f4fc
6953 F20101117_AABSMI venkateswaran_j_Page_122thm.jpg
8ac8e365133242345d5f7ae28c6038e9
f9017d71619afa097ccc041a06cbb99a6d99c111
5933 F20101117_AABSLU venkateswaran_j_Page_115thm.jpg
c8f253af6e3eeb7d6ce925e957767ab7
3f1137f32f1d6410efd9095aebb506dcb50cf3c1
98164 F20101117_AABRJG venkateswaran_j_Page_117.jpg
f3371dba5d3ae2edf9a7f18f03d00e86
57d67c542a5f2157d031e86a5d1775ce3d7994fe
74357 F20101117_AABRIS venkateswaran_j_Page_103.jpg
60ebd41208e53919e8bbd31db1464bfc
261df689f369241f1376e1fdaf0ac192cef7d105
30318 F20101117_AABSMJ venkateswaran_j_Page_122.QC.jpg
9a6fe86485528e99c0b4d3f26f199e9f
82844e7f6066efba91f0344a1bc7261cee59ffa0
22920 F20101117_AABSLV venkateswaran_j_Page_115.QC.jpg
d06e3437914bf33422129fc6947ed603
ecfff48a6e25334fc63f302400e2dc8099e3e908
104796 F20101117_AABRJH venkateswaran_j_Page_118.jpg
03a83ed4f1c2730ce48d681cd1e53379
25c1e67f9d085d5f5148f717b835feceb2d3436c
82362 F20101117_AABRIT venkateswaran_j_Page_104.jpg
688999ac0d9cbcdc072242b8ace04d0e
23c74fe8dcbf3b929e7ad392a6e331eb1461b439
6959 F20101117_AABSMK venkateswaran_j_Page_123thm.jpg
fb1c35c2dabf53392cb81397295deee5
6db3f93b23d0e2971bf0f2a0f5b57ee26527d418
5635 F20101117_AABSLW venkateswaran_j_Page_116thm.jpg
740a590912e867d6c20764987e469f7e
60a9425b3b2fed60515a5c836f1e9181778a6335
28115 F20101117_AABRJI venkateswaran_j_Page_119.jpg
acebaa997f7932f512fe86177b32c011
d276e9f175560b6be0fac573cefa4eb7b554107a
94653 F20101117_AABRIU venkateswaran_j_Page_105.jpg
7eae063ef0763bee1b356a068f181f70
39190f5e26744d86185065f820f11ce7dca8dad8
30485 F20101117_AABSML venkateswaran_j_Page_123.QC.jpg
de5c9b988e8fc3e0de991e303f294b0b
d416219f7bfeff5636be54468f65f05e5ddb0df8
22294 F20101117_AABSLX venkateswaran_j_Page_116.QC.jpg
0b4a43a32a6b0d62741ea3b5c65d463e
1cfd420b623c08e9dc0c110941a03c6ff9a4efc6
106592 F20101117_AABRJJ venkateswaran_j_Page_120.jpg
8035ba58c0ad909b4b038a780e76aae1
288d5baacfbc2ead4f52d41c07f65ddb49a43073
79567 F20101117_AABRIV venkateswaran_j_Page_106.jpg
910dc650bab79424812e3edf942f5351
24dc0324b9a2576aef40024c44076ae01d7dbbdb
7109 F20101117_AABSMM venkateswaran_j_Page_124thm.jpg
60a2baa3c809f177f163a948a45e3c34
454a8850a5c4386e821e7ca311bf9a56dc6bb3f1
6711 F20101117_AABSLY venkateswaran_j_Page_117thm.jpg
f22021e9f5ffe92bf4704f2450107c92
437f2a76017e8b647cffd7c3b4c1506d584721e1
109815 F20101117_AABRJK venkateswaran_j_Page_121.jpg
6b081706dadc372268a4f6415a749298
2822be7f4ac4c570b118adff3667519d4cd839b5
72429 F20101117_AABRIW venkateswaran_j_Page_107.jpg
30b2320e5960d82559f00d03a77eb172
2a69ecbf3819655fe3b6b34f5a006249328ca4d6
30587 F20101117_AABSMN venkateswaran_j_Page_124.QC.jpg
d1364f6fa7287c9c87feeb70d70c61be
352c63abcf60db1e794c3fb72dac65d895c85e78
29745 F20101117_AABSLZ venkateswaran_j_Page_117.QC.jpg
513b56513d898360d4fd877fbb750ccf
25f4b66bea13c2721ce14bfb0e458022351dfc7f
106953 F20101117_AABRJL venkateswaran_j_Page_122.jpg
7a4aad77a8f070de33a8f2d395750aed
3227a51b3f7a38665f7d58d5a64c6ac7b873cc11
F20101117_AABRIX venkateswaran_j_Page_108.jpg
7f5b57204efa5a8e29d7662959795031
2bda7c2f7690d885dd71109ad454583d14bdb284
1051964 F20101117_AABRKA venkateswaran_j_Page_010.jp2
f57c63e97772271bcb33ee570d1c1f7b
b1269f7536a6ac5f3fe9fd2f6da4c68d10249f2d
6984 F20101117_AABSMO venkateswaran_j_Page_125thm.jpg
e83292dc2e652f6a07b1e303fde02366
78728e9d9686333b5530825a40de3094dbf6974f
106454 F20101117_AABRJM venkateswaran_j_Page_123.jpg
c9de0aad0a64eee1b7cbc1b52de9a231
6b70862198a63f9757c9e162be4f06d3eb86d7d4
75789 F20101117_AABRIY venkateswaran_j_Page_109.jpg
b008dd3664cb0ec0c3a108d8708de69d
c322d37f5086798bc9bd478d4a009bed198268a1
990412 F20101117_AABRKB venkateswaran_j_Page_011.jp2
6f2b042e53f32723c53ae033da7bf5b7
743fb3b9fd8a6236888404cb46d7d942c2ae9d1e
30247 F20101117_AABSMP venkateswaran_j_Page_125.QC.jpg
6ac5471de893a7579fb76b8b480c6543
37baad8be8f0f79b83300dccb428d1dcf0920e06
108755 F20101117_AABRJN venkateswaran_j_Page_124.jpg
90d02c42a4fb6be7ea991b260647a12b
4782d8d98bce7990a10bf6b5ca551086ce21e0f1
77668 F20101117_AABRIZ venkateswaran_j_Page_110.jpg
fc652ceb658a9abd5b32cf1d8906c5b1
a60c0444dd0725cceca88de127c5b09f649ae86c
89772 F20101117_AABRKC venkateswaran_j_Page_012.jp2
f444e8698f0e070f12fb39848faa3d0c
f075e306357b38bb2277d9435b1a789fdd0ed371
2715 F20101117_AABSMQ venkateswaran_j_Page_126thm.jpg
682f44618de0517fe2c607abc854ba20
4981f8989587894e50d2390c807091b6d87ef1a0
105864 F20101117_AABRJO venkateswaran_j_Page_125.jpg
97e3177ecf1e8a0f29839d961112c7a3
3aee53e2cd47987bb0376de885e6beb91d0e8cac
1051924 F20101117_AABRKD venkateswaran_j_Page_013.jp2
4c6bdb8229de1f91ff2a9663d5c29bd6
aac599f74f32605e784beb94065b9ff92d0aa5b6
11412 F20101117_AABSMR venkateswaran_j_Page_126.QC.jpg
df9e68f6ff7708f47c3d586d9b4ff68a
b6bd4838bd99bc31892820db02b4bd405fab24fc
38876 F20101117_AABRJP venkateswaran_j_Page_126.jpg
b665906cc9180f44d53e6ecdeb82289b
3dc2a2d3743e120fc7157608dbdcbacb4b654e5a
4530 F20101117_AABSMS venkateswaran_j_Page_127thm.jpg
8c2d19c669d8d8edc94c318873110be4
f59b85fff03ad4f7ecba82c671821e8c0bba7bb6
729673 F20101117_AABRKE venkateswaran_j_Page_014.jp2
dc7aaac04f0a4bc55266db00cf045715
d05ad3335debaeba40d755eaf763bc60ddfd9f61
67845 F20101117_AABRJQ venkateswaran_j_Page_127.jpg
0779d74f372abeecb6fb90885a641abf
43503cffb89eddfdf14fc07e92b94e53064ce3f3
20338 F20101117_AABSMT venkateswaran_j_Page_127.QC.jpg
0dcd73020e391454141a64afe99a6214
bdc3c54815de7ba19d71cebccb34e044016450de
1051952 F20101117_AABRKF venkateswaran_j_Page_015.jp2
b717e54ed2857eb36a39d354feb2a6d1
6a42cdb6bef48ce0d2957b9def77ce946b407f1f
24700 F20101117_AABRJR venkateswaran_j_Page_001.jp2
71840241d2846a9d592f262de3d8e144
e2354d4174de27f94b9e483e12221739cf6f9729
151598 F20101117_AABSMU UFE0021666_00001.mets
636298728d34787c029660c92318f321
eda763b8f100f8c778626e5b1f53109d91814093
833588 F20101117_AABRKG venkateswaran_j_Page_016.jp2
0e74738f21df3cc4ac49e65114ee0f2c
c51c9aaf0f49b11807138bc2d8a1ce1683cc3b3d
6287 F20101117_AABRJS venkateswaran_j_Page_002.jp2
2247ffd19191fb5e6c3b6a5a6b81c60d
89c47e7d4e2304e52554f5293a720e9e717387ee
1051976 F20101117_AABRKH venkateswaran_j_Page_017.jp2
ba3ff11898f422398d57142e93a1260d
999133c1c61613971618e61eda5d4752f1f4abb6
4950 F20101117_AABRJT venkateswaran_j_Page_003.jp2
4858c179029380562d9e307ce67f0aa4
d5bb480f680033c8876b145abc44590d9cee29a4
1051978 F20101117_AABRKI venkateswaran_j_Page_018.jp2
8e4fa887cb531767deee75aff8b9dfdc
740eb4e0d151c740bb360fb7fa5542fc1df3f138
46242 F20101117_AABRJU venkateswaran_j_Page_004.jp2
f5016ba9984be33b6ef85d1cde27ceb4
4a3508af2ccd6e378c2b21cd1e22d088c85dbdbe
1051986 F20101117_AABRKJ venkateswaran_j_Page_019.jp2
d0a4f5ef880ad5bbac464a872c0421ff
8f81dc29e2d86de19af8de7b66502f6327c14999
1051984 F20101117_AABRJV venkateswaran_j_Page_005.jp2
4b78a5e54139b6fae6e8b5d0f550e654
0ea239d6096bb9515e3c560a1a6da6ee04b0ad4a
122393 F20101117_AABRKK venkateswaran_j_Page_020.jp2
98e1e394d999f93d30e7ab1ff23c3c03
3fff5c634edc8b7682b8a9080ef4742c8f3a767d
1051962 F20101117_AABRJW venkateswaran_j_Page_006.jp2
42267b742a8639fd15a03587f12a792e
76fde0bf4e2645c5260240cac023802aea6c2a03
1051931 F20101117_AABRLA venkateswaran_j_Page_036.jp2
f84aa0f49a0ba0171e1a6fcdbbd46457
d36f31c50b37300d4677ee7581f80ca2c10d83e0
1051934 F20101117_AABRKL venkateswaran_j_Page_021.jp2
fee8fc0d1dea83cc83f00469be13fdf6
6473a8bdd761d871919cd92b786b2f13670c7c12
1051968 F20101117_AABRJX venkateswaran_j_Page_007.jp2
6377808ef861b199d1d6527a9ada8c29
7fa4edf77aeea014d93e3374b796cda22b5cdcab
1051983 F20101117_AABRLB venkateswaran_j_Page_037.jp2
e4f3353ea88cf023cc063a299df6f4e1
42af85c25585f04ed29ba57546bf3b672319db85
F20101117_AABRKM venkateswaran_j_Page_022.jp2
f0f368cfbc14816f0e523c617bdbe641
9bbf71ed95fc8da35f8262f5726520b65d76e287
386645 F20101117_AABRJY venkateswaran_j_Page_008.jp2
561d407a43f5cd29b4f682597e924c9e
5bbe151121015f32c7545ee8b8517331b4c6987f
1051981 F20101117_AABRLC venkateswaran_j_Page_038.jp2
cfaf398a79c04f175d630e3c49d57125
5ef7f0a548763abd3045367501aa386f0cf9674d
92427 F20101117_AABRKN venkateswaran_j_Page_023.jp2
a7feadf82ce140900535142bccefab43
5b565d7bd30ec27fda0f2ee2b954abc5ed297f7f
1051969 F20101117_AABRJZ venkateswaran_j_Page_009.jp2
8c58c98de74339b1c76cff2cbf58f424
3392935f6f3f2c3ab0a7b3ecbccb953ff2d6fd3f
1051979 F20101117_AABRLD venkateswaran_j_Page_039.jp2
051dbbc0504767011178295be5d32c09
7108356df4689747a2df2b94daa59e4ee75943f5
1019910 F20101117_AABRKO venkateswaran_j_Page_024.jp2
81e4c34c2088f7e4e633093b8ec4e9d6
d638269b74ffbe004924ce1c80b3c7d9787a816f
F20101117_AABRLE venkateswaran_j_Page_040.jp2
d2d293aeee14b4bca3e0c162012c8b3e
7cb5de43c20d2ccd50daeba48d427b3246fb6f61
927677 F20101117_AABRKP venkateswaran_j_Page_025.jp2
44680f917024bbab4b039a04e6d1df88
fb61c783d107ef2090dad365e12d44b9de87586e
18611 F20101117_AABRKQ venkateswaran_j_Page_026.jp2
3b883662ef93a02b68ba389926f14f90
50c36c9ee3fafde78f42ed5151eac26a6d79315a
885856 F20101117_AABRLF venkateswaran_j_Page_041.jp2
ce802c352491a095dd3b783629a36e3b
80a2c3f340fcaaa661a665b1e8cf69149910c5bd
1051974 F20101117_AABRKR venkateswaran_j_Page_027.jp2
d12050bf5fd2f79133fabbf9703f8aba
081925b3f67428431dbeb4fe8e43c98719a7034a
981450 F20101117_AABRLG venkateswaran_j_Page_042.jp2
2e1ffe8fa8c56850d894117596249395
690bba16263fed4ec68057cc53c7695f0d4410d8
917407 F20101117_AABRKS venkateswaran_j_Page_028.jp2
1cca4fa5642aad600a2c877ec2276cc7
aa574e294c1f8442f7df10a2b11cd4db5de6f655
920203 F20101117_AABRLH venkateswaran_j_Page_043.jp2
712e87100d316337df620c6d6fe64a0c
9f92c39800a8e5928edb22fd3f1b3e592b518b78
1051938 F20101117_AABRKT venkateswaran_j_Page_029.jp2
8b3d378d95bbff998951b19c1b6830f9
8ea3f89c1578b560c6b190763822bed10e3c163c
1001263 F20101117_AABRLI venkateswaran_j_Page_044.jp2
cc91bbe5ad192b00d858e5537dbbe503
b9d82f18e5d41240b5843a78080b8752d79bdac6
1051982 F20101117_AABRKU venkateswaran_j_Page_030.jp2
5274549ae1ee5674b287ba23fc8389b6
7349855b7152e9acbba38401a652f71ca368f702
1051927 F20101117_AABRLJ venkateswaran_j_Page_045.jp2
a3f0f95b31cf91a3981fe3cc3665410c
83cad831e5401952112989aea8cae08ca127f243
1051960 F20101117_AABRKV venkateswaran_j_Page_031.jp2
4dd742767de2506a3b3851e8d3f3666e
143a4e7ea2c9e16dad2c1947d5f5b4a44f03c005
85969 F20101117_AABRLK venkateswaran_j_Page_046.jp2
022daa5a35935dbdbd2524db39751512
b82310d96c1ac7a5975de99c5af49cb934db4171
F20101117_AABRKW venkateswaran_j_Page_032.jp2
ac1b455e053ecd72d5b75f1531f16c23
843bb54236871d03e485ba6dfef732f3f7495ae9
F20101117_AABRLL venkateswaran_j_Page_047.jp2
1968f7f2098e50e043fe21cbd6fac3e2
89cdb8dd8e9657a6b395ae8609e081aa28d4f4e4
F20101117_AABRKX venkateswaran_j_Page_033.jp2
314c6de26b5ce58305d366a70d470d0e
efff3698363bda23250bd742761435874d686542
897879 F20101117_AABRMA venkateswaran_j_Page_062.jp2
78dfde58ff091179717207b0c2fc0125
9af3afcfe8006abdf2f81fc7ddf6e16f8337d909
1044358 F20101117_AABRLM venkateswaran_j_Page_048.jp2
cca45a59bcbe4e8e88b965e5aa0608a4
3beb4a23423457a174a8b242eafd2fad6a45eaef
1051906 F20101117_AABRKY venkateswaran_j_Page_034.jp2
68fa3e58750910cf19d7febec9562122
9117268bb3734aed89f9dfd9316d3a2684728a2b
771420 F20101117_AABRMB venkateswaran_j_Page_063.jp2
da9f15c150616b4a1ff7e88b341682e6
ed3f493ca1ad508b0f0ce86035f476a2d78fc6c4
F20101117_AABRLN venkateswaran_j_Page_049.jp2
7c8b04630acd8af7460bbaaeb27aba34
5959a6287fbd6219032ae1f97fb6323427b45371
112937 F20101117_AABRKZ venkateswaran_j_Page_035.jp2
5e1d31a6ce620883d786f135b918612d
7d9ee91df8031cb24d05e808bb0cb4356094f137
934987 F20101117_AABRMC venkateswaran_j_Page_064.jp2
f9c6da83ff7c17de026aa56bd327dc1a
0bc7b03df91344b766f0aa87888e006bad4bc7f7
937517 F20101117_AABRLO venkateswaran_j_Page_050.jp2
0689d9367b04a89bc1fe75bc166ec655
30844ef7be254f66a2f29e06f6784ffe3745b606
426502 F20101117_AABRMD venkateswaran_j_Page_065.jp2
e2eb4c3c4f1e4fb4ca6bdbe5771f1ce0
f469596d3e1927a8c2206fa61fb818ae68a3f4f8
1051957 F20101117_AABRLP venkateswaran_j_Page_051.jp2
edc03d348401786e9846d19a229f1032
0742a7e8b30bc2c68ebc54a1b32f6fb9f76895f9
1051948 F20101117_AABRME venkateswaran_j_Page_066.jp2
0195a7fdedf938ec54eb3956d9eccac1
af489a460d6e6271a956d0dd329ee6d6aa6b6201
878878 F20101117_AABRLQ venkateswaran_j_Page_052.jp2
2f8bf58f3aafda3b42d01c38dd1cc564
00406cd00d5144ac3f22402002d17b4f583e2fe0
1051980 F20101117_AABRMF venkateswaran_j_Page_067.jp2
06be621eecc3c0f0edaa8b765e04d82c
8762953a51bbc1b55e09766ef9f47048a1a15add
F20101117_AABRLR venkateswaran_j_Page_053.jp2
df77025ada9d796df7716d38fa621383
bba221df3299f79b138da9f79ce2a130046ff28b
50014 F20101117_AABRLS venkateswaran_j_Page_054.jp2
557490beafd17f313e0a471197e12e28
75b5501c2008ba7a477a7c5772dd0c0da795c9d4
418363 F20101117_AABRMG venkateswaran_j_Page_068.jp2
4fb13ceef005b1564088c32c41c1b4e6
1ff4add58b0964afd76e7ced6fdb24564448be1e
1051939 F20101117_AABRLT venkateswaran_j_Page_055.jp2
64050ddb0370f909217d5aa5ecb3fb81
9ca0766244b0c27519f0842e56b8df2d11d158d8
884795 F20101117_AABRMH venkateswaran_j_Page_069.jp2
2f7f217c32818880c93e09c01b31f96c
9cc70cbbff0acaa14ae841312478c26f57c323ba
1051973 F20101117_AABRLU venkateswaran_j_Page_056.jp2
1b25f132a9f3376ae409556e1d37bddd
5eda68bb0a3150d1d8029865a77150b0d62d26a1
727558 F20101117_AABRMI venkateswaran_j_Page_070.jp2
7b6e4c623c3da65133bdf7dc552b6803
5bc40a9d2b9bd492fa48b4a67fc0c893b9b030e6
F20101117_AABRLV venkateswaran_j_Page_057.jp2
a039ddfc7fc2b081347ba0d68e08cbb8
3e9fc61ec3edf5419ae0f5c1ffdadafabbd2776e
405821 F20101117_AABRMJ venkateswaran_j_Page_071.jp2
cebd7a97da7379f9b81a03acf2b0cf94
59128446f4490f9219ce66505804cc0842353389
1051928 F20101117_AABRLW venkateswaran_j_Page_058.jp2
d09218402f2b8e49de2ad992a28f6e92
c86a5cada979cca88d365f9fbaf05a1584837236
421501 F20101117_AABRMK venkateswaran_j_Page_072.jp2
603011477062ae3b70229e4b353f73de
df676d5038bfbce64060385d9213fdc1335e6804
902001 F20101117_AABRLX venkateswaran_j_Page_059.jp2
b5ea2fe1ee1d7eb8d10ab362c5843ecd
b18a4999ebdb810b75abef7140bf81e292a2774e
705835 F20101117_AABRNA venkateswaran_j_Page_088.jp2
f9467db78665c6431d3151f2706b0155
fbe49b8a6945b5c734bcecc03c122d06e3aac3e9
805394 F20101117_AABRML venkateswaran_j_Page_073.jp2
aca3c6053d6d028d554cd0b456120917
5eeded77e95a196d42f2d13e2cd9a071b6a37c99
440445 F20101117_AABRLY venkateswaran_j_Page_060.jp2
24e6d47d92aac810e4850a0fd93f02a8
9fa6322cc0c16769f3167c3e006ec02ad40d695b
842091 F20101117_AABRNB venkateswaran_j_Page_089.jp2
8a51fa5cc3743eefa4001cffa666e0d2
0fbe3e0d3d8cac848489ee7e0d246253c3f12ddb
885063 F20101117_AABRMM venkateswaran_j_Page_074.jp2
da926a1a37dfedb620132f7a72754a1a
c5ff7e51dce4bb5dcf20cab198ce09ce8285be28
771891 F20101117_AABRLZ venkateswaran_j_Page_061.jp2
07b2265469fa25151a4fe50be64cbe90
0a1d84b2dd9689cf97a854226339ce5f573cf0dc
30398 F20101117_AABRNC venkateswaran_j_Page_090.jp2
d95b6924fd065c80503107599496b384
b983a8585da9fe7ba0da3de55ad23cfb0003cd39
139032 F20101117_AABRMN venkateswaran_j_Page_075.jp2
bdd3a7976e03347759e34bd5104a5793
6d9163b7e3a4a599f4ce7c10a602fb45cb6263c7
F20101117_AABRND venkateswaran_j_Page_091.jp2
6690d50d919aa7e904681bf82147810a
dd0e74762a5fc2be77e830e0d2dec7fdf167263d
F20101117_AABRMO venkateswaran_j_Page_076.jp2
40e2831bff040f97c20cc5a46c3e50dd
272dd47f44dc71d666e837e0c77cb7b1161111dd
864951 F20101117_AABRNE venkateswaran_j_Page_092.jp2
3a3ee65ab60d86b40f7fae3ecb72581b
529694582502ece4d251fa0a5eef216a347ba1aa
F20101117_AABRMP venkateswaran_j_Page_077.jp2
c2fe5533e37bebb5a81c77d5e2503204
8d277e56147254646c2ea814daabbb289fc70aca
F20101117_AABRNF venkateswaran_j_Page_093.jp2
89f2deae1cc568c4a64f02eb61907e16
29d3e20cef7e780c3484384b000fbc4dd69fa1c4
818986 F20101117_AABRMQ venkateswaran_j_Page_078.jp2
bdae57b9fe3d3e20d70855eee22ee4f6
d687552962ba6255c0dc340aced30549202fa17a
F20101117_AABRNG venkateswaran_j_Page_094.jp2
49228e24d2d4274b7e35af8462466028
e789cbc26a5b2c5ab883d884f94d0d07859dc077
F20101117_AABRMR venkateswaran_j_Page_079.jp2
892428ce75c95bf1f9b9b2eac8764df3
faea5ee0e0be9a1c556398ed6e903ca24d7f723e
76166 F20101117_AABRMS venkateswaran_j_Page_080.jp2
b5929f153ed9849076c559ba47ffd979
e4cdda6445a82a50d8073663e955dd0beafe0856
1051944 F20101117_AABRNH venkateswaran_j_Page_095.jp2
05a84982da5ac75617fd36e264f2ce05
e633efe84554f90fe3ce6b939611d46d2a399c8f
115130 F20101117_AABRMT venkateswaran_j_Page_081.jp2
c43d76497182ac350627a3e303b4d48b
22e3172afbf2cdcc348d4c2856d662ac5647de81
867149 F20101117_AABRNI venkateswaran_j_Page_096.jp2
bfdc976b61cfd595d13ec70d6f4fc450
b335ea0aaf95b4f30bf6a0f45b7c83b4e77077f2
976509 F20101117_AABRMU venkateswaran_j_Page_082.jp2
b420aab7b71fbd8c7c72314b73b1bbb0
dc1326c1a0c02d7ef172c9c375b4cf0dadaa13d7
713680 F20101117_AABRNJ venkateswaran_j_Page_097.jp2
fd599e3eb4246ce9cbee81d8412c24b6
4ffc8eb18b4627ecf790bb04e61e43e8a72ee416
83613 F20101117_AABRMV venkateswaran_j_Page_083.jp2
524b548a27f2e9b56ac0835ce07e9a92
3845ca7f095dddefd181eea23f71c755fc095ce5
F20101117_AABRNK venkateswaran_j_Page_098.jp2
e8cd91e9f12f3001cbb5ab9661442ff1
121f24608632ad629b506c30a576d00581ce7a89
60479 F20101117_AABRMW venkateswaran_j_Page_084.jp2
5a21520677bf03c9ba469458ddb328e2
539684d5f4a110965f106a5e63305fe464b92c31
978547 F20101117_AABROA venkateswaran_j_Page_114.jp2
40bb127b38b683b300709c590ea19573
a4bf46051f7631aa32b3ee2a07df8ce162d4f63c
F20101117_AABRNL venkateswaran_j_Page_099.jp2
0b1577cb0b246a4dd653704be3e04cc8
8ebfbd3d4468fea751f753ff6eba0e7f0c230f4b
882580 F20101117_AABRMX venkateswaran_j_Page_085.jp2
cd2bf2e0f980f4389085df8fa9480829
539641b277266ee804f4a72335224a2edec91555
903070 F20101117_AABROB venkateswaran_j_Page_115.jp2
5b02f048ad94c940c9eb92118f712d51
f4ff2fa1216fd897e5bd776ee0c9c75e8781c707
1051985 F20101117_AABRNM venkateswaran_j_Page_100.jp2
ec2eefbf1d0c880a4b64c02bc7ec73d2
104c47e4e8ad0ae8aa40be587665a1c54c211f4d
67056 F20101117_AABRMY venkateswaran_j_Page_086.jp2
28113ae7b1c96f4e6bb1ca2f6218cafe
f47c9f9506aeed0e4ccccb9d6cf63ecbbb831826
890076 F20101117_AABROC venkateswaran_j_Page_116.jp2
cac71bb2f7208e6e99f784d8252df0ce
ac0f5bed8dc72c43d6404620440d7ce3880e2203
1051907 F20101117_AABRNN venkateswaran_j_Page_101.jp2
11d19e1e14ed6ffd06bc127ed9269932
c4cea85bfd01753630642a17c8707da7364f4055
753142 F20101117_AABRMZ venkateswaran_j_Page_087.jp2
0f6de0c1e8dd1596358a562f98339a79
3b2cd87e7c3778b957be5e8adc45007b5f3fae75
1051961 F20101117_AABROD venkateswaran_j_Page_117.jp2
adcf4f28eea1f081362958ae96dec107
ebb8988cea567082cd5a266f540ba4e032fe2589
1051967 F20101117_AABRNO venkateswaran_j_Page_102.jp2
30befb8a799216cea4bc92b2d94a248f
79c24cb3013e3fbcd20fafd481dab5f3a3cd0ed4
F20101117_AABROE venkateswaran_j_Page_118.jp2
80fe98310f596f4baba40515bc034340
30d0e0a1d93b743c9933b2a73d712cd45334d012
952273 F20101117_AABRNP venkateswaran_j_Page_103.jp2
b777dbe5c939230286c631270698c890
9f9b5a45617e90c88e57cfbe6ed2a834d046acf8
35525 F20101117_AABROF venkateswaran_j_Page_119.jp2
b3373f9421585dfe46901a6c18cdefea
4303263e3ba8d959bc286f8715763ea9fffda5a7
1026651 F20101117_AABRNQ venkateswaran_j_Page_104.jp2
1c958717086d4600c6417e8858d61052
996faa0cd4c48066412bfc128f0419d7d3a30160
132361 F20101117_AABROG venkateswaran_j_Page_120.jp2
8fe5f87e0cddf798fba1f4e34b765fdd
9bd8c608b84d3d6c51f2dba46cda03bc4cbbc2fa
F20101117_AABRNR venkateswaran_j_Page_105.jp2
99d82ca848af1c460f3c149c6761c525
de4efa5a44b39b0791a6960e7883309af26a8430
135868 F20101117_AABROH venkateswaran_j_Page_121.jp2
9db2d47a01d92ca9e4c9f71a98170370
1ee505d91fdbbc606ceb3188ba8f1a6a5b8bdc3b
974317 F20101117_AABRNS venkateswaran_j_Page_106.jp2
1fa8b8234195bed67e70eb049ff1080e
bbc8fef808f48cbb2f343bbafb1412a91cbaf8a7
887225 F20101117_AABRNT venkateswaran_j_Page_107.jp2
04bac85a5985e67521d24a26c136eef6
3cf9c853867ed4b1c6c610b648ca6d26051b3554
131573 F20101117_AABROI venkateswaran_j_Page_122.jp2
af48c94ff5d3f27670fb434ba3710689
dc4a1dfe98ff897e1d359938e3c3a6545323cfa2
947078 F20101117_AABRNU venkateswaran_j_Page_108.jp2
a3db71c64ed0a8bd7bd6cfcef789c93e
c707f58ea254d6370e6d93511ca132908db2ea88
133140 F20101117_AABROJ venkateswaran_j_Page_123.jp2
d4646833ddd0205275b7838fe1c90210
da08e8177a77e37d5bb58abb4b16e34941fe4c89
923588 F20101117_AABRNV venkateswaran_j_Page_109.jp2
ecf528eac6368a7063ebd2c451964a95
7cbfa31f8b27cb74cd8d0eaab4ec024fb232c29f
135307 F20101117_AABROK venkateswaran_j_Page_124.jp2
d009b8927549b4e27dc44a36285e1fcc
d4bc11f70e18557f62fcf723d41b6b044716896f
1009533 F20101117_AABRNW venkateswaran_j_Page_110.jp2
df55ce1f3e40a6e6151ec7902e531f4b
4fb81e080eff50599b02c54037473b169f00dbf8
133708 F20101117_AABROL venkateswaran_j_Page_125.jp2
fab15fc612e33c6c89a48e96f836e4b8
e9ff47523907f651014f35c83f8b920067222052
904945 F20101117_AABRNX venkateswaran_j_Page_111.jp2
62f18a309b34268e6d8525f3069f4f76
f0da94d116b41a1d6266ef8a3aeb9114c44189e2
F20101117_AABRPA venkateswaran_j_Page_013.tif
15df4cdba16b9cc2e504e132f92c50a7
848544e66235446a3627dd33c17d0558d30fdaef
980913 F20101117_AABRNY venkateswaran_j_Page_112.jp2
39d2923a654ef86af123065fa27d375e
a851c374f7b617acf6c3289bcc79d51514cb9dec
F20101117_AABRPB venkateswaran_j_Page_014.tif
3c1acdc20836d0bf4fa836784de1d8f1
eda6c35118e3e5eeda6846f87682b63d5be0d956
46622 F20101117_AABROM venkateswaran_j_Page_126.jp2
a97cdb6ee58d4fe0c907968aff00fa50
1de5748e34845109a415fb4f248bd47a647d89ba
986412 F20101117_AABRNZ venkateswaran_j_Page_113.jp2
211f79bf5b0463034ab726fc49f36c5e
27bd08f77a1fe9b8e8c66025ddbeaee5103ddcaa
F20101117_AABRPC venkateswaran_j_Page_015.tif
e7313461e85a6a5404adeb51422e28eb
bd3ef316b3bc4a7a1f977a0950ffda10f2a4dcb1
84013 F20101117_AABRON venkateswaran_j_Page_127.jp2
fbf4bdd63171aaded548c109d15863a3
c294a4ae7e713801b64f74bebcd6d2a564988251
F20101117_AABRPD venkateswaran_j_Page_016.tif
7e4f0ebc3475afbacda340a65294518c
be76847136ac558868d795166afaf2cc0b1cc7e9
1053954 F20101117_AABROO venkateswaran_j_Page_001.tif
f5ee8ca9bdaa7facf1ac5a1e301c3ba7
8d59add259e2c6c6e02ef24c9374e1e909cfd6be
F20101117_AABRPE venkateswaran_j_Page_017.tif
323015e4bcffdbf7065fcf17ba43ed0a
e76d8f7f6729c93090a9e6bd014d8a4d35dddaba
F20101117_AABROP venkateswaran_j_Page_002.tif
7c1a1a9a1207167603068f312709433f
ee38d4c8e86936d376b12bd209e4e1e0d80a4e07
F20101117_AABRPF venkateswaran_j_Page_018.tif
2bffde40a647c4375097a6c21ebdf3b9
7735d7dd34844e2a14351477dad3a06b8b857873
F20101117_AABROQ venkateswaran_j_Page_003.tif
09307364b6cc75ccb9d56f3d8b2f62f0
776de0c000b259ed309327a5b29159aa6daf24c6
F20101117_AABRPG venkateswaran_j_Page_019.tif
e5c17a0e26f760bbaaaca1813000baff
6d40afb1929d9c79903ad6635852526539130cbc
F20101117_AABROR venkateswaran_j_Page_004.tif
49abbcf17addd02aedc1e306da385e96
3fb858cd8022d4e4111411269bdeba1badaebc20
F20101117_AABRPH venkateswaran_j_Page_020.tif
c4f54afb1b74c83ec4356bd6bf14b094
8f3abf9acf9598eff9f57cc7ab59370b1870fea3
F20101117_AABROS venkateswaran_j_Page_005.tif
0fc4514385622bfc35f7417519c00fa1
89b7f764e3a1a73d365df0a89bd8cf36345b0222
F20101117_AABRPI venkateswaran_j_Page_021.tif
a303a61d5a31ba83c9b0bf6dd1e681fc
467a5c8311e45cfde43b9ad890ba0cb077bb2340
F20101117_AABROT venkateswaran_j_Page_006.tif
4d34a0c0a70a9cc314ce15c6ad1c7bc4
268ac14f8aea4e3f45323586b8365d54123dc696
F20101117_AABROU venkateswaran_j_Page_007.tif
1465d4a3a20d5b4c4b93ab2cc274e91b
d59117c38156fe01384056c6a996a4941f9654fb
F20101117_AABRPJ venkateswaran_j_Page_022.tif
b184b2167908413b887a888bba4ad45a
0cb4a9babe6ca37bd3df764fd223bbd82c6ef80d
F20101117_AABROV venkateswaran_j_Page_008.tif
16f6e756349ead87eed34a7f113b64c3
19b77eca34ca91eeb35b1cca02133850f57abf1d
F20101117_AABRPK venkateswaran_j_Page_023.tif
8ddec9310a9fea0a37d868956c630ab5
36e533bed734c5bf2335a296c585c3958c06c76c
F20101117_AABROW venkateswaran_j_Page_009.tif
97c329b52954cd9284c20084520cfd2e
585dfe815e1592b58dfb19b31cb55cace9261d5f
F20101117_AABRQA venkateswaran_j_Page_040.tif
01090119f8cbdfff8dd2cb9c07cdf852
218b6d8c4977aa5a3bf1242ea35c85b4ba5d8aa0
F20101117_AABRPL venkateswaran_j_Page_024.tif
d9f357897069fca410dbb5d49237071c
5ea4042dd5131a818f3a8948c0c1a9c1c9d54534
F20101117_AABROX venkateswaran_j_Page_010.tif
dd8db129a00b2f31319bb0918eb2a6d2
ef9b19dfaed30bcd23f8054d06530aa8e6ac3ef5
F20101117_AABRQB venkateswaran_j_Page_041.tif
0daef0e77422187210895863852ce078
ec41be82122ed929dfd1235bd8cbfae7937f4846
F20101117_AABRPM venkateswaran_j_Page_025.tif
c0b017eef88e0e7a12e03b5e2b7a2c79
f3b2b3505f351861bf2d5f69fc4664d5c371e4fb
F20101117_AABROY venkateswaran_j_Page_011.tif
f27ca5b81bed033b52705891ca590246
4378eb49f18a9adb90489cf7ad574cc5f7457d95
F20101117_AABRQC venkateswaran_j_Page_042.tif
819c691809f90d94da883f16a2911415
708af4dbded32c00a8bd4583ee29a6fffb3bb7bd
F20101117_AABRPN venkateswaran_j_Page_026.tif
fd6f0663dfd6a79fe3c38b5cf6c90674
84566651dc5aa3c1085c2342b4bbebd48cd2745a
F20101117_AABROZ venkateswaran_j_Page_012.tif
d225f0bb9ba1b65125eba02e629e1f09
6b2ccdd58b130a7cc058bf986672d4fd4a95f04a
8423998 F20101117_AABRQD venkateswaran_j_Page_043.tif
e2b76c4b86515be8d89b3eb497441fdc
df99d79b22a1350786dfd68a2649bbef06b33cb7
F20101117_AABRPO venkateswaran_j_Page_027.tif
1c672fd27609795980bd4d9a03c76735
b461edc240ec0dac5e6d2178a668126189e8e717
F20101117_AABRQE venkateswaran_j_Page_044.tif
50ff90642da2d7e19e91027c46b120c3
ff85a0f635bb8419b50895d5fe9e346ed68258ba
F20101117_AABRPP venkateswaran_j_Page_028.tif
d5e5b34c7e641f81d06beb74ef85956d
e27ea59281546b5d48170ac0fb91a59b45ef1899
F20101117_AABRQF venkateswaran_j_Page_045.tif
560b4149d3c2771f8e1b7884a29f93ec
659e9b2c2bb2752ee7087ad58d627c3cab26bcc0
F20101117_AABRPQ venkateswaran_j_Page_029.tif
f809c8e7eea4dbe0b33777d5ad94080c
8ddcfc17ba22aa79c4d72c3d059c54ba9aa18ec8
F20101117_AABRQG venkateswaran_j_Page_046.tif
73a96c3f8216488e5af76c06f0276377
35f5ec1a5425c036d572582e1c181a4130d249f7
F20101117_AABRPR venkateswaran_j_Page_030.tif
82ca1ed157d6c277569aa72bf91c5314
073170d8e91433fe245d8870f63582e330eb8cd9
F20101117_AABRQH venkateswaran_j_Page_047.tif
c3552e0fd1ec7c21b1c5386674316cb3
cc9e59112958fd066347479fef65316773251d50
F20101117_AABRPS venkateswaran_j_Page_031.tif
ee206fe48bb8d90fb21b57b88f6bf132
c4e2be6f521d652f3d166c3127fce2ce13988ff5
F20101117_AABRQI venkateswaran_j_Page_048.tif
eb5d901fc0ca9f80977cc51f48af0ead
79294573e0421825b865d5a0c746249d60ac4da6
F20101117_AABRPT venkateswaran_j_Page_032.tif
f8e5c77dae6b3d1ed8d69575e3007b11
7c3d289cbcd663fa061e4dc62ef0df03d653ca34
F20101117_AABRQJ venkateswaran_j_Page_050.tif
d610ccc1c8977577d2bc2be9a5cc81f7
997ab6f31d41ee7756811cef30e3eea982181e2b
F20101117_AABRPU venkateswaran_j_Page_034.tif
dcf09dfc6ebe8092f24483567842279b
5f08e70af3199990cee2d908ded05ac7b4a3feb8
F20101117_AABRPV venkateswaran_j_Page_035.tif
02b4644340697d281a403f214a9b4c47
1a06f7ea7efdee53a3362ca2fcda4e4ca7b13533
F20101117_AABRQK venkateswaran_j_Page_051.tif
3a0ff7fb27fd1b257840066e2471a63c
54caaf8c1e377eff7daadbb90c882fd4b5929c1a
F20101117_AABRPW venkateswaran_j_Page_036.tif
d613882cd7d45e26df0d0396bc53d9e4
d35bbfac8bb196dbea35e58540defce149321ea8
F20101117_AABRQL venkateswaran_j_Page_052.tif
e2cd6c2442f1bb1f58765ca54e1ffcf6
ca1ed85046e7e414f1573dc1d6702c4a3c0f89c8
F20101117_AABRPX venkateswaran_j_Page_037.tif
4426bcc5b7220a3b7929abf147c4af2b
86f7fea362133cdbd782eef01a3504b193a1a2a2
F20101117_AABRRA venkateswaran_j_Page_067.tif
62ea2759d2fab7e08f47e955bb943292
0513c1907f2b0ad45c1c82bead4a6e615ccf31b0
F20101117_AABRQM venkateswaran_j_Page_053.tif
c70cbc74462b5f8c047e7bd039fc9883
5dbe41715b8312a1a81b3904aa4c85b6fe732c1c
F20101117_AABRPY venkateswaran_j_Page_038.tif
2969183dd6e1b2e5b1b723b2761e4a1c
3aa144693f3eba3c59a4862cb9c66773035f2c2d
F20101117_AABRRB venkateswaran_j_Page_068.tif
2babb1070e5ac44565fff8dc305b5375
b4cfd44a2ebd791e751f446c94e8c2812030bd3b
F20101117_AABRQN venkateswaran_j_Page_054.tif
8248d1b66ff690c06f4483f9cf8e1453
c86d383f8b49d9485ee5f26c1d239339160b34e0
F20101117_AABRPZ venkateswaran_j_Page_039.tif
7a31864516030bc5f32d8ff5b5aa0e6b
87249d48395ec3bc9633fdeb7b02622544220224
F20101117_AABRRC venkateswaran_j_Page_069.tif
b43489e62c45ebadb6132af00c1f223a
1359a44b268e0224a8621297bb221d66b1759e4f
F20101117_AABRQO venkateswaran_j_Page_055.tif
8f2945e694ba5873c15b809c4d153545
c74ad41a44557dd925362356552e66bad817732d
F20101117_AABRRD venkateswaran_j_Page_070.tif
d0292ebce08e17fffe4c96b676243911
7f3f480f4b9af8e095ca36170c81b7946f8ad986
F20101117_AABRQP venkateswaran_j_Page_056.tif
81eb5b0ed4d80fff538ccba4172f8ebe
1f232dc95b1d62d3e3cf8ec339973432b924cc61
F20101117_AABRRE venkateswaran_j_Page_071.tif
1eae170688cff531460f23e8e541700b
1698184ae74260b2dc60b5695cc1d36393eef658
F20101117_AABRQQ venkateswaran_j_Page_057.tif
269e0a32631550d3b6400e998a48c53a
983f4fbecee28795d71de636c4ea60e2b962a684
F20101117_AABRRF venkateswaran_j_Page_072.tif
bb087aff48e93f94a4fc6547365b7aff
57bf9deb52cfe6a8e728652f63c5bed35f441a9b
F20101117_AABRQR venkateswaran_j_Page_058.tif
5217753f0b1e4379edea2564a902d2b7
8f424473ab6fa24ca7f0ebad3577d2eeeb172d75
F20101117_AABRRG venkateswaran_j_Page_073.tif
44ee817099fe742ad3e06e3eaae3996e
76e06944c1655becf611394013790b2e774e197e
F20101117_AABRQS venkateswaran_j_Page_059.tif
f06a10236c9f8df43d7725b84826c6fa
9868e172147b0d348864c92e83bf7ec6c9060f71
F20101117_AABRRH venkateswaran_j_Page_074.tif
0e0c062ee0c1f5e830243753b931592a
15bee2be2714c88ef4e468fd925550d21bb179c6
F20101117_AABRQT venkateswaran_j_Page_060.tif
82c36fb8406244d448ba54a1ff314418
aa942a9f05797a8b6d18785f536d5aacaf9aa32d
F20101117_AABRRI venkateswaran_j_Page_075.tif
ff0c534d3205422a7a9bef890be4b546
140ceea58ae38de3aeee77327bd762ca7da7721d
F20101117_AABRQU venkateswaran_j_Page_061.tif
022ded1ec9617ac0185c2f66bc8bd1bc
db60a9b25937318b793502da4d824d3f6db75a71
F20101117_AABRRJ venkateswaran_j_Page_076.tif
91284024d49d50db6d95039a1928fa0a
7cbde8eb9b48c01a1559b52377020e9008febc72
F20101117_AABRQV venkateswaran_j_Page_062.tif
21f5f65ec4bb322a5fed97471ed049c5
3b33ded21f02f3daa80c7bba80a9f53f6027cf4b
F20101117_AABRRK venkateswaran_j_Page_077.tif
3a56b6da8e6fde3ce84d33ed6311feb0
30fa7868dc400641180d461ff6e9f774679ec616
F20101117_AABRQW venkateswaran_j_Page_063.tif
498a701e39e8060f78d18b5150a582eb
50501916c4d1404c7ddf4d8cbe283351a7b75c3f
F20101117_AABRQX venkateswaran_j_Page_064.tif
ce1aa33daa96da42bd4cb0888361d78f
f2418893bf1ee9e94dec2ce8cac8db9b9b3afb05
F20101117_AABRSA venkateswaran_j_Page_093.tif
a254ab894cbdf4a1ee99a5180159797f
abfe3bc49fef740bbf62923d9e84eee5a7336e20
F20101117_AABRRL venkateswaran_j_Page_078.tif
fede30ad2e678c1368acf141376e5c8e
bdb62a8125517c0835fc5bb3090f210801e8e073
F20101117_AABRQY venkateswaran_j_Page_065.tif
843a7ad3358fef5a5f79d3edaecfccad
d12539623f0ad6ce715ca949fedfa20b7b9f4c3b
F20101117_AABRSB venkateswaran_j_Page_094.tif
9ab25cf29c8228f1d876c36a03eed736
ae78c29339ea04778a8f9f581b315dcc59591bac
F20101117_AABRRM venkateswaran_j_Page_079.tif
fcf43b57ffbcb77f2a0165bf3f5da1db
fb89e8005dab12e07d6f74ae36d817dee8ad1f21
F20101117_AABRQZ venkateswaran_j_Page_066.tif
520fb1c3a2f95a40eec2b640a15078c0
ec80d6f6c9d0c4247cfc3b4b3119fdb5effb967d
F20101117_AABRSC venkateswaran_j_Page_095.tif
504f761c52d319ab9f71932701c5fe86
c0659ae96ccb4d63330dc9b6dca6ad3d54372929
F20101117_AABRRN venkateswaran_j_Page_080.tif
a8c20f03211890c349c0c94d87bdb292
3f6016261191ce8edb7ae4e1a8c732ad483b8ca8
F20101117_AABRSD venkateswaran_j_Page_096.tif
ad4b83529aa62afdc92c4192a2b0c47a
2a0c9c2082666ee6678ba11052c1e1eeea167f21
F20101117_AABRRO venkateswaran_j_Page_081.tif
a201b6f7c7c3cd4d6aa0138d48bce2a6
9701925239303946257fa17a94daaaf46b30cea8
F20101117_AABRSE venkateswaran_j_Page_097.tif
3cf19f3515b2658511df7f7261a5cd98
5a2de50f21cfa5c282ad28166c7760c43fad6a03
F20101117_AABRRP venkateswaran_j_Page_082.tif
526125af4c666e2f9c3dcd19d735f6aa
9196b79149df9c7bfbc80fd77ebefa16ea525101
F20101117_AABRSF venkateswaran_j_Page_098.tif
a48b5609a5ff210fa826ce29df8f2ac2
debc6c302b1f71adf4321fad5460f89096ebf0e6
F20101117_AABRRQ venkateswaran_j_Page_083.tif
549fe00dd3928ece94720ce82b8c93b1
06f1c5c5cbac0a5a6dbf7509661aaa1f23b2d0c5
F20101117_AABRSG venkateswaran_j_Page_099.tif
ec6cb3c0c69f35ae019eeee6ea696400
0a0a99123a328f3cd4c700d2839fb17e337305d8
F20101117_AABRRR venkateswaran_j_Page_084.tif
8102825ed997f82113dfda08f9999d42
c2800ef413d7288da430f80b67a9a1050127822f
F20101117_AABRSH venkateswaran_j_Page_100.tif
d1c0a129db59851415154597abee2672
8f409b5bfbe1dc770fa71e65a728e8f4d2ce8362
F20101117_AABRRS venkateswaran_j_Page_085.tif
d24b9132d7de609c3f54ec8d724124fa
a21435b8e7ccce48fdb95631a306a24cfc4e00c3
F20101117_AABRSI venkateswaran_j_Page_101.tif
c4487fd8a71e572e30ee669b3b87208d
c215818283231e119fd6a5e21ee3bc08a181fd7a
F20101117_AABRRT venkateswaran_j_Page_086.tif
229c80082b24171f90e42405c3e4170a
5ba0e0de6577ed40544dd2f470815688e623d2d6
F20101117_AABRSJ venkateswaran_j_Page_102.tif
24f99cc5dfc6ff31d1b7a1b0598356d7
a8f49e1b420993651eebc7ef9bc303ac2ed4bfc7
F20101117_AABRRU venkateswaran_j_Page_087.tif
0af203629a635ff454b72a3bc00adc51
b2ef5419feae55d7e64ed1ae378ba1004562fabd
F20101117_AABRSK venkateswaran_j_Page_103.tif
90bf88514a71fd7f3900e220ea0b0da1
4afa7c3d74fcfd1f3af29c46a0daf48e5a095d99
F20101117_AABRRV venkateswaran_j_Page_088.tif
1fe16ed388a6529a07c1c5c0ba008181
dabef09756a7000d5e8c21f34ebb3b92aa2aaad9
F20101117_AABRSL venkateswaran_j_Page_104.tif
0d3814e3fe63076391d3612b11a857c8
59ea21d4a5f414797dfa756d021eea8acdca50b7
F20101117_AABRRW venkateswaran_j_Page_089.tif
6355e96dd1474970fa4392925f11cd39
48350a198e5c1051761b875781109466380307dc
F20101117_AABRTA venkateswaran_j_Page_119.tif
dde0e097c4e39bf6f2ab2181ab544187
0ad77efbae130bdd912c455e572ff5263555e6b5
F20101117_AABRRX venkateswaran_j_Page_090.tif
765b98128af5c72a761c10223d7d8b65
70d13b76db6c31158b5fb1af17b7402b06891d2e
F20101117_AABRTB venkateswaran_j_Page_120.tif
67b20b1e202d59dae3857cefaa7ddd9c
67ac127350b7babb2d10e05a7b4bd43fa596f766
F20101117_AABRSM venkateswaran_j_Page_105.tif
10ebde888b6264dfefaca2bce1b5ad4f
343ccbf30aa081779a7afbfff59150a5bdc2a249
F20101117_AABRRY venkateswaran_j_Page_091.tif
7c51b726ce648a77b0f16cf0439bcc92
f61c84261b6e2a29928b296b48bfd4f16735ccfa
F20101117_AABRTC venkateswaran_j_Page_121.tif
fc62f1b9212ebecd729e92a724560f88
2c043217bdaffecbed8688c3eb1eee08aab196e1
F20101117_AABRSN venkateswaran_j_Page_106.tif
849606ba204f1f1edd58aa2117762548
9e7821c3ff7c566286c6ec76af76fd6f14c4a4ce
F20101117_AABRRZ venkateswaran_j_Page_092.tif
02fb6bcdc4a5c7ab0b4b0a1e9bc25960
72abe81690585546400e6c0660a7238ecb31a0cd
F20101117_AABRTD venkateswaran_j_Page_122.tif
4447ac439a5acc7276f0dedd6f2005b1
2201f75f5afde8de08cd172b178c88cba0bac95e
F20101117_AABRSO venkateswaran_j_Page_107.tif
7055ac484eb250f446349b3ed7c70d74
95d1bd8baa2a0ee64b0a0a2c3917107eadbabc64
F20101117_AABRTE venkateswaran_j_Page_123.tif
2d924c5b1d9453cadef4651804cfe786
288dac2c8182bb8a12e65c4b51291dc268c69140
F20101117_AABRSP venkateswaran_j_Page_108.tif
d7d62279ba0bc20593c3298bb105de8c
a12d95e42fb4ef2ed22cb64a1956306fc1642c4c
F20101117_AABRTF venkateswaran_j_Page_124.tif
ccefef288a0daf93dbd1d4fc4c796b84
91e0d5b91b5f4382232f596700836dd274403450
F20101117_AABRSQ venkateswaran_j_Page_109.tif
57f2d943b4daa780867c2bca90d192b3
50cb78d72c5fa9bd9e0da9073982e75eaff4819b
F20101117_AABRTG venkateswaran_j_Page_125.tif
35c147f2553dde43ce63047e1d7781c3
c64f619609cb158cf1a7ff0f4007b465c894abe4
F20101117_AABRSR venkateswaran_j_Page_110.tif
940dab3d699218dc91b6594a6356725c
e86e78ecfc0a0dde8a583f04f37eec7d7c02b4ca
F20101117_AABRTH venkateswaran_j_Page_126.tif
da4a6e6ff80aa3903439d636d6dc226a
eeb62506f0290c08ec5b372264fc6bdb57b0175c
F20101117_AABRSS venkateswaran_j_Page_111.tif
14f7fc95a18e0d40010352d8d9fa29ba
fc6a50ed37d1c63ca67d2b25cd9914095438d751
F20101117_AABRTI venkateswaran_j_Page_127.tif
fad7b29167816b5fb1f598991e85265b
1a80f17ee78e3ff369e5b3389c7d45f25c341796
F20101117_AABRST venkateswaran_j_Page_112.tif
ee41a301558f98e6d79707e9c31cfc10
099484f929ce5fefd4cf976093f3dfaacabafcdb
8034 F20101117_AABRTJ venkateswaran_j_Page_001.pro
52e20fbf206136698a8db70c46ef1a09
77699230b6f3b2391b60c7f2a1128659c7b51ee6
F20101117_AABRSU venkateswaran_j_Page_113.tif
9913f22eb976e51a915d76cb5c4d0881
2cb5e55a81f9d85ff68923f88a0ecc8f17aae8fa
F20101117_AABRSV venkateswaran_j_Page_114.tif
ed11c395a32295dafb206e5d7e69999a
4e5b155408e58b7303cbf2f88dfb184ca306b73e
1100 F20101117_AABRTK venkateswaran_j_Page_002.pro
b98d89b59e007d0f8f6289ccd00be89c
0466f21e9d116c603a7e26846d5139ae394f806c
F20101117_AABRSW venkateswaran_j_Page_115.tif
03f0e1f58e1d7f31cb876c20972b387c
7ef1ca30dee592e0c1aa19028fcf3e9c74efe855
892 F20101117_AABRTL venkateswaran_j_Page_003.pro
a75db5d80a166e6a06424c1c74614c37
9e273712374b199de96bc67907ed2e8aef57b626
F20101117_AABRSX venkateswaran_j_Page_116.tif
28111484c14640cadb14c6bc6188df23
ed7c5cdd88ebf5785bc5fa18538577b3fc922385
61054 F20101117_AABRUA venkateswaran_j_Page_018.pro
f2bcdef822fd7e8509f355a78ecdff52
de5ec2f4cb1d3f992ac5c67d0a45c27ad313cfce
20829 F20101117_AABRTM venkateswaran_j_Page_004.pro
281b7ef3b903b61bc983721d2cb934cd
7ba549c0b56ddf1cb758db608aac5cb34797a19f
F20101117_AABRSY venkateswaran_j_Page_117.tif
6984a96b5ec39824e320fc308e00b107
d01804266b61c76e4c60baa1ba17612376742a95
56548 F20101117_AABRUB venkateswaran_j_Page_019.pro
8f92ddcb80ddf7e8ecd4183618390341
720df22e0b8f97d1d80e02dd4ea5b45c449f1332
F20101117_AABRSZ venkateswaran_j_Page_118.tif
ee95a3cd0ef661c36ff7ab5ceb3cc1cb
2ac25428b375036b6cda3bcf51d60decb3490b76
59250 F20101117_AABRUC venkateswaran_j_Page_020.pro
55f290310a04378d4864a89c383660e4
fb91dabadd91dbcfff47c30ac29f827d5ccd48d4
61569 F20101117_AABRTN venkateswaran_j_Page_005.pro
dc0d1859553365cc2f07b8848619679a
086352fa71d184261485c7e34b0011a6c9452ffc
50105 F20101117_AABRUD venkateswaran_j_Page_021.pro
2565a6ea0919a2eac9f8d123969777ef
1b9e23fc6e22ffeafacbe98e9b6720e84d64d456
81447 F20101117_AABRTO venkateswaran_j_Page_006.pro
f7ca0c4f4edfaa3b0a4ec2b54019ee57
72e4012b03be188bcc22b861b67f627b8ce29add
56625 F20101117_AABRUE venkateswaran_j_Page_022.pro
7b48cc26a3cd2cee9ff74a09a7feaebe
c95e8aeb2afb5537eb2f68c2fd6b4457d0484233
55501 F20101117_AABRTP venkateswaran_j_Page_007.pro
37dc0e2a812f84646a015e1cc3cb7c55
481a5b21594300aed70fa048ffaf9d23218f55fe
42826 F20101117_AABRUF venkateswaran_j_Page_023.pro
0a3a8d5f93e051a15306065cf6a291f6
8aea4fa553bcdefc57325a2196f091ef19dad8e5
13729 F20101117_AABRTQ venkateswaran_j_Page_008.pro
24d1b6702c13874c64816b532940ee22
9b421b1219af223d341f3c4d2ddbaef99a1effcc
2423 F20101117_AABSAA venkateswaran_j_Page_047.txt
5be17607ad3b5a671c78e35d98aa5ef4
184e830ed76b06f2b4bae8eb760811a27a009739
45688 F20101117_AABRUG venkateswaran_j_Page_024.pro
00b3ea89904735a704d2ff2a3a6fd973
09b356af71e7515c44a8ae3af39524f88c66e43d
64075 F20101117_AABRTR venkateswaran_j_Page_009.pro
fdc99e2df5ace63bb0b7976f4af2be3d
5869e4e3bbf54093b4a894eaabe5358a23e2a9d3
2022 F20101117_AABSAB venkateswaran_j_Page_048.txt
fe3db3cd616ab703ebc68abe6582f8dd
73e3198015a860d6d14d3cff46dc5935ced8fd58
41092 F20101117_AABRUH venkateswaran_j_Page_025.pro
73e628e5cd5e94d903353d951320d2f8
d661409a232aa9010603bd9ca1181ae3e7654dc1
66869 F20101117_AABRTS venkateswaran_j_Page_010.pro
5e4eae563b9e230a82d845ceb12ea70a
2eeb8cb89d39dbf112d501fc111baac43bfa8521
2233 F20101117_AABSAC venkateswaran_j_Page_049.txt
7c9e99904df74484219479d27b3189ff
413164a2a62ca08c15acf688d15a68859b0305ac
7058 F20101117_AABRUI venkateswaran_j_Page_026.pro
15273d473d1f6e6c41597816b5239c98
2ac7d25fdf15ce3def998a8b11ca8bcfad0b9531
26758 F20101117_AABRTT venkateswaran_j_Page_011.pro
d3fa7cf41fd3912ee69361403d4f3a26
1a3ca26da931ed6c37c289dd74bf20c0057cf9ab
2019 F20101117_AABSAD venkateswaran_j_Page_050.txt
556088b7d12ba19599e993b03bbd8e9d
d6f7f7c1a9b431fbe3f57bb4dc70643e00cac59a
60153 F20101117_AABRUJ venkateswaran_j_Page_027.pro
ba6e022cbcdcec35aff7e9164d11f09b
5eded5472a77ce82c79afa469cd30a2e8c9639ab
40858 F20101117_AABRTU venkateswaran_j_Page_012.pro
4a66165dccc5e477cdb3be33c4dbabf7
116342ac44af10622d0626a978f360087be749b3
2265 F20101117_AABSAE venkateswaran_j_Page_051.txt
1e0de12cfef417a2bfde0af25e4507db
1a17388d3a25950a896f1f4d21417c5da3c288a4
38268 F20101117_AABRUK venkateswaran_j_Page_028.pro
32756ceb3a544c78d0cedf4f28d13f13
099460f9097e8af8bfcef228fb694fb8cbfc9c33
58503 F20101117_AABRTV venkateswaran_j_Page_013.pro
402e1e392d188a69151aace3b77b8618
152d3d3242a23bd7f5687e5d368515975b8cb8ce
1636 F20101117_AABSAF venkateswaran_j_Page_052.txt
e053717c577cdb73a30cffb578045477
f64f1c74d51d98fd7db7f3904782642e63710a25
63822 F20101117_AABRUL venkateswaran_j_Page_029.pro
75d827a065b24bccd6730bd953a428a9
ea242793e68284068aee8c10456bc19e1fa14b9e
32484 F20101117_AABRTW venkateswaran_j_Page_014.pro
63ba8270e6214207f642cfc52a7a42e3
1dbd9fdae795ca0148fe463c049918edbcf82dc4
2408 F20101117_AABSAG venkateswaran_j_Page_053.txt
236a1dbaa6af900c762809d667101d8b
22dc8f5393e9206180c5fb19a518ca24385a48cc
47336 F20101117_AABRVA venkateswaran_j_Page_044.pro
a80a5e76b7163fee4caa1a0751266e62
6b2268ba726eb57349cb7af31406f664a0b98329
61703 F20101117_AABRUM venkateswaran_j_Page_030.pro
3b22f9eafdb7e32146584b8bc422f32e
03b60e8b845a883d7dfb29b0da8d9dd9ad921b7b
50214 F20101117_AABRTX venkateswaran_j_Page_015.pro
0466f3dc94d56f37ccbf969c731fbbe6
c51bc25e62231d12f855241f7ed411c207120dfd
55762 F20101117_AABRVB venkateswaran_j_Page_045.pro
dffec05444f4eca75c0782973348af2a
425c2d3d3a93ec31a7516c4cbbbb0f0ddac2205f
55485 F20101117_AABRUN venkateswaran_j_Page_031.pro
2f2ca914d0e2d649533fe5f66c329efd
3e354de1a20da7018ec5ac85d70e731d4166b9be
37193 F20101117_AABRTY venkateswaran_j_Page_016.pro
d2d1a3bdc67a2aff159ae35ac770ae03
9c6009ad1097b8f3dafa38392420649e2024c39b
1113 F20101117_AABSAH venkateswaran_j_Page_054.txt
d0007493b77cc5247080252f3e72f978
50a29986127d4e0fb9275e36db4d9bc43761544d
43346 F20101117_AABRVC venkateswaran_j_Page_046.pro
f57e2739b7e2c999f08e4f27af4cc097
13d24b26f92c180dd6398f6134bb9f0f57cf2680
55130 F20101117_AABRTZ venkateswaran_j_Page_017.pro
0f5fa71f88299dbeb0f744a98d30a7d4
63a9c1b6d6f8483b977ab8d5b9e56e702f17369e
2247 F20101117_AABSAI venkateswaran_j_Page_055.txt
f536500c9944edb119f784cc39295ed4
b8a7a736926ab4e5137d042b7892f4cdd2614f76
61710 F20101117_AABRVD venkateswaran_j_Page_047.pro
f2cc2f1efdaa6a9d9dd2c45cab47d0a3
26fccfee3ec4a1896a2866844290a7ec85060e09
61044 F20101117_AABRUO venkateswaran_j_Page_032.pro
f23e113a12930c593f33204bf0464011
a2f19e5e7fce214f403e83d7d57994127eaf4d98
2248 F20101117_AABSAJ venkateswaran_j_Page_056.txt
e2cc8e688e45cad114a3af329b21f3f3
2d1752f875e68c648c9045d0f475281e7e2524f2
48401 F20101117_AABRVE venkateswaran_j_Page_048.pro
5f015b987734216f9e448440ba82f198
d30634f143ee25ad2de15f9b220201b6093bf1c8
63486 F20101117_AABRUP venkateswaran_j_Page_033.pro
f063773c2ffd38c83da3d77251cca3d2
2bd469dc2744cd970ea5680362a7124f3167394b
2085 F20101117_AABSAK venkateswaran_j_Page_057.txt
70b6e3937256fce90de755d1ae311afe
a5d0a91e860c753b3807bab0f8c7d98d4c9ac511
56593 F20101117_AABRVF venkateswaran_j_Page_049.pro
b4ed1e3d07292e852fbe3986d7d3734e
342d339ae2f5d2b40b7a4717e8e0387a1e26117a
48005 F20101117_AABRUQ venkateswaran_j_Page_034.pro
595fdf15fcf94ed45148dd21900bbf41
960f0528ddbdc94f989a69385c6870325500932c
2269 F20101117_AABSAL venkateswaran_j_Page_058.txt
a1d5e5687946cd29d1c3dbf178b3d38d
9a77419351acd496937ea03635c47557d38fd9de
42942 F20101117_AABRVG venkateswaran_j_Page_050.pro
7df5efd35f413d48f0debdcef56d384b
5f21b23df6463b8ecf8f25bf8f63a95c16f60374
54646 F20101117_AABRUR venkateswaran_j_Page_035.pro
41fb95816375a05e1c0ef777140ddb68
338a0e08c64d838f4b866a7ae377e4b0bb63692f
1680 F20101117_AABSBA venkateswaran_j_Page_073.txt
edf3bdd5c8c8987439f9fbd4a0ee3c9b
728244ec36a9d4b483e8508f1adafa6ee92ec754
1754 F20101117_AABSAM venkateswaran_j_Page_059.txt
0ed8b8dbc9e2800305846ad311304736
0912702027c9d552985cd57b38400a3b6d283c2f
56378 F20101117_AABRVH venkateswaran_j_Page_051.pro
d92c461311f0ebc44edd3188992e630b
63ca4c215e124dd45b6f2140684f0d65877e58a9
48718 F20101117_AABRUS venkateswaran_j_Page_036.pro
ee97763c1e9f63900d1a32631e3b1423
7be2e318db1369168894de26fe64be3e1f64eb3b
1775 F20101117_AABSBB venkateswaran_j_Page_074.txt
4393df53d7e6aef9aa9477f4d1915872
cc020e242acfd75688a6ed6d34ad347520744190
790 F20101117_AABSAN venkateswaran_j_Page_060.txt
88922e59e380628205dc50e9701627d1
d01e545ad91e8a4feb7f1ba68ba0989d0f546d99
39248 F20101117_AABRVI venkateswaran_j_Page_052.pro
532636c39539757cd309542279e1ec6d
51d1947d7000a37788d27e521a12d906071a52ed
61387 F20101117_AABRUT venkateswaran_j_Page_037.pro
c95342aa743d80fd03b5b5fc0ea4316f
32ef20160a43fc70729410e39d0f91e4572decea
133 F20101117_AABSBC venkateswaran_j_Page_075.txt
7920ca2882bad436aed1168174d5177e
f86b330894bc74020e094fa717bb0bac0504cc34
1493 F20101117_AABSAO venkateswaran_j_Page_061.txt
592c49c0968a7006ab792449a3ac31cb
cac4746e4a6024cf17f1d6c81ee5bd175173487a
61278 F20101117_AABRVJ venkateswaran_j_Page_053.pro
7c365db25dc99339dd59f27ebd78859c
b6d0cf107cca4188b07d1c69994475dde356032a
65526 F20101117_AABRUU venkateswaran_j_Page_038.pro
4f562c3d623dcdcdefe1d79357e38bb5
746caa6f7a76ada5e35c6764d906ed5f18b8efb1
2368 F20101117_AABSBD venkateswaran_j_Page_076.txt
2cd14fb2d43e912b94f54b78516c9096
f87e5e8de1d365a994eacc050576b045137be9cd
F20101117_AABSAP venkateswaran_j_Page_062.txt
d0d26d781d6d87d77ea30595a191cc2c
fe22b307a93c5afb35fa9f160f66d221f35becf4
25106 F20101117_AABRVK venkateswaran_j_Page_054.pro
3d8d953662f2e645da5f430ca172f752
e262723d495d24cdeaa875bea4f6689aaa2884ac
61830 F20101117_AABRUV venkateswaran_j_Page_039.pro
b0dd99acf996851df81ffcdf2ba31ffb
b370304fa6afaeff77c706b5d5415c8e1c04d32c
F20101117_AABSBE venkateswaran_j_Page_077.txt
eb9869abdabdde967f90e6747dbc4611
8e2454a5e24b76088de22a234e74763786e8e3ea
1669 F20101117_AABSAQ venkateswaran_j_Page_063.txt
a036223e0c463a915d0e0c45e79986b7
d01b48c76696004c864bb85fe4273e42950841c5
56038 F20101117_AABRVL venkateswaran_j_Page_055.pro
c411f6499d85fc0fd40d25862a2a13f4
bcc53fe9d83df33be1a4a1e11ce8d4c4aa65130a
59345 F20101117_AABRUW venkateswaran_j_Page_040.pro
02b9c762da78c14721f9d1b91e285057
0a3d7f4961cb55201f4b739d4b1e019536edd4c1
F20101117_AABSBF venkateswaran_j_Page_078.txt
da1892b8e3898bbaf9ea0ceeac7e2392
d99108be36c7adf7b3c984815102fc1666df7cc7
1697 F20101117_AABSAR venkateswaran_j_Page_064.txt
816c0d9693b581dba2271048224e34b7
a7257f3d9a65b0661dfd39df53f01e8d3fc8a121
54129 F20101117_AABRVM venkateswaran_j_Page_056.pro
646392391810d89549a8319071b05e61
b7b6c6771f7a1ec5ef1911749d40816c601d0102
39030 F20101117_AABRUX venkateswaran_j_Page_041.pro
fd768e708e6907f0fa0edd0a9d6292ef
d73fbbcf60851b4d82bbbc4a23a17c2d78542ae4
2347 F20101117_AABSBG venkateswaran_j_Page_079.txt
fa56757bb50e9a60f7e06b05bbff2e33
c65af0f6fea2d5f1093cc8bf5628e0da443a0658
31480 F20101117_AABRWA venkateswaran_j_Page_070.pro
1cdfd86109d7e8e801c46fd96c0f54ff
a382d07de359721cf9e875357040a91b275ddfc7
664 F20101117_AABSAS venkateswaran_j_Page_065.txt
8ec31acdb1a1819fa10e9515580ef54c
abe6695e9725de2f2fa9d6c7d32590d8abe357bc







INDEXING TECHNIQUES FOR METRIC DATABASES WITH COSTLY SEARCHES


By

JAYENDRA G. VENKATESWARAN




















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007


































@2007 Jayendra G. Venkateswaran



































To my wonderful parents









ACKNOWLEDGMENTS

First and foremost, I would like to thank my advisor Dr. Tamer Kahveci, for his exceptional

guidance during the course of this research. He has been a constant source of motivation and was

always available for technical discussions and professional advice.

I am also indebted to my co-advisor, Dr. Christopher Jermaine for his excellent guidance

and support in major part of my research.

I am grateful to my dissertation committee members Alin Dobra, Sartaj Sahni and Lei Zhou

for their help and guidance.

Additionally, I thank all my friends at University of Florida-Gainesville, for their friendship.

Last, but not the least, I owe a special debt of gratitude to my family. I would not have been able

to get this far without their constant support and encouragement.









TABLE OF CONTENTS
page

ACKNOWLEDGMENTS .................................. 4

LIST OF TABLES ...................................... 8

LIST OF FIGURES ............. .... ............. ....... 9

ABSTRACT ................................... ....... 12

CHAPTER

1 INTRODUCTION ...................... ............ 13

1.1 Similarity Search ..................... ........... 13
1.1.1 Similarity Search Query ................... ...... 14
1.1.2 Similarity Join Query ................... ...... 16
1.2 Performance Issues ................... .......... 18
1.3 Contributions and Organization ................... ...... 19
1.3.1 Similarity Search Query ................... ...... 19
1.3.2 Similarity Join Query ................... ....... 21
1.3.3 Dissertation Organization ................... ...... 22

2 METRIC SPACE MODEL ......... .. ............. 23

2.1 Euclidean Distance .................. .............. .. 23
2.2 Edit Distance .................. ................. .. 24
2.3 Earth Mover's Distance ............... . . .... 25

3 RELATED WORK ................. ............. ....... 27

3.1 Partition-based Indexing Techniques ................... .. .. .. 27
3.1.1 R-Tree and its Variants .................. ....... 28
3.1.2 The STR-Ordering .................. .......... .. 30
3.2 Nearest Neighbor Queries .............. . . .... 31
3.2.1 k-Nearest Neighbor Query .. ............ . ..31
3.2.2 Reverse-Nearest Neighbor Query ............. .. .. 32
3.2.3 All Nearest Neighbor Query .................. .. .. .. 33
3.3 Reference-based Indexing Techniques . . . ..34
3.3.1 Distance-matrix based Indexing Methods . . . ..... 36
3.3.2 Reference-based Hierarchical Indexing Methods . . .... 38
3.4 String Search Methods .................. ............ .. 40

4 REFERENCE-BASED INDEXING FOR STATIC DATABASES . . .... 42

4.1 Maximum Variance ................. . . ...... 42
4.1.1 Algorithm .................. . . .... 44
4.1.2 Computational Complexity .................. ..... .. 45









4.2 Maximum Pruning .................. .............. .. 45
4.2.1 Algorithm .................. . . ..... 45
4.2.2 Computational Complexity .................. ...... .. 47
4.2.3 Sampling-based Optimization .................. ..... .. 47
4.2.3.1 Estimation of gain .... . .... .. 48
4.2.3.2 Estimation of largest gain ... . . . 49
4.2.4 Impact of the Sample Query Set ................ .. .. 51
4.3 Assignment of References .................. .......... .. 51
4.3.1 Motivation and Problem Definition .... . ... 51
4.3.2 Algorithm .................. . . ..... 53
4.3.3 Computational Complexity .................. ...... .. 55
4.4 Search Algorithm .................. . . ..... 55
4.4.1 Algorithm .................. . . ..... 55
4.4.2 Computational Complexity .................. ...... .. 56
4.5 Experiments ................. . . .... 56
4.5.1 Effect of the Parameters .................. ........ .. 58
4.5.1.1 Impact of m ................. ....... 58
4.5.1.2 Impact of query set size, IQ . . . ...... 61
4.5.2 Comparison of Proposed Methods .............. .... .. 62
4.5.2.1 Impact of query range () .... . . .. 62
4.5.2.2 Impact of number of references (k) . . ... 63
4.5.3 Comparison with Existing Methods .... . .... 64
4.5.3.1 Impact of query range () ....... . . 66
4.5.3.2 Impact of number of references (k) . . ... 67
4.5.3.3 Impact of input queries .............. ... .. 70
4.5.3.4 Scalability in database size ..... . . ..... 73
4.5.3.5 Scalability in string length ... . . .. 74

5 REFERENCE-BASED INDEXING FOR DYNAMIC DATABASES .......... 76

5.1 Overview of SP and TP... ............... ........... .. 76
5.1.1 Basic Approach ............... . . .... 76
5.1.2 Maintaining the Query Distribution .... . ... 76
5.2 Single-pass Algorithm in Depth .................. .. ...... 77
5.2.1 Computational Complexity .................. ...... .. 77
5.3 Three-pass Algorithm in Depth .................. ........ .. 78
5.3.1 Computational Complexity .................. ...... .. 79
5.4 Experiments ................... . . ...... 79
5.4.1 Query Performance .................. .......... .. 81
5.4.1.1 Experimental setup ................ ..... .. 81
5.4.1.2 Experimental results ................... . 85
5.4.2 Distance Computations .................. ....... 86
5.4.3 Impact of Index Construction Time .... . ... 87
5.4.4 Analyzing the Results ..... . . .. 88









6 GENERALIZED NEAREST NEIGHBORS FOR SIMILARITY JOINS


6.1 Problem Definition ................... .......... 92
6.2 Overview of the Algorithm .......... ....... ........... 93
6.3 Predicting the Solution: Priority Table Construction ........ .... .. 94
6.4 Static Search Strategies ............. . . ... 98
6.4.1 Fetch All. ................... . . ... 98
6.4.1.1 Creating clusters ............... ..... .. 98
6.4.1.2 Ordering clusters. .................. ..... .. 98
6.4.1.3 Processing clusters. ................. ..... .. 99
6.4.2 Fetch One .................. . . ..... 99
6.5 Dynamic Strategy ................ . . . .... 101
6.6 Further Improvements for GNN Queries ....... . . ... 102
6.6.1 Adaptive Filter. ............... . . ... 102
6.6.2 Partitioning ................. . . ... 103
6.6.3 Packing ................... . . ... 105
6.7 Experiments ................... . . ...... 105
6.7.1 Evaluation of Optimizations .................. ...... ..106
6.7.2 Comparison of Proposed Methods ..... . . ... 110
6.7.2.1 Evaluation of buffer size .... . . ... 110
6.7.2.2 Evaluation of the number of NN . . . 111
6.7.3 Comparison to Existing Methods ....... . . ... 111
6.7.3.1 Evaluation of buffer size .... . . ... 112
6.7.3.2 Evaluation of the number NN . . . 114
6.7.3.3 Evaluation of database size . . . ..... 115
6.7.3.4 Evaluation of the number of dimensions . . ... 116

7 CONCLUSION .............................. ........ 117

REFERENCES ...................... . . . 120

BIOGRAPHICAL SKETCH ................. . . ..... 127









LIST OF TABLES
Table page

4-1 Summary of symbols and definitions. ................... ....... 42

4-2 A list of proposed methods. .................. .. ......... 56

4-3 Comparison with Tree-based index structures. ................. 66

6-1 Comparison with an Optimal solution. ................. ......... 107

6-2 Comparison with GORDER. .................. .......... 110

6-3 Comparison with RkNn. .................. ........... 110









LIST OF FIGURES
Figure page

1-1 Similarity Search Example. ............... ............ .. 14

1-2 Similarity Join Queries Example. .................. ........ 16

2-1 Edit Distance Example. .................. .. ........... 24

2-2 EMD Example: Imagel .................. .............. 25

2-3 EMD Example: Image2 .................. . . ..... 25

2-4 EMD Example: Transformed Imagel. .................. ....... 25

3-1 Evolution of Index Structures in Metric Spaces. .................. ..28

3-2 Reference-based Indexing Example. .................. ........ 34

3-3 Example for Omni .................. ................. .. 36

4-1 Maximum Variance example. ............... ............ 43

4-2 Number of comparisons for Omni with varying number of references. . ... 52

4-3 Number of comparisons for MP-D for DNA object database for different values of 59

4-4 Number of comparisons for MP-D for protein database for different values of m ... 60

4-5 Number of comparisons for MP-D for image database for different values of m. .... 60

4-6 Impact of IQI on index construction time. . ........... ....... 61

4-7 Impact of IQI on query performance. .................. ........ 62

4-8 Comparison of the proposed methods for DNA database for queries with varying ranges. 63

4-9 Comparison of the proposed methods for Protein database for queries with varying
ranges ................... ....................... .... 64

4-10 Comparison of the proposed methods for DNA database with a varying number of
references ................. ................... .. 65

4-11 Comparison of the proposed methods for Protein database with a varying number of
references ................. ................... .. 65

4-12 Comparison with other methods on DNA database for queries with varying ranges. .68

4-13 Comparison with other methods on protein database for queries with varying ranges. .68

4-14 Comparison with other methods on image database for queries with varying ranges. .. 69

4-15 Comparison with other methods on DNA database for a varying number of references. 70









4-16 Comparison with other methods on protein database for a varying number of references. 71

4-17 Comparison with other methods on image database for a varying number of references. 71

4-18 Comparison with other methods on DNA database for queries from Heliconius Melpomene
with varying query ranges. .................. ........... 72

4-19 Comparison with other methods on DNA database for queries from Mus Musculus
with varying query ranges. .................. ........... 72

4-20 Comparison with other methods on DNA database for queries from Danio Rerio with
varying query ranges. .................. .............. 73

4-21 Scalability in database size. .................. .. ......... 74

4-22 Scalability in string length. .................. ........... 75

5-1 Comparison of the proposed methods on DNA database with hilbert-ordered data and
query distributions .................. ................. .. 82

5-2 Comparison of the proposed methods on DNA database with random-ordered data
and query distributions. .................. .. .......... 83

5-3 Sparse method for DNA database with Hilbert and random-ordered data and query
distributions.. .................. ............... .. 84

5-4 Comparison with Sparse for the randomly ordered image database. . . ... 85

5-5 Distance computation time of DNA database. ................. 86

5-6 Distance computation time of image database. ................. 87

5-7 Index construction (Ic) times of the methods SP and TP DNA and image databases. .. 88

5-8 Index construction (Ic) times of MP for DNA and image databases. . ... 89

6-1 An example for GNN query. .................. .......... 92

6-2 A sample Priority Table for two databases R and S. ................. ..96

6-3 First row of the Priority Table. ............... . . 97

6-4 Adaptive Filter Example ................. . . ..... 103

6-5 Partitioning Example ..................... . . .. 104

6-6 Evaluating Optimizations 1, 2 and 3. .................. ....... 106

6-7 Evaluation of Partitioning. .................. ........... 107

6-8 Comparison of the proposed methods on two-dimensional image database for differ-
ent buffer sizes ................... . . ..... 108









6-9 Comparison of the proposed methods on two-dimensional image database for differ-
ent values of k. ................... . . ... 109

6-10 Comparison with other methods on two-dimensional image database for different
buffer sizes. .......................... . . 111

6-11 Comparison with other methods on protein database for different buffer sizes. . 112

6-12 Comparison of SS, RT, and Mux FD on two-dimensional image database for different
values of k. ................... .......... ...... .. 113

6-13 Comparison of SS, RT, and Mux FD on protein databases for different values of k. .. 114

6-14 Comparison with other methods on two-dimensional image database for varying database
sizes ................... ....... .......... ........ 115

6-15 Comparison with other methods on two-dimensional image database for varying num-
ber of dimensions ................... . . ..... 116









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

INDEXING TECHNIQUES FOR METRIC DATABASES WITH COSTLY SEARCHES

By

Jayendra G. Venkateswaran

December 2007

Chair: Tamer Kahveci
Major: Computer Engineering

Similarity search in database systems is becoming an increasingly important task in modem

application domains such as artificial intelligence, computational biology, pattern recognition

and data mining. With the evolution of information, applications with new data types such as

text, images, videos, audio, DNA and protein sequences have began to appear. Despite extensive

research and the development of a plethora of index structures, similarity search is still too costly

in many application domains, especially when measuring the similarity between a pair or objects

is expensive. In this dissertation, the similarity search queries we consider are classified under

similarity search and similarity join queries. Several new indexing techniques to improve the

performance of similarity search are proposed. For the similarity search queries, reference-based

indexing methods applicable to both static and growing databases are proposed. For similarity

join queries, a generalized nearest neighbor framework and several search and optimization

algorithms are proposed. The extensive experiments evaluates the different parameters used by

the proposed methods and performance improvements over the state-of-art algorithms.









CHAPTER 1
INTRODUCTION

Similarity search refers to the retrieval of database objects that are similar or close to a

given query object. It has applications in diverse fields such as databases, artificial intelligence,

computational biology, pattern recognition and data mining. The use of data structures to speed

the search known as indexing is an important technique when answering questions over large

databases. Many indexing techniques have been proposed for similarity search. Despite the

extensive research and the development of a plethora of index structures, similarity search is

still too costly in many application domains especially when measuring the similarity between a

pair or objects is expensive. New indexing techniques are needed to improve the performance of

searches in these databases.

1.1 Similarity Search

Classical databases have stored numerical or alphabetical information. The database

objects are stored using different database models such as the relational [30], object-oriented [3]

and object-relational [83] models. Exact match, range and join queries over these models are

typically based on whether or not two keys are equal or not equal or whether a key falls in a range

of values.

With the evolution of information, applications with new data types such as text, images,

videos, audio, DNA and protein sequence have began to appear. The problem with these new data

types is that they can neither be ordered for range queries, nor meaningful to perform equality

comparisons on them. For example, consider the application of image databases. A user might

be interested in retrieving the images from the database that have similar color, texture and shape

to the given query image [71]. Existing database models identify images from the filenames,

keywords and so on. These approaches cannot be applied for the similarity search query. The

query requires a more complicated distance function that can capture the similarity between two

images accurately.

Similarity search [28, 37, 45] retrieves database objects that are similar or close to a given

query object, given some arbitrary, user-defined similarity function. In this dissertation, similarity










0 0


q

S q e







Figure 1-1. Similarity Search Example.


search queries are classified under the following two categories: a) similarity search queries and

b) similarity join queries. The next two subsections describes these two categories in detail.

1.1.1 Similarity Search Query

There are two types of similarity search queries: a) range queries and b) nearest neighbor

queries.

Let D() be a distance function. In a Range Query (RQ), the goal is to find the database

objects that are similar to a given query. Given a database R, a query q, and a similarity threshold

c, the search returns the set of all objects in S whose distance to q is less than c. Formally, the

answer to a RQ is defined as:



RQ(q, ) ={r R: D(q,r) < e}

Figure 1-1 presents an example for a RQ. Here the shaded circles represent the objects in R.

The black circle represents a given query q and the distance threshold is given by c. The query

returns the five objects enclosed by the circle with q as the center and C as the radius.

For an example of a RQ, consider the application spell checker in text retrieval. The Edit

Distance (ED) [58] between two strings gives the minimum number of transformations needed









to convert one string to another. Given a text database, a query string q and a similarity threshold

c, the spell checker will look for the database strings that are within the C distance from the

query string. As another example, consider a content-based image retrieval system. A commonly

used measure in this application is the Earth Movers Distance (EMD) [71]. EMD measures

the transformation needed to convert the histogram distribution of one image to that of another

image. This is used as a measure of dissimilarity between two given images. Given an image

database, a query image q and an EMD threshold c, the image retrieval system returns the images

from the database that have a EMD distance within c from q.

In a Nearest Neighbor (NN) query, given a database R and a query object q, the goal is to

retrieve the element in R that is closest to q based on some distance measure. The answer to a

NN query is defined as:



NN(R, q) {=r e R : v E R, D(q, r) < D(q, v)}

In a variation of the NN query, the k-Nearest Neighbor (kNN) query, given a database R, a

query q and a positive integer k, the goal is to return a set of k objects from R similar to q based

on some distance measure. The answer to a kNN query is defined as:




KNN(R, q, k) {U C R: |Ul = k, u e U,v c {R U},D(q, u) < D(q, v)}


Figure 1-1 presents an example of the two NN query types. The result of a NN query on

the object q is given by, NN(R, q) = {a}. The result of 2NN query on the object q is given by

NN(R, q, 2) {a, b}.

For an example application, consider geographic information system (GIS) given in [70].

The user of a tourist information system may require information about the locations of the

attractions which are close to the current location, or the tourist might be interested in knowing

the three nearest hospitals from the location of a car accident. These queries require computing










0
b 3
o a
2
S0 0

1 0 O
c

0 0

Figure 1-2. Similarity Join Queries Example.

the Euclidean distance between a given location to the database of locations of attractions or

hospitals.

1.1.2 Similarity Join Query

In a similarity join query [28], given two databases R and S, the goal is to retrieve the pairs

of objects from the two databases that are similar to each other based on a given criterion. All

similarity join queries can be further divided into the following three types: a) the k-Nearest

Neighbor (kNN) join query [19], b) the Reverse Nearest Neighbor (RNN) join query [55] and c)

the All Nearest Neighbor (ANN) join query [97]. Figure 1-2 presents an example of the different

types of join queries. Here the shaded circles represent the objects from R ( {a, b, c} E R ) and

the unshaded circles represent the objects from S ( {1, 2, 3, 4} e S).

The kNN join query is the most commonly used type of join query. Given a positive integer

k, the join query returns k objects from S similar to the objects in R based on some distance

measure. The answer to a kNN join query is defined as:



KNN(R,S,k) = {< r, U >: r R, UC S, IUI = k, Vu U, v {S U}, D(r,u) < D(r,v)}

In the example given in Figure 1-2, a 2NN join query between R and S returns the set

of points {{a, {1, 2}}, {b, {2, 3}}, {c, {2, 4}}. For an example application, consider kNN









classification in data mining [18]. Given a set of database object that are not classified under any

category R and set of objects S that are already classified, the kNN query over S is computed

for every object in R. Then the unclassified objects are assigned to the category that forms the

majority of its k nearest neighbors.

In a RNN join query, given R and S, the goal is to retrieve the set of all objects from S such

that every object from the result set is a nearest neighbor for some object in R. For each object in

this subset from S (Vu E U, U C S), there exists some object in R (3v E R) such that u is the

nearest neighbor of v. The answer to a RNN join query is defined as:



RNN(R, S) ={U: U C S,Vu E U, v E R, D(v,u) < D(v, t),Vt S}

A variant of this query, the RkNN query, retrieves the set of all objects in S which are one of

the k nearest neighbors of some object in R.

An example of a reverse 2NN query is given in Figure 1-2. It returns {2}, which is one of

the two nearest neighbor of the objects {a, b, c} E R. For an example application of RkNN query,

consider a decision support system. Suppose that people usually dine at one of the k nearest

restaurants. An entrepreneur who wants to invest in Mexican restaurant business would want to

know the potential customers who have a Mexican restaurant as one of the k nearest restaurants.

In this application, given the locations of all customers and locations of all restaurants, the

RkNN join query would return the locations of the customers who have at least one Mexican

restaurants as one of their k nearest ones. These customers are likely to use the restaurant due to

its geographical proximity.

In an ANN query, given R and S, the goal is to retrieve for every object in R its nearest

neighbor in S. The answer to an ANN join query is defined as:



ANN(R, S) {< r, s >: r E R, s S, D(r, s) < D(r, v),Vv e S}

The ANN query is not commutative, i.e. in general ANN(R, S) / ANN(S, R). In a

variant of this query type, called the closest-pair query [19], given a positive integer k, the goal is









to retrieve k objects pairs from the two databases that are closest among all objects pairs from the

ANN query.

An example ANN query is given in Figure 1-2. It returns the set of object pairs {{a, 1}, {b, 2}, {c, 4}}.

For an example application, consider a geographic information system. The user of a tourist in-

formation system may require information about the nearest gas station for each attraction. In this

application, given the locations of all gas stations and locations of all attractions, the ANN join

query would return the location of the nearest gas station to each attraction.

1.2 Performance Issues

There are two issues that make similarity queries hard: 1) Even for small databases sizes,

distance computations can be very expensive and 2) For larger databases, even for inexpensive

distance computations, limitations in available main memory (when the database size is much

larger than the main memory) can result in costly similarity join queries. The remainder of this

section describes these two performance issues in detail.

Even though the databases over which similarity search queries are performed, are not al-

ways huge (and are sometimes small enough to be stored in several gigabytes of main memory),

tremendous expense is associated with computing the distance from a query to each of the data

objects. This renders the access method most applicable to main-memory resident data sets -

sequential scan infeasible for similarity search queries. For example, consider searching an

image database using Earth Movers' Distance (EMD) [71] as the similarity measure. Computing

EMD is an expensive linear programming problem and takes about 40 seconds to compare two

given images. A small database of about 4000 images can be easily loaded into the memory.

However, due to the high complexity of EMD, even a single search on this database can take up

to 45 hours. As another example, consider a less complex measure, the edit distance (ED), used

over DNA databases. Computing the edit distance between two strings requires O(n2) time [67],

which can translate to seconds of CPU time for long strings. The human genome contains 30

million strings of length 100 [10]. Searching for the substrings similar to a given substring in this

database can take up to an hour.









The second issue that can affect the performance of a query is large database size, particu-

larly for the various types of join queries. The obvious way to join two databases R and S is to

perform one distance computation for each (r o s) E R x S. But if |R| = ISI = one million, this

is 1 x 1012 distance computations!. For an example application, consider the decision support

system. An entrepreneur who wants to invest in Mexican restaurants would want to know the

potential customers who have a Mexican restaurant as one of the k nearest restaurants. If R is

the set of all houses in the country and S is the set of all restaurants, the similarity join query

requires comparison of all pairs of locations from R and S. The databases are too big to be

loaded into the memory. This results in pages being fetched again and again during the search

and hence increases the I/O costs. The sort-based and hash-based methods [33, 38, 54] for han-

dling large databases do not work here these work only for exact-match queries (not similarity

join queries). Thus, even with inexpensive distance computations, new indexing techniques are

needed.

1.3 Contributions and Organization

In this dissertation, the above two performance issues in metric databases are addressed. The

following sections present the contributions of this dissertation.

1.3.1 Similarity Search Query

The problem of reducing the number of distance computations needed to obtain the results

of a similarity search query is considered. The work presented in this dissertation tries to solve

this problem using several reference-based indexing methods [35, 48, 99]. In the reference-based

indexing approach, a small fraction of database objects, referred to as the set of reference objects.

are selected. The distances between references and database objects are pre-computed and

stored as an index. Given a query, the search algorithm computes the distance from each of the

reference to the query object. Without any further comparisons, objects that are close to or far

away from the reference can be removed from the candidate set based upon those distances.









For a similarity search query in a database with a complex similarity measure, three primary

problems are addressed while using reference-based indexing: 1) selection of references, 2)

assignment of references and 3) reference selection for dynamic databases.

References are selected such that they represent all parts of the database. Two novel

strategies for the selection of references are proposed. They are:

Maximum Variance: This approach is based on the following two rules: i) If a query q

is close to a reference v, the objects far away from v can be pruned and objects close to

q are added to the result set, and ii) If q is far from v, the objects close to v are pruned.

These observations imply that the spread of the database objects around a reference is a

good measure for reference quality. In the Maximum Variance method, for each reference

candidate, the distance to each database object is computed. Then the variance of these

distances is determined. Then references with high variances that can prune objects not

pruned by the others are selected.

Maximum Pruning: The second approach is based upon a combinatorial calculation that

specifically selects the references to maximize the pruning of objects. It is based on the

following two rules: i) Select the references based on the number of objects it can prune,

and ii) Objects not pruned by one reference should be pruned by at least one of the other

references. Since the complexity of the method increases with the database size, sampling

techniques to speed the process are used.

The second problem is the mapping of references to database objects. In the proposed

solution, the number of references is larger than in other methods. In order to keep the number of

references assigned to each object manageable, only the best references for each given database

object are used to index it. Thus, each database object may have a different set of references

assigned to it.

The third problem is reference-based indexing for dynamic databases. For such databases,

two incremental variations of the Maximum Pruning strategy are proposed. After insertion

of a new object, these incremental methods update the reference set using the distribution of









actual queries. The actual queries are sampled using the reservoir algorithm [93]. Realistic

environments are simulated, where the distribution of the database objects and queries changes as

the application and user base evolve over time.

Papers describing the proposed indexing techniques have been already published. The

approaches for static databases from Chapter 4 are from the paper co-authored by Deepak

Lachwani, Tamer Kahveci and Christopher Jermaine and were originally published in the 2006

Very Large Data Base (VLDB) conference [91]. An extension of this work on reference indexing

with approaches for dynamic databases has been accepted by the Very Large Data Base Journal

(VLDBJ) [90].

1.3.2 Similarity Join Query

For similarity join queries, we first consider the problem of high I/O costs due to the

redundant disk fetches in large databases. Three strategies are proposed to index and process

the disk pages. These strategies reduce this redundancy as well as the overall I/O costs for

the queries. For the problem of high CPU costs due to a large number of complex distance

computations, several optimizations are proposed that reduce the number of actual objects that

are compared.

In this dissertation, the all of the following ideas are introduced first time and have not been

proposed elsewhere:

The Generalized Nearest Neighbor (GNN) Query: A new database primitive called the

GNN is defined. A GNN query finds the set of all objects SI C S that appear in the

kNN set of at least t objects of R, where t is a cutoff threshold. It provides a generalized

framework for the different kinds of nearest neighbor join queries, kNN, ANN and

RkNN.

Priority Table: A global-coarse-grain view of the objects in the data space using a Priority

Table (PT) is proposed. The PT defines the object pairs that (potentially) need to be

compared to answer a given GNN query.









Search Strategies: The problem of redundant disk seeks is addressed by proposing three

search strategies, Fetch All (FA), Fetch One (FO) and Fetch Dynamic (FD). The first

algorithm, FA, is a pessimistic approach providing as many candidate pages as possible

from S for each page of R. The second algorithm, FO, is an optimistic approach that

returns one candidate page at a time from S for each page of R. The third algorithm, FD,

dynamically decides the number of pages that needs to be fetched for each page of R by

analyzing query history.

Search Optimizations: The problem of expensive distance computations is addressed by

proposing three optimizations. The first one, called the row filter, prunes the pages of S

that are not in the kNN set of sufficiently many objects in R. The second one, called the

column filter, prunes the pages of R whose nearest neighbors do not contribute to the

result. The third optimization, called the adaptive filter, prunes the objects of S that are

too far from the objects of R. The costs are reduced further by pre-processing the input

databases using a packing technique called Sort-Tile-Recursive (STR) [76].

The paper describing the proposed work on Generalized nearest neighbor for similarity join

queries has been published already. The methods and experiments presented in Chapter 6

are from the paper co-authored by Tamer Kahveci and Orhan Camoglu, which was

originally published in the 2006 Extended Data Base Technology (EDBT) conference [89].

1.3.3 Dissertation Organization

The rest of the dissertation is organized as follows. Chapter 2 presents the describes a metric

space model with few examples. Related work is presented in Chapter 3. Chapter 4 presents

strategies for selecting the references and how these references can be used to answer a similarity

search query in a static database. Chapter 5 presents the reference selection strategies for

databases with frequent updates and varying query distributions. Chapter 6 presents a generalized

nearest neighbor framework for large databases using the three search strategies. The conclusions

are given in Chapter 7.









CHAPTER 2
METRIC SPACE MODEL

Similarity queries are needed in many diverse applications such as the geographic in-

formation system (GIS), the multimedia database system (MDS), the text retrieval and the

computational molecular biology. Thus, it is one of the most studied topics in the context of

databases. These applications often use a metric similarity measure such as the Euclidean Dis-

tance, the Edit Distance and the Earth Mover's Distance to compute the dissimilarity between

two objects. This chapter presents the concept of metric measures and elaborates on several

measures that are commonly used in recent databases.

Given any two objects a and b from the database, a distance measure D() is said to be a

metric if it satisfies the following four properties:

D(a, b) = D(b, a) (Symmetry)

D(a, b) > 0 (Non-Negativity)

D(a, a)= 0 (Reflexivity)

D(a, c) < D(a, b) + D(b, c) for any object c (Triangle Inequality)

If a measure does not satisfy the strict non-negativeness property, the measure is called

pseudo-metric. The triangle inequality property is a key property of the metric databases that

has been used effectively by the indexing techniques. Following are some of the commonly used

measures in spatial, text and image databases respectively.

2.1 Euclidean Distance

A Euclidean Distance (EuD) is used to measure the proximity of the database objects in

vector space. Vector space can be seen as a metric space where objects are represented arrays

of real numbers. Given two vectors, R = {rl, r,..., r,} and S = {s,s2,. ., sn} the EuD

between them is defined as:



EuD(R, S)= (r,- s)2
i= 1









P: ACGTACGTAC GT

Q:A GTACCTACCGT

Figure 2-1. Edit Distance Example.


The complexity of EuD is linear in terms of the number of terms or the dimensionality of

the vector.

There exists a lot of techniques for indexing in vector spaces [? ]. These methods perform

well up to twenty dimensions. After that their performance degrades with the increase in the

number of dimensions, referred to as the curse ofdimensionality, and thus inefficient in practice.

Several applications with objects represented in vectors space, like GIS and decision support

systems, use Eud as the distance measure.

2.2 Edit Distance

An Edit Distance (ED) is used to measure the similarity between two strings. Given two

strings P and Q, the ED between P and Q is the minimum number of edit operations required to

transform P into Q. For each letter Pi, Pi E P, an edit operation refers to one of the following

three: a) insert a new character after the letter Pi, b) delete the letter Pi, and c) replace the letter

Pi with another.

Figure 2-1 presents two strings P and Q and the ED between them. The two strings,

each having 12 characters, can be optimally aligned at a cost of 3 (two insertions and one

replacement).

The dynamic programming solution to find the ED and the alignment between two strings

runs in O(n2) time and space [67], where n is the average length of a string. If C is the allowable

number of transformations, the space and time complexity for the bounded version of this

problem is O(nc) [4, 6, 65, 87]. In many applications, c = O(n), thus making the complexity of

the bounded version O(n2) [42, 50]. Several applications, such as spell check and computational

molecular biology, use ED to compare two given strings.



















I1 1 11 11. 11
Figure 2-4. EMD Example:
Figure 2-2. EMD Example: Figure 2-3. EMD Example: Transformed
Imagel Image2 Imagel.

2.3 Earth Mover's Distance
An Earth Mover's Distance (EMD) [71] is a measure of work needed to transform the
histogram distribution of one image to another. Given two distributions, one can be seen as a
mass of earth properly spread in space, while the other distribution can be seen as a collection of
holes in that same space. It can always be assumed that there is at least as much earth as needed
to fill all the holes to capacity by switching the earth and the holes if necessary. Then, the EMD
measures the least amount of work needed to fill the holes with the earth. Here, a unit of work
corresponds to transporting a unit of the earth by a unit of the (ground) distance.
Figures 2-2 to 2-4 present an example of for the EMD similarity measure. Here the black
and grey bars represent the given two simple histogram distributions (Figures 2-2 and 2-3).
The EMD is given by the least amount of work needed to convert distribution Figure 2-2 to
Figure 2-4, i.e. transforming the region represented by white block in the first bin to the grey
region represented by second bin.
The EMD involves solving a linear programming based transportation problem [31], which
may take a long time. For example, for a 256-dimensional features extracted from images that
are partitioned into 8 X 12 tiles, each EMD computation takes 40 s, so, a similarity search on









a database of 4,000 images can take almost 45 hours to complete. The EMD has been used

extensively in medical image applications. Here images are represented by multi-dimensional

distributions of the histogram of its features such as color, shape and texture.









CHAPTER 3
RELATED WORK

Different indexing schemes have been proposed to speed up the similarity search queries.

The basic idea in these methods is to use the triangle inequality property of the metric measure to

filter out the database objects that can be proved to be far enough from the query objects, without

doing the actual distance computations. These techniques in general follow the following two

step filter-refine process:

Filter Phase: This phase determines the collection of objects that can be the candidates for

the result set, referred to as a candidate set.

Refinement Phase: The actual similarity of each object in the candidate set with the query

is computed to eliminate the false positives and to find the answer to the query.

The evolution of the index structures discussed in this section is given in Figure 3-1, based

on the taxonomy of several multi-dimensional index structures given by Gaede and Gunther [37].

The indexing techniques can be classified under two categories: Partition-based indexing

techniques and Reference-based indexing [28, 45] techniques.

Numerous methods employ index structures to filter unpromising database objects quickly.

One of the important classes of index structures partitions the data space into hierarchical

sets. Section 3.1.1 discusses the index structures that belong this class with a focus on the R-

Tree. Section 3.1.2 discusses how the R-Trees are packed to improve the query performance.

Section 3.2 presents the existing methods for NN queries based on partition-based approach.

Section 3.3 presents another important class of index structures, namely reference-based

indexing. Section 3.4 discusses methods that are specific for the text retrieval application.

3.1 Partition-based Indexing Techniques

The first class of index structures, partition-based techniques, divide the data space into

sets. Objects in each set are close to each other in the data space. Each set is represented by its

bounding box or region, so that comparing a query object with the representative box or region

has good chance of eliminating the objects in them without any further comparisons. Generally,

these methods are hierarchical in nature, i.e. the sets are recursively divided into subsets.












Reference-based
Reference-based
Hierarchical Methods Distance Indexing
Methods


BK-Tree
QuadTree
KD-Tree
/
R-Tree



R*-Tree
VP-Tree
F / Q-Tree
SS-Tree
SR-Tree MVP-Tree M-Tree
Slim-Tree

DF-Tree

DBM-Tree PM-Tree


Needleman-Wunsch
Suffix-Tree



Smith-Waterman


AESA




LAESA


OMNI

BRAND&BINC

SPARSE SPACE


Figure 3-1. Evolution of Index Structures in Metric Spaces.


Several index structures have been proposed in literature for storage and retrieval using

partition-based indexing techniques. Some of the popular ones include the kd-tree [11, 12], the

R-tree [43], quadtrees [36, 74], the R+-Tree [79], the R*-Tree [68], the X-Tree [14] and the

SR-Tree [53]. These techniques make extensive use of coordinate information to group and

classify points in the space. For example, kd-trees divide the space along different coordinates

and R-trees group points in hyper-rectangles.

3.1.1 R-Tree and its Variants

This section focuses on the R-Tree structure. R-Tree is one of the most important partition-

based index structure. It is one of the most common hierarchical structure for indexing spatial

objects in high-dimensional spaces. A number of R-Tree variants have evolved over all thee

years.

R-Tree [43] is a multi-dimensional generalization of the B-Tree [8]. Similar to B-tree, R-tree

is a height-balanced tree with index records in its leaf nodes containing pointers to data objects.


FASTA
BLAST









It uses the minimum bounding rectangle (MBR) as a representation of the underlying objects.

MBR of a group of objects is the smallest rectangle enclosing them. Entries in the leaf nodes are

of the form (MBR, ObjPtr), where MBR represents the bounding box of the object pointed by

ObjPtr. The entries in non-leaf nodes are of the form (MBR, ChildPtr), where MBR represents

the bounding boxes of the child node pointed by ChildPtr. Let M be the maximum number

of entries possible in a node, let m < M/2 represent the minimum number of entries in a node

and d the number of dimensions. This lower bound m prevents the degeneration of the tree and

ensures efficient storage utilization. If the number of entries in a node falls below m, the node is

deleted and the rest of its entries are distributed among the sibling nodes.

By storing the bounding boxes of geometric objects such as points, polygons and more

complex objects, an R-Trees can be used to determine which objects intersect a given query

region. As each node has at least m entries, the height of an R-Tree of N objects can at most be

log N 1. Generally, nodes will tend to have more than m entries, which will decrease the height

of the tree and improve space utilization.

In an R-Tree, all data objects that overlap with the query region are searched to retrieve

objects in the query region. When a node is searched, more than one subtree may need to be

traversed. Thus, it is not possible to guarantee good worst-case performance. With efficient

update algorithms, the tree will be maintained in such a form so as to eliminate the irrelevant

regions (regions that are away from the query region) and examine only data near the search area.

The search algorithm descends down the tree, at each level selecting those entries whose MBRs

overlap with that of the query. When a leaf-node is reached, entries whose MBRs overlap with

the MBR of the query region, are selected.

A variant of R-Tree, the R+-Tree [79], ensures that there exists only one path while

searching for a data object. This is done by splitting the overlapping rectangles and increasing

the storage required by having duplicate entries. Another variant of R-Tree, the X-Tree [14]

was designed for high-dimensional objects with overlap-free split according to a split history

and supenodes. A supenode is an oversized node, which prevents overlap when an efficient









split axis cannot be determined. R*-Tree [68] is an improved variant of the R-Tree. In addition

to using criteria like margin, area and overlap, it uses the concept of forced-reinsertion to re-

organize the structure for better storage utilization. The SS-Tree [96] is an index structure

designed for similarity indexing for multi-dimensional data. It is an improvement of R*-Tree,

but uses bounding spheres instead of bounding rectangles and uses a modified forced re-insertion

algorithm. Unlike R*-Tree, the SS-Tree re-inserts entries when the entries in a node are not

re-inserted. The SR-Tree [53] can be viewed as a combination of SS-Tree and R*-Tree. It uses

the intersection between the bounding sphere and the bounding rectangle. This outperforms both

SS-Tree and R*-Tree. The size of the directory entry is increased significantly by this approach.

An extensive survey of these methods have been given by Samet [74], Gaede and Gunther [?

] and Bohm et.al. [18]. Unfortunately the existing techniques are very sensitive to the data

space dimension. The Similarity queries have an exponential dependency on the dimension of

the space, referred to as the curse ofdimensionality. Due to this reason, despite its complexity,

researchers prefer metric spaces using complex data types.

3.1.2 The STR-Ordering

For applications such as multimedia database systems using static databases, pre-processing

the database objects and packing the partition-based index structure, say R-Tree, yields better

space utilization and improved query performance. Packing refers to grouping of similar objects

(objects within a close neighborhood) together. The problem of grouping the objects necessitated

the usage of several heuristics. Space-filling curves such as Z-Ordering [73] and Hilbert-

Ordering [51] transform the data objects into an one-dimensional space and order them. There

exists several extensive works on the space-filling curves [17, 73, 75]. This section describes

another heuristic, the STR-Ordering proposed by Leutenegger et.al [76].

The Sort-Tile Recursive (STR) is a method to order the MBRs of R-Tree based static

databases. Let N be the number of d-dimensional objects and b the number of entries per node of

the R-Tree. The data space is divided into r/b vertical slices so that each slice contains enough









rectangles to pack roughly 2r-/b nodes. Then, the entries are recursively partitioned. Following

steps are involved in packing,

Determine the number of leaf pages P = r/b and let S = p'V be the number of slices.

Sort the MBRs by the first dimension of the center point and partition them into S vertical

slices. A slice consists of a run of (P"-1/" x b) consecutive MBRs from the sorted list. The

last slice may contain fewer than (P"-1/" x b) rectangles.

Each slice is now processed with the remaining n 1 coordinates and this continues

recursively.

The STR method orders the database objects such that similar objects are grouped together

and packed in either the same or neighboring nodes. By effectively pruning the objects using

their MBRs, the number of distance computations of objects and hence the CPU time of the

search can be improved significantly. Every node in the index structure, say R-Tree, is packed up

to its capacity. This reduces the number of nodes and hence the height of the underlying R-Tree.

Each node is stored in a disk page. Reduction in the number of nodes decreases the number of

pages fetched from the disk and thus the I/O cost of a search improves dramatically.

3.2 Nearest Neighbor Queries

The commonly used nearest neighbor queries are : a) k-Nearest Neighbor (KNN) b) Reverse

Nearest Neighbor (RNN) query and c) All Nearest Neighbor (ANN) query.

3.2.1 k-Nearest Neighbor Query

A number of index-based methods have been developed for kNN queries. Hjaltson and

Samet [44] used PMR quadtree to index the search space. They search this tree in a depth-first

manner until the k nearest neighbors are found. It is an edge-based variant of PM quatree that

uses probabilistic splitting rule.

Roussopoulos et. al., [70] employed a branch-and-bound R-Tree traversal algorithm. It

uses the MINDIST and MINMAXDIST to order the MBRs of the R-Tree. Given two MBRs,

the MINDIST gives the minimum distance between them while MINMAXDIST guarantees the

presence of at least one object in the second MBR. The k nearest objects are stored in a buffer in









sorted order. The MBRs having a distance higher than the current kth largest distance are pruned

during kNN search.

The two-phase method [56] gets the resultant objects in two phases. In the first phase, a

KNN search is performed using the distance function on feature vectors. The actual distance

to these k objects are computed and the maximum value is determined. In the second phase, a

range query on the index returns all objects within this maximum distance from the same feature

distance function. For all the objects from the result set of the range query, the actual object

distance is computed and kNNs are returned.

Seidl and Kriegel [78] proposed a method that runs in multiple phases, iteratively updating

the kNN distance. In the first phase, it sorts the objects in increasing order of their feature

distance. Then for every object from the sorted list with distance less than the current kNN

distance, the actual query-to-object distance is computed and the kNNs are updated.

Berchtold et. al., [13] divide the search space using voronoi cells. It first pre-computes the

result of any nearest-neighbor search which corresponds to a computation of the voronoi cell

of each database object. Then, the voronoi cells are stored in an index structure efficient for

high-dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point

query on the index structure. Beyer et. al., [15] show that for a broad set of data distributions

most of the known kNN algorithms run slower than sequential scan. Thus, despite its simplicity,

sequential scan still remains a formidable competitor to index-based kNN methods.

3.2.2 Reverse-Nearest Neighbor Query

Given a query q, Reverse Nearest Neighbor (RNN) retrieves all objects that have q as their

nearest neighbor. Kom et. al., [55] introduced the Reverse Nearest Neighbor (RNN) problem.

They pre-compute the nearest neighbor of all the objects in the database and generate the nearest

neighbor circles for the objects. Then during search, all objects which have the query in their

nearest neighbor circle are retrieved. Because the RNN-tree is optimized for RNN, but not NN

search, Korn et. al., [55] use an additional (conventional) R-tree on the data points for nearest

neighbors and other spatial queries.









In order to avoid the maintenance of two separate structures, Yang and Lin introduced the

Rdnn-tree for RNN queries [98]. Similar to the RNN-tree, a leaf node of the RdNN-tree contains

vicinity circles of data points. On the other hand, an intermediate node contains the MBR of the

underlying points (not their vicinity circles), together with the maximum distance from every

point in the sub-tree to its nearest neighbor. As given in their experiments, the RdNN-tree is

efficient for both RNN and NN queries because, intuitively, it contains the same information as

the RNN-tree and has the same structure (for node MBRs) as a conventional R-tree.

Stanoi et. al. [82], does not use pre-computation and is to extend an existing solution to

bi-chromatic RNN query. The basic idea is to find the RNNs of an object, is to dynamically

construct the influence region or a Voronoi cell by examining the R-tree of the database objects.

Here, an influence region is defined as a polygon in space which encloses the locations that are

closer to the query object than to any other object. Once the influence region is computed, a

range query in the R-tree of objects is performed to locate RNNs of the object.

Tao et. al., [84] generalize the RNN problem to arbitrary number of NNs using a filter-and-

refine approach. The method uses an hierarchical tree-based index structure such as R-Tree to

compute a nearest neighbor ranking of the query object. The key idea is to iteratively construct

a Voronoi cell around the query object from the ranking. Objects that are beyond k Voronoi

planes w.r.t. the query can be pruned and need not to be considered for Voronoi construction. The

remaining objects must be refined, i.e. for each of these candidates, a k-nearest neighbor query

must be launched.

3.2.3 All Nearest Neighbor Query

Despite its wide use in many areas, All Nearest Neighbor (ANN) is the least studied NN

query type. MuX uses a two-level index structure called MuX index [19]. At the top level, MuX

index contains large pages (or MBRs). At the next level, these pages contain much smaller

buckets. For each bucket from R, it computes a pruning distance as it scans the candidate points

from S. It prunes the pages, buckets, and points of S beyond this distance for each bucket of R.

GORDER [97] is a block nested loop join method. It first reduces the dimensionality of R and S

















re4 <-ref2
rdiste, C rdist2
rdist, 0
P

Figure 3-2. Reference-based Indexing Example.


by using Principal Component Analysis. It then places a grid on the space defined by the reduced

dimensions and hashes data objects into grid cells. Later, it reads blocks of data objects from grid

cells by traversing the cells in grid order and compares all the objects in pairs of grid cells whose

MINDIST is less than the pruning distance defined by the kth NN.

3.3 Reference-based Indexing Techniques

In a reference-based indexing method, a small fraction of database objects, referred to as the

set of reference objects, are selected. The distances between references and database objects are

pre-computed and stored as an index. Given a query, the search algorithm finds the distance from

each of the reference to the query object. Without any further comparisons, objects that are too

close to or too far away from a reference are removed from the candidate set based upon those

distances with the help of the triangle inequality.

Figure 3-2 illustrates the reference-based indexing in a hypothetical two-dimensional space.

Here, the database objects are represented by points. The distance between two points in this

space corresponds to the underlying distance between the two objects (e.g., if the points denote

strings, rdistl between the points re f and p corresponds to the edit distance between the strings

represented by them). In a reference-based indexing, the distances between the object p and

references re f and ref2 are pre-computed. Let rdistl and rdist2 be the two pre-computed

distances, respectively. Given a query q with range r, the first step is to compute query-to-

reference distances qdistl and qdist2. A lower bound for the distance between q and an object









p with reference re f is computed as bound1 = |qdist rdistll using the triangle inequality.

Similarly, bound2 gives a lower bound for the distance between q and p with reference ref2.

Since bound > r with refl as the reference, object p can be pruned from the candidate set of q.

Let V = {vi, -, vm} denote the set of reference objects, where E S and VI = m, the

number of references. The distances from the references to the database objects are pre-computed

and is given by the set {D(si, vj) (Vsi E S) A (Vvj E V)}. This is a one-time cost for the

database.

During search, the algorithm first computes the distances from the query q to the references.

For each object s, a lower bound (LB) and an upper bound (UB) for D(q, s) are defined as:

LB = maxv{ED(q, v) ED(v, s) }

UB = minwv{ED(q, v) + ED(v, s) }

LB and UB are used in reference-based indexing as follows. For each (q, si) pair:

If r < LB, add s to the pruned set.

If r > UB, add s to the result set.

If LB < r < UB, add s to the candidate set.

After this pruning, only the objects in the candidate set are compared with q using the costly

comparison operation to complete the query.

The factors that determine the cost of the strategies used for selection and assignment of

references to database objects are memory and computation time. Let the available main memory

be B bytes and ISI = N be the number of objects in the database. We assume four bytes are used

to store a distance value and an object uses on average z bytes of storage. Thus, zm bytes are

needed to store a reference. For each object s E S and its reference E V, the object-reference

mapping is of the form [i, D(s, )]. Thus, 8mN bytes are used to store the pre-computed

distances for N objects. The value for m can be obtained by comparing the available memory

with the memory needed for storage: B 8mN + zm.

Let Q be the given query set, t be the time taken for one object comparison and cag, be

the average number of objects in the candidate set for each query. First, each query object is














qdist2


ref rdist2 re f2
d

Figure 3-3. Example for Omni.


compared with each of the m references. This takes tmQ|I time. Then, for each query, the Cavg

objects have to be compared to filter out the false positives. This takes -,... 0n time. The total

time taken is given by t(m + Cavg) Q1.

Reference-based similarity searches in metric spaces can be divided into two categories:

distance-matrix based and tree-based reference indexing methods. An extensive survey of these

categories is given by Chavez et al. [28] and Hjaltason and Samet [45].

3.3.1 Distance-matrix based Indexing Methods

Several methods have been proposed for selecting references in distance-matrix based

methods. Filho et al. [35] proposed a method called the Omni. Omni selects references from the

convex hull of the dataset. This is done by selecting objects that are far away from each other

and may achieve poor pruning rates. Multiple references prune the same set of objects that are

near the hull. There are redundant references, since an object can be pruned by one of them.

Furthermore, none of the references prune the objects that are far from the hull.

Figures 3-2 and 3-3 illustrate this problem. In Figure 3-2, re f and ref2 are the references

selected using the Omni method. Object p is close to re f and far away from ref2. For query q,

p can be pruned by both re f and ref2, representing a "wasted" reference. On the other hand,

objects inside the hull will not be pruned at all, as illustrated in Figure 3-3. Here, the bounds

obtained for the query q and object d using the two Omni references do not remove d from

the candidate set. Had the reference rand in this figure been selected instead, then it would

have been possible to prune d. It is essential to select the references so that each reference is









able to prune a different subset of the database. A similar approach for selecting references in

content-based image retrieval has been proposed by Vleugels et al. [94].

Reiner et al. [57] proposed a spacing-based selection of references. The basic idea is to

add database objects to the index based on two criteria. The first criterion is called spacing

where references with small variance of spacing are supposed to have have discriminative power

over the database. The second criterion, correlation, reduces the redundancy of the references

by using the linear correlation coefficients among the reference objects. It tries to minimize

the number of false positives. As soon as either the variance of spacing of one object or the

correlation of a pair of objects exceed a certain threshold, a reference object is replaced by a

randomly chosen new reference object. However, there were no guidelines on selecting the

spacing and correlation thresholds and the number of reference objects.

Brisaboa et.al. [21] proposed Sparse Spatial Selection (SSS) method that adapts to the

updates in dynamic databases. An object becomes a reference, if it is located at more than a

fraction of the maximum distance with respect to all of the existing references. The number of

references selected depends on the intrinsic dimensionality of the underlying database. This

approach is dynamic in nature and selects references that are not outliers. But it can result in

redundancy, i.e. reference that prune same set of database objects like Omni.

Bustos et.al. [23] present a criterion to compare the efficiency of two sets of references of

the same size. They select a reference set that maximizes the mean of query-to-object distances.

Different reference selection strategies were proposed: a) Selection that selects from a set of N

random sets of references one that maximizes the mean b) Incremental selects a new reference

that maximizes the mean of the current set of references and c) Local Optimum is an iterative

strategy that starts with a random reference set and in each iteration replaces one that makes less

contribution to the mean. They select references that are far from the other database objects.

This approach had the same drawbacks as that of Omni and does not always result in good set of

references.









The Approximating and Eliminating Search Algorithm (AESA) [72] is simply a matrix with

the pre-computed distance computations between every pair of objects in the database. When

answering a query, AESA computes the distance between the query and an arbitrarily set of

objects, and used these distances to generate the candidate set. This method has high storage and

pre-processing costs. The Linear AESA (LAESA) [64] solves this problem by selecting a small

subset of the objects called as Base Prototypes (BP)as references and pre-computing the distances

from the other objects. Its efficiency depends on the number of selected objects as BP and their

location with respect to other database objects. The objects are simply linearly traversed and

those that cannot be eliminated after considering the references are compared directly against the

query object. This method has hight CPU costs like AESA.

3.3.2 Reference-based Hierarchical Indexing Methods

Traditional tree-based indexing methods impose a hierarchy on the reference objects that

guides the order of distance computations. These methods use O(log n) references, since they

partition the space hierarchically using one or more references at each level.

The M-Tree [29] is a height-balanced tree with the data objects in its leaf nodes. It aims at

providing dynamic capabilities and good I/O performance with fewer distance computations. A

set of reference objects are selected at each node and objects closer to the reference objects are

organized into a subtree rooted by that reference. Each reference stores its covering radius. The

search algorithm recursively searches the nodes that cannot be pruned using the covering radius

criterion. The main problem with M-Tree is that it has poor selectivity at higher dimensions.

A variation of the M-Tree, the Slim-Tree [86], reduces the amount of overlap between the

tree nodes. It uses the slim down algorithm which leads to better tree. It also makes use of faster

splitting algorithm based on the minimal spanning tree. It makes use of chooseSubtree algorithm

for the slim-tree which leads to tighter trees, thus have fewer disk pages, and faster retrievals.

These tree-based structures are height balanced and attempt to reduce the height of the tree at

the expense of flexibility in reducing the overlap between the nodes. This constraint was relaxed

in the Density Based Metric Tree (DBM-Tree) [92] by reducing the overlap between nodes in









high-density regions, resulting in an unbalanced tree. An optimization algorithm that reorganizes

the objects in the nodes improving the performance is presented. The DF-Tree [85] selects a

global set of representatives in order to prune candidate objects when answering queries. Its

pruning with respect to number of distance computation is very high. But it is less efficient than

the slim-tree in number of disk access.

The Vantage-Point tree (VP-tree) [99] partitions the data space into spherical cuts by

selecting random reference points from the data. The goal is to reduce the number of distance

calculations to answer similarity queries in the VP-tree. A variation of the VP-Tree called the

MVP-Tree [20] uses more than one reference at each level. Using many representatives the

MVP-tree requires lesser distance calculations to answer similarity queries than the VP-tree.

Burkhard-Keller Tree (BKT) [22] provides interesting techniques for partitioning a metric

database where the recursive process is materialized as a tree. The first technique partitions a

database by choosing a reference from the database and grouping the objects with respect to

their distance from it. The second technique partitions the original database into a fixed number

of subsets and chooses a reference from each of the subsets. The reference and the maximum

distance from the reference to a point of the corresponding subset are also maintained to support

the nearest neighbor queries.

The Fixed-Queries Tree (FQT) [5], improves BKT where all the references stored in

nodes of the same level are the same. The actual objects are stored in the leaves. The advantage

of this approach is that some comparisons between the query and nodes are saved along the

backtracking that occurs in the tree. The Fixed-Height FQT (FHQT) [5] has all leaves at the same

depth thus making some leaved deeper than necessary. This makes sense because we may have

already performed the comparison between the query and reference of an intermediate level.

But has space limitations similar to FQT. The Fixed Queries Array (FQA) [27] is a compact

representation of the FHQT. It is similar to traversing the leaves of a FHQT with fixed height, left

to right. Using the same memory, FQA is able to use more references than FHQT improving its

efficiency.









The Spaghettis [26] present a dynamic reference based method. During the pre-processing,

the distance between the reference to database objects are computed. It reduces the CPU cost

while retaining the array structure of FHQT by sorting each array and saving the permutations

with respect to the preceding array. All these methods when applied to continuous distance

functions lose their linear CPU time. Except M-Tree based methods, all other structures are for

static databases and are not suitable for large databases.

Pivoting M-Tree (PM-Tree) [80], a hybrid between the two categories, is an extension of

the M-Tree. It combines the hierarchy used by tree-based methods with a distance-matrix based

method. The result is a flexible metric access method providing more efficient similarity search

performance than the M-Tree. But pivot selection requires a part of the database to be known in

advance. A combination of PM-Tree and a slim down algorithm makes an efficient metric access

method.

3.4 String Search Methods

In this section, we discuss some of the existing methods that are specific for string database

searches. In string databases, there are two types of string alignment: Global Alignment and

Local Alignment. Global alignment attempt to align all residues in two strings and are useful

when the strings are similar and have approximately same size. A dynamic programming based

algorithm was proposed by Needleman and Wunsch [67]. Local Alignment are more useful for

dissimilar strings that can contain regions of similarity. The Smith-Waterman Algorithm [81] is a

dynamic programing based solution local alignment of two strings. A number of index structures

have been developed to reduce the cost of searches in string databases. They can be classified

under three categories: k-gram indexing, suffix trees, and vector space indexing.

A k-gram is a string of length k, where k is a positive integer [40]. k-gram based methods

look for the shortest substrings that match exactly; these strings are then extended to find longer

alignments with mismatches and inserts/deletes, k-grams are usually indexed using hash tables.

Two of the most well known genome search tools that use hash tables are FASTA [69] and









BLAST [2]. The performance of these tools deteriorates quickly as the size of the database

increases.

Suffix trees were first proposed by Weiner [95] under the name position tree. Later, efficient

suffix tree construction methods [61, 88] and variations [34, 47, 52, 60] were developed.

The suffix tree for the string S is defined as a tree where each path from the root to a leaf

node represents a suffix of S. It time complexity for building a suffix tree is O(length(S)).

However, there are two significant problems with the suffix-tree approach: (1) suffix trees

manage mismatches inefficiently, and (2) suffix trees are notorious for their excessive memory

usage [66]. The size of the suffix tree varies between 10 to 37 bytes per letter [32, 47, 52, 62].

Suffix Array [60] is basically a lexicographically sorted list of all suffixes of the string S. The

suffix array was developed to reduce the space consumption in a suffix tree. A binary search on

the list gives the matching suffixes.

A number of index structures have been developed to function in vector space, such as

SST [39] and the frequency vectors [49, 50]. The frequency vector of a string stores the number

of letters of each type in that string. This method computes a lower bound to the distance

between two strings using the frequency vectors corresponding to the two strings. It uses this

lower bound to eliminate unpromising strings. However, as the query range increases frequency

vectors perform poorly.









CHAPTER 4
REFERENCE-BASED INDEXING FOR STATIC DATABASES

An important issue overlooked by the existing methods is that the performance of reference-

based indexing can be improved by selecting references that have a significant number of objects

close to and far from them. The key symbols used throughout this chapter are summarized in

Table 4-1.

Table 4-1. Summary of symbols and definitions.

Sii', ,... Definitions
S Database of objects
c Similarity threshold
q Query object
N Number of objects in the database
V Reference set
m Number of references in V
t Time taken to compare a pair of database objects
Q Sample query set
Gi Gain in pruning from reference E V
f Size of a sample database
r Accuracy of the estimated maximum gain
1b Probability of the estimated maximum gain
p and a Mean and variance of the distribution of gains
h Size of a sample candidate set
k Subset of references
Ic Index construction time


4.1 Maximum Variance

The variance of the distribution of distances of a reference to other objects is a good

indicator of the spread of objects in the database around that reference, and the first heuristic

uses this observation. The Maximum Variance heuristic assumes that queries follow the same

distribution as the database objects. It selects a reference set that represents the object distribution

of the database. Each new reference prunes some part of the database not pruned by the current

objects in the reference set.

This suggests an algorithm for choosing reference points. For example, in Figure 4-1, points

in two-dimensional space represent the database objects. The database is given by the objects














b

fo


e g 0O

C
Figure 4-1. Maximum Variance example.

{a, b, c, d, e, f, g}. Objects are represented by shaded points. e and c are the references selected
by Maximum Variance. Objects f, g are close to e and b is far from e. Object d is close to c and
object a is far from c. The edit distance between two objects corresponds to the distance between
the points representing them. Object e has the highest variance of distances. So e is chosen as
the first reference. Objects close to e (objects f and g) and far from it (object b) can be pruned
using e as the reference. The objects not pruned by e (objects a, c and d) remain in the candidate
set. A object from the candidate set having the next highest variance, c, is selected as the next
reference. Reference c can remove the object d close to it and a far from it from the candidate set.
This process is repeated until all of the references have been assigned.
Let L denote the length of the longest object in S. For a object si e S, pi and ao are the
the mean and variance of its distances to other objects in S. A cut-off distance, w, is computed
to measure the closeness of two objects. sj (sj E S) is close to si if ED(si, sj) < (pi w)
and sj is far away from si if ED(si, sj) > (pi + w). w is computed as a fraction of L,
given by w = L.perc, where 0 < perc < 1. sj is not considered as a potential reference if
ED(si, sj) < (pi w) or (pi + w) < ED(si, sj), 3si e V. A large value for perc will include
objects that are close to or far away from the existing references. This results in objects being









pruned by multiple references. A small value for perc can remove promising references, resulting

in a small number of references. Experimental results show that perc= 0.15 is a good choice.

4.1.1 Algorithm

input : Object database S, with IS' = N.
Number of references m.
Cutoff percentage perc.
Length of a object L.
output: Set of references V = {vl, v2,..., Vm}
1 V { }. /* Initialize */
2 for Si E S do
3 Select sample set of objects, SI C S.
4 Compute Di = {ED(si, Sj) I Vsj E S }.
5 Compute mean pi and variance ai of the distances in Di.
6 end
7 w = L.perc.
8 Sort the N objects in descending order of their variances.
9 while IVI < mdo
10 V= VU 1.
11 S = S {sj} Vsj E S with ED(si, Sj) < (pi w) or ED(si, sj) > (pi + w).
/* Remove objects close to or far away from the new reference */
12 end
13 Return of set of reference objects, V.
Algorithm 1: Maximum Variance Algorithm.


Algorithm 1 presents the algorithm in detail. For each object si E S, a sample database

St, SI C S is selected in Step 2.a and the set of distances Di {ED(si, sj) IVsj E SI} are

computed (Step 2.b). The mean pi and variance o- of the distances in Di are computed in Step

2.c. The distances are then sorted in descending order of their a values (Step 4). Then, the

following are done repeatedly until the required number of references are obtained. The object

Sl with maximum variance is selected as the next reference and added to V (Step 5.a). Then the

objects from S that are close to or far away from the new reference Sl (Step 5.b) are removed.

Steps 5.a and 5.b are repeated until there are enough number of references, i.e. IV = m. Each

iteration of the algorithm selects a new reference that is neither close to nor far away from the

existing references.









4.1.2 Computational Complexity

Each candidate reference is compared with all objects in the sample, St (Step 2). This

requires O(NISiI) distance computations. Each distance comparison takes O(L2). Sorting the

variances of N objects in Step 3 takes O(N log N). Steps 5.a-5.c take O(mN) time in the worst

case. Thus the overall time of the algorithm is O(NL2 Si + N log N + mN).

4.2 Maximum Pruning

This section describe the the second approach for choosing the reference set. The method,

called Maximum Pruning, combinatorially attempts to compute the best reference set for a given

query distribution.

A purely combinatorial approach which tests all possible combinations of references in

order to maximize performance over a given query distribution would be prohibitively expensive.

Exhaustively testing all possible combinations of m references from S takes O(C{ x N x IQI)

time, where Cf is (N). This is due to the fact that a purely combinatorial approach would need

to consider all possible reference sets, one-at-a-time, and for each it would need to compute the

pruning power with respect to Q. The method could perhaps be improved a bit by making use of

the fact that many of the reference sets to be tested would be overlapping, but it seems impossible

to reduce the complexity of an exact computation below O(C). To speed up this computation,

a greedy solution to this problem is proposed. In order to speed the greedy solution even further,

sampling-based optimizations are considered in Section 4.2.3.

4.2.1 Algorithm

Every object in the database is considered as a candidate reference. An initial set of

references is selected. Then the references in the set are refined iteratively. At each iteration,

an existing reference replaced with a new reference if this modification improves pruning with

respect to the sample query set Q. The algorithm terminates when the reference set cannot be

improved.

Algorithm 2 presents the Maximum Pruning algorithm. S, Q, and m are given as input.

Here, PRUNE(V/, q, s) returns true if one of the references in VI can prune s. MAX(P) and










input :Database S.
Number of references m.
Sample query set Q.
output: Set of references V = {v, v2,... Vm}
1 Initialize V with top m references obtained using the Maximum Variance method.
2 S = S V.
3 repeat
4 G[i] =0, 1 < i < SI. /* Sum of ./i,:, for each E S */
s Foreach [i sj], ', E 5, Vs E 5' Let go be the number of queries for which sj is
pruned using reference in V.
6 P[a] = 0, 1 < a < m. /* Initialize gain for ith reference V*/
7 for each [e, q] pair, Ve E V and Vq E Q do
8 V= V {e}U{ }.
9 if PRUNE(VI, q, sj) = 1 then
10 P[e]++.
11 end
12 end
13 if MAX(P) > go then
14 G[i]+ =MAX(P)- go.
15 end
16 if MAX(G) < 0 then
17 Return V.
18 end
19 Letv ,ti,,, ',, (G[i]).
20 Update V with v.
21 S S {v}.
22 until true ;
Algorithm 2: Maximum Pruning Algorithm.


MAX(G) returns the maximum of the values in P and G respectively. The reference set V is first

initialized as the top m references obtained using the Maximum Variance (MV) method (though

this is not a requirement, since one can start with a random initial reference set). The MV method

selects as references those database objects having a high variance in pairwise distances with

other database objects.

The Do-While loop replaces one existing reference with a better reference during each

iteration. An iteration of this loop starts by initializing the array G to zero. Each entry G[i] of G

shows the amount of pruning gained by including the ith candidate reference in the reference set.

The term gain is used to denote the amount of improvement in pruning. Steps 2.a 2.d iterate









over all candidate references and the gain obtained by replacing an existing reference with is

computed. This is done as follows: the total number of objects pruned using an existing reference

set for all queries in Q is computed in Step 2.a. The array P is initialized in Step 2.b. Each

entry P[a] of P denotes the number of objects pruned after replacing the ath existing reference

with Step 2.c iterates over all existing references. At each iteration, an existing reference is

replaced with and the total number of objects pruned for all queries in Q is computed. The

largest possible gain obtained by subtracting the number of objects pruned using the original

reference set from that of the best possible new reference set is computed in Step 2.d. If there is

no gain, then the algorithm terminates, returning the current reference set (Step 3). Otherwise, the

reference set is updated using the new reference that gives the best possible gain (Steps 4 to 6).

4.2.2 Computational Complexity

The algorithm needs access to the distances between all pairs of objects in S (Step 2). This

requires O(N2) comparisons. Note that this a one time cost, and is not incurred at each iteration

of the Do-While loop. The PRUNE function requires the distances between all (query, object)

pairs. The pre-computation of these distances requires N IQ comparisons. So, the total number

of comparisons is O(N2 + NVQ|). Step 2 has to consider 0(N2) pairs of objects. For each pair it

computes the gain for all queries in Q after replacing objects in the reference set one by one. This

takes O(m QI) time since a new reference can replace any of the m references (Step 2.c). Thus,

the overall time taken by this algorithm is O(N2m QI) for each iteration of the Do-While loop.

There can be at most N iterations, leading to the worst case complexity 0(1',,,I|Q).

4.2.3 Sampling-based Optimization

Although the Maximum Pruning algorithm in Section 4.2.1 is faster than a purely com-

binatorial approach, it is still impractical for large databases. To address this problem, two

sampling-based optimizations to improve the complexity of the algorithm by reducing the num-

ber of (object, reference) pairs processed are proposed. The first optimization reduces the number

of objects and the second reduces the number of references.









input : Database S,-S- = N.
Sample queries Q,-Q- = q.
output: G[i] ", e S.
1 for each ESdo
2 g1 = 2 = 0. /* Initialize total and estimate for 2nd moment */
3 a = 0. /* Initialize sampling fraction */
4 end
5 repeat
6 Select a random sj E S, where sj /
7 newGain = GAIN( sj, Q).
8 g1 += newGain; g2 += newGain2.
9 a a + 1/ISI.
10 -2 42/a~2 g2/(iS a3).
11 until 2a-/gl > ;
12 G[i] = g4/a.
Algorithm 3: Estimation of Gain.


4.2.3.1 Estimation of gain

The number of objects that must be considered when computing the gain associated with

a new reference point is reduced in the first optimization. This algorithm replaces Step 2 of the

Maximum Pruning algorithm in Algorithm 2. The gain is estimated based on a small sample of

the database rather than the entire database. One of the most important technical considerations

in the design of this algorithm is how to decide whether the gain estimate is accurate enough

based upon the sample.

Algorithm 3 presents the sampling algorithm in detail. S, Q and c are given as input. It

returns the total gain obtained by replacing an existing reference with a new reference for all

possible new references. For each candidate reference a random object sj E S is selected at

every iteration (Step 3.a). The gain by using as a reference for sj for Q is computed in Step

3.b. The gain is computed as follows. Steps 2.a to 2.c of the Maximum Pruning algorithm in

Algorithm 2 are executed to compute the total pruning achieved with respect to sj by replacing

each existing reference with Then the gain is given by the best pruning over all possible

replacements. The total gain seen as well as the total squared gain seen (which can be used to

estimate the second moment of the gains that have been sampled) is updated in Step 3.c. The









sampling fraction is updated in Step 3.d, and an estimate for the variance of the gain is computed

in Step 3.e. The algorithm terminates when the desired accuracy is reached. The accuracy of

the gain estimate is guessed by making use of the Central Limit Theorem, which implies that

errors over estimated sums and averages of this type are normally distributed. Since 95% of the

standard normal distribution's mass is within two standard deviations of zero, if the current gain

estimate gi/a is treated as the true gain and terminate when twice the relative standard deviation

is less than c, it is assured that the relative error is less than c with 95% probability.

For an average sample size of f with f
O(N2 fmlQ|). This is because it iterates over f objects rather than all N objects while computing

gain.

4.2.3.2 Estimation of largest gain

The Maximum Pruning algorithm uses all database objects as candidate references when

selecting the reference set. The number candidate references processed in each iteration is

reduced by the second optimization with the help of sampling. The goal is to use a small subset

of the database as the candidate references, and yet achieve pruning rates close to Maximum

Pruning.

Formally, the problem is defined as follows. Let G[i] be the gain that can be achieved

by including the ith reference in the reference set. Let G[e] be the largest gain (i.e., e =

I Ii.i, {G[i]}). Given two parameters r and i) where 0 < r, i < 1, the candidate refer-

ence set has to be sampled to ensuring that the largest gain from this sample is at least rG[e] with

probability Q.

Since G[e] is not known in advance, the Type-I Extreme Value Distribution (also known

as the Gumbel distribution [41]) can be used to estimate its value. This is done as follows. Lets

assume that each G[i] is produced by sampling repeatedly from a normally distributed random

variable. The first step is to determine the mean and standard deviation of this variable. To do

this, a sample set of candidate references is selected and the mean p and the standard deviation a









of the gains are computed for the sample. These are taken as the mean and standard deviation of

the underlying distribution.

Since the values in G are assumed to be the samples from a normal distribution, the largest

gain G[e] is known to have a Gumbel distribution whose parameters can be computed using p

and a. Let N and t be the number of candidate references and the sample size, respectively.

The two parameters of the Gumbel distribution, referred to as location parameter a and a scale

parameter b, are computed as follows:


a /2 log N

log log N + log 4w
b 2log 2 /2log N

The mean of the corresponding Gumbel distribution is then calculated as:

[- log,(- log,(0.5)) + b] +
a

This tells exactly what the expected value of the gain associated with the best reference object

is. Thus, the sampling stops when the best gain (with high probability) is almost good enough;

that is, when it is at least rT. To compute how many samples are needed, let P(x < rp) be

the probability that the gain of a random reference x is less than r/. This probability can be

calculated from the distribution of G [i]. The probability that the gain of all the w randomly

selected references are less than rp is P(x < rp)"'. Solving the inequality:


1- P(x < Tr") >


gives the number of samples needed (denoted by w) as:

< log, (1 )
log,(P(x < 7r))

The experiments have used a value 7r = = 0.99. Each iteration of the Do-While

loop of the Maximum Pruning algorithm in Algorithm 2 computes the gain from random

candidate references until the required accuracy is reached. In each iteration, a different sample









of candidate references is used. If the average sample size of the m iterations is h, h < N,

then this optimization along with the first optimization reduces the overall complexity to

O(Nfhrm Q).

4.2.4 Impact of the Sample Query Set

All reference selection strategies must at least implicitly assume that queries follow a

certain distribution, since the quality of a given reference is oii o' \ dependent upon the query

workload that is to be answered. For example, as described in the Introduction, a single reference

is enough to prune all of the data in a portion of the data space that is far from most queries;

additional references in such a portion of the data space are wasted. The implicit assumption

behind most existing reference selection strategies seems to be that queries are distributed

similar to database objects or that the queries are spread evenly throughout the space to be

indexed [21, 23, 35, 64, 94].

MP differs from most selection strategies in that it makes any assumptions about the query

distribution explicit. Of course, MP can easily be run by simply using the data to be indexed as

the query distribution; in this case, like other methods, MP will be optimized to run on queries

that are similar to database objects. However, if the queries follow different distribution or if the

query distribution keeps changing, MP has the benefit that it can explicitly take into consideration

these factors while creating the index.

As seen in the experiments of Sections 4.5 and 5.4, the efficiency and efficacy of MP do

depend upon the size and accuracy of the training query set. However, MP is surprisingly robust

in this regard, and even for small and/or a somewhat inaccurate training set, MP outperforms

its competitors. Thus, it can be argued that the fact MP uses a training query set is actually a

beneficial feature of the MP methodology.

4.3 Assignment of References

4.3.1 Motivation and Problem Definition

A simple way to improve the performance of a reference-based index is to increase the

number of references. For example, the number of object comparisons required by Omni [35] to











12000 -


Figure 4-2. Number of comparisons for Omni with varying number of references.


answer a query over a DNA object database as a function of the number of references is given

in Figure 4-2 (see Section 4.5 for details). This uses the DNA database with a query range of 8.

It can be seen that initially the number of comparisons required to answer a query reduces with

increase in the number of references. After 400 references, it begins to increase linearly with the

number of references. Thus, the "optimal" number of references for this particular example is in

the hundreds.

Unfortunately, due to memory constraints, selecting and assigning 400 references to each

and every object in a database is not always a practical solution. For example, the human genome

(with 3 billion base pairs) contains 30 million objects of length 100. From the cost model, the

main memory storage of an index that contains 400 references for each of these objects would

require about 90GB of main memory. While 90GB of RAM may be feasible, it is certainly at

the upper end of what would be acceptable. For an even larger database (or even in the case of

the human genome if one were to index the substring starting at each and every base pair) the

memory requirements quickly become unmanageable.


10000

c
8000
ao







2000
0




25 150 275 400 525 650 775 900 1025 1150 1275 1400





Number25 150 275 400 525 650 775 900 1025 1150 1275 1400References
Number of References









This section proposes a new strategy that selects only those references that can help in

pruning a given database object. This allows to construct an index whose performance is

equivalent to an index that makes use of a much larger set of references, but whose memory costs

are much smaller because only a small number of reference-to-object distances are stored for

each database object.

Formally, given a set of m references, the goal is to assign a set of k references (k < m)

to each database object such that these remove s from the candidate set of as many queries as

possible. When m is much smaller than the database size, this strategy improves the performance

with little increase in the storage cost.

4.3.2 Algorithm

It is important to note that the task of assigning the references to the database objects

is orthogonal to actually choosing the references. The set of m initial references (V) can be

chosen in many different ways; it can be chosen using the Maximum Pruning algorithm, or using

one of the other selection strategies. The only assumption here is that the size of the set V is

substantially larger than the number of references.

Algorithm 4 shows the algorithm for assigning references to each database object. Q, S,

the potential reference set V, and the number of references per object, k, are given as input. The

algorithm returns a mapping from each database object to k references in V. For each database

object s, the algorithm iteratively maps one reference until k references are mapped. This is

done as follows. An array Vcount is maintained, where Vcount[i] gives the number of queries

for which an object is pruned using the reference E V. The algorithm selects the reference

e c V for which s is pruned for maximum number of queries (Steps 3.b-3.c). e is removed from

V and mapped to s in Steps 3.e and 3.f. The queries for which s is pruned using e are removed

from the query set in Step 3.g. This is done to ensure that the new references are selected only to

improve the queries for which s can not be pruned using the existing references. Each reference

costs IQI extra object comparisons. This is because each query needs to be compared with all

the references. Thus, a reference is useful only if it prunes a total of more than IQI objects for all

















input : Database S,-S- = N.
Reference set V,-V- = m.
Sample queries Q,-Q- = q.
References per object k.
output: E ={E, E2,..., EN}.
EG is assigned to object s E S.
1 G[i] = 0, 1 < i < m. /* Total gain from each reference E V */
2 Ei = {}, 1 < i < N. /*Initialize reference set of each object*/
3 for each s c S do
4 repeat
5 Vcount[i] = 0, 1 < i < m. /* Initialize gain for [i ,s] pair */
6 for all [v, Qj], Vv e V and VQj E Q do
7 if PRUNE(s, Qj, v) then
8 Vcount [v]++.
9 end
10 end
11 Let e = arii., i (Vcount[x]).
12 G[e]+= Vcount[e].
13 V V- {e}.
14 E, = E, U {e}.
15 Remove from Q queries for which s is pruned with reference e.
16 until Ei = k. ;
17 Re-insert all deleted entries from sets V and Q.
18 end
19 for all v e V do
20 if G[v] < Q then
21 V V- -{v}.
22 end
23 end
24 Update the reference sets E, s S.
Algorithm 4: Algorithm to Assign References to Objects.









the queries in Q. If the total gain from a reference is not greater than IQ it is removed from the

reference set (Step 4). The reference sets of objects which have less than k references are then

updated with new references from V (Step 5).

4.3.3 Computational Complexity

During index construction, the algorithm needs access to the distances between all (query,

reference) pairs. This takes O(tmlQ ) time, where t is the time taken for one object comparison.

If the selection strategy is Maximum Pruning, then the pre-computed query-reference distances

are used. In each of the k iterations, Step 3.b of the algorithm can process up to O(mlQI)

query-reference pairs. Thus the overall time taken by the algorithm is O(Nmk lQ).

4.4 Search Algorithm

Sections 4.1 and 4.2 discussed how to find references and how to map them to database

objects (Section 4.3). This section describes how to use the mapped references to answer range

queries.

4.4.1 Algorithm

Let S be the database of N objects and V be the reference set. The set of k references

mapped to si is given by Ei, Vsi E S. Here, Ei C V. The edit distances ED(v, si), Vsi E S

and Vv E Ei are pre-computed. This is a one time cost for the database. For a query q E Q and

range e, the edit distances from q to the reference objects are computed, i.e. ED(q, ) ", E V.

For each (q, si), the lower and upper bounds LB and UB are computed as given in Chapter 1.

Depending on the LB, the UB, and the c, si is inserted into one of the two sets Result set and

Candidate set as follows. If UB < c, si is inserted into the result set. If LB < C < UB, si is

inserted into the candidate set. Otherwise, s, is pruned. Once the candidate set is determined,

the actual object comparison between q and all the objects in the candidate set to filter out false

positives is performed.

Given an object database S with N objects, the selection strategy selects m references.

The assignment strategy maps each object with k, k < m, references. For each object s and its

reference E V, its mapping is of the form [i, ED(s, )]. The Nk edit distances between the









objects and references mapped to the objects are stored in a file. The selection strategy, mapping

database objects s E S to references and computing the [s, ] edit distances are all one-time

costs for a object database.

During database search, references and reference to object edit distances are loaded into

the memory. With the average object size as z bytes and four bytes for an integer, this requires

(8Nk + mz) bytes of memory. Here 8 bytes are used to store each of the Nk mappings. With

the increase in m, the objects can be assigned better references. However this will increase the

number of query to reference computations. Hence m is restricted to a fraction of the database

size in the experiments.

4.4.2 Computational Complexity

Given a query set Q, the edit distances of all [query, reference] pairs are computed. This

involves mlQI object comparisons. For every [query, object] pair, all k references of the object

are compared to compute the lower and upper bounds. This takes O(Nk IQ ) time. If Cm is the

average candidate set size for Q using the m references, it takes C, I|| object comparisons to get

the final results. For an average object length of L, the object comparison takes O(L2). Thus, the

overall time taken by the search algorithm is given by, O((m + C,) Q IL2 + Nk Q ).

4.5 Experiments

This section compares the performance of different methods for indexing based on the

number of object comparisons performed.

Table 4-2. A list of proposed methods.

Assignment Selection Strategies
Strategy Maximum Variance Maximum Pruning
Same References MV-S MP-S
Diff. References MV-D MP-D


The proposed methods that have been implemented are listed in Table 4-2. For the proposed

algorithms, two different reference-to-object assignment strategies are considered. The first one

is the traditional way of assigning all references to the database objects (MV-S, MP-S). The

second strategy is the proposed approach of increasing the reference set and assigning different









subset of references to each database object (MV-D,MP-D). For MV-D and MP-D, 200 candidate

references are selected (i.e. m = 200) unless otherwise stated and the top k (k < m) references

are used for each database object to index that object. The methods are compared with many

existing methods: 1) Omni [35], 2) the Sparse spatial selection strategy (Sparse) proposed by

Brisaboa et.al. [21], 3) the random (BRAND) and 4) incremental (BINC) reference selection

strategies proposed by Bustos et.al. [23]. The following four types of databases have been used in

the experiments:

DNA Database: The Escherichia Coli (E.Coli) (K-12 MG1655) genome from GenBank [10]

(ftp: //ftp .ncbi .nih. gov/genbank/) has been used. The alphabet size of a DNA

object is 4 (A, C, G, and T). Four databases of non-overlapping objects of lengths 25, 50, 100 and

200 have been created. Each database is obtained by chopping the E.Coli database into 20,000

non-overlapping objects. Databases of different sizes containing 5,000, 10,000, 15,000 and

30,000 non-overlapping objects, each of length 100, are also created similarly.

Protein database: Protein objects in the Eukaryota kingdom of organisms are used for

this database. They are obtained from SwissProt [7] (ftp: / / ftp. ebi. ac. uk/pub/

databases / swissprot/). The alphabet size is 20. A set of 4,000 objects having up to 500

amino acids is selected randomly.

Text database: A text database from 33 randomly selected books from the Gutenberg project

(http: / /www. gutenberg. org/) is created. The alphabet consists of 36 alphanumeric

characters. The database contains 8,000 objects of length 100.

Retinal Images: Color histograms of images of retinal tissues from the UCSB Biolmage

database are used (http: / /bioimage .ucsb. edu/). The database consists consists

of histograms of 3932 images. The images of retina are obtained under different conditions

corresponding to different stages of a biological process [16, 59]. All comparisons are made

using EMD.









In order to select the queries and the query workload used to train the algorithms, the

following protocol is used. For each object database, 200 objects from different parts of the same

species/database are selected. For the image database, 200 images are selected randomly from the

database. 100 are used as sample queries for MP and another 100 are used as query objects. Note

that the query and database objects for the DNA databases are taken from different parts of the

same species and they do not overlap. Similarly for the protein data, the query objects are taken

from proteins from a different part of the database without any overlap. For the DNA database,

the mean distance between the query and the database objects is 56 and for the protein data, the

mean distance is 192. Three additional query sets tested in Section 8.2.3 have also been created:

one from each of the organisms Danio Rerio (Zebrafish), Mus Musculus (Mouse) and Heliconius

Melpomene (Butterfly). Each query set contains 100 objects of length 100 and each is selected

randomly from its organism.

The experiments ran on an Intel Pentium 4 processor running Linux with 2.8 GHz clock

speed and 1GB of main memory.

4.5.1 Effect of the Parameters

This section presents experimental results of the proposed methods under different parame-

ter settings.

4.5.1.1 Impact of m

The goal of this experiment is to understand the behavior of MP-D for different cardinalities

of the reference set, IV = m. Three databases have been used, the DNA object, protein and

image databases. The number of references is 32. The query range for the DNA database is 8. It

is 240 for the protein database and 16% for the image database.

The number of object comparisons for different reference set cardinalities for the three

databases are given in Figures 4-3 to 4-5. For the DNA database, up to m = 200, the number of

object comparisons reduces at a fast rate. From m = 200 to 300, there is very little improvement

in performance. For m > 300, the number of object comparisons increases. This is due to

increase in number of query-reference distance computations. For the image database, the













2200

(l 2000
C
0
.M 1800

E 1600
0
0 1400
0
.Q
9 1200

1000

800

600
100 150 200 250 300 350 400 450 500 550 600 650 700 750 800
Cardinality of Reference Set,m


Figure 4-3. Number of comparisons for MP-D for DNA object database for different values of m.


improvements obtained are only up to m = 200. The number of comparisons increases slowly

from m = 200 to 300 and for m > 300, the rate of increase is fast. From these results, it can

be concluded that using an m value in the low hundreds is a good choice, since this gives good

performance and allows for reasonable index construction times.

One question is How can one determine the best value of m if the index construction time

is completely ignored? Figures 4-3 to 4-5 show that the number of object comparisons follow a

U-shaped curve. This is indeed intuitive: For small m, the number of possibilities for selecting

different references is small. Thus, the pruning rate drops and the number of comparisons with

unpruned candidates increases. For large m, the candidate size decreases. However, the number

of comparisons with the references (i.e., m) increases. In other words, as m increases, the benefit

gained by pruning more candidates becomes less than the cost of comparing the query to the

reference set. Since this curve follows a U-shape, the best value of m can be determined by either

using binary search over m or by adopting the Newton-Rhapson method.

















2350

) 2300
O
0
2 2250
CL
E 2200
0
'5 2150
0

.0 2100
E
Z 2050

2000
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
Cardinality of Reference Set, m


Figure 4-4. Number of comparisons for MP-D for protein database for different values of m.


600

500 --.
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750

Cardinality of Reference Set, m


Figure 4-5. Number of comparisons for MP-D for image database for different values of m.












S 1200

1000
E
800

0 600

C 400

200
0C
10 50 100 200 500 1000
Number of Training Queries


Figure 4-6. Impact of IQI on index construction time.


4.5.1.2 Impact of query set size, | Q

The goal of this experiment is to understand the behavior of MP-D for different cardinalities

of the sample query set, IQ The DNA object database of 30,000 objects has been used. The

number of references is 32 and query range s 16.

The time taken to construct the index using MP-D for different cardinalities of the sample

query set is given in Figure 4-6. Time taken to create the index using MP-D for different values

of IQ|. It can seen that with the increase in the size of sample query set, the index construction

time increases almost linearly.

The number of comparisons needed is given in Figure 4-7. Number of comparisons for

MP-D for different values of IQ With increasing cardinality of the training query set, the

number of comparisons reduces. From IQ = 200 to 1000, there is slight improvement in the

performance. Given that the cost of building the index increases rapidly for increasing training

query set (Figure 4-6), in all of the experiments IQI = 100 is chosen.










10250
10230
C 10210
0
10190
E 10170
0
o 10150
0 10130

M. 10110
E
M 10090
10070
10050
25 50 75 100 200 300 400 500 600 700 800 900 1000
Number of Training Queries


Figure 4-7. Impact of IQI on query performance.


4.5.2 Comparison of Proposed Methods

This section presents the experimental results comparing MP-S and MP-D under different

parameter settings.

4.5.2.1 Impact of query range (e)

The goal of this experiment set is to understand the behavior of the proposed methods for

different query ranges. It compares the performance of MV-S, MP-S, MV-D and MP-D with e =

2 to 32 for DNA objects and e = 60 to 420 for protein database. The number of references used

for DNA and protein databases are 4 and 32 respectively. The plots are given in log scale.

The number of object comparisons for the DNA database is given in Figure 4-8. This

increases with the query range for all four methods. For different ranges, MP-D and MV-D have

lesser number of object comparisons compared to those of MV-S and MP-S. MP-D is gives

the best results. For ranges up to 8, assigning different reference sets to each object results in

significant reduction of object comparisons for both selection strategies. For both same and

different reference sets, MP performs slightly better than MV. This is due to the fact that MP is

using the knowledge of input query distribution. The results for protein database are given in











100000

0
S10000
E
o
o 1000 MV-S
o LIMP-S
.- IDOMV-D
( 100 MP-D

I-
a)
10



2 4 8 16 32
Query Range


Figure 4-8. Comparison of the proposed methods for DNA database for queries with varying
ranges.


Figure 4-9. MP-D outperforms others for all query ranges. For most of the query ranges, MP

methods perform better than MV methods. Similar results for text database for varying query

ranges have been observed. These experiments show that assigning different reference sets to

each object gives better pruning results than the traditional approach of assigning same reference

to all objects.

Overall, the results show that assigning different reference sets to each object gives better

pruning results than the traditional approach of assigning the same references to all database

objects.

4.5.2.2 Impact of number of references (k)

The goal of this experiment is to understand the behavior of our methods for different

number of references, k. It compares MV-S, MP-S, MV-D and MP-D, by fixing the query range

and varying the number of references assigned, k = 2, 4, 8, 16 and 32. Both protein and DNA

databases are used. The plots are in log scale for DNA database.










10000


o
0



a *MV-S
OMP-S
S1000


N I*MP-D
a 10 -0M
C,

1




60 120 180 240 300 360 420
Query Range

Figure 4-9. Comparison of the proposed methods for Protein database for queries with varying
ranges.


The number of object comparisons for DNA database is given in Figure 4-10. For all

four methods, the number of object comparisons decreases dramatically with increase in k.

As the number of references increases from 2 to 32, the number of objects compared drops by

factors of 5 to 20 between the methods MP-D and MP-S. The results for protein database is

given in Figure 4-11. MP-S and MV-S compare more number of objects for all references. With

the increase in the number of references, there is a gradual decrease in the number of object

comparisons. MV-S strategy outperforms MP-S strategy at k = 8, 16 and 32 in protein database

and MP-S outperforms MV-S for all values of k in DNA object database. The experiments using

text database gave similar results as the DNA object database with MP-D giving the best results.

This shows that with increase in number of references, the memory can be utilized better by

assigning subset of references to each database object.

4.5.3 Comparison with Existing Methods

This Section compares MV-D and MP-D with Omni, FV, M-Tree [29], DBM-Tree [92],

Slim-Tree [86], DF-Tree [85], Sparse [21], BRAND and BINC [23]..













100000


10000


1000


100


10


SMV-S
oMP-S
E MV-D
*MP-D


2 4 8 16 32
Number of References


Figure 4-10. Comparison of the proposed methods for DNA database with a varying number of
references.


2500

0
S2000
.1500
o
0
1500
c
a)

01

I-
M 500
E
Z


0 MV-S
oMP-S
DMV-D
*MP-D


2 4 8 16 32
Number of References


Figure 4-11. Comparison of the proposed methods for Protein database with a varying number of
references.









4.5.3.1 Impact of query range (e)

This experiment compares the behavior of the proposed method with several existing

methods for different query ranges. For the protein database, the query range is varied from

e = 60 to 420. c is tested for values from 2 to 32 for the DNA database and C from 2 to 32'. of

the largest distance for the image database. 32 references are used for MP-D, BRAND, BINC

and Omni, since these methods allow us to choose the number of references. Sparse chooses

the number of references itself. The number of references used by Sparse is 586 for the DNA

database, 598 for the protein database and 69 for the image database. The plots are given in

log-scale for the protein and image databases.

Table 4-3. Comparison with Tree-based index structures.

M-Tree Slim-Tree DBM-Tree DF-Tree
QR Ic=50 ms Ic=15 ms Ic=14ms Ic=480 ms
2 9946 10228 7313 8041
4 13426 14390 10963 10773
8 20147 19691 17507 19301
16 22865 21176 19977 24861
32 22892 21192 19997 25021
Omni FV MV-D MP-D
QR Ic=14 ms Ic=6 ss Ic=74 ms Ic=180 ms
2 228 247 205 200
4 1202 1264 273 208
8 12677 6208 2648 1126
16 19927 16088 18541 18296
32 20000 19811 19840 19836


The results of six existing methods along with MV-D and MP-D for the DNA object

database are given in Table 4-3. QR denotes the query range. ss and ms denote the running

time in seconds and minutes respectively. Number of references for Omni, DF-Tree, MV-D and

MP-D are 16. With the increase in query range, the number of object comparisons increases for

all the methods. The tree based methods compare more objects than a simple sequential scan

with the increase in query range. This is due to the comparison of query object with the objects

in the intermediate nodes of the tree structures. For ranges 8 to 32 Omni performs more object

comparison than FV, MV-D and MP-D. With the increase in the range from 2 to 8, MP-D reduces









the number of objects up to a factor of 6 to 100. Even for higher ranges, MP-D reduces the

number of object comparisons up to a factor of 2. Thus MP-D outperforms all other methods for

most of the query ranges.

Ic denotes the index construction time. For tree-based methods it refers to the time taken

to construct the tree structure. Ic of Omni gives the time taken to generate 16 references using

this method. Ic of MP-D gives the time taken to generate 200 references for the reference

set using maximum pruning method with sampling based optimizations. This includes the

time taken for selecting the references and the index construction time, where the reference

to data object distances are recorded. The index construction step is a one time cost for the

databases. The databases will have thousands of queries in a short period of time with very

little updates. Number of object comparisons form a major part of the overall computation in

search performance of these databases. In the remaining experiments, Omni and FV are used for

comparison as they perform the best among all competitors.

The results for the DNA database are given in Figure 4-12. Not surprisingly, with increasing

in query range, the number of object comparisons increases for all the methods. MP-D has

up to 40 times fewer comparisons than Omni, BRAND and BINC and up to 10 times fewer

comparisons than Sparse. Due to its large reference set, Sparse has more comparisons than

other methods at a range of 2. The results for range queries on the protein database are given in

Figure 4-13. Sparse has more comparisons than all other methods. BINC performs slightly better

than BRAND. For all query ranges, MP-D has up to two times few comparisons than the other

methods. The results for the image database are given in Figure 4-14. MP-D has up to 5 times

fewer comparisons than Sparse and up to three times fewer than BRAND.

4.5.3.2 Impact of number of references (k)

The goal of this experiment is to compare the behavior of the proposed methods with

existing methods using different numbers of references. The performance of Omni, BRAND,

BINC are compared with MP-D by fixing the query range for a varying number of references

k = 2, 4, 8, 16 and 32. The DNA, protein and image databases are used. The query range is 8














100000


o

10000
S 0oElOmni
Sparse
o EBRRAND
0 EBINC

5 1000 -o NEMP-D
E
z



100
2 4 8 16 32
Query Range


Figure 4-12. Comparison with other methods on DNA database for queries with varying ranges.


10000



0
C



S1000

E
o
o


S100
lo
E
z



10


DOmni
* Sparse
BRAND
EBINC
*MP-D


60 120 180 240 300 360 420


Query Range


Figure 4-13. Comparison with other methods on protein database for queries with varying ranges.











10000


o 1000
a ElOOmni
E Sparse
o 4 BRAND
20 BINC
10 MP-D
E 100





10
2 4 8 16 32
Query Range as % of Largest Distance in Database


Figure 4-14. Comparison with other methods on image database for queries with varying ranges.


for the DNA database, 300 for the protein database and 8' of the maximum distance value for

the image database. Since Sparse uses a fixed number of references for the static databases, the

results of the other methods are compared with different numbers of references. The plots are

given in log-scale.

The number of object comparisons for all four methods for the DNA database is given in

Figure 4-15. BINC performs better than BRAND and Omni performs better than both BRAND

and BINC. For all reference values, MP-D outperforms the other methods. As the number of

references increases, the number of comparisons required by MP-D is smaller by up to a factor of

20 compared to Omni, BRAND and BINC.

The results for the protein database are given in Figure 4-16. With fewer references BINC

has more comparisons than BRAND. MP-D reduces the number of comparisons by a factor of 2

compared to Omni and outperforms BRAND and BINC for all ranges of references.

The results for the image database are given in Figure 4-17. For a varying number of refer-

ences, BINC requires fewer comparisons than BRAND. Omni requires more comparisons than











100000


o
0
S10000
Omni
o BRAND
u IBINC
SEMP-D
(-
M 1000 -
E



100
2 4 8 16 32
Number of References

Figure 4-15. Comparison with other methods on DNA database for a varying number of refer-
ences.


the other methods. MP-D has up to three times fewer comparisons than Omni and outperforms

BRAND and BINC.

4.5.3.3 Impact of input queries

This experiment evaluates the proposed method using the DNA database when the distribu-

tion of queries differs from that of the sample query objects used in reference selection. Three

query sets from three different species (Mouse, Zebrafish and Butterfly) which are taxonomically

distant from the species E. coli are used. The the number of references for Omni, BRAND, BINC

and MP-D is 8. Sparse used 586 references for the DNA database.

The results are given in Figures 5-1 to 4-20. For all query ranges, MP-D outperforms the

other methods, even though it has been trained on a species that is different than the query

species. This suggests that MP-D is at least somewhat robust to changes in the query distribution

or inaccuracies in the training distribution.











2800
2600
2400
2200
2000
1800
1600
1400
1200
1000


mOmni
BRAND
EBINC
*MP-D


2 4 8 16 32
Number of References

Figure 4-16. Comparison with other methods on protein database for a varying number of refer-
ences.


3000

2500
0
a
-
o 2000

E
0
0 1500
o
S
E 1000
M
z

500

n


hi


I


mOmni
BRAND
EBINC
*MP-D


2 4 8 16 32
Number of References

Figure 4-17. Comparison with other methods on image database for a varying number of refer-
ences.


i














100000


Figure 4-18. Comparison with other methods on DNA database for queries from Heliconius
Melpomene with varying query ranges.


Figure 4-19. Comparison with other methods on DNA database for queries from Mus Musculus
with varying query ranges.


10000
0
0
S1000- OMNI
E Sparse
o
Uo BRAND
6 BINC
S100 l l MP-D
E
z

10




2 4 8 16 32
Query Range


100000



10000


0
0
S1000- OMNI
0 Sparse
E BRAND
SBINC
100 MP-D





10 --I --I---
10




2 4 8 16 32
Query Range











100000


Figure 4-20. Comparison with other methods on DNA database for queries from Danio Rerio
with varying query ranges.


4.5.3.4 Scalability in database size

Next, the scalability in database size of Omni, Sparse, BRAND, BINC and MP-D is tested

. Four DNA object databases with 5,000, 10,000, 15,000 and 20,000 objects each have been

used. The number of references used by Omni, BRAND, BINC and MP-D is 32. The numbers

of references selected by Sparse are 379, 478, 534 and 586 for the database sizes 5000, 10000,

15000 and 20000 respectively. The query range is 8. The plot is given in log-scale.

The results for the three methods are given in Figure 4-21. BRAND has the maximum

number of comparisons for all database sizes. With an increase in the size of the database, MP-D

outperforms all other methods. Even with its large reference set, Sparse has up to two times

more comparisons than MP-D. MP-D has up to 20 times fewer comparisons than Omni, BRAND

and BINC. With an increase in database size, the numbers of comparisons required by Omni,

BRAND and BINC increase at a faster rate compared to MP-D.


10000
o

1000- OMNI
E E Sparse
0
o BRAND
o6 U BINC
* j100
100 MP-D
E
z

10




2 4 8 16 32
Qurey Range










10000


Figure 4-21. Scalability in database size.


4.5.3.5 Scalability in string length

The goal of this experiment set is to compare the behavior of the proposed method with

existing methods for increasing lengths of string databases. The methods Omni, FV and MP-

D have been compared using four DNA database of 10,000 strings and with string lengths

S25, 50,100 and 200. The number of references used in the methods Omni and MP-D is 32

and the query range is 8. The number of string comparisons for different string lengths is given

in Figure 4-22. All of the methods show reduction in the number of string comparisons with

increase in the string lengths. For shorter strings, the range of 8 is large relative to the string

length. In these cases the MP-D outperforms Omni and FV by a factor of 2. For long strings,

the range of 8 is low. For these strings, MP-D reduces the strings comparison by a factor of 20

compared to FV and Omni. As the string length increases, the Omni outperforms FV. Separate

scalability experiments on protein databases wer enot performed. Protein databases having string

lengths up to 500 are used in the experiments given in Figures 4-13 and 4-16. These results show

that the proposed methods scale well to the different lengths of protein strings.


I-
cuED Omni
E Sparse
1000 --- BRAND
0B I OBINC
U MP-D
E



100
5000 10000 15000 20000
Database Size














































Figure 4-22. Scalability in string length.









CHAPTER 5
REFERENCE-BASED INDEXING FOR DYNAMIC DATABASES

Databases often updated by inserting new objects. For example, on the average, the

GenBank database [10] (ftp: / /ftp. ncbi .nih.gov/genbank/) has 100,000 insertions

of new DNA objects each day. Applying the MP algorithm presented in Chapter 4 to select the

best reference set after each insertion is a possible approach, but it may be infeasible due to its

cost. This section addresses this problem by proposing two incremental variations of the MP:

the Single Pass (SP) and the Three Pass (TP) variations. Since neither algorithm re-computes the

index from scratch, both must make assumptions involving the change (or lack thereof) of the

various gain statistics used by the MP algorithm over time.

5.1 Overview of SP and TP

5.1.1 Basic Approach

Both SP and TP assume that MP has been used to construct an index for the data that will be

included in the database when it is first made available for public use.

Each newly inserted database object is considered as a candidate reference by SP and its

gain is computed by passing over the database once. If the gain from the new object is more than

that of any reference in the reference set, then the reference set is updated with the new object.

In TP, the gain associated with the new object is computed using SP. Then the gains of all

the objects in the database are updated based on whether they can prune the new object with their

assigned references. In the final step, if the candidate reference with maximum gain is not already

in the reference set, its gain is recomputed and the candidate is added to the set of references.

In this method, the objects in the database are scanned three times; hence the name Three Pass

algorithm.

5.1.2 Maintaining the Query Distribution

Just as MP uses a query set Q in order to compute the quality of the resulting index

with respect to a given query load, so do both of the SP and TP. However, since in a realistic

environment the query workload may change over time, it makes little sense for the SP and TP to

employ a static query workload. Instead, they use a training query set Q that is allowed to evolve









over time via a reservoir sampling algorithm [93]. The reservoir algorithm is a classic, one-pass

sampling algorithm with the key characteristic that at all times, the set of objects maintained by

the algorithm is a true random sample of all of the objects seen thus far. The reservoir algorithm

is used by both SP and TP, to maintain in an online fashion a random sample of all of the queries

that have been observed thus far. This set is then used as the query set Q in order to optimize the

index. In this way, as the query distribution changes, both algorithms tend to include examples of

both the newer and the older queries in their training set, meaning that the index can evolve over

time in order to take into account an evolving query distribution.

5.2 Single-pass Algorithm in Depth

This section presents SP, first of the two incremental variations of the MP.

The gains of the objects that MP has selected as references are given as input to the SP

algorithm. After each insertion of a new database object, the new object's gain is computed by

assuming that every existing database object will use it as a reference. If the gain associated with

using the new object is more than that of any of the existing reference objects, then the reference

set is updated with the new object and its gain.

The SP algorithm is given in Algorithm 5. The algorithm takes the existing reference

set V, the sample query set Q, and gains G[1..IV|], where G[i] is the gain associated with the

reference V[i], as input. Given the new database object X, the algorithm first computes its gain

by including X in the candidate set (Steps 2.a to 2.d). This step is similar to Step 2 of the MP in

Algorithm 2. The reference with minimum gain, e, is selected in Step 4. If the gain from the new

reference is more than that of e, then the reference set is updated with the new reference.

5.2.1 Computational Complexity

Assuming that each distance computation takes O(t) time, the total time taken for object

comparisons is O(t((N + 1) + |Q|)). Step 2 of the algorithm replaces each reference of the

database object with the new candidate reference over all sample queries. This takes O(Nrm Q ).

Thus the overall time taken for an insertion operation is O(t((N + 1) + |Q|) + Nm lQ ). By










input : Database S,-S- = N.
Reference set V, IV = m.
Sample queries Q,-Q- = q.
New Database object,X.
Gains from m references,G[1.. IVI].
output: Set of references V = {vv, 2 ... ,m}.
1 Gt = 0. /*Sum of gains with X as candidate reference*/.
2 for each [X, sj], Vsj E S do
3 Let go be the number of queries for which sj is pruned using reference in V.
4 P[a] = 0, 1 < a < m. /* Initialize gain for ith reference V*/
5 for each [e, q] pair, Ve E V and Vq E Q do
6 V/ v- {e} U{X}.
7 if PRUNE(VI, q, sj) = 1 then
8 P[e]++.
9 end
10 end
11 if MAX(P) > go then
12 Gi+ = MAX(P) go.
13 end
14 end
15 if GI) < 0 then
16 Return.
17 end
18 Let e = argmini(G[i]).
19 if G[e] < G/ then
20 V=V-{V[e]}U{X}.
21 G[e] = G.
22 end
Algorithm 5: Single Pass Algorithm.


applying the first optimization of MP given in Algorithm 3 and by using a sample size of f, the

time complexity of SP can be reduced to O(t(f + |Q|) + fmlQ|).

5.3 Three-pass Algorithm in Depth

Every new object inserted into the database is considered as a candidate reference by SP,

but it ignores the fact that as new database objects are inserted, the distribution of the objects in

the database may change over time. The gains of the other objects in the database can also be

improved if they can prune the newer objects. Thus, over time it may become desirable to include

in the reference set one of the objects previously dismissed as being unsuitable.









The TP algorithm attempts to address this shortcoming. When a new database object is

inserted, the first pass of this method computes the gain for the new object using the SP. In the

second pass, the gains for other objects in the database are updated by considering each of them

as a candidate reference to the new database object. A candidate with maximum gain is selected.

In the third pass, the gain of the candidate is recomputed using its assigned references by adding

it to the reference set of each database object. If this gain is greater than any of the existing

references, the candidate is added to the reference set.

The TP algorithm is formally presented in Algorithm 6. The algorithm first computes

gain with the newly-inserted object X as a candidate reference using Step 2 of the SP. Step 2

updates the gains of other objects in the database similar to the Step 2 of MP in Algorithm 2.

Step 3 locates the object with maximum gain. Step 4 recomputes the gain of the object using

Step 2 of the SP given in Algorithm 5. If the new object has gain greater than any of the existing

references, it is included in the reference set (Steps 5 and 6).

5.3.1 Computational Complexity

The algorithm needs access to distances between the new object and all the database objects.

Step 2.c needs the query-to-object distances. With each distance computation taking O(t), the

total time taken for object comparisons is O(t((N + 1) + |Q|)). Step 2 of the algorithm considers

each object in the database as a candidate reference and updates its gain. This takes O(NmlQ )

time. SP is used twice in Steps 1 and 4. This takes O(2fml Q). Thus the overall time taken for

an insertion operation is O(t((N + 1) + |Q|) + NmIQI + 2fmI|Q).

5.4 Experiments

In perhaps the most common application of reference-based indexing, a moderately-sized

database is made available via the Internet to an extended user community. The user community

submits both queries and new objects to be included into the database. For example, consider the

GenBank database of DNA objects. On average, this particular database processes 100 thousand

object insertions each day. Given this sort of application, natural questions are:









input : Database S,-S- = N.
Reference set V, IV = m.
Sample queries Q,-Q- = q.
New Database object,X.
Gains from m references,G[1.. IVI].
output: Set of references V = {vi, v2,, ,m}.
1 Compute G[X] using the SP.
2 for each [si, X, Vsi E S do
3 Let go be the number of queries for which X is pruned using references in V.
4 P[a] = 0, 1 < a < m. /* Initialize gain for ith reference V*/
5 for each [e, q] pair, Ve E V and Vq E Q do
6 Vi= V- {e} U {si}.
7 if PRUNE(VI, q, X) = 1 then
8 P[e]++.
9 end
10 end
11 if MAX(P) > go then
12 G[i]+ MAX(P)- go.
13 end
14 end
15 Letw = ,ii.m,., (G[i]).
16 Compute G[w] using the SP.
17 Let e = argmini(G[V[1..m]).
is if G[e] < G[w] then
19 V V-{V[e]}U{w}.
20 end
Algorithm 6: Three Pass Algorithm.


Which of the proposed methods (MP, SP, or TP) is most suitable for a dynamic environ-

ment where both insertions and queries must be processed?

Does the suitability of each method changes as the specific characteristics of the database

change?

There are many factors that need to be considered when answering these questions. MP

can be used in a dynamic environment by simply rebuilding the index from scratch after every

new insertion (or periodically, if it is not too problematic that the index become stale). Almost

certainly, this will result in a database that gives superior query-processing speed compared to SP









or TP which use less information during index construction. However, the cost associated with

frequent index rebuilds may be problematic.

SP and TP will expectedly give inferior query processing capability compared to MP, yet

will be able to process insertions more easily. All of this is complicated by the fact that all of

the proposed algorithms require that the distance from each new database object to each existing

database object be computed as the database grows, which adds a very significant fixed cost to

every new database insertion for each algorithm.

These considerations must all be taken into account when choosing which of the three

methods to use. The goal of this experimental section is to carefully benchmark each of the three

methods in order to be able to point out under what circumstances each method may or may not

be preferred.

5.4.1 Query Performance

The goal of this particular experiment is to simulate a real-life environment such as Gen-

Bank, where the database is growing in size due to insertions, and where queries are being asked

simultaneously to the database growth.

5.4.1.1 Experimental setup

One important consideration when designing this experiment is to re-create the realistic

situation where the distribution of queries as well as recently inserted data is not stationary. In a

realistic environment, one may expect that the distribution of the data objects that are stored in

the database (as well as the queries asked) change as the application and user base evolves over

time. Such a non-stationary distribution may be expected to favor the "massive-rebuild" option

presented by the MP method, where the index is simply reconstructed from scratch periodically.

The reason for this is that both SP and TP save computation by assuming that the index does not

change too much over time. If the distribution of data and/or queries does change significantly,

then this assumption may be problematic and one might expect the query performance to

deteriorate. The goal is to test what the effect of such a non-stationary distribution is on the

quality of the three methods.











400
SSP
-TP
3 50-

MP

300



(.. 250

E

150






100


50
05 1 15 2 25
Database Size xio4

Figure 5-1. Comparison of the proposed methods on DNA database with hilbert-ordered data and
query distributions.



In order to test the effect of such considerations on query performance, the following

experimental setup is used. Given the DNA data set of 40,000 objects, the database objects are

first ordered (as we will describe later in this section). Then, the first 4,100 objects from the

ordering are considered. A random selection of 100 of these are considered as part of a sample

query set and the remaining 4,000 as part of the initial set of objects in the database. An index is

constructed for this initial database using MP and the sample query set. The remaining objects

are processed in the DNA database by their ordering, one object at a time. The first twenty-six

objects from the ordering are inserted into the database. For SP and TP, the index is updated after

each insertion. For MP, the index is reconstructed from scratch after the twenty-sixth insertion.

Then, the twenty-seventh object is selected as the query object for the search. For this query

object, the number of object comparisons required to answer the query using each of the three

indexing strategies are computed. For SP and TP, this query is then added to the query sample

using the reservoir algorithm [93]. Then, the next twenty-six objects (in order) are inserted. The











220-----
SP

200 --- TP
MP
180


160

C-)
O


0
-o






05 1 15 2 25
E











Database Size xio

Figure 5-2. Comparison of the proposed methods on DNA database with random-ordered data
and query distributions.



next object is used as a query over all of the inserted data. This process is repeated until 1,000

queries and 26,000 insertions have been performed.

Two different orderings of the DNA data are considered. The first ordering, a Hilbert

ordering, is obtained as follows. The percentage of each letter in the DNA object defines a

multi-dimensional vector (four dimensional since DNA object contains four different letters). The

objects are then ordered according to the Hilbert ordering of their resulting vectors. Because of

the Hilbert ordering, the distribution of the data that are inserted will vary over time. The first

few objects will have a high percentage of a single letter and low percentages of the rest of the

letters, but that percentage will drop for objects inserted later on, as the database is processed and

other letters become prominent. For example the DNA objects have {A, C, G, T} as alphabet set.

The Hilbert ordering of the this database has more than 40 % A's in the first few objects. This

percentage decreases to 10 % towards the end of the ordering. Because of the experimental setup,












Hlbert
25 Random


01

S1 15 2 25


Z

/




05

05 115 2 25
Database Size xio4

Figure 5-3. Sparse method for DNA database with Hilbert and random-ordered data and query
distributions.


this simulates the situation where queries are always more oriented towards the most-recently

inserted data, which may be a realistic scenario.

For the second experiment over the DNA data set, the data is ordered randomly. Thus, there

is no drift of the data and query distribution over time in this case. The same experiment is also

performed over the randomly ordered images from the retinal image data set. Here the initial data

set contained 1000 images and 100 sample query images. Then a random image is used as query

for every 10 images inserted. This process is continued until 3500 images have been inserted into

the database.

For the image data set, e = 8 % of the largest distance and k = 16 are used. For the two

orderings of DNA database, e = 8 and k = 32 are used.





















600
E
o
) 500
0
D 400
E


1500 2000 2500 3000
Database Size

Figure 5-4. Comparison with Sparse for the randomly ordered image database.


5.4.1.2 Experimental results

The number of comparisons required to answer each query as a function of database size (or

equivalently, time) for each of the various methods over various databases is given in Figures 5-

1 to 5-4. In addition to MP, SP, and TP, we present the results obtained by running the same

experiment using the Sparse method proposed by Brisaboa et.al. [21] for dynamic databases. The

other methods considered experimentally in this paper such as OMNI and BINC are not able to

handle dynamic updates, and so are not tested in these experiments.

The various plots depicted are fairly jagged due to the large number of queries asked

and the high variance of index performance on any given query, but the results show that the

incremental algorithms require may require more than three times more comparisons than MP.

Not surprisingly, SP required more comparisons than TP for all three data sets, though the

difference between the two methods is negligible for the randomly-ordered DNA data set. This

particular data set also showed the closest performance among all three methods. This is also not















8000



7000



5000
E 6000


E



4000



3000-



2000
1500 2000 2500 3000
Database Size


Figure 5-5. Distance computation time of DNA database.



surprising, given that the data and query distributions are stationary over time. Perhaps the most

surprising finding is that for the randomly ordered image database, there is a very clear separation

in performance of the various methods over time.

For the DNA databases, the Sparse method failed to generate a good set of references and it

must scan almost the entire database. For the image database, Sparse requires up to three times

more comparisons than MP and up to two times more comparisons than TP.

5.4.2 Distance Computations

The previous experiment tested the ability of the various methods to process queries

efficiently, but in an dynamic environment insert processing speed is also important.

The first experiment regarding insertion processing speed is meant to test the magnitude of a

cost that is common to all three algorithms: the cost of computing the distance of a new database

object to all existing database objects. For the two different distance measured considered














045


04-



0




02
E




02-






1500 2000 2500 3000
Database Size

Figure 5-6. Distance computation time of image database.


(ED and EMD), he time take to process the necessary distance computations associated with

insertions numbered 1000 to 3000 is computed.

The timing results for the DNA database (using ED) are given in Figure 5-6. With the

increase in size from 1000 to 3000 the time taken to process each additional insertion increases

from 0.12 to 0.36 seconds. The time required for the image data set (using EMD) is given in

Figure 5-5. With the increase in size from 1000 to 3000, the time taken increases from 2500

seconds to 7500 seconds. As one might expect, times scale linearly with database size for both

distances, but there is a very significant difference in the magnitude of the cost for both methods.

5.4.3 Impact of Index Construction Time

After demonstrating the fixed costs that all three methods must pay in order to process an

additional insertion, the actual index construction times of the three methods are compared.

The time taken for index construction by the two incremental methods with increase in

database size for the image data set is given in Figure 5-7, using c =' of the largest distance












---- TP







0 01-


E

0 008
0

0
C1)




004



002--12100 I I-18100 -I- I 24100 I I
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Database Size


Figure 5-7. Index construction (Ic) times of the methods SP and TP DNA and image databases.



and k = 32. The time taken by TP varies from 0.06 second for 1000 objects to 0.13 second for

3000 objects. Similarly the time taken by SP varies from 0.02 second to 0.04 second. The time

taken to construct the index for MP is given in Figure 5-8. It varies from 175 seconds for 1000

objects to 675 seconds for 3000 objects and increases linearly. Thus the incremental algorithms

are orders of magnitude faster than MP. We obtained similar results for the methods SP, TP and

MP using the DNA database.

5.4.4 Analyzing the Results

The experiments suggest that there is no easy answer to the question: Which is better, MP,

SP, or TP? Depending upon the application characteristics, either MP or one of the incremental

methods may be superior.

For distance metrics of intermediate computational cost such as the ED, the incremental

methods such as SP and TP seem preferable. The distance computation cost of ED is significant,















600 -


c /
S500 -


I /
E
0 400
0/
UI)
0
o /
X 300
-n
/

200 -



100
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 300(
Database Size

Figure 5-8. Index construction (Ic) times of MP for DNA and image databases.


but not debilitating. Specifically, it has a low cost (Figure 5-5) compared to the index construc-

tion cost of the MP (Figure 5-8) which dominates the time required to rebuild the index using

MP is around 3000 times the cost of computing all of the distances to a newly inserted database

object. The gain from MP's superior query speed (up to two times faster compared to the incre-

mental algorithms) does not seem to justify its costs. For example, the statistics associated with

the DNA objects in the GenBank [10] database show that for every new object that is inserted

into the database, the database has around one new query. With a 1 : 1 query to update ratio, the

update cost is just as important as query cost, making MP far less attractive.

On the other hand, for distance metrics of high computational cost such as EMD, applying

MP is preferred. The index construction time for MP (Figure 5-8) is low compared to the costly

EMD distance computation (Figure 5-6). Even though the incremental algorithms have an

index construction cost that is negligible (Figure 5-7), their query performance is up to three

times slower than MP. The difference in their query times is much greater than than the index









construction of MP (300 seconds more than MP). Hence for the image database applying MP is

probably a better option.

If an incremental algorithm is selected, choosing from among SP and TP is somewhat

difficult. The latter generally has superior query performance though not over the DNA data set

with a random ordering. The former has better insert processing performance. A rule of thumb

might be that if inserts are more common, choose SP, if queries are more common, choose TP.









CHAPTER 6
GENERALIZED NEAREST NEIGHBORS FOR SIMILARITY JOINS

This chapter presents a generalized framework for Nearest Neighbor queries called General-

ized Nearest Neighbors (GNN) to answer the similarity join queries.

Finding the broadness of data is needed in many applications such as life sciences (e.g.,

detecting repeat regions in biological sequences [46] or protein classification [24]), distributed

systems (e.g., resource allocation), spatial databases (e.g., decision support system or continuous

referral system [55]), profile-based marketing, etc.

In this dissertation, a new database primitive, called the Generalized Nearest Neighbor

(GNN) which naturally detects data broadness is defined. Given two databases R and S, the

GNN query finds all the objects in SI C S that appear in the k-NN set of at least t objects of R,

where t is a cutoff threshold. The objects in the result set of a GNN query are broad. Here, SI

is the set of objects that the user focuses on for broadness property. If R = S, then it is called

mono-chromatic query. Otherwise, it is called bi-chromatic query.

The trivial solution to a GNN query is to run a kNN query for each object in R one by

one, and accumulate the results for each object in S. However, this approach suffers from both

excessive amount of disk I/Os and CPU computations. When the databases do not fit into the

available buffer, a page that will be needed again might be removed from buffer while processing

a single kNN query. CPU cost also accounts for a significant portion of the total cost since kNN

is determined for each object in R from scratch.

Lets assume that each of the databases is larger than the available buffer. Three solutions

are proposed that arrange the data objects into pages. Each page represents a set of objects

represented by their minimum bounding rectangle (MBR). Two R*-trees [9] are constructed, one

built on R and other on S. They predict a set of candidate pages from S that may contain kNNs

for each MBR of R with the help of R*-trees. Each candidate page is assigned a priority based

on its proximity to that MBR of R and is stored in a Priority Table (PT). The first algorithm,

pessimistic approach, fetches as many candidate pages as possible from S for each page of R.

The second algorithm, optimistic approach, fetches one candidate page at a time from S for each


















2 3 Or4 S5










Figure 6-1. An example for GNN query.


page of R. The third algorithm dynamically decides the number of pages that needs to be fetched

for each page of R by analyzing query history. It reduces the CPU and I/O cost significantly

through three optimizations by dynamically pruning 1) pages of S that are not in the kNN set of

sufficiently many objects in R, and 2) pages of R whose nearest neighbors do not contribute to

the result 3) objects in candidate MBRs of S that are too far from the MBRs of R. The method

further reduces these costs by pre-processing the input databases using a packing technique called

Sort-Tile-Recursive (STR) [76].

6.1 Problem Definition

Let R and S be two databases. The GNN query is defined by a 5-tuple GNN(R, S, St, k,

t), where SI C 5, and k and t are positive integers. This query returns the set of tuples (s, R,),

where s E S, R, C R is the set of objects that have s as one of their k-NN, and IRI > t. The

Euclidean distance is used as the distance measure in this paper unless otherwise stated.

Assume that the white and black points in Figure 6-1 show the layout of 2-D databases R =

{rl, *, rs} and S ={si, *- 5s} respectively. Consider the following query:
GNN(R, S, S 5' {si, S2, s5}, 2, 3).









This translates as: "Find the objects in SI that are in the 2-NN set of at least three objects in

R". In Figure 6-1, the circles centered at each ri E R covers the 2-NN of ri, Vi. Only 82 and s4

are covered by at least three circles. s4 can be ignored since s4 St. The set of nodes that have

82 in the 2-NN is {rl, r2, r3}. Therefore, the output of this query is {(82, {ri, r2, r3})}. Note that

the data points in S Si cannot be ignored prior to GNN query. In other words GNN(R, S, SI, k,

t) / GNN(R, Si, Si, k, t). For example, in Figure 6-1, removal of s3 and s4 prior to GNN query

changes the 2-NNs of r2, r3 and r4. As a result 81 becomes one of the 2NNs of r2 and r3. Hence

s8 is incorrectly classified as broad.

A nice property of the GNN query is that both mono-chromatic and bi-chromatic versions

of the standard k-NN, ANN and RNN queries are its special cases. Following observations state

these cases. One can prove these from the definition of the GNN query. Note that the goal of

this paper is not to find different solutions to each of these special cases. The goal is to solve a

broader problem which can not be solved trivially using these special cases.

Observation 1. GNN({r}, S, S, k, 1) returns the k-NNof the object r in S. If r E S, then it

corresponds to the mono-chromatic k-NN query. Otherwise, it corresponds to the bi-chromatic

k-NN query.

Observation 2. GNN(R, S, S, 1, 1) returns the ANN of R in S. If R = S, then it is the mono-

chromatic case, otherwise bi-chromatic case.

Observation 3. GNN(R, S, {s}, 1, 1), where s E S, returns the RNN of the object s in R. If

R S, then it is the mono-chromatic case, otherwise bi-chromatic case.

6.2 Overview of the Algorithm

Lets assume that each of the databases is larger than the available buffer. The three proposed

solutions arrange the data objects into pages. Each page represents a set of objects represented

by their minimum bounding rectangle (MBR). Two R*-trees [9] are constructed, one built on

R and other on S. They predict a set of candidate pages from S that may contain k-NNs for

each MBR of R with the help of R*-trees. Each candidate page is assigned a priority based

on its proximity to that MBR of R and is stored in a Priority Table (PT). The first algorithm,









pessimistic approach, fetches as many candidate pages as possible from S for each page of R.

The second algorithm, optimistic approach, fetches one candidate page at a time from S for each

page of R. The third algorithm dynamically decides the number of pages that needs to be fetched

for each page of R by analyzing query history. It reduces the CPU and I/O cost significantly

through three optimizations by dynamically pruning 1) pages of S that are not in the k-NN set of

sufficiently many objects in R, and 2) pages of R whose nearest neighbors do not contribute to

the result 3) objects in candidate MBRs of S that are too far from the MBRs of R. The method

further reduces these costs by pre-processing the input databases using a packing technique called

Sort-Tile-Recursive (STR) [76].

6.3 Predicting the Solution: Priority Table Construction

Let R and S be two given databases, and A and B be the sets of MBRs that contain the

objects in these databases. The computation of the candidate set of MBR pairs one from A and

the other from B that needs to be inspected to calculate a given GNN query is discussed. Lets

assume that the databases are packed and indexed prior to the GNN query. The packing of the

database is discussed in more detail in Section 6.6. This is a one time cost per database; the same

index will be used for all the queries. STR [76] ordering is used for a total ordering of the data.

Throughout this paper R*-Tree is used to index the databases. Other index structures can be used

to replace the R*-tree. For simplicity, the capacity of each MBR of the R-tree is chosen as one

disk page and use leaf level MBRs to prune the solution space.

Given two MBRs B1 and B2, MAXDIST(B1, B2) and MINDIST(B1, B2) are defined as the

maximum and minimum distance between B1 and B2. The following lemma establishes an upper

bound to the k-NN distance to the objects in a set of MBRs.

Lemma 1. Let A be the MBR of a set of objects and a E A be an object. Let B {B1, BI L}

be the set of leaf level MBRs of an index structure built on a database. Assume that the MBRs in

B/,where Bi C B, contain at least K objects. Let E denote the distance of object a to its kth NN

in B, then

E < max{MAXDIST(A, B)},Vk, 1 < k < K.
BEBI









Proof follows from the observation that all objects in BI appear in B too.For a given positive

integer k, let m be the integer, 1 < m < IB for which

-' I Bi I< k< B, ,

where IBI is the number of objects in Bi. Let MAXDIST(A, Bi) < MAXDIST(A, Bi+1), Vi 1

< i <|l|, where I\B is the cardinality of B.

From Lemma 1, the k-NN distance of the objects in A to the objects in B is at most

MAXDIST(A, Bm). Hence, if MAXDIST(A, Bm) < MINDIST(A, B), B E B, then B does not

contain any object from the k-NN set of any object in A. Therefore, B can be pruned away from

B without any false dismissals during the computation of the k-NNs of the objects in A. From

these observations, given a GNN query, GNN(R, 5, St C 5, k, t), for each A E A, a priority list

of the candidate boxes in B is computed as follows:

For each A e A,

Step 1: Compute MAXDIST(A, Bm) for the given value of k as discussed above.

Step 2: Find the MBRs, B c B for which MINDIST(A, B) < MAXDIST(A, Bm).

Step 3: Assign priorities to these MBRs in increasing MINDIST(A, B) order.

The algorithm for Step 1 takes a MBR A, the root node of an R*-tree, and an integer k as

input. The root node is stored using a min-heap. The node with the smallest MAXDIST to A is

extracted from this heap. If the MINDIST of this node to A is more than the threshold, then it

is omitted. Otherwise it is inspected. If it is an internal node, then its children are inserted into

the min-heap. Otherwise, it is inserted into the candidate set, which is maintained using a max-

heap. If the candidate set contains more objects than necessary, then the MBR with the largest

MINDIST value is removed from the candidate set. Although the worst case time complexity

of this step is O(| B |) (i.e., the entire index is traversed), the amortized complexity is only

O(log(| B |). Step 2 is computed using the classic range search algorithm on R-trees. Therefore,

the amortized time complexity of this step is also O(log(| B |). This step eliminates all the

leaf level MBRs that only contain irrelevant points. Naturally, if an MBR contains at least one

relevant point, it is detected in Step 1 and that MBR will be processed by the strategies proposed









S1 S2 S3 S4 S5 S6 S7 S8
3 1 2

2 1 3

1 2

1 2 3

3 2 4 1

1 2

1

1 2


Figure 6-2. A sample Priority Table for two databases R and S.

in Section 6.4. Step 3 takes O(Clog C) time where C is number of candidate MBRs found at
Step 2.

The candidate MBRs for all the MBRs in A are stored in a Priority Table (PT). Figure 6-2
depicts the PT constructed for the GNN(R, S, Si, k, t) query. Here, re and Si correspond to

MBRs for R and S. Each row/column corresponds to a page or RIS on disk. The numbers at

each cell show the priority of that column for that row. Lets assume that St {sl, S3, 84, S5,

s7} in this example. In this figure, each row and column corresponds to an MBR for R and S
respectively. For simplicity, two assumptions are made without affecting the generality: 1) The

objects in R and S are located sequentially on disk. 2) Each row and column of the PT (i.e., each
MBR) corresponds to one disk page. The numbers at each row show the priority of the candidate

MBRs in S for the corresponding MBR in R. For example in row 1, the MBRs Si, s3 and 87
are in the candidate set of rl, such that s3 has the highest priority and Si has the lowest priority.

This is depicted in Figure 6-3. Here rl E R and S = {si, *- 87}. Objects in m, are within

MAXDIST distance from rl. If an MBR of S is not in the candidate set of an MBR in R, then the
corresponding cell is unnumbered.














s2 s6
/

,', MAXDIS r
s3


rl
s5

s1

s7






s4

Figure 6-3. First row of the Priority Table.


Given a query, GNN(R, S, SI C S, k, t), our search methods reduce the solution space by

pruning the PT. Following two optimizations can be made to reduce search space by inspecting

the PT:

Optimization 1. (Column Filter) Let si correspond to an MBR in Si. If the total number of

objects in the MBRs in R which have si in their candidate set is less than t, then that column can

be removed from St.

For example, in Figure 6-2, s5 is in the candidate set of only r4. If the total number of

objects in r4 is less than t, then s5 can be removed from SI safely. The correctness of Column

Filter can be proven from the fact that an object in a column, si can be in the k-NN set of the

objects in the rows only that have si in the candidate set.

Optimization 2. (Row Filter) If a row does not contain any candidate MBRs in Si, then it can be

removed from PT

For example, in Figure 6-2, rows r3 and rs do not have any candidates in SI. Therefore,

these rows can be omitted safely. If s5 is pruned from SI due to Column Filter, then the row, r4,

can also be ignored.









6.4 Static Search Strategies

PT defines the MBR pairs that (potentially) need to be compared to answer a given GNN

query. This section develops two methods to compute a GNN query, GNN(R, S, Si C S, k, t),

given the PT of the databases R and S. These methods are referred as Fetch All (FA) and Fetch

One (FO). Lets assume a limited buffer space B throughout this section. That is, the sizes of both

R and S are larger than B.

6.4.1 Fetch All

The first method uses a pessimistic strategy: to process each page (i.e., MBR) of R, (i.e.,

one row in PT), it reads as many candidate pages from S as possible into buffer at once starting

from the one with the highest priority. For example, for rl, FA reads sI, S3, and S7. FA runs in

3 phases: (1) find maximal clusters that fit into buffer, (2) reorder the clusters to maximize the

overlap and (3) read the pages for each cluster and process the contents. Next, each phase is

elaborated.

6.4.1.1 Creating clusters

The clusters are created by iteratively adding rows into the current cluster, starting from the

first row, until its size becomes B. When the cluster becomes large enough, a new cluster begun.

For example, if B = 6 pages, then the clusters for the PT in Figure 6-2 are C1 = {, r, I, 83,

84, 87}, C2 {= r3, r4, S2, 85, 86, 88}, C3 {= r5, 81, 82, 84, 88}, and C4 = r6, r7, r8, 81, 82, 86 .

The total cost of this step is linear in the number of candidate pages since each candidate page is

visited only once.

6.4.1.2 Ordering clusters

The order for reading the clusters affects the total amount of disk I/O. This is because, if

consecutive clusters have common pages, these pages will be reused and they do not need to

be read again. For example, if C3 is read after C1, then sl and s4 will be reused, saving two

disk reads. Given a read schedule of clusters, the total amount of disk reads saved by reusing

buffer is equal to the sum of the common pages between consecutive clusters. For example, if









the clusters are read in the order of C1, C3, C2, C4, then total savings adds up to 6 pages (i.e.,

I, n C + IC3 n C + I 2 n C4 = 6).

One can show that the Traveling Salesman Problem (TSP) can be reduced to the problem

of finding the best schedule for reading clusters. Intuitively, the proof is as follows. Each vertex

of TSP maps to a cluster. Each edge weight it i between clusters C, and Cj is computed as

the number of overlapping pages between C, and Cj. The best schedule on this graph is the

Hamiltonian Path that maximizes the sum of edge weights. Since TSP minimizes the sum of edge

weights, the weight of each edge ",- i is updated as w1j = Wmax "' i, where Wmax is the largest

edge weight. This guarantees that the new edge weights are non-negative. Then a new node v is

created and is connected to all nodes by zero-weight edges. The optimal schedule is the path with

the smallest sum of edge weights which begins at vertex v and visits all nodes once.

A greedy heuristic is used to find a good schedule as follows: We start with an empty path.

While there are unvisited vertices, we insert the next edge with the smallest weight into the path

if it does not destroy the path. Finally, the disconnected paths are attached randomly if there are

any.

6.4.1.3 Processing clusters

Once the cluster schedule is determined, the contents of each cluster are iteratively read into

buffer using optimal disk scheduling [77]. The procedure used to process each cluster after it is

fetched into buffer is given in Algorithm 7. For each row in the cluster, the algorithm searches

the k-NN of each object starting from the box with the highest priority (Steps 1 and 2). The

results obtained at this step are used to prune the candidate set (Step 3). After the candidate set is

pruned, Optimizations 1 and 2 are applied to PT in order to further reduce the solution space.

6.4.2 Fetch One

FA reads many redundant pages if only a small percentage of the candidate pages contain

actual k-NNs. FO uses optimistic approach to avoid this problem. FO iteratively reads one page

per row as long as there are more candidates.










input : k, a positive integer.
1 for each row ri in the buffer do
2 while row ri has more uninspected candidate MBRs in the buffer do
3 Let sr, be the next unprocessed candidate MBR with the highest priority.
4 Find the k-NN of each object in ri in s,,. Store the maximum of these k-NN
distances in dmax.
s Remove all candidates, s, of r in PT for which MINDIST(r, s) > dmax.
6 end
7 end
Algorithm 7: Process Buffer Algorithm.


The pseudocode for the FO algorithm is given in Algorithm 8. The algorithm splits buffer

equally for each of the databases. This is because, one candidate page is read per row starting

from the highest priority (Step 1). Therefore, the number of pages from each database in the

buffer will be equal at all times if all the candidate pages are distinct. After searching each

candidate page (Step 2), PT is further pruned by eliminating the pages that are farther than the

kth NN found so fa (Step 3)r, and using Optimizations 1 and 2 (Step 4).

For example, for the PT in Figure 6-2, let buffer size be 6 pages, then FO reads {ri, r2, r3}

and {S3, S4, s6} into buffer. Assume that the third candidate of r, is pruned at the end of this step.

Next, {s7, ss} are read to replace {S4, s6}. Although it is the second candidate of r2, s3 is not

read at this step since it is already in buffer. Assume that the third candidate of r2 is pruned at the

end of this step. Since none of the rows {ri, r2, r3} have any remaining candidates FO does not

need to read any more pages for these rows. Therefore, {ri, r2, r3} is replaced with {r4, r5, r6},

and the search continues recursively.

input : k, a positive integer.
1 while there are unprocessed rows do
2 Fill half of buffer with the MBRs,ri, from R and one page from S (sr,) for each ri.
3 ProcessBuffer(k).
4 Remove all the rows, ri, from the buffer that has no other uninspected candidate
MBRs.
s Apply Optimizations 1 and 2 on PT.
6 end
Algorithm 8: Fetch One Algorithm.









input : k, a positive integer.
B,buffer size.
1 Initialize f.
2 while there are unprocessed rows do
3 Fill buffer with [ t] pages (ri) from R and f pages from S (s,,) for each ri.
4 ProcessBuffer(k).
s Remove all rows ri, from buffer that has no other uninspected candidate MBRs.
6 Apply Optimization 1 and 2 on PT. Update value of f.
7 end
Algorithm 9: Fetch Dynamic Algorithm.


6.5 Dynamic Strategy

FO reads only the necessary pages (i.e., MBRs) to compute a given GNN(R, S, SI, k, t)

query since it reads one page at a time starting from the highest priority and stops when the

distance to the next MBR is more than the distance to the kth NN found so far. However, this

does not guarantee that the total I/O cost is minimized. This is because FO incurs a random

seek cost every time a new page is fetched from disk. Since a random seek is significantly more

costly than a page transfer, reading a few redundant pages sequentially at once may be faster than

FO. Thus, neither FO nor FA ensures the optimal I/O cost. The number of pages read at each

iteration, f, that minimizes the I/O cost depends on the query parameters and the distribution of

the database. A good approximation to this number can be obtained by sampling the MBRs of R.

The third method, Fetch Dynamic (FD) adaptively determines the value of f as follows. It

starts by guessing the value of f. It then reads the first cluster using this value. As it finds the

k-NNs of all the objects in the first cluster, it computes the optimal value of f for that cluster. It

then uses this value of f to choose the next cluster. After processing each cluster, it iteratively

updates f as the median of the number of pages needed for all of the rows processed so far. Note

that, the choice of the initial value of f has no impact in the performance after the first step, since

f is updated immediately after every iteration. As more rows are processed in each iteration, f

adapts to the query parameters and data distribution.









The pseudocode of the FD algorithm is given in Algorithm 9. The algorithm first assigns

an initial value for f (1 < f < candidate size). 20 % of the average number of candidates of

R is used as the initial guess. Let B denote the buffer size. While there are unprocessed rows,

FD reads [ L ] pages (ri) from R and f pages (sr,) from S with the highest priority for each R

page in buffer (Step 2.a). Thus, if all the candidates are distinct, buffer is filled with pages from R

and S. Steps 2.b processes each candidate page s,,. The processed pages (ris) are removed from

buffer at Step 2.c. The algorithm continues with Steps 2.a to 2.c until all the rows in buffer are

exhausted. Then Optimizations 1 and 2 are applied (Step 2.d). The value of f is updated at Step

2.e as the median of the number of candidates of the processed pages in R.

6.6 Further Improvements for GNN Queries

So far the two optimizations, row filter and column filter to trim both I/O and CPU costs are

discussed. This section discusses further optimizations to cut down both CPU and I/O costs of

FA, FO, and FD.

6.6.1 Adaptive Filter

The third optimization follows from the following observation. A given MBR r can be

expanded by dmax in all dimensions. If a candidate MBR s overlaps with this expanded MBR, the

distances between all pairs of points from r and s are computed. (Steps 2 and 3 of Algorithm 7)

This incurs O(t2) comparisons if each MBR contain O(t) points. This cost is reduced in two

ways. First, instead of expanding by dmax, different dimensions can be adaptively expand by

different amounts. Second, the t2 comparisons are avoided by pruning unpromising points from

S in a single pass. More formally, first all points in a candidate MBR s that are contained in the

expanded MBR of r are found. Next, the distances between all those points and all points of r are

computed. Let t1, t < t, be the number of points in s that are contained in the expanded MBR

of r. The CPU cost for the comparison of MBR pairs drops from O(t2) to O(t + t t1). This is

summarized as the third optimization, Adaptive Filter.

Optimization 3. (Adaptive Filter) Let p be a point in MBR r. Let d be the kNN distance of p to

the points in MBR s. Let kNN-sphere of p denote the sphere with radius d, centered at p. Let M









ml
| m2____ _________ __ *



s2 I /






J

L________________________I

Figure 6-4. Adaptive Filter Example


denote the MBR that tightly covers the kNN-spheres of all the points in r. A point can not be a

kNN of a point in r if it is not contained in M.

In Figure 6-4, mi, m2 are the expanded MBRs for the r without using and using the

Adaptive Filter respectively. si, s2 C S, si is the MBR of the points {qi, q2, q3, q4}. The

expanded MBR of r in the worst case is given by mi. When adaptive bounds are used, the

expanded MBR m2 is obtained. In the former case, two MBRs Sl and s2 intersect with mi. Thus,

3 disk I/Os (r, si, s2) and 32 comparisons are made. However, only sl intersects with m2 in the

latter case. Hence MBR s2, which do not have any point inside m2, can be pruned. This reduces

the I/O cost to 2 page reads (r, si) and the CPU cost to 16 comparisons. However, Optimization 3

states that a point in S is considered only if it is inside m2. Therefore, each point in sl can be

scanned once to find such points. These points are then compared to the points in r to update

k-NNs. Thus the CPU cost reduces to 12 comparisons (4 for scanning s and 8 for comparison of

the points in r with ql and q2).

6.6.2 Partitioning

Optimization 3 is improved further by partitioning the MBR r along selected dimensions.

Dimensions with high variances are selected for the partitioning. It starts from the dimension

with the highest variance. The MBR is split along this dimension into two MBRs, such that each









m3



S*V

v2
I '





L.- ---------

Figure 6-5. Partitioning Example


resulting MBR contains the same number of objects. Each of these MBRs are then recursively

partitioned along the dimension with the next highest variance recursively.

Partitioning improves the performance in two ways. First, since each of the partitions is

smaller than the original MBR, the pruning distance (dmax) along each dimension is reduced.

This reduces the I/O cost. Second, without partitioning, an object in MBR s is compared to all

the objects in r if the extended MBR of r contains it. However, with partitioning, an object in

s is not compared to the objects in partitions whose extended MBR does not contain it. Thus,

CPU cost is reduced by avoiding unnecessary comparisons. Note that as the number of partitions

increases, the number of point-MBR comparisons increases. When the number of partitions

becomes O(t) (i.e., the number of objects per MBR), the number of such comparisons becomes

O(t2). Thus, partitioning becomes useless. In the experiments, they are partitioned along at most

eight dimensions for the best performance.

In Figure 6-5, v1 and v2 are two partitions of MBR r and m3 and m4 are their extended

MBRs. sl C S is the MBR of the points {q, q, q3, q4}. Horizontal dimension is used to partition

the MBR r into two partitions. vl and v2 are the MBRs of these partitions having rm3 and m4 as

extended MBRs. MBR s2, which do not have any points inside m2, can be pruned. Each point in

sl is scanned once to find the candidate points for the partitions vl and v2. Only qi is present in









the extended MBR of v2, reducing the CPU cost to 10 (8 for comparing points in si with v1 and

v2 and 2 for comparing the points in v2 with ql).

6.6.3 Packing

The performance of R-Tree based methods can be improved by using packing algorithms

which group similar objects (objects within a close neighborhood) together. The Sort-Tile-

Recursive (STR) method [76] is employed for packing the R-Tree, built on the databases. Let

N be the number of d-dimensional objects in a database, B be the capacity of a node in R-Tree

and let P = [F]. STR sorts objects according to the first dimension. Then the data is divided

into S = [Pd] slabs, where a slab consists of a run of n. [Pd ] consecutive objects from the

sorted list. Now each slab is processed recursively using the remaining d 1 coordinates. It has

been shown in [76] that for most types of data distributions STR-Ordering performs better than

space-filling-curve based Hilbert-Ordering [51].

6.7 Experiments

Two classes of databases are used in the experiments.

Image databases: Each of Imagel and Image2 contains 60-dimensional feature vectors of

34,433 satellite images. Two databases are created from from Imagel and Image2 by splitting

each 60-dimensional vector into 30 two-dimensional vectors. Each of the resulting databases

contains 1,032,990 data points.

Protein structure databases: Each of Protein] and Protein2 contains 288,156 three-dimensional

feature vectors for secondary structures of proteins from Protein Data Bank (ftp: / / ftp.

rcsb. org/pub/pdb) as discussed in [24].

In addition to FA, FO, and FD, three of the existing methods are implemented: sequential

search (SS) and the R-tree-based NN method of Roussopoulos et. al., (RT) [70] and Mux-

Join [19]. To implement the buffer restrictions into RT, half of the available buffer is used for R

and the other half for S. In order to adapt these methods to GNN, GNN(R, S, St, k, t), a k-NN

search is performed for each object in R. SS is included in the experiments, as it is better than

many complicated NN methods for a broad set of data distributions [15]. The source codes of










800
700
600
(T 500
S400
E
i- 300
200
100
n


ECPU Time
SI/O Time


0- 00- -O O- O a-- 00-a-
SO z O Z O0 z O0 Z O0

0.5 1 2 4 8
S' size [% S]


Figure 6-6. Evaluating Optimizations 1, 2 and 3.


GORDER [97] and RkNN [84] are obtained from their authors. However, at its current state, it is

impossible to restrict memory usage of GORDER to a desired amount. Therefore, GORDER is

used in only one of the experiments where it is possible.

In all the experiments, SI = S is used unless otherwise stated. 4 kB page size is used in all

the experiments. The experiments ran on an Intel Pentium 4 processor with 2.8 GHz clock speed.

6.7.1 Evaluation of Optimizations

This section inspects the performance gain due to Optimizations 1, 2 and 3 and the im-

provements in Section 6.6. The GNN query is performed by varying the size of SI from 0.5 %

to 8 % of 5, by selecting pages of S randomly. In this experiment, FD used k = 10, t = 3,000,

and buffer size = 10 % of the database size. The queries are run on the two dimensional image

database.

The CPU and I/O time of FD with four different settings of Optimizations 1, 2 and 2

(obtained by turning these optimizations on and off) on two-dimensional image databases for

different sizes of Si are given in Figure 6-6. According to the results, the main performance

gain is obtained from Optimizations 2 and 3, yet there is a slight performance gain from










































Buffer Size (%)
Oracle
FO
FD
FA


'I


-0-0-0
002


C-

0.5


-0-0-0
a0-0-0
000

aC
O O
0Q Q 0
- N
oo
1t


-0-0-0
000
"0. 0 .0

C C
-Q


2
S' size [% S]


-0-0-0
"0. 0 .0

C C
4-
a

4


Figure 6-7. Evaluation of Partitioning.

Table 6-1. Comparison with an Optimal solution.

5 10 20 40
11245 10980 10244 10252
14601 13425 12710 12393
16706 13825 11702 10789
155727 73238 23885 10328


Optimization 1. The reason that the Optimization 1 has a smaller impact can be explained as

follows. t is only 0.3 % of the total number of objects in R. Thus, Optimization 1 can eliminate a

page of S only if it is in the candidate set of less than 3 pages of R. The impact of Optimization 1

is larger when the ratio of the average number of candidate pages to t is lower. This happens

when t is large or k is small. Optimization 2 has a high impact when Si is smaller. This is

because fewer rows in PT have candidates in SI for small St. Another way to obtain high filtering

rate from Optimization 2 is to reduce the average number of candidate pages per row by choosing

a small value for k. Optimization 3 effectively reduces the CPU and I/O cost for different sizes

of St. We can also see that for higher percentages of S, the impact of this optimization remains

constant and is independent of the size of SI. This can be understood from the fact that for a fixed


DCPU Time
SI/O Time


U-


-0-0-0

CZ

a-
8-


8


11- 11 111 10 1











100
90
80
70
0 60
0O DCPU Time
S50
E 5 I/O Time
i 40
30 -
20
10

00< 00< 00< 00<
LL LL LL LL LL I LL LL LL LL LL LL
5 10 20 40
Buffer size [% database size]


Figure 6-8. Comparison of the proposed methods on two-dimensional image database for differ-
ent buffer sizes.


value of k and at higher percentages of S, every MBR r E R has same number of candidate

MBRs from S. This results in constant reduction in CPU and I/O costs.

The performance gains on top of Optimizations 1, 2, and 3 (Unpartitioned algorithm) by

partitioning the MBRs and by using the STR-method based packing algorithm are compared

in Figure 6-7. Here, CPU and I/O time of FD with three different settings Unpartitioned,

Partitioned, and Packed (along with partition) on two-dimensional image databases for different

sizes of St. Here, packing is applied along with partitioning. Partitioning reduces the I/O cost

up to factor of 3 and CPU costs by orders of magnitude. The tighter bounds of the extended

MBRs of the partitions resulted in a reduction of the pruning distance. This explains the I/O and

CPU performance gains from the partitioned algorithm. Packing utilizes the distribution of data

and groups similar objects in MBRs that have common parent and a better organization of the

R*-Tree index structure. This results in a lower value for the parameter f in FD and hence has

better performance gains. Packing reduces the I/O cost up to 10 times and CPU cost by orders of

magnitude faster than an unpartitioned algorithm. It outperforms partitioned algorithm by up to a










50
45
40
35
8 30
[E OCPU Time
25 I/O Time
S20 -
15 -
10


00< 00< 00< 00< 00<
LLLL LL LL LL LL LL LL LL LL LL LL LL
10 20 30 40 50
Number of NN (k)

Figure 6-9. Comparison of the proposed methods on two-dimensional image database for differ-
ent values of k.


factor of 2 and 6 in I/O and CPU costs respectively. From here on all optimizations are used in all

of the proposed methods.

Scheduling the pages is known as paging problem [63]. Chan [25] proposed heuristic based

O((Rp + Sp)2) algorithms (R, and Sp are the number of pages in the two databases) for Index-

based Joins. For large databases, however, these heuristics are not efficient. An online scheduling

algorithm can be evaluated using competitive analysis [1]. In competitive analysis, an online

algorithm is compared with an optimal off-line algorithm which knows all the candidate pages

in advance. An algorithm is c-Competitive if for all sequences of page requests, CA < c.C + b,

where CA is the cost of the given algorithm, C is the cost of the off-line algorithm, b is a constant,

and c is the competitive ratio.

The performance of online methods are compared with its off-line version, named Oracle.

For each MBR r E R, Oracle provides the set of MBRs from S such that every MBR in this

set contains at least one k-NN of at least one object in r. Then the number of I/Os of Oracle

is optimized using the heuristic discussed in Section 6.4. The lower bound is computed as the









optimal number of I/Os as the total number of pages in R and S. The purpose of this experiment

is to observe how the I/O cost of the online methods compare to that of an off-line method and

the minimum possible I/O cost. Table 6-1 compares the performance of Oracle with the proposed

methods.

Since each database has 5064 pages, the lower bound is computed as the number of disk

I/Os as 10128 (5064+5064). The competitive ratio of FA is smallest for large buffer sizes (1.008

for 40% buffer) and for FO it is smallest for small buffer sizes (1.3 for 5% buffer). FD has a

smaller competitive ratio (1.5 for 5% buffer to 1.05 for 40% buffer). It can be concluded that our

methods perform very close to the off-line method.

Table 6-2. Comparison with GORDER.

Grid Size 1000 500 200 100
Buffer Size (%) 175 108 88 85
Time (seconds) 305 535 1519 4259


Table 6-3. Comparison with RkNn.

k 10 20 30 40
RkNN 2620 11750 84145 175495
FD 101 101.76 101.3 101.83


6.7.2 Comparison of Proposed Methods

This section compares FA, FO, and FD to each other for different parameter settings.

6.7.2.1 Evaluation of buffer size

Here, the performance of FA, FO, and FD are compared when the buffer size varies from 5

to 40 % of the total size of R and S. We use two-dimensional image database with k = 10 and

t = 500.

The I/O time and the running time of our methods are given in Figure 6-8. For lower buffer

sizes, FA retrieves all the candidate MBRs for every row and hence I/O cost takes up most of the

total time. We can observe this from the performance of FA at buffer size 5 % and is dominated

by the I/O cost. As buffer size increases, the cost of all three strategies drop since more pages can










100000


10000

8 o-1000
l CPU Time
S) *I/O Time
S 100







2.5 5 10 20 40
Buffer size [% database size]

Figure 6-10. Comparison with other methods on two-dimensional image database for different
buffer sizes.


be kept in buffer at a time. For small buffer sizes FO has the lowest cost since it does not load

unnecessary candidates. As buffer size increases, FA has the lowest cost since it keeps almost

entire S in buffer. However, in all these experiments, the cost of FD is either the lowest or very

close to the lower of FA and FO. This means that FD can adapt to the available buffer size.

6.7.2.2 Evaluation of the number of NN

The next experiment compares the performance of FA, FO, and FD for different values of k.

10 % buffer size is used and t = 500 for the two-dimensional image database.

The I/O and the running times are given in Figure 6-9. The costs of all these methods

increase as k increases. For different values of k FO has the lowest cost and FA has the highest

cost, due to the small buffer size (10%). Even when it does not have the lowest cost, FD is very

close to FO. This means that FD can adapt to the parameter k.

6.7.3 Comparison to Existing Methods

This section compares FD to five existing methods SS, RT, Mux-Index, RkNN, and

GORDER for different parameter settings. Two-dimensional image and protein databases are




















J.


2.5 5 10 20 40
Buffer size [% database size]


Figure 6-11. Comparison with other methods on protein database for different buffer sizes.

used in the experiments. The experimental results comparing FD with well known methods for
special cases are presented. Memory Usage and Running times (seconds) of GORDER on image
database with varying grid sizes. FD runs in only 11.03 seconds for the same database using 20%
buffer.
6.7.3.1 Evaluation of buffer size

In this experiment set, the values of k and t are fixed, and vary the buffer size. The two-
dimensional image database is used and k = 10. The running times of GORDER with different
amounts of memory usage and that of FD with 1.6 MB memory are computed. We measured
the actual memory usage of the methods using the top command of Linux. Although the buffer
size is set (an input parameter to GORDER) to 20 % of the total database size, we observed that
GORDER uses significant amount of memory (up to 175 % of the database size) for additional
book keeping. In order to reduce the actual memory usage GORDER is run with grid numbers
1000, 500, 200, and 100. However the actual memory usage of GORDER is always much larger
than 20 % of the total database size (i.e., 8 MB). For different memory settings, the running
time of GORDER varied from 300 to 4000 seconds while for the same query, FD running times


100000

10000

1000

100
E
10

1
n -i


-
n--ni- n -n

O CPU Time
--1 I/O Time

IIIib I










100000


10000 -

8 o-1000
l CPU Time
S) *I/O Time
100

10

1


10 20 30 40 50
Number of NN (k)

Figure 6-12. Comparison of SS, RT, and Mux FD on two-dimensional image database for differ-
ent values of k.


varied from 10 to 13 seconds (see Table 2. According to these experiments, FD runs an order

of magnitude faster than GORDER even when it uses much smaller buffer. It is impossible to

reduce the actual memory usage of GORDER to 20 % at its current implementation. Therefore,

in order to be fair, it is not included in the remaining experiments.

The I/O and the running times of SS, RT, Mux-index [19] and FD for different buffer sizes

on two-dimensional image and protein databases are given in Figures 6-10 and 6-11. k = 10 and

t =100 are used. FD is the fastest of the three methods in all settings. It can be seen that for small

buffer sizes RT is dominated by I/O cost. As buffer size increases, CPU cost of RT dominates.

Sequential scan is dominated by the CPU cost in all the experiments. The I/O cost of FD is a

fraction of that of RT. FD also reduces CPU cost aggressively through Optimizations 1 to 3 and

partitioning. In all the experiments, the total time of FD is less than the I/O time of RT or SS

alone. Mux-Index is dominated by I/O costs in all experiments. This is because for each block in

R it fills the buffer with blocks from S. Because of the nature of GNN queries, one needs to load

pages multiple times while working with limited amount of memory, independent of the method










100000

10000

8 o-1000
SDOCPU Time
S*I/O Time
E 100
100



1



10 20 30 40 50
Number of NN (k)


Figure 6-13. Comparison of SS, RT, and Mux FD on protein databases for different values of k.


used, naive (sequential scan) or more sophisticated (RT and Mux-Index). FD performs only the

necessary leaf comparisons and uses the near optimal buffering schedule, thus reduces both the

CPU and I/O cost effectively.

6.7.3.2 Evaluation of the number NN

Here, the performance of FD, SS, Mux, RkNN and RT are compared for different values of

k. 10 % buffer size is used and t = 500 for two-dimensional image and protein databases. The

RkNN was evaluated by querying for 100 random query points for different values of k.

The I/O and the running times are given in Figures 6-12 and 6-13. The cost of SS is almost

the same for all values of k. It increases slightly as k increases due to maintaining cost of the

top k closest objects. The costs of RT, Mux and FD increase as k increases since their pruning

power drops for large values of k. The running times of RT, Mux and FD do not exceed SS as k

increases. FD runs significantly faster than others. Depending on the value of k, FD runs orders

of magnitude faster than RT, SS and Mux. The I/O cost increases much slower for FD. This is

because FD adapts to different parameter settings quickly to minimize the amount of disk reads.

Table 3 present the running times of FD and RkNN for 100 query points. While the running time










10000


1000

El OCPU Time
100
.E 0 I/O Time



1

(o)- X0 (o9- X0 (o- X0 (o- X0

12.5 25 50 100
Database size [%]

Figure 6-14. Comparison with other methods on two-dimensional image database for varying
database sizes.


of RkNN increases at faster rate and is not scalable for higher values of k, the running times of

FD, including the time taken for the creation of priority table for each k, for the same query set is

almost constant and is order of magnitude faster than RkNN.

6.7.3.3 Evaluation of database size

In this experiment, the performance of FD, SS, Mux, and RT are observed for increasing

database sizes. Smaller databases are created from the original two-dimensional image database

by randomly choosing 50, 25, and 12.5 % of all the vectors. The buffer size is fixed to 10 % of the

original image database, k = 10, and t = 500.

The I/O and the running times are given in Figure 6-14. As R and S grows, the running time

of FD increases almost linearly. This is because when both databases are doubled, the average

number of candidate pages per row in the PT stays almost the same. On the other hand, the total

running time of SS increases quadratically since it has to compare all pairs of data points. The

running time of RT is dominated by I/O cost and increases faster than that of FD and slower

than that of SS. Like SS, the running time Mux increases quadratically since it fills the buffer










20000
18000
16000
14000
2 12000
SOCPU Time
10000
S0000 i I/O Time
8000
6000
4000
2000



2 4 8 16
Number of dimensions

Figure 6-15. Comparison with other methods on two-dimensional image database for varying
number of dimensions.


with blocks from S and is dominated by I/O costs. Thus, the speedup of FD over SS, Mux and

RT increases as database size increases. This means that the proposed method scales better with

increasing database size.

6.7.3.4 Evaluation of the number of dimensions

In this experiment, the performance of FD, SS, Mux, and RT are observed for increasing

number of dimensions. Databases of d = 2, 4, 8, 16 dimensions are created by choosing the first

d values of the feature vectors from the original 60-dimensional image databases. The buffer

size is fixed to 10 % of the total size of R and S, k = 10, and t = 500. The I/O and the running

times are given in Figure 6-15. As the number of dimensions increases, the running time of SS

increases linearly. On the other hand, the running times of RT and Mux increases faster. This is

also known as the dimensionality curse. For all the methods CPU time increases with the increase

in dimension and is significantly larger for 16 dimensions. However even at 16 dimensions FD is

1.3 times faster than the sequential scan, up to 3.5 times faster than RT and up to 1.2 times faster

than Mux-Index.









CHAPTER 7
CONCLUSION

Similarity search in database systems is becoming an increasingly important task in modem

application domains such as artificial intelligence, computational biology, pattern recognition

and data mining. With the evolution of information, applications with new data types such as

text, images, videos, audio, DNA and protein sequences have began to appear. Despite extensive

research and the development of a plethora of index structures, similarity search is still too costly

in many application domains, especially when measuring the similarity between a pair or objects

is expensive.

In this dissertation, new indexing techniques to improve the performance of similarity search

are proposed. Given a metric database and a similarity measure, the queries we consider are

classified under two categories: similarity search and similarity join queries. Several novel search

and indexing strategies are presented for each category.

Chapter 4 considered the problem of similarity search in static databases with complex

similarity measures. A family of reference-based indexing techniques were developed. Two

novel strategies were proposed for selecting references. Unlike existing methods, these methods

select references that represent all parts of the database. The first one, Maximum Variance

(MV), maximizes the spread of database around the references. The second one, Maximum

Pruning (MP), optimizes pruning based on a set of sample queries. Sampling methods were

used to improve the running times of the index construction. A novel approach to assign the

selected references to database objects was also proposed. The method maps of different sets of

references to each database object dynamically rather than using the same references. According

to our experiments, our methods perform much better than existing strategies. Among the our

methods, Maximum Pruning with dynamic assignment of reference sequences performed the

best. The total cost (number of sequence comparisons) of our methods was up to 30 times less

than its competitors.

Chapter 5 considered the problem of similarity search in dynamic databases with complex

similarity measures. For dynamic databases with frequent updates, two incremental versions of









the MP-Algorithm, the Single Pass (SP) and Three Pass (TP) algorithms were proposed. Since

neither algorithm re-computes the index from scratch, both must make assumptions involving the

change (or lack thereof) of the various gain statistics used by the MP algorithm over time.

Experiments suggested that depending upon the application characteristics, either MP or one

of the incremental methods may be superior. For distance metrics of intermediate computational

cost such as the ED, the incremental methods such as SP and TP seemed preferable. On the other

hand, for distance metrics of high computational cost such as EMD, applying MP is preferred.

If an incremental algorithm is selected, choosing from among SP and TP is somewhat difficult.

The latter generally had superior query performance though not over the DNA data set with a

random ordering. The former had better insert processing performance. A rule of thumb might be

that if inserts are more common, choose SP, if queries are more common, choose TP.

Chapter 6 considered the problem of similarity join queries in large databases. A new

database primitive called Generalized Nearest Neighbor (GNN) was proposed. GNN queries can

answer a much broader range of problems than the k-Nearest Neighbor query and its variants the

Reverse Nearest Neighbor and the All Nearest Neighbor queries. Based on the available memory

and the number of nearest-neighbors, either CPU or I/O time can dominate the computations.

Thus, one has to optimize both I/O and CPU cost for this problem.

Three methods were proposed to solve GNN queries. These methods arrange the given two

databases into pages and compute a priority table for each page. The priority table ranks the

candidate pages of one database based on their distances to the pages from the other database.

The first algorithm, FA, uses pessimistic approach. It fetches as many candidate pages as

possible into available buffer. The second algorithm, FO, uses optimistic approach. It fetches one

candidate page at a time. The third algorithm, FD, dynamically computes the number of pages

that needs to be fetched by analyzing past experience. Three optimizations, column filter, row

filter and adaptive filter were proposed to reduce the solution space of the priority table. Packing

and partitioning strategies which provide significant performance gains were also developed.









These optimizations reduce the CPU cost of the k-NN searches and eliminates additional I/O

costs by pruning the MBRs which do not have a k-NN.

According to the experiments, FA is the best method when the buffer size is large and

FO is the best method when the buffer size is small. FD was the fastest method in most of the

parameter settings. Even when it was not the fastest, the running time of FD was very close to

that of the faster of the FA and the FO. FD was significantly faster compared to sequential scan

and the standard R-tree based branch-and-bound k-NN solution to the GNN problem.









REFERENCES


[1] Albers, S.: Competitive Online Algorithms. Tech. Rep. LS-96-2, brics (1996)

[2] Altschul, S., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J.: Basic Local Alignment
Search Tool. Journal of Molecular Biology 215(3), 403-410 (1990)

[3] Atkinson, M.P, Bancilhon, F., DeWitt, D.J., Dittrich, K.R., Maier, D., Zdonik, S.B.: The
Object-Oriented Database System Manifesto. In: SIGMOD Conference, p. 395 (1990)

[4] Baeza-Yates, R., Perleberg, C.: Fast and Practical Approximate String Matching. In: CPM,
pp. 185-192 (1992)

[5] Baeza-Yates, R.A., Cunto, W., Manber, U., Wu, S.: Proximity Matching Using Fixed-
Queries Trees. In: CPM '94: Proceedings of the 5th Annual Symposium on Combinatorial
Pattern Matching, pp. 198-212. Springer-Verlag, London, UK (1994)

[6] Baeza-Yates, R.A., Navarro, G.: Faster Approximate String Matching. Algorithmica 23(2),
127-158 (1999)

[7] Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-Prot: juggling between
evolution and stability. Briefings in Bioinformatics 1, 39-55 (2004)

[8] Bayer, R., McCreight, E.M.: Organization and Maintenance of Large Ordered Indices. Acta
Informatica 1, 173-189 (1972)

[9] Beckmann, N., Kriegel, H.P, Schneider, R., Seeger, B.: The R*-tree: An Efficient and
Robust Access Method for Points and Rectangles. In: International Conference on
Management of Data (SIGMOD), pp. 322-331 (1990)

[10] Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J., Rapp, B., Wheeler, D.: GenBank.
Nucleic Acids Research 28(1), 15-18 (2000)

[11] Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commu-
nications of ACM 18(9), 509-517 (1975)

[12] Bentley, J.L.: Multidimensional binary search trees in database applications. IEEE
Transactions on Software Engineering 5, 333-340 (1979)

[13] Berchtold, S., Ertl, B., Keim, D., Kriegel, H.P, Seidl, T.: Fast Nearest Neighbor Search in
High-dimensional Space. In: International Conference on Data Engineering (ICDE), pp.
209-218 (1998)

[14] Berchtold, S., Keim, D.A., Kriegel, H.P: The X-tree : An Index Structure for High-
Dimensional Data. In: VLDB'96, Proceedings of the 22nd International Conference on
Very Large Data Bases, pp. 28-39. Morgan Kaufmann, Mumbai, India (1996)

[15] Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is "Nearest Neighbor"
Meaningful? In: International Conference on Database Theory (ICDT), pp. 217-235 (1999)









[16] Bhattacharya, A., Ljosa, V., Pan, J.Y, Verardo, M.R., Yang, H., Faloutsos, C., Singh, A.K.:
ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In: ICDM, pp.
50-57 (2005)

[17] Bially, T.: Space-Filling Curves: Their generation and their application to bandwidth
reduction. IEEE Transactions on Information Theory 15(6), 658-664 (1969)

[18] B6hm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index
structures for improving the performance of multimedia databases. ACM Computing
Surveys 33(3), 322-373 (2001)

[19] B6hm, C., Krebs, F.: The k-Nearest Neighbour Join: Turbo Charging the KDD Process.
Knowledge and Information Systems (KAIS) 6(6) (2004)

[20] Bozkaya, T., Ozsoyoglu, M.: Distance-based indexing for high-dimensional metric spaces.
In: ACM SIGMOD, pp. 357-368 (1997)

[21] Brisaboa, N.R., Farifia, A., Pedreira, O., Reyes, N.: Similarity Search Using Sparse Pivots
for Efficient Multimedia Information Retrieval. In: ISM '06: Proceedings of the Eighth
IEEE International Symposium on Multimedia (2006)

[22] Burkhard, W.A., Keller, R.M.: Some approaches to best-match file searching. Commun.
ACM 16(4), 230-236 (1973)

[23] Bustos, B., Navarro, G., Chavez, E.: Pivot selection techniques for proximity searching in
metric spaces. Pattern Recogn. Lett. 24(14), 2357-2366 (2003)

[24] Camoglu, O., Kahveci, T., Singh, A.K.: Towards Index-based Similarity Search for Protein
Structure Databases. Journal of Bioinformatics and Computational Biology (JBCB) 2(1),
99-126 (2004)

[25] Chan, C.Y, Ooi, B.C.: Efficient Scheduling of Page Access in Index-Based Join Processing.
IEEE Transactions on Knowledge and Data Engineering (TKDE) 9(6), 1005-1011 (1997)

[26] Chavez, E., Marroquin, J.L., Baeza-Yates, R.: Spaghettis: An Array Based Algorithm for
Similarity Queries in Metric Spaces. In: SPIRE '99: Proceedings of the String Processing
and Information Retrieval Symposium & International Workshop on Groupware, p. 38.
IEEE Computer Society, Washington, DC, USA (1999)

[27] Chavez, E., Marroquin, J.L., Navarro, G.: Fixed Queries Array: A Fast and Economical
Data Structure for Proximity Searching. Multimedia Tools Appl. 14(2), 113-135 (2001)

[28] Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces.
ACM Comput. Surv. 33(3), 273-321 (2001)

[29] Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An Efficient Access Method for Similarity
Search in Metric Spaces. In: The VLDB Journal, pp. 426-435 (1997)









[30] Codd, E.F.: A relational model of data for large shared data banks. Communications of the
ACM 26(1), 64-69 (1983)

[31] Dantzig, G.B.: Application of the simplex method to a transportation problem. In Activity
Analysis of Production and Allocation p. 359373 (1951)

[32] Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., Whited, O., Salzberg, D.: Alignment
of Whole Genomes. Nucleic Acids Research 27(11), 2369-2376 (1999)

[33] DeWitt, D.J., Katz, R.H., Olken, F., Shapiro, L.D., Stonebraker, M.R., Wood, D.: Imple-
mentation techniques for main memory database systems. In: SIGMOD '84: Proceedings of
the 1984 ACM SIGMOD international conference on Management of data, pp. 1-8 (1984)

[34] Ferragina, P., Grossi, R.: The String B-tree: A New Data Structure for String Search in
External Memory and Its Applications. JACM 46(2), 236-280 (1999)

[35] Filho, R.F.S., Traina, A.J.M., Traina, C., Faloutsos, C.: Similarity Search without Tears:
The OMNI Family of All-purpose Access Methods. In: ICDE, pp. 623-630 (2001)

[36] Finkel, R.A., Bentley, J.L.: Quad Trees: A Data Structure for Retrieval on Composite Keys.
Acta Informatics 4, 1-9 (1974)

[37] Gaede, V., Giinther, O.: Multidimensional Access Methods. ACM Computing Surveys
30(2), 170-231 (1998)

[38] Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. The
VLDB Journal 14(1), 2-29 (2005)

[39] Giladi, E., Walker, M., Wang, J., Volkmuth, W.: SST: An Algorithm for Finding Near-
Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size.
Bioinformatics 18(6), 873-877 (2002)

[40] Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.:
Approximate string joins in a database (almost) for free. In: VLDB, pp. 491-500 (2001)

[41] Gumbel, E.J.: Statistics of Extremes. Columbia University Press, New York, NY, USA
(1958)

[42] Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology, 1 edn. Cambridge University Press (1997)

[43] Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: SIGMOD, pp.
47-57. ACM Press (1984)

[44] Hjaltason, G., Samet, H.: Ranking in Spatial Databases. In: Symposium on Spatial
Databases, pp. 83-95. Portland, Maine (1995)

[45] Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans.
Database Syst. 28(4), 517-580 (2003)









[46] Huang, X., Madan, A.: CAP3: A DNA Sequence Assembly Program. Genome Research
9(9), 868-877 (1999)

[47] Hunt, E., Atkinson, M.P., Irving, R.W.: A Database Index to Large Biological Sequences.
In: VLDB, pp. 139-148. Rome, Italy (2001)

[48] Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R.: iDistance: An adaptive B+-tree
based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2),
364-397 (2005)

[49] Kahveci, T., Ljosa, V., Singh, A.: Speeding up whole-genome alignment by indexing
frequency vectors. Bioinformatics (2004). To appear

[50] Kahveci, T., Singh, A.: An Efficient Index Structure for String Databases. In: VLDB, pp.
351-360. Rome,Italy (2001)

[51] Kamel, I., Faloutsos, C.: Hilbert R-tree: An Improved R-tree using Fractals. In: VLDB, pp.
500-509 (1994)

[52] Karkkainen, J.: Suffix Cactus: A Cross between Suffix Tree and Suffix Array. In: CPM
(1995)

[53] Katayama, N., Satoh, S.: The SR-tree: An Index Structure for High-Dimensional Nearest
Neighbor Queries. In: SIGMOD Conference, pp. 369-380 (1997)

[54] Knuth, D.E.: The art of computer programming, volume 3: (2nd ed.) sorting and searching.
Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA (1998)

[55] Kor, F., Muthukrishnan, S.: Influence sets based on reverse nearest neighbor queries. In:
International Conference on Management of Data (SIGMOD), pp. 201-212 (2000)

[56] Kor, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast Nearest Neighbor
Search in Medical Databases. In: International Conference on Very Large Databases
(VLDB), pp. 215-226. India (1996)

[57] Leuken, R.H.V., Veltkamp, R.C., Typke, R.: Selecting vantage objects for similarity
indexing. In: ICPR '06: Proceedings of the 18th International Conference on Pattern
Recognition, pp. 453-456. IEEE Computer Society, Washington, DC, USA (2006)

[58] Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals.
Tech. Rep. 8 (1966)

[59] Ljosa, V., Bhattacharya, A., Singh, A.K.: Indexing Spatially Sensitive Distance Measures
Using Multi-resolution Lower Bounds. In: EDBT, pp. 865-883 (2006)

[60] Manber, U., Myers, E.: Suffix Arrays: A New Method for On-Line String Searches. SIAM
Journal on Computing 22(5), 935-948 (1993)

[61] McCreight, E.: A Space-Economical Suffix Tree Construction Algorithm. JACM 23(2),
262-272 (1976)









[62] Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-
alignment Searches on Biological Sequences. In: VLDB (2003)

[63] Merrett, T.H., Kambayashi, Y., Yasuura, H.: Scheduling of Page-Fetches in Join Operations.
In: International Conference on Very Large Databases (VLDB), pp. 488-498 (1981)

[64] Mico, M.L., Oncina, J., Vidal, E.: A new version of the Nearest-Neighbour Approximating
and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory
requirements. Pattern Recognition Letters 15, 9-17 (1994)

[65] Myers, E.W.: An o(ND) difference algorithm and its variations. Algorithmica 1(2), 251-266
(1986)

[66] Navarro, G., Baeza-Yates, R.: A Hybrid Indexing Method for Approximate String Match-
ing. J. Discret. Algorithms 1(1), 205-239 (2000)

[67] Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for Similari-
ties in the Amino Acid Sequence of Two Proteins. JMB 48, 443-53 (1970)

[68] Norbert Beckmann Hans-Peter Kriegel, R.S., Seeger, B.: The R*-Tree: An Efficient and
Robust Access Method for Points and Rectangles. In: SIGMOD Conference, pp. 322-331
(1990)

[69] Pearson, W., Lipman, D.: Improved Tools for Biological Sequence Comparison. PNAS 85,
2444-2448 (1988)

[70] Roussopoulos, N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. In: International
Conference on Management of Data (SIGMOD). San Jose, CA (1995)

[71] Rubner, Y, Tomasi, C., Guibas, L.J.: A Metric for Distributions with Applications to Image
Databases. In: ICCV: Proceedings of the Sixth International Conference on Computer
Vision, p. 59. IEEE Computer Society, Washington, DC, USA (1998)

[72] Ruiz, E.V.: An algorithm for finding nearest neighbours in (approximately) constant average
time. Pattern Recogn. Lett. 4(3), 145-157 (1986)

[73] Sagan, H.: Space-Filling Curves. Springer Verlag, NewYork, NY, USA (1994)

[74] Samet, H.: The Quadtree and Related Hierarchical Data Structures. ACM Computing
Surveys 16(2), 187-260 (1984)

[75] Samet, H.: The design and analysis of spatial data structures. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA (1990)

[76] Scott Leutenegger, M.L., Edgington, J.: STR: A Simple and Efficient Algorithm for R-Tree
Packing. In: International Conference on Data Engineering (ICDE), pp. 497-506 (1997)

[77] Seeger, B.: An analysis of schedules for performing multi-page requests. Information
Systems 21(5), 387-407 (1996)









[78] Seidl, T., Kriegel, H.: Optimal Multi-Step k-Nearest Neighbor Search. In: International
Conference on Management of Data (SIGMOD) (1998)

[79] Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R+-Tree: A Dynamic Index for Multi-
Dimensional Objects. In: VLDB'87: Proceedings of the 13th International Conference on
Very Large Data Bases, pp. 507-518. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA (1987)

[80] Skopal, T., Pokorny, J., Snasel, V.: PM-tree: Pivoting Metric Tree for Similarity Search in
Multimedia Databases. In: ADBIS (Local Proceedings) (2004)

[81] Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of
Molecular Biology (1981)

[82] Stanoi, I., Riedewald, M., Agrawal, D., Abbadi, A.: Discovery of Influence Sets in Fre-
quently Updated Databases. In: International Conference on Very Large Databases
(VLDB), pp. 99-108 (2001)

[83] Stonebraker, M., Moore, D.: Object Relational DBMSs: The Next Great Wave. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA (1995)

[84] Tao, Y., Papadias, D., Lian, X.: Reverse kNN Search in Arbitrary Dimensionality. In:
International Conference on Very Large Databases (VLDB) (2004)

[85] Traina, C., Traina, A.J.M., Filho, R.F.S., Faloutsos, C.: How to improve the pruning ability
of dynamic metric access methods. In: CIKM, pp. 219-226 (2002)

[86] Traina, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-Trees: High Performance Metric
Trees Minimizing Overlap Between Nodes. In: EDBT, pp. 51-65 (2000)

[87] Ukkonen, E.: Algorithms for Approximate String Matching. Information and Control 64,
100-118 (1985)

[88] Ukkonen, E.: On-line Construction of Suffix-trees. Algorithmica 14, 249-260 (1995)

[89] Venkateswaran, J., Kahveci, T., Camoglu, O.: Finding Data Broadness Via Generalized
Nearest Neighbors. In: EDBT, pp. 645-663 (2006)

[90] Venkateswaran, J., Kahveci, T., Jermaine, C.M., Lachwani, D.: Reference-based indexing
for metric spaces with costly distance measures VLDB Journal (2007)

[91] Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.M.: Reference-based indexing of
sequence databases. In: VLDB, pp. 906-917 (2006)

[92] Vieira, M.R., Traina, C., Chino, F.J.T., Traina, A.J.M.: DBM-Tree: A Dynamic Metric
Access Method Sensitive to Local Density Data. In: SBBD, pp. 163-177 (2004)

[93] Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37-57
(1985)









[94] Vleugels, J., Veltkamp, R.: Efficient image retrieval through vantage objects. In: VISUAL,
pp. 575-584. Springer (1999)

[95] Weiner, P.: Linear Pattern Matching Algorithms. In: IEEE Symposium on Switching and
Automata Theory, pp. 1-11 (1973)

[96] White, D.A., Jain, R.: Similarity Indexing with the SS-tree. In: ICDE, pp. 516-523 (1996)

[97] Xia, C., Lu, H., Ooi, B., Hu, J.: GORDER: An Efficient Method for KNN Join Processing.
In: International Conference on Very Large Databases (VLDB) (2004)

[98] Yang, C., Lin, K.I.: An Index Structure for Efficient Reverse Nearest Neighbor Queries. In:
International Conference on Data Engineering (ICDE), pp. 485-492 (2001)

[99] Yianilos, P.: Data Structures and Algorithms for Nearest Neighbor Search in General Metric
Spaces. In: SODA, pp. 311-321 (1993)









BIOGRAPHICAL SKETCH

Jayendra Gnanaskandan Venkateswaran was born and brought up in Chennai, India. He

received his Bachelor of Engineering from Coimbatore Institute Of Technology (CIT), one of the

most prestigious and oldest engineering college in India, in 2001. Jayendra majored in computer

engineering and obtained a distinguished record. Jayendra received his Master of Science from

University of Missouri-Rolla in May, 2003. He majored in computer science. His master's

thesis was on packing methods for SR-Tree index structure. Along with his masters' advisor, Dr.

S.R.Subramanya, he has published his work at the 21st Annual ACM Symposium on Applied

Computing (SAC), Dijon, France, April 2006.

Jayendra joined the Doctor of Philosophy (Ph.D) program in Computer and Information

Science and Engineering at the University of Florida-Gainesville in the fall 2003. While pursuing

his graduate degree, Jayendra worked as a graduate research assistant. He received his Doctor of

Philosophy in Computer Engineering from the University of Florida in December 2007. Along

with his advisors Dr. Tamer Kahveci and Dr. Christopher Jermaine, he has published his research

at the top database conferences Very Large DataBases (VLDB), Extended Database Technology

(EDBT) and VLDB Journal.

Jayendra's research focus is in the area of database searches, with special interests in

database indexing and querying, text mining, algorithms and bioinformatics.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Firstandforemost,IwouldliketothankmyadvisorDr.TamerKahveci,forhisexceptionalguidanceduringthecourseofthisresearch.Hehasbeenaconstantsourceofmotivationandwasalwaysavailablefortechnicaldiscussionsandprofessionaladvice.Iamalsoindebtedtomyco-advisor,Dr.ChristopherJermaineforhisexcellentguidanceandsupportinmajorpartofmyresearch.IamgratefultomydissertationcommitteemembersAlinDobra,SartajSahniandLeiZhoufortheirhelpandguidance.Additionally,IthankallmyfriendsatUniversityofFlorida-Gainesville,fortheirfriendship.Last,butnottheleast,Ioweaspecialdebtofgratitudetomyfamily.Iwouldnothavebeenabletogetthisfarwithouttheirconstantsupportandencouragement. 4

PAGE 5

page ACKNOWLEDGMENTS .................................... 4 LISTOFTABLES ....................................... 8 LISTOFFIGURES ....................................... 9 ABSTRACT ........................................... 12 CHAPTER 1INTRODUCTION .................................... 13 1.1SimilaritySearch .................................. 13 1.1.1SimilaritySearchQuery ........................... 14 1.1.2SimilarityJoinQuery ............................ 16 1.2PerformanceIssues ................................. 18 1.3ContributionsandOrganization ........................... 19 1.3.1SimilaritySearchQuery ........................... 19 1.3.2SimilarityJoinQuery ............................ 21 1.3.3DissertationOrganization .......................... 22 2METRICSPACEMODEL ................................ 23 2.1EuclideanDistance ................................. 23 2.2EditDistance .................................... 24 2.3EarthMover'sDistance ............................... 25 3RELATEDWORK .................................... 27 3.1Partition-basedIndexingTechniques ........................ 27 3.1.1R-TreeanditsVariants ........................... 28 3.1.2TheSTR-Ordering ............................. 30 3.2NearestNeighborQueries .............................. 31 3.2.1k-NearestNeighborQuery ......................... 31 3.2.2Reverse-NearestNeighborQuery ...................... 32 3.2.3AllNearestNeighborQuery ........................ 33 3.3Reference-basedIndexingTechniques ....................... 34 3.3.1Distance-matrixbasedIndexingMethods .................. 36 3.3.2Reference-basedHierarchicalIndexingMethods .............. 38 3.4StringSearchMethods ............................... 40 4REFERENCE-BASEDINDEXINGFORSTATICDATABASES ............ 42 4.1MaximumVariance ................................. 42 4.1.1Algorithm .................................. 44 4.1.2ComputationalComplexity ......................... 45 5

PAGE 6

................................. 45 4.2.1Algorithm .................................. 45 4.2.2ComputationalComplexity ......................... 47 4.2.3Sampling-basedOptimization ........................ 47 4.2.3.1Estimationofgain ........................ 48 4.2.3.2Estimationoflargestgain .................... 49 4.2.4ImpactoftheSampleQuerySet ...................... 51 4.3AssignmentofReferences ............................. 51 4.3.1MotivationandProblemDenition ..................... 51 4.3.2Algorithm .................................. 53 4.3.3ComputationalComplexity ......................... 55 4.4SearchAlgorithm .................................. 55 4.4.1Algorithm .................................. 55 4.4.2ComputationalComplexity ......................... 56 4.5Experiments ..................................... 56 4.5.1EffectoftheParameters ........................... 58 4.5.1.1Impactofm 58 4.5.1.2Impactofquerysetsize,jQj 61 4.5.2ComparisonofProposedMethods ..................... 62 4.5.2.1Impactofqueryrange() .................... 62 4.5.2.2Impactofnumberofreferences(k) ............... 63 4.5.3ComparisonwithExistingMethods ..................... 64 4.5.3.1Impactofqueryrange() .................... 66 4.5.3.2Impactofnumberofreferences(k) ............... 67 4.5.3.3Impactofinputqueries ..................... 70 4.5.3.4Scalabilityindatabasesize ................... 73 4.5.3.5Scalabilityinstringlength .................... 74 5REFERENCE-BASEDINDEXINGFORDYNAMICDATABASES .......... 76 5.1OverviewofSPandTP ............................... 76 5.1.1BasicApproach ............................... 76 5.1.2MaintainingtheQueryDistribution ..................... 76 5.2Single-passAlgorithminDepth .......................... 77 5.2.1ComputationalComplexity ......................... 77 5.3Three-passAlgorithminDepth ........................... 78 5.3.1ComputationalComplexity ......................... 79 5.4Experiments ..................................... 79 5.4.1QueryPerformance ............................. 81 5.4.1.1Experimentalsetup ....................... 81 5.4.1.2Experimentalresults ....................... 85 5.4.2DistanceComputations ........................... 86 5.4.3ImpactofIndexConstructionTime ..................... 87 5.4.4AnalyzingtheResults ............................ 88 6

PAGE 7

........ 91 6.1ProblemDenition ................................. 92 6.2OverviewoftheAlgorithm ............................. 93 6.3PredictingtheSolution:PriorityTableConstruction ................ 94 6.4StaticSearchStrategies ............................... 98 6.4.1FetchAll ................................... 98 6.4.1.1Creatingclusters ......................... 98 6.4.1.2Orderingclusters ......................... 98 6.4.1.3Processingclusters ........................ 99 6.4.2FetchOne .................................. 99 6.5DynamicStrategy .................................. 101 6.6FurtherImprovementsforGNNQueries ...................... 102 6.6.1AdaptiveFilter ................................ 102 6.6.2Partitioning ................................. 103 6.6.3Packing ................................... 105 6.7Experiments ..................................... 105 6.7.1EvaluationofOptimizations ......................... 106 6.7.2ComparisonofProposedMethods ..................... 110 6.7.2.1Evaluationofbuffersize ..................... 110 6.7.2.2EvaluationofthenumberofNN ................. 111 6.7.3ComparisontoExistingMethods ...................... 111 6.7.3.1Evaluationofbuffersize ..................... 112 6.7.3.2EvaluationofthenumberNN .................. 114 6.7.3.3Evaluationofdatabasesize ................... 115 6.7.3.4Evaluationofthenumberofdimensions ............ 116 7CONCLUSION ...................................... 117 REFERENCES ......................................... 120 BIOGRAPHICALSKETCH .................................. 127 7

PAGE 8

Table page 4-1Summaryofsymbolsanddenitions. ........................... 42 4-2Alistofproposedmethods. ................................ 56 4-3ComparisonwithTree-basedindexstructures. ...................... 66 6-1ComparisonwithanOptimalsolution. .......................... 107 6-2ComparisonwithGORDER. ............................... 110 6-3ComparisonwithRkNn. ................................. 110 8

PAGE 9

Figure page 1-1SimilaritySearchExample. ................................ 14 1-2SimilarityJoinQueriesExample. ............................. 16 2-1EditDistanceExample. .................................. 24 2-2EMDExample:Image1 .................................. 25 2-3EMDExample:Image2 .................................. 25 2-4EMDExample:TransformedImage1. .......................... 25 3-1EvolutionofIndexStructuresinMetricSpaces. ..................... 28 3-2Reference-basedIndexingExample. ........................... 34 3-3ExampleforOmni. .................................... 36 4-1MaximumVarianceexample. ............................... 43 4-2NumberofcomparisonsforOmniwithvaryingnumberofreferences. ......... 52 4-3NumberofcomparisonsforMP-DforDNAobjectdatabasefordifferentvaluesofm. 59 4-4NumberofcomparisonsforMP-Dforproteindatabasefordifferentvaluesofm. ... 60 4-5NumberofcomparisonsforMP-Dforimagedatabasefordifferentvaluesofm. .... 60 4-6ImpactofjQjonindexconstructiontime. ........................ 61 4-7ImpactofjQjonqueryperformance. ........................... 62 4-8ComparisonoftheproposedmethodsforDNAdatabaseforquerieswithvaryingranges. 63 4-9ComparisonoftheproposedmethodsforProteindatabaseforquerieswithvaryingranges. ........................................... 64 4-10ComparisonoftheproposedmethodsforDNAdatabasewithavaryingnumberofreferences. ......................................... 65 4-11ComparisonoftheproposedmethodsforProteindatabasewithavaryingnumberofreferences. ......................................... 65 4-12ComparisonwithothermethodsonDNAdatabaseforquerieswithvaryingranges. .. 68 4-13Comparisonwithothermethodsonproteindatabaseforquerieswithvaryingranges. 68 4-14Comparisonwithothermethodsonimagedatabaseforquerieswithvaryingranges. .. 69 4-15ComparisonwithothermethodsonDNAdatabaseforavaryingnumberofreferences. 70 9

PAGE 10

71 4-17Comparisonwithothermethodsonimagedatabaseforavaryingnumberofreferences. 71 4-18ComparisonwithothermethodsonDNAdatabaseforqueriesfromHeliconiusMelpomenewithvaryingqueryranges. ................................ 72 4-19ComparisonwithothermethodsonDNAdatabaseforqueriesfromMusMusculuswithvaryingqueryranges. ................................ 72 4-20ComparisonwithothermethodsonDNAdatabaseforqueriesfromDanioReriowithvaryingqueryranges. ................................... 73 4-21Scalabilityindatabasesize. ................................ 74 4-22Scalabilityinstringlength. ................................ 75 5-1ComparisonoftheproposedmethodsonDNAdatabasewithhilbert-ordereddataandquerydistributions. .................................... 82 5-2ComparisonoftheproposedmethodsonDNAdatabasewithrandom-ordereddataandquerydistributions. .................................. 83 5-3SparsemethodforDNAdatabasewithHilbertandrandom-ordereddataandquerydistributions. ........................................ 84 5-4ComparisonwithSparsefortherandomlyorderedimagedatabase. ........... 85 5-5DistancecomputationtimeofDNAdatabase. ...................... 86 5-6Distancecomputationtimeofimagedatabase. ...................... 87 5-7Indexconstruction(IC)timesofthemethodsSPandTPDNAandimagedatabases. .. 88 5-8Indexconstruction(IC)timesofMPforDNAandimagedatabases. .......... 89 6-1AnexampleforGNNquery. ............................... 92 6-2AsamplePriorityTablefortwodatabasesRandS. ................... 96 6-3FirstrowofthePriorityTable. .............................. 97 6-4AdaptiveFilterExample ................................. 103 6-5PartitioningExample ................................... 104 6-6EvaluatingOptimizations 1 2 and 3 ........................... 106 6-7EvaluationofPartitioning. ................................ 107 6-8Comparisonoftheproposedmethodsontwo-dimensionalimagedatabasefordiffer-entbuffersizes. ...................................... 108 10

PAGE 11

...................................... 109 6-10Comparisonwithothermethodsontwo-dimensionalimagedatabasefordifferentbuffersizes. ........................................ 111 6-11Comparisonwithothermethodsonproteindatabasefordifferentbuffersizes. ..... 112 6-12ComparisonofSS,RT,andMuxFDontwo-dimensionalimagedatabasefordifferentvaluesofk. ........................................ 113 6-13ComparisonofSS,RT,andMuxFDonproteindatabasesfordifferentvaluesofk. ... 114 6-14Comparisonwithothermethodsontwo-dimensionalimagedatabaseforvaryingdatabasesizes. ............................................ 115 6-15Comparisonwithothermethodsontwo-dimensionalimagedatabaseforvaryingnum-berofdimensions. ..................................... 116 11

PAGE 12

Similaritysearchindatabasesystemsisbecominganincreasinglyimportanttaskinmodernapplicationdomainssuchasarticialintelligence,computationalbiology,patternrecognitionanddatamining.Withtheevolutionofinformation,applicationswithnewdatatypessuchastext,images,videos,audio,DNAandproteinsequenceshavebegantoappear.Despiteextensiveresearchandthedevelopmentofaplethoraofindexstructures,similaritysearchisstilltoocostlyinmanyapplicationdomains,especiallywhenmeasuringthesimilaritybetweenapairorobjectsisexpensive.Inthisdissertation,thesimilaritysearchqueriesweconsiderareclassiedundersimilaritysearchandsimilarityjoinqueries.Severalnewindexingtechniquestoimprovetheperformanceofsimilaritysearchareproposed.Forthesimilaritysearchqueries,reference-basedindexingmethodsapplicabletobothstaticandgrowingdatabasesareproposed.Forsimilarityjoinqueries,ageneralizednearestneighborframeworkandseveralsearchandoptimizationalgorithmsareproposed.Theextensiveexperimentsevaluatesthedifferentparametersusedbytheproposedmethodsandperformanceimprovementsoverthestate-of-artalgorithms. 12

PAGE 13

Similaritysearchreferstotheretrievalofdatabaseobjectsthataresimilarorclosetoagivenqueryobject.Ithasapplicationsindiverseeldssuchasdatabases,articialintelligence,computationalbiology,patternrecognitionanddatamining.Theuseofdatastructurestospeedthesearch-knownasindexing-isanimportanttechniquewhenansweringquestionsoverlargedatabases.Manyindexingtechniqueshavebeenproposedforsimilaritysearch.Despitetheextensiveresearchandthedevelopmentofaplethoraofindexstructures,similaritysearchisstilltoocostlyinmanyapplicationdomainsespeciallywhenmeasuringthesimilaritybetweenapairorobjectsisexpensive.Newindexingtechniquesareneededtoimprovetheperformanceofsearchesinthesedatabases. 30 ],object-oriented[ 3 ]andobject-relational[ 83 ]models.Exactmatch,rangeandjoinqueriesoverthesemodelsaretypicallybasedonwhetherornottwokeysareequalornotequalorwhetherakeyfallsinarangeofvalues. Withtheevolutionofinformation,applicationswithnewdatatypessuchastext,images,videos,audio,DNAandproteinsequencehavebegantoappear.Theproblemwiththesenewdatatypesisthattheycanneitherbeorderedforrangequeries,normeaningfultoperformequalitycomparisonsonthem.Forexample,considertheapplicationofimagedatabases.Ausermightbeinterestedinretrievingtheimagesfromthedatabasethathavesimilarcolor,textureandshapetothegivenqueryimage[ 71 ].Existingdatabasemodelsidentifyimagesfromthelenames,keywordsandsoon.Theseapproachescannotbeappliedforthesimilaritysearchquery.Thequeryrequiresamorecomplicateddistancefunctionthatcancapturethesimilaritybetweentwoimagesaccurately. 28 37 45 ]retrievesdatabaseobjectsthataresimilarorclosetoagivenqueryobject,givensomearbitrary,user-denedsimilarityfunction.Inthisdissertation,similarity 13

PAGE 14

SimilaritySearchExample. searchqueriesareclassiedunderthefollowingtwocategories:a)similaritysearchqueriesandb)similarityjoinqueries.Thenexttwosubsectionsdescribesthesetwocategoriesindetail. LetD()beadistancefunction.InaRangeQuery(RQ),thegoalistondthedatabaseobjectsthataresimilartoagivenquery.GivenadatabaseR,aqueryq,andasimilaritythreshold,thesearchreturnsthesetofallobjectsinSwhosedistancetoqislessthan.Formally,theanswertoaRQisdenedas: 1-1 presentsanexampleforaRQ.HeretheshadedcirclesrepresenttheobjectsinR.Theblackcirclerepresentsagivenqueryqandthedistancethresholdisgivenby.Thequeryreturnstheveobjectsenclosedbythecirclewithqasthecenterandastheradius. ForanexampleofaRQ,considertheapplicationspellcheckerintextretrieval.TheEditDistance(ED)[ 58 ]betweentwostringsgivestheminimumnumberoftransformationsneeded 14

PAGE 15

71 ].EMDmeasuresthetransformationneededtoconvertthehistogramdistributionofoneimagetothatofanotherimage.Thisisusedasameasureofdissimilaritybetweentwogivenimages.Givenanimagedatabase,aqueryimageqandanEMDthreshold,theimageretrievalsystemreturnstheimagesfromthedatabasethathaveaEMDdistancewithinfromq. InaNearestNeighbor(NN)query,givenadatabaseRandaqueryobjectq,thegoalistoretrievetheelementinRthatisclosesttoqbasedonsomedistancemeasure.TheanswertoaNNqueryisdenedas: 1-1 presentsanexampleofthetwoNNquerytypes.TheresultofaNNqueryontheobjectqisgivenby,NN(R;q)=fag.Theresultof2NNqueryontheobjectqisgivenbyNN(R;q;2)=fa;bg. Foranexampleapplication,considergeographicinformationsystem(GIS)givenin[ 70 ].Theuserofatouristinformationsystemmayrequireinformationaboutthelocationsoftheattractionswhichareclosetothecurrentlocation,orthetouristmightbeinterestedinknowingthethreenearesthospitalsfromthelocationofacaraccident.Thesequeriesrequirecomputing 15

PAGE 16

SimilarityJoinQueriesExample. theEuclideandistancebetweenagivenlocationtothedatabaseoflocationsofattractionsorhospitals. 28 ],giventwodatabasesRandS,thegoalistoretrievethepairsofobjectsfromthetwodatabasesthataresimilartoeachotherbasedonagivencriterion.Allsimilarityjoinqueriescanbefurtherdividedintothefollowingthreetypes:a)thek-NearestNeighbor(kNN)joinquery[ 19 ],b)theReverseNearestNeighbor(RNN)joinquery[ 55 ]andc)theAllNearestNeighbor(ANN)joinquery[ 97 ].Figure 1-2 presentsanexampleofthedifferenttypesofjoinqueries.HeretheshadedcirclesrepresenttheobjectsfromR(fa;b;cg2R)andtheunshadedcirclesrepresenttheobjectsfromS(f1;2;3;4g2S). ThekNNjoinqueryisthemostcommonlyusedtypeofjoinquery.Givenapositiveintegerk,thejoinqueryreturnskobjectsfromSsimilartotheobjectsinRbasedonsomedistancemeasure.TheanswertoakNNjoinqueryisdenedas: 1-2 ,a2NNjoinquerybetweenRandSreturnsthesetofpointsffa;f1;2gg;fb;f2;3gg;fc;f2;4gg.Foranexampleapplication,considerkNN 16

PAGE 17

18 ].GivenasetofdatabaseobjectthatarenotclassiedunderanycategoryRandsetofobjectsSthatarealreadyclassied,thekNNqueryoverSiscomputedforeveryobjectinR.Thentheunclassiedobjectsareassignedtothecategorythatformsthemajorityofitsknearestneighbors. InaRNNjoinquery,givenRandS,thegoalistoretrievethesetofallobjectsfromSsuchthateveryobjectfromtheresultsetisanearestneighborforsomeobjectinR.ForeachobjectinthissubsetfromS(8u2U;US),thereexistssomeobjectinR(9v2R)suchthatuisthenearestneighborofv.TheanswertoaRNNjoinqueryisdenedas: Anexampleofareverse2NNqueryisgiveninFigure 1-2 .Itreturnsf2g,whichisoneofthetwonearestneighboroftheobjectsfa;b;cg2R.ForanexampleapplicationofRkNNquery,consideradecisionsupportsystem.Supposethatpeopleusuallydineatoneoftheknearestrestaurants.AnentrepreneurwhowantstoinvestinMexicanrestaurantbusinesswouldwanttoknowthepotentialcustomerswhohaveaMexicanrestaurantasoneoftheknearestrestaurants.Inthisapplication,giventhelocationsofallcustomersandlocationsofallrestaurants,theRkNNjoinquerywouldreturnthelocationsofthecustomerswhohaveatleastoneMexicanrestaurantsasoneoftheirknearestones.Thesecustomersarelikelytousetherestaurantduetoitsgeographicalproximity. InanANNquery,givenRandS,thegoalistoretrieveforeveryobjectinRitsnearestneighborinS.TheanswertoanANNjoinqueryisdenedas: 19 ],givenapositiveintegerk,thegoalis 17

PAGE 18

AnexampleANNqueryisgiveninFigure 1-2 .Itreturnsthesetofobjectpairsffa;1g;fb;2g;fc;4gg.Foranexampleapplication,considerageographicinformationsystem.Theuserofatouristin-formationsystemmayrequireinformationaboutthenearestgasstationforeachattraction.Inthisapplication,giventhelocationsofallgasstationsandlocationsofallattractions,theANNjoinquerywouldreturnthelocationofthenearestgasstationtoeachattraction. Eventhoughthedatabasesoverwhichsimilaritysearchqueriesareperformed,arenotal-wayshuge(andaresometimessmallenoughtobestoredinseveralgigabytesofmainmemory),tremendousexpenseisassociatedwithcomputingthedistancefromaquerytoeachofthedataobjects.Thisrenderstheaccessmethodmostapplicabletomain-memoryresidentdatasetssequentialscaninfeasibleforsimilaritysearchqueries.Forexample,considersearchinganimagedatabaseusingEarthMovers'Distance(EMD)[ 71 ]asthesimilaritymeasure.ComputingEMDisanexpensivelinearprogrammingproblemandtakesabout40secondstocomparetwogivenimages.Asmalldatabaseofabout4000imagescanbeeasilyloadedintothememory.However,duetothehighcomplexityofEMD,evenasinglesearchonthisdatabasecantakeupto45hours.Asanotherexample,consideralesscomplexmeasure,theeditdistance(ED),usedoverDNAdatabases.ComputingtheeditdistancebetweentwostringsrequiresO(n2)time[ 67 ],whichcantranslatetosecondsofCPUtimeforlongstrings.Thehumangenomecontains30millionstringsoflength100[ 10 ].Searchingforthesubstringssimilartoagivensubstringinthisdatabasecantakeuptoanhour. 18

PAGE 19

33 38 54 ]forhan-dlinglargedatabasesdonotworkheretheseworkonlyforexact-matchqueries(notsimilarityjoinqueries).Thus,evenwithinexpensivedistancecomputations,newindexingtechniquesareneeded. 35 48 99 ].Inthereference-basedindexingapproach,asmallfractionofdatabaseobjects,referredtoasthesetofreferenceobjects,areselected.Thedistancesbetweenreferencesanddatabaseobjectsarepre-computedandstoredasanindex.Givenaquery,thesearchalgorithmcomputesthedistancefromeachofthereferencetothequeryobject.Withoutanyfurthercomparisons,objectsthatareclosetoorfarawayfromthereferencecanberemovedfromthecandidatesetbaseduponthosedistances. 19

PAGE 20

Referencesareselectedsuchthattheyrepresentallpartsofthedatabase.Twonovelstrategiesfortheselectionofreferencesareproposed.Theyare: Thesecondproblemisthemappingofreferencestodatabaseobjects.Intheproposedsolution,thenumberofreferencesislargerthaninothermethods.Inordertokeepthenumberofreferencesassignedtoeachobjectmanageable,onlythebestreferencesforeachgivendatabaseobjectareusedtoindexit.Thus,eachdatabaseobjectmayhaveadifferentsetofreferencesassignedtoit. Thethirdproblemisreference-basedindexingfordynamicdatabases.Forsuchdatabases,twoincrementalvariationsoftheMaximumPruningstrategyareproposed.Afterinsertionofanewobject,theseincrementalmethodsupdatethereferencesetusingthedistributionof 20

PAGE 21

93 ].Realisticenvironmentsaresimulated,wherethedistributionofthedatabaseobjectsandquerieschangesastheapplicationanduserbaseevolveovertime. Papersdescribingtheproposedindexingtechniqueshavebeenalreadypublished.TheapproachesforstaticdatabasesfromChapter 4 arefromthepaperco-authoredbyDeepakLachwani,TamerKahveciandChristopherJermaineandwereoriginallypublishedinthe2006VeryLargeDataBase(VLDB)conference[ 91 ].AnextensionofthisworkonreferenceindexingwithapproachesfordynamicdatabaseshasbeenacceptedbytheVeryLargeDataBaseJournal(VLDBJ)[ 90 ]. Inthisdissertation,theallofthefollowingideasareintroducedrsttimeandhavenotbeenproposedelsewhere: 21

PAGE 22

76 ]. ThepaperdescribingtheproposedworkonGeneralizednearestneighborforsimilarityjoinquerieshasbeenpublishedalready.ThemethodsandexperimentspresentedinChapter 6 arefromthepaperco-authoredbyTamerKahveciandOrhanCamoglu,whichwasoriginallypublishedinthe2006ExtendedDataBaseTechnology(EDBT)conference[ 89 ]. 2 presentsthedescribesametricspacemodelwithfewexamples.RelatedworkispresentedinChapter 3 .Chapter 4 presentsstrategiesforselectingthereferencesandhowthesereferencescanbeusedtoanswerasimilaritysearchqueryinastaticdatabase.Chapter 5 presentsthereferenceselectionstrategiesfordatabaseswithfrequentupdatesandvaryingquerydistributions.Chapter 6 presentsageneralizednearestneighborframeworkforlargedatabasesusingthethreesearchstrategies.TheconclusionsaregiveninChapter 7 22

PAGE 23

Similarityqueriesareneededinmanydiverseapplicationssuchasthegeographicin-formationsystem(GIS),themultimediadatabasesystem(MDS),thetextretrievalandthecomputationalmolecularbiology.Thus,itisoneofthemoststudiedtopicsinthecontextofdatabases.TheseapplicationsoftenuseametricsimilaritymeasuresuchastheEuclideanDis-tance,theEditDistanceandtheEarthMover'sDistancetocomputethedissimilaritybetweentwoobjects.Thischapterpresentstheconceptofmetricmeasuresandelaboratesonseveralmeasuresthatarecommonlyusedinrecentdatabases. Givenanytwoobjectsaandbfromthedatabase,adistancemeasureD()issaidtobeametricifitsatisesthefollowingfourproperties: Ifameasuredoesnotsatisfythestrictnon-negativenessproperty,themeasureiscalledpseudo-metric.Thetriangleinequalitypropertyisakeypropertyofthemetricdatabasesthathasbeenusedeffectivelybytheindexingtechniques.Followingaresomeofthecommonlyusedmeasuresinspatial,textandimagedatabasesrespectively.

PAGE 24

EditDistanceExample. ThecomplexityofEuDislinearintermsofthenumberoftermsorthedimensionalityofthevector. Thereexistsalotoftechniquesforindexinginvectorspaces[?].Thesemethodsperformwelluptotwentydimensions.Afterthattheirperformancedegradeswiththeincreaseinthenumberofdimensions,referredtoasthecurseofdimensionality,andthusinefcientinpractice.Severalapplicationswithobjectsrepresentedinvectorsspace,likeGISanddecisionsupportsystems,useEudasthedistancemeasure. Figure 2-1 presentstwostringsPandQandtheEDbetweenthem.Thetwostrings,eachhaving12characters,canbeoptimallyalignedatacostof3(twoinsertionsandonereplacement). ThedynamicprogrammingsolutiontondtheEDandthealignmentbetweentwostringsrunsinO(n2)timeandspace[ 67 ],wherenistheaveragelengthofastring.Ifistheallowablenumberoftransformations,thespaceandtimecomplexityfortheboundedversionofthisproblemisO(n)[ 4 6 65 87 ].Inmanyapplications,=O(n),thusmakingthecomplexityoftheboundedversionO(n2)[ 42 50 ].Severalapplications,suchasspellcheckandcomputationalmolecularbiology,useEDtocomparetwogivenstrings. 24

PAGE 25

EMDExample:Image1 Figure2-3. EMDExample:Image2 Figure2-4. EMDExample:TransformedImage1. 71 ]isameasureofworkneededtotransformthehistogramdistributionofoneimagetoanother.Giventwodistributions,onecanbeseenasamassofearthproperlyspreadinspace,whiletheotherdistributioncanbeseenasacollectionofholesinthatsamespace.Itcanalwaysbeassumedthatthereisatleastasmuchearthasneededtollalltheholestocapacitybyswitchingtheearthandtheholesifnecessary.Then,theEMDmeasurestheleastamountofworkneededtolltheholeswiththeearth.Here,aunitofworkcorrespondstotransportingaunitoftheearthbyaunitofthe(ground)distance. Figures 2-2 to 2-4 presentanexampleoffortheEMDsimilaritymeasure.Heretheblackandgreybarsrepresentthegiventwosimplehistogramdistributions(Figures 2-2 and 2-3 ).TheEMDisgivenbytheleastamountofworkneededtoconvertdistributionFigure 2-2 toFigure 2-4 ,i.e.transformingtheregionrepresentedbywhiteblockintherstbintothegreyregionrepresentedbysecondbin. TheEMDinvolvessolvingalinearprogrammingbasedtransportationproblem[ 31 ],whichmaytakealongtime.Forexample,fora256-dimensionalfeaturesextractedfromimagesthatarepartitionedinto8X12tiles,eachEMDcomputationtakes40s,so,asimilaritysearchon 25

PAGE 26

26

PAGE 27

Differentindexingschemeshavebeenproposedtospeedupthesimilaritysearchqueries.Thebasicideainthesemethodsistousethetriangleinequalitypropertyofthemetricmeasuretolteroutthedatabaseobjectsthatcanbeprovedtobefarenoughfromthequeryobjects,withoutdoingtheactualdistancecomputations.Thesetechniquesingeneralfollowthefollowingtwosteplter-reneprocess: TheevolutionoftheindexstructuresdiscussedinthissectionisgiveninFigure 3-1 ,basedonthetaxonomyofseveralmulti-dimensionalindexstructuresgivenbyGaedeandGunther[ 37 ].Theindexingtechniquescanbeclassiedundertwocategories:Partition-basedindexingtechniquesandReference-basedindexing[ 28 45 ]techniques. Numerousmethodsemployindexstructurestolterunpromisingdatabaseobjectsquickly.Oneoftheimportantclassesofindexstructurespartitionsthedataspaceintohierarchicalsets.Section 3.1.1 discussestheindexstructuresthatbelongthisclasswithafocusontheR-Tree.Section 3.1.2 discusseshowtheR-Treesarepackedtoimprovethequeryperformance.Section 3.2 presentstheexistingmethodsforNNqueriesbasedonpartition-basedapproach.Section 3.3 presentsanotherimportantclassofindexstructures,namelyreference-basedindexing.Section 3.4 discussesmethodsthatarespecicforthetextretrievalapplication. 27

PAGE 28

EvolutionofIndexStructuresinMetricSpaces. Severalindexstructureshavebeenproposedinliteratureforstorageandretrievalusingpartition-basedindexingtechniques.Someofthepopularonesincludethekd-tree[ 11 12 ],theR-tree[ 43 ],quadtrees[ 36 74 ],theR+-Tree[ 79 ],theR*-Tree[ 68 ],theX-Tree[ 14 ]andtheSR-Tree[ 53 ].Thesetechniquesmakeextensiveuseofcoordinateinformationtogroupandclassifypointsinthespace.Forexample,kd-treesdividethespacealongdifferentcoordinatesandR-treesgrouppointsinhyper-rectangles. R-Tree[ 43 ]isamulti-dimensionalgeneralizationoftheB-Tree[ 8 ].SimilartoB-tree,R-treeisaheight-balancedtreewithindexrecordsinitsleafnodescontainingpointerstodataobjects. 28

PAGE 29

Bystoringtheboundingboxesofgeometricobjectssuchaspoints,polygonsandmorecomplexobjects,anR-Treescanbeusedtodeterminewhichobjectsintersectagivenqueryregion.Aseachnodehasatleastmentries,theheightofanR-TreeofNobjectscanatmostbelogN1.Generally,nodeswilltendtohavemorethanmentries,whichwilldecreasetheheightofthetreeandimprovespaceutilization. InanR-Tree,alldataobjectsthatoverlapwiththequeryregionaresearchedtoretrieveobjectsinthequeryregion.Whenanodeissearched,morethanonesubtreemayneedtobetraversed.Thus,itisnotpossibletoguaranteegoodworst-caseperformance.Withefcientupdatealgorithms,thetreewillbemaintainedinsuchaformsoastoeliminatetheirrelevantregions(regionsthatareawayfromthequeryregion)andexamineonlydatanearthesearcharea.Thesearchalgorithmdescendsdownthetree,ateachlevelselectingthoseentrieswhoseMBRsoverlapwiththatofthequery.Whenaleaf-nodeisreached,entrieswhoseMBRsoverlapwiththeMBRofthequeryregion,areselected. AvariantofR-Tree,theR+-Tree[ 79 ],ensuresthatthereexistsonlyonepathwhilesearchingforadataobject.Thisisdonebysplittingtheoverlappingrectanglesandincreasingthestoragerequiredbyhavingduplicateentries.AnothervariantofR-Tree,theX-Tree[ 14 ]wasdesignedforhigh-dimensionalobjectswithoverlap-freesplitaccordingtoasplithistoryandsupernodes.Asupernodeisanoversizednode,whichpreventsoverlapwhenanefcient 29

PAGE 30

68 ]isanimprovedvariantoftheR-Tree.Inadditiontousingcriterialikemargin,areaandoverlap,itusestheconceptofforced-reinsertiontore-organizethestructureforbetterstorageutilization.TheSS-Tree[ 96 ]isanindexstructuredesignedforsimilarityindexingformulti-dimensionaldata.ItisanimprovementofR*-Tree,butusesboundingspheresinsteadofboundingrectanglesandusesamodiedforcedre-insertionalgorithm.UnlikeR*-Tree,theSS-Treere-insertsentrieswhentheentriesinanodearenotre-inserted.TheSR-Tree[ 53 ]canbeviewedasacombinationofSS-TreeandR*-Tree.Itusestheintersectionbetweentheboundingsphereandtheboundingrectangle.ThisoutperformsbothSS-TreeandR*-Tree.Thesizeofthedirectoryentryisincreasedsignicantlybythisapproach. AnextensivesurveyofthesemethodshavebeengivenbySamet[ 74 ],GaedeandGunther[?]andBohmet.al.[ 18 ].Unfortunatelytheexistingtechniquesareverysensitivetothedataspacedimension.TheSimilarityquerieshaveanexponentialdependencyonthedimensionofthespace,referredtoasthecurseofdimensionality.Duetothisreason,despiteitscomplexity,researchersprefermetricspacesusingcomplexdatatypes. 73 ]andHilbert-Ordering[ 51 ]transformthedataobjectsintoanone-dimensionalspaceandorderthem.Thereexistsseveralextensiveworksonthespace-llingcurves[ 17 73 75 ].Thissectiondescribesanotherheuristic,theSTR-OrderingproposedbyLeuteneggeret.al[ 76 ]. TheSort-TileRecursive(STR)isamethodtoordertheMBRsofR-Treebasedstaticdatabases.LetNbethenumberofd-dimensionalobjectsandbthenumberofentriespernodeoftheR-Tree.Thedataspaceisdividedinto2p 30

PAGE 31

TheSTRmethodordersthedatabaseobjectssuchthatsimilarobjectsaregroupedtogetherandpackedineitherthesameorneighboringnodes.ByeffectivelypruningtheobjectsusingtheirMBRs,thenumberofdistancecomputationsofobjectsandhencetheCPUtimeofthesearchcanbeimprovedsignicantly.Everynodeintheindexstructure,sayR-Tree,ispackeduptoitscapacity.ThisreducesthenumberofnodesandhencetheheightoftheunderlyingR-Tree.Eachnodeisstoredinadiskpage.ReductioninthenumberofnodesdecreasesthenumberofpagesfetchedfromthediskandthustheI/Ocostofasearchimprovesdramatically. 44 ]usedPMRquadtreetoindexthesearchspace.Theysearchthistreeinadepth-rstmanneruntiltheknearestneighborsarefound.Itisanedge-basedvariantofPMquatreethatusesprobabilisticsplittingrule. Roussopouloset.al.,[ 70 ]employedabranch-and-boundR-Treetraversalalgorithm.ItusestheMINDISTandMINMAXDISTtoordertheMBRsoftheR-Tree.GiventwoMBRs,theMINDISTgivestheminimumdistancebetweenthemwhileMINMAXDISTguaranteesthepresenceofatleastoneobjectinthesecondMBR.Theknearestobjectsarestoredinabufferin 31

PAGE 32

Thetwo-phasemethod[ 56 ]getstheresultantobjectsintwophases.Intherstphase,aKNNsearchisperformedusingthedistancefunctiononfeaturevectors.Theactualdistancetothesekobjectsarecomputedandthemaximumvalueisdetermined.Inthesecondphase,arangequeryontheindexreturnsallobjectswithinthismaximumdistancefromthesamefeaturedistancefunction.Foralltheobjectsfromtheresultsetoftherangequery,theactualobjectdistanceiscomputedandkNNsarereturned. SeidlandKriegel[ 78 ]proposedamethodthatrunsinmultiplephases,iterativelyupdatingthekNNdistance.Intherstphase,itsortstheobjectsinincreasingorderoftheirfeaturedistance.ThenforeveryobjectfromthesortedlistwithdistancelessthanthecurrentkNNdistance,theactualquery-to-objectdistanceiscomputedandthekNNsareupdated. Berchtoldet.al.,[ 13 ]dividethesearchspaceusingvoronoicells.Itrstpre-computestheresultofanynearest-neighborsearchwhichcorrespondstoacomputationofthevoronoicellofeachdatabaseobject.Then,thevoronoicellsarestoredinanindexstructureefcientforhigh-dimensionaldataspaces.Asaresult,nearestneighborsearchcorrespondstoasimplepointqueryontheindexstructure.Beyeret.al.,[ 15 ]showthatforabroadsetofdatadistributionsmostoftheknownkNNalgorithmsrunslowerthansequentialscan.Thus,despiteitssimplicity,sequentialscanstillremainsaformidablecompetitortoindex-basedkNNmethods. 55 ]introducedtheReverseNearestNeighbor(RNN)problem.Theypre-computethenearestneighborofalltheobjectsinthedatabaseandgeneratethenearestneighborcirclesfortheobjects.Thenduringsearch,allobjectswhichhavethequeryintheirnearestneighborcircleareretrieved.BecausetheRNN-treeisoptimizedforRNN,butnotNNsearch,Kornet.al.,[ 55 ]useanadditional(conventional)R-treeonthedatapointsfornearestneighborsandotherspatialqueries. 32

PAGE 33

98 ].SimilartotheRNN-tree,aleafnodeoftheRdNN-treecontainsvicinitycirclesofdatapoints.Ontheotherhand,anintermediatenodecontainstheMBRoftheunderlyingpoints(nottheirvicinitycircles),togetherwiththemaximumdistancefromeverypointinthesub-treetoitsnearestneighbor.Asgivenintheirexperiments,theRdNN-treeisefcientforbothRNNandNNqueriesbecause,intuitively,itcontainsthesameinformationastheRNN-treeandhasthesamestructure(fornodeMBRs)asaconventionalR-tree. Stanoiet.al.[ 82 ],doesnotusepre-computationandistoextendanexistingsolutiontobi-chromaticRNNquery.ThebasicideaistondtheRNNsofanobject,istodynamicallyconstructtheinuenceregionoraVoronoicellbyexaminingtheR-treeofthedatabaseobjects.Here,aninuenceregionisdenedasapolygoninspacewhichenclosesthelocationsthatareclosertothequeryobjectthantoanyotherobject.Oncetheinuenceregioniscomputed,arangequeryintheR-treeofobjectsisperformedtolocateRNNsoftheobject. Taoet.al.,[ 84 ]generalizetheRNNproblemtoarbitrarynumberofNNsusingalter-and-reneapproach.Themethodusesanhierarchicaltree-basedindexstructuresuchasR-Treetocomputeanearestneighborrankingofthequeryobject.ThekeyideaistoiterativelyconstructaVoronoicellaroundthequeryobjectfromtheranking.ObjectsthatarebeyondkVoronoiplanesw.r.t.thequerycanbeprunedandneednottobeconsideredforVoronoiconstruction.Theremainingobjectsmustberened,i.e.foreachofthesecandidates,ak-nearestneighborquerymustbelaunched. 19 ].Atthetoplevel,MuXindexcontainslargepages(orMBRs).Atthenextlevel,thesepagescontainmuchsmallerbuckets.ForeachbucketfromR,itcomputesapruningdistanceasitscansthecandidatepointsfromS.Itprunesthepages,buckets,andpointsofSbeyondthisdistanceforeachbucketofR.GORDER[ 97 ]isablocknestedloopjoinmethod.ItrstreducesthedimensionalityofRandS

PAGE 34

Reference-basedIndexingExample. byusingPrincipalComponentAnalysis.Itthenplacesagridonthespacedenedbythereduceddimensionsandhashesdataobjectsintogridcells.Later,itreadsblocksofdataobjectsfromgridcellsbytraversingthecellsingridorderandcomparesalltheobjectsinpairsofgridcellswhoseMINDISTislessthanthepruningdistancedenedbythekthNN. Figure 3-2 illustratesthereference-basedindexinginahypotheticaltwo-dimensionalspace.Here,thedatabaseobjectsarerepresentedbypoints.Thedistancebetweentwopointsinthisspacecorrespondstotheunderlyingdistancebetweenthetwoobjects(e.g.,ifthepointsdenotestrings,rdist1betweenthepointsref1andpcorrespondstotheeditdistancebetweenthestringsrepresentedbythem).Inareference-basedindexing,thedistancesbetweentheobjectpandreferencesref1andref2arepre-computed.Letrdist1andrdist2bethetwopre-computeddistances,respectively.Givenaqueryqwithranger,therststepistocomputequery-to-referencedistancesqdist1andqdist2.Alowerboundforthedistancebetweenqandanobject 34

PAGE 35

LetV=fv1,,vmgdenotethesetofreferenceobjects,wherevi2SandjVj=m,thenumberofreferences.Thedistancesfromthereferencestothedatabaseobjectsarepre-computedandisgivenbythesetfD(si;vj)j(8si2S)^(8vj2V)g.Thisisaone-timecostforthedatabase. Duringsearch,thealgorithmrstcomputesthedistancesfromthequeryqtothereferences.Foreachobjects,alowerbound(LB)andanupperbound(UB)forD(q;s)aredenedas: Afterthispruning,onlytheobjectsinthecandidatesetarecomparedwithqusingthecostlycomparisonoperationtocompletethequery. Thefactorsthatdeterminethecostofthestrategiesusedforselectionandassignmentofreferencestodatabaseobjectsarememoryandcomputationtime.LettheavailablemainmemorybeBbytesandjSj=Nbethenumberofobjectsinthedatabase.Weassumefourbytesareusedtostoreadistancevalueandanobjectusesonaveragezbytesofstorage.Thus,zmbytesareneededtostoreareference.Foreachobjects2Sanditsreferencevi2V,theobject-referencemappingisoftheform[i;D(s;vi)].Thus,8mNbytesareusedtostorethepre-computeddistancesforNobjects.Thevalueformcanbeobtainedbycomparingtheavailablememorywiththememoryneededforstorage:B=8mN+zm. LetQbethegivenqueryset,tbethetimetakenforoneobjectcomparisonandcavgbetheaveragenumberofobjectsinthecandidatesetforeachquery.First,eachqueryobjectis 35

PAGE 36

ExampleforOmni. comparedwitheachofthemreferences.ThistakestmjQjtime.Then,foreachquery,thecavgobjectshavetobecomparedtolteroutthefalsepositives.ThistakestcavgjQjtime.Thetotaltimetakenisgivenbyt(m+cavg)jQj. Reference-basedsimilaritysearchesinmetricspacescanbedividedintotwocategories:distance-matrixbasedandtree-basedreferenceindexingmethods.AnextensivesurveyofthesecategoriesisgivenbyChavezetal.[ 28 ]andHjaltasonandSamet[ 45 ]. 35 ]proposedamethodcalledtheOmni.Omniselectsreferencesfromtheconvexhullofthedataset.Thisisdonebyselectingobjectsthatarefarawayfromeachotherandmayachievepoorpruningrates.Multiplereferencesprunethesamesetofobjectsthatarenearthehull.Thereareredundantreferences,sinceanobjectcanbeprunedbyoneofthem.Furthermore,noneofthereferencesprunetheobjectsthatarefarfromthehull. Figures 3-2 and 3-3 illustratethisproblem.InFigure 3-2 ,ref1andref2arethereferencesselectedusingtheOmnimethod.Objectpisclosetoref1andfarawayfromref2.Forqueryq,pcanbeprunedbybothref1andref2,representingawastedreference.Ontheotherhand,objectsinsidethehullwillnotbeprunedatall,asillustratedinFigure 3-3 .Here,theboundsobtainedforthequeryqandobjectdusingthetwoOmnireferencesdonotremovedfromthecandidateset.Hadthereferencerandinthisgurebeenselectedinstead,thenitwouldhavebeenpossibletopruned.Itisessentialtoselectthereferencessothateachreferenceis 36

PAGE 37

94 ]. Reineretal.[ 57 ]proposedaspacing-basedselectionofreferences.Thebasicideaistoadddatabaseobjectstotheindexbasedontwocriteria.Therstcriterioniscalledspacingwherereferenceswithsmallvarianceofspacingaresupposedtohavehavediscriminativepoweroverthedatabase.Thesecondcriterion,correlation,reducestheredundancyofthereferencesbyusingthelinearcorrelationcoefcientsamongthereferenceobjects.Ittriestominimizethenumberoffalsepositives.Assoonaseitherthevarianceofspacingofoneobjectorthecorrelationofapairofobjectsexceedacertainthreshold,areferenceobjectisreplacedbyarandomlychosennewreferenceobject.However,therewerenoguidelinesonselectingthespacingandcorrelationthresholdsandthenumberofreferenceobjects. Brisaboaet.al.[ 21 ]proposedSparseSpatialSelection(SSS)methodthatadaptstotheupdatesindynamicdatabases.Anobjectbecomesareference,ifitislocatedatmorethanafractionofthemaximumdistancewithrespecttoalloftheexistingreferences.Thenumberofreferencesselecteddependsontheintrinsicdimensionalityoftheunderlyingdatabase.Thisapproachisdynamicinnatureandselectsreferencesthatarenotoutliers.Butitcanresultinredundancy,i.e.referencethatprunesamesetofdatabaseobjectslikeOmni. Bustoset.al.[ 23 ]presentacriteriontocomparetheefciencyoftwosetsofreferencesofthesamesize.Theyselectareferencesetthatmaximizesthemeanofquery-to-objectdistances.Differentreferenceselectionstrategieswereproposed:a)SelectionthatselectsfromasetofNrandomsetsofreferencesonethatmaximizesthemeanb)Incrementalselectsanewreferencethatmaximizesthemeanofthecurrentsetofreferencesandc)LocalOptimumisaniterativestrategythatstartswitharandomreferencesetandineachiterationreplacesonethatmakeslesscontributiontothemean.Theyselectreferencesthatarefarfromtheotherdatabaseobjects.ThisapproachhadthesamedrawbacksasthatofOmnianddoesnotalwaysresultingoodsetofreferences. 37

PAGE 38

72 ]issimplyamatrixwiththepre-computeddistancecomputationsbetweeneverypairofobjectsinthedatabase.Whenansweringaquery,AESAcomputesthedistancebetweenthequeryandanarbitrarilysetofobjects,andusedthesedistancestogeneratethecandidateset.Thismethodhashighstorageandpre-processingcosts.TheLinearAESA(LAESA)[ 64 ]solvesthisproblembyselectingasmallsubsetoftheobjectscalledasBasePrototypes(BP)asreferencesandpre-computingthedistancesfromtheotherobjects.ItsefciencydependsonthenumberofselectedobjectsasBPandtheirlocationwithrespecttootherdatabaseobjects.Theobjectsaresimplylinearlytraversedandthosethatcannotbeeliminatedafterconsideringthereferencesarecompareddirectlyagainstthequeryobject.ThismethodhashightCPUcostslikeAESA. TheM-Tree[ 29 ]isaheight-balancedtreewiththedataobjectsinitsleafnodes.ItaimsatprovidingdynamiccapabilitiesandgoodI/Operformancewithfewerdistancecomputations.Asetofreferenceobjectsareselectedateachnodeandobjectsclosertothereferenceobjectsareorganizedintoasubtreerootedbythatreference.Eachreferencestoresitscoveringradius.Thesearchalgorithmrecursivelysearchesthenodesthatcannotbeprunedusingthecoveringradiuscriterion.ThemainproblemwithM-Treeisthatithaspoorselectivityathigherdimensions. AvariationoftheM-Tree,theSlim-Tree[ 86 ],reducestheamountofoverlapbetweenthetreenodes.Itusestheslimdownalgorithmwhichleadstobettertree.Italsomakesuseoffastersplittingalgorithmbasedontheminimalspanningtree.ItmakesuseofchooseSubtreealgorithmfortheslim-treewhichleadstotightertrees,thushavefewerdiskpages,andfasterretrievals.Thesetree-basedstructuresareheightbalancedandattempttoreducetheheightofthetreeattheexpenseofexibilityinreducingtheoverlapbetweenthenodes.ThisconstraintwasrelaxedintheDensityBasedMetricTree(DBM-Tree)[ 92 ]byreducingtheoverlapbetweennodesin 38

PAGE 39

85 ]selectsaglobalsetofrepresentativesinordertoprunecandidateobjectswhenansweringqueries.Itspruningwithrespecttonumberofdistancecomputationisveryhigh.Butitislessefcientthantheslim-treeinnumberofdiskaccess. TheVantage-Pointtree(VP-tree)[ 99 ]partitionsthedataspaceintosphericalcutsbyselectingrandomreferencepointsfromthedata.ThegoalistoreducethenumberofdistancecalculationstoanswersimilarityqueriesintheVP-tree.AvariationoftheVP-TreecalledtheMVP-Tree[ 20 ]usesmorethanonereferenceateachlevel.UsingmanyrepresentativestheMVP-treerequireslesserdistancecalculationstoanswersimilarityqueriesthantheVP-tree. Burkhard-KellerTree(BKT)[ 22 ]providesinterestingtechniquesforpartitioningametricdatabasewheretherecursiveprocessismaterializedasatree.Thersttechniquepartitionsadatabasebychoosingareferencefromthedatabaseandgroupingtheobjectswithrespecttotheirdistancefromit.Thesecondtechniquepartitionstheoriginaldatabaseintoaxednumberofsubsetsandchoosesareferencefromeachofthesubsets.Thereferenceandthemaximumdistancefromthereferencetoapointofthecorrespondingsubsetarealsomaintainedtosupportthenearestneighborqueries. TheFixed-QueriesTree(FQT)[ 5 ],improvesBKT,whereallthereferencesstoredinnodesofthesamelevelarethesame.Theactualobjectsarestoredintheleaves.Theadvantageofthisapproachisthatsomecomparisonsbetweenthequeryandnodesaresavedalongthebacktrackingthatoccursinthetree.TheFixed-HeightFQT(FHQT)[ 5 ]hasallleavesatthesamedepththusmakingsomeleaveddeeperthannecessary.Thismakessensebecausewemayhavealreadyperformedthecomparisonbetweenthequeryandreferenceofanintermediatelevel.ButhasspacelimitationssimilartoFQT.TheFixedQueriesArray(FQA)[ 27 ]isacompactrepresentationoftheFHQT.ItissimilartotraversingtheleavesofaFHQTwithxedheight,lefttoright.Usingthesamememory,FQAisabletousemorereferencesthanFHQTimprovingitsefciency. 39

PAGE 40

26 ]presentadynamicreferencebasedmethod.Duringthepre-processing,thedistancebetweenthereferencetodatabaseobjectsarecomputed.ItreducestheCPUcostwhileretainingthearraystructureofFHQTbysortingeacharrayandsavingthepermutationswithrespecttotheprecedingarray.AllthesemethodswhenappliedtocontinuousdistancefunctionslosetheirlinearCPUtime.ExceptM-Treebasedmethods,allotherstructuresareforstaticdatabasesandarenotsuitableforlargedatabases. PivotingM-Tree(PM-Tree)[ 80 ],ahybridbetweenthetwocategories,isanextensionoftheM-Tree.Itcombinesthehierarchyusedbytree-basedmethodswithadistance-matrixbasedmethod.TheresultisaexiblemetricaccessmethodprovidingmoreefcientsimilaritysearchperformancethantheM-Tree.Butpivotselectionrequiresapartofthedatabasetobeknowninadvance.AcombinationofPM-Treeandaslimdownalgorithmmakesanefcientmetricaccessmethod. 67 ].LocalAlignmentaremoreusefulfordissimilarstringsthatcancontainregionsofsimilarity.TheSmith-WatermanAlgorithm[ 81 ]isadynamicprogramingbasedsolutionlocalalignmentoftwostrings.Anumberofindexstructureshavebeendevelopedtoreducethecostofsearchesinstringdatabases.Theycanbeclassiedunderthreecategories:k-gramindexing,sufxtrees,andvectorspaceindexing. Ak-gramisastringoflengthk,wherekisapositiveinteger[ 40 ].k-grambasedmethodslookfortheshortestsubstringsthatmatchexactly;thesestringsarethenextendedtondlongeralignmentswithmismatchesandinserts/deletes.k-gramsareusuallyindexedusinghashtables.TwoofthemostwellknowngenomesearchtoolsthatusehashtablesareFASTA[ 69 ]and 40

PAGE 41

2 ].Theperformanceofthesetoolsdeterioratesquicklyasthesizeofthedatabaseincreases. SufxtreeswererstproposedbyWeiner[ 95 ]underthenamepositiontree.Later,efcientsufxtreeconstructionmethods[ 61 88 ]andvariations[ 34 47 52 60 ]weredeveloped.ThesufxtreeforthestringSisdenedasatreewhereeachpathfromtheroottoaleafnoderepresentsasufxofS.IttimecomplexityforbuildingasufxtreeisO(length(S)).However,therearetwosignicantproblemswiththesufx-treeapproach:(1)sufxtreesmanagemismatchesinefciently,and(2)sufxtreesarenotoriousfortheirexcessivememoryusage[ 66 ].Thesizeofthesufxtreevariesbetween10to37bytesperletter[ 32 47 52 62 ].SufxArray[ 60 ]isbasicallyalexicographicallysortedlistofallsufxesofthestringS.Thesufxarraywasdevelopedtoreducethespaceconsumptioninasufxtree.Abinarysearchonthelistgivesthematchingsufxes. Anumberofindexstructureshavebeendevelopedtofunctioninvectorspace,suchasSST[ 39 ]andthefrequencyvectors[ 49 50 ].Thefrequencyvectorofastringstoresthenumberoflettersofeachtypeinthatstring.Thismethodcomputesalowerboundtothedistancebetweentwostringsusingthefrequencyvectorscorrespondingtothetwostrings.Itusesthislowerboundtoeliminateunpromisingstrings.However,asthequeryrangeincreasesfrequencyvectorsperformpoorly. 41

PAGE 42

Animportantissueoverlookedbytheexistingmethodsisthattheperformanceofreference-basedindexingcanbeimprovedbyselectingreferencesthathaveasignicantnumberofobjectsclosetoandfarfromthem.ThekeysymbolsusedthroughoutthischapteraresummarizedinTable 4-1 Table4-1. Summaryofsymbolsanddenitions. SDatabaseofobjectsSimilaritythresholdqQueryobjectNNumberofobjectsinthedatabaseVReferencesetmNumberofreferencesinVtTimetakentocompareapairofdatabaseobjectsQSamplequerysetGiGaininpruningfromreferencevi2VfSizeofasampledatabaseAccuracyoftheestimatedmaximumgainProbabilityoftheestimatedmaximumgainandMeanandvarianceofthedistributionofgainshSizeofasamplecandidatesetkSubsetofreferencesICIndexconstructiontime Thissuggestsanalgorithmforchoosingreferencepoints.Forexample,inFigure 4-1 ,pointsintwo-dimensionalspacerepresentthedatabaseobjects.Thedatabaseisgivenbytheobjects 42

PAGE 43

MaximumVarianceexample. LetLdenotethelengthofthelongestobjectinS.Foraobjectsi2S,iandiarethethemeanandvarianceofitsdistancestootherobjectsinS.Acut-offdistance,w,iscomputedtomeasuretheclosenessoftwoobjects.sj(sj2S)isclosetosiifED(si;sj)<(iw)andsjisfarawayfromsiifED(si;sj)>(i+w).wiscomputedasafractionofL,givenbyw=L:perc,where0
PAGE 44

input:ObjectdatabaseS,withjSj=N.Numberofreferencesm.Cutoffpercentageperc.LengthofaobjectL. Algorithm 1 presentsthealgorithmindetail.Foreachobjectsi2S,asampledatabaseS0,S0SisselectedinStep2.aandthesetofdistancesDi=fED(si;sj)j8sj2S0garecomputed(Step2.b).ThemeaniandvarianceiofthedistancesinDiarecomputedinStep2.c.Thedistancesarethensortedindescendingorderoftheirivalues(Step4).Then,thefollowingaredonerepeatedlyuntiltherequirednumberofreferencesareobtained.Theobjects1withmaximumvarianceisselectedasthenextreferenceandaddedtoV(Step5.a).ThentheobjectsfromSthatareclosetoorfarawayfromthenewreferences1(Step5.b)areremoved.Steps5.aand5.barerepeateduntilthereareenoughnumberofreferences,i.e.jVj=m.Eachiterationofthealgorithmselectsanewreferencethatisneitherclosetonorfarawayfromtheexistingreferences. 44

PAGE 45

Apurelycombinatorialapproachwhichtestsallpossiblecombinationsofreferencesinordertomaximizeperformanceoveragivenquerydistributionwouldbeprohibitivelyexpensive.ExhaustivelytestingallpossiblecombinationsofmreferencesfromStakesO(CNmNjQj)time,whereCNmisNm.Thisisduetothefactthatapurelycombinatorialapproachwouldneedtoconsiderallpossiblereferencesets,one-at-a-time,andforeachitwouldneedtocomputethepruningpowerwithrespecttoQ.Themethodcouldperhapsbeimprovedabitbymakinguseofthefactthatmanyofthereferencesetstobetestedwouldbeoverlapping,butitseemsimpossibletoreducethecomplexityofanexactcomputationbelowO(CNm).Tospeedupthiscomputation,agreedysolutiontothisproblemisproposed.Inordertospeedthegreedysolutionevenfurther,sampling-basedoptimizationsareconsideredinSection 4.2.3 Algorithm 2 presentstheMaximumPruningalgorithm.S,Q,andmaregivenasinput.Here,PRUNE(V0;q;s)returnstrueifoneofthereferencesinV0canprunes.MAX(P)and 45

PAGE 46

MAX(G)returnsthemaximumofthevaluesinPandGrespectively.ThereferencesetVisrstinitializedasthetopmreferencesobtainedusingtheMaximumVariance(MV)method(thoughthisisnotarequirement,sinceonecanstartwitharandominitialreferenceset).TheMVmethodselectsasreferencesthosedatabaseobjectshavingahighvarianceinpairwisedistanceswithotherdatabaseobjects. TheDo-Whileloopreplacesoneexistingreferencewithabetterreferenceduringeachiteration.AniterationofthisloopstartsbyinitializingthearrayGtozero.EachentryG[i]ofGshowstheamountofpruninggainedbyincludingtheithcandidatereferenceinthereferenceset.Thetermgainisusedtodenotetheamountofimprovementinpruning.Steps2.a-2.diterate 46

PAGE 47

4.2.1 isfasterthanapurelycom-binatorialapproach,itisstillimpracticalforlargedatabases.Toaddressthisproblem,twosampling-basedoptimizationstoimprovethecomplexityofthealgorithmbyreducingthenum-berof(object,reference)pairsprocessedareproposed.Therstoptimizationreducesthenumberofobjectsandthesecondreducesthenumberofreferences. 47

PAGE 48

2 .Thegainisestimatedbasedonasmallsampleofthedatabaseratherthantheentiredatabase.Oneofthemostimportanttechnicalconsiderationsinthedesignofthisalgorithmishowtodecidewhetherthegainestimateisaccurateenoughbaseduponthesample. Algorithm 3 presentsthesamplingalgorithmindetail.S,Qandaregivenasinput.Itreturnsthetotalgainobtainedbyreplacinganexistingreferencewithanewreferenceforallpossiblenewreferences.Foreachcandidatereferencevi,arandomobjectsj2Sisselectedateveryiteration(Step3.a).ThegainbyusingviasareferenceforsjforQiscomputedinStep3.b.Thegainiscomputedasfollows.Steps2.ato2.coftheMaximumPruningalgorithminAlgorithm 2 areexecutedtocomputethetotalpruningachievedwithrespecttosjbyreplacingeachexistingreferencewithvi.Thenthegainisgivenbythebestpruningoverallpossiblereplacements.Thetotalgainseenaswellasthetotalsquaredgainseen(whichcanbeusedtoestimatethesecondmomentofthegainsthathavebeensampled)isupdatedinStep3.c.The 48

PAGE 49

ForanaveragesamplesizeoffwithfN,thisapproachreducesthecomplexitytoO(N2fmjQj).ThisisbecauseititeratesoverfobjectsratherthanallNobjectswhilecomputinggain. Formally,theproblemisdenedasfollows.LetG[i]bethegainthatcanbeachievedbyincludingtheithreferenceinthereferenceset.LetG[e]bethelargestgain(i.e.,e=argmaxifG[i]g).Giventwoparametersandwhere0,1,thecandidaterefer-encesethastobesampledtoensuringthatthelargestgainfromthissampleisatleastG[e]withprobability. SinceG[e]isnotknowninadvance,theType-IExtremeValueDistribution(alsoknownastheGumbeldistribution[ 41 ])canbeusedtoestimateitsvalue.Thisisdoneasfollows.LetsassumethateachG[i]isproducedbysamplingrepeatedlyfromanormallydistributedrandomvariable.Therststepistodeterminethemeanandstandarddeviationofthisvariable.Todothis,asamplesetofcandidatereferencesisselectedandthemeanandthestandarddeviation

PAGE 50

SincethevaluesinGareassumedtobethesamplesfromanormaldistribution,thelargestgainG[e]isknowntohaveaGumbeldistributionwhoseparameterscanbecomputedusingand.LetNandtbethenumberofcandidatereferencesandthesamplesize,respectively.ThetwoparametersoftheGumbeldistribution,referredtoaslocationparameteraandascaleparameterb,arecomputedasfollows:a=p loge(P(x<^)): 2 computesthegainfromrandomcandidatereferencesuntiltherequiredaccuracyisreached.Ineachiteration,adifferentsample 50

PAGE 51

21 23 35 64 94 ]. MPdiffersfrommostselectionstrategiesinthatitmakesanyassumptionsaboutthequerydistributionexplicit.Ofcourse,MPcaneasilyberunbysimplyusingthedatatobeindexedasthequerydistribution;inthiscase,likeothermethods,MPwillbeoptimizedtorunonqueriesthataresimilartodatabaseobjects.However,ifthequeriesfollowdifferentdistributionorifthequerydistributionkeepschanging,MPhasthebenetthatitcanexplicitlytakeintoconsiderationthesefactorswhilecreatingtheindex. AsseenintheexperimentsofSections 4.5 and 5.4 ,theefciencyandefcacyofMPdodependuponthesizeandaccuracyofthetrainingqueryset.However,MPissurprisinglyrobustinthisregard,andevenforsmalland/orasomewhatinaccuratetrainingset,MPoutperformsitscompetitors.Thus,itcanbearguedthatthefactMPusesatrainingquerysetisactuallyabenecialfeatureoftheMPmethodology. 4.3.1MotivationandProblemDenition 35 ]to 51

PAGE 52

NumberofcomparisonsforOmniwithvaryingnumberofreferences. answeraqueryoveraDNAobjectdatabaseasafunctionofthenumberofreferencesisgiveninFigure 4-2 (seeSection 4.5 fordetails).ThisusestheDNAdatabasewithaqueryrangeof8.Itcanbeseenthatinitiallythenumberofcomparisonsrequiredtoansweraqueryreduceswithincreaseinthenumberofreferences.After400references,itbeginstoincreaselinearlywiththenumberofreferences.Thus,theoptimalnumberofreferencesforthisparticularexampleisinthehundreds. Unfortunately,duetomemoryconstraints,selectingandassigning400referencestoeachandeveryobjectinadatabaseisnotalwaysapracticalsolution.Forexample,thehumangenome(with3billionbasepairs)contains30millionobjectsoflength100.Fromthecostmodel,themainmemorystorageofanindexthatcontains400referencesforeachoftheseobjectswouldrequireabout90GBofmainmemory.While90GBofRAMmaybefeasible,itiscertainlyattheupperendofwhatwouldbeacceptable.Foranevenlargerdatabase(oreveninthecaseofthehumangenomeifoneweretoindexthesubstringstartingateachandeverybasepair)thememoryrequirementsquicklybecomeunmanageable. 52

PAGE 53

Formally,givenasetofmreferences,thegoalistoassignasetofkreferences(k
PAGE 54

54

PAGE 55

4.1 and 4.2 discussedhowtondreferencesandhowtomapthemtodatabaseobjects(Section 4.3 ).Thissectiondescribeshowtousethemappedreferencestoanswerrangequeries. 1 .DependingontheLB,theUB,andthe,siisinsertedintooneofthetwosetsResultsetandCandidatesetasfollows.IfUB,siisinsertedintotheresultset.IfLBUB,siisinsertedintothecandidateset.Otherwise,siispruned.Oncethecandidatesetisdetermined,theactualobjectcomparisonbetweenqandalltheobjectsinthecandidatesettolteroutfalsepositivesisperformed. GivenanobjectdatabaseSwithNobjects,theselectionstrategyselectsmreferences.Theassignmentstrategymapseachobjectwithk,km,references.Foreachobjectsanditsreferencevi2V,itsmappingisoftheform[i;ED(s;vi)].TheNkeditdistancesbetweenthe 55

PAGE 56

Duringdatabasesearch,referencesandreferencetoobjecteditdistancesareloadedintothememory.Withtheaverageobjectsizeaszbytesandfourbytesforaninteger,thisrequires(8Nk+mz)bytesofmemory.Here8bytesareusedtostoreeachoftheNkmappings.Withtheincreaseinm,theobjectscanbeassignedbetterreferences.Howeverthiswillincreasethenumberofquerytoreferencecomputations.Hencemisrestrictedtoafractionofthedatabasesizeintheexperiments. Table4-2. Alistofproposedmethods. AssignmentSelectionStrategiesStrategyMaximumVarianceMaximumPruning SameReferencesMV-SMP-SDiff.ReferencesMV-DMP-D TheproposedmethodsthathavebeenimplementedarelistedinTable 4-2 .Fortheproposedalgorithms,twodifferentreference-to-objectassignmentstrategiesareconsidered.Therstoneisthetraditionalwayofassigningallreferencestothedatabaseobjects(MV-S,MP-S).Thesecondstrategyistheproposedapproachofincreasingthereferencesetandassigningdifferent 56

PAGE 57

35 ],2)theSparsespatialselectionstrategy(Sparse)proposedbyBrisaboaet.al.[ 21 ],3)therandom(BRAND)and4)incremental(BINC)referenceselectionstrategiesproposedbyBustoset.al.[ 23 ].Thefollowingfourtypesofdatabaseshavebeenusedintheexperiments: 10 ]( 7 ]( 16 59 ].AllcomparisonsaremadeusingEMD. 57

PAGE 58

TheexperimentsranonanIntelPentium4processorrunningLinuxwith2.8GHzclockspeedand1GBofmainmemory. ThenumberofobjectcomparisonsfordifferentreferencesetcardinalitiesforthethreedatabasesaregiveninFigures 4-3 to 4-5 .FortheDNAdatabase,uptom=200,thenumberofobjectcomparisonsreducesatafastrate.Fromm=200to300,thereisverylittleimprovementinperformance.Form>300,thenumberofobjectcomparisonsincreases.Thisisduetoincreaseinnumberofquery-referencedistancecomputations.Fortheimagedatabase,the 58

PAGE 59

NumberofcomparisonsforMP-DforDNAobjectdatabasefordifferentvaluesofm. improvementsobtainedareonlyuptom=200.Thenumberofcomparisonsincreasesslowlyfromm=200to300andform>300,therateofincreaseisfast.Fromtheseresults,itcanbeconcludedthatusinganmvalueinthelowhundredsisagoodchoice,sincethisgivesgoodperformanceandallowsforreasonableindexconstructiontimes. OnequestionisHowcanonedeterminethebestvalueofmiftheindexconstructiontimeiscompletelyignored?Figures 4-3 to 4-5 showthatthenumberofobjectcomparisonsfollowaU-shapedcurve.Thisisindeedintuitive:Forsmallm,thenumberofpossibilitiesforselectingdifferentreferencesissmall.Thus,thepruningratedropsandthenumberofcomparisonswithunprunedcandidatesincreases.Forlargem,thecandidatesizedecreases.However,thenumberofcomparisonswiththereferences(i.e.,m)increases.Inotherwords,asmincreases,thebenetgainedbypruningmorecandidatesbecomeslessthanthecostofcomparingthequerytothereferenceset.SincethiscurvefollowsaU-shape,thebestvalueofmcanbedeterminedbyeitherusingbinarysearchovermorbyadoptingtheNewton-Rhapsonmethod. 59

PAGE 60

NumberofcomparisonsforMP-Dforproteindatabasefordifferentvaluesofm. Figure4-5. NumberofcomparisonsforMP-Dforimagedatabasefordifferentvaluesofm. 60

PAGE 61

ImpactofjQjonindexconstructiontime. ThetimetakentoconstructtheindexusingMP-DfordifferentcardinalitiesofthesamplequerysetisgiveninFigure 4-6 .TimetakentocreatetheindexusingMP-DfordifferentvaluesofjQj.Itcanseenthatwiththeincreaseinthesizeofsamplequeryset,theindexconstructiontimeincreasesalmostlinearly. ThenumberofcomparisonsneededisgiveninFigure 4-7 .NumberofcomparisonsforMP-DfordifferentvaluesofjQj.Withincreasingcardinalityofthetrainingqueryset,thenumberofcomparisonsreduces.FromjQj=200to1000,thereisslightimprovementintheperformance.Giventhatthecostofbuildingtheindexincreasesrapidlyforincreasingtrainingqueryset(Figure 4-6 ),inalloftheexperimentsjQj=100ischosen. 61

PAGE 62

ImpactofjQjonqueryperformance. ThenumberofobjectcomparisonsfortheDNAdatabaseisgiveninFigure 4-8 .Thisincreaseswiththequeryrangeforallfourmethods.Fordifferentranges,MP-DandMV-DhavelessernumberofobjectcomparisonscomparedtothoseofMV-SandMP-S.MP-Disgivesthebestresults.Forrangesupto8,assigningdifferentreferencesetstoeachobjectresultsinsignicantreductionofobjectcomparisonsforbothselectionstrategies.Forbothsameanddifferentreferencesets,MPperformsslightlybetterthanMV.ThisisduetothefactthatMPisusingtheknowledgeofinputquerydistribution.Theresultsforproteindatabasearegivenin 62

PAGE 63

ComparisonoftheproposedmethodsforDNAdatabaseforquerieswithvaryingranges. Figure 4-9 .MP-Doutperformsothersforallqueryranges.Formostofthequeryranges,MPmethodsperformbetterthanMVmethods.Similarresultsfortextdatabaseforvaryingqueryrangeshavebeenobserved.Theseexperimentsshowthatassigningdifferentreferencesetstoeachobjectgivesbetterpruningresultsthanthetraditionalapproachofassigningsamereferencetoallobjects. Overall,theresultsshowthatassigningdifferentreferencesetstoeachobjectgivesbetterpruningresultsthanthetraditionalapproachofassigningthesamereferencestoalldatabaseobjects. 63

PAGE 64

ComparisonoftheproposedmethodsforProteindatabaseforquerieswithvaryingranges. ThenumberofobjectcomparisonsforDNAdatabaseisgiveninFigure 4-10 .Forallfourmethods,thenumberofobjectcomparisonsdecreasesdramaticallywithincreaseink.Asthenumberofreferencesincreasesfrom2to32,thenumberofobjectscompareddropsbyfactorsof5to20betweenthemethodsMP-DandMP-S.TheresultsforproteindatabaseisgiveninFigure 4-11 .MP-SandMV-Scomparemorenumberofobjectsforallreferences.Withtheincreaseinthenumberofreferences,thereisagradualdecreaseinthenumberofobjectcomparisons.MV-SstrategyoutperformsMP-Sstrategyatk=8;16and32inproteindatabaseandMP-SoutperformsMV-SforallvaluesofkinDNAobjectdatabase.TheexperimentsusingtextdatabasegavesimilarresultsastheDNAobjectdatabasewithMP-Dgivingthebestresults.Thisshowsthatwithincreaseinnumberofreferences,thememorycanbeutilizedbetterbyassigningsubsetofreferencestoeachdatabaseobject. 29 ],DBM-Tree[ 92 ],Slim-Tree[ 86 ],DF-Tree[ 85 ],Sparse[ 21 ],BRANDandBINC[ 23 ].. 64

PAGE 65

ComparisonoftheproposedmethodsforDNAdatabasewithavaryingnumberofreferences. Figure4-11. ComparisonoftheproposedmethodsforProteindatabasewithavaryingnumberofreferences. 65

PAGE 66

Table4-3. ComparisonwithTree-basedindexstructures. M-TreeSlim-TreeDBM-TreeDF-TreeQRIC=50msIC=15msIC=14msIC=480ms OmniFVMV-DMP-DQRIC=14msIC=6ssIC=74msIC=180ms TheresultsofsixexistingmethodsalongwithMV-DandMP-DfortheDNAobjectdatabasearegiveninTable 4-3 .QRdenotesthequeryrange.ssandmsdenotetherunningtimeinsecondsandminutesrespectively.NumberofreferencesforOmni,DF-Tree,MV-DandMP-Dare16.Withtheincreaseinqueryrange,thenumberofobjectcomparisonsincreasesforallthemethods.Thetreebasedmethodscomparemoreobjectsthanasimplesequentialscanwiththeincreaseinqueryrange.Thisisduetothecomparisonofqueryobjectwiththeobjectsintheintermediatenodesofthetreestructures.Forranges8to32OmniperformsmoreobjectcomparisonthanFV,MV-DandMP-D.Withtheincreaseintherangefrom2to8,MP-Dreduces 66

PAGE 67

TheresultsfortheDNAdatabasearegiveninFigure 4-12 .Notsurprisingly,withincreasinginqueryrange,thenumberofobjectcomparisonsincreasesforallthemethods.MP-Dhasupto40timesfewercomparisonsthanOmni,BRANDandBINCandupto10timesfewercomparisonsthanSparse.Duetoitslargereferenceset,Sparsehasmorecomparisonsthanothermethodsatarangeof2.TheresultsforrangequeriesontheproteindatabasearegiveninFigure 4-13 .Sparsehasmorecomparisonsthanallothermethods.BINCperformsslightlybetterthanBRAND.Forallqueryranges,MP-Dhasuptotwotimesfewcomparisonsthantheothermethods.TheresultsfortheimagedatabasearegiveninFigure 4-14 .MP-Dhasupto5timesfewercomparisonsthanSparseanduptothreetimesfewerthanBRAND. 67

PAGE 68

ComparisonwithothermethodsonDNAdatabaseforquerieswithvaryingranges. Figure4-13. Comparisonwithothermethodsonproteindatabaseforquerieswithvaryingranges. 68

PAGE 69

Comparisonwithothermethodsonimagedatabaseforquerieswithvaryingranges. fortheDNAdatabase,300fortheproteindatabaseand8%ofthemaximumdistancevaluefortheimagedatabase.SinceSparseusesaxednumberofreferencesforthestaticdatabases,theresultsoftheothermethodsarecomparedwithdifferentnumbersofreferences.Theplotsaregiveninlog-scale. ThenumberofobjectcomparisonsforallfourmethodsfortheDNAdatabaseisgiveninFigure 4-15 .BINCperformsbetterthanBRANDandOmniperformsbetterthanbothBRANDandBINC.Forallreferencevalues,MP-Doutperformstheothermethods.Asthenumberofreferencesincreases,thenumberofcomparisonsrequiredbyMP-Dissmallerbyuptoafactorof20comparedtoOmni,BRANDandBINC. TheresultsfortheproteindatabasearegiveninFigure 4-16 .WithfewerreferencesBINChasmorecomparisonsthanBRAND.MP-Dreducesthenumberofcomparisonsbyafactorof2comparedtoOmniandoutperformsBRANDandBINCforallrangesofreferences. TheresultsfortheimagedatabasearegiveninFigure 4-17 .Foravaryingnumberofrefer-ences,BINCrequiresfewercomparisonsthanBRAND.Omnirequiresmorecomparisonsthan 69

PAGE 70

ComparisonwithothermethodsonDNAdatabaseforavaryingnumberofrefer-ences. theothermethods.MP-DhasuptothreetimesfewercomparisonsthanOmniandoutperformsBRANDandBINC. TheresultsaregiveninFigures 5-1 to 4-20 .Forallqueryranges,MP-Doutperformstheothermethods,eventhoughithasbeentrainedonaspeciesthatisdifferentthanthequeryspecies.ThissuggeststhatMP-Disatleastsomewhatrobusttochangesinthequerydistributionorinaccuraciesinthetrainingdistribution. 70

PAGE 71

Comparisonwithothermethodsonproteindatabaseforavaryingnumberofrefer-ences. Figure4-17. Comparisonwithothermethodsonimagedatabaseforavaryingnumberofrefer-ences. 71

PAGE 72

ComparisonwithothermethodsonDNAdatabaseforqueriesfromHeliconiusMelpomenewithvaryingqueryranges. Figure4-19. ComparisonwithothermethodsonDNAdatabaseforqueriesfromMusMusculuswithvaryingqueryranges. 72

PAGE 73

ComparisonwithothermethodsonDNAdatabaseforqueriesfromDanioReriowithvaryingqueryranges. TheresultsforthethreemethodsaregiveninFigure 4-21 .BRANDhasthemaximumnumberofcomparisonsforalldatabasesizes.Withanincreaseinthesizeofthedatabase,MP-Doutperformsallothermethods.Evenwithitslargereferenceset,SparsehasuptotwotimesmorecomparisonsthanMP-D.MP-Dhasupto20timesfewercomparisonsthanOmni,BRANDandBINC.Withanincreaseindatabasesize,thenumbersofcomparisonsrequiredbyOmni,BRANDandBINCincreaseatafasterratecomparedtoMP-D. 73

PAGE 74

Scalabilityindatabasesize. 4-22 .Allofthemethodsshowreductioninthenumberofstringcomparisonswithincreaseinthestringlengths.Forshorterstrings,therangeof8islargerelativetothestringlength.InthesecasestheMP-DoutperformsOmniandFVbyafactorof2.Forlongstrings,therangeof8islow.Forthesestrings,MP-Dreducesthestringscomparisonbyafactorof20comparedtoFVandOmni.Asthestringlengthincreases,theOmnioutperformsFV.Separatescalabilityexperimentsonproteindatabaseswerenotperformed.Proteindatabaseshavingstringlengthsupto500areusedintheexperimentsgiveninFigures 4-13 and 4-16 .Theseresultsshowthattheproposedmethodsscalewelltothedifferentlengthsofproteinstrings. 74

PAGE 75

Scalabilityinstringlength. 75

PAGE 76

Databasesoftenupdatedbyinsertingnewobjects.Forexample,ontheaverage,theGenBankdatabase[ 10 ]( 4 toselectthebestreferencesetaftereachinsertionisapossibleapproach,butitmaybeinfeasibleduetoitscost.ThissectionaddressesthisproblembyproposingtwoincrementalvariationsoftheMP:theSinglePass(SP)andtheThreePass(TP)variations.Sinceneitheralgorithmre-computestheindexfromscratch,bothmustmakeassumptionsinvolvingthechange(orlackthereof)ofthevariousgainstatisticsusedbytheMPalgorithmovertime. 5.1.1BasicApproach EachnewlyinserteddatabaseobjectisconsideredasacandidatereferencebySPanditsgainiscomputedbypassingoverthedatabaseonce.Ifthegainfromthenewobjectismorethanthatofanyreferenceinthereferenceset,thenthereferencesetisupdatedwiththenewobject. InTP,thegainassociatedwiththenewobjectiscomputedusingSP.Thenthegainsofalltheobjectsinthedatabaseareupdatedbasedonwhethertheycanprunethenewobjectwiththeirassignedreferences.Inthenalstep,ifthecandidatereferencewithmaximumgainisnotalreadyinthereferenceset,itsgainisrecomputedandthecandidateisaddedtothesetofreferences.Inthismethod,theobjectsinthedatabasearescannedthreetimes;hencethenameThreePassalgorithm. 76

PAGE 77

93 ].Thereservoiralgorithmisaclassic,one-passsamplingalgorithmwiththekeycharacteristicthatatalltimes,thesetofobjectsmaintainedbythealgorithmisatruerandomsampleofalloftheobjectsseenthusfar.ThereservoiralgorithmisusedbybothSPandTP,tomaintaininanonlinefashionarandomsampleofallofthequeriesthathavebeenobservedthusfar.ThissetisthenusedasthequerysetQinordertooptimizetheindex.Inthisway,asthequerydistributionchanges,bothalgorithmstendtoincludeexamplesofboththenewerandtheolderqueriesintheirtrainingset,meaningthattheindexcanevolveovertimeinordertotakeintoaccountanevolvingquerydistribution. ThegainsoftheobjectsthatMPhasselectedasreferencesaregivenasinputtotheSPalgorithm.Aftereachinsertionofanewdatabaseobject,thenewobject'sgainiscomputedbyassumingthateveryexistingdatabaseobjectwilluseitasareference.Ifthegainassociatedwithusingthenewobjectismorethanthatofanyoftheexistingreferenceobjects,thenthereferencesetisupdatedwiththenewobjectanditsgain. TheSPalgorithmisgiveninAlgorithm 5 .ThealgorithmtakestheexistingreferencesetV,thesamplequerysetQ,andgainsG[1::jVj],whereG[i]isthegainassociatedwiththereferenceV[i],asinput.GiventhenewdatabaseobjectX,thealgorithmrstcomputesitsgainbyincludingXinthecandidateset(Steps2.ato2.d).ThisstepissimilartoStep2oftheMPinAlgorithm 2 .Thereferencewithminimumgain,e,isselectedinStep4.Ifthegainfromthenewreferenceismorethanthatofe,thenthereferencesetisupdatedwiththenewreference. 77

PAGE 78

applyingtherstoptimizationofMPgiveninAlgorithm 3 andbyusingasamplesizeoff,thetimecomplexityofSPcanbereducedtoO(t(f+jQj)+fmjQj). 78

PAGE 79

TheTPalgorithmisformallypresentedinAlgorithm 6 .Thealgorithmrstcomputesgainwiththenewly-insertedobjectXasacandidatereferenceusingStep2oftheSP.Step2updatesthegainsofotherobjectsinthedatabasesimilartotheStep2ofMPinAlgorithm 2 .Step3locatestheobjectwithmaximumgain.Step4recomputesthegainoftheobjectusingStep2oftheSPgiveninAlgorithm 5 .Ifthenewobjecthasgaingreaterthananyoftheexistingreferences,itisincludedinthereferenceset(Steps5and6). 79

PAGE 80

ComputeG[X]usingtheSP.1 Therearemanyfactorsthatneedtobeconsideredwhenansweringthesequestions.MPcanbeusedinadynamicenvironmentbysimplyrebuildingtheindexfromscratchaftereverynewinsertion(orperiodically,ifitisnottooproblematicthattheindexbecomestale).Almostcertainly,thiswillresultinadatabasethatgivessuperiorquery-processingspeedcomparedtoSP 80

PAGE 81

SPandTPwillexpectedlygiveinferiorqueryprocessingcapabilitycomparedtoMP,yetwillbeabletoprocessinsertionsmoreeasily.Allofthisiscomplicatedbythefactthatalloftheproposedalgorithmsrequirethatthedistancefromeachnewdatabaseobjecttoeachexistingdatabaseobjectbecomputedasthedatabasegrows,whichaddsaverysignicantxedcosttoeverynewdatabaseinsertionforeachalgorithm. Theseconsiderationsmustallbetakenintoaccountwhenchoosingwhichofthethreemethodstouse.Thegoalofthisexperimentalsectionistocarefullybenchmarkeachofthethreemethodsinordertobeabletopointoutunderwhatcircumstanceseachmethodmayormaynotbepreferred. 81

PAGE 82

ComparisonoftheproposedmethodsonDNAdatabasewithhilbert-ordereddataandquerydistributions. Inordertotesttheeffectofsuchconsiderationsonqueryperformance,thefollowingexperimentalsetupisused.GiventheDNAdatasetof40,000objects,thedatabaseobjectsarerstordered(aswewilldescribelaterinthissection).Then,therst4,100objectsfromtheorderingareconsidered.Arandomselectionof100oftheseareconsideredaspartofasamplequerysetandtheremaining4,000aspartoftheinitialsetofobjectsinthedatabase.AnindexisconstructedforthisinitialdatabaseusingMPandthesamplequeryset.TheremainingobjectsareprocessedintheDNAdatabasebytheirordering,oneobjectatatime.Thersttwenty-sixobjectsfromtheorderingareinsertedintothedatabase.ForSPandTP,theindexisupdatedaftereachinsertion.ForMP,theindexisreconstructedfromscratchafterthetwenty-sixthinsertion.Then,thetwenty-seventhobjectisselectedasthequeryobjectforthesearch.Forthisqueryobject,thenumberofobjectcomparisonsrequiredtoanswerthequeryusingeachofthethreeindexingstrategiesarecomputed.ForSPandTP,thisqueryisthenaddedtothequerysampleusingthereservoiralgorithm[ 93 ].Then,thenexttwenty-sixobjects(inorder)areinserted.The 82

PAGE 83

ComparisonoftheproposedmethodsonDNAdatabasewithrandom-ordereddataandquerydistributions. nextobjectisusedasaqueryoveralloftheinserteddata.Thisprocessisrepeateduntil1,000queriesand26,000insertionshavebeenperformed. TwodifferentorderingsoftheDNAdataareconsidered.Therstordering,aHilbertordering,isobtainedasfollows.ThepercentageofeachletterintheDNAobjectdenesamulti-dimensionalvector(fourdimensionalsinceDNAobjectcontainsfourdifferentletters).TheobjectsarethenorderedaccordingtotheHilbertorderingoftheirresultingvectors.BecauseoftheHilbertordering,thedistributionofthedatathatareinsertedwillvaryovertime.Therstfewobjectswillhaveahighpercentageofasingleletterandlowpercentagesoftherestoftheletters,butthatpercentagewilldropforobjectsinsertedlateron,asthedatabaseisprocessedandotherlettersbecomeprominent.ForexampletheDNAobjectshavefA,C,G,Tgasalphabetset.TheHilbertorderingofthethisdatabasehasmorethan40%A'sintherstfewobjects.Thispercentagedecreasesto10%towardstheendoftheordering.Becauseoftheexperimentalsetup, 83

PAGE 84

SparsemethodforDNAdatabasewithHilbertandrandom-ordereddataandquerydistributions. thissimulatesthesituationwherequeriesarealwaysmoreorientedtowardsthemost-recentlyinserteddata,whichmaybearealisticscenario. ForthesecondexperimentovertheDNAdataset,thedataisorderedrandomly.Thus,thereisnodriftofthedataandquerydistributionovertimeinthiscase.Thesameexperimentisalsoperformedovertherandomlyorderedimagesfromtheretinalimagedataset.Heretheinitialdatasetcontained1000imagesand100samplequeryimages.Thenarandomimageisusedasqueryforevery10imagesinserted.Thisprocessiscontinueduntil3500imageshavebeeninsertedintothedatabase. Fortheimagedataset,=8%ofthelargestdistanceandk=16areused.ForthetwoorderingsofDNAdatabase,=8andk=32areused. 84

PAGE 85

ComparisonwithSparsefortherandomlyorderedimagedatabase. 5-1 to 5-4 .InadditiontoMP,SP,andTP,wepresenttheresultsobtainedbyrunningthesameexperimentusingtheSparsemethodproposedbyBrisaboaet.al.[ 21 ]fordynamicdatabases.TheothermethodsconsideredexperimentallyinthispapersuchasOMNIandBINCarenotabletohandledynamicupdates,andsoarenottestedintheseexperiments. Thevariousplotsdepictedarefairlyjaggedduetothelargenumberofqueriesaskedandthehighvarianceofindexperformanceonanygivenquery,buttheresultsshowthattheincrementalalgorithmsrequiremayrequiremorethanthreetimesmorecomparisonsthanMP.Notsurprisingly,SPrequiredmorecomparisonsthanTPforallthreedatasets,thoughthedifferencebetweenthetwomethodsisnegligiblefortherandomly-orderedDNAdataset.Thisparticulardatasetalsoshowedtheclosestperformanceamongallthreemethods.Thisisalsonot 85

PAGE 86

DistancecomputationtimeofDNAdatabase. surprising,giventhatthedataandquerydistributionsarestationaryovertime.Perhapsthemostsurprisingndingisthatfortherandomlyorderedimagedatabase,thereisaveryclearseparationinperformanceofthevariousmethodsovertime. FortheDNAdatabases,theSparsemethodfailedtogenerateagoodsetofreferencesanditmustscanalmosttheentiredatabase.Fortheimagedatabase,SparserequiresuptothreetimesmorecomparisonsthanMPanduptotwotimesmorecomparisonsthanTP. Therstexperimentregardinginsertionprocessingspeedismeanttotestthemagnitudeofacostthatiscommontoallthreealgorithms:thecostofcomputingthedistanceofanewdatabaseobjecttoallexistingdatabaseobjects.Forthetwodifferentdistancemeasuredconsidered 86

PAGE 87

Distancecomputationtimeofimagedatabase. (EDandEMD),hetimetaketoprocessthenecessarydistancecomputationsassociatedwithinsertionsnumbered1000to3000iscomputed. ThetimingresultsfortheDNAdatabase(usingED)aregiveninFigure 5-6 .Withtheincreaseinsizefrom1000to3000thetimetakentoprocesseachadditionalinsertionincreasesfrom0.12to0.36seconds.Thetimerequiredfortheimagedataset(usingEMD)isgiveninFigure 5-5 .Withtheincreaseinsizefrom1000to3000,thetimetakenincreasesfrom2500secondsto7500seconds.Asonemightexpect,timesscalelinearlywithdatabasesizeforbothdistances,butthereisaverysignicantdifferenceinthemagnitudeofthecostforbothmethods. ThetimetakenforindexconstructionbythetwoincrementalmethodswithincreaseindatabasesizefortheimagedatasetisgiveninFigure 5-7 ,using=8%ofthelargestdistance 87

PAGE 88

Indexconstruction(IC)timesofthemethodsSPandTPDNAandimagedatabases. andk=32.ThetimetakenbyTPvariesfrom0.06secondfor1000objectsto0.13secondfor3000objects.SimilarlythetimetakenbySPvariesfrom0.02secondto0.04second.ThetimetakentoconstructtheindexforMPisgiveninFigure 5-8 .Itvariesfrom175secondsfor1000objectsto675secondsfor3000objectsandincreaseslinearly.ThustheincrementalalgorithmsareordersofmagnitudefasterthanMP.WeobtainedsimilarresultsforthemethodsSP,TPandMPusingtheDNAdatabase. FordistancemetricsofintermediatecomputationalcostsuchastheED,theincrementalmethodssuchasSPandTPseempreferable.ThedistancecomputationcostofEDissignicant, 88

PAGE 89

Indexconstruction(IC)timesofMPforDNAandimagedatabases. butnotdebilitating.Specically,ithasalowcost(Figure 5-5 )comparedtotheindexconstruc-tioncostoftheMP(Figure 5-8 )whichdominatesthetimerequiredtorebuildtheindexusingMPisaround3000timesthecostofcomputingallofthedistancestoanewlyinserteddatabaseobject.ThegainfromMP'ssuperiorqueryspeed(uptotwotimesfastercomparedtotheincre-mentalalgorithms)doesnotseemtojustifyitscosts.Forexample,thestatisticsassociatedwiththeDNAobjectsintheGenBank[ 10 ]databaseshowthatforeverynewobjectthatisinsertedintothedatabase,thedatabasehasaroundonenewquery.Witha1:1querytoupdateratio,theupdatecostisjustasimportantasquerycost,makingMPfarlessattractive. Ontheotherhand,fordistancemetricsofhighcomputationalcostsuchasEMD,applyingMPispreferred.TheindexconstructiontimeforMP(Figure 5-8 )islowcomparedtothecostlyEMDdistancecomputation(Figure 5-6 ).Eventhoughtheincrementalalgorithmshaveanindexconstructioncostthatisnegligible(Figure 5-7 ),theirqueryperformanceisuptothreetimesslowerthanMP.Thedifferenceintheirquerytimesismuchgreaterthanthantheindex 89

PAGE 90

Ifanincrementalalgorithmisselected,choosingfromamongSPandTPissomewhatdifcult.ThelattergenerallyhassuperiorqueryperformancethoughnotovertheDNAdatasetwitharandomordering.Theformerhasbetterinsertprocessingperformance.Aruleofthumbmightbethatifinsertsaremorecommon,chooseSP,ifqueriesaremorecommon,chooseTP. 90

PAGE 91

ThischapterpresentsageneralizedframeworkforNearestNeighborqueriescalledGeneral-izedNearestNeighbors(GNN)toanswerthesimilarityjoinqueries. Findingthebroadnessofdataisneededinmanyapplicationssuchaslifesciences(e.g.,detectingrepeatregionsinbiologicalsequences[ 46 ]orproteinclassication[ 24 ]),distributedsystems(e.g.,resourceallocation),spatialdatabases(e.g.,decisionsupportsystemorcontinuousreferralsystem[ 55 ]),prole-basedmarketing,etc. Inthisdissertation,anewdatabaseprimitive,calledtheGeneralizedNearestNeighbor(GNN)whichnaturallydetectsdatabroadnessisdened.GiventwodatabasesRandS,theGNNqueryndsalltheobjectsinS0Sthatappearinthek-NNsetofatleasttobjectsofR,wheretisacutoffthreshold.TheobjectsintheresultsetofaGNNqueryarebroad.Here,S0isthesetofobjectsthattheuserfocusesonforbroadnessproperty.IfR=S,thenitiscalledmono-chromaticquery.Otherwise,itiscalledbi-chromaticquery. ThetrivialsolutiontoaGNNqueryistorunakNNqueryforeachobjectinRonebyone,andaccumulatetheresultsforeachobjectinS.However,thisapproachsuffersfrombothexcessiveamountofdiskI/OsandCPUcomputations.Whenthedatabasesdonottintotheavailablebuffer,apagethatwillbeneededagainmightberemovedfrombufferwhileprocessingasinglekNNquery.CPUcostalsoaccountsforasignicantportionofthetotalcostsincekNNisdeterminedforeachobjectinRfromscratch. Letsassumethateachofthedatabasesislargerthantheavailablebuffer.Threesolutionsareproposedthatarrangethedataobjectsintopages.Eachpagerepresentsasetofobjectsrepresentedbytheirminimumboundingrectangle(MBR).TwoR*-trees[ 9 ]areconstructed,onebuiltonRandotheronS.TheypredictasetofcandidatepagesfromSthatmaycontainkNNsforeachMBRofRwiththehelpofR*-trees.EachcandidatepageisassignedaprioritybasedonitsproximitytothatMBRofRandisstoredinaPriorityTable(PT).Therstalgorithm,pessimisticapproach,fetchesasmanycandidatepagesaspossiblefromSforeachpageofR.Thesecondalgorithm,optimisticapproach,fetchesonecandidatepageatatimefromSforeach 91

PAGE 92

AnexampleforGNNquery. pageofR.ThethirdalgorithmdynamicallydecidesthenumberofpagesthatneedstobefetchedforeachpageofRbyanalyzingqueryhistory.ItreducestheCPUandI/Ocostsignicantlythroughthreeoptimizationsbydynamicallypruning1)pagesofSthatarenotinthekNNsetofsufcientlymanyobjectsinR,and2)pagesofRwhosenearestneighborsdonotcontributetotheresult3)objectsincandidateMBRsofSthataretoofarfromtheMBRsofR.Themethodfurtherreducesthesecostsbypre-processingtheinputdatabasesusingapackingtechniquecalledSort-Tile-Recursive(STR)[ 76 ]. AssumethatthewhiteandblackpointsinFigure 6-1 showthelayoutof2-DdatabasesR=fr1,,r5gandS=fs1,,s5grespectively.Considerthefollowingquery: GNN(R,S,S0=fs1;s2;s5g,2,3). 92

PAGE 93

6-1 ,thecirclescenteredateachri2Rcoversthe2-NNofri,8i.Onlys2ands4arecoveredbyatleastthreecircles.s4canbeignoredsinces4=2S0.Thesetofnodesthathaves2inthe2-NNisfr1,r2,r3g.Therefore,theoutputofthisqueryisf(s2;fr1;r2;r3g)g.NotethatthedatapointsinSS0cannotbeignoredpriortoGNNquery.InotherwordsGNN(R,S,S0,k,t)6=GNN(R,S0,S0,k,t).Forexample,inFigure 6-1 ,removalofs3ands4priortoGNNquerychangesthe2-NNsofr2,r3andr4.Asaresults1becomesoneofthe2NNsofr2andr3.Hences1isincorrectlyclassiedasbroad. AnicepropertyoftheGNNqueryisthatbothmono-chromaticandbi-chromaticversionsofthestandardk-NN,ANNandRNNqueriesareitsspecialcases.Followingobservationsstatethesecases.OnecanprovethesefromthedenitionoftheGNNquery.Notethatthegoalofthispaperisnottonddifferentsolutionstoeachofthesespecialcases.Thegoalistosolveabroaderproblemwhichcannotbesolvedtriviallyusingthesespecialcases. 9 ]areconstructed,onebuiltonRandotheronS.TheypredictasetofcandidatepagesfromSthatmaycontaink-NNsforeachMBRofRwiththehelpofR*-trees.EachcandidatepageisassignedaprioritybasedonitsproximitytothatMBRofRandisstoredinaPriorityTable(PT).Therstalgorithm, 93

PAGE 94

76 ]. 6.6 .Thisisaonetimecostperdatabase;thesameindexwillbeusedforallthequeries.STR[ 76 ]orderingisusedforatotalorderingofthedata.ThroughoutthispaperR*-Treeisusedtoindexthedatabases.OtherindexstructurescanbeusedtoreplacetheR*-tree.Forsimplicity,thecapacityofeachMBRoftheR-treeischosenasonediskpageanduseleaflevelMBRstoprunethesolutionspace. GiventwoMBRsB1andB2,MAXDIST(B1,B2)andMINDIST(B1,B2)aredenedasthemaximumandminimumdistancebetweenB1andB2.Thefollowinglemmaestablishesanupperboundtothek-NNdistancetotheobjectsinasetofMBRs.

PAGE 95

FromLemma 1 ,thek-NNdistanceoftheobjectsinAtotheobjectsinBisatmostMAXDIST(A,Bm).Hence,ifMAXDIST(A,Bm)
PAGE 96

AsamplePriorityTablefortwodatabasesRandS. inSection 6.4 .Step3takesO(ClogC)timewhereCisnumberofcandidateMBRsfoundatStep2. ThecandidateMBRsforalltheMBRsinAarestoredinaPriorityTable(PT).Figure 6-2 depictsthePTconstructedfortheGNN(R,S,S0,k,t)query.Here,riandsicorrespondtoMBRsforRandS.Eachrow/columncorrespondstoapageorR/Sondisk.Thenumbersateachcellshowthepriorityofthatcolumnforthatrow.LetsassumethatS0=fs1,s3,s4,s5,s7ginthisexample.Inthisgure,eachrowandcolumncorrespondstoanMBRforRandSrespectively.Forsimplicity,twoassumptionsaremadewithoutaffectingthegenerality:1)TheobjectsinRandSarelocatedsequentiallyondisk.2)EachrowandcolumnofthePT(i.e.,eachMBR)correspondstoonediskpage.ThenumbersateachrowshowthepriorityofthecandidateMBRsinSforthecorrespondingMBRinR.Forexampleinrow1,theMBRss1,s3ands7areinthecandidatesetofr1,suchthats3hasthehighestpriorityands1hasthelowestpriority.ThisisdepictedinFigure 6-3 .Herer12RandS=fs1,,s7g.Objectsinm1arewithinMAXDISTdistancefromr1.IfanMBRofSisnotinthecandidatesetofanMBRinR,thenthecorrespondingcellisunnumbered. 96

PAGE 97

FirstrowofthePriorityTable. Givenaquery,GNN(R,S,S0S,k,t),oursearchmethodsreducethesolutionspacebypruningthePT.FollowingtwooptimizationscanbemadetoreducesearchspacebyinspectingthePT: 6-2 ,s5isinthecandidatesetofonlyr4.Ifthetotalnumberofobjectsinr4islessthant,thens5canberemovedfromS0safely.ThecorrectnessofColumnFiltercanbeprovenfromthefactthatanobjectinacolumn,sicanbeinthek-NNsetoftheobjectsintherowsonlythathavesiinthecandidateset. 6-2 ,rowsr3andr8donothaveanycandidatesinS0.Therefore,theserowscanbeomittedsafely.Ifs5isprunedfromS0duetoColumnFilter,thentherow,r4,canalsobeignored. 97

PAGE 98

6-2 areC1=fr1,r2,s1,s3,s4,s7g,C2=fr3,r4,s2,s5,s6,s8g,C3=fr5,s1,s2,s4,s8g,andC4=fr6,r7,r8,s1,s2,s6g.Thetotalcostofthisstepislinearinthenumberofcandidatepagessinceeachcandidatepageisvisitedonlyonce. 98

PAGE 99

OnecanshowthattheTravelingSalesmanProblem(TSP)canbereducedtotheproblemofndingthebestscheduleforreadingclusters.Intuitively,theproofisasfollows.EachvertexofTSPmapstoacluster.Eachedgeweightwi;jbetweenclustersCiandCjiscomputedasthenumberofoverlappingpagesbetweenCiandCj.ThebestscheduleonthisgraphistheHamiltonianPaththatmaximizesthesumofedgeweights.SinceTSPminimizesthesumofedgeweights,theweightofeachedgewi;jisupdatedasw0i;j=wmaxwi;j,wherewmaxisthelargestedgeweight.Thisguaranteesthatthenewedgeweightsarenon-negative.Thenanewnodeviscreatedandisconnectedtoallnodesbyzero-weightedges.Theoptimalscheduleisthepathwiththesmallestsumofedgeweightswhichbeginsatvertexvandvisitsallnodesonce. Agreedyheuristicisusedtondagoodscheduleasfollows:Westartwithanemptypath.Whilethereareunvisitedvertices,weinsertthenextedgewiththesmallestweightintothepathifitdoesnotdestroythepath.Finally,thedisconnectedpathsareattachedrandomlyifthereareany. 77 ].TheprocedureusedtoprocesseachclusterafteritisfetchedintobufferisgiveninAlgorithm 7 .Foreachrowinthecluster,thealgorithmsearchesthek-NNofeachobjectstartingfromtheboxwiththehighestpriority(Steps1and2).Theresultsobtainedatthisstepareusedtoprunethecandidateset(Step3).Afterthecandidatesetispruned,Optimizations1and2areappliedtoPTinordertofurtherreducethesolutionspace. 99

PAGE 100

Removeallcandidates,s,ofrinPTforwhichMINDIST(r,s)>dmax.5 ThepseudocodefortheFOalgorithmisgiveninAlgorithm 8 .Thealgorithmsplitsbufferequallyforeachofthedatabases.Thisisbecause,onecandidatepageisreadperrowstartingfromthehighestpriority(Step1).Therefore,thenumberofpagesfromeachdatabaseinthebufferwillbeequalatalltimesifallthecandidatepagesaredistinct.Aftersearchingeachcandidatepage(Step2),PTisfurtherprunedbyeliminatingthepagesthatarefartherthanthekthNNfoundsofa(Step3)r,andusingOptimizations1and2(Step4). Forexample,forthePTinFigure 6-2 ,letbuffersizebe6pages,thenFOreadsfr1,r2,r3gandfs3,s4,s6gintobuffer.Assumethatthethirdcandidateofr1isprunedattheendofthisstep.Next,fs7,s8garereadtoreplacefs4,s6g.Althoughitisthesecondcandidateofr2,s3isnotreadatthisstepsinceitisalreadyinbuffer.Assumethatthethirdcandidateofr2isprunedattheendofthisstep.Sincenoneoftherowsfr1,r2,r3ghaveanyremainingcandidatesFOdoesnotneedtoreadanymorepagesfortheserows.Therefore,fr1,r2,r3gisreplacedwithfr4,r5,r6g,andthesearchcontinuesrecursively. ApplyOptimizations 1 and 2 onPT.5 100

PAGE 101

Initializef.1 f+1cpages(ri)fromRandfpagesfromS(sri)foreachri.3 1 and 2 onPT.Updatevalueoff.6 Thethirdmethod,FetchDynamic(FD)adaptivelydeterminesthevalueoffasfollows.Itstartsbyguessingthevalueoff.Itthenreadstherstclusterusingthisvalue.Asitndsthek-NNsofalltheobjectsintherstcluster,itcomputestheoptimalvalueoffforthatcluster.Itthenusesthisvalueofftochoosethenextcluster.Afterprocessingeachcluster,ititerativelyupdatesfasthemedianofthenumberofpagesneededforalloftherowsprocessedsofar.Notethat,thechoiceoftheinitialvalueoffhasnoimpactintheperformanceaftertherststep,sincefisupdatedimmediatelyaftereveryiteration.Asmorerowsareprocessedineachiteration,fadaptstothequeryparametersanddatadistribution. 101

PAGE 102

9 .Thealgorithmrstassignsaninitialvalueforf(1fcandidatesize).20%oftheaveragenumberofcandidatesofRisusedastheinitialguess.LetBdenotethebuffersize.Whilethereareunprocessedrows,FDreadsbB f+1cpages(ri)fromRandfpages(sri)fromSwiththehighestpriorityforeachRpageinbuffer(Step2.a).Thus,ifallthecandidatesaredistinct,bufferislledwithpagesfromRandS.Steps2.bprocesseseachcandidatepagesri.Theprocessedpages(ris)areremovedfrombufferatStep2.c.ThealgorithmcontinueswithSteps2.ato2.cuntilalltherowsinbufferareexhausted.ThenOptimizations 1 and 2 areapplied(Step2.d).ThevalueoffisupdatedatStep2.easthemedianofthenumberofcandidatesoftheprocessedpagesinR. 7 )ThisincursO(t2)comparisonsifeachMBRcontainO(t)points.Thiscostisreducedintwoways.First,insteadofexpandingbydmax,differentdimensionscanbeadaptivelyexpandbydifferentamounts.Second,thet2comparisonsareavoidedbypruningunpromisingpointsfromSinasinglepass.Moreformally,rstallpointsinacandidateMBRsthatarecontainedintheexpandedMBRofrarefound.Next,thedistancesbetweenallthosepointsandallpointsofrarecomputed.Lett0,t0t,bethenumberofpointsinsthatarecontainedintheexpandedMBRofr.TheCPUcostforthecomparisonofMBRpairsdropsfromO(t2)toO(t+tt0).Thisissummarizedasthethirdoptimization,AdaptiveFilter.

PAGE 103

AdaptiveFilterExample 6-4 ,m1,m2aretheexpandedMBRsfortherwithoutusingandusingtheAdaptiveFilterrespectively.s1,s2S,s1istheMBRofthepointsfq1,q2,q3,q4g.TheexpandedMBRofrintheworstcaseisgivenbym1.Whenadaptiveboundsareused,theexpandedMBRm2isobtained.Intheformercase,twoMBRss1ands2intersectwithm1.Thus,3diskI/Os(r;s1;s2)and32comparisonsaremade.However,onlys1intersectswithm2inthelattercase.HenceMBRs2,whichdonothaveanypointinsidem2,canbepruned.ThisreducestheI/Ocostto2pagereads(r;s1)andtheCPUcostto16comparisons.However,Optimization 3 statesthatapointinSisconsideredonlyifitisinsidem2.Therefore,eachpointins1canbescannedoncetondsuchpoints.Thesepointsarethencomparedtothepointsinrtoupdatek-NNs.ThustheCPUcostreducesto12comparisons(4forscanningsand8forcomparisonofthepointsinrwithq1andq2). 3 isimprovedfurtherbypartitioningtheMBRralongselecteddimensions.Dimensionswithhighvariancesareselectedforthepartitioning.Itstartsfromthedimensionwiththehighestvariance.TheMBRissplitalongthisdimensionintotwoMBRs,suchthateach 103

PAGE 104

PartitioningExample resultingMBRcontainsthesamenumberofobjects.EachoftheseMBRsarethenrecursivelypartitionedalongthedimensionwiththenexthighestvariancerecursively. Partitioningimprovestheperformanceintwoways.First,sinceeachofthepartitionsissmallerthantheoriginalMBR,thepruningdistance(dmax)alongeachdimensionisreduced.ThisreducestheI/Ocost.Second,withoutpartitioning,anobjectinMBRsiscomparedtoalltheobjectsinriftheextendedMBRofrcontainsit.However,withpartitioning,anobjectinsisnotcomparedtotheobjectsinpartitionswhoseextendedMBRdoesnotcontainit.Thus,CPUcostisreducedbyavoidingunnecessarycomparisons.Notethatasthenumberofpartitionsincreases,thenumberofpoint-MBRcomparisonsincreases.WhenthenumberofpartitionsbecomesO(t)(i.e.,thenumberofobjectsperMBR),thenumberofsuchcomparisonsbecomesO(t2).Thus,partitioningbecomesuseless.Intheexperiments,theyarepartitionedalongatmosteightdimensionsforthebestperformance. InFigure 6-5 ,v1andv2aretwopartitionsofMBRrandm3andm4aretheirextendedMBRs.s1SistheMBRofthepointsfq1,q2,q3,q4g.HorizontaldimensionisusedtopartitiontheMBRrintotwopartitions.v1andv2aretheMBRsofthesepartitionshavingm3andm4asextendedMBRs.MBRs2,whichdonothaveanypointsinsidem2,canbepruned.Eachpointins1isscannedoncetondthecandidatepointsforthepartitionsv1andv2.Onlyq1ispresentin 104

PAGE 105

76 ]isemployedforpackingtheR-Tree,builtonthedatabases.LetNbethenumberofd-dimensionalobjectsinadatabase,BbethecapacityofanodeinR-TreeandletP=dN Be.STRsortsobjectsaccordingtotherstdimension.ThenthedataisdividedintoS=dP1 76 ]thatformosttypesofdatadistributionsSTR-Orderingperformsbetterthanspace-lling-curvebasedHilbert-Ordering[ 51 ]. 24 ]. InadditiontoFA,FO,andFD,threeoftheexistingmethodsareimplemented:sequentialsearch(SS)andtheR-tree-basedNNmethodofRoussopouloset.al.,(RT)[ 70 ]andMux-Join[ 19 ].ToimplementthebufferrestrictionsintoRT,halfoftheavailablebufferisusedforRandtheotherhalfforS.InordertoadaptthesemethodstoGNN,GNN(R,S,S0,k,t),ak-NNsearchisperformedforeachobjectinR.SSisincludedintheexperiments,asitisbetterthanmanycomplicatedNNmethodsforabroadsetofdatadistributions[ 15 ].Thesourcecodesof 105

PAGE 106

EvaluatingOptimizations 1 2 and 3 GORDER[ 97 ]andRkNN[ 84 ]areobtainedfromtheirauthors.However,atitscurrentstate,itisimpossibletorestrictmemoryusageofGORDERtoadesiredamount.Therefore,GORDERisusedinonlyoneoftheexperimentswhereitispossible. Inalltheexperiments,S0=Sisusedunlessotherwisestated.4kBpagesizeisusedinalltheexperiments.TheexperimentsranonanIntelPentium4processorwith2.8GHzclockspeed. 1 2 and 3 andtheim-provementsinSection 6.6 .TheGNNqueryisperformedbyvaryingthesizeofS0from0.5%to8%ofS,byselectingpagesofSrandomly.Inthisexperiment,FDusedk=10,t=3,000,andbuffersize=10%ofthedatabasesize.Thequeriesarerunonthetwodimensionalimagedatabase. TheCPUandI/OtimeofFDwithfourdifferentsettingsofOptimizations 1 2 and 2 (obtainedbyturningtheseoptimizationsonandoff)ontwo-dimensionalimagedatabasesfordifferentsizesofS0aregiveninFigure 6-6 .Accordingtotheresults,themainperformancegainisobtainedfromOptimizations 2 and 3 ,yetthereisaslightperformancegainfrom 106

PAGE 107

EvaluationofPartitioning. Table6-1. ComparisonwithanOptimalsolution. BufferSize(%)5102040 1 .ThereasonthattheOptimization 1 hasasmallerimpactcanbeexplainedasfollows.tisonly0.3%ofthetotalnumberofobjectsinR.Thus,Optimization 1 caneliminateapageofSonlyifitisinthecandidatesetoflessthan3pagesofR.TheimpactofOptimization 1 islargerwhentheratiooftheaveragenumberofcandidatepagestotislower.Thishappenswhentislargeorkissmall.Optimization 2 hasahighimpactwhenS0issmaller.ThisisbecausefewerrowsinPThavecandidatesinS0forsmallS0.AnotherwaytoobtainhighlteringratefromOptimization 2 istoreducetheaveragenumberofcandidatepagesperrowbychoosingasmallvaluefork.Optimization 3 effectivelyreducestheCPUandI/OcostfordifferentsizesofS0.WecanalsoseethatforhigherpercentagesofS,theimpactofthisoptimizationremainsconstantandisindependentofthesizeofS0.Thiscanbeunderstoodfromthefactthatforaxed 107

PAGE 108

Comparisonoftheproposedmethodsontwo-dimensionalimagedatabasefordiffer-entbuffersizes. valueofkandathigherpercentagesofS,everyMBRr2RhassamenumberofcandidateMBRsfromS.ThisresultsinconstantreductioninCPUandI/Ocosts. TheperformancegainsontopofOptimizations 1 2 ,and 3 (Unpartitionedalgorithm)bypartitioningtheMBRsandbyusingtheSTR-methodbasedpackingalgorithmarecomparedinFigure 6-7 .Here,CPUandI/OtimeofFDwiththreedifferentsettingsUnpartitioned,Partitioned,andPacked(alongwithpartition)ontwo-dimensionalimagedatabasesfordifferentsizesofS0.Here,packingisappliedalongwithpartitioning.PartitioningreducestheI/Ocostuptofactorof3andCPUcostsbyordersofmagnitude.ThetighterboundsoftheextendedMBRsofthepartitionsresultedinareductionofthepruningdistance.ThisexplainstheI/OandCPUperformancegainsfromthepartitionedalgorithm.PackingutilizesthedistributionofdataandgroupssimilarobjectsinMBRsthathavecommonparentandabetterorganizationoftheR*-Treeindexstructure.ThisresultsinalowervaluefortheparameterfinFDandhencehasbetterperformancegains.PackingreducestheI/Ocostupto10timesandCPUcostbyordersofmagnitudefasterthananunpartitionedalgorithm.Itoutperformspartitionedalgorithmbyuptoa 108

PAGE 109

Comparisonoftheproposedmethodsontwo-dimensionalimagedatabasefordiffer-entvaluesofk. factorof2and6inI/OandCPUcostsrespectively.Fromhereonalloptimizationsareusedinalloftheproposedmethods. Schedulingthepagesisknownaspagingproblem[ 63 ].Chan[ 25 ]proposedheuristicbasedO((Rp+Sp)2)algorithms(RpandSparethenumberofpagesinthetwodatabases)forIndex-basedJoins.Forlargedatabases,however,theseheuristicsarenotefcient.Anonlineschedulingalgorithmcanbeevaluatedusingcompetitiveanalysis[ 1 ].Incompetitiveanalysis,anonlinealgorithmiscomparedwithanoptimaloff-linealgorithmwhichknowsallthecandidatepagesinadvance.Analgorithmisc-Competitiveifforallsequencesofpagerequests,CAc:C+b,whereCAisthecostofthegivenalgorithm,Cisthecostoftheoff-linealgorithm,bisaconstant,andcisthecompetitiveratio. Theperformanceofonlinemethodsarecomparedwithitsoff-lineversion,namedOracle.ForeachMBRr2R,OracleprovidesthesetofMBRsfromSsuchthateveryMBRinthissetcontainsatleastonek-NNofatleastoneobjectinr.ThenthenumberofI/OsofOracleisoptimizedusingtheheuristicdiscussedinSection 6.4 .Thelowerboundiscomputedasthe 109

PAGE 110

6-1 comparestheperformanceofOraclewiththeproposedmethods. Sinceeachdatabasehas5064pages,thelowerboundiscomputedasthenumberofdiskI/Osas10128(5064+5064).ThecompetitiveratioofFAissmallestforlargebuffersizes(1.008for40%buffer)andforFOitissmallestforsmallbuffersizes(1.3for5%buffer).FDhasasmallercompetitiveratio(1.5for5%bufferto1.05for40%buffer).Itcanbeconcludedthatourmethodsperformveryclosetotheoff-linemethod. Table6-2. ComparisonwithGORDER. GridSize1000500200100 ComparisonwithRkNn. TheI/OtimeandtherunningtimeofourmethodsaregiveninFigure 6-8 .Forlowerbuffersizes,FAretrievesallthecandidateMBRsforeveryrowandhenceI/Ocosttakesupmostofthetotaltime.WecanobservethisfromtheperformanceofFAatbuffersize5%andisdominatedbytheI/Ocost.Asbuffersizeincreases,thecostofallthreestrategiesdropsincemorepagescan 110

PAGE 111

Comparisonwithothermethodsontwo-dimensionalimagedatabasefordifferentbuffersizes. bekeptinbufferatatime.ForsmallbuffersizesFOhasthelowestcostsinceitdoesnotloadunnecessarycandidates.Asbuffersizeincreases,FAhasthelowestcostsinceitkeepsalmostentireSinbuffer.However,inalltheseexperiments,thecostofFDiseitherthelowestorveryclosetothelowerofFAandFO.ThismeansthatFDcanadapttotheavailablebuffersize. TheI/OandtherunningtimesaregiveninFigure 6-9 .Thecostsofallthesemethodsincreaseaskincreases.FordifferentvaluesofkFOhasthelowestcostandFAhasthehighestcost,duetothesmallbuffersize(10%).Evenwhenitdoesnothavethelowestcost,FDisveryclosetoFO.ThismeansthatFDcanadapttotheparameterk. 111

PAGE 112

Comparisonwithothermethodsonproteindatabasefordifferentbuffersizes. usedintheexperiments.TheexperimentalresultscomparingFDwithwellknownmethodsforspecialcasesarepresented.MemoryUsageandRunningtimes(seconds)ofGORDERonimagedatabasewithvaryinggridsizes.FDrunsinonly11.03secondsforthesamedatabaseusing20%buffer. 112

PAGE 113

ComparisonofSS,RT,andMuxFDontwo-dimensionalimagedatabasefordiffer-entvaluesofk. variedfrom10to13seconds(seeTable2.Accordingtotheseexperiments,FDrunsanorderofmagnitudefasterthanGORDERevenwhenitusesmuchsmallerbuffer.ItisimpossibletoreducetheactualmemoryusageofGORDERto20%atitscurrentimplementation.Therefore,inordertobefair,itisnotincludedintheremainingexperiments. TheI/OandtherunningtimesofSS,RT,Mux-index[ 19 ]andFDfordifferentbuffersizesontwo-dimensionalimageandproteindatabasesaregiveninFigures 6-10 and 6-11 .k=10andt=100areused.FDisthefastestofthethreemethodsinallsettings.ItcanbeseenthatforsmallbuffersizesRTisdominatedbyI/Ocost.Asbuffersizeincreases,CPUcostofRTdominates.SequentialscanisdominatedbytheCPUcostinalltheexperiments.TheI/OcostofFDisafractionofthatofRT.FDalsoreducesCPUcostaggressivelythroughOptimizations1to3andpartitioning.Inalltheexperiments,thetotaltimeofFDislessthantheI/OtimeofRTorSSalone.Mux-IndexisdominatedbyI/Ocostsinallexperiments.ThisisbecauseforeachblockinRitllsthebufferwithblocksfromS.BecauseofthenatureofGNNqueries,oneneedstoloadpagesmultipletimeswhileworkingwithlimitedamountofmemory,independentofthemethod 113

PAGE 114

ComparisonofSS,RT,andMuxFDonproteindatabasesfordifferentvaluesofk. used,naive(sequentialscan)ormoresophisticated(RTandMux-Index).FDperformsonlythenecessaryleafcomparisonsandusesthenearoptimalbufferingschedule,thusreducesboththeCPUandI/Ocosteffectively. TheI/OandtherunningtimesaregiveninFigures 6-12 and 6-13 .ThecostofSSisalmostthesameforallvaluesofk.Itincreasesslightlyaskincreasesduetomaintainingcostofthetopkclosestobjects.ThecostsofRT,MuxandFDincreaseaskincreasessincetheirpruningpowerdropsforlargevaluesofk.TherunningtimesofRT,MuxandFDdonotexceedSSaskincreases.FDrunssignicantlyfasterthanothers.Dependingonthevalueofk,FDrunsordersofmagnitudefasterthanRT,SSandMux.TheI/OcostincreasesmuchslowerforFD.ThisisbecauseFDadaptstodifferentparametersettingsquicklytominimizetheamountofdiskreads.Table3presenttherunningtimesofFDandRkNNfor100querypoints.Whiletherunningtime 114

PAGE 115

Comparisonwithothermethodsontwo-dimensionalimagedatabaseforvaryingdatabasesizes. ofRkNNincreasesatfasterrateandisnotscalableforhighervaluesofk,therunningtimesofFD,includingthetimetakenforthecreationofprioritytableforeachk,forthesamequerysetisalmostconstantandisorderofmagnitudefasterthanRkNN. TheI/OandtherunningtimesaregiveninFigure 6-14 .AsRandSgrows,therunningtimeofFDincreasesalmostlinearly.Thisisbecausewhenbothdatabasesaredoubled,theaveragenumberofcandidatepagesperrowinthePTstaysalmostthesame.Ontheotherhand,thetotalrunningtimeofSSincreasesquadraticallysinceithastocompareallpairsofdatapoints.TherunningtimeofRTisdominatedbyI/OcostandincreasesfasterthanthatofFDandslowerthanthatofSS.LikeSS,therunningtimeMuxincreasesquadraticallysinceitllsthebuffer 115

PAGE 116

Comparisonwithothermethodsontwo-dimensionalimagedatabaseforvaryingnumberofdimensions. withblocksfromSandisdominatedbyI/Ocosts.Thus,thespeedupofFDoverSS,MuxandRTincreasesasdatabasesizeincreases.Thismeansthattheproposedmethodscalesbetterwithincreasingdatabasesize. 6-15 .Asthenumberofdimensionsincreases,therunningtimeofSSincreaseslinearly.Ontheotherhand,therunningtimesofRTandMuxincreasesfaster.Thisisalsoknownasthedimensionalitycurse.ForallthemethodsCPUtimeincreaseswiththeincreaseindimensionandissignicantlylargerfor16dimensions.Howeverevenat16dimensionsFDis1.3timesfasterthanthesequentialscan,upto3.5timesfasterthanRTandupto1.2timesfasterthanMux-Index. 116

PAGE 117

Similaritysearchindatabasesystemsisbecominganincreasinglyimportanttaskinmodernapplicationdomainssuchasarticialintelligence,computationalbiology,patternrecognitionanddatamining.Withtheevolutionofinformation,applicationswithnewdatatypessuchastext,images,videos,audio,DNAandproteinsequenceshavebegantoappear.Despiteextensiveresearchandthedevelopmentofaplethoraofindexstructures,similaritysearchisstilltoocostlyinmanyapplicationdomains,especiallywhenmeasuringthesimilaritybetweenapairorobjectsisexpensive. Inthisdissertation,newindexingtechniquestoimprovetheperformanceofsimilaritysearchareproposed.Givenametricdatabaseandasimilaritymeasure,thequeriesweconsiderareclassiedundertwocategories:similaritysearchandsimilarityjoinqueries.Severalnovelsearchandindexingstrategiesarepresentedforeachcategory. Chapter 4 consideredtheproblemofsimilaritysearchinstaticdatabaseswithcomplexsimilaritymeasures.Afamilyofreference-basedindexingtechniquesweredeveloped.Twonovelstrategieswereproposedforselectingreferences.Unlikeexistingmethods,thesemethodsselectreferencesthatrepresentallpartsofthedatabase.Therstone,MaximumVariance(MV),maximizesthespreadofdatabasearoundthereferences.Thesecondone,MaximumPruning(MP),optimizespruningbasedonasetofsamplequeries.Samplingmethodswereusedtoimprovetherunningtimesoftheindexconstruction.Anovelapproachtoassigntheselectedreferencestodatabaseobjectswasalsoproposed.Themethodmapsofdifferentsetsofreferencestoeachdatabaseobjectdynamicallyratherthanusingthesamereferences.Accordingtoourexperiments,ourmethodsperformmuchbetterthanexistingstrategies.Amongtheourmethods,MaximumPruningwithdynamicassignmentofreferencesequencesperformedthebest.Thetotalcost(numberofsequencecomparisons)ofourmethodswasupto30timeslessthanitscompetitors. Chapter 5 consideredtheproblemofsimilaritysearchindynamicdatabaseswithcomplexsimilaritymeasures.Fordynamicdatabaseswithfrequentupdates,twoincrementalversionsof 117

PAGE 118

Experimentssuggestedthatdependingupontheapplicationcharacteristics,eitherMPoroneoftheincrementalmethodsmaybesuperior.FordistancemetricsofintermediatecomputationalcostsuchastheED,theincrementalmethodssuchasSPandTPseemedpreferable.Ontheotherhand,fordistancemetricsofhighcomputationalcostsuchasEMD,applyingMPispreferred.Ifanincrementalalgorithmisselected,choosingfromamongSPandTPissomewhatdifcult.ThelattergenerallyhadsuperiorqueryperformancethoughnotovertheDNAdatasetwitharandomordering.Theformerhadbetterinsertprocessingperformance.Aruleofthumbmightbethatifinsertsaremorecommon,chooseSP,ifqueriesaremorecommon,chooseTP. Chapter 6 consideredtheproblemofsimilarityjoinqueriesinlargedatabases.AnewdatabaseprimitivecalledGeneralizedNearestNeighbor(GNN)wasproposed.GNNqueriescanansweramuchbroaderrangeofproblemsthanthek-NearestNeighborqueryanditsvariantstheReverseNearestNeighborandtheAllNearestNeighborqueries.Basedontheavailablememoryandthenumberofnearest-neighbors,eitherCPUorI/Otimecandominatethecomputations.Thus,onehastooptimizebothI/OandCPUcostforthisproblem. ThreemethodswereproposedtosolveGNNqueries.Thesemethodsarrangethegiventwodatabasesintopagesandcomputeaprioritytableforeachpage.Theprioritytableranksthecandidatepagesofonedatabasebasedontheirdistancestothepagesfromtheotherdatabase.Therstalgorithm,FA,usespessimisticapproach.Itfetchesasmanycandidatepagesaspossibleintoavailablebuffer.Thesecondalgorithm,FO,usesoptimisticapproach.Itfetchesonecandidatepageatatime.Thethirdalgorithm,FD,dynamicallycomputesthenumberofpagesthatneedstobefetchedbyanalyzingpastexperience.Threeoptimizations,columnlter,rowlterandadaptivelterwereproposedtoreducethesolutionspaceoftheprioritytable.Packingandpartitioningstrategieswhichprovidesignicantperformancegainswerealsodeveloped. 118

PAGE 119

Accordingtotheexperiments,FAisthebestmethodwhenthebuffersizeislargeandFOisthebestmethodwhenthebuffersizeissmall.FDwasthefastestmethodinmostoftheparametersettings.Evenwhenitwasnotthefastest,therunningtimeofFDwasveryclosetothatofthefasteroftheFAandtheFO.FDwassignicantlyfastercomparedtosequentialscanandthestandardR-treebasedbranch-and-boundk-NNsolutiontotheGNNproblem. 119

PAGE 120

[1] Albers,S.:CompetitiveOnlineAlgorithms.Tech.Rep.LS-96-2,brics(1996) [2] Altschul,S.,Gish,W.,Miller,W.,Meyers,E.W.,Lipman,D.J.:BasicLocalAlignmentSearchTool.JournalofMolecularBiology215(3),403(1990) [3] Atkinson,M.P.,Bancilhon,F.,DeWitt,D.J.,Dittrich,K.R.,Maier,D.,Zdonik,S.B.:TheObject-OrientedDatabaseSystemManifesto.In:SIGMODConference,p.395(1990) [4] Baeza-Yates,R.,Perleberg,C.:FastandPracticalApproximateStringMatching.In:CPM,pp.185(1992) [5] Baeza-Yates,R.A.,Cunto,W.,Manber,U.,Wu,S.:ProximityMatchingUsingFixed-QueriesTrees.In:CPM'94:Proceedingsofthe5thAnnualSymposiumonCombinatorialPatternMatching,pp.198.Springer-Verlag,London,UK(1994) [6] Baeza-Yates,R.A.,Navarro,G.:FasterApproximateStringMatching.Algorithmica23(2),127(1999) [7] Bairoch,A.,Boeckmann,B.,Ferro,S.,Gasteiger,E.:Swiss-Prot:jugglingbetweenevolutionandstability.BriengsinBioinformatics1,39(2004) [8] Bayer,R.,McCreight,E.M.:OrganizationandMaintenanceofLargeOrderedIndices.ActaInformatica1,173(1972) [9] Beckmann,N.,Kriegel,H.P.,Schneider,R.,Seeger,B.:TheR*-tree:AnEfcientandRobustAccessMethodforPointsandRectangles.In:InternationalConferenceonManagementofData(SIGMOD),pp.322(1990) [10] Benson,D.,Karsch-Mizrachi,I.,Lipman,D.,Ostell,J.,Rapp,B.,Wheeler,D.:GenBank.NucleicAcidsResearch28(1),15(2000) [11] Bentley,J.L.:Multidimensionalbinarysearchtreesusedforassociativesearching.Commu-nicationsofACM18(9),509(1975) [12] Bentley,J.L.:Multidimensionalbinarysearchtreesindatabaseapplications.IEEETransactionsonSoftwareEngineering5,333(1979) [13] Berchtold,S.,Ertl,B.,Keim,D.,Kriegel,H.P.,Seidl,T.:FastNearestNeighborSearchinHigh-dimensionalSpace.In:InternationalConferenceonDataEngineering(ICDE),pp.209(1998) [14] Berchtold,S.,Keim,D.A.,Kriegel,H.P.:TheX-tree:AnIndexStructureforHigh-DimensionalData.In:VLDB'96,Proceedingsofthe22ndInternationalConferenceonVeryLargeDataBases,pp.28.MorganKaufmann,Mumbai,India(1996) [15] Beyer,K.S.,Goldstein,J.,Ramakrishnan,R.,Shaft,U.:WhenIsNearestNeighborMeaningful?In:InternationalConferenceonDatabaseTheory(ICDT),pp.217(1999) 120

PAGE 121

[16] Bhattacharya,A.,Ljosa,V.,Pan,J.Y.,Verardo,M.R.,Yang,H.,Faloutsos,C.,Singh,A.K.:ViVo:VisualVocabularyConstructionforMiningBiomedicalImages.In:ICDM,pp.50(2005) [17] Bially,T.:Space-FillingCurves:Theirgenerationandtheirapplicationtobandwidthreduction.IEEETransactionsonInformationTheory15(6),658(1969) [18] Bohm,C.,Berchtold,S.,Keim,D.A.:Searchinginhigh-dimensionalspaces:Indexstructuresforimprovingtheperformanceofmultimediadatabases.ACMComputingSurveys33(3),322(2001) [19] Bohm,C.,Krebs,F.:Thek-NearestNeighbourJoin:TurboChargingtheKDDProcess.KnowledgeandInformationSystems(KAIS)6(6)(2004) [20] Bozkaya,T.,Ozsoyoglu,M.:Distance-basedindexingforhigh-dimensionalmetricspaces.In:ACMSIGMOD,pp.357(1997) [21] Brisaboa,N.R.,Farina,A.,Pedreira,O.,Reyes,N.:SimilaritySearchUsingSparsePivotsforEfcientMultimediaInformationRetrieval.In:ISM'06:ProceedingsoftheEighthIEEEInternationalSymposiumonMultimedia(2006) [22] Burkhard,W.A.,Keller,R.M.:Someapproachestobest-matchlesearching.Commun.ACM16(4),230(1973) [23] Bustos,B.,Navarro,G.,Chavez,E.:Pivotselectiontechniquesforproximitysearchinginmetricspaces.PatternRecogn.Lett.24(14),2357(2003) [24] Camoglu,O.,Kahveci,T.,Singh,A.K.:TowardsIndex-basedSimilaritySearchforProteinStructureDatabases.JournalofBioinformaticsandComputationalBiology(JBCB)2(1),99(2004) [25] Chan,C.Y.,Ooi,B.C.:EfcientSchedulingofPageAccessinIndex-BasedJoinProcessing.IEEETransactionsonKnowledgeandDataEngineering(TKDE)9(6),1005(1997) [26] Chavez,E.,Marroquin,J.L.,Baeza-Yates,R.:Spaghettis:AnArrayBasedAlgorithmforSimilarityQueriesinMetricSpaces.In:SPIRE'99:ProceedingsoftheStringProcessingandInformationRetrievalSymposium&InternationalWorkshoponGroupware,p.38.IEEEComputerSociety,Washington,DC,USA(1999) [27] Chavez,E.,Marroquin,J.L.,Navarro,G.:FixedQueriesArray:AFastandEconomicalDataStructureforProximitySearching.MultimediaToolsAppl.14(2),113(2001) [28] Chavez,E.,Navarro,G.,Baeza-Yates,R.,Marroquin,J.L.:Searchinginmetricspaces.ACMComput.Surv.33(3),273(2001) [29] Ciaccia,P.,Patella,M.,Zezula,P.:M-Tree:AnEfcientAccessMethodforSimilaritySearchinMetricSpaces.In:TheVLDBJournal,pp.426(1997)

PAGE 122

[30] Codd,E.F.:Arelationalmodelofdataforlargeshareddatabanks.CommunicationsoftheACM26(1),64(1983) [31] Dantzig,G.B.:Applicationofthesimplexmethodtoatransportationproblem.InActivityAnalysisofProductionandAllocationp.359373(1951) [32] Delcher,A.,Kasif,S.,Fleischmann,R.,Peterson,J.,Whited,O.,Salzberg,D.:AlignmentofWholeGenomes.NucleicAcidsResearch27(11),2369(1999) [33] DeWitt,D.J.,Katz,R.H.,Olken,F.,Shapiro,L.D.,Stonebraker,M.R.,Wood,D.:Imple-mentationtechniquesformainmemorydatabasesystems.In:SIGMOD'84:Proceedingsofthe1984ACMSIGMODinternationalconferenceonManagementofdata,pp.1(1984) [34] Ferragina,P.,Grossi,R.:TheStringB-tree:ANewDataStructureforStringSearchinExternalMemoryandItsApplications.JACM46(2),236(1999) [35] Filho,R.F.S.,Traina,A.J.M.,Traina,C.,Faloutsos,C.:SimilaritySearchwithoutTears:TheOMNIFamilyofAll-purposeAccessMethods.In:ICDE,pp.623(2001) [36] Finkel,R.A.,Bentley,J.L.:QuadTrees:ADataStructureforRetrievalonCompositeKeys.ActaInformatics4,1(1974) [37] Gaede,V.,Gunther,O.:MultidimensionalAccessMethods.ACMComputingSurveys30(2),170(1998) [38] Gao,D.,Jensen,S.,Snodgrass,T.,Soo,D.:Joinoperationsintemporaldatabases.TheVLDBJournal14(1),2(2005) [39] Giladi,E.,Walker,M.,Wang,J.,Volkmuth,W.:SST:AnAlgorithmforFindingNear-ExactSequenceMatchesinTimeProportionaltotheLogarithmoftheDatabaseSize.Bioinformatics18(6),873(2002) [40] Gravano,L.,Ipeirotis,P.,Jagadish,H.,Koudas,N.,Muthukrishnan,S.,Srivastava,D.:Approximatestringjoinsinadatabase(almost)forfree.In:VLDB,pp.491(2001) [41] Gumbel,E.J.:StatisticsofExtremes.ColumbiaUniversityPress,NewYork,NY,USA(1958) [42] Guseld,D.:AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology,1edn.CambridgeUniversityPress(1997) [43] Guttman,A.:R-Trees:ADynamicIndexStructureforSpatialSearching.In:SIGMOD,pp.47.ACMPress(1984) [44] Hjaltason,G.,Samet,H.:RankinginSpatialDatabases.In:SymposiumonSpatialDatabases,pp.83.Portland,Maine(1995) [45] Hjaltason,G.R.,Samet,H.:Index-drivensimilaritysearchinmetricspaces.ACMTrans.DatabaseSyst.28(4),517(2003)

PAGE 123

[46] Huang,X.,Madan,A.:CAP3:ADNASequenceAssemblyProgram.GenomeResearch9(9),868(1999) [47] Hunt,E.,Atkinson,M.P.,Irving,R.W.:ADatabaseIndextoLargeBiologicalSequences.In:VLDB,pp.139.Rome,Italy(2001) [48] Jagadish,H.V.,Ooi,B.C.,Tan,K.L.,Yu,C.,Zhang,R.:iDistance:AnadaptiveB+-treebasedindexingmethodfornearestneighborsearch.ACMTrans.DatabaseSyst.30(2),364(2005) [49] Kahveci,T.,Ljosa,V.,Singh,A.:Speedingupwhole-genomealignmentbyindexingfrequencyvectors.Bioinformatics(2004).Toappear [50] Kahveci,T.,Singh,A.:AnEfcientIndexStructureforStringDatabases.In:VLDB,pp.351.Rome,Italy(2001) [51] Kamel,I.,Faloutsos,C.:HilbertR-tree:AnImprovedR-treeusingFractals.In:VLDB,pp.500(1994) [52] Karkkainen,J.:SufxCactus:ACrossbetweenSufxTreeandSufxArray.In:CPM(1995) [53] Katayama,N.,Satoh,S.:TheSR-tree:AnIndexStructureforHigh-DimensionalNearestNeighborQueries.In:SIGMODConference,pp.369(1997) [54] Knuth,D.E.:Theartofcomputerprogramming,volume3:(2nded.)sortingandsearching.AddisonWesleyLongmanPublishingCo.,Inc.,RedwoodCity,CA,USA(1998) [55] Korn,F.,Muthukrishnan,S.:Inuencesetsbasedonreversenearestneighborqueries.In:InternationalConferenceonManagementofData(SIGMOD),pp.201(2000) [56] Korn,F.,Sidiropoulos,N.,Faloutsos,C.,Siegel,E.,Protopapas,Z.:FastNearestNeighborSearchinMedicalDatabases.In:InternationalConferenceonVeryLargeDatabases(VLDB),pp.215.India(1996) [57] Leuken,R.H.V.,Veltkamp,R.C.,Typke,R.:Selectingvantageobjectsforsimilarityindexing.In:ICPR'06:Proceedingsofthe18thInternationalConferenceonPatternRecognition,pp.453.IEEEComputerSociety,Washington,DC,USA(2006) [58] Levenshtein,V.I.:Binarycodescapableofcorrectingdeletions,insertions,andreversals.Tech.Rep.8(1966) [59] Ljosa,V.,Bhattacharya,A.,Singh,A.K.:IndexingSpatiallySensitiveDistanceMeasuresUsingMulti-resolutionLowerBounds.In:EDBT,pp.865(2006) [60] Manber,U.,Myers,E.:SufxArrays:ANewMethodforOn-LineStringSearches.SIAMJournalonComputing22(5),935(1993) [61] McCreight,E.:ASpace-EconomicalSufxTreeConstructionAlgorithm.JACM23(2),262(1976)

PAGE 124

[62] Meek,C.,Patel,J.M.,Kasetty,S.:OASIS:AnOnlineandAccurateTechniqueforLocal-alignmentSearchesonBiologicalSequences.In:VLDB(2003) [63] Merrett,T.H.,Kambayashi,Y.,Yasuura,H.:SchedulingofPage-FetchesinJoinOperations.In:InternationalConferenceonVeryLargeDatabases(VLDB),pp.488(1981) [64] Mico,M.L.,Oncina,J.,Vidal,E.:AnewversionoftheNearest-NeighbourApproximatingandEliminatingSearchAlgorithm(AESA)withlinearpreprocessingtimeandmemoryrequirements.PatternRecognitionLetters15,9(1994) [65] Myers,E.W.:Ano(ND)differencealgorithmanditsvariations.Algorithmica1(2),251(1986) [66] Navarro,G.,Baeza-Yates,R.:AHybridIndexingMethodforApproximateStringMatch-ing.J.Discret.Algorithms1(1),205(2000) [67] Needleman,S.B.,Wunsch,C.D.:AGeneralMethodApplicabletotheSearchforSimilari-tiesintheAminoAcidSequenceofTwoProteins.JMB48,443(1970) [68] NorbertBeckmannHans-PeterKriegel,R.S.,Seeger,B.:TheR*-Tree:AnEfcientandRobustAccessMethodforPointsandRectangles.In:SIGMODConference,pp.322(1990) [69] Pearson,W.,Lipman,D.:ImprovedToolsforBiologicalSequenceComparison.PNAS85,2444(1988) [70] Roussopoulos,N.,Kelley,S.,Vincent,F.:NearestNeighborQueries.In:InternationalConferenceonManagementofData(SIGMOD).SanJose,CA(1995) [71] Rubner,Y.,Tomasi,C.,Guibas,L.J.:AMetricforDistributionswithApplicationstoImageDatabases.In:ICCV:ProceedingsoftheSixthInternationalConferenceonComputerVision,p.59.IEEEComputerSociety,Washington,DC,USA(1998) [72] Ruiz,E.V.:Analgorithmforndingnearestneighboursin(approximately)constantaveragetime.PatternRecogn.Lett.4(3),145(1986) [73] Sagan,H.:Space-FillingCurves.SpringerVerlag,NewYork,NY,USA(1994) [74] Samet,H.:TheQuadtreeandRelatedHierarchicalDataStructures.ACMComputingSurveys16(2),187(1984) [75] Samet,H.:Thedesignandanalysisofspatialdatastructures.Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA(1990) [76] ScottLeutenegger,M.L.,Edgington,J.:STR:ASimpleandEfcientAlgorithmforR-TreePacking.In:InternationalConferenceonDataEngineering(ICDE),pp.497(1997) [77] Seeger,B.:Ananalysisofschedulesforperformingmulti-pagerequests.InformationSystems21(5),387(1996)

PAGE 125

[78] Seidl,T.,Kriegel,H.:OptimalMulti-Stepk-NearestNeighborSearch.In:InternationalConferenceonManagementofData(SIGMOD)(1998) [79] Sellis,T.K.,Roussopoulos,N.,Faloutsos,C.:TheR+-Tree:ADynamicIndexforMulti-DimensionalObjects.In:VLDB'87:Proceedingsofthe13thInternationalConferenceonVeryLargeDataBases,pp.507.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA(1987) [80] Skopal,T.,Pokorny,J.,Snasel,V.:PM-tree:PivotingMetricTreeforSimilaritySearchinMultimediaDatabases.In:ADBIS(LocalProceedings)(2004) [81] Smith,T.,Waterman,M.:IdenticationofCommonMolecularSubsequences.JournalofMolecularBiology(1981) [82] Stanoi,I.,Riedewald,M.,Agrawal,D.,Abbadi,A.:DiscoveryofInuenceSetsinFre-quentlyUpdatedDatabases.In:InternationalConferenceonVeryLargeDatabases(VLDB),pp.99(2001) [83] Stonebraker,M.,Moore,D.:ObjectRelationalDBMSs:TheNextGreatWave.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA(1995) [84] Tao,Y.,Papadias,D.,Lian,X.:ReversekNNSearchinArbitraryDimensionality.In:InternationalConferenceonVeryLargeDatabases(VLDB)(2004) [85] Traina,C.,Traina,A.J.M.,Filho,R.F.S.,Faloutsos,C.:Howtoimprovethepruningabilityofdynamicmetricaccessmethods.In:CIKM,pp.219(2002) [86] Traina,C.,Traina,A.J.M.,Seeger,B.,Faloutsos,C.:Slim-Trees:HighPerformanceMetricTreesMinimizingOverlapBetweenNodes.In:EDBT,pp.51(2000) [87] Ukkonen,E.:AlgorithmsforApproximateStringMatching.InformationandControl64,100(1985) [88] Ukkonen,E.:On-lineConstructionofSufx-trees.Algorithmica14,249(1995) [89] Venkateswaran,J.,Kahveci,T.,Camoglu,O.:FindingDataBroadnessViaGeneralizedNearestNeighbors.In:EDBT,pp.645(2006) [90] Venkateswaran,J.,Kahveci,T.,Jermaine,C.M.,Lachwani,D.:Reference-basedindexingformetricspaceswithcostlydistancemeasures.VLDBJournal(2007) [91] Venkateswaran,J.,Lachwani,D.,Kahveci,T.,Jermaine,C.M.:Reference-basedindexingofsequencedatabases.In:VLDB,pp.906(2006) [92] Vieira,M.R.,Traina,C.,Chino,F.J.T.,Traina,A.J.M.:DBM-Tree:ADynamicMetricAccessMethodSensitivetoLocalDensityData.In:SBBD,pp.163(2004) [93] Vitter,J.S.:Randomsamplingwithareservoir.ACMTrans.Math.Softw.11(1),37(1985)

PAGE 126

[94] Vleugels,J.,Veltkamp,R.:Efcientimageretrievalthroughvantageobjects.In:VISUAL,pp.575.Springer(1999) [95] Weiner,P.:LinearPatternMatchingAlgorithms.In:IEEESymposiumonSwitchingandAutomataTheory,pp.1(1973) [96] White,D.A.,Jain,R.:SimilarityIndexingwiththeSS-tree.In:ICDE,pp.516(1996) [97] Xia,C.,Lu,H.,Ooi,B.,Hu,J.:GORDER:AnEfcientMethodforKNNJoinProcessing.In:InternationalConferenceonVeryLargeDatabases(VLDB)(2004) [98] Yang,C.,Lin,K.I.:AnIndexStructureforEfcientReverseNearestNeighborQueries.In:InternationalConferenceonDataEngineering(ICDE),pp.485(2001) [99] Yianilos,P.:DataStructuresandAlgorithmsforNearestNeighborSearchinGeneralMetricSpaces.In:SODA,pp.311(1993)

PAGE 127

JayendraGnanaskandanVenkateswaranwasbornandbroughtupinChennai,India.HereceivedhisBachelorofEngineeringfromCoimbatoreInstituteOfTechnology(CIT),oneofthemostprestigiousandoldestengineeringcollegeinIndia,in2001.Jayendramajoredincomputerengineeringandobtainedadistinguishedrecord.JayendrareceivedhisMasterofSciencefromUniversityofMissouri-RollainMay,2003.Hemajoredincomputerscience.Hismaster'sthesiswasonpackingmethodsforSR-Treeindexstructure.Alongwithhismasters'advisor,Dr.S.R.Subramanya,hehaspublishedhisworkatthe21stAnnualACMSymposiumonAppliedComputing(SAC),Dijon,France,April2006.JayendrajoinedtheDoctorofPhilosophy(Ph.D)programinComputerandInformationScienceandEngineeringattheUniversityofFlorida-Gainesvilleinthefall2003.Whilepursuinghisgraduatedegree,Jayendraworkedasagraduateresearchassistant.HereceivedhisDoctorofPhilosophyinComputerEngineeringfromtheUniversityofFloridainDecember2007.AlongwithhisadvisorsDr.TamerKahveciandDr.ChristopherJermaine,hehaspublishedhisresearchatthetopdatabaseconferencesVeryLargeDataBases(VLDB),ExtendedDatabaseTechnology(EDBT)andVLDBJournal.Jayendra'sresearchfocusisintheareaofdatabasesearches,withspecialinterestsindatabaseindexingandquerying,textmining,algorithmsandbioinformatics. 127