<%BANNER%>

A Study of Joint Classifier and Feature Optimization

Permanent Link: http://ufdc.ufl.edu/UFE0021627/00001

Material Information

Title: A Study of Joint Classifier and Feature Optimization Theory and Analysis
Physical Description: 1 online resource (47 p.)
Language: english
Creator: Mao, Fan
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: ard, feature, irrelevancy, jcfo, redundancy, selection, sparsity
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Feature selection is a major focus in modern day processing of high dimensional datasets that have many irrelevant and redundant features and variables, such as text mining and gene expression array analysis. An ideal feature selection algorithm extracts the most representative features while eliminating other non-informative ones, which achieves feature significance identification as well as the computational efficiency. Furthermore, feature selection is also important in avoiding over-fitting and reducing the generalization error in regression estimation where feature sparsity is preferred. In this paper, we provide a thorough analysis of an existing state-of-the-art Bayesian feature selection method by showing its theoretical background, implementation details and computational complexity in the first half. Then in the second half we analyze its performance on several specific experiments we design with real and synthetic datasets, pointing out its certain limitations in practice and finally giving a modification.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Fan Mao.
Thesis: Thesis (M.S.)--University of Florida, 2007.
Local: Adviser: Gader, Paul D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021627:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021627/00001

Material Information

Title: A Study of Joint Classifier and Feature Optimization Theory and Analysis
Physical Description: 1 online resource (47 p.)
Language: english
Creator: Mao, Fan
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: ard, feature, irrelevancy, jcfo, redundancy, selection, sparsity
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Feature selection is a major focus in modern day processing of high dimensional datasets that have many irrelevant and redundant features and variables, such as text mining and gene expression array analysis. An ideal feature selection algorithm extracts the most representative features while eliminating other non-informative ones, which achieves feature significance identification as well as the computational efficiency. Furthermore, feature selection is also important in avoiding over-fitting and reducing the generalization error in regression estimation where feature sparsity is preferred. In this paper, we provide a thorough analysis of an existing state-of-the-art Bayesian feature selection method by showing its theoretical background, implementation details and computational complexity in the first half. Then in the second half we analyze its performance on several specific experiments we design with real and synthetic datasets, pointing out its certain limitations in practice and finally giving a modification.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Fan Mao.
Thesis: Thesis (M.S.)--University of Florida, 2007.
Local: Adviser: Gader, Paul D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021627:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101118_AAAAAS INGEST_TIME 2010-11-18T10:42:03Z PACKAGE UFE0021627_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 56350 DFID F20101118_AAANOO ORIGIN DEPOSITOR PATH UFE0021627_00001.mets GLOBAL false PRESERVATION FULL MESSAGE_DIGEST ALGORITHM MD5
fbf967d0316b0c7514617227e038704a
SHA-1
9c79c72858753b586e1b97a16db72f60dc0a50be
1053954 F20101118_AAANJR mao_f_Page_17.tif BIT
5392e1fee020b32532c2b8262d6f1f64
d589811e125ac761f52648808e8a3b64e8d920d3
47949 F20101118_AAANEU mao_f_Page_32.pro
36ccbf8b169fbddba13a22f5cf60af4e
0f0d536cd44a68a7d58a03534df1bc30181e9e69
F20101118_AAANJS mao_f_Page_20.tif
095e704bfec88fd8e9ad640cc050d2fe
a6b871240e6e97ce1beaf4395e478baa5bbafbc6
2020 F20101118_AAANEV mao_f_Page_22.txt
86bc140c58dcdeffd964d31de65bd5f0
f2dc3d359bfc5d671a281b52da06cff6f5a210dc
25271604 F20101118_AAANJT mao_f_Page_23.tif
159e0429bd02c0038f3945f69416c52d
4c1be25788e715f47447480a639a7a3960cc3958
35450 F20101118_AAANEW mao_f_Page_11.pro
33010b14016afd02c276f165e8cf0263
debd08bf4128bd8f499f32588646b58d8a73df4c
F20101118_AAANJU mao_f_Page_25.tif
fbff3959b1dff56e68ed0277995b54b4
ef4f0e889b55e29aec2693684ef07d1dda751088
33753 F20101118_AAANEX mao_f_Page_09.jp2
fc1aac247865957fb865ae21426aca10
fd17d781cf4d642908a60b2bb26a495347dd7ade
F20101118_AAANJV mao_f_Page_27.tif
5080ea3c3c5d65c8c63ed63eb98f8d40
1724fbb7c604935de83b17f3fa079437f4e0a888
F20101118_AAANEY mao_f_Page_28.tif
f84aa869219f6156441d179e92cb26ca
fd01535105546f1d0d9fc400cf64f8b7abf72513
F20101118_AAANJW mao_f_Page_30.tif
72051024dfdea0411a26d637e265e310
774e45fb9a81754e289290130ca14496a61ba587
21756 F20101118_AAANCA mao_f_Page_12.QC.jpg
edab649aa209f1d0b4387f39ae667045
494708a48ab8e3d0168ddf51e408a998a4326a40
22259 F20101118_AAANEZ mao_f_Page_01.jp2
a178e7ea10f59428899af2c1ee5ac6c8
af2b7f800572afd7813059881658cdcfe04e46e5
F20101118_AAANJX mao_f_Page_33.tif
e196a697c3ed1023847f4002a8d5c32e
cd771d47f7a785e34194d258d64ff22a7c49c542
F20101118_AAANCB mao_f_Page_15.tif
efd028677b5e62934c649bf4bb27961c
fd731a86a59507f39574def962c4d662c0d2fba9
3095 F20101118_AAANHA mao_f_Page_44thm.jpg
90dd2af7f8265f348bd4c5f7921a6423
c99b43af3dbf5caae5396ed3c77b9ed3c53ec568
F20101118_AAANJY mao_f_Page_34.tif
fa7e487b543974fd238137c7ba157e46
e5160ce18dd4eb79e00e21aee95243743345758f
F20101118_AAANCC mao_f_Page_31.tif
bf1b0612060a2da2b05f1983cac5ad03
78eae5a0de712f50e9ae4aca1ee85f9ae81fa015
455 F20101118_AAANHB mao_f_Page_01.txt
bc978f7af95df5bbb5ba45af77fd9119
7a61d75009ac41ea3e85070bb92229ec961af8cd
F20101118_AAANJZ mao_f_Page_35.tif
0bdad9b94d622865b4659800dde593f9
65fb8d8cd4409c3098bc0c2f66d82becf8d909db
16121 F20101118_AAANCD mao_f_Page_10.jpg
cd7ba534768bb3cb7038c9ce57f17b02
08cafb52b64e02910f5f0eb14ff1683350ebb4a1
841972 F20101118_AAANHC mao_f_Page_23.jp2
5a1294387e7bcd76ebd5cc585998bf2e
5f205510580d97876aaab856cabf7910e9ad554b
1710 F20101118_AAANCE mao_f_Page_19.txt
9b50910db55c022c83b656142030340c
ed5c3459393c37015b1c0ed0462627df7d3018c2
5489 F20101118_AAANHD mao_f_Page_19thm.jpg
3b88a2e6f7593e489b886a5210f32ea2
41bcf277ffe550b1820de8fd0ca8ee341e5de17f
91051 F20101118_AAANCF mao_f_Page_32.jp2
08443e4d37b3461c207de738afcfaad2
8c5e694cfb5a800455cb620e115507be0e4b2568
1933 F20101118_AAANMA mao_f_Page_29.txt
aa18104ca85b8229ef8bfcda0872bb74
6a9012f05a3d1142a23f5945168b2cc5ee4b5c97
2287 F20101118_AAANHE mao_f_Page_04thm.jpg
2637698b5429e18318ea075c16c1de5a
4ea1f14a7b1a51ad26a8085a26625530c9e280af
596165 F20101118_AAANCG mao_f_Page_30.jp2
86ecef853f2654d00e4e363262b3c80b
245e26f672251373274b91c58eb07573cdd508aa
2479 F20101118_AAANMB mao_f_Page_32.txt
798652ad8a6d74dec4620e133723ac68
b8a0640f386c43da9ae94c5e7822256eda944f6b
7951 F20101118_AAANCH mao_f_Page_09.QC.jpg
e62a383789938efa7ea6da39cb512b59
c696bcbf2e2a18d9497c35a2065e15a3ce9e19bf
1581 F20101118_AAANMC mao_f_Page_33.txt
86ee2d16ecf99124112a8ac17694ff1d
5b6e66118ca820a5db2fba4c530410977a2d0366
19863 F20101118_AAANHF mao_f_Page_10.jp2
9ec8fb8176b5504b94347138d5cc7aed
a7d1a171c78386bf60e2da528106d5ffbbb82316
4937 F20101118_AAANCI mao_f_Page_32thm.jpg
2aa411db237f2131b7eb20a185915c0f
103fd06d668a9895db1b0d4baca44772bdd0d166
1382 F20101118_AAANMD mao_f_Page_34.txt
6483d000e8f3da390e073025076d6fe8
a563c1df4303f837962163daccf4abc2b3f739db
85824 F20101118_AAANCJ mao_f_Page_14.jp2
662fc0f091609ae798f11908080b6371
aeb657618880bfac4465a5ab004091e399c8a332
2226 F20101118_AAANME mao_f_Page_35.txt
c20e9dced65814b711ebfaadc38a8c7a
e65ce3709dd005bb5eb6a66b952110ea861c4d18
4637 F20101118_AAANHG mao_f_Page_33thm.jpg
7c5882076337bb1223dbedcdac58ca7f
de23b7cb764c1ceda211cdee9ce18572e8de5160
24335 F20101118_AAANCK mao_f_Page_24.QC.jpg
9efd0f395f89e444fc827dbf3c72ea1f
e8e677d21b27e8cbce81e6320c083b4fe7b717e8
1927 F20101118_AAANMF mao_f_Page_36.txt
abe35130aa30a8f135071c6742ddc5e4
9720efafd4d94a2f784a9cbdb1366772b6945f85
70576 F20101118_AAANHH mao_f_Page_21.jpg
81d8268d87a23349bb95a3dd08435def
c795e6dc8e9b703c3dbc383d3f490b0226d99534
89078 F20101118_AAANCL mao_f_Page_25.jp2
3daec2de600f82067f88386443c32376
b028cfccc7b215df8648c996537ea3f70baa2da8
2213 F20101118_AAANMG mao_f_Page_38.txt
896cb1df6788e09af6e3b1704599ad9f
01fb2a452386fc6071825a2cd7dcd4808b496e29
F20101118_AAANHI mao_f_Page_02.tif
ced17fbf4438dab11650edb4d4287687
8be8ccf4f8c733988ef27d0d3ace2afb40059f1a
46199 F20101118_AAANCM mao_f_Page_45.jpg
e6ac3e59c21194e3f166001ab6a37517
73066c555dc3274ad680ed9482c8f22992b70c47
757 F20101118_AAANMH mao_f_Page_39.txt
d6d9354b24902a912dd3ca423592b2b7
76919bf94cd23daa983484d604063d934d65a38d
9590 F20101118_AAANHJ mao_f_Page_41.QC.jpg
fb4be67379e4b5cb8708ab0e0f0db1e0
26dccc3389abc39cd3d9d7bb6da772a549b6c25c
F20101118_AAANCN mao_f_Page_19.tif
dcfbc14c204e5371f5d623b160e252b3
0099b5d21343849a7978a0143602ba2edef86561
1198 F20101118_AAANMI mao_f_Page_40.txt
ca51d5af913f3ea4ce4e409377d331eb
3539255ab97cbb6f9a142932849f42bfca2eb598
38141 F20101118_AAANHK mao_f_Page_23.pro
4a52deaf1dd2f6a535a46ba04c8e03f9
c605023d9b8e665495e5ed2b41a3704e43d1ec5a
1447 F20101118_AAANCO mao_f_Page_03.QC.jpg
a7581ea08af450192043281c481dbeb7
8eace24fda69bb1dc9ef829182e3a4ff25eb8c79
601 F20101118_AAANMJ mao_f_Page_41.txt
2fd279cdbb0ae2fcfaab04d21aaaedb6
2baf3b6b96fba4edee56553fb4ae88c76e72b85d
16325 F20101118_AAANHL mao_f_Page_36.QC.jpg
d88930a44cd2072e91a8a756df54701b
c95042724e0c2181d91ce90eb6ba5f2e311225fc
82128 F20101118_AAANCP mao_f_Page_35.jpg
f41ef7c0f09670d598411eb18cf39650
2f19d5f83216c538a3b9b5a92ea335f7c10c8104
25256 F20101118_AAANHM mao_f_Page_09.jpg
703db38f75886cad18b1d7b1f933a567
6670aee20eb21772e9439888b3c1b766f13569ab
7955 F20101118_AAANCQ mao_f_Page_10.pro
9c60131f8703228a4b2e5a588a628129
dcf693bd5583b1e4289d986cae6deeda9e3cbb54
127 F20101118_AAANMK mao_f_Page_42.txt
2eacf8b8f9d7c4f104b930363a4b0db6
cc4eaf9a1e8e9f295f5ddb5f5f7bf9ae3165874c
41040 F20101118_AAANHN mao_f_Page_21.pro
bc072a256f69e9f369cf8d32ae9b494d
ad27750625247b2a7dbc9083fab5931ee16bd355
23775 F20101118_AAANCR mao_f_Page_44.pro
d4a5ce6d9ec5b4c002784a0fd6fcc7b4
9c80377403368ac5be105263b484fac6ea24aeed
1375 F20101118_AAANML mao_f_Page_43.txt
347602bed34ab8eb0fc464972c51f7fa
aceaa99e43654c60d69fe051cf262a401581073d
72341 F20101118_AAANHO UFE0021627_00001.xml
0ad270f13ee4a72755f9d005daaaa620
96176fa36640f2679622ba19d3c442bc650ae0ff
6349 F20101118_AAANCS mao_f_Page_38thm.jpg
0503556ae6583033abb522c64c459ff3
5a34b5afdf59e2d041b60de7f213e8cb04167448
1317 F20101118_AAANMM mao_f_Page_44.txt
6a3aade6b483efc6e836d1d8e22d2bfc
b1421a1895256d2bed68e027bd58df96dcd810c4
F20101118_AAANCT mao_f_Page_32.tif
719b488a51cf6630a5f08b9f4643cd7d
6b7c68a9bc8f734a6ee2d5be42e056f213484d45
1598 F20101118_AAANMN mao_f_Page_45.txt
81112df05cc548c109cedc991cb8dda6
a88fbfdf459c261b302f4b4cef63706408ec9699
F20101118_AAANCU mao_f_Page_29.tif
433fb42971c16c22be3f93169cad86d0
cb847984f5cdec1f242c435e2894ee121ef5a31c
434998 F20101118_AAANMO mao_f.pdf
cb505c530ea30a8f3f40066ca839f3f4
511ce263989eac08d5f704aaef5cc671f55a1099
21158 F20101118_AAANHR mao_f_Page_01.jpg
f1b1aac0610c129388b3fb340e642057
0865a5428a9075e9dab1f0f0e0b845fe3744ba15
384 F20101118_AAANCV mao_f_Page_47.txt
8cf3fbe6d3efe12d5027aa3652220e40
0c8d2b2bb0116a8c45cec18ca075dadfd49e4868
1479 F20101118_AAANMP mao_f_Page_01thm.jpg
9d2fb76037b1b3823a3697e71afdeb7d
a9a9b03ad18d57d0f85e0a6c8a213871d41e88fe
3087 F20101118_AAANHS mao_f_Page_02.jpg
678ab0f6d7916b0b9a2961d3eb8afc26
2e736630343cad6096fc832c0354fb5faf2121e8
73430 F20101118_AAANCW mao_f_Page_13.jpg
fa5e7d8927602017ad4cf38226ef4be0
e89cd841d96539714973e44ec254185aa52db4e8
6353 F20101118_AAANMQ mao_f_Page_01.QC.jpg
2e778a16986bb4e56d19b612da2ae7f1
c87a50d6476cad3e4fc2be2ad5fadc2e53e97635
30107 F20101118_AAANHT mao_f_Page_04.jpg
96ab02c750726d5ff0d3b3bf59738c59
1e1303feb4c724b1a10e51adfa15f615ba0a9eeb
F20101118_AAANCX mao_f_Page_07.tif
51a7ba5349c4097570a6101d6f2b2619
75511ba8c1c4830b2f5194a02dde3c8eee9931ff
623 F20101118_AAANMR mao_f_Page_03thm.jpg
dc8eec01ddc94cc99eb76c892679c104
ba89204a80de229cc91b56da26d66d01c0cea28b
84658 F20101118_AAANHU mao_f_Page_05.jpg
4730c365b3f2d1361d7e91ab20109ff0
538fb382963d6e2e38f81dbf80264fe31c0451da
4626 F20101118_AAANCY mao_f_Page_02.jp2
3e310353bcb73ff212050be88a429e2e
9c36acfe91fae14014659295f7ebb281a0dc9368
8751 F20101118_AAANMS mao_f_Page_04.QC.jpg
ccafa8b1c9225646bd2fa20516bed734
1109a0425558c12e651f38f5f34c9c566ac2bb8c
14876 F20101118_AAANHV mao_f_Page_06.jpg
9619e8210b93aed3d8e5fca4f1dcda53
64febc60bd8a21a867a8bca3dc9ba09bb600a30f
44439 F20101118_AAANCZ mao_f_Page_39.jp2
210ead44699312b218b9e8c786e689af
1e68ab1b2fedca3c7d1e3b761494da37dc6f3c86
4703 F20101118_AAANMT mao_f_Page_05thm.jpg
9989622f246b6f11059ad519b047fea1
b3482182aca77e1d881e194d5def9248bb8111ec
69778 F20101118_AAANHW mao_f_Page_07.jpg
ac37f65b8ccdcc179137a16c4959dd0c
625b8b92cbdac79c6520bc47c7d10c35a8700670
4629 F20101118_AAANMU mao_f_Page_07thm.jpg
f43d698606a340c3dd6bbbc5324ad0c9
f6cd0a4176f333934b55b32e10255d007f409f20
62781 F20101118_AAANHX mao_f_Page_11.jpg
a62f38c40556c2439a2d0135da77a256
5ae0043fbb9064577da455a6c84157bef229c926
3129 F20101118_AAANMV mao_f_Page_08thm.jpg
79f39251f77cc48f32e9f6a7d7f44e83
099ae9230e1ae1d138f9732d9d710e927e308ead
1952 F20101118_AAANFA mao_f_Page_27.txt
471fd9529ee64461a0eb87b0d82d4e80
bb89f0536692965f5edfb745efbf9880ca1b0224
68091 F20101118_AAANHY mao_f_Page_14.jpg
f60f3eeebc248033293580e6d139e548
36eb8dc6685c0c9925cbedabe44388ccfa58473b
12409 F20101118_AAANMW mao_f_Page_08.QC.jpg
7566492f3f1ca1304587778e001b4339
60659c71710723a88900bad0d2bee981abae3f65
93008 F20101118_AAANFB mao_f_Page_27.jp2
8ec646f7117978f9dde2b56af527fc39
4190e595e5894b36304d96461b117bc3825705d2
51715 F20101118_AAANHZ mao_f_Page_15.jpg
a5c50d00479ef04f555344e011c49e31
66c18977515a23c96e27ecac91bd85bfe4c0efef
5516 F20101118_AAANMX mao_f_Page_10.QC.jpg
0aab6039cd6fdfdb36f65aa13dfa2f0e
d0aa4acd8927646b7419563e6ad609c95b5e3f0d
1051977 F20101118_AAANFC mao_f_Page_05.jp2
aaf2b494dfbca26fc42d3dabadb91092
0686822857637850a3c3429d8a0ee05454662dcb
F20101118_AAANKA mao_f_Page_37.tif
2c66fc796291212086d66bb196bbed76
dd83bb159058314d3f9aba2c20159c4aa7b938af
4456 F20101118_AAANMY mao_f_Page_11thm.jpg
56acc1734e5c7e4129727fcc0d11263f
06c7142a25676da1c53a00813472e03af7c360d6
F20101118_AAANKB mao_f_Page_38.tif
6d54ed7bd03bc66289cfe762a80e1b59
e8b21e744e3bc5793a8088e1bd98a5246ac348f4
5570 F20101118_AAANMZ mao_f_Page_14thm.jpg
dcd813db75caf9968fd9bb5cb558a406
b020c46970662dbc8b27cfbbdeeada8a119259ce
48524 F20101118_AAANFD mao_f_Page_35.pro
01704088522a056c6b82c96835b0288d
5c25e2f11cb51f687399ed54cb1ab2696d385334
F20101118_AAANKC mao_f_Page_40.tif
4fa8c65a2229703f7e59009aeb818d95
411e06abfae5d07d7ec9876e91bd27212355e830
100247 F20101118_AAANFE mao_f_Page_24.jp2
23aeb9cd6d4460140c8c236dfbfa2c9b
f867e1694c55582c3412b10564e8c6b236647458
39202 F20101118_AAANFF mao_f_Page_37.pro
41e2016789dbb4fddc0bd3e4a242c3e5
5754d0f009bed51041760cb8468bfc395af39826
F20101118_AAANKD mao_f_Page_41.tif
27db12eda5176d260d9891ea1e835482
d1f70f6c51bf92f37460dc832563dabaad1e524f
F20101118_AAANFG mao_f_Page_26.tif
bfef212c0f80a0e5e3410446de00c803
e37047722d917da589eb4ac07b1873226c675c8b
F20101118_AAANKE mao_f_Page_42.tif
a13ffb99625b7f33c2aeb185b500d423
eedfa02aca76a879b57904543cc8eaf5cf5df382
19962 F20101118_AAANFH mao_f_Page_23.QC.jpg
6381a0f4d3b907c9066b88c3d3b80c60
617cc6629ca2dd03a39d61da45dbaabe56ea82a0
F20101118_AAANKF mao_f_Page_43.tif
f49551d52b9104c7112eaff93906459b
974fc4ce491094c96fd1285f13b7552987f4ff73
1453 F20101118_AAANFI mao_f_Page_10thm.jpg
730e7df271f4944503b6d1fcea8a7062
e4aa5aa0633bd328dd87f7dfd870ac6c4d352262
F20101118_AAANKG mao_f_Page_44.tif
85a2e9bdbb57bc7187984bb6a6602272
d668538460e9b259c97e7d163585522c06a1a331
538 F20101118_AAANFJ mao_f_Page_06.txt
3519d7c749693bffa806bb1fd026fc00
26f8afdd39d9e634dfa0957df9a2f31812972ab3
F20101118_AAANKH mao_f_Page_45.tif
8d1b21d0d006dfa9180411bffcd0c4f1
8135abe1d09d8f337eb672f320ab3b9339a1da1e
67780 F20101118_AAANFK mao_f_Page_12.jpg
2986ecd6388104a9f41b68e1561d114c
3d8c7d3b2a4554684a9d309366f0149e5753b9f1
19161 F20101118_AAANFL mao_f_Page_37.QC.jpg
1909eb5d014540b7c7e406efe89d6ea3
0d238d1b2d6c7b66158bf4631f14a77e926dec2a
F20101118_AAANKI mao_f_Page_46.tif
2768cea32198c76628633e448e0ffd1e
b2ba51a17f27f79550ca383712f31fbd21d8b7d5
95814 F20101118_AAANFM mao_f_Page_28.jp2
774aa40b4ff6474676c675ffdadb3ebc
8da5c3db529371f4d3be2b77be74bc3bad561644
F20101118_AAANKJ mao_f_Page_47.tif
2ace30d1e72c84dcf5154a49d585dedf
47996d4b53bdcc5cf87551d828fe8f5d811ff56c
44410 F20101118_AAANFN mao_f_Page_08.jpg
6af236f20ba686ecda7ec911f82a6716
2f15657459d5fc25792e6224218bcc74191e26e3
7349 F20101118_AAANKK mao_f_Page_01.pro
cfca9088ea991c87d7ecdd9646ef3ef4
ac2899864e2261de6bc7b46672d9a5f50cce2d9a
F20101118_AAANFO mao_f_Page_10.tif
a60dd643efd10906e3fe9c77af7ba555
cad5bc2c58c0b59b2e2e3e2757c7be63866ff330
1754 F20101118_AAANKL mao_f_Page_03.pro
2cfb7e572a0a6d35f6861f88c3b9c3dd
f8a3bebfb98eedfdc0d509221e37bc85e78ab9a0
2059 F20101118_AAANFP mao_f_Page_17.txt
a3e14a575f1f4d6004a8c388c190b104
56b12de500472fc88130a61a52265aa8d12e2f95
15754 F20101118_AAANKM mao_f_Page_04.pro
92717ea2669fb02ff2f8838ec5749d0e
9fb888a6736f9d9ef8b1cf57451ee3cadb70ad02
1624 F20101118_AAANFQ mao_f_Page_46.txt
726f1a62c9a7ebac90c1263c3e1810f5
fc9821dc9b66ed63684dc3ae1af5abba410ae488
92665 F20101118_AAANKN mao_f_Page_05.pro
0de78b9216b4c7c6bf91b8c2f54ee0b4
0a2b1490a618ae7d313b154056c6548f36080546
5274 F20101118_AAANFR mao_f_Page_03.jpg
fad96e9c5d1905d44381c5d7aaea0cc1
b7ab624afa4edde39181b489b97d24a2dbd0e08c
57332 F20101118_AAANKO mao_f_Page_07.pro
aa228291d4e5b7866b11a832cc03f0fe
4d2f93f77dd1238ad9cedf7f0a26d6c881d2190c
2598 F20101118_AAANFS mao_f_Page_09thm.jpg
5cdee35605893dad0a6e2e0ff0f21a92
06424f4d86db24c889aa8e4f3a1bab7967c9254c
32851 F20101118_AAANKP mao_f_Page_08.pro
33fad9364f9acb89066590625966304a
03b52b90932fd9b94c6d75e57d6c041e22b60d42
4708 F20101118_AAANFT mao_f_Page_15thm.jpg
ef22e5998506ec6b8ce68ae3a6637122
93ecea3023259ebe873374f0596d5cad7721a2d4
17388 F20101118_AAANKQ mao_f_Page_09.pro
253be40cd37c651c570c548b3c0f6ef8
415432d492650bba37e2f89cddbe2d956f8ce898
F20101118_AAANFU mao_f_Page_12.tif
9c6f79ff025d043e11fd5446cea30c53
192cd753ef540b81c45a73621e432247fad31d85
40207 F20101118_AAANKR mao_f_Page_12.pro
3131e37aef8b3bde593950c1ec2e2ae6
c93aa1f9b71cc7326961440d09523860ad1e0373
66751 F20101118_AAANFV mao_f_Page_17.jpg
821ecbc6bb99dee6027837f0cf4c20fd
3b6ee6ec069dd8ac38651c9fb5e8e55703d0e333
45225 F20101118_AAANKS mao_f_Page_13.pro
c4c927f9d79a9122721e324a2dbc6414
b86d56476b94e9bf1b8aaaf59e825a3c45109193
2086 F20101118_AAANFW mao_f_Page_37.txt
d3f43bbf29bd36eab3085c21319632c5
73ef34e6a0d4de52ba1e6c9b09354b127769fbb7
40691 F20101118_AAANKT mao_f_Page_14.pro
a97706b970c54baf08e0754622684d24
30e53962795e648799b1cbf36a224b52ff998fbe
38642 F20101118_AAANFX mao_f_Page_30.jpg
8bc0d7eb3de67ac24796c083e986a8b2
99a4a8a5c03dbabf088ac52f7b05aa4b8091a2f7
38782 F20101118_AAANKU mao_f_Page_17.pro
bae4bbf99894b9e71c6cd0fa30cb7557
96960f321fccbcdf60d3ebb260bd63fbdd8af90a
F20101118_AAANFY mao_f_Page_21.tif
ab3af5a86b49a7721bc2d7cb0f54033b
362965244e677ced31bface841b4bc962cc2aca4
26026 F20101118_AAANKV mao_f_Page_18.pro
421d4aaf49fe2553115b48d7727e69c9
e4adb557706d79d2079b776f34d1b3eb2feaad65
51426 F20101118_AAANDA mao_f_Page_36.jpg
cb824265e9198402b1e9c283339ac75f
114309f20d5cddb6fac74cc6f32377e95967e3e2
F20101118_AAANFZ mao_f_Page_24.tif
53bd4bd62bbfd963f836fb3a96b28de1
f118212cc38968cb5019779367877649caad40c4
38334 F20101118_AAANKW mao_f_Page_19.pro
fe87b0dc5219b8040fb6ee3d835a26c4
2fe8b65adf2b4846a224f26ea89052cd5c598ce2
47402 F20101118_AAANKX mao_f_Page_20.pro
feb44fb10c353869f55474a802d74c7e
3dfb08d9b280a1af22c12975209ccbc06688c79a
1316 F20101118_AAANDB mao_f_Page_06thm.jpg
28965949866d207be32f4b123e305f07
1a168e8829f0f829fab8dc9c4bb5fc4d7fa309bf
61475 F20101118_AAANIA mao_f_Page_19.jpg
ff0ec0f5fe1d52a82c46ed4a809c259c
05a6d50d9793682abc9a889ce2c1a26ca4bb302b
47032 F20101118_AAANKY mao_f_Page_24.pro
c2ea46fb33919da44d69bee79ca986ff
7c93ef3e4f6a65f18dfb96d10d8fa7d2186c16cb
F20101118_AAANDC mao_f_Page_09.tif
30f21e153676bcd9d2453be15e2cedb0
80aa992d7bed4f4a6529a6ff5b1b6b4924fba80a
79955 F20101118_AAANIB mao_f_Page_20.jpg
689a73ab756e07e5476ede3b6a57b916
dd44b308db1770080148dc3b6ce6b4ec84c93d7b
42819 F20101118_AAANKZ mao_f_Page_25.pro
876ad60596fa1f22d5a179f850a1734a
7df13e51de5782569cc48e72eed5344fb51619ec
2961 F20101118_AAANDD mao_f_Page_42.QC.jpg
86d9e49dcccda5cd8944847148d1e415
5602e40101c021eaa5075342f7fb82dbeaa0f29d
63539 F20101118_AAANIC mao_f_Page_22.jpg
f8d556dc6e0e19112c64e3627fdafe7a
7e7d12ce2bd5ef8fbda200872b74e420249d88cc
6107 F20101118_AAANDE mao_f_Page_20thm.jpg
d1d95ff04a049485c5e5e566bffe2096
45c5b7cd6199bfec50794a2125acf4d5759ed7f2
64685 F20101118_AAANID mao_f_Page_23.jpg
1ea2e03884ebc92456fc7134926e1df0
3573a7f26d5f2fd57ab2d1340faaaf0f46d9de2c
81466 F20101118_AAANDF mao_f_Page_26.jpg
443f51afecb46248daa1792b88b5f020
0ad20a81eb82f6fffb1cf7839f4d3e31fcd94abd
4592 F20101118_AAANNA mao_f_Page_16thm.jpg
dbc38058b7183c9ada46831e1414145e
855549a5ce6ec0691702f8505ce0227bce14f178
79605 F20101118_AAANIE mao_f_Page_24.jpg
0b2fcc6d35f0ee7a258c46db774cc812
4cd7b13efa7255a86e25a25905043505f3ce193b
12371 F20101118_AAANDG mao_f_Page_30.QC.jpg
cfa6a457ca09dc48fdfceecc3497fd4e
a279f4c39cbfd658c46cf340a04d7282a1de6b3a
17630 F20101118_AAANNB mao_f_Page_16.QC.jpg
7c9cc0067d76f400aa4b8a9b85325df5
a14718dfbc6cd9f81e0e6b254fd3acb71d008d5e
72063 F20101118_AAANIF mao_f_Page_27.jpg
1ea13d5ea78ffec6f91e0d20821f441b
b9a85bb84e8f246a14079ab46a2d1d77edcd77d9
10169 F20101118_AAANDH mao_f_Page_06.pro
88a0657b000b0d19202f07a7bbfe8e60
68d26bcb334984a31998088bb202fd3f62f78b29
5165 F20101118_AAANNC mao_f_Page_17thm.jpg
05fcdb39f8cf1c686e1bc7e310cbb0d6
55661dc9cc1acebe70928c106f2a3a889906b5a4
55315 F20101118_AAANDI mao_f_Page_43.jp2
9d4604b70fec8b7017d0b09fdb3d1856
5525621d115753f7d0856ffb966afc929178892c
15845 F20101118_AAANND mao_f_Page_18.QC.jpg
4ab61d3c8f65388afe32f8e93c8b6e50
6b247fb5e4305b3b6410573e9595f6c5c4b0e2a5
79229 F20101118_AAANIG mao_f_Page_28.jpg
2fc3c590f39c8587478db2223e8dccd1
560d258be874746dd409bd73dc23c0e860a3766c
5813 F20101118_AAANDJ mao_f_Page_13thm.jpg
59abdd13a48ec77871795f8b9dd9586c
7cbaae0f68b49f673845279b179c42323738e0e5
19813 F20101118_AAANNE mao_f_Page_19.QC.jpg
33b2f16913f4b1505ebb5c7fe98c3987
e8edbdad4b8d2bf649a53cd9aa4a09ab02e54475
71576 F20101118_AAANIH mao_f_Page_29.jpg
cf86593d4346c304e0f8a33d3b14a639
7274af9f0dfec02c13a9be61f4208db8eea275b0
3672 F20101118_AAANDK mao_f_Page_43thm.jpg
821efd05fcb56c1e1dc02d522a2e5a31
e746c151583863866f4335ee3597ad603d579aaa
22057 F20101118_AAANNF mao_f_Page_20.QC.jpg
5a3a4628f334302ca691eee51479625e
71cc9d19d4a0eb4e48c321436f5e51b27d2e94da
80301 F20101118_AAANII mao_f_Page_31.jpg
66476cbd2e30c46c5e74597a77eef29e
5cc681d78f5c2ca1334839e391b6d43db479c793
7003 F20101118_AAANDL mao_f_Page_03.jp2
fd6c69f41831e807782adf2a110f0ffb
e1cdc5e7cb2b773b146c5b458477b5ce790f749f
21306 F20101118_AAANNG mao_f_Page_21.QC.jpg
fbd7782528d4d27c6c2435eac7e07ca3
253a770f553b92978dfb16f8e93f044069f0b2d8
83625 F20101118_AAANIJ mao_f_Page_32.jpg
b62a1192510a431334dc7fb2e6d3cd99
361bf1a7335314d9c09f350e73d13488f6e10988
44099 F20101118_AAANDM mao_f_Page_40.jp2
c490582aeb20fce2c63a3fb08bf2069d
a35e50e8a48c4e6bfc19c515bf6f1a39ba60a28f
5142 F20101118_AAANNH mao_f_Page_22thm.jpg
82a1f0d8af991342c35534449f7374c1
f9be65f66104d2b00e670e52d233872f9db77915
60247 F20101118_AAANIK mao_f_Page_33.jpg
d294fd132ea31b5d7ba13ac7383581e4
6da11481f311daa15b37384d01db43c573fef3ea
845149 F20101118_AAANDN mao_f_Page_34.jp2
2bed7e0584aa2a91e17b197e43661cac
7660849744798c426bac961ac5cbc0935674ee2f
21989 F20101118_AAANNI mao_f_Page_22.QC.jpg
5f668e2701db9608100bb31c0c052424
dca7408654563c9caa85d12325756e89a8cd55ac
48655 F20101118_AAANIL mao_f_Page_34.jpg
e422557e564ccb9864d85b5ef9229f61
a4af9ceb020bccbbfc67509009d20d7f8f65417f
F20101118_AAANDO mao_f_Page_39.tif
85b930929fed77a3527008cee339fe70
cfb33ee7b182dde3b1d63b2b2d2ab684392fc79e
5406 F20101118_AAANNJ mao_f_Page_23thm.jpg
a47fc24c8f09dd31c5255d09e50692b3
02e86b75b2ae1974a57cd05a72dca990a7c16ba6
60500 F20101118_AAANIM mao_f_Page_37.jpg
dca010cb8d8f9b9019c2852a482904e7
a3bbb621c2d87e67668c8ccb2d29f8881e04903b
30653 F20101118_AAANDP mao_f_Page_15.pro
085f8120b09947838bafc7396c071a7b
11c99130c75da1560b9e7979a50d2cd6f2d3757b
5852 F20101118_AAANNK mao_f_Page_24thm.jpg
da171a7597b84e0aeec3a93a2a4d4a57
98b6350f476d00b5fce3e3990fe541ef570bd2f5
34429 F20101118_AAANIN mao_f_Page_39.jpg
74104d277b5e05be9fad0656dd8267df
d4e873d574cc574032bf171456a1cb20112f1bc8
3808 F20101118_AAANDQ mao_f_Page_05.txt
63da2da5cd64b3192f0449dd0202256f
71a55b0678a8b7225ef70c0f0030ee412c03fde7
34430 F20101118_AAANIO mao_f_Page_40.jpg
c979785e35f9da42c5319aaa3dd14a4e
1b4352bcfaf1114de645e0fdd0ab768c725d52db
F20101118_AAANDR mao_f_Page_36.tif
ad12179d5963e36223f11d524562ea58
4fe02f8d44ef5eca6d6e6c3b90830efa83f13944
5576 F20101118_AAANNL mao_f_Page_25thm.jpg
953f6275d4276a4846d2d9b82382c066
e01a677dce7bc51142dfcfb2ae3908bcc1a400a0
28060 F20101118_AAANIP mao_f_Page_41.jpg
0aa978334e214f0797ae8b56691a78b5
d07307c27c568994b31a605a984fdd92310ff16e
21391 F20101118_AAANDS mao_f_Page_13.QC.jpg
42eda876ab3f0bc61fb2c949434ca354
498ea88ac252aaa1028794aa4df92e6bb3bba883
22198 F20101118_AAANNM mao_f_Page_25.QC.jpg
336fe1437927f1e98a1f8a32bee55a0f
da1816a66ff1dea3bed5513ed1385f89a4869d84
9954 F20101118_AAANIQ mao_f_Page_42.jpg
59cea515b56556dd213840a7202e5611
02bf21fbf366ed4ded58a7d1e13670d3c924884a
138 F20101118_AAANDT mao_f_Page_03.txt
5f4fe90dbe3b1e9b5c0c890c232be160
0167e9b8cbc6b74a5db4d53263ff70d7897ea076
24766 F20101118_AAANNN mao_f_Page_26.QC.jpg
8e14b89fbf3a5ea0d027458d890460a8
0165cb5fd6a464d5cf5a39042d61c0d424c26039
45747 F20101118_AAANIR mao_f_Page_43.jpg
8de05657743fe75b83b1b336e57d3a2c
1a35b8f96ebac0bbd8e24ffe58e8e75ba739a3d5
408 F20101118_AAANDU mao_f_Page_02thm.jpg
10b0ace7233108b8813206996f34e3c0
5fef4f621c52283dd81fa88fbc13d8c34f95ecb2
5717 F20101118_AAANNO mao_f_Page_27thm.jpg
d262d23ffef0c5ff5cd049f093b0495c
215559a959e989146828f0b77688cf4b046d6b96
21079 F20101118_AAANIS mao_f_Page_47.jpg
f283ddac5eb330488822a18a0a594164
eb7b480e73e66271511879b8a0701be1ad1ade57
11769 F20101118_AAANDV mao_f_Page_40.QC.jpg
8c9c33977a37a5d3169792bd37dd2f07
d397d8fe43d639b10d5598f56783f0c9118eefd6
5877 F20101118_AAANNP mao_f_Page_28thm.jpg
fee3bc168b186b1ae446d2cf12fa0404
f7e9e462447c29776ccf519a31d5e215990950b1
266200 F20101118_AAANIT mao_f_Page_06.jp2
c27dcf9cbbbc70714bb9eb43106589de
fb458caed57feb65bbc56e122c1cd7b454d9a698
4245 F20101118_AAANDW mao_f_Page_18thm.jpg
77f6ab4dba9ca028c8cf7305476449c2
bc004ec72481e69ea2a72f84cfe147097c4235f8
24631 F20101118_AAANNQ mao_f_Page_28.QC.jpg
c53442d5be5cdb6944e75d36fa23b98e
152d000857ea6ea1e4e20c8b00c297804995bebd
1051982 F20101118_AAANIU mao_f_Page_07.jp2
f0bf45fa42bf7b4a8823df012b41dbd5
4390c1641c890de86490dbbc56a03bd2c49f3b96
F20101118_AAANDX mao_f_Page_11.tif
956f3e360e21afcb5c0ebaf0d59ff756
8eccffd8e82596c697abd3fc45688dcc426f5607
5498 F20101118_AAANNR mao_f_Page_29thm.jpg
311cef7be79050b4599e489ee9b90437
f0652d31fd409290e53c01846f24a22441fd6b75
1051981 F20101118_AAANIV mao_f_Page_08.jp2
ebd560ffe790dce91edfc66e1a1d43b0
66548d61f455a5ef2ae56cf728bf041174a2971f
958 F20101118_AAANDY mao_f_Page_09.txt
9284f02be48b1cab88e6eac6a0be3dd9
62670b324d1193340aaef18a748ed88c5a2517be
22629 F20101118_AAANNS mao_f_Page_29.QC.jpg
f0ee9d20c4aebcbcf2ee898025a71d46
e155f2f3fd8076970d81fb73184823ada19c5da7
94910 F20101118_AAANIW mao_f_Page_13.jp2
aa4493a6242b6d73fa36f0b39ec8ee6b
5acf8463cee7ba984dc25e365548f0043219e307
18075 F20101118_AAANDZ mao_f_Page_11.QC.jpg
7d05102ab10e0d5508203438398fa3b8
fcf142937760a2bcdd1324f7e87319d814a9fd46
3455 F20101118_AAANNT mao_f_Page_30thm.jpg
ba0656a8f90aa886cdfd20ac4b133533
a069534a888241422da06ac81369e4346f810f1f
695029 F20101118_AAANIX mao_f_Page_15.jp2
c6a78b8e1f53eaca57aadec5c4e12ee5
586a5e896a8e377c8caa77748c5c75b1c1f23de3
6012 F20101118_AAANNU mao_f_Page_31thm.jpg
208ba2cefb569da278c77ec53b4e6598
097ccdbde59d9199ba5245bf2411a6f6e278a1a2
82916 F20101118_AAANIY mao_f_Page_19.jp2
598c358e5f67bc1f504ed9b94f8f22b9
5859cadd9772bb3bc941e955669592d5106ee9ce
25644 F20101118_AAANNV mao_f_Page_31.QC.jpg
89c0dbe18547377c7157a08f5a841af8
44ad03a8c767652e913b25a32c2c6f39e486a825
70972 F20101118_AAANGA mao_f_Page_25.jpg
38608eaf3f2dd077a819afeefd45ba68
a263a857170def38a76d3bb4f3b8f4094261061d
88208 F20101118_AAANIZ mao_f_Page_21.jp2
0c755341ee45364b4998f93cf5f87dd8
0cae8c1a44637cafe1ecf40d19ba79e7ede98181
23573 F20101118_AAANNW mao_f_Page_32.QC.jpg
a7d84c7da5e52f1f5d569fe37917585b
431056127c8528635fd26d0941f43d9d5df7c84d
2035 F20101118_AAANGB mao_f_Page_31.txt
23fb44fc96064734b1263a68e9ec086c
9ed5acbd98400c7016527e74cfbd126448b14704
19495 F20101118_AAANNX mao_f_Page_33.QC.jpg
1f63bd76be6fd02845033ff69c1ee96d
d8a00b82822106b9f0e0e8cf81ca954edf7c4632
19082 F20101118_AAANGC mao_f_Page_39.pro
4737668a7d0a0389e30ee24636779196
1d1a60a56807ab9d738a04794028a59f50736747
48185 F20101118_AAANLA mao_f_Page_28.pro
2676cd7a0cb7111706a1096f45af4f3a
6dd0abbe503bd855451240e2c6a3be18aba5a56d
15302 F20101118_AAANNY mao_f_Page_34.QC.jpg
d913b39564ff7b40ec8bde0d4c665ad5
fb031933b93fb651566a2300c6e54ab9d545753c
F20101118_AAANGD mao_f_Page_18.tif
4ca4eafa8688a3b752b1076d0eafd524
9a623e76b31c0f90d28d586f2ee2d0c833935b58
42696 F20101118_AAANLB mao_f_Page_29.pro
5603ad92486dfb92571d99a2d937016a
bc4a9458001ac2de71884d4fb7efa19327d5fcbb
6081 F20101118_AAANNZ mao_f_Page_35thm.jpg
f42371c0c4a9e1b6cdae402843261e61
37d4fa7c7e52c726ef1fe0c887fb917cdcbd71d9
18550 F20101118_AAANLC mao_f_Page_30.pro
46f84350fec2e294f7b251122bae6ebe
188da6a72ffbeeac3d5e9c66c6069f86d54d05be
31344 F20101118_AAANGE mao_f_Page_36.pro
102ab885890324029110cfc85dbf8cb8
78df7bf586886732219328191aceeaf6b25b0fe4
37342 F20101118_AAANLD mao_f_Page_33.pro
7b14b12bddeb5478f689ec3a99dbd4cc
7d721952f1d9f23dd235d1d8c4c3a349053f49c8
5229 F20101118_AAANGF mao_f_Page_12thm.jpg
92d99d7998473f94aecbce119df9464a
d5fd7c6fff0052a3ea55daaa6538e356d2aae397
23450 F20101118_AAANLE mao_f_Page_34.pro
8f443284c78da3e0eda5f9dea60ce1ae
0525ad7104910ef33fea596f29ee728b42e7cfe2
18783 F20101118_AAANGG mao_f_Page_07.QC.jpg
24f257a39f1111e9c73d3b7040ad0e5d
23f405067b05665a87cd5e484e75d48e25d7f380
49930 F20101118_AAANLF mao_f_Page_38.pro
5326dd27ff192f0f845b3815e7ff6d0f
a856eb4493a4e8747c533b59c9c89685ff27abe3
1890 F20101118_AAANGH mao_f_Page_12.txt
20ffd8f81d9487962e59b84dd78a6a06
9173a037d6b9b2981b9dbaeea45fa57871a1d215
18979 F20101118_AAANLG mao_f_Page_40.pro
dc22b08bb9c4d61782275d66e64de113
d311f8594316bc68d5398a1307a162ff1aeb2f2b
82243 F20101118_AAANGI mao_f_Page_17.jp2
ba53273cebd660cc125b9ed94d3103c3
429e24249d0a5f74b90fb9b88aa1d2d54205ef31
10229 F20101118_AAANLH mao_f_Page_41.pro
e0629664991a652d88987a2a66c322c5
d407defda7e20ea97d7c56678867d54df8d33997
89095 F20101118_AAANGJ mao_f_Page_29.jp2
170f6d4c2c39ed99a4ceeec805b50f09
60702e1956c0e043def7d0225316009c2758a6ba
2976 F20101118_AAANLI mao_f_Page_42.pro
325d766cf66e21cf5f965309b1ee5fd5
d7c2f2c3c549a57fa41f73e79ef708d6b794f30e
2739 F20101118_AAANGK mao_f_Page_41thm.jpg
d0490db060804767e48ae3014e4e7011
aefcb6ad6a0cdc256555b40f4edb5a19f369ed58
2648 F20101118_AAANGL mao_f_Page_39thm.jpg
f0aede562f88cd2dfe80b7197f90888b
f5f8cf81e670f680b0712da4a70255f77e1522ef
26946 F20101118_AAANLJ mao_f_Page_43.pro
f7669d691b69afd75e49a036553e2e4d
95c13cb0fa85dfcba5845567a5894545084e8dc1
58001 F20101118_AAANGM mao_f_Page_46.jpg
506bac714f4604a82e0cbc0b431d4b0e
054780feb5e3dc3493941dda179c73219d563916
29069 F20101118_AAANLK mao_f_Page_45.pro
e1cda2dc0d22df5d08b3cf94b9168229
6949ba2d25d089a55d965d2d77e8c48b06a09031
3098 F20101118_AAANGN mao_f_Page_40thm.jpg
46588dd863fffd978e102a07f3155f5e
781f10aaaaf58e034dabe05c7dc26cf3cf03241f
39717 F20101118_AAANLL mao_f_Page_46.pro
803ab4da60e556a9a93b45e5ab15a32b
392a02f2ab9052eb076719bb3ed667bcc69b4bc4
46330 F20101118_AAANGO mao_f_Page_31.pro
7aed7b285464f68a25a2ee5820bd0ed0
4ac24c42f480565713afb7c449a624f742369648
9454 F20101118_AAANLM mao_f_Page_47.pro
ba84978639ca4bcfe5fb54b92aa50ac3
6174224a513e8e72a1cf50aaac0c4adeb9081c07
649 F20101118_AAANGP mao_f_Page_02.pro
a885a8eae0c0920f1135a551b133a4e1
0debed65241b8ca5c5587df2684a73976cc37073
73 F20101118_AAANLN mao_f_Page_02.txt
1499b1ec51d9fac41b1443ec08ecfdda
c8e59fd05c91ccc20887867b66c742355ecdf476
47255 F20101118_AAANGQ mao_f_Page_18.jpg
6422c5a22afbec9a873f3d1091826ab6
41b98f54e0ee52e9e8355a93f7b157d705d89333
59974 F20101118_AAANBT mao_f_Page_18.jp2
1978d2db3d6c665695365deeb7c57bbb
bd275853f02a8a539dfa91221bb84d594f26375c
2406 F20101118_AAANLO mao_f_Page_07.txt
84865103f83287573c0842a09f276f88
e22a8f90fac3e5e83121be0094aa01f3ad125425
4328 F20101118_AAANGR mao_f_Page_34thm.jpg
a602ffcae8610dbf9b6b6d6b1a8d0b53
74e5c788af1d12daa27868e2d36758f4801be839
770658 F20101118_AAANBU mao_f_Page_36.jp2
22dc8b5894cdfacfa65a642cf75fc76a
52ba9fa7cd03947c71688995cd84fe3aa6233d9b
1334 F20101118_AAANLP mao_f_Page_08.txt
4dc0e63f1095503d3b014c9362f4db45
d8c8388de81fab3e693db3c844d2306cb88ed261
70249 F20101118_AAANGS mao_f_Page_16.jp2
f4b8d0228bdf960ca8b6c4652e56f1e9
682c9a84ce993bdd6fd61ad035ab8229c8d4dce6
101420 F20101118_AAANBV mao_f_Page_20.jp2
29ee38e6ee57cd7c9881ac842e15e782
48e5746b195bb962efbad4434aa80a7b6b9b0065
392 F20101118_AAANLQ mao_f_Page_10.txt
53c34856c16677b16e6fec040f5fa1f6
b1e4bc82554e6a35f08dc7b36e9acefea3f2f995
31355 F20101118_AAANGT mao_f_Page_16.pro
7ff30fb3abe1f4a95aef06762ff6f27b
a0290fd615e3ea8394d64a3c6d8ce5c822ef3e3f
1106 F20101118_AAANBW mao_f_Page_30.txt
26b0e5e76abb2ba73688b276f8097e42
796c746dee20b6a1e024c1c7dda869ed004eea97
1642 F20101118_AAANLR mao_f_Page_11.txt
56f5185ce0c747f86d7c03e6b00f8449
556c1053a182f45deff1400218887fbb04f495bd
677 F20101118_AAANGU mao_f_Page_04.txt
4f90bdcc8e631b4d855328ca91337d95
a9a1c417c04cef5d3825ccec4237dd25256d1b8a
21797 F20101118_AAANBX mao_f_Page_27.QC.jpg
bdcd75c74b3aff3f4d2eacb4450b41e0
7e18339e665bc6801ede0dfb600d263d8557e136
1901 F20101118_AAANLS mao_f_Page_14.txt
ca9ca1857685969cf3fc5fc40ba859bb
a799debcb5c0e08efe7a021715ad9181b6aa9b44
22621 F20101118_AAANGV mao_f_Page_14.QC.jpg
44cb96cf820c45b37d1be41c83e6d81c
fd2b11cce5f3205091535700b22854e0f148ba76
87607 F20101118_AAANBY mao_f_Page_12.jp2
0a04f52d5ae1a5186e9af53d72e89cc8
2e47e78fec641b3f7d7bab255346b081afd904f2
1527 F20101118_AAANLT mao_f_Page_15.txt
bb44a8d584a3853592c3b0a6eb5d7a32
b59ecd107264fa9516e9a1abe20263d4873dff6b
17476 F20101118_AAANGW mao_f_Page_15.QC.jpg
e1891130d921163632762a8b4e46d44f
0a0f9ee33c5712381216681d5e20461ef1a6d0f8
F20101118_AAANBZ mao_f_Page_13.txt
eb3185cd3047297d62e1d47d49daced6
de19b4b73debaa14d41ec3356a9a34b9e6e77f87
1575 F20101118_AAANLU mao_f_Page_16.txt
1678986db468006e477a5e46705efc28
62ecdb0fe45d6079207d1d54a6eb196896696906
37969 F20101118_AAANGX mao_f_Page_22.pro
424938a266da3dfc967e77875079cc23
e5bcc603ef9d6859e1ca9c05618c3876d0d20bd6
1783 F20101118_AAANLV mao_f_Page_21.txt
8f4329d18293777636f7bd47f570579e
25bf46fee5038e0c820829c4a0aeda1eecb7eb2b
80032 F20101118_AAANEA mao_f_Page_22.jp2
c4fdd245730348cdbe1a727af680520d
9a14bbbe0797b8fd2ed69d5e5664a64b349c4693
34723 F20101118_AAANGY mao_f_Page_41.jp2
de28294a01ec1a7e68538fa70cb5f46a
a9678a8569a2cfa355f131c6a9a62fa194d2cdef
1859 F20101118_AAANLW mao_f_Page_23.txt
e695864ec95c65a187d4a03db86bdbf1
29cb8748d317ec264026549bce6b26d30c866190
1261 F20101118_AAANEB mao_f_Page_18.txt
1096fc9c7996f2c81caf32ab7b09b535
6839ab30af2473c45b7b0ceed8680d4a3a3ba1e6
1930 F20101118_AAANGZ mao_f_Page_20.txt
879f3e1ff62d3560fe4519bd992b342c
b0542ba7e1d1e45ca6bb554f6c9f9ea2dbb248e3
2039 F20101118_AAANLX mao_f_Page_24.txt
69d677f86fcd243544fb7f6247f5f84a
4f97492de40b87216d367daa74583e1ea6553bd7
49261 F20101118_AAANEC mao_f_Page_26.pro
c8ff5d4c82818d6d9f6710a9417de52c
68a0546ab5bba0d77298352a05f9562f56f18b71
1051966 F20101118_AAANJA mao_f_Page_31.jp2
bdc613b9ebd25f8a9dcc46a1478217a5
2952598d40b8a30a388baeab269643439d2e80ad
1870 F20101118_AAANLY mao_f_Page_25.txt
81b9d596612a05754342afa791a197dc
f27cb686ce285a2d41f9b64d3d318a9600bd479c
55830 F20101118_AAANED mao_f_Page_16.jpg
69d3d200e63d2332967caa7cf8631761
b586c558f16cb78ee208eb91796fd3274a874e7f
98398 F20101118_AAANJB mao_f_Page_33.jp2
aff4862f972101c091b2abe8486b3d38
c70157763a129e2f266abbc183548783eaa30fd7
2535 F20101118_AAANLZ mao_f_Page_26.txt
60f2aa2f44a6d3de7d8c77d6a2a9653b
d9ee1be82f269f50d9fd4cdeadcbec68dc07f144
F20101118_AAANEE mao_f_Page_01.tif
f3c1b0b13654fd13d92f6a7b585d6fba
bf5d7cb97d5027b87157b42fa06e5502c6bde96a
100630 F20101118_AAANJC mao_f_Page_35.jp2
e562815ef975d91ec0172cb5f2cc9444
b4e477c79e50a2768bbabae77fe16030826c1ccc
4256 F20101118_AAANEF mao_f_Page_06.QC.jpg
dbc7f6402315a605e91da7539d2298eb
6e4db25ba654b809c0dec337bb02530a5be2b33e
24991 F20101118_AAANOA mao_f_Page_35.QC.jpg
04e40096c83716e1a1d0d4b6faf39ae3
87142699cb250a3b16a53ed54cd2372b7b6b81bd
865241 F20101118_AAANJD mao_f_Page_37.jp2
568c599e152938a4528caa50b530dd4a
7a4e28b1429bcc7d94aa503c1bb10b9efb943294
80197 F20101118_AAANEG mao_f_Page_11.jp2
7fd4e37144dce6d53ebf9bcc88c8ad00
5ec20672dc8f9f995f2bce46e5c3201bbf415206
4172 F20101118_AAANOB mao_f_Page_36thm.jpg
9fdc65d8ab42e31e5ca13f97a7462bfc
17c2e1d28a50ffe7b1014606b169b59b9918b421
103204 F20101118_AAANJE mao_f_Page_38.jp2
df4dfd000ec3c10d4ef90c2e919c965d
32bdb6d7b120586a4715ece815058e15092eec48
37179 F20101118_AAANEH mao_f_Page_44.jpg
25a24f02e9c6425b3d937d34ebbbf58f
876d63861eb1a78a908b321c7d271199b3591e4e
F20101118_AAANOC mao_f_Page_37thm.jpg
148f102a0152a0d1391c6fc81b8cba82
c68c301d9c3c1b8339b46685e96f90d62013d998
12480 F20101118_AAANJF mao_f_Page_42.jp2
c6c382e06b458473e1b9603dc8a09e3f
629b8a2f9c1ff3321f32c00766784f4571a164af
984 F20101118_AAANEI mao_f_Page_02.QC.jpg
e4eb3122058b0a86978d5c3eb788a652
e897ef849c7f41c2572272a85187969fdc15965d
24735 F20101118_AAANOD mao_f_Page_38.QC.jpg
d82ebf40fb48a3ba57743726055275df
d4a55f0ea19eea8a1be6da3650fa23997c0fb227
42829 F20101118_AAANJG mao_f_Page_44.jp2
5d3a15faad7b71ebf3132591b399bfa1
993e6098046dcad81f170b273b43cc0c2e4381c3
2262 F20101118_AAANEJ mao_f_Page_28.txt
3fb70b9ceb7cd9c82c7c6e0b26e4e849
366c8bb3b4024766c825e5d7b3ffd3f105e81336
11220 F20101118_AAANOE mao_f_Page_39.QC.jpg
daaac70dd2a0cb5f6fd61c1e4e79559a
4f8fe7f34ba0f357178e33c16d782bd76b525e8f
1111 F20101118_AAANOF mao_f_Page_42thm.jpg
a6ccf292064427f49a1ba5cacade6973
25c0ce2182930a4e8e1a5f94c85dc3da2219f252
52022 F20101118_AAANJH mao_f_Page_45.jp2
840d513caf68361d6f3dcb4829e14b95
5ee23d85c007000c4810df8d0f980460ec83944a
91967 F20101118_AAANEK mao_f_Page_26.jp2
b860f6606282f49f5038a02d01bb1908
bbc4ed151c0cdcf46c821b7c0873ccb4e7a58eb5
13966 F20101118_AAANOG mao_f_Page_43.QC.jpg
fa250d68dcd39c13d2a3b3c7ab0be982
dd830798df05de6666ae4acc62ded1fd8ba4ec3f
82388 F20101118_AAANJI mao_f_Page_46.jp2
09caad210fe15bcfaddb41a934084d0e
b1f230caf3da3ba944ad48b806aaf8218b288213
38429 F20101118_AAANEL mao_f_Page_04.jp2
29bc86c6d55d833a6fa464bb9b1f72f8
a83e325670f02175aa7ea303d22e12f0b0b0e5a0
10839 F20101118_AAANOH mao_f_Page_44.QC.jpg
839b69acf81431c533b0b0c4bf2ff0e6
b7e02091787162bafba4dcd31cc1b3b14622e245
24687 F20101118_AAANJJ mao_f_Page_47.jp2
3ae2543beb1c41495a3a58de2e325c06
66f0aeb8032415ba34b6492fb9b145155caf3b65
5490 F20101118_AAANEM mao_f_Page_21thm.jpg
6ffbb344c80694c8eb4ef0c2ca5035b3
0a6c72e255bc3847f86ef72805c66c745f0b5032
3423 F20101118_AAANOI mao_f_Page_45thm.jpg
fb6954b1df3adc10e28715ebbde6fbcf
3e44f5f65374b6dd7db7790198394cff71e65346
F20101118_AAANJK mao_f_Page_03.tif
df0573e7678c19b1226bc8c1d87bef50
ef105397b4335fa0d4c2f62fe910f39e08088193
18248 F20101118_AAANEN mao_f_Page_05.QC.jpg
c79e8d9b98894ff571db648a0a3efe6b
cfefd27fb606dfc4ec69a2299e14c321024f996b
12787 F20101118_AAANOJ mao_f_Page_45.QC.jpg
f0938d0f4937a84b494a974a446f7217
47d23e79b52b38e6cad018844aad90e09613e258
F20101118_AAANJL mao_f_Page_05.tif
fd8023bc60e97525bbcd4bfa31010cbe
2c597370b803ac3dfe4ff447313a67a33a8cf1b4
43356 F20101118_AAANEO mao_f_Page_27.pro
a66f92c4b34936fc414485921b1212f6
9578798b91eeef0b5a0d90dec6a4d0a55732a419
4528 F20101118_AAANOK mao_f_Page_46thm.jpg
cb9d589b3a8193f6d064bd90820d80d2
9c849e8515637a728bb0ddfc0c80c90de97dce6e
F20101118_AAANJM mao_f_Page_06.tif
ababf3f483e6da0db798033566b477d2
9aa91d9b23ecdf606a655fcd63743d74ea762bb4
F20101118_AAANEP mao_f_Page_22.tif
7d6190391045cceb3f32dab60767ffc4
b4073adf89545113641dc809d147fd73053d2350
16869 F20101118_AAANOL mao_f_Page_46.QC.jpg
cedb1a08131705e1c9a5bfa6aeb4c892
0ebedc4280b339e491d87ae85220a364a7ef1996
F20101118_AAANJN mao_f_Page_08.tif
0574dec22a628f80cf3819b4b30bd735
d567872f9a98e2851bcce619082fd2d650173a9f
19417 F20101118_AAANEQ mao_f_Page_17.QC.jpg
555afc9c1e7e5a55956b935b299178d6
de81df647947c0d27e999c24378289599f60accb
F20101118_AAANJO mao_f_Page_13.tif
ee4e4f206aa3f78221aee54c021569f8
e3c28eeec305ad1891d145ce4440b31b4734c0ee
F20101118_AAANER mao_f_Page_04.tif
f71162e8701f5b5951194b0af5950304
cb9da2393321d7f19e62528867dae87e6ed8be9e
1755 F20101118_AAANOM mao_f_Page_47thm.jpg
7f06feebee6368fd5121fc771d9172f9
1ef274c2ab22f2bd7ed9951ccf10c86383250687
F20101118_AAANJP mao_f_Page_14.tif
1ecd83a84da7dcd73d27fe30eafa5771
7f461ad9513057b980b649de63f7f3ac6238d4ef
5800 F20101118_AAANES mao_f_Page_26thm.jpg
04ab37e2055d7b850d775ff47bbbef17
38b2dd5006e97a757e907685174d662cf71342bb
5108 F20101118_AAANON mao_f_Page_47.QC.jpg
db4d26351e5131b3a1266ecb7838c70f
9ae92a024a76ac49f332b8758e0f0eda9aefc66e
F20101118_AAANJQ mao_f_Page_16.tif
dffaec26312308dad5b1a087f3ff2286
f480fea0fcf718cc902ea5d9d99fea0c28dec85a
80548 F20101118_AAANET mao_f_Page_38.jpg
cf231ffa845558d1de7a53d09408ac77
c8726f358b37acc68e6b8318b3b4a594dbd6cc92







A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION:
THEORY AND ANALYSIS





















By

FAN MAO


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2007





























2007 Fan Mao


































To my Mom,
for her enormous patience and unfailing love.









ACKNOWLEDGMENTS

I would like to thank my thesis adviser, Dr. Paul Gader, for his encouragement and

valuable advice-both on the general research direction and the specific experimental details. As a

starter, I sincerely appreciate the opportunity of getting the direct help from an expert in this field.

I also thank Dr. Arunava Banerjee and Dr. Joseph Wilson for being my committee members and

reading my thesis.

Special thanks go to Xuping Zhang. We had a lot of interesting discussions in the last

half year, many of which greatly inspired me and turned out to be on my experiment designs.









TABLE OF CONTENTS

page

A CK N O W LED G M EN TS ........... ........................................ ... ............... .....................4

L IS T O F T A B L E S .............................................................................. ............... 7

LIST OF FIGURES .................................. .. ..... ..... ................. .8

LIST OF SYM BOLS .............. ...... ... ............................... ..........9

A B S T R A C T ......... ....................... ............................................................ 1 1

CHAPTER

1 V ARIABLE SELECTION V IA JCFO ....................................................................... ...... 12

1.1 Intro du action ............................................................................... 12
1.2 O rigin ....................................................... ..................................12
1 .2 .1 L A S S O ......................................................................................................... 1 2
1.2 .2 T he P rior of f ........................................................................... .. .... 14
1 .3 JC F O ................................................................. ................ 16
1.4 E M A lgorithm for JC F O .............................................................................. ............17
1.5 C om plexity A naly sis............ ... .......................................................... ......... .. 19
1.6 A alternative A approaches ....................................................................... ..................2 1
1.6 .1 S p arse S V M .................................................... ................ 2 1
1.6.2 R elevance V ector M machine ...................................................... ..... .......... 23

2 A N A L Y SIS O F JC FO ..................................................... ........................ ...............24

2.1 Brief Introduction............................... ............. ............. ......... 24
2 .2 Irrelev an cy ............................ ...................................2 4
2.2.1 U uniform ly D distributed Features ........................................ ...... ............... 25
2.2.2 Non-informative Noise Features............................................... ...................26
2 .3 R edu n dan cy ................................................................................ 2 7
2.3.1 O racle Features .................................. .. ... ........ ............ 27
2.3.2 D duplicate F features ........................ .. ....................... .... .. ........... 28
2.3.3 Sim ilar Features .................. ...... ...... .... ... .... ..... ............... 29
2.3.4 Gradually M ore Informative Features .................................... ............... 31
2 .3.5 Indispensable F features .......................................................................... ... ... 33
2.4 Nonlinearly Separable Datasets ...........................................................................35
2.5 D discussion and M odification ............................................... ............................ 37
2 .6 C o n clu sio n ............................................................................. 3 8

APPENDIX

A SOME IMPORTANT DERIVATIONS OF EQUATIONS ....................................... 40









B THE PSEUDOCODES FOR EXPERIMENTS DESIGN........................................................43

L IST O F R E F E R E N C E S .................................................................................... .....................46

B IO G R A PH IC A L SK E T C H .............................................................................. .....................47



















































6









LIST OF TABLES

Table page

1.1. Comparisons of computation time of 0 and other parts................... ..................................21

2 .1. F feature divergence of H H ............................ ......................................... ................................25

2 .2 F feature divergence of C rab s ......................................................................... ....................25

2.3. Uniformly distributed feature in Crabs ............................. ....................................... 26

2.4. Uniformly distributed feature in HH.......................................................... ...............26

2.5. N on-inform ative N oise in HH and Crabs ........................................ .......................... 26

2.6. O racle feature ................................................. 28

2.7. D duplicate feature w eights on H H .......................................................... ............................28

2.8. Duplicate feature weights on Crabs ............... ........................ ................................ 28

2.9. Percentage of either two identical features' weights being set to zero in HH.....................29

2.10. Percentage of either two identical features' weights being set to zero in Crabs...................29

2.11. Five features ................................................... 32

2 .12 T e n fe atu re s ..................................................................................................................... 3 2

2.13. Fifteen features................................................... 32

2.14. Comparisons of JCFO, non-kemelized ARD and kemelized ARD ............................ 35

2.15. Comparisons of JCFO and kernelized ARD with an added non-informative irrelevant
fe atu re ........................................................................................... 3 5

2.16. Comparisons of JCFO and kernelized ARD with an added similar redundant feature ........36

2.17. Comparisons of the three ARD methods ...................................................... ............ 38












7









LIST OF FIGURES


Figure page

1.1. G aussian (dotted) vs. Laplacian (solid) prior............................................... ..................... 15

1.2. Logarithm of gamma distribution. From top down: a=le-2 b=le2; a=le-3 b=le3;
a= le-4 b= le4. .............................................................................23

2.1. Two Gaussian classes that can be classified by either x or y axes ......................................30

2.2. Weights assigned by JCFO (from top to down: x, y and z axis) ........................................30

2.3. Weights assigned by ARD (from top to down: x, y and z axis) ..........................................31

2.4. Two Gaussians that can only be classified by both x and y axes .......................................33

2.5. Weights assigned by JCFO (from top to down: x, y and z axis) ........................................34

2.6. Weights assigned by ARD (from top to down: x, y and z axis) ..........................................34

2 .7 C ross data ......... ................................................................... 36

2 .8. E llip se data ............... .. ............................................................37









LIST OF SYMBOLS

Symbols Meanings

x' the ith input object vector

x'I thejth element of the ith input object vector

P weight vector

Sp p -norm

I identity matrix

0 zero vector

V(v 0,1) 0 mean unit variance normal distribution evaluated on v

sgn(.) sign function

H design matrix

(.) positive part operator

(D(z) Gaussian cdf 9N(x 0,1)dx

E[z] z's expectation

f(t) the estimate of f in the tth iteration

O(-) big 0 notation of complexity

o element-wise Hadamard matrix multiplication









LIST OF ABBREVIATIONS

JCFO: Joint Classifier and Feature Optimization

LASSO: Least Absolute Shrinkage and Selection Operator

RBF: Radial Basis Function

SVM: Support Vector Machine

RVM: Relevance Vector Machine

ARD: Automatic Relevance Determination










Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION PAGE:
THEORY AND ANALYSIS

By

Fan Mao

December 2007

Chair: Paul Gader
Major: Computer Engineering

Feature selection is a major focus in modem day processing of high dimensional datasets

that have many irrelevant and redundant features and variables, such as text mining and gene

expression array analysis. An ideal feature selection algorithm extracts the most representative

features while eliminating other non-informative ones, which achieves feature significance

identification as well as the computational efficiency. Furthermore, feature selection is also

important in avoiding over-fitting and reducing the generalization error in regression estimation

where feature sparsity is preferred. In this paper, we provide a thorough analysis of an existing

state-of-the-art Bayesian feature selection method by showing its theoretical background,

implementation details and computational complexity in the first half. Then in the second half

we analyze its performance on several specific experiments we design with real and synthetic

datasets, pointing out its certain limitations in practice and finally giving a modification.









CHAPTER 1
VARIABLE SELECTION VIA JCFO

1.1 Introduction

Joint Classifier and Feature Optimization (JCFO) was first introduced by (Krishnapuram

et al, 2004). It's directly inspired by (Figureiredo, 2003) for achieving sparsity in feature

selections, by driving feature weights to zero. Compared to the traditional ridge regression,

which shrinks the range of the regression parameters (or simply the weight coefficients of a

linear classifier) but rarely sets them directly to zero, JCFO inherits the spirit of LASSO

(Tibshirani, 1996) for driving some of the parameters exactly to zero, which is equivalent to

removing those corresponding features. In this way, it is claimed that JCFO eliminates redundant

and irrelevant features. The remainder of this chapter will be arranged as follows: Section 2 takes

a close look at how this idea was derived; Section 3 gives the mathematical structure of JCFO;

Section 4 and 5 further illustrate an EM algorithm which is used to derive the learning algorithm

for JCFO and analyze its complexity. The last part briefly introduces two other approaches that

have a similar functionality.

1.2 Origin

1.2.1 LASSO

Suppose we have a data set Z = {(x',y,)},i= 1,2,...,n where x' = (xl,x,...xk) is the ith

input variable vector whose elements are called features, and yj are the responses or the class

labels (this thesis will only consider, e {0,1}). The ordinary least squares regression would be

to find f = (S1, g2,..., Pk)', which minimizes

N k
J1 (ii)
1-1 J=\









The solution is the well-known original least squares estimate f0 = (XTX) X'y As

mentioned in (Hastie et al, 2001), it often has low bias but large variance when X'X is invertible,

which tends to incur over-fitting. To improve the prediction accuracy we usually shrink or set

some elements of f to zero in order to achieve parameter sparsity. If this is appropriately done,

we actually also employ an implicit feature selection process on our dataset getting rid of the

insignificant features and extracting those that are more informative. In addition to simplifying

the structure of our estimation functions, according to (Herbrich, 2002), the sparsity of weight

vector/ also plays a key role in controlling the generalization error which we will discuss in the

next section. The ridge regression as a revised approach that penalizes the large /P would be

N k k
argmin (y, Z/ x)2 _+A (1.2)
f =1 ==1 j=1

Here A is a shrinking coefficient that adjusts the ratio of the squared 2 -norm off to the

residual sum squares in our objective function. The solution of (1.2) is(1 + y) /f0, (Tibshirani,

1996) where y depends on A and X. This shows ridge regression reduces f to a fraction off0,

but rarely sets their elements exactly to zero hence can't completely obtain the goal of feature

selection. For an alternative approach, (Tibshirani, 1996) proposed changing (1.2) to

N k k
(y, )2 + (1.3)
1-1 Jy1 J=1

Or equivalently,


(y- f )T (y- fTX)+A Pfl (1.4)

I denotes the ?, -norm. This is called Least Absolute \N//,, i/kige and Selection

Operator (LASSO). To see why / -norm favors more sparsity of #, note that









(/i, / ) i = ) (1, 0 || = 1, but (v/i2, )/i2) = 2 > I (1,0) 1 =1 .Hence it tends to set more

elements exactly to zero. By (Tibshirani, 1996), the solution of (1.4) for orthogonal design X is

,8 =sgn(,)(gO ), (1.5)

where sgn(.) denotes the sign function and (a) is defined as (a) = a, ifa > 0; 0 otherwise. y

depends on A. Intuitively, y can be seen like a threshold that "filters" those 0o s that are below

a certain range and truncates them off. This is a very nice attribute by which our purpose of

getting sparsity has been fulfilled.

1.2.2 The Prior of fl

The ARD (Automatic Relevance Determination) approach proposed by (Figureiredo, 2003)

inherits this idea of promoting sparsity from LASSO through a Bayesian approach. It considers

the regression functional h as linear with respect to f so our estimate function would be

d
f(x,fl) = h, (x)= f h(x), (1.6)
J-1

where h(x) could be a vector of linear transformations of x, nonlinear fixed basis functions or

kernel functions which consist of a so-called design matrix H, such that H j = h (x) Further, it

assumes that the error e y, flh(x) is a zero mean Gaussian, V(0, u2). Hence the likelihood

would be

P(Y f) = P=(y | HP,a2I), (1.7)

where lis akx k identity matrix. Note that since we assume samples are i.i.d. Gaussian, there's

no correlation among them. More importantly, this method also assigns a Laplacian prior to f :


p ) = k k e
p(fl I) = a- {-a 8 e iJ}= exp{- f -a |.
1=1 2 2











The influence that a prior exerts on the sparsity off was discussed in (Herbrich, 2002),


where they used a W(0, Ik) prior to illustrate that fly's log-density is proportional to

k
- 2 = i and the highest value is acquired when f = For the comparison of Gaussian
1=1


and Laplacian priors, we plot both of their density functions in Figure 1.1. The latter is much

more peaked in the origin therefore favors more/f's elements being zero.


05
0 45
04
035
03
025
02


8 6 4
-8 -6 -4 -2


0 2 4 6


Figure 1.1. Gaussian (dotted) vs. Laplacian (solid) prior

The MAP estimate off/ is given by


f = arg min {y Hfl[ + 22a 1} (1.8)


It can be easily seen that this is essentially the same as (1.3). IfH is an orthogonal matrix,


(1.8) can be solved separately for each /, (See Appendix 1.1 for the detail derivation.)


/, = arg min{/,2 2/, (H'y), + 2c2a /7,

= sgn((H'y),)((HTy), -2a)


(1.9)


Unfortunately, in the general case /8 cannot be solved directly from (1.8) due to its


non-differentiability in the origin. As a modification, (Figureiredo, 2003) present a


U 15b
01
005
0
-10


8 10









hierarchical-Bayes view of the Laplacian prior, showing it's nothing but a two-level Bayes

model--zero mean with independent, exponentially distributed variance: p(fl, I ,) = V(0, r,) and

p(, y) = (/ 2) exp{ -( / 2)r, }, such that:


p(A, )= p(l, r,)p(r, |y)dr, = exp {-J f/, |}, (1.10)
0 2

where r, can be considered as a hidden variable being calculated by an EM algorithm while y is

the real hyper-parameter we need to specify. (This integration can be found in Appendix 1.2.)

1.3 JCFO

From the same Bayesian viewpoint, JCFO (Krishnapuram et al, 2004) apply a Gaussian

cumulative distribution function (probit link) to (1.6) to get a probability measure of how likely

one input object belongs to a classic e {0,1}. To be more precise:


P(y= 1 x)= ( ,+f0 K (x,x) (1.11)


where c(z) = N(0, 1)dz. (Note that ((-z) = 1- (z).) K,(x, x) is a symmetric measure

of the similarity of two input objects. For example:

k r
K,(x, x) I= 1 +f xx (1.12)
1=1

for polynomial functions and:


K,(x, xJ) exp O, x )2 (1.13)


for Radial Basis Functions. (The good attributes of RBF functions will be discussed in Section

2.4.)









An apparent disadvantage of (1.6) is that if we don't choose h as a linear function of x (or

simply the input vector x itself), the calculation of f would be like selecting the kernel functions

of x rather than selecting the features of x. This is also why it's conceptually an ARD method. As

a major modification of explicitly selecting each feature, in (1.12) and (1.13) JCFO assigns each

xl a corresponding parameter, Since the same is applied to the ith element of each input x ,

it is equivalent to weighing the significance of the ith feature in the input object. This is how the

feature selection is incorporated. Another point that needs to be mentioned is that all 0, should

be non-negative because K,(x, x) measures the similarity of two objects as a whole, i.e.

accumulating the differences in each two corresponding elements. Had 0, = -0J, in (1.13) the

difference of the ith and thejth elements of two objects would have been cancelled.

Then like how we treated /P, each O, is given a Laplacian prior, too:

fp 2 6(Ok|O, pk, ) if8k>0
P(O Pk ) = 2,,5>O (1.14)


where p(pk 72) (72 / 2)exp{-( / 2)p}, thus:

exp(- 2 Ok)k > 0
P(Ok Pk )P(Pkk 172)dpk {= if ( > (1.15)
0o if Ok 0

1.4 EM Algorithm for JCFO

Now from our i.i.d. assumption of the training data, its likelihood function becomes

N N Y3
p(DPfl,O)=fJ 4 ,80 +l Ko(x,x')
1=1 f j=1)
(1.16)

_-6i, _,Ko(x'),x)
J=1








Rather than maximizing this likelihood, JCFO provides an EM algorithm by first assuming

a random function z(x, fl, 0) = h (x)fl + co, where h (x) = [1, K (x, x1),..., K (x, xN)] and

o ~ (0,1). Then it treats z =[z, z2,...zN], p as the missing variables whose expectations

would be calculated in the E step.

E step

The complete log-posterior is

log p(f, 0 D, z, p) oc log(z f, 0, D) + log(fl ) + log(0 I p)
oc -z'z- fH (Hof- 2z)- flTP-_T 'RO'

where T = diag(r, r, ,...rT ) and R = diag(p p2',..., p,) If we want to calculate the

expectation of (1.17), i.e.

Q(, )(1.17)
= EI-zz- iTHi (HO, 2z)- zTTP)-0OT ROT1

The expectations with respect to those missing variables are

v,=E Iz' |D,f, I

= h (x')t + (2y'-1)N(hO (x')(') I 0,1)' (118)
SD((2y' -i)h (x'))'l")


a, =E[z\D, ^ ]=y, DL;

and

= IE[p, 1 D, 2,y2 =Y(2t) ,1

where l(t) denotes the estimated value of /, in the tth iteration. (The derivation of (1.18) can

be found in Appendix 1.3)









M step

Now with the expectations of these missing variables at hand, we can start applying MAP

on the Q function of (1.17). After dropping all terms irrelevant to f, 0 and setting

v= [v,v2,..., vN], Q= diag[co), o2,...o) ] and A diag[d1,d2,...,8d] we get


Q(f,0 t'tt)) =-fT HHof + 2flTHv- fl T -0' A0. (1.19)

Take the derivative of (1.18) with respect to f and 0k would be:

VfQ= -2HHol + 2Hv- 2Ql (1.20)

S2 {(- N N+Hi Vf = -25k -2 (HoP-v)pT o (1.21)


where o represents the element-wise Hadamard matrix multiplication. Now by jointly

maximizing (1.19) with respect to both/f and0, we not only select the kernel functions but also

those features of the input objects if both f and0 are parsimonious. This is the core motivation

of JCFO. Something that needs to be stressed is that f could be solved directly in each iteration

while Okcan only be computed in terms of an approximation method. This in fact increases the

computation complexity and, even worse, influences the sparsity of 0. We will repeatedly come

to this issue in Chapter 2.

1.5 Complexity Analysis

What is the complexity of the EM algorithm above in the general case? Since in each

iteration this algorithm alternatively calculates f and 0, we investigate these two scenarios

respectively.

Complexity of estimating /

Let's first take a look at the search for/f, i.e. setting (1.19) to zero:









f = ( + HTHo)' HvT (1.22)

i) If kernel functions are applied, the calculations for the inner product is O(k) while the

common matrix inversion and multiplication of an N x N matrix in matlab are O(NC), 2 < c < 3

and O(N3) respectively. Therefore the whole complexity of (1.22) would be:

O(N2k + N3 + NC). For low-dimensional datasets with k < N, we can simplify it as O(N3).

ii) If only the linear regression is used, there's no need to calculateH., thus the complexity

is counted by the larger portion between matrix inversion and multiplication, i.e.

max(Nc, N2k), (2 < c < 3), or O(NC) if we assume < Nc2 Compared with the kernel version,

this doesn't seem like reducing many computations. However, if the constants in the O-notation

are taken into account, the difference would be conspicuous. Meanwhile, without the kernel, we

get rid of the calculation of 0. This also saves us enormous computation time, which we are

going to show in the subsequent section.

For an improved method, we can exploit the fact that after a few iterations, more and

more/, would vanish, which means there's no use in re-calculating their corresponding kernel

functions K,(x', x1), j = 1,2,...,N. Hence we can delete this entire row from Ho and shrink the

size of this design matrix significantly. This desirable attribute can also be applied to 0,

especially in the case of high dimensional dataset, as an alleviation of the computation pressure

incurred by kernel methods.

Complexity of estimating 0

After getting the value off/ in the current iteration, 0 is solved by plugging this value into

(1.21) and approximating it analytically. (The authors of JCFO choose a conjugate gradient

method.) Depending on different datasets, the number of iterations required to estimate the









minimum of the target function vary remarkably, which makes it difficult to quantify the exact

time complexity of calculating 0. Here we provide a comparison of the time used to compute

0 with that of other computations by averaging 50 and 100 runs of JCFO on two datasets. The

JCFO matlab source code is from Krishnapuram, one of the inventors of JCFO. Two hyper

parameters are chosen to be ,1 = 0.002 and = 4; 6, 's are initialized as 1 / k, where k is the

number of features.


Table 1.1. Comparisons of computation time of 0 and other parts.
Datasets HH Crabs
Variables (50 runs) sec (100 runs) sec
0 mean 200.3 16.27
std 55.70 0.4921
Other mean 129.7 9.524
std 37.97 0.3577


We can see that nearly two-thirds of computations are used to compute 0. Furthermore,

by choosing a conjugate gradient method, like the authors did with fmincon in matlab, we

cannot assume the minimum would be obtained by a parsimonious 0. In other words, since the

optimization itself is not sparsity-oriented, enforcing this requirement sometimes will lead to its

terminating badly. This is a major weakness of JCFO. In order to verify our hypotheses, we will

look at more behaviors of 0 in Chapter 2.

1.6 Alternative Approaches

1.6.1 Sparse SVM

In the following two sections we will briefly discuss two other approaches derived from

LASSO that have the very similar property as JCFO. (B.Scholkopf and A.J. Smola, 2002)

introduce an -insensitive SVM which has the following the objective function:









1 2 N
mm r(w, (.)) = w C2 ),
wGH,GRN,bGR 2 N,
.t. ((w,x,)+ b)- y, +, (1.23)

y,-((w, x)+ b) E + ,
*) > 0, E > O.

From the constraints we know that if a sample object falls inside the C -tube, it won't get a

penalty in our objective function. The parameter (*) denotes the real distance of a point to

the e tube C is a coefficient that controls the ratio of the total ,(* distance to the norm of the

weight vector both of which we want to minimize, or generally, the balance between the sparsity

and the prediction accuracy. The variable e [0,1] is a hyper parameter that behaves as the

upper bound of the fraction of errors (the number of points outside the tube) as well as the

lower bound of the fraction of SVs. Now let's take the derivative of w, b in the dual form

of (1.22) and set them to zero. We would get

N 1N N
max W(a"())= (a, +a)y, -11(a, +a )(a, +ac)K(x,, x),
aGRN 1-1 2 ,-1 J-1
N
S.t. Z(a a*) 0,
=1 (1.24)
(*) [0,C],


0(' +a)< Cv .


This is the standard form of v SVM. Since the points inside the c tube can't be used as

the SVs, intuitively it attains more weight vector sparsity than ordinary SVM by assigning

more a*) to be zero. (J. Bi et al, 2003) proposes a modification by adding a term 1 a- to the

loss function (this revision is called Sparse-SVM) in order to get an even more parsimonious set









of SVs. This also inherits the idea from Lasso. Of course, optimization involves a searching

algorithm to find both the prior C and v.

1.6.2 Relevance Vector Machine

The Relevance Vector Machine (RVM) (M. Tipping, 2000) has a very similar design

structure as JCFO. Indeed, it also assumes a Gaussian prior on Cw ~ (0, T) and

p(t I e) N(Xo, o2I) where T = diag(r, r2,r), r, can be considered as the hyper parameter

of o and o2 is known. These assumptions are the same as (1.7) and (1.14). However, instead

of calculating the expectation of t and T and plugging them back to maximize the posterior as

what JCFO did in (1.17) and (1.19), RVM integrates out o to get the marginal likelihood:


p(t T,')a2=fp(t ) 2)p(w T)dw)
-m p i T T 1 (1.25)
=(2zr) o2I+XTXT 2 exp t(cr2+XTXr)-'t
2

Henceforth we can take the derivatives with respect to r, and 02 in order to maximize

(1.25). Interestingly, the logarithm of Gamma distribution F(r, I a, b) assigned by (Herbrich,

2002) as the prior ofr, has the same effect as the Laplacian prior JCFO uses in promoting the

sparsity of(o. We plot the shapes of this prior with different a, b in Figure 1.2.












1 05 0 0 1 15
Figure 1.2. Logarithm of gamma distribution. From top down: ae-2 be2; ae-3 be3;





a=-e-4 b=e4.
-6-


-10-2 -1 5 -0'5 8 05 1 15 2

Figure 1.2. Logarithm of gamma distribution. From top down: a=le-2 b=le2; a=le-3 b=le3;
a=le-4 b=le4.









CHAPTER 2
ANALYSIS OF JCFO


2.1 Brief Introduction

As the inventors of JCFO claim, its major objectives are twofold:

i To learn a function that most accurately predicts the class of a new example (Classifier
Design) and,

ii To identify a subset of the features that is most informative about the class distinction
(Feature Selection).

Within this chapter we will investigate these two attributes of JCFO on several real and

synthetic datasets. From Experiment 2.2.1 to 2.3.2.2, we train our samples from the 220-sample-

7-feature HH landmine data and 200-sample-5-feature Crabs data as our basic datasets, both of

which are separable by an appropriate classification algorithm. Then we'll take a 4-fold cross

validation on that. For the rest of the experiments, we'll generate several special Gaussian

distributions. From the performance evaluation's perspective, we compare both of JCFO's

prediction and feature selection abilities using RBF kernel with the non-kernelized (linear) ARD

method presented by (Figureiredo, 2003), which is its direct ascendant. Regarding our interest,

we are going to devote more effort to the feature selection, primarily to the elimination of

irrelevant and redundant features. In each scenario we'll contrive a couple of tiny experiments

that test a certain aspect. Then a more extensive result will be provided together with an

explanation of how this result is derived. The pseudo code of each experiment can be found in

Appendix B.

2.2 Irrelevancy

Intuitively, if some features don't vary from one class to another much, which means

they're not sufficiently informative in differentiating the classes, we call them irrelevant features.

To be more precise, it's convenient to introduce the between-class divergence as follows:









SO 1 2 1
dk + 2 2)+ (1- 2) )2( + )
2 2 o2 0 2 -1 02

where k denotes the kth feature and /, o, are the mean and the variance of Feature k of class i.

Note that d is in reverse proportion to the similarity of the two classes-the more similar the

mean and variance are, the less d would become. In the extreme case that the two distributions

represented by the kth feature totally overlap, d drops to zero. This is equivalent to saying that

the kth feature is completely irrelevant. We will use this concise measure to compare with the

feature weight 0k assigned by JCFO in the following experiments. The divergences of each

feature in HH and Crabs are given by Table 2.1 and Table 2.2.


Table 2.1. Feature divergence of HH
Feature 1 2 3 4 5 6 7
Divergence 0.3300 0.5101 32.7210 0.2304 1.3924 3.0806 3.9373

Table 2.2. Feature divergence of Crabs
Feature 1 2 3 4 5
Divergence 0.0068 0.3756 0.0587 0.0439 0.0326

2.2.1 Uniformly Distributed Features

The distribution of this kind of feature is completely the same between two classes.

Obviously, features like this should be removed. We examine the performance of JCFO and

ARD in eliminating uniformly distributed features by adding a constant column as a new feature

to Crabs and HH. Testing results that average 50 runs are shown in Table 2.3 and Table 2.4.

Only ARD completely sets the interferential feature weight to zero on both datasets, while JCFO

doesn't fulfill the goal of getting rid of the invariant feature on HH. Nonetheless, the time

consumption of JCFO is enormous, too.









Table 2.3. Uniformly distributed feature in Crabs
JCFO ARD
Train error rate mean (%) 2 4
Train error rate std (%) 0 0
Test error rate mean (%) 6 6
Test error rate std (%) 0 0
Percentage of the noise weights set to zero 100% 100%
Running Time 2 hours Imin

Table 2.4. Uniformly distributed feature in HH
JCFO ARD
Train error rate mean (%) 1.49 14.5
Train error rate std (%) 1.48 2.8e-15
Test error rate mean (%) 21.7 20
Test error rate std (%) 4.63 8.5e-15
Percentage of the noise weights set to zero 0% 100%
Running Time 4 hours Imin

2.2.2 Non-informative Noise Features

While a uniformly distributed feature is rarely seen in realistic datasets, most of the

irrelevancies lie under the cover of various kinds of random data noise. We can emulate this by

adding the class independent Gaussian noise as a feature into our datasets. Although different

means and variances can be chosen, this feature itself is still non-informative in terms of

classification. Our purpose is to see if JCFO can remove this noise feature. By adding unit

Gaussian noise with means of 5, 10 and 15 each time as a new feature and training each one 50

times, we get an average divergence of the noise feature as 0.0001. The testing result is provided

in Table 2.5.


Table 2.5. Non-informative Noise in HH and Crabs
JCFO ARD
HH Crabs HH Crabs
Train error rate mean (%) 1.55 1.82 14.55 2.40
Train error rate std (%) 1.32 0.83 0.1 1.96
Test error rate mean (%) 27.41 5.00 20.03 3.60
Test error rate std (%) 6.1 1.83 0.47 2.95
Percentage of the noise weight being set to zero 0 8.87 96 99.3









It can be seen that almost none of the corresponding noise weights are set to zero by JCFO.

By contrast, ARD almost always sets noise weights to zero. This implies that JCFO does not

eliminate irrelevant Gaussian features well compared to the ARD method. As mentioned in

Section 1.5.2, the reason is likely due to the fact that although fl in (1.10) and 0 in (1.14)

have the same sparsity-promoting mechanism, in the implementation 0 can only be derived

approximately. This leaves the values of its elements tiny but still nonzero; thus, it conversely

compromises O's sparsity. A similar situation will be reencountered in the next section.

Nonetheless, JCFO doesn't beat ARD in terms of the testing error. This result can also be

explained as the unfulfilment of the sparsity-related reduction of generalization error with respect

to 0 because we know the two methods share the same functionality in the f part.

2.3 Redundancy

A redundant feature means that it has a similar, but not necessarily identical, structure to its

other counterparts. Here we can employ a well-known measure to assess the correlation between

two features:

N
YXk, XA,
N Nk
k=1


k=l k=1

where i and j denote the ith and jth feature. The larger the p is, the more correlated they are.

The extreme case is 1 when they are identical.

2.3.1 Oracle Features

First let's look at a peculiar example: if we put the class label as a feature, obviously this

can be considered as the most informative attribute in distinguishing the classes, and all the other

features are redundant compared to it. We shall check if JCFO can directly select this 'oracle'









feature and get rid of all the others; i.e., only the 0, corresponding to this feature is nonzero.

We test JCFO and ARD 100 times on HH and Crabs respectively with the 'label feature' being

added.

Table 2.6. Oracle feature
JCFO ARD
Train error rate (%) 0 0
Only the label feature weight is nonzero? Yes Yes

The purpose of designing this toy experiment is to determine if, as feature selection

approaches, both of them have the ability of identifying the most representative feature from

other relatively subtle ones. This should be considered as the basic requirement for feature

selection.

2.3.2 Duplicate Features

A duplicate feature is identical to another feature and thus is completely redundant. Of

course, here the correlation value of the two features is exactly 1. This experiment will examine

whether JCFO can identify a duplicate feature and remove it. We achieve this by simply

replicating a feature in our original dataset then adding it back as a new feature. We repeat this

process on each feature and test whether it or its replica can be eliminated.


Table 2.7. Duplicate feature weights on HH
Feature 1 2 3 4 5 6 7 Error(%)
Weight __Train/Test
JCFO Itself 0.2157 0.1438 0.3130 0.2134 0.4208 0.2502 0.2453 4.082/
Duplicate 0.2300 0.1520 0.3539 0.2142 0.4010 0.2454 0.2604 20.32
ARD Itself 0 0 0 0 0 0 -0.0094 14.63/
Duplicate 0 0 0 0 0 0.3172 0 20.25

Table 2.8. Duplicate feature weights on Crabs
Feature 1 2 3 4 5 Error (%)
Weight _Train/Test
JCFO Itself 0 0.0981 0.1091 0 0.0037 2/
Duplicate 0 0.0846 0.1076 0 0.0032 6
ARD Itself 0 -0.9558 0.9050 0 0 4/
Duplicate 0 0 0 0 0 6










From the above tables, apparently JCFO assigns each feature approximately the same

weight as its duplicate, while ARD sets either the feature or its duplicate (or both) to zero.

Meanwhile, considering their testing errors we come to the situation of 2.2.2 again.

2.3.3 Similar Features

Now we examine the features that are not identical but highly correlated. This section is

separated into two parts:

2.3.3.1

This experiment is similar to the duplicate feature experiment. We coin a new feature by

replicating an original feature, mixing it with a unit variance Gaussian noise. A verification is

also taken on each feature to check if either this feature or its counterpart is removed by running

the test 50 times on HH and 100 times on Crabs and averaging the times of either of their

weights being set to zero.


Table 2.9. Percentage of either two identical features' weights being set to zero in HH
Feature 1 2 3 4 5 6 7 Error(%)
p (0.4652) (0.8951) (0.9971) (0.9968) (0.9992) (0.9992) (0.1767) Train/Test
Weight
JCFO 30 30 20 10 10 10 20 3.853/21.4
ARD 100 100 100 100 99 100 92 14.51/20.11

Table 2.10. Percentage of either two identical features' weights being set to zero in Crabs
Feature 1 2 3 4 5 Error (%)
P (0.9980) (0.9970) (0.9995) (0.9996) (0.9977) Train/Test
Weight
JCFO 90 10 30 90 50 2.013/4.840
ARD 100 91 97 99 100 4.001/5.976

The results on two datasets still imply that JCFO can't perform as well as ARD in terms of

the redundancy elimination.










y







x
z


Figure 2.1. Two Gaussian classes that can be classified by either x or y axes

2.3.3.2

In Figure 2.1 above, these two ellipsoids can be differentiated with either the x or y

variables, thus either can be considered as redundant to the other. However, z is completely

irrelevant. We generate 200 samples each time, half in each class, and run both methods 150

times. We plot the 0 corresponding to three axes features in the following figures.



0.4
0.2-
0
0 50 100 150

0.4
0.2
0
0 50 100 150

0.1
0.05

0 50 100 150

Figure 2.2. Weights assigned by JCFO (from top to down: x, y and z axis)










0.4
0.2
0
0 50 100 150
0
-0.2
-0.4
0 50 100 150
0.05
0
-0.05 -
0 50 100 150
Figure 2.3. Weights assigned by ARD (from top to down: x, y and z axis)

It can be seen that both methods discover the irrelevancy of z axis though the performance

of ARD is much better. For the redundancy, unfortunately neither of them eliminates the x or y

axis. But, more interestingly, the 0 assigned by ARD on x and y each time are almost identical

while those assigned by JCFO are much more chaotic. Note that the redundancy here is different

from the previous experiments since it's more implicit from a machine's perspective. We will

discuss this further in 2.4.

2.3.4 Gradually More Informative Features

In this experiment we generate samples from two multivariate Gaussians. The

corresponding elements in both mean vectors have gradually larger differences with the increase

of their orders. e.g. suppose q, = (0, 0,..., Ok) and p2 = (10, 20,..., 1Ok). The variance of each


dimension is fixed to one fourth of the largest element difference. So, in the above case it should

be 10k/4. By treating each dimension as a feature, we contrive a set of gradually more

informative features since the feature corresponding to the largest order has the most distant

means and the least overlapped variances, thus it is the most separable. When compared to this

feature, the rest could be deemed redundant. We do this experiment with 5, 10 and

15-dimensional data respectively. In each experiment we generate 200 samples and perform a

4-fold cross-validation. This process is repeated 50 times on JCFO and 100 times on ARD, then










we look at whether the most informative feature is kept while the others are deleted. Results are


listed in the following three tables.


Table 2.11. Five features
u JCFO ARD
Feat mean std zero(%) mean std zero(%)
1 0.0001 0.0008 14 -0.0057 0.0072 57
2 0.0041 0.0013 0 -0.0352 0.0155 8
3 0.0090 0.0020 0 -0.0819 0.0168 0
4 0.0158 0.0021 0 -0.1486 0.0193 0
5 0.0256 0.0027 0 -0.2339 0.0224 0

Table 2.12. Ten features
JCFO ARD
Fetu mean std zero(%) mean std zero(%)
1 0 0 100 -0.0001 0.0010 95
2 0.0001 0.0003 80 -0.0026 0.0045 72
3 0.0011 0.0019 50 -0.0070 0.0082 52
4 0.0027 0.0028 34 -0.0187 0.0125 24
5 0.0040 0.0039 16 -0.0286 0.0145 14
6 0.0053 0.0049 6 -0.0460 0.0146 4
7 0.0099 0.0071 0 -0.0668 0.0171 0
8 0.0133 0.0075 0 -0.0856 0.0156 0
9 0.0144 0.0087 2 -0.1125 0.0190 0
10 0.0214 0.0106 0 -0.1356 0.0197 0

Table 2.13. Fifteen features
JCFO ARD
Fetu mean std zero(%) mean std zero(%)
1 0 0 100 -0.0002 0.0009 94
2 0.0001 0.0004 92 -0.0009 0.0024 86
3 0.0002 0.0006 84 -0.0022 0.0038 71
4 0.0009 0.0017 68 -0.0046 0.0063 61
5 0.0015 0.0027 62 -0.0062 0.0076 56
6 0.0021 0.0036 52 -0.0118 0.0099 36
7 0.0030 0.0050 44 -0.0143 0.0120 35
8 0.038 0.0050 30 -0.0242 0.0143 18
9 0.0044 0.0069 34 -0.0318 0.0156 10
10 0.0068 0.0089 30 -0.0413 0.0138 2
11 0.0057 0.0075 30 -0.0497 0.0157 2
12 0.0109 0.0120 16 -0.0588 0.0176 2
13 0.0117 0.0110 8 -0.0735 0.0159 0
14 0.0118 0.0136 20 -0.0845 0.0173 0
15 0.0151 0.0134 12 -0.0992 0.0193 0









The above tables ::::i :i- that although the most .; ::1F .:::i -: :::::: are assigned to the

largest weights by both methods, the rest of the features are not removed but ranked in terms of

their .... : .. Generally, JCFO eliminates more redundant features in the 2nd and 3rd cases

However, table 2.12 also shows that sometimes JCFO incorrectly deletes the most informative

feature (1: for Feature 15). In some scenarios, the Cf .... selection method can't .i H

: the redundancy, but it gives a ranking measurement according to how much each r

contributes to the classification, which is also an acceptable alternative. (I. Guyon, '''-)

2.3.5 Indispensable Features

In : !: ::: classification, if a set of features : :.:I* as a whole, i.e. no subsets of them can

fulfill the classification, we call them indispensable features. Eliminating a feature from an

.I: ,: :1. feature set will cause a selection error. '::; e are going to check how JCFO and ARD

deal with this kind of ': ::: If we treat the x and y variables as two: : : ::. in Figure 2.4,

.'i.1. :i. they consist of an ::.. .. : .*. feature set since none of them alone can determine

the classes.

y








z


Figure 2.4. Two Gaussians that can only be classified by both x and y axes

Like in Experiment 2.3.3.2, we generate 2-:: samples each time, half in each class and run

both methods 150 times. Results are listed below:















0.5
1 ----------------------------

0
0 50 100 150
1


0.5
0
0 50 100 150
1

0.5

-0
0 50 100 150

Figure 2.5. Weights assigned by JCFO (from top to down: x, y and z axis)



-0.4

-0.5 -

-0.6
0 50 100 150

0

-0.2

-0.4




SA V v V
0 50 100 150

0.1




0 50 100 150


Figure 2.6. Weights assigned by ARD (from top to down: x, y and z axis)

The fact that both JCFO and ARD don't remove either x or y axis-features is a good


result. However, the weights assigned by ARD are much more stable. Regarding the irrelevant z


axis, ARD does a clean job like it did in 2.3.3.2. Therefore, we can still evaluate its performance


as better than JCFO.









2.4 Non-linearly Separable Datasets

The experiments with synthetic data in the previous two sections concentrated on

classifying linearly separable features, which is not the strong suit for JCFO. In the following

two tests, we will examine JCFO's performance on two non-linearly separable datasets

compared with that of ARD method.

The 2-D Cross data comprises two crossing Gaussian ellipses as one class and four

Gaussians that surround this cross as the other class, each of which has 100 samples. (Figure 2.7)

The 2-D Ellipse data has two equal mean Gaussian ellipses as two classes, each of which has 50

samples. One of them has a much wider variance than the other, which makes distribution look

like one class is surrounded by the other (Figure 2.8). Clearly, there's no single hyper plane that

can classify either of the two datasets correctly. We run JCFO, non-kemelized ARD and

kernelized ARD on both datasets 50 times respectively. A comparison of their performances is

shown in Table 2.14. In the experiment summarized in Table 2.15, a non-informative irrelevant

feature was added in exactly as described in Section 2.2.2. In the experiment summarized in Table

2.16, one redundant feature was added in as described in Section 2.3.1.


Table 2.14. Comparisons of JCFO, non-kernelized ARD and kernelized ARD
Methods Non-kemelized Kerelized JCFO
Test error ( ARD ARD
Cross mean 50 8.1 4.0
std 0 le-14 le-15
Ellipse mean 48 8.4 12
std 4e-14 3e-15 le-14

Table 2.15. Comparisons of JCFO and kernelized ARD with an added non-informative irrelevant
feature
Methods Kerelized JCFO
Test error (0 ARD
Cross mean 17.7 7.32
std 17.6 3.20
Ellipse mean 15.4 13.4
std 5.70 5.03












Table 2.16. Comparisons of JCFO and kernelized ARD with an added similar redundant feature
Methods Kemelized JCFO
Test error (o ARD
Cross mean 15.3 8.76
std 4.01 3.77
Ellipse mean 13.4 12.3
std 5.21 4.41

Since non-kemelized ARD performs a linear mapping, it cannot perform a reasonable

classification in this scenario. Each of kemelized ARD and JCFO performs a little better in one

of the original datasets. JCFO outperforms ARD on both of the two datasets mixed with

irrelevant or redundant features, especially on the Cross data. However, the time consumption for

training JCFO still dominates that of kernelized ARD and none of the parameters corresponding

to the noise feature are set to zero by JCFO in these two scenarios.


1- 0 +


+ + 0 0




0.6 + +





0.3 o o
o n + o o




+0.5- + + + + + +
0.4 + ++ + +
0 0 + 00
S+o + O 0 o0
0. + + 00 00 0
0.2 0 0
o 0 0 + 0
0.1 0 o

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Figure 2.7. Cross data











0.8

0.7 +

0.6 + +
+ 0 0 + o 0 0 + + +
0 0 0
0.5 + 0 0 0 0 0 + +
0.5 o 0 0 +

0.4o o
0. + 00 0 O 0 +

+ + 0 0 0 0 0 0
0.3 + 0 0 +

0.2 + + + 0
0.1 + + +


0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Figure 2.8. Ellipse data

2.5 Discussion and Modification

From the trouble JCFO suffered in the experiments we designed, we reiterate what we

discussed in 2.2.2:

That is, JCFO does not reliably achieve sparsity on the feature set and does not, therefore,
realize the advantage of low generalization error due to reduced model complexity. Hence,
it doesn't achieve both of the two advantages it claims.

Therefore, a more effective approximation method for calculating 0 in (1.21) needs to be

proposed in order to achieve JCFO's theoretical superiority. Without such a method being

available at hand, we suggest simply dropping the 0 part and returning to a kernelized ARD

model. e.g. we can plug in the RBF kernel function in (1.13). According to (Herbrich, 2002), this

function has the appealing property that each linear combination of kernel functions (see (1.6))


of the training objects (x', x2,...x ) can be viewed as a density estimator in the input space


because it effectively puts a Gaussian on each x' and weights its contribution to the final density


by 8, in (1.6). Regarding feature selection, we can apply a non-kernelized (linear) ARD to









directly weigh each feature. (However, this requires our datasets shouldn't be non-linearly

separable.) As we have seen in 2.3.3.2, each time the x and y variables are assigned the same

absolute value (different signs), which might help us detect the very implicit redundancy in that

scenario. By combining these two approaches, we first use the latter to pre-select the more

representative features, then turn them back to the former to do a more precise learning. This will

increase the efficiency of the whole learning process. We show comparisons of the performance

of non-kernelized, kernelized ARD and the combo of them as below on the Crabs data and HH

by training 100 times.


Table 2.17. Comparisons of the three ARD methods
Methods Non-kemelized Kernelized Combo
Test error (_
Crabs mean 3.60 2.2 1-30
std 2.95 0 1-10
iterations 100 5 2
HH mean 20.03 12.63 9.17
std 0.47 6.26 2.11
iterations 100 5 3

Note that after implementing feature selections on the Crabs data, only two features remain.

The Kernelized ARD will make /, vanish very fast on low dimension datasets, (here even after

2 iterations only two /, are nonzero) which causes the training result to be unstable (See the

mean and std of the combinational method on the Crabs data). Therefore, we also need to realize

that there are certain cases that the combination method is not appropriate and it also needs to

balance the tradeoff between the parameter sparsity and the prediction accuracy.

2.6 Conclusion

In this thesis, we introduced the theoretical background of an existing Bayesian based

feature selection and classification method in the first chapter, from its origin to implementation

detail as well as complexity. Then, from the second chapter, we systematically analyzed its









performances in a series of particularly designed experiments. A couple of comparisons with its

direct ascendant approach are provided. From these experimental results we've seen that even if

JCFO is theoretically more ideal in achieving both the features and basis functions sparsity, the

lack of an effective implementation technique seriously restricts its performance. As an

alternative, we suggest returning to the original ARD method, jointly using the kernelized and

non-kernelized versions of it to exploit both feature selections and class predictions. Though our

model would become less ambitious henceforth, its simplicity in practice and time-efficiency are

still preserved from our original design purpose.









APPENDIX A
SOME IMPORTANT DERIVATIONS OF EQUATIONS

Equation (1.9)

Consider the General case of minimizing:

f(x) = x 2ax + 2b | x w.r.t. x,

where a, b are constants. This is equivalent to:

x2 2ax + 2bx, x > 0
minf(x) = x=0.
x 2a 2bx, x < 0

When x> 0:

-b ,a-b>0 (1)
arg min f (x) =O (1)
x 0 a-b<0 (2)

When x< 0:

a+b ,a+b<0 (3)
arg min f(x) = {a
x 0 a+b > 0 (4)

(1) & (3) b < 0 & x = arg min(f(a- b), f(a + b))
a-b,a+b
x= a+b, a<0
x=a-b, a>0
Sx = sgn(a)(| a I -b)

(1)&(4)->a>0&x=a-b>0
> x = sgn(a)(| a I -b)+

(2)&(3)->a<0&x=a+b <0
> x = sgn(a)(| a I -b)+

(2)&(4)->b>O&x=0
Sx = sgn(a)(| a I -b)

Summarizing the above four scenarios yields:









x = sgn(a)(i a I -b)


Integration (1.10)

o 1 y"
V 5 e 2 z.-e 2 dr
S2


e









2,

te _
e


2 e
2 e



2


2r 2


1
-dr
2f


Sdx (letx r= T12,d 12
2

(22J_
17xMl


dt


(Beyer,1979)


Expectation (1.18)


Since p(z' D, ,(t),(t))




p(z' D, Bt',t'"))


N(h^, (Ax)",1), let's consider y =1, then h, (x)(t) > 0. i.e.:


0
0 ,z' <0


z' >0

,z <0


((h,, (x)/())








z' p(z' l D, '(x)fdz' o,


The case of y = 0 follows the similar way.


- =E Izz D, ft't), d't)










APPENDIX B
THE PSEUDOCODES FOR EXPERIMENTS DESIGN

2.2 Irrelevancy

NOTES: In this testing unit, our basic datasets are randomly generated, linearly separable. Each

feature is normalized to zero-mean, unit variance.

2.2.1 Uniformly Distributed Feature

counter 0;
For 1:50
ds load dataset;
(n,k) get the matrix size from ds;
ds append a column of 1s as the k+1 feature
Run JCFO (ds);
If theta(k+1) = 0;
counter counter + 1;
End
End
Display(counter/50)

2.2.2 Noninformative Noise Features

counter 0;
For mu = 5, 10, 15
For 1:50
ds load dataset;
(n,k) get the matrix size from ds;
ds append a column of mu-mean, unit variance noise
as the k+1 feature
Run JCFO (ds);
If theta(k+1) = 0
counter counter + 1;
End
End
End
Display(counter/150)

2.3 Redundancy

NOTES: In this testing unit, our basic datasets come from the randomly generated, linearly

separable datasets or Gaussian distributions. Each feature is normalized to zero-mean, unit

variance.










2.3.1 Oracle Features


counter 0;
For 1:100
ds load dataset;
(n,k) get the matrix size from ds;
ds <- append the class label as the k+l feature
Run JCFO (ds);
If theta(k+1) > 0 and theta(1 through k) = 0;
counter counter + 1;
End
End
Display(counter/100)

2.3.2 Duplicate Features

counter 0;
For 1:50
ds load dataset;
(n,k) get the matrix size from ds;
Fori 1:k
ds <- replicate and append the ith feature
Run JCFO (ds);
If theta(i) = 0 or theta(k+1) = 0
counter counter + 1;
End
End
End
Display(counter/(50*k))

2.3.3 Similar Features

2.3.3.1

counter 0;
For 1:50
ds load dataset;
(n,k) get the matrix size from ds;
Fori 1:k
ds <- mix the ith feature with a unit noise and append it
Run JCFO (ds);
If theta(i) = 0 or theta(k+1) = 0
counter counter + 1;
End
End
End
Display( counter/(50*k))

2.3.3.2










mul [10 0 0];
mu2 [0 10 0];
sigl [7 0 0; 0 1 0; 0 01];
sig2 [1 0 0; 0 70; 0 0 1]
counter 0;
For 1:150
ds randomly generate 200 samples from the two Gaussians
with class label
Run JCFO(ds);
If either Feature X or Feature Y has been removed
counter counter + 1;
End
End
Display(counter/150)

2.3.4 Gradually More Informative Features

counter 0;
Fork=5, 10, 15
For 1:100
mul a vector with k zeros;
mu2 [10 20 30 ... 10*k];
sig (10k/4) (k-size identity matrix)
ds randomly generate 200 samples from
the Gaussians with class label
Run JCFO (ds);
If theta(k)>0 and theta(1 through k-1) = 0
counter counter + 1;
End
End
End
Display(counter/300)

2.3.5 Indispensable Features

mul [5 5 0];
mu2 [8 8 0];
sig [4 0 0; 0 1 0; 0 01];
counter 0;
For 1:150
ds randomly generate 200 samples from the two Gaussians
with class label then rotate it 45 degree clockwise
around the class center
Run JCFO(ds);
If theta(1)>0 and theta(2) > 0.
counter counter + 1;
End
End
Display( counter/150)










LIST OF REFERENCES

W. Beyer. CRC Standard Mathematical Tables, 25th ed. CRC Press, 1979.

J. Bi, K.P. Bennett, M. Embrechits, C.M. Breneman, and M. Song. Dimensionality reduction via sparse
support vector machines. JMLR, 3: 1229-1243, 2003.

M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(9): 1150-1159, 2003.

I. Guyon and Andre Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157-1182,
2003.

T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer Verlag, 2001.

R. Herbrich. Learning Kernel Classifiers. MIT Press, 2002.

B. Krishnapuram, A.J. Hartemink and M.A.T. Figueiredo. A Bayesian approach to joint feature selection
and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):
1105-1111, 2004.

B. Krishnapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin and M. Figueiredo. On semi-supervised
classification. In L. K. Saul, Y.Weiss and L. Bottou, editors, Advances in Neural Information
Processing Systems 17. MIT Press, 2005.

B. Scholkopf and A.J. Smola. Learning with Kernels Regula'rizition, Optimization and Beyond. MIT
Press, 2002.

R. Tibshirani. Regression shrinkage and selection via lasso. J. Royal Statistical Soc. (B), 58: 267-200,
1996.

M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen and K. R. Muller, editors, Advances
in Neural Information Processing Systems 11, pp. 218-224. MIT Press, 2000.









BIOGRAPHICAL SKETCH


Fan Mao was born in Chengdu, China, in 1983. He received his bachelor's degree

in computer science and technology in Shanghai Maritime University, Shanghai, China,

in 2006. Then he came to University of Florida, Gainesville, FL. In December 2007, he

received his M.S. in computer science under the supervision of Dr. Paul Gader.





PAGE 1

A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION: THEORY AND ANALYSIS By FAN MAO A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2007 1

PAGE 2

2007 Fan Mao 2

PAGE 3

To my Mom, for her enormous patience and unfailing love. 3

PAGE 4

ACKNOWLEDGMENTS I would like to thank my th esis adviser, Dr. Paul Gade r, for his encouragement and valuable adviceboth on the general research direc tion and the specific experi mental details. As a starter, I sincerely appreciate the opportunity of getting th e direct help from an expert in this field. I also thank Dr. Arunava Banerjee and Dr. Jose ph Wilson for being my committee members and reading my thesis. Special thanks go to Xuping Zhang. We had a lot of interesting disc ussions in the last half year, many of which greatly inspired me a nd turned out to be on my experiment designs. 4

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 LIST OF SYMBOLS .......................................................................................................................9 ABSTRACT ...................................................................................................................................11 CHAPTER 1 VARIABLE SELECTION VIA JCFO...................................................................................12 1.1 Introduction ..................................................................................................................12 1.2 Origin ...........................................................................................................................12 1.2.1 LASSO ...............................................................................................................12 1.2.2 The Prior of ...................................................................................................14 1.3 JCFO ............................................................................................................................16 1.4 EM Algorithm for JCFO ..............................................................................................17 1.5 Complexity Analysis ....................................................................................................19 1.6 Alternative Approaches ...............................................................................................21 1.6.1 Sparse SVM .......................................................................................................21 1.6.2 Relevance Vector Machine ................................................................................23 2 ANALYSIS OF JCFO............................................................................................................24 2.1 Brief Introduction .........................................................................................................24 2.2 Irrelevancy ...................................................................................................................24 2.2.1 Uniformly Distributed Features .........................................................................25 2.2.2 Non-informative Noise Features ........................................................................26 2.3 Redundancy ..................................................................................................................27 2.3.1 Oracle Features ..................................................................................................27 2.3.2 Duplicate Features .............................................................................................28 2.3.3 Similar Features .................................................................................................29 2.3.4 Gradually More Informative Features ...............................................................31 2.3.5 Indispensable Features .......................................................................................33 2.4 Nonlinearly Separable Datasets ...................................................................................35 2.5 Discussion and Modification .......................................................................................37 2.6 Conclusion ...................................................................................................................38 APPENDIX A SOME IMPORTANT DERIVATIONS OF EQUATIONS...................................................40 5

PAGE 6

B THE PSEUDOCODES FOR EXPERIMENTS DESIGN......................................................43 LIST OF REFERENCES...............................................................................................................46 BIOGRAPHICAL SKETCH.........................................................................................................47 6

PAGE 7

LIST OF TABLES Table page 1.1. Comparisons of computation time of and other parts.......................................................21 2.1. Feature divergence of HH.................................................................................................. .....25 2.2. Feature divergence of Crabs............................................................................................... ....25 2.3. Uniformly distributed feature in Crabs...................................................................................2 6 2.4. Uniformly distributed feature in HH.......................................................................................26 2.5. Non-informative Noise in HH and Crabs...............................................................................26 2.6. Oracle feature............................................................................................................ ..............28 2.7. Duplicate feature weights on HH........................................................................................... .28 2.8. Duplicate feature weights on Crabs........................................................................................ 28 2.9. Percentage of either two identical f eatures weights being set to zero in HH........................29 2.10. Percentage of either two identical features weights being set to zero in Crabs...................29 2.11. Five features............................................................................................................ ..............32 2.12. Ten features............................................................................................................. ..............32 2.13. Fifteen features......................................................................................................................32 2.14. Comparisons of JCFO, non-kernelized ARD and kernelized ARD.....................................35 2.15. Comparisons of JCFO and kernelized ARD with an added non-informative irrelevant feature................................................................................................................................35 2.16. Comparisons of JCFO and kernelized ARD with an added similar redundant feature........36 2.17. Comparisons of the three ARD methods..............................................................................38 7

PAGE 8

LIST OF FIGURES Figure page 1.1. Gaussian (dotted) vs. Laplacian (solid) prior..........................................................................15 1.2. Logarithm of gamma distribution. From t op down: a=1e-2 b=1e2; a=1e-3 b=1e3; a=1e-4 b=1e4.....................................................................................................................23 2.1. Two Gaussian classes that can be classified by either x or y axes.........................................30 2.2. Weights assigned by JCFO (from top to down: x, y and z axis)............................................30 2.3. Weights assigned by ARD (from top to down: x, y and z axis).............................................31 2.4. Two Gaussians that can only be classified by both x and y axes...........................................33 2.5. Weights assigned by JCFO (from top to down: x, y and z axis)............................................34 2.6. Weights assigned by ARD (from top to down: x, y and z axis).............................................34 2.7. Cross data................................................................................................................ ................36 2.8. Ellipse data.............................................................................................................. ................37 8

PAGE 9

LIST OF SYMBOLS Symbols Meanings ix the i th input object vector i j x the j th element of the i th input object vector weight vector p p-norm I identity matrix 0 zero vector (|0,1) v N 0 mean unit variance nor mal distribution evaluated on v sgn() sign function H design matrix () positive part operator () z Gaussian cdf (|0,1)z x dxN [] z zs expectation ()t the estimate of in the t th iteration () O big O notation of complexity element-wise Hadamard matrix multiplication 9

PAGE 10

LIST OF ABBREVIATIONS JCFO: Joint Classifier and Feature Optimization LASSO: Least Absolute Shrinkage and Selection Operator RBF: Radial Basis Function SVM: Support Vector Machine RVM: Relevance Vector Machine ARD: Automatic Relevance Determination 10

PAGE 11

Abstract of Thesis Presen ted to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION PAGE: THEORY AND ANALYSIS By Fan Mao December 2007 Chair: Paul Gader Major: Computer Engineering Feature selection is a major focus in modern day processing of high dimensional datasets that have many irrelevant and redundant features and variables, such as text mining and gene expression array analysis. An id eal feature selection algorithm ex tracts the most representative features while eliminating other non-informativ e ones, which achieves feature significance identification as well as the computational effici ency. Furthermore, feature selection is also important in avoiding over-fitti ng and reducing the generalization error in regression estimation where feature sparsity is prefe rred. In this paper, we provide a thorough analysis of an existing state-of-the-art Bayesian f eature selection method by showi ng its theoretical background, implementation details and computational complexity in the first half. Then in the second half we analyze its performance on se veral specific experiments we design with real and synthetic datasets, pointing out its certa in limitations in practice a nd finally giving a modification. 11

PAGE 12

CHAPTER 1 VARIABLE SELECTION VIA JCFO 1.1 Introduction Joint Classifier and Feature Optimization (JCFO) was first intr oduced by (Krishnapuram et al, 2004). It's directly in spired by (Figureiredo, 2003) for achieving sparsity in feature selections, by driving feature weights to zero. Compared to the tradit ional ridge regression, which shrinks the range of the regression paramete rs (or simply the weight coefficients of a linear classifier) but rarely sets them directly to zero, JCFO inherits the spirit of LASSO (Tibshirani, 1996) for driving some of the parame ters exactly to zero, which is equivalent to removing those corresponding featur es. In this way, it is claimed that JCFO eliminates redundant and irrelevant features. The remainder of this ch apter will be arranged as follows: Section 2 takes a close look at how this idea was derived; Section 3 gives the ma thematical structure of JCFO; Section 4 and 5 further illustrate an EM algorithm which is used to deri ve the learning algorithm for JCFO and analyze its complexity. The last pa rt briefly introduces two other approaches that have a similar functionality. 1.2 Origin 1.2.1 LASSO Suppose we have a data set {()},1,2,..., Z yi i ix, n where 12(,,...)iiii k x xx x is the i th input variable vector whose elements are called features and y i are the responses or the class labels (this thesis will only consider {0,1} y i ). The ordinary least squares regression would be to find 12(,,...,)T k which minimizes 2 11(Nk i ij ijyx )j (1.1) 12

PAGE 13

The solution is the well-known original least squares estimate As mentioned in (Hastie et al, 2001), it ofte n has low bias but large variance when 01 ()T XXXy T T X X is invertible, which tends to incur over-fitting. To improve th e prediction accuracy we usually shrink or set some elements of to zero in order to achieve parameter sparsity. If this is appropriately done, we actually also employ an implicit feature select ion process on our dataset getting rid of the insignificant features and extracting those that are more informative. In addition to simplifying the structure of our estimation functions, according to (Herbrich, 2002), the sparsity of weight vector also plays a key role in controlling the gene ralization error which we will discuss in the next section. The ridge regr ession as a revised approach that penalizes the large j would be 2 111argmin()Nkk i ijj ijjyx 2 j (1.2) Here is a shrinking coefficient that ad justs the ratio of the squared -norm of 2 to the residual sum squares in our objectiv e function. The solution of (1.2) is 10 (1) (Tibshirani, 1996) where depends on and This shows ridge regression reduces X to a fraction of 0 but rarely sets their elements exactly to zero hence can't completely obtain the goal of feature selection. For an alternative approach, (Tib shirani, 1996) proposed changing (1.2) to 2 111()Nkk i ijj ijjyx j (1.3) Or equivalently, 1 T TT y Xy X (1.4) 1 denotes the -norm. This is called Least Absolute Shrinkage and Selection Operator (LASSO). To see why -norm favors more sparsity of 1 1 note that 13

PAGE 14

2 2(1/2,1/2)(1,0)1 but 1 1(1/2,1/2)2(1,0)1 .Hence it tends to set more elements exactly to zero. By (Tibshirani, 1996), the solution of (1.4) for orthogonal design X is 00 sgn()()jjj (1.5) where denotes the sign function and sgn() () a is defined as () aa if ; 0 otherwise. 0 a depends on Intuitively, can be seen like a thres hold that filters those 0j s that are below a certain range and truncates them off. This is a very nice attribute by which our purpose of getting sparsity has been fulfilled. 1.2.2 The Prior of The ARD ( Automatic Relevance Determination ) approach proposed by (Figureiredo, 2003) inherits this idea of promoting sparsity from LASSO through a Bayesian approach. It considers the regression functional h as linear with respect to so our estimate function would be 1(,)()()d T jj jfhx x hx (1.6) where h ( x) could be a vector of linea r transformations of x, nonlinea r fixed basis functions or kernel functions which consist of a so-called design matrix H such that Further, it assumes that the error is a zero mean Gaussian, ,()ijjiHhx ()T iy hx 2(0,) N Hence the likelihood would be 2(|)(|,) p y yH I N (1.7) where I is a identity matrix. Note that since we assume samples are i.i.d. Gaussian, theres no correlation among them. More importantly, th is method also assigns a Laplacian prior to kk : 1 1(|)exp||exp{} 22k k i ip 14

PAGE 15

The influence that a prior exerts on the sparsity of was discussed in (Herbrich, 2002), where they used a prior to illustrate that (,)k0I N s log-density is proportional to 2 2 1 k i i and the highest value is acquired when 0 For the comparison of Gaussian and Laplacian priors, we plot both of their dens ity functions in Figure 1.1. The latter is much more peaked in the origin therefore favors more s elements being zero. -10 -8 -6 -4 -2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 1.1. Gaussian (dotted) vs. Laplacian (solid) prior The MAP estimate of is given by 2 2 2 argmin{2} yH 1 (1.8) It can be easily seen that this is essentially the same as (1.3). If H is an orthogonal matrix, (1.8) can be solved separately for each j (See Appendix 1.1 for the detail derivation.) 22 2argmin{2()2} sgn(())(())iT iiii TT ii i Hy HyHy (1.9) Unfortunately, in the general case i cannot be solved directly from (1.8) due to its non-differentiability in the origin. As a modification, (Figureiredo, 2003) present a 15

PAGE 16

hierarchical-Bayes view of the Laplacian pr ior, showing its nothing but a two-level Bayes model--zero mean with independent, exponentially distributed variance: (|)(0,)ii ip N and (|)(/2)exp{(/2)}iip such that: 0(|)(|)(|)exp{||} 2ii iiipppd i (1.10) where i can be considered as a hidden variable bei ng calculated by an EM algorithm while is the real hyper-parameter we n eed to specify. (This integration can be found in Appendix 1.2.) 1.3 JCFO From the same Bayesian viewpoint, JCFO (K rishnapuram et al, 2004) apply a Gaussian cumulative distribution function ( probit link) to (1.6) to get a pr obability measure of how likely one input object belongs to a class To be more precise: {0,1} c (1.11) 0 1(1|) ()N i i iPy K x x x where ()(0,1)z z dzN (Note that ()1() zz .) ( ) j Kx,x is a symmetric measure of the similarity of two input objects. For example: (1.12) 1()1r k j iii iK x,x jx x for polynomial functions and: 2 2 1() ()expj k j ii i ixx K x,x (1.13) for Radial Basis Functions (The good attributes of RBF functi ons will be discussed in Section 2.4.) 16

PAGE 17

An apparent disadvantage of ( 1.6) is that if we dont choose h as a linear function of x (or simply the input vector x itself), the calculation of would be like selecting the kernel functions of x rather than selecting the features of x. This is also why its conceptually an ARD method. As a major modification of explicitly selecting each feature, in (1.12) and (1.13) JCFO assigns each i x a corresponding parameter i Since the same i is applied to the i th element of each input j x it is equivalent to weighing the significance of the i th feature in the input object. This is how the feature selection is incorporat ed. Another point that needs to be mentioned is that all i should be non-negative because ( ) j Kx,x measures the similarity of two objects as a whole, i.e. accumulating the differences in each two corresponding elements. Had i j in (1.13) the difference of the i th and the j th elements of two objects would have been cancelled. Then like how we treated i each i is given a Laplacian prior, too: (1.14) 2(|0,)0 (|) 00kkk kk kif p if N where 222(|)(/2)exp{(/2)}kkp thus: 22 2 0exp()0 (|)(|) 00kk kkkk kif ppd if (1.15) 1.4 EM Algorithm for JCFO Now from our i.i.d. assumption of the training data its likelihood function becomes 0 1 1 (1) 0 1(|,) () ()i iy N N ij i j i y N ij i jpD K K x,x x,x (1.16) 17

PAGE 18

Rather than maximizing this likelihood, JCFO provides an EM algorithm by first assuming a random function (,,)()Tz x hx ) where and 1()[1,(),...,()]NKKhxx,xx,x ~(0,1 N Then it treats 12[,,...],,Nzzz z as the missing variable s whose expectations would be calculated in the E step. E step The complete log-posterior is log(,|,,,)log(|,,)log(|)log(|) (2)TTT TTpD D z z zz HH z T R where 111 01(,,...)Ndiag T and 111 12(,,...,)Ndiag R If we want to calculate the expectation of (1.17), i.e. ()() (,|,) (2)tt TTT TTQ zz HH z T R (1.17) The expectations with respect to those missing variables are ()() () () () |,, (21)(()|0,1) () ((21)())itt i iTit Tit iTitvzD y y hx hx hx N (1.18) ; 1()() 11 |,,||tt iiiiD 1 and 1 1()() 22 |,,tt iiiiD where ()t i denotes the estimated value of i in the t th iteration. (The derivation of (1.18) can be found in Appendix 1.3) 18

PAGE 19

M step Now with the expectations of these missing va riables at hand, we can start applying MAP on the Q function of (1.17). After dropping all terms irrelevant to and setting 12[,,...,]T Nvvvv 12[,,...]Ndiag and 12[,,...,]Ndiag we get ()() (,|,) 2ttTT TTTTQ HH Hv (1.19) Take the derivative of (1.18) with respect to and k would be: 22TTQ HH Hv 2 (1.20) 1 11 (,)22()NN T kk ij kk ijQ v H H (1.21) where represents the element-wise Hadamard matrix multiplication. Now by jointly maximizing (1.19) with respect to both and, we not only select the ke rnel functions but also those features of the input objects if both and are parsimonious. This is the core motivation of JCFO. Something that need s to be stressed is that could be solved directly in each iteration while k can only be computed in terms of an appr oximation method. This in fact increases the computation complexity and, even worse, influences the sparsity of. We will repeatedly come to this issue in Chapter 2. 1.5 Complexity Analysis What is the complexity of the EM algorithm above in the general case? Since in each iteration this algorithm alternatively calculates and we investigate these two scenarios respectively. Complexity of estimating Lets first take a look at the search for i.e. setting (1.19) to zero: 19

PAGE 20

1 ()T T H HH v (1.22) i) If kernel functions are applied, the calculations for the inner product is while the common matrix inversion and multiplication of an () Ok NN matrix in matlab are (),23cONc and respectively. Therefore the whole complexity of (1.22) would be: For low-dimensional datasets with k < N we can simplify it as 3( ON ) ) ) 23(cONkNN 3( ON ii) If only the linear regression is us ed, theres no need to calculate H thus the complexity is counted by the larger portion between matrix inversion and multiplication, i.e. or if we assume 2max(,),(23)cNNkc ()cON 2 ckN Compared with the kernel version, this doesnt seem like reducing many computations However, if the cons tants in the O-notation are taken into account, the difference would be conspicuous. Meanwhile, without the kernel, we get rid of the calculation of This also saves us enormous computation time, which we are going to show in the subsequent section. For an improved method, we can exploit the fact that after a few iterations, more and more j would vanish, which means theres no use in re-calculating their corresponding kernel functions (),1,2,...,.ij K jx,x N Hence we can delete this entire row from H and shrink the size of this design matrix significantly. This desirable attribute can also be applied to especially in the case of high dime nsional dataset, as an allevia tion of the computation pressure incurred by kernel methods. Complexity of estimating After getting the value of in the current iteration, is solved by plugging this value into (1.21) and approximating it analytically. (The au thors of JCFO choose a conjugate gradient method.) Depending on different da tasets, the number of iterations required to esitimate the 20

PAGE 21

minimum of the target function vary remarkabl y, which makes it difficult to quantify the exact time complexity of calculating Here we provide a comparison of the time used to compute with that of other computations by averagi ng 50 and 100 runs of JCFO on two datasets. The JCFO matlab source code is from Krishnapuram, one of the inventors of JCFO. Two hyper parameters are chosen to be 10.002 and 24 ; 'is are initialized as where k is the number of features. 1/ k Table 1.1. Comparisons of computation time of and other parts. Datasets Variables HH (50 runs) sec Crabs (100 runs) sec mean 200.3 16.27 std 55.70 0.4921 mean 129.7 9.524 Other std 37.97 0.3577 We can see that nearly two-thirds of computations are used to compute Furthermore, by choosing a conjugate gradient method, like the authors did with fmincon in matlab, we cannot assume the minimum would be obtained by a parsimonious. In other words, since the optimization itself is not sparsity-oriented, enforc ing this requirement sometimes will lead to its terminating badly. This is a major weakness of JC FO. In order to verify our hypotheses, we will look at more behaviors of in Chapter 2. 1.6 Alternative Approaches 1.6.1 Sparse SVM In the following two sections we will briefl y discuss two other approaches derived from LASSO that have the very similar property as JCFO. (B.Schlkopf and A.J. Smola, 2002) introduce an insensitive SVM which has the followi ng the objective function: 21

PAGE 22

2 (*) (*) ,, 1 (*)11 min(,) ( ), 2 ..(,), (,), 0,0.NN i wHb i iii iii iC N st by yb w w wx wxRR (1.23) From the constraints we know that if a sample object falls inside the tube it won't get a penalty in our objective function. The parameter (*) i denotes the real di stance of a point to the tube C is a coefficient that controls the ratio of the total (*) i distance to the norm of the weight vector both of which we want to minimize or generally, the balan ce between the sparsity and the prediction accuracy. The variable [0,1] is a hyper parameter that behaves as the upper bound of the fraction of errors (the number of points outside the tube ) as well as the lower bound of the fraction of SVs. Now lets take the derivative of ,,, b w in the dual form of (1.22) and set them to zero. We would get (*) 11 1 1 (*) 11 max()()()()(,), 2 ..()0, [0,], ().NNN N iii iiiiij ii j N ii i i N ii iWyK st C m C xxR (1.24) This is the standard form of SVM Since the points inside the tube can't be used as the SVs, intuitively it attains more weight ve ctor sparsity than ordinary SVM by assigning more (*) i to be zero. (J. Bi et al, 2003) proposes a modification by adding a term 1 N i i to the loss function (this revision is called Sparse-SVM) in order to get an even more parsimonious set 22

PAGE 23

of SVs. This also inherits the idea from La sso. Of course, optimization involves a searching algorithm to find both the prior C and 1.6.2 Relevance Vector Machine The Relevance Vector Machin e (RVM) (M. Tipping, 2000) ha s a very similar design structure as JCFO. Indeed, it also assumes a Gaussian prior on ~(, ) 0N and 2(|)~(,) pt X I N where 12(,,)kdiag i can be considered as the hyper parameter of and 2 is known. These assumptions are the same as (1.7) and (1.14). However, instead of calculating the expectation of t and and plugging them back to maximize the posterior as what JCFO did in (1.17) a nd (1.19), RVM integrates out to get the marginal likelihood: 22 1 22 2 2(|,)(|,)(|) 1 (2) exp( ) 2m TTTptppd t 1 I XXtIXXt (1.25) Henceforth we can take the derivatives with respect to i and 2 in order to maximize (1.25). Interestingly, the logar ithm of Gamma distribution (|,)iab assigned by (Herbrich, 2002) as the prior of i has the same effect as the Laplacian prior JCFO uses in promoting the sparsity of. We plot the shapes of this prior with different a, b in Figure 1.2. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Figure 1.2. Logarithm of gamma distribution. From top down: a=1e-2 b=1e2; a=1e-3 b=1e3; a=1e-4 b=1e4. 23

PAGE 24

CHAPTER 2 ANALYSIS OF JCFO 2.1 Brief Introduction As the inventors of JCFO claim, its major objectives are twofold: i To learn a function that most accurately pred icts the class of a ne w example (Classifier Design) and, ii To identify a subset of the features that is most informa tive about the class distinction (Feature Selection). Within this chapter we will investigate these two attributes of JCFO on several real and synthetic datasets. From Experiment 2.2.1 to 2.3.2.2, we train our samples from the 220-sample7-feature HH landmine data and 200-sample-5-feat ure Crabs data as our basic datasets, both of which are separable by an appropriate classifica tion algorithm. Then well take a 4-fold cross validation on that. For the rest of the experime nts, well generate several special Gaussian distributions. From the performance evaluation s perspective, we compare both of JCFOs prediction and feature selection abilities using RBF kernel with the non-kernelized (linear) ARD method presented by (Figureiredo, 2003), which is its direct ascendant. Re garding our interest, we are going to devote more effort to the feat ure selection, primarily to the elimination of irrelevant and redundant features. In each scenario well contrive a couple of tiny experiments that test a certain aspect. Then a more extensive result will be provided together with an explanation of how this result is derived. The pseudo code of each experiment can be found in Appendix B. 2.2 Irrelevancy Intuitively, if some features don't vary from one class to another much, which means they're not sufficiently informative in differentiati ng the classes, we call them irrelevant features. To be more precise, it's convenient to in troduce the between-class divergence as follows: 24

PAGE 25

2 12 12 21 12111 (2)()( 22kd 1 ) where k denotes the k th feature and ,ii are the mean and the variance of Feature k of class i Note that d is in reverse proportion to the similarity of the two classesthe more similar the mean and variance are, the less d would become. In the extreme case that the two distributions represented by the k th feature totally overlap, d drops to zer o. This is equivalent to saying that the k th feature is completely irrelevant. We will use this concise measure to compare with the feature weight k assigned by JCFO in the following e xperiments. The divergences of each feature in HH and Crabs are gi ven by Table 2.1 and Table 2.2. Table 2.1. Feature divergence of HH Feature 1 2 3 4 5 6 7 Divergence 0.3300 0.5101 32.7210 0.2304 1.3924 3.0806 3.9373 Table 2.2. Feature divergence of Crabs Feature 1 2 3 4 5 Divergence 0.0068 0.3756 0.0587 0.0439 0.0326 2.2.1 Uniformly Distributed Features The distribution of this kind of feature is completely th e same between two classes. Obviously, features like this should be removed. We examine the performance of JCFO and ARD in eliminating uniformly distributed features by adding a constant column as a new feature to Crabs and HH. Testing results that averag e 50 runs are shown in Table 2.3 and Table 2.4. Only ARD completely sets the interferential f eature weight to zero on bot h datasets, while JCFO doesnt fulfill the goal of getti ng rid of the invariant feature on HH. Nonetheless, the time consumption of JCFO is enormous, too. 25

PAGE 26

Table 2.3. Uniformly distributed feature in Crabs JCFO ARD Train error rate mean (%) 2 4 Train error rate std (%) 0 0 Test error rate mean (%) 6 6 Test error rate std (%) 0 0 Percentage of the noise weights set to zero 100% 100% Running Time 2 hours 1min Table 2.4. Uniformly distributed feature in HH JCFO ARD Train error rate mean (%) 1.49 14.5 Train error rate std (%) 1.48 2.8e-15 Test error rate mean (%) 21.7 20 Test error rate std (%) 4.63 8.5e-15 Percentage of the noise weights set to zero 0% 100% Running Time 4 hours 1min 2.2.2 Non-informative Noise Features While a uniformly distributed f eature is rarely seen in real istic datasets, most of the irrelevancies lie under the cover of various kinds of random data noise. We can emulate this by adding the class independent Gaussian noise as a feature into our data sets. Although different means and variances can be chosen, this feature itself is still non-informative in terms of classification. Our purpose is to see if JCFO can remove this noise feature. By adding unit Gaussian noise with means of 5, 10 and 15 each time as a new feature and training each one 50 times, we get an average divergence of the noise feature as 0.0001. The testing result is provided in Table 2.5. Table 2.5. Non-informative Noise in HH and Crabs JCFO ARD HH Crabs HH Crabs Train error rate mean (%) 1.55 1.82 14.55 2.40 Train error rate std (%) 1.32 0.83 0.1 1.96 Test error rate mean (%) 27.41 5.00 20.03 3.60 Test error rate std (%) 6.1 1.83 0.47 2.95 Percentage of the noise weight being set to zero 0 8.87 96 99.3 26

PAGE 27

It can be seen that almost none of the corre sponding noise weights ar e set to zero by JCFO. By contrast, ARD almost always sets noise weights to zero. Th is implies that JCFO does not eliminate irrelevant Gaussian features well compared to the ARD method. As mentioned in Section 1.5.2, the reason is likely due to th e fact that although in (1.10) and in (1.14) have the same sparsity-promoting mechanism, in the implementation can only be derived approximately. This leaves the va lues of its elements tiny but still nonzero; thus, it conversely compromises sparsity. A similar situation will be reencountered in the next section. Nonetheless, JCFO doesnt beat ARD in terms of the testing error. Th is result can also be explained as the unfulfilment of the sparsity-relat ed reduction of generali zation error with respect to because we know the two methods sh are the same functionality in the s part. 2.3 Redundancy A redundant feature means that it has a similar, but not necessarily identical, structure to its other counterparts. Here we can employ a well-kn own measure to assess th e correlation between two features: 1 22 11 N kikj k ij NN kiki kkxx x x where i and j denote the ith and jth feature. The larger the is, the more correlated they are. The extreme case is 1 when they are identical. 2.3.1 Oracle Features First lets look at a peculiar example: if we put the class la bel as a feature, obviously this can be considered as the most informative attribut e in distinguishing the classes, and all the other features are redundant compared to it. We shall check if JCFO can directly select this oracle 27

PAGE 28

feature and get rid of all the others; i.e., only the i corresponding to this feature is nonzero. We test JCFO and ARD 100 times on HH and Crabs re spectively with the l abel feature being added. Table 2.6. Oracle feature JCFO ARD Train error rate (%) 0 0 Only the label feature weight is nonzero? Yes Yes The purpose of designing this toy experiment is to determine if, as feature selection approaches, both of them have the ability of id entifying the most representative feature from other relatively subtle ones. This should be c onsidered as the basic requirement for feature selection. 2.3.2 Duplicate Features A duplicate feature is identical to another feature and thus is completely redundant. Of course, here the correlation value of the two features is exactly 1. This experiment will examine whether JCFO can identify a duplicate feature and remove it. We achieve this by simply replicating a feature in our original dataset then adding it back as a new feature. We repeat this process on each feature and test whether it or its replica can be eliminated. Table 2.7. Duplicate feature weights on HH Feature Weight 1 2 3 4 5 6 7 Error (%) Train/Test Itself 0.2157 0.1438 0.3130 0.2134 0.4208 0.2502 0.2453 JCFO Duplicate 0.2300 0.1520 0.3539 0.2142 0.4010 0.2454 0.2604 4.082/ 20.32 Itself 0 0 0 0 0 0 -0.0094 ARD Duplicate 0 0 0 0 0 0.3172 0 14.63/ 20.25 Table 2.8. Duplicate feature weights on Crabs Feature Weight 1 2 3 4 5 Error (%) Train/Test Itself 0 0.0981 0.1091 0 0.0037 JCFO Duplicate 0 0.0846 0.1076 0 0.0032 2/ 6 Itself 0 -0.9558 0.9050 0 0 ARD Duplicate 0 0 0 0 0 4/ 6 28

PAGE 29

From the above tables, apparently JCFO a ssigns each feature approximately the same weight as its duplicate, while ARD sets either the feature or its duplicate (or both) to zero. Meanwhile, considering their testing errors we come to the situation of 2.2.2 again. 2.3.3 Similar Features Now we examine the features that are not id entical but highly correla ted. This section is separated into two parts: 2.3.3.1 This experiment is similar to the duplicate feature experiment. We coin a new feature by replicating an original feature, mixing it with a unit variance Gaussian noise. A verification is also taken on each feature to check if either th is feature or its counter part is removed by running the test 50 times on HH and 100 times on Crabs a nd averaging the times of either of their weights being set to zero. Table 2.9. Percentage of either two identical features weights being set to zero in HH 1 2 3 4 5 6 7 Feature Weight (0.4652) (0.8951) (0.9971) (0.9968) (0.9992) (0.9992) (0.1767) Error(%) Train/Test JCFO 30 30 20 10 10 10 20 3.853/21.4 ARD 100 100 100 100 99 100 92 14.51/20.11 Table 2.10. Percentage of either two identical features weights being set to zero in Crabs 1 2 3 4 5 Error (%) Feature Weight (0.9980) (0.9970) (0.9995) (0.9996) (0.9977) Train/Test JCFO 90 10 30 90 50 2.013/4.840 ARD 100 91 97 99 100 4.001/5.976 The results on two datasets still imply that JC FO cant perform as well as ARD in terms of the redundancy elimination. 29

PAGE 30

x y z Figure 2.1. Two Gaussian classes that can be classified by either x or y axes 2.3.3.2 In Figure 2.1 above, these two ellipsoids can be differentiated with either the x or y variables, thus either can be considered as redundant to th e other. However, z is com p letely irrelevan t We generate 200 sam p les each tim e half in each class, and ru n both m e thods 150 tim es. W e plot the corresponding to three axes featur es in the following figures. 0 50 100 150 0 0. 2 0. 4 0 50 100 150 0 0. 2 0. 4 0 50 100 150 0 0. 05 0. 1 Figure 2.2. Weights assigned by JCFO (f rom top to down: x, y and z axis) 30

PAGE 31

0 50 100 150 0 0.2 0.4 0 50 100 150 -0.4 -0.2 0 0 50 100 150 -0.05 0 0.05 Figure 2.3. Weights assigned by ARD (fro m top to down: x, y and z axis) It can be seen that both methods discover th e irrelevancy of z axis though the performance of ARD is much better. For the redundancy, unfort unately neither of them eliminates the x or y axis. But, more interestingly, theassigned by ARD on x and y each time are almost identical while those assigned by JCFO are mu ch more chaotic. Note that th e redundancy here is different from the previous experiments since its more implicit from a machines perspective. We will discuss this further in 2.4. 2.3.4 Gradually More Informative Features In this experiment we generate sample s from two multivariate Gaussians. The corresponding elements in both mean vectors have gradually larger differences with the increase of their orders. e.g. suppose and 1(0,0,...,0)k 2(10,20,...,10) k The variance of each dimension is fixed to one fourth of the largest el ement difference. So, in the above case it should be 10k/4. By treating each dimension as a featur e, we contrive a set of gradually more informative features since the feature correspond ing to the largest orde r has the most distant means and the least overlapped variances, thus it is the most separable. When compared to this feature, the rest could be deemed redunda nt. We do this experiment with 5, 10 and 15-dimensional data respectively. In each experi ment we generate 200 samples and perform a 4-fold cross-validation. This process is repeated 50 times on JCFO and 100 times on ARD, then 31

PAGE 32

we look at whether the most info rmative feature is kept while th e others are deleted. Results are listed in the following three tables. Table 2.11. Five features JCFO ARD Feature mean std zero(%) mean std zero(%) 1 0.0001 0.0008 14 -0.0057 0.0072 57 2 0.0041 0.0013 0 -0.0352 0.0155 8 3 0.0090 0.0020 0 -0.0819 0.0168 0 4 0.0158 0.0021 0 -0.1486 0.0193 0 5 0.0256 0.0027 0 -0.2339 0.0224 0 Table 2.12. Ten features JCFO ARD Feature mean std zero(%) mean std zero(%) 1 0 0 100 -0.0001 0.0010 95 2 0.0001 0.0003 80 -0.0026 0.0045 72 3 0.0011 0.0019 50 -0.0070 0.0082 52 4 0.0027 0.0028 34 -0.0187 0.0125 24 5 0.0040 0.0039 16 -0.0286 0.0145 14 6 0.0053 0.0049 6 -0.0460 0.0146 4 7 0.0099 0.0071 0 -0.0668 0.0171 0 8 0.0133 0.0075 0 -0.0856 0.0156 0 9 0.0144 0.0087 2 -0.1125 0.0190 0 10 0.0214 0.0106 0 -0.1356 0.0197 0 Table 2.13. Fifteen features JCFO ARD Feature mean std zero(%) mean std zero(%) 1 0 0 100 -0.0002 0.0009 94 2 0.0001 0.0004 92 -0.0009 0.0024 86 3 0.0002 0.0006 84 -0.0022 0.0038 71 4 0.0009 0.0017 68 -0.0046 0.0063 61 5 0.0015 0.0027 62 -0.0062 0.0076 56 6 0.0021 0.0036 52 -0.0118 0.0099 36 7 0.0030 0.0050 44 -0.0143 0.0120 35 8 0.038 0.0050 30 -0.0242 0.0143 18 9 0.0044 0.0069 34 -0.0318 0.0156 10 10 0.0068 0.0089 30 -0.0413 0.0138 2 11 0.0057 0.0075 30 -0.0497 0.0157 2 12 0.0109 0.0120 16 -0.0588 0.0176 2 13 0.0117 0.0110 8 -0.0735 0.0159 0 14 0.0118 0.0136 20 -0.0845 0.0173 0 15 0.0151 0.0134 12 -0.0992 0.0193 0 32

PAGE 33

The above tables indicate that although the most significant features are a ssigned to the largest weights by both methods, the rest of the f eatures are not removed but ranked in terms of their significances. Generally, JCFO eliminates more redundant features in the 2nd and 3rd cases. However, table 2.12 also shows that sometimes JC FO incorrectly deletes the most informative feature (12% for Feature 15). In some scenario s, the feature selection method cant totally identify the redundancy, but it gives a ranking m easurement according to how much each feature contributes to the classification, which is al so an acceptable altern ative. (I. Guyon, 2003) 2.3.5 Indispensable Features In pattern classification, if a se t of features operate as a whole, i.e. no subsets of them can fulfill the classification, we call them indispensable features. E liminating a feature from an indispensable feature set will cause a selection error. We are going to check how JCFO and ARD deal with this kind of feature. If we treat the x and y variable s as two features in Figure 2.4, apparently they consist of an indispensable feature set since none of them alone can determine the classes. x y z Figure 2.4. Two Gaussians that can only be classified by both x and y axes Like in Experim e nt 2.3.3.2, we generate 200 sample s each tim e half in e ach class and run both m e thods 150 tim es. Results are listed below: 33

PAGE 34

0 50 100 150 0 0.5 1 0 50 100 150 0 0.5 1 0 50 100 150 0 0.5 1 Figure 2.5. Weights assigned by JCFO (f rom top to down: x, y and z axis) 0 50 100 150 -0.6 -0.5 -0.4 0 50 100 150 -0.4 -0.2 0 0 50 100 150 -0.1 0 0.1 Figure 2.6. Weights assigned by ARD (fro m top to down: x, y and z axis) The fact that both JCFO and ARD dont remove either x or y axis-features is a good result. However, the weights assigned by ARD are much more stable. Rega rding the irrelevant z axis, ARD does a clean job like it did in 2.3.3.2. Theref ore, we can still evaluate its performance as better than JCFO. 34

PAGE 35

2.4 Non-linearly Separable Datasets The experiments with synthetic data in th e previous two sections concentrated on classifying linearly separable features, which is not the strong suit for JCFO. In the following two tests, we will examine JCFOs performa nce on two non-linearly separable datasets compared with that of ARD method. The 2-D Cross data comprises two crossing Gaussian ellipses as one class and four Gaussians that surround this cross as the other cl ass, each of which has 100 samples. (Figure 2.7) The 2-D Ellipse data has two equal mean Gaussian ellipses as two classe s, each of which has 50 samples. One of them has a much wider variance than the other, which makes distribution look like one class is surrounded by the other (Figur e 2.8). Clearly, theres no single hyper plane that can classify either of the two datasets correctly. We r un JCFO, non-kernelized ARD and kernelized ARD on both datasets 50 times respec tively. A comparison of their performances is shown in Table 2.14. In the experiment summari zed in Table 2.15, a non-informative irrelevant feature was added in exactly as descried in Se ction 2.2.2. In the experiment summarized in Table 2.16, one redundant feature was added in as described in Section 2.3.1. Table 2.14. Comparisons of JCFO, non-ke rnelized ARD and kernelized ARD Methods Test error (%) Non-kernelized ARD Kernelized ARD JCFO mean 50 8.1 4.0 Cross std 0 1e-14 1e-15 mean 48 8.4 12 Ellipse std 4e-14 3e-15 1e-14 Table 2.15. Comparisons of JCFO and kernelized AR D with an added non-informative irrelevant feature Methods Test error (%) Kernelized ARD JCFO mean 17.7 7.32 Cross std 17.6 3.20 mean 15.4 13.4 Ellipse std 5.70 5.03 35

PAGE 36

Table 2.16. Comparisons of JCFO and kernelized ARD with an added sim ilar redundant feature Methods Test error (%) Kernelized ARD JCFO mean 15.3 8.76 Cross std 4.01 3.77 mean 13.4 12.3 Ellipse std 5.21 4.41 Since non-kernelized ARD performs a linear mapping, it cannot perf orm a reasonable classification in this scenario. Each of kerne lized ARD and JCFO performs a little better in one of the original datasets. JCFO outperforms AR D on both of the two datasets mixed with irrelevant or redundant features, especially on th e Cross data. However, the time consumption for training JCFO still dominates that of kernelized ARD and none of the parameters corresponding to the noise feature are set to zero by JCFO in these two scenarios. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 2.7. Cross data 36

PAGE 37

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure 2.8. Ellipse data 2.5 Discussion and Modification From the trouble JCFO suffered in the experiments we designed, we reiterate what we discussed in 2.2.2: That is, JCFO does not reliably achieve sparsity on the feature set and does not, therefore, realize the advantage of low ge neralization error due to redu ced model complexity. Hence, it doesnt achieve both of the two advantages it claims. Therefore, a more effective approximation method for calculating in (1.21) needs to be proposed in order to achieve JCFOs theoretic al superiority. Without such a method being available at hand, we s uggest simply dropping the part and returning to a kernelized ARD model. e.g. we can plug in the RBF kernel functi on in (1.13). According to (Herbrich, 2002), this function has the appealing property that each linea r combination of kernel functions (see (1.6)) of the training objects can be viewed as a density estimator in the input space because it effectively puts a Gaussian on each 12(,,...)Nxxx i x and weights its contributi on to the final density by i in (1.6). Regarding feature selection, we can apply a non-kernelized (linear) ARD to 37

PAGE 38

directly weigh each feature. (However, this requires our datasets shouldnt be non-linearly separable.) As we have seen in 2.3.3.2, each time the x and y variables are assigned the same absolute value (different signs), which might help us detect the very implicit redundancy in that scenario. By combining these two approaches, we first use the latter to pre-select the more representative features, then turn them back to the former to do a more precise learning. This will increase the efficiency of the whole learning pr ocess. We show comparisons of the performance of non-kernelized, kernelized ARD and the comb o of them as below on the Crabs data and HH by training 100 times. Table 2.17. Comparisons of the three ARD methods Methods Test error (%) Non-kernelized Kernelized Combo mean 3.60 2.2 1~30 std 2.95 0 1~10 Crabs iterations 100 5 2 mean 20.03 12.63 9.17 std 0.47 6.26 2.11 HH iterations 100 5 3 Note that after implementing feature selections on the Crabs data, only two features remain. The Kernelized ARD will make i vanish very fast on low dimension datasets, (here even after 2 iterations only two i are nonzero) which causes the training result to be unstable (See the mean and std of the combinational method on the Crab s data). Therefore, we also need to realize that there are certain cases that the combination method is not a ppropriate and it also needs to balance the tradeoff between the paramete r sparsity and the prediction accuracy. 2.6 Conclusion In this thesis, we introduced the theoretical b ackground of an existing Bayesian based feature selection and classificati on method in the first chapter, fr om its origin to implementation detail as well as complexity. Then, from the s econd chapter, we systematically analyzed its 38

PAGE 39

performances in a series of particularly design ed experiments. A couple of comparisons with its direct ascendant approach are provided. From these experimental results weve seen that even if JCFO is theoretically more ideal in achieving both the features and basis functions sparsity, the lack of an effective implementation technique seriously restricts its performance. As an alternative, we suggest returni ng to the original ARD method, jo intly using the kernelized and non-kernelized versions of it to exploit both fe ature selections and class predictions. Though our model would become less ambitious henceforth, its simplicity in practice and time-efficiency are still preserved from our original design purpose. 39

PAGE 40

APPENDIX A SOME IMPORTANT DERIVATIONS OF EQUATIONS Equation (1.9) Consider the General case of minimizing: 2()22|| f xxaxbx w.r.t. x, where a, b are constants. This is equivalent to: 2 222, min()0 ,0 22,xxaxbxx fx x xaxbxx 0 0 When : 0 x ; ,0 argmin() 0,0(xabab fx ab (1) 2) (3) 4) b When : 0 x ,0 argmin() 0,0(xabab fx ab ,(1)&(3)0&argmin((),()) ,0 ,0 sgn()(||)ababbxfabfa xaba xaba xaab ; (1)&(4)0&0 sgn()(||) axab x aab ; (2)&(3)0&0 sgn()(||) axab x aab ; (2)&(4)0&0 sgn()(||) bx x aab Summarizing the above f our scenarios yields: 40

PAGE 41

sgn()(||) x aab Integration (1.10) 2 2 2 2 2 2 2 2 2 2 22 2 0 22 0 1/2 1/2 2 2 0 2 2 2 0 2 0 || 2 2 ||1 2 2 1 22 1 (, 2 2 2 (,1979) 2 2x x x x t teed ed edxletxdx ed x edt e Beyer e ) Expectation (1.18) Since lets consider ()()() () (|,,)~((),1)tittTtpzD hx N 1iy then i.e.: ()() ()0tTthx () () () ()() ()() () 0 () () ((),1) ,0 (|,,) ((),1) 0, ((),1) ,0 (()) 0,t t t tTt i itt Tt i Tt i Tt iz pz z z z 0 0 hx D hx hx hx N N N 41

PAGE 42

() () ()()() ()() 0 () () () |,,(|,,) (()|0,1 () (())t t tittiitti i Tt Tt TtvzDzpzdz D hx hx hx N ) The case of follows the similar way. 0iy 42

PAGE 43

APPENDIX B THE PSEUDOCODES FOR EXPERIMENTS DESIGN 2.2 Irrelevancy NOTES: In this testing unit, our basic datasets are randomly generated, linearly separable. Each feature is normalized to zero-mean, unit variance. 2.2.1 Uniformly Distributed Feature counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; ds append a column of 1s as the k+1 feature Run JCFO (ds); If theta(k+1) = 0; counter + 1; counter End End Display(counter/50) 2.2.2 Noninformative Noise Features counter 0; For mu = 5, 10, 15 For 1:50 ds load dataset; (n,k) get the matrix size from ds; ds append a column of mu-mean, unit variance noise as the k+1 feature Run JCFO (ds); If theta(k+1) = 0 counter + 1; counter End End End Display(counter/150) 2.3 Redundancy NOTES: In this testing unit, our basic datasets come from the randomly generated, linearly separable datasets or Gaussian distributions. Each feature is normalized to zero-mean, unit variance. 43

PAGE 44

2.3.1 Oracle Features counter 0; For 1:100 ds load dataset; (n,k) get the matrix size from ds; ds append the class label as the k+1 feature Run JCFO (ds); If theta(k+1) > 0 and theta(1 through k) = 0; counter + 1; counter End End Display(counter/100) 2.3.2 Duplicate Features counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; For i 1:k ds replicate and append the ith feature Run JCFO (ds); If theta(i) = 0 or theta(k+1) = 0 counter + 1; counter End End End Display(counter/(50*k)) 2.3.3 Similar Features 2.3.3.1 counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; For i 1:k ds mix the ith feature with a unit noise and append it Run JCFO (ds); If theta(i) = 0 or theta(k+1) = 0 counter + 1; counter End End End Display( counter/(50*k)) 2.3.3.2 44

PAGE 45

mu1 [10 0 0]; mu2 [0 10 0]; sig1 [7 0 0; 0 1 0; 0 01]; sig2 [1 0 0; 0 7 0; 0 0 1] counter 0; For 1:150 ds randomly generate 200 samples from the two Gaussians with class label Run JCFO(ds); If either Feature X or Feature Y has been removed counter + 1; counter End End Display(counter/150) 2.3.4 Gradually More Informative Features counter 0; For k = 5, 10, 15 For 1:100 a vector with k zeros; mu1 mu2 [10 20 30 10*k]; sig (10k/4) (k-size identity matrix) ds randomly generate 200 samples from the Gaussians with class label Run JCFO (ds); If theta(k)>0 and theta(1 through k-1) = 0 counter + 1; counter End End End Display(counter/300 ) 2.3.5 Indispensable Features mu1 [5 5 0]; mu2 [8 8 0]; sig [4 0 0; 0 1 0; 0 01]; counter 0; For 1:150 ds randomly generate 200 samples from the two Gaussians with class label then rotate it 45 degree clockwise around the class center Run JCFO(ds); If theta(1)>0 and theta(2) > 0. counter + 1; counter End End Display( counter/150 ) 45

PAGE 46

LIST OF REFERENCES W. Beyer. CRC Standard Mathematical Tables, 25th ed CRC Press, 1979. J. Bi, K.P. Bennett, M. Embrechits, C.M. Breneman and M. Song. Dimensionality reduction via sparse support vector machines. JMLR 3: 1229-1243, 2003. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9): 1150-1159, 2003. I. Guyon and Andre Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157-1182, 2003. T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer Verlag, 2001. R. Herbrich. Learning Kernel Classifiers MIT Press, 2002. B. Krishnapuram, A.J. Hartemink and M.A.T. Figueire do. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9): 1105-1111, 2004. B. Krishnapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin and M. Figueiredo. On semi-supervised classification. In L. K. Saul, Y.Weiss and L. Bottou, editors, Advances in Neural Information Processing Systems 17 MIT Press, 2005. B. Schlkopf and A.J. Smola. Learning with Kernels Regular ization, Optimization and Beyond MIT Press, 2002. R. Tibshirani. Regression shrinkage and selection via lasso. J. Royal Statistical Soc. (B), 58: 267-200, 1996. M. Tipping. The relevance vector machinc. In S. A. Solla, T. K. Leen and K. R. Muller, editors, Advances in Neural Information Processing Systems 11 pp. 218-224. MIT Press, 2000. 46

PAGE 47

BIOGRAPHICAL SKETCH Fan Mao was born in Chengdu, China, in 1983. He received his bachelors degree in computer science and technology in Sha nghai Maritime University Shanghai, China, in 2006. Then he came to University of Fl orida, Gainesville, FL. In December 2007, he received his M.S. in com puter science under the superv ision of Dr. Paul Gader. 47