A Hierarchical Dynamic Model for Object Recognition

MISSING IMAGE

Material Information

Title:
A Hierarchical Dynamic Model for Object Recognition
Physical Description:
1 online resource (115 p.)
Language:
english
Creator:
Chalasani, Rakesh
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
PRINCIPE,JOSE C
Committee Co-Chair:
MEYN,SEAN PETER
Committee Members:
RANGARAJAN,ANAND
CHEN,YUNMEI

Subjects

Subjects / Keywords:
cortex -- deep -- hierarchy -- learning -- recognition
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
This work focuses on building a hierarchical dynamic model for object recognition in video. It is inspired from predictive coding framework that is used to explain the working of sensory signal processing in the brain. Using this framework, we propose a new architecture that embodies one of the important characteristics of biological vision; namely, finding a causal representation of the visual inputs while using contextual information coming in from various sources.  The proposed model is a deep network to process an input video sequence in a hierarchical and distributive manner, and includes several innovations. At the core of the model is a dynamic system at each level in the hierarchy encoding time-varying observations. We propose a novel procedure called dynamic sparse coding to infer sparse states of a state-space model and extend it further to model locally invariant representation. These dynamic models endow the network with long term memory (parameters) and short-term contextual information and lead to invariant representation of the observations over time. Another important part of the proposed model is bidirectional flow of information in the hierarchy. The representation at each level in the hierarchy is obtain from data driven bottom-up signals and top-down expectations, which are both driving and modulatory. The top-down expectations are usually task specific, which bias and modulate the representations at the lower layers to extract relevant information from the milieu of noisy observations. We also propose a convolutional dynamic model that allows us to scale the proposed architecture to large problems. We show that this model decomposes the inputs in a hierarchical manner, starting from low level representations to higher level abstractions,  mimicking the processing of information in the early visual cortex. We evaluate the performance of the proposed model on several benchmark data-sets for object and face recognition and show that it performs better than several existing methods.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Rakesh Chalasani.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: PRINCIPE,JOSE C.
Local:
Co-adviser: MEYN,SEAN PETER.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2014-06-30

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0046204:00001


This item is only available as the following downloads:


Full Text

PAGE 1

AHIERARCHICALDYNAMICMODELFOROBJECTRECOGNITIONByRAKESHCHALASANIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2013

PAGE 2

c2013RakeshChalasani 2

PAGE 3

Inmemoryofmygrandfather 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketotakethisopportunitytothankmanypeoplewhohelpedandsupportedmyjourneytopurseaPhDdegree.Ithankmyadvisor,Dr.JoseC.Principe,forgivingmetheopportunityandmeanstopursuethiswork;withouthisconstantsupportandguidancethisworkwouldnothavebeenpossible.IamveryindebtedtoDr.YunmeiChenformanypatientandinsightfuldiscussionsthathelpedshapethisworksignicantlyandDr.AnandRangarajanforstimulatingmyintellectualcuriositythroughhiscoursework.IwouldalsoliketothankDr.SeanMeynandDr.JohnHarrisforservingonmycommitteeandformanyconstructivecomments.Ifeelfortunatetobeincompanyofsomeamazingpeople,whomadethislastveyearsverymemorable.IthankAustinBrockmeier,RoshaPokharel,SohanSeth,LuisSanchezGiraldoandAlexanderSinghAlvaradofortheirfriendship,sharing,andmanyinterestingdiscussions.IthankEvanKriminger,MatthewEmigh,ErionHasanbelliu,Kittipat`Bot'Kampa,GoktugCinar,StefanCraciun,andEderSantanaforallthefunandlaughter,andInJunPark,JihyeBae,SonglinZhao,MiguelTeixeira,BilalFadlallah,GabrielNallathambi,PingpingZhu,JongminLeeandalltheCNELersforthegoodtimes.IthankAnushaVallurupalliandKrishnaChaitanyaforbeingthefamilyawayfromhome.Finally,Ithankmyparentsforallthelove,supportandfaith,mybrotherandsister-in-lawforconstantcareandencouragementandSwathiSriKondagariforbeingthereandbringingsomuchjoyintomylife. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1Motivation .................................... 13 1.2Outline ...................................... 16 2THEORETICALBACKGROUND .......................... 18 2.1DeepNetworksandRepresentationLearning ................ 18 2.2PredictiveCodingModels ........................... 20 2.2.1HierarchyofAttractors ......................... 23 2.2.2RelationshipwithOtherMethods ................... 24 3FEATUREEXTRACTIONFROMTIME-VARYINGSIGNALS ........... 27 3.1ModelandEnergyfunction .......................... 27 3.2Learning ..................................... 30 3.2.1Inference ................................ 31 3.2.2ParameterUpdates ........................... 35 3.3Experiments .................................. 37 3.3.1InferringSparseStateswithKnownParameters ........... 37 3.3.2LearningfromNaturalImageSequences ............... 41 3.3.2.1Visualizinginvariance .................... 42 3.3.2.2Learningtemporalstructure ................. 44 3.3.3RoleofTemporalConnections:Toydata ............... 45 4DEEPPREDICTIVECODINGNETWORKS .................... 48 4.1Multi-LayeredArchitecture ........................... 48 4.2InferenceinMulti-LayeredNetworkwithTop-DownConnections ..... 50 4.3Experiments .................................. 52 4.3.1ReceptiveFieldsintheHierarchicalModel .............. 52 4.3.2RoleofTop-DownInformation ..................... 53 5CONVOLUTIONALDYNAMICNETWORKFORLARGESCALEOBJECTRECOGNITION ................................... 57 5

PAGE 6

5.1ModelArchitecture ............................... 57 5.1.1SingleLayerModel ........................... 57 5.1.2BuildingaHierarchy .......................... 60 5.1.3ImplementationDetails ......................... 61 5.2Inference .................................... 61 5.2.1Procedure ................................ 62 5.2.2ApproximateInferencewithTop-DownConnections ......... 66 5.3Learning ..................................... 68 5.4Experiments .................................. 69 5.4.1LearningfromNaturalVideoSequences ............... 70 5.4.2ObjectRecognition-Caltech-101dataset .............. 71 5.4.3RecognitionWithContext ....................... 72 5.4.3.1Roleofcontextualinformationduringinference ...... 73 5.4.3.2Sequentiallabeling ..................... 75 5.4.3.3Analysisoftemporalandtop-downconnections ..... 80 5.4.4LearningHierarchyofAttractors .................... 81 5.4.4.1Learningpartsofobjectfromunlabeledsequences ... 82 5.4.4.2Denoisingvideosusingtop-downinformation ....... 82 5.5Discussion ................................... 86 5.5.1RelationshipwithFeed-ForwardNetworks .............. 86 5.5.2ComparisonwithOtherMethods ................... 87 5.6Summary .................................... 88 6CONCLUSION .................................... 89 6.1Summary .................................... 89 6.2AvenuesforFutureWork ........................... 91 APPENDIX AAFASTPROXIMALMETHODOFCONVOLUTIONALSPARSECODING ... 92 A.1ConvolutionalFISTAandDictionaryLearning ................ 93 A.1.1Inference ................................ 94 A.1.2DictionaryLearning ........................... 96 A.1.3ExtensiontoPSD ............................ 97 A.2Experiments .................................. 98 A.2.1DictionaryLearning ........................... 98 A.2.2ConvergenceRate ........................... 99 A.2.3PredictiveSparseDecomposition ................... 100 A.3Summary .................................... 101 BADDITIONALRESULTSFORMODELVISUALIZATION ............. 102 B.1VisualizingInvarianceEncodedbyLayer1Causes ............. 102 B.2HierarchicalDecompositionObtainedfromtheYouTubedataset ...... 104 6

PAGE 7

REFERENCES ....................................... 106 BIOGRAPHICALSKETCH ................................ 115 7

PAGE 8

LISTOFTABLES Table page 3-1Computationaltimeofdifferentmethodsonsyntheticdata(pertimesample). 38 5-1ClassicationperformanceoverCaltech-101datasetwithonlyasinglebottom-upinference ....................................... 71 5-2ClassicationperformanceoverCOIL-100datasetwithvariouscongurations. 74 5-3RecognitionrateforfacerecognitioninHonda/UCSDdataset .......... 77 5-4ClassicationperformanceoverYouTubeCelebritiesdataset. .......... 78 8

PAGE 9

LISTOFFIGURES Figure page 1-1Thevisualpathwayinthebrainresponsibleforobjectrecognition. ....... 14 2-1Comparisonbetweenshallowanddeepnetworks. ................ 19 3-1Adynamicnetworkconsistingoftwodistinctiveprocessingunits:statesforextractingfeaturesfromtheinputsandcausesthatpoolingthestatestoformamorecomplexrepresentation. ........................... 28 3-2Thecombinedprioractingonthestatesindynamicsparsecoding. ....... 29 3-3Effectofsmoothingonthecostfunction. ...................... 34 3-4Performanceofdynamicsparsecodingwithsparseinnovationwhiletrackingsparsestates. ..................................... 39 3-5Performanceofthedynamicsparsecodingwithvaryingnumberofobservationdimensions. ..................................... 40 3-6Performanceofthedynamicsparsecodingwithvarying(sparse)noiselevels. 40 3-7Receptiveeldsofthemodelatdifferentlevels. .................. 41 3-8VisualizingtheobservationmatrixlearnedfromnaturalvideosusingGabort. 42 3-9Visualizinginvariancerepresentedbythecauses. ................ 43 3-10Visualizingthetemporalstructurelearnedbythemodel. ............. 44 3-11Roleoftemporalconnectionsduringinference. .................. 46 4-1Atwo-layeredhierarchicalmodelconstructedbystackingseveralstate-spacemodels. ........................................ 49 4-2Blockdiagramshowingtheowofbottom-upandtop-downinformationinthemodel. ........................................ 50 4-3Visualizationofthereceptiveeldsoftheinvariantunitslearnedfromnaturalvideos. ........................................ 53 4-4Cleanandcorruptedvideosequencesconstructedusingthreedifferentshapes. 54 4-5Thescatterplotofthe3dimensionalcausesatthetop-layerobtainedfromthetoydataset. .................................... 55 5-1Blockdiagramofasinglelayerconvolutionaldynamicnetwork. ......... 58 5-2Blockdiagramoftheinferenceprocedure,witharrowsindicatingtheowofinformationduringinference. ............................ 67 9

PAGE 10

5-3Receptiveeldsofthetwo-layeredconvolutionalnetworklearnonnaturalvideosequences. ...................................... 70 5-4ExamplesfromtheCOIL-100dataset. ....................... 73 5-5PartofthefacesequencebelongingtothreedifferentsubjectsextractedfromHonda/UCSDdataset. ................................ 75 5-6ExamplesequencesfromtheYoutubecelebritiesdataset. ............ 76 5-7ClassicationperformanceonnoisyHonda/UCSDdatasetwith50framespersequence. ....................................... 79 5-8TheperformanceonnoisyHonda/UCSDdatasetforvariousvaluesofand. 80 5-9Hierarchicaldecompositionofobjectpartslearnedbythemodelfromfacevideosof16differentsubjectsintheVidTIMITdataset. .............. 83 5-10Videodenoisingwithtemporalandtop-downconnections. ............ 84 5-11ThePCAprojectionsoflayertwocausesinthedenoisingexperiment. ..... 85 A-1LearneddictionaryandimagereconstructionobtainedusingconvolutionalFISTA. ......................................... 98 A-2ComparisonoftheconvergenceratebetweenconvolutionalCoD(ConvCoD)andconvolutionalFISTA(ConvFISTA). ....................... 99 A-3Encoderweightsandinstantaneoustotallossinpredictivesparsedecomposition. 100 B-1Visualizinggroupingofthedictionaryelementsencodedbylayer1causes. .. 102 B-2Visualizinginvarianceencodedbythelayer1causesusingfrequency-orientationpolarplot. ....................................... 103 B-3Visualizinginvarianceencodedbythelayer1causesusingcenter-orientationscatterplot. ...................................... 104 B-4Receptiveeldsinatwo-layerednetworklearnedfromtheYouTubecelebritiesdataset. ........................................ 105 10

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyAHIERARCHICALDYNAMICMODELFOROBJECTRECOGNITIONByRakeshChalasaniDecember2013Chair:JoseC.PrincipeMajor:ElectricalandComputerEngineeringThisworkfocusesonbuildingahierarchicaldynamicmodelforobjectrecognitioninvideo.Itisinspiredfrompredictivecodingframeworkthatisusedtoexplaintheworkingofsensorysignalprocessinginthebrain.Usingthisframework,weproposeanewarchitecturethatembodiesoneoftheimportantcharacteristicsofbiologicalvision;namely,ndingacausalrepresentationofthevisualinputswhileusingcontextualinformationcominginfromvarioussources.Theproposedmodelisadeepnetworktoprocessaninputvideosequenceinahierarchicalanddistributivemanner,andincludesseveralinnovations.Atthecoreofthemodelisadynamicsystemateachlevelinthehierarchyencodingtime-varyingobservations.Weproposeanovelprocedurecalleddynamicsparsecodingtoinfersparsestatesofastate-spacemodelandextenditfurthertomodellocallyinvariantrepresentation.Thesedynamicmodelsendowthenetworkwithlongtermmemory(parameters)andshort-termcontextualinformationandleadtoinvariantrepresentationoftheobservationsovertime.Anotherimportantpartoftheproposedmodelisbidirectionalowofinformationinthehierarchy.Therepresentationateachlevelinthehierarchyisobtainfromdatadrivenbottom-upsignalsandtop-downexpectations,whicharebothdrivingandmodulatory.Thetop-downexpectationsareusuallytaskspecic,whichbiasandmodulatetherepresentationsatthelowerlayerstoextractrelevantinformationfromthemilieuofnoisyobservations.Wealsoproposeaconvolutional 11

PAGE 12

dynamicmodelthatallowsustoscaletheproposedarchitecturetolargeproblems.Weshowthatthismodeldecomposestheinputsinahierarchicalmanner,startingfromlowlevelrepresentationstohigherlevelabstractions,mimickingtheprocessingofinformationintheearlyvisualcortex.Weevaluatetheperformanceoftheproposedmodelonseveralbenchmarkdata-setsforobjectandfacerecognitionandshowthatitperformsbetterthanseveralexistingmethods. 12

PAGE 13

CHAPTER1INTRODUCTION 1.1MotivationRecognizingobjectsandpatternsinanenvironmentisoneoftheimportantfunctionsofthehumanvisualsystemanditiscapableofdoingthisremarkablywell.However,buildingacomputationalmodelformachinevisionthatcanemulatethisfunctionalblockofthebrainhasturnedouttobeanextremelydifculttask.Thecomplexityofthistaskliesinourabilitytodealwithlargenumberofvariationsanobjectcancastontoa2Dsensoryspacelikeimages,beitinscale,rotation,position,etc[ Pintoetal. 2008 ].So,howdoesthebrainsolvethiscomplexproblem?Fromourcurrentunderstanding,thebrainisalargedistributednetworkwithseveralunits,calledneurons,workinginparalleltoperformcomplextaskslikeobjectrecognition.Particularly,thevisualsystem,asotherpartsofcerebralcortex,arearrangedasahierarchicalnetwork[ FellemanandVanEssen 1991 ].Figure 1-1 showsthevisualpathwayinthebrainthatisresponsibletodeterminewhattheobjectis.Thereisgrowingconvergencewithintheneurosciencecommunitythattheroleofthishierarchicaldistributivenetworkistomaintainsomeinternalrepresentationsoftheenvironmentandusestheseinternalrepresentationstoconstantlypredictthetime-varyingchangesinthesensoryinputs[ Clark 2013 ].Perception(orrecognition)canbesimplyconsideredmeasuringhowaccuratethesepredictionsare.Biologicalvisionusessuchinternalrepresentationsforvarietyoftasksrecognizingobjects,dealingwithmissingoroccludedpartsofanobject,attention,trackingandsegmentationofthescenetonameafew.Ithastwoimportantcharacteristicsthathelpsittoobtainsuchrichrepresentations:rstly,itdecomposestheinputsinahierarchicalfashion[ FellemanandVanEssen 1991 ]andsecondly,itactsasadynamicprocessandusescontextualinformationveryefciently[ Schwartzetal. 2007 Bar 2004 ].Inotherwords,ithastheabilitytoobtaininformationfromvarioussourcesandcombine 13

PAGE 14

Figure1-1. Thevisualpathwayinthebrainresponsibleforobjectrecognition.Itismadeupoffourmajorblocks,viz,V1-V2-V4-IT(Inferiortemporalcortex),arrangedinahierarchy.Theblocksareinterconnectedwithbothforwardandbackwardconnections,whileLGNprojectstheexternalenvironmentontoV1throughtheretina. thematdifferentlevelsbottom-upinformationfromthesensoryinputs(fromLGNinFigure 1-1 ),spatio-temporalrelationshipsoverspaceandtime(throughlateralconnectionsareeachlevel)andthetop-downexpectationscomingfromthehigherlevels(V4andITinFigure 1-1 )carryingpriorknowledgegainedabouttheexternalenvironmentthroughexperience.Thesefunctionalcharacteristicsofthebrainnamely,theideaofanalysis-by-synthesis,thebidirectionalowofinformationfusingbottom-upandtop-downsignals,spatio-temporalrecurrentconnectionsconveyingcontextualinformationcanbemodeled,insimplemathematicalterms,asagenerativemodel[ Friston 2005 ].Predictivecoding,proposedby RaoandBallard [ 1997 ]and Friston [ 2008 ],exploresthisideaofusingagenerativemodelandprovidesauniedmathematicalframework.Inthiswork,inspiredbypredictivecoding,weproposeanovelhierarchicaldynamicmodelforrobustobjectrecognitioninvideosequencesthatisconsistentwiththebiologicalvision.Keepinginmindthesepropertiesofthethevisualsystem,werestrictourmodeltohavecertainimportantproperties: 14

PAGE 15

themodelhastoconsidertheinputsasadynamicprocess,i.e.,ithastomaintainthetemporalrelationshipsduringinference.Thisleadstoincorporatingcontextualinformationintothemodelandmaybebenecialforrobustrecognition[ Schwartzetal. 2007 ]. ithastodecomposetheinputsinahierarchicalanddistributivefashion.Suchdeeprepresentationshouldextractabstractinformationfromtheinputs,leadingtobettergeneralizationandrobustnesstoseveralvariationsintheinputs[ Bengio 2009 ]. ithastomaintainsparsityateverylevel.Thishelpsnotonlytoobtainefcientrepresentationoftheobservationsbutalsoforlearningbetterdiscriminability[ Rigamontietal. 2011 ]. itshouldcombinethetop-downfeedbackinformationfromthehigherlayersinthehierarchyduringbottom-upinference.Suchbidirectionalinformationowcanbehelpfulatthelowerlayerstoreconstructthemissingpartsoftheobjectordisambiguatebetweendifferentobjectsinanoisyenvironment[ GilbertandLi 2013 ].Previously,manymethodsareproposedtomodelvisualperceptionandtheyencompasssomeoftheabovementionedproperties.Forexample,deeplearningmethodslikedeepbeliefnetworks[ Hintonetal. 2006 ],stackedauto-encoders[ Vincentetal. 2010 ],HMAX[ RiesenhuberandPoggio 1999 ][ Serreetal. 2007 ],convolutionalneuralnetworks[ LeCunetal. 1989 ],etcencodetheinputsasahierarchicalrepresentation.Itisobservedthatincreasinglyinvariantrepresentationscanbeobtainedwiththedepthofthehierarchicalmodels[ Goodfellowetal. 2009 ].Thesemodelsfocusmainlyonrapidobjectrecognitionwithfeed-forwardnetworks,withoutconsideringneithertemporalnortop-downconnections.Ontheotherhand,temporalcoherencewasfoundtobeusefultolearncomplexinvariancesfromthedata,see[ WiskottandSejnowski 2002 Mobahietal. 2009 Zouetal. 2012 ].Also,modelslikedeepBoltzmannmachine[ SalakhutdinovandHinton 2012 ],convolutionaldeepbeliefnetworks[ Leeetal. 2009 ]considerthetop-downinuencesusingundirectedgraphicalmodels.RaoandBallard[ RaoandBallard 1997 ]previouslyproposedapredictivecodingmodelbasedonKalmanlterlikeupdaterules,whileFirston[ Friston 2008 ]proposed 15

PAGE 16

asimilarhierarchicaldynamicmodelwithgeneralizedcoordinatesbasedonempiricalpriorsandpresentedauniedtheorybasedonfreeenergyprinciples.LeeandMumford[ LeeandMumford 2003 ]proposedaparticlelteringbasedapproachtobuildamodelforvisualperception.Thoughthesemodelsaresuccessfulinincorporatingtop-downconnectionsandtemporalrelationships,neitherofthesemodelsconsidersparsityonthehierarchicalrepresentationsanddoesnotscalewelltolargeimagesandvideos.Inspiteoftheselimitation,predictivecodingmodelsareabletoexplaintheimportanceofcontextinsensoryperception[ RaoandBallard 1999 Kiebeletal. 2008a 2009 ]andmakeitpossibletoexploretheuseofcontext(orempiricalpriors)indeepnetworksforvisualobjectrecognition.Frommachinevisionpointofview,imposingsomepriorknowledgeonthemodelcanleadtobetterrepresentationoftheinputs.SuchpriorknowledgecanbedomainspecicasinSIFT[ Lowe 1999 ],HOG[ DalalandTriggs 2005 ]orcanbetheuseofgenericxedpriorslikesparsity[ OlshausenandField 1996 ]andtemporalcoherence[ WiskottandSejnowski 2002 ].Theusethesexedpriorshasbecomeparticularlyusefulwhilelearningdeepnetworks[ Zouetal. 2012 Leeetal. 2007a Kavukcuogluetal. 2010a ].However,contextualinformationmightcontainmoretaskorobjectivespecicinformationthatcannoteasilybeincorporatedwhileusingxedpriors.So,onewaytoincorporatethisinformationisbyempiricallyalteringthepriorsonthemodeldependingonthecontext.Inthiswork,weusethisideaofempiricalpriorstoincorporatecontextualinformationobtainedfromthetemporalrelationshipsinthevideoandtop-downabstractinformationabouttheobjectstobuildarobustrecognitionsystem. 1.2OutlineFollowingistheoutlineoftherestofthedissertation:inchapter 2 wediscussthetheoreticalbackgroundfortherestofthiswork.Inthischapter,werstdiscusstheimportanceoflearningrepresentationsisadistributiveandhierarchicalmanner.Then 16

PAGE 17

wemotivatetheideaofpredictivecodingasunifyingframeworkanddiscusstheideaofbuildingahierarchicaldynamicmodel.Wereviewseveralexistingmethodsandtheirtherelationshipwiththeproposedmodel.Inchapter 3 ,weproposedynamicsparsecoding(DSC),wheretheobservationsequencesaremodeledusingadynamicsystemwithappropriatesparsityconstraintonthestatesandtheinnovations.Wefurtherextendthistomodelinvariantrepresentationoftheobservations.Thisformsthebasicblockforbuildingahierarchicalmodellater.Inchapter 4 weproposedeeppredictivecodingnetwork(DPCN),whichisahierarchicaldynamicmodel,withthestate-spacemodeldevelopedinchapter 3 asthebuildingblock.Weproposeanefcientapproximateinferenceprocedureinthishierarchicalnetwork,thatcombinesthebottom-updatadriveninformationcomingintothelowerlayerswiththetop-downexpectationfromthehigherlayers.Inchapter 5 ,weproposeaconvolutionalmodelforthedynamicnetworksdiscussedintheaforementionedchapters.Thisallowsustoscaletheproposedarchitecturetolargescaleimages/videos.Herewecomparetheproposedmodelonseveralbenchmarkobjectandfacerecognitiondatasetsandshowthatitperformsbetterthanseveralexistingmethods.Finally,inchapter 6 ,weconcludethedissertationbysummarizingthecontributionsofthisworkanddiscusssomefutureavenuestoexplore. 17

PAGE 18

CHAPTER2THEORETICALBACKGROUNDInthischapter,werstconsidertheproblemofvisualobjectrecognitionfromamachinelearningperspectiveandunderstandthecomplexityofthetask.Wediscusstheimportanceofhierarchicalanddistributivenatureoftherepresentationstoperformbetterclassicationandarguethatsuchrepresentationscandisentangletheunderlyingdiscriminativefactorsofanobjectfromlow-levelsensoryinputs.Wethenfocusonesuchframeworkcalledpredictivecodingtoconstructahierarchicalnetwork.Itprovidesauniedtheorytoconstructadeepdynamicmodelsandformsthetheoreticalunderpinningfortherestofthemodelsproposedinthiswork. 2.1DeepNetworksandRepresentationLearningOnetheimportantpre-requisiteforgoodclassicationistondarepresentationofthedatasuchthattheclassicationbecomeseasierinthatprojectedspace.Forexample,considerasimpletwoclassproblemshowninFigure 2-1A ,wherethesamplesina2Dspacebelongingtotwodifferentclassesarenotlinearlyseparable.Generally,popularlearningalgorithmslikekernel(ornon-linear)supportvectormachines[ Vapnik 1998 ],radialbasisnetwork[ Haykin 1994 ],multilayerperceptronwithsinglehiddenlayer[ Haykin 1994 ]ndanon-linearmappingsuchthatthesesamplesarewellseparatedinsomeprojectedspace.Suchmodelswhichperformasinglenon-lineartransformationontheinputsarecalledasshallownetworks.Thoughthesemethodsperformrelativelywellinsimpletaskslikeabove,suchmodelshavelimitedcapacitywhenvariationswithinaclassarelargeandrequirestolearnsomecomplexmappingtondagoodrepresentations[ BengioandLeCun 2007 ],suchasinthecaseofobjectrecognition.Thereasonbeing,ascomplexityofthedecisionboundarybetweenclassesbecomeslarge,shallownetworksrequiresmorenumberofcomputationalunitstorepresentit.Infact, Bengio [ 2009 ]hasshowntheoreticallythatsuchshallowmodels 18

PAGE 19

AShallownetworks BDeepnetworks Figure2-1. Comparisonbetweenshallowanddeepnetworks.(A)Shallownetworksuseasinglenon-linearfunctiontomaptheobservationsintoaspacewheretheyarelinearlyseparable.(B)However,whenthedecisionboundarybecomescomplicated(asincaseoftwoclassobjectrecognitionshownhere),asinglenon-linearmappingmightnotsufce.Deepnetworksusemultiplenon-lineartransformationsontheobservationsandprogressivelyextractabstractinformation.Afterthesemultiplenon-linearmappings,differentclassesbecomemoredistinguishable. requiresexponentiallylargenumberofcomputationalunitsandmemoryproportionaltothecomplexityofthetasks.Instead,deepnetworksthatusemultiplehierarchicalnon-linearmappingsareshowntohavemuchmoreexibilitytomodelcomplexmappingsandinacompactmanner[ BengioandLeCun 2007 HintonandSalakhutdinov 2006 ].Thekeytosuchmodelsistheirdistributiveandhierarchicalnaturewhilerepresentingtheinputs/observations.Theobservationsarerepresentedsuchthatsimplefeatures 19

PAGE 20

extractedatthelowerlayersarecombinedinthehigherlayerthroughcomplexrelationshipsandbecomeprogressivelymoreabstractwiththedepthofthemodel(Figure 2-1B ).Theimportantpointhereisthere-usabilityofthelowerlayerfeatureextractorsforcreatingdistinctiveandabstracthigherlayerfeatures.Theoretically,everydistinctpathtraversingfromlow-levelfeaturestothehigherlevelformadifferentrepresentation.Suchre-usabilityoffeatureextractorscommonforseveraldistinctiveclassesnotonlyleadtocompactrepresentationsbutalsobettergeneralizationforsomeunknownclasses.But,howdoweknowthesecomplexhigher-levelrelationships?Aswewillshowlaterinthiswork,suchrelationshipscanbelearnedfromtheobservationsthemselves,calledasrepresentationlearning.Representationlearningallowsustoleveragethestructurepresentinthedata,evenwhenthereisnolabelinformation[ Bengioetal. 2013 ].Itisalreadyshown[ Hintonetal. 2006 Leeetal. 2009 Jarrettetal. 2009a Boureauetal. 2010 ]tooutperformsomehandcraftedtechniqueslikeSIFT[ Lowe 1999 ],HOG[ DalalandTriggs 2005 ].Anotheradvantageoftherepresentationlearningisitsuseintransferlearning,whereitallowsthemodellearnedonaparticulardata(ortask)toeasilygeneralizetoanotherdata(ortasks). 2.2PredictiveCodingModelsThisideaofdeepanddistributiverepresentationsdiscussedaboveisinfactlooselyinspiredfromworkingofvisualcortex(orneocortex,ingeneral)inthebrain.AsdiscussedinChapter 1 ,visualcortexusessimilarmultiplelayersofnon-lineartransformationsonthesensoryinputs/observations.Wealsodiscussedhowgenerativemodelscouldpotentiallybeuseddesignsuchhierarchicalmodels.Thekeyassumptionhereisthatthesensorysignalcomingfromtheexternalenvironmentunfoldasadynamicprocesswithadistinctivecausalevent.Oneofthemostprominenttheoriesformodelingsuchsensoryinputs,followingtheideaofHelmholtz,istoconsiderthe 20

PAGE 21

perceptualsystemasaninferenceenginewhosefunctionistoinfertheprobablecausesofthesensoryinput[ Dayanetal. 1995 ].Inotherwords,thegoalofperceptionissolveaninverseproblemtondtheunderlyingsignalthatmighthavecausedthesensoryinputs.Toelaborate,onewouldliketobuildagenerativemodelthattriestolearntheexternalenvironmentusingasetofparameters.Usingtheseparameters,themodeltriestotheninfertheunderlyingcausethatmighthavegeneratedtheobservations.Thisideaofanalyzingtheobservationsbysynthesisisshowntobeveryusefulandseveralmethodswereproposedtosolvethisinverseproblem[ Friston 2005 RiesenhuberandPoggio 1999 Dayanetal. 1995 GeorgeandHawkins 2005 ].Hereweareinterestedinonesuchframeworkproposedin[ Friston 2005 RaoandBallard 1999 ],calledpredictivecoding(orempiricalBayes).Thebasicideaofpredictivecodingistomodelthevisualsystemasagenerativemodelthattriestopredicttheexternalresponsesusingsomeinternalstates.Thenthepredictionerror,fromagenerativemodelperspective,isthedifferencebetweentheactualobservationandthegeneratedinputfromthemodelandtheunderlyingcauses(alsocalledaslatentorhiddenvariables)[ Friston 2005 ].Mathematically,ifytisanobservationattimet,thenitisdescribedbytheunderlyingcause(ut)asfollows:yt=F(ut)+vt (2)whereF(.)issomeobservation(ormeasurement)function.Inaddition,sinceweassumethattheobservationsaretime-varying,anintermediatehiddenstates(xt)canbeconsideredtoencodethedynamicsovertime.Hence,auniedmodelthatencodesasequenceofobservationscanbewrittenasageneralizedstate-spacemodeloftheform[ Friston 2008 ]:yt=F(xt,ut)+ntxt=G(xt)]TJ /F10 7.97 Tf 6.59 0 Td[(1,ut)+vt (2) 21

PAGE 22

whereGiscalledstatetransitionfunction.WeassumethatFandGcanbeparameterizedbysomesetofparameters,.Thetermsutarecalledtheunknowncausesand,sinceweareusuallyinterestedinobtainingabstractinformationfromtheobservations,theyareencouragedtohaveanon-linearrelationshipwiththeobservations.Thehiddenstates,xt,thenmediatetheinuenceofthecauseontheobservationsandendowthesystemwithmemory Friston [ 2008 ].Thetermsvtandntarestochasticandmodeluncertaintyinthepredictions.Now,tobuildamulti-layerhierarchicalnetwork,severalsuchstate-spacemodelscanbestackedsuchthattheoutputfromoneactsasobservationstothemodelinthelayerabove.Mathematically,anL-layerednetworkofthisformcanbewrittenas: yt=f(x(1)t,u(1)t)+n(1)tx(1)t=g(x(1)t)]TJ /F10 7.97 Tf 6.59 0 Td[(1,u(1)t)+v(1)t...u(l)]TJ /F6 7.97 Tf 6.59 0 Td[(1)t=f(x(l)t,u(l)t)+n(l)tx(l)t=g(x(l)t)]TJ /F10 7.97 Tf 6.59 0 Td[(1,u(l)t)+v(l)t (2) ...u(L)]TJ /F6 7.97 Tf 6.59 0 Td[(1)t=f(x(L)t,u(L)t)+n(L)tx(L)t=g(x(L)t)]TJ /F10 7.97 Tf 6.59 0 Td[(1,u(L)t)+v(L)tThetermsv(l)tandn(l)tarestochasticuctuationsatthehigherlayersandentereachlayerindependently.Inotherwords,thismodelformsaMarkovchainacrossthelayersandthelatentvariablesatanylayerarenowonlydependentontheobservationscomingfromthelayerbelowandthepredictionsfromthelayerabove.Noticehowthecausesatlowerlayerformtheobservationstothelayerabove,i.e.,thecausesformthelinkbetweenthelayerswhilethestateslinkthedynamicsovertime. 22

PAGE 23

Theimportantpointinthisdesignisthatthehigher-levelpredictionsinuencetherepresentationatlowerlevels.Thesepredictionsfromthehigherlayernon-linearlyenterintothestatespacemodelatabottomlayerbyempiricallyalteringtheprioronthecausesofthebottomlayer.Hence,thetop-downconnections,alongwiththehorizontal(orrecurrent)connectionsinthestatespace,directlyinuencetheinferenceinthebottomlayers.Aswewilldiscusslater,thesetwoconnectionstogetherprovidescontextual-awarenesstothemodelwhileencodingtheobservations.Learninginthishierarchicalgenerativemodelistantamounttobuildinginternalmodelsthatcanexplainobservationsintermsofsomeunderlyingcauses.Paraphrasingthesame,ifthemodelisabletosynthesizeanobservationaccurately,thenitmeansthemodelhaspreviouslyseenasimilarobservation.Recognitionthenissimplysolvinganinverseproblembyjointlyminimizingthepredictionerroratalllevels[ Friston 2005 ].Thisframeworkformthetheoreticalunderpinningofthemethodsproposedinthiswork. 2.2.1HierarchyofAttractorsSo,howdoesthisdynamic(orpredictivecoding)modelrelatedtohierarchicalrepresentationsdiscussedinSection 2.1 ?Thekeyideawhiledesigningthisdynamicgenerativemodelisthattheobservationscomingintothemodelunfoldasanorderedsequenceofspatio-temporaldynamics[ GeorgeandHawkins 2005 Kiebeletal. 2008b ],andthesearegeneratedastrajectoriesinsomeunderlyingattractormanifolds,whilethecausesgovernthepositionoftheattractorsinthemanifold[ FristonandKiebel 2009 ].Critically,inahierarchicalsetting,theshapeoftheattractorsinthelower-layersisgovernedbyotherdynamicmodelsinthehigherlayers,whichthemselvescanhavetheirownattractors.Thiscouplingacrossthelayersinthehierarchyallowsthedynamicsinthehigher-layersinuencethelower-layertrajectories.Fromobjectrecognitionperspective,suchhierarchicaldynamicmodelshavesomeimportantcharacteristics[ FristonandKiebel 2009 ] 23

PAGE 24

Firstly,sinceweassumethatthedynamicmodelsateachlayerhavetheirownattractors,eachcangenerateandhence,encode,itsownspatio-temporalsequence.Whilethedynamicswithineachmodeldetermineswhereinthetrajectorythesequenceisfrom,theshapedetermineswhichattractorisencodingtheobservations.Inotherwords,eachdynamicmodelatanylayerencodeaparticularshape(orpartoftheobject)andisinvarianttocertaintransformations(oructuations)ofthatshape.Secondly,ashigherlayermodelsinuencethemanifoldofthelowerlayerattractors,thehigherlayersencodethecategoryofthesequencerepresentedbythelowerlayers.Thatistosay,itispossibletogeneratesequenceofsequences,asthecausesinthehigherlayercontrolhowtheattractorsinthelower-layermanifoldaremixedtogether(Thisisequivalenttore-usingthelowerlayerfeatureextractorsdescribedinSection 2.1 ).Moreimportant,whileencoding,thehigherlayersdetermine(rather,guidethelowerlayersto)whichsequence(orobject)toencodeinmiliueofnoisyobservations.Thisentitlesanon-linearinteractionofthetop-downeffectbetweenthehigherlayerattractorandlowerlayermanifold.Finally,thehigherlayersencodemoreabstractknowledgeandevolvemoreslowlyovertime.Thereasonbeing,asthelowerlayersareguidedbythehigherlayerstoencodeaparticularobject,thelowerlayerencodethetransformationintheobjectandpassononlyaninvariantrepresentationtothelayerabove.Sincetheobjectsinascenechangeslowlythanthetransformationtheyundergo,thethehigherlayersaremoreinvariantand,hence,muchslowertoevolveovertime.Inthefollowingchapters,weshowthatproposedmodelencompassallthesecharacteristicsandleadtoarobustobjectrecognitionsystem. 2.2.2RelationshipwithOtherMethodsOnecanalsoarguethattheabovestate-spacemodelin( 2 ),anditscorrespondinghierarchicalmodel( 2 ),encompassesseveralexistingmethodsthatarepopularforobjectrecognition[ Friston 2008 RoweisandGhahramani 1999 ].Theabilityofthese 24

PAGE 25

modelstogeneralizeseveralmodelscomesfromconsideringdifferentpriorsanddesignwhileencodingtheinputs.Forexample,sparsecoding[ OlshausenandField 1996 ],whichispopularlyusedinobjectrecognition,canbeconsideredasspecialcaseofthestatespacemodelin( 2 ),whenwedonothavethehiddenstatesandasparsityinducingpriorisconsideredoverthecauses.Evenmorecomplexmethodslikeindependentsubspaceanalysis(ISA)(ortopographicICA)[ Hyvarinenetal. 2009 ]canalsobeconsideredasacasewherewehaveboththehiddenstatesandcauses,withoutanydynamicalstate-spaceequation.Similarly,manyotherpopulardeepnetworksusedforobjectrecognitioncontainsmaxoraveragepoolingfunctionsalongwithfeatureextractionmethods[ RiesenhuberandPoggio 1999 Jarrettetal. 2009b ].Thiscanalsobesubsumedintotheproposedmodelbyconsideringthestatesencodethefeatures,whilethenon-linearfunctionbetweenthestatesandthecausesissimplyamaxoraveragingpoolingfunction.Recently,restrictedBoltzmannmachine[ Hintonetal. 2006 ],auto-encoders[ Bengioetal. 2007 ],encoder-decodermodels[ Ranzatoetal. 2007 ],etchavebecomepopulartobuilddeepnetworks.Thekeytothesemodelsistolearnbothencodinganddecodingconcurrentlyandbuildthedeepnetworkasafeedforwardmodelusingonlytheencoderwhilediscardingthedecoder.Thoughthesemodelsappeartobedifferentfromthehierarchicalmodelsdescribedbefore,thefunctionoftheencoderisonlytoapproximatethedecoding(orinference)andhence,canbesubsumedintheaboveframeworkaswell.(However,see[ SalakhutdinovandHinton 2012 Leeetal. 2009 ]thatconstructsdeepBoltzmannmachineusingundirectedgraphicalmodels,whichcanalsoprocessbidirectionalinformationowinthehierarchicalmodels.Alsosee[ Bengioetal. 2013 ]formorethoroughcomparisonbetweendirectedandundirectedgraphicalmodelsforlearningdeepnetworks.)Wewouldalsolikeremarkthatthereareseveralpopularhandcraftedfeatures,likeSIFT[ Lowe 1999 ],SURF[ Bayetal. 2008 ],HOG[ DalalandTriggs 2005 ],etc. 25

PAGE 26

thatareverysuccessfulinobjectrecognition.Thesemethodsdonotfallintoadaptivelearningparadigmandhence,arenotbiologicallyplausible.Buttheyareverysuccessfulinobjectrecognitiontasksandareworthtakinganoteof. 26

PAGE 27

CHAPTER3FEATUREEXTRACTIONFROMTIME-VARYINGSIGNALSInthischapter,weconsideradynamicnetworktoextractfeaturesfromasmallpartofavideosequence.Thisformsthebasicblockforbuildinghierarchicalmodelslater.Thecenterpieceoftheproposedmodelisextractingsparsefeaturesfromtime-varyingobservationsequencesusingadynamicmodel.Tothisend,weproposeanovelprocedurebasedonproximalmethodstoinfersparsestates(orfeatures)ofadynamicsystem.Wethenextendthisfeatureextraction(orsparsecoding)blocktointroduceapoolingstrategytolearninvariantfeaturerepresentationsfromthedata.Weshowthatthistwostagemodelrstextractssimplefeaturesandthencombinethemtoformmorecomplexrepresentations,similartosimpleandcomplexcellsintheV1regionsofthevisualcortex. 3.1ModelandEnergyfunctionTobeginwith,letfy1,y2,...,yt,...g2RPbeaP-dimensionalsequenceofapatchextractedfromthesamelocationacrossalltheframesinavideo1.Toprocessthis,ournetworkconsistsoftwodistinctiveparts(Figure 3-1 ):featureextraction(inferringstates)andpooling(inferringcauses).Fortherstpart,sparsecodingisusedinconjunctionwithalinearstatespacemodeltomaptheinputsytattimetontoanover-completedictionaryofK-lters,C2RPK(K>P),togetsparsestatesxt2RK.Tokeeptrackofthedynamicsinthelatentstatesweusealinearfunctionwithstate-transitionmatrixA2RKK.Moreformally,weassumethattheinputsaresynthesizedusingthefollowinggenerative 1Hereytisavectorizedformofp Pp Psquarepatchextractedfromaframeattimet. 27

PAGE 28

Figure3-1. Adynamicnetworkconsistingoftwodistinctiveprocessingunits:statesforextractingfeaturesfromtheinputsandcausesthatpoolingthestatestoformamorecomplexrepresentation. modelwithan`1sparsityregularizationonthestatesxt: yt=Cxt+nt (3) xt=Axt)]TJ /F10 7.97 Tf 6.59 0 Td[(1+vtWecallthismodeldynamicsparsecoding(DSC).Toinferthestatesxtinthismodel,weminimizesthefollowingenergyfunction: E1(xt,yt,C,A)=kyt)]TJ /F5 11.955 Tf 11.96 0 Td[(Cxtk22+kxt)]TJ /F5 11.955 Tf 11.96 0 Td[(Axt)]TJ /F10 7.97 Tf 6.58 0 Td[(1k1+kxtk1 (3) Noticethatthesecondterminvolvingthestate-transitionisalsoconstrainedtobesparse,implyingthatthenumberofchangeinthefeaturesovertimeissmall.Thisnotonlymakessenseinpracticewherevisualinputsweusuallyencounterareslowlychangingovertime,butalsomakesthestate-spacerepresentationmoreconsistentandleadstomoresparsersolution.Figure 3-2 showsthecomparisonbetweenmodelingtheinnovationsassparseversusdense(orGaussiandistribution)(similartoKalmanlteringbutwithoutupdatingthecovariancematrixovertime).Noticethattheshapeofthecombinedregularizeroverthestatesaroundthesolutionissharperwithsparseinnovations,indicatingthatitpromotesbettersparsitythanwhentheinnovationsaremodeledasaGaussiandistribution. 28

PAGE 29

Figure3-2. Thecombinedprioractingonthestatesindynamicsparsecoding.(A)IncaseofGaussianinnovations,thecombinedpriorissmootharoundthesolutionandmightnotmaintainthesparsityoftheoutputsolution.(B)Incaseofsparseinnovations,thecombinedprioractinginthestatesissharperandleadstoamuchsparsersolution. Now,totakeadvantageofthespatialrelationshipsinalocalneighborhood,asmallgroupofstatesx(n)t,wheren2f1,2,...Ngrepresentsasetofcontiguouspatchesw.r.t.thepositionintheimagespace,areadded(orsumpooled)together.Suchpoolingofthestatesmaybeleadtolocaltranslationinvariance.Ontopthis,aD-dimensionalcausesut2RDareinferredfromthepooledstatestoobtainrepresentationthatisinvarianttomorecomplexlocaltransformationslikerotation,spatialfrequency,etc.Inlinewith[ KarklinandLewicki 2005 ],thisinvariantfunctionislearnedsuchthatitcancapturethedependenciesbetweenthecomponentsinthepooledstates.Specically, 29

PAGE 30

thecausesutareinferredbyminimizingtheenergyfunction: E2(ut,xt,B)=NXn=1KXk=1jkx(n)t,kj+kutk1 (3) k=0h1+exp()]TJ /F8 11.955 Tf 9.29 0 Td[([But]k) 2iwhere0>0issomeconstant.NoticethathereutmultiplicativelyinteractswiththeaccumulatedstatesthroughB,modelingtheshapeofthesparseprioronthestates.Essentially,theinvariantmatrixBisadaptedsuchthateachcomponentutconnectstoagroupofcomponentsintheaccumulatedstatesthatco-occurfrequently.Inotherwords,wheneveracomponentinutisactiveitlowersthecoefcientofasetofcomponentsinx(n)t,8n,makingthemmorelikelytobeactive.Sinceco-occurringcomponentstypicallysharesomecommonstatisticalregularity,suchactivityofuttypicallyleadstolocallyinvariantrepresentation[ KarklinandLewicki 2005 ].Thoughthetwocostfunctionsarepresentedseparatelyabove,wecancombinebothtodeviseauniedenergyfunctionoftheform:E(xt,ut,)=NXn=11 2ky(n)t)]TJ /F5 11.955 Tf 11.96 0 Td[(Cx(n)tk22+kx(n)t)]TJ /F5 11.955 Tf 11.95 0 Td[(Ax(n)t)]TJ /F10 7.97 Tf 6.59 0 Td[(1k1+KXk=1jt,kx(n)t,kj+kutk1 (3)t,k=0h1+exp()]TJ /F8 11.955 Tf 9.3 0 Td[([But]k) 2iwhere=fA,B,Cg.Aswewilldiscussnext,bothxtandutcanbeinferredconcurrentlyfrom( 3 )byalternativelyupdatingonewhilekeepingtheotherxedusinganefcientproximalgradientmethod. 3.2LearningTolearntheparametersin( 3 ),wealternativelyminimizeE(xt,ut,)usingaproceduresimilartoblockco-ordinatedescent.Werstinferthelatentvariables(xt,ut)whilekeepingtheparametersxedandthenupdatetheparameterswhilekeepingthevariablesxed.Thisisdoneuntiltheparametersconverge.Wenowdiscussseparately 30

PAGE 31

theinferenceprocedureandhowweupdatetheparametersusingagradientdescentmethodwiththexedvariables. 3.2.1InferenceWejointlyinferbothxtandutfrom( 3 )usingproximalgradientmethods,takingalternativegradientdescentstepstoupdateonewhileholdingtheotherxed.Inotherwords,wealternatebetweenupdatingxtandutusingasingleupdatesteptominimizeE1andE2,respectively.However,updatingxtisrelativelymoreinvolved.So,keepingasidethecauses,werstfocusoninferringsparsestatesalonefromE1,andthengobacktodiscussthejointinferenceofboththestatesandthecauses.InferringStates Inferringsparsestates,giventheparameters,fromalineardynamicalsystemformsthecruxofourmodel.ThisisperformedbyndingthesolutionthatminimizestheenergyfunctionE1in( 3 )withrespecttothestatesxt(whilekeepingthesparsityparameterxed).Heretherearetwopriorsofthestates:thetemporaldependenceandthesparsityterm.AlthoughthisenergyfunctionE1isconvexinxt,thepresenceoftwonon-smoothtermsmakesithardtousestandardoptimizationtechniquesusedforsparsecodingalone.Previously,manymethodsareproposedtoexplorethepossibilityofusingdynamicstorecoverthesparse,time-varyingsignals.Notably,somemodicationstotheKalmanlterareproposedbasedonselectingaconstrainedsubsetofbasis[ Vaswani 2008 ]orusinghierarchicalBayesiansparsecoding[ Karserasetal. 2013 ].Othersaddressedtheproblemusingdynamicprogramming[ Angelosanteetal. 2009 ],homotopy[ Charlesetal. 2011 ],modiedcompressivesensingproblem[ VaswaniandLu 2010 ]ormodelingthestateinnovationsasGauss-Bernoullisignalwhileusingsamplingmethods Sejdinovicetal. [ 2010 ].Finally, CharlesandRozell [ 2012 ]proposedre-weighted`1dynamicltering(RWL1-DF),wheredynamicsovertimeareusedtomodeltheweightedsparsityprioronthestates.However,itrequiresperformingan`1optimizationmultiple 31

PAGE 32

timesforeachtimeinstance.Also,aswewillshowlater,italsobecomesunstablewhenthenoiseintheobservationsbecomeslarge.Tosummarize,theoptimizationusedinallthesemethodsiseithercomputationallyexpensiveand/orunstableforuseinlargescaleproblemslikeobjectrecognition.Inthiswork,inspiredbythemethodproposedby Chenetal. [ 2012a ]forstructuredsparsity,weproposeasmoothproximalgradientmethodthatcanapproximatetheenergyfunctionE1andabletouseefcientsolverslikefastiterativeshrinkagethresholdingalogorithm(FISTA)[ BeckandTeboulle 2009 ].ThekeytoourapproachistorstuseNestrov'ssmoothnessmethod[ Nesterov 2005 ]toapproximatethenon-smoothstatetransitionterm.Theresultingenergyfunctionisaconvexandcontinuouslydifferentiablefunctioninxtwithasparsityconstraint,andhence,canbeefcientlysolvedusingproximalmethodslikeFISTA.Smoothapproximationofsparseinnovations:Tobegin,let(xt)=ketk1whereet=(xt)]TJ /F5 11.955 Tf 12.37 0 Td[(Axt)]TJ /F10 7.97 Tf 6.59 0 Td[(1).Theideaistondasmoothapproximationtothisfunction(xt)inet.Noticethat,sinceetisalinearfunctiononxt,theapproximationwillalsobesmoothw.r.t.xt.Now,wecanre-write(xt)usingthedualnormof`1as(xt)=argmaxkk11Tetwhere2Rk.Usingthesmoothingapproximationfrom Nesterov [ 2005 ]on(xt): (xt)f(et)=argmaxkk11[Tet)]TJ /F3 11.955 Tf 11.96 0 Td[(d()] (3) whered()=1 2kk22isasmoothingfunctionandisasmoothnessparameter.FromNestrov'stheorem[ Nesterov 2005 ],itcanbeshownthatf(et)isconvexandcontinuouslydifferentiableinetandthegradientoff(et)withrespecttoettakestheform retf(et)= (3) 32

PAGE 33

whereistheoptimalsolutiontof(et)=argmaxkk11[Tet)]TJ /F3 11.955 Tf 11.97 0 Td[(d()].Thisoptimalsolutionofin( 3 )isgivenby=argmaxkk11[Tet)]TJ /F3 11.955 Tf 13.15 8.09 Td[( 2kk2]=argminkk11)]TJ /F5 11.955 Tf 13.15 8.09 Td[(et 2=Set (3)whereS()isafunctionprojectinget ontoan`1-ball.Thisisoftheform:S(x)=8>>>>>><>>>>>>:x,)]TJ /F8 11.955 Tf 9.29 0 Td[(1x11,x>1)]TJ /F8 11.955 Tf 9.3 0 Td[(1,x<)]TJ /F8 11.955 Tf 9.3 0 Td[(1Now,byusingthechainruleandsincef(et)isalsoconvexandcontinuouslydifferentiableinxt,thegradientoff(et)w.r.txtalsoturnsouttothesame.Effectofsmoothing:Tovisualizetheeffectoftheabovedescribedsmoothingoperation,weplotthefunctionf(et)foraone-dimensionalerrorsignalet2Rforvariousvaluesof.Notethatdeterminesthemaximumvalueofin( 3 )()correspondingtoeacherrorvalue.Figure 3-3 showstheresultingplots.Asitindicates,thesharppointin`1-normaroundtheoriginissmoothedintheapproximatedfunctionf(et).Alsonotethat,asthevalueofincreases,theapproximation,thoughsmoother,startstodeviatemorefromthe`1-norm.Infact,onecanshowthat,giventhedesiredaccuracyofthesolution,followingconvergenceresultsfromTheorem2in Chenetal. [ 2012a ]suggests= k,wherekisthedimensionsofthestates,leadstothebestconvergencerate.Wereferthereaderto Chenetal. [ 2012a ]fordetails.SmoothingproximalgradientdescentforDSC: 33

PAGE 34

Figure3-3. Effectofsmoothingonthecostfunction.Plotshowsthesmoothfunctionf(et)versusaonedimensionalerrorsignaletforvariousvaluesofthesmoothnessparameter. Withthissmoothingapproximation,theoverallcostfunctionfrom( 3 )cannowbere-writtenas xt=argminxt1 2kyt)]TJ /F5 11.955 Tf 11.96 0 Td[(Cxtk22+f(et)+kxtk1 (3) withthesmoothparth(xt)=1 2kyt)]TJ /F5 11.955 Tf 12.09 0 Td[(Cxtk22+f(et)whosegradientwithrespecttoxtisgivenby rxth(xt)=CT(yt)]TJ /F5 11.955 Tf 11.95 0 Td[(Cxt)+ (3) Usingthegradientinformationin( 3 ),wesolveforxtfrom( 3 )usingFISTA[ BeckandTeboulle 2009 ].InferringCauses Givenagroupofstatevectors,utcanbeinferredbyminimizingE2,wherewedeneagenerativemodelthatmodulatesthesparsityofthepooledstatevector,Pnjx(n)j.HereweobservethatFISTAcanbereadilyappliedtoinferut,asthesmoothpartofthe 34

PAGE 35

functionE2: h(ut)=KXk=10h1+exp()]TJ /F8 11.955 Tf 9.3 0 Td[([But]k) 2iNXn=1jx(n)t,kj (3) isconvex,continuouslydifferentiable.WenoteherethatthematrixBisinitializedwithnon-negativeentriesandcontinuestobenon-negativewithoutanyadditionalconstraints[ GregorandLeCun 2011 ].Thisallowsthegradientofh(ut),givenby: ruth(ut)=)]TJ /F5 11.955 Tf 9.3 0 Td[(BTKXk=10hexp()]TJ /F8 11.955 Tf 9.3 0 Td[([But]k) 2iNXn=1jx(n)t,kj (3) tobeLipschitzcontinuousandhence,guaranteesconvergencewithaboundontheconvergencerateofthesolution[ BeckandTeboulle 2009 ].JointInference Weshowedthusfarthatbothxtandutcanbeinferredfromtheirrespectiveenergyfunctionsusingarst-orderproximalmethodcalledFISTA.However,forjointinferencewehavetominimizethecombinedenergyfunctionin( 3 )overbothxtandut.WedothisbyalternatelyupdatingxtandutwhileholdingtheotherxedandusingasingleFISTAupdatestepateachiteration.ItisimportanttopointoutthattheinternalFISTAstepsizeparametersaremaintainedbetweeniterations.Thisprocedureisequivalenttoalternatingminimizationusinggradientdescent.Althoughthisprocedurenolongerguaranteesconvergenceofbothxtanduttotheoptimalsolution,inallofoursimulationsitleadtoareasonablygoodsolution.PleaserefertoAlgorithm. 1 fordetails.Notethat,withthealternatingupdateprocedure,eachxtisnowinuencedbythefeed-forwardobservations,temporalpredictionsandthefeedbackconnectionsfromthecauses. 3.2.2ParameterUpdatesWithxtandutxed,weupdatetheparametersbyminimizingEin( 3 )withrespectto.Sincetheinputshereareatime-varyingsequence,theparametersareupdatedusingdualestimationltering[ Nelson 2000 ];i.e.,weputanadditional 35

PAGE 36

Algorithm1Updatingxt,utsimultaneouslyusingFISTA-likeprocedure. Require: TakeLx0,n>08n2f1,2,...,Ng,Lu0>0andsome>1. 1: Initializex0,n2RK8n2f1,2,...,Ng,u02RDandset1=u0,z1,n=x0,n. 2: Setstep-sizeparameters:1=1. 3: whilenoconvergencedo 4: Update=0(1+exp()]TJ /F8 11.955 Tf 9.29 0 Td[([Bui])=2 5: forn2f1,2,...,Ngdo 6: Linesearch:FindthebeststepsizeLxk,n. 7: Computefrom( 3 ) 8: Updatexi,nusingthegradientfrom( 3 )withasoft-thresholdingfunction. 9: Updateinternalvariableszi+1withstepsizeparameteriasin[ BeckandTeboulle 2009 ]. 10: endfor 11: ComputePNn=1jxi,nj 12: Linesearch:FindthebeststepsizeLuk. 13: Updateui,nusingthegradientof( 3 )withasoft-thresholdingfunction. 14: Updateinternalvariablesi+1withstepsizeparameteriasin[ BeckandTeboulle 2009 ]. 15: Updatei+1=1+q (42i+1)=2 16: Checkforconvergence. 17: i=i+1 18: endwhile 19: returnxi,n8n2f1,2,...,Ngandui constraintontheparameterssuchthattheyfollowastatespaceequationoftheform: t=t)]TJ /F10 7.97 Tf 6.59 0 Td[(1+zt (3) whereztisGaussiantransitionnoiseovertheparameters.Thiskeepstrackoftheirtemporalrelationships.Alongwiththisconstraint,weupdatetheparametersusing 36

PAGE 37

gradientdescent.Noticethatwithxedxtandut,eachoftheparametermatricescanbeupdatedindependently,whosegradientisobtainedasfollows:rAE()=sign(xt)]TJ /F5 11.955 Tf 11.96 0 Td[(Atxt)]TJ /F10 7.97 Tf 6.58 0 Td[(1)xTt+(At)]TJ /F5 11.955 Tf 11.96 0 Td[(At)]TJ /F10 7.97 Tf 6.58 0 Td[(1)rCE()=(yt)]TJ /F5 11.955 Tf 11.96 0 Td[(Ctxt)xTt+(Ct)]TJ /F5 11.955 Tf 11.96 0 Td[(Ct)]TJ /F10 7.97 Tf 6.59 0 Td[(1) (3)rBE()=)]TJ /F8 11.955 Tf 7.47 -9.69 Td[(expf)]TJ /F8 11.955 Tf 15.28 0 Td[([But]g.jxtjuTt+(Bt)]TJ /F5 11.955 Tf 11.96 0 Td[(Bt)]TJ /F10 7.97 Tf 6.59 0 Td[(1)whereactsasaforgettingfactor.MatricesCandBarecolumnnormalizedaftertheupdatetoavoidanytrivialsolution.Mini-BatchUpdate:Togetfasterconvergence,theparametersareupdatedafterperforminginferenceoveralargesequenceofinputsinsteadofateverytimeinstance.Withthisbatchofsignals,moresophisticatedgradientmethods,likeconjugategradient,canbeusedand,hence,canleadtomoreaccurateandfasterconvergence. 3.3Experiments 3.3.1InferringSparseStateswithKnownParametersWeconsideranexperimentalset-upsimilartooneusedby CharlesandRozell [ 2012 ]withsyntheticdataandcomparetheperformanceoftheproposeddynamicsparsecoding(DSC)withothermethodssparsecodingusingFISTA(SC)[ BeckandTeboulle 2009 ],Kalmanlter[ Kalman 1960 ],re-weighted`1dynamicltering(RWL1-DF)[ CharlesandRozell 2012 ].Wealsocompareourmethodwhileconsideringthestatesinnovationsin( 3 )asGaussian(SC-L2Innov.),asdepictedinFigure 3-2 2.Specically,theexperimentalset-upisafollows:wesimulateastatesequencewithonly20non-zeroelementsina500-dimensionalstatevectorevolvingwithapermutationmatrix(notethatthiskeepsthenumberofnon-zeroelementssameovertime),whichis 2WeperforminferenceonthismodelusingFISTA[ BeckandTeboulle 2009 ] 37

PAGE 38

Table3-1. Computationaltimeofdifferentmethodsonsyntheticdata(pertimesample). MethodsDSCSC-L2SCRWL1-DFInnov. Time(inseconds)0.170.160.270.54 differentforeverytimeinstant.ThisstatesequenceisthenpassedthroughaGaussianscalingmatrixtogenerateasequenceofobservations.Wevaryobservationdimensions(p)dependingontheexperiment,whichwillbespeciedlater.Weconsiderthatboththepermutationandthescalingmatricesareknownapriori.TheobservationnoiseisGaussianwithzeromeanandvariance2=0.001.Weconsidersparsestate-transitionnoise,whichissimulatedbychoosingasubsetofactiveelements(n)inthestatevectorchosenrandomlyandswitchingeachofthemwitharandomlychosenelement(withuniformprobabilityoverthestatevector).Thisresembleasparseinnovationinthestateswith2nnumberofwronglyplacedelements,onemissingelementandoneadditionalelement.Weusethesegeneratedobservationsequencesasinputsandusetheaprioriknownparameterstoinferthestatesxt.Tosetthehyper-parameters,weperformaparametersweeptondthebestcongurationforeachmethod.Wecomparetheinferredstatesfromdifferentmethodswiththetruestatesintermsofrelativemeansquarederror(rMSE);denedaskxestt)]TJ /F5 11.955 Tf 11.95 0 Td[(xtruetk kxtruetkFigure 3-4 showsthetrackingperformanceofdifferentmethodsseecaptionfordetailsaboutthemodelused.Also,Table 3-1 showsthecomputationtime(pertimeinstance)forallthemethods3.Weobservethatthedynamicsparsecoding(DSC)isabletotrackthestatesovertimemoreaccuratelythansparsecoding(SC),whichdoesnotuseanydynamics.ThedynamicmodelwithGaussianinnovations(SC-L2Innov.), 3Allcomputationsaredoneon8-coreIntelXeon,2.4GHzprocessor. 38

PAGE 39

Figure3-4. Performanceofdynamicsparsecodingwithsparseinnovationwhiletrackingsparsestates.Theplotshowstherelativemeansquareerror(rMSE)versustimeandeachplotisanaverageover40runs.Experimentisperformedwithp=70,n=3andwesettheparameters=10and=10in( 3 ).Kalmanltercompletelyfailedtotrackthestatesandisnotshownhere. thoughperformsbetterthanthesparsecodingmodelattimes,isnotabletotrackthestateaccurately,re-assertingourargumentthatconsideringsparseinnovationsmakethemodelmorestableandconsistent.Finally,RWL1-DFisabletotrackthestatesasaccuratelyasourmodel,butrequiresseveralobservationsbeforereachingasteadystateandiscomputationallymoreexpensive.Infact,weobservedthatRWL1-DFbecomesveryunstablewhentheobservationshaveinadequateinformation,asaresultofverynoisyobservationorwhenthenumberofobservationdimensionsareless.Wediscussmoreaboutthisinthefollowingexperiments.Figure 3-5 showsthesteadystateerror(rMSE)after50timeinstancesversuswiththedimensionalityoftheobservationsequence(p).Eachpointisobtainedafteraveragingover50runs.WeobservethatDSCisabletotrackthetruestatesevenforlowdimensionalobservations,whenothermethodsfail.Thisshowsthatthetemporalrelationsadoptedinthemodelprovidecontextualinformationnecessarytotrackthe 39

PAGE 40

Figure3-5. Performanceofthedynamicsparsecodingwithvaryingnumberofobservationdimensions.Weusesimilarsetofparametersasbefore,=10,=10andn=3. Figure3-6. Performanceofthedynamicsparsecodingwithvarying(sparse)noiselevels(n).Weusesimilarsetofparametersasbefore,=10,=10andp=70. changesintheobservation,evenwhentheinformationprovidedbytheinstantaneousobservationsinnotsufcient.NoticealsothatRWL1-DFbecomesveryunstablewhenthedimensionsoftheobservationsaresmall.Samecanbeextrapolatedincaseofnoisyobservationsequences,wheretheessentialinformationinthetimesequenceisatscarce.Figure 3-6 showsthe 40

PAGE 41

performanceofallthemethodsversusvaryingsparsenoiselevels(n).Again,weobservethatDSCoutperformsothermethods,particularlywhenthenoiselevelsarehigh.Also,noticethattheRWL1-DFbecomesveryunstablewhenthenoiselevelsarehigh. 3.3.2LearningfromNaturalImageSequences AReceptiveeldsofthestates BReceptiveeldsofthecauses Figure3-7. Receptiveeldsofthemodelatdifferentlevels.(A)Showsthereceptiveeldsofthestates,i.e.,thecolumnsoftheobservationmatrixC(1).(B)showsthereceptiveeldsofthecauses.Thereceptivehereareconstructedasaweightedcombinationofthecolumnsofthelayer1observationmatrixC(1). Inthefollowingexperimentsweshowthattheworkingofthestatesandcausesresemblethatofsimpleandcomplexcellsinthevisualcortex.Thestatesactassimplefeaturedetectors,whilecausesencodecomplexinvariances.However,thekeytoourmodelisthattheresponsesofboththestatesandthecausesareinuencedbythecontext,comingfrombothtemporalandtop-downconnections,makingthemcapabletorepresentobservationsthatarebeyondtheircharacteristicreceptiveelds.Firstly,weconsiderlearningamodelfromnaturalvideosequencesobtainedfromtheVanHateren'svideodatabase[ vanHaterenandRuderman 1998 ].Thisdatabasecontainsseveralvideoclipsofnaturalscenescontaininganimals,tree,etc.andeachframeofthesevideoclipsispreprocessedusinglocalcontrastnormalization 41

PAGE 42

AScatterplotofthecenters BPolarplotoforientationversusfrequency Figure3-8. VisualizingtheobservationmatrixlearnedfromnaturalvideosusingGabort.(A)Showsthescatterplotofthecenterpositionsofthedictionaryelementsalongwiththeorientationsand(B)showsthepolarplotoforientationsversusspatialfrequency. asdescribedby Jarrettetal. [ 2009a ].Sequencesofpatchesarethenextractedfromthepreprocessedvideosequencestolearntheparametersofthemodel.Weuse1717pixelpatchestolearn400dimensionalstatesand100dimensionalcauses.Weconsiderthepoolingbetweenthestatesandthecausestobe22,i.e.,wefurtherdivideeachofthe1717patchesinto4overlapping1515pixelpatchesandthestatesextractedfromeachofthesesubdividedpatchesarepooledtoobtainthecauses(refertoFigure 3-1 ).Figure 3-7 showsthereceptiveeldsofthestates/causesatdifferentlevelsofthelearnedmodel.ThereceptiveeldsofthestatesaresimpleinclinedlinesresemblingGaborfunction,whilethatofthecausesresemblegroupingoftheseprimitivefeature,localizedinorientationand/orspatialposition(moreaboutthisisSection 3.3.2.1 ). 3.3.2.1VisualizinginvarianceTogetabetterunderstandingoftheinvariancelearnedbythemodel,wevisualizetheconnectionsbetweentherstlayerstatesandthecauses.Figure 3-9 showstheresultsobtainedseecaptionformoredetails.Weobservethatmostofthecolumnsoftheinvariancematrixgrouptogetherdictionaryelementsthathavesimilarorientationandfrequency,whilebeinginvarianttotheotherpropertiesliketranslation.However, 42

PAGE 43

AShowingcenterandorientationselectivity BShowingfrequencyandorientationselectivity Figure3-9. Visualizinginvariancerepresentedbythecauses.Firstly,eachdictionaryelement(withareceptiveeldof1515pixels)intheobservationmatrix(C(1))isttedwithaGaborfunctionandisparametrizedintermsofthecenterposition,spatialfrequenceandorientationoftheGaborfunctions,asshowninFigure 3-8 .Thentheconnectionstrengthbetweentheinvariancematrix(B(1))andtheobservationmatrix(C(1))isplotted,i.e.,thesubsetofthedictionaryelementsthataremostlikelytobeactivewhenacolumnoftheinvariancematrixisactiveareshown.EachboxrepresentsonecolumnoftheinvariancematrixB(1)and10outof100columnsarerandomlyselected.Thegureshowsthecenterandorientationscatterplots(A)andthecorrespondingthespatialfrequencyandorientationpolarplots(B),highlightingthesubsetofmostactivedictionaryelementsforaselectcolumnsofB(1)(darkercolorsindicatestrongerconnections).Notice,foreachactivecolumninB(1),asubsetofthedictionaryelements(notunique)aregroupedtogetheraccordingtoorientationand/orspatialposition,indicatinginvariancetotheotherpropertieslikespatialfrequencyandcenterposition. 43

PAGE 44

A B Figure3-10. Visualizingthetemporalstructurelearnedbythemodel.(A)Depictstheconnectionstrength(ofmatrixA(1))betweenthelayer1stateelementsovertime.Readitasfollows:ifthebasisontheleftisactiveattimet(presynaptic),thenthecorrespondingsetofbasisontherightindicatethepredictionfortimet+1(postsynaptic).Thisindicatesthatcertainproperties,likeorientationandspatialpositions,changesmoothlyovertime.(B)Thescatterplotof15strongestconnectionspereachelementinthematrixA(1),arrangedaccordingtotheorientationselectivityofthepreandpostsynapticelements.Noticethatmostpointsarewithin=6fromthediagonal,indicatedbytheblacklines. thereareothertypesoftheinvariancesaswell,wherethedictionaryelementsaregroupedonlybyspatiallocationwhilebeinginvarianttootherproperties4. 3.3.2.2LearningtemporalstructureAsshownabove,thereceptiveeldsofthebottomlayerstatesinourmodelresemblethatofsimplecellsinV1areaofvisualcortex.Itisnowwellknownthatthesecellsactassimpleorientedltersandstronglyrespondtoparticularinputs[ Olshausen 4Appendix B.1 showsvisualizationofallthecauses 44

PAGE 45

andField 1996 ].However,recentstudiesshowthattheirinuenceextendsbeyondtheirreceptiveelds[ Rustetal. 2005 ],modulatingtheresponseofothercells,inbothexcitatoryandinhibitoryways,dependingonthespatialandtemporalcontextualinformation.Inourapproach,suchtemporalcontextateachlayerismodeledusingtheparametermatrixA(l)8l.Figure 3-10 showsavisualizationofthismatrixatthebottomlayer.Weobservethatthemodellearnstomaintaincertainproperties,likeorientationandspatialposition,overtime.Inotherwords,givenabasisisactiveataparticulartimeithasexcitatoryconnectionswithagroupofbasis(sometimeswithstrongself-recurrenceconnection)makingthemmorelikelytobeactiveatthenexttimeinstance.Ontheotherhand,alongwiththesparsityregularization,italsoinhibitsresponseofotherelementsthatarenotstronglyconnectedwiththeactivebasisovertime. 3.3.3RoleofTemporalConnections:ToydataInthepreviousexperimentwehaveshownthatthestatesatanytimetarecloselyrelatedtothestatesattime(t)]TJ /F8 11.955 Tf 13.01 0 Td[(1).Butweleftoutanimportantquestion,howdoesthepreviousstateofthesystemaffectourperception(orencoding)ofthepresentobservations?Inotherwords,doesthecontexthelpsustodisambiguateanaliasedobservation,i.e.,canwedifferentiateasimilarpatternoccurringintwodifferentsequences?Wetrytoanswerthisherebyconsideringatoydataexample.Thisproblemispreviouslystudiedinpredictivecodingframeworkaswell[ Rao 1999 ].Inthisexperiment,anobservationsequenceisamadeupofpatches,witheachpatchcontainingparallellines(numberoflineschosenuniformlybetween1to5)withthesameorientations(alsochosenuniformlyfromfourdifferentorientations),suchthatfromoneframetotheotherthepatchisshiftedbyonlyonepixel.Figure 3-11A showsasetofsequences,eachrowrepresentingonesequence,generatedusingtheaboveprocedure.Again,suchsequencesareconcatenatedtoformalongersequenceofobservations. 45

PAGE 46

AInputs BBasis(MatrixC) CInference Figure3-11. Roleoftemporalconnectionsduringinference.(A)Toysequencesusedintheexperiment;eachrowindicatingonesequence.(B)Thebasis(ormatrixC)learnedfromtheseinputs.(C)Theinferredstatevectorattimet=3whenthesamesequenceispresentedintwodifferentorders.Weobservethatdependingontheorderingthesamepatternhasdifferentrepresentation. 46

PAGE 47

Sinceweareinterestedhereinonlythetemporalconnections,wedonotconsideranycauses(i.e.,wexut=0)duringthisexperiment.Afterlearningthesystemontheobservationsequences,wextheparametersandpresenttothesystemtwosequences:asequenceofaparticularshapeandthesamesequenceinreverseorder,asshowninFigure 3-11C .Notethatattimet=3theinputisthesamebutinadifferentcontext;intherstcasethepatternmovingfromtoptobottomwhileinthesecondcaseitismovingintheoppositedirection.Weobservethattheinferredstatesattimet=3aredifferent,i.e.,thecontextinwhichaparticularpatternisobservedchangesitsrepresentation.Notethathavingobservationsaloneateachtimecandothesame.Inadditiontothis,thesystemisalsocapabletolearnarepresentationsuchthatitcanstillgeneratetheobservationsbackusingthebasis,eventhoughwithdifferentrepresentations. 47

PAGE 48

CHAPTER4DEEPPREDICTIVECODINGNETWORKSInthischapterwediscussahierarchicaldynamicmodelcalleddeeppredictivecodingnetworks.ThefeatureextractionblockdiscussedinChapter. 3 isusedasthebasicbuildingblocktoconstructthisdeepnetwork.Inlinewithotherdeeplearningmethods,weusethesebasicbuildingblockstoconstructahierarchicalmodelusinggreedylayer-wiseunsupervisedlearning.Thehierarchicalmodelisbuiltsuchthattheoutputfromonelayeractsasaninputtothelayerabove.Inotherwords,thelayersarearrangedinaMarkovchainsuchthatthestatesatanylayerareonlydependentontherepresentationsinthelayerbelowandabove,andareindependentoftherestofthemodel.Theoverallgoalofthedynamicalsystematanylayeristomakethebestpredictionoftherepresentationinthelayerbelowusingthetop-downinformationfromthelayersaboveandthetemporalinformationfromthepreviousstates.Hence,thenamedeeppredictivecodingnetworks(DPCN). 4.1Multi-LayeredArchitectureThearchitectureofthemultilayeredprocessingmodelisatreestructure,withthesimpleencodingmoduledescribedinChapter 3 replicatedateachnodeofthetree(Figure 4-1 ).Atthebottomlayer,thenodesarearrangedasatilingoftheentirevisualsceneandtheparametersacrossallthenodesaretied,resemblingaconvolutionovertheinputframe.Eachnodeencodesasmallpatchoftheinputvideosequence,whichisusefulforparallelizingthecomputation.Thecomputationalmodelisuniformwithinalayer,andacrosslayers,albeitwithdifferentdimensions,theonlythingthatchangesisthenatureoftheinputdata.Notethatwithineachblockthefeaturesextractedfromaspatialneighborhoodarepooledtogether,indicatingaprogressivelyincreasingreceptiveeldsizeofthenodeswiththedepthofthenetwork.Forthisreason,wecanalsoexpectthattheactivationdurationofagivenfeaturealsoslowsdownwiththedepthofthemodule.Parameterlearningateachlayerusesagreedylayer-wiseprocedure,i.e.the 48

PAGE 49

Figure4-1. Atwo-layeredhierarchicalmodelconstructedbystackingseveralstate-spacemodels.Forvisualizationnooverlappingisshownbetweentheimagepatcheshere,butoverlappingpatchesareconsideredduringactualimplementation. parametersatthebottomlayermodulesarelearnedrstfromasequenceofsmallpatchesextractedfromtheinputvideosequences,andonlyafterthelearningofthenextlayerstarts.Figure 4-2 exempliestheinferenceonatwo-layernetworkwithasinglemoduleineachlayerforsimplicity.HerethelayersinthehierarchyarearrangedinaMarkovchain,i.e.,thevariablesatanylayerareonlyinuencedbythevariablesinthelayerbelowandthelayerabove.Specically,atthebottomlayerforexample,sequencesofpatches(yt)extractedfromxedspatiallocationsspreadacrosstheentire2Dspaceofthevideoframesisfedasinputtoeachrstlayermodules.Ontheotherhand,thetop-downpredictionsoftherstlayercausescomingfromthesecondlayertrytomodulatetherepresentations.Thebidirectionalnatureofthemodelisapparentinthisgure,andingeneraltheremaybeanextratopdownpredictionasinputtoprovidecontextfortheanalysis.Nextwewillincludethemodicationsinthegeneralequationtoexploitthisextrainformation. 49

PAGE 50

Figure4-2. Blockdiagramshowingtheowofbottom-upandtop-downinformationinthemodel. 4.2InferenceinMulti-LayeredNetworkwithTop-DownConnectionsWiththeparametersxed,inferringlatentvariablesatanyintermediatelayerinvolvesobtainingusefulrepresentationofthedatadrivenbottom-upinformationwhilecombiningthetop-downinuencesfromthehigherlayers.Inotherwords,whilethedynamicnetworkateachlayertrytoextractusefulinformationfromtheinputsforrecognition,thetop-downconnectionsmodulatetherepresentationsateachlevelwithabstractknowledgefromthehigherlayers.Aswewillshownext,thetop-downconnectionsconveycontextualinformationtoendowthemodelwithapriorknowledgeforextractingtaskspecicinformationfromnoisyinputs.Moreformally,atanylayer(l),theenergyfunctionthatneedstobeminimizedtoinferx(l)tandu(l)tisgivenby: El(x(l)t,u(l)t,(l))=NXn=11 2ku(l)]TJ /F10 7.97 Tf 6.59 0 Td[(1,n)t)]TJ /F5 11.955 Tf 11.95 0 Td[(C(l)x(l,n)tk22+kx(l,n)t)]TJ /F5 11.955 Tf 11.95 0 Td[(A(l)x(l,n)t)]TJ /F10 7.97 Tf 6.58 0 Td[(1k1+KXk=1j(l)t,kx(l,n)t,kj+ku(l)tk1+1 2ku(l)t)]TJ /F5 11.955 Tf 12.02 0 Td[(^u(l+1)tk22 (4) (l)t,k=0h1+exp()]TJ /F8 11.955 Tf 9.3 0 Td[([B(l)u(l)t]k) 2i 50

PAGE 51

where^u(l)t=C(l+1)x(l+1)tisthetop-downpredictionofthecausescomingfromthestate-spacemodelinthelayerabove.Thisadditionalterminvolving^utinuencestherepresentationatthe(l)thlayerbyreducingthetop-downpredictionerror.Inotherwords,thegoalistomatchtherepresentationoftheinputsfromthelayerbelowwiththebeliefofthelayeraboveaboutthesamerepresentation.Ideally,toperforminferenceinthishierarchicalmodel,allthestatesandthecauseshavetobeupdatedsimultaneouslydependingonthepresentstateofalltheotherlayersuntilthemodelreachesequilibrium[ Friston 2008 ].However,suchaprocedurecanbeveryslowinpractice.Instead,weproposeanapproximateinferenceprocedurethatonlyrequiresasingletop-downowofinformationandthenasinglebottom-upinferenceusingthistop-downinformation.Specically,beforethearrivalofanewobservationattimet,ateachlayer(l)(startingfromthetop-layer)werstpropagatethemostlikelycausestothelayerbelowusingthestateattheprevioustimeinstancex(l)t)]TJ /F10 7.97 Tf 6.59 0 Td[(1andthepredictedcauses^u(l+1)t.Moreformally,thetop-downpredictionatlayerlisobtainedas: ^u(l)t=C(l)^x(l)twhere^x(l)t=argminx(l)t(l)kx(l)t)]TJ /F5 11.955 Tf 11.96 0 Td[(A(l)x(l)t)]TJ /F10 7.97 Tf 6.58 0 Td[(1k1+0KXk=1j^t,kx(l)t,kj (4) and^t,k=(exp()]TJ /F8 11.955 Tf 9.3 0 Td[([B(l)^u(l+1)t]k))=2Atthetopmostlayer,L,abiasissetsuchthat^u(L)t=^u(L)t)]TJ /F10 7.97 Tf 6.58 0 Td[(1,i.e.,thetop-layerinducessometemporalcoherenceonthenaloutputs.From( 4 ),itiseasytoshowthatthepredictedstatesforlayerlcanbeobtainedas ^x(l)t,k=8>><>>:[A(l)x(l)t)]TJ /F10 7.97 Tf 6.58 0 Td[(1]k,0t,k<(l)0,0t,k(l) (4) 51

PAGE 52

Thesepredictedcauses^u(l)t,8l2f1,2,...,Lgaresubstitutedin( 4 )andasinglelayer-wisebottom-upinferenceisperformedasdescribedinsection 3.2.1 1.Thecombinedpriornowimposedonthecauses,ku(l)tk1+1 2ku(l)t)]TJ /F5 11.955 Tf 12.18 0 Td[(^u(l+1)tk22,issimilartotheelasticnetpriordiscussedin[ ZouandHastie 2005 ],leadingtoasmootherandbiasedestimateofthecauses. 4.3Experiments 4.3.1ReceptiveFieldsintheHierarchicalModelFirstly,wewouldliketotesttheabilityoftheproposedmodeltolearncomplexfeaturesinthehigher-layersofthemodel.Forthiswetrainatwolayerednetworkfromanaturalvideo.Eachframeinthevideowasrstcontrastnormalizedasdescribedin[ Kavukcuogluetal. 2010b ].Then,wetraintherstlayerofthemodelon4overlappingcontiguous1515pixelpatchesfromthisvideo;thislayerhas400dimensionalstatesand100dimensionalcauses.Thecausespoolthestatesrelatedtoallthe4patches.Theseparationbetweentheoverlappingpatchesherewas2pixels,implyingthatthereceptiveeldofthecausesintherstlayeris1717pixels.Similarly,thesecondlayeristrainedon4causesfromtherstlayerobtainedfrom4overlapping1717pixelpatchesfromthevideo.Theseparationbetweenthepatcheshereis3pixels,implyingthatthereceptiveeldofthecausesinthesecondlayeris2020pixels.Thesecondlayercontains200dimensionalstatesand50dimensionalcausesthatpoolsthestatesrelatedtoallthe4patches.Figure 4-3 showsthevisualizationofthereceptiveeldsoftheinvariantunits(columnsofmatrixB)ateachlayer.Weobservethateachdimensionofcausesintherstlayerrepresentsagroupofprimitivefeatures(likeinclinedlines)whichare 1Notethattheadditionalterm1 2ku(l)t)]TJ /F5 11.955 Tf 17.9 0 Td[(^u(l+1)tk22intheenergyfunctiononlyleadstoaminormodicationintheinferenceprocedure,namelythishastobeaddedtoh(ut)in( 3 ). 52

PAGE 53

ALayer1states BLayer1causes CLayer2causes Figure4-3. Visualizationofthereceptiveeldsoftheinvariantunitslearnedin(a)layer1and(b)layer2whentrainedonnaturalvideos.Thereceptiveeldsareconstructedasaweightedcombinationofthedictionaryofltersatthebottomlayer. localizedinorientationorposition2.Whereas,thecausesinthesecondlayerrepresentmorecomplexfeatures,likecorners,angles,etc.Theseltersareconsistentwiththepreviouslyproposedmethodslike Leeetal. [ 2009 ]and Zeileretal. [ 2010 ]. 4.3.2RoleofTop-DownInformationInthissection,weshowtheroleofthetop-downinformationduringinference,particularlyinthepresenceofstructurednoise.Videosequencesconsistingofobjectsofthreedifferentshapes(RefertoFigure 4-4A )wereconstructed.Theobjectiveistoclassifyeachframeascomingfromoneofthethreedifferentclasses.Forthis,several3232pixel100framelongsequencesweremadeusingtwoobjectsofthesameshapebouncingoffeachotherandthewalls.Severalsuchsequenceswerethenconcatenatedtoforma30,000longsequence.Wetrainatwolayernetworkusingthissequence.First,wedividedeachframeinto1212patcheswithneighboringpatchesoverlappingby4pixels;eachframeisdividedinto16patches.Thebottomlayerwastrainedsuchthe1212patcheswereusedasinputsandwereencodedusinga100 2Pleaserefertosupplementarymaterialformoreresults. 53

PAGE 54

AClearSequences BCorruptSequences Figure4-4. Partofthe(A)cleanand(B)corruptedvideosequencesconstructedusingthreedifferentshapes.Eachrowindicatesonesequence. dimensionalstatevector.A4contiguousneighboringpatcheswerepooledtoinferthecausesthathave40dimensions.Thesecondlayerwastrainedwith4rstlayercausesasinputs,whichwereitselfinferredfrom2020contiguousoverlappingblocksofthevideoframes.Thestateshereare60dimensionallongandthecauseshaveonly3dimensions.Itisimportanttonoteherethatthereceptiveeldofthesecondlayercausesencompassestheentireframe.WetesttheperformanceoftheDPCNintwoconditions.Therstcaseiswith300framesofcleanvideo,with100framespershape,constructedasdescribedabove.Weconsiderthisasasinglevideowithoutconsideringanydiscontinuities.Inthesecondcase,wecorruptthecleanvideowithstructurednoise,wherewerandomlypickanumberofobjectsfromsamethreeshapeswithaPoissondistribution(withmean1.5)andaddthemtoeachframeindependentlyatarandomlocations.Thereisnocorrelationbetweenanytwoconsecutiveframesregardingwherethenoisyobjectsareadded(seeFigure 4-4B ).Firstweconsiderthecleanvideoandperforminferencewithonlybottom-upinference,i.e.,duringinferenceweconsider^u(l)t=0,8l2f1,2g.Figure 4-5A showsthescatterplotofthethreedimensionalcausesatthetoplayer.Clearly,thereare3clustersrecognizingthreedifferentshapeinthevideosequence.Figure 4-5B showsthe 54

PAGE 55

A B C Figure4-5. Thescatterplotofthe3dimensionalcausesatthetop-layerfor(A)cleanvideowithonlybottom-upinference,(B)corruptedvideowithonlybottom-upinferenceand(C)corruptedvideowithtop-downowalongwithbottom-upinference.Ateachpoint,theshapeofthemarkerindicatesthetrueshapeoftheobjectintheframe. scatterplotwhenthesameprocedureisappliedonthenoisyvideo.Weobservethat3shapesherecannotbeclearlydistinguished.Finally,weusetop-downinformationalongwiththebottom-upinferenceasdescribedinsection 4.2 onthenoisydata.Wearguethat,sincethesecondlayerlearnedclassspecicinformation,thetop-downinformationcanhelpthebottomlayerunitstodisambiguatethenoisyobjectsfromthetrueobjects.Figure 4-5C showsthescatterplotforthiscase.Clearly,withthetop-down 55

PAGE 56

information,inspiteoflargelycorruptedsequence,theDPCNisabletoseparatetheframesbelongingtothethreeshapes(thetracefromoneclustertotheotherisbecauseofthetemporalcoherenceimposedonthecausesatthetoplayer.).Inthisworkweproposedthedeeppredictivecodingnetwork,agenerativemodelthatempiricallyaltersthepriorsinadynamicandcontextsensitivemanner.Thismodelcomposestotwomaincomponents:(a)lineardynamicalmodelswithsparsestatesusedforfeatureextraction,and(b)top-downinformationtoadapttheempiricalpriors.Thedynamicmodelcapturesthetemporaldependenciesandreducestheinstabilityusuallyassociatedwithsparsecoding,whilethetaskspecicinformationfromthetoplayershelpstoresolveambiguitiesinthelower-layerimprovingdatarepresentationinthepresenceofnoise.Ourapproachcanbeextendedwithconvolutionalmethods,pavingthewayforimplementationofhigh-leveltaskslikeobjectrecognition,etc.,onlargescalevideosorimages.Thiswillbediscussedinthenextchapter. 56

PAGE 57

CHAPTER5CONVOLUTIONALDYNAMICNETWORKFORLARGESCALEOBJECTRECOGNITIONInthischapter,weconsideraspecicarchitecturebasedonspatialconvolutionforthestate-spacemodeldiscussedinthepreviouschapters,designedtoextractinformationthatisinvarianttotransformationsoftheobjectsintheinputscene.Weshowthatthisconvolutionaldynamicnetwork(CDN)cancombinethebottom-up,top-downandlateral(ortemporal)inuencesinthenetworkeffectively.Moreimportantly,itcanscaletolargeimages/framesandlearndecompositionofobjectpartsinahierarchicalfashion.Weshowtheperformanceofthemodelonseveralbenchmarkobjectandfacerecognitiontasksandshowthattheproposedmodelisbetter/comparabletoothermethodsintheliterature.Wealsodiscuss,usingseveralexperiments,theinuenceofthetop-downandtemporalconnectionshaveontheperformanceoftheproposedmodel. 5.1ModelArchitecture 5.1.1SingleLayerModelWerstconsiderasinglelayermodelshowninFigure 5-1 toprocessavideosequence.Heretheinputs/observationstothemodelisasequenceofvideoframesIt,8t2f1,2...TgandeachframeiscomposedofMcolorchannels,denotedasfI1t,I2t...IMtg.Now,weassumethateachchannelImtismodeledasanobservationtoastate-spacemodel,withthesamesetofstatesusedacrossallthechannels.Specically,eachchannelImtismodeledasalinearcombinationofKmatrices,Xkt8k2f1,2...Kg,convolvedwithltersCm,k8k.Thestatespaceequationsforthismodelcanbewrittenas: Imt=KXk=1Cm,kXkt+Nmt8m2f1,2...MgXkt(i,j)=KX~k=1ak,^kX~kt)]TJ /F10 7.97 Tf 6.58 0 Td[(1(i,j)+Vkt(i,j) (5) 57

PAGE 58

Figure5-1. Blockdiagramofasinglelayerconvolutionaldynamicnetwork.Inputsherecontain3channels(denotedinRGBcolors)andeachchannelismodeledasacombinationofthestatemaps(black)convolvedwithltersC(blue).Thepooledstatemaps(orange)aredecomposedusingthecausemaps(purple)convolvedwithltersB(blue).Duringinferencethereisatwo-wayinteractionbetweenthestateandthecausemappingsthroughpooling/unpoolingoperations,whichisleftimplicithere. wheredenotesconvolution.IfIktisawhframeandCm,kisasspixellter,thenXktisamatrixofsize(w+s)]TJ /F8 11.955 Tf 12.29 0 Td[(1)(h+s)]TJ /F8 11.955 Tf 12.29 0 Td[(1).WerefertoXt=fXktg8kasstatemaps(orsometimessimplyasstates).Also,ak,~kindicatesthelateralconnectionsbetweenthestatemapsovertime.Sinceweareonlyinterestedinobjectrecognitioninthiswork,weassumeak,~k=8>><>>:1,k=~k0,otherwisei.e.,weconsideronlyself-recurrentconnectionsbetweenstatemaps,whichencouragestemporalcoherence.However,onecanmodelthemotionintheinputsequencesby 58

PAGE 59

alternativelylearningthecoefcientsak,~kalongwiththerestofthemodelparameters[ CadieuandOlshausen 2008 ].Since( 5 )isaunder-determinedmodel,weregularizeitwithasparsityconstraintonthestatestoobtainauniquesolution.Hence,thecombinedenergyfunctionforthestate-spacemodelin( 5 )canbewrittenasfollows: Ex(Xt,C)=MXm=1kImt)]TJ /F6 7.97 Tf 17.13 14.94 Td[(KXk=1Cm,kXktk22+kXt)]TJ /F5 11.955 Tf 11.96 0 Td[(Xt)]TJ /F10 7.97 Tf 6.59 0 Td[(1k1+KXk=1k.jXktj (5) NoticethatweconsiderthatthestatetransitionnoiseVtin( 5 )toalsobesparse,sothatitisconsistentwiththesparsityonthestates.Thismakespracticalsense,asthenumberofchangesbetweentwoconsecutiveframesinatypicalvideosequenceissmall.In( 5 ),kisasparsityparameteronthekthstatemap.Insteadofassumingthatthesparsityofthestatesisconstant(orthatthepriordistributionoverthestatesisstationary)asin[ Zeileretal. 2010 ],hereweconsiderthatthecausemaps(orcauses)Utmodulatetheactivityofthestatesthroughthesparsityparameter.Inlinewiththemodelproposedby KarklinandLewicki [ 2005 ],weconsiderthesparsityparameter2R(w+s)]TJ /F10 7.97 Tf 6.59 0 Td[(1)(h+s)]TJ /F10 7.97 Tf 6.59 0 Td[(1)KintermsofthecausesUt2R(w+s)]TJ /F6 7.97 Tf 6.59 0 Td[(p)(h+s)]TJ /F6 7.97 Tf 6.58 0 Td[(p)Das: k=0 21+exp)]TJ /F6 7.97 Tf 17.15 14.95 Td[(DXd=1Bk,dUdt (5) where0>0isaconstant.Thisnon-linearmultiplicativeinteractionbetweenthestateandthecausemappingsleadstoextractinginformationthatisinvarianttoseveraltransformationsfromtheinputs.Essentially,throughtheltersBk,d2Rpp,Udtlearntogrouptogetherthestatesthatco-occurfrequently.Sinceco-occurringcomponentstypicallysharesomecommonstatisticalregularity,suchactivitytypicallyleadstolocallyinvariantrepresentation[ KarklinandLewicki 2005 ].Moreimportantly,unlikemanyother 59

PAGE 60

deeplearningmethods[ Hintonetal. 2006 Zeileretal. 2010 ],theactivityofthecausesinuencesthestatesdirectlythroughthetop-downconnections(Bk,d)andthestatisticalgroupingislearnedfromthedata,insteadofapre-determinedtopographicconnections[ HyvarinenandHoyer 2001 ].Givenxedstatemaps,theenergyfunctionthatneedstobeminimizedtoobtainthecausesis: Eu(Ut,B)=KXk=10 21+exp)]TJ /F6 7.97 Tf 17.14 14.95 Td[(DXd=1Bk,dUdtjXktj+kUtk1 (5) whereweregularizedthesolutionusingan`1sparsitypenalty.WenoteherethatalltheelementsofBareinitializedtobenon-negativeandtheyremainsowithoutanyadditionalconstraint.Thisensuresthatthegradientofthesmoothpart(ortherstterm)ofEu(.)isLipschitzcontinuous,allowingustouseproximalmethodstoinferUtwithguaranteedconvergence[ BeckandTeboulle 2009 ],asdiscussedinSection 5.2 5.1.2BuildingaHierarchySeveralofthesesingle-layermodelsdescribedabovecaneasilybestackedtoformahierarchicalmodel.Likemanyotherdeeparchitectures,suchasdeepbeliefnetworks[ Hintonetal. 2006 ],stackedauto-encoders[ Vincentetal. 2010 ],etc.,theoutputs(orcausemaps)fromonelayeractasinputtothelayerabove.However,unlikethesemodels,eachlayergets,alongwiththebottom-upinputs,atop-downpredictionsofitsoutputcauses.Thenthegoalduringinferenceofthestatesandthecausesatanylayeristocomeupwithrepresentationsthatbestpredictstheinputswhilereducingthetop-downpredictionerror.Moreformally,combiningthetop-downpredictionsintothesinglelayermodeldescribedinSection 5.1.1 ,theenergyfunctionatthelthlayerinthe 60

PAGE 61

hierarchicalmodelcanbewrittenas: El(Xlt,Ult,Cl,Al,Bl)=Dl)]TJ /F10 5.978 Tf 5.75 0 Td[(1Xm=1kUm,l)]TJ /F10 7.97 Tf 6.59 0 Td[(1t)]TJ /F6 7.97 Tf 16.35 15.29 Td[(KlXk=1Clm,kXk,ltk22+lkXlt)]TJ /F5 11.955 Tf 11.96 0 Td[(Xlt)]TJ /F10 7.97 Tf 6.59 0 Td[(1k1+KlXk=1k.jXk,ltj+lkUltk1+lkUlt)]TJ /F13 11.955 Tf 13.2 3.15 Td[(bUltk22 (5) k=0 21+exp)]TJ /F6 7.97 Tf 16.18 15.29 Td[(DlXd=1Blk,dUd,ltwhereUl)]TJ /F10 7.97 Tf 6.58 0 Td[(1tarethecausescomingfromthelayerbelowandbUltisthetop-downpredictioncomingfromthestate-spacemodelinthelayerabove.Asindicatedbytheenergyfunctionin( 5 ),thearchitectureateachlayerissimilartothesinglelayermodeldescribedbefore,thoughthenumberofstates(Kl)andcauses(Dl)mightvaryingoverthelayers. 5.1.3ImplementationDetailsTomaketheimplementationmoreefcient,weintroducesomerestrictionsonthearchitecture.Firstly,likeotherconvolutionalmodels[ LeCunetal. 1989 Leeetal. 2009 Kavukcuogluetal. 2010a Zeileretal. 2010 ],weassumesparseconnectivitybetweenboththeinputsandstatesandalso,betweenstatesandcauses.Thisnotonlyincreasestheefciencyduringinferencebutalsobreaksthesymmetrybetweenlayersandhelpstolearncomplexrelationships.Secondly,weshrinkthesizeofthestatesusingmaxpoolingbetweenthestatesandthecauses.Correspondingly,thesparsityparameters()obtainedfromthecausesareunpooledduringinferenceofthestates(seeSection 5.2 fordetails).Thisreducesthesizeoftheinputsgoingintothehigherlayersandhence,ismoreefcientduringinference.Also,thepoolingisshowntoproducebetterinvariantrepresentations[ Boureauetal. 2010 ]. 5.2InferenceAtanylayerl,inferenceinvolvesndingthestatesXltandthecausesUltthatminimizestheenergyfunctionElin( 5 ).Toperformthisjointinference,wealternatelyupdatethestateswiththecausesxedandthenupdatethecauseswiththestatesxed 61

PAGE 62

Algorithm2InferenceinConvolutionalDynamicNetwork Require: Inputs-I1:T,N-#FISTAiterations,L-#layers,Parameters-C1:L,B1:L Require: Hyper-parameters-1:L,1:L,1:L,1:L0 Require: Initializestates-X1:L0=0 1: fort=1:Tdo//Loopovertime //Top-downpredictions. 2: forl=L:)]TJ /F8 11.955 Tf 9.3 0 Td[(1:1do//Loopoverlayers 3: ComputebXltusing( 5 ) 4: Predict:bUltusing( 5 ) 5: endfor //Bottom-upinference. 6: Initialize:Xlt=Xlt)]TJ /F10 7.97 Tf 6.58 0 Td[(1,Ult=bUlt 7: forl=1:Ldo//Loopoverlayers 8: forn=1:Ndo//FISTAiteration 9: Computestatepredictionterm:. 10: UpdatestatesXltusing( 5 )and( 5 ). 11: Max-pooling:[down(Xk,lt),pk,lt]=pool(Xk,lt). 12: UpdatecausesUltusing( 5 ). 13: Unpoolandre-computelusing( 5 ). 14: endfor 15: endfor 16: endfor untilconvergence.Updatingeitheroftheminvolvessolvingan`1convolutionalsparsecodingproblemandweuseaproximalgradientbasedmethodcalledFISTA[ BeckandTeboulle 2009 Chalasanietal. 2013 ](andsomevariations[ ChalasaniandPrincipe 2013 ])forthis,whereeachupdatestepinvolvescomputingthegradient,followedbyasoftthresholdingfunctiontoobtainasparsesolution. 5.2.1ProcedureAlgorithm 2 showsthestepsinvolvedinthisiterativeinferenceprocedureandbelowwewillelucidateeachofthestepsindetail. 62

PAGE 63

UpdatingStates Firstly,withthecausesxed,updatingthestatesinvolvendingthegradientofallthetermsotherthanthesparsitypenaltyinElw.r.tXlt.Forconvenience,were-writethesetermsas: h(Xlt)=Dl)]TJ /F10 5.978 Tf 5.75 0 Td[(1Xm=1kUm,l)]TJ /F10 7.97 Tf 6.58 0 Td[(1t)]TJ /F6 7.97 Tf 16.35 15.29 Td[(KlXk=1Clm,kXk,ltk22+kXlt)]TJ /F5 11.955 Tf 11.95 0 Td[(Xlt)]TJ /F10 7.97 Tf 6.59 0 Td[(1k1 (5) Sinceh(Xlt)isnon-smooth,thesecondterminvolvingstatetransitionshas`1penalty,itisnotpossibletonditsexactgradient.However,inordertoapproximatelycomputeit,werstusetheideaofNestrov'ssmoothness[ Nesterov 2005 ]toapproximatethenon-smoothstatetransitionterminh(Xlt)withasmoothfunction.Tobegin,letsconsider(Xlt)=ketk1whereet=kvec(Xlt))]TJ /F7 11.955 Tf 12.04 0 Td[(vec(Xlt)]TJ /F10 7.97 Tf 6.59 0 Td[(1)k1.Theideaistoapproximate(Xlt)withasmoothfunctionandcomputeitsgradientwithrespecttoet.Since,etislinearfunctionofXlt,computingthegradientof(Xlt)w.r.tXltthenbecomesstraightforward.Now,usingthedualof`1-norm,wecanre-write(Xlt)as (Xlt)=argmaxkk11[Tet] (5) where2Rcard(et).UsingNestrov'ssmoothnessproperty,wecanapproximate(Xlt)withasmoothfunctionoftheform: (Xlt)f(et)=argmaxkk11[Tet)]TJ /F3 11.955 Tf 11.96 0 Td[(d()] (5) whered()=kk22 2isasmoothnessfunctionandisthesmoothnessparameter.FollowingTheorem1in Chenetal. [ 2012a ],wecanshowthatf(et)isconvexandsmoothand,moreover,thegradientoff(et)w.r.tetisgivenby: retf(et)= (5) 63

PAGE 64

whereistheoptimalsolutionto( 5 ).Wecanobtainaclosed-formsolutiontoas(forproofreferto ChalasaniandPrincipe [ 2013 ]): =Set (5) whereS(.)isaprojectionoperatorappliedovereveryelementinandisdenedasfollows:S(x)=8>>>>>><>>>>>>:x,)]TJ /F8 11.955 Tf 9.29 0 Td[(1x11,x>1)]TJ /F8 11.955 Tf 9.3 0 Td[(1,x<)]TJ /F8 11.955 Tf 9.3 0 Td[(1Asdiscussedabove,usingchainrule,f(et)isalsoconvexandsmoothinXltanditsgradientrXltf(et)remainsthesameasin( 5 ).Giventhissmoothapproximationofthenon-smoothstatetransitiontermanditsgradient,wenowapplytheiterativeshrinkage-thresholdingalgorithm[ BeckandTeboulle 2009 ]forconvolutionalstates-spacemodelwithasparsityconstraint.Thegradientofre-formulatedh(Xlt)w.r.tXltisgivenasfollows:rXbk,lth(Xlt)=)]TJ /F6 7.97 Tf 12.03 16.12 Td[(Dl)]TJ /F10 5.978 Tf 5.76 0 Td[(1Xm=1eCbk,mUm,l)]TJ /F10 7.97 Tf 6.59 0 Td[(1t)]TJ /F6 7.97 Tf 16.34 15.29 Td[(KlXk=1Ck,mXk,lt+Mbk (5)whereeCbk,mindicatesthatthematrixCbk,misippedverticallyandhorizontallyandMbkisthebkthmapfromamatrixobtainedbyreshaping.Onceweobtainthegradient,thestatescanbeupdatedas: Xlt=Xlt)]TJ /F11 11.955 Tf 11.96 0 Td[(lrXlth(Xlt) (5) 64

PAGE 65

whereisastepsizeforthegradientdescentupdate1.Followingthis,wepasstheupdatedstatesthroughasoftthresholdingfunctionthatclampsthesmallervalues,leadingtoasparsesolution: Xlt=sign(Xlt)(maxjXltj)]TJ /F11 11.955 Tf 17.93 0 Td[(l) (5) MaxPooling Wethenperformaspatialmaxpooling[ Jarrettetal. 2009a ]oversmallneighborhoodsacrosseachandevery2Dstatemapas:[down(Xk,lt),pk,lt]=pool(Xk,lt)wherepk,ltindicatesthepoolingindexes.Wedonotpoolingacrossthestatemaps,sothenumberofstatemapsremainsthesame,whiletheresolutionofeachmapdecreases(denotedasdown(Xk,lt)).Weusenon-overlappingspatialwindowsforthepoolingoperation.UpdateCauses Similartothestateupdatesdescribedabove,wexthestatesandcomputethegradientusingonlythesmoothpartoftheenergyfunctionEl(denotedash(Uk,lt))w.r.tUlt.Giventhepooledstates,thegradientcanbecomputedasfollows:rUbd,lth(Ult)=)]TJ /F3 11.955 Tf 13.15 8.09 Td[(0 2KlXk=1eBk,bd"exp)]TJ /F6 7.97 Tf 16.18 15.29 Td[(DlXd=1Blk,dUd,lt.down(Xk,lt)#+2(Ubd,lt)]TJ /F13 11.955 Tf 13.24 3.15 Td[(bUbd,lt) (5) 1FISTA BeckandTeboulle [ 2009 ]usesamomentumtermduringthegradientupdateandleadstomuchfasterconvergence.Weusethisinourimplementation.Pleasereferto Chalasanietal. [ 2013 ]forconvolutionalsparsecodingwithFISTA 65

PAGE 66

Similartothestateupdatesdescribedabove,usingthisgradientinformation,weupdatethecausesbyrsttakingagradientstep,followedbyasoftthresholdingfunction: Ult=Ult)]TJ /F11 11.955 Tf 11.96 0 Td[(lrUlth(Ult)Ult=sign(Ult)(maxjUltj)]TJ /F3 11.955 Tf 17.93 0 Td[(l) (5) Unpooling Now,afterupdatingthecauses,were-evaluatethesparsityparameterforthenextiteration.Wedothisasfollows: k,l=0 21+exp)]TJ /F7 11.955 Tf 11.95 0 Td[(unpoolpktDlXd=1Blk,dUd,lt (5) whereunpoolpk,lt(.)indicatesreversingthepoolingoperationusingtheindexespk,lobtainedduringthemaxpoolingoperationdescribedabove[ LeCunetal. 1998 ].Noticethat,whiletheinputstothepoolingoperationaretheinferredstates,theinputstotheunpoolingoperationsarethelikelystatesgeneratedbythecauses.OverallIteration Asingleiterationconsistsoftheabovementionsfoursteps:updatethestatesusingasingleFISTAstep,performmaxpoolingoverthestates,updatethecausesusingasingleFISTAstepand,nally,re-evaluatethesparsityparameterforthenextiteration.Allthecomputationsduringinferenceinvolvesonlybasicoperationssuchasconvolution,summation,poolingandunpooling.AllofthesecanbeefcientlyimplementedonaGPUwithparallalization,makingtheoverallprocessveryquick. 5.2.2ApproximateInferencewithTop-DownConnectionsIntheinferenceproceduredescribedabove,whileupdatingthecauses,weassumedthatthetop-downpredictionsbUltarealreadyavailableandareconstantthroughouttheinferenceprocedure.However,ideally,thisshouldn'tbethecase.SincethelayersarearrangedinaMarkovchain,allthelayershavetobeconcurrentlyupdated,whilepassingtop-downandbottom-upinformation,untilthesystemreaches 66

PAGE 67

Figure5-2. Blockdiagramoftheinferenceprocedure,witharrowsindicatingtheowofinformationduringinference. anequilibrium.Inpractice,thiscanbeveryslowtoconverge.Inordertoavoidthis,wedoanapproximateinference,wherewemakeasingleapproximatetop-downpredictionsateachtimestepusingthestatesfromtheprevioustimeinstanceandperformasinglebottom-upinferencewithxedtop-downpredictions,startingfromthebottomlayer.Moreformally,ateverytimestep,usingthestate-spacemodelateachlayerwepredictthemostlikelycauseatthelayerbelow(bUl)]TJ /F10 7.97 Tf 6.59 0 Td[(1t),givenonlythepreviousstatesandthepredictedcausesfromthelayerabove.Mathematically,thetop-downpredictionatlayerlcanbewrittenas:bUm,l)]TJ /F10 7.97 Tf 6.58 0 Td[(1t=KlXk=1Clm,kbXk,lt8m2f1,2...Dl)]TJ /F10 7.97 Tf 6.59 0 Td[(1gwherebXlt=argminXltlkXlt)]TJ /F5 11.955 Tf 11.96 0 Td[(Xlt)]TJ /F10 7.97 Tf 6.59 0 Td[(1k1+bkXltk1 (5)bk=0 21+exp)]TJ /F7 11.955 Tf 11.96 0 Td[(unpoolpk,lt)]TJ /F10 5.978 Tf 5.75 0 Td[(1DlXd=1Blk,dbUd,ltandbUltitselfisatop-downpredictioncomingfromlayerl+1.Atthetop-layer,weconsidertheoutputfromtheprevioustimeasthepredictedcauses,i.e.,bULt=ULt)]TJ /F10 7.97 Tf 6.59 0 Td[(1,allowingtemporalsmoothnessovertheoutputsofthemodel. 67

PAGE 68

SimpleanalyticsolutioncanbeobtainforbXltin( 5 )andisgivenby: bXk,lt(i,j)=8>><>>:Xk,lt)]TJ /F10 7.97 Tf 6.59 0 Td[(1(i,j)bk(i,j)
PAGE 69

learning,wedonotconsideranytop-downconnectionsbysettingl=08lin( 5 )whileinferringtherepresentations.Atlayerl,afterinferringXltandUltandxingthem,weupdatetheltersClandBlusinggradientdescent(withmomentum)minimizingthecostfunctionEl(.).ThegradientofEl(.)withrespecttoClcanbecomputedas:rClbm,bkEl=)]TJ /F8 11.955 Tf 11.95 0 Td[(2eXbk,ltImt)]TJ /F6 7.97 Tf 16.35 15.29 Td[(KlXk=1Ck,mXk,lt (5)andthegradientofEl(.)w.r.tBlcanbecomputedas:rBbk,bdEl=)]TJ /F13 11.955 Tf 13.23 3.15 Td[(eUbd,ltexp)]TJ /F6 7.97 Tf 16.18 15.28 Td[(DlXd=1Blbk,dUd,ltdown(Xbk,lt) (5)Afterupdatingthelters,wenormalizeeachltertobeofunitnormtoavoidatrivialsolution. 5.4ExperimentsInthissection,weevaluatetheperformanceoftheproposedmodelonvarioustasks:(i)itsabilitytolearnhierarchicalrepresentationsandobjectspartsfromunlabeledvideosequences,(ii)objectrecognitionwithcontextualinformation,and(iii)sequentiallabelingofvideoframesforrecognitionand(iv)itsrobustnessinnoisyenvironment.Preprocessing:Inalltheexperimentsweperformthesamepre-processingontheinputs.Eachframeinavideosequence(oreachimage)isconvertedintogray-scale.Wethennormalizeeachframetobezeromeanandunitnorm,followedbylocalcontrastnormalizationasdescribedby Jarrettetal. [ 2009a ].Classication:Thefeaturevectorsusedforclassicationtasksdescribedbelowarevectorizedformofthecausesextractedfromthevideoframes.Also,sometimesweusedifferentkindsofpoolingonthecausesdependingonthedatasetbeforefeedingittotheclassier,whichwillbespeciedlater.Giventhesefeaturevectors,weusea 69

PAGE 70

linearL2-SVM(fromLibLinearpackage[ Fanetal. 2008 ])foralltheclassicationtasksdescribedbelow. 5.4.1LearningfromNaturalVideoSequences Figure5-3. Receptiveeldsofthetwo-layerednetworklearnonnaturalvideosequences.(Top)Receptiveeldsoflayer-1states.(Bottom)Receptiveeldsoflayer-2causes.Theyareconstructedasalinearcombinationofbottomlayerlters. Firstly,tovisualizewhatinternalrepresentationsthemodelcanlearn,weconstructatwolayerednetworkusingtheHansvanHaterennaturalscenevideos[ vanHaterenandRuderman 1998 ].Hereeachframeis128128pixelsinsizeandispre-processedasdescribedabove.Therstlayerconsistsof16statesof77ltersand32causesof66lters,whilethesecondlayerismadeupof64statesof77ltersand128causesof66lters.Thepoolingsizebetweenthestatesandthecausesforboththelayersis22.Figure 5-3 showsthereceptiveeldsoftherstlayerstatesandthesecond 70

PAGE 71

Table5-1. ClassicationperformanceoverCaltech-101datasetwithonlyasinglebottom-upinference MethodsAccuracy OurMethod-Layer162.11.1%OurMethod-Layer1+266.80.5% Zeileretal. [ 2010 ]-Layer1+2(DN)66.91.1% Leeetal. [ 2009 ]-Layer1+2(CDBN)65.40.5% Kavukcuogluetal. [ 2010a ](ConvPSD)65.70.7% Chenetal. [ 2011 ]-Layer1+2(ConvFA)65.70.7% Jarrettetal. [ 2009a ](PSD)65.61.0% Boureauetal. [ 2010 ](Macrofeatures)70.91.0% Lazebniketal. [ 2006 ](SPM)64.60.7% layercauses.Weobservethatthereceptiveeldsoftherstlayerstates(Figure 5-3 top)resemblesimpleorientedlters,similartothoseobtainedfromsparseencodingmethods[ OlshausenandField 1996 ].Thereceptiveeldsofthesecondlayercauses,showninFigure 5-3 (bottom),containsmorecomplexstructureslikeedgejunctionsandcurves.Theseareconstructedasweightedcombinationofthelowerlayerlters. 5.4.2ObjectRecognition-Caltech-101datasetOneadvantageofthedistributivemodels,likeours,istheirabilitytotransferthemodellearnedonunlabeleddatatoextractfeaturesforgenericobjectrecognition,thesocalledself-taughtlearning[ Rainaetal. 2007 ].WeusethistoaccessthequalityofthelearningprocedureandperformobjectrecognitioninstaticimagefromCaltech-101dataset[ Fei-Feietal. 2007 ].Eachimageinthedatasetisre-sizedtobe152152(zeropaddedtopreservetheaspectratio)andpre-processedasdescribedabove.Weusethesametwo-layeredmodellearnedfromnaturalvideossequencesasaboveandextractfeaturesforeachimageusingasinglebottom-upinference(i.e.,withoutanytemporalortop-downinformationbysetting=0and=0forboththelayersin( 5 )).Theoutputcausesfromlayer1andlayer2aretakenandaremadeintoathreelevelspatialpyramidforeachlayeroutput[ Lazebniketal. 2006 ].Theybothare 71

PAGE 72

thenconcatenatetoformafeaturevectorforeachimageandarefedasinputstolinearclassier.Table 5-1 showstheresultsobtainedwhen30imagesperclassareusedfortrainingandtesting,followingstandardprotocol,averagedover10runs.Theparametersofthemodelaresetthroughcrossvalidation.Weobservethatusinglayer1causesaloneleadstoaccuracyof62.1%,whileusingthecausesfromboththelayersimprovestheperformanceto66.9%.Theseresultsiscomparabletoothersimilarmethodsthatuseconvolutionarchitecture[ Leeetal. 2009 Kavukcuogluetal. 2010a Zeileretal. 2010 Chenetal. 2011 ]andslightlybetterthanusinghand-designedfeatureslikeSIFT[ Lazebniketal. 2006 ]. 5.4.3RecognitionWithContextAsdiscussedby Schwartzetal. [ 2007 ],visualperceptionisnotstaticandusescontextualinformationfrombothspaceandtime.Wearguethatourmodelcaneffectivelyutilizesthiscontextualinformationandproducesarobustrepresentationoftheobjectsinvideosequences.Whilethetemporalrelationshipsareencodedthroughthestate-spacemodelateachlayer,thespatialcontextmodulatestherepresentationthroughtwodifferentmechanisms:(i)spatialconvolutionalongwithsparsityensuresthatthereiscompetitionbetweenelements,leadingtosomekindofinteractionacrossspaceand(ii)thetop-downmodulationscomingfromhigher-layerrepresentations,whichrstaccumulatesinformationfromthelower-layersandthentriestopredicttheresponseoveralargerreceptiveelds.Inordertotestthishypothesis,weshowtheperformanceofthemodelovertwodifferenttasks.Firstly,weshowthatusingcontextualinformationduringinferencecanleadtoaconsistentrepresentationoftheobjects,evenincaseswheretherearelargetransformationsoftheobjectovertime.WeusetheCOIL-100[ Neneetal. 1996 ]datasetforthistask.Secondly,weusethemodelforsequencelabelingtask,whereweassignaclasstoeachframeinsequencebeforeclassifyingtheentiresequenceasawhole.Thegoalistoshowtheextentofinvariancethemodelcanencode,particularly 72

PAGE 73

Figure5-4. ExamplesfromtheCOIL-100dataset. incasesofcorruptedinputs.WetesttheperformanceontheHonda/UCSDfacevideodataset[ Leeetal. 2005 ],onbothcleanaswellascorruptedsequences. 5.4.3.1RoleofcontextualinformationduringinferenceForthisexperimentweconsidertheCOIL-100dataset,whichcontains100differentobjects(orclasses),Figure 5-4 shownsomeexamples.Foreachobjectthereisasequenceobtainedbyplacingtheobjectonaturntableandtakingapictureforevery5degreeturn,resultingin72framelongvideoperobject.Eachframeisre-sizedtobe128128pixelinsizeandpre-processedasdescribedabove.Weusethesametwo-layerednetworkdescribedinSection 5.4.1 andperforminferencewithtop-downconnectionsovereachofthesequences.WecombinethecausesfromboththelayersforeachframeanduseittotrainalinearSVMforclassication.Aspertheprotocoldescribedin Neneetal. [ 1996 ],weconsider4framesperobjectatviewingangles0o,90o,180o,270oaslabeleddatausedfortrainingtheclassierandtherestareusedfortesting.Noticethatweassumethatwehaveaccesstothetestsamplesduring 73

PAGE 74

Table5-2. ClassicationperformanceoverCOIL-100datasetwithvariouscongurations. MethodsAccuracy View-tunednetwork(VTU)[ WersingandKorner 2003 ]79.10%StackedISA+temporal[ Zouetal. 2012 ]87.00%ConvNets+temporal[ Mobahietal. 2009 ]92.25% CDNwithoutcontext79.45%CDN+temporal(notopdown)94.41%CDN+temporal+topdown98.34% training.Thisresemblesthetransductivelearningsettingdescribedby Mobahietal. [ 2009 ].HerewecomparetheproposedmethodwithotherdeeplearningmodelsatwostagehierarchicalmodelbuiltusingmorebiologicallypossiblefeaturedetectorscalledView-tunednetwork(VTU)[ WersingandKorner 2003 ],stackedindependentsubspaceanalysislearnedwithtemporalregularization(StackedISA+Temporal)[ Zouetal. 2012 ]andconvolutionalnetworkstrainedwithtemporalregularization(ConvNet+Temporal)[ Mobahietal. 2009 ].Whilethersttwomethodsdonotutilizethecontextualinformationduringtrainingtheclassier2, Mobahietal. [ 2009 ]usesasimilarsettingasourswheretheentireobjectsequenceisconsideredduringtraining.Also,weconsiderthreedifferentsettingsduringinferenceinourmodel:(i)eachframeprocessedindependentlyanddoesnotconsideranycontextualinformation,i.e.,notemporalortop-downconnections,(CDNwithoutcontext)(ii)withonlytemporalconnections(CDN+temporal(notopdown))and(iii)withboththetemporalandtop-downconnections(CDN+temporal+topdown).AsshowninTable 5-2 ,ourmethodperformsmuchbetterthantheothermethodswhencontextualinformationisused.Whileusingtemporalconnectionsitselfproved 2 Zouetal. [ 2012 ]usestemporalregularizationonlywhilelearningthemodelwithanunrelatedvideosequence. 74

PAGE 75

Figure5-5. Partofthefacesequence(fromlefttoright)belongingtothreedifferentsubjectsextractedfromHonda/UCSDdataset(afterhistogramequalization). sufcienttoobtaingoodperformance,havingtop-downconnectionsimprovedtheperformancefurther.Ontheotherhand,notusinganycontextualinformationleadstosignicantdropinperformance.Also,wewouldliketoemphasisthatthemodelislearnedonvideosequencescompletelyunrelatedtothetask,indicatingthatthecontextualinformationduringinferenceismoreimportantthanusingitforjusttrainingtheclassierasin Mobahietal. [ 2009 ].Thereasonforthiscouldbethatcontextualinformationmightpushtherepresentationsfromeachsequenceintoawelldenedattractor,separatingitfromotherclasses(moreaboutthisinSection 5.4.4 ). 5.4.3.2SequentiallabelingWhiletheaboveexperimentshowstheroleofcontextduringinference,itdoesnottellmuchaboutthediscriminabilityofthemodelitself.Forthis,inthefollowingexperimentwetesttheperformanceoftheproposedmodelforsequencelabelingtask,wherethegoalisclassifyaprobesequencegivenasetoflabeledtrainingsequences.HereweconductthisexperimentforfacerecognitionontheHonda/UCSDdataset[ Leeetal. 2005 ]andtheYouTubecelebritiesdataset[ Kimetal. 2008 ].TheHondadatasetcontains59videosof20differentsubjects,whiletheYouTubedatasetcontains1910videosof47subjects.Wenoteherethat,whiletheHondadatasetisobtainedfromacontrolledenvironment,theYouTubedatasetisobtainedfrommorenaturalsetting,withverynoisyandlow-resolutionvideos,makingthetaskverychallenging.Figure 5-6A showsomeexamplevideosfromtheYouTubedataset. 75

PAGE 76

A B Figure5-6. ExamplesequencesfromtheYoutubecelebritiesdataset.(A)Originalvideos.(B)Extractedfacesequences. First,fromeveryvideo,thefacesfromeachframearedetectedusingVoila-Jonesfacedetection[ ViolaandJones 2001 ]andthenre-sizedtobe2020pixelfortheHondadatasetand3030pixelfortheYouTubedataset.Figure 5-5 andFigure 5-6B showssomeexamplefacesequencesobtainedfromtheHondaandtheYouTubedatasets,respectively.Eachsetoffacesdetectedfromavideoarethenconsideredasanobservationsequence.Next,inadditiontothepre-processingdescribedabove,wealsoperformhistogramequalizationoneachframetoremoveanyilluminationvariations.Finally,followingstandardprotocol[ Leeetal. 2005 ],fortheHondadatasetweconsider20facesequencesfortrainingandtherest39sequencesfortesting.Wereportresultsusingvaryingnumberofframespersequence(N)asdenedin[ Huetal. 2011 ]:50,100andfulllength.WhenthelengthofthesequenceislessthanN,thenalltheframesinthesequenceareused.IntheYouTubedataset,werstrandomlypartitionthedatasetinto10subsetsof9videoseach,thendivideeachsubsetinto3videosfor 76

PAGE 77

Table5-3. RecognitionrateforfacerecognitioninHonda/UCSDdataset SequenceLengths50frames100framesFulllengthAverage/Methods MDA[ WangandChen 2009 ]74.3694.8797.4488.89AHISD[ CevikalpandTriggs 2010 ]87.1884.7489.7487.18CHSID[ CevikalpandTriggs 2010 ]82.0584.6292.3186.33SANP[ Huetal. 2011 ]84.6292.3110092.31DFRV[ Chenetal. 2012b ]89.7497.4497.4494.87 CDNw/ocontext89.7497.4497.4494.87CDNwithcontext92.3110010097.43 training,andtheremaining6fortesting.Wereporttheaverageperformanceoverallthe10subsets.FortheHondadataset,weusethe20trainingsequencestolearnatwo-layerednetwork,withtherstlayermadeupof16statesand48causesandthesecondlayerismadeupof64statesand100causes.Alltheltersareof55insizeandthepoolingsizeinboththelayersis22.WeuseasimilararchitecturefortheYouTubedatasetaswell,butwithltersize77andthethemodelparametersarelearnedbyrandomlysamplingfromtheallthesequencesinthedataset.Weemphasisthatthelearningiscompletelyunsupervised.Duringclassication,fortheHondadataset,theinferredcausesfromboththelayersforeachframeareconcatenatedandareusedasfeaturevectors.Ontheotherhand,fortheYouTubedataset,wemakea3-levelspatialpyramidofthecausesfromboththelayers[ Lazebniketal. 2006 ]anduseitasafeaturevector.Anyprobesequenceisassignedaclassbasedonthemaximallypolledpredictedlabelacrossalltheframes.Alltheparametersaresetafterperformingaparametersweeptondthebestperformance3. 3OntheYouTubedataset,parametersweepisdoneasinglesubsetandthesameparametersareusedfortherestofthesubsets. 77

PAGE 78

Table5-4. ClassicationperformanceoverYouTubeCelebritiesdataset. MethodsAccuracy MDA[ WangandChen 2009 ]65.3%SANP[ Huetal. 2011 ]68.4%COV+PLS[ Wangetal. 2012 ]70.0%COV+KL[ Vemulapallietal. 2013 ]73.2%Proj.+KL[ Vemulapallietal. 2013 ]70.8% CDNw/oContext69.5%CDNwithContext71.4% Table 5-3 summarizestheresultsobtainedontheHonda/UCSDdataset.Wecompareourmethodherewithmanifolddiscriminantanalysis(MDA)[ WangandChen 2009 ],setbasedfacerecognitionmethods(AHISDandCHSID)[ CevikalpandTriggs 2010 ],sparseapproximatednearestpoints(SANP)[ Huetal. 2011 ]anddictionary-basedfacerecognitionfromvideo(DFRV)[ Chenetal. 2012b ].Ourmethod(CDNwithcontext)clearlyoutperformsallthesemethods,acrossallthesequencelengthsconsidered.Wealsonotethattheperformanceoftheproposedmodeldropswhentemporalandtop-downconniptionsarenotconsidered(CDNw/ocontext).OntheYouTubedataset,wecompareourmethod,inadditiontoSANP[ Huetal. 2011 ]andMDA[ WangandChen 2009 ],withothermethodsthatusecovariancefeatures(COV+PLS)[ Wangetal. 2012 ],kernellearning(COV+KLandProj.+KL)[ Vemulapallietal. 2013 ]).AsshownTable 5-4 ,theproposedmodeliscompetitivewiththeotherstate-of-the-artmethods.Wenoteherethatmostofthemethodsmentionedabove(particularly,COV+PLS,Porj.+PLSandCOV+KL)consideralltheframesinthesequencetoextractfeaturesbeforeperformingclassication.Ontheotherhand,weperformasequentiallabeling,utilizingknowledgeonlyfromthepastframestoextractthefeatures.Also,withouteitherthetemporalortop-downconnections,theperformanceoftheproposedmethodagaindropstoaround69.5%(CDNw/ocontext). 78

PAGE 79

A B Figure5-7. ClassicationperformanceonnoisyHonda/UCSDdatasetwith50framespersequence.Theplotsshowstherecognitionrates,(A)persequenceand(B)perframe,ofdifferentmethodswithclean(green/dark)andnoisy(yellow/light)sequences.Fornoisysequences,theperformanceshownisaveragedover5runs. Finally,toevaluatetheperformanceofthemodelwithanoisyobservation,wecorrupttheHonda/UCSDsequenceswithsomestructurednoiseintheaboveexperiment(butmaintainthesameparameterslearnedfromcleansequences).Wemakethenoisysequenceasfollows:one-halfofeachframeofallthesequencesiscorruptedbyaddingone-halfofarandomlychosenframeofrandomsubject.Werepeatthisanumberoftimesperframe(thenumberisbasedonaPossiondistributionwithmean2).Figure 5-7 summarizestheclassicationresults(persequence(Figure 5-7A )andperframe(Figure 5-7B ))obtainedwithsequencelengthof50frames.Whiletheperformanceontheproposedmodeldropsinboththecases,i.e.,withandwithouttemporalandtop-downconnections(denotedasCDNwithcontextandCDNw/ocontext,respectively),theperformancedropissteepwhencontextualinformationisnotusedthanwhenitisused.Thedifferenceismoreprominentintheclassicationaccuracyperframe.Forcomparison,wealsoshowtheperformanceofSANP,whoseperformancedropssignicantlywithnoise. 79

PAGE 80

A B C D Figure5-8. TheperformanceonnoisyHonda/UCSDdatasetforvariousvaluesofand.(A)-(C)showsrecognitionratesversustemporalconnectionparameter(),whereeachcolorplotindicatesaparticularvalueof,while(B)-(D)showsrecognitionratesversustop-downconnectionsparameter(),whereeachcolorplotindicatesaparticularvalueof.Also,(A)-(B)showstherecognitionratespersequenceand(C)-(D)showsrecognitionratesperframe.Notethatthehigherrecognitionratesperframedoesnotalwaysreectashigherrecognitionratespersequence. 5.4.3.3Analysisoftemporalandtop-downconnectionsTounderstandtheextentofinuencethetemporalandtop-downconnectionshaveontherepresentations,wevarythehyper-parametersandin( 5 ),whichdeterminetheextentofinuencetheyhave,respectively,duringinference.Weusethe 80

PAGE 81

sameexperimentalsetupasabovewithnoisyHonda/UCSDsequencesandrecordtheclassicationperformance(persequenceandperframe)fordifferentandvalues.Tomakethevisualizationeasier,weusethesamesetofhyper-parametersforboththelayers,withsparsityparametersxedat0=0.3and=0.05,whichareobtainedafterperformingaparametersweepforbestperformance.Figure 5-8 showstherecognitionrateonthenoisyHonda/UCSDdataset,asafunctionofbothtemporalconnectionparameter()andtop-downconnectionparameter().Weobservethattheperformanceisdependentonboththeparametersandshouldbesetreasonably(neithertoohighnortoolow)toobtaingoodresults.Whiletheseplotsshowtheeffectivecontributionoftemporalandtop-downconnections,theyalsoshowsomethingmoreinteresting.Whiletheperformanceisbetterwitheithertemporalortop-downconnections,bestperformanceisobtainedonlywhenbothareavailable.Thisindicatesthatbothtemporalandtop-downconnectionsplayanequallyimportantroleduringinference.Thisisinaccordancewithotherpredictivecodingmodelsusedfordetectingbirdsongs[ FristonandKiebel 2009 ]. 5.4.4LearningHierarchyofAttractorsInthissection,wefurtheranalyzethemodelfromaslightlydifferentperspective.Theaimhereistovisualizeandunderstandtherepresentationslearnedinthehierarchicalmodelandgetsomeinsightintotheworkingofthetop-downandtemporalconnections.Thekeyassumptioninourmodelisthatanyvisualinputsequenceunfoldswithawelldenedspatio-temporaldynamics[ GeorgeandHawkins 2005 FristonandKiebel 2009 ]andthesedynamicscanbemodeledastrajectoriesinsomeunderlyingattractormanifold.Inthehierarchicalsetting,wefurtherassumethattheshapeofmanifoldthatdescribestheinputsisitselfmodulatedbythedynamicsinanevenhigherlevelattractormanifold.Fromagenerativemodelperspective,thisisequivalenttosayingthatasequenceofcausesinahigherlayernon-linearlymodulatethedynamicsofa 81

PAGE 82

lowerlayerrepresentations,whichinturnrepresentaninputsequence.Inotherwords,assuccinctlydescribedby FristonandKiebel [ 2009 ],suchhierarchicaldynamicmodelrepresenttheinputsassequencesofsequences.Inthefollowingexperiments,weshowthatthemodelcanlearnhierarchyofattractors,suchthatthecomplexityoftherepresentationincreaseswiththedepthofthemodel.Also,weshowthatthetemporalandtop-downconnections(orempiricalpriors)leadtherepresentationsintostableattractors,makingthemrobusttonoise. 5.4.4.1LearningpartsofobjectfromunlabeledsequencesWeshowthatthemodelcanlearnthehierarchicalcompositionsoftheobjectsfromthedataitselfinacompletelyunsupervisedmanner.Forthis,weconsidertheVidTIMITdataset[ Sanderson 2008 ],wherefacevideosof16differentpeoplewithdifferentfacialexpressionsareusedasinputs.Welearntwo-layerednetworkswith16rstlayerstates,36rstlayercauses,36secondlayerstatesand16secondlayeredcauses.Wefurtheruse3x3non-overlappingpoolingregionsintherstlayerand2x2non-overlappingpoolingregionsforthesecondlayer.Weconstructthereceptiveeldsofthelayer1andlayer2causesusingthelinearcombinationofthebasisinthelayersbelowandshowninFigure 5-9 Weobservethatthemodelisabletolearnahierarchicalstructureofthefaces.Whiletherstlayerstatesrepresentprimitivefeatureslikeedges,rstlayercauseslearnpartsofthefaces.Thesecondlayercauses,wherethemodelcombinestheresponsesoftherstlayercauses,areabletorepresentanentireface.Moreimportantly,weobservethateachcauseunitinthesecondlayerisspecictoaparticularface(orobject),increasingthediscriminabilitybetweenfaces. 5.4.4.2Denoisingvideosusingtop-downinformationWenextshowthatthetop-downinformationcanbeusefultodenoiseahighlycorruptedvideobyusingtheinformationfromcontext.Toshowthis,weusethesamemodelasaboveonthefacevideosequences.Wecorruptafacevideosequence 82

PAGE 83

A B Figure5-9. Hierarchicaldecompositionofobjectpartslearnedbythemodelfromfacevideosof16differentsubjectsintheVidTIMITdataset.(A)Thereceptiveeldsoflayer1causes.(B)Thereceptiveeldsoflayer2causes.Bothareconstructedasweightedlinearcombinationofltersinthelayersbelow.Layer1stateshavesimilarreceptiveeldsasshowninFigure 5-3 (Top) 83

PAGE 84

A B Figure5-10. Videodenoisingwithtemporalandtop-downconnections.(A)-(B)showsexampleswithtwodifferentvideosequences.Foreachexample,(Top)corruptedvideosequenceswhereineveryframeone-forthoftheframeisoccludedwithanunrelatedimage,(Middle)thelinearprojectionoflayer2statesontotheimagespacewheninferenceisperformedwithtemporalandtop-downconnectionsand(Bottom)linearprojectoflayer2stateswheninferenceisperformedwithouttemporalortop-downconnections. 84

PAGE 85

A B Figure5-11. ThePCAprojectionsoflayertwocausesinthedenoisingexperimentwithout(a)andwith(b)temporalandtop-downconnections. 85

PAGE 86

(differentfromtheoneusedtolearnthemodel)withastructurednoise,whereone-forthpartofeachframeisoccludedwithacompletelyunrelatedimage.Thereisnocorrelationbetweentheocclusionintwoconsecutiveframes.Figure 5-10 andFigure 5-11 showstheresultsobtained4.InFigure 5-10 weprojecttheresponseofthelayertwostatesintotheinputspacetounderstandtheunderlyingrepresentationofthemodel.Sincelayertwostatesgetinformationfromthebottomlayeraswellasthetop-downinformationfromthesecondlayercauses,itshouldbeabletoresolvetheoccludedportionofthevideosequenceusingthecontextualinformationovertimeandspace.Weobservethatwiththetop-downinformationtherepresentationovertimegetsstabilizedandthemodelisabletoresolvetheoccludedpartoftheinputvideosequence.Ontheotherhand,withoutthecontextualinformationtherepresentationsdidnotconvergetoastablesolution.Figure 5-11 showsthe2DPCAprojectionsofthelayer2causes.Again,weobservethatrepresentationsobtainedwithtemporalandtop-downconnectionsforeachsubjectarestableandmappedintoawelldenedattractors,separatedfromoneanother(Figure 5-11B ).Ontheotherhand,withouttheseconnectionstherepresentationsarenotstableandcannotbewellseparated(Figure 5-11A ). 5.5Discussion 5.5.1RelationshipwithFeed-ForwardNetworksManydeeplearningmethodslikedeepbeliefnetworks[ Hintonetal. 2006 ],stackedauto-encoders[ Vincentetal. 2010 ],convolutionalneuralnetworks[ LeCunetal. 1989 ],etcencodetheinputsasahierarchicalrepresentation.Itisobservedthatincreasinglyinvariantrepresentationscanbeobtainedwiththedepthofthehierarchicalmodels[ Goodfellowetal. 2009 ].However,incontrasttoourmodel, 4Theseresultsaremadeintovideosandtheyaremadeavailableat http://cnel.ufl.edu/~rakesh/face_video_2013.html 86

PAGE 87

thesemethodsneitherperformexplainingawaynorconsidertemporalandtop-downconnections,andonlyfocusonfeed-forwardrapidrecongitionwithoutcontext.Infact,theproposedmodelcanalsobewrittenasafeed-forwardnetworkbyperformingapproximateinference.Startingfrominitialrest(i.e.,allthevariablesareinitializedtozeros)andconsideringonlyasingleFISTAiteration,thestatesandthecausescanbe(approximately)inferredas[ DenilanddeFreitas 2012 ]: Xl,kt=1 LT0Dl)]TJ /F10 5.978 Tf 5.76 0 Td[(1Xm=1eCk,mUm,l)]TJ /F10 7.97 Tf 6.59 0 Td[(1tUl,dt=1 LTKlXk=1eBk,mXk,lt (5) whereT(.)isasoftthresholdingfunctionandLdeterminesthestep-size.Butsuchrepresentationshaveonlyalimitedcapacity,asthereisnocompetitionbetweentheelementstoexplaintheinputs.OntheCaltech-101datasetexperimentdescribedinsection 5.4.2 suchapproximateinferenceonlyproducedamodestrecognitionrateof46%5(chanceisbelow1%). 5.5.2ComparisonwithOtherMethodsThemethodsthatareclosesttoourapproacharethoseinvolvingconvolutionalsparsecoding[ Zeileretal. 2010 2011 ].Deconvolutionalnetworks(DN)usesasimilarhierarchicalsparsecodingbutdoesnotconsidertemporalortop-downconnections.OurmethodcanbeconsideredasageneralizationofDNtoabroaderclassofdynamicsystems.Also,poolingwithswitchsetting[ Zeileretal. 2011 ]canalsobeincorporatedintoourmodelwithoutanysignicantchanges. 5Itshouldbenotedthatwedonotperformanylocalcontrastnormalizationbetweenlayers,whichisreportedtoproducebetterperformanceinfeed-forwardnetworks[ Boureauetal. 2010 ]. 87

PAGE 88

DeepBoltzmannmachine(DBM)[ SalakhutdinovandHinton 2012 ]andconvolutionaldeepbeliefnetworks(CDBN)[ Leeetal. 2009 ]usesundirectedgraphicalmodelstoconstructdeepnetworks.Similartoourmodel,theyalsoincorporatetop-downconnections,thoughtheydonotconsidertemporalconnections.However,theyrelyofsamplingmethodstoperforminferenceandrequiresseveraliterationsacrossthelayersbeforeastablesolutionisobtains.Incontrast,inourmethodweonlyperformasingletop-downandbottom-uppass.Also,learninginthesemodelsisslow. 5.6SummaryInthischapter,weproposedanovelconvolutionaldynamicmodelbasedonpredictivecodingframework.Thecurxofourapproachistobuildahierarchicalgenerativemodelbystackingseveralstate-spacemodelswithsparsestatesandcauses.Thetemporal(orrecurrent)connectionswith-ineachlayerandinteractionbetweentop-downandbottom-upconnectionsacrossthelayers,allowsthemodeltoincorporatecontextualinformationwhileextractingfeaturesfromsequenceofsensorysignals.Wehaveshownthatfeaturesextractedarestableandrobusttotransformationsandnoiseontheobjectsintheinputsequence.Performanceofthemodelinobjectrecognitionandsequencelabelingdatasetsshowthatusingcontextualinformationcanleadtosignicantgainsinrecognitionrates. 88

PAGE 89

CHAPTER6CONCLUSION 6.1SummaryInthiswork,weproposedanewarchitectureforobjectrecognitioninvideosequencesbasedonpredictivecodingframework.Theproposedhierarchicaldynamicmodel,learnedinacompletelyunsupervisedmanner,isabletoextractrichstructurefromthedataandisabletoself-organizeitinahierarchy.Thekeycomponentoftheproposedmodelisitsabilitytobetterleveragethecontextualinformationandndagoodandrobustrepresentationofthesensorysignals.Theexperimentalresultsshowthatproposedmodelisabletooutperformothermethodsinrecognizingobjectsinvideosequencesandiscompetitiveincaseofstaticimages.Wealsoshowthemodelisabletoobtainarobustrepresentationofthecorruptsignals,leadingtobetterrecognitionanddenoisingofthesignalsinchallengingenvironments.Weemphasisthatthesearepossibleonlybecauseoffourimportantcharacteristicsoftheproposedarchitecture.Firstly,itmodelsthesensorysignalsusingadynamicsystem.Thisprovidesashort-termmemorytokeeptrackofthetime-varyingchangesintheinputandmaintainthecontextualinformationduringinference.Secondly,themodeldecomposesthesignalinahierarchicalanddistributiveways,suchthatlowlevelfeaturesarecombinedtogethertoformhighlevelabstractions.Thesere-usabilityofthelowandintermediatelevelfeatureextractorsleadstocompactrepresentationsandbettergeneralizationtounknowninputsignals.Thirdly,weimposesparsityontherepresentationsateachlevel,whichhelpstoobtainbetterdiscriminativerepresentations.Lastly,itincludesbottomupandtopdowncomponentstoprovidecontextatthelowerlevelsofprocessing.Thisbidirectionalprocessingalsoallowsfortheuseofpastinformationtostabilizeinternalrepresentations,andpotentiallythemodulatethelowerlevelfeatureextractorstorepresentrelevantinformationfromtheinputs. 89

PAGE 90

Inchapter 3 ,weproposeddynamicsparsecoding(DSC)wheretheinputsequencesaremodeledusingalineardynamicalsystemwithappropriatesparsityconstraintonthestatesandtheinnovations.Wefurtherextendedthistomodelinvariantrepresentationsintheinputsequences.Ourexperimentshaveshowedthattheanalysisoftheinputsusingsuchstate-spacemodelsleadstomoreaccurateandinvariantrepresentationoftheinputs.Inchapter 4 ,weintroduceddeeppredictivecodingnetworks(DPCN)whichusesthestate-spacemodelproposedinchapter 3 asabasicbuildingblocktoconstructahierarchicalnetwork.Inthishierarchicalnetworkweconsiderbothbottom-upandtop-downinuencesusingtheideaofempiricalBayes.Wethenproposedanefcientapproximateinferenceprocedureinthishierarchicalnetwork,thatcombinesthebottom-updatadriveninformationcomingintothelowerlayerswiththetop-downexpectationfromthehigherlayers.Wehaveshownthatthisbidirectionalowofinformationleadstoextractingrobustrepresentationsfromnoisyvideosequences.Inchapter 5 ,weproposedaconvolutionalarchitectureforDPCN.Thisconvolutionaldynamicnetworkallowstheinferenceandthelearningalgorithmstoscaletolargeimages/framesandalleviatestheproblemofreplicatingtheinferenceproceduremultipletimeateachlevelinDPCN.Wehaveshownthatthismodelisabletolearnpartsoftheobjectandarrangetheminahierarchyfromunlabeledvideosequences.Wehavequantitativelyshowntheroleoftemporalandtop-downconnectionshaveinobtainingrobustrepresentationsforbothrecognitionandde-noisingtasks.Insummary,wedesignedahierarchicaldynamicmodeinspiredfrompredictivecoding.Thisgenerativemodelembodiesoneofthekeycharacteristicofbiologicalvision,namelyndingtherepresentationsofthesensorysignalsbasedoncontextualinformationcomingfromspatio-temporalrelationships(short-termmemory)andtop-downexpectations(longtermmemory).Suchamodelisshowntobeveryusefulinseveralobjectrecognitiontasks. 90

PAGE 91

6.2AvenuesforFutureWorkWhileinthisworkwefocusedonbuildinganobjectrecognitionsysteminspiredfrombiologicalvision,webelieveitcanbefurtherextendedtoperformotherimportantfunctionsofthehumanvisualsystem.Forexample,visual(orspatial)attentionplaysanimportantroleduringperception,allowingonetoignoreirrelevantinformationwhileattendingtoimportantinformationinthescene[ GilbertandLi 2013 Carrasco 2011 ].Severalcomputationalmethodsareproposedformodelingsuchvisualattention,eitherusingabottom-upapproach[ BorjiandItti 2013 ]orusinghand-craftedarchitectures[ Chikkeruretal. 2010 ];however,autonomouslyattendingtorelevantaspectduringperceptioninthescenestillremainselusive.Webelievethatthetop-downexpectations,alongwithmultiplicativeinteractionbetweenthestatesandthecausesintheproposedmodel,allowsustomodelsuchattentionmechanism.Also,initscurrentform,theproposedmodelhasseverallimitations:(i)InspiteoftheefcientGPUimplementations,themajorbottleneckoftheproposedistheinferencetime.However,becauseoftheuseofproximalmethods,webelievewecanbuildarecurrentnetworkthatcanefcientlyapproximatethisinferenceprocedure[ GregorandLeCun 2010 ].(ii)Thelayer-wiselearningapproachadoptedheredoesnotallowthelowerlayerstoadapttomorerelevantinformationencodedinthehigherlayers.Suchtwo-wayinteractionevenwhilelearningcouldproducebetterrepresentations.(iii)Whileusingthestate-spacemodelsallowedustomodelshort-termtemporaldependencies,encodinglongtermdependenciesisstillnotpossiblewithintheproposedframework. 91

PAGE 92

APPENDIXAAFASTPROXIMALMETHODOFCONVOLUTIONALSPARSECODINGGenerally,sparsecodingisbasedontheideathatanobservation,y2Rp,canbeencodedusinganover-completedictionaryoflters,C2Rpk;(k>p)andasparsevectorx2Rk.Moreformallythiscanbewrittenas^x=argminx1 2ky)]TJ /F5 11.955 Tf 11.95 0 Td[(Cxk22+kxk1 (A)The`1-normonxensuresthatthelatentvectorissparse.Severalefcientsolverslikecoordinatedescent(CoD)[ LiandOsher 2009 ],fastiterativeshrinkagethresholdingalgorithm(FISTA)[ BeckandTeboulle 2009 ],feature-signalgorithm[ Leeetal. 2007b ],etc,canbereadilyappliedtosolvetheaboveoptimizationproblem.ThedictionaryCcanalsobelearnedfromthedata[ Leeetal. 2007b Mairaletal. 2009 ].However,inmostapplicationsofsparsecodingmanyoverlappingpatchesacrosstheimageareprocessedseparately.Thisisoftentooslowinpractice,makingitdifculttoscaletolargeimages.Moreover,sparsecodingaloneisnotcapableofencodingtranslationsintheobservations.Learningthedictionaryinthiscontextproducesseveralshiftedversionsofthesamelter,suchthateachpatchcanbereconstructedindividually[ Kavukcuogluetal. 2010a ].Duringinference,whenperformedonalltheoverlappingpatches,thiscanleadtoaveryredundantrepresentation.Toovercometheselimitations,convolutionalsparsecodingisproposed[ Kavukcuogluetal. 2010a Zeileretal. 2010 ].Heresparsecodingisappliedovertheentireimageandthedictionaryisaconvolutionallterbankwith`M'kernelssuchthat,x=argminx1 2kI)]TJ /F6 7.97 Tf 32.97 14.94 Td[(MXm=1Cmxmk22+MXm=1kxmk1 (A) 92

PAGE 93

whereIisanimageofsize(wh),Cmisalterkernelofsize(ss)inadictionary,xmisasparsematrixofsize(w+s)]TJ /F8 11.955 Tf 12.29 0 Td[(1)(h+s)]TJ /F8 11.955 Tf 12.28 0 Td[(1),isthesparsityparameterand''representsa2Dconvolutionoperator1.Inthiswork,weproposeanalgorithmtosolvetheoptimizationproblemin( A )efcientlyandwhichscalestolargeimages.Thisisanextensiontofastiterativeshrinkagethresholdingalgorithm(FISTA)[ BeckandTeboulle 2009 ]andissolvedusingproximalgradient.Inaddition,wealsoextendourmethodtoincludeafeed-forwardpredictor(predictivesparsedecomposition[ Kavukcuogluetal. 2010b ])inthecostforjointoptimizationduringinferenceandlearning.Convolutionalsparsecodingisalsopreviouslystudiedin[ Kavukcuogluetal. 2010a Zeileretal. 2010 ]. Zeileretal. [ 2010 ]proposedamethodtosolvetheoptimizationproblemin( A )byintroducingadditionalauxiliaryvariablewhichdemandssolvingalargelinearsystem(thesizeofthelinearsystemisproportionaltosizeoftheimage)ateveryiteration;althoughthecomplexitycouldbereducedbyusingconjugategradient,theirapproachdoesnotscaletolargeimages.Ontheotherhand, Kavukcuogluetal. [ 2010a ]proposedaconvolutionalextensiontoCoDwherethecomputationperiterationissmallcomparedtothemethodproposedby Zeileretal. [ 2010 ]butthenumberofiterationsrequiredforconvergencebecomeslargeforlargeimages.Inthefollowingsections,wecompareandcontrastconvolutionalCoDwithourmethodtoshowtheperformanceimprovementsachieved. A.1ConvolutionalFISTAandDictionaryLearningSincethecostfunctionin( A )isnotjointlyconvexinbothCandx,learningthedictionaryofltersinvolvesablockcoordinatedescentkindofoptimization[ Leeetal. 2007b Mairaletal. 2009 ];whereintheobjectivein( A )isalterativelyminimized 1Allthevariablehenceforthrepresentamatrix,unlessotherwisestated.Alsotheconvolutionoperatorsisappliedin`full'or`valid'modes,dependingonthecontext. 93

PAGE 94

w.r.txandCwhilekeepingoneofthemxed.ThissectiondescribestheconvolutionalgeneralizationofthepopularsparsecodingalgorithmFISTAforinferringthelatentvariablexwhileCisheldxed.Then,aprocedureforupdatingtheparametersinCwithxedxisdescribed.Inadditiontothis,theeffectivenessoftheproposedmethodtogeneralizetoothercostfunctionsisshownusingpredictivesparsedecomposition(PSD)[ Kavukcuogluetal. 2010a ]. A.1.1InferenceISTAisapopularrstorderproximalmethodtoinfersparsecodesfromalinearinverseproblemin( A ).Thatis,ithastheadvantageofbeingasimplegradientbasedalgorithminvolvingsimplecomputationslikematrixmultiplicationswithCandCTfollowedbyasoftthresholdingfunction.However,ittendstohaveaveryslowconvergencerate.Toovercomethis, BeckandTeboulle [ 2009 ]haveproposedFISTA,whichhassignicantlybetterglobalconvergenceratewhilepreservingthecomputationalsimplicityofISTA.Infact,FISTAcanbegeneralizedtoanynon-smoothconvexoptimizationproblemwithcostfunctionoftheform:F(x)=f(x)+g(x) (A)wheref,gareconvexandgispossiblynon-smooth[ BeckandTeboulle 2009 ].MostnotableadvantageofFISTAisthatitkeepsISTA'snumberofgradientevaluations(justoneperiteration)butinvolvesndingapointthatissmartlychosenforupdatingthelatentvariable,x.Thisadditionalpointisafunctionofthedifferencebetweentheprevioustwoupdatesofx.SuchmomentumtermintroducedintotheupdatehelpstoobtainfasterconvergenceandisshowntohaveaconvergencerateoforderO(1=k2),wherekisthenumberofiterations.AstraightforwardwaytouseFISTA(oranysparsecodingmethod)tosolve( A )isbyreplacingtheconvolutionoperatorbyamatrix-vectorproduct.Thematrix,equivalenttothedictionaryin( A ),ismadebyrstconstructingaToeplitzmatrixforeachlter 94

PAGE 95

Algorithm3Convolutionalextensionforfastiterativeshrinkagethresholdingalgorithm(FISTA). Require: Inputimage:I,L(0)>0,>1andx(0)2R(w+s)]TJ /F10 7.97 Tf 6.59 0 Td[(1)(h+s)]TJ /F10 7.97 Tf 6.59 0 Td[(1)M 1: Initializez(1)=x(0),t(1)=1andk=0. 2: while(convergence)do 3: k=k+1 4: Linesearch:Findthesmallestnon-negativeintegeri(k)suchthatwithL=i(k)L(k)]TJ /F10 7.97 Tf 6.58 0 Td[(1)F(pL(z(k)))QL(pL(z(k)),z(k)) 5: SetL(k)=i(k)L(k)]TJ /F10 7.97 Tf 6.59 0 Td[(1) 6: x(k)=pL(k)(z(k)) 7: t(k+1)=1+p 1+4t(k)2 2 8: z(k+1)=x(k)+t(k))]TJ /F10 7.97 Tf 6.59 0 Td[(1 t(k+1)(x(k))]TJ /F7 11.955 Tf 11.95 0 Td[(x(k)]TJ /F10 7.97 Tf 6.59 0 Td[(1)) 9: endwhile 10: returnx(k) containingallitsshiftedversionsandthenconcatenatingallsuchmatrices.Theresultingproblemissimilarto( A )andFISTAisguaranteedtoconvergeforsuchaproblem.However,inpracticeworkingonsuchlargematricescanbecomputationallyexpensiveandmightbeunnecessaryifonecantakeadvantageoftheconvolutionalnatureoftheformulation.Asdiscussedbefore,FISTAinvolvescomputingthegradientoftheconvexpartofthecostfunctionin( A ).So,thekeytocomputationalsimplicityinthisworkcomesfromcomputingthisgradientefcientlyandcanbeobtainedasfollows:thederivativeoffwithrespecttozn(nthmapcorrespondingtothenthlter)isgivenbyrf(zn)=C0n)]TJ /F2 11.955 Tf 5.48 -9.68 Td[(I)]TJ /F6 7.97 Tf 27 14.94 Td[(MXm=1Cmzm (A)whereC0nisequivalentto180orotationofmatrixC.Thisleadstoveryefcientcomputationofthegradient,whileavoidingconstructinganylargematrices.Equippedwiththisefcientgradientcomputation,wecaneasilyextendFISTAforconvolutionalsparsecoding.Algorithm 3 describestheconvolutionalgeneralizationof 95

PAGE 96

FISTA;wherepLk(.),for`1regularizationinthecostfunction,isgivenbypL(k)(z(k))=T=L(k))]TJ /F7 11.955 Tf 5.48 -9.69 Td[(z(k))]TJ /F8 11.955 Tf 19.51 8.09 Td[(1 L(k)rf(z(k)) (A)whereT(.)isanelementwisesoftthresholdingfunctionT(xi)=(jxij)]TJ /F3 11.955 Tf 17.93 0 Td[()+sgn(xi) (A)wherexiindicatestheithelementofapointxinthelatentspace.rf(z)isthegradientofthequadratictermin( A ),denotedbyf,atsomepointzinthelatentspaceandiscomputedfrom( A ).Also,thelinesearchtondtheappropriatestepsize,Lk,requirescomputingF(x),thecostfunctionin( A ),andQ(pL(x),x)givenbyQ(pL(x),x)=f(x)+hpL(x))]TJ /F7 11.955 Tf 11.96 0 Td[(x,rf(x)i+L 2kpL(x))]TJ /F7 11.955 Tf 11.96 0 Td[(xk2+jpL(x)j1wheref(x)isthequadraticlossin( A )andh.iistheinnerproductbetweentwovectorizedquantities.Notethat,comparedtoconvolutionalCoD[ Kavukcuogluetal. 2010a ],thecomputationsrequiredperiterationherearemore.However,becauseofthemomentumtermintroducedinstep7ofAlgorithm. 3 ,thenumberofiterationsrequiredforconvergencearemuchless. A.1.2DictionaryLearningWiththeinferredlatentvariablexxed,thedictionaryofltersareupdatedusinggradientdescentmethod.Forfasterconvergencelimited-memoryBFGS[ Nocedal 1980 ]isusedforupdatingtheltersoverabatchofimages.Thegradientw.r.ttoasinglelter,Cnisrf(Cn)=BXb=1x0b,n(Ib)]TJ /F6 7.97 Tf 16.95 14.94 Td[(MXm=1Cmxb,m) (A) 96

PAGE 97

whereBisthenumberofimagesinabatchandx0b,miscomputedinasimilarmannertoC0mdescribedin( A ).Eachlterisnormalizedtohaveaunitnormpostupdatetoavoidanyredundantsolution.Theabovedescribedinferenceandlearningprocedurecanbereadilyappliedforimagedenoising[ EladandAharon 2006 ]andconstructingdeepnetworkslikedeconvolutionalnetwork[ Zeileretal. 2010 ].However,unlikeconvolutionalCoD,FISTAcanbegeneralizedtoothercostfunctionseasily,likepredictivesparsedecomposition(PSD)describedbelow. A.1.3ExtensiontoPSDThecostfunctionusedinPSD[ Kavukcuogluetal. 2010b ]containstwoparts:adecoder,containingaquadraticcostfunctionalongwithasparsityconstraintsimilarto( A )andfeed-forward,non-linearencodertoapproximatethesparsecode.Thecostfunctionwiththisadditionalpredictiontermlookslike:x=argminx1 2kI)]TJ /F6 7.97 Tf 32.98 14.94 Td[(MXm=1Cmxmk22+MXm=1kxm)]TJ /F3 11.955 Tf 11.96 0 Td[('(WmI)k22+MXm=1kxmk1 (A)whereWm2Rssisafeed-forwardlterand'(.)isanon-linearfunctiontoapproximatesparsecodes.Forfurtherdetailsreferto Kavukcuogluetal. [ 2010a ].Now,thersttwoquadratictermsin( A ),referredtoasfbelow,areconvexandcontinuouslydifferentiablewithrespecttoxandthiscanbereadilysolvedusingconvolutionalFISTAdescribedabovewithgradientw.r.txncomputedasrf(xn)=C0nI)]TJ /F6 7.97 Tf 27 14.94 Td[(MXm=1Cmxm+xn)]TJ /F3 11.955 Tf 11.96 0 Td[('(WnI)NotethatconvolutionalCoDcannotdealwiththiscostfunction,withoutlosingitsefciency. 97

PAGE 98

ALearnedconvolutionlters BPreprocessedinputimage. CReconstructedimage FigureA-1. (A)Thedictionaryoflterslearnedfromthefaceimages,(B)originalcontrastnormalizedimageand(C)reconstructionoftheimagesfromtheinferredvariables. A.2ExperimentsInthissection,convolutionalFISTA(ConvFISTA)iscomparedwiththeotherexistingalgorithms.Grayscaleimagesofdifferentsizesareusedfortestingpurposes.Alltheimagesarerstpre-processedsuchthattheirmeanissubtractedandthencontrastnormalizedusinga55averagelter.Thisstepensuresthatthelow-frequencycomponentsareremovedandhighfrequencyedgesaremadeprominent.Thishelpstospeedupthelearningandinferenceduringsparsecoding[ OlshausenandField 1996 ]. A.2.1DictionaryLearningFaceimagesfromAT&Tdatabase[ SamariaandHarter 1994 ]areusedforlearningthedictionary.Eachgrayscaleimageisrstresizedtobe6464pixelandpre-processedasstatedabove.Fromhere8,1616convolutionalltersarelearnedusingabatchof20imagesperepoch.Figure A-1 showstheresultsobtained.Itisimportanttonotethattaking8convolutionalltersmakesthesystemthatmaytimes 98

PAGE 99

A5050 B150150 FigureA-2. ComparisonoftheconvergenceratebetweenconvolutionalCoD(ConvCoD)andconvolutionalFISTA(ConvFISTA).Twodifferentimagessizes,(A)5050pixeland(B)150150pixel,areusedforcomparison. over-completeateverypoint.Thistypicallyleadstoamoresparserepresentationthatwhatwouldbeachievedfrompatchlevelsparsecoding,becauseeachelementinthelatentvariablescompeteswiththeotherltersnotjusttoexplainonepointbutalocalneighborhoodintheimage. A.2.2ConvergenceRateAsdiscussedbefore,oneoftheimportantadvantagesofConvFISTAisitsfasterconvergencerateduringinference.TocomparetheperformanceoftheproposedmodelwithconvolutionalCoD(ConvCoD)[ Kavukcuogluetal. 2010a ],theabilityofboththemethodstoreconstructtheimage,withapre-learneddictionary,isconsideredasthemetric.Inotherwords,therelativeerrorbetweenthetrue(contrastnormalized)imageandreconstructedimageobtainedfromtheinferredlatentrepresentationM=kI)]TJ /F8 11.955 Tf 28.87 2.53 Td[(^Ik22 kIk22 (A)where^Iisthereconstructedimage,isusedasthemetric.Tomakeafaircomparison,thesamemodelisusedforboththemethods,i.e.,thesparsityparameterisheldxed(=0.1)andaxeddictionaryof32,99lterslearned 99

PAGE 100

fromBerkeleysegmentationdataset[ Martinetal. 2001 ]isused.Imagesfromthesamedatasetofvaryingsizeareusedtocomparetheperformancewithscale.Figure A-2 showsthecomparison.ItisobservedthatConvFISTAisabletooutperformConvCoDintermsofconvergencerate.ThedifferenceinperformancebecomesmorepronouncedwithincreaseinthesizeoftheimagesbecauseConvCoD,whichupdatesonlyonecarefullychosenelementateveryiteration,requiresmoreiterationsforconvergence. A.2.3PredictiveSparseDecomposition AEncoderweightsW. BInstantaneoustotalloss. FigureA-3. (A)36,1111encoderlterweightslearnedusingConvFISTA.Theyresembleinthelearneddictionarylters(notshown).(B)TheinstantaneoustotallossateveryepochforbothConvFISTAandConvCoD. AsdescribedinSection. A.1.3 ,predictivesparsecoding(PSD),usedasabasicblockforbuildingdeepnetworks,hasadecoder(involvingC)andanfeed-forwardencoder(involvingW)inthecostfunctionasin( A ).However,oncethemodelistrained,thedecoderisdiscardedwhiletheencoderaloneisusedforapproximateinference.Therefore,itisimportanttominimizethejointcostfunction,sothattheencoderwouldbeabletoapproximatethedecoderfunctionmoreaccurately.HereConvFISTAisusedforinferringthelatentrepresentationfromthejointcostfunctioninvolvingboththedecoderandtheencoder.Thenon-linearfunction 100

PAGE 101

usedintheencoderisconsideredasin[ GregorandLeCun 2010 ]with:'m(I)=m(tanh(WmI)]TJ /F3 11.955 Tf 22.68 0 Td[()+tanh(WmI+)).Theparametersoftheencoder,andW,arealsolearnedalongwiththeconvolutionlters,C(eachoneisupdatedwhilealltheothersareheldxed).Ontheotherhand,forcomparisonweusetheproceduredescribedin[ Kavukcuogluetal. 2010a ],wheretheencoderislearnedseparatelyasfeed-forwardmodel,similartoconvolutionalnetworks[ LeCunetal. 1989 ],afterobtainingthesparserepresentationusingConvCoD.ResultsobtainedashowninFigure A-3 .Similartothepreviousexperiment,Berkeleysegementationdataset[ Martinetal. 2001 ]isused,witheachimagere-sizedtobe150150pixelandpreprocessed.Figure A-3A showsthelearnedencoderweightscorrespondingto36,1111convolutionallters.Weobserve(Figure A-3B )thatthetotalloss,thequadratictermplusthepredictiontermin( A ),reachesalowervaluewhenusingConvFISTAforjointlyoptimizatingthecostthanusingConvCoDforseparatelytrainingthemodel. A.3SummaryInthiswork,weproposedaconvolutionalextensiontotheFISTAalgorithmforsolvingthesparsecodingproblem.TheconvolutionalFISTAproposedherehastwomajoradvantages:(1)UnlikeConvCoD,whereonlyonecarefullychosenelementinupdatedateveryiteration,ConvFISTAupdatesalltheelementsinparallelandachievesfasterconvergencerateand(2)becauseofsimplerstorderupdates,ConvFISTAcanbeeasilygeneralizedtootherconvexlossfunctionsaswell.Withthehelpofrecentadvancesinusingproximalmethodsforstructuredsparsitymodels[ Chenetal. 2012a ],itmightbepossibletoextendthemethodproposedheretolearnmorecomplexstructureswithintheconvolutionallters;forexample,introducingtreestructuredrelationshipbetweentheltersmightbetterrepresentthevariationsbetweendifferentfeatureswithinanimage.Suchextensionsstillremainstobestudied. 101

PAGE 102

APPENDIXBADDITIONALRESULTSFORMODELVISUALIZATION B.1VisualizingInvarianceEncodedbyLayer1CausesFromSection 3.3.2.1 ,wehereshowtheinvarianceencodedbyallthe100layer1causes.Inadditiontothefrequency-orientationpolarplotandcenter-orientationscatterplot,wealsoshowherethegroupingofthedictionaryelementscorrespondingtoeachcolumnoftheinvariancematrixB. FigureB-1. Visualizinggroupingofthedictionaryelementsencodedbylayer1causes.Fromlefttorightandtoptobottom,every5blocksindicateagroupoflayer1dictionaryelements(orstates)thatarestronglyconnectedtoacausedimensionthroughtheinvariancematrixB. 102

PAGE 103

FigureB-2. Visualizinginvarianceencodedbythelayer1causesusingfrequency-orientationpolarplot.EachsubplothereindicatesonecolumnoftheinvariancematrixB. 103

PAGE 104

FigureB-3. Visualizinginvarianceencodedbythelayer1causesusingcenter-orientationscatterplot.EachsubplothereindicatesonecolumnoftheinvariancematrixB. B.2HierarchicalDecompositionObtainedfromtheYouTubedatasetInSection 5.4.3.2 ,wediscussedclassicationperformanceoftheconvolutionaldynamicmodelontheYouTubecelebritiesdataset.Herewevisualizethemodellearned 104

PAGE 105

forthistaskandshowthatthemodellearnsahierarchicaldecompositionofthefacesinthedataset. FigureB-4. Receptiveeldsinatwo-layerednetworklearnedfromtheYouTubecelebritiesdataset.(Top)Showsthereceptiveeldsofsubsetoflayer1causes.(Bottom)Showsthereceptiveeldsofsubsetoflayer2causes.Bothareconstructedaslinearcombinationoflowerlayerlters. 105

PAGE 106

REFERENCES NicolasPinto,DavidDCox,andJamesJDiCarlo.Whyisreal-worldvisualobjectrecognitionhard?PLoScomputationalbiology,4(1):e27,2008. D.J.FellemanandD.C.VanEssen.Distributedhierarchicalprocessingintheprimatecerebralcortex.Cerebralcortex(NewYork,N.Y.:1991),1(1):1,January1991.ISSN1047-3211.doi:10.1093/cercor/1.1.1-a.URL http://dx.doi.org/10.1093/cercor/1.1.1-a AClark.Whatevernext?predictivebrains,situatedagents,andthefutureofcognitivescience.TheBehavioralandbrainsciences,pages1,2013. OdeliaSchwartz,AnneHsu,andPeterDayan.Spaceandtimeinvisualcontext.NatureReviewsNeuroscience,8(7):522,July2007.ISSN1471-003X.doi:10.1038/nrn2155.URL http://dx.doi.org/10.1038/nrn2155 MosheBar.Visualobjectsincontext.NatureReviewsNeuroscience,5(8):617,2004. KarlFriston.Atheoryofcorticalresponses.PhilosophicalTransactionsofTheRoyalSocietyB:BiologicalSciences,360:815,2005. RajeshP.N.RaoandDanaH.Ballard.DynamicModelofVisualRecognitionPredictsNeuralResponsePropertiesintheVisualCortex.NeuralComputation,9:721,1997. KarlFriston.HierarchicalModelsintheBrain.PLoSComputBiol,4(11):e1000211,112008. YoshuaBengio.LearningDeepArchitecturesforAI.Found.TrendsMach.Learn.,2(1):1,January2009.ISSN1935-8237.doi:10.1561/2200000006.URL http://dx.doi.org/10.1561/2200000006 RobertoRigamonti,MatthewABrown,andVincentLepetit.Aresparserepresentationsreallyrelevantforimageclassication?InComputerVisionandPatternRecognition(CVPR),2011IEEEConferenceon,pages1545.IEEE,2011. CharlesDGilbertandWuLi.Top-downinuencesonvisualprocessing.NatureReviewsNeuroscience,14(5):350,2013. GeoffreyE.Hinton,SimonOsindero,andYee-WhyeTeh.AFastLearningAlgorithmforDeepBeliefNets.NeuralComp.,18(7):1527,July2006. P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsinadeepnetworkwithalocaldenoisingcriterion.TheJournalofMachineLearningResearch,11:3371,2010. 106

PAGE 107

MaximilianRiesenhuberandTomasoPoggio.Hierarchicalmodelsofobjectrecognitionincortex.NatureNeuroscience,2:1019,1999. T.Serre,L.Wolf,S.Bileschi,M.Riesenhuber,andT.Poggio.Robustobjectrecognitionwithcortex-likemechanisms.PatternAnalysisandMachineIntelligence,IEEETransactionson,29(3):411,2007.ISSN0162-8828.doi:10.1109/TPAMI.2007.56. Y.LeCun,B.Boser,J.S.Denker,D.Henderson,R.E.Howard,W.Hubbard,andL.D.Jackel.Backpropagationappliedtohandwrittenzipcoderecognition.NeuralComput.,1(4):541,December1989.ISSN0899-7667.doi:10.1162/neco.1989.1.4.541.URL http://dx.doi.org/10.1162/neco.1989.1.4.541 IanGoodfellow,HonglakLee,QuocVLe,AndrewSaxe,andAndrewYNg.Measuringinvariancesindeepnetworks.InAdvancesinneuralinformationprocessingsystems,pages646,2009. L.WiskottandT.J.Sejnowski.Slowfeatureanalysis:Unsupervisedlearningofinvariances.Neuralcomputation,14(4):715,2002. HosseinMobahi,RonanCollobert,andJasonWeston.Deeplearningfromtemporalcoherenceinvideo.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,ICML'09,pages737,NewYork,NY,USA,2009.ACM.ISBN978-1-60558-516-1.doi:10.1145/1553374.1553469.URL http://doi.acm.org/10.1145/1553374.1553469 WillZou,AndrewNg,ShenghuoZhu,andKaiYu.Deeplearningofinvariantfeaturesviasimulatedxationsinvideo.InAdvancesinNeuralInformationProcessingSystems25,pages3212,2012. RuslanSalakhutdinovandGeoffreyHinton.Anefcientlearningprocedurefordeepboltzmannmachines.NeuralComputation,24(8):1967,2012. HonglakLee,RogerGrosse,RajeshRanganath,andAndrewY.Ng.Convolutionaldeepbeliefnetworksforscalableunsupervisedlearningofhierarchicalrepresentations.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,ICML'09,pages609,NewYork,NY,USA,2009.ACM.ISBN978-1-60558-516-1.doi:10.1145/1553374.1553453. TaiSingS.LeeandDavidMumford.HierarchicalBayesianinferenceinthevisualcortex.JournaloftheOpticalSocietyofAmerica.A,Optics,imagescience,andvision,20(7):1434,July2003.ISSN1084-7529. RajeshP.N.RaoandDanaH.Ballard.Predictivecodinginthevisualcortex:afunctionalinterpretationofsomeextra-classicalreceptive-eldeffects.NatureNeuroscience,2(1):79,January1999.ISSN1097-6256.doi:10.1038/4580.URL http://dx.doi.org/10.1038/4580 107

PAGE 108

StefanJ.Kiebel,JeanDaunizeau,andKarlJ.Friston.AHierarchyofTime-ScalesandtheBrain.PLoSComputBiol,4(11):e1000209,112008a. StefanJ.Kiebel,KatharinavonKriegstein,JeanDaunizeau,andKarlJ.Friston.RecognizingSequencesofSequences.PLoSComputBiol,5(8):e1000464+,August2009.ISSN1553-7358.doi:10.1371/journal.pcbi.1000464.URL http://dx.doi.org/10.1371/journal.pcbi.1000464 DavidG.Lowe.Objectrecognitionfromlocalscale-invariantfeatures.InProceedingsoftheInternationalConferenceonComputerVision-Volume2-Volume2,ICCV'99,pages1150,Washington,DC,USA,1999.IEEEComputerSociety.ISBN0-7695-0164-8. NavneetDalalandBillTriggs.Histogramsoforientedgradientsforhumandetection.InProceedingsofthe2005IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition(CVPR'05)-Volume1-Volume01,CVPR'05,pages886,Washington,DC,USA,2005.IEEEComputerSociety.ISBN0-7695-2372-2.doi:10.1109/CVPR.2005.177. B.A.OlshausenandD.J.Field.Emergenceofsimple-cellreceptiveeldpropertiesbylearningasparsecodefornaturalimages.Nature,381(6583):607,June1996.ISSN0028-0836. HonglakLee,ChaitanyaEkanadham,andAndrewNg.Sparsedeepbeliefnetmodelforvisualareav2.InAdvancesinneuralinformationprocessingsystems,pages873,2007a. K.Kavukcuoglu,P.Sermanet,Y.L.Boureau,K.Gregor,M.Mathieu,andY.LeCun.Learningconvolutionalfeaturehierarchiesforvisualrecognition.AdvancesinNeuralInformationProcessingSystems,pages1090,2010a. VladimirNVapnik.Statisticallearningtheory.1998. SimonHaykin.Neuralnetworks:acomprehensivefoundation.PrenticeHallPTR,1994. YoshuaBengioandYannLeCun.Scalinglearningalgorithmstowardsai.Large-ScaleKernelMachines,34,2007. GeoffreyEHintonandRuslanRSalakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504,2006. YBengio,ACourville,andPVincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisandmachineintelligence,2013. K.Jarrett,K.Kavukcuoglu,M.A.Ranzato,andY.LeCun.Whatisthebestmulti-stagearchitectureforobjectrecognition?InComputerVision,2009IEEE12thInternationalConferenceon,pages2146.IEEE,2009a. 108

PAGE 109

Y-LBoureau,FrancisBach,YannLeCun,andJeanPonce.Learningmid-levelfeaturesforrecognition.InComputerVisionandPatternRecognition(CVPR),2010IEEEConferenceon,pages2559.IEEE,2010. PeterDayan,GeoffreyE.Hinton,RadfordM.Neal,andRichardS.Zemel.TheHelmholtzMachine.NeuralComputation,7(5):889,September1995.doi:10.1162/neco.1995.7.5.889.URL http://dx.doi.org/10.1162/neco.1995.7.5.889 D.GeorgeandJ.Hawkins.Ahierarchicalbayesianmodelofinvariantpatternrecognitioninthevisualcortex.InProceedings2005IEEEInternationalJointConferenceonNeuralNetworks,volume3,pages1812,December2005.doi:10.1109/IJCNN.2005.1556155.URL http://dx.doi.org/10.1109/IJCNN.2005.1556155 StefanJKiebel,JeanDaunizeau,andKarlJFriston.Ahierarchyoftime-scalesandthebrain.PLoScomputationalbiology,4(11):e1000209,2008b. KarlFristonandStefanKiebel.Corticalcircuitsforperceptualinference.NeuralNetworks,22(8):1093,2009. SamRoweisandZoubinGhahramani.Aunifyingreviewoflineargaussianmodels.NeuralComput.,11(2):305,February1999.ISSN0899-7667.doi:10.1162/089976699300016674.URL http://dx.doi.org/10.1162/089976699300016674 AapoHyvarinen,JarmoHurri,andPatrickO.Hoyer.NaturalImageStatistics:AProbabilisticApproachtoEarlyComputationalVision.(ComputationalImagingandVision).Springer,2ndprinting.edition,June2009.ISBN1848824904.URL http://www.worldcat.org/isbn/1848824904 KevinJarrett,KorayKavukcuoglu,Marc'A.Ranzato,andYannLeCun.Whatisthebestmulti-stagearchitectureforobjectrecognition?In2009IEEE12thInternationalConferenceonComputerVision,pages2146.IEEE,September2009b.ISBN978-1-4244-4420-5.doi:10.1109/ICCV.2009.5459469.URL http://dx.doi.org/10.1109/ICCV.2009.5459469 YoshuaBengio,PascalLamblin,DanPopovici,andHugoLarochelle.Greedylayer-wisetrainingofdeepnetworks.InInNIPS,2007. M.Ranzato,FuJieHuang,Y.-L.Boureau,andY.LeCun.Unsupervisedlearningofinvariantfeaturehierarchieswithapplicationstoobjectrecognition.InComputerVisionandPatternRecognition,2007.CVPR'07.IEEEConferenceon,pages1,2007. HerbertBay,AndreasEss,TinneTuytelaars,andLucVanGool.Speeded-UpRobustFeatures(SURF).ComputerVisionandImageUnderstanding,110(3):346,June2008.ISSN10773142.doi:10.1016/j.cviu.2007.09.014.URL http://dx.doi.org/10.1016/j.cviu.2007.09.014 109

PAGE 110

YanKarklinandMichaelS.Lewicki.AHierarchicalBayesianModelforLearningNonlinearStatisticalRegularitiesinNonstationaryNaturalSignals.NeuralComputa-tion,17:397,2005. N.Vaswani.KalmanlteredCompressedSensing.InImageProcessing,2008.ICIP2008.15thIEEEInternationalConferenceon,pages893,oct.2008. EvripidisKarseras,KinLeung,andWeiDai.Trackingdynamicsparsesignalsusinghierarchicalbayesiankalmanlters.InProceedingsIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),2013. D.Angelosante,G.B.Giannakis,andE.Grossi.Compressedsensingoftime-varyingsignals.InDigitalSignalProcessing,200916thInternationalConferenceon,pages1,july2009. A.Charles,M.S.Asif,J.Romberg,andC.Rozell.Sparsitypenaltiesindynamicalsystemestimation.InInformationSciencesandSystems(CISS),201145thAnnualConferenceon,pages1,march2011. NamrataVaswaniandWeiLu.Modied-cs:Modifyingcompressivesensingforproblemswithpartiallyknownsupport.SignalProcessing,IEEETransactionson,58(9):4595,2010. D.Sejdinovic,C.Andrieu,andR.Piechocki.Bayesiansequentialcompressedsensinginsparsedynamicalsystems.InCommunication,Control,andComputing(Allerton),201048thAnnualAllertonConferenceon,pages1730,292010-oct.12010.doi:10.1109/ALLERTON.2010.5707125. AdamSCharlesandChristopherJRozell.Re-weightedl1dynamiclteringfortime-varyingsparsesignalestimation.arXivpreprintarXiv:1208.0325,2012. X.Chen,Q.Lin,S.Kim,J.G.Carbonell,andE.P.Xing.Smoothingproximalgradientmethodforgeneralstructuredsparseregression.TheAnnalsofAppliedStatistics,6(2):719,2012a. AmirBeckandMarcTeboulle.AFastIterativeShrinkage-ThresholdingAlgorithmforLinearInverseProblems.SIAMJournalonImagingSciences,2(1):183,March2009.ISSN19364954.doi:10.1137/080716542. Y.Nesterov.Smoothminimizationofnon-smoothfunctions.MathematicalProgramming,103(1):127,2005. KarolGregorandYannLeCun.EfcientLearningofSparseInvariantRepresentations.CoRR,abs/1105.5307,2011. AlexNelson.Nonlinearestimationandmodelingofnoisytime-seriesbydualKalmanlteringmethods.PhDthesis,2000. 110

PAGE 111

RudolphEmilKalman.ANewApproachtoLinearFilteringandPredictionProblems.TransactionsoftheASMEJournalofBasicEngineering,82(SeriesD):35,1960. JHansvanHaterenandDanLRuderman.Independentcomponentanalysisofnaturalimagesequencesyieldsspatio-temporallterssimilartosimplecellsinprimaryvisualcortex.ProceedingsoftheRoyalSocietyofLondon.SeriesB:BiologicalSciences,265(1412):2315,1998. NicoleCRust,OdeliaSchwartz,JAnthonyMovshon,andEeroPSimoncelli.Spatiotemporalelementsofmacaquev1receptiveelds.Neuron,46(6):945,2005. R.P.Rao.Anoptimalestimationapproachtovisualperceptionandlearning.Visionresearch,39(11):1963,June1999.ISSN0042-6989. H.ZouandT.Hastie.Regularizationandvariableselectionviatheelasticnet.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),67(2):301,2005. KorayKavukcuoglu,Marc'AurelioRanzato,andYannLeCun.Fastinferenceinsparsecodingalgorithmswithapplicationstoobjectrecognition.CoRR,abs/1010.3467,2010b. M.D.Zeiler,D.Krishnan,G.W.Taylor,andR.Fergus.Deconvolutionalnetworks.InComputerVisionandPatternRecognition(CVPR),2010IEEEConferenceon,pages2528.IEEE,2010. CharlesCadieuandBrunoAOlshausen.Learningtransformationalinvariantsfromnaturalmovies.InAdvancesinneuralinformationprocessingsystems,pages209,2008. A.HyvarinenandP.O.Hoyer.Atwo-layersparsecodingmodellearnssimpleandcomplexcellreceptiveeldsandtopographyfromnaturalimages.Visionresearch,41(18):2413,August2001.ISSN0042-6989. RakeshChalasani,JoseCPrincipe,andNaveenRamakrishnan.Afastproxialmethodforconvolutionalsparsecoding.InNeuralNetworks.IJCNN'13.2013IEEEInterna-tionalJointConferenceon,2013. R.ChalasaniandJ.C.Principe.Deeppredictivecodingnetworks.InWorkshopatInternationalConferenceonLearningRepresentations(ICLR2013),April2013. YannLeCun,LeonBottou,YoshuaBengio,andPatrickHaffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278,1998. Rong-EnFan,Kai-WeiChang,Cho-JuiHsieh,Xiang-RuiWang,andChih-JenLin.Liblinear:Alibraryforlargelinearclassication.TheJournalofMachineLearningResearch,9:1871,2008. 111

PAGE 112

BoChen,GungorPolatkan,GuillermoSapiro,LawrenceCarin,andDavidBDunson.Thehierarchicalbetaprocessforconvolutionalfactoranalysisanddeeplearning.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages361,2011. SvetlanaLazebnik,CordeliaSchmid,andJeanPonce.Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories.InComputerVisionandPatternRecognition,2006IEEEComputerSocietyConferenceon,volume2,pages2169.IEEE,2006. RajatRaina,AlexisBattle,HonglakLee,BenjaminPacker,andAndrewYNg.Self-taughtlearning:transferlearningfromunlabeleddata.InProceedingsofthe24thinternationalconferenceonMachinelearning,pages759.ACM,2007. LiFei-Fei,RobFergus,andPietroPerona.Learninggenerativevisualmodelsfromfewtrainingexamples:Anincrementalbayesianapproachtestedon101objectcategories.ComputerVisionandImageUnderstanding,106(1):59,2007. SameerA.Nene,ShreeK.Nayar,andHiroshiMurase.ColumbiaObjectImageLibrary(COIL20).1996. Kuang-ChihLee,JeffreyHo,Ming-HsuanYang,andDavidKriegman.Visualtrackingandrecognitionusingprobabilisticappearancemanifolds.ComputerVisionandImageUnderstanding,99(3):303,2005. HeikoWersingandEdgarKorner.Learningoptimizedfeaturesforhierarchicalmodelsofinvariantobjectrecognition.Neuralcomputation,15(7):1559,2003. MinyoungKim,SanjivKumar,VladimirPavlovic,andHenryRowley.Facetrackingandrecognitionwithvisualconstraintsinreal-worldvideos.InComputerVisionandPatternRecognition,2008.CVPR2008.IEEEConferenceon,pages1.IEEE,2008. PaulViolaandMichaelJones.Robustreal-timeobjectdetection.InternationalJournalofComputerVision,4,2001. YiqunHu,AjmalSMian,andRobynOwens.Sparseapproximatednearestpointsforimagesetclassication.InComputerVisionandPatternRecognition(CVPR),2011IEEEConferenceon,pages121.IEEE,2011. RuipingWangandXilinChen.Manifolddiscriminantanalysis.InComputerVisionandPatternRecognition,2009.CVPR2009.IEEEConferenceon,pages429.IEEE,2009. HakanCevikalpandBillTriggs.Facerecognitionbasedonimagesets.InComputerVi-sionandPatternRecognition(CVPR),2010IEEEConferenceon,pages2567.IEEE,2010. 112

PAGE 113

Yi-ChenChen,VishalMPatel,PJonathonPhillips,andRamaChellappa.Dictionary-basedfacerecognitionfromvideo.InComputerVisionECCV2012,pages766.Springer,2012b. RuipingWang,HuiminGuo,LarrySDavis,andQionghaiDai.Covariancediscriminativelearning:Anaturalandefcientapproachtoimagesetclassication.InComputerVi-sionandPatternRecognition(CVPR),2012IEEEConferenceon,pages2496.IEEE,2012. RavitejaVemulapalli,JaishankerKPillai,andRamaChellappa.Kernellearningforextrinsicclassicationofmanifoldfeatures.InComputerVisionandPatternRecognition(CVPR),2013IEEEConferenceon,pages1782.IEEE,2013. ConradSanderson.BiometricPersonRecognition:Face,SpeechandFusion.VDMVerlag,2008. MishaDenilandNandodeFreitas.Recklesslyapproximatesparsecoding.arXivpreprintarXiv:1208.0959,2012. MatthewDZeiler,GrahamWTaylor,andRobFergus.Adaptivedeconvolutionalnetworksformidandhighlevelfeaturelearning.InComputerVision(ICCV),2011IEEEInternationalConferenceon,pages2018.IEEE,2011. MarisaCarrasco.Visualattention:Thepast25years.Visionresearch,51(13):1484,2011. AliBorjiandLaurentItti.State-of-the-artinvisualattentionmodeling.IEEETransactionsonPatternAnalysisandMachineIntelligence,35(1):185,2013. SharatChikkerur,ThomasSerre,ChestonTan,andTomasoPoggio.Whatandwhere:Abayesianinferencetheoryofattention.Visionresearch,50(22):2233,2010. KarolGregorandYannLeCun.Learningfastapproximationsofsparsecoding.InJohannesFurnkranzandThorstenJoachims,editors,Proceedingsofthe27thInternationalConferenceonMachineLearning(ICML-10),pages399,Haifa,Israel,June2010.Omnipress.URL http://www.icml2010.org/papers/449.pdf Y.LiandS.Osher.Coordinatedescentoptimizationforl1minimizationwithapplicationtocompressedsensing;agreedyalgorithm.InverseProblemsandImaging,3(3):487,2009. H.Lee,A.Battle,R.Raina,andA.Y.Ng.Efcientsparsecodingalgorithms.Advancesinneuralinformationprocessingsystems,19:801,2007b. J.Mairal,F.Bach,J.Ponce,andG.Sapiro.Onlinedictionarylearningforsparsecoding.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,pages689.ACM,2009. 113

PAGE 114

J.Nocedal.Updatingquasi-newtonmatriceswithlimitedstorage.Mathematicsofcomputation,35(151):773,1980. M.EladandM.Aharon.Imagedenoisingviasparseandredundantrepresentationsoverlearneddictionaries.ImageProcessing,IEEETransactionson,15(12):3736,2006. FerdinandoSSamariaandAndyCHarter.Parameterisationofastochasticmodelforhumanfaceidentication.InApplicationsofComputerVision,1994.,ProceedingsoftheSecondIEEEWorkshopon,pages138.IEEE,1994. D.Martin,C.Fowlkes,D.Tal,andJ.Malik.Adatabaseofhumansegmentednaturalimagesanditsapplicationtoevaluatingsegmentationalgorithmsandmeasuringecologicalstatistics.InProc.8thInt'lConf.ComputerVision,volume2,pages416,July2001. 114

PAGE 115

BIOGRAPHICALSKETCH RakeshChalasaniwasborninMopidevi,Indiain1986.HereceivedhisBachelorofTechnologydegreeinelectronicsandcommunicationengineeringfromNationalInstituteofTechnology,Nagpur,Indiain2008.HereceivedMasterofScienceandPhDinelectricalandcomputerengineeringfromUniversityofFloridain2010and2013,respectively.HealsoworkedasresearchinternatBoschResearchandTechnologyCenterduringSummer,2012.Hisresearchinterestsincludemachinelearning,patternrecognition,unsupervisedlearning,kernelmethods,informationtheoreticlearningandcomputervision. 115