<%BANNER%>

Efficient Algorithms for Sparse Singular Value Decomposition

Permanent Link: http://ufdc.ufl.edu/UFE0041153/00001

Material Information

Title: Efficient Algorithms for Sparse Singular Value Decomposition
Physical Description: 1 online resource (97 p.)
Language: english
Creator: Rajamanickam, Sivasank
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: band, bidiagonalization, blocked, reduction, sparse, svd
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Singular value decomposition is a problem that is used in a wide variety of applications like latent semantic indexing, collaborative filtering and gene expression analysis. In this study, we consider the singular value decomposition problem for band and sparse matrices. Linear algebraic algorithms for modern computer architectures are designed to extract maximum performance by exploiting modern memory hierarchies, even though this can sometimes lead to algorithms with higher memory requirements and more floating point operations. We propose blocked algorithms for sparse and band bidiagonal reduction. The blocked algorithms are designed to exploit the memory hierarchy, but they perform nearly the same number of floating point operations as the non-blocked algorithms. We introduce efficient blocked band reduction algorithms that utilize the cache correctly and perform better than competing methods in terms of the number of floating point operations and the amount of required workspace. Our band reduction methods are several times faster than existing methods. The theory and algorithms for sparse singular value decomposition, especially algorithms for reducing a sparse upper triangular matrix to a bidiagonal matrix are proposed here. The bidiagonal reduction algorithms use a dynamic blocking method to reduce more than one entry at a time. They limit the sub-diagonal fill to one scalar by pipelining the blocked plane rotations. A symbolic factorization algorithm for computing the time and memory requirements for the bidiagonal reduction of a sparse matrix helps the numerical reduction step. Our sparse singular value decomposition algorithm computes all the singular values at the same amount of time it takes to compute a few singular values using existing methods. It performs much faster than existing methods when more singular values are required. The features of the software implementing the band and sparse bidiagonal reduction algorithms are also presented.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Sivasank Rajamanickam.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Davis, Timothy A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041153:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041153/00001

Material Information

Title: Efficient Algorithms for Sparse Singular Value Decomposition
Physical Description: 1 online resource (97 p.)
Language: english
Creator: Rajamanickam, Sivasank
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: band, bidiagonalization, blocked, reduction, sparse, svd
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Singular value decomposition is a problem that is used in a wide variety of applications like latent semantic indexing, collaborative filtering and gene expression analysis. In this study, we consider the singular value decomposition problem for band and sparse matrices. Linear algebraic algorithms for modern computer architectures are designed to extract maximum performance by exploiting modern memory hierarchies, even though this can sometimes lead to algorithms with higher memory requirements and more floating point operations. We propose blocked algorithms for sparse and band bidiagonal reduction. The blocked algorithms are designed to exploit the memory hierarchy, but they perform nearly the same number of floating point operations as the non-blocked algorithms. We introduce efficient blocked band reduction algorithms that utilize the cache correctly and perform better than competing methods in terms of the number of floating point operations and the amount of required workspace. Our band reduction methods are several times faster than existing methods. The theory and algorithms for sparse singular value decomposition, especially algorithms for reducing a sparse upper triangular matrix to a bidiagonal matrix are proposed here. The bidiagonal reduction algorithms use a dynamic blocking method to reduce more than one entry at a time. They limit the sub-diagonal fill to one scalar by pipelining the blocked plane rotations. A symbolic factorization algorithm for computing the time and memory requirements for the bidiagonal reduction of a sparse matrix helps the numerical reduction step. Our sparse singular value decomposition algorithm computes all the singular values at the same amount of time it takes to compute a few singular values using existing methods. It performs much faster than existing methods when more singular values are required. The features of the software implementing the band and sparse bidiagonal reduction algorithms are also presented.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Sivasank Rajamanickam.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Davis, Timothy A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041153:00001


This item has the following downloads:


Full Text

PAGE 1

EFFICIENTALGORITHMSFORSPARSESINGULARVALUEDECOMPOSIT ION By SIVASANKARANRAJAMANICKAM ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2009

PAGE 2

c r 2009SivasankaranRajamanickam 2

PAGE 3

ToAarthi,Akilan,andSowmya 3

PAGE 4

ACKNOWLEDGMENTS Iamfortunatetohavethesupportofsometrulywonderfulpeo ple.Firstand foremost,IwouldliketothankmyadvisorDr.TimothyDavisw hohasbeenanexcellent mentor.Iwillbeindebtedtohimforeverforthetrustheplac edinmeevenbeforeI startedgraduatestudiesandwhiledoingthisstudy.Hehasn otonlytaughtmetechnical nuances,butalsohelpedmeappreciatethebeautyineveryth ingfromperfectsoftwareto poetry.Iamgratefulforhisvaluableadvice,patienceandf ormakingmeabetterperson. IwouldliketoacknowledgeDr.GeneGolubwhowasinmyPhDcom mitteeforhis kindandencouragingwords.Hegavethemotivationtosolvet hesparsesingularvalue decompositionproblem.IwouldalsoliketothankDr.AlinDo bra,Dr.WilliamHager, Dr.JorgPeters,andDr.MeeraSitharamforservinginmyPhDc ommittee. Onthepersonalside,IwouldliketothankmywifeSrideviwho sesacricesand constantsupportmadethiswholeeortpossible.Shehasalw aysbeenwithmewhenever Ineededher.Icouldhavenevercompletedthisstudywithout herbesideme.Myson AkilananddaughterSowmyadeserveaspecialmentionforthe timespentwiththem rejuvenatedmewhileworkingonthisdissertation. IamgratefulfortheloveandsupportofmyparentsG.Rajaman ickamandR.Anusuya. Theyprovidedmewiththebestresourcestheyknewandgaveme thefoundationthat madegraduatestudiespossible.MybrotherSarangandeserv esaspecialthanksforacting asasoundingboardofideasbothinthepersonalandprofessi onalside.Iwouldliketo thankmyfriendsandcolleaguesSrijithandVenkatforallth eirhelpduringthegraduate studies.IwouldalsoliketoacknowledgemyfriendsBalraj, Baranidharan,Dhamodaran, KarthikeyaArasan,Ramasubbu,Seenivasan,Senthilkumara ndSimpsonFernandofor cheeringmeonforthepastfteenyears. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................4 LISTOFTABLES .....................................7 LISTOFFIGURES ....................................8 ABSTRACT ........................................10 CHAPTER 1INTRODUCTION ..................................12 1.1SingularValueDecomposition .........................12 1.1.1ComputingtheSingularValueDecomposition ............13 1.1.2OrthogonalTransformations ......................13 1.2SingularValueDecompositionofSparseandBandMatrice s ........14 2LITERATUREREVIEW ..............................17 2.1GivensRotationsandBandReduction ....................17 2.2MethodstoComputetheSingularValueDecomposition ..........20 3BANDBIDIAGONALIZATIONUSINGGIVENSROTATIONS .........22 3.1Overview ....................................22 3.2BlockingforBandReduction .........................23 3.3PipeliningtheGivensRotations ........................25 3.3.1SeedBlock ................................26 3.3.2ColumnBlock ..............................27 3.3.3DiagonalBlock .............................29 3.3.4RowBlock ...............................30 3.3.5FillBlock ................................31 3.3.6FloatingPointOperationsforBandReduction ............32 3.4AccumulatingthePlaneRotations ......................34 3.4.1Structureof U ..............................35 3.4.2FloatingPointOperationsforUpdate .................42 3.5PerformanceResults ..............................43 3.6Summary ....................................47 4PIRO BAND:PIPELINEDPLANEROTATIONSFORBANDREDUCTION .49 4.1Overview ....................................49 4.2Features .....................................49 4.2.1PIRO BANDInterface .........................50 4.2.2LAPACKStyleInterface ........................52 4.2.3MATLABInterface ...........................52 5

PAGE 6

4.3PerformanceResults ..............................53 4.3.1ChoosingaBlockSize ..........................53 4.3.2PIRO BAND svd and qr Performance ................56 5SPARSER-BIDIAGONALIZATIONUSINGGIVENSROTATIONS ......58 5.1Introduction ...................................58 5.2TheoryforSparseBidiagonalization ......................58 5.2.1ProleandCornerDenitions .....................58 5.2.2Mountain,SlopeandSink .......................62 5.2.3PropertiesofSkylineandMountainviewProles ...........64 5.2.4PropertiesforBlockingtheBidiagonalReduction ..........70 5.3BidiagonalizationAlgorithms .........................73 5.3.1BlockinginMountainviewProle ...................74 5.3.2BlockinginSkylineProle .......................75 5.4SymbolicFactorization .............................80 5.5PIRO SKY:PipelinedPlaneRotationsforSkylineReduction ........84 5.6PerformanceResults ..............................87 6CONCLUSION ....................................90 REFERENCES .......................................92 BIOGRAPHICALSKETCH ................................97 6

PAGE 7

LISTOFTABLES Table page 4-1MATLABinterfacetoPIRO BAND ........................52 4-2PIRO BAND svd vsMATLABdense svd .....................56 4-3PIRO BAND qr vsMATLABdenseandsparse qr ................57 5-1MATLABinterfacetoPIRO SKY ..........................86 5-2CinterfacetoPIRO SKY ..............................86 7

PAGE 8

LISTOFFIGURES Figure page 2-1Schwarz'smethodsforbandreduction .......................18 3-1Firsttwoiterationsofblockedbandreduction ...................23 3-2FindingandapplyingcolumnrotationsintheS-block ...............27 3-3ApplyingcolumnrotationstotheC-block. .....................27 3-4Applyingcolumnrotationsandndingrowrotationsinth eD-block .......28 3-5ApplyingrowrotationstoonecolumnintheR-block. ...............29 3-6Applyingrowrotationsandndingcolumnrotationsinth eF-block .......31 3-7Updateof U forsymmetric A with r =1 ......................34 3-8Updateof U forsymmetric A with r =3 ......................35 3-9Updateof U forunsymmeteric A with r =1 ....................36 3-10Updateof U forunsymmetric A with r =2 ....................39 3-11Performanceoftheblockedreductionalgorithmwithde faultblocksizes .....44 3-12Performanceof piro band reduce vsSBRandLAPACK .............44 3-13Performanceof piro band reduce vsSBRandLAPACKwithupdate ......45 3-14Performanceof piro band reduce vsDGBBRDofLAPACK ...........45 4-1Performanceof piro band reduce fordierentblocksizes ............54 4-2Performanceof piro band reduce fordierentblocksizes ............55 5-1Asparse R ,skylineproleof R andmountainviewproleof R ..........59 5-2Cornerdenitionintheproleof R forcornercolumns j and n .........60 5-3Cornercolumns j and n aftercolumnrotations ..................60 5-4Cornersforthesparse R ...............................61 5-5Typesofcolumnsinthemountainviewprole ...................63 5-6Endcolumnlemma ..................................64 5-7Toprowlemma ....................................65 5-8Rubblerowlemma ..................................66 8

PAGE 9

5-9Dierentconditionstoestablishprogressinreduction ...............68 5-10Potentialcolumnsforblockinginthemountainviewpro le ............70 5-11Problemwithsimpleblockingintheskylineprole ................76 5-12Firstthreeblocksandtencornersoftheskylinereduct ionalgorithm .......79 5-13Classicationofcolumnsintheskylineprole ...................80 5-14Symbolicfactorizationoftheskylineprole ....................81 5-15PerformanceofPIRO SKYvsMATLAB svds ...................87 5-16PerformanceofPIRO SKYvsMATLAB svds ...................88 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy EFFICIENTALGORITHMSFORSPARSESINGULARVALUEDECOMPOSIT ION By SivasankaranRajamanickam December2009 Chair:TimothyA.DavisMajor:ComputerEngineering Singularvaluedecompositionisaproblemthatisusedinawi devarietyofapplications likelatentsemanticindexing,collaborativelteringand geneexpressionanalysis.Inthis study,weconsiderthesingularvaluedecompositionproble mforbandandsparsematrices. Linearalgebraicalgorithmsformoderncomputerarchitect uresaredesignedtoextract maximumperformancebyexploitingmodernmemoryhierarchi es,eventhoughthiscan sometimesleadtoalgorithmswithhighermemoryrequiremen tsandmoreroatingpoint operations.Weproposeblockedalgorithmsforsparseandba ndbidiagonalreduction. Theblockedalgorithmsaredesignedtoexploitthememoryhi erarchy,butthey performnearlythesamenumberofroatingpointoperationsa sthenon-blocked algorithms.Weintroduceecientblockedbandreductional gorithmsthatutilizethe cachecorrectlyandperformbetterthancompetingmethodsi ntermsofthenumberof roatingpointoperationsandtheamountofrequiredworkspa ce.Ourbandreduction methodsareseveraltimesfasterthanexistingmethods. Thetheoryandalgorithmsforsparsesingularvaluedecompo sition,especially algorithmsforreducingasparseuppertriangularmatrixto abidiagonalmatrixare proposedhere.Thebidiagonalreductionalgorithmsuseady namicblockingmethodto reducemorethanoneentryatatime.Theylimitthesub-diago nallltoonescalarby pipeliningtheblockedplanerotations.Asymbolicfactori zationalgorithmforcomputing 10

PAGE 11

thetimeandmemoryrequirementsforthebidiagonalreducti onofasparsematrixhelps thenumericalreductionstep. Oursparsesingularvaluedecompositionalgorithmcompute sallthesingularvaluesat thesameamountoftimeittakestocomputeafewsingularvalu esusingexistingmethods. Itperformsmuchfasterthanexistingmethodswhenmoresing ularvaluesarerequired. Thefeaturesofthesoftwareimplementingthebandandspars ebidiagonalreduction algorithmsarealsopresented. 11

PAGE 12

CHAPTER1 INTRODUCTION TheSingularValueDecomposition(SVD)iscalledthe\swiss armyknife"ofmatrix decompositions.ApplicationsoftheSVDtendtobevariedfr omLatentSemanticIndexing (LSI)tocollaborativeltering.TheSVDisalsooneofthemo stexpensivedecompositions, bothintermsofcomputationandmemoryusage.Wepresentec ientalgorithmsfor ndingthesingularvaluedecompositionfortwospecialcla ssesofmatrices:sparse matricesandbandmatrices. Thischapterintroducesthesingularvaluedecompositioni nSection 1.1 .Section 1.1.1 describeshowtocomputetheSVDandtheorthogonaltransfor mationsusedincomputing theSVD.Section 1.2 considersthespecialproblemswithbandandsparseSVD. 1.1SingularValueDecomposition TheSingularvaluedecompositionofamatrix A 2R m n isdenedas A = U V T where U 2R m m and V 2R n n and U and V areorthogonalmatrices. 2R m n isadiagonalmatrixwithrealpositiveentries.Thisformul ationoftheSVDisgenerally referredtoasthe full SVD[ GolubandvanLoan1996 ; TrefethenandBauIII1997 ]. ThemorecommonlyusedformofSVDis A 2R m n and m n isdenedas A = ^ U ^ V T where ^ U 2R m n and V 2R n n and ^ U and V haveorthonormalcolumns. ^ 2R n n isa diagonalmatrixwithrealpositiveentries.Thisisgeneral lyreferredtoasthe thin SVDor the economy SVD[ TrefethenandBauIII1997 ].Thediscussioninthisdissertationapplies toboththe thin and full SVD.WewillusethetermSVDtorefertobothofthemand dierentiatewhereverrequired. 12

PAGE 13

1.1.1ComputingtheSingularValueDecomposition ThemostgeneralmethodtocomputetheSVDof A willbeasfollows[ GolubandvanLoan 1996 ; TrefethenandBauIII1997 ] 1.Computethe QR factorizationofthematrix A .( A = QR ) 2.Reducetheuppertriangularmatrix R toabidiagonalmatrix B usingorthogonal transformations.( R = U 1 BV 1 ) 3.Reducethebidiagonalmatrix B toadiagonalmatrixusinganiterative method.( B = U 2 V 2 ) ThecomputationoftheSVDcanthenberepresentedas A = QR = Q ( U 1 BV 1 ) = Q ( U 1 ( U 2 V 2 ) V 1 ) =( QU 1 U 2 )( V 2 V 1 ) = U V Thereareseveralvariationstothesesteps.The QR factorizationcanbeskippedentirely if A isuppertriangular.If A issparsewecanapplyallreducingordering[ Davisetal. 2004a b ]beforethe QR factorizationtominimizethenumberofnon-zerosin R orusea prolereductionorderingof R [ CuthillandMcKee1969 ; Hager2002 ; Gibbsetal.1976 ; ReidandScott2002 1998 ].SomeofthedierentchoicesforcomputingtheSVDare discussedinSection 2.2 1.1.2OrthogonalTransformations Whenreducingtheuppertriangular R toabidiagonalmatrixinstep2ofSection 1.1.1 weuseorthogonaltransformationssuchastheGivensrotati onsorHouseholdertransformations. Thesetransformationspreservethesingularvaluesof R .Householder-rerectionbased bidiagonalizationwasrstsuggestedin[ GolubandKahan1965 ].Theyreducethe 13

PAGE 14

sub-diagonalpartofanentirecolumninadensematrixtozer owithoneHouseholder transformation. InthecaseofbandmatricesbothGivensrotations[ Schwarz1968 ; Rutishauser1963 ; Kaufman2000 ]andHouseholdertransformations[ Bischofetal.2000b ; MurataandHorikoshi 1975 ]havebeenusedinthepasttoreduceittoabidiagonalmatrix .WhileHouseholder transformationsoperateonentirecolumnsatatimewewould liketozeroentries selectivelyinasparseorbandbidiagonalizationstep,sow ewilluseGivensrotations inboththecases. 1.2SingularValueDecompositionofSparseandBandMatrice s TheSVDofsparseandbandmatricesposesvariouschallenges .Someofthese problemsarealsofacedbymanyothernumericalalgorithmsa ndsomeproblemsare uniquetothisproblem.Wediscusstheseissueshere. Datastructure .Weassumethatthebandmatricesarestoredinapackedcolum n bandformati.eif A 2R m n andthelowerbandwidthis l andtheupperbandwidthis u thenwestore A ina( l + u +1) n matrix B suchthat A ( i;j )= B ( i j + u +1 ;j ) [ GolubandvanLoan1996 ].Besidestheadvantageofreducedspace,thiscolumnbased datastructure,enablestheuseofcolumnbasedalgorithmsw hichcanthenbeadapted tothesparsebidiagonalreduction.Sparsematricesareusu allystoredinacompressed columnformatwithonlythenon-zeroesstored.However,wen eedmorespacethanthat forsparsebidiagonalreductionalgorithm.Wepresenttwop roledatastructuresforthe storingasparsematrixinordertocomputetheSVDeciently Fill-in .Almostallsparsematrixalgorithmshavesomewaystohandl ell-in:trying toreducethell-inbyreorderingthematrixordoingasymbo licfactorizationtond theexactll-inetc.Theformerensuresthattheamountofwo rkandmemoryusageis reduced.Thelaterensuresastaticdatastructurewhichlea dstopredictablememory accessesinturnimprovingtheperformance.WeapplyourGiv ensrotationsinapipelined schemeandavoidanyll-ininthebandreductionalgorithms .SeeChapter 3 formore 14

PAGE 15

details.Wecannotavoidthell-ininthesparsecase,butwe trytominimizethell-inby choosingtoreducethecolumnsthatgeneratetheleastamoun tofllrst.SeeChapter 5 formoredetails.ItisbettertoavoidtheHouseholdertrans formationsinthesparsecase becauseofthecatastrophiclltheymightgenerate. FLOPS .FloatingpointoperationsorFLOPSmeasurestheamountofw ork performedbyanumericalalgorithm.Generally,anincrease inthell-inleadstomore roatingpointoperations.Ourbandreductionalgorithmsdo nomoreworkthanasimple non-blockedbandreductionalgorithmduetoll.However,t heydomoreroatingpoint operationsthanthenon-blockedalgorithmsduetotheorder inwechoosetodothe reduction.Theincreaseintheroatingpointoperationsisd uetoblocking,butthe advantageofblockingoverwhelmsthesmallincreaseinFLOP S.Wealsoshowhowto accumulatetheleftandrightsingularvectorsbyoperating onjustthenon-zeroesthereby reducingthenumberofroatingpointoperationsrequired. Thesparsecase,usesadynamicblockingschemeforndingab lockofplanerotations thatcanbeappliedtogether.Thealgorithmforsparsebidia gonalreductiontriestodo theleastamountofwork,butthereisnoguaranteefordoingt heleastamountofFLOPS forsparsematrices.However,givenabandmatrixthesparse algorithmwilldothesame amountofFLOPSastheblockedbandreductionalgorithm. Memoryaccess .Performanceofanyalgorithmgetsimpactedbythenumber oftimestheyaccessthememoryandtheorderinwhichtheyacc essthememory. ThedevelopmentoftheBasicLinearAlgebraSubroutines(BL AS)[ Blackfordetal. 2002 ; Dongarraetal.1990 ]andalgorithmsthatusetheBLASmainlyfocusonthis aspect.Unfortunately,generatingGivensrotationsandap plyingthemareBLAS-1style operationsleadingtopoorperformance.Blockedalgorithm stendtoworkaroundthis problem.Wepresentblockedalgorithmsforboththesparsea ndbandbidiagonalization cases.Thesymbolicfactorizationalgorithmforbidiagona lreductionofsparsematrices helpsusallocateastaticdatastructureforthenumericalr eductionstep. 15

PAGE 16

Software .Availabilityofusablesoftwarefortheselinearalgebrai calgorithmsplay acrucialroleingettingthealgorithmsadapted.Wehavedev elopedsoftwareforband reductionandforcomputingtheSVDofasparsematrix. Inshort,wehavedevelopedalgorithmsforblockedbidiagon alreductionofband matricesandsparsematricesthattakeintoconsiderationm emoryusage,operationcount, caching,symbolicfactorizationanddatastructureissues .Wealsopresentrobustsoftware implementingthesealgorithms. ModiedversionsofChapters 3 and 5 ,whichdiscussblockedalgorithmsforband reductionandsparseSVDrespectively,willbesubmittedto ACMTransactionson MathematicalSoftwareastwotheorypapers[ RajamanickamandDavis2009a ].The softwareforbandreduction(discussedinChapter 4 andin[ RajamanickamandDavis 2009b ])andthesoftwareforsparsebidiagonalreductionwillbes ubmittedtoACM TransactionsonMathematicalSoftwareasCollectedAlgori thmsoftheACM. 16

PAGE 17

CHAPTER2 LITERATUREREVIEW Wesummarizepastresearchrelatedtosingularvaluedecomp osition,bandreduction, sparsesingularvaluedecompositioninthischapter.Thepr oblemofbandreduction, especiallywithrespecttothesymmetricreductiontotridi agonalmatrixcase,hasbeen studiedfornearly40years.Section 2.1 presentsthepastworkinbandreduction.The singularvaluedecompositionproblemfordensematricesha sbeenstudiedforseveral yearsandvariousiterativemethodsareknownforndingthe sparsesingularvalue decompositiontoo.ThesearedescribedinSection 2.2 2.1GivensRotationsandBandReduction Givensrotations(orplanerotationsasitiscallednow)and Jacobirotations[ Schwarz 1968 ; Rutishauser1963 ]areusedinthesymmetriceigenvalueproblemofbandmatric es, especiallytoreducethematrixtothebidiagonalform.TheG ivensrotationsthemselves, intheirpresentunitaryform,aredenedin[ Givens1958 ].Ittakesasquarerootand 5roatingpointoperations(rops)tocomputetheGivensrota tionsandtwovaluesto savethem. Stewart [ 1976 ]showedhowwecansavejustonevalue,theangleofthe rotation,andsavesomespace.Thecomputationofthesquare rootisoneoftheexpensive operationsinmoderncomputers.SquarerootfreeGivensrot ationsarediscussedin [ Hammarling1974 ; Gentleman1973 ; AndaandPark1994 ]. Bindeletal. [ 2002 ]showed thatitwaspossibletoaccuratelyndtheGivensrotationwi thoutoverroworunderrow. Rutishauser [ 1963 ]gavetherstalgorithmforbandreductionusingapentadia gonal matrix.Thisalgorithmremovesthellcausedbyaplanerota tion[ Givens1958 ] immediatelywithanotherrotation.Schwarzgeneralizedth ismethodfortridiagonalizing asymmetricbandmatrix[ Schwarz1963 1968 ].Bothalgorithmsuseplanerotations(or Jacobirotations)tozeroonenon-zeroentryofthebandresu ltinginascalarll.The algorithmsthenusenewplanerotationstoreducethell,cr eatingnewllfurtherdown 17

PAGE 18

(c) xxxxx xx x xxxxxx xx xxxxxx xx xxxxxx xx xxxx1 x xxxxxxx xxxxxx xx xxxx xx x x xxxxxx xxxxx xxxx xxx xxxxxx xxxxxxxx x xxxxx xx x xxxxxx xx xxxxxx xx xxxxxx xx xx420 3 xxxxxxx xxxxxx xx xxxx xx x x xxxxxx xxxxx xxxx xxx xxxxxx xxxxxxxx x xxxxx x3 x xxxxxx x4 xxxxxx xx xxxxxx xx xxxx0 x xxxxxx2 xxxxxx xx xxxx xx x x xxxxxx xxxxx xxxx xxx xxxxxx xxxxxxxx x (a) (b) Figure2-1.Schwarz'smethodsforbandreductionthematrix,untilthereisnoll.Thisprocessofmovingthe lldownthematrixiscalled chasingthell. WegiveabriefideaofSchwarz'sgeneralizations[ Schwarz1963 1968 ]ofRutishauser's method[ Rutishauser1963 ]asourblockedmethodforbandreductionusesamixed approachofthesetwomethods.Werefertoabandmatrixwith l subdiagonalsand u superdiagonalsasan( l;u ) -bandmatrix .AstepinSchwarz'salgorithmsisreducingan entryinthebandandchasingtheresultantlldownthematri x.Figure 2-1 (a)showsthe rststepinthereductionofa14-by-14(2 ; 5)-bandmatrixwhenusingSchwarz'smethods. Therstplanerotationtriestoreducetheentrymarkedwith 1.Thiscausesallinthe thirdsubdiagonal(markedas ).Themethodsrotaterows7and8andcolumns12and13 tocompletethechase.However,Schwarz'smethodsdierinw hetheritreducesanentire rowtobidiagonalrst([ Schwarz1968 ])oritreducesonediagonalatatime([ Schwarz 1963 ])inthesubsequentsteps.Wewillcalltheformertherowred uctionmethodandthe laterthediagonalreductionmethod.Figure 2-1 (b)-(c)showsthedierencebetweenthe twomethods.Insteps2 ::: 4,therowreductionmethodwillreducetheentriesmarked 2 ::: 4inFigure 2-1 (b).Insteps2 ::: 4,thediagonalreductionmethodwillreducethe entriesmarked2 ::: 4inFigure 2-1 (c). Givenasymmetric( b;b )-bandmatrix A oftheorder n ,thetwoalgorithmsofSchwarz areshowninAlgorithms 2.1 and 2.2 .Schwarz'sbandreductionalgorithmsarenot 18

PAGE 19

blockedalgorithms. Rutishauser [ 1963 ]modiedtheGivensmethod[ Givens1958 ]for reducingafullsymmetricmatrixtotridiagonalformtouseb locking.Thestructureof theblock,aparallelogram,willbethesameforthemodiedG ivensmethodandour algorithm,butwewillallowthenumberofcolumnsandnumber ofrowsintheblockto bedierent.Furthermore,[ Rutishauser1963 ]ledtoabulge-likellwhichisexpensiveto chase,whereasourmethodresultsinlessintermediatellw hichtakesmuchlessworkto chase. Algorithm2.1 Schwarz'srowreductionalgorithmforsymmetricbandreduc tion for i suchthat1 i n 2 do for j = MIN ( i + b;n ) 1: 1: i +1 do Rotatecolumns j and j +1toreduce A [ i;j +1]. Chasethell. endfor endfor Algorithm2.2 Schwarz'sdiagonalreductionalgorithmforsymmetricband reduction for j = b 1: 1:1 do for i =1: n ( j +1) do Rotatecolumns i + j and i + j +1toreduce A [ i;i + j +1]. Chasethell. endfor endfor Kaufman [ 1984 ]suggestedanapproachtoreduceandchasemorethanoneentr y atatimeusingvectoroperations.Thealgorithmusesplaner otationsandreducesone diagonalatatime.ThisisthemethodintheLAPACK[ Andersonetal.1999 ]library forbidiagonalandtridiagonalreduction(xGBBRDandxSBTR Droutines).Thelatest symmetricreductionroutinesinLAPACKusearevisedalgori thmbyKaufman[ Kaufman 2000 ].ThoughKaufman'salgorithmsreducemorethanoneentryat atimeitisnota blockedalgorithmaseachoftheentrieshastobeseparatedb yadistance. Rutishauser [ 1963 ]alsosuggestedusingHouseholdertransformationsforban dreductionandchasing thetriangularll. MurataandHorikoshi [ 1975 ]alsousedHouseholdertransformations tointroducethezeroesinthebandbutchosetochaseonlypar tofthell.Thisidea 19

PAGE 20

reducedtheropsrequired,butincreasedtheworkspace.Rec ently, Lang [ 1993 ]usedthis ideaforparallelbandreduction.Theideaisalsousedinthe SBRtoolbox[ Bischofetal. 2000b ].However,theyuseQRfactorizationofthesubdiagonalsan dstoreitas WY transformations[ VanLoanandBischof1987 ]sothattheycantakeadvantageoflevel 3styleBLASoperations.TheSBRframeworkcanchoosetoopti mizeforroatingpoint operations,availableworkspaceandusingtheBLAS. Inordertocomputethesingularvectorswecanstartwithani dentitymatrix andapplytheplanerotationstoit.Therearetwotypesofmet hodstodothis:we applytheplanerotationswhentheyareappliedtotheorigin albandmatrix(forward accumulation)orwesavealltherotationsandapplyittothe identitymatriceslater (backwardaccumulation)[ Smithetal.1976 ].Eventhoughsimpleforwardaccumulation requiresmoreroatingpointoperationsthanbackwardaccum ulation, Kaufman [ 1984 ] showedthatwecandoforwardaccumulationecientlybyexpl oitingthenonzeropattern whenapplyingtherotationstotheidentity.Kaufman'saccu mulationalgorithmwilldo thesamenumberofroatingpointoperationsasbackwardaccu mulation.Thenonzero patternof U or V willbedependentupontheorderinwhichwereducetheentrie sinthe originalbandmatrix.Theaccumulationalgorithmexploits thenonzeropatternof U or V whileusingourblockedreduction. 2.2MethodstoComputetheSingularValueDecomposition GolubandKahan [ 1965 ]introducedthetwostepapproachtocomputethesingular valuedecomposition.InordertocomputetheSVDof A theyreduce A toabidiagonal matrixandthenreducethebidiagonalmatrixtoadiagonalma trix(thesingularvalues). GolubandReinsch [ 1970 ]introducedthe QR algorithmtocomputethesecondstep. DemmelandKahan [ 1990 ]showedthatthesingularvaluesofthebidiagonalmatrixca nbe computedaccurately.Thetwostepmethodwasmodiedtoathr eestepmethoddescribed inSection 1.1.1 byChan[ Chan1982 ].ThebidiagonalizationwasprecededbyaQR factorizationstepfollowedbythebidiagonalizationofR. Thismethodleadstolesswork 20

PAGE 21

whenever m 5 n= 3.[ TrefethenandBauIII1997 ]suggestadynamicapproachwhereone startswithbidiagonalizationrstandswitchesto QR factorizationwhenappropriate.We willusethethreestepapproachforcomputingtheSVD. SparseSVD .IterativemethodsliketheLanczosorsubspaceiterationb ased algorithmsinSVDPACK[ Berry1992 ]andPROPACK[ Larsen1998 ]areavailablefor ndingthesparseSVD.Themainadvantageoftheseiterative methodsistheirspeed whenonlyafewsingularvaluesarerequired.However,wehav etoguesstherequired numberofsingularvalueswhenweusetheiterativealgorith ms.Thereisnoknowndirect methodtocomputeallthesingularvaluesofasparsematrix. Othermethodsandapplications .TheSingularvaluedecompositionisoneof thefactorizationsthathasitsusesinadiversegroupofapp lications.LatentSemantic IndexingbasedapplicationsusetheSVDoflargesparsematr ices[ Berryetal.1994 ; Deerwesteretal.1990 ].TheSVDisalsousedincollaborativeltering[ Pryor1998 ; Paterek2007 ; Goldbergetal.2001 ; BillsusandPazzani1998 ]andrecommendersystems [ Sarwaretal.2002 ; Koren2008 ].Sparsegraphminingapplicationsthatneedtond patternscandosoecientlywiththeSVD.CompactMatrixDec omposition(CMD) [ Sunetal.2008 ]ortheapproximateversions[ Drineasetal.2004 ]dobetterthanSVD now.Otherapplicationsincludegeneexpressionanalysis[ Walletal.2003 ],ndingthe pseudoinverseofamatrix[ GolubandKahan1965 ]andsolvingthelinearleastsquares problem[ GolubandReinsch1970 ].Thereareotheralternativemethodstocompute thebidiagonalreduction.Onesidedbidiagonalreductions arediscussedin[ Ralha2003 ; Barlowetal.2005 ; BosnerandBarlow2007 ]andthesemidiscretedecomposition(SDD) [ KoldaandO'leary1998 ]wasproposedasanalternativetoSVDitself. 21

PAGE 22

CHAPTER3 BANDBIDIAGONALIZATIONUSINGGIVENSROTATIONS 3.1Overview Eigenvaluecomputationsofsymmetricbandmatricesdepend onareductionto tridiagonalform[ GolubandvanLoan1996 ].Givenasymmetricbandmatrix A ofsize n -byn with l subdiagonalsandsuperdiagonals,thetridiagonalreducti oncanbewritten as U T AU = T (3{1) where T isatridiagonalmatrixand U isorthogonal.Singularvaluedecompositionofan m -byn unsymmetricbandmatrix A with l subdiagonalsand u superdiagonalsdepends onareductiontobidiagonalform.Wecanwritethebidiagona lreductionof A as U T AV = B (3{2) where B isabidiagonalmatrixand U and V areorthogonal. Someofthelimitationsofexistingbandreductionmethods[ Bischofetal.2000a ; Kaufman1984 ]canbesummarizedasfollows: theyleadtopoorcacheaccessandtheyuseexpensivenon-str ide-oneaccessrow operationsduringthechase(or) theyusemorememoryand/ormoreworktotakeadvantageofcac hefriendlyBLAS3 stylealgorithms. Inthischapter,wedescribeanalgorithmthatusesplanerot ations,blocksthe rotationsforecientuseofcache,doesnotgeneratemorel lthannonblocking algorithms,performsonlyslightlymoreworkthanthescala rversions,andavoids non-stride-oneaccessasmuchaspossiblewhenapplyingthe blockedrotations. Section 3.2 introducestheblockingidea.Blockingcombinedwithpipel inedrotations helpsusachieveminimalll,asdiscussedinSection 3.3 .Weintroducethemethodto accumulatetheplanerotationsecientlyinSection 3.4 .Theperformanceresultsforthe newalgorithmsareinSection 3.5 22

PAGE 23

D block S blockC blockR blockF block (c) xxxxx xx x xxxxxx xx xxxxxx xx xxxxxx xx xxx nn n xxxx nnn xxxxxx xx xxxx xx x x xxxxxx xxxxx xxxx xxx xxxxxx xxxxxxxx x xxxxx xx x xxxxxx xx xxx nn n xxxx nnn xxxx xx x x xxxxxx xxxxx xxxxxx x xxxxx x xx xxxx x x xx xxxx xxx xxxxxx x x xxxxxx xx xxxx n nn x xxxx nn x n xxxxxx xx xxxxxx xx xxx00 0 xxxx000 xxxxxx xx xxxx xx x x xxxxxx xxxxx xxxx xxx xxxxxx xxxxxxxx x (a) (b) Figure3-1.Firsttwoiterationsofblockedbandreduction. Werefertoaplanerotationbetweentwocolumns(rows)ascol umn(row)rotations. Weuseanunsymmetricbandmatrixasanexampleeventhoughth eoriginalalgorithms areforthesymmetricbandcaseasouralgorithmappliestobo thsymmetricand unsymmetricmatrices.Theterm k -lowerHessenbergmatrix referstoamatrixwith k superdiagonalsandthelowertriangularpart. 3.2BlockingforBandReduction OurblockingschemecombinesSchwarz'srowreductionmetho d[ Schwarz1968 ]and Rutishauser'smodiedGivensmethod[ Rutishauser1963 ].A step inourblockedreduction algorithmreducesan r -byc blockofentriesandchasestheresultantll.Ateachstep,w e choosean r -byc parallelogramasourblocksuchthatitsreductionandthech aseavoids thebulge-likellwhencombinedwithourpipeliningscheme (describedinSection 3.3 ). Therst c entriesreducedbytherowreductionmethodandtherst r entriesreduced bythediagonalreductionmethodaretheoutermostentrieso ftheparallelogramforthe rststepintheblockedreduction.Figure 3-1 (a)showstheselectedblockfortherststep whentheblockisofsize2-by-3. Ateachstep,onceweselecttheblock,wepartitiontheentri esinthebandthat changeinthecurrentstepintoblocks.Theblockscanbeoneo fvetypes: 23

PAGE 24

S-block TheS( seed )blockconsistsofthe r -byc parallelogramtobereducedandthe entriesinthesame r rowsastheparallelogramtowhichtherotationsareapplied TheS-blockintheupper(lower)triangularpartisreducedw ithcolumn(row) rotations. C-block TheC( columnrotation )blockhastheentriesthataremodiedonlybycolumn rotationsinthecurrentstep. D-block TheD( diagonal )blockconsistsoftheentriesthataremodiedbybothcolum n androwrotations(inthatorder)inthecurrentstep.Applyi ngcolumnrotations inthisblockcausesllwhicharechasedbynewrowrotations .Thisisatmosta ( c + r )-by-( c + r )block. R-block TheR( rowrotation )blockhastheentriesthataremodiedonlybyrow rotationsinthecurrentstep. F-block TheF( ll )blockconsistsoftheentriesthataremodiedbybothrowan d columnrotations(inthatorder)inthecurrentstep.Applyi ngrowrotationsin thisblockcausesllwhichischasedbynewcolumnrotations .Thisisatmosta ( c + r )-by-( c + r )block. Generallythepartitioningresultsinoneseedblockandpot entiallymorethanone oftheotherblocks.Theblocksotherthantheseedblockrepe atinthesameorder.The orderoftheblocksisC,D,RandF-blockwhentheseedblockis intheuppertriangular part.Figure 3-1 (b)showsallthedierentblocksfortherststep.Thegure showsmore thanonell,butnomorethanonellispresentatthesametim e.Thellislimitedto asinglescalar.Thepipelining(explainedinSection 3.3 )helpsusrestrictthell.Asa result,ourmethodusesverylittleworkspaceotherthanthe matrixbeingreducedandan r -byc datastructuretoholdthecoecientsforthependingrotati onsbeingappliedina singlesweep. Oncewecompletethechaseinalltheblocks,subsequentstep sreducethenext r -byc entriesinthesamesetof r rowsaslongastherearemorethan r diagonalsleftinthem. 24

PAGE 25

Thisresultsinthetwopassalgorithmwherewereducetheban dwidthto r rstandthen reducethethinbandtoabidiagonalmatrix.Figure 3-1 (c)highlightstheentriesforthe secondstepwith n .ThetwopassbandreductionisgiveninAlgorithm 3.1 .Thewhile loopinLine 1 reducesthebandwidthto r andistakenatmosttwice. Algorithm3.1 Twopassalgorithmforbandreduction 1: while lowerbandwidth> 0or upperbandwidth> 1 do 2: for eachsetofrrows/columns do 3: reducethebandinlowertriangularpart 4: reducethebandinuppertriangularpart 5: endfor 6: set c to r 1 7: set r to1. 8: endwhile Therearetwoexceptionswithrespecttothethenumberofdia gonalsleft.The numberofdiagonalsleftinthelowertriangularpartis r 1insteadof r ,when r isone(to avoidtheseconditerationofthewhileloopintheabovealgo rithm)andwhentheinput matrixisanunsymmetric( l; 1)-bandmatrix(formorepredictableaccumulationofthe planerotations).SeeSection 3.4 formoredetailsonthelater. Thealgorithmsofarassumesthenumberofrowsintheblockfo rtheupper triangularpartandthenumberofcolumnsintheblockforthe lowertriangularpart arethesame.Wereducetheupperandlowertriangularpartse paratelyifthatisnot thecase.Algorithm 3.2 reducesthebandintheuppertriangularpart.Thealgorithm forreducingthelowertriangularpartissimilartotheonef ortheuppertriangularpart exceptthattheorderoftheblocksinthechasingisdierent .Intheinnerwhileloopthe rotationsareappliedtotheR,F,CandD-blocks,inthatorde r. 3.3PipeliningtheGivensRotations Thealgorithmforthereductionoftheuppertriangularpart inSection 3.2 relies onndingandapplyingtheplanerotationsontheblocks.Our pipeliningschemedoes notcausetriangularll-inandavoidsnon-stride-oneacce ssofthematrix.Thegeneral ideaistondandstore r -byc rotationsintheworkspaceandapplythemaseciently 25

PAGE 26

Algorithm3.2 Algorithmforbandreductionintheuppertriangularpart 1: for eachsetof c columnsincurrentsetofrows do 2: Findthetrapezoidalsetofentriestozerointhisiteration 3: Dividethematrixintoblocks 4: Findcolumnrotationstozero r -byc entriesintheS-blockandapplytheminthe S-block 5: while columnrotations do 6: ApplycolumnrotationstotheC-block 7: ApplycolumnrotationstotheD-block,Findrowrotationsto chasell 8: ApplyrowrotationstotheR-block 9: ApplyrowrotationstotheF-block,Findcolumnrotationsto chasell 10: Readjustthefourblockstocontinuethechase. 11: endwhile 12: endfor aspossibleineachoftheblockstoextractmaximumperforma nce.Figures 3-2 3-6 inthissectionshowtherstveblocksfromFigure 3-1 (b)andtheoperationsinallof themaresequentiallynumberedfrom1 :: 96.Asolidarcbetweentwoentriesrepresents thegenerationofarow(orcolumn)rotationbetweenadjacen trows(orcolumns)ofthe matrixtoannihilateoneofthetwoentriesinalltheguresi nthissection.Adottedarc representstheapplicationoftheserotationstootherpair sofentriesinthesamepairof rows(orcolumns)ofthematrix.3.3.1SeedBlock The r -byc entriestobereducedinthecurrentstepareintheS-block.T henumber ofrowsinthisblockis r andthenumberofcolumnsisatmost c + r .Wendnewcolumn rotationstoreducetheentriesonerowatatime.Therotatio nsgeneratedfromtheentries inthesamerow(column)intheupper(lower)triangularpart arecalledawave.Given arowintheseedblock,weapplyrotationsfrompreviousrows beforendingthenew rotationstoreducetheentriesinthecurrentrow.Thisisil lustratedintheFigure 3-2 .We reducethethreeentries(markedas n )intherstrowwithcolumnrotations G 1 :::G 3 andsavetherotationsinworkspace.Thiswaveofrotationsi salsoappliedtothesecond rowasoperations4 ::: 6.Wethenreducethethreeentriesinthesecondrowwithnew setofrotations G 4 :::G 6 andsavetherotations.Thestepstogenerateandapplythe 26

PAGE 27

000 x 4 5 6 xxxxx G 3 G 2 G 1 000 xxx n n n G 4 G 5 G 6 x n n nxxxx x G 3 G 2 G 1 (b) (c) (a) Figure3-2.FindingandapplyingcolumnrotationsintheS-b lock (a) (b) G 4 G 5 G 6 xx x xx 16 18 20 xx x xx 17 19 21 xx x xx xx x xx 10 12 14 11 13 15 G 1 G 2 G 3 Figure3-3.ApplyingcolumnrotationstotheC-block.rotationsintheS-blockareshowninAlgorithm 3.3 .Findingandapplyingrotationsinthe S-blockistheonlyplaceinthealgorithmwhereweuseexplic itnon-stride-oneaccessto accesstheentriesacrosstherows. Algorithm3.3GeneratingandapplyingrotationsintheS-bl ock 1: for allrowsintheS-block do 2: if nottherstrow then 3: applycolumnrotationsfromthepreviousrows. 4: endif 5: Findthecolumnrotationstozeroentriesincurrentrow(ift hereareany) 6: endfor 3.3.2ColumnBlock Weapply r wavesofrotationstotheC-block.Eachwaveconsistsof c column rotations.WeapplyeachcolumnrotationtotheentireC-blo ckoneatatime(andone waveatatime),asthisleadstoecientstride-oneaccessof memory,assumingthematrix isstoredincolumn-majororder. Acolumnrotationoperatesonpairsofentriesinadjacentco lumns(columns i 1and i ,say).Thenextcolumnrotationinthesamewaveoperatesonc olumns i 2and i 1, reusingthe i 1thcolumnwhichpresumablywouldremainincache.Finally, whenthis 27

PAGE 28

RG 1 RG 2 RG 3 G 1 G 2 G 3 22232425 26 272829 30 3132 33 xxx xxx x x xxx x xx x RG 2 RG 3 RG 1 xxx x x xxx xxxx xx x 36 37 38 3942 40 41 35 34 RG 4 RG 5 RG 6 G 4 G 5 G 6 xxx x x x xx xxx xxx x 48 53 454647 505152 5556 43 49 54 44 57 RG 5 RG 6 RG 4 xxx x x xxx xxxx xx x 62 63 6059 61 58 (b) (a) Figure3-4.Applyingcolumnrotationsandndingrowrotati onsintheD-block. rstwaveisdonewerepeatandapplyalltheremaining r 1wavestothecorresponding columnsoftheC-block.Thecolumnrotationsinanywave j +1useallbutonecolumn usedbythewave j leadingtopossiblecachereuse.Figure 3-3 (a)-(b)illustratestheorder inwhichweapplythetwowavesofrotations( G 1 :::G 3 and G 4 :::G 6 )tothe2rowsinthe C-block.Algorithm 3.4 describesthesestepstoapplytherotationsintheC-block. Algorithm3.4ApplyingrotationsintheC-block 1: for eachwaveofcolumnrotations do 2: for eachcolumninthiswaveofrotations do 3: Applytherotationforthecurrentcolumnfromcurrentwavet oalltherowsinthe C-block. 4: endfor 5: Shiftcolumnindicesfornextwave. 6: endfor 28

PAGE 29

xxx 666564 x RG 3 RG 2 RG 1 RG 4 RG 5 RG 6 x xxxxx 696867 Figure3-5.ApplyingrowrotationstoonecolumnintheR-blo ck. 3.3.3DiagonalBlock Weapply r waveseachwith c columnrotationstotheD-block.Eachcolumnrotation G i (saybetweencolumns k and k 1)cangenerateallinthisblock.Inordertorestrict thelltoascalar,wereducethellimmediatelywitharowro tation( RG i )beforewe applythenextcolumnrotation G i +1 (tocolumns k 1and k 2).Wesavetherow rotationafterreducingthellandcontinuewiththenextco lumnrotation G i +1 ,instead ofapplyingtherowrotationtotherestoftherowasthatwoul dleadtonon-stride-one access.Whenallthecolumnrotationsinthecurrentwaveare done,weapplythenew waveofrowrotationstoeachcolumnintheD-block. ThisalgorithmleadstotwosweepsontheentriesoftheD-blo ckforeachwaveof columnrotations:onefromrighttoleftwhenweapplycolumn rotationsthatgenerate ll,ndrowrotations,reducethellandsavetherowrotati ons,andthenextfromleft torightwhenweapplythepipelinedrowrotationstoallthec olumnsintheD-block. Werepeatthetwosweepsforeachofthesubsequent r 1wavesofcolumnrotations. Figure 3-4 (a)showstheorderinwhichweapplythecolumnrotations G 1 :::G 3 ,ndthe correspondingrowrotationsandapplythemtoourexample.T heoperations22 ::: 42 showthesesteps.Operations22 ::: 33showtherighttoleftsweepand34 ::: 42show thelefttorightsweep.Figure 3-4 (b)showstheorderinwhichweapplyasecondwave ofrotations G 4 :::G 6toourexampleandhandlethellgeneratedbythem(numbere d 43 ::: 63).Algorithm 3.5 showsbothapplyingandndingrotationsintheD-block. 29

PAGE 30

Algorithm3.5ApplyingandndingrotationsintheD-block 1: for eachwaveofcolumnrotations do 2: for eachcolumninthiswaveofrotations(right-left) do 3: Applytherotationforthecurrentcolumnfromcurrentwavet oallthe rowsintheD-block,generatell. 4: Findrowrotationtoremovell. 5: Applyrowrotationtoremovell.(nottoentirerow) 6: endfor 7: for eachcolumninthiswaveofrotations(left-right)andaddit ionalcolumnsinthe D-blocktotherightofthesecolumns do 8: Applyallpendingrowrotationstocurrentcolumn. 9: endfor 10: Shiftcolumnindicesfornextwave. 11: endfor 3.3.4RowBlock Weapply r wavesofrotationstotheR-block.Eachwaveconsistsof c rowrotations. WecouldapplyeachrowrotationtotheentireR-blockoneata time(andonewaveata time),butthiswouldrequireaninecientnon-stride-onea ccessofmemory,assumingthe matrixisstoredincolumn-majororder. Instead,weapplyallthe c rowrotationstojusttherstcolumnoftheR-block,for therstwave.Arowrotationoperatesonadjacententriesin thecolumn(rows i 1and i say).Thenextrowrotationinthiswaveoperatesonrows i 2and i 1,reusingthecache fortheentryfromrow i 1.Finally,whenthisrstwaveisdonewerepeatandapplyall theremaining r 1wavestotherstcolumnoftheR-block,whichpresumablywo uld remainincacheforallthe r waves.Thisentireprocessisthenrepeatedforthesecondan d subsequentcolumnsoftheR-block.Thisorderofoperations leadstoecientcachereuse andpurelystride-oneaccessofthememoryintheR-block. Figure 3-5 showstwowavesofrowrotations RG 1 :::RG 3 and RG 4 :::RG 6 applied toonecolumnintheR-block.Theorderinwhichweapplythero tationstothecolumn isspeciedbynumbers64 ::: 69.Algorithm 3.6 givesthestepstoapplytherotationsin theR-block.ThereisnollintheR-block.Wedonotshowtheo perations70 ::: 75in 30

PAGE 31

RG 3 RG 1 RG 2 xxxx x x x x x x xx x x x 76 77 78 79 80 81 xxxx x xx x x x x x x x x 90 89 88 87 9293949596 838485 82 86 91 G 1 G 2 G 3 RG 3 RG 1 RG 2 Figure3-6.Applyingrowrotationsandndingcolumnrotati onsintheF-block thesecondcolumnoftheR-block(fromFigure 3-1 (b))intheFigure 3-5 .Buttherow rotationsareappliedinthesamepatternasintherstcolum n. Algorithm3.6 ApplyingrotationsintheR-block 1: for eachcolumnintheR-block do 2: for eachwaveofrowrotations do 3: Applyalltherotationsincurrentwavetoappropriaterowsi ncurrentcolumn. 4: Shiftrowindicesfornextwave. 5: endfor 6: endfor 3.3.5FillBlock Weapply r waveseachwith c rowrotationstotheF-block.Eachrowrotation RG p (betweenrows i and i 1say)cancreateallinthisblock.Wecouldadoptthesame approachasintheD-block,completingtherowrotation RG p inthisblocktogeneratethe llandusingacolumnrotationtoreducethellimmediately beforewedorowrotation RG p +1 ,torestrictthelltoascalar,butthatwouldrequireustoa ccessalltheentriesin arow. Instead,foreachcolumnintheF-blockweapplyalltherelev antrowrotationsfrom thecurrentwave,butjuststopshortofcreatinganyll.Not ethatweneednotapplyall c rowrotationstoallthecolumnsoftheF-block.Whenapplyin gtherowrotationstoa column,successiverowrotationsshareentriesbetweenthe mformaximumcachereuseas intheR-block.Thisconstitutesoneleft-to-rightsweepof alltheentriesintheF-block. 31

PAGE 32

Wethenaccessthecolumnsfromrighttoleftbyapplyingther owrotationsay RG p (betweenrows i and i 1say)thatgeneratesthellinthatcolumnoftheF-block(sa y k ),removethellimmediatelywithcolumnrotation G p andapplythecolumnrotationto restoftheentriesincolumns k and k 1intheF-block. Figure 3-6 showstheorderinwhichweapplytherowrotations(operatio ns76 ::: 81), generatell(operations82 ; 86 ; 91),andapplythecolumnrotations.Operations76 ::: 81 constitutetheleft-to-rightsweep.Operations82 ::: 96constitutetheright-to-leftsweep. Werepeatthisforallthe r 1pendingwavesofrotations.Algorithm 3.7 showsthetwo phasesinapplyingtherotationsintheF-block. Algorithm3.7 ApplyingrowrotationsandndingcolumnrotationsintheFblock 1: for eachwaveofrowrotations do 2: for eachcolumnintheF-block(left-right) do 3: Applyallrotationsfromcurrentwave,exceptthosethatgen eratell-ininthe currentcolumn,tocurrentcolumn. 4: endfor 5: for eachcolumnintheF-block(right-left) do 6: Applythependingrotationthatcreatesll. 7: Findcolumnrotationtoremovell. 8: Applycolumnrotationtoentirecolumn. 9: endfor 10: Shiftrowindicesfornextwave. 11: endfor Tosummarize,theonlynon-stride-oneaccesswedointheent irealgorithmisinthe seedblockwhenwetrytondtherotations.Wegeneratenomor ellthantheanyof thescalaralgorithmsbypipeliningourplanerotations.Th eamountofworkwedovaries basedontheblocksizewhichisexplainedinthenextsubsect ion. 3.3.6FloatingPointOperationsforBandReduction Forthepurposesofcomputingtheexactnumberofroatingpoi ntoperations,we assumethatndingaplanerotationtakessixoperations,as intherealcasewithno scalingoftheinputparameters(twomultiplications,twod ivides,oneadditionanda 32

PAGE 33

squareroot).Wealsoassumethatapplyingtheplanerotatio nstotwoentriesrequiressix roatingpointoperationsasintherealcase(fourmultiplic ationsandtwoadditions). Letusconsiderthecaseofreducinga( u;u )-symmetricbandmatrixoforder n to tridiagonalformbyoperatingonlyontheuppertriangularp artofthesymmetricmatrix. Thenumberofroatingpointoperationsrequiredtoreduceth esymmetricmatrixusing Schwarz'sdiagonalreductionmethodis f d =6 b X d =3 n X i = d d +2 max( i d; 0) d 1 ( d +1)+rem max( i d; 0) d 1 +2(3{3) whereremistheremainderfunctionand b = u +1.Thenumberofroatingpoint operationsrequiredtoreducethesymmetricmatrixusingSc hwarz'srowreductionmethod is f r =6 n 2 X i =1 min ( b;n i +1) X d =3 d +2 max( i d; 0) b 1 ( b +1)+rem max( i d; 0) b 1 +2(3{4) Inordertondthenumberofofroatingpointoperationsforo urblockedreduction, whenblocksizeis r -byc wenotethatweperformthesamenumberofoperationsas usingatwostepSchwarz'srowreductionmethodwherewerst reducethematrixto r diagonalsusingSchwarz'srowreductionmethodandthenred ucethe r diagonalsusing Schwarz'srowreductionagain.Thenwecanobtaintheroatin gpointoperationsrequired bytheblockedreductiontoreducethematrixbyusingtheequ ation 3{4 for f r f b 1 = n r 1 X i =1 min ( b;n i +1) X d = r +2 d +2 max( i d; 0) b 1 ( b +1)+rem max( i d; 0) b 1 +2(3{5) f b 2 = n 2 X i =1 min ( r +1 ;n i +1) X d =3 d +2 max( i d; 0) r ( r +2)+rem max( i d; 0) r +2(3{6) f b =6( f b 1 + f b 2 )(3{7) 33

PAGE 34

XX XXXXXXX XXXXXXX XXXXXXX XXXXXXX XXXXX XXXXXXX XXXXXXX XXXXXXX XXXXXXX XXXXXX XXXX XXX XX X X5321 X4 X XXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXX 54321X 5432X1 543X2 54X3 5X4 X5 X XXXXXX XX XXXXXX XX XXX XXXX XXXXX XXXXXX XXXXX XXXX XXX XX XXXXXX XXX XXX (a) (b) (c) Figure3-7.Updateof U forsymmetric A with r =1 Asymptotically,allthreemethodsresultinthesameamount ofwork.Butthework diersbyaconstantfactor.Allthreemethodsrequirethesa meamountofworkwhen u< 3.Givena1000-by-1000matrixand u =3,Schwarz'srowreductionmethodcan computethebidiagonalreductionwith10%fewerroatingpoi ntoperations(rops)than thediagonalreductionmethod.Dependingontheblocksize, theblockedmethoddoes thesamenumberofroatingpointoperationsasoneoftheothe rtwomethods.When weincreasethebandwidthto u =150forthesamematrix,therowreductionmethod resultsin10%fewerropsthanthediagonalreductionmethod and3%fewerropsthan theblockedreductionmethod(withdefaultblocksize).Whe ntheinputmatrixisdense ( u =999),therowreductionmethodperforms25%fewerropsthan thediagonal reductionmethodand5%fewerropsthantheblockedmethod(w ithdefaultblocksize). Therowreductionmethodinthisexampleleadstotheleastro ps,buttheblocked reductionisonly3-5%worseintermsoftherops.Weshowthat thiscanbecompensated bytheperformanceadvantageofblocking. 3.4AccumulatingthePlaneRotations Ourblockedapproachtothebandreductionleadstoadieren tnon-zeropattern in U and V thantheonedescribedby Kaufman [ 2000 ].Inthissection,wedescribethe 34

PAGE 35

XXXX321 XXXX654 XXXXXXX XXXXX XXXX XXXXXXX XXXXXXX XXXXXXX XXXXXX XXXX XXX XX XXXX987 XXX XXXX X X X X X XXXXXXXXXX XXXXXXXXXXX XXXXX XXXX X 987X 321X47 32X147 3X258 X369 654X7 X X X XXXX XXXXX XXXXXX XXXXX XXXX XXXXXX X XXXX XXXXX XXXXXX XXXXXX XXXXX XXXX Figure3-8.Updateof U forsymmetric A with r =3 llpatternin U and V forourblockedreductionmethod.Theaccumulationalgorit hm usesthefactthatapplyingplanerotationsbetweenanytwoc olumnsin U resultsinthe non-zeropatternofboththecolumnstobeequaltotheuniono ftheoriginalnon-zero patternofthecolumns.Givenanidentitymatrix U tostartwithandtheorderofthe planerotations,asdictatedbytheblockedreduction,wene edtotracktherstandlast nonzeroentriesinanycolumnof U toapplytherotationsonlytothenon-zeros.Itiseasy todothisfor U and V ifwehavetwoarraysofsize n ,butweuseconstantspacetodothe same.Inthissection,weignorethenumericalvaluesanduse onlythesymbolicstructure of U .Weomit V fromthediscussionhereafterasthellpatternin V willbethesameas thellpatternin U 3.4.1Structureof U AstepinourblockedreductionAlgorithm 3.1 reducesablockofentries.Wedene a majorstep asthereductionfromoneiterationofthe for loopinLine 2 ofAlgorithm 3.1 .A minorstep isdenedasthereductioninoneiterationofthe while loopinline 5 of Algorithm 3.2 .Inotherwords,amajorstepcaninvolvemanystepstoreduce onesetof r columnsand/orrowstoabandwidthof r .Astepcaninvolvemanyminorstepstochase thelltotheendofthematrix. 35

PAGE 36

3XXXXXXXX XXXXXXXXXXX XXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXX XXXXXXX XXXXXX XXXXX 1XXXXXXXXXX 2XXXXXXXXX 4XXXXXXX Xedcba X X443X2432X14321X X XXX X X X X XXXXXXXXXXXXXXXXX X 4X3 XXXXXXXXXXXXXXXXXXX X XXX X X X XX XXXX XXXXX XXXXXX X X (a) (b) (c) Figure3-9.Updateof U forunsymmetric A with r =1 Givenabandmatrix A anda r -byc block,thealgorithmtondthenonzeropattern in U diersbasedonfourpossiblecases A isasymmetricor A isanunsymmetric(0 ;u )-bandmatrix. A isanunsymmetric( l; 1)-bandmatrix. A isanunsymmetric( l;u )-bandmatrixwhere l> 0, u> 1and r =1. A isanunsymmetric( l;u )-bandmatrixwhere l> 0, u> 1and r> 1. Thestructureofthealgorithmisthesameinallfourcases.A pplyingtheplane rotationsfromtherstmajorstep,i.e,reductionofthers t r columnsand/orrowsof A ,totheidentitymatrix U leadstoaspecicnonzeropatternintheupperandlower triangularpartof U .Thisnonzeropatternin U thendictatesthepatternofthellin U whileapplyingtheplanerotationsfromthesubsequentmajo rsteps. Symmetric A :Weconsiderasymmetric( u;u )-bandmatrixandanunsymmetric (0 ;u )-bandmatrixassymbolicallyequivalentinthissectionas weoperateonlyonthe uppertriangularpartofthesymmetricmatrix.When A isasymmetric( u;u )-band matrixand r =1,ourreductionAlgorithm 3.1 reduceseachrowtobidiagonalform beforereducinganyentriesinthesubsequentrows.Whenred ucingtherstrowof A to bidiagonalform,the u 1planerotationsfromtherstminorstepappliedtocolumns 36

PAGE 37

2 :::u +1of U ,resultsinablockllin U wheretheblockisalowerHessenbergmatrix ofthetheorder u .Thellin U ,fromthenext u 1planerotations(fromthenext minorstepofthechasein A ),isinthesubsequent u columnsofUandhasthesamelower Hessenbergstructure.Whenweapplyalltheplanerotations fromtherstmajorstep inthereductionof A wegetablockdiagonal U ,whereeachblockislowerHessenberg oforder u .Forexample,considerreducingtherstrowofentriesmark ed1 ::: 5inthe Figure 3-7 (a).Figure 3-7 (b)showsthellpatternafterapplyingto U theplanerotations thatweregeneratedtoreducetheseentriesandchasetheres ultantll.Thellin U generatedfromtherstminorstepofthereductionof A ismarkedwiththecorresponding numberoftheentryfromFigure 3-7 (a)thatcausedthell.Forexample,llfromentry marked3inFigure 3-7 (a)ismarked3inFigure 3-7 (b).Thellin U fromthesubsequent minorstepsaremarked x Applyingtheplanerotationsfromreducingeveryotherrowo f A (andthecorresponding chase)resultsina u -by-( u 1)blockllforeachminorstepinthelowertriangularpart of U (untilitisfull)andanincreaseinonesuperdiagonal.Figu re 3-7 (c)showsthellin U causedbytheplanerotationsfromthesecondmajorstep.The llfromtherstminor stepismarkedwith andthellfromsubsequentminorstepsaremarkedwith .Note thatkeepingtrackofthellintheuppertriangularpartisr elativelyeasyasallplane rotationsthatreduceentriesfromrow k andthecorrespondingchasegeneratellinthe k-thsuperdiagonalof U .Thisllpatternin U isexactlysameasthellpatternin U if wereduced A withSchwarz'srowreductionmethod(Algorithm 2.1 ),asblockedreduction followsthesameorderofrotationswhen r isone. When r> 1eachblockintheblockdiagonal U ,afterapplyingtherotationsfromthe rstmajorstepto U ,isa( u r;r )-bandmatrixoftheorder u .Eachsubsequentminor stepgeneratesablockllin U thatconsistsofa( u r +1)-by-( u r )blockandbelow thatablockof(0 ;u r )-bandmatrixwithazerodiagonal.Theexamplegivenbelow illustratesthiscaseclearly.Asdiscussedinthe r =1case,thereisalsoanadditionof r 37

PAGE 38

superdiagonalsin U ,forevery r rowsthatarereducedin A .Figure 3-8 showsthellin U forthersttwomajorstepswhen r =3.Theconventionsaresameasbefore.Intherst majorstep(Figure 3-8 (b))thellfromrstminorstepisnumberedfortheentriesi n A (Figure 3-8 (a))thatcausedthell,andfurtherllfromtheotherminor stepsareshown as x .Inthesecondmajorstep(Figure 3-8 (c)),thellfromtwodierentminorstepsare dierentiatedwith and .Furthermore,thetwoblocksofllinthelowertriangularp art of U foroneminorstepareboundedwithrectanglesinFigure 3-8 (c). ( l; 1) -band A :Givena( l; 1)-bandmatrixthebandreductionalgorithmreduces thelowertriangularpartto r 1diagonalstokeepthepatternofthellin U simple. When r =1,applyingtherotationsfromtherstmajorstepofthered uctionleads toablockdiagonal U ,witheachblocklowerHessenbergoftheorder l +1.Applying rotationsfromeachsubsequentmajorstep,resultsinrecta ngularblocksoflloftheorder of( l +1)-byl inthelowertriangularpartof U ,andoneadditionalsuperdiagonalinthe uppertriangularpartof U .Thebasicstructureofthellissimilartotheoneshownin Figure 3-7 (b)-(c)exceptthesizeoftheblocksisdierent. When r> 1,applyingrotationsfromtherstmajorstepresultsinabl ockdiagonal U ,whereeachblockis( l r +2 ;r )-bandoftheorder( l +1).Applyingrotationsfromeach subsequentmajorstepresultsinablockofllthatconsists ofa( l r +2)-by-( l r +1) blockandbelowthatablockintheshapeof(0 ;l r +1)-bandoftheorder( r 1)-byl withazerodiagonalinthelowertriangularpartof U ,and r additionalsuperdiagonalsin theuppertriangularpartof U .Thebasicstructureofthellissimilartotheoneshown inFigure 3-8 (b)-(c)exceptthesizeoftheblocksisdierent. Unsymmetric A r =1:Whenweconsiderthecaseoftheunsymmetric( l;u )-band matrix A ,tondthenonzeropatternof U ,weneedtoonlylookattheinteraction betweentheplanerotationstoreducetheentriesinlowertr iangularpartandtheplane rotationstochasethellfromlowertriangularpart(event houghtheoriginalentrieswere fromtheuppertriangularpart),asthesearetheonlyrotati onsappliedto U .Wecan 38

PAGE 39

XXXXXXXXX2XXXXXXXXX14XXXXXXXXX XXXXXXXXXXX XXXXXXXXX 3XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXX XXXXXXX XXXXXX XXXXX XXXdcbaXXXXhgfe X X X242X1321X3 X X 43X X X X X XXXXXXXXXXX XXX X X XXXXXXXXXXX XXX XXXXXXXXXXX XXX hgfeX dcbaXe dcbXae dcXbf dXcg Xdh XXXX0000XXXXX XXXXXXXXXXX XXXXXXXXX nn XXXXXXXXX n XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXX XXXXXXX XXXXXX XXXXX 00XXXXXXXXX 0XXXXX XXX0000 0 n XXXXXXXXX X X XXXXXXXXXXXX XXX XXX XXXX XXXXX XXXXXX XXXXXX XXXXX XXX XXXX XXXX XXX X X XXXXXXXXXXXXXXXXX XXXXXXX XXXXX XXXXXX XXXXXXX XXXX XXXX XXXX XXX XXXXXXXX XXXXXXXX XXXXX (a) (b) (c) (f) (e) (d) Figure3-10.Updateof U forunsymmetric A with r =2 simplysaythatweneedtolookatalltherowrotationstomatr ix A .Letusconsiderthe casewhenthethenumberofrowsintheblock r isone.Thereductionalgorithmfrom Section 3.2 reducestherstcolumnandrowpairtothebidiagonalformbe forereducing subsequentcolumnandrows. Afterapplyingtherotationsfromtherstmajorstepofther eductionof A U will beablockdiagonalmatrixwheretherstblockis( l +1)-by-( l +1)andtherestofthe blocks( l + u )-by-( l + u )alllowerHessenberg.Figure 3-9 (a)usesanexample(6 ; 4)-band matrix A ofthesize16-by-16.Figure 3-9 (b)showsthellpatternin U afterreducingthe rstcolumnof A andtheresultantchase.Thellfromtherstminorstepisnu mbered 39

PAGE 40

forthecorrespondingentriesthatcausedit.Figure 3-9 (c)showsthellpatternin U after therstmajorstep.Allthenewllismarkedwith Eachsubsequentminorstepresultsinablockllofsize( l + u )-by-( l + u 1)in U in thelowertriangularpart,excepttherstblockwhichisofs ize( l + u )-byl .Thellinthe uppertriangularpartof U isstillonesuperdiagonalforaeverycolumnandrowpair.Th e structureofthellin U issimilartoFigure 3-7 (c)exceptthesizeoftheblocksislarger. ThereasonfortheblockHessenbergstructurewhenapplying theplanerotations intherstmajorstepisnotstraightforwardasinthesymmet riccase,butwedescribe ithereforcompleteness.Theplanerotationstoreducethee ntriesintherstcolumnis betweenrows l +1 ::: 1resultinginalowerHessenbergblockofthesize( l +1)-by-( l +1) in U aswesawinthesymmetriccase.Theplanerotationsfromthec orrespondingchase resultsinthesamelowerHessenbergblockstructurellin U ,buteachoftheseblocks isseparatedby u 1columns(not u ).Whenwereducetherstrowlater,theplane rotationsarebetweencolumns( u +1) ::: 2of A .Notethatcolumn u +1of A shouldhave beeninvolvedinthechasetoreducethellgeneratedfromre ducingtheentry A (1 ; 2)(as l> 0).Whenwereducetheentryin A (1 ;u +1)andchasetheresultantsubdiagonalll, applyingtheplanerotationto U resultsinllofsize l +2inthelowertriangularpartof U .ThislliscausedbytheexistingHessenbergblocksin U oforder l +1.Inotherwords, ascolumn u +1in A ispartofthechasewhenreducingtherstcolumnof A andpartof theactualreductionwhenreducingtheentriesinrstrowof A theinteractionofthell leadstoabiggerHessenbergblockoforder( l + u ). Unsymmetric A r> 1:When r> 1thereductionAlgorithm 3.1 leaves r diagonals intheupperandlowertriangularpart,whereaswhen r =1abovewereducethematrix tobidiagonalinoneiteration,whichisequivalenttoleavi ng r 1diagonalsinthelower triangularpart.Weneedtoleave r diagonalsinthelowertriangularpartinsteadof r 1 diagonalsconsistentwiththe r =1casetoavoidtheonecolumninteractionbetweenthe planerotationsoftheloweranduppertriangularpartasexp lainedabovein r =1case,so 40

PAGE 41

thatthellpatternin U remainssimplein r> 1case.Withoutthissmallchange,thell patternin U isstillpredictable,butisvastlydierentfromtheprevio usthreecases.We donotdiscussthatllpatternhere. Applyingtheplanerotationsfromreducingandchasinganl loftherst r columns of A ,to U ,resultsindiagonalblocksin U ,whereeachblockis( l r;r )-bandmatrixof theorder l .Eachoftheseblocksisseparatedbyadistanceof u .Thenextsetofplane rotations,toreducetheentriesin r rowsintheuppertriangularpartof A ,causesll in U inthe u columnsbetweentheexistingHessenbergblocksin U .Theblockllisa ( u r;r )-bandmatrixoftheorder u .Thusafterapplyingtherotationsintherstmajor step U isblockdiagonalwithtwodierenttypeofblocks(asdescri bedabove)alternating inthediagonal.Figure 3-10 (b)-(c)showsthellpatternafterreducingtheentriesint he rst r columnsandrowsrespectively.Thellfromtherstminorst episnumberedfor thecorrespondingentriesthatcreatedthellinboththeg ures. Reducing r columnsin A aftertherstmajorstepcreatesablockllinthelower triangularpartof U thatconsistofarectangularblockofthesize( u r +1)-by-( l r )and belowthatablockof(0 ;l r )-bandmatrixofsize( r 1)-by-( l 1)withazerodiagonal. Figure 3-10 (e)showsthellwhilereducingtheentriesfromthethirdan dfourthcolumns marked inFigure 3-10 (d).Reducingthesecondsetof r rowscreatesablockllinthe lowertriangularpartof U thatconsistofarectangularblockofthesize( l r +1)-by-( u r ) andbelowthatablockof(0 ;u r )-bandmatrixofsize( r 1)-by-( u 1)withazero diagonal.Figure 3-10 (f)showsthellwhilereducingtheentriesfromthethirdan dfourth rows(marked )ofFigure 3-10 (d). Fromtheabovedescriptionofthellforthefourcases,itis straightforwardtokeep trackoftherstandlastrowofanycolumnof U .Foranygivencolumn k ,therstrow in U towhichaplanerotationthatreducedanentryinrow i andtheplanerotations tochaseitsllisappliedis k i .Thisholdstrueinallfourcasesabove.Forkeeping trackofthelastrowforanycolumnin U ,weonlyneedtokeeptrackofthelastrowof 41

PAGE 42

therstblockof U and V andupdateitaftereachmajorstepbasedontheabovefour cases.Whilereducingaparticularsetof r columnandrowsandchasingthellwecan obtainthelastrowforagivencolumnfromthelastrowofthe rstblockandthenumber ofminorstepscompletedandthellsizebasedontheaboveca ses.Weleavetheindexing calculationsoutofthechapterforasimplerpresentation.3.4.2FloatingPointOperationsforUpdate Letusconsiderthecaseofreducing( l;u )-symmetricbandmatrixoforder n to tridiagonalformagain.Thenumberofplanerotationsrequi redtoreducethesymmetric matrixusingSchwarz'sdiagonalreductionmethodis g d = b X d =3 n X i = d 2+2 max( i d; 0) d 1 (3{8) where b = u +1.Thenumberofplanerotationsrequiredtoreducethesymm etricmatrix usingSchwarz'srowreductionmethodis g r = n 2 X i =1 min ( b;n i +1) X d =3 2+2 max( i d; 0) b 1 (3{9) Inordertondthenumberofofroatingpointoperationsforo urblockedreduction, whenblocksizeis r -byc weusethetwostepapproachasbeforetoget g b 1 = n r 1 X i =1 min ( b;n i +1) X d = r +2 2+2 max( i d; 0) b 1 (3{10) g b 2 = n 2 X i =1 min ( r +1 ;n i +1) X d =3 2+2 max( i d; 0) r (3{11) g b = g b 1 + g b 2 (3{12) Asymptotically,numberofrotationsinallthethreemethod sarethesame,butthere isasignicantdierenceintheconstantfactor.Letusassu methatwedonotexploitthe structureof U .Alloftheplanerotationsareappliedtoonecolumnof U (eachofsize n ) 42

PAGE 43

then.Reducingthenumberofplanerotationsplaysasignic antroleinreducingtherops foraccumulationoftheplanerotations. Givena1000-by-1000matrix,and l =150,Schwarz'srowreductionmethodcomputes thebidiagonalreductionusing78%fewerplanerotationsth anthediagonalreduction method.Itperforms43%fewerplanerotationsthantheblock edreductionmethod(with thedefaultblocksize).When l =999,thetherowreductionmethodcomputes83%and 48%fewerplanerotationsthanthediagonalreductionandbl ockedreductionmethods respectively.Theincreaseinthenumberofplanerotations fortheblockedmethodisdue totherotationstoreducethe r diagonalsasaseconditeration( g b 2 inEquation 3{11 ). Theserotationsareespeciallycostlytheyareappliedtoaf ull U WecanexactlymimictheSchwarz'srowreductionandstilldo blockingbychoosing r =1,leadingtoa50%reductionintheropcount.Notethatthec ostofsmallincrease intheropswhenwereduce A withouttheaccumulationwasosetbytheblocking,but inthiscase,whentheropsalmostdoublesitisimportanttou se r =1.Thedefault blocksizewilluse r =1whentheplanerotationsareaccumulated.Unlesstherear esome compellingreasons,sayimprovementslikeblockingtheacc umulationin U itselfwhich osetsthecostinrops r> 1casewillbepracticallyslowerwhen U and V arerequired. 3.5PerformanceResults Wecompareourbandreductionroutinepiro band reduceagainstthatofSBR's driverroutine DGBRDD [ Bischofetal.2000a ]andLAPACK'ssymmetricandunsymmetric routines DSBTRD and DGBBRD [ Andersonetal.1999 ]inthissection.Allthetestswere donewithdoubleprecisionarithmeticwithrealmatrices.T heperformancemeasurements weredoneonamachinewith8dualcore2.2GHzAMDOpteronproc essors,with64GB ofmainmemory.PIRO BANDwascompiledwith gcc4.1.1 withtheoptions -O3 LAPACKandSBRarecompiledwith g77 withtheoptions -O3-funroll-all-loops Figure 3-11 showstherawperformanceofouralgorithmswiththedefault block sizeusedbyourroutines.Weestimatethedefaultblocksize basedontheproblemsize. 43

PAGE 44

10 7 10 8 10 9 10 10 10 11 10 12 5 6 7 8 9 10 11 x 10 8 PIRO_BAND : Performance FLOPSFLOPS/sec bw=150 bw=200 bw=300 bw=400 bw=500 Figure3-11.Performanceoftheblockedreductionalgorith mwithdefaultblocksizes 10 7 10 8 10 9 10 10 10 11 0.8 1 1.5 2 3 4 PIRO_BAND vs SBR performance FLOPSSBR time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 10 7 10 8 10 9 10 10 10 11 0.8 1 1.5 2 3 4 PIRO_BAND vs LAPACK performance FLOPSLAPACK time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 Figure3-12.Performanceof piro band reduce vsSBRandLAPACK 44

PAGE 45

0 2000 4000 6000 8000 10000 0.5 0.8 1 1.5 2 3 PIRO_BAND vs LAPACK performance w/ update NLAPACK time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 0 1000 2000 3000 4000 5000 6000 0.5 0.8 1 1.5 2 3 PIRO_BAND vs SBR performance w/ update NSBR time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 Figure3-13.Performanceof piro band reduce vsSBRandLAPACKwithupdate 10 8 10 9 10 10 10 11 1 2 3 4 5 6 7 8 9 PIRO_BAND vs DGBBRD performance FLOPSDGBBRD time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 0 500 1000 1500 2000 2500 3000 1 2 3 4 5 6 7 8 9 PIRO_BAND vs DGBBRD performance w/ update NDGBBRD time/PIRO_BAND time bw=150 bw=200 bw=300 bw=400 bw=500 Figure3-14.Performanceof piro band reduce vsDGBBRDofLAPACK 45

PAGE 46

Theblocksizeestimateisnotbasedonhardwareorcompilerp arameters.Performance ofouralgorithmsdependtosomeextentontheblocksize.The optimalblocksizes canvarydependingonthearchitecture,compileroptimizat ionsandproblemsize.The defaultblocksizeisnottheoptimalblocksizeforallthepr oblems,eveninthisparticular machine.Nevertheless,theresultsfromdefaultblocksize isusedforalltheperformance comparisonsinthissection.Theeectsofthedierentbloc ksizesontheperformanceare discussedin[ RajamanickamandDavis2009b ]andinSection 4.3.1 Figure 3-11 showstheeectofblockingfordierentbandwidths.Thetes tswere runonsymmetricmatricesoftheorder200to10000withbandw idths150to500. Thebandreductionalgorithmperformsslightlybetterforp roblemswithnon-narrow bandwidthsthanfortheproblemswithnarrowbandwidths.Th egureshowstheeect ofcache:helpingtheperformanceforsmallermatricesandt hennegativelyaectingthe performanceastheproblemsizeincreases,butblockingsta bilizestheperformanceasthe problemsizeincreases. Givenaproblemsize,weusetheroatingpointoperationsfro mtheSchwarz'srow reductionmethodastheropsrequiredforthatproblem.Thed ierentalgorithmswe comparemaydomoreroatingpointoperationsthanSchwarz's rowreductionmethod(no algorithmwilldoless).Weusethesmallestpossibleropsfo ragivenproblem(therops fromSchwarz'srowreductionmethod)asabaselinetocompar eallthealgorithms.The graphsinFigures 3-11 3-12 3-14 usetheFLOPSfromSchwarz'srowreductionmethod. Figure 3-12 comparestheperformanceofourblockedalgorithmandSBR's band reductionroutines,withouttheaccumulationofplanerota tions.WeprovideSBRthe maximumworkspaceitrequires n u foramatrixofsize( u;u )-bandmatrixoforder n andletitchoosethebestpossiblealgorithm.Forsmallerma trices,SBRisabout20% fasterthanouralgorithm.Whentheproblemsgetlarger,for smallerbandwidths,the performanceofboththealgorithmsarecomparableeventhou ghtheworkspacerequired byouralgorithmisconsiderablysmaller(32-by-32).Asthe bandwidthincreasesour 46

PAGE 47

blockedalgorithmisabout25%to2xfasterthanSBRalgorith mswhileusingonlyfraction oftheworkspace. TheresultsoftheperformancecomparisonagainstLAPACK's DSBTRD isshownin Figure 3-12 .LAPACK'sroutinesdoesnothaveanytuningparametersexce ptthesize n workspacerequiredbytheroutines. DSBTRD performsslightlybetterthanouralgorithm forsmallerproblemsizes.Astheproblemsizesincrease,ou ralgorithmisfasterthan DSBTRD evenforsmallerbandwidths.Theperformanceimprovementi sintheorderof1.5x to3.5x. DSBTRD usestherevisedalgorithmasdescribedin[ Kaufman1984 ]. Whenweaccumulatetheplanerotationsweuseoneentirerowo fthesymmetric matrixasourblockasthiscutstheroatingpointoperations requiredbynearlyhalfwhen comparedtoablocksizeof32-by-32.Theperformanceofoura lgorithmagainstSBRand LAPACKalgorithmsareshownintheFigure 3-13 .OuralgorithmisfasterthanSBR byafactor2.5forlargermatrices.Itis20%to2timesfaster thanLAPACK's DSBTRD implementation.Asinthenoaccumulationcase,thelargert hebandwidththebetterthe performanceimprovementforouralgorithm. WecompareouralgorithmagainstLAPACK's DGBBRD fortheunsymmetriccase. DGBBRD doesnotusetherevisedalgorithmofKaufmanyet,soitcando betterinthe updatecase.Theresultswhenwecompareouralgorithmagain sttheexistingversionof DGBBRD areinFigures 3-14 withoutandwiththeaccumulationofplanerotations.Our algorithmisabout4to7timesfasterintheformercaseandab out2to8timesfasterin thelatercase. 3.6Summary Theproblemofbandreductionhadtwopossiblestyleofsolut ions:avectorized solutionusingplanerotationsimplementedinLAPACKlibra riesandaBLAS3based solutionusing QR factorizationandHouseholdertransformationsimplement edin SBRtoolbox.Wepresentedablockedalgorithminthischapte rthatisbetterthan thecompetitivemethodsintermsoftherequiredworkspace, thenumberofroatingpoint 47

PAGE 48

operationsandrealperformance.Thesoftwarepackagethat implementsthisalgorithm, PIRO BAND,isdescribedindetailinChapter 4 48

PAGE 49

CHAPTER4 PIRO BAND:PIPELINEDPLANEROTATIONSFORBANDREDUCTION 4.1Overview Bandreductionmethodsareanimportantpartofalgorithmsf oreigenvalue computationsofsymmetricbandmatricesandthesingularva luedecompositionsof unsymmetricbandmatrices.Weintroducednewblockedalgor ithmsforbandreductionin Chapter 3 .ThesoftwareimplementingthealgorithmsPIRO BAND(pronounced\pyro band")isdescribedinthischapter. PIRO BANDisalibraryfortridiagonalreductionofsymmetricmat ricesand bidiagonalreductionofunsymmetricbandmatrices.Wealso providefunctionstocompute thebandsingularvaluedecompositionusingthebandreduct ionfunctionsandasimplicial left-lookingband QR factorization.Wecallthis QR factorizationsimplicialfactorization asitdoesnotuseanysupernodesorfrontstotakeadvantageo fBLAS3styleoperations. See[ Davis2009b ]foradescriptionofamultifrontal QR factorization. Section 4.2 introducesthefeaturesinthePIRO BANDlibrary.Thecomparisonof thebandreductionalgorithmsinPIRO BANDagainstotherbandreductionalgorithmsis giveninSection 3.5 .WepresentsomeadditionalperformanceresultsofPIRO BANDin Section 4.3 4.2Features PIRO BANDisalibrarywithbothMATLABandCinterfacesforbandre duction. TheCinterfacesaredescribedinSections 4.2.1 and 4.2.2 .TheMATLABinterfacesare describedinSection 4.2.3 PIRO BANDacceptsgeneralsparseanddensematricesintheMATLAB interface. TheClibraryrequirestheinputmatricestobeinthepackedb andformat.Givenan m -byn matrix A with bl diagonalsintheuppertriangularsideand bu diagonalsinthe uppertriangularside,thepackedbanddatastructureisoft hesize( bl + bu +1)-byn whereentry A ( i;j )isstoredinentry( i j + bu;j ).TheMATLABinterfacewillaccept 49

PAGE 50

bothsparseanddensematricesandconvertthemtobandformu singthestructureofthe matrixwhentheinputissparseandthenumericalvalueswhen theinputisdense. PIRO BAND'sCinterfaceandtheMATLABinterfacesupportrealand complex matrices.ComplexmatricesareexpectedtobestoredintheC 99stylecomplexformat wheretherealandimaginarypartofeveryentryisstoredinc onsecutivememory locations.WedonotrequireC99forcompilingandusingthel ibraries,butthetest coverageforthelibraryrequiresC99support.PIRO BANDClibrariesdonotsupport complexmatriceswheretherealandimaginarypartarestore dinseparatearrays. 4.2.1PIRO BANDInterface Theprimaryfunctionforbandreductionis piro band reduce .Thereareeight dierentversionsofthisfunctionforallthecombinations ofdoubleprecisionandsingle precisionarithmetic,realandcomplexmatricesand32-bit and64-bitintegers.Theeight versionsofthefunctionscanbedenedusingthefollowingt emplate. piro_band_reduce_x:='d'|'s'(fordoubleorsingleprecision)y:='r'|'c'(forrealofcomplexmatrices)z:='i'|'l'(for32-bitor64-bitintegers) Forexample, piro band reduce dri ,isthefunctionnameforusingdoubleprecision arithmeticforreducingarealbandmatrixwith32-bitinteg ers.Theprototypeforthis functionisintpiro_band_reduce_dri( intblks[],intm,intn,intnrc,intbl,intbu,double*A,intldab,double*B1,double*B2,double*U,intldu,double* VT, intldv,double*C,intldc,double*dws,intsym ) The piro band reduce functionsdestroytheirinputmatrix A astheyreduceit tobidiagonal/tridiagonalform.Thetwodiagonalsareretu rnedonsuccess.Theplane rotationsareaccumulatedintherightandleftonlyiftheya rerequired.Wedonot provideaseparateinterfaceforsymmetricandunsymmetric matrices.Instead,weusean 50

PAGE 51

inputragtodierentiatebetweenthetwo:thelastparamete rtothefunction sym istrue forasymmetricmatrix. Alltheeightversionsofthe piro band reduce functionsrequireanoptionalblock size( blks )andaworkspace( dws )asaparameter. blks isanarrayofsizefourwhere thersttwoentriesspecifythenumberofcolumns( c u )androws( r u )intheblockforthe reducingthesuperdiagonals.Thenexttwoentriesin blks specifythenumberofrows ( r l )andcolumns( c l )intheblockforreducingthesubdiagonals.Thenumberofro wsand columnsintheblocksarelimitedby c u + r u
PAGE 52

Table4-1.MATLABinterfacetoPIRO BAND MATLABfunctionDescriptionofthefunction piro band Bidiagonalreductionroutineforbandmatrices piro band qr QR factorizationofabandmatrix piro band svd SVDofabandmatrix piro band lapack MATLABinterfacetotheLAPACKstyleinterfaces storeband Storeasparse/fullmatrixinpackedbandformat PIRO BANDnds C T U insteadof U T C ,and V insteadof V T ,asinLAPACK. Thesearemoreecienttocompute. TheLAPACKstyleinterfacesgivenbelowdonothaveanyofthe serestrictionsandmimic thefunctionalityofLAPACKsubroutines.4.2.2LAPACKStyleInterface TheLAPACKstyleinterfacesupportsalleightfunctionsfor bandreductionin theLAPACKlibrary.ThecorrespondingfunctionnamesinPIR O BANDhaveaprex piro band addedtoit.Forexample,LAPACK'sbandreductionfunction DSBTRD is piro band dsbtrd inourinterface.Fortheversionthatsupports64-bitinteg ersweadda sux l tothefunctionname.Theinterfacehastheexactfunctional ityastheLAPACK librariesexceptforonedierence:thereturnvaluesofthe functionsaredierentin certaincases.LikeLAPACK'sfunctions,areturnvalueofze roissuccessandanegative valuemeansfailure,buttheexacterrorcodesaredierentf romLAPACK'sfunctionsas weuseonecommonfunctionforsymmetricandunsymmetricmat rices.Areturnvalueof -imaynotindicatetheithargumentisinvalidintheLAPACKs tyleinterfaces. 4.2.3MATLABInterface TheMATLABinterfaceinPIRO BANDprovidesthefunctionslistedinTable 4-1 TheMATLABfunctions piro band and piro band lapack areprovideasimpleinterfaces forthebandreductionalgorithmsbothinPIRO BANDstyleandLAPACKstyle. storeband ndsabandstructure,ifoneexists,andstoresthematrixin bandformat. Itusesthenumericalvalueswhentheinputisadensematrix. Itusesthestructureofthe matrixiftheinputissparsetondthebandmatrix. 52

PAGE 53

Thefunction piro band svd computesthesingularvaluedecompositionofaband matrix.Itcancomputeboththefullandeconomysingularval uedecompositionofthe bandmatrix.Theusagealsosupportsndingonlythesingula rvalues.Thesingular valuedecompositionfunctionusesourbandreductionalgor ithmtoreducethematrix tobidiagonal/tridiagonalform.ItthenusesLAPACK(bundl edwithinMATLAB) todiagonalizebidiagonal/tridiagonalmatrixtocomputet hesingularvalues.The piro band svd functioncomputestheeconomysingularvaluedecompositio ninthree steps.Itusesthesimplicialleft-looking QR factorizationtocomputeanuppertriangular bandedmatrix R .Itthenreduces R usingthebidiagonalreductionalgorithmandthe LAPACKdiagonalreductionalgorithm. piro band qr providesaMATLABinterfacetotheleft-lookingband QR factorization. Itprovidesanoptiontondboth Q and R andanoptiontondtheHouseholdervectors insteadof Q .SeePIRO BANDuserguide[ RajamanickamandDavis2009c ]fortheexact usageofallthesefunctions. 4.3PerformanceResults Alltheperformanceresultshereweretakenonamachinewith eightdualcore2.2 GHzOpteronprocessorswith64GBofmainmemory.WeusedMATL AB7.6.0.324 (R2008a).PIRO BANDwascompiledwith gcc version4.1.1withtheoption -O3 4.3.1ChoosingaBlockSize Choosingtherightblocksizefordierentmachinesisadic ultproblem.Weprovide afunctioninCthatestimatestherecommendedblocksizebas edonthenumberof roatingpointoperations.Thefunctionwillnotoptimizeth eblocksizeforanyspecic hardwareorcompiler.Thesimpleruleistouseablocksizeof 8-by-8whentherequired ropsislessthan10 10 andtheblocksizeof32-by-32otherwise.Thisisourdefault block size.Thefunctionpiro band get blocksizeuses6 kn 2 asanestimateforthenumberof FLOPSforthebandreductionalgorithm.Figure 4-1 usestheestimatedFLOPSforthe x-axis. 53

PAGE 54

10 8 10 9 10 10 10 11 10 12 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 PIRO_BAND Relative Block Size Performance FLOPSblk time/default time blk=row blk=8 blk=16 blk=32 blk=64 Figure4-1.Performanceof piro band reduce fordierentblocksizes Toverifythissimpleblocksizeselectionwecomparevedi erentoptions:anentire rowastheblock,blocksizesof8-by-8,16-by-16,32-by-32a nd64-by-64.Whentheentire rowisusedastheblocktheblockedreductionalgorithmdoes exactlythesameamount ofropsasSchwarz'rowreductionmethod.Itdoesmorerops(b yaconstantfactor)for theotherfourblockingsizes.Figure 4-1 showstheperformancecomparisonofvarious blocksizesinrelationtothedefaultblocksize.Whenthero pcountissmaller8-by-8is theidealblocksize.Astheropsreach10 10 thereareafewcaseswhere16-by-16isbetter, but32-by-32isidealwhentheropcountishigher.Notethate xceptwhenchoosingthe entirerowastheblock,alltheotherblocksizesperformwit hin30%ofthedefaultsize. Wecanalsoseethat64-by-64isalmostasgoodas32-by-32whe nwedolotofroating 54

PAGE 55

10 8 10 9 10 10 10 11 10 12 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 PIRO_BAND Relative Block Size Performance FLOPSblk time/default time blk=row blk=8 blk=16 blk=32 blk=64 Figure4-2.Performanceof piro band reduce fordierentblocksizes pointoperations.Therecanbeacut-opointwhere64-by-64 maystartperformingbetter than32-by-32.Wechosetoleavethedefaultat32-by-32andl ettheuserexperimentfor theircommonproblemsizes. Astheblocksizesareplatformdependentwerepeatedthisex perimentonamachine withaa3.2GHzPentium4and4GBofmemory.Theresultsaresho wninFigure 4-2 Thoughtheperformanceofvariousblocksizesaresomewhatd ierentfromFigure 4-1 the defaultblocksizeisnevermorethan15%worsethanthebestb locksizewhichinthiscase is16-by-16forsomeproblems. 55

PAGE 56

Table4-2.PIRO BAND svd vsMATLABdense svd NBandwidthPIRO BANDSVDMATLABDENSESVD (seconds)(seconds) 10001001.747.1200020011.2149.3300030034.9306.0400030064.2540.5 4.3.2PIRO BAND svd and qr Performance WecomparetheperformanceresultsofPIRO BAND'ssingularvaluedecomposition andPIRO BAND's QR factorizationwiththeirMATLAB'sequivalentsinthissect ion. DetailedperformanceresultsofPIRO BANDpackageagainsttheLAPACKandSBR bandreductionroutinescanbefoundinSection 3.5 MATLABdoesnothaveabandsingularvaluedecomposition.Th esparsesingular valuedecompositionistooslowwhencomparedwiththedense singularvaluedecomposition, especiallywhenallthesingularvaluesarerequired.Wecom pareourbandsingularvalue deompositionagainstthedensealgorithmwhichisbetterwh enoneneedstocompute allthesingularvalues.Asweoperateonlyontheband,oural gorithmisasymptotically fasterthanthedenseSVD.Thetimingresultsforndingonly thesingularvaluesare summarizedinTable 4-2 WecomparePIRO BAND'ssimplicialleft-looking QR factorizationagainst MATLAB'sdense QR factorizationandMATLAB'ssparse QR factorization.The performanceresultsareprovidedinTable 4-3 .WeshouldnotethatMATLAB'ssparse QR isnotaveryfastalgorithm.Asour QR factorizationisasupportroutineusedonly foreconomysingularvaluedecompositionweuseasimplicia lalgorithm.Anecient QR factorizationalgorithmlikethemultifrontalalgorithmS PQR[ Davis2009a ]willbeableto dobetterthanthesimplicial QR factorizationalgorithmwithasmalladditionalcostin theintegerworkspace.ThelatestreleaseofMATLABR2009bi ncludesSPQRasits QR factorizationandshouldhavebetterresultsthantheonesp resentedhereforthesparse 56

PAGE 57

Table4-3.PIRO BAND qr vsMATLABdenseandsparse qr NBandwidthPIRO BANDQRDENSEQRSPARSEQR (seconds)(seconds)(seconds) 10001001.55.311.9200020014.734.0159.9300030048.4106.01907.3400030087.0239.04145.9 case. QR factorizationisusedbyPIRO BANDonlywhentheeconomysingularvalue decompositionisrequired. 57

PAGE 58

CHAPTER5 SPARSER-BIDIAGONALIZATIONUSINGGIVENSROTATIONS 5.1Introduction Givenasparsematrix A wefollowthethreestepprocessdescribedinSection 1.1.1 tocomputetheSVDof A .Therearevarioussparse QR factorizationalgorithmswecan choosefromtodotherststepeciently.Reducingabidiago nalmatrixtoadiagonal matrixisaproblemwherenumericalconsiderationsaremore importantthansparsity considerations.Thesubjectofthischaptercoverstheonly otherpieceleftinthepuzzle: bidiagonalizationoftheuppertriangularsparsematrix R fromthe QR factorization. Whiletryingtosolvethesparseproblem,weusethelessonsf romthebandreduction algorithm.Wedonotgeneratemorethanonellforanyrotati on.Weblocktherotations forbettercacheaccessandpipelinetherotationstoavoidm orethanonell.Thischapter isorganizedasfollows.Section 5.2 coversthetheoryforsparsebidiagonalization.We discussvariousalgorithmsforsparsebidiagonalizationi nSection 5.3 .Thesymbolic factorizationforsparsebidiagonalreductionispresente dinSection 5.4 .Section 5.5 details thesoftwareimplementingthereductionalgorithmsandits performanceiscoveredin Section 5.6 5.2TheoryforSparseBidiagonalization 5.2.1ProleandCornerDenitions Let R 2R n n bethesparseuppertriangularmatrixfromthe QR factorization.We usethefollowingdenitionsthroughoutthischapter.Some ofthedenitionsandlemmas inthissectionarederivedfrom[ DavisandGolub2003 ]. Denition5.1 (SkylineProle) SkylineproleofRisdenedasfollows. S ( R )= f ( i;j ):1 j n; ( f j ( R ) i j ) ( j i =0) ( j i =1) g where f j ( R )= min f i :1 i j;r ij 6 =0 g 58

PAGE 59

XXX XXXXX X XXX XX X XXX XXXX XXXX XXXX XXXX XXX XX XX XX X X X XX XXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXXX XXXXX XXXXX XXXXX XXXX XXX XX X X X XX X X XXXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXXXX XXXXXX XXXXXX XXXXX XXXX XXX XX X X X XXXX X X SparseRSkyline(R)Mountainview(R) Figure5-1.Asparse R ,skylineproleof R andmountainviewproleof R orsimplytheskylineproleof R isthesetofindexpairs( i;j )thatsatisfythe followingcriteria: Allentriesintheupperbidiagonalareintheskylineprole if r ij 6 =0then( i;j )isintheskylineprole. if( i;j )isintheproleand i 6 = j then( i +1 ;j )isintheskylineprole. Denition5.2 (MountainviewProle) TheMountainviewproleof R isdenedas follows. M ( R )= f ( i;j ):1 i n;i j g i ( R ) g where g i ( R )= max f k : i k n; ( i;k ) 2S ( R ) g orsimplythemountainviewproleof R isthesetofindexpairsthatsatisfythe followingcriteria: Allentriesintheskylineproleareinthemountainviewpro le. if( i;j )isinthemountainviewproleand i 6 = j then( i;j 1)isalsointhe mountainviewprole. 59

PAGE 60

x xx xxxx n-1n (d) j j-1 x xx xx x x x x x j+1 (a) x j j-1 (b) xx x xxx x xx j+1x n xxx xxxx x x n-1 (c) Figure5-2.Cornerdenitionintheproleof R forcornercolumns j and n rotateswaprotateswap j j-1 x x0 xx x x x x x j+1 (a) x j j-1 (b) xx xxx 0 xx j+1x x n xx 0x x x n-1 (c) xx x xx 0xx n-1n (d) 0 x x Figure5-3.Cornercolumns j and n aftercolumnrotations Thisstyleofdeningtheproleissimilartotheenvelopede nitionin[ Hager2002 ]. Figure 5-1 showsasmallsparse R itsskylineproleandmountainviewprole. Denition5.3 (Toprow) Toprowofacolumn j trow ( j ) ,isthesmallestrowindexin theproleofcolumn j Denition5.4 (Endcolumn) Endcolumnofarow i ecol ( i ) ,isthelargestcolumnindex intheproleofrow i Denition5.5 (Corner) Acolumnjisacornerif trow ( j 1) trow ( j ) when j = n; trow ( j 1) trow ( j )
PAGE 61

trow(5) = row 4 column 5 ecol(2) = column 7 row 2 XXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXX XXX XX X X X X X X L XXXXX R Figure5-4.Cornersforthesparse R WeshowallthefourpossibilitiesforacornerintheFigure 5-2 .Figures 5-2 (a)and (c)showthecasewhen trow ( j 1)= trow ( j )andFigure 5-2 (b)and(d)showthecase when trow ( j 1) >trow ( j ).Wedenethecornersinsuchawaythatifcolumn j isa cornerthenrotatingcolumns j and j 1toreducetheentry( trow ( j ) ;j )resultsinexactly onellinthesubdiagonalofcolumn j 1.Figure 5-3 (a)and(c)showthecasewhen weapplyaplanerotationtoreducetheentry( trow ( j ) ;j ).As trow ( j 1)= trow ( j ) theproleofcolumns j 1and j isthesameinrows trow ( j ) :::j 1resultinginone subdiagonalllintheentry( j;j 1)(markedas ).Figure 5-3 (b)and(d)illustratethe casewhen trow ( j 1) >trow ( j ).Weuseaswapofthecolumns j 1and j inthis caseastheswapisaspecialcaseofaplanerotationreducing theentry( trow ( j ) ;j ).The swapresultsinnoroatingpointoperationsandresultsinon esubdiagonalll(marked a ).Theentryin( j;j )(markedwith )isnumericallyzero,butsymbolicallynon-zero (inthesensethatisanentrystoredintheproledatastruct ure).Theswapbetweenthe columnscanbedoneinplaceasthetotalnumberofentriesint heproleremainthesame (ignoringthesubdiagonalll). Itfollowsfromthedenitionthatanymatrixthatisnotbidi agonalhasatleastone cornerandtherearenocornersinabidiagonalmatrix.Ourgo althenistoreduceallthe 61

PAGE 62

cornersinanuppertriangularmatrix(andchasethell)tog etabidiagonalmatrixwith nocorners.Thefactthatifcolumn j isacornerthencolumns j 1and j +1arenot cornershelpsussowedonotrotatetwocornerswithoneanoth er. Denition5.6 (Rightmost/LeftmostCorner) Column k iscalledtherightmost (leftmost)cornerif k islargest(smallest)columnindexforwhich k iscorner. Denition5.7 (RightmostCornerEntry) If k istherightmostcornerwecalltheentry ( trow ( k ) ;k ) astherightmostcolumnentry. Denition5.8 (Rubble) If k istherightmostcornerthenthesetofcolumns k +1 :::n is calledtherubble.Denition5.9 (Heightofacolumn) Heightofacolumn j is j trow ( j ) Thecorners, trow and ecol valuesareshowninFigure 5-4 .Thecornersfortheskyline matrixRfromFigure 5-1 are f 4 ; 7 ; 10 ; 13 g .The trow and ecol valuesfortheskylineprole of R fromFigure 5-1 are trow ( S ( R ))= f 1 ; 1 ; 2 ; 2 ; 4 ; 3 ; 2 ; 4 ; 4 ; 4 ; 7 ; 10 ; 7 ; 9 ; 11 ; 12 g ecol ( S ( R ))= f 2 ; 7 ; 7 ; 10 ; 10 ; 10 ; 13 ; 13 ; 14 ; 14 ; 15 ; 16 ; 16 ; 16 ; 16 ; 16 g Therightmostcorneriscolumn13andtheleftmostcornerisc olumn4.The rightmostcolumnentryis(7 ; 13).TherightmostcornerentryismarkedwithaRand theleftmostcornerismarkedwithaL.Columns14 ::: 16arecalledtherubble. 5.2.2Mountain,SlopeandSink Themountainviewprolehasspecialpropertiesthatwecane xploitwhiledoing thereduction.Thefollowingdenitionsapply both tothemountainviewandtheskyline proles,butaredenedusingthe mountainviewprole of R .Weclassifyeverycolumn ofthemountainviewproleintherange3 :::n asoneofthefollowing3cases:mcolumn, sinkorslope.Denition5.10 (Mcolumn) Acolumn j isanmcolumn(acolumninamountain)if trow ( j 1)= trow ( j ) inthemountainviewproleof R 62

PAGE 63

Slope mcolumn Sink Mountain XXXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXXXX XXXXXX XXXXXX XXXXX XXXX XXX XX X X X XXXX X X Mountainview(R) Figure5-5.TypesofcolumnsinthemountainviewproleDenition5.11 (Mountain) Asetofmcolumnsallwiththesametoprowvalueinthe mountainviewproleof R Denition5.12 (Slope) Acolumn j isaslopeif trow ( j 1)+1= trow ( j ) inthe mountainviewproleof R Denition5.13 (Sink) Acolumn j isasinkif trow ( j 1)+1
PAGE 64

i i-1 ecol(i-1) .. .. XX XX XX XX X X X XX? XX L Figure5-6.Endcolumnlemma Acornerwillalwaysbeanmcolumn. Therightmostcornerisintherightmostmountain. Foranycolumn j anditstoprow i = trow ( j ),rotatingrows i and i 1generates morethanonellif j isanmcolumn,atleastonellthatcannotbechased immediatelyif j isasinkandonellthatcanbechasedif j isaslope. Mcolumnheightproperty .Inamountainviewproletheheightofthecolumns inanymountainmonotonouslyincreasesfromthelowestnumb eredcolumntothehighest numberedcolumn. Proof .Everycolumn j inthemountainhasthesame trow ( j )valuefromthe denition.Thesmallestnumberedcolumnsetsthebaseheigh t.As j increasesforevery columninthemountaintheheight, j trow ( j )increasestoo. 2 5.2.3PropertiesofSkylineandMountainviewProles Therearepotentiallymorethanonecornerintheproleofas parsematrix R that wecanreducetogetallinthesubdiagonal.Furthermore,th echasetoreducethe subdiagonalllhastobehandledaswell.Weneedtocharacte rizethellandtheproles bettertoarriveatanalgorithmforreducingtheuppertrian gularmatrix.Thefollowing propertieshelpusdoit.Lemma1 ( End-columnlemma). Giventheproleof R ecol ( i 1) ecol ( i ) forall i intherange 2 :::n 64

PAGE 65

righmost corner j-1 j n is a corner. Contradiction XX .. .. XXX XXXXX XX XXXXX XXXXX XXXX XX XXXX XXX XX X X XX X XX Figure5-7.Toprowlemma Proof .Assume ecol ( i 1) >ecol ( i )forsome i intherange2 :::n .Thismeans ( i 1 ;ecol ( i 1))(entrymarkedLinFigure 5-6 )isintheproleand( i;ecol ( i 1)) (entrymarked?inFigure 5-6 )isnotintheprole.Fromthedenitionoftheprole,if ( i 1 ;ecol ( i 1))isinprole( i;ecol ( i 1))isintheprole.Acontradiction. 2 Lemma2 ( Mountainviewlemma). Giventhemountainviewproleof R trow ( j 1) trow ( j ) forall j intherange 2 :::n Proof .TheproofissimilartotheproofoftheEnd-columnlemma.As sume trow ( j 1) >trow ( j )forsome j intherange2 :::n .Thismeans( trow ( j ) ;j )isintheproleand ( trow ( j ) ;j 1)isnotintheprole.Fromthedenitionofthemountainvie wprole,if ( trow ( j ) ;j )isinprole( trow ( j ) ;j 1)isintheprole.Acontradiction. 2 Lemma3 ( Top-rowlemma). If k istherightmostcornerthen trow ( j 1) k .If j = k +1wegetastraightforwardcontradictionas k cannotbea cornerif trow ( k ) trow ( k +1).If j = n and trow ( j 1) trow ( j )then n willbethe rightmostcorner.Acontradiction. 65

PAGE 66

righmost corner i i-1 ecol(i-1) ecol(i) j XX .. .. XXX XXXXX XX XXXXX XXXXX XXXX XX XXXX XXX XX X X XX Figure5-8.Rubblerowlemma Westillhavethecaseof j intherange k +2 :::n 1.Asweassumed trow ( j 1) trow ( j ),toavoidjbeingacorner(andresultinacontradictionimm ediately) trow ( j ) trow ( j +1)mustbetrue.Asweproceedfromcolumns j 1 :::n wehavethe inequality trow ( j 1) trow ( j ) trow ( j +1) ::: trow ( n 1) trow ( n ) Thisresultsincolumn n beingtherightmostcorner.Acontradiction.Thiscaseissh own inFigure 5-7 2 Lemma4 ( Rubblecolumnlemma). If k istherightmostcornerthenforall j inthe range k:::n theskylineproleisthesameasthemountainviewprole. Proof .Assumethatthereexistsanentry( i;j ) ;ki .Fromourassumption trow ( j ) i .Combiningthe twoweget trow ( j 1) >trow ( j )contradictingtheTop-rowlemma. 2 Lemma5 ( Rubblerowlemma). If k istherightmostcornerforall i intherange trow ( k )+1 :::n ecol ( i ) ecol ( i 1) = 0 or 1 66

PAGE 67

Proof .FromtheEnd-columnlemma, ecol ( i ) ecol ( i 1) 0.Wejustneedto provethat ecol ( i ) ecol ( i 1) < 2.Assumethereexiststworows i 1and i intherange trow ( k )+1 :::n suchthat ecol ( i ) ecol ( i 1) 2.Figure 5-8 showstherightmostcorner andtherows i 1and i .Let j beacolumnintherange ecol ( i 1)+1 :::ecol ( i ) 1.There isatleastonesuchcolumn.Columnjsatises trow ( j )= trow ( j +1)= i contradictingthe Top-rowlemma. 2 Thecolumnstotherightoftherightmostcolumnwillnothave amountainasshown intheTop-rowlemma.Theysatisfythepropertiesofamounta inviewprole.Hencethe namerubble.Lemma 4 andLemma 5 fullydenethecharacteristicsoftherubble. Theorem5.1 ( RightmostCornerTheorem). Reducingtherightmostcolumnentry, ( trow ( k ) ;k ) ,where k istherightmostcorner,usingapairofGivensrotationscau seszero oronellintheprole. Proof .Fromthepropertyofthecornerwegetonellintheentry( k;k 1)whenwe applyaGivensrotationstocolumns k and k 1. Weapplyarowrotationbetweenrows k and k 1toremovethesubdiagonalll.As therowrotationsareappliedtocolumnsintherange k:::n therubblecolumnlemmaand mountainviewlemmaensurethatthetotalnumberofll-inwi llbe ecol ( i ) ecol ( i 1) (liketherotationbetweenanytworowsinthemountainviewp role).Thoughcolumn k 1isinvolvedinthisrotation,towhichtherubblerowlemmad oesnotapply,weare chasingthellincolumn k 1.Wecannotgetalltherebythatrotation.Asrows k 1and k areintherange trow ( k )+1 :::n ,therubblerowlemmaensuresthatthe ecol ( i ) ecol ( i 1)willbeeitherzeroorone. 2 Lemma6 ( RubbleHeightLemma). Let k betherightmostcornerintheprole,for allcolumns j intherange k +1 :::n ,theheightofthecolumnsismonotonouslynon increasing. Proof .Asthecolumnnumberincreasesfrom k to n ,fromthetop-rowlemma (Lemma 3 ),the trow valueshavetoincreasebyatleast1foreachnewcolumnwhich 67

PAGE 68

k-1k ecol(k-1) ecol(k) rotaterotate ecol(k-1) ecol(k) swap swap k xx 0x x x k-1 x x x xx (a) x x x x x x x x k xx 0x x x k-1 x x x xx (c) x x x x x x x x x n k-1k ecol(k-1) ecol(k) k xx 0x 0 x k-1 x x x xx (d) x x x x x x x x x n k-1k k xx 0x 0 x k-1 x x x xx (b) x x x x x x x x k-1k ecol(k-1) ecol(k) Figure5-9.Dierentconditionstoestablishprogressinre duction ensuresthattheheightofthecolumnsremainatleastthesam e.When trow values increasebymorethanonetheheightofthecorrespondingcol umnsdecreasestoo. 2 Progress .Therearefourcasestoconsiderwhenwetrytoestablishpro gresswhen usingtheTheorem 5.1 .Weuseeitheraswaporanormalrotationtoreducetherightm ost cornerandcreateasubdiagonalll.Wethenreducethesubdi agonalllwitharow rotation.Therowrotationmaycreatezerooronellasestab lishedbytherightmost cornertheorem(Theorem 5.1 ).Thefourcasesfromthecombinationsofthesetwo conditionsare 68

PAGE 69

1.rotatetherightmostcorner,rowrotationcreatesnoll.2.swaptherightmostcorner,rowrotationcreatesnoll.3.rotatetherightmostcorner,rowrotationcreatesonell 4.swaptherightmostcorner,rowrotationcreatesonell. Notethatthellgeneratedcannotbechasedallthetime.Wen eedtoestablish progressirrespectiveofthechasabilityofthell.Weneed torememberthefactthat thecolumnrotationreducesthenumberofentriesinthepro lebyoneandtheswap leavesthenumberofentriesintheprolethesame(ignoring thesubdiagonalll).Incase 1,thecolumnrotationreducestheprolebyoneentryandthe rowrotationleavesthe numberofentriesintheproleunchanged.Wehavemadeprogr essbyreducingoneentry withoutanyll.Incase2,thenumberofentriesintheprole remainunchanged,butthe rightmostcornerentrymovedclosertothediagonal(fromco lumn k tocolumn k 1) makingprogress.Figure 5-9 (a)-(b)showsthesersttwocases.The0'sshowtheentries thatwerejustmadezeroeitherbyarotateoraswap.Thesubdi agonalllismarkedwith a .Therecouldhavebeenmorethanoneentrythatgotswappedre ducingtheprole evenfaster.Forexample,Figure 5-9 (b)showstwo0'sincolumn k forthetwoentriesthat gotswapped.Boththeentriesgotclosertothediagonalmaki ngprogress. Incase3,thecolumnrotationreducedoneentryintheprole andtherowrotation createdall.Therubbleheightlemma(Lemma 6 )ensuresthateitherallthecolumns from k to ecol ( k )areofthesameheight(asthatof k )orthecolumn ecol ( k )isshorter than k itself.Intheformercasewhentheallthecolumnsfrom k:::ecol ( k )areofthe sameheightthellcreatedbytherowrotationisawayfromth emaindiagonalby onemorethantherightmostcolumnentryandwillbechasable as trow ( ecol ( k ))= trow ( ecol ( k 1))whenthellisatentry( k 1 ;ecol ( k ).Thentheprogresshastobe establishedbyoneoftheothercases.TheFigure 5-9 (c)showsthellfromrowrotation as 69

PAGE 70

XXXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXX 21 XXXXX 3 XXXXX 4 XXXXX XXXX XXX XX X X X XXX X X X (b) XXXXX XXXXX 7 XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXX XXX XX X X X XXX X X X XXXXX 6 5 (c) XXXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXXX X XXXXXX XXXXXX XXXXX XXXX XXX XX X X X XXX X X X (a) Figure5-10.Potentialcolumnsforblockinginthemountain viewprole Inthelatercasewhencolumn ecol ( k )isshorterthancolumn k thellwillbeatthe samedistanceastherightmostcolumnentryorclosertothed iagonalthantherightmost columnentry.Progressisestablishedinthelatercaseasth ellisclosertothediagonal andtheformercasebecausethellisclosertobechasedout. Figure 5-9 showstheswapcasewithll.Theswapleavesthenumberofent ries unchangedandtherowrotationcreatesonell.Thelltempo rarilyincreasesthe numberofentriesintheskylineprole.Butthenewentryisc losertocolumn n and willeventuallybechasedoutofthematrix.Thuswemakeclea rprogressinthreecases andinonecasewetemporarilyincreasetheprolewithanent rywhichiseasiertoreduce andchaseoutofthematrix.5.2.4PropertiesforBlockingtheBidiagonalReduction AllthepropertieswehavediscussedinSection 5.2.3 leadsustoacorrectalgorithm forsparse R -bidiagonalization.However,thealgorithmwillnotbeabl ockedalgorithm. Weshowthatblockingtheplanerotationsispossiblewhenre ducingtherightmostcorner inthissection. 70

PAGE 71

Theorem5.2. Reducinganycornerintherangeofcolumns trow ( k )+1 :::k ,where k is therightmostcorner,usingaGivensrotationandchasingth ellcauseszerooronellin themountainviewprole. Proof .InthemountainviewprolewedonotneedtheRubble-column lemmaas themountainviewlemmaguaranteesthattheGivensrotation sareappliedbetweentwo rows,say i 1and i ,toeliminateasub-diagonalll,thenewllcanonlybeinen tries ( i 1 ;ecol ( i 1)+1) ::: ( i 1 ;ecol ( i )).Anycorner,bydenition,generatesonlyone sub-diagonalllwhenitscornerentryisreducedtozerowit haGivensrotation.The rubble-rowlemmastillappliestorows trow ( k )+1 :::k .Soanycornerintherange trow ( k )+1 :::k willcausezerooronellinthemountainviewprole. 2 Theprogressstatementfortherightmostcornerapplieseve nwhenweuseTheorem 5.2 However,Theorem 5.2 doesnotapplytotheskylineproleasthemountainviewLemm a 2 doesnotapplytoskylineproleandthellneednotjustbein ( i 1 ;ecol ( i 1)+1) ::: ( i 1 ;ecol ( i ))whenaGivensrotationsisappliedtorows i 1and i .Thistheoremshowsus thattherearemultiplecornersthatcanbereducedwithzero oronell.Whentheyare withinafewcolumns trow ( k )+1 :::k apartthistheoremshowsthatblockingmightbe possible,atleastinthemountainviewprole.Figure 5-10 (a)showsthecolumnsfrom wherewecanchooseourcornersothatwecaneectivelyblock theplanerotations.We canseefromthegurethattherearetwocornersthatcanbere ducedwithzerooronell. Wecanchoosethesecolumnsastheblockandreducethecorner sastheyappearinone ofthesecolumnsafterreducingtheinitialcorners.Butwec ancharacterizethealgorithm betterusingtherightmostmountaintheoremgivenbelow.Theorem5.3 ( RightmostMountainTheorem). Let j betheleastcolumninthe rightmostmountainand k betherightmostcorner,ifwealwaysreducetherightmost cornerandchasethell,thenwhenacolumn m intherange 3 :::j 1 getschosenasthe rightmostcornerforthersttime,theheightofallthecolu mnsintherange j:::n willbe lessthantheoriginalheightof j 71

PAGE 72

Proof .Let m betherstcornerchosenasrightmostcornerintherange3 :::j 1. Column j 1willstillbeatitsoriginalheight,onelessthantheheigh tof j ,as m isthe rstcornerchosenintherange1 :::j 1.Letusassumethereexistsacolumn p inthe range j:::n suchthatheight( p ) (original)height( j ). Asweareconsideringthecolumnsintherubble(totherighto f m ,thecurrent rightmostcorner)wecanusetherubbleheightlemma(Lemma 6 )toseethatheightof p 1shouldremainthesameorincreasefromtheheightof p .Thisleadsto originalheight( j ) height( p ) height( p 1) ::: height(j) Alltheinequalitiesintheaboveequationshouldbeequalit iesasheight(j)cannotincrease as m istherstcornerintherange3 :::j 1.Iftheheightsofallthecolumnsintherange p 1 :::j remainthesame,i.etheinequalitiesaboveareallequaliti es,thenheight( j )= height( j 1)resultinginacornerin j whichistotherightoftherightmostcorner m .A contradiction. 2 Theorem 5.3 showsthatifwereducetherightmostcorneralwaysandchase the llthenwereducetherightmostmountainandtherubbletoah eightlessthanthe baseheightoftherightmostmountainbeforewereduceanyco rnerfromtheleftofthe mountain.Inshort,alwaysreducingtherightmostcornerac tuallyisamethodthatalways reducestherightmostmountain.Wedidnotuseanyofthespec ialpropertiesofthe mountainviewprolehereexceptthedenitions.Theorem 5.3 holdsfortheskylineprole withasmallchange:Allthecolumnsintherange j:::n willhaveheightlessthanor equaltotheoriginalheightof j 1.Theproofremainsthesame. Thisshowsonecaneectivelyblockthereductioninthemoun tainviewprole.The sizeoftheblockdependsonthesizeofthemountain.Thesize oftheblockisdynamic sincethesizeofthemountainchanges.Figure 5-10 (b)showstheblockandrstfour cornersthatwillbereducedtogetherasoneblock.Notethat thoughtherstthree cornersareinthemountain,onecolumnintherubblecontrib utestotheblock.Figure 72

PAGE 73

5-10 (c)showsthesecondmountainandthethreecornersthatwill bereducedasone block.Thereisnorubblecolumnthatcanbeaddedtothisbloc kforthereduction.The cornerentries1 ::: 7intheseguresarechoseninthatorderbecausetherightmo stcorner methodwouldhavechosenthemtoo. Theblockingintheskylineproleisalittlemoretrickier. Itispresentedindetailin Section 5.3.2 .Weseeallthealgorithmstoreducethesparse R inSection 5.3 5.3BidiagonalizationAlgorithms Thesimplestalgorithmtoreduce R usestheresultofTheorem 5.1 thatreducingthe rightmostcornerresultsinzerooronell.Algorithm 5.1 reducestherightmostcornerand usesaplanerotationtoreducethesubdiagonalll.Thisalg orithmdoesnotuseblocking orpipelining. Algorithm5.1 Rightmostcorneralgorithmforbidiagonalization 1: Findallthecornersfrom1 :::n andpushthemintoastack. 2: while therearemorecorners do 3: Poptherightmostcorner k 4: Reducetherightmostcornerentrybyapplyingarotationbet weencolumns k and k 1. 5: Reducethellbyapplyingarotationbetweenrows k and k 1. 6: Inspectcolumns k 1 ;k +1 ; and ecol ( k )fornewcornersandpushthenewcornersto stack. 7: endwhile Wedonotdealwiththellexplicitlyinthisalgorithm.Ifth ellincolumn ecol ( k ) canbechaseditwillbethenewrightmostcornerandchased.I fthellcannotbe chasedthenwereducethenextrightmostcorner.Givenaband matrixinsparseformthe rightmostcorneralgorithmreducesitexactlylikeSchwarz 'sdiagonalreductionmethod (Algorithm 2.2 ).Justlikethediagonalreductionmethodtherightmostcor neralgorithmis notblocked,doesmoreworkandresultsinpoorperformance. Whentherightmostcornermethodissimilartothediagonalr eductionmethod [ Schwarz1963 ]itisnaturaltolookforamethodtoreduce R liketheSchwarz'srow reductionmethod[ Schwarz1968 ].Weneedtochoosetheleftmostcornerandreduceitto 73

PAGE 74

mimicSchwarz'srowreductionmethod.However,theleftmos tcornermaygeneratemore thanonellintheprole.Wecanalterthedenitiontoalway srotatea`validcorner', onethatgeneratesonlyonell,andgivepreferencetothele ftmostvalidcorner.Though wecanproveprogressinthiscase,andthatthereexistsatle astonevalidcorneralways -therightmostcorner,theleftmostcorneralgorithmdoesn otworkverywellinpractice. Thellgeneratedandaccessingcolumnsinacomplexfashion leadtopoorperformance. Itdoesnothavethenicepropertiesoftherightmostcornera lgorithm,soblockingthe leftmostcornermethodturnsouttobedicult.Wedonotdesc ribetheleftmostcorner algorithmindetailhere.Theideaoftheleftmostcorneralg orithmwasfullygivenshape in[ DavisandGolub2003 ]. 5.3.1BlockinginMountainviewProle Algorithm5.2 Blockedrightmostmountainalgorithmformountainview bidiagonalization 1: Findallthecornersfrom1 :::n andpushthemintoastack. 2: while therearemorecorners do 3: Poptherightmostcorner k andndtherightmostmountain 4: while therearemorecornersintherightmostmountain/rubble do 5: Usethe( trow ( k ) ;k )asananchorpointandndablockinthemountain 6: Dividethematrixintoblocksbasedontheblock. 7: Findcolumnrotationstoreducethecornersintheblockanda pplythemtothe columnsinthisblock 8: while therearecolumnrotations do 9: ApplycolumnrotationstoC-block. 10: ApplycolumnrotationstoD-blockinapipelinedorderandge neraterow rotationstochasethell. 11: ApplyrowrotationstoR-block. 12: ApplyrowrotationstoF-blockinapipelinedorderandgener atecolumn rotationstochasethellwhereverpossible. 13: Readjustthefourblockstocontinuethechase 14: endwhile 15: endwhile 16: endwhile Ablockedalgorithmcanbederivedfromtherightmostmounta intheorem( 5.3 )and Theorem 5.2 toreducethemountainviewproleof R .Thebasicideaistoreducethe 74

PAGE 75

mountainviewproleonemountainatatimestartingfromthe rightmostmountain.When thenumberofcolumnsinthemountainbecomestoolargewecan limittheblocksizeand reduceonemountaininmultiplesteps. Wechoosetherightmostcornerentryasouranchorpointforo urblockedrightmost mountainalgorithmandndablockaroundthatanchorpoint. Thisblockservesasour seedblocktondalltheentriesthatcanbereducedinoneste p.Findingablockaround theanchorpointinthemountainviewproleissimilartond ingtheblockintheband reductionalgorithminChapter 3 .Oncewehavetheseedblockdenedwecanreduce thecornerswithintheseedblock,dividetherestofthematr ixintoblocksandpipeline theplanerotationssimilartothealgorithmsinSections 3.2 and 3.3 ,butthereare somedierencesasweoperateonasparse R .Whenwechasethellsomellischasable andsomellisleftbehind,sotheindexingisnotassmoothas itwasinthebandcase. Instead,weneedtostoretheindicesofcolumnsandrowsalon gwiththeplanerotations andapplyittothecorrespondingcolumns/rows. Algorithm 5.2 reducesthemountainviewproleinablockedfashionbyredu cing therightmostmountaininoneormorestepsdependingontheb locksizeandthesize ofthemountain.Theblockedrightmostmountainalgorithme stablishesthatblocking andpipeliningispossibletoreduceasparseuppertriangul ar R .However,itoperateson mountainviewproleof R andthesizeofthemountainviewproleisquitelarge.The blockedalgorithmmustworkontheskylineproletobeofpra cticaluse. 5.3.2BlockinginSkylineProle Amongthetheoremsweusedtoarriveatablockedalgorithmfo rthemountainview proleTheorem 5.3 andLemma 6 applytotheskylineprole.Butthemcolumnheight propertyandTheorem 5.2 donotapplytotheskylineprole.Thismeanswestillreduce onemountain(therightmostone)torubbleintheskylinepro lewhenweusethe rightmostcorneralgorithm,butwecannotconsidertheenti remountainasoneblock 75

PAGE 76

pk XXXX XXXX XXX XXXXX XXXXX XX XXX XX XXXXX XXXX XXXXX XXXX XXX XX X X X X X X X Figure5-11.Problemwithsimpleblockingintheskylinepro le asthellneednotbejustintherubble.Thellcouldbeinone ofthecolumnsofthe skylinethatcorrespondtothemcolumnsoftherightmostmou ntain. Saythemcolumnsoftherightmostmountainare j:::k where k istherightmost corner.Giventheskylineprole,wecanndacolumn p suchthat j
PAGE 77

weinspectthecolumnstotheleftofit,rowsbelowitandcolu mnstotherightofitand expandtheS-blocktogetagoodnumberofcornerstobeblocke dandreducedtogether. However,afewrestrictionsmustbeplacedontheS-blocksot hatwealwayslimitthell totherubbleandweareabletopartitiontherestofthematri xcorrectlyintoblocks. Algorithm5.3 FindingtheS-blockforblockedskylinereduction 1: Settheinitialvaluesof lrow to trow ( k 1), fcol to k 1and lcol to k 2: SettheinitialvaluesfortherstrowofD-blkto k 1andrstcolumnofF-blkto ecol ( k 1). 3: InspectcolumnstotheleftofS-blockandincludecolumnssa tisfyingtheconditionsfor S-block. 4: InspectrowbelowtheS-blockandincluderowsatisfyingthe conditionsforS-block 5: InspectcolumnstotherightofS-blockandincludecolumnss atisfyingtheconditions forS-block. 1.The trow ofallthecolumnsintheS-blockshouldbeintheS-block.Ino therwords lrow shouldbeatleastequaltothemaximumof trow valuesofallthecolumnsin S-block. 2.The fcol islimitedbytheleastcolumninthemountain. 3.TheS-blockandtheD-blockcannotsharearow.4.TheS-blockandtheF-blockcannotshareacolumn. ConditiononeavoidstheproblemshowninFigure 5-11 aswedonotdoanyrow rotationsintheS-blockwecannotcreateanyllinthemcolu mns.Conditiontwois notstrictlynecessary.Wecanrelaxittosay fcol islimitedbythecolumn trow ( k ),but thatgreatlycomplicatestheblockstructure,sowerestric ttheblocktothemountain. Conditionthreeandfourensurewecanpartitionthematrixi ntoblockscorrectly.The algorithmtondtheS-blockchecksfortheseconditionsand increasestheblocksize toutilizepipeliningandcachingforperformance.Thealgo rithmtondtheS-blockis listedin 5.3 .OncewendtheS-blockthealgorithmtoreducetheskylinei ssimilar totheblockedrightmostmountainalgorithmformountainvi ewreductionalgorithm (Algorithm 5.2 ).ThealgorithmtoreducetheskylineisshowninAlgorithm 5.4 77

PAGE 78

Algorithm5.4 Blockedrightmostmountainalgorithmforskylinebidiagon alization 1: Findallthecornersfrom1 :::n andpushthemintoastack. 2: while therearemorecorners do 3: Poptherightmostcorner k andndtherightmostmountaincolumns 4: while therearemorecornersintherightmostmountain/rubble do 5: Usethe( trow ( k ) ;k )asananchorpointandndablockusingAlgorithm 5.3 6: Dividethematrixintoblocksbasedontheblock. 7: Findcolumnrotationstoreducethecornersintheblockanda pplythemtothe columnsinthisblock 8: if therearenewcornersthatgotgeneratedwhicharenotwithin theboundaries oftheseedblock then 9: pushitinthestackfornextiteration. 10: endif 11: while therearecolumnrotations do 12: ApplycolumnrotationstoC-block. 13: ApplycolumnrotationstoD-blockinapipelinedorderandge neraterow rotationstochasethell. 14: ApplyrowrotationstoR-block. 15: ApplyrowrotationstoF-blockinapipelinedorderandgener atecolumn rotationstochasethellwhereverpossible. 16: Readjustthefourblockstocontinuethechase 17: endwhile 18: endwhile 19: endwhile Weillustratetheblockedskylinealgorithmwithanexample skyline R inFigure 5-12 TherstS-blockasdeterminedbythealgorithmtondtheS-b lockisshownin Figure 5-12 (a).Allthecornersarehighlightedandthecornerthatisgo ingtobereduced isnumbered.Corner1resultsinaswap.Figure 5-12 (b)showstheresultoftheswapwith 0sintheswappedcornerandthetwonewcornersthatwerecrea tedasaresultofthe swap.Thethreecorners1 ; 2 ; 3inFigure 5-12 (a)-(b)arereducedinonestepbytheskyline algorithm.Thecornernumbered3isfromtherubblebutwasin cludedintheS-block tohelpblocking.Noticethatthecorrespondingblockinthe mountainviewprolewas reducedwithfourrotationsbythemountainviewalgorithmi nFigure 5-10 Figure 5-12 (c)showsthenewlycreatedzerosandnextS-blockforthered uction. ThethirdS-blockisshowninFigure 5-12 (d).Column4wasnotincludedaspartofthis S-blockeventhoughitispartofthesamemountainbecauseit wouldhavecreateda 78

PAGE 79

(d) (a) (b) (c) (e) (f) XXXX XXXXX 0 0 XXXXX 0XXXXX XXXXX XXXXX XXXXX XXX XX XXXXX XXXX XXX XX X X X XXX X X 7 X X X XXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXX 1 XXXXX XXXXX XXXXX XXXX XXX XX X X X X X X X XXXX XXXXXX X XXXXXX XXXXX XXXXX XXXXX XXXXX 2 0 XX 0 X XX X 3 XX X XXXXX XXXX XXX XX X X X X X X X 0 XXXX XXXXX 5 4 XXXXX 6 XXXXX XXXXX XXXXX XXXXX 0 XXX XX XXXXX XXXX XXX XX X X X X X X X X 0 XX XX XX 9 X XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXX XX XXXXX XXXX XXX XX X X X XXX X X 0 X X X X XX X 0 XXXX 10 XXXXX XXXXX XXXXX XXXXX XXXXX XXX XX XXXXX XXXX XXX XX X X X XXX X XX X X X Figure5-12.Firstthreeblocksandtencornersoftheskylin ereductionalgorithm llincolumn5.Ascorner7resultsintwoswaps(corner8fort hesecondswapisnot shown)wegettwonewcorners(showninFigure 5-12 (e)),butthecornerincolumn5is notreducedinthecurrentS-blockascolumn4isnotinthecur rentS-block.Thisstepis handledinLine 8 ofAlgorithm 5.4 .Wereducecorner9andthencorner10asshownin Figure 5-12 (e)-(f).Corners7 ::: 10arepartofonestepintheblockedreduction. Notethateventhoughweusethemcolumnstondthepossibles izeoftheS-block, wenarrowtheS-blockinAlgorithm 5.3 ,sotheblocksaresmallerthantheblocksinthe mountainviewreductionalgorithm.However,weoperateont heskylineprolewhich ismuchmoreeconomicalthanthemountainviewproleandasa resultthenumberof operationsisreduced. 79

PAGE 80

not a sink, a mcolumn Sink k Slope Figure5-13.Classicationofcolumnsintheskylineprole IntheskylinealgorithmtheS-blockcanhavemorethanoneco rnerinitasinthe Figure 5-12 (b).Thiscannothappeninthemountainviewprole.Eachcor nerinthe S-blockcausesatmostonellbasedonthewaywefoundtheS-b lock,sowecanchoose toreducetheminanyorder.Currentlyweeliminatetheleftm ostcornerwithintheblock sothatwecanreducetheoperationcount.Wecallthisthe rightmostmountainleft corner algorithm.Givenabandmatrixthisalgorithmmimicsthe`ro wmode'describedin Section 3.5 .Insteadofchoosingtheleftmostcornerallthetimeifwech ooseonerightmost cornerafterevery c (say32)leftmostcornerswemimicthe`blockmode'fromSect ion 3.5 Thereisanimplementationissuewehavenotaddressedinthi ssectionsofar.The bandreductionAlgorithm 3.1 chasesallthelltoendofthematrix.Ontheother handtheskylinereductionalgorithmhastoleavesomeofthe llthatisnotchasable intheprolewhichrequiresadynamicdatastructure.Asymb olicfactorizationforthe skylinereductionthatestimatesthespacerequirementsfo reachcolumnispresentedin Section 5.4 5.4SymbolicFactorization WeusethedenitionsfromSection 5.2.2 toarriveatthesymbolicfactorization algorithm.Noticethatthedenitionareusingthemountain viewprolethoughtheyapply 80

PAGE 81

k j p k k-1 k k-1 j j-1 j j-1 sink p p-1 if ecol(k) = j and trow(j) = k Figure5-14.Symbolicfactorizationoftheskylineproletoboththeproles.Figure 5-13 illustratestheneedforusingthemountainviewproleto denethemcolumnandsinksintheskylineprole.Column k 3inthegureisnota sinkbutanmcolumn.Weclassifyitassuchbecausecolumn k 3cannotgetlledwhen weuseAlgorithm 5.4 forreducingtheskyline.Itcannotbeasink.Wecanalsowrit e thisas,column j inskylineproleisasinkifandonlyif trow ( j 1)+1
PAGE 82

there,isin ecol ( k )whichimpliesifthellisincolumn j then ecol ( k )= j .Wealsoknow thatthellisinrow k 1(astherowrotationisbetweencolumns k and k 1)which impliesthat trow ( j )= k ,sowecansayfromthedenitionoftheedge( k;j )ispresentin F Ontheotherhand,iftheedge( k;trow ( k ))ispresentthenweknow ecol ( k )= j and trow ( j )= k .Weknow trow ( j 1)
PAGE 83

1.Ifwereducetherightmostcornerandfollowthepathusing trow and ecol values andreachasinkwehaveanallthatisnotchasableandthecol umnheighthasto grow. 2.Iftherightmostsinkis s ,thenthecolumnsfrom s +1 :::n cannotgrowaslongas thesinkremainsthesame. Asimplesymbolicfactorizationalgorithmistosimulateth eentirefactorization butwithoutdoinganynumericalwork.Asitoperatesonlyon trow and ecol values thisalgorithmtakesonlyasmanyasstepsasthenumberofpla nerotationswhichis considerablylessthantheactualroatingpointoperations .However,givenabandmatrix, whichdoesnotneedanyadditionalspaceatall,thesymbolic factorizationwouldreduce theentirematrixsymbolicallytondthatitdoesnotrequir eanyspace. Therightmostsinkconditiongivesusagoodstoppingcriter ionforthesymbolic factorization.Notethatthebandmatrixhasnosinksandifw euseitasastopping criterionthesymbolicfactorizationstopsimmediatelyaf terndingthattherenosinks. Algorithm 5.5 modiesthesimplesymbolicfactorizationtousethisstopp ingcriterion. Algorithm5.5 Symbolicfactorizationfortherightmostcornermethod 1: Findallthecornersfrom1 :::n andpushthemintoastack. 2: Inspectallthecolumnsfrom1 :::n andndtherightmostsink.( s r ) 3: while therearemorecornersandthereexistsatleastonesink do 4: Poptherightmostcorner k 5: Followthepathfromcolumn k ingraph F untilasink( s )oruntilnoll. 6: Inspectcolumns k 1 ;k +1 ; fornewcornersandpushthenewcornerstostack. 7: Inspectcolumns k;k +1 ; and s;s +1forsinkstatusandupdatetherightmostsinkif necessary. 8: if rightmostsink( s r )isnolongerasink then 9: scancolumns s r 1 ::: 1untilyoundthenextrightmostsink. 10: endif 11: endwhile NotethatLine 12 inAlgorithm 5.5 tomaintaintherightmostsinkmayappearcostly, butfromthewaythesinksgetupdatedwecanshowthatthecost tondtherightmost sinkis O ( n )andamortizedcosttoupdatetherightmostsinkis O (1).Itiseasytoseethat theinitialcosttondtherightmostsinkis O ( n )aswecheckthe trow valuesforallthe 83

PAGE 84

columns.Notethatiftherightmostsink s r isnolongerasink(sayifitbecameaslope) wescanthecolumnsright-to-lefttondthenewrightmostsi nk.Ifthisistheonlytypeof changetothesinksthenwewillmovebacktocolumn1withanot her O ( n )cost. However,asaresultofofthellcreatedbyarotationandthe swapsnewsinksget created.Asinkcreatedbyaswapcannotbetherightmostsink .Ifanewrightmostsink getscreatedasresultofllthenithastobein s r +1.Wecanthenupdatetherightmost sinkin O (1)time.Themovetowards n willpayforthemovebackto1againwhenwedo theright-to-leftscantondthenextrightmostsink.Wedon otmaintainalistofallthe sinks.Instead,wetestitusingthe trow datastructureeverytimeweneeditasweaccess thedatastructuretotraversethellgraph. Algorithm 5.5 modiedthenon-blockedrightmostcornermethodforthesym bolic factorization.Wecanapplythesamechangestotheblockeda lgorithmsaswelltogeta simplesymbolicfactorizationalgorithmfortheblockedsk ylinereduction.Algorithm 5.6 computesasymbolicfactorizationfortheblockedskyliner eduction.Notethatwehave removedthechasefromAlgorithms 5.5 and 5.6 asfollowingthepathinthellgraph isequivalenttothechase.Thellfromthesetwoalgorithms canbedierentasthethe blockedskylinereductionalgorithmwillreducetheleftmo stcornerwithintheseedblock. 5.5PIRO SKY:PipelinedPlaneRotationsforSkylineReduction PIRO SKYisthelibrarythatimplementsthesparsebidiagonalred uctionalgorithms discussedinthischapter.PIRO SKYstandsforPIpelinedplaneROtationsofsparse SKYlinematrices.PIRO SKYhasbothaMATLABandCinterface.PIRO SKYworks forrealsparsematrices,withnativeinteger(32-bitand64 -bit)support,anddouble precisionarithmetic.UnlikePIRO BAND,PIRO SKYexpectsC99compatibility. TheMATLABinterfaceofPIRO SKYndsthesingularvaluedecompositionof sparsematricesusingourblockedalgorithms.Tocomputeth esingularvaluedecomposition wedependon symrcm [ CuthillandMcKee1969 ]ortheimplementationsoftheGPS algorithm[ Gibbsetal.1976 ]forndingaprolereductionorderingforthematrix A T A 84

PAGE 85

Algorithm5.6 Symbolicfactorizationfortheblockedrightmostmountain algorithmfor skylinebidiagonalization 1: Findallthecornersfrom1 :::n andpushthemintoastack. 2: Inspectallthecolumnsfrom1 :::n andndtherightmostsink.( s r ) 3: while therearemorecornersandthereexistsatleastonesink do 4: Poptherightmostcorner k andndtherightmostmountaincolumns 5: while therearemorecornersintherightmostmountain/rubble do 6: Usethe( trow ( k ) ;k )asananchorpointandndablockusingAlgorithm 5.3 7: Foreachcorner k c intheS-blockfollowthepathfrom k c ingraph F valuesuntila sink( s )oruntilnoll. 8: if therearenewcornersthatgotgeneratedwhicharenotwithin theboundaries oftheseedblock then 9: pushitinthestackfornextiteration. 10: endif 11: Inspectcolumns k;k +1 ; and s;s +1forsinkstatusandupdatetherightmostsink ifnecessary. 12: if rightmostsink( s r )isnolongerasink then 13: scancolumns s r 1 ::: 1untilyoundthenextrightmostsink. 14: endif 15: endwhile 16: endwhile andthenusethatforthecolumnorderingof A .WedependonSPQR[ Davis2009a b ]for computingthe QR factorization.PIRO SKYthenreducesthesparseuppertriangular R tobidiagonalform.WecanthenuseLAPACKtoreducethebidia gonalmatrixto diagonalform[ GolubandReinsch1970 ; DemmelandKahan1990 ; Deiftetal.1991 ].We provideoneinterface piro sky svd whichdoesalltheabovestepstocomputethesingular valuedecomposition.Wealsoprovidetwocustominterfaces forcomputingthesymbolic factorizationofthesparseuppertriangular R ( piro sky symbolic )andforcomputingthe bidiagonalreductionitself( piro sky reduce ). piro sky reduce canoperateintwomodes.Inthestaticmode,itcancomputeth e symbolicfactorization,allocatetherequiredamountofme moryanddothereductionin place.Inthedynamicmode,itcanallocatethespaceforthes kylinematrixdynamically andcomputethereduction.Thedynamicmodeisusefulwhenth espacerequiredby thefactorization(ascomputedbythesymbolicfactorizati on)isunavailable.PIRO SKY managesthememorydynamicallyinthelatercaseandrealloc atesmemoryasdesired. 85

PAGE 86

Table5-1.MATLABinterfacetoPIRO SKY MATLABfunctionDescriptionofthefunction piro sky reduce Bidiagonalreductionroutineforsparsematrices piro sky svd ComputestheSVDofasparsematrix piro sky symbolic Symbolicfactorizationforthebidiagonalreduction Table5-2.CinterfacetoPIRO SKY CfunctionDescriptionofthefunction piro sky set common: InitializesPIRO SKYwithdefaultvalues. piro sky allocate symbolic: Allocatesthedatastructureforthesymbolicfactorization. piro sky symbolic: Computesthesymbolicfactorizationforagivensparsematrix. piro sky allocate skyline: Allocatestheskylinematrixbasedonthecountsfromsymbolicfactorization. piro sky sparse to skyline: Convertsacompressedcolumnformatsparsematrixtoaskylinematrix. piro sky reduce: Computesthebidiagonalreductionoftheskylinematrix. piro sky free skyline: Freesspaceusedbytheskylinedatastructure. piro sky free symbolic: Freesspaceusedbythesymbolicdatastructure. Thereductionalsoacceptsanapproximatebandwidthandcan reducetheskylinematrix toabandmatrix.Itthenconvertstheskylinetoabanddatast ructureandusestheband reductionalgorithm(PIRO BAND)tocomputethebidiagonals. ThesymbolicfactorizationinPIRO SKYcomputesthecolumncountsrequiredfor sparseskylinematrices.Italsoreportsthetotalnumberof planerotationsthatwould berequiredtodothereductionandthenumberofswapsthatwo uldtakeplace.The swapswhichlookcheaperbecausetheydonotdoanyroatingpo intworkareactually expensive.Whentwocolumnsarenotadjacenttoeachotherin memorytheswapcannot bedoneinplaceandwemaynotbeabletotonecolumninthepla ceofanother.This couldhappenwhenweskipsymbolicfactorizationandoperat einthedynamicmode. PIRO SKYdynamicallyreallocatesspacetocontinuewiththeredu ctionthen. TheMATLABinterfacesforPIRO SKYaresummarizedinTable 5-1 .TheC interfacetoPIRO SKYprovidesasetoffunctions.Thefunctionsandabriefdes cription ofthemareshowninTable 5-2 .TheCfunctionsarelistedintheexactorderasthey 86

PAGE 87

10 0 10 1 10 2 10 3 0 20 40 60 80 100 120 relative run time, vs. bestnumber of matricesPIRO_SKY Relative Performance (profile) n/5 MATLAB svds PIRO_SKY 10 0 10 1 10 2 10 3 0 20 40 60 80 100 120 relative run time, vs. bestnumber of matricesPIRO_SKY Relative Performance (profile) n/10 MATLAB svds PIRO_SKY Figure5-15.PerformanceofPIRO SKYvsMATLAB svds willbeusedinanexample.Theuserscansetanarrayofparame tersusingthedata structuresintheCinterfacetonetunetheperformanceofP IRO SKYincludingsizeof theworkspace,blocksize,static/dynamicmemoryallocati on,elbowroomforcolumns, approximatebandwidthandfunctionsusedtoallocateandfr eememory.PIRO SKYis notpubliclyavailableyet.Weplantomakeitpubliclyavail ableasanACMTransactions ofMathematicalSoftwarecollectedalgorithm. 5.6PerformanceResults WecompareourPIRO SKYimplementationto svds inMATLAB.MATLABuses ARPACK[ Lehoucqetal.1997 ; LehoucqandSorensen1996 ]tocomputethesingular valuesofasparsematrix.AsMATLAB svds isnotdesignedcomputeallthesingular valuesofasparsematrixandouralgorithmisdesignedtocom puteallthesingularvalues wedonotcomparetheminthecaseofcomputingallthesingula rvalues.Instead,givena sparsematrix A ofsize n -byn ,wecompute n k singularvaluesforvariousvaluesof k using 87

PAGE 88

10 0 10 1 10 2 0 20 40 60 80 100 120 relative run time, vs. bestnumber of matricesPIRO_SKY Relative Performance (profile) n/25 MATLAB svds PIRO_SKY 10 0 10 1 10 2 0 20 40 60 80 100 120 relative run time, vs. bestnumber of matricesPIRO_SKY Relative Performance (profile) n/50 MATLAB svds PIRO_SKY Figure5-16.PerformanceofPIRO SKYvsMATLAB svds MATLAB svds andcompute all thesingularvaluesusingourmethodandcomparethe results.Thisshouldgiveabetterideaofwhenwecanusethei terativemethodandwhen wecanswitchtoourdirectmethodandcomputeallthesingula rvaluesinstead. Allthetestsweredonewithdoubleprecisionarithmeticwit hrealmatricesfrom theUFsparsematrixcollection[ Davis2009c ].Weusedrealmatriceswith n< 5000. Theperformancemeasurementsweredoneonamachinewith8du alcore2.2GHzAMD Opteronprocessorsand64GBofmainmemory.PIRO BANDwascompiledwith gcc 4.1.1 withtheoptions -O3 .Weused svds inMATLAB7.6.0.324(R2008a). PIRO SKYwasconguredtoalwaysusethesymbolicfactorization, allocatethe spacefortheskylineandoperateonastaticdatastructure. Furthermore,weallowed50% morespacethantheoriginalnumberofnon-zerosin R (notintheskylineof R )togetan approximatebandwidthin R belowwhichmostentrieslie.Wethenusethebandwidthin thesymbolicandnumericalreductiontoreducethematrixto abandmatrix.Wethenuse PIRO BANDlibrarytoreducethebandmatrixtobidiagonalform.We useMATLAB's 88

PAGE 89

sparse qr and symrcm forthe QR factorizationandprolereductionsteps.Wereduced thebidiagonalmatrixtodiagonalformusingLAPACK.PIRO SKY'stimetocomputethe SVDincludesthetimetocomputeallthesesteps. Theresultswhenwecompute n 5 and n 10 singularvaluesarepresentedinFigure 5-15 Weuseaperformanceproletocomparethetwomethods.Inbot hcasesPIRO SKY isfasterthanMATLAB's svd forallthematricesandcomputesallthesingularvalues. Figure 5-16 showstheresultsforcomputing n 25 and n 50 singularvalues.When n 25 singular valuesarecomputedbythe svds theperformanceofbothalgorithmsarecompetitive exceptinafewmatriceswherePIRO SKYdoesnotperformverywell.Whenwe compute n 50 singularvalues svds performsbetterinmorethanhalfthematricesand theperformanceissimilarintheotherhalfofthematrices. Tosummarize,PIRO SKY computes all thesingularvaluesinthetimeittakestocomputejust4%ofs ingularvalues using svds anddoesmuchbetterwhenmoresingularvaluesarerequired. 89

PAGE 90

CHAPTER6 CONCLUSION Thesingularvaluedecompositionisaproblemthatisusedin widerangeof applications.Westudiedthisproblemfortwospecicclass ofmatrices:abandmatrix andasparsematrix.Ourmajorcontributionstotheseproble msare: Wedevelopedablockedalgorithmforbandreductionthatgen eratesnomorell thananyscalarbandreductionmethodusingpipelining.The workspacerequired forouralgorithmismuchsmaller(equaltothesizeoftheblo ck)thanthework spacerequiredbycompetingmethods( O ( n )to O ( nnz )).Ourblockedmethods performslightlymoreworkthantheSchwarz'srowreduction methodbutlessthan bothLAPACKandSBRbandreductionalgorithms.Whentheplan erotationsare notaccumulated,ouralgorithmisabout2timesfasterthanS BRand3.5times fasterthanLAPACK.Ourblockedalgorithmmanagestobalanc eallthethree requirements,cacheutilization,numberofroatingpointo perationsandworkspace requirement. Wedevelopedanalgorithmtoaccumulatetheplanerotations whileutilizingthe nonzerostructureof U and V forourblockedalgorithm.Ouralgorithmreducesthe numberofplanerotationsbyhalf,hencethenumberofropsin accumulationsby half,ascomparedtoLAPACK'saccumulationalgorithm.When theplanerotations areaccumulatedourspeedupisabout2xandand2.5xagainstL APACKandSBR. Wedevelopedalibraryforbandreductionthatsupportsallt herequireddatatypes withbothCandMATLABinterfaces.PIRO BANDwillbemadeavailableasa collectedACMTOMSalgorithm. Wedevelopedthetheoryforsparseskylinebidiagonalreduc tion,andbothblocked andnon-blockedalgorithmsfortheproblem.Wealsodevelop edthetheoryand thealgorithmforsymbolicfactorizationforthebidiagona lreductionmethod.Our bidiagonalreductionmethodusesecientsparsedatastruc turesandcomputesall thesingularvaluesofasparsematrixinthesametimeittake sfortheiterative methodtocomputejust4%ofthesingularvalues.Itperforms farbetterwhenmore singularvaluesarerequired.Noprioralgorithmexistsfor ecientlycomputing all thesingularvaluesofasparsematrix. Wedevelopedalibraryforbidiagonalreductionofsparsema tricesandndingthe sparsesingularvaluedecomposition.PIRO SKYsupportsbothCandMATLAB interfaces.PIRO SKYwillbemadeavailableasacollectedACMTOMSalgorithm. Ourlibraryforsparsebidiagonalreductionstilldoesnots upportcomplexmatrices. Complexsupportwillbepartofthelibrarywhenitismadepub liclyavailable. 90

PAGE 91

Itisnotexpensivetocomputethesymbolicfactorizational gorithmsdescribedin Section 5.4 ,butatheoreticalquestionstillremainsopen.Giventhel lgraph F itwill beinterestingtocomputethesymbolicfactorizationwitho utsimulatingtherotations ofnumericalreduction.However,forallpracticalpurpose soursymbolicfactorization algorithmworkswell. Thetechniquestoadoptourblockedreductionmethodstomul ticorearchitectures isanotherinterestingproblemthatstillremainstobestud ied.Theproblemhasbeen studiedinthecontextofreducingadensematrixtobandform [ Ltaiefetal.2009 ].The bandreductionalgorithmcouldcoupletheblockingwiththe techniqueofmovingthe blocksaxeddistancebeforereducingthenextblock,asitw asdonein[ Kaufman2000 ] forsingleentries.However,thesynchronizationbetweent heblockshavetobeworkedout carefullyforadoptingtheblockedalgorithmtomulticorea rchitectures.Bandreductionin parallelarchitectureshasbeenstudiedforsometimethoug h[ Lang1993 1996 ].Ecient bidiagonalreductionofthesparseskylinematrixinmultic orearchitecturesstillremainsan openproblem. 91

PAGE 92

REFERENCES Anda,A.A.andPark,H. 1994.Fastplanerotationswithdynamicscaling. SIAMJ. MatrixAnal.Appl.15, 1,162{174. Anderson,E. Bai,Z. Bischof,C. Blackford,S. Demmel,J. Dongarra,J. DuCroz,J. Greenbaum,A. Hammarling,S. McKenney,A. andSorensen, D. 1999. LAPACKUsers'Guide ,Thirded.SocietyforIndustrialandApplied Mathematics,Philadelphia,PA. Barlow,J.L. Bosner,N. andDrmac,Z. 2005.Anewstablebidiagonalreduction algorithm. LinearAlgebraAppl.397 ,35{84. Berry,M.W. 1992.Largescalesparsesingularvaluedecompositions. Inter.J.of SupercomputerAppl.6, 1,13{49. Berry,M.W. Dumais,S.T. andO.Brien,G.W. 1994.Usinglinearalgebrafor intelligentinformationretrieval. SIAMRev.37, 4,573{595. Billsus,D.andPazzani,M.J. 1998.Learningcollaborativeinformationlters.In ICML'98:Proc.FifteenthInter.Conf.Mach.Learn. MorganKaufmannPublishers Inc.,SanFrancisco,CA,USA,46{54. Bindel,D. Demmel,J. Kahan,W. andMarques,O. 2002.OncomputingGivens rotationsreliablyandeciently. ACMTrans.Math.Soft.28, 4,206{238. Bischof,C.H. Lang,B. andSun,X. 2000a.Algorithm807:TheSBR toolbox|softwareforsuccessivebandreduction. ACMTrans.Math.Softw.26, 4, 602{616. Bischof,C.H. Lang,B. andSun,X. 2000b.Aframeworkforsymmetricband reduction. ACMTrans.Math.Soft.26, 4,581{601. Blackford,L.S. Demmel,J. Dongarra,J. Duff,I. Hammarling,S. Henry, G. Heroux,M. Kaufman,L. Lumsdaine,A. Petitet,A. Pozo,R. Remington,K. andWhaley,R.C. 2002.Anupdatedsetofbasiclinearalgebra subprograms(BLAS). ACMTrans.Math.Softw.28, 2,135{151. Bosner,N.andBarlow,J.L. 2007.Blockandparallelversionsofone-sided bidiagonalization. SIAMJ.MatrixAnal.Appl.29, 3,927{953. Chan,T.F. 1982.Animprovedalgorithmforcomputingthesingularvalu e decomposition. ACMTrans.Math.Soft.8, 1,72{83. Cuthill,E.andMcKee,J. 1969.Reducingthebandwidthofsparsesymmetric matrices.In Proc.196924thNat.Conf. ACM,NewYork,NY,USA,157{172. 92

PAGE 93

Davis,T.A. 2009a.Algorithm8xx:SuiteSparseQR,amultifrontalmulti threadedsparse QRfactorizationpackage.Submittedto ACMTrans.Math.Softw. ,RetrievedOctober 7,2009from http://www.cise.ufl.edu/ ~ davis/ Davis,T.A. 2009b.Multifrontalmultithreadedrank-revealingsparse QR factorization.Submittedto ACMTrans.Math.Softw. ,RetrievedOctober7,2009 from http://www.cise.ufl.edu/ ~ davis/ Davis,T.A. 2009c.TheUniversityofFloridasparsematrixcollection. Submittedto ACMTrans.Math.Softw. ,RetrievedOctober7,2009from http://www.cise.ufl.edu/ ~ davis/ Davis,T.A. Gilbert,J.R. Larimore,S.I. andNg,E.G. 2004a.Algorithm836: COLAMD,acolumnapproximateminimumdegreeorderingalgor ithm. ACMTrans. Math.Softw.30, 3,377{380. Davis,T.A. Gilbert,J.R. Larimore,S.I. andNg,E.G. 2004b.Acolumn approximateminimumdegreeorderingalgorithm. ACMTrans.Math.Softw.30, 3, 353{376. Davis,T.A.andGolub,G.H. 2003.Personalcommunications. Deerwester,S. Dumais,S.T. Furnas,G.W. Landauer,T.K. andHarshman,R. 1990.Indexingbylatentsemanticanalysis. J.Amer.Soc.Inf.Sci.41 391{407. Deift,P. Demmel,J. Li,C. andTomei,C. 1991.Thebidiagonalsingularvalue decompositionandhamiltonianmechanics. SIAMJ.Numer.Anal28 ,1463{1516. Demmel,J.andKahan,W. 1990.Accuratesingularvaluesofbidiagonalmatrices. SIAMJ.Sci.Stat.Comput11 ,873{912. Dongarra,J.J. DuCroz,J. Hammarling,S. andDuff,I.S. 1990.Asetof level3basiclinearalgebrasubprograms. ACMTrans.Math.Softw.16, 1,1{17. Drineas,P. Kannan,R. andMahoney,M.W. 2004.Fastmontecarloalgorithms formatricesII:Computingalow-rankapproximationtoamat rix. SIAMJ.Comp.36 2006. Gentleman,W.M. 1973.LeastsquarescomputationsbyGivenstransformation s withoutsquareroots. IMAJ.Appl.Math.12 ,329{336. Gibbs,N.E. Poole,WilliamG.,J. andStockmeyer,P.K. 1976.Analgorithm forreducingthebandwidthandproleofasparsematrix. SIAMJ.Numer.Anal.13, 2, 236{250. Givens,W. 1958.Computationofplaneunitaryrotationstransforming generalmatrixto triangularform. J.SIAM6, 1,26{50. 93

PAGE 94

Goldberg,K. Roeder,T. Gupta,D. andPerkins,C. 2001.Eigentaste:A constanttimecollaborativelteringalgorithm. Info.Retrieval4, 2,135{151. Golub,G.H.andKahan,W. 1965.Calculatingthesingularvaluesandpseudo-inverse ofamatrix. SIAMJ.Numer.Anal.2, 2,205{224. Golub,G.H.andReinsch,C. 1970.Singularvaluedecompositionandleastsquare solutions. Numer.Math.14, 5,403{420. Golub,G.H.andvanLoan,C.F. 1996. MatrixComputations ,3rded.TheJohns HopkinsUniversityPress. Hager,W.W. 2002.Minimizingtheproleofasymmetricmatrix. SIAMJ.Sci. Comput.23, 5,1799{1816. Hammarling,S. 1974.AnoteonmodicationstotheGivensplanerotation. IMAJ. Appl.Math.13 ,215{218. Kaufman,L. 1984.Bandedeigenvaluesolversonvectormachines. ACMTrans.Math. Softw.10, 1,73{85. Kaufman,L. 2000.Bandreductionalgorithmsrevisited. ACMTrans.Math.Soft.26, 4, 551{567. Kolda,T.G.andO'leary,D.P. 1998.Asemidiscretematrixdecompositionfor latentsemanticindexingininformationretrieval. ACMTrans.onInf.Syst.16 ,322{346. Koren,Y. 2008.Factorizationmeetstheneighborhood:amultifacete dcollaborative lteringmodel.In KDD'08:Proc.14thACMSIGKDDInt.Conf.KnowledgeDiscover y DataMin. ACM,NewYork,NY,USA,426{434. Lang,B. 1993.Aparallelalgorithmforreducingsymmetricbandedma tricesto tridiagonalform. SIAMJ.Sci.Comput.14, 6,1320{1338. Lang,B. 1996.Parallelreductionofbandedmatricestobidiagonalf orm. Parallel Comput.22, 1,1{18. Larsen,R.M. 1998.Lanczosbidiagonalizationwithpartialreorthogona lization. RetrievedOctober7,2009from http://soi.stanford.edu/ ~ rmunk/PROPACK/ Lehoucq,R.andSorensen,D.C. 1996.Derationtechniquesforanimplicitly re-startedarnoldiiteration. SIAMJ.MatrixAnal.Appl17 ,789{821. Lehoucq,R.B. Sorensen,D.C. andYang,C. 1997.ARPACK users'guide:Solutionoflargescaleeigenvalueproblemsw ithimplicitly restartedArnoldimethods.RetrievedOctober7,2009from http://www.caam.rice.edu/software/ARPACK/UG/ug.html 94

PAGE 95

Ltaief,H. Kurzak,J. andDongarra,J. 2009.Parallelbandtwo-sidedmatrix bidiagonalizationformulticorearchitectures. IEEETrans.ParallelandDist.Sys. (to appear),RetrievedOctober7,2009from http://icl.cs.utk.edu/ ~ ltaief/ Murata,K.andHorikoshi,K. 1975.Anewmethodforthetridiagonalizationofthe symmetricbandmatrix. Infor.Proc.Japan15 ,108{112. Paterek,A. 2007.Improvingregularizedsingularvaluedecomposition forcollaborative ltering.In Proc.KDDCupandWorkshop Pryor,M.H. 1998.TheEectsofSingularValueDecompositiononCollabo rative Filtering.Tech.Rep.PCS-TR98-338,DartmouthCollege,Co mputerScience,Hanover, NH.June. Rajamanickam,S.andDavis,T.A. 2009a.Blockedbandreductionforsymmetric andunsymmetricmatrices.Tobesubmittedto ACMTrans.Math.Softw. ,Retrieved October7,2009from http://www.cise.ufl.edu/ ~ srajaman/ Rajamanickam,S.andDavis,T.A. 2009b.PIRO BAND:Pipelinedplanerotations forblockedbandreduction.Tobesubmittedto ACMTrans.Math.Softw. ,Retrieved October7,2009from http://www.cise.ufl.edu/ ~ srajaman/ Rajamanickam,S.andDavis,T.A. 2009c.PIRO BANDuserguide.Retrieved October7,2009from http://www.cise.ufl.edu/ ~ srajaman/ Ralha,R. 2003.One-sidedreductiontobidiagonalform. LinearAlgebraAppl.358, 1-3, 219{238. Reid,J.K.andScott,J.A. 1998.Orderingsymmetricsparsematricesforsmall proleandwavefront.Tech.Rep.RAL-TR-98-016,Rutherfor dAppletonLaboratory. Reid,J.K.andScott,J.A. 2002.ImplementingHager'sexchangemethodsfor matrixprolereduction. ACMTrans.Math.Softw.28, 4,377{391. Rutishauser,H. 1963.OnJacobirotationpatterns. Proc.ofSymp.inAppl.Math.15 219{239. Sarwar,B. Karypis,G. Konstan,J. andRiedl,J. 2002.Incrementalsingular valuedecompositionalgorithmsforhighlyscalablerecomm endersystems.In FifthInt. Conf.Comput.Inf.Sci. 27{28. Schwarz,H.R. 1963.Algorithm183:Reductionofasymmetricbandmatrixto triple diagonalform. Commun.ACM6, 6,315{316. Schwarz,H.R. 1968.Tridiagonalizationofasymmetricbandmatrix. Numer. Math.12, 4,231{241. 95

PAGE 96

Smith,B.T. Boyle,J.M. Dongarra,J. Garbow,B.S. Ikebe,Y. Klema, V.C. andMoler,C.B. 1976. MatrixEigensystemRoutines-EISPACKGuide, SecondEdition .LectureNotesinComputerScience,vol.6.Springer. Stewart,G.W. 1976.Theeconomicalstorageofplanerotations. Numer.Math.25 137{138. Sun,J. Xie,Y. Zhang,H. andFaloutsos,C. 2008.Lessismore:Sparsegraph miningwithcompactmatrixdecomposition. Stat.Anal.DataMin.1, 1,6{22. Trefethen,L.N.andBauIII,D. 1997. NumericalLinearAlgebra ,1sted.Societyfor IndustrialandAppliedMathematics. VanLoan,C.andBischof,C. 1987.TheWYrepresentationforproductsof householdermatrices. SIAMJ.Sci.Stat.Comput.8, 1. Wall,M. Rechtsteiner,A. andRocha,L. 2003.Singularvaluedecompositionand principalcomponentanalysis.In ApracticalApproachtoMicroarrayDataAnalysis D.P.Berrar,W.Dubitzky,andM.Granzow,Eds.91{109. 96

PAGE 97

BIOGRAPHICALSKETCH SivaRajamanickam'sresearchinterestsareinsparsematri xalgorithms.Specically, heworksondirectmethodsforsparseeigenvalueandsingula rvaluedecompositions.He hasalsoworkedonbandreductionmethodsandorderingmetho dsforsparseCholesky andsparseLUfactorizations.HeisoneoftheauthorsofCCOL AMD(Constrained ColumnApproximateMinimumDegreeOrdering)librarywhich ispartoftheCHOLMOD (CholeskyModication)package.Hewasalsopartofthedeve lopmentteamsinSun MicrosystemsandTataConsultancyServices. 97