PAGE 114

So,Ltw+tp+Tr.Also,sinceTps(0)=tw,theearliesttimetheprocessingofthelastsegmentcanbeginistw+(s)]TJ /F4 11.955 Tf 12.36 0 Td[(1)tp,HenceLtw+Tp+tr.Combiningtheselowerbounds,wegetLmaxfTw+tp+tr,tw+Tp+tr,tw+tp+Trg.SinceTAequalsthederivedlowerbound,thelowerboundistight,L=maxfTw+tp+tr,tw+Tp+tr,tw+tp+TrgandTAistheminimumpossiblecompletiontime. Theorem6.8. Whens>1,thecompletiontimeTBofstrategyBfortheenhancedGPUmodelisTB=tw+maxftw,tpg+(s)]TJ /F4 11.955 Tf 11.95 0 Td[(2)maxftw,tr,tpg+maxftp,trg+tr Proof. Whentheforloopindexi=1,thereadwithinthisloopbeginsattw+maxftw,tpg.For2itr>tp,TB=tw+tw+(s)]TJ /F4 11.955 Tf 10 0 Td[(2)tw+tr+tr=Tw+2tr>Tw+tp+tr=TA=L. 6.5ExperimentalResults 6.5.1GPU-to-GPU ForallversionsofourGPU-to-GPUCUDAcode,wesetmaxL=17,T=64,andSblock=14592.Consequently,Sthread=Sblock=T=228andtWord=Sthread=4=57.NotethatsincetWordisodd,wewillnothaveshared-memorybankconicts(Theorem 6.1 ).Wenotethatsinceourcodeiswrittenusinga1-dimensionalgridofblocksandsinceagriddimensionisrequiredtobe<65536[ 13 ],ourGPU-to-GPUcodecanhandleatmost 114

PAGE 115

65535blocks.Withthechosenblocksize,nmustbelessthan912MB.Forlargern,wecanrewritethecodeusingatwo-dimensionalindexingschemeforblocks. Forourexperiments,weusedapatterndictionaryfrom[ 44 ]thathas40patterns.Thetargetsearchstringswereextractedfromadiskimageandweusedn=10MB,100MB,and904MB. 6.5.1.1Aho-Corasickalgorithm WeevaluatedtheperformanceofthefollowingversionsofourGPU-to-GPUACalgorithm: AC0 ThisisAlgorithmbasic(Figure 6-4 )withtheDFAstoredindevicememory. AC1 ThisdiffersfromAC0onlyinthattheDFAisstoredintexturememory. AC2 TheAC1codeisenhancedsothateachthreadreads16charactersatatimefromdevicememoryratherthan1.Thisreadingisdoneusingavariableoftypeunint4.Thereaddataisstoredinsharedmemory.Theprocessingofthereaddataisdonebyreadingitonecharacteratatimefromsharedmemoryandwritingtheresultingstatetodevicememorydirectly. AC3 TheAC2codeisfurtherenhancedsothatthreadscooperativelyreaddatafromdevicememorytosharedmemoryasinFigure 6-5 .time.ThereaddataisprocessedasinAC2. AC4 ThisistheAC3codewithdeciencyD2eliminatedusingaregisterarraytosavetheinputandcooperativewritesasdescribedinSection 6.3.2.2 WeexperimentedwithavariantofAC3inwhichdatawasreadfromsharedmemoryasuints,theencoded4charactersinauintwereextractedusingshiftsandmasks,andDFAtransitionsdoneonthese4characters.Thisvarianttookabout1%to2%moretimethanAC3andisnotreportedonfurther.Also,weconsideredvariantsofAC4inwhichtWord=48and56andthese,respectively,tookapproximately14.78%and7.8%moretimethatAC4.Wedonotreportonthesevariantsfurthereither. 115

PAGE 116

Table6-1. RuntimeforACversions OptimizationStep10MB100MB904MB AC022.92ms227.12ms2158.31msAC111.85ms118.14ms1106.75msAC28.19ms80.34ms747.73msAC35.57ms53.33ms434.03msAC42.88ms26.48ms248.71ms Figure6-13. GraphicalrepresentationofspeeduprelativetoAC0 Table 6-1 givestheruntimeforeachofourACversions.Ascanbeseen,theruntimedecreasesnoticeablywitheachenhancementmadetothecode.Table 6-2 givesthespeedupattainedbyeachversionrelativetoAC0andFigure 6-13 isaplotofthisspeedup.SimplyrelocatingtheDFAfromdevicememorytotexturememoryasisdoneinAC1resultsinaspeedupofalmost2.Performingalloftheenhancementsyieldsaspeedupofalmost8whenn=10MBandalmost9whenn=904MB. 6.5.1.2MultipatternBoyerMoorealgorithm ForthemultipatternBoyerMooremethod,weconsideredonlytheversionsmBM0andmBM1thatcorrespond,respectively,toAC0andAC1.InbothmBM0andmBM1, Table6-2. SpeedupofAC1,AC2,AC3,andAC4relativetoAC0 OptimizationStep10MB100MB904MB AC0111AC11.931.921.95AC22.802.832.89AC34.114.264.97AC47.718.588.68 116

PAGE 117

Table6-3. RuntimeformBMversions OptimizationStep10MB100MB904MB mBM025.40ms251.86ms2342.69msmBM113.07ms127.26ms1184.22ms thebadcharacterfunctionandtheshift1andshift2functionswerestoredinsharedmemory.Table 6-3 givestheruntimesformBM0andmBM1.Onceagain,relocatingthereversetriefromdevicememorytotexturememoryresultedinaspeedupofalmost2.NotethatmBM1takesbetween7%and10%moretimethanistakenbyAC1.SincethemultipatternBoyer-MoorealgorithmhasasomewhatmorecomplexmemoryaccesspatternthanusedbyAC,itisunlikelythattheremainingcodeenhancementswillbeaseffectiveastheywereinthecaseofAC.So,wedonotexpectversionsmBM2throughmBM4tooutperformtheirACcounterparts.Therefore,wedidnotconsiderfurtherrenementstomBM1. 6.5.1.3Comparisonwithmulticorecomputingonhost Forbenchmarkingpurposes,weprogrammedalsoamultithreadedversionoftheACalgorithmandranitonthequad-coreXeonhostthatourGPUisattachedto.ThemultithreadedversionreplicatedtheACDFAsothateachthreadhaditsowncopytoworkwith.Forn=10MBand100MBweobtainedbestperformanceusing8threadswhileforn=500MBand904MBbestperformancewasobtainedusing4threads.The8-threadscodedeliveredaspeedupof2.67and3.59,respectively,forn=10MBand100MBrelativetothesingle-threadedcode.Forn=500MBand904MB,thespeedupachievedbythe4-threadcodewas,respectively,3.88and3.92,whichisveryclosetothemaximumspeedupof4thataquad-corecandeliver. Table6-4. SpeedupofmBM1relativetomBM0 OptimizationStep10MB100MB904MB mBM0111mBM11.941.981.98 117

PAGE 118

Table6-5. RuntimeformultithreadedAConquad-corehost NumberofThreads10MBSpeedup100MBSpeedup 124.48ms1243.47ms1213.52ms1.81125.52ms1.94411.28ms2.1768.74ms3.5489.18ms2.6767.77ms3.591610.64ms2.3068.07ms3.58 NumberofThreads500MBSpeedup904MBSpeedup 11237.64ms12369.85ms12617.44ms2.001206.21ms1.964319.23ms3.88604.54ms3.928367.32ms3.37677.16ms3.5016356.48ms3.47620.99ms3.82 AC4offersspeedupsof8.5,9.2,and9.5relativetothesingle-threadCPUcodeforn=10MB,100MB,and904MB,respectively.Thespeedupsrelativetothebestmultithreadedquad-corecodeswere,respectively,3.2,2.6,and2.4,respectively. 6.5.2Host-to-Host WeusedAC3withtheparametersstatedinSection 6.5.1 toprocesseachsegmentofdataontheGPU.Thetargetstringtobesearchedwaspartitionedintoequalsizesegments.Asaresult,thetimetowriteasegmenttodevicememorywas(approximately)thesameforallsegmentsaswasthetimetoprocesseachsegmentintheGPUandtoreadtheresultsbacktohostmemory.So,theassumptionsmadeintheanalysisofSection 6.4.2 applies.FromTheorem 6.3 ,weknowthathost-to-hoststrategyAwillgiveoptimalperformance(independentoftherelativevaluesoftw,tp,andtr)thoughattheexpenseofrequiringasmuchdevicememoryasneededtostoretheentireinputandtheentireoutput.However,strategyB,whilemoreefcientonmemorywhenthenumberofsegmentsismorethan2,doesnotguaranteeminimumruntime.Thevaluesoftw,tp,andtrforasegmentofsize10MBweredeterminedtobe1.87ms,2.73ms,and3.63ms,respectively.So,tw
PAGE 119

Table6-6. RuntimeforstrategyAhost-to-hostcode NumberofSegmentsSegmentSizeGPUtimeCPUtimeSpeedup 1009.04MB816.80ms604.54ms0.741090.4MB785.55ms604.54ms0.772452MB788.63ms604.54ms0.771904MB770.13ms604.54ms0.785010MB412.55ms319.23ms0.821050MB387.78ms319.23ms0.825100MB385.17ms319.23ms0.831500MB396.42ms319.23ms0.81 TB=Tw+Tr+tp)]TJ /F3 11.955 Tf 12.39 0 Td[(tw,andstrategyBissuboptimal.StrategyBisexpectedtotaketp)]TJ /F3 11.955 Tf 12.38 0 Td[(tw=0.86msmoretimethantakenbystrategyAwhenthesegmentsizeis10MB.Sincetw,tr,andtwscaleroughlylinearlywithsegmentsize,strategyBwillbeslowerbyabout8.6mswhenthesegmentsizeis100MBandby77.7mswhenthesegmentsizeis904MB.UnlessthevalueofnissufcientlylargetomakestrategyAinfeasiblebecauseofinsufcientdevicememory,weshouldusestrategyA.WeexperimentedwithstrategyAandTable 6-6 givesthetimetakenwhenn=500MBand904MBusingadifferentnumberofsegments.Thisgurealsogivesthespeedupobtainedbyhost-to-hoststrategyArelativetodoingthemultipatternsearchonthequad-corehostusing4threads(notethat4threadsgivethefastestquad-coreperformanceforthechosenvaluesofn).AlthoughtheGPUdeliversnospeeduprelativetoourquad-corehost,thespeedupcouldbequitesubstantialwhentheGPUisaslaveofamuchslowerhost.Infact,whenoperatingasaslaveofasingle-corehostrunningatthesameclock-rateasourXeonhost,theCPUtimeswouldbeaboutthesameasforoursingle-threadedversionandtheGPUhost-to-hostcodewoulddeliveraspeedupof3.1whenn=904MBand500MBandthenumberofsegmentsis1. 119

PAGE 120

CHAPTER7CONCLUSION Thefocusofthisdissertationismulti-patternmatching.Ourcontributionsaresummarizedasbelow: 1.Wehaveproposedtheuseof2-and3-levelsummariesforefcientpopcountcomputationandhavesuggestedwaystominimizethesizeofthelookuptableassociatedwiththepopcountschemeofMunro[ 31 ].Asnearaswecantell,wearethersttousemorethan1levelofsummariesforpopcountcomputationinnetworkapplications.Usingthesummariesproposedhere,thenumberofadditionsrequiredtocomputepopcountisbetween7%and13%ofthatrequiredbytheschemeof[ 57 ]. Wealsohaveproposedanaggressivecompressionscheme.Whenthisschemeisusedonourtestsets,thememoryrequiredbythesearchstructureisbetween24%and31%lessthanthatrequiredwhenthecompressionschemeof[ 57 ]isused.Althoughasearchusingourstructuremakesmorememoryaccessesthanwhenthestructureof[ 57 ]isused,thetwoschemesmakealmostthesamenumberofmemoryaccesseswhenthememorybandwidthissufcientlylarge. 2.WehavedevelopedcompressedACautomataalgorithmstoperformhigh-throughputmulti-patternstringmatchingontheIBMCellBroadbandEngine.Wehavepresentedadetailedoverviewofthealgorithmic-levelandimplementation-leveloptimizationsthatweappliedinordertoimprovethealgorithm'sperformance. Oursolutiondeliversimpressivecompressionratiosinexperimentscenariosrepresentativeofnaturallanguageprocessingandnetworksecurityapplications:respectively,1:34ondictionariescontainingEnglishwords,and1:58ondictionariescontainingrandombinarypatterns.Also,oursolutionprovidesaremarkablethroughputbetween0.90and2.35GbpsperCellblade,dependingonthestatisticalpropertiesofdictionaryandinput. 120

PAGE 121

3.Weareexploringmulti-patternstringmatchingalgorithmsonGraphicsProcessingUnits(GPUs)thatcomprisealargenumberofprocessors.Thetargetapplicationsaredigitalforensicsandintrusiondetection.Wehaveanalyzedtheperformanceofthepopularle-carvingsoftwareScalpel1.6anddeterminedthatthissoftwarespendalmostallofitstimereadingfromdiskandsearchingforheadersandfooters.Thetimespentonthelatteractivitymaybedrasticallyreduced(byafactorof17whenwehave48rules)byreplacingScalpel'scurrentsearchalgorithm(BoyerMoore)bytheAho-Corasickalgorithm.Further,byusingasynchronousdiskreads,wecanfullymaskthesearchtimebythereadtimeanddoin-placecarvinginessentiallythetimeittakestoreadthetargetdisk.FastScalpelisanenhancedversionofScalpel1.6thatusesasynchronousreadsandtheAho-Corasickmultipatternsearchalgorithm.FastScalpelachievesaspeedupofabout2.4overScalpel1.6withrulesetsofsize48.Largerrulesetswillresultinalargerspeedup.Further,ouranalysisandexperimentsshowthatthetimetodoin-placecarvingcannotbereducedthroughtheuseofmulticoresandGPUsassuggestedin[ 41 ].Thisisbecausethebottleneckisdiskreadandnotheaderandfootersearch.Theuseofmulticores,GPUs,andotheracceleratorscanreduceonlythesearchtime.Toimprovetheperformanceofin-placecarvingbeyondthatachievedbyFastScalpelrequiresareductioninthediskreadtime. 4.WefocusonmultistringpatternmatchingusingaGPU.ACandmBMadaptationsforthehost-to-hostandGPU-to-GPUcaseswereconsidered.Forthehost-to-hostcasewesuggesttwostrategiestocommunicatedatabetweenthehostandGPUandshowedthatwhilestrategyAwasoptimalwithrespecttoruntime(undersuitableassumptions),strategyBrequiredleesdevicememory(whenthenumberofsegmentsismorethan2).ExperimentsshowthattheGPU-to-GPUadaptationofACachievesspeedupsbetween8.5and9.5relativetoasingle-threadCPUcodeandspeedupsbetween2.4and3.2relativetoamultithreadedcodethatusesallcoresofourquad-corehost.Forthehost-to-hostcase,theGPUadaptationachievesaspeedupof3.1relativetoa 121

PAGE 122

single-threadcoderunningonthehost.However,forthiscase,amultithreadedcoderunningonthequadcoreisfaster.Ofcourse,performancerelativetothehostisquitedependentonthespeedofthehostandusingaslowerorfasterhostwithfewerormorecoreswillchangetherelativeperformancevalues. 122

PAGE 123

REFERENCES [1] A.AhoandM.Corasick,EfcientStringmatching:AnAidtoBibliographicSearch,CACM,18,6,pp.333-340,1975. [2] S.Antonatos,K.AnagnostakisandE.Markatos,GeneratingRealisticWorkloadsforNetworkIntrusionDetectionSystems,ACMWorkshoponSoftwareandPerformance,2004. [3] IBMCellDevelopmentTeam,IBMAssemblyVisualizerforCellBroadbandEngine, http://www.alphaworks.ibm.com/tech/asmvis ,[updated30Sep2009;cited30Sep2009] [4] R.BaceandP.Mell,IntrusionDetectionSystems,NISTSpecialPublicationonIDSs. [5] R.Baeza-Yates,ImprovedStringSearching,Software-PracticeandExperience,19,pp.257-271,1989. [6] R.Baeza-YatesandG.Gonnet,ANewApproachtoTextSearching,CACM,35,10,pp.74-82,1992. [7] BeateCommentz-Walter,AStringMatchingAlgorithmFastontheAverage,Bookchapter,LectureNotesinComputerScience,1979 [8] R.BoyerandJ.Moore,AFastStringSearchingAlgorithm,CACM,20,10,pp.262-272,1977. [9] D.Brokenshire,MaximizingthePoweroftheCellBroadbandEngineProcessor:25TipstoOptimalApplicationPerformance,TechniqueReport,IBMSTIDesignCenter2006. [10] IBMCellDevelopmentTeam,CellBroadbandEngineResourceCenter, http://www-128.ibm.com/developerworks/power/cell ,[updated30Sep2009;cited30Sep2009] [11] S.Che,M.Boyer,J.Mengetal,Aperformancestudyofgeneral-purposeapplicationsongraphicsprocessorsusingCUDA,JournalofParallelandDis-tributedComputing,2008 [12] CUDAProgrammingCuideVersion2.2.1 [13] NVIDIADevelopmentTeam,NVIDIACUDAmanualreference, http://developer.nvidia.com/object/gpucomputing.html ,[updated15June2010;cited30June2010] [14] M.Degermark,A.Brodnik,S.Carlsson,andS.Pink,SmallForwardingTablesforFastRoutingLookups,ACMSIGCOMM,pp.3-14,1997. 123

PAGE 124

[15] S.DharamapurikarandJ.Lockwood,FastandScalablePatternMatchingforContentFiltering,ANCS,2005. [16] W.Eatherton,G.Varghese,Z.Dittia,TreeBitmap:Hardware/SoftwareIPLookupswithIncrementalUpdates,ComputerCommunicationReview,34(2),pp.97-122,2004. [17] Y.Fang,R.KatzandT.Lakshman,GigabitRatePacketPattern-matchingusingTCAM,ICNP,2004 [18] F.Baboescu,S.SinghandG.Varghese,PacketClassicationforCoreRouters:IsthereanalternativetoCAMs,INFOCOM,2003. [19] Namikus,TheForemostFileCarver, http://foremost.sourceforge.net/ ,[updated30April2010;cited30April2010] [20] Z.Galil,OnImprovingtheWorstCaseRunningTimeofBoyer-MooreStringMatchingAlgorithm,5thColloquiaonAutomata,LanguagesandProgramming,EATCS,1978. [21] N.Horspool,PracticalFastSearchinginStrings,Software-PracticeandExperi-ence,10,1980. [22] N.Huang,H.Hung,S.Lai,Y.Chu,W.Tsai,AGPU-basedMultiple-patternMatchingAlgorithmforNetworkIntrusionDetectionSystems,IEEEComputerSociety,2008 [23] N.Jacob,C.Brodley,OfoadingIDSComputationtotheGPU,The22ndAnnualComputerSecurityApplicationsConference,2006 [24] G.Jacobson,SuccinctStaticDataStructure,CarnegieMellonUniversityPh.DThesis,1998. [25] D.E.Knuth,J.H.Morris,Jr,andV.R.Pratt,Fastpatternmatchinginstrings,SIAMJ.Computing6,323-350,1977. [26] E.Lindholm,J.Nickolls,S.Oberman,J.Montrym,NVIDIATesla:AUniedGraphicsandComputingArchitecture,IEEEComputerSociety,2008 [27] J.Lockwood,C.Neely,andC.Zuver,AnExtensibleSystem-On-Programmable-Chip,content-awareInternetrewall. [28] H.LuandS.Sahni,O(logW)MultidimensionalPacketClassication,IEEE/ACMTransactionsonNetworking,15,2,pp.462,2007. [29] L.Marziale,G.RichardIII,V.Roussev,MassiveThreading:UsingGPUstoIncreasethePerformanceofDigitForensicsTools,ScienceDirect,2007 [30] MikeFisk,GeorgeVarghese,ApplyingFastStringMatchingtoIntrusionDetection,LosAlamosNationalLabNM,2002 124

PAGE 125

[31] J.Munro,Tables,FoundationsofSoftwareTechnologyandTheoreticalComputerScience,LNCS,1180,pp.37-42,1996. [32] J.MunroandS.Rao,SuccinctRepresentationofDataStructures,inHandbookofDataStructuresandApplications,D.MehtaandS.Sahnied.,Chapman&Hall/CRC,2005. [33] PaulWegener,AnObjectOrientedApproachtoParallelPatternMatching,GreatEastSoftware,2009 [34] A.PalandN.Memon,TheEvolutionofFileCarving,IEEESignalProcessingMagazine,pp.59-72,2009. [35] V.Paxson,Bro:AsystemforDetectingNetworkIntrudersinReal-time,ComputerNetworks,31,pp.2435,1999. [36] H.Dreger,C.Kreibach,V.Paxson,andR.Sommer,EnhancingtheAccuracyofNetwork-basedIntrusionDetectionwithHost-basedContext,DIMVA,2005. [37] J.GonzalezandV.Paxson,EnhancingNetworkIntrusionDetectionwithIntegratedSamplingandFiltering,RAID,2006. [38] R.SommerandV.Paxson,ExploitingIndependentStateforNetworkIntrusionDetection,ACSAC,2005. [39] H.Dreger,A.Feldmann,M.Mai,V.PaxsonandR.Sommer,DynamicApplication-LayerProtocolAnalysisforNetworkIntrusionDetection,USENIXSecuritySymposium,2006. [40] G.RichardIII,V.Roussev,Scalpel:AFrugal,HighPerformanceFIleCarver,DigitalForensicsResearchWorkshop,2005 [41] G.RichardIII,V.Roussev,L.Marziale,In-PlaceFileCarving,ScienceDirect,2007 [42] S.Sahni,Datastructures,Algorithms,andApplicationsinC++,SecondEdition,SiliconPress,2005. [43] Sahni,S.,Schedulingmaster-slavemultiprocessorsystems,IEEETrans.onComputers,45,10,1195-1199,1996. [44] GoldenG.RichardIII,TheScalpellecarver, http://www.digitalforensicssolutions.com/Scalpel/ ,[updated30April2010;cited30April2010] [45] D.Scarpazza,O.Villa,F.Petrini,Peak-PerformanceDFA-basedStringMatchingontheCellProcessor,ThirdIEEE/ACMIntl.WorkshoponSystemManage-mentTechniques,Processes,andServices,withinIEEE/ACMIntl.ParallelandDistributedProcessingSymposium2007 125

PAGE 126

[46] D.Scarpazza,O.Villa,F.Petrini,AcceleratingReal-TimeStringSearchingwithMulticoreProcessors,IEEEComputerSociety,2008. [47] D.Scarpazza,G.Russell,High-performanceRegularExpressionScanningontheCell/B.E.processor,23rdInternationalConferenceonSupercomputing,2009. [48] S.Singh,F.Baboescu,G.Varghese,andJ.Wang,PacketClassicationusingMultidimensionalCutting,ACMSigcomm,8,2003. [49] R.Smith,N.Goyal,J.Ormontetal.EvaluatingGPUsforNetworkPacketSignatureMatching,InternationalSymposiumonPerformanceAnalysisofSystemsandSoftware,2009. [50] SnortUsersManual2.6.0,2006. [51] MartyRoesch,SnortIntrusionDetectionSystem, http://www.snort.org ,[updated2Dec2007;cited2Dec2007] [52] H.Song,J.Turner,andJ.Lockwood,ShapeShiftingTriesforFasterIPRouteLookup,ICNP,2005. [53] H.Song,etal.SnortOfoader:ARecongurableHardwareNIDSFilter,FPL2005. [54] H.SongandJ.Lockwood,EfcientPacketClassicationforNetworkIntrusionDetection,FPGA,2005. [55] D.TaylorandJ.Turner,ClassBench:APacketClassicationBenchmark,INFO-COM,2005. [56] NVIDIADevelopmentTeam,NVIDAteslaarchitecture, http://www.lostcircuits.com/graphics ,[updated15June2010;cited15June2010]. [57] N.Tuck,T.Sherwood,B.CalderandG.Varghese,DeterministicMemory-efcientStringMatchingAlgorithmsforIntrusionDetection,INFOCOM,2004. [58] G.Vasiliadis,S.Antonatos,M.Polychronakis,E.MarkatosandS.Ioannidis,Gnort:HighPerformanceNetworkIntrusionDetectionUsingGraphicsProcessors,InProceedingsofthe11thInternationalSymposiumOnRecentAdvancesInIntrusionDetection(RAID),2008 [59] M.Waldvogel,G.Varghese,J.Turner,andB.Plattner,ScalableHigh-speedPrexMatching,ACMTrans.onComputerSystems,19,4,pp.440-482,2001. [60] W.LuandS.Sahni,PacketClassicationusingTwo-DimensionalMultibitTries,IEEESymposiumonComputersandCommunications,2005. [61] W.LuandS.Sahni,PacketClassicationusingPipelinedTwo-dimensionalMultibitTries,IEEESymposiumonComputersandCommunications,2006. 126

PAGE 127

[62] W.LuandS.Sahni,SuccinctRepresentationofStaticPacketClassiers,IEEESymposiumonComputersandCommunications,2007. [63] Y.WonandS.Sahni,Abalancedbinsortforhypercubemulticomputers,Jr.ofSupercomputing,2,1988,435-448. [64] Y.WonandS.Sahni,Hypercube-to-hostsorting,Jr.ofSupercomputing,3,1989,41-61. [65] Y.WonandS.Sahni,Host-to-hypercubesorting,ComputerSystems:ScienceandEngineering,4,3,1989,161-168. [66] S.WuandU.Manber,AgrepAFastAlgorithmforMulti-patternSearching,TechnicalReport,DepartmentofComputerScience,UniversityofArizona,1994. [67] M.Yazdani,W.Fraczak,F.Welfeld,andI.Lambadaris,TwoLevelStateMachineArchitectureforContentInspectionEngines,INFOCOM2006. [68] X.ZhaandS.Sahni,HighlycompressedAho-CorasickautomataforEfcientIntrusionDetection,IEEESymposiumonComputersandCommunications,2008. [69] X.ZhaandS.Sahni,Fastin-placelecarvingfordigitalforensics,UniversityofFlorida,2010. 127

PAGE 128

BIOGRAPHICALSKETCH XinyanZhareceivedherPh.D.fromtheComputerandInformationScienceandEngineeringDepartmentattheUniversityofFloridainthesummerof2010.XinyanZhaworkedunderthesupervisionofDr.SartajSahni.Herresearchinterestsaredatastructuresandalgorithms,networkintrusiondetectionsystems,andparallelcomputing.XinyanZhareceivedherBachelorofSciencefromtheComputerScienceDepartmentatNanjingUniversity,Chinain2004andreceivedherMasterofSciencefromtheComputerScienceDepartmentattheUniversityofCentralFloridain2006. 128


Citation
Multipattern String Matching Algorithms

Material Information

Title:
Multipattern String Matching Algorithms
Creator:
Zha, Xinyan
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (128 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Sahni, Sartaj
Committee Members:
Peir, Jih-Kwon
Chen, Shigang
Xia, Ye
Wu, Dapeng
Graduation Date:
8/7/2010

Subjects

Subjects / Keywords:
Automata ( jstor )
Bitmapped images ( jstor )
Bytes ( jstor )
Carving ( jstor )
Computer memory ( jstor )
Engines ( jstor )
Intrusion detection systems ( jstor )
Run time ( jstor )
Scalpels ( jstor )
Search time ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
aho, boyer, digital, ibm, in, multi, multicore, network, nvidia, scalpel
Genre:
Electronic Thesis or Dissertation
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
Computer Engineering thesis, Ph.D.

Notes

Abstract:
Multi-pattern string matching is widely used in applications such as network intrusion detection, digital forensics and full text search. In this dissertation, we focus on space efficient multi-pattern string matching as well as on time efficient multicore algorithms. We develop a highly compressed Aho-Corasick automata for efficient intrusion detection. Our method uses bitmaps with multiple levels of summaries as well as aggressive path compaction. Our compressed automata takes 24% to 31% less memory than taken by the compressed automata of Tuck et al and the number of additions required to compute popcounts is reduced by about 90%. We propose a technique to perform high performance exact multi-pattern string matching on the IBM Cell/Broadband Engine(Cell) architecture, which has 9 cores per chip, one control unit and eight computation units. Our technique guarantees the remarkable compression factors of 1:34 and 1:58, respectively on the memory representation of English language dictionaries and random binary string dictionaries. Our memory-based implementation delivers a sustained throughput between 0.90 and 2.35 Gbps per cell blade, while supporting dictionary sizes up to 9260000 average patterns per Gbyte of main memory. We focus on Scalpel, a popular open source file recovery tool which performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multithreading as typically supported on multicore commodity PCs. Using these methods, we are able to do in-place file carving in essentially the time it takes to read the disk whose files are being carved. Since, using our methods, the limiting factor for performance is the disk read time, there is no advantage to using accelerators such as GPUs as has been proposed by others.To further speed in-place file carving, we would need a mechanism to read disk faster. Furthermore, we develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU and host-to-host. For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze two strategies to communicate between the host and the GPU and show that one is optimal with respect to run time while the other requires less device memory. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho-Corasick GPU adaptation achieves a speedup between 8.5 and 9.5 relative to a single-thread CPU implementation and between 2.4 and 3.2 relative tothe best multithreaded implementation. For the host-to-host case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged between 0.74 and 0.83. Early versions of our multipattern Boyer-Moore adaptations ran 7% to 10% slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer-Moore codes further. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2010.
Local:
Adviser: Sahni, Sartaj.
Statement of Responsibility:
by Xinyan Zha.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
10/9/2010
Resource Identifier:
004979739 ( ALEPH )
705932506 ( OCLC )
Classification:
LD1780 2010 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

MULTI-PATTERNSTRINGMATCHINGALGORITHMSByXINYANZHAADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2010

PAGE 2

c2010XinyanZha 2

PAGE 3

IdedicatethisPh.Ddissertationtomyparents.Thankyouforyoursupporttheseyears. 3

PAGE 4

ACKNOWLEDGMENTS Thankstoallwhohavehelpedinwritingthisdissertation.EspeciallythankstomyadvisorDr.Sahniforalltheresearchguidancetheseyears.ThankstotheNationalScienceFoundationforthefundingsupport. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 10 ABSTRACT ......................................... 13 CHAPTER 1INTRODUCTION ................................... 15 1.1OverviewoftheContributions ......................... 17 1.2OutlineoftheDissertation ........................... 19 2AHIGHLYCOMPRESSEDAHO-CORASICKAUTOMATAFOREFFICIENTINTRUSIONDETECTION .............................. 20 2.1TheAho-CorasickAutomaton ......................... 20 2.2TheMethodofTucketal.[32]ToCompressNon-OptimizedAutomaton .. 22 2.3PopcountsWithFewerAdditions ....................... 26 2.4OurMethodtoCompresstheNon-OptimizedAho-CorasickAutomaton .. 32 2.4.1ClassicationofAutomatonStates .................. 32 2.4.2NodeTypes ............................... 33 2.4.2.1Bitmap ............................ 33 2.4.2.2Lowdegreenode ...................... 33 2.4.2.3Pathcompressednode ................... 34 2.4.3MemoryAccesses ........................... 35 2.4.4BitmapNodewithTypeISummaries,W=32 ............ 36 2.4.5LowDegreeNode,W=32 ...................... 36 2.4.6Ol,1l5,Nodes,W=32 ..................... 37 2.4.7ONodes,W=32and1024 ...................... 38 2.4.8PathCompressedNodeofTuck,W=32and1024 ......... 39 2.4.9Summary ................................ 40 2.4.10MappingStatestoNodes ....................... 40 2.5ExperimentalResults ............................. 42 2.5.1NumberofNodes ............................ 42 2.5.2MemoryRequirement ......................... 42 2.5.3Popcount ................................ 43 3COMPRESSEDOBJECTORIENTEDNFAFORMULTI-PATTERNMATCHING 46 3.1TheObjectOrientedNFAforMulti-patternMatching ............ 46 3.2CompressedOONFA ............................. 49 3.3ExperimentalResults ............................. 50 5

PAGE 6

4THECOMPRESSEDAHO-CORASICKAUTOMATAONIBMCELLPROCESSOR 54 4.1TheCell/BroadbandEngineArchitecture ................... 54 4.2Cell-orientedAlgorithmDesign ........................ 55 4.2.1Step(2):BranchReplacementandHinting .............. 58 4.2.2Step(3):LoopUnrolling,DataAlignment ............... 58 4.2.3Step(4):BranchRemoval,Select-bitsIntrinsics ........... 60 4.2.4Step(5):StrengthReduction ..................... 61 4.2.5Step(6):HorizontalUnrolling ..................... 61 4.3ExperimentalResults ............................. 62 4.4RelatedWork .................................. 63 5FASTIN-PLACEFILECARVINGFORDIGITALFORENSICS .......... 70 5.1In-placeCarvingUsingScalpel1.6 ...................... 70 5.2MultipatternBoyer-MooreAlgorithm ..................... 76 5.3MulticoreSearching .............................. 76 5.4AsynchronousRead .............................. 77 5.5MulticoreIn-placeCarving ........................... 78 5.6ExperimentalResults ............................. 81 5.6.1RunTimeofScalpel1.6 ........................ 81 5.6.2BufferSize ................................ 82 5.6.3MultipatternMatching ......................... 83 5.6.4MulticoreSearching .......................... 84 5.6.5AsynchronousRead .......................... 85 5.6.6MulticoreIn-placeCarving ....................... 86 5.6.7Scalpel1.6vs.FastScalpel ...................... 86 6MULTI-PATTERNMATCHINGONMULTICORESandGPUS .......... 89 6.1TheNVIDIATeslaArchitecture ........................ 89 6.2MultipatternBoyer-MooreAlgorithm ..................... 91 6.3GPU-to-GPU .................................. 94 6.3.1Strategy ................................. 94 6.3.2AddressingtheDeciencies ...................... 99 6.3.2.1DeciencyD1readingfromdevicememory ....... 99 6.3.2.2DeciencyD2writingtodevicememory ......... 102 6.4Host-to-Host .................................. 105 6.4.1Strategies ................................ 105 6.4.2CompletionTime ............................ 106 6.4.3CompletionTimeUsingEnhancedGPUs .............. 112 6.5ExperimentalResults ............................. 114 6.5.1GPU-to-GPU .............................. 114 6.5.1.1Aho-Corasickalgorithm ................... 115 6.5.1.2MultipatternBoyerMoorealgorithm ............ 116 6

PAGE 7

6.5.1.3Comparisonwithmulticorecomputingonhost ...... 117 6.5.2Host-to-Host ............................... 118 7CONCLUSION .................................... 120 REFERENCES ....................................... 123 BIOGRAPHICALSKETCH ................................ 128 7

PAGE 8

LISTOFTABLES Table page 2-1Lookuptablefor4-bitblocks ............................. 28 2-2Distributionofstatesina3000stringSnortdatabase ............... 32 2-3MemoryaccessestoprocessanodeforW=32andW=64 .......... 40 2-4MemoryaccessestoprocessanodeforW=128andW=1024 ....... 41 2-5Numberofnodesofeachtype,OlandOcountsareforTypeIsummaries ... 42 2-6NumberofOlOnodesforTypeIIandTypeIIIsummaries ............ 42 2-7Memoryrequirementfordataset1284 ....................... 43 2-8Memoryrequirementfordataset2430 ....................... 43 2-9Numberofpopcountadditions,dataset1284 ................... 45 2-10Numberofpopcountadditions,dataset2430 ................... 45 3-1Searchtimes(milliseconds)forEnglishpatterns ................. 53 3-2Memoryrequired(bytes)forEnglishpatterns ................... 53 3-3NumberofdifferenttypenodesforEnglishpatterns ................ 53 3-4CompressionratioforEnglishpatterns ....................... 53 4-1CompressionRatiosobtainedbyourtechniqueontwosampledictionariesofcomparableuncompressedsize .......................... 55 4-2TheimpactoftheoptimizationstepsontheperformanceofourcompressedACNFAalgorithm .................................. 57 4-3AggregatethroughputonanIBMCellchipwith8SPUs(Gbps). ......... 62 5-1ExampleheadersandfootersinScalpel'scongurationle ........... 71 5-2Examplesofin-placelecarvingoutput ...................... 72 5-3In-placecarvingtimebyScalpel1.6fora16GBfalshdisk ............ 81 5-4In-placecarvingtimebyScalpel1.6withdifferentbuffersizewith48carvingrules .......................................... 83 5-5Searchtimefora16GBashdrive ......................... 83 5-6SpeedupinsearchtimerelativetoBoyer-Moore ................. 84 8

PAGE 9

5-7Timetosearchusingdualcorestrategywith24rules ............... 85 5-8In-placecarvingtimeusingAlgorithmAsynchronous ............... 85 5-9In-placecarvingtimeusingSRMS ......................... 86 5-10In-pacecarvingtimeusingSRSS .......................... 86 5-11In-placecarvingtimeusingMARS2 ........................ 87 5-12In-placecarvingtimeandspeedupusingFastScalpelandScalpel1.6 ..... 88 6-1RuntimeforACversions .............................. 116 6-2SpeedupofAC1,AC2,AC3,andAC4relativetoAC0 .............. 116 6-3RuntimeformBMversions ............................. 117 6-4SpeedupofmBM1relativetomBM0 ........................ 117 6-5RuntimeformultithreadedAConquad-corehost ................. 118 6-6RuntimeforstrategyAhost-to-hostcode ..................... 119 9

PAGE 10

LISTOFFIGURES Figure page 2-1Anexamplestringset ................................ 21 2-2UnoptimizedAho-CorasickautomataforstringsofFigure 2-1 .......... 21 2-3OptimizedAho-CorasickautomataforstringsofFigure 2-1 ........... 22 2-4Abitmapnodeof[ 57 ] ................................ 25 2-5Apathcompressednodeof[ 57 ] .......................... 26 2-6TypeIsummaries .................................. 29 2-7Ourbitmapnode ................................... 34 2-8Ourlowdegreenode ................................. 34 2-9Ourpathcompressednode ............................. 35 2-10Normalizedmemoryrequirement .......................... 44 2-11Normalizedadditionsforpopcount ......................... 44 3-1Algorithm1.CompletionoftheOOgraph ..................... 47 3-2Algorithm2.Addanedgetothegraph ....................... 47 3-3Algorithm3.ObjectOrientedNFASearchFunction ................ 48 3-4Aho-CorasickNFAwithfailurepointers(allfailurepointerspointtostate0) ... 49 3-5TheObjectOrientedNFA(statecardsrepresentation) .............. 50 3-6OONFA(allstatesthathavenomatchedcharacterwiilreturnbacktostate0) 51 3-7TheDFAforoursetofpatterns ........................... 51 3-8OObitmapnode ................................... 52 3-9OOpathcompressednode ............................. 52 3-10OOcopynode .................................... 52 4-1ChiplayoutoftheCell/BroadbandEngineArchitecture. ............. 65 4-2Bitmapnodelayout. ................................. 65 4-3Path-compressednodelayoutwithpackingfactorequaltofour. ......... 65 4-4HowtwoautomataoverlapthecomputationpartwiththeirDMAtransferwaittime. .......................................... 66 10

PAGE 11

4-5Thenumberofcyclesprocessedpercharacterwithdifferentverticalunrollingfactors. ........................................ 66 4-6Howthethroughputgrowswitheachoptimizationstep. ............. 67 4-7Utilizationofclockcyclesfollowingeachoptimizationstep. ............ 67 4-8DMAinter-arrivaltransferdelayfrommainmemorytolocalstorewhen8SPEsareusedconcurrently. ................................ 68 4-9AggregatethroughputofouralgorithmonanIBMQS22blade(16SPEs). ... 68 4-10Howthepercentageofmatchedpatternsaffectstheaggregatethroughput. .. 69 4-11Thetrade-offbetweenthecompressionratioandthethroughputinaParetospace ......................................... 69 5-1ControlowScalpel1.6(a) ............................. 74 5-2ControlowScalpel1.6(b) ............................. 74 5-3Controlowfor2-threadedsearch ......................... 77 5-4In-placecarvingusingasynchronousreads .................... 78 5-5Controlowforsinglecorereadandsinglecoresearch(SRSS) ........ 79 5-6Controlowformulticoreasynchronousreadandsearch(MARS1) ....... 80 5-7Anothercontrolowformulticoreasynchronousreadandsearch(MARS2) .. 80 5-8Multi-PatternSearchAlgorithmsSpeedup. ..................... 84 5-9SpeedupofFastScalpelrelativetoScalpel1.6 .................. 88 6-1NVIDIAGT200Architecture[ 56 ] .......................... 90 6-2Reversetrieforcac,acbacc,cba,bbaca,andcbaca(shift1(node),shift2(node)valuesareshownbesideeachnode) ....................... 93 6-3GPU-to-GPUnotation ................................ 95 6-4OverallGPU-to-GPUstrategyusingAC ...................... 96 6-5Tthreadscollectivelyreadablockandsaveinsharedmemory ......... 100 6-6Host-to-hoststrategyA ............................... 105 6-7Host-to-hoststrategyB ............................... 106 6-8Notationusedincompletiontimeanalysis ..................... 107 6-9StrategyA,twtp,s=4(cases1aand2) .................... 109 11

PAGE 12

6-10StrategyA,twtp,s=4(cases1b,1c,and4b) ......... 110 6-12StrategyA,enhancedGPU,s=4 ......................... 113 6-13GraphicalrepresentationofspeeduprelativetoAC0 ............... 116 12

PAGE 13

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyMULTI-PATTERNSTRINGMATCHINGALGORITHMSByXinyanZhaAugust2010Chair:SartajSahniMajor:ComputerEngineering Multi-patternstringmatchingiswidelyusedinapplicationssuchasnetworkintrusiondetection,digitalforensicsandfulltextsearch.Inthisdissertation,wefocusonspaceefcientmulti-patternstringmatchingaswellasontimeefcientmulticorealgorithms. WedevelopahighlycompressedAho-Corasickautomataforefcientintrusiondetection.Ourmethodusesbitmapswithmultiplelevelsofsummariesaswellasaggressivepathcompaction.Ourcompressedautomatatakes24%to31%lessmemorythantakenbythecompressedautomataofTucketal.[ 57 ]andthenumberofadditionsrequiredtocomputepopcountsisreducedbyabout90%. Weproposeatechniquetoperformhighperformanceexactmulti-patternstringmatchingontheIBMCell/BroadbandEngine(Cell)architecture,whichhas9coresperchip,onecontrolunitandeightcomputationunits.Ourtechniqueguaranteestheremarkablecompressionfactorsof1:34and1:58,respectivelyonthememoryrepresentationofEnglishlanguagedictionariesandrandombinarystringdictionaries.Ourmemory-basedimplementationdeliversasustainedthroughputbetween0.90and2.35Gbpspercellblade,whilesupportingdictionarysizesupto9,260,000averagepatternsperGbyteofmainmemory. WefocusonScalpel,apopularopensourcelerecoverytoolwhichperformslecarvingusingtheBoyer-Moorestringsearchalgorithmtolocateheadersandfootersina 13

PAGE 14

diskimage.Weshowthatthetimerequiredforlecarvingmaybereducedsignicantlybyemployingmulti-patternsearchalgorithmssuchasthemultipatternBoyer-MooreandAho-CorasickalgorithmsaswellasasynchronousdiskreadsandmultithreadingastypicallysupportedonmulticorecommodityPCs.Usingthesemethods,weareabletodoin-placelecarvinginessentiallythetimeittakestoreadthediskwhoselesarebeingcarved.Since,usingourmethods,thelimitingfactorforperformanceisthediskreadtime,thereisnoadvantagetousingacceleratorssuchasGPUsashasbeenproposedbyothers.Tofurtherspeedin-placelecarving,wewouldneedamechanismtoreaddiskfaster. Furthermore,wedevelopGPUadaptationsoftheAho-CorasickandmultipatternBoyer-MoorestringmatchingalgorithmsforthetwocasesGPU-to-GPUandhost-to-host.FortheGPU-to-GPUcase,weconsiderseveralrenementstoabaseGPUimplementationandmeasuretheperformancegainfromeachrenement.Forthehost-to-hostcase,weanalyzetwostrategiestocommunicatebetweenthehostandtheGPUandshowthatoneisoptimalwithrespecttoruntimewhiletheotherrequireslessdevicememory.ExperimentsconductedonanNVIDIATeslaGT200GPUthathas240coresrunningoffofaXeon2.8GHzquad-corehostCPUshowthat,fortheGPU-to-GPUcase,ourAho-CorasickGPUadaptationachievesaspeedupbetween8.5and9.5relativetoasingle-threadCPUimplementationandbetween2.4and3.2relativetothebestmultithreadedimplementation.Forthehost-to-hostcase,theGPUACcodeachievesaspeedupof3.1relativetoasingle-threadedCPUimplementation.However,theGPUisunabletodeliveranyspeeduprelativetothebestmultithreadedcoderunningonthequad-corehost.Infact,themeasuredspeedupsforthelattercaserangedbetween0.74and0.83.EarlyversionsofourmultipatternBoyer-Mooreadaptationsran7%to10%slowerthancorrespondingversionsoftheACadaptationsandwedidnotrenethemultipatternBoyer-Moorecodesfurther. 14

PAGE 15

CHAPTER1INTRODUCTION Intrusiondetectionsystems(IDS)monitoreventswithinanetworkorcomputersystemwiththeobjectiveofdetectingattemptstocompromisethecondentiality,integrity,availability,ortobypassthesecuritymechanismsofacomputerornetwork[ 4 ].TheintrusiondetectedbyanIDSmaymanifestitselfasadenialofservice,unauthorizedlogin,auserperformingtasksthathe/sheisnotauthorizedtodo(e.g.,accesssecureles,createnewaccounts,etc),executionofmalwaresuchasvirusesandworms,andsoon.AnIDSaccomplishesitsobjectivebyanalyzingdatagatheredfromthenetwork,hostcomputer,orapplicationthatisbeingmonitored.Theanalysisusuallytakesoneoftwoformsmisuse(orsignature)detectionandanomalydetection.Inmisusedetection,theIDSmaintainsadatabaseofsignatures(patternsofevents)thatcorrespondtoknownattacksandsearchesthegathereddataforthesesignatures.InanomalydetectiontheIDSmaintainsstatisticsthatdescribenormalusageandchecksfordeviationsfromthesestatisticsinthemonitoreddata.Whilemisusedetectionusuallyhasalowrateoffalsepositives,itisabletodetectonlyknownattacks.Anomalydetectionusuallyhasahigherrateoffalsepositives(becauseuserskeepchangingtheirusagepatterntherebyinvalidatingthestoredstatistics)butisabletodetectnewattacksneverseenbefore. Networkintrusiondetectionsystems(NIDS)examinenetworktrafc(bothin-andout-boundpackets)lookingfortrafcpatternsthatindicateattemptstobreakintoatargetcomputer,portscans,denialofserviceattacks,andothermaliciousbehavior.Hostintrusiondetectionsystems(HIDS)monitortheactivitywithinacomputingsystemlookingforactivitythatviolatesthecomputingsystemsinternalsecuritypolicy(e.g.,aprogramattemptingtoaccessanunauthorizedresource).Applicationintrusiondetectionsystems(AIDS)monitortheactivityofaspecicapplicationwhileprotocolintrusiondetectionsystems(PIDS)ensurethatspecicprotocolssuchasHTTPbehaveasthey 15

PAGE 16

should.EachtypeofIDShasitscapabilitiesandlimitationsandattemptshavebeenmadetoputtogetherhybridIDSsthatcombinethecapabilitiesofthedescribedbaseIDSs.TheevolutionofWeb2.0applicationsandbusinessanalyticsapplicationsisshowingamoreandmoreprevalentproductionanduseofunstructureddata.NaturalLanguageProcessing(NLP)applicationscandeterminethelanguageinwhichadocumentiswritten.E-mailwebapplicationsextractsemanticallytaggedinformation(dates,places,deliverytrackingnumbers,etc.)frommessages.Businessanalyticsapplicationscanautomaticallydetectbusinesseventslikethemergeroftwocompanies.DigitalForensicstoolscanrecovertherawdiskimagesbydetectingstreamsofbytesthatmatchleheadersandfooters. Inaboveapplications(andmanyothers),itiscrucialtoprocesshugeamountsofsequentialtexttoextractmatchesagainstapredeterminedsetofstrings(thedic-tionary).Arguably,themostpopularwaytoperformthisexact,multi-patternstringmatchingtaskistheAho-Corasick[ 1 ](AC)algorithm.However,AC,especiallyinitsoptimizedformbasedonaDeterministicFiniteAutomaton(DFA),isnotspace-efcient.Infact,thestate-transitiontablethatitsDFAsusecanbehighlyredundant.UncompressedDFAshavealowtransitioncost(andthereforeahighthroughput)butalsolargefootprintand,consequently,alowdictionarycapacityperunitofmemory.Forexample,adictionaryof200,000patternswithaveragelength15bytesoccupies1GbyteofmemorywhenencodedforanuncompressedACDFA.Lowspaceefciencylimitsthealgorithm'sapplicabilitytodomainsthatrequireverylargedictionarieslikeautomaticlanguageidentication,whichemploydictionarieswithmillionsofentries,comingfromhundredsofdistinctnaturallanguages. Inthisdissertation,weaddressthespaceinefciencyoftheACDFAbyexploringavariantofACthatemployscompressedpaths.OurworkisinspiredbythatofTucketal[ 57 ]andisbasedontheNon-deterministicFiniteAutomaton(NFA)versionofACandachievesignicantmemoryreduction.WealsoprovideourIBMcelladaptations 16

PAGE 17

ofthecompressedAho-Corasickstringmatchingalgorithm.Furthermore,weproposefastin-placecarvingtechniquesofapopulardigitalforensicstoolScalpelaswellasourGPUadaptationsoftheAho-CorasickandmultipatternBoyer-MoorestringmatchingalgorithmsforthetwocasesGPU-to-GPUandhost-to-host. 1.1OverviewoftheContributions First,wedevelopamethodtocompresstheunoptimized(alsoknownasnondeterministic)Aho-Corasickautomatonthatisusedwidelyinintrusiondetectionsystems.Ourmethodusesbitmapswithmultiplelevelsofsummariesaswellasaggressivepathcompaction.Byusingmultiplelevelsofsummaries,weareabletodetermineapopcountwithasfewas1addition.OnSnortstringdatabases,ourcompressedautomatatake24%to31%lessmemorythanistakenbythecompressedautomataofTucketal.[ 57 ].andthenumberofadditionsrequiredtocomputepopcountsisreducedbyabout90%. Next,wechooseanestablishedmulti-corearchitecture,theIBMCellBroadbandEngine(CBE)andexploreitspotentialforstringmatchingapplications.TheIBMCellpresentssoftwaredesignerswithnon-trivialchallengesthatarerepresentativeofthenextgenerationofmulti-corearchitectures.Withits9coresperchip,theIBMCell/BroadbandEngine(Cell)candeliveranimpressiveamountofcomputepowerandbenetthestring-matchingkernelsofnetworksecurity,businessanalyticsandnaturallanguageprocessingapplications.However,theavailableamountofmainmemoryonthesystemlimitsthemaximumsizeofthedictionarysupported.Tocounterthis,weproposeatechniquethatemployscompressedAho-Corasickautomatatoperformfast,exactmulti-patternstringmatchingwithverylargedictionaries.Ourtechniqueachievestheremarkablecompressionfactorsof1:34and1:58,respectively,onthememoryrepresentationofEnglish-languagedictionariesandrandombinarystringdictionaries.WedemonstrateaparallelimplementationfortheCellprocessorthatdeliversasustainedthroughputbetween0.90and2.35GbpsperCellblade,whilesupportingdictionarysizesupto9.2millionaveragepatternsperGbyteofmain 17

PAGE 18

memory,andexhibitingresiliencetocontent-basedattacks.Thishighdictionarydensityenablesnaturallanguageapplicationsofanunprecedentedscaletorunonasingleserverblade. Thirdly,wefocusonapopularopensourcelerecoverytoolScaleplwhichperformslecarvingusingtheBoyer-Moorestringsearchalgorithmtolocateheadersandfootersinadiskimage.Weshowthatthetimerequiredforlecarvingmaybereducedsignicantlybyemployingmulti-patternsearchalgorithmssuchasthemultipatternBoyer-MooreandAho-CorasickalgorithmsaswellasasynchronousdiskreadsandmultithreadingastypicallysupportedonmulticorecommodityPCs.Usingthesemethods,weareabletodoin-placelecarvinginessentiallythetimeittakestoreadthediskwhoselesarebeingcarved.Since,usingourmethods,thelimitingfactorforperformanceisthediskreadtime,thereisnoadvantagetousingacceleratorssuchasGPUsashasbeenproposedbyothers.Tofurtherspeedin-placelecarving,wewouldneedamechanismtoreaddiskfaster. Last,wedevelopGPUadaptationsoftheAho-CorasickandmultipatternBoyer-MoorestringmatchingalgorithmsforthetwocasesGPU-to-GPUandhost-to-host.ExperimentsconductedonanNVIDIATeslaGT200GPUthathas240coresrunningoffofaXeon2.8GHzquad-corehostCPUshowthat,fortheGPU-to-GPUcase,ourAho-CorasickGPUadaptationachievesaspeedupofalmost8.50to9.53relativetoitsCPUcounterpart;thecorrespondingspeedupattainedbyourmultipatternBoyer-Mooreadaptationisalmost3.19to3.40.Forthehost-to-hostcase,weachieveaspeedupofalmost2.90to3.12utilizingourAho-CorasickGPUadaptationversusdoingthemultipatternmatchingontheCPUitselfusingasingle-coreCPUcodeforAho-Corasickalgorithm.TheAho-CorasickalgorithmrunsfasterthanthemultipatternBoyer-Moorealgorithmbothinitssingle-coreCPUversionandinitsGPUadaptation. 18

PAGE 19

1.2OutlineoftheDissertation Theremainderofthisdissertationisorganizedasfollows.Chapter 2 presentsourhighlycompressedAho-Corasickautomataforefcientnetworkintrusiondetection.Chapter 3 developsacompressedversionoftheObjectOrientedNFAformulti-patternmatching.Chapter 4 presentsourcompressedAho-Corasickautomataalgorithmforhighperformanceexactmulti-patternstringmatchingonanIBMCellBroadbandEngine.Chapter 5 showsfastin-placelecarvingtechniquesfordigitalforensics.Chapter 6 describesouradaptationsofmultipatternstringmatchingalgorithmsontheGraphicsProcessingUnits(GPUs).WeconcludeinChapter 7 19

PAGE 20

CHAPTER2AHIGHLYCOMPRESSEDAHO-CORASICKAUTOMATAFOREFFICIENTINTRUSIONDETECTION 2.1TheAho-CorasickAutomaton TheAho-Corasicknitestateautomaton[ 1 ]formulti-stringmatchingiswidelyusedinIDSs.Therearetwoversionsofthisautomatonunoptimizedandoptimized.Whilebothversionsarenitestatemachines,Intheunoptimizedversion,whichweuseinthispaper,thereisafailurepointerforeachstatewhileintheoptimizedversion,nostatehasafailurepointer.Inbothversions,andeachstatehassuccesspointers;eachsuccesspointerhasalabel,whichisacharacterfromthestringalphabet,associatedwithit.Also,eachstatehasalistofstrings/rules(fromthestringdatabase)thatarematchedwhenthatstateisreachedbyfollowingasuccesspointer.Thisisthelistofmatchedrules.Intheunoptimizedversion,thesearchstartswiththeautomatonstartstatedesignatedasthecurrentstateandtherstcharacterinthetextstring,S,thatisbeingsearcheddesignatedasthecurrentcharacter.Ateachstep,astatetransitionismadebyexaminingthecurrentcharacterofS.Ifthecurrentstatehasasuccesspointerlabeledbythecurrentcharacter,atransitiontothestatepointedatbythissuccesspointerismadeandthenextcharacterofSbecomesthecurrentcharacter.Whenthereisnocorrespondingsuccesspointer,atransitiontothestatepointedatbythefailurepointerismadeandthecurrentcharacterisnotchanged.Wheneverastateisreachedbyfollowingasuccesspointer,therulesinthelistofmatchedrulesforthereachedstateareoutputalongwiththepositioninSofthecurrentcharacter.Thisoutputissufcienttoidentifyalloccurrences,inS,ofalldatabasestrings.AhoandCorasick[ 1 ]haveshownthatwhentheirunoptimizedautomatonisused,thenumberofstatetransitionsis2n,wherenisthelengthofS. Intheoptimizedversion,eachstatehasasuccesspointerforeverycharacterinthealphabetandso,thereisnofailurepointer.AhoandCorasick[ 1 ]showhowtocomputethesuccesspointerforpairsofstatesandcharactersforwhichthereisnosuccess 20

PAGE 21

abcaabb abcaabbcc acb acbccabb ccabb bccabc bbccabca Figure2-1. Anexamplestringset Figure2-2. UnoptimizedAho-CorasickautomataforstringsofFigure 2-1 pointerintheunoptimizedautomatontherebytransformingaunoptimizedautomatonintoanoptimizedone.Thenumberofstatetransitionsmadebyanoptimizedautomatonwhensearchingformatchesinastringoflengthnisn. Figure 2-1 showsanexamplestringsetdrawnfromthe3-letteralphabetfa,b,cg.Figures 2-2 and 2-3 ,respectively,showitsunoptimizedandoptimizedAho-Corasickautomata.Forthisexample,weassumethatthestringalphabetisfA,B,Cg. ItisimportanttonotethatwhenweremovethefailurepointersfromanuncompressedAho-Corasickautomaton,theresultingstructureisatrie[ 42 ]rootedattheautomatonstartnode.However,anoptimizedautomatonhasthestructureofagraphthatmaynot 21

PAGE 22

Figure2-3. OptimizedAho-CorasickautomataforstringsofFigure 2-1 beatrie.Thisdifferenceinthestructuredenedbythesuccesspointershasaprofoundimpactonourabilitytocompressunoptimizedautomataversusoptimizedautomata. 2.2TheMethodofTucketal.[32]ToCompressNon-OptimizedAutomaton Assumethatthealphabetsizeis256(e.g.,ASCIIcharacters).Althoughthedevelopmentisgeneralizedreadilytoanyalphabetsize,itismoreconvenienttodothedevelopmentusingaxedandrealisticalphabetsize.AnaturalwaytostoretheAho-Corasickautomaton,foragivendatabaseDofstrings,istorepresenteachstateoftheunoptimizedautomatonbyanodethathas256successpointers,afailurepointer,andalistofrulesthatarematchedwhenthisstateisreachedviaasuccesspointer.Assumingthatapointertakes4bytesandtherulelistissimplypointedatbythenode,eachstatenodeis1032bytes.Usingbitmapandpathcompression,wemayusenodeswhosesizeis52bytes[24].thefollowingelds: 22

PAGE 23

1. Success[0:255],whereSuccess[i]givesthestatetotransitiontowhentheASCIIcodeforthecurrentcharacterisi(Success[i]isnullincasethereisnosuccesspointerforthecurrentstatewhenthecurrentcharacterisi). 2. RuleList...alistofrulesthatarematchedwhenthisstateisreachedviaasuccesspointer. 3. Failure...thetransitiontomakewhenthereisnosuccesstransition,forthecurrentcharacter,fromthecurrentstate. Assumethateachpointerrequires4bytes.So,eachnoderequires1024bytesfortheSuccessarrayand4bytesforthefailurepointer.InkeepingwithTucketal.[ 57 ],whenaccountingforthememoryrequiredforRuleList,weshallassumethatonlya4-bytepointertothislistisstoredinthenodeandignorethememoryrequiredbythelistitself.Hence,thesizeofastatenodeforanunoptimizedautomatonis1032bytes.Usingbitmapandpathcompression,thesizeofanodebecomes52bytes[ 57 ]. Intheoptimizedversion,theFailureeldisomittedandthememoryrequiredbyanodeis1028bytes.Whileeachnodeoftheoptimizedautomatonrequires4byteslessthanrequiredbyeachnodeoftheunoptimizedautomaton,thereislittleopportunitytocompressanoptimizednodeaseachofits256successpointersisnon-nullandtheautomatondoesnothaveatreestructure.However,manyofthesuccesspointersinthenodesofaunoptimizedautomatonarenullandthestructuredenedbythesuccesspointersisatrie.Therefore,thereissignicantopportunitytocompressthesenodes.Followinguponthisobservation,Tucketal.[ 57 ]proposetwotransformationstocompressthenodesinanunoptimizedautomaton: 1.BitmapCompression. Initssimplestform,bitmapcompressionreplaceseach1032-bytenodeofanunoptimizedautomatonwitha44-bytenode.Ofthese44bytes,8areusedforthefailureandrulelistpointers.Another32bytesareusedtomaintaina256-bitbitmapwiththepropertythatbitiofthismapis1iffSuccess[i]6=null.Thenodescorrespondingtothenon-nullsuccesspointersarestoredincontiguousmemoryandapointer(rstChild) 23

PAGE 24

totherstofthesestoredinthe44-bytenode.TomakeastatetransitionwhentheASCIIcodeforthecurrentcharacterisi,werstdeterminewhetherSuccess[i]isnullbyexaminingbitiofthemap.Incasethisbitisnull,thefailurepointerisused.Whenthisbitisnotnull,wedeterminethenumberofbits(popcountorrank)inbitmappositionslessthanithatare1andusingthiscount,thesizeofanode(44-bytes),andthevalueoftherstchildpointer,determinethelocationofthenodetotransitionto.Since,determiningthepopcountinvolvesexaminingupto255bits,thisoperationisquiteexpensive(atleastinsoftware).Toreducethecostofdeterminingthepopcount,Tucketal.[ 57 ]proposetheuseofsummariesthatgivethepopcountfortherst32j,1j<8bitsofthebitmap.Usingthesesummariesthepopcountforanyimaybedeterminedbyaddingtogetherasummarypopcountandupto31bitvalues.Eachsummaryneedstobe8bitslong(themaximumvalueis255)and7summariesareneeded.Thesizeofabitcompressednodewithsummariesis,therefore,51bytes. Wenotethatthenotionofusingbitmapsandsummariesforthecompactrepresentationofdatastructures(inparticular,trees)wasrstadvancedbyJacobson[ 24 32 ]andhasbeenusedfrequentlyinthecontextofdatastructuresfornetworkapplications(see[ 14 16 52 62 ],forexample).WhileJacobson[ 24 32 ]suggestsusingseverallevelsofsummaries,[ 14 57 ]useasinglelevel.Also,Munro[ 32 ]hasproposedaschemethatuses3levelsofsummaries,requiresO(m)space,wheremisthesizeofthebitmap,andenablesthecomputationofthepopcountbyaddingthreesummaries,onefromeachlevel. Thesizeofabitmapnodebecomes52byteswhenweaddinthenodetypeandfailurepointeroffseteldsthatareneededtosupportpathcompression(Figure 2-4 ). 2.PathCompression. Pathcompressionissimilartoend-nodeoptimization[ 16 62 ].Anend-nodesequenceisasequenceofstatesatthebottomoftheautomaton(thestartstateisatthetopoftheautomaton)thatarecomprisedofstatesthathaveasinglenon-null 24

PAGE 25

Figure2-4. Abitmapnodeof[ 57 ] successtransition(exceptthelaststateinthesequence,whichhasnonon-nullsuccesstransition).Statesinthesameend-nodesequencearepackedtogetherintooneormorepathcompressednodes.Thenumberofthesestatesthatmaybepackedintoacompressednodeislimitedbythecapacityofapathcompressednode.So,forexample,ifthereisanend-nodesequences1,s2,...,s6andifthecapacityofapathcompressednodeis4states,thens1,....s4arepackedintoonenode(sayA)ands5ands6intoanother(sayB).Foreachsipackedintoapathcompressednodeinthisway,weneedtostorethe1-bytecharacterforthetransitionplusthefailureandrulelistpointersforsi.Sinceseveralautomatonstatesarepackedintoasinglecompressednode,a4-bytefailurepointerthatpointstoacompressednodeisn'tsufcient.Inaddition,weneedanoffsetvaluethattellsuswhichstatewithinthecompressednodeweneedtotransitionto.Using3bitsfortheoffset,wecanhandlenodeswithcapacityc8.Notethatnow,d3c=8ebytesareneededfortheoffsets.Hence,apathcompressednodewhosecapacityisc8needs9c+d3c=8ebytesforthestateinformation.Another4bytesareneededforapointertothenextnode(ifany)inthesequenceofpathcompressednodes(i.e.,apointerfromAtoB).Anadditionalbyteisrequiredtoidentifythenodetype(bitmapandcompressed)andthesize(numberofstatespackedintothiscompressednode).So,thesizeofacompressednodeis9c+d3c=8e+5bytes.Thenodetypebitisrequirednowinbitmapnodesaswellasisanoffsetforthefailurepointer.Accountingfortheseelds,thesizeofabitmapnodebecomes52bytes.Sinceacompressednodemaybeasibling(states/nodesreachablebyfollowingasingle 25

PAGE 26

Figure2-5. Apathcompressednodeof[ 57 ] successpointerfromanygivenstate/nodearesiblings)ofabitmapnode,weneedtokeepthesizeofbothbitmapandpathcompressednodesthesamesothatwecanaccesseasilythejthchildofabitmapnodebyperformingarithmeticontherstchildpointer.Thisrequirementlimitsustoc=5andapathcompressednodesizethatis52bytes.Figure 2-5 showsapathcompressednode. Onthe1533-stringSnortdatabaseof2003,thememoryrequiredbythebitmapped-pathcompressedautomatonusing1levelofsummariesisabout1/50thatrequiredbytheoptimizedautomaton,about1/27thatrequiredbytheWu-Manberdatastructure,andabout10%lessthanthatrequiredbytheSFKsearchdatastructure[ 57 ].However,theaveragesearchtime,usingasoftwareimplementation,isincreasedbybetween10%and20%relativetothatfortheoptimizedautomaton,bybetween30%and100%relativetotheWu-Manberalgorithm,andisaboutthesameasforSFKsearch.TherealpayofffromtheAho-Corasickautomatoncomeswithrespecttoworst-casesearchtime.Theworst-casesearchtimeusingtheAho-Corasickautomatonisbetween1/4and1/3thatwhentheWu-ManberorSFKsearchalgorithmsareused.Theworst-casesearchtimeforthebitmapped-pathcompressedunoptimizedautomatonisbetween50%and100%morethanfortheoptimizedautomaton[ 57 ]. 2.3PopcountsWithFewerAdditions Aseriousdeciencyofthecompressionmethodof[ 57 ]istheneedtoperformupto31additionsateachbitmapnode.Thisseriouslydegradesworst-caseperformanceandincreasestheclamorforhardwaresupportforapopcountinnetworkprocessors 26

PAGE 27

[ 57 ].Sincepopcountsareusedinavarietyofnetworkalgorithms([ 14 16 52 62 ],forexample)inadditiontothoseforintrusiondetection,weconsider,inthissection,theproblemofdeterminingthepopcountindependentoftheapplication.Thisproblemhasbeenstudiedextensivelybythealgorithmscommunity([ 24 31 32 ],forexample).Inthealgorithmscommunity,thepopcountproblemisreferredtoasthebit-vector-rankproblem,wherethetermsbitmapandbitvectoraresynonymsandpopcountandrankaresynonyms.Werecastthebestresultforthebit-vector-rankproblemusingthebitmap-popcountterminology. Munro[ 31 32 ]hasproposedamethodtodeterminethepopcountform-bitbitmapusing3levelsofsummariesthattogethertakeo(m)bitsofspace.Thepopcountisdeterminedbyaddingtogether3O(logm)-bitnumbers,onefromeachofthe3levelsofsummaries.Munro'smethodisdescribedbelow: 1.Level1Summaries Partitionthebitmapintoblocksofs1=dlog22mebits.Thenumberofsuchblocksisn1=dm=s1e.Computethelevel1summariesS1(1:n1),whereS1(i)isthenumberof1sinblocks0throughi)]TJ /F4 11.955 Tf 11.95 0 Td[(1,1in1. 2.Level2Summaries Eachlevel1blockjispartitionedintosubblocksofs2=d1 2log2mebits.Thenumberofsuchsubblocksisn2=ds1=s2e.S2(j,i)isthenumberof1sinsubblocks0throughi)]TJ /F4 11.955 Tf 11.95 0 Td[(1ofblockj,0j
PAGE 28

Table2-1. Lookuptablefor4-bitblocks iinbinaryT4(i,0)T4(i,1)T4(i,2)T4(i,3) 000000000100010000200100001300110001401000011501010011601100012701110012810000111910010111101010011211101101121211000122131101012214111001231511100123 representationofi;positionsarenumberedlefttorightbeginningwith0anda4-bitrepresentationofiisused. Onemayverifythatthetotalspacerequiredbythesummariesiso(m)bitsandthatapopcountmaybedeterminedbyaddingonesummaryfromeachofthethreelevels.Fora256-bitbitmap,usingMunro'smethod[ 31 32 ],thelevel-1blocksares1=64bitslongandtherearen1=4ofthese;eachlevel-1blockispartitionedinton2=16subblocksofsizes2=4;andthelookuptableTs2isT4. MotivatedbytheworkofMunro[ 31 32 ],wepropose3designsforsummariesfora256-bitbitmap.Thersttwooftheseuse3levelsofsummariesandthethirduses2levels. 1.TypeISummaries Level1SummariesForthelevel1summaries,the256-bitbitmapispartitionedinto4blocksof64bitseach.S1(i)isthenumberof1sinblocks0throughi)]TJ /F4 11.955 Tf 12.41 0 Td[(1,1i3. Level2SummariesForeachblockjof64bits,wekeepacollectionoflevel2summaries.Forthispurpose,the64-bitblockispartitionedinto164-bitsubblocks. 28

PAGE 29

Figure2-6. TypeIsummaries S2(j,i)isthenumberof1sinsubblocks0throughi)]TJ /F4 11.955 Tf 12.71 0 Td[(1ofblockj,0j3,1i15. Level3SummariesEach4-bitsubblockispartitionedinto22-bitsubsubblocks.S3(j,i,1)isthenumberof1sinsubsubblock0oftheith4-bitsubblockofthejth64-bitblock,0j3,0i15. Figure 2-6 showsthesetupforTypeIsummaries.WhenTypeIsummariesareused,thepopcountforpositionq(i.e.,thenumberof1sprecedingpositionq),0q<256,ofthebitmapisobtainedasfollows: Positionqisinsubblocksb=b(qmod64)=4cofblockb=bq=64c.Thesubsubblockssbis0whenqmod4<2and1otherwise. ThepopcountforpositionqisS1(b)+S2(b,sb)+S3(b,sb,ssb)+bit(q)]TJ /F4 11.955 Tf 12.73 0 Td[(1),wherebit(q)]TJ /F4 11.955 Tf 12 0 Td[(1)is0ifqmod2=0andisbitq)]TJ /F4 11.955 Tf 12 0 Td[(1ofthebitmapotherwise;S1(0),S2(b,0)andS3(b,sb,0)areall0. Asanexample,considerthecaseq=203.Thisbitisinsubblocksb=b(203mod64)=4c=b11=4c=2ofblockb=b203=64c=3.Since203mod4=3,thesubsubblockssbis1.Thepopcountforbit203isthenumberof1sinpositions0through191+thenumberinpositions192through199+thoseinpositions200through201+thenumberinposition202=S1(3)+S2(3,2)+S3(3,2,1)+bit(202). 29

PAGE 30

Sincewedonotstoresummariesforb,sb,andssbequaltozero,thecodetocomputethepopcounttakestheform if(b)popcount=S1(b)elsepopcount=0;if(sb)popcount+=S2(b,sb);if(ssb)popcount+=S3(b,sb,ssb);if(q)popcount+=bit(q-1); So,usingTypeIsummaries,wecandetermineapopcountwithatmost3additionswhereasusingonly1levelofsummariesasin[ 57 ],upto31additionsarerequired.Thisreductioninthenumberofadditionscomesattheexpenseofmemory.AnS1()valueliesbetween0and192andsorequires8bits;anS2valuerequires6bitsandanS3valuerequires2bits.So,weneed83=24bitsforthelevel-1summaries,6154=360bitsforthelevel-2summaries,and21164=128bitsforthelevel-3summaries.Therefore,512bits(or64bytes)areneededforthesummaries.Incontrast,thesummariesofthe1-levelschemeof[ 57 ]requireonly56bits(or7bytes). 2.TypeIISummaries TheseareexactlywhatisprescribedbyMunro[ 31 32 ].S1andS2areasforTypeIsummaries.However,theS3summariesarereplacedbyasummarytable(Table 2-1 )T4(0:15,0:3)suchthatT4(i,j)isthenumberof1sinpositions0throughj)]TJ /F4 11.955 Tf 12.08 0 Td[(1ofthebinaryrepresentationofi.ThepopcountforpositionqofabitmapisS1(b)+S2(b,sb)+T4(d,e),wheredistheintegerwhosebinaryrepresentationisthebitsinsubblocksbofblockbofthebitmapandeisthepositionofqwithinthissubblock;S1andSBareforthecurrentstate/bitmap. SinceT4(i,j)3,weneed2bitsforeachentryofT4foratotalof128bitsfortheentiretable.Recognizingthatrows2jand2j+1arethesameforeveryj,wemaystoreonlytheevenrowsandreducestoragecostto64bits.AfurtherreductioninstoragecostforT4ispossiblebynoticingthatallvaluesincolumn0ofthisarrayare0andsoweneednotstorethiscolumnexplicitly.Actually,sinceonly1copyofthistableisneeded,thereseemstobelittlevalue(forourintrusiondetectionsystemapplication)to 30

PAGE 31

thesuggestedoptimizationsandwemaystoretheentiretableatastoragecostof128bits. Thememoryrequiredforthelevel1and2summariesis24+360=384bits(48bytes),areductionof16bytescomparedtoTypeIsummaries.WhenTypeIIsummariesareused,apopcountisdeterminedwith2additionsratherthan3usingTypeIsummariesand31usingthe1-levelsummariesof[ 57 ]. 3.TypeIIISummaries Theseare2levelsummariesandusingthese,thenumberofadditionsneededtocomputeapopcountisreducedto1.Level-1summariesarekeptforthebitmapandalookuptableisusedforthesecondlevel.Forthelevel-1summaries,wepartitionthebitmapinto16blocksof16bitseach.S1(i)isthenumberof1sinblocks0throughi)]TJ /F4 11.955 Tf 11.96 0 Td[(1,1i15.ThelookuptableT16(i,j)givesthenumberof1sinpositions0throughj)]TJ /F4 11.955 Tf 11.53 0 Td[(1ofthebinaryrepresentationofi,0i<65,536=216,0j<16.ThepopcountforpositionqofthebitmapisS1(bq=16c)+T16(d,e),wheredistheintegerwhosebinaryrepresentationisthebitsinblockbq=16cofthebitmapandeisthepositionofqwithinthissubblock;S1andSBareforthecurrentstate/bitmap. 815=120bits(or15bytes)ofmemoryarerequiredforthelevel-1summariesofabitmapcomparedto7bytesin[ 57 ].ThelookuptableT16requires216164bitsaseachtableentryliesbetween0and15andsorequires4bits.ThetotalmemoryforT16is512KB.Foratableofthissize,itisworthconsideringtheoptimizationsmentionedearlierinconnectionwithT4.Sincerows2jand2j+1arethesameforallj,wemayreducetablesizeto256KBbystoringexplicitlyonlytheevenrowsofT16.Another16KBmaybesavedbynotstoringcolumn0explicitly.Yetanother16KBreductionisachievedbysplittingtheoptimizedtableinto2.Now,column0ofoneofthemisall0andisall1intheother.So,column0maybeeliminated.Wenotethatoptimizationbelow256KBmaynotbeofmuchvalueastheincreasedcomplexityofusingthetablewilloutweighthesmallreductionisstorage. 31

PAGE 32

Table2-2. Distributionofstatesina3000stringSnortdatabase DegreeNumberofNodesPercentage 019647.7512245388.625912.3331490.584430.175350.146140.0557230.0908140.055980.03110,11,12,13,14,156,3,4,5,3,2<0.03106<0.03113<0.03124<0.03135<0.03143<0.03152<0.0317,18,21,51,781<0.03 2.4OurMethodtoCompresstheNon-OptimizedAho-CorasickAutomaton 2.4.1ClassicationofAutomatonStates TheSnortdatabasehad3,578stringsinApril,2006.Table 2-2 prolesthestatesinthecorrespondingunoptimizedAho-Corasickautomatonbydegree(i.e.,numberofnon-nullsuccesspointersinastate).Ascanbeseen,thereareonly36stateswhosedegreeismorethan8andthenumberofstateswhosedegreeisbetween2and8is869.Anoverwhelmingnumberofstates(24,417)haveadegreethatislessthan2.However,1639ofthese24,417statesarenotinend-nodesequences.Thisprolemotivatedustoclassifythestatesinto3categoriesB(stateswhosedegreeismorethan8),L(stateswhosedegreeisbetween2and8)andO(allotherstates).Bstatesarethosethatwillberepresentedusingabitmap,Lstatesarelowdegreestates,andOstatesarestateswhosedegreeisoneorzero.Incasethedistributionofstatesinfuturestringdatabaseschangessignicantly,wecanuseadifferentclassicationofstates. Next,aner(2letter)stateclassicationisdoneasbelowandinthestatedorder. 32

PAGE 33

BBAllBstatesarereclassiedasBBstates. BLAllLstatesthathaveasiblingBBstatearereclassiedasaBLstates. BOAllOstatesthathaveaBBsiblingarereclassiedasBOstates. LLAllremainingLstatesarereclassiedasLLstates. LOAllremainingOstatesthathaveanLLsiblingarereclassiedasLOstates. OOAllremainingOstatesarereclassiedasOOstates. 2.4.2NodeTypes Ourcompressedrepresentationusesthreenodetypesbitmap,lowdegree,andpathcompressed.Thesearedescribedbelow. 2.4.2.1Bitmap Abitmapnodehasa256-bitbitmaptogetherwithsummaries;anyofthethreesummarytypesdescribedinSection 2.3 maybeused.WenotethatwhenTypeIIorTypeIIIsummariesareused,onlyonecopyofthelookuptable(T4orT16)isneededfortheentireautomaton.Allbitmapnodesmaysharethissinglecopyofthelookuptable.WhenTypeIIsummariesareused,the128bitsneededbytheunoptimizedT4areinsignicantcomparedtothestoragerequiredbytheremainderoftheautomaton.ForTypeIIIsummaries,however,usinga512KBunoptimizedT16isquitewastefulofmemoryanditisdesirabletogodowntoatleastthe256KBversion. Thememoryrequiredforabitmapnodedependsonthesummarytypethatisused.WhenTypeIsummariesareused,eachbitmapnode(Figure 2-7 )is110bytes(weneed57extrabytescomparedtothe52-bytenodesof[ 57 ]forthelargersummariesandanadditionalextrabytebecauseweuselargerfailurepointeroffsets).WhenTypeIIsummariesareused,eachbitmapnodeis94bytesandthenodesizeis61byteswhenTypeIIIsummariesareused. 2.4.2.2Lowdegreenode Lowdegreenodesareusedforstatesthathavebetween2and8successtransitions.Figure 2-8 showstheformatofsuchanode.Inadditiontoeldsforthe 33

PAGE 34

Figure2-7. Ourbitmapnode Figure2-8. Ourlowdegreenode nodetype,failurepointer,failurepointeroffset,rulelistpointer,andrstchildpointer,alowdegreenodehastheeldschar1,...,char8fortheupto8charactersforwhichthestatehasanon-nullsuccesstransitionandsize,whichgivesusthenumberofthesecharactersstoredinthenode.Sincethisnumberisbetween2and8,3bitsaresufcientforthesizeeld.Althoughitissufcienttoallocate22bytestoalowdegreenode,weallocate25bytesasthisallowsustopackapathcompressednodewithupto2characters(i.e.,anO2nodeasdescribedlater)intoalowdegreenode. 2.4.2.3Pathcompressednode Unlike[ 57 ],wedonotlimitpathcompressiontoend-nodesequences.Instead,wepathcompressanysequenceofstateswhosedegreeiseither1or0.Further,weusevariable-sizepathcompressednodessothatbothshortandlongsequencesmaybecompressedintoasinglenodewithnowaste.Inthepathcompressionschemeof[ 57 ]anend-nodesequencewith31stateswilluse7nodesandinoneofthesethecapacityutilizationisonly20%(onlyoneoftheavailable5slotsisused).Additionally,theoverheadofthetype,nextnode,andsizeeldsisincurredforeachofthepathcompressednodes.Byusingvariable-sizepathcompressednodes,allthespaceinsuchanodeisutilizedandthenodeoverheadispaidjustonce.Inourimplementation,welimitthecapacityofapathcompressednodeto256states.Thisrequiresthatthefailurepointeroffsetsinallnodesbeatleast8bits.Apathcompressednodewhose 34

PAGE 35

Figure2-9. Ourpathcompressednode capacityisc,c256,hasccharacterelds,cfailurepointers,cfailurepointeroffsets,crulelistpointers,1typeeld,1sizeeld,and1nextnodeeld(Figure 2-9 ).WerefertothepathcompressednodeofFigure 2-9 asanOnode.FivespecialtypesofOnodesO1throughO5alsoareusedbyus.AnOlnode,1l5,issimplyanOnodewhosecapacityisexactlylcharacters.ForthesespecialO-nodetypes,wemaydispensewiththecapacityeldasthecapacitymaybeinferredfromthenodetype. Thetypeelds(nodetypeandrstchildtype)are3bits.WeuseType=000forabitmapnode,Type=111foralowdegreenodeandType=110foranOnode.Theremaining5valuesforTypeareassignedtoOlnodes.SincethecapacityofanOnodemustbeatleast6,weactuallystorethenode'struecapacityminus6initscapacityeld.Asaresult,an8-bitcapacityeldsufcesforcapacitiesupto261.However,sincefailurepointeroffsetsare8bits,usinganOnodewithcapacitybetween257and261isn'tpossible.So,thelimitonOnodecapacityis256.ThetotalsizeofapathcompressednodeOis10c+6bytes,wherecisthecapacityoftheOnode.ThesizeofanOlnodeis10l+5aswedonotneedthecapacityeldinsuchanode. 2.4.3MemoryAccesses ThenumberofmemoryaccessesneededtoprocessanodedependsonthememorybandwidthW,howthenode'seldsaremappedtomemory,andwhetherornotwegetamatchatthenode.WeprovidetheaccessanalysisprimarilyforthecaseW=32bits. 35

PAGE 36

2.4.4BitmapNodewithTypeISummaries,W=32 Wemapourbitmapnodeintomemorybypackingthenodetype,rstchildtype,failurepointeroffseteldsaswellas2ofthe3L1summariesintoa32-bitblock;2bitsofthisblockareunused.TheremainingL1summary(S1(3))togetherwithS2(0,)areplacedintoanother32-bitblock.TheremainingL2summariesarepackedinto32-bitblocks;5summariesperblock;2bitsperblockareunused.TheL3summariesoccupy4memoryblocks;thebitmaptakes8blocks;andeachofthe3pointerstakesablock. Whenabitmapnodeisreached,thememoryblockwithtypeeldsisaccessedtodeterminethenode'sactualtype.Therulepointerisaccessedsowecanlistallmatchingrules.Abitmapblockisaccessedtodeterminewhetherwehaveamatchwiththeinputstringcharacter.Iftheexaminedbitis0,thefailurepointerisaccessedandweproceedtothenodepointedbythispointer;thefailurepointeroffset,whichwasretrievedfrommemorywhentheblockwithtypeeldswasaccessed,isusedtopositionusattheproperplaceinthenodepointedatbythefailurepointerincasethisnodeisapathcompressednode.So,thetotalnumberofmemoryaccesseswhenwedonothaveamatchis4.Whentheexaminedbitofthebitmapis1,wecomputeapopcount.Thismayrequirebetween0and3memoryaccesses(forexample,0areneededwhenbit0ofthebitmapisexaminedorwhentheonlysummaryrequiredisS1(1)orS1(2)).Usingthecomputedpopcount,therstchildpointer(anothermemoryaccess)andtherstchildtype(cannotbethatofanOnode),wemovetothenextnodeinourdatastructure.Atotalof4to7memoryaccessesaremade. 2.4.5LowDegreeNode,W=32 Nextconsiderthecaseofalowdegreenode.Wepackthetypeelds,sizeeld,failurepointeroffseteld,andthechar1eldintoamemoryblock;7bitsareunused.Theremaining7chareldsarepackedinto2blocksleaving8bitsunused.Eachofthepointereldsoccupiesamemoryblock.Whenalowdegreenodeisreached,wemustaccessthememoryblockwithtypeeldsaswellastherulepointer.Todetermine 36

PAGE 37

whetherwehaveamatchatthisnode,wedoanorderedsequentialsearchoftheupto8charactersstoredinthenode.Letidenotethenumberofcharactersexamined.Fori=1,noadditionalmemoryaccessisrequired,oneadditionalaccessisrequiredwhen2i5,and2accessesarerequiredwhen6i8.Incaseofnomatchweneedtoaccessalsothefailurepointer;therstchildpointerisretrievedincaseofamatch.Thetotalnumberofmemoryaccessestoprocessalowdegreenodeis3to5regardlessofwhetherthereisamatch. 2.4.6Ol,1l5,Nodes,W=32 ForanO1node,weplacethetype,failurepointeroffset,andchar1eldsintoamemoryblock;therule,failureandrstchildpointersareplacedintoindividualmemoryblock.ToprocessanO1node,werstretrievethetypeblockandthentherulepointer.Therulepointerisusedtolistthematchingrules.Then,wecomparewithchar1thatistheretrievedtypeblock.Ifthereisamatch,weretrievetherstchildpointerandproceedtothenodepointedat.Incaseofnomatch,weretrievethefailurepointer,whichtogetherwiththeoffsetinthetypeblockleadsustothenextnode.So,3accessesareneededwhenanO1nodeisreached. ThemappingforanO2issimilartothatusedforanO1node.Thistime,thetypeblockcontainschar1andchar2,theadditionalrulepointerandfailureoffsetpointersareplacedinseparateblocks.Thenumberofmemoryaccessesneededtoprocesssuchanodeis3whenonlychar1isexamined(thishappenswhenthereisamismatchatchar1).Whenchar2alsoisexaminedanadditionalrulepointerisretrieved.Foramismatch,wemustretrievethesecondfailurepointeraswellasitsfailurepointeroffset.So,5accessesareneeded.Foramatch,4accessesarerequired.So,incaseofamismatchinanO2node,3or5accessesareneeded;otherwise,4areneeded. ForO3nodes,weplacechar3anditsassociatedfailurepointeroffsetintothememoryblockofO2thatcontainsthesecondfailurepointeroffset.Theassociatedruleandfailurepointersareplacedinseparatememoryblocks.Whenall3charactersare 37

PAGE 38

matched,weneed6memoryaccesses.Whenamismatchoccursatchar1,thereare3accesses;atchar2,thereare5accesses;andatchar3,thereare6accesses. AnalternativemappingforanO3nodeplacesthedataeldsintomemoryinthefollowingorder:nodeandrstchildtypeelds(1bytetotal),pairsofcharacterandrulepointerelds((charj,rulepointerj),5bytesperpair),rstchildpointer(4bytes),pairsoffailurepointerandfailurepointeroffsets(5bytesperpair).Whenicharactersareexamined,weretrieved(1+5i)=4eblockstoprocessthecharactersandtheirrulepointers.Incaseofamismatchatcharacteri,2additionalaccessesareneededtoretrievethecorrespondingfailurepointeranditsoffset.Incaseofamatch,asingleadditionalmemoryaccessgetsustherstchildpointer.So,thetotalnumberofmemoryaccessesisd(1+5i)=4e+2whenthereisamismatchandd(1+5i)=4e+1whenallcharactersinthenodesarematched.Whenthisalternativematchingisused,amismatchatcharacteri,1i3takes4,5,and6memoryaccesses,respectively.Whenthereisnomismatch,5memoryaccessesarerequired. ForanO4node,weextendtheoriginalO3mappingbyplacingchar3,char4,andoffsetpointers3and4inonememoryblock;andoffsetpointer2inanother.Ruleandfailurepointersoccupyoneblockeach.Whenall4charactersarematched,weneed7memoryaccesses.Amismatchatcharacteri,1i4,resultsin3,5,6,and7accesses,respectively. AnO5nodeismappedwithchars3,4,5andoffsetpointer3inamemoryblockandoffsetpointers2,4,and5inanother.Whenall5charactersinanO5nodearematched,thereare8memoryaccesses.Whenthereisamismatchatcharacteri,1i5,thenumberofmemoryaccessesis3,5,6,8,and9,respectively. 2.4.7ONodes,W=32and1024 Forsimplicity,weextendthealternativemappingdescribedaboveforO3nodes.Fieldsaremappedtomemoryintheorder:nodetype,rstchildtype,andcapacityelds(2bytestotal),pairsofcharacterandrulepointerelds((charj,rulepointerj),5bytes 38

PAGE 39

perpair),rstchildpointer(4bytes),pairsoffailurepointerandfailurepointeroffsets(5bytesperpair).ThememoryaccessanalysisissimilartothatforO3nodesandthetotalnumberofmemoryaccesses,whenW=32,isd(2+5i)=4e+2whenthereisamismatchandd(2+5i)=4e+1whenallcharactersinthenodesarematched. WhenW=1024,anOnodetsintoasinglememoryblockprovideditscapacity,c,isnomorethan12.Hence,forc12,asinglememoryaccesssufcestoprocessthisnode.Whenc>12,thememoryaccesscountusingtheabovemappingisisd(2+5i)=128e+1.Sinceic256,atmost12memoryaccessareneedtoprocessanOnodewhenW=1024. 2.4.8PathCompressedNodeofTuck,W=32and1024 WhenW=32,thetype,size,failureoffset1,andchar1through3eldsofthepathcompressednodeof[ 57 ]maybemappedintoasinglememoryblock.Thechar4and5eldstogetherwiththe4remainingfailurepointeroffseteldsmaybemappedintoanothermemoryblock.Foramismatchatchar1,weneedtoaccessblock1,rulepointer1,andfailurepointer1foratotalof3memoryaccesses.Forafailureatchari,2isize,wemustaccessalsoblock2andanadditionali)]TJ /F4 11.955 Tf 12.88 0 Td[(1rulepointers.Thememoryaccesscountis3+i.Noticethatsince[ 57 ]pathcompressesend-nodesequencesonly,afailuremustoccurwheneverweprocessapathcompressednodewhosesizeislessthan5asthelaststateinsuchanodehasnosuccesstransition(i.e.,itsdegreeis0intheAho-Corasickautomaton).Hence,foramatchatthisnode,wemayassumethatthesizeis5.Thetwoblocks,5rulepointers,andtherstchildpointerareaccessed.Thetotalnumberofmemoryaccessesis8. WhenW=1024,all52bytesofthepathcompressednodetinamemoryblock.So,only1memoryaccessisneededtoprocessthenode.Notethatforanend-nodesequencewith256states,53pathcompressednodesareused.Theworst-caseaccessestogothroughthisend-nodesequenceis53.UsingourOnode,12memoryaccessesaremadeintheworstcase. 39

PAGE 40

Table2-3. MemoryaccessestoprocessanodeforW=32andW=64 W=32W=64 MatchMismatchMatchMismatchB(I)4to743to63B(II)4to643to53B(III)4to543to43L3to53to52to32to3O13322O243or522or3O363,5,or632or3O473or5to742to4O583,5,6,42,8,or93to5O3,3,2,2,d2+5i 4e+1d2+5i 4e+2d2+5i 8e+1d2+5i 8e+1TB[32]4to5433TO[32]1+i,3,1+i,2+di 2e,6,83+i74 2.4.9Summary Usingasimilaranalysis,wecanderivethememoryaccesscountsfordifferentvaluesofthememorybandwidthW,othersummarytypes,andothernodetypes.Table 2-3 and 2-4 givetheaccesscountsforthedifferentnodeandsummarytypesforafewsamplevaluesofW.TherowslabeledB(bitmap),L(lowdegree),Ol(O1throughO5),andOrefertonodetypesforourstructurewhilethoselabeledTB(bitmap)andTO(onedegree)refertonodetypesinthestructureofTucketal.[ 57 ].WenotethatthecountsofTable 2-3 and 2-4 arespecictoacertainmappingoftheeldsofanodetomemory.Usingadifferentmappingwillchangethememoryaccesscount.However,webelievethatthemappingsusedinouranalysisarequitereasonableandthatusingalternativemappingswillnotimprovethesecountsinanysignicantmanner. 2.4.10MappingStatestoNodes Wemapstatestonodesasfollowsandinthestatedorder. 1. CategoryBX,X2fB,L,Og,statesaremappedto1bitmapnodeeach;siblingstatesaremappedtonodesthatarecontiguousinmemory.NotethatinthecaseofBLandBOstates,onlyaportionofabitmapnodeisused. 40

PAGE 41

Table2-4. MemoryaccessestoprocessanodeforW=128andW=1024 W=128W=1024 MatchMismatchMatchMismatchB(I)2to5211B(II)2to4211B(III)2to3211L1to21to211O11111O21211O32211O42211O522or311O1,1,1,1,d2+5i 16e+1d2+5i 16e+1d2+5i 128e+1d2+5i 128e+1TB[32]2311TO[32]2211 2. MaximalsetsofLX,X2fL,Og,statesthataresiblingsarepackedintounusedspaceinabitmapnodecreatedin(1)using25bytesperLXstateandthelowdegreestructureofFigure 2-8 .Bythis,wemeanthatifthereare(say)3LXstatesthataresiblingsandthereisabitmapnodewithatleast75bytesofunusedspace,all3siblingsarepackedintothisunusedspace.Ifthereisnobitmapnodewiththismuchunutilizedspace,noneofthe3siblingsispackedintoabitmapnode.ThepackingofsiblingLXnodesisdoneinnon-increasingorderofthenumberofsiblings.Notethatbypackingallsiblingsintoasinglebitmapnode,wemakeitpossibletoaccessanychildofabitmapnodeusingitsrstchildpointer,thechild'srank(i.e.,indexinthelayoutofcontiguoussiblings),andthesizeoftherstchild(thisisdeterminedbythetypeoftherstchild).NotethatwhenanLOstatewhosechildisanOOstateismappedinthisway,itismappedtogetherwithitsloneOO-statechildintoasingle25-byteO2node,whichisthesamesizeasalowdegreenode. 3. TheremainingLXstatesaremappedintolowdegreenodes(LLstates)orO2nodes(LOstates).LLstatesaremappedonestateperlowdegreenode.Asbefore,whenanLOstatewhosechildisanOOstateismappedinthisway,itismappedtogetherwithitsloneOO-statechildintoasingle25-byteO2node.Siblingstatesaremappedtonodesthatarecontiguousinmemory. 4. ThechainsofremainingOOstatesarehandledingroupswhereagroupiscomprisedofchainswhoserstnodesaresiblings.Ineachgroup,wendthelength,l,oftheshortestchain.Ifl>5,setl=5.EachchainismappedtoanOlnodefollowedbyanOnode.TheOlnodesforthegroupareincontiguousmemory.NotethatanOnodecanonlybethechildofanOlnodeoranotherOnode. 41

PAGE 42

Table2-5. Numberofnodesofeachtype,OlandOcountsareforTypeIsummaries NodeTypeBLOlOTBTO DataSet128413359585045410572955DataSet243010076993857615273310 Table2-6. NumberofOlOnodesforTypeIIandTypeIIIsummaries NodeTypeOl(II)O(II)Ol(III)O(III) DataSet1284848456851464DataSet2430938576940578 2.5ExperimentalResults WebenchmarkedourcompressionmethodofSection 4.2 againstthatproposedbyTucketal.[ 57 ]usingtwodatasetsofstringsextractedfromSnort[ 51 ]rulesets.Therstdatasethas1284stringsandthesecondhas2430strings.Wenameeachdatasetbythenumberofstringsinthedataset. 2.5.1NumberofNodes Table 2-5 and 2-6 givethenumberofnodesoftypeI,typeII,typeIIIandTucketal.[ 57 ]inthecompressedAho-Corasickstructureforeachofourstringsets.ThemaximumcapacityofanallocatedOnodewas141fordataset1284and256fordataset2430. 2.5.2MemoryRequirement AlthoughthetotalnumberofnodesusedbyusislessthanthatusedbyTucketal.[ 57 ],ournodesarelargerandsothepotentialremainsthatweactuallyusemorememorythanusedbythestructureofTucketal.[ 57 ].Table 2-7 and 2-8 givethenumberofbytesofmemoryusedbythestructureof[ 57 ]aswellasthatusedbyourstructureforeachofthedifferentsummarytypesofSection 2.3 .RecallthatthesizeofaBnodedependsonthesummarytypethatisused.AsstatedinSection 4.2 ,theBnodesizeis110bytesforTypeIsummaries,94bytesforTypeIIsummaries,and61bytesforTypeIIIsummaries.ThememorynumbersgiveninTable 2-7 and 2-8 donotincludethe16bytes(orless)neededforthesingleT4tableusedbyTypeIIsummariesorthe256KBneededbytheT16tableusedbyTypeIsummaries.InthecaseofTypeII 42

PAGE 43

Table2-7. Memoryrequirementfordataset1284 Dataset1284 Methods[ 57 ]TypeITypeIITypeIIIMemory(bytes)208624157549155603152237Normalized10.760.750.73 *ExcludesmemoryforT4andT16 Table2-8. Memoryrequirementfordataset2430 Dataset2430 Methods[ 57 ]TypeITypeIITypeIIIMemory(bytes)251524177061175511172523Normalized10.700.700.69 *ExcludesmemoryforT4andT16 summaries,addinginthe16bytesneededbyT4doesn'tmateriallyaffectthenumbersreportedinTable 2-7 and 2-8 .ForTypeIIIsummaries,the256KBneededforT16ismorethanwhatisneededfortherestofthedatastructure.However,asthedatasetsizeincreases,this256KBremainsunchangedandxedat256KB.TherowlabeledNormalizedgivesthememoryrequirednormalizedbythatrequiredbythestructureofTucketal.[ 57 ].ThenormalizedvaluesareplottedinFigure 2-10 .Ascanbeseen,ourstructurestakebetween24%and31%lessmemorythanisrequiredbythestructureof[ 57 ].Withthe256KBrequiredbyT16addedinforTypeIIIsummaries,theTypeIIIrepresentationtakestwiceasmuchmemoryasdoes[ 57 ]forthe1284datasetand75%moreforthe2430dataset.Asthesizeofthedatasetincreases,weexpectTypeIIsummariestobemorecompetitivethan[ 57 ]ontotalmemoryrequired. 2.5.3Popcount Table 2-9 and 2-10 givethetotalnumberofadditionsrequiredtocomputepopcountswhenusingeachofthedatastructures.Forthisexperiment,weused3querystringsobtainedbyconcatenatingadifferingnumberofrealemailsthatwereclassiedasspambyourspamlter.Thestringlengthsvariedfrom1MBto3MBandwecountedthenumberofadditionsneededtoreportalloccurrencesofallstringsintheSnortdatasets(1284or2430)ineachofthequerystrings.Thelastrowofeachgure 43

PAGE 44

Figure2-10. Normalizedmemoryrequirement Figure2-11. Normalizedadditionsforpopcount 44

PAGE 45

Table2-9. Numberofpopcountadditions,dataset1284 Methods[ 57 ]TypeITypeIITypeIII strlen=100283210.61M1.37M1.25M0.76Mstrlen=203213132.21M4.15M3.79M2.29Mstrlen=300266564.26M8.25M7.51M4.55Mstrlen=4006579107.21M13.74M12.49M7.56Mstrlen=5035666161.76M20.75M18.82M11.37MNormalized10.1280.1170.071 Table2-10. Numberofpopcountadditions,dataset2430 Methods[ 57 ]TypeITypeIITypeIII strlen=100283211.54M1.46M1.33M0.79Mstrlen=203213134.97M4.43M4.02M2.42Mstrlen=300266569.54M8.78M7.96M4.80Mstrlen=4006579116.11M14.67M13.28M8.00Mstrlen=5035666175.60M22.25M20.09M12.08MNormalized10.1270.1140.069 isthetotalnumberofaddsforall3querystringsnormalizedbythetotalforthestructureof[ 57 ].ThenormalizedvaluesareplottedinFigure 2-11 .WhenTypeIIIsummariesareused,thenumberofpopcountadditionsisonly7%thatusedbythestructureof[ 57 ].TypeIandTypeIIsummariesrequireabout13%and12%,respectively,ofthenumberofadditionsrequiredby[ 57 ]. 45

PAGE 46

CHAPTER3COMPRESSEDOBJECTORIENTEDNFAFORMULTI-PATTERNMATCHING 3.1TheObjectOrientedNFAforMulti-patternMatching TheconstructionofanObjectOrientedNonDeterministicAutomata(NFA)formulti-patternmatchingstartswiththeAho-CorasickNFA.FailuretransitionsintheACNFAareeliminatedandnulltransitionsareprocessedanddescribedbelow.Thefollowingdiscussionfollowscloselythedevelopmentin[ 33 ]. Figures 3-1 ,through 3-3 showthethreefunctions:CompletionoftheOOgraph,Addanedgetothegraph,ObjectOrientedNFASearchFunction. ForthecompletionoftheNFAgraphalgorithm,foreachpatternweperformthefollowingsteps. Startingfromtheroot,movetothenextstateaccordingthepattern'srstcharacter. Movetothenextstateusingthepattern'ssecondcharacter,callitcurrentstate. Repeatthefollowinguntiltheendofthepatternisreached. Checktherootstatetoseeifthereisatransitiononthecurrentithcharacter.Ifso,weplaceastatecardrepresentingthattransitionnexttothecurrentstate,copyallthetransitionsfromthatstatetocurrentstate(expectanyexitingtransitiononcurrentstate). Movetothenextstateusingthepattern'snextcharacter.Ifwehaveplacedstatecardsnexttoourpreviousstate,checkifthereisatransitiononourcurrentcharacter.Ifsoplacethestatecardforthattransitionnexttoourcurrentstatecardandcopyallthetransitionsfromthatstatetothecurrentstate(exceptanyexitingtransitionsoncurrentstate) TheACNFA(withfailuretransitions)forthepatternsetfhers,his,shegisshowninFigure 3-4 .TocompletetheOONFA,westartwiththeNFAofFigure 3-4 andprocessthepatternsonebyoneusingthealgorithmofFigure 3-1 Whenprocessingtherstpatternhers,westartfromrootstate0,movetostate1usingthepattern's1stcharacterh.Thenwemovetostate2usingcharactere.Wechecktheinitialrootstatetoseewhetherthereisatransitiononourcurrentcharactere.Thereisnone.Sowemovetothenextstate3usingcharacterr.Sincethereis 46

PAGE 47

Algorithm1.CompletionoftheOOgraphInput.Listofpatterns,POutput,CompletionoftheOOgraphwithstartingnoderoot.beginqueue<-emptycurrentstate=root->next[p[0]]fori<-1untilstrlen(p)begincurrentstate=currentstate->next[p[i]]whilequeue<>emptydobeginlettempbethenextstateinqueuequeue<-queue-{temp}temp=temp->next[p[i]]if(temp<>NULL)beginqueue<-queueU{temp}addEdges(currentstate,temp)endendwhiletemp=root->next[p[i]]if(temp<>NULL)beginqueue<-queueU{temp}addEdges(currentstate,temp)endendforend Figure3-1. Algorithm1.CompletionoftheOOgraph Algorithm2.AddanedgetothegraphInput.startstate,endstatebeginfori<-0until255begintemp=endstate->next[i]iftemp<>NULLstartstate->next[i]=tempendforend Figure3-2. Algorithm2.Addanedgetothegraph 47

PAGE 48

Algorithm3.ObjectOrientedNFASearchFunctionInput.inputtextstringbeginfori<-1untilstrlen(text)beginif(currentstate<>NULL)begincurrentstate=currentstate->next[text[i]]if(currentstate<>NULL)checkmatchedruleelsecurrentstate=root->next[text[i]]elsecurrentstate=root->next[text[i]]endendforend Figure3-3. Algorithm3.ObjectOrientedNFASearchFunction *ThesearchfunctionofOONFAisquitesimple.Itmakesatmost2transitionsontheinputtextcharacterandeverytimeitgoesbacktothetransitionstartingfromtherootstatewhenthereisnomatch. notransitionon'r',wemovetothenextstate4usingthelastcharacters.Nowthereexistsatransitionfromtherootstate0tostate7ons.Sowecopythestate7cardovernexttostate4. Forthesecondpatternhis,whenwemovefromstate5tostate6ons,wendthereexistsatransitionfromtherootstate0tostate7ons.Sowecopythestate7cardovernexttostate4. Forthethirdpatternshe,whenwemovefromstate7tostate8oncharacterh,wecopystatecard1nexttostate8becausethereisatransitionfromtherootstate0tostate1onh.Thenwemovetothenextstate9oncharactere.Becauseweplaceastatecard1nexttoourpreviousstate8,weneedtocheckwhetherthereexistsatransitionfromthatcard1onourcurrentcharactere.Sowecopystatecard2nexttostate9. Figure 3-5 demonstrateswhatisdescribedabove. 48

PAGE 49

Figure3-4. Aho-CorasickNFAwithfailurepointers(allfailurepointerspointtostate0) Figure 3-4 convertsthisstatecardrepresentationtoaNondeterministicFiniteAutomata(NFA).Aswecansee,fouradditionaltransitionshavebeenaddedtocompletetheinitialNFAofFigure 3-4 Figure 3-6 istheAho-CorasickNFAwithfailurepointersonthesamesetofpatterns. Figure 3-7 istheDFAobtainedfrombothFigures 3-4 and 3-6 foroursetofpatterns. Fromtheabovegures,oneobservationwecanmakeistheOONFAisapartialtransformedautomatawhenconvertingAho-CorasickNFAtonalDFA.ItkeepsalltheDFAtransitionsthatstartfromstatesofdepth2andendsatstatesofdepth2. 3.2CompressedOONFA TheObjectOrientedNFAmaybecompressedtoobtainacompressedOOtrieusingthemethodofTucketal.[ 57 ].ThreetypesofnodesareemployedinacompressedOOtrie.Besidesbitmapandpathcompressednodes,weaddCOPYnodesthatsimplyplayaroleasasoftlinktoanoriginalnodethathasbeencopiedovernexttothecurrentnode.Figure 3-10 showstheformatofaCOPYnode.BitmapandpathcompressednodesincompressedOOtriesusethesameformatasin[ 57 ]exceptthatthefailurepointerandfailurepointeroffseteldsareomittedinFigures 3-8 and 3-9 49

PAGE 50

Figure3-5. TheObjectOrientedNFA(statecardsrepresentation) 3.3ExperimentalResults WebenchmarkedtheOOmethodofSection 3.1 andourcompressionmethodofSection 3.2 againstthatproposedbyTucketal.[ 57 ]andtheAho-Corasickautomata[ 1 ]usingsixdatasetsof1000,2000,3000,4000,5000,6000Englishwords.TheexperimentsareperformedonLinuxsystemFedora7envirnomentandalltheprogramsareinC++. Table 3-1 givessearchtimeforEnglishpatterns. Table 3-2 and 3-3 givethememoryrequiredfordifferentmulti-patterndatastructuresandthenodedistribution. 50

PAGE 51

Figure3-6. OONFA(allstatesthathavenomatchedcharacterwiilreturnbacktostate0) Figure3-7. TheDFAforoursetofpatterns Table 3-4 givesthecompressionratio,relativetotheoriginaluncompresseddatastructure,achievedbyacompresseddatastructure. AlthoughtheOOmethodofSection 3.1 isfasterthantheAho-Corasickautomata[ 1 ]by25%29%forsearch,thecompressedOOtriemethodisslowerthanthecompressedAho-Corasicktrie[ 57 ]by8%21%.AlsothecompressionratiofortheOOmethodisnotasgoodasthatoftheAho-Corasickcompressionmethod[ 57 ].TheOOcompressionratioisabout1.63.3,muchlessthanthatoftheAho-Corasickcompressionratio(37.3 51

PAGE 52

Figure3-8. OObitmapnode Figure3-9. OOpathcompressednode Figure3-10. OOcopynode 40.8).ThereasonforthisistheOOstructurehasmorenon-nullnextnodepointers.Theseextranon-nullnextnodepointerspointtoalargenumberofCOPYnodeswhichdonotexistintheAho-Corasickcompressionmethod[ 57 ]. WhiletheOOstructureispreferredoverotherstructuresconsideredinthischapterinapplicationsthatarenotmemoryconstrained,thecompressedAho-Corasicktrieispreferredwhenmemoryisseverelylimited. 52

PAGE 53

Table3-1. Searchtimes(milliseconds)forEnglishpatterns LengthOOACOOCACC 100006158134415148346200001215207587279803873000020792289141087117884400002308295718386415128950000266737372686811847616000031374373282647224662 Table3-2. Memoryrequired(bytes)forEnglishpatterns NumofPatternsOOCOOACCAC 10001,691,4605,642,240138,7725,664,28020005,794,31211,313,152278,38011,357,34430008,944,57615,733,760396,54015,795,220400011,499,23219,766,272515,76419,843,484500013,989,20023,769,088632,52423,861,936600016,587,32828,075,008754,71628,184,676 Table3-3. NumberofdifferenttypenodesforEnglishpatterns NumofPatternsOO-BitmapOO-CompressedOO-COPYAC-BitmapAC-Compressed 10003,392118227,9559141,564200010,792249100,3891,8203,151300015,250112156,6502,7594,322400019,155144201,8413,8485,362500023,049159245,8164,9276,368600027,205208291,5756,0317,446 Table3-4. CompressionratioforEnglishpatterns NumofPatternsOO/OOCAC/ACCOOC/ACC 10003.3440.8112.1820001.9540.8020.8130001.7639.8322.5640001.7238.4722.3050001.7037.7222.1760001.6937.3421.98 OO=OOmethodofSection 3.1 ,AC=Aho-Corasickautomata[ 1 ]OOC=ourcompressionmethodofSection 3.2 ,AAC=Aho-Corasickcompressionmethod[ 57 ] 53

PAGE 54

CHAPTER4THECOMPRESSEDAHO-CORASICKAUTOMATAONIBMCELLPROCESSOR Inthischapter,wedevelopamulticorealgorithmformulti-patternmatching.Specically,wechooseanestablishedmulti-corearchitecture,theIBMCell/BroadbandEngine(Cell)forourworkbecauseitisaprominentarchitectureinthehigh-performancecomputingcommunity,ithasshownpotentialinstringmatchingapplications,anditpresentssoftwaredesignerswithnon-trivialchallengesthatarerepresentativeofthenextgenerationsofmulti-corearchitectures. Withourproposedalgorithm,weachieveanaveragecompressionratioof1:34forEnglishwordsand1:58forrandombinarypatterns.Ourimplementationprovidesasustainedthroughputbetween0.90and2.35GbpsperCellbladeindifferentapplicationscenarios,whilesupportingdictionarydensitiesupto9.26millionaveragepatternsperGbyteofmainmemory. 4.1TheCell/BroadbandEngineArchitecture TheCellprocessor[ 10 ]contains9heterogeneouscoresonasilicondie.Oneofthemisatraditional64-bitprocessorwithcachememoriesand2-waysimultaneousmulti-threading,calledPowerProcessorElement(PPE),andcapableofrunningafull-featuredoperatingsystemandtraditionalPowerPCapplications.Theother8coresarecalledSynergisticProcessorElements(SPEs).Theyhavenocaches,butratherasmallamountofscratch-padmemory(256kbyte)thattheprogrammermustmanageexplicitly,byissuingDMAtransferfromandtothemainmemory.ThecoresareconnectedwitheachotherviatheElementInterconnectBus(EIB),afastdoubleringon-chipnetwork. Figure 4-1 showsthechiplayoutoftheCellarchitecture. TheCelldeliversitsbestperformancewhentheSPEsarekepthighlyutilizedbystreamingtasksthatloaddatafrommainmemory,processdatalocallyandcommittheresultsbacktomainmemory.Thesetasksexhibitaregular,predictablememoryaccess 54

PAGE 55

Table4-1. CompressionRatiosobtainedbyourtechniqueontwosampledictionariesofcomparableuncompressedsize.Dictionary(1)containsthe20,000mostcommonwordsintheEnglishlanguage.Dictionary(2)contains8,000randombinarypatternsofsameaveragelengthasinDictionary(1). DictionaryOriginalPackingCompressedCompressionACSizeFactorACSizeRatio (1)English48.86Mbytes41.41Mbytes34.7881.83Mbytes26.65(2)Binary52.37Mbytes40.90Mbytes58.0780.85Mbytes61.53120.86Mbytes60.83 patternthattheprogrammercanexploittoimplementdoublebuffering,andoverlapcomputationanddata-transferovertime. AchievinghighperformanceontheCellwithnon-streamingapplicationsisallbuttrivial,andalgorithmsbasedonDFAslikeoursarearguablythemostdifculttoport.Infact,thesealgorithmsexhibitunpredictablememoryaccesspatternsandacomplexlatencyinteractionbetweencomputecodeanddata-transfercode.Thesecircumstancesmakeitdifculttodeterminewhatrepresentsthecriticalpathinthecode,andhowtooptimizeit. Figure 4-3 showsapath-compressednode. 4.2Cell-orientedAlgorithmDesign ThissectiondescribestheimplementationchoiceswemadetoadaptourACNFAalgorithmtotheCellprocessor. Tocomputepopcountsefciently,weemploytheCNTBandSUMBinstructions(availableattheClevelviathespu cntb()andspu sumb()intrinsics).Thesereducethenumberofoperationstocomputethepopcountfrom31additions(summary+bit0+bit1+...+bit30)totwospuinstructionsplusonesummaryaddition.Samplecodetocomputethepopcountforchildnodei(0i255)ofacompressedAho-Corasicknodeisgivenbelow. popcount=get_summary(i); 55

PAGE 56

bitblock=get_bitmapblock(i);charvector=spu_promote(bitblock,0);countbyteones=spu_cntb((charvector);countblockones=spu_sumb(countbyteones,countbyteones);popcount=popcount+spu_extract(countblockones,0); Also,weemployvectorcomparisoninstructionstogetthelongestmatchbetweentheinputandcompressedpaths. Foralignmentreasons,weonlyconsiderpath-compressednodeswithpackingfactors(c)of4,8and12.Table 4-1 showsthecorrespondingcompressionratio.Notethat4isthebestchoicefortheEnglishdictionaryand8isbestforrandombinarypatterns.Forsimplicity,weconsiderapackingfactorof4intheexperimentsthatfollow.Thedifferenceincompressiongainobtainedwithapackingfactorof8isnotsignicantenoughtojustifytheincreaseinalgorithmcomplexity.Byusingthiscompressedautomata,wecancompressdictionarieswithanaveragecompressionratioof1:34forEnglishdictionariesand1:58forrandombinarypatterns. WenowdescribetheoptimizationsweemployedtomapourcompressedACalgorithmtoCellarchitectureandtheirimpact.ResultswereobtainedwiththeIBMCellSDK3.0onIBMQS22blades.Table 4-2 showstheimpactoftheoptimizationstepsontheperformanceandqualityofcode.WestartedfromanavecompressedACimplementationandweappliedbranchhinting,branchreplacementwithconditionalexpressions,verticalunrolling,datastructurerealignment,branchremoval,arithmeticstrengthreductionandhorizontalunrolling.Theaggregateeffectoftheseoptimizationsistoincreasethethroughput(byreducingthenumberofcyclesabsorbedpercharacter),reducingthecyclesperinstruction(CPI),reducingstallsandincreasingthedualissuerate(i.e.clockcyclesinwhichbothpipelineinanSPEissueanewinstruction). ThesetechniqueshelptodecreasetheCPI,thebranchstallcyclesrate,thedependencystallcycles.Theyalsodecreasethesingleinstructionissuerateand 56

PAGE 57

Table4-2. TheimpactoftheoptimizationstepsontheperformanceofourcompressedACNFAalgorithmwhenevaluatedinthefourapplicationscenariospresentedinSection 6.5 .Packingfactorforcompressed-pathnodeis4. TypicalCycles/CPIInstsUsedNOPBranchDep.SingleDualOptimizationStepThroughputcharperRegsStallStallIssueIssueSpeedup(Gbps)(1SPE)charRateRateRateRateRate ScenarioA:FullTextSearch(0)UnoptimizedPPEbaselineimplementation0.082=1.0(1)Naveimplementationon8SPEs1.440142.21.4299.9812.3%10.4%27.9%47.3%11.5%17.1(2)1Engine,branchhints,conditionalexpr.1.518134.91.4692.1821.5%23.7%25.2%38.0%11.0%18.5(3)4Engines,loopsunrolling,alignment1.616126.81.4686.7921.9%19.5%29.0%37.7%11.7%19.7(4)1Engine,branchremoval1.768115.80.94122.7992.3%2.7%26.1%44.9%24.0%21.5(5)1Engine,cheaperpointerarithmetics1.771115.60.97118.9991.8%1.8%26.0%46.3%23.4%21.6(6)4Engines,horizontalunrolling2.05899.50.83120.31251.7%3.2%17.4%43.8%33.7%25.1 ScenarioB:NetworkContentMonitoring(0)UnoptimizedPPEbaselineimplementation0.082=1.0(1)Naveimplementationon8SPEs0.655312.81.23307.2841.8%15.0%20.7%52.0%10.0%8.0(2)1Engine,branchhints,conditionalexpr.0.882232.31.18231.2832.4%8.6%27.6%48.7%12.6%10.8(3)4Engines,loopsunrolling,alignment0.992206.41.16225.18862.7%20.0%20.0%40.9%16.3%12.1(4)1Engine,branchremoval1.018201.11.00163.81281.8%1.0%27.8%48.5%20.7%12.4(5)1Engine,cheaperpointerarithmetics1.464139.90.92120.221282.3%1.4%23.2%44.6%28.4%17.9(6)4Engines,horizontalunrolling2.07198.90.92107.041281.9%2.4%22.6%44.9%27.9%25.3 ScenarioC:NetworkIntrusionDetection(0)UnoptimizedPPEbaselineimplementation0.082=1.0(1)Naveimplementationon8SPEs0.512388.01.43387.10842.5%15.4%27.7%49.1%5.3%6.2(2)1Engine,branchhints,conditionalexpr.0.576355.31.14354.41862.9%14.1%21.8%43.4%17.8%7.0(3)4Engines,loopsunrolling,alignment0.636321.91.28343.40831.7%16.2%25.3%44.8%11.6%7.8(4)1Engine,branchremoval0.650315.31.00281.751281.8%1.2%27.6%48.5%20.8%7.9(5)1Engine,cheaperpointerarithmetics0.801255.70.94199.771282.2%3.4%22.9%42.9%28.4%9.8(6)4Engines,horizontalunrolling1.318155.30.92165.731281.9%2.5%22.6%44.7%28.0%16.1 ScenarioD:Anti-VirusScanning(0)UnoptimizedPPEbaselineimplementation0.092=1.0(1)Naveimplementationon8SPEs0.451453.61.30441.3842.1%12.8%26.0%46.5%12.2%4.9(2)1Engine,branchhints,conditionalexpr.0.560365.41.16314.57862.8%14.8%22.0%42.7%17.4%6.1(3)4Engines,loopsunrolling,alignment0.703291.51.28227.19831.7%16.4%25.4%44.5%11.7%7.64)1Engine,branchremoval1.588129.01.06121.141281.5%7.6%24.7%45.4%19.9%17.3(5)1Engine,cheaperpointerarithmetics1.694120.90.96126.501282.3%1.2%23.1%44.9%28.3%18.4(6)4Engines,horizontalunrolling2.20492.90.92100.651281.7%7.9%20.6%41.8%27.3%24.0 57

PAGE 58

increasethedualinstructionissuerate.Overall,theoptimizationeffortresultsina16to25timesthroughputspeedupagainsttheunoptimizedPPEbaselineimplementation. 4.2.1Step(2):BranchReplacementandHinting Wheneverpossible,werestructurethecontrolowsotoreplaceifstatementswithconditionalexpressions.Weinspecttheassemblyoutputtomakesurethatthecompilerrendersconditionalexpressionwithselectbitsinstructionsratherthanbranches. AmajorifstatementinthecompressedACNFAkerneldoesnotbenetfromthisstrategy,i.e.,theonethatbranchesdependingonwhetherthenodetypeisbitmaporpath-compressed.Thetwobranchesaretoodifferenttoreducetoconditionalexpressions.Wereducethemispredictionpenaltyassociatedwiththisbranchbyhintingtomarkthebitmapcaseasthemorelikely,assuggestedbyourprolingonrealisticdata. 4.2.2Step(3):LoopUnrolling,DataAlignment Weapplyunrollingtoafewrelevantboundedinnermostloops,andweapplydatastructurealignment.Ouralgorithmconsistsoftwomajorparts:acomputepartandamemoryaccesspart.SincethecompressedACistoolargetotentirelyintheSPEs'localstores,westoreitinmainmemory. Wesafelyignoretheimpactofmemoryaccessesrequiredtoloadinputtextfrommainmemorytolocalstoreandwritebackmatchesintheoppositedirection.Infact,weimplementbothtransfersinadouble-bufferedway,overlappingcomputationanddatatransferintime.Thebelowpseudocodeshowsthemajorpartoftheverticalunrollingmethodinthealgorithm. while(){//Automaton1:waitDMAupdatecurrentnodeif(type==BITMAP)ProcessBITMAPNODE 58

PAGE 59

else//type==PATHCOMPRESSEDProcessPATHCOMPRESSEDNODEDMAtransferrequest//Automaton2:waitDMAupdatecurrentnodeif(type==BITMAP)ProcessBITMAPNODEelse//type==PATHCOMPRESSEDProcessPATHCOMPRESSEDNODEDMAtransferrequest...} WhenasingleinstanceofanACNFAruns,itcomputesitsnext-iterationnodepointerandthenfetchesthisnodeviaaDMAtransferfrommainmemory.DMAtransfershaveround-triptimeofhundredsofclockcycles.Toutilizethesecycles,werunmultipleconcurrentautomata,eachcheckingmatchesindifferentsegmentsoftheinput,unrollingtheircodetogethervertically.Multipleautomatacanpipelinememoryaccesses,overlappingtheDMAtransferdelays.Figure 4-4 showshowtwoautomataoverlaptheircomputationpartwiththeirDMAtransferwaittime.Figure 4-5 illustrateshowdifferentverticalunrollingfactorsaffecttheperformance.Wechooseverticalunrollingfactor8inourimplementationasitgivestheminimalDMAtransferdelay. WealsoperformedanexperimenttondoutthebestDMAtransfersizetomakefulluseofthebandwidthandminimizetheDMAtransferdelay.ThepsudocodebelowshowshowtomeasuretheDMAtransfertimewithdifferentDMAtransfersize. i=0 59

PAGE 60

DMAtransferrequest(transfer_size)recodetime1while(ib)?a:b;<==>select=spu_cmpgt(a,b);c_plus_1=spu_add(c,1);a_plus_b=spu_add(a,b);c=spu_sel(c,c_plus_1,select);d=spu_sel(a_plus_b,d,select); 60

PAGE 61

Thebasicideaistocomputethetwopossibleresultsforbothbranchesandselectoneoftheresultsusingaselectbitinstruction.Forexample,thetransformationreducesbranchmissstallsfrom19.5%to2.7%ofthecyclecountforthefull-textsearchscenario. 4.2.4Step(5):StrengthReduction Wemanuallyapplyoperatorstrengthreduction(i.e.,replacingmultiplicationanddivisionswithshiftsandadditions)wherethecompilerdidnot.Inaddition,weusecheappointerarithmetictoloadfouradjacentintegerelementsintoa128bitvector.Thisreducestheloadoverhead.e.g.Manualstrengthreductionreducestheoverallclockcycles3%forthefulltextsearchscenario. 4.2.5Step(6):HorizontalUnrolling AfterSteps1,dependencystallsoccupyabout25%ofthecomputationtime.WithintheNFAcomputecode,onebranchhandlesbitmapnodes,whiletheotheronehandlespath-compressednodes.Inthecodeofbothcases,therearefrequentread-after-writedatadependencies. Toreducethedependencystalls,weinterleavethecodesofmultiple,distinctautomata;wecallthisoperationhorizontalunrolling.Thesemultipleautomataprocessindependentinputstreamsagainstthesamedictionary.Theyhavedistinctstatesandinput/outputbuffers,andtheyrequiremultiple,distinctDMAoperationstoperformtheassociatedstreameddoublebuffering.Thebuffersizeis4096bytesinourexperiments. Thehorizontalunrollfactormustbechosenaccuratelytoreectthetrade-offbetweenthedecreaseddependencystallsandthepotentiallyincreasedbranchstalls.Ourexperimentsshowthatunrolling2NFAsachievesthehighestperformanceimprovement,10%.Forexample,forthefulltextsearchscenario,dependencystallsdecreasedfrom26.0%to17.4%,whilebranchstallsincreasefrom1.8%to3.2%. 61

PAGE 62

Table4-3. AggregatethroughputonanIBMCellchipwith8SPUs(Gbps). ScenarioThroughput(Gbps) Full-textsearch1.14Networkcontentmonitoring1.43Networkintrusiondetection0.90Anti-Virusscanning1.25Full-textsearch(100%match)1.69Anti-Virusscanning(100%match)2.35 4.3ExperimentalResults Inthissection,webenchmarkoursoftwaredesigninasetofrepresentativescenarios. WeusetwodictionariestogeneratecompressedACautomata:Dictionary1containsthe20,000mostcommonwordsintheEnglishlanguage,whileDictionary2contains8000randombinarypatterns.Webenchmarkthealgorithmonthreeinputles:theKingJamesBible,atcpdumpstreamofcapturednetworktrafcandarandomlygeneratedbinaryle. Figure 4-9 andTable 4-3 showtheaggregatethroughputofouralgorithmonadual-chipblade(16SPEs)inthesixscenariosdescribedbelow.ScenarioA(Dictionary1againsttheBible)isrepresentativeoffull-textsearchsystems.ScenarioB(Dictionary1againstthenetworkdump)isrepresentativeofcontentmonitoringsystems.ScenarioC(Dictionary2againstthenetworkdump)isrepresentativeofNetworkIntrusionDetectionSystems(NIDSs).ScenarioD(Dictionary2againstbinarypatterns)isrepresentativeofanti-virusscanners. Thelasttwoscenariosinthegurearerepresentativeofsystems(withDictionary1and2,respectively)underamalicious,content-basedattack.Infact,asystemwhoseperformancedegradesdramaticallywhentheinputexhibitsfrequentmatcheswiththedictionaryissubjecttocontent-basedattacks.Anattackerthatgainspartialorfullknowledgeofthedictionarycouldprovidethesystemwithtrafcspecicallydesignedtooverowit.Inscenariosveandsixweprovideoursystemwithinputsentirely 62

PAGE 63

composedofwordsfromthedictionary.Ourexperimentsshowadesirablepropertyofouralgorithm:itsperformanceactuallyincreasesincaseoffrequenthitting. ThereasonisthatourNFAspendsasimilaramountoftimetoprocessabitmaporapath-compressednode.Forthisreason,amismatchtakesacomparableamountoftimetothematchofanentirepath. Forthisreason,thecyclesspentperinputcharacterdecreasewhenmoreinputcharactersmatchthedictionary.Path-compressednodespackasmanyas4or8originalACnodes,andallowmulti-charactermatchatonetime.Figure 4-10 showshowthepercentageofmatchedpatternsaffectstheaggregatethroughputontheIBMcellbladewith16SPUsforthevirusscanningscenario.Asthepercentageofthematchedpatternsincreases,theaggregatethroughputincreasesaswell. Weexplorethetrade-offsbetweentheACcompressionratioandthethroughputinaParetospace.WechoosetheEnglishdictionaryasthecompressionobjectandchoosepackingfactorsof4,8,12forpathcompressednodes.AsshowninFigure 4-11 ,thecompressionratiodecreaseswithincreaseinthepackingfactor.However,thethroughputisbetterwithapackingfactorof8thanwithoneof4. ThereasonforthatistheinputdataisaEnglishinputwhichhas100%matchagainstthedictionary.Soinsteadofmatching4nodesinthepathcompressednodeatonetime,matching8nodesatonetimegivesbetterperformance.However,apackingfactorof12hassomethroughputdegradationcomparedtoapackingfactorof8.OneconclusionwedrawfromthisParetochartisthecompressionratioaffectsthethroughput,inordertogetabettercompressionratio,wehavetosacricethroughput. 4.4RelatedWork Snort[ 50 ]andBro[ 35 39 ]aretwoofthemorepopularpublicdomainNetworkIntrusionDetectionSystems(NIDSs).ThecurrentimplementationofSnortusestheoptimizedversionoftheACautomaton[ 1 ].SnortalsousesSFKsearchandtheWu-Manber[ 66 ]multi-stringsearchalgorithm. 63

PAGE 64

ToreducethememoryrequirementoftheACautomaton,Tucketal.[ 57 ]haveproposedstartingwiththenon-deterministicACautomatonandusingbitmapsandpathcompression. Inthenetworksecuritydomain,bitmapshavebeenusedalsointhetreebitmapscheme[ 16 ]andinshapeshiftingandhybridshape-shiftingtries1[ 52 62 ].PathcompressionhasbeenusedinseveralIPaddresslookupstructuresincludingtreebitmap[ 16 ]andhybridshape-shiftingtries[ 62 ].Thesecompressionmethodsreducethememoryrequiredtoabout1/30/50ofthatrequiredbyanACDFAoraWu-Manberstructure,andtoslightlylessthanwhatrequiredbySFKsearch[ 57 ].However,lookupsonpath-compresseddatarequiremorecomputationatsearchtime,e.g.,moreadditionsateachnodetocomputepopcounts,thusrequiringhardwaresupporttoachievecompetitiveperformance. ZhaandSahni[ 68 ]havesuggestedacompressedACtrieinspiredbytheworkofTucketal.[ 57 ]:theyusebitmapswithmultiplelevelsofsummaries,aswellasanaggressivepathcompaction.ZhaandSahni'stechniquerequires90%feweradditionstocomputepopcountsthanTucketal[ 57 ]'s,andoccupies24%%lessmemory.Scarpazzaetal.[ 46 ]proposeamemory-basedimplementationofthedeterministicACalgorithmthatiscapableofsupportingdictionariesaslargeastheavailablemainmemory,andachievesasearchperformanceof1.5.2GbpsperCellchip.Scarpazzaetal.[ 47 ]alsoproposeregularexpressionmatchingagainstsmallrulesets(whichsuitstheneedsofthesearchenginetokenizers)delivering8-14GbpsperCellchip. 1Atrieisatree-baseddatastructurefrequentlyusedrepresentsdictionariesandassociativearraysthathavestringsasakey. 64

PAGE 65

Figure4-1. ChiplayoutoftheCell/BroadbandEngineArchitecture. Figure4-2. Bitmapnodelayout. Figure4-3. Path-compressednodelayoutwithpackingfactorequaltofour. 65

PAGE 66

Figure4-4. HowtwoautomataoverlapthecomputationpartwiththeirDMAtransferwaittime. Figure4-5. Thenumberofcyclesprocessedpercharacterwithdifferentverticalunrollingfactors.(Full-textsearchscenario). 66

PAGE 67

Figure4-6. Howthethroughputgrowswitheachoptimizationstep.(Full-textsearchscenario). Figure4-7. Utilizationofclockcyclesfollowingeachoptimizationstep.(Full-textsearchscenario). 67

PAGE 68

Figure4-8. DMAinter-arrivaltransferdelayfrommainmemorytolocalstorewhen8SPEsareusedconcurrently. Figure4-9. AggregatethroughputofouralgorithmonanIBMQS22blade(16SPEs). 68

PAGE 69

Figure4-10. Howthepercentageofmatchedpatternsaffectstheaggregatethroughput.TheinputhereisonEnglishinputdata,withEnglishDictionary. Figure4-11. Thetrade-offbetweenthecompressionratioandthethroughputinaParetospace.TheinputhereisonEnglishinputdata,withEnglishDictionary. 69

PAGE 70

CHAPTER5FASTIN-PLACEFILECARVINGFORDIGITALFORENSICS Inthischapter,wefocusonapopularopensourcelerecoverytoolScaleplwhichperformslecarvingusingtheBoyer-Moorestringsearchalgorithmtolocateheadersandfootersinadiskimage.Weshowthatthetimerequiredforlecarvingmaybereducedsignicantlybyemployingmulti-patternsearchalgorithmssuchasthemultipatternBoyer-MooreandAho-CorasickalgorithmsaswellasasynchronousdiskreadsandmultithreadingastypicallysupportedonmulticorecommodityPCs.Usingthesemethods,weareabletodoin-placelecarvinginessentiallythetimeittakestoreadthediskwhoselesarebeingcarved.Since,usingourmethods,thelimitingfactorforperformanceisthediskreadtime,thereisnoadvantagetousingacceleratorssuchasGPUsashasbeenproposedbyothers.Tofurtherspeedin-placelecarving,wewouldneedamechanismtoreaddiskfaster. 5.1In-placeCarvingUsingScalpel1.6 Thenormalwaytoretrievealefromadiskistosearchthediskdirectory,obtainthele'smetadata(e.g.,locationondisk)fromthedirectory,andthenusethisinformationtofetchthelefromthedisk.Often,evenwhenalehasbeendeleted,itispossibletoretrievealeusingthismethodastypicallywhenaleisdeleted,adeleteagissetinthediskdirectoryandtheremainderofthedirectorymetadataassociatedwiththedeletedleunaltered.Ofcourse,thecreationofnewlesorchangestoremaininglesfollowingadeletemaymakeitimpossibletoretrievethedeletedleusingthediskdirectoryasthenewles'metadatamayoverwritethedeletedle'smetadatainthedirectoryandchangestotheremaininglesmayusethediskblockspreviouslyusedbythedeletedle. Inlecarving,weattempttorecoverlesfromatargetdiskwhosedirectoryentrieshavebeencorrupted.Intheextremecasetheentiredirectoryiscorruptedandalllesonthediskaretoberecoveredusingnometadata.Therecoveryofdisklesinthe 70

PAGE 71

Table5-1. ExampleheadersandfootersinScalpel'scongurationle FiletypeHeaderFooter gifnx47nx49nx46nx38nx37nx61nx00nx3bgifnx47nx49nx46nx38nx39nx61nx00nx3bjpgnxffnxd8nxffnxe0nx00nx10nxffnxd9htmtxtBEGINn040PGPzipPKnx03nx04nx3cnxac absenceofdirectorymetadataisdoneusingheaderandfooterinformationfortheletypeswewishtorecover.Table 5-1 givestheheaderandfooterforafewpopularletypes.ThisinformationwasobtainedfromtheScalpelcongurationle[ 40 ].nx[0-f][0-f]denotesahexadecimalvaluewhilen[0-3][0-7][0-7]isanoctalvalue.So,forexample,nx4Fn123nInsCCIdecodestoOSICCI.Inlecarving,weviewadiskasbeingserialstorage(theserializationbeingdonebysequentializingdiskblocks)andextractalldisksegmentsthatliebetweenaheaderanditscorrespondingfooterasbeingcandidatesforthelestoberecovered.Forexample,adisksegmentthatbeginswiththestringiscarvedintoanhtmle. Sincealemaynotactuallyresideinaconsecutivesequenceofdiskblocks,therecoveryprocessemployedinlecarvingisclearlypronetoerror.Nonetheless,lecarvingrecoversdisksegmentsdelimitedbyaheaderanditscorrespondingfooterthatpotentiallyrepresentale.Theserecoveredsegmentsmaybeanalyzedlaterusingsomeotherprocesstoeliminatefalsepositives.Noticethatsomeletypesmayhavenoassociatedfooter(e.g.,txtleshaveaheaderspeciedinTable 5-1 butnofooter).Additionally,evenwhenaletypehasaspeciedheaderandafooteroneofthesemaybeabsentinthediskbecauseofdiskcorruption(forexample).So,additionalinformation(suchasmaximumlengthofletobecarvedforeachletype)isusedinthelecarvingprocess.See[ 34 ]forareviewoflecarvingmethods. Scalpel[ 40 ]isanimprovedversionofthelecarverForemost[ 19 ].Atpresent,Scalpelisthemostpopularopensourcelecarveravailable.Scalpelcarveslesintwophases.Intherstphase,Scalpelsearchesthediskimagetodeterminethelocation 71

PAGE 72

Table5-2. Examplesofin-placelecarvingoutput FilenameStartTruncatedLengthImage gif/0000001.gif27465839NO2746/tmp/linux-imagegif/0000006.gif45496392NO4234/tmp/linux-imagejpg/0000047.jpg55645747NO675/tmp/linux-imagehtm/0000013.htm23123244NO823/tmp/linux-imagetxt/0000021.txt34235233NO56/tmp/linux-imagezip/0000008.zip76452352NO1423646/tmp/linux-image ofheadersandfooters.ThisphaseresultsinadatabasewithentriessuchasthoseshowninTable 5-2 .Thisdatabasecontainsthemetadata(i.e.,startlocationofle,lelength,letype,etc.)forthelestobecarved.Sincethenamesofthelescannotberecovered(asthesearetypicallystoredonlyinthediskdirectory,whichispresumedtobeunavailable),syntheticnamesareassignedtothecarvedlesinthegeneratedmetadatadatabase. ThesecondphaseofScalpelusesthemetadatadatabasecreatedintherstphasetocarvelesfromthecorrupteddiskandwritethesecarvedlestoanewdisk.Evenwithmaximumlelengthlimitsplacedonthesizeoflestoberecovered,averylargeamountofdiskspacemaybeneededtostorethecarvedles.Forexample,Richardetal.[ 41 ]reportsarecoverycaseinwhichcarvingawiderangeofletypesforamodest8GBtargetyieldedover1.1millionles,withatotalsizeexceedingthecapacityofoneofour250GBdrives. AsobservedbyRichardetal.[ 41 ],becauseoftheverylargenumberoffalsepositivesgeneratedbythelecarvingprocess,lecarvingcanbeveryexpensivebothintermsofthetimetakenandtheamountofdiskspacerequiredtostorethecarvedles.Toovercomethesedecienciesoflecarving,Richardetal.[ 41 ]proposein-placelecarving,whichessentiallygeneratesonlythemetadatadatabaseofTable 5-2 .Themetadatadatabasecanbeexaminedbyanexpertandmanyofthefalsepositiveseliminated.Theremainingentriesinthemetadatadatabasemaybeexaminedfurthertorecoveronlydesiredles.Sincetheruntimeofalecarveristypicallydominated 72

PAGE 73

bythetimeforphase2,on-linelecarverstakemuchlesstimethandolecarvers.Additionally,thesizeofevena1millionentrymetadatadatabaseislessthan60MB[ 41 ].So,in-placecarvingrequireslessdiskspaceaswell. Althoughin-placelecarvingisconsiderablyfasterthanlecarving,itstilltakesalargeamountoftime.Forexample,in-placelecarvingofan16GBashdrivewithasetof48rules(headerandfootercombinations)usingtherstphaseofScalpel1.6takesmorethan30minutesonanAMDAthlonPCequippedwitha2.6GHZCore2Duoprocessorand2GBRAM.Marzialeetal.[ 29 ]haveproposedtheuseofmassivethreadsassupportedbyaGPUtoimprovetheperformanceofanin-placelecarver.Inthispaper,wedemonstratethathardwareacceleratorssuchasGPUsareoflittlebenetwhendoinganin-placelecarving.Specically,byreplacingthesearchalgorithmusedinScalpel1.6withamultipatternsearchalgorithmsuchasthemultipatternBoyerMoore[ 7 30 66 ]andAho-Corasick[ 1 ]algorithmsanddoingdiskreadsasynchronously,theoveralltimeforin-placelecarvingusingScalpel1.6becomesverycomparabletothetimetakentojustreadthetargetdiskthatisbeingcarved.So,thelimitingfactorisdiskI/OandnotCPUprocessing.Furtherreductioninthetimespentsearchingthetargetdiskforfootersandheaders,aspossiblyattainableusingaGPU,cannotpossiblyreduceoveralltimetobelowthetimeneededtojustreadthetargetdisk.Togetfurtherimprovementinperformance,weneedimprovementindiskI/O. Thereareessentiallytwotasksassociatedwithin-placecarving(a)identifythelocationofspeciedheadersandfootersinthetargetdiskand(b)pairheadersandcorrespondingfooterswhilerespectingtheadditionalconstraints(e.g.,maximumlelength)speciedbytheuser.Thetimerequiredfor(b)isinsignicantcomparedtothatrequiredfor(a).So,wefocuson(a). Scalpel1.6locatesheadersandfootersbysearchingthetargetdiskusingabufferofsize10MB.Figure 5-1 givesthehigh-levelcontrolowofScalpel1.6.A10MBbufferislledfromdiskandthensearchedforheadersandfooters.Thisprocessisrepeated 73

PAGE 74

Figure5-1. ControlowScalpel1.6(a) Figure5-2. ControlowScalpel1.6(b) untiltheentirediskhasbeensearched.Whenthesearchmovesfromonebuffertothenext,careisexercisedtoensurethatheaders/footersthatspanabufferboundaryaredetected.SearchingwithinabufferisdoneusingthealgorithmofFigure 5-2 .Ineachbuffer,werstsearchforheaders.Thesearchforheadersisfollowedbyasearchforfooters.Onlynon-nullfootersthatarewithinthemaximumcarvinglengthofanalreadyfoundheaderaresearchedfor. Tosearchabufferforanindividualheaderoffooter,Scalpel1.6usestheBoyer-Moorepatternmatchingalgorithm[ 8 ],whichwasdevelopedtondalloccurrencesofapatternPinastringS..ThisalgorithmbeginsbypositioningtherstcharacterofPattherst 74

PAGE 75

characterofS.ThisresultsinapairingoftherstjPjcharactersofSwithcharactersofP.Thecharactersineachpairarecomparedbeginningwiththoseintherightmostpair.Ifallpairsofcharactersmatch,wehavefoundanoccurrenceofPinSandPisshiftedrightby1character(orbyjPjifonlynon-overlappingmatchesaretobefound).Otherwise,westopattherightmostpair(orrstpairsincewecomparerighttoleft)wherethereisamismatchandusethebadcharacterfunctionforPtodeterminehowmanycharacterstoshiftPrightbeforere-examiningpairsofcharactersfromPandSforamatch.Morespecically,thebadcharacterfunctionforPgivesthedistancefromtheendofPofthelastoccurrenceofeachpossiblecharacterthatmayappearinS.So,forexample,ifthecharactersofSaredrawnfromthealphabetfa,b,c,dg,thebadcharacterfunction,B,forP=abcabcdhasB(a)=4,B(b)=3,B(c)=2,andB(d)=1.Inpractice,manyoftheshiftsinthebadcharacterfunctionofapatternareclosetothelength,jPj,ofthepatternPmakingtheBoyer-Moorealgorithmaveryfastsearchalgorithm.Infact,whenthealphabetsizeislarge,theaverageruntimeoftheBoyer-MoorealgorithmisO(jSj=jPj).Galil[ 20 ]hasproposedavariationforwhichtheworst-caseruntimeisO(jSj).Horspool[ 21 ]proposesasimplicationtotheBoyer-MoorealgorithmwhoseperformanceisaboutthesameasthatoftheBoyer-Moorealgorithm. EventhoughtheBoyer-Moorealgorithmisaveryfastwaytondalloccurrencesofapatterninastring,usingitinourin-placecarvingapplicationisn'toptimalbecausewemustusethealgorithmonceforeachpattern(header/footer)tobesearched.So,thetimetosearchforallpatternsgrowslinearlyinthenumberofpatterns.LocatingheadersandfootersusingtheBoyer-Moorealgorithm,asisdoneinScalpel1.6,takesO(mn)timewheremisthenumberofletypesbeingsearchedandnisthesizeofthetargetdisk.Consequently,theruntimeforin-placecarvinggrowslinearlywithboththenumberofletypesandthesizeofthetargetdisk.Doublingeitherthenumberofletypesorthedisksizewilldoubletheexpectedruntime;doublingbothwillquadruplethe 75

PAGE 76

runtime.However,whenamultipatternsearchalgorithmisused,theruntimeisO(n)(bothexpectedandworstcase).Thatis,thetimeisindependentofthenumberofletypes.Whetherwearesearchingfor20letypesor40,thetimetondthelocationsofallheadersandfootersisthesame! 5.2MultipatternBoyer-MooreAlgorithm SeveralmultipatternextensionstotheBoyer-Mooresearchalgorithmhavebeenproposed[ 5 7 30 66 ].AllofthesemultipatternsearchalgorithmsextendthebasicbadcharacterfunctionemployedbytheBoyer-Moorealgorithmtoabadcharacterfunctionforasetofpatterns.Thisisdonebycombiningthebadcharacterfunctionsfortheindividualpatternstobesearchedintoasinglebadcharacterfunctionfortheentiresetofpatterns.ThecombinedbadcharacterfunctionBforasetofppatternshas B(c)=minfBi(c),1ipg foreachcharactercinthealphabet.HereBiisthebadcharacterfunctionfortheithpattern.TheSet-wiseBoyer-Moorealgorithmof[ 30 ]performsmultipatternmatchingusingthiscombinedbadfunction.Themultipatternsearchalgorithmsof[ 5 7 66 ]employadditionaltechniquestospeedthesearchfurther.Theaverageruntimeofthealgorithmsof[ 5 7 66 ]isO(jSj=minL),whereminListhelengthoftheshortestpattern.BaezaandGonnet[ 6 ]extendmultipatternmatchingtoallowfordon'tcaresandcomplementsinpatterns.Thisextensionisn'trequiredforourin-placelecarvingapplication. 5.3MulticoreSearching ContemporarycommodityPCshaveeitheradualcoreorquadcoreprocessor.Wemayexploittheavailabilityofmorethanonecoretospeedthesearchforheadersandfooters.Thisisdonebycreatingasmanythreadsasthenumberofcores(experimentsindicatethatthereisnoperformancegainwhenweusemorethreadsthanthenumberofcores).EachthreadsearchesaportionofthestringS.So,ifthenumberofthreadsis 76

PAGE 77

Figure5-3. Controlowfor2-threadedsearch t,eachthreadsearchesasubstringofsizejSj=tplusthelengthofthelongestpatternminus1.Figure 5-3 showsthecontrolowwhentwothreadsareusedtodothesearch. 5.4AsynchronousRead Scalpel1.6llsitssearchbufferusingsynchronous(orblocking)readsofthetargetdisk.Inasynchronousread,theCPUisunabletodoanycomputingwhilethereadisinprogress.ContemporaryPCs,however,permitasynchronous(ornon-blocking)readsofdisk.Whenanasynchronousreadisdone,theCPUisabletoperformcomputationsthatdonotinvolvethedatabeingreadfromdiskwhilethediskreadisinprogress.Whenasynchronousreadsareused,weneedtwobuffersactiveandinactive.Inthesteadystate,ourcomputerisdoinganasynchronousreadintotheinactivebufferwhilesimultaneouslysearchingtheactivebuffer.Whenthesearchoftheactivebuffercompletes,wewaitfortheongoingasynchronousreadtocomplete,swaptherolesoftheactiveandinactivebuffers,initiateanewasynchronousreadintothecurrentinactivebuffer,andproceedtosearchthecurrentactivebuffer.ThisisstatedmoreformallyinFigure 5-4 LetTreadbethetimeneededtoreadthetargetdiskandletTsearchbethetimeneededtosearchforheadersandfooters(exclusiveofthetimetoreadfromdisk).WhensynchronousreadsareusedasinFigures 5-1 and 5-2 ,thetotaltimeforin-place 77

PAGE 78

AlgorithmAsynchronousbeginreadactivebufferrepeatifthereismoreinputasynchronousreadinactivebuffersearchactivebufferwaitforasynchronousread(ifany)tocompleteswaptherolesofthe2buffersuntildoneend Figure5-4. In-placecarvingusingasynchronousreads carvingisapproximatelyTread+Tsearch(notethatthetimerequiredfortask(b)ofin-placecarvingisrelativelysmall).Whenasynchronousreadsareused,allbuttherstbufferisreadconcurrentlywiththesearchofanotherbuffer.So,thetimeforeachiterationoftherepeat-untilloopisthelargerofthetimetoreadabufferandthattosearchthebuffer.Whenthebufferreadtimeisconsistentlylargerthanthebuffersearchtimeorwhenthebuffersearchtimeisconsistentlylargerthanthebufferreadtime,thetotalin-placecarvingtimeusingasynchronousreadsisapproximatelymaxfTread,Tsearchg.Therefore,usingasynchronousreadsratherthansynchronousreadshasthepotentialtoreduceruntimebyasmuchas50%.ThesearchalgorithmsofSections 5.1 and 6.2 ,otherthantheAho-Corasickalgorithm,employheuristicswhoseeffectivenessdependsonboththerulesetandtheactualcontentsofthebufferbeingsearched.Asaresult,itisentirelypossiblethatwhenwesearchonebuffer,thereadtimeexceedsthesearchtimewhilewhenanotherbufferissearched,thereadtimeexceedsthesearchtime.So,whenthesesearchmethodsareused,itispossiblethatthein-placecarvingtimeissomewhatmorethanmaxfTread,Tsearchg. 5.5MulticoreIn-placeCarving InSection 5.3 wesawhowtousemultiplecorestospeedthesearchforheadersandfooters.Task(a)ofin-placecarving,however,needstobothreaddatafromdiskandsearchthedatathatisread.Thereareseveralwaysinwhichwecanutilizethe 78

PAGE 79

Figure5-5. Controlowforsinglecorereadandsinglecoresearch(SRSS) availablecorestoperformboththesetasks.TherstistousesynchronousreadsfollowedbymulticoresearchingasdescribedinSection 5.3 .WerefertothisstrategyasSRMS(synchronousreadmulticoresearch).Extensiontoalargernumberofcoresisstraightforward. Thesecondpossibilityistouseonethreadtoreadabufferusingasynchronousreadandthesecondtodothesearch(Figure 5-5 ).WerefertothisstrategyasSRSS(singlecorereadandsinglecoresearch). Athirdpossibilityistouse4buffersandhaveeachthreadruntheasynchronousreadalgorithmofFigure 5-4 asshowninFigures 5-6 and 5-7 .InFigure 5-6 thethreadsaresynchronizedforeverypairofbufferssearchedwhileinFigure 5-7 ,thesynchronizationisdoneonlywhentheentirediskhasbeensearched.So,usingthestrategyofFigure 5-6 ,eachthreadprocessesthesamenumberofbuffers(exceptwhenthenumberofbuffersofdataisodd).Whenthetimetollabufferfromdiskconsistentlyexceedsthetimetosearchthatbuffer,thestrategyofFigure 5-7 alsoprocessesthesamenumberofbuffersperthread.However,whenthebufferlltimeislessthanthesearchtimeandthereissufcientvariabilityinthetimetosearchabuffer,itispossible,usingthestrategyofFigure 5-7 ,foronethreadtoprocessmanymorebuffersthan 79

PAGE 80

Figure5-6. Controlowformulticoreasynchronousreadandsearch(MARS1) Figure5-7. Anothercontrolowformulticoreasynchronousreadandsearch(MARS2) processedbytheotherthread.Inthiscase,thestrategyofFigure 5-7 willoutperformthatofFigure 5-6 .Forourapplication,thetimetollabufferexceedsthetimetosearchitexceptswhenthenumberofrulesislarge(morethan30)andthesearchisdoneusinganalgorithmsuchasBoyerMoore(asisthecaseinScalpel1.6),whichisnotdesignedformultipatternsearch.Hence,weexpectbothstrategiestohavesimilarperformance.WerefertothesestrategiesasMARS1(multicoreasynchronousreadandsearch)andMARS2,respectively. 80

PAGE 81

Table5-3. In-placecarvingtimebyScalpel1.6fora16GBfalshdisk Numberof612243648CarvingRules TotalTime967s1069s1532s1788s1905sDiskRead833s833s833s833s833sSearch133s232s693s947s1063sOther1s4s6s8s9s 5.6ExperimentalResults Weevaluatedthestrategiesforin-placecarvingproposedinthispaperusingadualprocessor,dualcoreAMDAthlon(2.6GHZCore2Duoprocessor,2GBRAM).WestartedwithScalpel1.6andshutoffitssecondphasesothatitstoppedassoonasthemetadatadatabaseofcarvedleswascreated.Allourexperimentsusedpattern/rulesetsderivedfromthe48-rulesinthecongurationlein[ 44 ].Fromthisrulesetwegeneratedrulesetsofsmallersizebyselectingthedesirednumberofrulesrandomlyfromthissetof48rules.Weusedthefollowingsearchstrategies:BoyerMooreasusedinScalpel1.6(BM);SBM-S(set-wiseBoyerMoore-simple),whichusesthecombinedbadcharacterfunctiongiveninSection 6.2 andthesearchalgorithmemployedin[ 30 ];SBM-C(set-wiseBoyer-Moore-complex)[ 7 ];WuM[ 66 ];andAhoCorasick(AC).Ourexperimentsweredesignedtorstmeasuretheimpactofeachstrategyproposedinthepaper.Theseexperimentsweredoneusingasourtargetdiska16GBashdrive.Alltimesreportedinthispaperaretheaveragefromrepeatingtheexperimentvetimes.AnalexperimentwasconductedbycouplingseveralstrategiestoobtainanewbestperformanceScalpelin-placecarvingprogram.ThisprogramiscalledFastScalpel.Forthisnalexperiment,weusedashdrivesandharddisksofvaryingcapacity. 5.6.1RunTimeofScalpel1.6 Ourrstexperimentanalyzedtheruntimeofin-placecarving.Table 5-3 showstheoveralltimetodoanin-placecarveofour16GBashdriveaswellastimespenttoreadthediskandthatspenttosearchthediskforheadersandfooters.Thetimespentonothertasks(thisisthedifferencebetweenthetotaltimeandthesumoftheread 81

PAGE 82

andsearchtimes)alsoisshown.Ascanbeseen,thesearchtimeincreaseswiththenumberofrules.However,theincreaseinsearchtimeisn'tquitelinearinthenumberofrulesbecausetheeffectivenessofthebadcharacterfunctionvariesfromoneruletothenext.Forsmallrulesets(approximately30orless),theinputtime(timetoreadfromdisk)exceedsthesearchtimewhileforlargerrulesets,thesearchtimeexceedstheinputtime.Thetimespentonactivitiesotherthaninputandsearchisverysmallcomparedtothatspentonsearchandinputforallrulesets.So,toreduceoveralltime,weneedtofocusonreducingthetimespentreadingdatafromthediskandthetimespentsearchingforheadersandfooters. 5.6.2BufferSize Scalpel1.6spendsalmostallofitstimereadingthediskandsearchingforheadersandfooters(Table 5-3 ).Thetimetoreadthediskisindependentofthesizeoftheprocessingbufferasthistimedependsonthediskblocksizeusedratherthanthenumberofblocksperbuffer.Thesearchtimetooisrelativelyinsensitivetothebuffersizeaschangingthebuffersizeaffectsonlythenumberoftimestheoverheadofprocessingbufferboundariesisincurred.Forlargebuffersizes(say100Kandmore),thisoverheadisnegligible.Althoughthetimespentonothertasksisrelativelysmallwhenthebuffersizeis10MB(asusedinScalpel1.6),thistimeincreasesasthebuffersizeisreduced.Forexample,Scalpel1.6refreshestheprogressbarfollowingtheprocessingofeachbufferload.Whenthebuffersizeisreducedfrom10MBto100KB,thisrefreshisdone100timesasoften.ThevariationintimespentonotheractivitiesresultsinavariationintheruntimeofScalpel1.6withchangingbuffersize.Table 5-4 showsthein-placecarvingtimebyScalpel1.6withdifferentbuffersizewith48carvingrules.Thisvariationmaybevirtuallyeliminatedbyalteringthecodefortheothercomponentsto(say)refreshtheprogressbarafterevery(say)10MBofdatahasbeenprocessed,therebyeliminatingthedependencyonbuffersize.So,wecangetthesameperformanceusingamuchsmallerbuffersize. 82

PAGE 83

Table5-4. In-placecarvingtimebyScalpel1.6withdifferentbuffersizewith48carvingrules BufferSize100KB1MB10MB20MB Time2030s1895s1905s1916s Table5-5. Searchtimefora16GBashdrive Numberof612243648CarvingRules BM133s232s693s947s1063sSBM-S99s108s124s132s158sSBM-C107s117s142s155s178sWuM206s205s201s219s212sAC63s62s64s65s64s 5.6.3MultipatternMatching Table 5-5 showsthetimerequiredtosearchour16GBashdriveforheadersandfootersusingdifferentsearchmethods.Thistimedoesnotincludethetimeneededtoreadfromdisktobufferorthetimetodootheractivities(seeTable 5-3 ).Table 5-6 andFigure 5-8 givethespeedupachievedbythevariousmultipatternsearchalgorithmsrelativetotheBoyer-MooresearchalgorithmthatisusedinScalpel1.6.Ascanbeseen,theruntimeisfairlyindependentofthenumberofruleswhentheAho-Corasick(AC)multipatternsearchalgorithmisused.Althoughthetheoreticalexpectedruntimeoftheremainingmultipatternsearchalgorithms(SBM-S,SBM-C,andWuM)isindependentofthenumberofsearchpatterns,theobservedruntimeshowssomeincreasewiththeincreaseinnumberofpatterns.Thisisbecauseofthevariabilityintheeffectivenessoftheheuristicsemployedbythesemethodsandthefactthatourexperimentislimitedtoasinglerulesetforeachrulesetsize.Employingalargenumberofrulesetsforeachrulesetsizeandsearchingovermanydifferentdisksshouldresultinanaveragetimethatdoesnotincreasewithrulesetsize.TheAho-Corasickmultipatternsearchalgorithmistheclearwinnerforallrulesetsizes.Thespeedupinsearchtimewhenthismethodisusedrangesfromalowof2.1whenwehave6rulestoahighof17whenwehave48rules. 83

PAGE 84

Table5-6. SpeedupinsearchtimerelativetoBoyer-Moore Numberof612243648CarvingRules SBM-S1.342.155.597.176.73SBM-C1.241.984.886.095.97WuM0.641.133.454.325.01AC2.113.7410.8314.5716.61 Figure5-8. Multi-PatternSearchAlgorithmsSpeedup. 5.6.4MulticoreSearching Table 5-7 givesthetimetosearchour16GBashdrive(exclusiveofthetimetoreadfromthedrivetothebufferandexclusiveofthetimespentonotheractivities)using24rulesandthedualcoresearchstrategyofSection 5.3 .ThecolumnlabeledunthreadedisthesameasthatlabeledinTable 5-5 .Althoughthesearchtaskiseasilypartitionedinto2ormorethreadswithlittleextraworkrequiredtoensurethatmatchesthatcrosspartitionboundariesarenotmissed,theobservedspeedupfromusing2threadsonadualcoreprocessorisquiteabitlessthan2.Thisisduetotheoverheadassociatedwithspawningandsynchronizingthreads.Theimpactofthis 84

PAGE 85

Table5-7. Timetosearchusingdualcorestrategywith24rules AlgorithmsUnthreaded2threadsSpeedup BM693s380s1.82SBM-S124s88s1.41SBM-C142s99s1.43WuM201s149s1.35AC64s58s1.10 Table5-8. In-placecarvingtimeusingAlgorithmAsynchronous Numberof612243648CarvingRules BM843s855s968s966s1100sSBM-S838s837s839s888s847sSBM-C832s843s837s829s847sWuM840s841s840s843s842sAC832s834s828s833s828s overheadisverynoticeablewhenthesearchtimeforeachthreadlaunchisrelativelysmallasinthecaseofACandlessnoticeablewhenthissearchtimeislargeasinthecaseofBM.InthecaseofAC,wegetvirtuallynospeedupintotalsearchtimeusingadualcoresearchwhileforBM,thespeedupis1.8. 5.6.5AsynchronousRead Table 5-8 givesthetimetakentodoanin-placecarvingofour16GBdiskusingAlgorithmAsynchronous(Figure 5-4 ).ThemeasuredtimeisgenerallyquiteclosetotheexpectedtimeofmaxfTread,Tsearchg.AnotableexceptionisthetimeforBMwith24ruleswherethein-placecarvingtimeissubstantiallymorethanmaxf833,693g=833(seeTable 5-3 ).ThisdiscrepancyhastodowithvariationintheeffectivenessofthebadcharacterheuristicusedinBMfromonebuffertothenextasexplainedattheendofSection 5.4 .Althoughusingasynchronousreads,weareabletospeedupScalpel1.6byafactorofalmost2whenthenumberofrulesis48,thisisn'tsufcienttoovercometheinherentinefciencyofusingtheBoyer-Mooresearchalgorithminthisapplicationoverusingoneofthestatedmultipatternsearchalgorithms. 85

PAGE 86

Table5-9. In-placecarvingtimeusingSRMS Numberof612243648CarvingRules BM961s987s1217s1338s1393sSBM-S942s944s953s958s944sSBM-C948s937s928s935s979sWuM978s977s975s987s1042sAC924s925s929s927s973s Table5-10. In-pacecarvingtimeusingSRSS Numberof612243648CarvingRules BM846826937s932s1006sSBM-S849s850s849s844s881sSBM-C852s847s844s854s845sWuM843s837s870s843s833sAC850s852s852s852s849s 5.6.6MulticoreIn-placeCarving Tables 5-9 through 5-11 ,respectively,givethetimetakenbythemulticorecarvingstrategiesSRMS,SRSS,andMARS2ofSection 5.5 .WhentheBoyer-Mooresearchalgorithmisused,amulticorestrategyresultsinsomeimprovementoverAlgorithmAsynchronousonlywhenwehavealargenumberofrules(inourexperiments,24ormorerules)aswhenthenumberofrulesissmall,thesearchtimeisdominatedbythereadtimeandtheoverheadofspawningandsynchronizingthreads.Whenamultipatternsearchalgorithmisused,noperformanceimprovementresultsfromtheuseofmultiplecores.Althoughweexperimentedonlywithadualcore,thisconclusionappliestoalargenumberofcores,GPUs,andotheracceleratorsasthebottleneckisthereadtimefromdiskandnotthetimespentsearchingforheadersandfooters. 5.6.7Scalpel1.6vs.FastScalpel Basedonourpreliminaryexperiments,wemodiedtherstphaseofScalpel1.6inthefollowingway: 1. ReplacethesynchronousbufferreadsofScalpel1.6byasynchronousreads. 86

PAGE 87

Table5-11. In-placecarvingtimeusingMARS2 Numberof612243648CarvingRules BM909s912s943s938s1011sSBM-S907s907s908s908s909sSBM-C904s906s905s907s917sWuM906s906s907s908s908sAC904s903s902s904s904s 2. ReplacetheBoyer-MooresearchalgorithmusedinScalpel1.6bytheAho-Corasickmultipatternsearchalgorithm WerefertothismodiedversionasFastScalpel.AlthoughFastScalpelusesthesamebuffersize(10MB)asusedbyScalpel1.6,wecanreducethebuffersizetotensofKBswithoutimpactingperformanceprovidedwemodifythecodefortheothercomponentsofScalpel1.6asdescribedinSection 5.6.2 .TheperformanceofFastScalpelrelativetoScalpel1.6wasmeasuredusingavarietyoftargetdisks.Table 5-12 givesthemeasuredin-pacecarvingtimeaswellasthespeedupachievedbyFastScalpelrelativetoScalpel1.6.Figure 5-9 plotsthemeasuredspeedup.The16GBdiskusedintheseexperimentsisaashdiskwhilethe32GBand75GBdisksareharddrives.Whilespeedupincreasesasweincreasethesizeoftheruleset,thespeedupisrelativelyindependentofthedisksizeandtype.Thespeeduprangedfromabout1.1whentherulesetsizeis6toabout2.4whentherulesetsizeis48.Forlargerrulesets,weexpectevengreaterspeedup.SincethetotaltimetakenbyFastScalpelisapproximatelyequaltothetimetoreadthediskbeingcarved,furtherspeedupispossibleonlybyreducingthetimetoreadthedisk.Thiswouldrequireahigherbandwidthbetweenthediskandbuffer. 87

PAGE 88

Table5-12. In-placecarvingtimeandspeedupusingFastScalpelandScalpel1.6 Numberof612243648CarvingRules Scalpel1.6(16GB)967s1069s1532s1788s1905sFastScalpel(16GB)832s834s828s833s828sSpeedup(16GB)1.161.281.852.152.31 Scalpel1.6(32GB)1581s1737s2573s3263s3386sFastScalpel(32GB)1443s1460s1448s1447s1438sSpeedup(32GB)1.101.191.782.262.35 Scalpel1.6(75GB)3766s4150s6348s7801s8307sFastScalpel(75GB)3376s3393s3386s3375s3396sSpeedup(75GB)1.121.221.872.312.45 Figure5-9. SpeedupofFastScalpelrelativetoScalpel1.6 88

PAGE 89

CHAPTER6MULTI-PATTERNMATCHINGONMULTICORESANDGPUS OurfocusinthischapterisacceleratingtheAho-CorasickandBoyer-MooremultipatternstringmatchingalgorithmsthroughtheuseofaGPU.AGPUoperatesintraditionalmaster-slavefashion(see[ 43 ],forexample)inwhichtheGPUisaslavethatisattachedtoamasterorhostprocessorunderwhosedirectionitoperates.Algorithmdevelopmentformaster-slavesystemsisaffectedbythelocationoftheinputdataandwheretheresultsaretobeleft.Generally,fourcasesarise[ 63 65 ]asbelow. 1. Slave-to-slave.Inthiscasetheinputsandoutputsforthealgorithmareontheslavememory.Thiscasearises,forexample,whenanearliercomputationproducedresultsthatwereleftinslavememoryandtheseresultsaretheinputstothealgorithmbeingdeveloped;further,theresultsfromthealgorithmbeingdevelopedaretobeusedforsubsequentcomputationbytheslave. 2. Host-to-host.Heretheinputstothealgorithmareonthehostandtheresultsaretobeleftonthehost.So,thealgorithmneedstoaccountforthetimeittakestomovetheinputstotheslaveandthattobringtheresultsbacktothehost. 3. Host-to-slave.Theinputsareinthehostbuttheresultsaretobeleftintheslave. 4. Slave-to-host.Theinputsareintheslaveandtheresultsaretobeleftinthehost. Inthischapter,weaddressthersttwocasesonly.Inourcontext,werefertotherstcase(slave-to-slave)asGPU-to-GPU. 6.1TheNVIDIATeslaArchitecture Figure 6-1 givesthearchitectureoftheNVIDIAGT200TeslaGPU,whichisanexampleofNVIDIA'sgeneralpurposeparallelcomputingarchitectureCUDA(ComputeUniedDriverArchitecture)[ 13 ].ThisGPUcomprises240scalarprocessors(SP)orcoresthatareorganizedinto30streamingmultiprocessors(SM)eachcomprisedof8SPs.EachSMhas16KBofon-chipsharedmemory,1638432-bitregisters,andconstantandtexturecache.EachSMsupportsupto1024activethreads.Therealsois4GBofglobalordevicememorythatisaccessibletoall240SPs.TheTesla,likeother 89

PAGE 90

Figure6-1. NVIDIAGT200Architecture[ 56 ] GPUs,operatesasaslaveprocessortoanattachedhost.Inourexperimentalsetup,thehostisa2.8GHzXeonquad-coreprocessorwith16GBofmemory. ACUDAprogramtypicallyisaCprogramwrittenforthehost.CextensionssupportedbytheCUDAprogrammingenvironmentallowthehosttosendandreceivedatato/fromtheGPU'sdevicememoryaswellastoinvokeCfunctions(calledkernels)thatrunontheGPUcores.TheGPUprogrammingmodelisSingleInstructionMultipleThread(SIMT).Whenakernelisinvoked,theusermustspecifythenumberofthreadstobeinvoked.Thisisdonebyspecifyingexplicitlythenumberofthreadblocksandthenumberofthreadsperblock.CUDAfurtherorganizesthethreadsofablockintowarpsof32threadseach,eachblockofthreadsisassignedtoasingleSM,andthreadwarpsareexecutedsynchronouslyonSMs.Whilethreaddivergencewithinawarpispermitted,whenthethreadsofawarpdiverge,thedivergentpathsareexecutedseriallyuntiltheyconverge. ACUDAkernelmayaccessdifferenttypesofmemorywitheachhavingdifferentcapacity,latencyandcachingproperties.Wesummarizethememoryhierarchybelow. 90

PAGE 91

Devicememory:4GBofdevicememoryareavailable.Thismemorycanbereadandwrittendirectlybyallthreads.However,devicememoryaccessentailsahighlatency(400to600clockcycles).Thethreadschedulerattemptstohidethislatencybyschedulingarithmeticsthatarereadytobeperformedwhilewaitingfortheaccesstodevicememorytocomplete[ 13 ].Devicememoryisnotcached. Constantmemory:Constantmemoryisread-onlymemoryspacethatissharedbyallthreads.Constantmemoryiscachedandislimitedto64KB. Sharedmemory:EachSMhas16KBofsharedmemory.Sharedmemoryisdividedinto16banksof32-bitwords.Whentherearenobankconicts,thethreadsofawarpcanaccesssharedmemoryasfastastheycanaccessregisters[ 13 ]. Texturememory:Texturememory,likeconstantmemory,isread-onlymemoryspacethatisaccessibletoallthreadsusingdevicefunctionscalledtexturefetches.Texturememoryisinitializedatthehostsideandisreadatthedeviceside.Texturememoryiscached. Pinnedmemory(alsoknownasPage-LockedHostMemory):Thisispartofthehostmemory.Datatransferbetweenpinnedanddevicememoryisfasterthanbetweenpageablehostmemoryanddevicememory.Also,thisdatatransfercanbedoneconcurrentwithkernelexecution.However,sinceallocatingpartofthehostmemoryaspinnedreducestheamountofphysicalmemoryavailabletotheoperatingsystemforpaging,allocatingtoomuchpage-lockedmemoryreducesoverallsystemperformance[ 13 ]. 6.2MultipatternBoyer-MooreAlgorithm TheBoyer-Moorepatternmatchingalgorithm[ 8 ]wasdevelopedtondalloccurrencesofapatternPinastringS.ThisalgorithmbeginsbypositioningtherstcharacterofPattherstcharacterofS.ThisresultsinapairingoftherstjPjcharactersofSwithcharactersofP.Thecharactersineachpairarecomparedbeginningwiththoseintherightmostpair.Ifallpairsofcharactersmatch,wehavefoundanoccurrenceofPinSandPisshiftedrightby1character(orbyjPjifonlynon-overlappingmatchesaretobefound).Otherwise,westopattherightmostpair(orrstpairsincewecomparerighttoleft)wherethereisamismatchandusethebadcharacterfunctionforPtodeterminehowmanycharacterstoshiftPrightbeforere-examiningpairsofcharactersfromPandSforamatch.Morespecically,thebad 91

PAGE 92

characterfunctionforPgivesthedistancefromtheendofPofthelastoccurrenceofeachpossiblecharacterthatmayappearinS.So,forexample,ifthecharactersofSaredrawnfromthealphabetfa,b,c,dg,thebadcharacterfunction,B,forP=abcabcdhasB(a)=4,B(b)=3,B(c)=2,andB(d)=1.Inpractice,manyoftheshiftsinthebadcharacterfunctionofapatternareclosetothelength,jPj,ofthepatternPmakingtheBoyer-Moorealgorithmaveryfastsearchalgorithm. Infact,whenthealphabetsizeislarge,theaverageruntimeoftheBoyer-MoorealgorithmisO(jSj=jPj).Galil[ 20 ]hasproposedavariationforwhichtheworst-caseruntimeisO(jSj).Horspool[ 21 ]proposesasimplicationtotheBoyer-MoorealgorithmwhoseperformanceisaboutthesameasthatoftheBoyer-Moorealgorithm. SeveralmultipatternextensionstotheBoyer-Mooresearchalgorithmhavebeenproposed[ 5 7 30 66 ].AllofthesemultipatternsearchalgorithmsextendthebasicbadcharacterfunctionemployedbytheBoyer-Moorealgorithmtoabadcharacterfunctionforasetofpatterns.Thisisdonebycombiningthebadcharacterfunctionsfortheindividualpatternstobesearchedintoasinglebadcharacterfunctionfortheentiresetofpatterns.ThecombinedbadcharacterfunctionBforasetofppatternshas B(c)=minfBi(c),1ipg foreachcharactercinthealphabet.HereBiisthebadcharacterfunctionfortheithpattern. TheSet-wiseBoyer-Moorealgorithmof[ 30 ]performsmultipatternmatchingusingthiscombinedbadfunction.Themultipatternsearchalgorithmsof[ 5 7 66 ]employadditionaltechniquestospeedthesearchfurther.Theaverageruntimeofthealgorithmsof[ 5 7 66 ]isO(jSj=minL),whereminListhelengthoftheshortestpattern. ThemultipatternBoyer-Moorealgorithmusedbyusisdueto[ 7 ].Thisalgorithmemploystwoadditionalfunctionsshift1andshift2.LetP1,P2,...,Ppbethepatternsinthedictionary.First,werepresentthereverseofthesepatternsasatrie.Figure 6-2 showsthetriecorrespondingtothepatternscac,acbacc,cba,bbaca,andcbaca. 92

PAGE 93

Figure6-2. Reversetrieforcac,acbacc,cba,bbaca,andcbaca(shift1(node),shift2(node)valuesareshownbesideeachnode) LetP(nodei)=pi,1,pi,2,...,pi,jPijbethepathfromtherootofthetrietonodeiofthetrie.Letset1andset2beasbelow. set1(node)=fnode0:P(node)ispropersufxofP(node0),i.e.P(node0)=qP(node)forsomenonemptystringqg set2(node)=fnode0:node0set1(node)andmatchedpattern(node0)6=g Thetwoshiftfunctionsaredenedintermsofset1andset2asbelow. shift1(node)=8>>>>>>><>>>>>>>:1node=rootmin(fd(node0))]TJ /F3 11.955 Tf 11.95 0 Td[(d(node),otherwisenode0set1(node)gSfminLg) 93

PAGE 94

shift2(node)=8>>>>>>><>>>>>>>:minLnode=rootmin(fd(node0))]TJ /F3 11.955 Tf 11.95 0 Td[(d(node),otherwisenode0set2(node)gSshift2(node.parent)) whered(node)isthedepthofthenodeinthetrieandisdenedas: d(node)=8><>:1node=rootd(node0)+1ifnodeisachildofnode0 ThemultipatternBoyer-Moorealgorithmof[ 7 ]usesthefollowingshiftfunction. shift(c,node)=maxfB(c),shift1(node),shift2(node)g 6.3GPU-to-GPU 6.3.1Strategy Theinputtothemultipatternmatcherisacharacterarrayinputandtheoutputisanarrayoutputofstatesorreverse-trienodeindexes(orpointers).Botharraysresideindevicememory.WhentheAho-Corasick(AC)algorithmisused,output[i]givesthestateoftheACDFAfollowingtheprocessingofinput[i].SinceeverystateoftheACDFAcontainsalistofpatternsthatarematchedwhenthisstateisreached,output[i]enablesustodetermineallmatchingpatternsthatendatinputcharacteri.WhenthemultipatternBoyer-Moore(mBM)algorithmisused,output[i]isthelastreversetrienodevisitedoverallexaminationsofinput[i].Usingthisinformationandthepatternliststoredinthetrienodewemaydetermineallpatternmatchesthatbeginatinput[i].IfweassumethatthenumberofstatesinACDFAaswellasthenumberofnodesinthemBMreversetrieisnomorethan65536,astate/nodeindexcanbeencodedusingtwobytesandthesizeoftheoutputarrayistwicethatoftheinputarray. OurcomputationalstrategyistopartitiontheoutputarrayintoblocksofsizeSblock(Figure 6-3 summarizesthenotationusedinthissection).Theblocksarenumbered(indexed)0throughn=Sblock,wherenisthenumberofoutputvaluestobecomputed. 94

PAGE 95

nnumberofcharactersinstringtobesearchedmaxLlengthoflongestpatternSblocknumberofinputcharactersforwhichathreadblockcomputesoutputBnumberofblocks=n=SblockTnumberofthreadsinathreadblockSthreadnumberofinputcharactersforwhichathreadcomputesoutput=Sblock=TtWordSthread=4TWtotalwork=effectivestringlengthprocessed Figure6-3. GPU-to-GPUnotation Notethatnequalsthenumberofinputcharactersaswell.output[iSblock:(i+1)Sblock)]TJ /F4 11.955 Tf 12.11 0 Td[(1]comprisestheithoutputblock.Tocomputetheithoutputblock,itissufcientforustouseAConinput[bSblock)]TJ /F3 11.955 Tf 12.81 0 Td[(maxL+1:(b+1)Sblock)]TJ /F4 11.955 Tf 12.8 0 Td[(1],wheremaxListhelengthofthelongestpattern(forsimplicity,weassumethatthereisacharacterthatisnottherstcharacterofanypatternandsetinput[)]TJ /F3 11.955 Tf 9.3 0 Td[(maxL+1:)]TJ /F4 11.955 Tf 9.3 0 Td[(1]equaltothischaracter)ormBMoninput[bSblock:(b+1)Sblock+maxL)]TJ /F4 11.955 Tf 12.06 0 Td[(2](wemayassumethatinput[n:n+maxL)]TJ /F4 11.955 Tf 12.23 0 Td[(2]equalsacharacterthatisnotthelastcharacterofanypattern).So,ablockactuallyprocessesastringwhoselengthisSblock+maxL)]TJ /F4 11.955 Tf 12.22 0 Td[(1andproducesSblockelementsoftheoutput.ThenumberofblocksisB=n=Sblock. SupposethatanoutputblockiscomputedusingTthreads.Then,eachthreadcouldcomputeSthread=Sblock=Toftheoutputvaluestobecomputedbytheblock.So,threadt(threadindexesbeginat0)ofblockbcouldcomputeoutput[bSblock+tSthread:bSblock+tSthread+Sthread)]TJ /F4 11.955 Tf 11.74 0 Td[(1].Forthis,threadtofblockbwouldneedtoprocessthesubstringinput[bSblock+tSthread)]TJ /F3 11.955 Tf 11.74 0 Td[(maxL+1:bSblock+tSthread+Sthread)]TJ /F4 11.955 Tf 11.74 0 Td[(1]whenACisusedandinput[bSblock+tSthread:bSblock+tSthread+Sthread+maxL)]TJ /F4 11.955 Tf 12.24 0 Td[(2]whenmBMisused.Figure 6-4 givesthepseudocodeforaT-threadcomputationofblockioftheoutputusingtheACDFA.Thevariablesusedareself-explanatoryandthecorrectnessofthepseudocodefollowsfromtheprecedingdiscussion. Asdiscussedearlier,thearraysinputandoutputresideindevicememory.TheACDFA(orthemBMreversetriesincasethemBMalgorithmisused)resideintexture 95

PAGE 96

Algorithmbasic //computeblockboftheoutputarrayusingTthreadsandAC //followingisthecodeforasinglethread,threadt,0t
PAGE 97

AnicefeatureofAlgorithmbasicisthatallTthreadsthatworkonasingleblockcanexecuteinlock-stepfashionasthereisnodivergenceintheexecutionpathsoftheseTthreads.ThismakesitpossibleforanSMofaGPUtoefcientlycomputeanoutputblockusingTthreads.With30SMs,wecancompute30outputblocksatatime.ThepseudocodeofFigure 6-4 does,however,haveseveraldecienciesthatareexpectedtoresultinnon-optimalperformanceonaGPU.Thesedecienciesarelistedbelow. D1: Sincetheinputarrayresidesindevicememory,everyreferencetothearrayinputrequiresadevicememorytransaction(inthiscasearead).TherearetwosourcesofinefciencywhenthereadaccessestoinputareactuallymadeontheTeslaGPU. 1.OurTeslaGPUperformsdevice-memorytransactionsforahalf-warp(16)ofthreadsatatime.Theavailablebandwidthforasingletransactionis128bytes.Eachthreadofourcodereads1byte.So,ahalfwarpreads16bytes.Hence,barringanyotherlimitationofourGPU,ourcodewillutilize1/8ththeavailablebandwidthbetweendevicememoryandanSM. 2.TheTeslaisabletocoalescethedevicememorytransactionsfromseveralthreadsofahalfwarpintoasingletransaction.However,coalescingoccursonlywhenthedevice-memoryaccessesoftwoormorethreadsinahalf-warplieinthesame128-bytesegmentofdevicememory.WhenSthread>128,thevaluesofinputStartIndexforconsecutivethreadsinahalf-warp(notethattwothreadst1andt2areinthesamehalfwarpiffbt1=16c=bt2=16c)aremorethan128bytesapart.Consequently,foranygivenvalueoftheloopindexi,thereadaccessesmadetothearrayinputbythethreadsofahalfwarplieindifferent128-bytesegmentsandsonocoalescingoccurs.Althoughthepseudocodeiswrittentoenableallthreadstosimultaneouslyaccesstheneededinputcharacterfromdevicememory,anactualimplementationontheTeslaGPUwillserializetheseaccessesand,infact, 97

PAGE 98

everyreadfromdevicememorywilltransmitexactly1bytetoanSMresultingina1/128utilizationoftheavailablebandwidth. D2: Thewritestothearrayoutputsufferfromdecienciessimilartothoseidentiedforthereadsfromthearrayinput.AssumingthatourDFAhasnomorethan216=65536states,eachstatecanbeencodedusing2bytes.So,ahalf-warpwrites64byteswhentheavailablebandwidthforahalfwarpis128bytes.Further,nocoalescingtakesplaceasnotwothreadsofahalfwarpwritetothesame128-bytesegment.Hence,thewritesgetserializedandtheutilizedbandwidthis2bytes,whichis1/64thoftheavailablebandwidth.AnalysisofTotalWork UsingtheGPU-to-GPUstrategyofFigure 6-4 ,weessentiallydomultipatternsearchesonBTstringsoflengthSthread+maxL)]TJ /F4 11.955 Tf 12.45 0 Td[(1each.Withalinearcomplexityformultipatternsearch,thetotalwork,TW,isroughlyequivalenttothatdonebyasequentialalgorithmworkingonaninputstringoflengthTW=BT(Sthread+maxL)]TJ /F4 11.955 Tf 11.95 0 Td[(1)=n SblockT(Sthread+maxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1)=n SblockT(Sblock T+maxL)]TJ /F4 11.955 Tf 11.95 0 Td[(1)=n(1+T Sblock(maxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1))=n(1+1 Sthread(maxL)]TJ /F4 11.955 Tf 11.95 0 Td[(1)) So,ourGPU-to-GPUstrategyincursanoverheadof1 Sthread(maxL)]TJ /F4 11.955 Tf 12.39 0 Td[(1)100%intermsoftheeffectivelengthofthestringthatistobesearched.Clearly,thisoverheadvariessubstantiallywiththeparametersmaxLandSthread.SupposethatmaxL=17,Sblock=14592,andT=64(asinourexperimentsofsection 6.5 ).Then,Sthread=228andTW=1.07n.Theoverheadis7%. 98

PAGE 99

6.3.2AddressingtheDeciencies 6.3.2.1DeciencyD1readingfromdevicememory AsimplewaytoimprovetheutilizationofavailablebandwidthbetweenthedevicememoryandanSM,istohaveeachthreadinput16charactersatatime,processthese16characters,andwritetheoutputvaluesforthese16characterstodevicememory.Forthis,wewillneedtocasttheinputarrayfromitsnativedatatypeunsignedchartothedatatypeuint4asbelow: uint4*inputUint4=(uint4*)input; Avariablevaroftypeuint4iscomprisedof4unsigned4-byteintegersvar.x,var.y,var.z,andvar.w.Thestatement uint4in4=inputUint4[i]; readsthe16bytesinput[16*i:16*i+15]andstorestheseinthevariablein4,whichisassignedspaceinsharedmemory.SincetheTeslaisabletoreadupto128bits(16bytes)atatimeforeachthread,thissimplechangeincreasesbandwidthutilizationforthereadingoftheinputdatafrom1/128ofcapacityto1/8ofcapacity!However,thisincreaseinbandwidthutilizationcomeswithsomecost.Toextractthecharactersfromin4sotheymaybeprocessedoneatatimebyouralgorithm,weneedtodoashiftandmaskoperationonthe4componentsofin4.Weshallseelaterthatthiscostmaybeavoidedbydoingarecasttounsignedchar. SinceaTeslathreadcannotreadmorethan128bitsatatime,theonlywaytoimprovebandwidthutilizationfurtheristocoalescetheaccessesofmultiplethreadsinahalfwarp.Togetfullbandwidthutilizationatleast8threadsinahalfwarpwillneedtoreadunit4sthatlieinthesame128-bytesegment.However,thedatatobeprocessedbydifferentthreadsdonotlieinthesamesegment.Togetaroundthisproblem,threadscooperativelyreadallthedataneededtoprocessablock,storethisdatainsharedmemory,andnallyreadandprocessthedatafromsharedmemory.InthepseudocodeofFigure 6-5 ,Tthreadscooperativelyreadtheinputdataforblockb. 99

PAGE 100

//denespaceinsharedmemorytostoretheinputdata shared unsignedcharsInput[Sblock+maxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1]; //typecasttouint4 uint4sInputUint4=(uint4)sInput; //readasuint4s,assumeSblockandmaxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1aredivisibleby16 intnumToRead=(Sblock+maxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1)=16; intnext=bSblock=16)]TJ /F4 11.955 Tf 11.96 0 Td[((maxL)]TJ /F4 11.955 Tf 11.96 0 Td[(1)=16+t; //Tthreadscollectivelyinputablock for(inti=t;i
PAGE 101

orasuint4s.Whenthelatterisdone,weneedtodoshiftsandmaskstoextractthecharactersfromthe4unsignedintegercomponentsofauint4. AlthoughtheinputschemeofFigure 6-5 succeedsinreadinginthedatautilizing100%ofthebandwidthbetweendevicememoryandanSM,thereispotentialforshared-memorybankconictswhenthethreadsreadthedatafromsharedmemory.Sharedmemoryispartitionedinto16banks.Theith32-bitwordofsharedmemoryisinbankimod16.Formaximumperformancethethreadsofahalfwarpshouldaccessdatafromdifferentbanks.SupposethatSthread=224andsInputbeginsata32-bitwordboundary.LettWord=Sthread=4(tWord=224=4=56forourexample)denotethenumberof32-bitwordsprocessedbyathreadexclusiveoftheadditionalmaxL)]TJ /F4 11.955 Tf 12.31 0 Td[(1charactersneededtoproperlyhandletheboundary.Intherstiterationofthedataprocessingloop,threadtneedssInput[tSthread],0t0, 101

PAGE 102

j)]TJ /F3 11.955 Tf 12.13 0 Td[(imustbedivisibleby16.However,j)]TJ /F3 11.955 Tf 12.14 0 Td[(i<16andsocannotbedivisibleby16.Thiscontradictionimpliesthatourassumptionisinvalidandthetheoremisproved. ItshouldbenotedthatevenwhentWordisodd,theinputforeveryblockbeginsata128-bytesegmentofdevicememory(assumingthatfortherstblockbeginsata128-bytesegment)providedTisamultipleof32.Toseethis,observethatSblock=4TtWord,whichisamultipleof128wheneverTisamultipleof32.Asnotedearlier,sincetheTeslaschedulesthreadsinwarpsofsize32,wenormallywouldchooseTtobeamultipleof32. 6.3.2.2DeciencyD2writingtodevicememory WecouldusethesamestrategyusedtoovercomedeciencyD1toimprovebandwidthutilizationwhenwritingtheresultstodevicememory.Thiswouldrequireustorsthaveeachthreadwritetheresultsitcomputestosharedmemoryandthenhaveallthreadscollectivelywritethecomputedresultsfromsharedmemorytodevicememoryusinguint4s.Sincetheresultstaketwicethespacetakenbytheinput,suchastrategywouldnecessitateareductioninSblockbytwo-thirds.Forexample,whenmaxL=17,andSblock=14592weneed14592bytesofsharedmemoryforthearraysInput.Thisleavesuswithasmallamountofof16KBsharedmemorytostoreanyotherdatathatwemayneedto.Ifwewishtostoretheresultsinsharedmemoryaswell,wemustuseasmallervalueforSblock.So,wemustreduceSblocktoabout14592/3or4864tokeeptheamountofsharedmemoryusedthesame.WhenT=64,thisreductioninblocksizeincreasesthetotalworkoverheadfromapproximately7%toapproximately22%.Wecanavoidthisincreaseintotalworkoverheadbydoingthefollowing: 1.First,eachthreadprocessestherstmaxL)]TJ /F4 11.955 Tf 12.45 0 Td[(1charactersitistoprocess.Theprocessingofthesecharactersgeneratesnooutputandsoweneednomemorytostoreoutput. 2.Next,eachthreadreadstheremainingSthreadcharactersofinputdataitneedsfromsharedmemorytoregisters.Forthis,wedeclarearegisterarrayofunsigned 102

PAGE 103

integersandtypecastsInputtounsignedinteger.Since,theTthreadshaveatotalof16,384registers,wehavesufcientregistersprovidedSblock416384=64K(inreality,Sblockwouldneedtobeslightlysmallerthan64Kasregistersareneededtostoreothervaluessuchasloopvariables).Sincetotalregistermemoryexceedsthesizeofsharedmemory,wealwayshaveenoughregisterspacetosavetheinputdatathatisinsharedmemory. UnlessSblock4864,wecannotstorealltheresultsinsharedmemory.However,todo128-bytewritetransactiontodevicememory,weneedonlysetsof64adjacentresults(recallthateachresultis2bytes).So,thesharedmemoryneededtostoretheresultsis128Tbytes.SincewearecontemplatingT=64,weneedonly8Kofsharedmemorytostoretheresultsfromtheprocessingof64charactersperthread.Onceeachthreadhasprocessed64charactersandstoredtheseinsharedmemory,wemaywritetheresultstodevicememory.ThetotalnumberofoutputsgeneratedbyathreadisSthread=4tWord.Theseoutputstakeatotalof8tWordbytes.So,whentWordisodd(asrequiredbyTheorem 6.1 ),theoutputgeneratedbyathreadisanon-integralnumberofuint4s(recallthateachuint4is16bytes).Hence,theoutputforsomeofthethreadsdoesnotbeginatthestartofaunit4boundaryofthedevicearrayoutputandwecannotwritetheresultstodevicememoryasuint4s.Rather,weneedtowriteasuint2s(athreadgeneratesanintegralnumbertWordofint2s).Witheachthreadwritingauint2,ittakes16threadstowrite128bytesofoutputfromthatthread.So,Tthreadscanwritetheoutputgeneratedfromtheprocessingof64characters/threadin16roundsofuint2writes.Onedifcultyisthat,asnotedearlier,whentWordisodd,eventhoughthesegmentofdevicememorytowhichtheoutputfromathreadistobewrittenbeginsatauint2boundary,itdoesnotbeginatauint4boundary.Thismeansalsothatthissegmentdoesnotbeginata128-byteboundary(notethatevery128-byteboundaryisalsoauint4boundary.So,eventhoughahalf-warpof16threadsiswriting 103

PAGE 104

to128bytesofcontiguousdevicememory,these128-bytesmaynotfallwithinasingle128-bytesegment.Whenthishappens,thewriteisdoneastwomemorytransactions. Thedescribedproceduretohandle64charactersofinputperthreadisrepeateddSthread=64etimestocompletetheprocessingoftheentireinputblock.IncaseSthreadisnotdivisibleby64,eachthreadproducesfewerthan64resultsinthelastround.Forexample,whenSthread=228,wehaveatotalof4rounds.Ineachoftherstthreerounds,eachthreadprocesses64inputcharactersandproduces64results.Inthelastround,eachthreadprocesses36charactersandproduces36results.Inthelastround,groupsofthreadseitherwritetocontiguousdevicememorysegmentsofsize64or8bytesandsomeofthesesegmentsmayspan2128-bytesegmentsofdevicememory. Aswecansee,usinganoddtWordisrequiredtoavoidshared-memorybankconictsbutusinganoddtWord(actuallyusingatWordvaluethatisnotamultipleof16)resultsinsuboptimalwritesoftheresultstodevicememory.Tooptimizewritestodevicememory,weneedtouseatWordvaluethatisamultipleof16.SincetheTeslaexecutesthreadsonanSMinwarpsofsize32,Twouldnormallybeamultipleof32.Further,tohidememorylatency,itisrecommendedthatTbeatleast64.WithT=64anda16KBsharedmemory,Sthreadcanbeatmost161024=64=256andsotWordcanbeatmost64.However,sinceasmallamountofsharedmemoryisneededforotherpurposes,tWord<64.ThelargestvaluepossiblefortWordthatisamultipleof16istherefore48.Thetotalwork,TW,whentWord=48andmaxL=17isn(1+1 44816)=0.083n.ComparedtothecasetWord=57,thetotalworkoverheadincreasesfrom7%to8.3%.WhetherwearebetteroffusingtWord=48,whichresultsinoptimizedwritestodevicememorybutshared-memorybankconictsandlargerworkoverhead,orwithtWord=57,whichhasnoshared-memorybankconictsandlowerworkoverheadbutsuboptimalwritestodevicememory,canbedeterminedexperimentally. 104

PAGE 105

for(inti=0;i
PAGE 106

Writesegment0fromhosttodevicebufferIN0; for(inti=1;i
PAGE 107

snumberofsegmentstwtimetowriteaninputdatasegmentfromhosttodevicememorytrtimetoreadanoutputdatasegmentfromdevicetohostmemorytptimetakenbyGPUtoprocessaninputdatasegmentandcreatecorrespondingoutputsegmentTwPs)]TJ /F9 7.97 Tf 6.58 0 Td[(1i=0tw=stwTrPs)]TJ /F9 7.97 Tf 6.58 0 Td[(1i=0tr=strTpPs)]TJ /F9 7.97 Tf 6.58 0 Td[(1i=0tp=stpTws(i)timeatwhichthewritingofinputsegmentitodevicememorystartsTps(i)timeatwhichtheprocessingofsegmentibytheGPUstartsTrs(i)timeatwhichthereadingofoutputsegmentitohostmemorystartsTwf(i)timeatwhichthewritingofinputsegmentitodevicememorynishesTpf(i)timeatwhichtheprocessingofsegmentibytheGPUnishesTrf(i)timeatwhichthereadingofoutputsegmentitohostmemorynishesTAcompletiontimeusingstrategyATBcompletiontimeusingstrategyBLlowerboundoncompletiontime Figure6-8. Notationusedincompletiontimeanalysis 5. Ineveryfeasiblestrategy,therelativeorderofsegmentwrites,processing,andreadsisthesameandissegment0,followedbysegment1,...,andendingwithsegments)]TJ /F4 11.955 Tf 11.95 0 Td[(1,wheresisthenumberofsegments. SincewritingfromthehostmemorytothedevicememoryusesthesameI/Ochannel/busasusedtoreadfromthedevicememorytothehostmemoryandsincewhentheGPUisnecessarilyidlewhentherstinputsegmentisbeingwrittentothedevicememoryandthelastoutputsegmentisbeingreadfromthismemory,tw+maxf(s)]TJ /F4 11.955 Tf 12.37 0 Td[(1)(tw+tr),stpg+tr,isalowerboundonthecompletiontimeofanyhost-to-hostcomputingstrategy. Itiseasytoseethatwhenthenumberofsegmentssis1,thecompletiontimeforbothstrategiesAandBistw+tp+tr,whichequalsthelowerbound.Actually,whens=1,bothstrategiesareidenticalandoptimal.Theanalysisofthetwostrategiesfors>1ismorecomplexandisdonebelowinTheorems 6.2 to 6.5 .Wenotethatassumption4impliesthatTwf(i)=Tws(i)+tw,Tpf(i)=Tps(i)+tp,andTrf(i)=Trs(i)+tr,0i1,thecompletiontime,TA,ofstrategyAis: 107

PAGE 108

1. Tw+Trwheneveranyoffollowingholds: (a) twtp^tpTr)]TJ /F3 11.955 Tf 11.95 0 Td[(tr (b) twtp^trtp (c) twtp^trTw+itr] 2. Tw+tp+trwhentwtp^tp>Tr)]TJ /F3 11.955 Tf 11.95 0 Td[(tr 3. tw+tp+Trwhentwtp 4. tw+Tp+trwheneitherofthefollowingholds: (a) twtp^trTw+itr] Proof. Itshouldbeeasytoseethattheconditionslistedinthetheoremexhaustallpossibilities.WhenstrategyAisused,allthewritestodevicememorycompletebeforetherstreadbegins(i.e.,Trs(0)Twf(s)]TJ /F4 11.955 Tf 12.75 0 Td[(1)),Tws(i)=itw,Twf(i)=(i+1)tw,0iTr)]TJ /F3 11.955 Tf 12.05 0 Td[(tr,TA=Tw+tp+tr(Figure 6-9 (b)).Thisprovescases1aand2ofthetheorem. Whentwtp.Therstofthesehastwosubsubcasesofitsowntrtp(theoremcase4a)andtr>tp(theoremcase3).ThesesubsubcasesareshowninFigure 6-10 forthecaseof4segments.ItiseasytoseethatTA=tw+Tp+trwhen 108

PAGE 109

Acase1a Bcase2 Figure6-9. StrategyA,twtp,s=4(cases1aand2) trtpandTA=tw+tp+Trwhentr>tp.Thesecondsubcase(twtp)hastwosubsubcasesaswelltrtpandtrTw+itr](theoremcase1c),69i,0iTrf(i)]TJ /F4 11.955 Tf 13.02 0 Td[(1)],whereTrf()]TJ /F4 11.955 Tf 9.3 0 Td[(1)isdenedtobeTw.So,TA=Trf(s)]TJ /F4 11.955 Tf 12.73 0 Td[(1)+tr=Tw+Tr(Figure 6-11 (b)).WhentrTw+itr](theoremcase4b),9i,0iTrf(i)]TJ /F4 11.955 Tf 12.45 0 Td[(1)]andTA=tw+Tp+tr(Figure 6-11 (c)). Theorem6.3. ThecompletiontimeusingstrategyAistheminimumpossiblecompletiontimeforeverycombinationoftw,tp,andtr. Proof. First,weobtainatighterlowerboundLthantheboundtw+maxf(s)]TJ /F4 11.955 Tf 12.45 0 Td[(1)(tw+tr),stpgprovidedatthebeginningofthissection.Since,writesandreadsaredoneseriallyonthesameI/Ochannel,LTw+Tr.Since,Twf(s)]TJ /F4 11.955 Tf 11.32 0 Td[(1)Twforeverystrategy,theprocessingofthelastsegmentcannotbeginbeforeTw.Hence,LTw+tp+tr.Sincetheprocessingoftherstsegmentcannotbeginningbeforetw,thereadoftherstsegment'soutputcannotcompletebeforetw+tp+tr.Theremainingreadsrequire 109

PAGE 110

Acase4a Bcase3 Figure6-10. StrategyA,twtp,s=4(cases1b,1c,and4b) 110

PAGE 111

(s)]TJ /F4 11.955 Tf 12.9 0 Td[(1)trtimeandaredoneaftertherstreadcompletes.So,thelastreadcannotcompletebeforetw+tp+TR.Hence,Ltw+tp+tR.Also,sincetheprocessingoftherstsegmentcannotbeginbeforetw,Tpf(s)]TJ /F4 11.955 Tf 12.39 0 Td[(1)tw+Tp.Hence,Ltw+Tp+tr.CombiningalloftheseboundsonL,wegetLmaxfTw+Tr,Tw+tp+tr,tw+tp+Tr,tw+Tp+trg. FromTheorem 6.2 ,weseethat,inallcases,TAequalsoneoftheexpressionsTw+Tr,Tw+tp+tr,tw+tp+Tr,tw+Tp+tr.HenceatightlowerboundonthecompletiontimeofeverystrategyisL=maxfTw+Tr,Tw+tp+tr,tw+tp+Tr,tw+Tp+trg.StrategyAachievesthistightlowerbound(Theorem 6.2 )andsoobtainstheminimumcompletiontimepossible. Theorem6.4. Whens>1,thecompletiontimeTBofstrategyBisTB=tw+maxftw,tpg+(s)]TJ /F4 11.955 Tf 11.96 0 Td[(2)maxftw+tr,tpg+maxftp,trg+tr Proof. Whentheforloopindexi=1,thereadwithintheloopbeginsattw+maxftw,tpg.For2itp>tr>0,TB=Tw+(s)]TJ /F4 11.955 Tf 12.55 0 Td[(1)tr+tp=Tw+Tr+tp)]TJ /F3 11.955 Tf 12.06 0 Td[(tr=Tw+tp+2tr.WhenstrategyAisusedwiththisdata,weareeitherincase1aofTheorem 6.2 andTA=Tw+Tr=L
PAGE 112

6.4.3CompletionTimeUsingEnhancedGPUs Inthissection,weanalyzethecompletiontimesofstrategiesAandBundertheassumptionthatweareusingaGPUsystemthatisenhancedsothattherearetwoI/Ochannels/busesbetweenthehostCPUandtheGPUandtheCPUhasadual-portmemorythatsupportssimultaneousreadsandwrites.Inthiscasethewritingofaninputdatasegmenttodevicememorycanbeoverlappedwiththereadingofanoutputdatasegmentfromdevicememory.Whens=1,theenhancedGPUisunabletoperformanybetterthantheoriginalGPUandTA=TB=tw+tp+tr.Theorems 6.6 through 6.9 aretheenhancedGPUanalogsofTheorems 6.2 through 6.5 forthecases>1. Theorem6.6. Whens>1,thecompletiontime,TA,ofstrategyAfortheenhancedGPUmodelisTA=8>>>><>>>>:Tw+tp+trtwtp^twtrtw+Tp+trtw
PAGE 113

Atwtpandtwtr Btwtpandtw