<%BANNER%>

Multimedia Communication with Cluster Computing and Wireless WCDMA Network


PAGE 1

MULTIMEDIACOMMUNICATIONWITHCLUSTERCOMPUTINGAND WIRELESSWCDMANETWORK By JUWANG ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2003

PAGE 2

Copyright2003 by JuWang

PAGE 3

ThisworkisdedicatedtomybelovedwifeQingyanZhang.

PAGE 4

ACKNOWLEDGEMENTS Iwouldliketothankmydissertationadvisor,Dr.JonathanLiu,forhis guidanceandsupport,withoutwhichitwouldneverhavebeenpossibletocomplete thiswork.Dr.Liusinvaluableadviceonb othacademicandnonacademiclifemade memorematureandscholastic.IwouldalsothankDr.RandyChowandDr.JihkwonPeirforsharingtheirintelligenceandexperiencewithme.Thanksalsogo toDr.SahniandDr.Fang,whoprovidemanyconstructivequestionsandwitty suggestionsonmyresearchwork. Specialthanksgotomyparents.Theircontinuedsupporthasalwaysprovided methecourageandthespirittomoveforward.Finally,wordswillneverbeenough toexpressmygratitudetomybelovedwife,Qingyan,forherconstantlove,patience, encouragement,andinspirationduringthediculttimes. iv

PAGE 5

T ABLEOFCONTENTS Page ACKNOWLEDGEMENTS ............................ iv ABSTRACT .................................... vii CHAPTER 1INTRODUCTION ............................... 1 1.1HighPerformanceMPEG-2Decoding..................1 1.2MultimediaCommunicationoverWCDMAWirelessNetwork.....5 1.3DissertationOutline...........................9 2MPEG-2BACKGROUNDANDRELATEDSTUDY ............ 11 3DATA-PARTITIONPARALLELSCHEME ................. 17 3.1PerformanceModelofData-PartitionScheme.............20 3.2CommunicationAnalysisinDataPartitionParallelScheme.....22 3.3PartitionExampleandCommunicationCalculation..........23 3.4CommunicationMinimization......................25 3.5HeuristicDataPartitionAlgorithm...................27 3.6ExperimentalResults...........................29 3.6.1PerformanceoverA100-MbpsNetwork .............30 3.6.2Performanceovera680-MbpsEnvironment ...........31 4PIPELINEPARALLELSCHEME ...................... 33 4.1DesignIssuesofPipelineScheme....................33 4.2PerformanceAnalysis...........................39 4.3ExperimentalResults...........................46 4.4ExperimentDesign............................47 4.5PerformanceoverA100-MbpsNetwork .................48 4.6PerformanceoverA680-MbpsSMPEnvironment ...........51 4.7TowardstheHighResolutionMPEG-2Video..............53 4.7.1EcientBueringSchemes....................57 4.7.2FurtherOptimizationintheSlaveNodes............59 4.7.3ImplementationandExperimentalResults...........60 4.8Summary.................................62 v

PAGE 6

5MULTIMEDIASUPPORTINCDMAWIRE LESSNETWORK ...... 64 5.1Performance-GuaranteedCallProcessing................67 5.1.1ProposedAdmissionProtocolinAMobileStation.......69 5.1.2ProposedAdmissionProtocolinaBaseStation........70 5.1.3APerformance-GuaranteedSystem...............72 5.1.4ProcessingTimeatBaseStation................74 5.2DynamicSchedulingforMultimediaIntegration............76 5.2.1DesignIssueofTracSchedulingforWirelessCDMAUplink Channel..............................77 5.2.2TracSchedulingwithFixedSpreadingFactor........78 5.2.3DynamicSchedulingtoImprovethe Ts.............81 5.3Summary.................................85 6SOFTHANDOFFWITHDYNAMICSPREADING ............ 87 6.1RelatedStudy...............................89 6.2TheFrameworkofSoftHandoAlgorithmwithDynamicSpreading.91 6.2.1StationarySystemBehavior...................93 6.2.2HandoTimeAnalysis......................96 6.3DetermineSpreadingFactorforHandoMobile............101 6.3.1DesignFactor...........................101 6.3.2PerformanceofOR-ON-DOWNPowerControl.........102 6.3.3MainPathPowerControl....................104 6.4PerformanceOptimizationinHandoPeriod..............106 6.4.1BERModelinHandoArea...................106 6.4.2SimplicationoftheOriginalProblem.............108 6.4.3ProposedSub-OptimalSF/POWERDecisionAlgorithm...111 6.4.4NumericalResultsandPerformanceDiscussion........113 7CONCLUSIONSANDFUTUREWORK ................... 116 7.1ImproveTransportationLayerPerformanceinWCDMAdownlink..117 7.2ParallelMPEG-4Decoding........................118 7.3OptimalRateDistortionControlinVideoEncoding..........119 APPENDIXES ACSOURCECODEFORPARALLE LMPEG-2DECODE R ........ 121 BMATLABSOURCECODEFORWCDMAHANDOFFALGORITHM .. 146 REFERENCES ................................... 160 BIOGRAPHICALSKETCH ............................ 164 vi

PAGE 7

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFul“llmentofthe RequirementsfortheDegreeofDoctorofPhilosophy MULTIMEDIACOMMUNICATIONWITHCLUSTERCOMPUTINGAND WIRELESSWCDMANETWORK By JuWang August2003 Chairman:JonathanC.L.Liu MajorDepartment:ComputerandInf ormationScienceandEngineering Distributedmultimediaisamultidisciplinaryareaofresearchwhichinvolves networking,video/audioprocessing,stor agingandhighperformancecomputing.Recentadvancesinvideocompressionandnetworkingtechnologyhasbroughtincreasing attentiontothisarea.Nevertheless,providin ghigh-qualitydigitalvideoinlargescale remainsachallengingtask. Theprimaryfocusofthisdissertationistwofold:(1)WeinvestigatedapuresoftwarebasedparallelMPEG-2decompressionscheme.Toachieveahighlevelof scalability,weproposedthedatapipelines chemebasedonamaster/slavearchitecture.Theproposedschemeisveryecientduetothelowoverheadinthemasterand slavenodes.Ourexperimentalresultsshowedthattheproposedparallelalgorithm candeliveraclose-linearspeedupforhighqualitycompressedvideo.Witha100Mbpsnetwork,the30-fpsdecompressionframeratecanbeapproachedusingLinux cluster.With680-MbpsSMPenvironment,weobservedthe60-fpsHDTVquality with13nodes.(2)Thesecondtopicofthi sdissertationaddressedhowmultimedia vii

PAGE 8

applicationscanbeeectivelysupporte dinthewirelesswidebandCDMAnetwork. Themajorchallengeliesinthefactth atmultimediatracusuallydemandshigh data-rateandlowbiterror,whilethewirelessnetworkoftenexperienceshighbiterrorcausedbymulti-userinterferenceovertheairmedium.Basedontheobservation thatbetterchannelqualitycouldbeobtainedbyusinglongerspreadingcodes,anew mediaaccesscontrol(MAC)schemewasproposedsuchthatthespreadingcodefor eachmobileisdynamicallyadjustedba sedonthenetworkloadanddesiredtrac QoS.WiththisnewMACprotocol,theradioresourcecanbebetterutilizedand moredatatraccanbesupported. Wefurtherextendthedynamicspreadingschemetothemulti-cellscenario tohandlethemobilehandover.Basedont hesystemloadoflocalandneighbor cells,thenewhandoalgorithmwilldecide whenandwhatspreadingfactorsshould beusedforamobileduringthehandoper iod.Thealgorithmalsooptimizesthe assignmentofspreadingcodesandmobiletransmittingpowersuchthattheoverall throughputcanbemaximized.Thenewhando algorithmisalsooptimizedtoallow batch-processingformobilerequeststoreducethehandodelayatheavytrac. viii

PAGE 9

CHAPTER1 INTRODUCTION Overthepastdecade,atremendousamountofresearchanddevelopmenteort hasbeenundertakenonhighperformancedis tributedmultimediasystems.Digital videowithdecentquality(e.g.,DVDquality)hasalreadybecomeaordabletothe generalpublicwitho-the-shelfworkstat ionsandvariouselectronicconsumerproducts.Someimportantmultimediaapplicat ions,suchashighqualityvideoconferencingandvideo-on-demand,arebecomingrealisticoverwirelinenetwork.Nevertheless, itisonlyhumannaturetoseekevenhighervideoquality(e.g.,HDTV)andubiquitous accesstomultimediainformation.Tothisend,itisessentialthatthefollowingtwo keyissuesberesolved:(1)ahighperformance,genericyetscalablesoftwarevideo encoding/decodingsolutionthatdoesnotrelyonanyparticularhardwaredesign and(2)ahighperformancemobilecommunicationnetworkthatcansupportalargescalemultimediatrac.Inthisdissertation,weproposedandinvestigatedvarious technologiestoaddresstheseissues.Theprimaryfocusofthisdissertationisthus twofold:(1)toinvestigatethedesignissuesofasoftware-based,parallelMPEG-2decoder(MPEG-2isawidelyusedvideoformat)and(2)toevaluateanewwideband CDMAcommunicationprotocolwhichsupportsmultimediatracandsigni“cantly increasesthelink-levelQoSfortheuplin kchannelandsystemperformanceduring themobilehando. 1.1HighPerformanceMPEG-2Decoding MPEG-2standardhasbeenwidelyacceptedasaplatformtoprovidebroadcastingqualitydigitalvideo.Usingapo werfulDCTtransformbasedcompression 1

PAGE 10

2 algorithmandthemotioncompensationtechnique,greatcompressionratiocanbe obtainedwhilepreservinggoodvideoqualit y.Thetechnique,however,requiresintensiveprocessingduringtheencoding anddecodingphase.Thecomputationrequirementinthedecodingsideisparticula rlycritical,becausethedecodermustbe fastenoughtomaintainacontinuousplayback. Real-timeMPEG-2decodinghasbeenimplementedviaahybridapproach (generalpurposeprocessorwithmultimediaextension)forthebroadcasting-level videoquality.Nevertheless,high-performance,scalable,portablesoftwaredecoder schemesforthehigh-levelvideoqualityarestillunderinvestigation.Weinvestigated howagenerichigh-performancesoftwared ecodercanbeconstructedbyparallelizing thedecodingprocessoveraworkstationclusterormultipleprocessorsinanSMPmachine.Twoparallelapproaches,namelydata -partitionscheme[1]anddatapipeline scheme[2],arestudiedinthisdissertation. Withthedata-partitionscheme,adedi catedmasternodeisinchargeofparsingtherawMPEG-2bitstream(eitherfroma networkinthecaseofstreamingvideo orfromalocal“lesystem),dividingeachframeintosetsofmacroblocks,distributing subtasks,andcollectingtheresults.Inthes lavenodes,eachmacroblockgoesthrough VLCdecoding,IDCT,andMCtoproducethepixelinformation.Thisschemerequiresthetransmissionofreferencedatawhendecodingnon-intracodedmacroblocks (inP-andB-typeframes).Ouranalysisrev ealsthatthecommunicationcostrelated toreferencedatacanvarysigni“cantlywi thdierentpartitionmethods.Thusanoptimalpartitionshouldbefoundtomaximi zesystemperformance.Wecomparedfour partitionschemesbasedonthereferencedatacommunicationpattern.TheQuickPartitionschemeproducedtheleastcommunicationoverhead.OurQuick-Partition algorithmsubdividesagivenframeintotwopartseverytime,alongtheshorterdimension.Thenthetwosubpartsarefurtherdividedusingthesamestrategy.This

PAGE 11

3 procedurerepeatsuntiladesirednumberisachieved.WithQuickpartition,ourparalleldecodercanprovideapeakdecodingrateof44-fps,undera15-nodeSMPsystem withmeasuredinter-nodebandwidthof680Mbps.Thecorrespondingspeedup1is about8. However,itisfoundthatthedatapartitionmethodwillcausehighcomputationoverheadatthemasternodewhenthenumberofpartitionsbecomeslarge,which resultsinalowspeedupgain.Toachievehighsystemperformanceatlargescaleparallelcon“guration,weinvestigatedthes econdparallelapproach,thedatapipeline scheme.Thedatapipelineschemeismores calablethanthedata-partitionscheme byusingaframelevelparallelization.By doingthis,notonlyisthecomputation overheadinthemasternodereduced,thecommunicationcostisalsosigni“cantly reduced,wherethecostoftransferringrefe rencedatacanbevirtuallyeliminated. Theperformanceofourpipelinealgorit hmisdeterminedbytwodesignfactors: (1)theblocksize D (i.e.,thenumberofframesdecodedineachslave)and(2)the numberofslavenodes N .Thedeterminationoftheoptimalvaluesofthesetwo factorsbecomesnontrivialduetotheinter -framedatadependenceamongI-,P-and B-frames.Theinternaldatadependencecau sesthetransmissionofreferencepicture dataamongdierentnodesandevenlimitsthedegreeofparallelism.We“ndthat increasing D cansigni“cantlyreducethecommunicationtothereferencepicture. When N is“xed,thisreductionisclose-linearlyuntil D equalsthesizeofGOP,where theminimumcommunicationoverheadisachieved.Nevertheless,astheperformance analysismodelwilldemonstrate,networkbandwidthturnsouttobeacriticalfactor fortheoverallsystemperformance.Wehavefoundthatthereisasinglesaturation pointforagivenmultiple-nodecomputationenvironment.Beforethesaturation point,thesystemroundtime(i.e.,overalltimeforonepipelinerunningcycle)is 1Thespeed-upgainisdenedastheratioofdecompressiontimebetweentheparalleldecoderanditscorresponding serialversion.Sincethemajoritydecompressionworkisdoneinslavenodes,weexcludethemasternodewhen calculatingthespeed-up.Thisis,themaximumspeed-upforan8-node-conguration(1masternodeand7slave nodes)willbe7.Weusethisdenitionthroughoutallthediscussionunlessexplicitlystated.

PAGE 12

4 dominatedbytheCPUprocessingtimeandincreasing N canincreasethesystem performanceclose-linear.However,ahig herdecodingratealsocausesmorenetwork trac.Eventuallythegrowingcommunicationtracwillsaturatethesystemwhen itreachesthesystembandwidthlimitation. ExperimentsoverdierentnetworkingandOSplatformswereperformedto validateourproposeddatapipelinescheme.Theactualdecompressionrateisrecorded fortheseexperiments,withaparticularfocuson30-fpsrealtimevideoand60-fps HDTVquality.Theexperimentalresultsindicatethatourschemecanprovidea real-timedecodingrateandthesystemcanbescaledupwiththe100-Mbpsandthe 680-Mbpsenvironments.Witha100-Mbpsnetwork,weareabletodeliver30-fps with6slavenodes(eachnodecandecompressat5.6-fps),withspeed-upof5.7. Withthe680-MbpsSMPenvironmentequippedwith15lowendprocessors (Sparc248MHz),arateof30-fpswasachievedwhenusing7nodes(singlenode decompressionrateof5-fps).Arateof60-fps(HDTVquality)wasachievedwhen using13and14nodes.Ouranalysisshowsthata270-fpsdecodingratecouldbe expectedwithafullcon“gurationw ithourSMPtestingenvironment. Inordertoevaluatethescalabilityperfo rmanceofthedatapipelineparallel algorithm,furtherexperimentwereconducte dwithhighresolutionvideoformat[3]. However,whenattemptingtodecode(1024*1024)and(1404*960)high-levelMPEG2video,wehaveobservedasevereperforma ncedegradation(e.g.,droppingfrom18or 20fpsto2.5fps)whenmorethan10slavenodesareused.Byanalyzingtheruntime systemresourceutilization,wefoundthat thesystemmemoryisquicklyexhausted whenincreasingthenumberofslavenodes.Whendecodingthevideo“lewith highspatialresolution,theincreaseofmemoryusageeventuallybecomesasystem bottleneck.Toaddressthechallengeand obtainhighscalabledecodingforhigh resolutionvideo,weproposedandimplementedtworevisedmemorymanagement approachestoreducethebuerrequirement.The“rstisMinimumTransmission

PAGE 13

5 BuerinSlaveNode( STscheme ),wherewereducethetransmissionbuersize intheslavenodestothreeframes.Inthesecondapproach,weproposedadynamic buerschemewhichisadaptiveaccordingtothecurrentframetype.FortheIframe,thesystemwillonlyrequestoneframebuer.WhenP-orB-imageareto bedecoded,2frameswillbeallocated.Th eeectivenumberofframesperbueris only85%ofthe3framebuer.Theexperimentalresultsshowthatthebuerspace issigni“cantlyreduced,andweobservedawell-scaleddecodingperformanceforthe highresolutionMPEG-2video. 1.2MultimediaCommunicationoverWCDMAWirelessNetwork OurinvestigationonparallelizingtheMPEG-2decodingalgorithmindicates thatreal-timedecompressionofhighqualityvideocanbeobtainedviapuresoftwaresolution.Itisourbeliefthatpowerfulcomputationpowerinthefuturewill beavailableviamassiveintegrationofl ow-endCPUsintoonechip/board.Thus ourproposeddatapipelineparalleldecode risparticularlyappealingforthosesystemsequippedwithmultiplegeneralpurpo seprocessors.Furthermore,ourparallel MPEG-2decoderalsoprovidesvaluablesuggestionstothedesignofahighperformanceparallelvideoencoder.Withproperinter-nodecommunicationprotocoland enoughCPUs,software-basedreal-timeencodingforhigh-qualityvideoispossible. However,thecapabilityofreal-timevideoencoding/decodingattheendsystemsalonedoesnotguaranteethequa lityalongthewholepath.Wemustalso haveaquality-guaranteedcommunicationnetworktodeliverthemultimediacontents inatimelymannerandwithhigh“delity.Multimediadata,especiallystreamingaudioandvideodata,shouldbetransportedwithdelayguaranteeinordertohave asmoothplayback.Therehadbeenconsiderableamountofresearchonthisarea. SomearetryingtoprovideQoSoverbesteortnetwork,suchasRSVP[4]overINTERNETandVoIP[5].Therearealsosolutionswithbuilt-inQoSconsiderationat thenetworklayer,suchasATM[6]andIPv6[7].Theseresearchhaveshownthatitis

PAGE 14

6 becomingrealistictosupportalargenumbe rofmultimediaconnectionswithdesired QoSoverthewirelinenetwork. Furthermore,distributedmultimediasyst emshouldexpanditscoveragethrough awirelesscommunicationlinkaswell.Thefuturegenerationwirelesssystemneeds tosupportaseamlessintegration(i.e.,trans parentapplicationswitching)tosupport voice,audioandconventionaldata(e.g., e-mails,andftp).Itshouldalsosupport manyuserswithguaranteedquality.However,themultimediacommunicationinthe wirelessnetworkremainsatechnicalchalle nge,duetothelowinformationbandwidth andhightransmissionerrorinthephysicallayer. The“rstgenerationwirelesscellularnetworkisdesignedtocarryvoicecommunication.Thesenetworksareanalogsys tem,andusefrequencydivisionmultiple access(FDMA)tosupportmultipleusers.InFDMAsystem,thewholeradiospectrumisdividedintoseveralisolatedphysi calsub-channels,andeachsub-channelis dedicatedtoaparticularuseronceassignedbythebasestation.Inordertoreducethecross-channelinterference,sub-ch annelsareoftenspacedapartbysucient distance.Furthermore,neighboringcellsarenotallowedtousethesamefrequency band.However,thesereducedtheradioeciency,andseverelylimitthenumberof concurrentusers. Thesecondgenerationsystemischaracterizedbydigitalmodulationandtime divisionmultipleaccess(TDMA)technology,whichseparatesusersintime.These systemsstarttoprovidelimitedsupportfordataservice,suchasemailandshort message.However,TDMAissubjectedtomulti-pathinterference,wherethereceived signalmightcomefromseveraldirectionswithdierentdelay.SimilartotheFDMA technology,neighboringcellsinTDMAmustusedierentfrequencybandtoreduce inter-cellinterference. Nevertheless,thepastdecadehavewitnessedarapidgrowsofthe2Gsystem. Encouragedbythesuccessofsecondgenerationcellularwirelessnetwork,researchers

PAGE 15

7 arenowpushingthe3Gstandardtosupporta seamlessintegrationofmultimedia dataservices,aswellasvoiceservice.However,multimediadatatracismore demandingthanvoiceservice,bothindatarateandmaximumtoleratedbiterror rate(BER).Forexample,streamingv ideorequiresalowBER,lessthan10Š 4,and amoderatedatarate(usuallyhigherthan64kbps).Thebottom-lineisthatthe minimumBERanddataraterequirementsforalladmittedconnectionsmustbe satis“edatanytime.Otherwisethesystemwillnotguaranteetheperformance. However,tracchannelsintraditionalTDMAbasedsystemhavelowdata rate,andthebiterrorrate(BER)isoftenh igherthanrequiredbymultimediatrac. Directlyprovidingmultimediaservices basedontheTDMAsystem,thoughpossible,resultsinseveredegradationinsystemcapacity[8].Thus,providingqualityof guaranteeserviceinwirelessnetworkneedsnewdesignandfunctionalityintheMAC layer. Anothercompetingtechnology,codedivisionmultipleaccess(CDMA),usesa totallydierentparadigmtoshareradio resourceamongdierentusers.InCDMA system,thewholespectrumisusedascarryingbandforallusersinanytime. Channelizationisachievedbyassigningd ierentspreadingcodestousers.Each informationbitwillbespreadedintothebasebandbeforetransmission.Thereceivingsidedespreadsthemultipleoccurrenceo fbasebandsignalbackintotheoriginal bit.Theinterferencecausedbysimultaneoustransmissionisreducedbythespreading/despreadingprocess.Forthispurpose,thespreadingcodesareselectedsuchthat theircross-correlationissmall. CDMAtechnologyisprovensuperiorthanFDMAandTDMAinmanyaspects [9].Forexample,themulti-pathsignalsinTDMAcanbeusedtoincreasetheoverall signalqualitybyusingmulti-“ngerRakereceiver.Severaladvantagesarelistedhere: Firstofall,thecapacityinCDMAismuchhigherthanFDMAandTDMAsystem, andthesamefrequencybandcanbereusedbytheneighboringcells.Secondly,

PAGE 16

8 channelsinCDMAaremuchsecure,sincetheusersdataareautomaticallyencrypted duringthespreading/de-spreadingprocess.Thirdly,whichismostimportantto multimediatrac,isitscapabilityofoeringdierentlevelofBERanddatarate bymanipulatethespreading/despreadingpa rameter,ordynamicspreadingfactor. Therefore,dierenttypesoftracmightco-existinthesystemwithminimalimpact toeachother.Itisbasedontheaboveconsiderations,wechooseWCDMAasthe potentialwirelesssolutiontoprovidemult imediaserviceandthedirectionforfurther research. Inthisdissertation,weinvestigateeectiveprotocoldesignwithdynamic spreadingfactorssuchthatvariousQoSbasedondierenttractypescanbeprovided.Increasingspreadingfactorscanbene“tthesystembecauseitwillincreasethe desiredsignalstrengthlinearly.Themeasu redbiterrorratecanbereduced75times withalongspreadingfactor.Bytakingadvantageofthisbene“t,weproposesome middle-waresolutionstomonitorthenetwo rkloadandswitchthespreadingfactors dynamicallybasedonthecurrentloadwithmultimediatrac.Thesemiddle-ware solutionsareimplementedinmobileandbasestations,andexperimentsareperformedtomeasuretheactualsystemperformance.Thepreliminaryresultsindicated thatourproposedsystemcanalwaysmaintainadesiredqualityforallthevoiceconnections.Wefurtherextendedourprotoc oltoguaranteeabalancedsupportamong dierenttractypes.Whilethevoicecommunicationisstillguaranteedtobenoninterrupted,thedatatracisprovedtobeservedwithreasonableresponsetimeby ourproposedsystem. Wefurtherextendourdynamicspreadingadmissioncontrolschemetoamulticellsituationbyproposingadynamicsprea dingenabledsofthandoframework.The processingtimeofthehandorequestwa sanalyzed.Wefoundthattheupdateprocesscausedbyhandoisthemajorcomponentofdelay.Furtherinvestigationshows thattheupdateassociatedwithhandomightconsumetoomuchaccesschannel

PAGE 17

9 timeandincreasethedelayofhando,es peciallywhenhandotracisheavy.We, therefore,adoptedabatchmechanismsu chthatmultiplehandorequestscouldbe processedsimultaneously.Theaveragedelayisreducedfrom1.12secondsto800ms inheavyhandorate. WealsoproposedanewHandoMobileResourceAllocationalgorithm(HMRA) tooptimizetheperformanceofthemobilesinthehandoarea.Thespreadingfactor andtransmissionpowerforthehandomobilesarejointlyconsideredtomaximize thethroughput;meanwhilethealgorithm maintainstheBERrequirementsforthe handomobilesandthetargetcell.Theorigi nalproblemisformulatedinanonlinear programmingformat.Weproposedaprocedur etosimplifyitintoalinearconstraint problem,whichissolvedbyarevisedsimplexmethod.Numericalresultsshowa 25%increaseinthroughputforWWWtrac,anda26%improvementforthevideo trac. 1.3DissertationOutline Therestofthisdissertationisorganizedasfollows.Chapter2outlinesrelated studiesofhighperformanceMPEG-2videoe ncodinganddecoding.Thisresearch addresseddierentmethods,includingpur ehardwarebasedmethodswherespecializedhardwarearchitectureisused,hybrid methodswhichbuildmultimediaoperation intothegeneralprocessor,andsoftwareparallelsolutions. InChapter3,wedescribetheframeworkoftheproposedpuresoftwarebased parallelMPEG-2decoder.Adatapartition basedparallelschemeispresented,in whichthedecodingofeachvideoframeissubdividedtomultipleslavenodes.The videoframeisphysicallypartitionedintose veraldisjointedparts,andeachpartwill bedecodedatslavenode.Theperformanceresultsforthedatapartitionalgorithm withdierentpartitionmethodsarepresented. Chapter4presentsthedatapipelinebasedparallelalgorithm.Thismethod attemptstoexploittheframestructureo ftheMPEG-2bitstream.Inordertoreduce

PAGE 18

10 thecommunicationoverheadandeliminat etheinner-framedecodingdependency,we useaframelevelparallelizinginthedatapipelinescheme.Thenumberofframes assignedtoeachslavenodeandthenumberofslavenodesbecomeadesignissue.Our analyticalmodelandexperimentalresultsshowthatthedatapipelinecanprovide close-to-linearspeedup. Chapter5andChapter6constitutethe secondpartofthisthesis.InChapter 5,wepresentanovelmediaaccesscontrol(MAC)protocolfortheWCDMAwireless network.Theproposedprotocolisdesigne dtohandlemultimediatrac.Wedemonstratehowthenewprotocolcanguarante etheBERrequirement.Anewmultimedia schedulingalgorithmisdescribedtoutiliz ethedynamicspreadingcapability.Chapter6describeshowtheworkcanbeextendedtoamulti-cellenvironment.Anew handoprocedureisstudiedwhichisdesignedtocoordinatethechangeofspreading factorwhenamobilemovesintoanewcell.Thehandodelayisanalyzedfordifferenttractypes.Theanalysisshowsthataconsiderabledelaycouldbecausedby frequenthandos.Wethususeabatchingprocesstoreducethenumberofunnecessaryupdates,whichinturnreducestheh andodelay.Theresourceallocationfor thehandomobileisalsodiscussed. Chapter7summarizesthemaincontributionsofthisdissertationandpresents somedirectionsform yfutureresearch.

PAGE 19

CHAPTER2 MPEG-2BACKGROUNDANDRELATEDSTUDY MPEG-2[10]isoneofthedominantdi gitalvideocompressionstandards. Sinceitsinceptionbackintheearly90s,thestandardiswidelyacceptedandimplementedbyvarioushardwareandsoftwaresolutions.Thestandardde“nedthesyntax ofvalidMPEG-2bitstream,whichcoveredawiderangeofvideoquality(e.g.,frame 352*240QCIFvideoformatto1404*960HDTV).AmongvariousMPEG-2applications,DVD(corresponding720*480resolutionintheMPEG-2family)isprobablythe mostcommerciallysuccessfully.However,weexpectthathighqualityvideoformat willbecomemoredesirableandneedtobes upportedinthenearfuture.Forexample, HDTVandevenhigherresolutionvideoisbeingproposedforthenextgenerationof electricconsumerproducts.Therefore,asu ccessfuldigitalvideosolutionshouldhave goodscalabilityperformance.Speci“cally,thesolutionofinterestshouldbeableto beextendedandprovidesatisfactoryperfo rmanceforthehigh-endhigh-qualityvideo format.Ourproposedsoftwarebasedpara llelMPEG-2decodingschemeisevaluated withasetofvideostreams,fromlow-quality,lowbit-ratetohigh-end,high-resolution videos.AswillbeshowninChapter4,ourinvestigationindicatesthatwithsome necessaryrevisionsatthememorymanagementalgorithm,thedata-pipelinescheme canproduceareal-timedecodingperformanceforhigh-endvideostreams. IntheMPEG-2videoencodingprocess, therawvideosourceiscompressed byexploitingthespatialandtemporalredundancywithinthetime-continuousvideo frames.TwokeytechniquesusedareDCTtransformationandmotioncompensation.Toreducethespatialredundancy,th evideoframeissegmentedinto8x8pixel blocks,followedbythe8x82-DDCTtransformation.TheDCTcoecientsarethen 11

PAGE 20

12 quantizedtoreducethenumberofrepresentingbits.Thequantizationstepisa non-reversibleprocesswherepartofpixeli nformationislost.Thus,thequantization tableisdesignedtominimizethedegradat ionofimagevisualquality.Studiesinthis “eldhaveshownthatlowfrequencyDCTcoecientsshouldbeassignedmorebits thanthehighfrequencypart.Thusmanyofthehighfrequencyelementswillbecome zeroafterquantization.ThequantizedDCTcoecientmatrixisserializedthrough azig-zagscanningtoconcentratethenonzerolow-frequencyDCTcoecients.Such arepresentationcanbefurthercompactedusingarun-lengthencoding. TheMPEG-2motioncompensationtechniquefurtherreducestheencoded bitrate.Thisisbasedontheobservationthatthereisalotofcorrelationbetween consecutivevideoframes.UsingtheconceptofGroupofPicture(GOP),videoframes areencodedintothreeframetypes:Iframe,P-frameandB-frame.EachGOP containsoneI-frame,whichiscompletely self-encodedwithoutreferringtoother frames.TheI-frameispresumelyofthehighestpicturequalityandservesasa referenceframefortheP-framesandB-f ramesinthesameGOP.Thepixelblocksin theP-framesareencodedwithforwardmoti onprediction,wherethereferenceframe issearchedtoprovideapixelblockwithhighestmatchaspredictionblock,and onlytheresidueisDCT-transformedandqua ntized.FortheB-frame,tworeference frames(I/Pframe),oneframefromthepr eviousframeandanotherfromthenext frame,areusedtoperformabi-directionmotionprediction. Thusthedecompressionshouldgothroughthefollowingsteps:First,the DCTcoecientandmotioninformationareextractedfromtherun-lengthencoded bit-stream;thisisfollowedbyade-quant izationandaninverseDCTtransform;then iftheblockispredictedfromotherblockdata,amotioncompensationneedstobe donetoformtherecoveryimage;thentheditheringprocedurewillmaptheimage intosuitablecolorspaceofap articulardisplaysystem.

PAGE 21

13 High-performancesoftwaredecodingforhigh-qualityMPEG-2videoisbecomingincreasinglydesirableforawiderang eofmultimediaapplications.Inthepast fewyears,signi“cantprogresshasbeenmadeinthemicroprocessortechnologyand softwaredecoderoptimization,broughtr eal-timesoftware-basedMPEG-2decoding [11,12]intoreality.Lee[12]implementedareal-timeMPEG-1decoderby“ne-tuning Humandecoding,IDCTalgorithmand”atteningthecodetoreducethecostofprocedurecall/return.Shealsousedshiftand addoperationstoreplacemultiplication, whichcanbeexecutedmoreecientlywith PA-RISCmultimediainstructions.With thebuilt-inmulti-ALUandsuper-scalarf eaturesofPA-RISC,thecostofsomedecodingprocedurescanbereducedbyupto4times.Morededicatedmultimedia instructionswereintroducedinmanyprocessorarchitecturesandwereusedtoaccelerateMPEG-2decodingprocess[13…15].B hargavaetal.[13]conductedacomplete evaluationofMMXtechnologyfor“ltering,FFT,vectorarithmeticandJPEGcompressionapplications.Tungetal.[15 ]proposedanMMX-enhancedversionofa softwaredecoderbyoptimizingtheIDCTandMCusingIntelsMMXtechnology. Theyreportedareal-timeMP@MLdecoding;however,thedecoderwasnotfully compliantwiththeMPEG-2standards.InZhouetal.[16],atheoreticalcomputationanalysisofMPEGdecodingwaspresentedandtheauthorssuggestedusingVIS (VisualInstructionSet)toachievereal-t imedecoding.However,theiranalysisonly tookintoaccounttheleastarithmeticope rationsduringthedecompressionstage, andtheimplementationresultswereyettobereported.Therearesomepublicdomainandcommercialsoftwaredecodersca pableofreal-timeDVD(MP@ML)quality decoding[11,16].Theseproductsshouldnotbecatalogedasthepure-softwaresolutions,sincetheyareoptimizedwithspeci“chardwaremultimediasupport(e.g., IntelsMMXinstructionset);thustheyarestillhybriddecoders.AnotherhardwarebasedapproachtakesadvantagesofaredundantDSPunitandtheverywideinternal

PAGE 22

14 busdesign,whichmakeitpossibletoexploit instruction-levelparallelism(suchas VLIW);someoftheworkisreportedinreferences[11,17]. Insummary,allofthesolutionsdiscussedaboveindeeringdegreesdependon somespeci“chardwarefeaturesindierentdegree,eitherfrommultimediainstructionsinCPUorfromavideodisplaycard.Thusthesesolutionsareverydicultto transporttodierenthardwareplatforms. Furthermore,thesesolutionsusuallycan notsupportscalabilityfeaturesofhighp ro“leMPEG-2video.Forexample,inmost oftheseschemes,the30framepersecond(FPS)framerateisusedasthetarget performance.Insomehigh-endpro“les,ho wever,ahigherframerateandresolution arerequired(e.g.,MP@HLprogressivevideoformatde“nedaspatialresolutionof 1920*1152atasamplingrateof60-fps[10]). Ontheotherhand,withoutthesupportofdedicatedhardware,thecomputationofMPEG-2decompressioncouldbeverydemanding,especiallyforthehigh pro“le/levelstreams.TheMP@MLMPEG-2streamwithonebase-layervideostream needsatleast2-Mbpsbit-rate(anaverageof6-MbpsisusedinDVDvideo).For somehigh-pro“lehigh-qualityMPEG-2 videos,upto40-Mbpscouldbenecessary tosupportMPEG-2extensionlayers(with MPEG-2scalabilityfeatures).Theextensionlayersprovideadditionalinformationthatcouldbeusedinthedecoding proceduretoproducemorevividvideoquality.Theyalsorequireadditionalcomputationinthedecompressionprocedure.Thedecodingoftheextensionlayerhas togothroughaproceduresimilartothebaselayer;thusthedecodingtimeofa scalableMPEG-2videostreamwillincreaseproportionallytothenumberoflayers. Forastreamwithtwoscalabilityfeatures(temporalscalabilityandSNRscalability) atthespatialresolution1920*1152(theimagesizeissixtimeshigherthanthatof MP@MLqualityvideo),therequiredcomputationisroughly3 6=18timesthatof MP@MLMPEG-2videodecoding.Thishug eamountofcomputationrequirement makesitverydicultforthereal-timeplaybackofasoftware-onlydecoder,evenwith

PAGE 23

15 thefastestCPUavailable.Hence,functiona lity-basedordata-basedparallelization shouldbeconsideredtoboostthedecodingperformance. Variousparallelschemeshavebeenstudiedforvideoencodingintheliterature [18…20].InAkramullahatal.[19],adata-parallelMPEG-2encoderisimplemented onanIntelParagonplatform,reportingareal-timeMPEG-2encoding.Gongand Rowe[20]proposedacoarse-grainedparallelversionofaMPEG-1encoder.Acloseto-linearspeed-upwasobserved.ComparableworkofaparallelMPEG-4encoderhas beenreported[21].Theauthorsuseddierentalgorithmstoscheduletheencoding ofmultiplevideostreamsoveraclusterofworkstations.AparallelMPEG-2decoder basedonshared-memorySMPmachinewasreported[22];however,theydidnot addresshowreal-timedecodingshouldbesupportedforthehigh-pro“leandhighlevelvideosource.InAhmadetal.[18] ,theperformanceoftheparallelMPEG encoderbasedontheIntelsParagonsystemwasevaluated.Intheirwork,the I/O node providesseveralrawvideoframestothe computenodes inaroundfashion,and each computenode performstheMPEGencoding,sendsbacktheencodedbitstream, andrequestsmorevideoframesfrom I/Onode .The I/Onode servestherequests from computenode inaFCFSmanner,whichmayreducetheidletimeinthe I/O node .However,thisschemeallowsout-of-orderarrivalofencodedframedataandthe I/Onode hastotrackthedynamicsoftheassignedframesforeachcomputenode. Thisincreasesthecomplexityinthe I/Onode andrequiresmuchhigherbuerspace topreservetheoriginalsequenceofvideoframes.Inourcase,sincethemasterneeds todisplaythedecodedvideoinexactorder,itmustbuerallthedecodedframes beforethearrivalofprecedentvideoframes.Thebuild-upofframebuermight exceedthephysicalmemoryinmastersideifthesameschedulingisused.Thuswe useanin-orderdatadistributionbetweenmasternodeandslavenode. YungandLeung[23]implementedanH.261encoderonanIBMSP2system. Bothspatialandtemporalparallelizationw ereinvestigated.Theirresultsshowed

PAGE 24

16 MBlevelparallelcouldequallyprovideahighspeed-upastheframe/GOPlevel parallelization.However,thisclaimwasno ttrueintheparalleldecoding.Ourresults showthatadata-partitionscheme(MBlevel)canonlyprovidelimitedperformance gainwhenthenumberofslavenodesincreases.Furthermore,thedegreeofscalability highlydependsonthecommunicationbandw idth,whichwasnotfullyexploitedin theirstudy.BycarefullyinvestigatingtheimpactofaMaster/Slavecommunication pattern,wefoundthatthecommunication patterninparalleldecodingcanbemore complicatedthaninthecaseofencoding,whicheventuallybecomesthekeyfactor determiningthepeaksystemperformance.

PAGE 25

CHAPTER3 DATA-PARTITIONPARALLELSCHEME Ourinitialinvestigationfocusedonthedata-partitionscheme.Weconsider datapartitiononthemacroblocklevelsi ncethemajorityofMPEG-2decompression computationisspentonblocklevelIDCTandmotioncompensation.Thisscheme allowsalowcomplexityineachcomput ingnode,sinceonlypartoftheMPEG-2 decodingprocedureneedstobeimplementedinthecomputingnode.Thedatapartitionschemealsohastheadvantageofthepotentialquickresponsetotheendusercommands.Asexplainedintherestofthi schapter,avideoframeisdividedinto severalgroupsofmacroblocksthataredeco mpressedinslavenodessimultaneously. Thusthesystemcanresponseveryquickly(alwayslessthanthedecodingtimeof oneframefornon-preemptivescheduling) .Infact,theresponsetimeisinversely proportionaltotheparallelgaininthedata-partitionscheme.Animportantdesign factorinthedatapartitionalgorithmistheparallel(partition)granularity.Itis possibletodivideaframeinpixel-level;however,sucha“negrainschememay introducetoomuchoverhead.Webelievethatthemacroblockleveldatapartition schemeshouldbeanaturalchoiceforourinitialinvestigation.Inthefollowing discussion,allpartitionalgorithmsarebase donmacroblockleveldatadecomposition. Tomaximizethedecodingperformance,thefollowingissuesshouldbeconsidered: € Ingeneral,thepercentageoftheparallelizedcomputingcomponentsdetermines theupperbounderofpotentialperformance,whichindicatesthatweneedto parallelizeasmanydecompressionstepsa spossible.Ourpreliminaryexperimentsshowthattheprocessingofmacroblockstakesabout95%ofthetotal computation.Thusbyadoptingamacro blocklevelparallelizing,weexpect thatmajorityofthecomputationisparallelized. € Newoverheadcausedbytheparallelsch emeshouldbeanalyzed.Speci“cally, duetotheinter-frameandintra-frame dependence,ourdata-partitionscheme 17

PAGE 26

18 willintroduceadditionalcommunicationcost.Forexample,motioncompensationmayrequirereferencepixelbl ocksofpreviousframe.Howtoreduce thedatacommunicationcostbecomescritical.Aswewillseelater,dierentpartitionalgorithmscanresultindierentamountsofadditionalreference data.Findingtheoptimalpartitiontominimizethecommunicationoverhead becomesamainchallenge. Client Slaver Client Master Client Slaver Client Slaver Client Slaver Client Slaver Client SlaverATM backbone Video Stream Server S erver Side Local Area Network Switch Ethernet Figure3.1:Paralleldecodingarchitecture Figure3.1depictsthesystemarchitectureofourparalleldecoder.Thesystem consistsofamasternode,whichreceivestheMPEG-2encodedvideostreamfroma remotevideoserver,andseveralslavenodestodothedecompression.Todistribute thewholedecompressionworkloadtotheslavenodes,themasternodesplitsthe bitstreamintoframes.Eachframe,asasetofmacroblocks,isfurtherdividedinto N parts,forthe N slavenodes.These N setsofrawdata,togetherwithnecessary referencepicturedata,willbesenttothes lavenodesfordecompression.Themaster nodehastowaituntilallslavenodes“nishtheirdecodingjob,andsendthedecoded macroblocksbacktothemasternode.The decodedmacroblocksarethenmerged intoonecompleteimagefordisplay.ThemajorityoftheMPEG-2decompression steps(e.g.,IDCT,MC,Dithering)areperformedatslavenodes.Themasternode hasitsmaindutyinbitstreamparsingands ubtasksdistributionforslavenodes.The aboveproceduresareillustratedbythefollowingpseudo-codes: (1)Controlalgorithminthemasternode.Themasternodereceivesthebitstreamfromtheserver,extractsframedata,partitionsitintosubtask,packedwith

PAGE 27

19 referenceframedata,andsendsittoslavenodes.Thenwewaitforthedecompressed data. ProcedureMaster Control Initializeinsidebufferandstartslavenodes WHILE(thereismoredataininputbit-stream fromserver) Rbuff = Bufferforthecurrentframedata Performapartitionover Rbuff ,resultwithasetof N sub-taskbuffers: P1,P2,...PN/*refertothe partitionalgorithmsinnextsection*/ Updatenewreferencedataforeachslave Refi FOR(i=1toN) Sendtoslaveiwith Piandreference dataRi ENDFOR WaittoReceivedecodedmicro-blocksfromeachslavenode Combinethepartsframeintothewholepicture IF(currentframeisonthetopofdisplaybuffer) calldisplayroutine ENDIF ENDWHILE EndProcedure (2)SystemLevelAlgorithmofSlaveNodeforDataPartitionScheme:Slavenodes receivethebit-streamfromserver,performMPEG-2decodingprocedure,andsend backthedecompressedmicro-blockstotheMasternode. Procedure Slave Decode Initializeinsidebufferand setupsessiondecodingparameter(frommasternode) WHILE(noterminatesignalfromMasternode) Receiverawdatafrommaster /*performMPEG-2decodingoverthegivenportiondata, including:*/ InverseRun-lengthdecodingtorestoreDCT coefficient/motionvector InverseDCT Performmotioncompensation Performdithering Sendtomasterthedecodedmicro-blocks ENDWHILE EndProcedure

PAGE 28

20 3.1PerformanceModelofData-PartitionScheme Theperformanceofourdata-partitionschemecanberepresentedbytheframe decompressiontime Tf( P ),withrespecttoagivenpartition P .Thedetermination of Tf( P )dependson(1)thecomputationinslavenodes Tc,(2)thetransmissiontime fortheinterprocessorcommunication Tt,and(3)thehousekeepingcomputationin themasternode Tm.Sincethedecompressioninslavenodesandcommunicationcan beperformedsimultaneously, Tfshouldbedominatedbyt helongercomputation path.Wehave Tf= Tm+ max { Tc,Tt} (3.1) Table3.1:Notionsusedfordecodingmodelingwithdata-partitionscheme H numberofmacroblocksofaframeinverticaldimension W numberofmacroblocksofaframeinhorizontaldimension F = { mi,j:0
PAGE 29

21 andCPUpower(i.e.,CPUclockrate).Therefore,fromtheviewofthemasternode, theslowestslavenodedeterminedtheoveralldecompressiontime. Thecommunicationcostsbetweenmast erandslavenodesshouldincludethe transmissionofrawdataanddecodeddat a.Asidefromrawdata,themasternode mustsendreferencedatatoslavenodes,becauseofthemotionpredictiontechnique usedinMPEG-2 .1GivenaframepartitionP,theframedecompressiontimecan thusbeexpressedas Tf( P )= max { maxk( | SP,k| Tk)+ kCP( k ) B } (3.2) Thusthedecompressiontimeisajointeectofsystemhardwareparameter ( TK,Betc.)andthepartition P (both SP,kand CP( k )arefunctionsof P ).Given ahardwareenvironment,anoptimalpartitionshouldbefoundtominimizethedecompressiontime.Inanotherwords,theoptimalpartitionshouldminimize Tfover allfeasiblepartitions. P= argminP { allfeasiblepartitions }Tf( P )(3.3) Itcanbeshownthat,tominimizethe“rsttermintherightsideofequation (3.2),weshouldassignthesubtasks SP,kinproportiontotheprocessingpowerof kthslavenode.Forexample,inatwo-slave-nodesituationwhere T1: T2=1: 4,theoptimalpartitionsshouldhavethepropertyof | S1| : | S2| =1:4,where maxk( | Sk| Tk)isminimized.Howeverpartitionswithsuchpropertymightnotresult inaminimumforthesecondtermin(3.2). Forexample,thesecondtermwillalways beminimumwhenallthemacroblocksareassignedtooneslavenode.Thispartition unfortunatelyresultsinthelongestovera llcomputationtime.Thetrade-obetween computationandcommunicationmakesita nontrivialworkto“ndtheoptimal(slave number,partition)pair.Beforewefurtheranalyzeequation3.2,weshouldestablish adirectrelationbetweenthedatacommunication CP( k )andthepartition P 1MacroblocksofPandBframemaybeencodedusingmotionprediction.Theyarepredictedbymacroblocks fromthesearchingareaofapreviousIorPframe.Thesearchareaisdenedasarectanglethatcenteredinthe predictedmacroblock.

PAGE 30

22 3.2CommunicationAnalysisinDataPartitionParallelScheme Foragivenpartition P ,thecommunicationbetweenthe kthslavenodeand master CP( k )canbefurtherdividedintothreeparts:(1)theamountofrawdatafrom themasternodetothe kthslavenode,whichisapproximately SP,k sm(assuming thatamacroblockinMPEG-2bitstreamrequiresanaverageof smbits),(2)because ofthemotioncompensation,MPEG-2hasas tronginter-framedatadependency.The encodingofPframeandBframerequiresoth er(previous)IorPframesasreferences. Thisistheveryreasonthatcausesadditionalinter-nodesdatacommunication.We use RP( k )torepresentthereferencedataatthe kthslavenodeunderpartition P .(3) thedecodedmacroblocksinthe kthslavenoderequires SP,k 162 8bits,assuming onebyteperpixelinthe16by16macroblock. Thereforethecommunicationtime(s econdtermofequation(3.2))canbe rewrittenas TCP= k { 1 K }( SP,k sm)+( SP,k 162 8)+ RP( k ) B (3.4) Itisnoticedthat k { 1 K } SP,k smisexactlythesizeofoneframein MPEG-2,and ( SP,k 162 8)isthesizeofadecodedframe.Wehave TCP= Se+ Sd+ k { 1 K }RP( k ) B (3.5) here Seand Sdrepresentthecompressedframesizeandtheoriginalpicturesizerespectively.GivenanMPEG-2videostream,theaverage Seisobtainedfromtheencodedbit-rate,and Sdcomesfromtheimagehorizontalsizeandverticalsizeencoded inthebitstream.Thetotalreferenceareainthe kthslavenode RP( k )istheunionof thereferenceareaforeachmacroblockin SP( k ).AccordingtoMPEG-2,thepredictingmacroblockisselectedfromasearchingwindow,whichisde“nedasasquarearea centeredinthetargetmacroblock.Thedi mensionofthesearchingareaisusually “xedthroughagivenvideotitle.Let s bethesearchingwindowsizeprede“nedinencodingtime,thereferenceareaforonemacroblockis(16+ s ) (16+ s )=162+32 s + s2

PAGE 31

23 pixels.Thedierencebetween RP( k )ofcurrentframeand SP( k )oftherefereedframe hastobetransmittedfrommastertoslavenode. Tosimplifyourdiscussion,weassumeastaticpartitionschemewherethe partitionisdeterminedattheinitialphaseand“xedthroughthedecodingsession. The“xedpartitioninfactprovidesachancetoreducethe RP( k ):sincethesame areaofeachframesisdecompressedinthesameslavenode,thedierencebetween RP( k )and SP( k )canbereducedsigni“cantly.Anupperboundof RP( k )isgivenby RP( K ) SP( k ) ( s2+32 s )(3.6) Withstaticpartition,thereferencea reaonlyneedstobetransferredonce perGOP(GroupofPicture),sinceitcanbereusedbythesuccessiveB-frame(not separatedbyI-orP-frame).Thereforethe transmittedreferencedatacanbefurther reduced.AssumethatthesystemGOPpatternis I : P : B =1: b : c ,therewillbe atotalof(1+ b + c )framesinoneGOP.ThereferredframescanonlybeIandP frame.Thustherequiredreferencetransmissionratiois =1+ b 1+ b + c.Theamountof referencedatashouldbeaveragedoveraGOP. TCPisrevisedas TCP= Se+ Sd+ k { 1 K }RP( k ) B (3.7) Substitutingequation(3.6)into(3.7),thetotalcommunicationshouldbe boundedby TCP Se+ Sd+ ( s2+32 s ) k { 1 K } SP( k ) B Se+ Sd+ ( s2+32 s ) H V B 3.3PartitionExampleandCommunicationCalculation Our“rstpartitionexample( P 1)isaresultoftheHorizontal-Vertical(HV) Partitionalgorithm,whichisde“nedbythefollowingrecursiveprocedure:ifthepreviouspartitionoperationisperformedalonghorizontaldimension,thenwepartition

PAGE 32

24 theimageverticallyforthistime.Ifthepartitionnumber K iseven,thenweseparatetheworkingareaintotwopartswithequalsize,withthepartitiondimension decidedabove.Otherwise,thewholeareaispartitionedintotwopieceswitharea ratioof1:( K Š 1).Foreachofthetworesultantparts,weperformtheabovesteps correspondingly. Byapplyingthe HV Š Partition algorithmwithinput K =4(i.e.,4-way partition),wehaveour“rstpartition P 1,showninFigure3.2. P 1separatesthe originalpictureinto4adjunctedareas: A,B,CandD .Wefoundthatthereference area RP 1( k )consistsofthebelt-shape-areaalo ngtheinternaledges.Thewidthof thebeltisthesearchingwindowsize s .Asimplewaytocalculatethetotalreference areaistoaddthelengthofallinternaledges. A B C D (b) (a) A BC D AD B C B A D C (c)(d) Figure3.2:Horizontal-verticalpartitionintofourpieceABCandD. Fora720*480picture,eachpartcontainstwointernaledges,resultingina totallengthof360+240=600pixels.Thetotalreferenceareaforthispartition isthus4 600=2400pixels.Eachpixelcontains3colorelements(eachrequire8 bits).Withasearchingwindowsizeof32,thetotalamountofreferencedatais: RP 1( k )=4 (600 s 8 3)=1 8 MBits .ForaGOPstructureofIBBPBBPBB,

PAGE 33

25 =1 3.Thetotalcommunicationtimeunder 80Mbps(sustained)networkbandwidth shouldbe CPvh=Se+ Sd+ k { 1 K }RP( k ) B=0 5+2 76+0 6 80=48 msec A B C D B C (a) another 4piecepartition (b)the reference area of piece A A DFigure3.3:Anotherwaytopartitionapictureinto4parts AnalternatepartitionisshowninFigure3.3.(a).Thereare8internaledge segments,separatingthepictureintofouridenticalpieces.Thereferenceareaofthe upper-leftpieceisshownasgrayareainFigure3.3.(b).Usingthesamecalculation methodasintheaboveexample,thetotalreferencedataincreasesto3600 32 8 3= 2 8M-bits.Comparedtothepreviouspartition,thispartitionrequire50%more referencedata.Thetotalcommunicationcostincreasesto4.19-Mbits/frame. 3.4CommunicationMinimization Toobtaintheminimumcommunicationcost,thebestpartitionshouldbe found,giventheimageshapeandthenumberofslavenode.Oneapproachisto usebrute-forcesearching.However,thesizeofsearchingspaceisprohibitivelylarge evenweenforcepartitionstobeequal-size.Thenumberofpossiblepartitionscan be“guredoutasfollowing:Let H V bethenumberofmacroblocksforaframe andthereare K slavenodes.Theproblemisthesameasputting H V objects into K boxes.Thisisequivalenttothecombinationofchoosing X =H V Kobjects

PAGE 34

26 from H V objects,followedbychoosinganother X fromwhatisleft,untilnomore selection.Thetotalnumberofthecombinationsis CH V X CH V Š X X C2 X X CX X= ( H V )! X ( H V Š X )! X (2 X )! X X X = H V ( H V )! (H V K)!KFora1Kby1Kpicture,thereare4096macroblocks.Whenpartitionedinto fourslavenodes,thenumberofpartitionisintheorderof10690.For K =8,it isabout101922.For K =16,thisnumbereasilyexceeds103155.Thesizeofthe searchingspaceincreasesexponentiallywithpicturesize.Toavoidthisexplosive searchingspace,someheuristicalgorithmsaredesired. Thepartitionproblemcanalsobeaddressedasaninstanceofquadricassignmentproblem…aclassicalcombinatori aloptimalproblem,whichhasprovenNPhard.UsingthenotationinTable3.1,theequal-sizepartitionimplies:Forthe K subsets SP,kofframe F ,thefollowingpropertieshold: | SP,i| = | SP,j| SP,i= and SP,i= F ,for0
PAGE 35

27 Here Xi,j=1whenmacroblock i isassignedtothe jthslavenode.Equation (3.11)showsthateachslavenodeisassignedH V Kmacroblocks,andeachmacroblock canbeassignedtoonlyoneslavenode. dk,histhetransmissiontimetosendanunitof decodeddatafromthe hthslavenodetothe kthslavenode,orviceversa.Weassume thatthecommunicationbandwidthbetweenanypairofnodesareidentical(i.e., ahomogenousenvironment). ri,jrepresentshowmuchdatainthe ithmacroblock happenstobethereferencedataforthe jthmacroblock.Thisisdeterminedby therelativepositionofthetwomacroblocksintheimage.Itisobviousthat ri,jissymmetric,andisinfacttheintersectionbetween si(searchingwindowareaof macroblock mi)and mj.When s 16, ri,jisde“nedby ri,j= 0 i = j 16 smiandmjareneibour s2miandmjintersectatconner 0 otherwise 3.5HeuristicDataPartitionAlgorithm ForgeneralQAPproblems,geneticalgorithmsandsimulatedannealingalgorithms(SA)havebestudiedintensively[24,25].Unfortunately,whenapplyingto ourproblem,thesemethodsfailtoproducesatisfactoryperformance.Thereference dataobtainedfromSAalgorithmwithinputsizesof100,300and500macroblocks are8448,35712and56448pixelsrespectively.TheSAalgorithmtakes1minute,30 minutesand5hoursrespectively.WhenusingtheHVpartitionalgorithm,forthe sameinputsize,thecostsare7680,15360and17280pixelsrespectively,andtherunningtimeisinthelevelofseveralseconds.TheHValgorithmtakesadvantagefrom theproblemnaturebygroupingmacroblock sinitsnaturallayout.Furthermore,we proposedamoredynamicpartitionalgorithmthatexploitsmoreinformationfrom theshapeofpartitionedarea,namedQuick-Partition:

PAGE 36

28 1.Giveninput( K,hd,wd,S ),where K isthenumberofcurrentpartition, S isa rectangleareatobepartitioned, hd isthelengthof S inhorizontaldimension, and wd isthelengthof S inverticaldimension. 2.IfKislessthan2,thannofurtherpartitionisrequired. 3.Calculate k1= K 2 and k2= K 2 astheareaoftwosmallrectanglestobe generatedfrom S 4.If hd wd ,we“ndaverticallinethatseparatestherectangle S intotwo sub-rectangleswitharearatioof k1: k2.Otherwise,ahorizontallineshouldbe foundsimilarly. 5.Calculatethewidthandheightforthetwosub-rectangles.Nameitas S1and S2. 6.Foreachof S1and S2,weperformthesameprocedurefromstep1to6. 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35Number of PartitionAmount of Reference Data per Picture (Mbits) Horizontal partitio n *vertical partition +HV partition dQuick partition Search Window Size=32 pixels Picture size: 720*480 Reference Data In Data Partition Algorithms Figure3.4:Performancecomparisonsofdierentpartitionalgorithms Figure3.4demonstratedtheamountofreferencedataasafunctionofpartitionnumber K anddierentpartitionalgorithms.Wecomparedtheperformance offouralgorithms:Horizontal-Partition ,Vertical-Partition,HV-Partitionandour

PAGE 37

29 Quick-Partitionalgorithm.TheHorizontal-Partitionalwaysdividestheframein horizontaldimension,whileVertical-Partiti ondoesthepartitionusingverticalline. HV-Algorithmpartitionstheframewithinterleavedhorizontallineandverticalline. Thesealgorithmsaretestedover720*480picture,withthesearchingwindowsize setto32.ForHorizontal-Partitionalgorithm,thereferencedataincreaseslinearly asthenumberofpartition K increases.Asimilar(linearincreasing) R ( K )trend isobservedfortheVertical-Partitionalgorithm.Theslopeofthecurve,however,is lessthanthatofHorizontal-Partition.Thisisbecausethewidthofavideoframeis actuallylargerthantheheight.At K =4,theVertical-Partitionalgorithmresults inanabout2.8M-bits/framereferencedata,whichis33%lessthanthatofthe Horizontalpartition.Thesamepercentageofreductionholdsforothervaluesof K TheHV-Partitionalgorithmgeneratesamu ch-betterresultintermsofreducingthe referencedata.Forpartitionnumber K =4 8 16and32,thecorresponding R ( K ) are1.8432,4.055,5.5296,and9.9533M-bits/framerespectively.Comparedwiththe HorizontalortheVerticalpartiti on,thereisremarkablereductionin R ( K ),especiallyforhighpartitionnumber.Forexam ple,thereductionsofthereferencedata (comparedwiththeVertical-Partition)attheabovepointsare16.67%,21.43%,50% and56.45%respectively.OurQuick-Partitionalgorithmproducestheleastamount ofreferencedatainthefouralgorithms. Ingeneral,theQuickPartitionperforms eitherequallyorbetterthantheHValgorithm.Theresultantreferencedataofthe Quickpartitionfor K =2,8and32are33.33%,18.18%,and14.81%lessthanthat ofHV-Partitionsrespectively. 3.6ExperimentalResults Theperformanceofthedata-partitionschemeisveri“edbyexperimentsover twohardware/softwarecon“gurations: € 100-MbpsLAN:A100MbpsEthernet,andeachnodeisequippedwithPentiumII450-MHzCPUand128-MBmemory.TheoperatingsystemisRedHat Linux6.2.

PAGE 38

30 0 5 10 15 20 2 5 4 6 8 1 0 1 2 1 4 1 6 *Expected Frame rate +Actual Frame rate Test Stream Size: 720*480, Searching Window:32 VerticalPartition Under 100M network 0 5 10 15 20 25 6 8 10 12 14 16 18Number of PartitionFrame Rate (f/s) *Expected Frame rate +Actual Frame rate Test Stream Size: 720*480, Searching Window:3 2 QuickPartition Under 100M network Figure3.5:Decodingrateunder100Mbpsnetwork € 680MbpsSMP:Thisenvironmentisactuallyasymmetricmulti-processors servermachine.Eachprocessorisa248-MHzUltraSPARCprocessor.Theactuallyinter-nodecommunicationbandwidthcanreachupto680-Mbps,based onourmeasurements.ThismachineisusedtoemulateaGigbitslocalnetwork. 3.6.1PerformanceoverA100-MbpsNetwork The100Mbpsnetworkhasatotalof16slavenodesequippedwithPentium450MHzprocessor.Eachnodeiscapableofdecodingat10-fps(withcompiling optimization).Thiscon“gurationiswidelyavailableinmanyorganizationsorcompanies.TheperformanceofourpartitionalgorithmisdepictedbyFigure3.5.Figure 3.5.ashowstheperformancewhentheVertical-Partitionalgorithmisused,and“gure 3.5.bisofthecaseoftheQuick-Partitionalgorithm.Thefollowingobservationscan bemade: € Forsmallvalueof K ,bothpartitionalgorithmsproduceclose-linearperformanceimprovements.FortheVertical-Partitionalgorithm,thepredicteddecompressionratesforK=1and2are8-fpsand14-fpsrespectively.Theactual frameratesare7and12-fps,whichisveryclosetoourexpectation. € TheQuick-PartitionalgorithmhasasimilarresultasthatoftheVerticalPartitionwhen K issmall( K< =3).Thepredictedandobserveddecompressionratesarenearlyidentical.Thisisbecausewhenthenumberofpartition issmall,theslavedecompressiontime stilldominatesthedecodingcost.The dierenceinreferencedataofthetwopartitionalgorithmsisoverwhelmedby thedecompressiontimeatslavenodes. € AtK=3,thecostofcommunicationbeginstoplayadeterminantrole.The bene“tofthereducednode-decodingtimeiscanceledbytheincreasingofcommunicationtime(whichismainlycausedbythereferencedata).Thedecoding rateatK=3isonly14-fps,whichisonly16%higherthanthatof K =2.

PAGE 39

31 0 10 20 30 40 50 6 0 0 10 20 30 40 50 60 70 80Number of Partitions (Slave Nodes)Frame Rate (f/s) *QuickPartition expected +QuickPartition measured Test Stream Size: 720*480, Searching Window:32 HVPartition and QuickPartition Under 680M network oHVPartition expected >HVPartition measured Figure3.6:Decodingrateunder680Mbpsenvironment € K =3representsthepeakdecompressionra teofthiscon“guration.Furtherincreasing K resultsinadegradingoftheperformance.FortheVertical-Partition algorithm,weobserved12,9,and6-fpsforK=4,8,16respectively.TheperformancedegradingoftheQuick-Partitionalgorithmisnotsosevere.The decompressionframeratesare16,13and11-fpsrespectively.Theperformance improvementislimited,sincethenetworkisnotfastenough. 3.6.2Performanceovera680-MbpsEnvironment Thisexperimentisperformedwitha15-nodeSMPserver.Eachnodecontains anUltraSPARC248-MHzCPU.Whenusedalong,eachCPUcandecodeat5-fps .Theinter-nodecommunicationbandwidthisup-to680-Mbps.Thisenvironment isusedtoemulateaGigbitsnetwork.TheperformanceresultsfortheHV-Partition andtheQuick-PartitionalgorithmarepresentedinFigure3.6. Itisobservedthattheoverallperformanceofourdatapartitionparallelscheme improvedsigni“cantlyeventhoughtheprocessingpowerattheindividualnodedecreased.Detaileddiscussionsarelistedbelow: € WhenusingHV-Partition,weobserveaclose-linearincreasingofthedecompressionrateuntil K =8.Thepredictedframerateare4.8,9.3,16.7and29-fps forK=1,2,4and8respectively.Theact ualframeratesareonlyslightlyless thanthepredictedvalues(thedierenceislessthan10%).When K isinthe rangeof8to14,thedecompressionrateskeepincreasing,howeverthespeedof increasingissloweddown.Forexample,whenthenumberofslavenodedoubles

PAGE 40

32 from7to14,thedecompressionrate(actual)increasesfrom23-fpsto30-fps (only30%performancegain). € FortheQuick-Partitionalgorithm,weobservedconsiderableperformanceimprovement.Theclose-to-linearspeedupissustaineduntil K =16,wherethe theoreticalpeakdecompressionrateof44-fpsisexpected.Theactualdecoding ratematchwithourmodelclosely,thedecodingratesfor K =1,2,4,8and14 are4.6,9.0,15,25and40.4respectively. € AtK=16,thedecodingperformancereachesitspeakfortheQuick-Partition andVertical-Partitionalgorithm,andtheoverallsystemperformancestartgoingdown.Again,thisisbecausethecommunicationcostofreferencedata overwhelmedthecapabilityofsystemcommunication.However,duetothe lackofmoreprocessingnodesinourSM Pserver,weareunabletoverifythe performancetrendafterK=14.

PAGE 41

CHAPTER4 PIPELINEPARALLELSCHEME Themajoroverheadofthedata-partitionschemeexistsinthemasternode, wherethecompressedvideostreamneedstobeparseddown-tothemacroblocklevel. Thisportionofcomputationbecomessigni“cantwhenthenumberofslavenodes increases,thuspreventsfurtherimprovin gofdecodingrate.Unfortunately,multimasternodes(asusedbyN.H.C.Yungetal.[23])willnotworksincethebitstream isVLCencodedandhighlyauto-correlated. Analternateschemeforfurtherperformanceenhancementreliesontheincreasingofdecodingunit,bydistributingablock offramestoeachworkstation. 4.1DesignIssuesofPipelineScheme Severaldesignissuesshouldbeaddre ssedinordertohaveahighparallel processinggain. € Theidletimeattheslavenodesshouldbeminimized.Theremightbeawaiting timeifaslavenodecannotreceivenewblockofframesfromthemasternode, oncethedecompressionforthepreviousframeis“nished.Forthispurpose,the masternodeisdesignedsuchthatit“llednewdatatootherslavenodesduring thecomputationtimeofaparticularslave node.Therefore,thewaitingtimein theslavenodesisminimized.Coupledwiththisconcernistheloadbalancing betweenslavenodes.Asuccessfulparallelschemeshouldavoidoverwhelming ofsomenodeswhilestarvingsomeothernodes.Ourexperimentresultsshow thattheSTDofCPUusageoftheslavenodesislow,andreducedtoless than0.03whenblocksizeissettooneGOP.Webelievethisfactwillhold regardlessthevideocontentsbeingtested,asfarastheunderlineslavenodes arehomogeneous.Amoreadaptivetaskschedulingshouldbefavoredforthe caseofheterogeneousenvironment.Itseemsthatastaticassignmentofblock sizebasedontheCPUspeedoftheindividualslavenodeshouldbesucient. Wewillnotdiscussthisdimensionforthesakeofthespace. € Theworkloadinthemasternodeshouldbeminimized.Infact,themaster noderepresentstheonlysequentiallinkinthepipelinescheme.Themaster nodemustperformMPEG-2headerparsingsothattheframedatacanbe extractedforeachslavenode.Bydevelopingaquickbitstreamparserwhich 33

PAGE 42

34 scanonlytheGOPstartcodeandpicturestartcode,thecomputationofthis procedureisnegligiblecomparedtotheotherMPEG-2decodingsteps. € Animportantdesignfactoristheproperblocksizeforeachsubtask.Should themasternodedeliver2framesofrawdatatoaslavenodeeachtime?How about5framesor10frames?Wehavefoundthataproperblocksizedoes in”uenceoverallsystemperformancesig ni“cantly.Infact,theoverallsystem communicationismainlydeterminedbytheblocksize,duetotheinter-frame datadependencyandtheI-P-Bframestructureintheencodedbit-stream. Forexample,aslavenodethatdecodesaP-framewillneedanI-frameasthe reference.Iftheblocksizeisnotchosenc arefully,thereferredI-framecanbein anothernode,thusweneedtotransmitonefulldecodedI-frametothetarget node.Noticethatthisalsointroducesadditionaldelaywhichcansigni“cantly aectthepipelineeciency.Fortheworstcase,apreviouslydecodedI-frame mayberequiredbyallslavenodes,ifthefollowingP-orB-frameshappento bedistributedallovertheseslavenodes. € Anotherimportantdesignissueisthedeterminationoftheoptimalnumberof slavenodes.Thisisparticularlyimportantinapracticalenvironment.The questioncanbeputasfollowing:given acertainhardwarecon“guration,how manynodeswillthesystemneedinordertoreachthepeakperformance?We havefoundthatthenetworkbandwidthisonephysicallimit.Givenacertain network,weareabletopredictandverifythemaximumnumberoftheslave nodesthatwilldeliverthebestdecodingrate.Furtherincreasesonthenumber oftheslavenodeswillnotgenerateapositiveeecttothedecodingrate. Inthefollowingdiscussion,wepresentthepseudocodeofourproposedparallel algorithms.Thedetailedanalysisonthecommunicationoverheadandperformance predictionwillbeaddressedlater.Forsimplicity,ahomogeneousenvironmentis assumedwhereslavenodeshaveidenticalcomputingcapability. Algorithm1(MasterNode):Foreachofthe slavenodes,TheMasternodeextracts ablockofframesfromtheincomingbit-streams,anddeliversittotheslavenodes. Thisprocedurerepeatsuntiltheendofavideosession. D istheblocksize, N isthe numberofslavenodes. ProcedurePipeline Master(D) Initializeinternalbuffersandstartslavenodes. rnd =1 /*set rnd to1forthefirstround*/ FOR(j=1toN) /*Pipelineinitialize,filltheslavenodeswith rawdatatostartthepipeline.*/ sendablockof D framestothe j th

PAGE 43

35 slavenodefordecoding ENDFOR WHILE(Thereisdataintheincomingbuffer) FOR(j=1toN) ReceiveDecompresseddata: ( rnd Š 1) .N.D +( j Š 1) .Dthframeto ( rnd Š 1) .N.D + j.Dthframe from jthslave Prepareablockofrawdata: Extract rnd.N.D +( j Š 1) .Dthframe to rnd.N.D + j.Dthframerawdata fromtheincomingbuffer R=thereferenceIframeneeded sendFandRto jthslave ENDFOR ENDWHILE ENDProcedure Algorithm1depictsthecontrol”owonthemasterside.Beforeenteringintoa stablepipelinerunning,thesystemhasasta rt-upproceduretoestablishthepipeline. Themasternodethenentersintoroundoperationrepresentedbytheforloopblock. TheinteractionbetweentheMasternodeandtheSlavenodesateachroundis synchronizedtohaveacontinuouspipeline.Duringeachnormalround,theMaster nodeexchangesdatawiththeSlavenodesinaround-robinmanner.Foreachofthe N Slavenodes,Themasternodewill“rstreceive D framesofpreviousroundthat aredecompressedatthatSlavenode.Itwillthensendablockof D compressed framestotheslavenode.Whenitmovestoservethenextone,theslavenodewill receivenewdataanddothedecompressionsimultaneously. Algorithm2(SlaveNode):Slavenodesreceiveablockof D frameseachtimefrom theMasternode,anddecompressit Procedure Pipeline Slave () Initializeinternalbuffers,receivedecoding parameterand D N valuesfromMaster WHILE(noterminatesignalfromMasternode)

PAGE 44

36 Receive D framesofrawdatafrommaster /*performMPEG-2decodingforthegivenframes data,including:*/ FOR(eachofthe D frames) Run-lengthdecodingtorestore DCTandmotionvector performInverseDCT performmotioncompensation performdithering ENDFOR sendtomasterthedecodedframes ENDWHILE ENDProcedure Asmentionedearly,itispossiblethatt herequiredreferenceframes(i.e.,Iand/orP-frames)arenotavailablelocally.Thus,theseextrareferenceframesneeded tobetransmittedtothepropernodes,inadditiontotherawvideoblock.How muchistherequiredextratransmissionofthesereferenceframes?Whatarethe optimalvaluesof D and N suchthattheseextratransmissionscanbeminimized? Thefollowingexampledemonstratestheneedforaproper D value.Weassumethat thevideoshotisencodedwitha“xedGOP(GroupofPicture)length L and“xed framepatternwithinaGOP(e.g.,I:P:B=1:2:12).Thefollowingsequencedepictsthe GOPpattern G = { I1,B1,B2,B3,B4,P1,B5,B6,B7,B8,P2,B9,B10,B11,B12} € If D =1,withtwoSlavenodes( N =2),the“rstSlavenodehasaframe patternlike G2 1= { I1,B2,B4,B5,B7,P2,B10,B12} .Thesecondslavenode hasapatternof G2 2= { B1,B3,P1,B6,B8,B9,B11} .From G2 1,therearethe followingdatadependencies: … B2,B4need I1,P1asreferenceframes; … B5,B7requireframesof P1and P2; … P2needs P1; … B10,B12need P2,I2; Withtheunionofallrequiredreferenceframes(i.e., I1, P1, P2and I2)andthe eliminationofthoseframesresidinglocally(i.e. I1and P2),theremainingset (i.e., P1and I2)consistsofthoseextrareferenceframesneededtobetransmitted tothe“rstSlavenode. € Bycontinuingthesimilarprocedure,theextrareferenceframesforSlavenode 2are I1and P2.

PAGE 45

37 € Thus,thetotalcommunicationoverheadfor D =1with N =2consistsof4 extraframes(e.g.,about2Mbitsfromourtestvideo“le). Similarly,wecancalculatethee xtrareferenceframesforother N s.For examplewhen N =3,wehavetheframepatternsinthreeslavenodesas: G3 1= { I1,B3,B5,B8,B10} G3 2= { B1,B4,B6,P2,B11} G3 3= { B2,P1,B7,B9,B12} .The extrareferenceframesofeachnodeare { P1,P2} fornode1, { I1,P1,I2} fornode 2,and { I1,P2,I2} fornode3.Thetotalcommunicationoverheadnowincreases to8extraframes(e.g.,about4.5Mbit)witha D =1and N =3con“guration. Figure4.1depictsthecalculatedamountofextracommunicationwhenthevaluesof 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15Number of NodesAbs Redundancy data per Picture (MBits) D=1 D=2 D=4 D=8 D=16Reference Communication In Pipeline Scheme (D: block size, GOP=1:2:12 ) Figure4.1:Communicationoverheadwithdierent( D N )combinations N and D increase.WeusethesameGOPframepatternof1:2:12asdiscussedin theaboveexample.Theframesizeis1024*1024withonebyteforeachpixel,which resultsinonemegabyteforeachframe.Thepurposeofthisanalysisistoidentify theperformancetrendofextracommunication.Severaltrendscanbeobservedas following: € When D =1and N isrelativelysmall,theextracommunicationincreases rapidlywithNincreasing.Itcausesa bout2.6-Mbits/perframeofextracommunicationwhen2nodesareused.Doublingthevalueof N from2to4will

PAGE 46

38 increasetheamountofextracommunicationfrom2.6Mbitsto7.3Mbits(i.e., 280%increase).Anadditional64%(i.e.,from7.3Mbitto12Mbit)oftheextra communicationisexpectedwhenthenumberofnodesincreasesto8.However, whenthenumberofnodesincreasesfrom8to16,itisexpectedthattheextra communicationwillbesaturatedat14Mbit/perframe(i.e.,16.7%increase). Webelievethepeakoverheadontheextracommunicationisachievedsincethe GOPpatternwillrepeatitselfamonga llthese16nodes.Whenthenumberof nodesisgreaterthanthenumberofframeswithinaGOP,onlyasmallvariation onthecommunicationoverheadisexpected. € When D =2,asimilarperformancetrendisexpected.Weexpectacommunicationoverheadof2.6Mbits,5.7Mbits,and6.8Mbitsperframewhen N is2,4and8.Thedegreeoftracincreasesisthus120%and20%.Less communicationoverheadisexpectedandthesituationoccursearliercompared tothe D =1case.Thisisfeasiblesincelarger D providestheopportunity tosharethesamereferenceframeswithi nthesamenode,whileitisinfeasible when D =1.Forinstance,anodethatisassignedwithtwocontinuousB framesonlyneedsoneextraI-andP-frames. € Thistrendisfurtherdepictedwhen D =4,8and16.Especiallywhen D =16, theminimumcommunicationoverheadcouldbeapproached.Furthermore,the communicationcostremainsunchang edwhenthenumberofslavenodesincreases,becauseeachnodeisnowassignedablockconsistingofawholeGOP andpossiblytheI-framefromthenext GOP.Therefore,theextracommunicationforotherreferenceframesisexpectedtobeminimal. € Givena“xed D ,thederivativeofthecurvedecreaseswhen N becomeslarge.In fact,thecurvealwaysconvergesintoaconstantvaluewhen N islargeenough. € Theincreaseof D canaecttheextracommunicationsigni“cantly.While keepingthevalueof N “xed,increasing D canresultinareductioninextra communicationcost. For N =8,a44%reductionisachievedifthevalueof D changesfrom1 to2.Afurtherincreaseof D to4and8willproduceanadditional46% and56%performancegainrespectively.Increasingthevalueof D meansa biggergranularity,thussomeB-and/orP-typeframescansharethesame referenceframeswithinoneslavenode.Therefore,intermsofreducingthe communicationoverhead,increasingthe D valueisquiteeective. TheaboveexamplesarebasedontheGOPstructurewhereI:P:B=1:2:12. ForotherGOPstructure,asimilarcommunicationanalysiscanbeconducted.For instance,inthecaseofI:P:B=1:4:10,wehavemoreP-framesandlessB-frames.The communicationcostmaydecreaseduetothischange,sinceeachB-frameneedstwo referenceframesandeachP-frameonlyn eedsonereferenceframe.Thoughinteresting,itisbeyondthescopeofthisworktofurtherdiscusshowGOPstructure aectsthetotalcommunicationoverhead.Nevertheless,thegeneraltrendsofcommunicationwithrespecttoblocksize D willholdevenifadierentGOPsettingis

PAGE 47

39 assumed.Speci“cally,whentheblocksizeequalstothelengthofGOP,aminimal communicationoverheadcanbeobtainedinallcases. 4.2PerformanceAnalysis Toanalyzethepipelineperformance,weparticularlyfocusonseveralevents whereMasterandSlavenodessynchronizetoeachother.Tosimplifythediscussion, weassumeahomogeneousenvironmentwhereeachnodeisidentical.Asdescribed intheprevioussections,thesystembe haviorcouldbediscussedwiththeround conceptwhichisobservedintheMasterpointofview.Puttinginasimplemanner, aroundistheperiodoftimethemasternodeexperiencesbetweentwosuccessive pollingŽtoaparticularslavenode.Theroundtimeshouldbedeterminedbythe decompressiontimeoftheinterestedslavenodeandthemasternodesinteraction withotherslavenode,whicheverislonger.Withthehomogeneousassumption,we canexpectthateachround(cycle)spendsaboutthesametimeforeachslavenode afterthepipelinerunintoastablestate.Thus,tracingtheeventshappenedinone roundshouldbeenoughtodemonstratetheoverallsystembehavior.Todescribe thetimingrelationbetweendierentevents,Table4.2listssomenotationstobe used.Itisnoticedthatthefollowinganalysisofsystemtimingisapplicabletoboth clusterandSMPenvironments.Thisis,a10-processorSMPmachinewith1Gbps internaldatacommunicationbandwidthistreatedthesameasa10-workstation clusterwithGigabitnetwork.Asanexam ple,Figure4.2illustratestheeventsof the ithroundinthetwoslavenodescon“gurat ion.Ourpipelinedesignputthe majorityofthethecomputationworkload attheslavenodes.Nevertheless,certain partofcomputationhastobeendoneinthemasternode,suchasbitstreamreading, parsing,andsegmentation.Ourexperimentsshowsthatthispartonlyrepresentsa tinyportionoftheoveralldecodingtime(lessthan5%oftotaltime).Inthefollowing performanceanalysis,weusea“xedvalue c toincludethiscost.

PAGE 48

40 Table4.1:Somenotationsusedtodescribesystemeventsforpipelinescheme TM i, 1 Masternodestartstoreceivedecompressedframesatround i TM i, 2 timewhenMasternode“nishesround i TSki, 1 Slave k startstodecompressatround i TSki, 2 timewhenSlave k “nishesthedecompressingofround i Tsingle averagetimeneededforasinglenodetodecodeaframe Tsm timefordatatransmissionfromSlavetoMastereachround Tms timefordatatransmissionfromMastertoSlaveeachround Tround theroundtimeforacompletedecodingcycle ASf averageframesizeinaMPEG-2bit-stream R ( D ) theamountofreferenceframedata,w.r.tagivenblocksize D c asmallconstantrepresentsthemiscsoftwarecost Tp theoveralldecodingtimeforaMPEG-2stream x thenumberofframesinaMPEG-2stream Wede“ne TM i, 1asthebeginningofthe ithroundonthemasterside,whenit startstoreceivetheresultsfromthe“rstslave. TM i, 2isde“nedastheendofthe ithround.Twoconditionsholdbeforethemast ercanenterintothenextround:(1)the mastersidehaspolledoverallslavenodesforthisround,and(2)the“rstslavenode hascompleteddecodingforthisro und.ThisisshowninEquation(4.1), TM i, 1=max { TS1i Š 1 2,TM i Š 1 2} (4.1) Here i =1 2 ,,, istheroundcounterand N isthenumberofSlaves.Themasters activityinthe ithround“nishesafteritsendsablockofarawframetothe Nthslave node,whichisalsothelastslave.Thus TM i, 2and TSNi, 1occursimultaneously. TM i, 2= TSNi, 1(4.2) Thetime-stampsattheslavesidearedepictedfromEquation(4.3)to(4.6)asfollows: TS1i, 1= TM i, 1+ Tsm+ Tms+ c (4.3)

PAGE 49

41 Slave #1 Tround Slave #1 Decoding Slave #2 Decoding T S lave #2 Mastersm TmsMstart S1end S2start S2en d S1startTsingle(1) (2) =Mend (4)(6) (3) (5) Figure4.2:Eventsinoneroundwithtwoslavenodes At TS1i, 1,the“rstslavecanstartitsdecompressingworkimmediatelyafteritreceives acompletedatasetfromthemaster.Thedelayisrepresentedin(4.3)asthecommunicationdelay( Tsm+ Tmsformutualdata-exchange)andothersoftwareoverhead (2 c assumingthesamecostonbothsides). Thedecompressionatslavenode1needs D.Tsingleseconds(Equation(4.4)) for D framesinaverage. TS1i, 2= TS1i, 1+ D.Tsingle(4.4) Forthe kthslavenode(1
PAGE 50

42 Equation(4.7)and(4.8)showthecomm unicationcostbetweenmasterandslave nodes,whichisdeterminedbyframesize h.v ,networkbandwidth B ,andextracommunicationofreferenceframes R ( D ).Theaveragecompressedframesize ASfcan varyaccordingtodierentencodingparameters.Weassume ASf= Frame Size/r r isthecompressionratio(inourteststream, r =10). Tsm= D.h.v. 8 B (4.7) Tms= D.ASf+ R ( D ) B (4.8) Bysubstituting Ts1i Š 1 2of(1)with(4),wehave( Ts1i Š 1 1+ D.Tsingle).Then Ts1i Š 1 1can bereplacedby( TM i Š 1 1+ Tsm+ Tms+ c ).Thus Ts1i Š 1 1of(1)canbereplacedby ( TM i Š 1 1+ D.Tsingle+ Tms+ Tsm+ c ).Similarly, TM i Š 1 2of(1)isreplacedby(2)at“rst, whichis TsNi Š 1 1.Then,byapplying(5)ktimes, TM i Š 1 2is“nallyrepresentedinterms of TM i Š 1 1. TM i, 1=max { TM i Š 1 1+ D.Tsingle+ Tms+ Tsm+ c,TM i Š 1 1+ N. ( Tsm+ Tms+2 c ) } Therefore,theroundtimebecomes Tround= TM i, 1Š TM i Š 1 1=max { D.Tsingle+ Tms + Tsm +2 c,N. ( Tsm+ Tms+2 c ) } Itisnowclearthattheroundtime Troundisajointeectofsingle-slavedecompression-time,totalroundcommuni cationtime,andothersystemoverhead. Thetotalruntimefora x -framevideostreamcanbeexpressedas: Tp= Tstart+( x N.D Š 1) .Tround= Tstart+( x N.D Š 1) (max { N. ( Tsm+ Tms+2 c ) ,D.Tsingle+ Tsm+ Tms+2 c } ) Thestart-uptime Tstartisthelatencyofthesystemacceptingausersrequest. Itcorrespondstothetimeperiodfromthemastersreceivingofthe“rstbyte,tothe “rstframebeingdisplayed.Thepipelineruns(number-of-rounds* Tround)seconds.

PAGE 51

43 Herethenumberofroundsiscalculatedas(x N.DŠ 1),where x isthetotalnumberof frames.Theaverageframerateofdecompression(FRD)canbeestimatedas FRD = x Tp= x Tstart+(x N.DŠ 1) .Tround(4.9) Equation(4.9)predictstheexpecteddecodingframerate.Inordertohaveaclear viewofsystemperformancewithrespectto D and N ,wefurtherassumethatthe videosequenceislongenough(i.e., x islarge).Thusthestart-uptime Tstartis negligiblecomparedtothetotaldecodingtime.With(4.7)and(4.8),weareableto rewrite Tround. Tround=max { N. ( D.ASf+ R ( D )+ D.h.v. 8 B +2 c ) D.Tsingle+ D.ASf+ R ( D )+ D.h.v. 8 B +2 c } =( D.ASf+ R ( D )+ D.h.v. 8 B +2 c )+ max { D.Tsingle, ( N Š 1) ( D.ASf+ R ( D )+ D.h.v. 8 B +2 c ) } = D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c + max { D.Tsingle, ( N Š 1) D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c } (4.10) Furtherobservationsrevealthattheroundtime Troundisanonlinearfunction oftheblocksize D andtheSlavenumber N .Both( N Š 1) .D. (1+1 /r ) .h.v. 8+ R ( D ) B+2 c and D.Tsingleincreasewhen D and N increase.However,witha“xed D ,theformerstill increaseswith N andthelaterdoesnotchangeaccordingly.Whenusinglessslave nodes(i.e.,small N ),theroundtime Troundwillbedominatedbythesingle-node decodingtime Tsingle.Thisisexplainedbyequation(4.11). D.Tsingle> ( N Š 1) D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c (4.11) Byreplacing(4.11)into(4.10),theroundtimeisfurthersimpli“edas: Tround= D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c + D.Tsingle(4.12)

PAGE 52

44 0 5 10 15 20 25 30 0 50 100 150 Processor NumberFRD B=10 B=100 B=155 B=622 B=1000 > saturated area for B=10mbps > saturated for B=100mbps > saturated for B=155mbps linear speed up area for B=1000mbps < Picture size:480*720; GOP=15; Tsingle =0.24 sec Figure4.3:Expectedperformanceofpipelinedecompressionscheme(withDequals oneGOP) Withthesimpli“ed Tround,wecanpredictthesystemperformanceFRD.Figure4.3demonstratedourtheoreticalframeratewithrespectto N anddierent networkbandwidths.Thesenetworkbandwidthsinclude10-Mbpsand100-Mbps switchedEthernet,155-MbpsATMOC-3,622-MbpsATMOC-12,andGigabitEthernet.WenoticethattheestimatedFDRislinearlyrelatesto N initially.For example, € Using10-MbpsswitchedEthernet,ourmo delpredictstheFRDshouldincrease from3-fpsto6-fpswhen N isincreasedfrom1to2. € Similarly,with100-MbpsswitchedEthernet,thesystemexpectstheFRDcan bereachedto36fpswith N =8. € Wealsoexpect,with155-MbpsATMOC-3,55fpscanbereachedbyusing16 slavenodes. € Whenthenetworkbandwidthismorethan622-Mbps,thesystemexpectsan almost-linearimprovement,whichca npossiblyreachhundredsofframesper second. However,theimprovementisexpectedtobeofmuchlesssigni“canceafter theabovethresholdsforthesedierentnetworks.Werealizedthat,when N keeps increasing,communicationtimewillev entuallydominatetheroundtime.When

PAGE 53

45 Equation(4.11)nolongerholds,wehave D.Tsingle< ( N Š 1) D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c (4.13) andthe Troundischangedto Tround=[ D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c ]+( N Š 1) [ D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c ] } = N. ( D. (1+1 /r ) .h.v. 8+ R ( D ) B +2 c ) Nowtheroundtimedoesnotrelatetothesingleslavedecodingtime( Tsingle)anymore. Accordingly,theframeratebecomes FRD x (x N.DŠ 1) .N. (D. (1+1 /r ) .h.v. 8+ R ( D ) B+2 c ) x x D(D. (1+1 /r ) .h.v. 8+ R ( D ) B+2 c ) B 8 .h.v. (1+1 r)+(R ( D )+2 c.B D) Therefore,theframeratewillnotincreaseevenifmoreslavenodesaredeployed(i.e.,thenumberofslavenodes N didnotappearintheaboveapproximation). Thesystemhasreacheditssaturationpoint. Duetothepipelineoperationandtherelativelyheavysubtaskintheslave nodes,thedatapipelineschemeisexp ectedtohasalongerstart-uplatency Tstartthanthedata-partitionscheme.Inthedatapipelinescheme,alongwaitingtime isrequiredtoestablishthesystempipelin e.Accordingtothecontrolalgorithmin themasternode,thefollowingmustbeaccomplishedbeforethe“rstframecanbe displayed:(1)themasternodesentout D blocksofcompressedframedatatoeach of N slavenodes,(2)the“rstslavenode“nishedthedecompressionofitsassigned frames(i.e.,from“rstframeto Dthframe),(3)thebitmapofdecoded D frameswere sentbacktomasternode.Thereforewehave: Tstart=max { N. ( Tsm+ Tms+2 c ) ,D.Tsingle+ Tsm+ Tms+2 c }

PAGE 54

46 Itisclearthatthelatencyisalsoafunctionof( D,N )andothersystem parameters.Beforethesaturationpoin t,itisdeterminedbydecompressiontime for D frames.Assuming D =15, Tsingle=0 2secondsandnetworkbandwidthis 100-Mbps,weexpecta3.5secondsdelay(for720*480video).Whensaturated, Tstartwillincreasewhen N increases.Thisiscausedby theenlargeddurationofthe pipelinecycle.Thecycletimeincreasesby Tsm+ Tms+2 c secondsforeachofthe extranodesbeyondthesaturationpoint.Forthe100-Mbpsnetwork,thisisabout 0.48seconds.Forthe1000-Mbpsnetwork,weexpect0.05secondsforeachadditional node. Withamoresophisticateddesignofthemasterandslavecontrolalgorithm,we canreducethelatencyto( Tsingle+ Tms+ Tsm).Whenthemasternodeisdistributing rawdatatoaslavenode,itwillreceivedecompressedimagedatafromanotherslave node.Whenthepipelineisstabilized,slavenodesarerunningatdierentphaseof decompression,thusmaximizetheparallelgain. 4.3ExperimentalResults Inordertoevaluatethedatapipelineschemeandverifythecorrectnessof theperformancemodel,experimentalresultswerecollectedfromdierenthardware/softwarecon“gurations.Thetestinge nvironmentforthedata-partitionscheme wasused,wealsoaddeda100-MbpsLANequippedwithlow-endclientnodesfor morecompletecomparison.Theexperimentalresultsindicatethatourproposed systemdeliversaclose-linearspeed-upwhenhigh-speednetworksareused.Forexample,witha100-MbpsswitchedEthernet,a30-fpsdecompressionisaccomplished with6PCs(thecorrespondingspeed-upisabout5).Forhigherdisplayrates,we haveobservedupto73-fpsusingaSunSMPserverwithanetworkbandwidthof680 Mbps.

PAGE 55

47 4.4ExperimentDesign Table4.2listedthehardware/softwarec on“gurationofthetestingenvironments.Thethreetestcasesarelabe ledasNT100M,LINUX100M,andSMP680M respectively.Here,NICindicatesthespeedofthenetworkinterfacecard,whichmight bedierentfromthespeedofthenetworkswitcher(asinthecaseNT100M).From ourexperience,themajorfactordetermin ingthescalabilityofparalleldecompression istheachievednetworkbandwidth.Theachievedbandwidthislessthanthepeak bandwidthspeci“edbythenetworkorNIChardware.NotethattheachievednetworkbandwidthcanbeaectedbytheCPUspeed,NICandoperatingsystem.Thus, thisobservationindicatesanend-to-endapp lication-levelmeasurementbycombining alloftheabovefactors. Table4.2:Experimentalcon“g urationsandenvironments System NT100M LINUX100M SMP680M Parameters Processor Pentium PentiumII 15UltraSparc 133Mhz 450Mhz 248Mhz Memory 64MByte 128MByte 2GByte NIC 10/100Mbps 10/100Mbps 10/100Mbps NetworkSwitch 100Mbps 100Mbps 100Mbps OperatingSystem NT4.0 RedHatLinux6.2 Sun5.6SMP AchievedEnd-to-End Bandwidth 80Mbps 90Mbps 680Mbps Tsingle(sec) 0 58 0 18 0 21 The100-MbpsswitchedEthernets(i.e.,NT100MandLINUX100M)provide abasicenvironmentforourexperiments.TheLINUX100Mdeliversanetworkef“ciencyof90%.FortheSunSMPserver,theactualcommunicationbetweentwo processorsisperformedatthesystembuslevel.Theequivalentbandwidthismuch higherthanthenetworkinterface.Thisexplainswhytheactualbandwidth(i.e.,680 Mbps)intheSMPsystemismuchhigherthanitsNICspeed.Notealsothatthe CPUspeedaectsthe Tsingleinasigni“cantdegree.AfastCPUresultsinashort

PAGE 56

48 Tsingle.Forinstance,450-MhzPentiumIIcanreducethedecompressiontime(within aslavenode)from0.58seconds(133-MhzPentium)to0.18seconds. OurtestingvideostreamisaMP@MLMPEG-2bit-streamwith720*480 imageresolution,encodedatanaveragebitrateof6Mbps.Ourparalleldecoderisa revisedversion,basedonapublic-domainsequentialMPEG-2encoder/decoder[26]. Weonlyinvestigatethedecoderpartinthischapter.Theinter-nodecommunication isimplementedbysynchronousMPI(message passinginterface)protocol.Although asynchronousmessagepassingcouldbeusedtoimprovethenetworkeciency,the potentialbene“tmaybemarginalsinceourexperimentsareperformedintheclear environment. Wemeasurethefollowingperformancemetricsfromeachexperiment: € FRD:Theactualframerateofdecompression,whichiscomparedtoanexpectedframeratefromouranalyticalmodel. € Speed-up:ThisisascalabilitymeasurementontheachievedFRDwhenthe numberofslavenodesisincreased. € CPUusage:TheutilizationoftheCPUineachslavenode. 4.5PerformanceoverA100-MbpsNetwork Figure4.4showstheperformanceresultsfortheNT100Menvironment.Note thattheslavenodesareequippedwith133-MhzPentium.Ifusedalone,itisonly capableofdecodingat2-fpswhenusingthesequentialdecoder.Withourproposed softwaresolutionwehavedemonstratedthatthedecompressionratecanbescaledto atleast15-fps.The15-fpsperformanceus uallyproducesananimation-likequality thatvideoiscontinuous(insteadof2-fpss lide-showquality).Detailedanalysison theperformanceresultsarethefollowing: € OurexperimentalresultsinFigure4.4showaclose-linearimprovementofdecompressionratewhentheslavenodesincrease.Foroneslavenode,weobserved1.8-fps.Whenincreasedtotwoslavenodes,a3.5-fpsdecompression rateismeasured(withaspeed-upof1.94).Furtherincreaseto4slavenodes resultsa6-fpsdecodingrate.(about3.33timeshigher).At8nodes,thesystem candeliverathroughputof14-fps,thisrepresentsa7.78speed-up.

PAGE 57

49 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40Processor Numberframe rate NTCluster, Gop=15 B=100Mbps Expected Frame Rate Actual Frame Rate Figure4.4:PipelinedecodingexperimentonaclusterofPentium133PCworkstationswith100-MbpsfastswitchedEthernet. € Theactualsystemperformanceconformswiththeanalyticalmodelclosely.For 1,2,4and8slavenodes,thereareonly11%,12.5%,14%and6%dierences betweentheexpectedandobserveddecompressionrate.Thesmalldierences betweentheanalyticalandexperimentalresultsindicateagreatconsistenceon thesystembehaviorbeforethesaturationpoint(aswepointedoutinthelast section). Theexpectedframeratefromourmodelis35-fps,whichcanbeapproached with24slavenodesaccordingtoourprediction.Duetothelackofsucientslave nodesintheNT100Mplatform,wewerenotabletoverifythebehaviorafterthis performancesaturationpoint(e.g.,N=16or24).Fortunately,anothersetofexperimentsperformedonclusterofLinuxmachines(i.e.,LINUX100Menvironment)to giveusmoreinsight. WithintheLINUX100Mcon“guration,eachnodeisaPentiumII450-MhzPC. Thesingle-nodedecodingspeedis5-fps.Thehighestframeratepredictedunderthis clusterenvironmentis36-fps,atan8-slave-nodecon“guration.Detailedresultsare discussedasfollowing: € Again,weobservedthatthedecodingrateincreasesclose-linearwhenthenumberofnodesgrowsfromoneto“ve.For1slavenode,wehave4.2-fps.Two

PAGE 58

50 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 40 45 50Processor NumberFrame Rate LIN100M, Gop= 15 B=100Mbps Expected Frame Rate Actual Frame Rate 1 Figure4.5:PipelinedecodingexperimentonaPentiumIIPCLinuxclusterwith 100MbfastswitchedEthernet slavenodesproduces8.3-fps,whichisa97%improvement.Furtherincreasingto4slavenodesbringsframerateto21.5-fps,nearly4timehigherthan one-slavenodecase. € (saturationpoint)Thesystemreachesitspeakperformanceat7slavenodes, withaframerateof30-fps.Thentheframeratebecomes”at,withonlyatiny ”uctuationalong30-fps.Infact,thepe rformancegainbecomesinsigni“cant after5slavenodes.Thesaturationpointcomesearlierthanexpected(e.g.,the analyticalmodelpredictsthesystemsaturationpointat8slavenodes.) € Nevertheless,theobservedresultsstillshowanacceptablematchwiththeanalyticalmodel.Foroneandtwoslavenodes,themismatchesarelessthan5%. Aftersaturationpoint,theobservedthroughputisabout14%lessthanexpected(e.g.,theanalyticalmodelpredicts35-fps,whiletheexperimentalresult is30-fpsaftersaturation). TheLINUX100Mcon“gurationalongwithourproposedparallelsoftwaredecoderrepresentsareasonablesolutionfo rprovidingareal-timeMPEG-2decompression.Theachievedqualitycanbeasgreatasanyhardware-basedsolutionssince itachievedupto30-fps.However,ifthesystemisallowedtosupport”exibleuser interactions(e.g.,fastmotiondisplay) or60-fpsHDTVvideoquality,decompression withahigher-than-30-fpsrateisrequired. Duetothenetworkbandwidthlimitation,LINUX100Misnotcapableofprovidingthis”exibility.Therefore,weinvestigatedtheperformanceoverthe680-Mbps

PAGE 59

51 SunSMPplatform.SinceweusedMPIaspartofthecommunicationmechanism, ourproposedsoftwaresolutiondidnotrequireanymodi“cationontheSMPenvironment.Notethatthiskindofhighbandwidthisnoteasilyachievedintodays networkingenvironment.EarlyresultsinGigabitEthernetjuststartedrevealing similarperformanceresultsintherangeof600Mbpsoverhigh-endservers[27]. 4.6PerformanceoverA680-MbpsSMPEnvironment Figure4.6isourhigh-performanceresultusinganSMPmachinewith15 UltraSparc248-MhzCPUs.AsmentionedinTable4.2,theinter-processcommunicationis680-Mbps.Wetestedthreevideotitleswithdierentcontent:Tennis, FlowerandMobl.Thesethreevideosareencodedwithsameencodingparameters tohaveafaircomparison.Theyarechosentorepresentdierentdegreeofimage complexityandobjectmotion.Nevertheless,ourexperimentsarequiteconsistent forallthetestingstream.Thedierenc eindecodingperformanceforthesevideo titlesisalmostnegligible.Partofthereasonsofthisperformanceconsistencycomes fromthehigh-encoded-bit-rate.With6Mbpsencodingbitrate,fewmacroblocksare skipped,thusthenumberofmacroblocksdecodedforthe3videostreamsarevery closetoeachother.Thisisnotthecasewhentheencodingbitrateissmall,where manymacroblockisskippedduringencoding.Inthefollowingweonlydiscussthe performanceforthetennis.Wehavethefollowingobservations: € 30-fpsreal-timedecompressionrateisachievedat7slavenodes.With13 slavenodes,thesystemisabletoprovide60-fpsHDTVquality.Thehighest measuredsystemperformanceis68-fps,using14slavenodesoutofthetotalof 15physicalCPUs. € Theactualframeratealsoindicatesa closelinearspeed-up.Startingfrom 5-fpsatoneslavenode,weobserve10-fpsat2slavenodes,implyinga100% speed-up.Thespeed-upfor4,8and14slavenodesare400%,780%and1360% respectively.Thisindicatesthatourpipelineschemecanbewellscaledup forahigh-demandingvideodecompressi ngscenario.After14slavenodes,the decompressionratebecomes”attened( notshownintheplot).Furtherincreasingthenumberofslavenodescouldnotbringmoreperformancegain,simply becauseallofthe14CPUshavebeenfullyloaded. € Thesystemshowsaprecisebehavioraspredictedbyouranalyticalmodel. Thedeviationofdecompressionrateis controlledwithina10%errorrange.

PAGE 60

52 0 5 10 15 20 25 3 0 0 20 40 60 80 100 120 140Processor NumberFrame Rate SUN SMP, B=680Mbps Gop= 15 Expected Frame Rate Actual Frame Rate Frame Rate Flower Frame Rate Mobl Figure4.6:PipelinedecodingexperimentonSUNeclipseserverwith680Mbpssustainedbandwidth. Accordingtoouranalyticalmodel,thesystemshouldsaturateat55slavenodes, whichshouldpossiblyprovidea270-fpsdecompressionrate. Table4.3showsCPUusageforthemasterandslavenodes.IntheSMP machine,theslaveprocessesaredynamica llyscheduledto14physicalprocessorsby theoperatingsystem.Thusthestatisticsiscollecteddirectlyfromthemasterand slavenodes.Itisobservedthattheslav enodeskeepahighCPUutilizationthrough alltheexperiments.Inaverage,90%ofCPUtimeinslavenodesisusedintheuser spaceforcomputation,andtherestofcomputationisspentoncommunicationand miscellaneoussystemcost.Thesystema lsorunsinahighlybalancedmanner,and thestandarddeviationofslavenodeloadisverylow(lessthan7%).Thewaiting timeinslavenodesisalsocontrolledinalowlevel,rangedbetween5%and8%. Unlikethecaseofdata-partitionsche me,wheretheoverheadinmasternode becomesigni“cantathighsystemcon“guration,thedatapipelineschemehaslow masternodecomplexity.Thecomputat ioninthemasternodeiskeptlow(15%at twoslavenode),andincreasesslightlywhenmoreslavenodesareadopted(referto theMasterLoadrowinTable4.3).Thisindicatesthatthecomputationoverhead

PAGE 61

53 inmasternodeisnotsigni“cant.Atthe14slave-nodecon“guration,thereisstill 70%idletimeinthemasternode,indicatingthemasternodeisnotyetsaturated. Furtherperformanceimprovementcouldbeobtainedwhenmorethan14slavenodes isused. Table4.3:CPUutilizationofmasterandslavenodes NumberofSlaveNodes Parameters 2 3 4 6 8 10 12 14 SlaveCPULoad 92% 89% 90% 91% 90% 90% 90% 91% (Average) STDofSlaveLoad 7% 6% 6% 5% 6% 4% 3% 4% (Average) SlaveWaiting 5% 5% 6% 7% 7% 6% 8% 8% (Average) MasterLoad 15% 18% 20% 22% 25% 28% 30% 32% Theimprovementofthedecodingperformanceisfurtherveri“edbythecumulativesystemCPUutilization.Withtwoslavenodes,10%ofthetotalcomputing powerofthe15-nodesSMPmachineisused.Thenumberbecomes16%with3slave nodes,andincreasesbyabout7%foreachadditionalslavenodesafterward.With thefullcon“gurationof14slavenodes,90%ofoverallprocessingpowerisusedby theparalleldecoder,andremainsatthislevelwhenfurtherincreasingthenumberof slaveprocesses. 4.7TowardstheHighResolutionMPEG-2Video TheachievedframeratesforthelowandmainlevelMPEG-2videoarevery closetoourprediction.However,thescalabilityperformanceresultsforthehigh resolutionMPEG-2videosarenotsatisfac tory.InFigure4.7,thedecodingratesfor (1404*960)MPEG-2“lesareillustrated.Sta rtingwith2fpsatsinglenodecon“guration,alinearincreasecanbeobserved .However,thedecodingperformanceof Ž”owerŽsuddenlydroppedto2.5fpsat10slavenodes,andcontinueddeteriorating

PAGE 62

54 withasmallreboundat11slavenodes.ForŽtennisŽandŽcalendarŽ,similarperformancedegradationisobservedat12slavenodes,rightafterthepeakperformance point.Similarperformancedegradationisobservedforthe(1024*1024)videoformat. 2 4 6 8 10 12 14 16 0 5 10 15 20 25 *calendar +flower otennis xpredictionnumber of nodesFPR1404 960 Figure4.7:Decodingframeratefor1404x960 Inordertoidentifythesystembottleneckwhichcausesthedegradationof decodingperformanceforhigh-resolutio nvideo,werecordtheutilizationofsystem resourcesduringthedecodingprocess.Thee videncefromsystemruntimestatistics canbecollectedfromtheCPUtimedistributionandthenumberofpagefaultsto supportthisuniqueobservation1.Figure4.9illustratesthemeasurednumberofpage faultsversusthenumberofslavenodesandtheCPUstatistic.Forthesakeofbrevity, weonlypresenttheresultsforŽtennisŽ. Forthe352x240video,thenumberofpagefaultsvirtuallyremainsunchanged, andiskeptatalowlevel(1010pagefaults/frame).Increasingthevideoresolutionto 1TheCPUutilizationcouldbeobtainedbysystemcall time() inUNIXsystem,andthepage faultisrecordedbyautilityprocess truss spawnedbythedecodingprocesses

PAGE 63

55 704x480isre”ectedbyariseofthepagefaultnumber,afourfoldjumpisobserved. Nevertheless,the704x480casestillhasa”atcurvefortheincreasingslavenode, indicatingthatthesystemisrunningsteadily.Forthe1024x1024video,thenumber ofpagefaultsincreasesconsiderably.Itisnoticedthatthepagefaultssigni“cantly increasesat10to12slavenodes,reaching3500pagefaultsperframe.Comparedto thedecodingperformanceinFigure4.7.(b),t heperiodwithhighpagefaultscoincides withthecollapseofthedecodingrate.Thi sindicatesthattheexcessivepagefaults haddriventhesystemintoanthrashingstate.Thepagefaultsbehaviorofthe 1404x960videoshowsthesamepatternasinthe1024x1024case. 2 4 6 8 10 12 14 16 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 *352*240 +704*480 o1024*1024 >1404*960Number of Slave NodesSecondSystem Page Fault Figure4.8:Pagefaultvsnumberofslavenode TheexcessiveincreasingofpagefaultsisalsoindicatedbytheCPUusage. Withoneslavenode,90%ofthesystemtimeisidle,8%oftheCPUtimeisusedin theuserspace,andtheremaining2%forothersystemmaintenance.Whenincreasing slavenodes,theuserspacetimeincreases proportionally,andthesystemidletime decreases.After8slavenodes,however,bothsystemidletimeanduserspacetime dropsigni“cantly,whilethesystemoverheadshowsamajorincrease.About90%of

PAGE 64

56 theCPUtimeisusedbytheoperatingsystem,whileuserspaceonlyoccupies5%of CPUtime.Recallingthatthepagefaultsnu mberincreasessuddenlyat9slavenodes (seeFigure4.9),weconcludethatthesystemspendsmostofitsCPUtimeswapping pagein/out,thusdropsthedecodingperformance. 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 *User Space (Decompression) +System Idle oSystem Processing(0.S)number of nodesCPU PercentageCPU usage for tennis40 Figure4.9:CPUusage Realizingthatpagefaultisdirectlyrela tedtotheshortageofsystemmemory, webelievethatthebuermanagementoftheparalleldecodeshouldbeinvestigated. Anot-optimizedbuerschemewilldevastat ethecompetitionbetweenuserprocesses (e.g.,ourcommunicationanddecompressionsoftware)andsystemprocesses(e.g., demand-pagingmechanismsbyOS).Becaus eoftheshortageoftheoverallmemory, thesystemprocesswillgenerateasigni“cantnumberofpagefaults,whichinturn slowsdownthedecompressionspeedduetothelackofCPU.

PAGE 65

57 Thememoryrequirementforthemasternodeandslavenodescanbeexpressed: Mm= mc+ mstreambuffer+ moutbuffer+ minbufferMs= mc+ mcompressedbuffer+ mtransmissionbuffer= mc+ moutbuffer+1 5 minbufferHere mcisthesizeofexecutablecodeforthemasternode,about500KB. mstreambufferisthestreamingbuertoreceivethecompressedvideopacketfromthevideoserver, wecurrently“xedittobe1MB. moutbufferand minbufferarededicatedforinformationexchangingintheparalleldecoding. moutbufferequalsoneGOPofMPEG-2 compressedframes,and minbufferneedstoaccommodatetwoGOPofdecompressed frames(oneGOPfordisplayingandanotherforincomingtrac). Fortheteststreamtennis40(1404*960),thecorrespondingmemoryrequirementinmaster/slavesidesare: Mm=42(MB)and Mc=30 8( MB ).Theaccumulativebueringspacewillgrowquic klywhenusingalarge-scaleslavenode con“guration,whichcausesunsatisfactor yscalabilityperformancewhenthenumber ofslavenodesislarge.Forinstance,letNbethenumberofslavenodes,thetotal memoryrequirementbecomes Mt= Mm+ N MsUsingtheparametersofthetestingMPEG-2video,thetotalmemoryused canbeestimatedfromaboveequation.Forthevideo Tennis40 ,weneedabout73 MB,104MB,165MB,319MBand381.5MBrespectivelywhenthenumberofslave nodeare1,2,4,9and11.Inthenextsection,wewilldiscussseveraltechniquesto reducethebuerspace. 4.7.1EcientBueringSchemes Itisnoticedthattheslavenodealloca tedaGOPlengthofframebueroriginally,whichcouldbefurtheroptimizedtotheminimalbuerspace.However,due

PAGE 66

58 tothedecodingdependencyinsidetheMPEG-2videostructure,wearenotableto useonlyoneframebuer.TodecodeaB-frame,weneedtworeferenceframesand onedecodingworkingframe,resultinginatotalof3frames.Withacarefulredesign ofthemaster-slavecommunicationprotocol,usinga3-frametransmissionbuerin theslavesideispossible,whichwecalledthe ST scheme.Whenthepicturesizeis 1024*1024andGOP=15,wesaveabout12MBbuerspaceperslavenode,about an80%reductionintheslaveside. Theminimumrequiredframebuercanbefurtherdecreasedfrom3-frames to2frames.FortheI-orP-frames,weneed onebuerforthepredictionpicture, andanotherbuerfortheworkingframe.T hetwobuerschangetheirroleafter decodingaI-orP-frame,sothatthemostrecentlydecodedI-orP-frameisused asthepredictionframeforthenextP-frame.FortheB-typeframe,sincethe decodedB-framewillnotbeusedasreferenceframe,wecandirectlysenddecoded blockstothemasternodewithoutstoringt hem.Theabovediscussionassumesthat thereferenceframeforacurrentP-frameisalwaysthelastdecodedP-frame,and thereferenceframeforacurrentB-framearethelasttwoP-frames.Nevertheless, thisapproachalsoworksiftheB-andP-framesalwaysrefertotheI-framewith correspondingchangeinthereferencebuer. Withthisscheme,theexpectedmemoryrequirementbecomes M t= M m+ N M c= Mm+ N ( mc+1 5 3 mframe+ minbuffer) To“ndoutthenumberofmaximumslavenodesbeforetheexhaustingofsystem memoryforthe1404*960singlelayeredMPEG-2video,wesolve(47+(N-1)* 6) < 300MB(use300MBasasystemthreshold).ThisgivesN=43slavenodesand ahigherthan60fpsdecodingratecanbeexpected.

PAGE 67

59 4.7.2FurtherOptimizationintheSlaveNodes Itisfurtherobservedthatthedecodingprocedureinslavenodesmightnot usethreeframesallthetime.Morespeci“cally,theI-frameneedsonlyoneframe buer,whileP-framecanbedecodedwithtwoframebuer.OnlyB-frameneeds thewholethreeframebuers.Thusthetotalamountofbuercanvaryduringthe lifetimeoftheslavenode.Byallocating framebuerdynamicallyaccordingtothe frametype,itcanbeexpectedthatthetotalbuercanbesigni“cantlyreducedfor highqualityvideo. ThisisparticularlytruewhenI-andP-framerepresentaconsiderableportion oftheframes.LettheratioofI,P,BframesinaGOPstructurebea:b:c,theeective buerspaceforonelayerisexpressedby M =(1 a +2 b +3 c ) / ( a + b + c ). InatypicalGOPstructureofŽIBBPBBP BBPBBPBBŽ,wehavea:b:c=1:4:10.This resultsinaneectivebuernumberof39/15=2.6,whichisabout85%ofthe3frames buerscheme. Theconceptofdynamicbuerallocationcanbeappliedinsidethedecodingof eachframe.Sincethedecompressionofeachframeisbasedonaserialdecompression ofmacroblocks,theoverallbuerspacec ouldbereducedbydynamicallyallocating buerformacroblocks.Forexample,whendecodingthe“rstmacro-block,weonly needtoallocatea16*16blockspace.Thebuerforothermacroblockswillbe assignedwhenitisneeded.Withthisdynamicmemoryallocation,weexpectan additionalbuerreductionof0.5framefo rtheworkingframe.Noticethatthis schemecannotreducetheamountofbuer forthereferenceframe,whichshould beinsystemduringthedecodingprocess. Theeectivebuerrequirementbecomes M =((1 Š 0 5) a +(2 Š 0 5) b +(3 Š 0 5) c ) / ( a + b + c ).UsingthesameGOP structureastheabove,theeectiveframenumberofthebuerinslavenodeis2.1, whichis60%ofthe3-framebuerscheme.

PAGE 68

60 Thedynamicallocationofbuerintheslavenodeisanapplicationlevel memorymanagementscheme,whichiscloselyembeddedinthedecodingprocess. Thecurrentimplementationrelyonso mesystem-providedroutines(e.g., malloc and free ).Wespeculatethatacustomizedbuermanagementroutine(directaccessofthe systemmemory)shouldbeabletofurtherincreasethedecodingperformance,which shouldbediscussedinourfutureresearch.ThetradeohereistheadditionalCPU costintroducedbythedynamicmemorymanagement.Foreachmacro-block,the additionalcostincludesatleasttwosystemcalls(formemoryallocation/deallocation) andsomeothermiscellousoperations.Ithasbeenshownthatthecostassociated withdynamicmemoryallocationissigni“cantforthedatabaseserverandWeb-server, wherethousandsofprocessesmayco-exist toprocessuserrequests.Inourcase,the numberofslavenodes/processesisusuallybelow20anditisexpectedthatmemory managementactivityisfarlessfrequent,thustheoverheadintroducedshouldbe limited.Thisiscon“rmedbyourexperimentalresultsbycomparingtheperformance ofthedecodingwith/withoutdynamicmemoryallocation.Withdynamicbuer allocationenabled,theoveralldecodingtimeisincreasedlessthan7%thanthe staticmemoryallocationcase. 4.7.3ImplementationandExperimentalResults Experimentsareperformedforthehighresolutionvideoformatwiththerevisedmemorymanagementscheme.Ourresultsshowthatthetwoimprovements bringsigni“cantmemoryreduction,and thephenomenaofpagingpaniciseliminated. Forthe1404*960video,thetotalbuersizeis53.5MBwithoneslavenode, whichis27%lessthantheoriginalone.For4-slave-nodecase,the ST schemeuse 85MBinsteadoftheoriginal165MB,whichisalmost50%memorysaving.For the1404*960case,thenumberofpagefaultsshrinksfrom1500to1200,atoneslave nodecon“guration.For1024x1024video,thepagefaultsarenow943,25%less

PAGE 69

61 2 4 6 8 10 12 14 16 0 5 10 15 20 25 *calendar +flower otennis xpredictionnumber of nodesFPR1404*960 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 *User Space (Decompression) +System Idle oSystem Processing(0.S)number of nodesCPU PercentageCPU usage for tennis40 Figure4.10:(a)Decodingframeratefortherevisedmemorymanagement,and(b)user spacetimeVSkernelspacetimefort he“rstbueroptimizationscheme thanbefore.Forallofthevideostreams,thenumberofpagefaultsalmostremains unchangedwhenincreasingthenumberoftheslavenodes. Figure4.10.(a)showthescalabledecodingframerateforthe1404*960video withourrevised ST scheme.Weobservedaclose-to-linearincreaseoftheframerate. Thepeakdecodingrateisobtainedat14-slavenodes,where20fpsisobserved.The decodingperformancefor1024*1024video“lesshowsasimilarbehavior.Thusour revisedbueringschemehassuccessfullysolvedthememoryshortageproblem,and workswellforhighqualityvideouptoMP@HLvideo. Figure4.10.(b)showstheoverallCPUtimedistributionofslavenodeswhen decodinghigh-resolutionvideoformat swiththerevisedbuerscheme.Theuser spacetimecomponentrepresentsthecomputationtimefortheMPEG-2decoding procedure,thekernelspacetimeisforthesy stemleveloverhead,includingtimespent inthenetworklayer,systemcall,andothercosts.Itisobservedthattheuserspace timeincreaseslinearlywhenthenumberofslavenodesincreases,accompaniedbya correspondingdecreaseinthesystemidletime.Meanwhiletheoperatingsystemlevel costismaintainedatalowlevel(between5%to10%oftotalCPUtime).Forthe largescaleexperiments(morethan11slav enodesdeployed),theabnormalitycrossoveroftheuserspacetimeandsystemoverheadobservedintheoriginaldecoding

PAGE 70

62 experimentsnolongerexists.Thisfurtherprovestheeectivenessofthe ST scheme insolvingthememoryshortage. 4.8Summary Uptothischapter,wediscusshowagenericandscalableMPEG-2decoder isimplementedviaapure-software-basedparalleldecodingscheme.TheMPEG-2 decompressionalgorithmisparallelizedindata-partitionanddata-pipelinemanner builtonamaster/slavearchitecture.The data-partitionscheme isshownnotas scalableasthe datapipelinescheme ,duetotheoverheadinmasternodeandhighbandwidthrequirement.The datapipelinescheme isabletoproducehighgainfrom parallelprocessingwithlittleoverhead.Weanalyzedtheeectofdierentblock sizesandthespeedupswithincreasingnumberofslavenodes.Usingtheblocksize ofoneGOP,thecommunicationoverheadisreducedtominimumbyreducingthe inter-framedependenceasmuchaspossible. Promisingresultsshowthat datapipelinescheme performswellinvarious hardware/softwareplatform,givennecessa rynetworkbandwidth.Thereportedhighestdecodingrateismorethan70f/sforML@MPvideo.Wefurtherinvestigated thescalabilityperformanceforawiderang eofvideoformat,especiallyforthehighresolutionMPEG-2video(e.g.,HDTV).However,Itisfoundthattheoriginaldatapipelineschemesuerssigni“cantperform ancedegradationwhendecodinghigh-level MPEG-2videowiththefullsystemcon“g uration,duetoinecientmanagementof memoryspaceinthedecoder.Theshort ageofsystemmemoryisalsocon“rmedby theoutbreakofpagefaultwhensystemisfullyloaded.Weproposeanecientbuer managementmechanismsuchthatthememoryrequirementcanbereducedby50%. Therevisedparalleldecodesigni“cantly relievethememoryshortageproblem,and showedasatisfactoryscale-upperformancewhendecodingthehigh-resolutionvideo formats(closeto24f/sdecodingrateisachi evableforhighresolutionqualityvideo).

PAGE 71

63 Ouranalysisimplythatupto270f/sispossiblebyincreasingthenumberofslave nodeintheSMPenvironment. OurinvestigationonparallelizingMPEG-2decodingalgorithmindicatesthat real-timedecompressionofhighqualityvideocanbeobtainedviapuresoftware solution.Wealsobelievethatcomputationpowerinthefuturewillbeimproved signi“cantlyviacost-eectiveMPPtechnol ogy.Therefore,videodecodingcapability attheend-usercouldberegardedassolved. Inthefollowingchapters,wediscussanothercriticallinkindistributedmultimediasystem„howtoprovidequality-guaranteedcommunicationserviceinwireless network.Asmentionedinchapter1,deliveringmultimediacontenttimelywithhigh “delityinwirelessnetworkpresentsagreattechnicalchallenge,nottomentionthat thesystemshouldalsosupportlargenumberofusers.Amongseveralcompetingtechnologies(e.g.,FDMA,TDMA,andCDMA),wechooseWCDMAwirelessnetworkas thetargetplatform,duetothemanyadvantagesCDMAhaveoverothersystems.In thenextchapter,wediscussadynamicspreadingfactorschemeinWCDMAsystem suchthatmultimediatracwithstrictBERrequirementcanbesupportedbydynamicallyupdatingitssprea dingfactor.TheprotocolprovidesguaranteedQoSfor allacceptedcallrequests.Ourdiscussionalsoincludesatimeanalysisfortheprotocol execution,andatracschedulerutilizin gthedynamicspreadingcapability. Inchapter7,wefurtherextendthedynamicspreadingfactorschemeintothe multiplecellsenvironment.Speci“cally ,weaddressedhowsofthandoalgorithm inWCDMAshouldberevisedtoworkwithdy namicspreadingfactorcontrol.Our studiesshowthatthedecisionofspreadingfactorinhandoperiodshouldbeconsideredtogetherwiththepowercontrolformobilestations.Withaheuristicalgorithm tooptimizethespreadingfactorandpowercontrol,wewillshowthattheoverall throughputduringthehandoperiodcanbeimprovedby25%.

PAGE 72

CHAPTER5 MULTIMEDIASUPPORTINCDMAWIRELESSNETWORK Providingmultimediaservicethroughwirelessnetworkisbecomingthemajorbattle“eldfortheserviceproviderandtechnologyvenders.Recenttechnology advancesareincreasingmultimediacapab ilitiesinmobiledevices.Cellularphones andnotebooksareconvergingintoasinglemini-devicewhichiscapableofbothcomputingandcommunicating.Theyarebeco mingmorecompetitivetodesktopPCin termsofcomputingpower.However,theco mmunicationqualitysupportedbycurrentwirelessnetworkstillneedmajorimprovementtomeettherigorousdemandsof multimediaapplications,wherehigherda tarateandlowerbiterrorrateisdesired. Intraditionalwirelesscellularnetwork,tracchannelsaredesignedtosupport voiceconversation.Thesechannelshavethesamedatarate,andhavethesamebit errorrate(BER).Thesesystemsonlyhav elimitedsupportfordatatrac,suchas inPourandLiu[28]wherethesilentperiodofvoiceisutilizedfordatatransmit. Thus,providingqualityofguaranteeserviceinwirelessnetworkneedsnewdesign andfunctionalityintheMAClayer[8].ChoiandShin[29]discussedQoSguarantees inawirelessLANwithDynamicTime-DivisionDuplexed(D-TDD)transmission.In paper[30],anadmissioncontrolprotocolformulti-serviceCDMAisdevelopedbased oninterferenceestimation.Theyusedl og-normaldistributiontoapproximatethe eectofrandomuserlocation,shadowing,andimperfectpowercontrol. Amajoreorttowardmultimediasupportsinthecellularwirelessnetwork issocalledthe3rdgenerationsystem, suchasWCDMA.CDMAbasedsystemis particularappealingformultimediaapp lication,primaryduetoitscapabilityto providingdierentleveloflinkquality(BER)fordierenttractype,thusmakeit 64

PAGE 73

65 possibletobetterutilizetheradioresour ce.ThespreadingfactorofCDMAisthekey variableindetermininguserdatarateand associatedBER.Theoretically,spreading factormakesCDMApossiblebyrepeating usersdatasignalssuchthattheycan bere-constructedatthereceivingmobilestations.Increasingspreadingfactorcan bene“tBERbecauseitwillincreasethedesiredsignalstrengthlinearly.Themean MAI(Multi-Access-Interference)causedbyotheruserswilldecreaseaccordingly,and willapproachtozerowhenthespreadingfactorapproachestoin“nity. TheevaluationoftheBERhasbeenstudiedintensively[9,31…33].Itiswidely agreedthattheBERislargelydeterminedbyMAI.InFukumasaetal.[9],thedesign ofPNsequenceisdiscussedtoreduceMAI.InGeraniotisandWu[33],theprobability ofsuccessfulpackettransmissionisanalyzedforDS-CDMAsystem.InChoiandCho [34],apowercontrolschemeisproposedtominimizetheinterferenceofhighdata-rate userstotheneighboringcells.Theresultshowsthatthenumberofhighdata-rate usersfordatacommunicationshouldbelessthan6inordertosupportenoughvoice users.However,theseworksdidnotcoverthesystemBERwithdynamicspreading factor.Anewmiddle-wareprotocolisneededtointegratedierentservicesmore eciently(i.e.voice,data,video). TheuniqueBERbehavioranditscorrelationtospreadingfactorinCDMA systemallowadditional”exibilityintra nsmittingdatatrac.Akyildizproposed apacketschedulingprotocolforslottedCDMA[35].Theschedulercanmaximize systemthroughputbasedonBERrequirem ent.InOhandWasserman[36],system performanceofaDS-CDMAwhensettingtoadierentspreadinggainisstudied. Twokindsoftracareconsidered.Theauthorshowsthatoptimalspreadinggain increaseslinearlywhentheMAIincreases.However,theirworkdidnotrelatethe controlofdynamicspreadinggainwiththesystemload. Tohaveaclearpictureofhowspreadingfactorsandthenumberofactive usersaecttheBER,weperformedlinklevelsimulationfortheasynchronizeduplink

PAGE 74

66 channelsinIS95.Figure5.1depictstheBERunderdierentnumbersofactiveusers andspreadingfactors.Theexperimentswe reperformedwithinasinglecellwithout interferencefromneighboringcells.Multi-patheectsandthermalnoisearenot takenintoaccountsinceweemphasizeontheeectofspreadingfactor. 0 5 10 15 20 25 30 35 40 45 9 8 7 6 5 4 3 2 1 0 Number of UsersBER (10y)oSF=64 +SF=96 *SF=128 >SF=160 Figure5.1:BERvs.usernumbervs.SF Theperformanceresultsclearlyindicate thattheincreaseofspreadingfactors caneectivelydecreasetheBERforagivennumberofusers.Forexample,with10 users,increasingthespreadingfactorsfrom64to96willreduceBERfrom0.0008 to0.0003(e.g.,62.5%reduction).Increasingspreadingfactorsfurtherto128results aBERof0.000004,whichis75timesless.InordertosupportavarietyofBERs, manyhardwaremanufacturersareconsider ingtobringinthedynamiccapabilityof changingspreadingfactorsinnext-generationmobileandbasestations. Becausethepossibleadaptationofsprea dingfactors,noveladmissionprotocols(i.e.,statediagrams)areproposedinmobileandbasestations.Withthenew protocol,itispossiblethatamobileuser isnoti“edtochangeitsspreadingfactor.

PAGE 75

67 ThisusuallyhappenswhenanewmobilepresentsanOPENrequest.ThepreliminaryresultsindicatethatourproposedsystemalwaysmaintainsadesiredBERfor alltheconnections(includingthee xistingandnewly-arrivingone). Startingfromthischapter,wepresentouron-goingstudyforthenewprotocol designinWCDMAtosupportmultimediatracandtheassociatedperformance issues.Ourdiscussioninthe“rstpartofthischapterwillfocusonthebaseline protocoldesigninasinglecellsituation.Wealsoprovideadetailanalysisforthe timingcomponentswhentheprotocolisexecutedandproposeimprovedschemesto reducetheend-to-endconnectionsetupti me.Inthesecondpartofthischapter, weaddressthemultimediatracschedulingschemesbasedonthenewdynamic spreadingfactorprotocol.Inadditiontoguaranteeingthevoicecommunication, ourproposedschedulingschemereducedtheturn-aroundtimesigni“cantlyforthe conventionaldatatrac(i.e.,e-mails).Th esediscussioncanalsobefoundat[37,38]. Thedynamicspreadingfactorprotocolisfurtherextendedtohandlemultiple cellsinthenextchapter.Weaddhandoc apabilityintothedynamicspreading protocolsuchthatmobilestationcantakeadvantageofthedynamicspreadingeven duringhandoperiod.Tooptimizetheove rallsystemthroughput,weproposeanew resourceallocationalgorithmtoassignspreadingfactorandtransmittingpowerto handomobiles.Themajorityofthecontentisalsoreportedinreferences[39]and [40]. 5.1Performance-GuaranteedCallProcessing Theultimategoalofadmissioncontrolprotocolistosupportasmanyusersas possiblewhilestillsatisfyingBERrequireme ntsforallexistingconnections.ConventionalFDMAschemesdividethefrequenc yspectrumintomultiplechannels.Every usersconnectionneedstoallocateonechanne lfromthebasestationbeforethevoice

PAGE 76

68 conversationcantakeplace.Whenallthechannelsareallocated,nomoreconnectionscanbeadmitted.Therefore,admissioncontrolwithFDMAschemesisstraight forward. However,CDMA-basedschemes(alongwiththemultimediasupportanddynamicspreadingfactors)maketheadmissiondecisionnontrivial.Oneadvantageof CDMAsystemisthatthewholespectrumisusedforcommunication.Connections areseparatedbyPseudo-Noisecodes(PN)assignedatthebasestation.Thusdetermininganexactpointtoblocknewly-a rrivingconnectionsaredicult.During atypicalcalllifetimeofvoiceconversation,thesystemwillacceptandterminate anumberofcalls,thusthenumberofactiveuserswillchangeratherfrequently.In aninterference-limitedCDMAsystem,thevariationofsystemloadisthekeyfactor determiningthe”uctuationofthelinkqualityoftracchannel.Thusouradmissioncontrolprotocolneedstobeadaptivetothechangesinthesystemload.Our proposedadmissioncontrolprotocolisabletomonitorandassurethatperformance remainsalmostthesamebytakingnecessarystepswheneverneeded. Bothmobileandbasestationsneedtoparticipateinthecalladmissionprocess. Mobilestationsaretheonesthatmaketherequestsand/orupdatetheirparameters followingvariouscommandsfromthebasestation.Basestationshouldprocessrequestsfrommobilestations,monitoringthechangeintheenvironment,anddecide thekeytransmittingcommandstomobilest ations(suchaspowercommand,spreadingfactor,andPNcodes).Theenvironmentchangeincludesincrease/decreasein thenumberofusersandalterationsintractypes.Asanexample,astationmay startaconnectionwithvoice,haltthevoicewithoutterminatingtheconnection,send email,andgobacktovoicecommunication.Thisschemeeliminatesthesigni“cant overhead(intermoftensofseconds)causedbyterminatingandre-openinganew connection.Therefore,theadmissioncontrolprotocolthatweproposehereis”exible andcapableofguaranteeoverallsystemperformance.

PAGE 77

69 5.1.1ProposedAdmissionProtocolinAMobileStation ThefollowingFigure5.2depictsthestatediagram(i.e.,protocol)forthe mobilestations. IDLE Signal back denial of service keep old parameters for conn[i] ACTIVE Send close conn[i] Send ACK Send update request for conn[i] Wait reply Wait random time Update parameters for conn[i] Wait reply Send open request for conn[i] Send ACK Set initial parameters for conn[i] Retry? Send ACK to BS D ata Communication Data Data Data Signal Retry[i]M By MS By BS Retry[i]++ Denied Open By MS Accepted By BS Denied Yes NoFigure5.2:Thestatediagramandprotocolofamobilestation. UnliketraditionalCDMAsystems(IS-95),mobilestationshavethreetypes ofrequests:OPENanewconnection,ALTERthetractype,andCLOSEthe connection.The“rsttworequesttypesfollowthesimilarstepsexceptthefactthat alteringthetractypedoesnotchanget henumberofusers.Theapplicationin mobileunitsendsanOPENrequesttobasestationthroughtheaccesschannel. Alongwiththerequestisthetypeofthetrac(andthedesiredminimumdatarate, ifrequired).Ourprotocolde“nes4tractypes:VOICE,AUDIO,VIDEO,and DATA.Multipledatarateoptionsareavailableforeachtractype.IS-95doesnot allowmobiletospecifythedesiredtractype.

PAGE 78

70 Toestablishaconnectiontotheba sestation,themobilesendsanOPEN request,specifyingthetractypeanddat arateoptions.Oncethebasestation receivesthatrequest,it“rstchecksifthenewrequestcanbesatis“edindependent ofotherconnections.Thesatisfactionis determinedbasedontwothings:Thetype ofthetractobecarriedonthenewconnection,i.e.,theBERrequirementand minimumdatarate,andinterferencebyotherusers.Theminimumspreadingfactor thatwillprovidetheBERsatisfactioni susedforthenewconnection.Themaximum datarateallowedbythatspreadingfactor iscalculatedmeanwhile,whichiscompared totheminimumdataraterequiredbythespeci“capplicationandthe“rstdecision ismade.Thedecisioniseithertocontinueremainingstepsortodenytherequest. Iftherequestisdenied,themobileisallowedtoretryafterwaitingarandomtime period. 5.1.2ProposedAdmissionProtocolinaBaseStation Figure5.3depictsthestatediagram(i.e.,protocol)forthebasestations.If itisdeterminedthatthenewconnectioncanbesatis“ed,thebasestationmoves furthertocheckifexistingconnections canbesatis“edoncethenewconnection isup.ThisfacilityisnotavailableinIS-95CDMA.However,itisaddedtoour protocoltoensurequalityforexistingusers.Foreachexistingconnection,thesystem calculatestheexpectedaverageBERcorrespondingtotheincreasedsystemload.If theexpectedaverageBERishigh,wetrytoincreasethespreadingfactorandcheck themaximumdatarateinordertoseeifdataraterequirementcanalsobesatis“ed withanincreasedspreadingfactor.Thisstepisrepeatedtillallexistingconnections arereviewed. Therearetwopossiblesituationsafterthereview:First,noexistingconnectionrequiresanupdate.ThedestinationmobileisimmediatelyinformedofanOPEN requestandanACKissentbacktotherequestingmobile.Thesecondsituationis thatsomeconnectionsmayrequireanupdate.Foreachtractyperequiringan

PAGE 79

71 Monitor Access Channel Find SF for new conn to satisfy BER Is DR satisfied? Change in the traffic type request i++ Any more existing connection? Send accept message on PgC Can we decrease SF for conn[i] Send denied message on PgC Find SF for new conn to satisfy BER Is DR satisfied? Is DR satisfied with higher SF? BER req. met? Increase SF i++ Check BER for conn[i] No No Yes Yes Yes more conn exis t? Send ACCEPT on PgCh Wait ACKs Send updates monitor BER Send OPEN request to destination No No Yes No A conn closed i=0 Yes Yes Notify User Decrease SF No No Yes, i=0 NewFigure5.3:Thestatediagramandprotocolofabasestation. update,thebasestationbroadcastsanU PDATEmessage.Allmobilesusingthe sametractypemustsendanACKbacktobasestationtocon“rmthattheyhave updatedtheirparameters.Thisisnecessarytomakesurethatnoonesuersbad performanceoncethenewconnectionisa ctive.TheupdateprocedureisamajorimprovementoverIS-95.ItisthekeyinprovidingtheQoSguaranteetoallcalls.This procedureallowsthemobilestoadjusttheir spreadingfactorsdynamically.Whenall ACKsarereceived,thebasestationsendsanACKtotherequestingmobile,andthe mobilemaystarttotransmititsdata. Anotherimportantfunctionalityoftheprotocolisits”exibilitytoalterthe tractype.Onecannotswitchfromonetrac toanotherinIS-95withoutterminatingandre-establishingtheconnection.This functionisprovidedonlyifaconnection isalreadyestablished.Ifthemobiledecidestochangethetractype,itwillsend anUPDATEmessagetothebasestation.Themessagehastospecifytherequested

PAGE 80

72 newtractype.Thebasestationwillfo llowthesamestepsasinthecaseofthe OPENrequest,butthistimeitwillnotcon sideranincreaseinthenumberofusers whiledoingitscomputations. 5.1.3APerformance-GuaranteedSystem Weperformedanemulationofvoicetracupto50usersformeasuringthe performanceofouradmissioncontrolprotocol.Practicalparametersassociatedwith thehardwareandsoftwareenvironmentinthemobileandbasestationsarelistedin thefollowingTable5.1. Table5.1:Practicalparametersused inthewirelessWCDMAenvironment ChipRate 4.096Mcps PacketSize 128bits SpreadingFactors 32,64,96,128,160 PagingChannelDataRate 32kbps AccessChannelDataRate 16kbps BERforvoice 10Š 2 MinDataRateforvoice 8kbps Thespreadingfactorsusedintheexperimentsare32,64,96,128,and160. Theexperimentspresentedinthissectiononlytestedhomogeneousvoicetrac. Theintegrationofmultimediatrac(e.g.,additionalsupportofe-mail,ftpand audio/musicstreams)willbepresentedi nthenextsection.TheBERrequirement forvoicetracisassumedtobe10Š 2.Theminimumdataraterequiredbyvoiceis 8-Kbps1.Forsimplicity,packetsizeis“xedto128bits. Thesystemchiprateissetto4.096mcpsasproposedinWCDMAstandard. Themaximumtracchanneldatarateinthiscasewillbe128kbpswithaspreading factorof32.ThedataratesofPagingandAccesschannelsare32Kbpsand16Kbps 1Ourlateststudyindicatedthatthisispossiblewithadvancedcompressionschemessuchlike GSM.

PAGE 81

73 respectively.BERwasmeasuredineverymobilestation,andtheaverageBERamong allthevoicestreamswerecalculatedand illustratedinthefollowingFigure5.4. 0 5 10 15 20 25 6 5 4 3 2 1 0 SF64 SF96 SF128 SF160 SF32 Upper BER limit for voic e Upper BER limit for audi o Time slotBER (10y) Figure5.4:Performanceguaranteewiththeproposedadmissioncontrol. AsdepictedbythecurvelabeledasSF32,Ž“xed-spreading-factorschemes didnotguaranteetheoverallperformanceamongalltheusers.Whentheconnections arelessthan“ve,theaverageBERwasacceptable(i.e.,lessthan10Š 2).However, whentheconnectionnumberexceeded“ve,everyconnection(includingexistingand thenewly-acceptedconnections)sueredwiththeBERqualityabove10Š 2. Ontheotherhand,byusingourproposedadmissioncontrolprotocol,the averageBERwasalwaysmaintainedbelow10Š 2thresholdforvoiceevenasthe numberofuserincreases.Thecurvelabeledwithasequenceof(SF64,SF96,SF128, SF160)inFigure5.4statedthetimeinstancesthatourCDMAsystemre-actedto theincreasingdemandofusers,andnews preadingfactorshavebeenadopted.The systemneverexceededtheupperlimitforthevoicetrac,(10Š 2). Thoughtheexperimentalresultsareverypromising,ourproposedadmission controlprotocoldoesintroducefewconsequencesintwodesigntradeos.Oneofthe tradeosoccursintheprolongedbasestat ionprocessing.Theothertradeooccurs

PAGE 82

74 inthecontentiontimeforallmobilestationtoacknowledgetheaccomplishmentsof changingspreadingfactors.Thesetwotim ingfactorsthusresultinalongerend-toendconnectionsetuptime.Duringthenext subsections,wedescribethesetrade-os indetailsandproposemethodsforfurtherimprovement. 5.1.4ProcessingTimeatBaseStation Theend-to-enddelayisde“nedasthetotaltimebetweenausermakinga connectionrequestandtheuserbecomingreadytotransmitdata.Twodominant timecomponentsare Tp,basestationprocessingtime,and Tupdate,waitingtimefor UPDATEACKsfrommobilestations. Innormalsituation,toprocessanewco nnectionrequest,t hebasestationneed toreviewthelinkstatusforeachexistingconnection.Thus Tpisroughlyproportional tothenumberofusersinthesystem.Theadmissionprotocoltakesapproximately1 msectoreviewoneconnectionifitisdeterminedthattheconnectioncanbesatis“ed. However,whentheneedforanewspreadingfactorisrequired,therequiredtimeis muchhigher.Forinstance,whenthenumberofexistingconnectionsreaches5where anUPDATEshouldbeundertaken,theprocessingtime Tpincreasesto22msec.We realizedthattheconnectionsbelongingtothesametractypealwayschangetothe samespreadingfactorsincetheyhavethesameBERrequirement.Thus,connections withthesameBERrequirement(sametractype)couldusesamespreadingfactor onceforall. ContentionTimeforAcknowledgment WhenUPDATEdecisionismadeatthebasestation,ourprotocolrequires that,beforeacceptingthenewconnection,allactivemobilesshouldswitchtothe newspreadingfactorsothatBERneverincreasesabovetheacceptablepoint.The currentsystemdesignenforcesthatmobilesshouldsendacknowledgmentsbackto basestationaftertheychangetheirspreadingfactors.SincetheAccessChannelof aCDMAsystemonlyhasacommonchannelandoperatedinslottedALOHA,there

PAGE 83

75 willbeextradelaysduetopossibleaccessco llisions,ifmorethanonemobilestations needtoaccessthecommonchannel.Theaveragenumberofslotspercontentionis1 AwhereAistheprobabilitythatsomestati onacquiresthechannelinaspeci“cslot. When A =1 /e ,theminimumcontentionisreached.Thisresultsinalargecontention delaywhenUPDATEisneeded.Whenthenumberofexistingconnectionsis9,the end-to-enddelaybecomes448.4323msec,and Tupdateis395.432msec. Manyapproachescanpotentiallydecrea sethecontentionperiod.Forexamples,thegoalcanbeaccomplishedifanimprovedcollisionprevention/resolution algorithmisused(i.e.,otherthanCSMAmethods).Increasingthenumberofaccess channelsisanotheralternative.Witho utanytechnologypreference,wehave“rst investigatedthesecondapproachbyincreasingthenumberofaccesschannels. Thebasestationnowshoulddistributeaccesschannelsamongallusersequally, thustheoverall Tupdatecanbereducedevenlyacrossallthemobilestations.Ahash functionimplementedinthebasestatio nshouldbesucientforthispurpose.By takingamobilestationsserialnumberorPN,wecanbalancethenumberofmobiles usingeachaccesschannel.ThefollowingFigure5.5depictsthepreliminaryresults correspondingto1,2,and4accesschannels. Thepreliminaryresultsdemonstrateaverypromisingresultsonreducingthe Tupdatecomponent.Byusing2accesschannels,theaverage Tupdatecanbereduced by49%stablyindependentofthenumberofexistingconnections.With4access channels,afurtherreductionabout48%to58%of Tupdateisaccomplished. However,whenmultipleaccesschannelsaredeployed,weobservedthatthe averageBERwillincreaseattheupdatingpoints.Thisiscausedbytheadditional interferencecausedbymultipleactiveaccesschannels,TheBERsimulationwith1, 2,and4accesschannelsshowsthattheBERcorresponding2and4accesschannel ishigherthanthecaseswithoneaccesschannel.Fortwoaccesschannel,theoverall BERperformanceisstillbelow10Š 2guaranteedquality.Therefore,usingtwoaccess

PAGE 84

76 0 2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 700 = 1 Access Chann el = 2 Access Chann el = 4 Access Chann el Number of existing connectionsTu (ms) Figure5.5:Improved Tupdatebyusingmultipleaccesschannels. channelsprovedtobeagoodmethodtobalancetheshorter TupdatetimeandBER qualityguarantee.Whenweuse4accesschannels,thesystemexperiencesasharp increaseinBERandviolatedtheBERrequirementat5users. 5.2DynamicSchedulingforMultimediaIntegration Withasolidunderstandingofusingdyn amicspreadingfactorstosupport manyvoice-onlyuserswithguaranteedqu ality,afundamentalquestionneedstobe answeredishowcanthesystemintegratethemultimediatrac(withthebestsystem performance)?Inthissection,wewillillus tratehowadynamicschedulingalgorithm canbeproposedforsupportingthismission.Tosimplifythediscussion,ourexample assumesasimpletracpatternmixingwithe-mailsandvoicestreams.Atypical tracpatternconsistof K voicesessionsasbackgroundtrac,and8largeEmail requests(withaminimumBER=0.00007).Weusuallyemulateeache-mailas384Kbitsdataamountcomewithafewattachments.

PAGE 85

77 5.2.1DesignIssueofTracSchedulingforWirelessCDMAUplinkChannel ThetracschedulingprobleminCDMAuplinkbringssomenewissuesthat arenotpresentedinschedulingwithwirelin enetwork.Speakingbrie”y,thecapacity ofwirelessCDMAuplinkisavariableofthetractypeandtheirspreadingfactors,thusmakestheschedulingmoredicultthanthewirelinenetworkcase.Letus considerasimpleschedulinggoaltomaximizetheinstancethroughputwithoutany fairnessrequirementamongdierent”ows.Suchagoalcanbeeasilymeetintime divisionbasedwirelinenetworksbysimplytransmittingwhateverisinthetransmissionbuer.Asfarasthetransmissionbueriskeptnon-empty,thenetworkshould beunderfullutilization,whichisthelinkspeed.Aoftenforgottenassumptionbehindthisschedulingmethodisthehigh“delityofcommunicationmediawithvery lowtransmittingerror.Nevertheless,th isassumptionremaintrueformostofthe wirelinemedia.Withaverylowtransmissionerror,thedierenceamongtractype intermofBERrequirementisbasicallyinvisibletothenetwork.Thusasfarasthe abovesimpleschedulinggoalisconcerned,itdoesnotmatterastowhichtractype shouldbescheduledearlier. Unfortunately,thetransmissionerrori nwirelessCDMAuplinkismuchhigher thananywirelinemediausedtoday.Withagivenchannelsituationandspreading factor,sometracmightnotabletobetransmitduetotheviolationoftheBER requirements.Furthermore,CDMAallowseveralconcurrenttransmissionstobe undertaken,thusthethroughputisthesummati onofdatarateforallactivechannels. Fromthediscussioninthe“rstpartofthischapter,itisgenerallytruethatwhen highspreadingfactorsareused,thenumbe rofconcurrentactivechannelsincreases. Howeverthedatarateforachannelwilld ecrease.Thusevenconsideringonetrac type,thedecisionofthespreadingfactorisnottrivial. Infact,thetracschedulinginwirelessCDMAuplinkmustdecidestwo things:theorderinwhichconnectionswill beactivated,andthespreadingfactor

PAGE 86

78 tobeusedforeachconnections.Thedecisionoftheseschedulingparametersshould alsosatisfytheBERrequirementforalltra c,andmaximizetheoverallthroughput. Noticethatframebyframeschedulingisnotconsideredhereduetothesignaling overheadbetweenbasestat ionandmobilestations. Thereforeoursimpletracschedulerwillbeinvokedwhenoneofthefollowing eventsoccurs:newconnectionrequest,connectionterminationandtracpro“le update.Eachconnectionisassociatedwi thtractypeasdiscussedearly,andthe amountofdatatobetransmitted.Forstre amingtrac,adefaultof500kbitsis assumedforschedulingpurpose.Whent hisquotais“nishedandtheconnection isstillalive,another500kbitswillbeassigned.Fornon-streamingtracsuchas ftpandemailrequests,thetransmissionquotaistheactualamountofdataforthe request. Fortherestofthischapter,wewilldiscusstheperformanceofsuchtrac schedulerwith“xedspreadingfactorand dynamicschedulingfactor.Theperformanceisevaluatedinthetermofoverallturnaroundtimeinsteadoftheinstance throughput. 5.2.2TracSchedulingwithFixedSpreadingFactor Theperformancemetrictobemeasuredfordatacommunicationisthetotal turn-aroundtimetoful“llallthedatatracs,denotedas Ts.Thegoalofsystem designisthustargetedforshorter Ts(whilemaintainingperformanceguaranteefor allexistingvoiceconnections).IftraditionalCDMAsystemwith“xedspreading factor(e.g.,SF=64)isused,datatracwillbeexperiencingthesameBERasvoice trac.Therefore,whenthenumberofbackgroundvoice K increases,itmaynot beeectivetotransmite-mailssincetheBERbecomestoohigh.Sinceaccuracyof e-mailsneedstobeguaranteed,thetrans missionofe-mailsdatatracunderthis situationwillcausehighprobabilityofpacketdamage(duetothelargeBERover acceptablelimits).

PAGE 87

79 Thehighpacketsdamageratewillrequiremuchhigherretransmissionsrate fromtheuppernetworklayer.Thusthetotal Tswillbecomeevenhigher.Ifthe networkloadsustainsforalongtime,eve ntheconstantretransmissionswillnot guaranteeade“nitesuccessofpacketdelivery.Thereforeitdoesnotonlyintroduce alongdelayfortheretransmittinguser,italsogeneratesanegativeinterferenceto otherusers. Thus,inwirelessCDMAenvironment,perhapsanon-retransmissionpolicy canbene“ttheoverallsystemwheredatatransmissionoccursonlyunderacceptable BERlevel.Forexample,when K =10,thepredictedaverageBERis0.0001,which ishigherthantheBERrequirementofe-mailtrac,thusemailrequesthastowait untilthevoiceloaddecreases.Basedonthi spolicyanda“xedspreadingfactors,a possibletracschedulertosupportintegra tedmultimediacommunicationmaywork asfollow: Algorithm FSF Scheduling :Thisalgorithmdelaysthetransmissionofe-mailcommunicationuntiltheBERintheenvironmentisacceptable.Thisalgorithmwill reducethefrequencyofthere-tra nsmissionfromtheupperlayers. AlgorithmFSF Scheduling contactthebasestationforthecurrent trafficload. selectanunfinishede-mailrequest r ( i ) checktoseeiftheadditionofthisrequestwill stillsatisfyBERrequirement. IF(thepredictedBERexceedsanyoftheexisting connection,ortheBERof r ( i ) ), r ( i ) isnotacceptedforthenexttimeframe. ENDIF GOTO(3)andcheckforothertraffics. otherwise, r ( i ) isscheduledatthenexttimeframe.

PAGE 88

80 WehaveconductedexperimentswiththeFSF Schedulingalgorithmtocollect theperformanceresults.Thefocusofthep erformanceisthetotalturn-aroundtime, Ts,for8e-mailrequests.Figure5.6showstheturnaroundtime Tswith“xed spreadingfactor. 0 5 10 15 20 25 30 35 40 45 5 0 0 20 40 60 80 100 120 Number of Existing Voice ConnectionTurn Around Time (sec)Scheduling under Fixed SF for 8 Email (each 384Kbits) oSF=64 +SF=96 *SF=128 >SF=160 Figure5.6:Turnaroundtimefordatatrac Thefollowingobservationscanbefound: € forSF=64,thedatatractakes6secondswhen k =2 3 4.Itincreasesto 12secondsfor k =5,6and7.At k =11,theturn-aroundtimebecomes48 seconds.Furtherincreaseofthevoicebackgroundwillresultinanunacceptable highBERforthee-mailtrac.ThusthetransmissiontimeisN/A.Similar performanceresultsareobservedforSF=96. € WithSF=128, Tsequalsto12(second)whenlessthan16voiceconnections exist. Tsincreasesto24and36secondwhen backgroundvoiceconnectionsis 19and22.with24backgroundvoiceco nnections,onlyone e-mailconnection canbetransmittedinordertosatisfytheBERrequirement,whichresultsa turnaroundtimeof96(sec).Noe-mailtransmissionisallowedafter25voice users. € ForSF=160, Tsequalsto15(sec)whenlessthan20voiceuserspresented, whichis25%higherthanthatofSF=128.Theincreaseof Tsiscausedbythe decreaseofdatarate,thedatarateatSF=160isabout25-Kbits,whichis25% lessthatofSF=128.When k =27, Tsincreasesto30(sec),45(sec)at k =31, 60(sec)at k =32and120(sec)at k =33.After k =34,we“ndthatthe systemBERisalreadytoohightoacceptanyDATAtrac.

PAGE 89

81 Insummary,withtheFSF Schedulingalgorithm,thesystemthroughput suersinordertosatisfythehighBERrequirementofDATAcommunication.However,withsmallspreadingfactors,thevoicebackgroundcanaectthetransmission ofDATAtracsigni“cantly.Underlightnetworkload,thelongerspreadingfactors (e.g.,128and160)actuallygeneratelongerturn-aroundtimethantheshorterspreadingfactors(e.g.,64and96).However,longe rspreadingfactorsdohavethecapability tosupportmoreconcurrentvoicestreamswithreasonableresponsetimeforconventionaldatarequests.Howtotakeadvantageofthesetwodesignalternatives(with abalancedperformancebetweencontinuousandconventionaldataservice)become thefocusofournextproposedschedulingalgorithm. 5.2.3DynamicSchedulingtoImprovethe Ts Inordertoberescuedfromsuchundesirablesituation,aschedulingalgorithm basedondynamicspreadingfactorassignmentcanbeadopted.Intuitively,when thecurrentsystemBERdoesnotsatisfytheBERrequirementofDATAtrac,we should“ndalongerSF.Whentherearemultipledatatransmissionrequests,we shouldselectafeasibleSFforeachofthem.Sincesomedatasessionsmaylast severaltimeframes,thespreadingfactor shouldbedeterminedinaper-time-frame basis.Therefore,minimi zingtheturn-aroundtime Tsforagivensetofdatatracs becomeaglobaloptimizatio nproblemovertheallfeasibleSFcombinationsthrough thelife-timeofdatasessions. Howeverthesearchingspacewillgrowexponentially.Forexample,assume thedatatracsconsistofoneFTPsessionandoneAUDIOsession,andthere are10backgroundvoicesession,wewillhave3candidateSF(SF=96,128or160) forAUDIOand2forFTP(SF=128or160).FurtherassumetheAUDIOsession require256-KbitsandFTPrequire64-Kbits.Withthemaximumchanneldatarate 128-Kbps(achievedifSF=32),theAUDIOsessionwilllastatleast2seconds, whichcorresponding100time-frames(ea chtime-frameis20msec).Withsimilar

PAGE 90

82 arguments,theleastnumberoftime-framefortheFTPsessionis25.Thusthetotal combinationforthisexamplewillbeatleast(2 3)min { 100 25 }.Ingeneral,assumethe feasibleSFfor kthtractype Tkis Nk,andtheestimatedframenumberof Tkis Fk, theapproximatesizeofsearchingspaceis( kTk)min { Nk}. Thishugesearchingspacemakesbruteforcesearchingimpracticalsincethe schedulingisdesiredto“nishwithinafra me-time.Inaddition,thebackground voicetracanddatarequestsmaychangefromtimetotime,whichalsomakesit lessworthyto“ndastaticglobaloptimizationbasedonthehistorysystemload information.Therefore,itisourbeliefthataspreadingfactorcombinationthat maximizethethroughputforcurrenttimeslotismorereasonable. Theremainingissueiswhichtracshouldbeconsidered“rstwithwhat spreadingfactor.WeobservedthatthosewithalowBERrequirementhaveabetterchancetobesatis“edthanthosehaves trictBERrequirement,whichindicates thatasmallspreadingfactorismorelikelytobeacceptedbythelow-BER-trac. Therefore,toimprovetheoverallthroug hput,weshouldinspectthelow-BER-trac “rst.Forexample,with40voicebackground,wecaneitherschedule5FTPtracs withSF=160,or5AudiotracswithSF=96,resultinginatotalof45activetracs (includingvoice)2.Obviouslythelatteronehashigherthroughput. Thusourschedulingalgorithmworksinfourphasescorrespondingtothefour datatracs,fromlow-BER-tractostrict -BER-trac.Namely,thetractypes areprocessedintheorderofAudio,Image,FTP,andEmail.Thefollowingpseudo codeillustratesthealgorithmforprocessingaudiotrac.Thealgorithmsprocessing othertractypesaresimilar. AlgorithmDSF Scheduling /*ThisalgorithmimprovestheFSF SchedulingAlgorithm byassigningdifferenttraffictypesmoredynamically*/ 2Othercombinationswhichresult6ormoredatatracs(suchas3FTPand3Audio)will resultmorethan46connections.TheresultedvoiceBERwillnotbesatis“ed.

PAGE 91

83 INPUTPARAMETERS: Nf,Ne,Na,Nirepresents thenumberofpendingFTP,e-mail, AUDIOandImagedatarequestsrespectively. FTP[1.. Nf]:pendingFTPrequests, FTP[k]representstheremainingdataamountyet tobetransmitted. EMAIL[1.. Ne]:pendingEmailREQUESTS. AUDIO[1.. Na]:pendingAudiorequests. IMAGE[1.. Ni]:pendingImagerequests. RBER[4]:BERrequirementofthefourtraffics. BER[5][1:50]:thepredictedBERgiven thenumberofactiveusers andthespreadingfactor. Findtheminimumspreadingfactor SFvand SFa: BER [ SFv][ k + Na]
PAGE 92

84 deletefinishedrequests ENDFOR Continuewithothertraffictypes. Theperformanceofthisschedulingalgorithmforthesampletracpatternis demonstratedinFigure5.7,andthef ollowingobservationsaremade: 0 5 10 15 20 25 30 35 40 45 5 0 0 20 40 60 80 100 120 Number of Existing Voice ConnectionTurn Around Time (sec)Scheduling under Dynamic SF for 8 Email (each 384Kbits) SF=64> SF=128>
PAGE 93

85 withspreadingfactorof64or96cannotsupportthedatatracwithareasonableturn-aroundtime(duetothehighpacketdamageprobability).Our DSF Schedulingautomaticallychangesintohighspreadingfactorandprovide acontinuestransmissionofdatatrac.InFigure5.7,underaheavysystem loadof30users,westillcanful“llthesampletracpatternin40seconds. € Dynamicallychangingspreadingfactorbasedondierentsystemloadcansigni“cantlyimprovesystemthroughput.Forexample,with8backgroundvoice connections, Tsisreducedfrom24secto12secbysimplyturningSFfrom64 to160,whichresultsina100%saveindatatractime. Insummary,ourdynamicspreadingfact orassignmentschemecanadjustthe systemloadandbalancetherequirementofdierenttrac.Voicecommunication isguaranteedtobenon-interrupted,anddatatracsareservedwithreasonable fairnessandresponsetime.Ourschemecanguaranteenostarvationfordatatracs ,evenwhenthebasestationisunderhighload. 5.3Summary Inthischapter,wediscussedanewMACprotocolforW-CDMA.Theprotocol cansupportdierenttracclassesthathaveavarietyofBERrequirements.Since theBERisindirectlyproportionaltothenumberofusersinawirelesscommunicationsystem,wedesignedaprotocolthatintegratesdierenttracclasseswitha guaranteedqualityaswellasmaximizesthenumberofuserssupported. Thepresentedprotocoladmitsnewcallsonlyifthenewconnectionsaswellas alltheexistingconnectionsqualitycanbe guaranteed.Ourprotocolusesadynamic spreadingfactorscheme.InordertosatisfytheBERrequirement,theprotocoltries todynamicallychangetheSFusedbyconnectionsthatmayexperienceahighBER. TheconnectionsmayaltertheirSFseveraltimesdependingonthechangesinthe numberofusersinthesystem. WehaveevaluatedtheBERperformanceandadmissiontimeoftheprotocol underanincreasingnumberofVoiceconn ections.WehavealsocomparedourprotocoltoaregularCDMAprotocolthatusesa“xedspreadingfactor.Theresults showthattheDSFprotocolprovidessigni“cantimprovementinBERsatisfaction

PAGE 94

86 comparedtoCDMAwith“xedSF.Eventhoughthenumberofusersincreaseinthe system,theaverageBERofallconnectionsaremaintainedbelowtheupperlimitfor theirtractype.Newschemesareproposedandevaluatedtoreducetheadmission delays.Furthermore,inordertoguarante eabalancedsupportamongdierenttrac types,weproposedaschedulingalgorithm.Wemeasuredtheresponsetimefordata tracbyourproposedsystem. Asfutherstudy,weevaluatetheperformanceoftheprotocolundermore complexcondition(i.e.,mixedtracwithstreamingvideo).Wealsoexpandthe protocoltocovermorethanonecell(i.e.,so -calledsofthando).Thesestudiesare describedinnextchapter.

PAGE 95

CHAPTER6 SOFTHANDOFFWITHDYNAMICSPREADING Inourpreviousstudy[38],wehadproposedadynamicspreadingscheme suchthatthemobileusercanchangethelengthoftheassignedspreadingcodefor moreecientutilizationoftheradiofrequency.Withthenewdynamicspreading capability,thesystemcansupportBER-sen sitivemultimediatrac(i.e.,combined voiceandtraditionaltextinformation)evenwhenthesystemloadishigh. Itisthetaskofthischaptertoextendthedynamicspreadingschemetothe multi-cellscenario.Morespeci“cally ,weneedtotakemobilityandhandover1[41,42] intoconsideration.Inoneenvisioneda pplicationscenario,wecanimagehowTedd andTomcanbeni“tfromthisschemewhendrivingfromGainesvilletoOrlando(total distanceisabout80miles).Assumebasestationisplacedalongthehighwayevery 4miles,themobilewilldriveacross20cellsthroughthetrip.Itcanbeexpectedthat therearelessmobilesintheruralareat hanintheurbanarea.Inruralarea,Tedd canhaveavideophoneconversationwith hisfriendinOrlandowhenthereareno mobileusersaround.Whencomesintoatow n,wheremoremobileusersco-existed, Teddcanswitchedtoaregularvoiceconversationbyusingalongerspreadingcode (thusdecreasethedataraterequirementandBERrequirement).Afterdrivingoutof thetown,Teddcanresumethevideosessionbychangingbacktoashortspreading factor,sinceshorterspreadingfactoris enoughtoprovidesignalqualitywhenthe numberofmobileuserdecrease(thuslessinterference). Thefundermentalquestionyettoansweris: HowtointegrateSoftHandointo thedynamicspreadingscheme? InthedynamicspreadingenabledCDMAsystem, 1weusetheterms handover and hando interchangeableinthecontextofthischapter.87

PAGE 96

88 amobilemighttravelthroughaseriesofcellsduringacallsession.Itislikelythat thesecellshavedierentnumberofactivein-cellconnections,thereforemightassign dierentspreadingfactors,inordertomaintaintheBERquality.Thusamobile mighthavetochangeitsspreadingfactorfromtimetotimeasitentersintoanew cell.Abasicrequirementforhandoalgorithmunderthisscenarioistomakesure thattheupdatingofthespreadingfactorinboththehandoveringmobileandthe targetbasestationbe“nished withintheallowedhandotime. Weproposedanovelsoft-handoschemetoaddresstheaboveproblems.Our schemeusesasimilarmethodologyasWCDMAtodeterminetheadditionanddroppingofcellsintotheactiveset.Thebasicfunctionalityoftheproposedhando schemehasthreeparts:(1)collectinform ationofthesurroundingcellsandthemobilepro“le,(2)updatetheSpreadingFact or(SF)intheactive-set(basestationsin range)tomaintaintheBERlevelforthemobilesinthesecells,and(3)decidethe rightspreadingfactorandpowerlevelforthehandovermobiles. Theprobabilitiesofupdatesastheres ultofhandoisanimportantfactor whichdescribestheperformanceofourproposedframework.Wepresentananalytical evaluationforupdateprobability.Thepre liminaryresultsshowthatmobileswith voicetrachavearelativesmallupdateprobability(alsoseeWangetal.[39]). HowevervideotracandWWWtrachaveahighupdateprobability,duetothe highBERrequirements. Thetimecomponentsoftheproposedalgorithmwereanalyzed.Wefoundthat thehandoalgorithmcouldconsumeasigni “cantsystemresourceandresultinlong handodelay,especiallywhenthesystemneedstoexecutealargenumberofupdates. Wethereforerevisedthehandoprocedur esuchthathandoveringmobilescanbe processedinbatch,andseveralupdateoperationscausedbyconsecutivehando couldbecombinedtogether.Numericalresultsshowthattheaveragehandodelay canbereducedbyalmost29%.

PAGE 97

89 Thecoreprocessexecutedinourproposedsoft-handowithdynamicspreadingschemeistodeterminetherightspreadingfactorandthepowerlevelofthe transmittingsignalforthehandomobile.ThedecisionforSFandpowerlevel shouldbemadesuchthatthefollowingconditionsbesatis“ed:(1)thelinkquality ofthemobilestationsintheneighboringcellisnotsacri“ced,(2)theminimumBER requirementsofthehandoveringmobileisnotviolated,and(3)themaximumdata rateofthehandoveringmobileisachieved. Weproposedasub-optimalmethod(calledHYB)withthegoalofjointoptimizationforSFandpowerlevelforallmobilesinthehandoverarea.TheHYB algorithmmodeledtheBERconstraintsofallmobilesasthefunctionofvariousfactors,including(SF,power)pairforhando mobile,systemloadoftheactiveset,and distancefromtheactive-cells.Thedecisionvariablesare (SF,power) forallhando mobilesmaximizingthethroughputforha ndomobiles.Ourmodelingresultsinan nonlinearprogrammingproblem.Furtherinvestigationshowsthatourproblemcan besimpli“edtoalinearprogrammingprob lemwithaslightlymodi“cationofthe constraints.Thereducedproblemcontainso nlyhalfthenumberofdecisionvariables andcanbesolvedecientlyusinglinearprogrammingmethods.Thesolutionofthe reducedproblemisthenusedtodrivethedesiredSFsfromtheoriginalconstraints. NumericalexperimentsshowthattheHYBalgorithmoutperformsagreedystrategy (wecalleditMaDalgorithm)signi“cantly.TheBERofthemobilesispreserved duringthewholehandoperiod,meanwhiletheinterferencetothesurroundingcell iscontrolled.ThethroughputoftheHYBschemeforWWWtracis25%higher comparedtotheconservativestrategy.A26%increaseforvideotracwasalso observed. 6.1RelatedStudy Traditionalhandoalgorithmisbase donthevoice-onlynetwork,carrying homogeneoustrac.Themajorperformancefocusofthesealgorithmsistoguarantee

PAGE 98

90 theconnectivityoflivecallsandtokeepthef ailurerateassmallaspossible.Therefore someimportantmetricsandcriteriausedincludereceivedsignalstrength(RSS), signaltointerferenceratio(SIR),distan ce,tracloadandetc.Withdierenthando decisionschemes,thecorrespondingha ndoprobabilitycouldbeanalyzed[43].A generalsurveyofhandoprotocolwaspresentedinMarichamyandMaskara[44]. TheCDMAsystemsupportsauniquehandoscheme„softhando[41,42]. Softhandoallowsmorethanonebasestationcommunicatingwiththemobilestation duringthehandoverprocess,thusthereisnobreakperiodwhenthemobilemovesto anothercell.Ithadbeenshownthatsofthandocanreducethetransmissionpower ofthemobile,andincreasethesystemcapacity[41,45]. Duetotheshortageofradioresources,itisessentialtomakethebestuse ofthesystemresourceandsupportmultipleusersandmultimediatrac.Resource allocationfortheCDMAsystemandCACingeneralhadbeendiscussedinmany works[46…48].InNaghshinehandSchwartz[47],adistributedcalladmissionscheme withtheconsiderationofneighboringce llsisreported.Themovementofmobilesas wellasthesystemloadinnearbycellsisusedtomaketheadmissiondecision. InZhouetal.[46],aforwardlinkpowercontrolschemewasdiscussedinatwocellCDMAnetworksupportingdierentQoSrequirements.Theschemeisoptimized inthesenseofminimizingbasestationtransmittingpower,andmaximizingsystem throughput.However,thestudyisbasedont heassumptionof“xedspreadingfactor, anddidnotaddresstheresourcea ssignmentforthereverselink. LeeandWang[48]presentedanoptimumpowerassignmentformobilessupportingdierentQoSrequirements.Theyformulatedtheadmissionproblembyaset ofinequalityofdesiredSIRforalltheactivemobilesandprovidedamethodtodrive afeasiblesolution.Howeverinthisstudy,theprocessinggainis“xedtoaprede“ned valueforeachtractype,thustheadmissibleregionwillbestrictlylimited.

PAGE 99

91 Howeveronlyafewworkshadbeendonetooptimizetheperformanceof handomobilesbyreallocatingtheradioresources.Furthermore,theeectsofdynamicallychangingthespreadingfactor havenotbeenreportedbythesestudies. Ourstudyisthe“rstofthefewattemptsta kingthedynamicspreadingfactorinto thehandoalgorithm.Wealsoaddressedt heresourceallocationduringthehando periodasajointoptimizationforboththespreadingfactorandthemobilepower control.Tothebestofourknowledge,thisapproachhasnotbeenreportedinliterature. Therestofthischapterisorganizedas follows,Section6.2discussesourproposeddynamicspreadingsofthandoschemeandthesystemresponsetime.Section 6.3presentsabaselinealgorithmusingagreedystrategyindecidingthespreading factor.InSection6.4,weshowhowancloserestimationofthemobile/cellcondition canprovidehighthroughputwhilepres ervingthelowinterferencelevel. 6.2TheFrameworkofSoftHandoAlgorithmwithDynamicSpreading Ourproposedhandoalgorithmisbase donthetraditionalsofthandoalgorithm.Themobilestation,whenmovingaroundinthemulticellenvironment,could beeitherinhandostateornon-handos tate.Wheninthenon-handostate,the mobileissolelycontrolledbytheresidentcell,whichisdescribedinWangetal.[38]. Themobileenteredintohandostatewhen theactive-setofthisparticularmobile containmorethanonecellsite.Themaintenanceofactivesetissimilarastheone usedinIS-95,wheretherelativepilotsig nalsofsurroundingcellsarereportedbythe mobileandadecisionofadding/dropping/i gnoringofcellinto/fromtheactivesetis conducted. Figure6.1showsthestatediagramofthehandoprocedurethatisexecuted intheBSside.Thelogicofthemobilestationisrelativelysimple,andisnotfurther elaboratedhere.Noticethatthe BnneedtosenditsforwardWalshcodetothe mobilesothemobilecandecodethesignalfromthenewbasestation.

PAGE 100

92 Itcanbeobservedthattherearethree partiesparticipatingduringthehando period.Theroleofthemobilestationint hehandoprocessingissimilartothat oftheexistingsofthandoprocedure.Th emobileisresponsibleforpilotsignal measurement,andcomparesthedetectedp ilotsignalstoanaddingthresholdanda droppingthreshold(thesignalis“ltered toeliminatetherandomvariationcaused byfastfading).ThemobilethenreportsanychangeofsignalstrengthtotheBSin a“xedtimeinterval.Thecurrentservingbasestation,willupdatetheactivesetof themobileanddecidewhetherornotthemobileshouldbeinthehandoperiod. Onceinthehandostate,thecurrentBSwillgatherallnecessaryinformationand decideaspreadingfactorforthehandoveringmobile. Howeverthehandoalgorithmshouldalsoguaranteethatthetargetcell(and othernearbycells)canstilloperatewithperformanceguarantee.Infact,thehandoveringmobilestationisusuallyrelativelyclosetotheBSsintheactiveset,so theirsignalisstrongenoughtobecomedominantinterferenceifnotbeingcarefully controlled.Thusitisthedutyofthehando algorithmtomaintainthedesiredBER inthetargetcellallthetime,e.g.,anupdateinthetargetcellshouldbetaken immediately.Theneedofupdatingthespr eadingfactorsintargetcellduringthe handoprocedurecouldprolongthetotalh andotime.Thismakesthetimefactor moreimportantthaninthetraditionalsystem,wherehandocanbe“nishedwithin amatterof100ms.Inthecaseofheavyh andotrac,ecienthandoprocessingbecomesnontrivialsincethetargetcellmighthavetotriggerupdateprocedures severaltimesinalimitedtimeperiod. Theperformancegoalsofthehandoalgorithmwithdynamicspreadingfactor aresummarizedasfollows(1)Allhandoreq uestsshouldbeprocessedwithinagiven timeperiod,suchthatthemobilecansmo othlymigratetotheothercell.(2)The algorithmshouldbeabletohandlestressha ndorequests,whichisimportantwhen thesysteminunderheavyload.(3)Thehandoalgorithmshouldseekthehighest

PAGE 101

93 the active set factor for mobiles in decide the spreading wait ACKs perform update locally ask the active set to terminate the handoff is the decision Y different from old SF? same as new SF N Y control for orig cell factor and power decide the spreading N: wait for the next handoff trigger event to reevaluate the handoff Y N previous one for the Mh the diversity processing MSC add Bn into traffic of Mh acquire the uplink wait for the Bn to walsh code to mobile Bn send the foward activeset added into new BS Bn send uplink PN code of mobile to Bn receive the pilot CIR still in handoff period? is the mobile wait ACK from Mh according power comand mobile adjust signal ask the mobile to update the SF set to use new SF for Mh inform all cells in active of activeset collect the load infor from Mh report of active set Figure6.1:FrameworkofsofthandoschemeatBSwithdynamicspreading throughputwhilemaintainingtheBERper formanceofthehandomobileduringthe handoperiod.Theperformanceoftheactivecellsmustnotbejeopardized.The BERperformanceofthesecellsalsocantbeviolatedatanytime. 6.2.1StationarySystemBehavior Basically,thehandoalgorithmneedtodecide:(1)whetherornottherelevant active-setshouldperformanupdateprocess,and(2)whetherornotthehandovering mobileshouldchangeitsspreadingfactor.Inordertostudythestationaryupdate behavior,wede“nethemobilestatebythespreadingfactoritisusing.Thepossible valuesofthespreadingfactorislimitedto { 16 k | k =1 2 ,..., 8 } .Wewillstudy theupdatingprobabilityforthehandomob ilesusingastatetransitiondiagram,as showninFigure6.2. Foramobiletotransittoanewstate,certaintransitionconditionsmustbe satis“ed.Speci“cally,atransitionoccurswhenthetargetSFisdierentfromthe

PAGE 102

94 m=[83 90]/m=[2233 ] m=[9 21]/m=[0 8] m=[22 33]/m=[0 8] m=[34 46]/m=[0 8] m=[22 33]/m=[9 21] M obile state transistion carring video traffic SF96 SF160 SF128 SF192 SF214 SF256 SF64 SF32 Figure6.2:StateTransitionForHandoMobile currentone.Thusthethedrivingeventsofthestatetransitionisthechangeinthe numberofactivemobilesinthetargetce ll.Weuseadouble-arrowtorepresentthe forwardandbackwardtransitionsbetween twostates.Forthesakeofconvenience, thetwostatesassociatedtoeachtransitioniscalledlowstateandhighstate,withlow statehasashorterspreadingfactor.Thetra nsitionislabeledbyatriggercondition pair (x/y) ,wherexrepresentthetransitconditionfromthelowstatetohighstate, andyforthereversedirection.Thetrans itionconditioncanbederivedfromthe admissioncriteriaforaparticulartractype.Forinstance,wecanuseFigure4in Wangetal.[38]toobtainthesituationwhenanupdatingofSFwilloccur.Dueto thecomplexityofthegraph,weonlyla beledafewtransitionsinFigure6.2. Let Probirepresenttheprobabilityformobileinstate i .Theprobabilityof updateforthehandomobilecanbeexpressedby: Probv u,m= Probv ij = iProbv j= Probv i(1 Š Probv i)=1 Š ( Probv i)2

PAGE 103

95 Herethesubscripts u,m indicatetheupdateinthemobileunit,andthesuperscript v representsvoicetrac.Wewillusesubscripts u,c torepresenttheupdateinthe targetcell. Probv u,mcannowbecalculatedfromtheprobabilityforeachstateofthevoice trac.Assumethesystemloadinthetargetcellisuniformlydistributedinthe admissiblerange,wecanusethefrequencyofthe ithSFastheapproximationof Probi.TherelationshipbetweentheSFandthesystemloadfordierenttrac typescanbederivedfromtheBER-SF-SystemLoadcurveasshowninWangetal. [38].Followingdiscussionassumesthata particularcorrespondencebetweenSFand systemloadisalreadyobtained.Forexample,assumethemaximumvoiceconnection is“xedto87,theSFupdatepolicywilluse SF=32 forvoicetracwhenthesystem loadisfrom11to20.Thisindicates(20 Š 11) 87ofthe87possiblecellloadsituation resultin SF32 .Similarly,thefrequenciesforothermobilestatecanbeobtained. Theapproximatedprobabilityforspreadingfactor (SF16,SF32,SF64,SF96, SF128,SF160,SF192,SF224,SF256) are (0.115,0.126,0.23,0.068,0.046, 0,0,0,0) respectively.Thecorrespondingprobabilityofupdateforvoicemobile isthus Probv u,m=1 Š 1152Š 1262Š 232Š 0682Š 0462=0 702.Wecandrive theprobabilitiesofupdateforvideoand WWWtracsimilarly.Theprobabilities ofupdateforvideoandWWWtracare Probwww u,m=0 82and Probvideo u,m=0 87. Theprobabilityofupdateinthetargetcellcanbedecidedfromtheupdate policytoo.Acellload m isreferredasanunstablepointifoneofthetractypes needstoupdatetoanewSFwhenthesystemloadbecome m +1.Wefoundthat thereare18unstablepointsamongthe87possiblecellload,thustheprobabilityof updateintargetcellis( Probu,c=4+7+7 87)0.21. Nowwecanderivetheprobabilitiesforth efoursituationsbasedontheupdatingdecisionsforthemobileandtargetcell:1) (unchanged,unchanged) ,2) (unchanged,update) ,3) (update,unchanged) ,and4) (update,update) bythe

PAGE 104

96 followingequations: P 1= Prob ( mobileupdate cellupdate ) = 1 3 ( Probv u,m+ Probvideo u,m+ Probwww u,m) Probu,c(6.1) =16 6%(6.2) P 2= Prob ( mobileupdate cellunchanged ) = 1 3 ( Probv u,m+ Probvideo u,m+ Probwww u,m)(1 Š Probu,c)(6.3) =62 5%(6.4) P 3= Prob ( mobileunchanged cellupdate ) = 1 3 ((1 Š Probv u,m)+(1 Š Probvideo u,m)+(1 Š Probwww u,m)) Probu,c(6.5) =4 9%(6.6) P 4= Prob ( mobileunchanged cellunchanged ) = 1 3 ((1 Š Probv u,m)+(1 Š Probvideo u,m)+(1 Š Probwww u,m))(1 Š Probu,c)(6.7) =16%(6.8) Substitutingtheupdating-probabilityforhandomobileandtargetcell Probu,mProbu,m, Probu,mand Probu,c intoequation(6.2)-(6.8),wehave: P 1=1 3(0 702+ 0 82+0 87)0 21=0 166, P 2=0 625, P 3=0 049,and P 4=0 16.Theabove resultsshowthat63%ofthetimetherewillbeanupdateforthehandomobileand unchangedinthetargetcell.Theprobabilityofno-updatinginbothsidesisonly0.16. Thisindicatesthatourproposedhandos chemecouldleadtoconsiderableamount ofupdatesifhandooccurred.Inthefollo wingdiscussion,thehandotimingis analyzedfordynamicsystembehavior. 6.2.2HandoTimeAnalysis Thehandoprocessingtimeisde“nedasthetotaltimebetweenamobile enteringintohandostateandtheuserbecomereadytocommunicatewiththe activeset.Wecanformulatethisby: Thandoff= Ti+ Tu+ Te

PAGE 105

97 where Tiisthetimetoexchangeinformationamongtheactiveset.This includesthreeparts:thelinkparameterofthehandoveringmobileneedstobesent tothenewmemberintheactiveset Tm,thesystemloadinformationofeachcell shouldbesenttothemastercell Tload,andatimedelayforthetargetcelltoacquire thehandomobile Ttarget. Tmand Tloadaremainlycausedbythecommunication delaybetweenthe mastercell andtheothercellsintheactiveset.Weusearound tripdelayof10msasanreferencevalue(thisisusuallyobservedbetweentwohosts inthesamewirednetwork). Ttargetcanbeapproximatedbythesofthandotimein thetraditionalCDMAsystem,whichincl udesthetimeforuserdetection,resource allocationandetc.Apracticalupperboundof Ttargetisin40msrange. Tuisthetimetoevaluatetheimpacttothetargetcellcausedbythehando mobile.Duringthistimeperiod,itmustbedecidedwhetherornotanupdatein thetargetcellshouldbeperformed.Ifso,theupdatingmustbecompletedbefore proceedingfurther.Thedurationofthispartdependsonthenumberofactiveusers inthetargetcell,andhowmanyoftheseneedtobeupdated2.AsshowninWanget al.[38],thispartcouldbefurtherdividedintoprocessingtimetoevaluatetheBER forthetargetcell,andtheupdatingtime .Table6.1illustratestheincreasingof Tuasthenumberofexistingmobileusersgrows.Eachcolumnofthetablerepresentsa pointwhenachangeofthespreadingfactorisnecessarycomparedtotheprevious column.Forinstance,assumethetargetcellhastwovideousersandoneWWW useralready.Whenanewmobilearrived,itwilltake9msecforsearchingtime,and requiretwoupdateoperationforthetwovideousers(correspondingto20msec). Overall,wehave Tu=29msec.Thiscasecorrespondstothecolumnwith4total users. 2TheevaluationofBERisperformedforallactiveusers,andonlypartofthemneedtoupdate theirSF(e.g.,alltheuserscarryingvideotrac, orallWWWusersfornextinstance).Noticealso thathereweassumethattheupdatingisperformedinasequentialmanner.

PAGE 106

98 Table6.1:Processingtimeintargetcellatupdatepointinmsec TotalUsers 2 3 4 6 8 11 15 16 19 22 Spreading WWWusers 16 32 32 48 48 64 80 80 96 96 Factor StreamingVideo 16 16 32 32 48 48 48 64 64 80 Used Voice 16 16 16 16 16 32 32 32 32 64 Searchingtime 7 8 9 11 13 16 20 21 24 27 Updatetime 30 20 20 20 30 40 50 70 80 80 Overall Tu 37 28 29 31 43 56 70 91 104 107 TotalUsers 23 27 32 34 36 39 40 44 45 Spreading WWWusers 112 128 144 144 160 160 176 192 192 Factor StreamingVideo 80 96 96 112 112 128 128 128 144 Used Voice 64 64 64 64 64 96 96 96 128 Searchingtime 28 32 37 39 41 45 46 50 51 Updatetime 90 180 110 120 120 130 130 140 140 Overall Tu 118 212 147 159 161 175 176 190 191 Finally, Teisthetimetochoosethenewspreadingfactorandsignalpower forthehandomobile,andtheupdatetimeofthehandomobileifneeded.This procedureneedstobere-executedafteraperiodoftimesuchthatthespreading factorandpowerlevelofthehandomobilecanbeadjacentaccordingtothemobile location. ItisourconcernthatthehandoprocessdescribedinFigure6.1willhave eciencyproblems,duetothefactthatitprocesseshandorequestsinasequential manner.Thenumericalresultsoftheoveral lhandotimecorrespondingtodierent handoratesareplottedinFigure6.3.Atmoderatehandorate( =2),thehando timeiswellcontrolled.Thebasestationspentabout100msforeachhandomobile whentherewasnoupdateinthetargetcell.Whenthereisupdateinthetarget cell,weobservedsigni“cantincreaseofprocessingtime,representedbythespurs. However,forahighhandorate( =8),theoverallprocessingtimestartstobuild upforthelatearrivals.Theslopeofthepe aksoftheresponsetimeincreasesteadily after36users.Thehighestprocessingtime(1800ms)wasrequiredwhenthereare 87mobilesalreadyinqueue.Toreducethe processingtimeofthehandoprocedure,

PAGE 107

99 0 10 20 30 40 50 60 70 80 90 10 0 0 200 400 600 800 1000 1200 1400 1600 1800 Number of Handoff Mobilethe Handoff Time Thandoff (in ms) -o1=2 -*2=8 Figure6.3:(a)Handotimeunderstressattack oneimmediateimprovementistoallowthehandoalgorithmtohandlemultiple handorequestseachtime.Ourproposedbatchingalgorithmisinvokedbyatimer, ortheeventofnewhandorequests,whic hevercomes“rst.Therevisedhando algorithmisshownbelow: ProcedureHandoffProcess GlobalVaribles: hr[];/*handoffrequestsqueue*/ nhr[];/*newhandoffrequestqueue*/ act union;/*theunionofactiveset act set[i];/*activesetofi-thhandoffmobile*/ while(1) { if(nhr[]isempty)then sleep(10ms); gotooldqueue; else m=numberofhandoffrequestinnhr[]; end /*getheringthesystemstatusofactivecellsforallhandoff requests*/ for(i=0;i
PAGE 108

100 receive cell stat(); end /*re-evaluatetheBERconditionsforthecellsinact union, decidewetheranupdateisneeded*/ for(eachofi-thcellintheact union) load=act union[i].load+handoffnumber[i]; for(eachtraffictypety[j]) SF[i][j]=calthedesiredspreadingfactor; ifSF[i][j]!=thepreviousspreadingfactor update[i]=true; end end sendthenewSF[i][j]totheactive union whoseupdate[i]==true; waitACKfromallupdatingcells; /*evaluatetheBERandSpreadingfactorforthehandovermobile consideredinthisbatch*/ oldqueue: for(eachofthehr[k]inthisbatch) hr[k].master=decidethecurrentmastercell; hr[k].SF=calculatethedesiredSF; hr[k].pw=calculatethedesiredpowerlevel; end executethedecisionforhr[]’s waitforACKfromhr[]; /*eachmobilewillsendACKtoitsmastercell*/ } //endwhile Onedesignparameterofthebatchingschemeishowtochoosetherightvalue fortime-outforthetimer,orbatchingperiod.Asmallbatchingperiodmightnot helpmuchincaseofheavyhandotrac .Ontheotherhand,alargebatching periodwillincreasethehandodelayinge neral.Duetothespacelimitation,wewill discussthisissueinourfuturepublications.Thebatchingperiodusedhereissetto 10ms,whichisthesamefrequencyofSNRreportedfromthemobilestations. TheperformanceoftheimprovedschemeisplottedinFigure6.4.Thesystem responsetimeforthesmallha ndotracisnotchanged.At =8,theprocessing timeatupdatepointsreducedsigni“cantly,mostlyconcentratedbetween600to800

PAGE 109

101 0 10 20 30 40 50 60 70 80 90 10 0 0 200 400 600 800 1000 1200 Handoff Mobile to the Target Cellthe Handoff Time Thandoff (in ms)System Response Time with Batching Processing -o1=2 -*2=8 Figure6.4:Improvedhandotimeunderstressattack ms,ThemaximumBERis1090ms,whichis60%oftheoriginalone.Furthermore, thebuildupofthewaitingtimeiseliminated. 6.3DetermineSpreadingFactorforHandoMobile 6.3.1DesignFactor Forthemobilesinthehandoarea,thesystemshoulddecidetwocritical parametersdynamically„spr eadingfactorandtransmittingsignalstrength.Since thedeterminationofthespreadingfactorshouldbebasedontheBERperformance, weneedtoconsiderallthedierentradiolinksamongthemobileandtheactiveset. We“rstconsiderasimpledecisionschemewherethedecisionoftheSFandthe powerlevelisseparated.TheSFisdecidedusingagreedymethod,calledMaximizeData-rate( MaD )scheme,whichalwayschoosetheshortestpossibleSFthatcan besupportedintheactivecells.The MaD canbeimplementedbyperformingthe admissioncontrolalgorithminthesinglecellsituationforeachoftheactivecell. However,thiscouldresultinadegradationofthelinkquality.Ifthebasestation advertisingtheshortSFcodeisunderdeepfading,thereverselinkwillsuerin greatdeal,sincetheotherBSspresumel ycannotguaranteethedesiredBERwith

PAGE 110

102 ashortSF.Anotherextremecasetakestheoppositechoice:italwaysselectsthe longestspreadingfactor.Thiscertainly canguaranteeBERqualityinmostofcases, however,theradioeciencyisdecreasedinthisratherconservativescheme. Weexaminetheperformanceofthe MaDscheme coupledwithtwodierent powercontrols:(1)thetraditional OR-ON-DOWN algorithmand(2)followthe powercontrolcommandofthe mastercell .We“ndthatnoneofthetwoexamined powercontrolmethodscanprovidegoodpe rformanceinallcases.Theformerone failstheBERrequirementsofhandomobile,andthelateronecausestoomuch interferencetoothercells.TheproblemissolvedinSection6.4,wherewepropose anoptimizedapproachbasedonamorepreciseestimationofBER,wheretheSF andpowerlevelofmobileisdrivenfromaconstrainedinequalityarrayspecifyingour BERmodelinthehandoarea.Thesimulationresultsshowthattheoptimalscheme canprovidethegoodthroughputwhilesatisfyingnecessaryBERrequirementsforthe otherparties. 6.3.2PerformanceofOR-ON-DOWNPowerControl TheOR-ON-DOWNschemeisaclose-looppowercontrolusedinIS-95.The basestationsinthecontrolsetcanissueapowercontrolcommandtothemobile basedonthereceivedsignalstrength.Thepowercommandisa1-bitinformation embeddedintheforwardchannelevery1.25ms.AnUPcommandmeansthat themobileshouldincreaseitstransmissionpowerbya“xedstep,andaDOWN commandmeanstodotheopposite. Asdescribedinnumerousresearchliterature,thispowercontrolmethodrequiresonlyonevotetodecreasethemobiletransmissionpower,whileallpositive votesarerequiredtoincrease.Thismethodisprovensuccessfulforthevoiceonly tracwherea“xedSFschemeisused.Howev er,withthedynamicspreadingfactor enabledsystem,themobilemightnotbeabletocommunicatewithallcellsinthe activesetwell,particularlywiththehighloadedcell.Forinstance,assumeamobile

PAGE 111

103 handosfroma cell 0withloadof10(desiredSF=32)towards cell 1withloadof20 (desiredSF=64),theSFassignmentalgorithmdescribedabovewilltake cell 0asthe mainpath,thuschooseashortspreadingfactorof32.Nowif cell 1issuesaDOWN commandtothemobile(sincethemobileismovingtowards cell 1and“ndsthatthe signalisbecomingtoostrong),themobilewillfollowthecommandasrequiredby theOR-ON-DOWNrule.Thiswillinturnreducethereceivedsignalstrengthin cell 0,thusmayincreasethelinkBER.Forthereverselinkbetweenthehandovering mobileand cell 1,theBERisalsohighsincetheassignedspreadingfactorof64is onlygoodfor cell 0,andnotlongenoughintheheavyloaded cell 1.Thereforethe overalllinkqualitywillbebelowtheexpectedsinceneitherlinkcanprovidegood performance. Usingtheaboveexample,Figure6.5illustratedtheBERofhandomobilein thehandoarea.ThetargetBERissetto2%forvoiceconversation.Figure6.5.(a) showsascenariowhenthemobilemovingfromalightloadedsystem( cell 0)toa higherloadedsystem( cell 1).Theright“gureshowstheresultwhenmobilemoves inthereversedirection. Inthecaseofforwardmovement( cell 0 cell 1),themobilecanmaintainan acceptableBERof0.008intherangeof800 meterto1000meters.Afterpassingthe midpointbetween cell 0and cell 1,whichisatthedistanceof1000meter,theBER increasesrapidly.Ataroundthe1040mete rpoint,itexceedsthe0.02thresholdand continuelyincreases.Thenth emobileissolelycontrolledby cell 1anduseanew spreadingfactorof64,andtheBERdropsto0.01,anacceptablelevel. Whenthemobilemovesfrom cell 1to cell 0,weobservedtheoppositebehavior ofBER.Atrange800meter,whenthemob ileentersthehandoarea,thereis asuddenincreaseofBER.Atthatpoint,t hespreadingfactorisadjustedto32, accordingtoourgreedyalgorithm.MeanwhiletheOR-ON-DOWNpowercontrol willlimitthesignalpowersuchthat cell 1,whichisclosertothemobile,receivethe

PAGE 112

104 signalnecessarystrong.Thoughtthereceivedsignallevelismaintainedin cell 1, thechangeofSFfrom64to32signi“cantlyincreasestheBERinthe cell 1.Asto the cell 0,whichisreceivedweakerthanexpectedsignal,duetotheDOWNpower commandissuedtothemobilefrom cell 1,alsocannotprovidetheexpectedlink quality.Overall,theBERinthisperiodexceedtheallowedvalue. 700 800 900 1000 1100 1200 1300 0.005 0.01 0.015 0.02 0.025 0.03 0.035 desired BER of mobile mobile distanceBERBER with Maximum Data rate and OR ON DOWN power control (towards high loaded cell) SF=32 Cell 1 SF changed to 6 4 handoff area, SF=32 cell0 cell1 700 800 900 1000 1100 1200 1300 0.005 0.01 0.015 0.02 0.025 0.03 0.035 desired BER of mobile mobile distanceBERBER with Maximum Data rate and OR ON DOWN power control (moving to light loaded cell) SF = 64 handoff area, SF=32 cell1 cell 0 Figure6.5:HandingomobilesBERforMaD/ORonDOWNpowercontrol 6.3.3MainPathPowerControl AlessonfromtheOR-ON-DOWNscheme indicatesthatatleastoneofthe radiolinkshouldbestrongenoughtoprovidetherequiredBER.Wedenotethe radiochannelbetweenmobileandthecellallowingtheshortestspreadingfactoras the mainpath .Toguaranteethatthe mainpath isgoodthroughthehandoperiod, weletthecellinthe mainpath carryoutthepowercontrol.Thesameclose-loop controlisperformedbyissuing1-bitpowercommands.TheBERperformanceofthis methodisplottedinFigure6.6a.WeonlyshowedtheBERobservationsforSF=32, 64and92.Thesystemsettingisthesamea sintheexampleoftheprevioussection. ThesimulationresultshowsthattherequiredBERforthemobileissatis“ed throughthehandoperiod.Weparticularl ynoticedthattheBERdecreasesconsiderablyathalfofthehandoareacloseto cell 1.ThelowestBERof0.00001is observedaroundtherightedgeofthehandoregion.Thenthemobilechangeto

PAGE 113

105 SF=64andiscontrolledby cell 1.TheBERalsoreboundedbackto0.01level.The lowBERduringthehandoareaiscausedb ytwofactors:(1)duringthehandoperiod,themobileslinkwith cell 0isthe mainpath ,thisalongguaranteesthattheBER observedin cell 0isnothigherthan0.008,(2)asthemobilemovesawayfrom cell 0 (towards cell 1),thedistancefrom cell 0isincreasing,whichdemandstheincrease oftransmittingpowerinthemobile.Meanwhile,thedistanceto cell 1isreducing, togetherwiththeincreasingofthetransmittingsignallevel,thereceivedsignalat cell 1increasesinaveryrapidmanner(receivedsignalis r 0 r 1dtimesexpected).The highreceivedsignalstrengthin cell 1providesacompensationforthehighinterferencein cell 1,andresultsinanevenbetterlinkqualityat cell 1.Tofurtherverify 0 200 400 600 800 1000 1200 1400 1600 1800 200 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 mobile distanceBERBER with Maximum Data rate and Aggressive path power control SF=32 SF=64 SF=92 0 50 100 150 200 250 300 350 400 45 0 0 2 4 6 8 10 12 14 16 18 20 Mobile location in Handoff Area (meter)Normalized Received Power Level in the Neighbor Cell Figure6.6:(a)HandingomobilesBERforMaD/main-path,and(b)thereceived powerofhandingomobileintheneighborcellusingMaDscheme theeectofthe mainpath method,weshowedthestrengthofthereceivedsignalat cell 1inFigure6.6b.Ascanbeobserved,thel obeofthesignalstrengthincreases polynomially,withrandomvariationca usedbyfastfading.Attheendofhando area(around400meters),thereceivedsignalis32timeshigherthanthenormallevel. However,thestrongsignalofthehandomobilealsocontributetotheinterferencelevelforotherusersin cell 1.TheBERlevelofthevoiceusers(usingSF=64) in cell 1increaserapidlyasthehandomobilesapproachingthetargetcell.The

PAGE 114

106 bottom-lineBER(0.02)isviolatedafter1100meter,andthehighesterrorrateis 0.37beforetheendofthehandoperiod. 6.4PerformanceOptimizationinHandoPeriod Tooptimizetheperformanceinthehandoarea,weproposedanopenloop spreadingfactorandpowercontrolalgorithmbasedonanaccurateestimationof thelinkBERforboththehandoveringmobileandthecellsintheactiveset.The algorithmexplicitlycalculatesthedesir edSFandtransmittingpowerlevelforall handomobiles.Tobetterunderstandouralgorithm,weshalldiscusshowtheBER isestimated,andbywhichmeansitisrelatedtoourtargetparametersofSFand signalpower.OurmodelingofBERforthehandomobileandactivesetresultsina serialofinequalitythatimposesconstraintsonthepossiblevaluesforSFandpower level. 6.4.1BERModelinHandoArea TheapproximateofthechannelBERcanbeexpressedasafunctionofE I(SIR,orenergy-per-bit-to -interferenceratio).Let SIRj,ibetheSIRofthe jthhando mobileatthe ithcell,wehave: SIRj,i=Ej i=pjSFj mi+ Ih+ fIo= PjrŠ j,i10 10 SFj ( mi+ fIo+ Ih)(6.9) Io=1 | A |u Amu(6.10) Ih,j= k = j PkrŠ k,i10 k 10 (6.11) here pjisthereceivedsignalpoweratcell i formobile j ,and SFjisthespreading factorof jthmobile. muisthesystemloadofcell i f representthefactorforother cell[41],and A isthesetofactivecells.Inestimatingtheinterferencefromother cells( Io),weusetheaveragenumberofmobileusersinthesurroundingcellsas themobiledensity(numberofmobiles/cell).However,theinterferencecausedby thehandomobiles Ihneedstobeconsideredasaseparatesourcefromthegeneral

PAGE 115

107 other-cell-interferenceterm,sincetheymighttransmitstrongersignalthanexpected (asshowninSection4). Thereceivedsignalstrengthismodeledbyapathlossmodel: pj,i= PjrŠ j,i10 10, where Pjisthemobilestransmittingpower, rj,iisthedistancebetweenthemobile andbasestation, (=3.5)isthepathlossorderand isthezeromeanrandom power”uctuation.Wealsoassumethat thenon-handomobileisunderperfect powercontrolofitsresidentcell,withanormalizedreceivedpowerlevelof13. Usingthediversitydecodingforuplinkchannel,thebestlinkamongtheactive setischosenforthe jthhandomobile.Let Ajbetheactivesetof jthmobile,we have SIRj=mini AjSIRj,i. Substituting(6.10)and(6.11)intoequation(6.9),theSIRforthe jthhando mobilecanbeexpressedas: SIRj=mini Aj PjrŠ j,i10 j 10 SFj mi+f | Ni|u Nimu+ k = j PkrŠ k,i10 k 10 (6.12) TheSIRforthein-cellmobilesineachactivesetcouldbesimilarlyobtained.Note thattheinterferencefromallthehando mobilesisnowregardedasnoise,andthe regularin-cellmobileusershaveanunitreceivedsignalstrength.Fortractype t inthe ithactivecellwithspreadingfactor SFt,theSIRisapproximatedby: SIR ( i,t )= SFt ( mi+ j PjrŠ j,i10 10 + fIo) (6.13) WithbothBERrequirementsforthehandomobileandin-cellmobile(equations (6.12)and(6.13))satis“ed,weshouldchoosetheSFssuchthatthemaximumthroughputisobtained.SincethedatarateisinverselyproportionaltotheSF,theoverall throughputisthesumofBc SFjforallhandomobiles.Here Bcisthechiprateofthe system(5McpsforWCDMA).Let H denotethesetofhandomobilesatanytime, 3Inpractice,thereceivedpowerofdierenttraca renotuniformlyassigned,thustheycausea dierentlevelofinterference.

PAGE 116

108 theoptimalSF/powerassignmentsforthehandomobilecanbeobtainedbysolving thefollowingproblem: max z = j H Bc SFj(6.14) SIRj bj,j H (6.15) SIRi,t bt,i A ,and t { tractypeset } (6.16) 6.4.2Simpli“cationoftheOriginalProblem Thenonlinearoptimizationproblemdescribedby(6.14)-(6.16)certainly couldbesolvedbymanywell-knownmethods(e.g.,Newtonsmethod,orgradient algorithm).Howeverthesemethodsmaynotbeecientenough,especiallywhenthe numberofvariablestobeoptimizedislarge.Thereforeitisourinterestto“ndan ecientalgorithmtodeliversub-optimalresults. Westartwithseriesofsimpli“cationsoftheoriginalproblem,suchthatthe problemistractable.The“rststepistoeliminatetheminimizingoperationin(6.12). Foreachhandomobilewithgiven SFjand Pj,theminimumSIRpathiswherethe shortestdistance rj,iandlowestload miis,accordingto(6.12).However,thetwo factors( rj,iand mi)maycon”ictwitheachother,inthattheclosestcellmightnot betheonewiththeleastnumberofin-celltrac.Wethereforede“neavariable wi=mi rŠ u j,itodeterminethemainpathformobile j .Thus(6.12)canbesimpli“edby thefollowingrule: SIRj= PjrŠ j,i ( j )10 j 10 SFj mi ( j )+ fIo+ k = j PkrŠ k,i ( j )10 k 10 i ( j )= arg mink Ajwk(6.17) Afterneglectingtherandom”uctuationsof pathlossandrearranging(6.15),theSIR constraintsforthehandomobilebecome PjrŠ j,i ( j ) SFj mi ( j )+ fIo+ k = j PkrŠ k,i ( j ) bj(6.18)

PAGE 117

109 Tofurthersimplify(6.18),itisnoticedthat PkrŠ u k,i ( j )isnonnegative,thus mi ( j )+ fIo+ k H PkrŠ k,i ( j ) bj mi ( j )+ fIo+ k = j PkrŠ k,i ( j ) bjTherefore,(6.18)isreplacedbyastrongcondition PjrŠ j,i ( j ) SFj mi ( j )+ fIo+ k H PkrŠ k,i ( j ) bj,Removingtherandom variableandrearrangetheexpression4in(6.16),Thesimpli“edproblemnowbecomes: max z = j H Bc SFj(6.19) PjrŠ j,i ( j ) SFj mi ( j )+ fIo+ k H PkrŠ k,i ( j ) bj(6.20) k H PkrŠ k,i SFt btŠ miŠ fIoi A (6.21) SFj,Pj 0 j H (6.22) ItcanbeeasilyseenthattheaboveconstraintsarefeasibleonlyifSFt btŠ miŠ fIo> 0 foreachactivecell i .Anypositivepowervectorsatisfying(6.21)canbeusedtodrive acorresponding SFjfrom(6.20),sincegivenapositive Pj,wecanselecta SFjbig enoughtobegreaterthantherighthandofconstraintsin(6.20). Itistimenowtorewritetheobjectivefunctioninalinearexpression.Our basicstrategyistoreplacetheterm1 SFjin(6.19)byalinearcombinationof Pj. Wenoticedthatthe“rstorderpartialderivativeoftheobjectivefunctionisalways negative(z SFi= ŠBc SF2 i),whichmeansthattheoptimal SFimustbeintheboundaryde“nedbytheconstraints.Oncetheco nstraintsandtheobjectivefunctionare transformedintolinearformat,wecanuse alinearprogrammingmethodtosolvethe problem. Forthesakeofconvenience,wede“ne ci=SFt btŠ miŠ fIoi A (6.23) 4Byignoringtherandomcomponentsaspresentedintheactualsystemmodel,oursolutionis onlyvalidasastationaryapproximation.

PAGE 118

110 theconstraintsin(6.20)canberewrittenas 1 SFj PjrŠ u j,i ( j ) mi ( j )+ fIo+ k H PkrŠ k,i ( j ) bj PjrŠ u j,i ( j ) mi ( j )+ fIo+ ci ( j ) bj(6.24) Nowwede“ne aj,i ( j )=rŠ u j,i ( j ) (mi ( j )+ fIo+ ci ( j ))bjin(6.24)andreplacetherevised(6.24)into theobjectivefunction z in(6.19),thedependenceoftheobjectivefunctiontothe decisionvariable SFjiseliminated,andtheproblemisreducedto z = j H Bc SFj j HBcaj,i ( j )Pjk H PkrŠ k,i cii A Pj 0 j H (6.25) (6.25)isaperfectlinearprogrammingproblem,andcanbesolvedbythepopular simplexmethod.Aftertheoptimalpowervector { Pj} isobtained,wecandrivethe corresponding { SFj} from(6.20)byequalizingtheineq uality.With(6.25),itiseasily seenthatthehandorequestshouldbedeniedifsearchingforanfeasiblesolution failed,whichmeansthatthenewmobileca nnotbesupportedwithoutviolatingthe BERrequirementsofothermobiles. Unfortunately,theoptimalsolutionfor(6.25)onlyprovidesanupperbound fortheproblemdescribedin(6.14)to(6.16),andsometimescannotbereached.The optimalpowervectorforthelinearprogrammingproblemin(6.25)couldcontain zerosforsome Pj;thisisverycommoninthecasethatthenumberofconstraintsis lessthanthenumberofdecisionvariables.Howeveraccordingtotheconstraintsin (6.20), SFjPjmustbegreaterthanapositivevalue,whichisimpossibleif Pj=0. Thussomeadditionalstepsmustbetakeninsuchcases.Inthenextsection,wewill discusstheheuristicalgorithmthat seeksthesub-optimalsolutionof( SFj,Pj)for allthehandomobiles.

PAGE 119

111 6.4.3ProposedSub-OptimalSF/POWERDecisionAlgorithm Fromthepreviousdiscussion,weknowthattheoptimalsolutionforlinear programmingproblem(6.25)mightnotresultsinafeasiblesolutionfortheoriginal problemduetothepossiblezero-valuepowerassignment.Furthermore,theallowed SpreadingFactorinrealitycanonlybetake nfromalimitedsetofpositiveinteger. Thisnecessitatesapost-processingtoconverttheoptimalvectortoafeasiblesolution. Forexample,ifthespreadingfactorhasavalueof35 5,weshouldreplaceitwiththe closestgreaterintegervaluethatisinthefeasiblesolutionspace.Besidesthis,dueto thebackgroundnoise,wemustalsomakesurethatthesignal-to-noise-ratio Pj/n be greaterthanathreshold,say10 15db.Theseconsiderationsleadustoahybrid algorithmwhichutilizesthelinearprogrammingmethodsandadditionalconstraints re-evaluationfromtheoriginalproblem.ThealgorithmisnamedHYB. Giventheinputoftheoriginalproblemin(6.14to6.16)andthereducedlinear programmingproblemin(6.25),thealgorithmmustgothroughaseriesofsteps. First,wewill“ndtheoptimalpowervectorforproblem(6.25),usingastandard simplexmethod.Thisisfollowedbyachecktoseeifthereareany Pjinthesolution vectorequaltozeroortherespectiveSNRisbelowthethreshold. Iftheaboveconditionisnotsatis“ed,wecanmoveaheadtoobtainacorresponding SFjusing(6.20).Theresultant SFjisusuallyanon-integerpositivevalue, whichwillbefurtherprocessedtoresultinalarger SFjinthefeasiblespace.The newvectorof SFjissubstitutedinto(6.20)again,andresultsinanew Pjvector. Thenew Pjvectorislessthanthe“rstpowervectorweobtainedfor(6.25),since weareactuallyusingalarger SFi.Atthisstage,wehaveallthedecisionvariables neededandthealgorithmisterminated. Otherwise,atransformshouldbedonewhichwillfurthersimplifythelinear problemin(6.25).Wethen“xtheproblematicdecisionvariablestotheminimal

PAGE 120

112 powerlevelandreplacethembackinto(6.25).Thiswillgenerateanewlinearprogrammingproblemwithreducednumberofdecisionvariables.Therevisedproblem willbeprocessedagainusingthesimplexmethod,andthesamecheckprocesswillbe appliedtothethenewoptimalvector.Itcanbeseethatthealgorithmwilleventuallyterminate,andthemaximumlooptimewillbelessthanthenumberofhando mobiles.Thepseudocodefortheaboveprocedureisdescribedbelow: ProcedureNon Linear Simplex Input: bj, thedesiredSIRfor jthhandoffmobile rj,itheestimateddistancebetween jthmobileand ithcell mi, theloadofcelli; SFt, theleastSFfortraffictypet; bt, theleastSIRfortraffictypet; Knownsystemparameters: u : pathlossdegree ; SFmax: themaximumpossiblespreadingfactor ; f : other Š cell Š interferenceratio ; : theminimumpossiblepowerlevel ; /*calculatesomeconvenientparameters:*/ Io=f | A |u Amu FOR i A and j H DO aj,i=rŠ u j,i ( mi+ fIo+ ci) bj; ci=SFt btŠ miŠ fIo; ENDFOR Thedecisionvectoris P = { Pj} and SFj= { SFj} ; iteration: usesimplexmethodon z = j HBcaj,iPj(*) k H PkrŠ k,i ci (**) Pj &0 j H andlet P = { P1, P2,... pJ} betheresult. calculate P = { j | Pj<, Pj P } IF P == emptyset THEN case1:

PAGE 121

113 calculate SFjusingtheequation: SFj=1 Pjaj,ifor j P ; GOTO end; ELSE case2: foreach j P Pj= max { ,1 aj,iSFmax} endfor replacethevalueofvector PPinto(*)and(**) reducethedecisionvector: P = P Š PP; SF = { SFj| j P } ; GOTO iteration; ENDIF end: /*finalizethespreadingfactorandpowerlevelvector*/ FOR j H DO SF j=max { round ( SFj) 16 SFj/ 16 } ; calculate: P j=1 SF jaj,i; calculatetheobjectivefunctionas(*); EndProcedure 6.4.4NumericalResultsandPerformanceDiscussion ToevaluatetheperformanceofHYBalgorithm,weexaminedhandoscenarioswithdierenttrac.Ournumericale xperimentsassumethetwo-cellhando situation,wherethehandotracisfromtheoriginalcell( Co)tothetargetcell ( Ct). Coand Cthave5and10in-cellmobileskeptactiveallthetime,respectively. Weassumeatleastonemobileforeachtractypeexistineachcell,thusthesumof thetransmissionpowerofhandotracis moststrictlylimited.Thehandotrac contain10mobiles,distributedinthe450-meterhandoverareawithequaldistance. Theperformanceresultstobeexaminedh ave4parts:(1)overallthroughputfor handomobile,(2)theassignmentofSF,(3)thepowerassignments,and(4)the BERineachhandomobile.Toinvestigatet herelevanceoftheperformancewith thelocationofhandomobiles,westudiedtheresultsfordierentlocationoset

PAGE 122

114 ofthehandomobiles.Alocationcon“gurationof oset10 meansthatthe“rst handomobileis10metersfromtheleftmostpointofhandoarea,andthenext mobilewillbe50metertotheright,soonandsoforth.Inourexperiments,upto “ftydierentosetcon“gurationsarestudied. Inthe“rstcaseofourexperiments,allhandomobilescarryWWWtrac. ThenumericalresultsareshowninFigure6.7aand6.7c.IntheSFassignmentfor the oset0 ,allhandomobilesareassignedSF=256exceptforthe8thmobile,which isassignedSF=32.Forthehandomobilewi thhigherSFvalue,thecorresponding powerlevelislow(intherangeof1to3).The8thmobilehasapowerlevelof11, whichcompensatesthepotentialhighBERwhenusingsuchasmallSF.Theother locationcon“gurationsshowsimilarresul ts,exceptthatthe“rstmobileisassigned aspreadingfactorbetween32and256.TheBERofthehandomobileissatis“ed inallmobilelocationswithinthehandoarea.MostoftheBERconcentrateon therangebetween0.00001to0.000045,wheretheworstBERisobservedatthe8thmobile.Theclose-to-desiredBERperformanceshowsthatthesystemresourcesare sharedequally.Theoverallthroughputinthehandoareaismaintainedfrom380to 400kbps.Thisisabout20%increasecomparedtothecasewithoutHYBalgorithm, wheretheSFofeachhandomobileis“xedto160,resultinginathroughputof312 kbps. Thesecondexperimentwasperformedforallvideohandotrac.Weobservedthatthe“rstandthe8thhandomobilesareassigned smallspreadingfactors. OthermobilesstillusethelargestSF(=256).Thiscorrespondstothetwopeak powerlevelinthepowerassignment.Forinstance,the oset0 con“gurationhavea SFassignmentof (32,256,256,256,256,256,256,32,256,256) ,andthepowervector of (6.4,1,1.3,1.4,1.5,1.4,1.5,15.2,2.4,2) TheBERlevelofthehandomobilesarealsocontrolledbelowthemaximum allowedvalue.ThehighestBERisalsoreachedatthe8thmobile,withBER=0.00031.

PAGE 123

115 ThelowestBERis0.0001.Thethroughputforvideotraccanvaryfrom600kbpsto 950kbps,dependingonthelocationofhandomobiles.WithouttheHYBalgorithm, thehandoalgorithmhastouseamoreconservativestrategytochoosetheSFby consideringthemasin-celltracinthetargetcell.ThiswillresultinaSFof112 forallhandomobiles,correspondingt oathroughputof446.4kbps.Thus,eventhe worstcaseofHYBcanbeni“tfroma26%performancegain. 0 50 100 150 200 250 300 350 400 450 50 0 0 50 100 150 200 250 300 350 the assignment of SF for WWW trafficSpreading FactorOffset in the Handoff area *offset0ooffset10 doffset20 offset30 0 50 100 150 200 250 300 350 400 450 50 0 0 50 100 150 200 250 300 350 the assignment of SF for Video trafficSpreading FactorOffset in the Handoff area *offset0ooffset10 doffset20 offset30 0 50 100 150 200 250 300 350 400 450 500 0 2 4 6 8 10 12 the assignment of Power for WWW trafficNormalized Power LevelOffset in the Handoff area *offset0 ooffset10 doffset20 offset20 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 the assignment of Power for Video trafficNormalized Power LevelOffset in the Handoff area *offset0 ooffset10 doffset20 offset20 Figure6.7:Performanceforvideotrac:(a)SFassignmentforWWWtrac,(b)SF assignmentforvideotrac,(c)powerlevelforWWWtrac,and(d)powerlevelfor videotrac

PAGE 124

CHAPTER7 CONCLUSIONSANDFUTUREWORK Thesuccessofdistributedmultimediasystemreliesonthesolutionstothe threekeyaspects:(1)videoencodingalgorithms,(2)videodecodingalgorithms, and(3)networktransportationprotocolsthatguaranteethedeliveryofvideostream timelyanderror-free.Inthe“rstpartofthisdissertation,weinvestigatedageneric, pure-software-based,paralleldecodingalgorithms.Builtonamaster/slavearchitecture,weparallelizedthesequentialMPEG -2decodingalgorithmindata-partition anddata-pipelinemanner. Forthedata-partitionscheme,thed ecodingtaskisparallelizedintheframe level.Inordertoreducethecommunicationoverhead,weproposedaquickpartitionalgorithm,whichcan“ndtheclose-to-optimalpartitioneciently.Thedatapartitionschemeisshowntoperformwellinthesmallandmoderatesettings,but notasscalableasthedatapipelinescheme,duetotheoverheadinmasternodeand high-bandwidthrequirement.Thedatapipelineschemeisabletoproducehighgain fromparallelprocessingwithlittleoverhead.Thisismadepossiblebytheframe-level parallelismandthereduction ofreferencingdatathroughoptimizingtheblocksize. Meanwhiletheoverheadinthemasternodeiskeptwithalowdegreebyadoptinga quickbitstreamanalysis. Inthesecondpartofthisdissertation,weproposedaMAClayerprotocol forWCDMAsystemthatsupportsmultimediatracwithaguaranteedquality. Theproposedprotocolalsomaximizesthenumberofuserssupported.Themajor contributionsare: € Theadmissioncontrolisinvestigatedindetail. 116

PAGE 125

117 € Theresultsshowthateventhoughthenumberofusersincreases,theaverage BERismaintainedbelowtheupperlimits. € Improvementschemesareproposedtoreducetheadmissiondelays. € Anewschedulingalgorithmisdiscussedtoimprovethesystemthroughput € Thedynamicspreadingfactorprotocoli sextendedtomulti-cellenvironment andanewsofthandoalgorithmisproposed. € Abatchingmethodisstudiedtoreducethehandodelay,particularlythe updatetime. € Ajoin-optimizationalgorithmisproposedfortheallocationofspreadingfactor andtransmittingpowerforthehandomobilestations. Ourstudiesprovidedsolutionsandr ecommendationforsomeimportantaspectsindistributedmultimediasystems.However,therearestillmuchworkaheadto ful“lltheultimategoal:instantaccesstoh ighqualitymultimediainformationfrom anywhere,atanytime.Inmyfutureresearch,Iwillcontinuemyeortsapproachingthisgoal.Amongmanypotentialtopicsinthedistributedmultimediasystem, Ibelievethatthefollowingthreeissuesneedtobeaddressed:(1)howtoprovide video-on-demandinWCDMAdownlinkandsupportlargenumberofusers?(2)how toreal-timedecodeMPEG-4documentwhichconsistsofseveralmediastreams,and (3)howtomakethevideoencodingprocessmoreaccurateinbitallocationandadaptivetobitrate”uctuation.Somefurtherelaborationfortheabovetopicsareshown inthefollowingsections. 7.1ImproveTransportationLayerPerformanceinWCDMAdownlink TheHSDPA(HighSpeedDownlinkPacketAccess)of3GPP(thirdgeneration pattern-shipproject)initiatestheeortstoprovidehighspeeddownlinkformultimediatrac,inparticularvideotrac.W ithdynamicspreadingfactor,adaptive modulation(AMC)andlinklevelre-transmission,theachievablebandwidthcould beupto10Mbps.WithHSDPA,manyvideoapplicationisbecomingmoreandmore realistic,suchasnews-on-demandandvid eo-on-demand.However,studiesalsoshow

PAGE 126

118 thatthesystemperformancewilldegradesigni“cantlywhenmobileunitsmoveaway frombasestation,andmobilesintheboundaryareaofthecellsuerthemost. Fromthetransportationpointofview,thelink-levelre-transmissionwilldirectlyincreasethedelayofupperlayerACK,whichcouldtriggertheratecontrol algorithmbuilt-inthetransportationlayer.Forexample,ifthedelayoftheupper layerACKislargerthantheTCPtime-outvalue,thesourcetransmissionwindow willbereducedbyhalf(asinTCPTahoe).Duetothecongestionavoidanceconsiderationinthesource,thetransmittingratewillnotberecoveredimmediatelyeven thelinkconditionisimprovedshortly. Webelievethattransportationlayershouldbeisolatedfromthelink-level ”uctuation.Thisisparticularlyimportant tothedelaysensitiveapplications,such asstreamingvideo,audioondemand,andint eractivemultimediaapplications.More eectivemethodscanbeinvestigatedtoprovideasmoothtransportationlayerupon thedynamicofMAClayerservice.Particularly,thefeasibilityoftransportationlayer bueringschemesinthebasestationormobileswitchingcentercanbeinvestigated inthenearfuture. 7.2ParallelMPEG-4Decoding TheparalleldecompressionofMPEG-4streamwillintroduceanothersetof problem.UnlikeinthecaseofMPEG-2,theMPEG-4videoconsistsofseveralsubstreamsthatrequiredierentdecodingpro cedure,aswellasthedecodingcomplexity. Itcouldcontaingraphicobject,naturalvideo,animation,texturebackgroundand othermediastream.Thesesub-streamsiss ynchronizedbothinsideandoutsidewith eachother.ThusaparallelMPEG-4decoderneedtoschedulethedecodingsubtask suchthatthesystemloadisbalanced.Moreover,MPEG-4introducemuchmore interactivitybetweentheend-userandcon tentprovider,whichmakeitimportantto haveaquickresponsetime.Aswecanimage,thereisalotofwaystoparallelizethe

PAGE 127

119 decodingofMPEG-4,howitcanbeapproachedeectivelywillbeexploredinour futurework. 7.3OptimalRateDistortionControlinVideoEncoding Ideally,QoS-guaranteedcommunication(e.g.,ATM[6]andIPv6[7])shouldbe usedformultimediatransportation.However,wespeculatethatmajorityofstreamingvideowillstillbeprovidedonthebest-eortnetwork(e.g.,INTERNET)inthe nearfuture.Theend-to-endvideoqualitycanbegreatlyaectedduetothedynamic ofavailablebandwidth.Thusitisessentialthatthevideoencodingmakesthebest useoftheavailablebandwidth,andgenera tethecompressedbitstreamwithhighestqualityaspossible.Bandwidthadaptivevideoencodingtechnology,dynamically updatethevideoencodingparametersacco rdingtothecurrentnetworksituation, isexpectedtodeliverbettervideoqualitywithlimitedbandwidth.Somerelated workonthisareacanbefoundin[49…52].TheseworkshavetheirfocusonadjustingthequantizationparameteroftheDCTcoecientsbasedonsomeparticular rate-distortionmodel.However,theirmod elarebasedonthe“xedimageresolution, whichimposeanunnecessarylimitationtofurtherexploittherate-distortiontheory andresultinpoorvideoqualitywhenbitbudgetislow.Whentheavailabledata rateistoolow,alargequantizationscalew illbeusedinthetraditionalratecontrol methods,inordertomaintainthedesiredbitrate.Thelargerquantizationscale degradetheimagequalityseverely,sincedetailimagestructurecannotsurvivea coarsequantization. Fromourpreliminaryexperimentalresults,weobservedthatvideoquality canbesigni“cantlyimprovedbyre-renderi ngtheimageintoasmallresolutionand usearelativeprecisequantizationscale.Thetradeohereis:whenthebitbudgetrunslow,wecaneitherdown-sampleimagetoasmallerresolutionandusea ratherprecisequantizationparameter,ordirectlyadapttoacoarsequantization scalewiththeoriginalimagesize.Thereforeabigchallengeis:foragivenbitrate,

PAGE 128

120 howshouldwedecidethebestimageresizingfactorandthequantizationparameters toreconstructtheoptimalvideoquality.Th eanswertothisquestionrequiresanew rate-distortionmodelthatcanmodelthevi deocontentsandresizingfactor.Weare currentlyinvestigatingtheaccuracyofse veralcandidatemodels,andexperiments areunderdevelopmenttoverifyourmodels.

PAGE 129

APPENDIXA CSOURCECODEFORPARALLELMPEG-2DECODER /******************************************** datapartitionalgorithm ******************************************/ externinttotalpicturenum; externintworkstationnum; externintFirst_Picture; externintmyrank; voidDistributeData_Partition() {inti,j; intcode; totalpicturenum++; frameptr[0]->lpFile=frameptr[0]->rawData; while(1){ Global_next_start_code(); code=Global_Show_Bits32(); if(code==PICTURE_START_CODE)break; } j=0; do{ frameptr[0]->lpFile[j]=Global_Get_Byte(); j++; }while(Global_Show_Bits32()!=PICTURE_START_CODE); 121

PAGE 130

122 frameptr[0]->ld->Incnt=32; frameptr[0]->ld->Rdptr=frameptr[0]->ld->Rdbfr+2048; frameptr[0]->ld->Rdmax=frameptr[0]->ld->Rdptr; frameptr[0]->ld->Bfr=0; frameptr[0]->rawsize=j; frameptr[0]->picturenum=totalpicturenum; frameptr[0]->Float_Counter=0; totalpicturenum++; printf("thesizeofrawsendingpictureis%dtheactual sizeofthispacketsis%d\n",sizeof(structPictureBuffer),j); for(i=0;i
PAGE 131

123 Sequence_Framenum=0; RCVsize=(frameptr[0]->horizontal_size*frameptr[0]->vertical_size) /workstationnum; Acc_Time=0; First_Picture=1; for(;;){ intj=0; wj_tmp1=times(&tbuf); DistributeData_Partition(); if(Sequence_Framenum==0)decodeframe(frameptr[0],0); for(j=0;j
PAGE 132

124 intstatus; char*tmp; i=0; tmp=frameptr[0]; MPI_Recv(tmp,sizeof(structPictureBuffer)/workstationnum, MPI_BYTE,0,100,MPI_COMM_WORLD,&status); if(((structPictureBuffer*)frameptr[0])->mb_width==-1) { frameptr[0]->mb_width=-1; return; } //receiverawdatafromthemaster frameptr[0]->lpFile=frameptr[0]->rawData; frameptr[0]->ld=&(frameptr[0]->base); frameptr[0]->ld->Incnt=32; frameptr[0]->ld->Rdptr=frameptr[0]->ld->Rdbfr+2048; frameptr[0]->ld->Rdmax=frameptr[0]->ld->Rdptr; frameptr[0]->ld->Bfr=0; frameptr[0]->Float_Counter=0; } voidsendPictureData_Partition() { inti,j; intcode; intsizeL; intsizeC; i=0;

PAGE 133

125 sizeL=(frameptr[0]->Coded_Picture_Width*frameptr[0]-> Coded_Picture_Height)/workstationnum; MPI_Send(Dithered_Image[0],sizeL,MPI_BYTE,0,100, MPI_COMM_WORLD); } voiddecodeframe_Partition(structPictureBuffer*frame,intmrank) {intReturn_Value; /*assumetheframerateforonenodeis5f/s,ittake 200msforonepicture,sosleep200/workstationnumms*/ Return_Value=Headers(frame);//ret=1meanpicture_headmeeted if(Return_Value==0)return-1; elseif(Return_Value!=1)return-1; if(First_Picture){ Initialize_Sequence(frame); First_Picture=0; if(myrank==0)} return; } /*********************************************** masternodemain(),datapipelinescheme *********************************************/ voidmain(argc,argv) intargc; char*argv[]; { intret,code;

PAGE 134

126 intcount; inti,j; intindex; intsos=1; intdummyargc; intsize; FILE*tmpfp,*grpfp; grpfp=fopen("/amd/birch/export/class00/cen4500fa00/yhe/ mpeg2dec12/tm.pg","r"); printf("grpfile:%d\n",grpfp); fscanf(grpfp,"#%d%d\n",&FAKESCALE_BMP,&FAKESCALE_NET); fclose(grpfp); decodestate=START; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); MPI_Comm_size(MPI_COMM_WORLD,&workstationnum); workstationnum--; //themasterwillshouldbeexculdedfromtheslavenum; sprintf(logfilename,"e:\\temp\\logfile%d.txt",myrank); logfp=fopen(logfilename,"w"); if(myrank==0)/*Iamthemaster*/ { printf("MAsterprocessingup!\n"); D_Float_Counter=0; D_Info_Flag=1; for(count=0;count
PAGE 135

127 count,count,argv[count]); } Clear_Options(); /*decodecommandlinearguments*/ Process_Options(dummyargc,dummyargv); for(i=0;ild=&(frameptr[i]->base); /*selectbaselayercontext*/ /*openMPEGbaselayerbitstreamfile(s)*/ lpFile=(unsignedchar*)malloc(6553600); tmpfp=fopen("mei60f.m2v","r"); fread(lpFile,1,SUPER_BUF_SIZE,tmpfp); fclose(tmpfp); Float_Counter=0;/*SUPERGLOBAL*/ Last_Round=0; Last_Frame=0; D_Info_Flag=1;/*Wesetthistoletthe displayerhavetheinforatthefirsttime*/ InitiateTime(); Global_Bfr=0; Global_Incnt=0; Float_Counter=0;

PAGE 136

128 do { Global_Bfr|=((unsignedchar)lpFile[Float_Counter])<< (24-Global_Incnt); Float_Counter++; Global_Incnt+=8; } while(Global_Incnt<=24); if(Global_Show_Byte()==0x47) { sprintf(Error_Text,"Decodercurrentlydoesnot parsetransportstreams\n"); Error(Error_Text); } Global_next_start_code(); code=Global_Show_Bits32(); switch(code) { caseSEQUENCE_HEADER_CODE: break; casePACK_START_CODE: /*System_Stream_Flag=1;*/ printf("wedon’tsupportsystemstream\n"); caseVIDEO_ELEMENTARY_STREAM: /*System_Stream_Flag=1;*/ printf("wedon’tsupportsystemstream\n"); break;

PAGE 137

129 default: sprintf(Error_Text,"Unabletorecognizestreamtype\n"); Error(Error_Text); break; } Initialize_Decoder(); ret=Decode_Bitstream(); }/*mainprocessend*/ else{ intmydecodednum; structtmstbuf; clock_twj_tmp1,wj_tmp2; printf("thisisslaverwithrank%d\n",myrank); fflush(NULL); Clear_Options(); /*decodecommandlinearguments*/ Process_Options(dummyargc,dummyargv); for(i=0;ild= &(frameptr[i]->base);/*selectbaselayercontext*/ Float_Counter=0;/*SUPERGLOBAL*/ Last_Round=0; Last_Frame=0;

PAGE 138

130 D_Info_Flag=1;/*Wesetthistolet thedisplayerhavetheinforatthefirsttime*/ first_dither=1; InitiateTime(); Initialize_Decoder(); basenum=0; rfp=fopen(rfpfilename,"wb"); pfp=fopen(pfpfilename,"rb"); First_Picture=1; picorder=0; mydecodednum=0; receiveDistributeData_Partition(); for(j=0;j<60;j++){ picorder=j*PARALLELSIZE*workstationnum+(myrank-1)*PARALLELSIZE; mydecodednum=j*PARALLELSIZE; wj_tmp1=times(&tbuf); printf("dothe%dto%d\n",picorder,picorder+PARALLELSIZE); //ExtractData decodeframe_Partition(frameptr[0],myrank); wj_tmp2=times(&tbuf); sExtractAcc_Time+=(wj_tmp2-wj_tmp1); printf("Theextractacctimeis%d\n",sExtractAcc_Time); //Sendresultback if(first_dither) { first_dither=0; dither_initial();

PAGE 139

131 } //Ditherthegroupofpicture.theresultiswindowbmppicstoredin //globalDithered_Image if(frameptr[0]->progressive_sequence){ Dither_Frame_Partition(myrank); }else{ if((frameptr[0]->picture_structure==FRAME_PICTURE&& frameptr[0]->top_field_first)|| frameptr[0]->picture_structure==BOTTOM_FIELD) { Dither_Top(i,Dithered_Image[i]); Dither_Bottom(i,Dithered_Image2[i]); }else{ Dither_Bottom(i,Dithered_Image[i]); Dither_Top(i,Dithered_Image2[i]); } } wj_tmp2=times(&tbuf); sDithAcc_Time+=(wj_tmp2-wj_tmp1); printf("TheDitheracctimeis%d\n",sDithAcc_Time); wj_tmp1=times(&tbuf); sendPictureData_Partition(); wj_tmp2=times(&tbuf); sSendAcc_Time+=(wj_tmp2-wj_tmp1); //RecieveRawData if(frameptr[0]->mb_width==-1)break; receiveDistributeData_Partition();

PAGE 140

132 //RECEIVErawdatafrommaster,aftergetthedata,it fflush(NULL); } fclose(pfp); fclose(rfp); fclose(logfp); } fclose(logfp); MPI_Finalize(); //ExitProcess(0); return; } voidInitialize_Sequence(frame) structPictureBuffer*frame; { intcc,size; inti,k; intMBAmax; staticintTable_6_20[3]={6,8,12}; /*forceMPEG-1parametersforproperdecoderbehavior*/ /*seeISO/IEC13818-2sectionD.9.14*/ for(i=0;ibase.MPEG2_Flag) { frameptr[i]->progressive_sequence=1;

PAGE 141

133 frameptr[i]->progressive_frame=1; frameptr[i]->picture_structure=FRAME_PICTURE; frameptr[i]->frame_pred_frame_dct=1; frameptr[i]->chroma_format=CHROMA420; frameptr[i]->matrix_coefficients=5; } /*roundtonearestmultipleofcodedmacroblocks*/ frameptr[i]->mb_width=(frameptr[i]->horizontal_size+15)/16; frameptr[i]->mb_height=(frameptr[i]->base.MPEG2_Flag&& !frameptr[i]->progressive_sequence)? 2*((frameptr[i]->vertical_size+31)/32) :(frameptr[i]->vertical_size+15)/16; frameptr[i]->Coded_Picture_Width=16*frameptr[i]->mb_width; frameptr[i]->Coded_Picture_Height=16*frameptr[i]->mb_height; /*ISO/IEC13818-2sections6.1.1.8,6.1.1.9,and6.1.1.10*/ frameptr[i]->Chroma_Width=(frameptr[i]-> chroma_format==CHROMA444)? frameptr[i]->Coded_Picture_Width :frameptr[i]->Coded_Picture_Width>>1; frameptr[i]->Chroma_Height=(frameptr[i]-> chroma_format!=CHROMA420)? frameptr[i]->Coded_Picture_Height :frameptr[i]->Coded_Picture_Height>>1; frameptr[i]->block_count=Table_6_20[frameptr[i]-> chroma_format-1]; for(cc=0;cc<3;cc++) {

PAGE 142

134 if(cc==0) size=frameptr[i]->Coded_Picture_Width* frameptr[i]->Coded_Picture_Height; else size=frameptr[i]->Chroma_Width*frameptr[i]->Chroma_Height; if(!(frame_pool[i][cc]=(unsignedchar*)malloc(size))) Error("frame_pool[K]mallocfailed\n"); } MBAmax=frameptr[i]->mb_width*frameptr[i]->mb_height; if(!(frameptr[i]->data=malloc(MBAmax*sizeof(structMBlock)))) Error("framptr->datamallocerror\n"); global_microblocks[i]=frameptr[i]->data; } for(cc=0;cc<3;cc++) { forward_reference_frame[cc]=frame_pool[0][cc]; backward_reference_frame[cc]=frame_pool[0][cc]; if(frame->base.scalable_mode==SC_SPAT) { /*thisassumeslowerlayeris4:2:0*/ if(!(llframe0[cc]=(unsignedchar*) malloc((lower_layer_prediction_horizontal_size* lower_layer_prediction_vertical_size)/(cc?4:1)))) Error("llframe0mallocfailed\n"); if(!(llframe1[cc]=(unsignedchar*) malloc((lower_layer_prediction_horizontal_size* lower_layer_prediction_vertical_size)/(cc?4:1))))

PAGE 143

135 Error("llframe1mallocfailed\n"); } } /*SCALABILITY:Spatial*/ if(frame->base.scalable_mode==SC_SPAT) { if(!(lltmp=(short*)malloc (lower_layer_prediction_horizontal_size* ((lower_layer_prediction_vertical_size* vertical_subsampling_factor_n)/ vertical_subsampling_factor_m)*sizeof(short)))) Error("lltmpmallocfailed\n"); } } //masterfunc:datapipelineversion voidDistributeData() { inti,j; intcode; for(i=0;ilpFile=frameptr[i]->rawData; while(1){ Global_next_start_code();

PAGE 144

136 code=Global_Show_Bits32(); if(code==PICTURE_START_CODE)break; } j=0; do{ frameptr[i]->lpFile[j]=Global_Get_Byte(); j++; }while(Global_Show_Bits32()!=PICTURE_START_CODE); frameptr[i]->ld->Incnt=32; frameptr[i]->ld->Rdptr=frameptr[i]->ld->Rdbfr+2048; frameptr[i]->ld->Rdmax=frameptr[i]->ld->Rdptr; frameptr[i]->ld->Bfr=0; frameptr[i]->rawsize=j; frameptr[i]->picturenum=totalpicturenum; frameptr[i]->Float_Counter=0; totalpicturenum++; MPI_Send(frameptr[i],sizeof(structPictureBuffer), MPI_BYTE,((basenum/PARALLELSIZE)%workstationnum)+1, 100,MPI_COMM_WORLD); } } voidTerminateSlave() { inti; printf("TerminateSlave!\n"); //getch(); frameptr[0]->mb_width=-1;

PAGE 145

137 for(i=1;i<=workstationnum;i++){ printf("Terminateslave%d\n",i); MPI_Send(frameptr[0],sizeof(structPictureBuffer), MPI_BYTE,i,100,MPI_COMM_WORLD); } } //slavefunc voidreceiveDistributeData() { inti,j,actread; intcode; intstatus; char*tmp; i=0; while(1)//for(i=0;i=PARALLELSIZE)break; jjj=i; tmp=frameptr[i]; MPI_Recv(tmp,sizeof(structPictureBuffer), MPI_BYTE,0,100,MPI_COMM_WORLD,&status); i=jjj; if(((structPictureBuffer*)frameptr[i])->mb_width==-1) { frameptr[0]->mb_width=-1; printf("jjj=%d\n",jjj);

PAGE 146

138 return; } //receiverawdatafromthemaster frameptr[i]->lpFile=frameptr[i]->rawData; frameptr[i]->ld=&(frameptr[i]->base); frameptr[i]->ld->Incnt=32; frameptr[i]->ld->Rdptr=frameptr[i]->ld->Rdbfr+2048; frameptr[i]->ld->Rdmax=frameptr[i]->ld->Rdptr; frameptr[i]->ld->Bfr=0; frameptr[i]->Float_Counter=0; i++; } } charpicparam[100]; //slavefunc //makethebmppicturetosend,insteadofy,u,vdata voidsendPictureData() { inti,j; intcode; intsizeL; intsizeC; i=0; while(1)//for(i=0;i=PARALLELSIZE)break; sizeL=(frameptr[i]->Coded_Picture_Width*

PAGE 147

139 frameptr[i]->Coded_Picture_Height)/FAKESCALE_NET; sizeC=(frameptr[i]->Chroma_Width* frameptr[i]->Chroma_Height)/FAKESCALE_NET; sprintf(picparam,"%d%d%d%d%d%d%d%d%d%d", frameptr[i]->matrix_coefficients,frameptr[i]-> Coded_Picture_Width,frameptr[i]->Coded_Picture_Height, frameptr[i]->Chroma_Width, frameptr[i]->Chroma_Height,frameptr[i]->chroma_format, frameptr[i]->progressive_sequence, frameptr[i]->picture_structure, frameptr[i]->top_field_first,frameptr[i]->picture_coding_type); //FIRSTsendthepictureparam,thensendpicturebmpdata printf("sendpicture%d\n",i); if(frameptr[i]->progressive_sequence) { MPI_Send(picparam,100,MPI_CHAR,0,100,MPI_COMM_WORLD); MPI_Send(Dithered_Image[i],sizeL,MPI_BYTE, 0,100,MPI_COMM_WORLD); } else { rett=MPI_Send(picparam,100,MPI_CHAR,0,100,MPI_COMM_WORLD); rett=MPI_Send(Dithered_Image[i],sizeL, MPI_BYTE,0,100,MPI_COMM_WORLD); } i++; }

PAGE 148

140 } intdecodeframe(structPictureBuffer*frame,intbasenum) { intReturn_Value; Return_Value=Headers(frame);//ret=1meanpicture_headmeeted if(Return_Value==0)return-1; elseif(Return_Value!=1)return-1; if(First_Picture){ Initialize_Sequence(frame); First_Picture=0; if(myrank==0) } getframe(frame,basenum); return0; } voidshowdecodepicture(intbasenum) { decodehighhalf(frameptr[(basenum)%PARALLELSIZE]); decodelowhalf(frameptr[(basenum)%PARALLELSIZE]); if(myrank==0)frame_reorder(frameptr[(basenum)%PARALLELSIZE], Bitstream_Framenum,Sequence_Framenum); if(!frameptr[(basenum)%PARALLELSIZE]->Second_Field) { Bitstream_Framenum++; Sequence_Framenum++; } }

PAGE 149

141 staticintvideo_sequence(Bitstream_Framenumber) int*Bitstream_Framenumber; { intReturn_Value,i; intsizeL,sizeC; intlastframe=0; intstatus; intrett; clock_twj_tmp1,wj_tmp2; structtmstbuf; Bitstream_Framenum=*Bitstream_Framenumber; Sequence_Framenum=0; pfp=fopen(pfpfilename,"wb"); DistributeData(); decodeframe(frameptr[basenum%PARALLELSIZE],basenum); basenum+=PARALLELSIZE; for(i=0;i
PAGE 150

142 { if(i>=PARALLELSIZE)break; jjj=i; printf("receivepicture%d\n",i+basenum); //Receiveapicture rett=MPI_Recv(picparam,100,MPI_CHAR, (((basenum-(workstationnum)*PARALLELSIZE)/ PARALLELSIZE)%workstationnum)+1,100, MPI_COMM_WORLD,&status); sscanf(picparam,"%d%d%d%d%d%d%d%d%d%d", &(frameptr[0]->matrix_coefficients),&(frameptr[0]-> Coded_Picture_Width), &(frameptr[0]->Coded_Picture_Height), &(frameptr[0]->Chroma_Width), &(frameptr[0]->Chroma_Height), &(frameptr[0]->chroma_format), &(frameptr[0]->progressive_sequence), &(frameptr[0]->picture_structure), &(frameptr[0]->top_field_first), &frameptr[0]->picture_coding_type); sizeL=(frameptr[0]->Coded_Picture_Width* frameptr[0]->Coded_Picture_Height)/FAKESCALE_NET; sizeC=(frameptr[0]->Chroma_Width*frameptr[0]-> Chroma_Height)/FAKESCALE_NET; if(frameptr[0]->progressive_sequence) { if(frameptr[0]->picture_coding_type==B_TYPE)

PAGE 151

143 MPI_Recv(frame_pool[0][0],sizeL,MPI_BYTE,(((basenum(workstationnum)*PARALLELSIZE)/PARALLELSIZE)% workstationnum)+1,100,MPI_COMM_WORLD,&status); else{ i=jjj; MPI_Recv(frame_pool[1][0],sizeL,MPI_BYTE,(((basenum(workstationnum)*PARALLELSIZE)/PARALLELSIZE)% workstationnum)+1,100,MPI_COMM_WORLD,&status); } }else { rett=MPI_Recv(frame_pool[0][0],sizeL, MPI_BYTE,(((basenum-(workstationnum)*PARALLELSIZE )/PARALLELSIZE)%workstationnum)+1,100, MPI_COMM_WORLD,&status); } printf("MasterAccWaitingTimeis%d\n",mRecieveAcc_Time); //Displayapicture printf("MasterAccDisplayTimeis%d\n",DisAcc_Time); i++; } //Counttheacctime wj_tmp2=times(&tbuf); Acc_Time+=(wj_tmp2-wj_tmp1); printf("PARALLELSIZE=%dinterval=%f",PARALLELSIZE, ((float)(wj_tmp2-wj_tmp1))/CLK_TCK); printf("MasterAccTimeis%fsecs,FPR=%f\n",

PAGE 152

144 (float)Acc_Time/CLK_TCK, (float)(PARALLELSIZE)*CLK_TCK/(wj_tmp2-wj_tmp1)); //Sendrawdata DistributeData(); basenum+=PARALLELSIZE; } fclose(pfp); if(Sequence_Framenum!=0) { printf("outputthelastframe\n"); Output_Last_Frame_of_Sequence(frameptr[basenum], Bitstream_Framenum); } Deinitialize_Sequence(frameptr[0]); *Bitstream_Framenumber=Bitstream_Framenum; return(Return_Value); errhead: exit(-1); } voidgameover() { TerminateSlave(); fclose(pfp); Deinitialize_Sequence(frameptr[0]); fclose(logfp);

PAGE 153

145 printf("waitforfinalize\n"); MPI_Finalize(); //ExitProcess(0); exit(0); }

PAGE 154

APPENDIXB MATLABSOURCECODEFORWCDMAHANDOFFALGORITHM function[theoBER,theoBER_o,bs1ber,bs2ber]=MaD(nuser1,nuser2, activeset,pw1,pw2,f,d) %BERchangingfrom5-->10usersduringhandoff %powercontroledbyBS1 %nuser1=5; %nuser2=10; %returnvalue: %theoBER,thecombinedBERofhandovermobileusing %mainpathpowercontrol %theoBER_o,thecombinedBERforhandovermobileusing %is-95’orfordown’powercontrol %bs1bertheberobservedinbs1usingmainpathpower %control SF=[32:16:256];%15SFs fori=1:15, ifnuser1
PAGE 155

147 elseifactiveset==3 ifd>1000 ppww1=((2000-d)/d)^4; ppww2=1; else ppww1=1; ppww2=(d/(2000-d))^4; end theoBER_o(i)=min(ber(nuser1,nuser2,SF(i),ppww1,f), ber(nuser2,nuser1,SF(i),ppww2,f)); theoBER(i)=min(ber(nuser1,nuser2,SF(i),1,f), ber(nuser2,nuser1,SF(i),pw2,f)); bs2ber(i)=ber(nuser2+pw2,nuser1,64,1,f); else%activeset=2 theoBER(i)=ber(nuser2,nuser1,SF(i),1,f); theoBER_o(i)=theoBER(i); end elseifnuser1>nuser2 bs1ber(i)=ber(nuser1,nuser2,64,1,f); bs2ber(i)=ber(nuser2,nuser1,64,1,f); ifactiveset==1 theoBER(i)=ber(nuser1,nuser2,SF(i),1,f); theoBER_o(i)=theoBER(i); elseifactiveset==3 ifd>1000 ppww1=((2000-d)/d)^4; ppww2=1;

PAGE 156

148 else ppww1=1; ppww2=(d/(2000-d))^4; end theoBER_o(i)=min(ber(nuser1,nuser2,SF(i),ppww1,f), ber(nuser2,nuser1,SF(i),ppww2,f)); theoBER(i)=min(ber(nuser1,nuser2,SF(i),pw1,f), ber(nuser2,nuser1,SF(i),1,f)); bs1ber(i)=ber(nuser1+pw1,nuser2,64,1,f); else theoBER(i)=ber(nuser2,nuser1,SF(i),1,f); theoBER_o(i)=theoBER(i); end end end functionBER=ber(nuser1,nuser2,sf,pw,f); %calculatetheBERgiven: %localcellload---nuser1 %othercellload---nuser2 %interestedmobile’sSF---sf %interestedmobile’srelativepowerlevel %therelativeratioofothercellinterference--f %[BER]=ber(5,10,64,2,0.3) SIR=pw*sf/(nuser1+f*nuser2); tem1=(nuser2*f+nuser1-1)/(3*sf); term2=(nuser1+f*nuser2)/(2*sf*pw)+nuser1/(3*sf); sf*pw/(nuser1+f*nuser2)

PAGE 157

149 BER=normcdf(-term2^(-1/2),0,1); %theimprovedtimeforupdatingintargetcell T_u=zeros(100); T_u([234681115161922])=[3728293143567091104107] T_u([232732343639404445])= [118212147159161175176190191] T_u([49515357616369758187])= [215217229423247259295331367403] T_i=60 T_e=20 lambda(1)=500 lambda(3)=250 lambda(2)=125 waiting=0; fork=1:2 i=1; whilei<100 ifwaiting==0 T_h(k,i)=T_u(i)+T_i+T_e; waiting=max(T_h(k,i)-lambda(k),0); i=i+1 else T_h(k,i)=T_u(i)+T_i+T_e+waiting; cnt=1; tmp=T_h(k,i); while(tmp>cnt*lambda(k)), tmp=max(T_u(i:i+cnt))+T_i+T_e*cnt;

PAGE 158

150 cnt=cnt+1; end ifcnt>1 T_h(k,i)=tmp; forss=1:cnt T_h(k,i+ss)=T_h(k,i)+(cnt-ss)*lambda(k) end end i=i+cnt %waiting=max(T_h(k,i)-lambda(k),0); end end end figure hold plot([1:99],T_h(1,1:99),’y-’,[1:99],T_h(2,1:99),’g-’) plot([1:99],T_h(1,1:99),’yo’,[1:99],T_h(2,1:99),’g*’) %,[1:100],T_h(3,:),’r+’) text(10,1000,’-o-lambda1=2’) text(10,900,’-*-lambda2=8’) xlabel(’HandoffMobiletotheTargetCell’,’FontSize’,20) ylabel(’theHandoffTimeT_{handoff}(inms)’,’FontSize’,20) title(’SystemResponseTimewithBatchingProcessing’,’FontSize’,20) functioninvm=invers(A) m=size(A); m=m(1); d=deter(A);

PAGE 159

151 fori=1:m forj=1:m invm(i,j)=(-1)^(i+j)*deter(aug(A,j,i)); invm(i,j)=invm(i,j)/d; end end %modle:z=aj*x’ %A1*x>=m1+fIox1~x6followcell1 %A2*x>=m2+fIox7~x10followcell2 %r1mu*p’<=10 %r2mu*p’<=10 %setbjtodesiredSIRsfortrafficmixture m1=5 m2=10 powervector=zeros(400,10); throughput=zeros(1,400); r1mu=zeros(400,10); r2mu=zeros(400,10); d=1 %theoffsetpatternrepeatevery50meters ford=1:1:50 r1=mod(([800:50:1250]+d-800),400)+800; r2=2000-r1 r1mu(d,:)=10^10*r1.^(-3.5) r2mu(d,:)=10^10*r2.^(-3.5) bj=[5555555555]+4 aj1=r1mu(d,:)./(m1+0.3*m2+10).*bj

PAGE 160

152 aj2=r2mu(d,:)./(m2+0.3*m1+10).*bj aj(1:6)=aj1(1:6) aj(7:10)=aj2(7:10) %minimumSF=256 %constrainforhandoffmobileoncell1 A1(1,:)=[r1mu(d,1)*256/bj(1)-r1mu(d,2:10)] A1(2,:)=[-r1mu(d,1)r1mu(d,2)*256/bj(2)-r1mu(d,3:10)] A1(3,:)=[-r1mu(d,1:2)r1mu(d,3)*256/bj(3)-r1mu(d,4:10)] A1(4,:)=[-r1mu(d,1:3)r1mu(d,4)*256/bj(4)-r1mu(d,5:10)] A1(5,:)=[-r1mu(d,1:4)r1mu(d,5)*256/bj(5)-r1mu(d,6:10)] A1(6,:)=[-r1mu(d,1:5)r1mu(d,6)*256/bj(6)-r1mu(d,7:10)] %constrainforhandoffmobileoncell2 A1(7,:)=[-r2mu(d,1:6)r2mu(d,7)*256/bj(7)-r2mu(d,8:10)] A1(8,:)=[-r2mu(d,1:7)r2mu(d,8)*256/bj(8)-r2mu(d,9:10)] A1(9,:)=[-r2mu(d,1:8)r2mu(d,9)*256/bj(9)-r2mu(d,10)] A1(10,:)=[-r2mu(d,1:9)r2mu(d,10)*256/bj(10)] %righthandofhandoffmobile b(1:6)=m1+0.3*m2 b(7:10)=m2+0.3*m1 %othercellconstrain A1(11,:)=r1mu(d,:) A1(12,:)=r2mu(d,:) b(11)=12 b(12)=12 type=’max’ c=aj rel=’>>>>>>>>>><<’

PAGE 161

153 pp=simplex2p(type,c,A1,rel,b,0) sf(1,d)=bj(1)*(b(1)+r1mu(d,2:10)*pp(2:10))/(r1mu(d,1)*pp(1)) sf(2,d)=bj(2)*(b(2)+r1mu(d,1)*pp(1)+ r1mu(d,3:10)*pp(3:10))/(r1mu(d,2)*pp(2)) sf(3,d)=bj(3)*(b(3)+r1mu(d,1:2)*pp(1:2)+ r1mu(d,4:10)*pp(4:10))/(r1mu(d,3)*pp(3)) sf(4,d)=bj(4)*(b(4)+r1mu(d,1:3)*pp(1:3)+ r1mu(d,5:10)*pp(5:10))/(r1mu(d,4)*pp(4)) sf(5,d)=bj(5)*(b(5)+r1mu(d,1:4)*pp(1:4)+ r1mu(d,6:10)*pp(6:10))/(r1mu(d,5)*pp(5)) sf(6,d)=bj(6)*(b(6)+r1mu(d,1:5)*pp(1:5)+ r1mu(d,7:10)*pp(7:10))/(r1mu(d,6)*pp(6)) sf(7,d)=bj(7)*(b(7)+r2mu(d,1:6)*pp(1:6)+ r2mu(d,8:10)*pp(8:10))/(r2mu(d,7)*pp(7)) sf(8,d)=bj(8)*(b(8)+r2mu(d,1:7)*pp(1:7)+ r2mu(d,9:10)*pp(9:10))/(r2mu(d,8)*pp(8)) sf(9,d)=bj(9)*(b(9)+r2mu(d,1:8)*pp(1:8)+ r2mu(d,10)*pp(10))/(r2mu(d,9)*pp(9)) sf(10,d)=bj(10)*(b(10)+r2mu(d,1:9)*pp(1:9))/(r2mu(d,10)*pp(10)) powervector(d,:)=pp’; throughput(d)=5000*([1111111111]*(1./sf(:,d))) end %wwwpower.eps figure hold plot([0:50:450]+10,powervector(1,:),’r-’, [0:50:450]+10,powervector(1,:),’r*’)

PAGE 162

154 plot([0:50:450]+20,powervector(11,:),’y-’, [0:50:450]+20,powervector(11,:),’yo’) plot([0:50:450]+30,powervector(21,:),’g-’, [0:50:450]+30,powervector(21,:),’gd’) plot([0:50:450]+40,powervector(31,:),’b-’, [0:50:450]+40,powervector(31,:),’b^’) title(’theassignmentofPowerforWWWtraffic’) ylabel(’NormalizedPowerLevel’) xlabel(’OffsetintheHandoffarea’) text(100,11,’-*-offset0’) text(100,10.5,’-o-offset10’) text(100,10,’-d-offset20’) text(100,9.5,’-^-offset20’) function[activeset,s0_d,s1_d,diff1,diff2, theBER,theBER_o,pl,bs1ber,bs2ber]= sho(k1,k2,d,r0,T_add,T_drop,timer,nUser1,nUser2,speed); %IS-95softhandoffalgorithm %sho(k1,k2,d,r0,T_add,T_drop,timer) %Usage:[activesets1s2diff1diff2theBERtheBER_o plbs1berbs2ber]=sho(0,30,2000,2000,-215,-217,10,10,10,10); %approximateaddingdistanceis700m,dropat1400m. %k1theBStransmittingpower %k2thepathlossparameter,usually20-40 %dthedistancebetweentwoBSs %r0thedistancefromBS0 %T_addaddingthreshold,-217corresponds600metersforadding %T_dropdropingthreshold,suggesting1dblessthanT_add

PAGE 163

155 %timerhyperthesistimeperiod %speedmeter/sec %u(d),v(d)iszeromeanGaussianprocesshavingtheexponential %correlationfunctionasE(u(d)u(d+delta))=sigma^2*exp(delta/d0) %d0isaconstantofdecay.Weassumesigma=8db,d0=20m %dsisthesamplingdistance,use1meterhere d0=20; ds=20; sigma=12; alpha=exp(-ds/d0); sigmavi=(1-alpha^2)*sigma; u=zeros(1,d); v=zeros(1,d); us=random(’norm’,0,sigmavi,1,d); vs=random(’norm’,0,sigmavi,1,d); u=us; v=vs; t0=0 t1=0 active(1)=1; %%active(i)=1BS1,=2BS2,=3BS1+BS2 fori=2:d s0_d(i)=k1-k2*log(i)+u(i); s1_d(i)=k1-k2*log(d-i)+v(i); diff1(i)=s0_d(i)-s1_d(i); diff2(i)=s1_d(i)-s0_d(i); active(i)=active(i-1);

PAGE 164

156 if(s1_d(i)>T_add)&s0_d(i)>T_add, active(i)=3; end if(s0_d(i)
PAGE 165

157 SF1=ones(1,nUser1)*64; SF2=ones(1,nUser2)*64; pl(i,:)=[10^(s0_d(i)/10)/10^(s1_d(i)/10) 10^(s1_d(i)/10)/10^(s0_d(i)/10)]; %pl(i,:)=[s0_d(i)/s1_d(i)s1_d(i)/s0_d(i)]; link=[nUser1*100SF1(1)nUser1]; physchan=[3000];%[SNR]; multi=[1000];%multipath misc=[110];%reverselink %[sum,BaseSignal,subsignal]= cdma2(link,physchan,multi,misc,SF1,SF2,pl); end figure plot([1:2000],theBER(:,2),’yo’,[1:2000],theBER(:,3),’g+’); title(’BERlevelusingMaDscheme’); xlabel(’mobiledistance(meter)’); ylabel(’BERtheoretically’); text(10,0.005,’oSF=32\n+SF=64’); figure plot(pl(800:1200,2)) xlabel(’MobilelocationinHandoffArea(meter)’); ylabel(’NormalizedReceivedPowerLevelintheNeighborCell’); figure plot(bs1ber); plot(bs2ber);

PAGE 166

158 hold plot([1:2000],0.2,’r-’); text(200,0.22,’desiredBERforvoice’) xlabel(’MobilelocationinHandoffArea(meter)’); ylabel(’BERforvoicetrafficintheothercells’); figure plot([1:2000],diff1,’yo’,[1:2000],diff2,’g+’) xlabel(’mobiledistanceinmeter’) ylabel(’signalstrengthindb’) text(100,diff1(100),’pilotsignalfromBS1’) text(1100,diff2(1100),’pilotsignalfromBS2’) activeset=active; figure plot(activeset) xlabel(’distance’) ylabel(’activeset’) axis([0200004]) text(400,3.7,’activeset,T_add=-15db,T_drop=-16db’) text(400,3.5,’T_tdrop=5sec’) figure %%%theBER_of.eps hold theBER_of=ones(2000); theBER_of(700:1200)=theBER_o((700:1200),1); theBER_of(1200:1300)=theBER_o((1200:1300),2); plot([700:1300],theBER_of(700:1300)); plot([700:1300],0.02,’--’)

PAGE 167

159 text(850,0.02,’desiredBERofmobile’) xlabel(’mobiledistance’) ylabel(’BER’) title(’BERwithMaximumDatarateandORONDOWN powercontrol(towardshighloadedcell)’) text(1000,0.015,’SF=32’) text(1200,0.007,’Cell1SFchangedto64’) line([8001200],[0.0070.007]) line([800800],[0.0060.009]) line([12001200],[0.0060.009]) text(900,0.007,’handoffarea,SF=32’)

PAGE 168

REFERENCES [1]J.Wang,J.Liu,andY.Lin.Apartition-basedparallelmpeg-2softwaredecoder. Proc.ofJointConferenceofInformationSystems ,pages1009…1012,Durham, NorthCarolina,Mar2002. [2]J.WangandJ.Liu.Parallelmpeg-2decodingwithhigh-speednetworks. Proc. ofInternationalConferenceonMultimediaandExposition2001(ICME2001) pages449…452,Tokyo,Japan,Mar2001. [3]J.Wang,Y.He,andJ.Liu.Ecientbueringcontrolforasoftware-only,highlevel,high-pro“lempeg-2decoder. toappearinIEEEInternationalConference onMultimediaandExposition2003(ICME2003) [4]L.Zhang,S.Deering,D.Estrin,S. Shenker,andD.Zappala.Rsvp:anew resourcereservationprotocol. IEEENetwork ,7(5):8…18,Sep1993. [5]N.Kamat,J.Wang,andJ.Liu.Anecientre-routingschemeforvoiceover ip. toappearinIEEEInternationalConferenceonMultimediaandExposition 2003(ICME2003) [6]J.M.McManusandK.W.Ross.Video-on-demandoveratm:Constant-rate transmissionandtransport. IEEEJournalonSelectedAreasinCommunications ,14(9):1087…1097,Aug1996. [7]ADurand.Deployingipv6. IEEEInternetComputing ,5(1):79…81,Feb2001. [8]I.F.Akyildiz,J.McNair,L.Carrasco, R.Puigjaner,andY.Yesha.Mediumaccesscontrolprotocolsformultimediatracinwirelessnetworks. IEEENetwork pages39…47,July/Augest1999. [9]H.Fukumasa,R.Kohno,andH.Imai.Designofpseudo-noisesequenceswith goododdandevencorrelationpropertiesfords/cdma. IEEEJournalonSelected AreasinCommunications ,12(5):828…836,June1994. [10]ISO/IEC13818-2.Informationtechnology…genericcodingofmovingpictures andaudioinformation:Video.1998. [11]T.Akiyama,H.Aono,R.Aoki,K.W.Ler,B.Wilson,T.Araki,T.Takahashi, H.Takeno,C.Boon,A.Sato,S.Nakatani,K.Horiike,andT.Senoh.Mpeg2 videocodecusingimagecompressiondsp. IEEETrans.onConsumerElectron 40(3):466…472,Aug1994. [12]R.Lee.Realtimempegvideoviasoftwaredecompressiononapa-riscprocessor. 40thIEEEComputerSocietyInternationalConference(COMPCON95) ,pages 186…192,March1995. [13]R.Bhargava,L.John,B.L.Evans,andR.Radhakrishnan.Evaluatingmmx technologyusingdspandmultimediaapplications. Proc.ofthe31stIEEEInternationalSymposiumonMicroarchitecture ,pages37…46,1998. 160

PAGE 169

161 [14]A.Peleg,S.Wilkie,andU.Weiser.Intelmmxformultimediapcs. CommunicationoftheACM ,40(1):25…40,Jan1997. [15]T.Tung,C.Ho,andJ.Wu.Mmx-baseddctandmcalgorithmsforreal-time puresoftwarempegdecoding. IEEEMultimediaConference99 ,pages357…362, July1999. [16]C.-G.Zhou,L.Hohn,D.Rice,I.Kabir,A.Jabbi,andX.Hu.Mpegvideo decodingwiththeultrasparcvisualinstructionset. Compcon95 ,pages470…475, Spring1995. [17]S.SriramandC.-Y.Hung.Mpeg-2videodecodingonthetms320c6xdsparchitecture. Proc.ofthe1998IEEEAsilomarConf.InSignal,Systemsand Computers ,pages1735…1739,1998. [18]I.Ahmad,S.M.Akramullah,M.L.Liou,andM.Ka“l.Ascalableoinempeg2videoencodingschemeusingamultiprocessorsystem. ParallelComputing 27(6):823…846,2001. [19]S.Akramullah,I.Ahmad,andM.Liou.Adata-parallelapproachforrealtimempeg-2videoencoding. JournalofParallelAndDistributedComputing 30(2):129…146,Nov1995. [20]K.GongandL.Rowe.Parallelmpeg-1encoding. Proc.ofthe1994Picture CodingSymposium ,Sep1994. [21]Y.He,I.Ahmad,andM.Liou.Real-timeinteractivempeg-4systemencoder usingaclusterofworkstations. IEEETrans.onMultimedia ,1(2):217…233,June 1999. [22]A.Bilas,J.Fritts,andJ.Singh.Real-timeparallelmpeg-2decodinginsoftware. Proc.ofthe11thInternationalParallelProcessingSymposium ,pages37…46, 1997. [23]N.H.C.YungandK.K.Leung.Spatialandtemporaldataparallelizationofthe h.261videocodingalgorithm. IEEETrans.onCircuitsandSystemsforVideo Tech ,11(1):91…104,Jan2002. [24]S.H.Bokhari.Onthemappingproblem. IEEETrans.onComputer ,c30(3):207… 214,Mar1981. [25]N.BowenandC.Nikolaou.Ontheassignmentproblemofarbitraryprocess systemstoheterogeneousdis tributedcomputersystems. IEEETrans.OnComputer ,41(3):257…273,Mar1993. [26]MPEGSoftwareSimulationGroup.Mpeg-2videocodecversion1.2. http://www.mpeg.org/tristan/MPEG/MSSG ,04/25/2003. [27]B.Daines,J.Liu,andK.Sivalingam.Supportingmultimediacommunication overagigabitethernetnetwork. InternationalJournalofParallelandDistributed SystemsandNetworks ,4(2):102…115,Jun2001. [28]M.Naraghi-PourandH.Liu.Integratedvoice-datatransmissionincdmapacket pcns. Proc.of2000IEEEICC ,pages1085…1089,2000.

PAGE 170

162 [29]S.ChoiandK.G.Shin.Auni“edwirelesslanarchitectureforreal-timeandnonreal-timecommunicationservices. IEEE/ACMTrans.onNetworking ,8(1):44… 59,1999. [30]L.ZhugeandV.Li.Interferenceestimationforadmissioncontrolinmulti-service ds-cdmacellularsystems. Proc.of2000IEEEGLOBECOM ,pages1509…1514, 2000. [31]M.Lops,G.Ricci,andA.M.Tulino.Narrow-band-interferencesuppressionin multiusercdmasystems. IEEETrans.onCommunications ,46(9):1163…1175, 1998. [32]P.ChangandC.Lin.Designofspre adspectrummulti-codecdmatransport architectureformultimediaservices. IEEEJournalonSelectedAreasinCommunications ,18(1):99…111,January2000. [33]E.GeraniotisandJ.Wu.Theprobabilit yofmultiplecorrectpacketreceptions indirect-sequencespread-spectrumnetworks. IEEEJournalonSelectedAreas inCommunications ,12(5):871…884,June1994. [34]S.ChoiandD.Cho.Capacityevaluationofforwardlinkinacdmasystem supportinghighdata-rateservice. Proc.of2000IEEEGLOBECOM ,pages 123…127,2000. [35]I.F.Akyildiz,D.A.Levine,andI.Joe.Aslottedcdmaprotocolwithber schedulingforwirelessmultimedianetworks. IEEE/ACMTrans.onNetworking 7(2):146…157,April1999. [36]S.OhandK.M.Wasserman.Dynamicspreadinggaincontrolinmulti-service cdmanetworks. IEEEJournalonSelectedAreasinCommunications ,17(5):918… 927,May1999. [37]M.Elicin,J.Wang,andJ.Liu.Supportmultimediatracoverw-cdmawith dynamicspreading. Proc.ofGLOBECOM2001 ,4:2384…2388,2001. [38]J.Wang,M.Elicin,andJ.Liu.Multimediasupportforwirelessw-cdmawith dynamicspreading. ACMJournalofWirelessNetwork ,8:355…370,2002. [39]J.Wang,Y.Cen,andJ.Liu.Analysisofsofthandoalgorithmswithdynamicspreadingwcdmasystem. Proc.ofthe17thInternationalConferenceon AdvancedInformationNetworkingandApplications(AINA2003) ,2003. [40]J.Wang,J.Liu,andYuehaoCen.Handoalgorithmsindynamicspreading wcdmasystemsupportingmultimediatrac. IEEEJournalonSelectedAreas inCommunications ,toappearin2003. [41]K.S.GilhousenA.J.Viterbi,A.M.ViterbiandE.Zehavi.Softhandofextends cdmacellcoverageandincreasesreverselinkcapacity. IEEEJournalonSelected AreasinCommunications ,12:1281…1288,Oct1994. [42]N.ZhangandJ.M.Holtzman.Analysisofacdmasofthandoalgorithm. IEEE Trans.onVehicularTechnology ,47(2):710…714,May1998. [43]Y.FangandI.Chlamtac.Analyticalgeneralizedresultsforthehandoprobabilityinwirelessnetworks. IEEETrans.onCommunications ,50(3):396…399, March2002.

PAGE 171

163 [44]S.ChakrabartiP.MarichamyandS.L.Maskara.Overviewofhandoschemes incellularmobilenetworksandtheircomparativeperformanceevaluation. Proc. ofIEEEVTC99 ,3:1486…1490,1999. [45]C.CLeeandRaymondSteele.Eectofsoftandsofterhandosoncdmasystem capacity. IEEETrans.onVehicularTechnology ,47(3):830…841,Aug,1998. [46]C.Zhou,L.Honig,andS.Jordan.Two-cellutility-basedresourceallocation foracdmavoiceservice. IEEEProc.of54thVehicularTechnologyConference 1:27…31,2001. [47]M.NaghshinehandM.Schwartz.Distributedcalladmissioncontrolinmobile/wirelessnetworks. IEEEJournalofSelectedAreasonCommunication 14(4):711…717,May1996. [48]T.LeeandJ.Wang.Admissioncontrolforvariablespreadinggaincdmawireless packetnetworks. IEEETrans.onVehicularTechnology ,49(2):565…575,March 2000. [49]T.ChiangandY.-QZhang.Anewratecontrolschemeusingquadraticrate distortionmodel. IEEETrans.onCircuitsandSystemsforVideoTechnology 7(1):246…250,Feb1997. [50]W.DingandB.Liu.Ratecontrolofmpegvideocodingandrecordingby rate-quantizationmodeling. IEEETrans.onCircuitsandSystemsforVideo Technology ,6(1):12…20,Feb1996. [51]H.HangandJ.Chen.Sourcemodelfortransformvideocoderandits application„parti:Fundamentaltheory. IEEETrans.onCircuitsandSystemsforVideoTechnology ,7(2):287…298,April1997. [52]H.HangandJ.Chen.Sourcemodelfortransformvideocoderandits application„partii:Variableframeratecoding. IEEETrans.onCircuitsand SystemsforVideoTechnology ,7(2):299…311,April1997.

PAGE 172

BIOGRAPHICALSKETCH JuWangwasborninFengHua,P.R.China.HereceivedhisBachelorof SciencedegreefromtheComputerScienc eDepartmentatOceanUniversityofQingdao,P.R.Chinain1995.In1998,heearnedhisMasterofSciencedegreefrom theComputerScienceDepartmentatInstituteofComputingTechnology,Chinese AcademyofSciences,P.R.China.HeisexpectedtoreceivehisDoctorofPhilosophydegreeincomputerandinformationsc ienceandengineeringfromUniversity ofFlorida,inAugust2003.Hehasacceptedanassistantprofessorpositionwith VirginiaCommonwealthUniversity,startingAugest2003.Hisresearchinterestsincludedistributedmultimediasystems,digita lvideoprocessing,computernetworking, wirelessnetworksandsecurity. 164

PAGE 173

IcertifythatIhavereadthisstudyandthatinmyopinionitconformsto acceptablestandardsofscholarlypresentationandisfullyadequate,inscopeand quality,asadissertationforthedegreeofDoctorofPhilosophy. JonathanC.L.Liu,Chairman AssociateProfessorofComputerand InformationScienceandEngineering IcertifythatIhavereadthisstudyandthatinmyopinionitconformsto acceptablestandardsofscholarlypresentationandisfullyadequate,inscopeand quality,asadissertationforthedegreeofDoctorofPhilosophy. RandyY.C.Chow ProfessorofComputerandInformation ScienceandEngineering IcertifythatIhavereadthisstudyandthatinmyopinionitconformsto acceptablestandardsofscholarlypresentationandisfullyadequate,inscopeand quality,asadissertationforthedegreeofDoctorofPhilosophy. Jih-kwonPeir AssociateProfessorofComputerand InformationScienceandEngineering IcertifythatIhavereadthisstudyandthatinmyopinionitconformsto acceptablestandardsofscholarlypresentationandisfullyadequate,inscopeand quality,asadissertationforthedegreeofDoctorofPhilosophy. SartajSahni ProfessorofComputerandInformation ScienceandEngineering IcertifythatIhavereadthisstudyandthatinmyopinionitconformsto acceptablestandardsofscholarlypresentationandisfullyadequate,inscopeand quality,asadissertationforthedegreeofDoctorofPhilosophy. YuguangŽMichaelŽFang AssistantProfessorofElectricaland ComputerEngineering

PAGE 174

ThisdissertationwassubmittedtotheGraduateFacultyoftheCollegeof EngineeringandtotheGraduateSchoolandwasacceptedaspartialful“llmentof therequirementsforthedegreeofDoctorofPhilosophy. August2003 Dean,CollegeofEngineering Dean,GraduateSchool


Permanent Link: http://ufdc.ufl.edu/UFE0000944/00001

Material Information

Title: Multimedia Communication with Cluster Computing and Wireless WCDMA Network
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000944:00001

Permanent Link: http://ufdc.ufl.edu/UFE0000944/00001

Material Information

Title: Multimedia Communication with Cluster Computing and Wireless WCDMA Network
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000944:00001


This item has the following downloads:


Full Text











MULTIMEDIA COMMUNICATION WITH CLUSTER COMPUTING AND
WIRELESS WCDMA NETWORK














By

JU WANG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


2003































Copyright 2003


by


Ju Wang
















This work is dedicated to my beloved wife Qingyan Zhang.















ACKNOWLEDGEMENTS


I would like to thank my dissertation advisor, Dr. Jonathan Liu, for his

guidance and support, without which it would never have been possible to complete

this work. Dr. Liu's invaluable advice on both academic and nonacademic life made

me more mature and scholastic. I would also thank Dr. Randy Chow and Dr. Jih-

kwon Peir for sharing their intelligence and experience with me. Thanks also go

to Dr. Sahni and Dr. Fang, who provide many constructive questions and witty

suggestions on my research work.

Special thanks go to my parents. Their continued support has always provided

me the courage and the spirit to move forward. Finally, words will never be enough

to express my gratitude to my beloved wife, Qingyan, for her constant love, patience,

encouragement, and inspiration during the difficult times.
















TABLE OF CONTENTS
Page


ACKNOWLEDGEMENTS ............................ iv

ABSTRACT ................... .............. vii

CHAPTER

1 INTRODUCTION ............................... 1

1.1 High Performance MPEG-2 Decoding ................ ..... 1
1.2 Multimedia Communication over WCDMA Wireless Network ..... 5
1.3 Dissertation Outline ........................... 9

2 MPEG-2 BACKGROUND AND RELATED STUDY ............ 11


3 DATA-PARTITION PARALLEL SCHEME ....... ........ 17

3.1 Performance Model of Data-Partition Scheme ............. 20
3.2 Communication Analysis in Data Partition Parallel Scheme ..... 22
3.3 Partition Example and Communication Calculation .......... 23
3.4 Communication Minimization ......... ............. 25
3.5 Heuristic Data Partition Algorithm ......... ........... 27
3.6 Experimental Results ................. .. ....... 29
3.6.1 Performance over A 100-Mbps Network ............. 30
3.6.2 Performance over a 680-Mbps Environment ........... 31

4 PIPELINE PARALLEL SCHEME .......... ....... ...... 33

4.1 Design Issues of Pipeline Scheme ......... ........ ... 33
4.2 Performance Analysis ................... ..... 39
4.3 Experimental Results ........................... 46
4.4 Experiment Design ................... ...... 47
4.5 Performance over A 100-Mbps Network . . 48
4.6 Performance over A 680-Mbps SMP Environment . . ... 51
4.7 Towards the High Resolution MPEG-2 Video . . .... 53
4.7.1 Efficient Buffering Schemes . . . 57
4.7.2 Further Optimization in the Slave Nodes . . .... 59
4.7.3 Implementation and Experimental Results . .... 60
4.8 Summary .................... .............. 62









5 MULTIMEDIA SUPPORT IN CDMA WIRELESS NETWORK ...... 64

5.1 Performance-Guaranteed Call Processing .............. 67
5.1.1 Proposed Admission Protocol in A Mobile Station ....... 69
5.1.2 Proposed Admission Protocol in a Base Station ....... 70
5.1.3 A Performance-Guaranteed System . . ..... 72
5.1.4 Processing Time at Base Station . . ..... 74
5.2 Dynamic Scheduling for Multimedia Integration . . .... 76
5.2.1 Design Issue of Traffic Scheduling for Wireless CDMA Uplink
Channel ............. ... .... ....... .. 77
5.2.2 Traffic Scheduling with Fixed Spreading Factor . ... 78
5.2.3 Dynamic Scheduling to Improve the T . . 81
5.3 Summary .................. .............. .. 85

6 SOFT HANDOFF WITH DYNAMIC SPREADING . . .... 87

6.1 Related Study . . . . . . . .. 89
6.2 The Framework of Soft Handoff Algorithm with Dynamic Spreading 91
6.2.1 Stationary System Behavior .................. .. 93
6.2.2 Handoff Time Analysis . . ........ . 96
6.3 Determine Spreading Factor for Handoff Mobile . . .... 101
6.3.1 Design Factor ............. ... ........ 101
6.3.2 Performance of OR-ON-DOWN Power Control . ... 102
6.3.3 Main Path Power Control ..... . . .. 104
6.4 Performance Optimization in Handoff Period . . ... 106
6.4.1 BER Model in Handoff Area ..... . . ..... 106
6.4.2 Simplification of the Original Problem . . .... 108
6.4.3 Proposed Sub-Optimal SF/POWER Decision Algorithm 111
6.4.4 Numerical Results and Performance Discussion . ... 113

7 CONCLUSIONS AND FUTURE WORK. .. . . ...... 116

7.1 Improve Transportation Layer Performance in WCDMA downlink 117
7.2 Parallel MPEG-4 Decoding . . . ... .. 118
7.3 Optimal Rate Distortion Control in Video Encoding . ... 119

APPENDIXES

A C SOURCE CODE FOR PARALLEL MPEG-2 DECODER . ... 121

B MATLAB SOURCE CODE FOR WCDMA HANDOFF ALGORITHM 146

REFERENCES ................... . . .... 160

BIOGRAPHICAL SKETCH ............ . . ...... 164















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MULTIMEDIA COMMUNICATION WITH CLUSTER COMPUTING AND
WIRELESS WCDMA NETWORK

By

Ju Wang

August 2003


Chairman: Jonathan C.L. Liu
Major Department: Computer and Information Science and Engineering

Distributed multimedia is a multidisciplinary area of research which involves

networking, video/audio processing, storaging and high performance computing. Re-

cent advances in video compression and networking technology has brought increasing

attention to this area. Nevertheless, providing high-quality digital video in large scale

remains a challenging task.

The primary focus of this dissertation is twofold:(1) We investigated a pure-

software based parallel MPEG-2 decompression scheme. To achieve a high level of

scalability, we proposed the data pipeline scheme based on a master/slave architec-

ture. The proposed scheme is very efficient due to the low overhead in the master and

slave nodes. Our experimental results showed that the proposed parallel algorithm

can deliver a close-linear speedup for high quality compressed video. With a 100-

Mbps network, the 30-fps decompression frame rate can be approached using Linux

cluster. With 680-Mbps SMP environment, we observed the 60-fps HDTV quality

with 13 nodes. (2) The second topic of this dissertation addressed how multimedia









applications can be effectively supported in the wireless wideband CDMA network.

The major challenge lies in the fact that multimedia traffic usually demands high

data-rate and low bit error, while the wireless network often experiences high bit er-

ror caused by multi-user interference over the air medium. Based on the observation

that better channel quality could be obtained by using longer spreading codes, a new

media access control (MAC) scheme was proposed such that the spreading code for

each mobile is dynamically adjusted based on the network load and desired traffic

QoS. With this new MAC protocol, the radio resource can be better utilized and

more data traffic can be supported.

We further extend the dynamic spreading scheme to the multi-cell scenario

to handle the mobile handover. Based on the system load of local and neighbor

cells, the new handoff algorithm will decide when and what spreading factors should

be used for a mobile during the handoff period. The algorithm also optimizes the

assignment of spreading codes and mobile transmitting power such that the overall

throughput can be maximized. The new handoff algorithm is also optimized to allow

batch-processing for mobile requests to reduce the handoff delay at heavy traffic.















CHAPTER 1
INTRODUCTION

Over the past decade, a tremendous amount of research and development effort

has been undertaken on high performance distributed multimedia systems. Digital

video with decent quality (e.g., DVD quality) has already become affordable to the

general public with off-the-shelf workstations and various electronic consumer prod-

ucts. Some important multimedia applications, such as high quality video conferenc-

ing and video-on-demand, are becoming realistic over wireline network. Nevertheless,

it is only human nature to seek even higher video quality (e.g., HDTV) and ubiquitous

access to multimedia information. To this end, it is essential that the following two

key issues be resolved: (1) a high performance, generic yet scalable software video

encoding/decoding solution that does not rely on any particular hardware design

and (2) a high performance mobile communication network that can support a large-

scale multimedia traffic. In this dissertation, we proposed and investigated various

technologies to address these issues. The primary focus of this dissertation is thus

twofold: (1) to investigate the design issues of a software-based, parallel MPEG-2 de-

coder (MPEG-2 is a widely used video format) and (2) to evaluate a new wideband

CDMA communication protocol which supports multimedia traffic and significantly

increases the link-level QoS for the uplink channel and system performance during

the mobile handoff.

1.1 High Performance MPEG-2 Decoding

MPEG-2 standard has been widely accepted as a platform to provide broad-

casting quality digital video. Using a powerful DCT transform based compression









algorithm and the motion compensation technique, great compression ratio can be

obtained while preserving good video quality. The technique, however, requires in-

tensive processing during the encoding and decoding phase. The computation re-

quirement in the decoding side is particularly critical, because the decoder must be

fast enough to maintain a continuous play back.

Real-time MPEG-2 decoding has been implemented via a hybrid approach

(general purpose processor with multimedia extension) for the broadcasting-level

video quality. Nevertheless, high-performance, scalable, portable software decoder

schemes for the high-level video quality are still under investigation. We investigated

how a generic high-performance software decoder can be constructed by parallelizing

the decoding process over a workstation cluster or multiple processors in an SMP ma-

chine. Two parallel approaches, namely data-partition scheme [1] and data pipeline

scheme [2], are studied in this dissertation.

With the data-partition scheme, a dedicated master node is in charge of pars-

ing the raw MPEG-2 bitstream (either from a network in the case of streaming video

or from a local file system), dividing each frame into sets of macroblocks, distributing

subtasks, and collecting the results. In the slave nodes, each macroblock goes through

VLC decoding, IDCT, and MC to produce the pixel information. This scheme re-

quires the transmission of reference data when decoding non-intra coded macroblocks

(in P- and B-type frames). Our analysis reveals that the communication cost related

to reference data can vary significantly with different partition methods. Thus an op-

timal partition should be found to maximize system performance. We compared four

partition schemes based on the reference data communication pattern. The Quick-

Partition scheme produced the least communication overhead. Our Quick-Partition

algorithm subdivides a given frame into two parts every time, along the shorter di-

mension. Then the two subparts are further divided using the same strategy. This










procedure repeats until a desired number is achieved. With Quick partition, our par-

allel decoder can provide a peak decoding rate of 44-fps, under a 15-node SMP system

with measured inter-node bandwidth of 680Mbps. The corresponding speedup1 is

about 8.

However, it is found that the data partition method will cause high computa-

tion overhead at the master node when the number of partitions becomes large, which

results in a low speedup gain. To achieve high system performance at large scale par-

allel configuration, we investigated the second parallel approach, the data pipeline

scheme. The data pipeline scheme is more scalable than the data-partition scheme

by using a frame level parallelization. By doing this, not only is the computation

overhead in the master node reduced, the communication cost is also significantly

reduced, where the cost of transferring reference data can be virtually eliminated.

The performance of our pipeline algorithm is determined by two design factors:

(1) the block size D (i.e., the number of frames decoded in each slave) and (2) the

number of slave nodes N. The determination of the optimal values of these two

factors becomes nontrivial due to the inter-frame data dependence among I-, P- and

B-frames. The internal data dependence causes the transmission of reference picture

data among different nodes and even limits the degree of parallelism. We find that

increasing D can significantly reduce the communication to the reference picture.

When N is fixed, this reduction is close-linearly until D equals the size of GOP, where

the minimum communication overhead is achieved. Nevertheless, as the performance

analysis model will demonstrate, network bandwidth turns out to be a critical factor

for the overall system performance. We have found that there is a single saturation

point for a given multiple-node computation environment. Before the saturation

point, the system round time (i.e., overall time for one pipeline running cycle) is

1The speed-up gain is defined as the ratio of decompression time between the parallel decoder and its corresponding
serial version. Since the majority decompression work is done in slave nodes, we exclude the master node when
calculating the speed-up. This is, the maximum speed-up for an 8-node-configuration (1 master node and 7 slave
nodes) will be 7. We use this definition throughout all the discussion unless explicitly stated.









dominated by the CPU processing time and increasing N can increase the system

performance close-linear. However, a higher decoding rate also causes more network

traffic. Eventually the growing communication traffic will saturate the system when

it reaches the system bandwidth limitation.

Experiments over different networking and OS platforms were performed to

validate our proposed data pipeline scheme. The actual decompression rate is recorded

for these experiments, with a particular focus on 30-fps real time video and 60-fps

HDTV quality. The experimental results indicate that our scheme can provide a

real-time decoding rate and the system can be scaled up with the 100-Mbps and the

680-Mbps environments. With a 100-Mbps network, we are able to deliver 30-fps

with 6 slave nodes (each node can decompress at 5.6-fps), with speed-up of 5.7.

With the 680-Mbps SMP environment equipped with 15 low end processors

(Sparc 248MHz), a rate of 30-fps was achieved when using 7 nodes (single node

decompression rate of 5-fps). A rate of 60-fps (HDTV quality) was achieved when

using 13 and 14 nodes. Our analysis shows that a 270-fps decoding rate could be

expected with a full configuration with our SMP testing environment.

In order to evaluate the scalability performance of the data pipeline parallel

algorithm, further experiment were conducted with high resolution video format [3].

However, when attempting to decode (1024*1024) and (1404*960) high-level MPEG-

2 video, we have observed a severe performance degradation (e.g., dropping from 18 or

20 fps to 2.5 fps) when more than 10 slave nodes are used. By analyzing the runtime

system resource utilization, we found that the system memory is quickly exhausted

when increasing the number of slave nodes. When decoding the video file with

high spatial resolution, the increase of memory usage eventually becomes a system

bottleneck. To address the challenge and obtain high scalable decoding for high

resolution video, we proposed and implemented two revised memory management

approaches to reduce the buffer requirement. The first is Minimum Transmission









Buffer in Slave Node (ST scheme), where we reduce the transmission buffer size

in the slave nodes to three frames. In the second approach, we proposed a dynamic

buffer scheme which is adaptive according to the current frame type. For the I-

frame, the system will only request one frame buffer. When P- or B- image are to

be decoded, 2 frames will be allocated. The effective number of frames per buffer is

only 85% of the 3 frame buffer. The experimental results show that the buffer space

is significantly reduced, and we observed a well-scaled decoding performance for the

high resolution MPEG-2 video.

1.2 Multimedia Communication over WCDMA Wireless Network

Our investigation on parallelizing the MPEG-2 decoding algorithm indicates

that real-time decompression of high quality video can be obtained via pure soft-

ware solution. It is our belief that powerful computation power in the future will

be available via massive integration of low-end CPUs into one chip/board. Thus

our proposed data pipeline parallel decoder is particularly appealing for those sys-

tems equipped with multiple general purpose processors. Further more, our parallel

MPEG-2 decoder also provides valuable suggestions to the design of a high perfor-

mance parallel video encoder. With proper inter-node communication protocol and

enough CPUs, software-based real-time encoding for high-quality video is possible.

However, the capability of real-time video encoding/decoding at the end-

systems alone does not guarantee the quality along the whole path We must also

have a quality-guaranteed communication network to deliver the multimedia contents

in a timely manner and with high fidelity. Multimedia data, especially streaming au-

dio and video data, should be transported with delay guarantee in order to have

a smooth playback. There had been considerable amount of research on this area.

Some are trying to provide QoS over best effort network, such as RSVP [4] over IN-

TERNET and VoIP[5]. There are also solutions with built-in QoS consideration at

the network layer, such as ATM [6] and IPv6[7]. These research have shown that it is









becoming realistic to support a large number of multimedia connections with desired

QoS over the wireline network.

Furthermore, distributed multimedia system should expand its coverage through

a wireless communication link as well. The future generation wireless system needs

to support a seamless integration (i.e., transparent application switching) to support

voice, audio and conventional data (e.g., e-mails, and ftp). It should also support

many users with guaranteed quality. However, the multimedia communication in the

wireless network remains a technical challenge, due to the low information bandwidth

and high transmission error in the physical layer.

The first generation wireless cellular network is designed to carry voice com-

munication. These networks are analog system, and use frequency division multiple

access (FDMA) to support multiple users. In FDMA system, the whole radio spec-

trum is divided into several isolated physical sub-channels, and each sub-channel is

dedicated to a particular user once assigned by the base station. In order to re-

duce the cross-channel interference, sub-channels are often spaced apart by sufficient

distance. Furthermore, neighboring cells are not allowed to use the same frequency

band. However, these reduced the radio efficiency, and severely limit the number of

concurrent users.

The second generation system is characterized by digital modulation and time

division multiple access (TDMA) technology, which separates users in time. These

systems start to provide limited support for data service, such as email and short

message. However, TDMA is subjected to multi-path interference, where the received

signal might come from several directions with different delay. Similar to the FDMA

technology, neighboring cells in TDMA must use different frequency band to reduce

inter-cell interference.

Nevertheless, the past decade have witnessed a rapid grows of the 2G system.

Encouraged by the success of second generation cellular wireless network, researchers









are now pushing the 3G standard to support a seamless integration of multimedia

data services, as well as voice service. However, multimedia data traffic is more

demanding than voice service, both in data rate and maximum tolerated bit error

rate (BER). For example, streaming video requires a low BER, less than 10-4, and

a moderate data rate (usually higher than 64 kbps). The bottom-line is that the

minimum BER and data rate requirements for all admitted connections must be

satisfied at any time. Otherwise the system will not guarantee the performance.

However, traffic channels in traditional TDMA based system have low data

rate, and the bit error rate (BER) is often higher than required by multimedia traffic.

Directly providing multimedia services based on the TDMA system, though possi-

ble, results in severe degradation in system capacity [8]. Thus, providing quality of

guarantee service in wireless network needs new design and functionality in the MAC

layer.

Another competing technology, code division multiple access (CDMA), uses a

totally different paradigm to share radio resource among different users. In CDMA

system, the whole spectrum is used as carrying band for all users in any time.

Channelization is achieved by assigning different spreading codes to users. Each

information bit will be spreaded into the baseband before transmission. The receiv-

ing side despreads the multiple occurrence of baseband signal back into the original

bit. The interference caused by simultaneous transmission is reduced by the spread-

ing/despreading process. For this purpose, the spreading codes are selected such that

their cross-correlation is small.

CDMA technology is proven superior than FDMA and TDMA in many aspects

[9]. For example, the multi-path signals in TDMA can be used to increase the overall

signal quality by using multi-finger Rake receiver. Several advantages are listed here:

First of all, the capacity in CDMA is much higher than FDMA and TDMA system,

and the same frequency band can be reused by the neighboring cells. Secondly,









channels in CDMA are much secure, since the user's data are automatically encrypted

during the spreading/de-spreading process. Thirdly, which is most important to

multimedia traffic, is its capability of offering different level of BER and data rate

by manipulate the spreading/despreading parameter, or dynamic spreading factor.

Therefore, different types of traffic might co-exist in the system with minimal impact

to each other. It is based on the above considerations, we choose WCDMA as the

potential wireless solution to provide multimedia service and the direction for further

research.

In this dissertation, we investigate effective protocol design with dynamic

spreading factors such that various QoS based on different traffic types can be pro-

vided. Increasing spreading factors can benefit the system because it will increase the

desired signal strength linearly. The measured bit error rate can be reduced 75 times

with a long spreading factor. By taking advantage of this benefit, we propose some

middle-ware solutions to monitor the network load and switch the spreading factors

dynamically based on the current load with multimedia traffic. These middle-ware

solutions are implemented in mobile and base stations, and experiments are per-

formed to measure the actual system performance. The preliminary results indicated

that our proposed system can always maintain a desired quality for all the voice con-

nections. We further extended our protocol to guarantee a balanced support among

different traffic types. While the voice communication is still guaranteed to be non-

interrupted, the data traffic is proved to be served with reasonable response time by

our proposed system.

We further extend our dynamic spreading admission control scheme to a multi-

cell situation by proposing a dynamic spreading enabled soft handoff framework. The

processing time of the handoff request was analyzed. We found that the update pro-

cess caused by handoff is the major component of delay. Further investigation shows

that the update associated with handoff might consume too much access channel









time and increase the delay of handoff, especially when handoff traffic is heavy. We,

therefore, adopted a batch mechanism such that multiple handoff requests could be

processed simultaneously. The average delay is reduced from 1.12 seconds to 800 ms

in heavy handoff rate.

We also proposed a new Handoff Mobile Resource Allocation algorithm (HMRA)

to optimize the performance of the mobiles in the handoff area. The spreading factor

and transmission power for the handoff mobiles are jointly considered to maximize

the throughput; meanwhile the algorithm maintains the BER requirements for the

handoff mobiles and the target cell. The original problem is formulated in a nonlinear

programming format. We proposed a procedure to simplify it into a linear constraint

problem, which is solved by a revised simplex method. Numerical results show a

25% increase in throughput for WWW traffic, and a 26% improvement for the video

traffic.

1.3 Dissertation Outline

The rest of this dissertation is organized as follows. Chapter 2 outlines related

studies of high performance MPEG-2 video encoding and decoding. This research

addressed different methods, including pure hardware based methods where special-

ized hardware architecture is used, hybrid methods which build multimedia operation

into the general processor, and software parallel solutions.

In Chapter 3, we describe the framework of the proposed pure software based

parallel MPEG-2 decoder. A data partition based parallel scheme is presented, in

which the decoding of each video frame is subdivided to multiple slave nodes. The

video frame is physically partitioned into several di- int'h 1 parts, and each part will

be decoded at slave node. The performance results for the data partition algorithm

with different partition methods are presented.

Chapter 4 presents the data pipeline based parallel algorithm. This method

attempts to exploit the frame structure of the MPEG-2 bitstream. In order to reduce









the communication overhead and eliminate the inner-frame decoding dependency, we

use a frame level parallelizing in the data pipeline scheme. The number of frames

assigned to each slave node and the number of slave nodes become a design issue. Our

analytical model and experimental results show that the data pipeline can provide

close-to-linear speed up.

Chapter 5 and Chapter 6 constitute the second part of this thesis. In Chapter

5, we present a novel media access control (MAC) protocol for the WCDMA wireless

network. The proposed protocol is designed to handle multimedia traffic. We demon-

strate how the new protocol can guarantee the BER requirement. A new multimedia

scheduling algorithm is described to utilize the dynamic spreading capability. Chap-

ter 6 describes how the work can be extended to a multi-cell environment. A new

handoff procedure is studied which is designed to coordinate the change of spreading

factor when a mobile moves into a new cell. The handoff delay is analyzed for dif-

ferent traffic types. The analysis shows that a considerable delay could be caused by

frequent handoffs. We thus use a watching process to reduce the number of unnec-

essary updates, which in turn reduces the handoff delay. The resource allocation for

the handoff mobile is also discussed.

Chapter 7 summarizes the main contributions of this dissertation and presents

some directions for my future research.















CHAPTER 2
MPEG-2 BACKGROUND AND RELATED STUDY

MPEG-2 [10] is one of the dominant digital video compression standards.

Since its inception back in the early 90s, the standard is widely accepted and imple-

mented by various hardware and software solutions. The standard defined the syntax

of valid MPEG-2 bitstream, which covered a wide range of video quality (e.g., frame

352*240 QCIF video format to 1404*960 HDTV). Among various MPEG-2 applica-

tions, DVD (corresponding 720*480 resolution in the MPEG-2 family) is probably the

most commercially successfully. However, we expect that high quality video format

will become more desirable and need to be supported in the near future. For example,

HDTV and even higher resolution video is being proposed for the next generation of

electric consumer products. Therefore, a successful digital video solution should have

good scalability performance. Specifically, the solution of interest should be able to

be extended and provide satisfactory performance for the high-end high-quality video

format. Our proposed software based parallel MPEG-2 decoding scheme is evaluated

with a set of video streams, from low-quality, low bit-rate to high-end, high-resolution

videos. As will be shown in Chapter 4, our investigation indicates that with some

necessary revisions at the memory management algorithm, the data-pipeline scheme

can produce a real-time decoding performance for high-end video streams.

In the MPEG-2 video encoding process, the raw video source is compressed

by exploiting the spatial and temporal redundancy within the time-continuous video

frames. Two key techniques used are DCT transformation and motion compensa-

tion. To reduce the spatial redundancy, the video frame is segmented into 8x8 pixel

blocks, followed by the 8x8 2-D DCT transformation. The DCT coefficients are then









quantized to reduce the number of representing bits. The quantization step is a

non-reversible process where part of pixel information is lost. Thus, the quantization

table is designed to minimize the degradation of image visual quality. Studies in this

field have shown that low frequency DCT coefficients should be assigned more bits

than the high frequency part. Thus many of the high frequency elements will become

zero after quantization. The quantized DCT coefficient matrix is serialized through

a zig-zag scanning to concentrate the nonzero low-frequency DCT coefficients. Such

a representation can be further compacted using a run-length encoding.

The MPEG-2 motion compensation technique further reduces the encoded

bit rate. This is based on the observation that there is a lot of correlation between

consecutive video frames. Using the concept of Group of Picture (GOP), video frames

are encoded into three frame types: I-frame, P-frame and B-frame. Each GOP

contains one I-frame, which is completely self-encoded without referring to other

frames. The I-frame is presumely of the highest picture quality and serves as a

reference frame for the P-frames and B-frames in the same GOP. The pixel blocks in

the P-frames are encoded with forward motion prediction, where the reference frame

is searched to provide a pixel block with highest match as prediction block, and

only the residue is DCT-transformed and quantized. For the B-frame, two reference

frames (I/P frame), one frame from the previous frame and another from the next

frame, are used to perform a bi-direction motion prediction.

Thus the decompression should go through the following steps: First, the

DCT coefficient and motion information are extracted from the run-length encoded

bit-stream; this is followed by a de-quantization and an inverse DCT transform; then

if the block is predicted from other block data, a motion compensation needs to be

done to form the recovery image; then the dithering procedure will map the image

into suitable color space of a particular display system.









High-performance software decoding for high-quality MPEG-2 video is becom-

ing increasingly desirable for a wide range of multimedia applications. In the past

few years, significant progress has been made in the microprocessor technology and

software decoder optimization, brought real-time software-based MPEG-2 decoding

[11, 12] into reality. Lee [12] implemented a real-time MPEG-1 decoder by fine-tuning

Huffman decoding, IDCT algorithm and flattening the code to reduce the cost of pro-

cedure call/return. She also used shift and add operations to replace multiplication,

which can be executed more efficiently with PA-RISC multimedia instructions. With

the built-in multi-ALU and super-scalar features of PA-RISC, the cost of some de-

coding procedures can be reduced by up to 4 times. More dedicated multimedia

instructions were introduced in many processor architectures and were used to accel-

erate MPEG-2 decoding process [13-15]. Bhargava et al. [13] conducted a complete

evaluation of MMX technology for filtering, FFT, vector arithmetic and JPEG com-

pression applications. Tung et al. [15] proposed an MMX-enhanced version of a

software decoder by optimizing the IDCT and MC using Intel's MMX technology.

They reported a real-time MP@ML decoding; however, the decoder was not fully

compliant with the MPEG-2 standards. In Zhou et al. [16], a theoretical computa-

tion analysis of MPEG decoding was presented and the authors suggested using VIS

(Visual Instruction Set) to achieve real-time decoding. However, their analysis only

took into account the least arithmetic operations during the decompression stage,

and the implementation results were yet to be reported. There are some public do-

main and commercial software decoders capable of real-time DVD (MP@ML) quality

decoding [11, 16]. These products should not be cataloged as the pure-software so-

lutions, since they are optimized with specific hardware multimedia support (e.g.,

Intel's MMX instruction set); thus they are still hybrid decoders. Another hardware-

based approach takes advantages of a redundant DSP unit and the very wide internal









bus design, which make it possible to exploit instruction-level parallelism (such as

VLIW); some of the work is reported in references [11, 17].

In summary, all of the solutions discussed above in deffering degrees depend on

some specific hardware features in different degree, either from multimedia instruc-

tions in CPU or from a video display card. Thus these solutions are very difficult to

transport to different hardware platforms. Furthermore, these solutions usually can

not support scalability features of high profile MPEG-2 video. For example, in most

of these schemes, the 30 frame per second (FPS) frame rate is used as the target

performance. In some high-end profiles, however, a higher frame rate and resolution

are required (e.g., MP@HL progressive video format defined a spatial resolution of

1920*1152 at a sampling rate of 60-fps [10]).

On the other hand, without the support of dedicated hardware, the compu-

tation of MPEG-2 decompression could be very demanding, especially for the high

profile/level streams. The MP@ML MPEG-2 stream with one base-layer video stream

needs at least 2-Mbps bit-rate (an average of 6-Mbps is used in DVD video). For

some high-profile high-quality MPEG-2 videos, up to 40-Mbps could be necessary

to support MPEG-2 extension layers ( with MPEG-2 scalability features). The ex-

tension layers provide additional information that could be used in the decoding

procedure to produce more vivid video quality. They also require additional com-

putation in the decompression procedure. The decoding of the extension layer has

to go through a procedure similar to the base layer; thus the decoding time of a

scalable MPEG-2 video stream will increase proportionally to the number of layers.

For a stream with two scalability features (temporal scalability and SNR scalability)

at the spatial resolution 1920*1152 ( the image size is six times higher than that of

MP@ML quality video), the required computation is roughly 3*6 = 18 times that of

MP@ML MPEG-2 video decoding. This huge amount of computation requirement

makes it very difficult for the real-time playback of a software-only decoder, even with









the fastest CPU available. Hence, functionality-based or data-based parallelization

should be considered to boost the decoding performance.

Various parallel schemes have been studied for video encoding in the literature

[18-20]. In Akramullah at al. [19], a data-parallel MPEG-2 encoder is implemented

on an Intel Paragon platform, reporting a real-time MPEG-2 encoding. Gong and

Rowe [20] proposed a coarse-grained parallel version of a MPEG-1 encoder. A close-

to-linear speed-up was observed. Comparable work of a parallel MPEG-4 encoder has

been reported [21]. The authors used different algorithms to schedule the encoding

of multiple video streams over a cluster of workstations. A parallel MPEG-2 decoder

based on shared-memory SMP machine was reported [22]; however, they did not

address how real-time decoding should be supported for the high-profile and high-

level video source. In Ahmad et al. [18], the performance of the parallel MPEG

encoder based on the Intel's Paragon system was evaluated. In their work, the I/O

node provides several raw video frames to the compute nodes in a round fashion, and

each compute node performs the MPEG encoding, sends back the encoded bitstream,

and requests more video frames from I/O node. The I/O node serves the requests

from compute node in a FCFS manner, which may reduce the idle time in the I/O

node. However, this scheme allows out-of-order arrival of encoded frame data and the

I/O node has to track the dynamics of the assigned frames for each compute node.

This increases the complexity in the I/O node and requires much higher buffer space

to preserve the original sequence of video frames. In our case, since the master needs

to display the decoded video in exact order, it must buffer all the decoded frames

before the arrival of precedent video frames. The build-up of frame buffer might

exceed the physical memory in master side if the same scheduling is used. Thus we

use an in-order data distribution between master node and slave node.

Yung and Leung [23] implemented an H.261 encoder on an IBM SP2 system.

Both spatial and temporal parallelization were investigated. Their results showed






16


MB level parallel could equally provide a high speed-up as the frame/GOP level

parallelization. However, this claim was not true in the parallel decoding. Our results

show that a data-partition scheme (MB level) can only provide limited performance

gain when the number of slave nodes increases. Furthermore, the degree of scalability

highly depends on the communication bandwidth, which was not fully exploited in

their study. By carefully investigating the impact of a Master/Slave communication

pattern, we found that the communication pattern in parallel decoding can be more

complicated than in the case of encoding, which eventually becomes the key factor

determining the peak system performance.















CHAPTER 3
DATA-PARTITION PARALLEL SCHEME


Our initial investigation focused on the data-partition scheme. We consider

data partition on the macroblock level since the majority of MPEG-2 decompression

computation is spent on block level IDCT and motion compensation. This scheme

allows a low complexity in each computing node, since only part of the MPEG-2

decoding procedure needs to be implemented in the computing node. The data-

partition scheme also has the advantage of the potential quick response to the end-

user commands. As explained in the rest of this chapter, a video frame is divided into

several groups of macroblocks that are decompressed in slave nodes simultaneously.

Thus the system can response very quickly (always less than the decoding time of

one frame for non-preemptive scheduling). In fact, the response time is inversely

proportional to the parallel gain in the data-partition scheme. An important design

factor in the data partition algorithm is the parallel (partition) granularity. It is

possible to divide a frame in pixel-level; however, such a fine grain scheme may

introduce too much overhead. We believe that the macroblock level data partition

scheme should be a natural choice for our initial investigation. In the following

discussion, all partition algorithms are based on macroblock level data decomposition.

To maximize the decoding performance, the following issues should be considered:

In general, the percentage of the parallelized computing components determines
the upper bounder of potential performance, which indicates that we need to
parallelize as many decompression steps as possible. Our preliminary experi-
ments show that the processing of macroblocks takes about 95% of the total
computation. Thus by adopting a macroblock level parallelizing, we expect
that majority of the computation is parallelized.

New overhead caused by the parallel scheme should be analyzed. Specifically,
due to the inter-frame and intra-frame dependence, our data-partition scheme









will introduce additional communication cost. For example, motion compen-
sation may require reference pixel blocks of previous frame. How to reduce
the data communication cost becomes critical. As we will see later, differ-
ent partition algorithms can result in different amounts of additional reference
data. Finding the optimal partition to minimize the communication overhead
becomes a main challenge.


/



Id Stream ATM backbone client M ter

1Cl t Sla


Server Side Local AreaNetwork


Figure 3.1: Parallel decoding architecture


Figure 3.1 depicts the system architecture of our parallel decoder. The system

consists of a master node, which receives the MPEG-2 encoded video stream from a

remote video server, and several slave nodes to do the decompression. To distribute

the whole decompression workload to the slave nodes, the master node splits the

bitstream into frames. Each frame, as a set of macroblocks, is further divided into

N parts, for the N slave nodes. These N sets of raw data, together with necessary

reference picture data, will be sent to the slave nodes for decompression. The master

node has to wait until all slave nodes finish their decoding job, and send the decoded

macroblocks back to the master node. The decoded macroblocks are then merged

into one complete image for display. The majority of the MPEG-2 decompression

steps (e.g., IDCT, MC, Dithering) are performed at slave nodes. The master node

has its main duty in bitstream parsing and subtasks distribution for slave nodes. The

above procedures are illustrated by the following pseudo-codes:

(1) Control algorithm in the master node. The master node receives the bit-

stream from the server, extracts frame data, partitions it into subtask, packed with










reference frame data, and sends it to slave nodes. Then we wait for the decompressed

data.

Procedure MasterControl
Initialize inside buffer and start slave nodes
WHILE (there is more data in input bit-stream
from server)
Rbuff = Buffer for the current frame data
Perform a partition over Rbuff, result with a set of N
sub-task buffers:
PI, P2,...PN /* refer to the
partition algorithms in next section */
Update new reference data for each slave Refi
FOR(i =1 to N)
Send to slave i with Pi and reference data Ri
END FOR
Wait to Receive decoded micro-blocks from each slave node
Combine the parts frame into the whole picture
IF (current frame is on the top of display buffer)
call display routine
END IF
END WHILE
End Procedure


(2) System Level Algorithm of Slave Node for Data Partition Scheme: Slave nodes
receive the bit-stream from server, perform MPEG-2 decoding procedure, and send
back the decompressed micro-blocks to the Master node.


Procedure SlaveDecode
Initialize inside buffer and
set up session decoding parameter(from master node)
WHILE (no terminate signal from Master node)
Receive raw data from master
/*perform MPEG-2 decoding over the given portion data,
including:*/
Inverse Run-length decoding to restore DCT
coefficient/motion vector
Inverse DCT
Perform motion compensation
Perform dithering
Send to master the decoded micro-blocks
END WHILE
End Procedure









3.1 Performance Model of Data-Partition Scheme

The performance of our data-partition scheme can be represented by the frame

decompression time Tf(P) with respect to a given partition P. The determination

of T (P) depends on (1) the computation in slave nodes T,, (2) the transmission time

for the interprocessor communication Tt, and (3) the housekeeping computation in

the master node Tm. Since the decompression in slave nodes and communication can

be performed simultaneously, Tf should be dominated by the longer computation

path. We have


Tf = Tm+ max{T,,Tt} (3.1)



Table 3.1: Notions used for decoding modeling with data-partition scheme

H number of macroblocks of a frame in vertical dimension
W number of macroblocks of a frame in horizontal dimension
F= {mij : 0 < j < W} the set of macroblocks for a frame
K number of total slave nodes
P A particular partition of F into K parts subsetss)
T (P) decoding time of one frame with respect to partition P
TCp communication time per frame for a given partition P
Sp,k Subset of F assigned to kth slave node
Tk decoding time of SP,k at kth slave node
Cp(k) The amount of data communication per frame at
kth slave node
B Network bandwidth


Statistics show that Tm could vary from 5% to 10% of the sequential decoding

time and we use the higher boundary for the sake of brevity. Table 3.1 listed the

terminology used to calculate the other two costs. Partition P consists of K disjoin

subset of macroblocks Sp,i Sp,2 SPK, where Ui=1to, Sp,i = F. The computation is

performed simultaneously on slave nodes. For each slave node, a different amount of

decompression time is required based on the decoding workload (size of each Spk)









and CPU power (i.e., CPU clock rate). Therefore, from the view of the master node,

the slowest slave node determined the overall decompression time.

The communication costs between master and slave nodes should include the

transmission of raw data and decoded data. Aside from raw data, the master node

must send reference data to slave nodes, because of the motion prediction technique

used in MPEG-2.1 Given a frame partition P, the frame decompression time can

thus be expressed as

Tf(P) = mamaT axX (ISP,k T) + E p(k)} (3.2)

Thus the decompression time is a joint effect of system hardware parameter

(TK, B etc.) and the partition P(both SP,k and Cp(k) are functions of P). Given

a hardware environment, an optimal partition should be found to minimize the de-

compression time. In another words, the optimal partition should minimize Tf over

all feasible partitions.

P* = argminp{llf easiblepartitions}Tf (P) (3.3)

It can be shown that, to minimize the first term in the right side of equation

(3.2), we should assign the subtasks Sp,k in proportion to the processing power of

kt" slave node. For example, in a two-slave-node situation where Ti : T2 = 1

4, the optimal partitions should have the property of SiS : IS21 = 1 : 4, where

mazk (ISk l Tk) is minimized. However partitions with such property might not result

in a minimum for the second term in (3.2). For example, the second term will always

be minimum when all the macroblocks are assigned to one slave node. This partition

unfortunately results in the longest overall computation time. The trade-off between

computation and communication makes it a nontrivial work to find the optimal (slave

number, partition) pair. Before we further analyze equation 3.2, we should establish

a direct relation between the data communication Cp(k) and the partition P.
1Macroblocks of P and B frame may be encoded using motion prediction. They are predicted by macroblocks
from the searching area of a previous I or P frame. The search area is defined as a rectangle that centered in the
predicted macroblock.









3.2 Communication Analysis in Data Partition Parallel Scheme

For a given partition P, the communication between the kth slave node and

master Cp(k) can be further divided into three parts: (1) the amount of raw data from

the master node to the k" slave node, which is approximately I||ISp| sm (assuming

that a macroblock in MPEG-2 bitstream requires an average of s bits), (2) because

of the motion compensation, MPEG-2 has a strong inter-frame data dependency. The

encoding of P frame and B frame requires other (previous) I or P frames as references.

This is the very reason that causes additional inter-nodes data communication. We

use Rp(k) to represent the reference data at the kt1 slave node under partition P. (3)

the decoded macroblocks in the k slave node requires IISp,|| 162 8 bits, assuming

one byte per pixel in the 16 by 16 macroblock.

Therefore the communication time (second term of equation (3.2)) can be

rewritten as

To p (lisp i sm) + (SPl 162 8) + Rp(k) (3)
B
kE{1...K}
It is noticed that Ek{1....K} I|SP,l Sm is exactly the size of one frame in

MPEG-2, and E(ISPrI| 162 8) is the size of a decoded frame. We have

S= + Sd + Ek{...K} RP(k) (3.5)
TCp = (3.5)

here Se and Sd represent the compressed frame size and the original picture size re-

spectively. Given an MPEG-2 video stream, the average Se is obtained from the en-

coded bit-rate, and Sd comes from the image horizontal size and vertical size encoded

in the bitstream. The total reference area in the kt" slave node Rp(k) is the union of

the reference area for each macroblock in Sp(k). According to MPEG-2, the predict-

ing macroblock is selected from a searching window, which is defined as a square area

centered in the target macroblock. The dimension of the searching area is usually

fixed through a given video title. Let s be the searching window size predefined in en-

coding time, the reference area for one macroblock is (16+s)*(16+s) = 162+32*s+s2









pixels. The difference between Rp(k) of current frame and Sp(k) of the refereed frame

has to be transmitted from master to slave node.

To simplify our discussion, we assume a static partition scheme where the

partition is determined at the initial phase and fixed through the decoding session.

The fixed partition in fact provides a chance to reduce the Rp(k): since the same

area of each frames is decompressed in the same slave node, the difference between

Rp(k) and Sp(k) can be reduced significantly. An upper bound of Rp(k) is given by

Rp(K) < I|Sp(k)|| (s2 + 32. s) (3.6)

With static partition, the reference area only needs to be transferred once

per GOP (Group of Picture), since it can be reused by the successive B-frame (not

separated by I- or P-frame). Therefore the transmitted reference data can be further

reduced. Assume that the system GOP pattern is I : P : B = 1 : b : c, there will be

a total of (1 + b + c) frames in one GOP. The referred frames can only be I and P

frame. Thus the required reference transmission ratio is 7 = l+ The amount of

reference data should be averaged over a GOP. TCp is revised as

S + Sd + Ek({lI...K} Rp(k) (3.7)
TCp = (3.7)


Substituting equation (3.6) into (3.7), the total communication should be

bounded by

S, + S+7 (S2 + 32 s) Ek{l..K} ISp(k)ll
TCp <
B
SS + Sd + (s2 + 32 s) H V
B
3.3 Partition Example and Communication Calculation

Our first partition example (P1) is a result of the Horizontal-Vertical(HV)

Partition algorithm, which is defined by the following recursive procedure: if the pre-

vious partition operation is performed along horizontal dimension, then we partition










the image vertically for this time. If the partition number K is even, then we sep-

arate the working area into two parts with equal size, with the partition dimension

decided above. Otherwise, the whole area is partitioned into two pieces with area

ratio of 1 : (K 1). For each of the two resultant parts, we perform the above steps

correspondingly.

By applying the HV Partition algorithm with input K = 4 (i.e., 4-way

partition), we have our first partition P1, shown in Figure 3.2. PI separates the

original picture into 4 adjuncted areas: A, B C and D. We found that the reference

area Rpl(k) consists of the belt-shape-area along the internal edges. The width of

the belt is the searching window size s. A simple way to calculate the total reference

area is to add the length of all internal edges.



A D A


B C

(a)


(b)


A DD



BB C


(c) (d)

Figure 3.2: Horizontal-vertical partition into four piece A B C and D.


For a 720*480 picture, each part contains two internal edges resulting in a

total length of 360 + 240 = 600 pixels. The total reference area for this partition

is thus 4 600 = 2400 pixels. Each pixel contains 3 color elements( each require 8

bits). With a searching window size of 32, the total amount of reference data is:

E Rp(k) = 4. (600. s 8 3) = 1.8MBits. For a GOP structure of IBBPBBPBB,


B -I










7 = 1 The total communication time under 80Mbps (sustained) network bandwidth

should be Cph S+Sd+{_YiE{lK...} Rp(k) 0.5+2.76+0.6 = 48 rnsec
B 80


A










I..
D







C



(a) another 4-piece-partition (b)the reference area of piece A

Figure 3.3: Another way to partition a picture into 4 parts


An alternate partition is shown in Figure 3.3.(a). There are 8 internal edge

segments, separating the picture into four identical pieces. The reference area of the

upper-left piece is shown as gray area in Figure 3.3.(b). Using the same calculation

method as in the above example, the total reference data increases to 3600*32*8*3 =

2.8 M-bits. Compared to the previous partition, this partition require 50 % more

reference data. The total communication cost increases to 4.19-Mbits/frame.

3.4 Communication Minimization

To obtain the minimum communication cost, the best partition should be

found, given the image shape and the number of slave node. One approach is to

use brute-force searching. However, the size of searching space is prohibitively large

even we enforce partitions to be equal-size. The number of possible partitions can

be figured out as following: Let H V be the number of macroblocks for a frame

and there are K slave nodes. The problem is the same as putting H V objects

into K boxes. This is equivalent to the combination of choosing X = H~I objects
K4









from H V objects, followed by choosing another X from what is left, until no more

selection. The total number of the combinations is

CH*V H*V-X .2*X X
X X X X
(H V)! (H V X)! (2 X)! X!
X! X! X! X!
(H V)!


For a 1K by 1K picture, there are 4096 macroblocks. When partitioned into

four slave nodes, the number of partition is in the order of 10690. For K = 8, it

is about 101922. For K = 16, this number easily exceeds 103155. The size of the

searching space increases exponentially with picture size. To avoid this explosive

searching space, some heuristic algorithms are desired.

The partition problem can also be addressed as an instance of quadric as-

signment problem-a classical combinatorial optimal problem, which has proven NP-

hard. Using the notation in Table 3.1, the equal-size partition implies: For the K

subsets SP,k of frame F, the following properties hold: ISpl| = Sp,jl n Sp, = and

U Sp,i = F for 0 < i,j < N. Under this assumption, the optimization of Tf is
equivalent to the optimization of Tt (see Equation (3.2)). Let the total N (N=H*V)

macroblocks be indexed from 1 through N. The object function is:


min( > X .X,.dk,h.r,) (3.8)
i,j{1...N} k,he{1...K}
The problem becomes the determination of N*K variables X<,j, 1 < i < N

and 1 < j < K, that minimizing equation (3.8). Xij is subject to the following

constraints:

X,j e {0,1} (3.9)
H.V
> = H (3.10)

X3,' = 1 (3.11)
j









Here X~j = 1 when macroblock i is assigned to the jth slave node. Equation

(3.11) shows that each slave node is assigned macroblocks, and each macroblock

can be assigned to only one slave node. dk,h is the transmission time to send an unit of

decoded data from the h slave node to the k slave node, or vice versa. We assume

that the communication bandwidth between any pair of nodes are identical (i.e.,

a homogenous environment). rij represents how much data in the ith macroblock

happens to be the reference data for the j*h macroblock. This is determined by

the relative position of the two macroblocks in the image. It is obvious that rij

is symmetric, and is in fact the intersection between si (searching window area of

macroblock mi) and nm. When s < 16, rij is defined by

0 i=j
16 s mi and nm are neibour
rij s2 ri and mj intersect at conner
0 otherwise
3.5 Heuristic Data Partition Algorithm

For general QAP problems, genetic algorithms and simulated annealing algo-

rithms (SA) have be studied intensively [24, 25]. Unfortunately, when applying to

our problem, these methods fail to produce satisfactory performance. The reference

data obtained from SA algorithm with input sizes of 100, 300 and 500 macroblocks

are 8448, 35712 and 56448 pixels respectively. The SA algorithm takes 1 minute, 30

minutes and 5 hours respectively. When using the HV partition algorithm, for the

same input size, the costs are 7680, 15360 and 17280 pixels respectively, and the run-

ning time is in the level of several seconds. The HV algorithm takes advantage from

the problem nature by grouping macroblocks in its natural layout. Further more, we

proposed a more dynamic partition algorithm that exploits more information from

the shape of partitioned area, named Quick-Partition:










1. Given input (K, hd, wd, S), where K is the number of current partition, S is a

rectangle area to be partitioned, hd is the length of S in horizontal dimension,

and wd is the length of S in vertical dimension.


2. If K is less than 2, than no further partition is required.


3. Calculate ki = [LJ and k2

generated from S.


[r] as the area of two small rectangles to be


4. If hd > wd, we find a vertical line that separates the rectangle S into two

sub-rectangles with area ratio of kl : k2. Otherwise, a horizontal line should be

found similarly.


5. Calculate the width and height for the two sub-rectangles. Name it as S1 and

S2-


6. For each of Si and S2, we perform the same procedure from step 1 to 6.


Reference Data In Data Partition Algorithms
Picture size: 720*480
Search Window Size=32 pixels


Figure 3.4: Performance comparisons of different partition algorithms


Figure 3.4 demonstrated the amount of reference data as a function of par-

tition number K and different partition algorithms. We compared the performance

of four algorithms: Horizontal-Partition, Vertical-Partition, HV-Partition and our









Quick-Partition algorithm. The Horizontal-Partition always divides the frame in

horizontal dimension, while Vertical-Partition does the partition using vertical line.

HV-Algorithm partitions the frame with interleaved horizontal line and vertical line.

These algorithms are tested over 720*480 picture, with the searching window size

set to 32. For Horizontal-Partition algorithm, the reference data increases linearly

as the number of partition K increases. A similar (linear increasing) R(K) trend

is observed for the Vertical-Partition algorithm. The slope of the curve, however, is

less than that of Horizontal-Partition. This is because the width of a video frame is

actually larger than the height. At K = 4, the Vertical-Partition algorithm results

in an about 2.8 M-bits/frame reference data, which is 33 % less than that of the

Horizontal partition. The same percentage of reduction holds for other values of K.

The HV-Partition algorithm generates a much-better result in terms of reducing the

reference data. For partition number K = 4, 8,16 and 32, the corresponding R(K)

are 1.8432, 4.055, 5.5296, and 9.9533 M-bits/frame respectively. Compared with the

Horizontal or the Vertical partition, there is remarkable reduction in R(K), espe-

cially for high partition number. For example, the reductions of the reference data

(compared with the Vertical-Partition) at the above points are 16.67%, 21.43%, 50%

and 56.45 % respectively. Our Quick-Partition algorithm produces the least amount

of reference data in the four algorithms. In general, the Quick Partition performs

either equally or better than the HV algorithm. The resultant reference data of the

Quick partition for K =2, 8 and 32 are 33.33% 18.18%, and 14.81% less than that

of HV-Partitions respectively.

3.6 Experimental Results

The performance of the data-partition scheme is verified by experiments over

two hardware/software configurations:

100-Mbps LAN: A 100Mbps Ethernet, and each node is equipped with Pentium-
II 450-MHz CPU and 128-MB memory. The operating system is RedHat
Linux6.2.











Vertical-Partition Under 100M network Quick-Partition Under 100M network
Test Stream Size 720*480, Searching Window 32
Test Stream Size 720*480, Searching Window 32

-*-Expected Frame rate

-+-Actual Frame rate 2 -*-Expected Frame rate
0 -+-Actual Frame rat



Io 5 1o 5 20 25
S o 2 2 Number of Partition


Figure 3.5: Decoding rate under 100Mbps network


680Mbps SMP: This environment is actually a symmetric multi-processors
server machine. Each processor is a 248-MHz Ultra SPARC processor. The ac-
tually inter-node communication bandwidth can reach up to 680-Mbps, based
on our measurements. This machine is used to emulate a Gigbits local network.

3.6.1 Performance over A 100-Mbps Network

The 100Mbps network has a total of 16 slave nodes equipped with Pentium-

450 MHz processor. Each node is capable of decoding at 10-fps (with compiling

optimization). This configuration is widely available in many organizations or com-

panies. The performance of our partition algorithm is depicted by Figure 3.5. Figure

3.5.a shows the performance when the Vertical-Partition algorithm is used, and figure

3.5.b is of the case of the Quick-Partition algorithm. The following observations can

be made:

For small value of K, both partition algorithms produce close-linear perfor-
mance improvements. For the Vertical-Partition algorithm, the predicted de-
compression rates for K= and 2 are 8-fps and 14-fps respectively. The actual
frame rates are 7 and 12-fps, which is very close to our expectation.

The Quick-Partition algorithm has a similar result as that of the Vertical-
Partition when K is small (K <= 3). The predicted and observed decompres-
sion rates are nearly identical. This is because when the number of partition
is small, the slave decompression time still dominates the decoding cost. The
difference in reference data of the two partition algorithms is overwhelmed by
the decompression time at slave nodes.

At K=3, the cost of communication begins to play a determinant role. The
benefit of the reduced node-decoding time is canceled by the increasing of com-
munication time (which is mainly caused by the reference data). The decoding
rate at K=3 is only 14-fps, which is only 16 % higher than that of K = 2.










HV-Partition and Quick-Partition Under 680M network
Test Stream Size: 720*480, Searching Window:32


-*-Quick-Partition expected
:t:ttskzpartition measured


0 10 20 30 40
Number of Partitions (Slave Nodes)


50 60


Figure 3.6: Decoding rate under 680Mbps environment


K = 3 represents the peak decompression rate of this configuration. Further in-
creasing K results in a degrading of the performance. For the Vertical-Partition
algorithm, we observed 12, 9, and 6-fps for K=4, 8, 16 respectively. The per-
formance degrading of the Quick-Partition algorithm is not so severe. The
decompression frame rates are 16, 13 and 11-fps respectively. The performance
improvement is limited, since the network is not fast enough.

3.6.2 Performance over a 680-Mbps Environment

This experiment is performed with a 15-node SMP server. Each node contains

an Ultra SPARC 248-MHz CPU. When used along, each CPU can decode at 5-fps

.The inter-node communication bandwidth is up-to 680-Mbps. This environment

is used to emulate a Gigbits network. The performance results for the HV-Partition

and the Quick-Partition algorithm are presented in Figure 3.6.

It is observed that the overall performance of our data partition parallel scheme

improved significantly even though the processing power at the individual node de-

creased. Detailed discussions are listed below:

When using HV-Partition, we observe a close-linear increasing of the decom-
pression rate until K = 8. The predicted frame rate are 4.8, 9.3, 16.7 and 29-fps
for K=1, 2 ,4 and 8 respectively. The actual frame rates are only slightly less
than the predicted values (the difference is less than 10 %). When K is in the
range of 8 to 14, the decompression rates keep increasing, however the speed of
increasing is slowed down. For example, when the number of slave node doubles






32


from 7 to 14, the decompression rate (actual) increases from 23-fps to 30-fps
(only 30 % performance gain).
* For the Quick-Partition algorithm, we observed considerable performance im-
provement. The close-to-linear speedup is sustained until K = 16, where the
theoretical peak decompression rate of 44-fps is expected. The actual decoding
rate match with our model closely, the decoding rates for K1=, 2, 4, 8 and 14
are 4.6, 9.0, 15, 25 and 40.4 respectively.

* At K=16, the decoding performance reaches its peak for the Quick-Partition
and Vertical-Partition algorithm, and the overall system performance start go-
ing down. Again, this is because the communication cost of reference data
overwhelmed the capability of system communication. However, due to the
lack of more processing nodes in our SMP server, we are unable to verify the
performance trend after K=14.















CHAPTER 4
PIPELINE PARALLEL SCHEME


The major overhead of the data-partition scheme exists in the master node,

where the compressed video stream needs to be parsed down-to the macroblock level.

This portion of computation becomes significant when the number of slave nodes

increases, thus prevents further improving of decoding rate. Unfortunately, multi-

master nodes (as used by N. H. C. Yung et al. [23]) will not work since the bitstream

is VLC encoded and highly auto-correlated. An alternate scheme for further perfor-

mance enhancement relies on the increasing of decoding unit, by distributing a block

of frames to each workstation.

4.1 Design Issues of Pipeline Scheme

Several design issues should be addressed in order to have a high parallel

processing gain.

The idle time at the slave nodes should be minimized. There might be a waiting
time if a slave node can not receive new block of frames from the master node,
once the decompression for the previous frame is finished. For this purpose, the
master node is designed such that it filled new data to other slave nodes during
the computation time of a particular slave node. Therefore, the waiting time in
the slave nodes is minimized. Coupled with this concern is the load balancing
between slave nodes. A successful parallel scheme should avoid overwhelming
of some nodes while starving some other nodes. Our experiment results show
that the STD of CPU usage of the slave nodes is low, and reduced to less
than 0.03 when block size is set to one GOP. We believe this fact will hold
regardless the video contents being tested, as far as the underline slave nodes
are homogeneous. A more adaptive task scheduling should be favored for the
case of heterogeneous environment. It seems that a static assignment of block
size based on the CPU speed of the individual slave node should be sufficient.
We will not discuss this dimension for the sake of the space.
The workload in the master node should be minimized. In fact, the master
node represents the only sequential link in the pipeline scheme. The master
node must perform MPEG-2 header parsing so that the frame data can be
extracted for each slave node. By developing a quick bitstream parser which









scan only the GOP start code and picture start code, the computation of this
procedure is negligible compared to the other MPEG-2 decoding steps.

An important design factor is the proper block size for each subtask. Should
the master node deliver 2 frames of raw data to a slave node each time? How
about 5 frames or 10 frames? We have found that a proper block size does
influence overall system performance significantly. In fact, the overall system
communication is mainly determined by the block size, due to the inter-frame
data dependency and the I-P-B frame structure in the encoded bit-stream.
For example, a slave node that decodes a P-frame will need an I-frame as the
reference. If the block size is not chosen carefully, the referred I-frame can be in
another node, thus we need to transmit one full decoded I-frame to the target
node. Notice that this also introduces additional delay which can significantly
affect the pipeline efficiency. For the worst case, a previously decoded I-frame
may be required by all slave nodes, if the following P- or B-frames happen to
be distributed all over these slave nodes.

Another important design issue is the determination of the optimal number of
slave nodes. This is particularly important in a practical environment. The
question can be put as following: given a certain hardware configuration, how
many nodes will the system need in order to reach the peak performance? We
have found that the network bandwidth is one physical limit. Given a certain
network, we are able to predict and verify the maximum number of the slave
nodes that will deliver the best decoding rate. Further increases on the number
of the slave nodes will not generate a positive effect to the decoding rate.


In the following discussion, we present the pseudo code of our proposed parallel

algorithms. The detailed analysis on the communication overhead and performance

prediction will be addressed later. For simplicity, a homogeneous environment is

assumed where slave nodes have identical computing capability.


Algorithm 1 (ii i-.t Node): For each of the slave nodes, The Master node extracts

a block of frames from the incoming bit-streams, and delivers it to the slave nodes.

This procedure repeats until the end of a video session. D is the block size, N is the

number of slave nodes.


Procedure PipelineMaster(D)
Initialize internal buffers and start slave nodes.
md=1 /* set rnd to 1 for the first round*/
FOR (j=l to N)
/* Pipeline initialize, fill the slave nodes with
raw data to start the pipeline. */
send a block of D frames to the jth










slave node for decoding
END FOR

WHILE (There is data in the incoming buffer)
FOR (j =1 to N)
Receive Decompressed data:
(rnd 1).N.D + (j 1).D"t frame to
(rnd 1).N.D + j.Dth frame
from jth slave
Prepare a block of raw data:
Extract rnd.N.D + (j 1).Dth frame
to rnd.N.D +j.Dth frame raw data
from the incoming buffer
R = the reference I frame needed
send F and R to jth slave
END FOR
END WHILE
END Procedure


Algorithm 1 depicts the control flow on the master side. Before entering into a

stable pipeline running, the system has a start-up procedure to establish the pipeline.

The master node then enters into round operation represented by the for loop block.

The interaction between the Master node and the Slave nodes at each round is

synchronized to have a continuous pipeline. During each normal round, the Master

node exchanges data with the Slave nodes in a round-robin manner. For each of the

N Slave nodes, The master node will first receive D frames of previous round that

are decompressed at that Slave node. It will then send a block of D compressed

frames to the slave node. When it moves to serve the next one, the slave node will

receive new data and do the decompression simultaneously.


Algorithm 2 (Slave Node) : Slave nodes receive a block of D frames each time from

the Master node, and decompress it


Procedure Pipeline-Slave()
Initialize internal buffers, receive decoding
parameter and D,N values from Master
WHILE (no terminate signal from Master node)










Receive D frames of raw data from master
/* perform MPEG-2 decoding for the given frames
data,including:*/
FOR (each of the D frames)
Run-length decoding to restore
DCT and motion vector
perform Inverse DCT
perform motion compensation
perform dithering
END FOR
send to master the decoded frames
END WHILE
END Procedure



As mentioned early, it is possible that the required reference frames (i.e., I-

and/or P- frames) are not available locally. Thus, these extra reference frames needed

to be transmitted to the proper nodes, in addition to the raw video block. How

much is the required extra transmission of these reference frames? What are the

optimal values of D and N such that these extra transmissions can be minimized?

The following example demonstrates the need for a proper D value. We assume that

the video shot is encoded with a fixed GOP (Group of Picture) length L and fixed

frame pattern within a GOP (e.g., I:P:B=1:2:12). The following sequence depicts the

GOP pattern G = {, B, B2, B3, B4, P, B, B6, B7, Bs,P2, B9, B10, B11, B12}.

If D = 1, with two Slave nodes (N = 2), the first Slave node has a frame
pattern like G2,1 = {1, B2, B4, B5, B7, P2, B10, B12}. The second slave node
has a pattern of G2,2 = {31, B3, P1, 6, B, Bs B11}. From G2.1, there are the
following data dependencies:

B2, B4 need I1, P1 as reference frames;
B5, B7 require frames of P1 and P2;
P2 needs Pi;
B10, B12 need P2,12;

With the union of all required reference frames (i.e., I1, P1, P2 and I2) and the
elimination of those frames residing locally (i.e. I1 and P2), the remaining set
(i.e., P1 and I2) consists of those extra reference frames needed to be transmitted
to the first Slave node.

By continuing the similar procedure, the extra reference frames for Slave node
2 are I1 and P2.











Thus, the total communication overhead for D = 1 with N = 2 consists of 4
extra frames (e.g., about 2 Mbits from our test video file).


Similarly, we can calculate the extra reference frames for other N's. For

example when N = 3, we have the frame patterns in three slave nodes as: G3,1 =

{I1, B3, B5, B8, B10}, G3,2 = {B1, B4, B6, P2, B11}, G3,3 = {-2, P1, 7, B9, B12}. The

extra reference frames of each node are {P, P2} for node 1, {I1,P1, 12} for node

2, and {I1,P2, 2} for node 3. The total communication overhead now increases

to 8 extra frames (e.g., about 4.5 Mbit) with a D = 1 and N = 3 configuration.

Figure 4.1 depicts the calculated amount of extra communication when the values of

Reference Communication In Pipeline Scheme (D: block size, GOP=1:2:12)
15
D=1




10 -


D=2

5-
Q0 0 0 0 0 0 00 0 0 0
D=4

U)
n D=8
< 0 0 0 0 0 0 0 0 0 0 0- 0 0 0 0
D=16
0 2 4 6 8 10 12 14 16 18 20
Number of Nodes

Figure 4.1: Communication overhead with different (D, N) combinations



N and D increase. We use the same GOP frame pattern of 1:2:12 as discussed in

the above example. The frame size is 1024*1024 with one byte for each pixel, which

results in one megabyte for each frame. The purpose of this analysis is to identify

the performance trend of extra communication. Several trends can be observed as

following:

When D = 1 and N is relatively small, the extra communication increases
rapidly with N increasing. It causes about 2.6-Mbits/per frame of extra com-
munication when 2 nodes are used. Doubling the value of N from 2 to 4 will









increase the amount of extra communication from 2.6 Mbits to 7.3 Mbits (i.e.,
280% increase). An additional 64% (i.e., from 7.3 Mbit to 12 Mbit) of the extra
communication is expected when the number of nodes increases to 8. However,
when the number of nodes increases from 8 to 16, it is expected that the extra
communication will be saturated at 14 Mbit/per frame (i.e., 16.7% increase).
We believe the peak overhead on the extra communication is achieved since the
GOP pattern will repeat itself among all these 16 nodes. When the number of
nodes is greater than the number of frames within a GOP, only a small variation
on the communication overhead is expected.
When D = 2, a similar performance trend is expected. We expect a commu-
nication overhead of 2.6 Mbits, 5.7 Mbits, and 6.8 Mbits per frame when N
is 2, 4 and 8. The degree of traffic increases is thus 120% and 20% Less
communication overhead is expected and the situation occurs earlier compared
to the D = 1 case. This is feasible since larger D provides the opportunity
to share the same reference frames within the same node, while it is infeasible
when D = 1. For instance, a node that is assigned with two continuous B
frames only needs one extra I- and P-frames.
This trend is further depicted when D = 4, 8 and 16. Especially when D = 16,
the minimum communication overhead could be approached. Furthermore, the
communication cost remains unchanged when the number of slave nodes in-
creases, because each node is now assigned a block consisting of a whole GOP
and possibly the I-frame from the next GOP. Therefore, the extra communica-
tion for other reference frames is expected to be minimal.
Given a fixed D, the derivative of the curve decreases when N becomes large. In
fact, the curve always converges into a constant value when N is large enough.
The increase of D can affect the extra communication significantly. While
keeping the value of N fixed, increasing D can result in a reduction in extra
communication cost.
For N = 8, a 44% reduction is achieved if the value of D changes from 1
to 2. A further increase of D to 4 and 8 will produce an additional 46%
and 56% performance gain respectively. Increasing the value of D means a
bigger granularity, thus some B- and/or P-type frames can share the same
reference frames within one slave node. Therefore, in terms of reducing the
communication overhead, increasing the D value is quite effective.

The above examples are based on the GOP structure where I:P:B =1:2:12.

For other GOP structure, a similar communication analysis can be conducted. For

instance, in the case of I:P:B=1:4:10, we have more P-frames and less B-frames. The

communication cost may decrease due to this change, since each B-frame needs two

reference frames and each P-frame only needs one reference frame. Though inter-

esting, it is beyond the scope of this work to further discuss how GOP structure

affects the total communication overhead. Nevertheless, the general trends of com-

munication with respect to block size D will hold even if a different GOP setting is









assumed. Specifically, when the block size equals to the length of GOP, a minimal

communication overhead can be obtained in all cases.

4.2 Performance Analysis

To analyze the pipeline performance, we particularly focus on several events

where Master and Slave nodes synchronize to each other. To simplify the discussion,

we assume a homogeneous environment where each node is identical. As described

in the previous sections, the system behavior could be discussed with the round

concept which is observed in the Master point of view. Putting in a simple manner,

a round is the period of time the master node experiences between two successive

"polling" to a particular slave node. The round time should be determined by the

decompression time of the interested slave node and the master node's interaction

with other slave node, whichever is longer. With the homogeneous assumption, we

can expect that each round (cycle) spends about the same time for each slave node

after the pipeline run into a stable state. Thus, tracing the events happened in one

round should be enough to demonstrate the overall system behavior. To describe

the timing relation between different events, Table 4.2 lists some notations to be

used. It is noticed that the following analysis of system timing is applicable to both

cluster and SMP environments. This is, a 10-processor SMP machine with 1Gbps

internal data communication bandwidth is treated the same as a 10-workstation

cluster with Gigabit network. As an example, Figure 4.2 illustrates the events of

the ih round in the two slave nodes configuration. Our pipeline design put the

majority of the the computation workload at the slave nodes. Nevertheless, certain

part of computation has to been done in the master node, such as bitstream reading,

parsing, and segmentation. Our experiments shows that this part only represents a

tiny portion of the overall decoding time (less than 5% of total time). In the following

performance analysis, we use a fixed value c to include this cost.










Table 4.1: Some notations used to describe system events for pipeline scheme

Tim Master node starts to receive decompressed frames at round i
T, time when Master node finishes round i
Ti I Slave k starts to decompress at round i
Ti 2 time when Slave k finishes the decompressing of round i
Tsigle average time needed for a single node to decode a frame
Tsm time for data transmission from Slave to Master each round
Tms time for data transmission from Master to Slave each round
Tround the round time for a complete decoding cycle
ASf average frame size in a MPEG-2 bit-stream
R(D) the amount of reference frame data w.r.t a given block size D
c a small constant represents the misc software cost
Tp the overall decoding time for a MPEG-2 stream
x the number of frames in a MPEG-2 stream


We define Ti4 as the beginning of the ith round on the master side, when it

starts to receive the results from the first slave. T is defined as the end of the ith

round. Two conditions hold before the master can enter into the next round: (1) the

master side has polled over all slave nodes for this round, and (2) the first slave node

has completed decoding for this round. This is shown in Equation (4.1),


Ti = max{TSi1,2i r-1,2}


(4.1)


Here i = 1, 2,,, is the round counter and N is the number of Slaves. The master's

activity in the ih round finishes after it sends a block of a raw frame to the Nth slave

node, which is also the last slave. Thus Tm and Ts occur simultaneously.


T,2 Ti
mM rn,1


(4.2)


The time-stamps at the slave side are depicted from Equation (4.3) to (4.6) as follows:


Ti, T1i s T5 T(.s + c


(4.3)











Single
^------------------------------------ >,
Slave #2
Slave---- 2Slave #2 Decoding I--


Slave #1 S2
d(5)


Master
---1- i- i--------

>- Slend
Mstart --------- ------- -------- ----.-.---- (3)
(1)
Ism
.- ----[i,-
SSlstart S2staa =Mend
(2) (4) (61

Figure 4.2: Events in one round with two slave nodes


At T the first slave can start its decompressing work immediately after it receives

a complete data set from the master. The delay is represented in (4.3) as the com-

munication delay (Tm + Ts for mutual data-exchange) and other software overhead

(2c assuming the same cost on both sides).

The decompression at slave node 1 needs D.Tsigl seconds (Equation (4.4))

for D frames in average.



Ts Ti, + D.T ,ing (4.4)


For the kt slave node (1 < k < N), we define the start-time and end-time pair (T,

Ti k) in a similar manner. Since the master node polls the slave nodes in a round-

robin order, node k will start later than node (k 1). The delay time is T,, + T,, + c

(see equation (4.5)).


Ti + T, + Tms + c (4.5)


TSk k + D.T1g


(4.6)









Equation (4.7) and (4.8) show the communication cost between master and slave

nodes, which is determined by frame size h.v, network bandwidth B, and extra com-

munication of reference frames R(D). The average compressed frame size ASf can

vary according to different encoding parameters. We assume ASf = FrameSize/r,

r is the compression ratio ( in our test stream, r = 10).

D.h.v.8
Tsm = B (4.7)


Ti D.ASf + R(D) (4.8)
Tms (4.8)
B

By substituting Ti11,2 of (1) with (4), we have (Tis1, + D.Tsingl). Then Tis,1 can

be replaced by (T 1, + Tsm + Ts + c) Thus Ti1,1 of (1) can be replaced by

(T_1,1 + D.Tsinge + Tns + Tsm + c). Similarly, T1,2 of (1) is replaced by (2) at first,
which is Ti1i,1. Then, by applying (5) k times, T 1,2 is finally represented in terms

of Ti1,

T m = max{T1, + D.T, gl + Ts + Tm + c, Ti,1 + N.(Tsm + Tms + 2c)}

Therefore, the round time becomes


Tround = TM T1 =max{D.Tingi + Tn + Ts + 2c, N.(T + Tms + 2c)}

It is now clear that the round time Tround is a joint effect of single-slave-

decompression-time, total round communication time, and other system overhead.

The total run time for a x-frame video stream can be expressed as:


Tp = Tstart + 1).Tround
N.D
= Ttart + (N.-- 1).(max{N.(Ts, + Ts, + 2c), D.Tingle + Tsm + Tms + 2c})

The start-up time Tstart is the latency of the system accepting a user's request.

It corresponds to the time period from the master's receiving of the first byte, to the

first frame being displayed. The pipeline runs (number-of-rounds Tround) seconds.









Here the number of rounds is calculated as (N 1), where x is the total number of

frames The average frame rate of decompression (FRD) can be estimated as

X X
FRD = (4.9)
TV Tstart + ( 1)T) round

Equation (4.9) predicts the expected decoding frame rate. In order to have a clear

view of system performance with respect to D and N, we further assume that the

video sequence is long enough (i.e., x is large). Thus the start-up time Tstart is

negligible compared to the total decoding time. With (4.7) and (4.8), we are able to

rewrite Tround

S mx{ND.ASf + R(D) + D.h.v.8
Tround = max{N.( + 2c),
D.ASf + R(D) + D.h.v.8
D'Tsingle + + 2c
D.ASf + R(D) + D.h.v.8
B=+ 2c) +




B
max{D.Tsingl e, (N ).( + 2c)
D.(1 + 1/r).h.v.8 + R(D)
+ 2c+

D.(1 + 1/r).h.v.8 + R(D)
max{D.Ts igl, (N 1). + 2c} (4.10)

Further observations reveal that the round time Tround is a nonlinear function

of the block size D and the Slave number N. Both (N- 1).D.(11/r)...8+R(D) 2c and

D.Tsingl, increase when D and N increase. However, with a fixed D, the former still

increases with N and the later does not change accordingly. When using less slave

nodes (i.e., small N), the round time Tround will be dominated by the single-node

decoding time Tsingle. This is explained by equation (4.11).

1D.(1 + 1/r).h.v.8 + R(D)
D.Tsingle > (N 1)D.(1 /r) 8 + + 2c (4.11)

By replacing (4.11) into (4.10), the round time is further simplified as:

D.(1 + 1/r).h.v.8 + R(D)
Around =+ 2 + DTsingle 4.12)











Picture size 480*720, GOP=15, Tsingle =0 24 sec
150 -
B=1000
B=622
linear speed up area for B=1000-mbps <---



100 -



-> saturated for B=155-mbps

50 -

_- h3B=1Ui00 -
-> saturated for B=100-mbps


0 0
0 teu ed l t 15 20 2'= u 30
Processor Number

Figure 4.3: Expected performance of pipeline decompression scheme (with D equals
one GOP)


With the simplified Tround, we can predict the system performance FRD. Fig-

ure 4.3 demonstrated our theoretical frame rate with respect to N and different

network bandwidths. These network bandwidths include 10-Mbps and 100-Mbps

switched Ethernet, 155-Mbps ATM OC-3, 622-Mbps ATM OC-12, and Gigabit Eth-

ernet. We notice that the estimated FDR is linearly relates to N initially. For

example,

Using 10-Mbps switched Ethernet, our model predicts the FRD should increase
from 3-fps to 6-fps when N is increased from 1 to 2.

Similarly, with 100-Mbps switched Ethernet, the system expects the FRD can
be reached to 36 fps with N = 8.

We also expect, with 155-Mbps ATM OC-3, 55 fps can be reached by using 16
slave nodes.

When the network bandwidth is more than 622-Mbps, the system expects an
almost-linear improvement, which can possibly reach hundreds of frames per
second.



However, the improvement is expected to be of much less significance after

the above thresholds for these different networks. We realized that, when N keeps

increasing, communication time will eventually dominate the round time. When









Equation (4.11) no longer holds, we have

)D.(1 + 1/r).h.v.8 + R(D)
D.Tngle < (N 1).(1 /r).h.v.) + 2c (4.13)
B

and the Tround is changed to

D.(1 + 1/r).h.v.8 + R(D) D.(1 + 1/r).h.v.8 + R(D)
Around = [ + 2c] + (N 1).[ + 2c]}
B B
DN.(1 + 1/r).h.v.8 + R(D) 2c
B

Now the round time does not relate to the single slave decoding time (Tsingle) anymore.

Accordingly, the frame rate becomes


FRD
( 1) N .(D /r).h.v.8+R(D) 2c
x
Dx D.(1+1/r).h.v.8+R(D) 2c)
B
8.h.v.(1+ )+ (R(D)+2c.B

Therefore, the frame rate will not increase even if more slave nodes are de-

ployed (i.e., the number of slave nodes N did not appear in the above approximation).

The system has reached its saturation point.

Due to the pipeline operation and the relatively heavy subtask in the slave

nodes, the data pipeline scheme is expected to has a longer start-up latency Ttart

than the data-partition scheme. In the data pipeline scheme, a long waiting time

is required to establish the system pipeline. According to the control algorithm in

the master node, the following must be accomplished before the first frame can be

displayed: (1) the master node sent out D blocks of compressed frame data to each

of N slave nodes, (2) the first slave node finished the decompression of its assigned

frames (i.e., from first frame to Dth frame), (3) the bitmap of decoded D frames were

sent back to master node. Therefore we have:

Tstrt = max{N.(Ts + T,, + 2c), D.Tigle + Ts + T, + 2c}









It is clear that the latency is also a function of (D, N) and other system

parameters. Before the saturation point, it is determined by decompression time

for D frames. Assuming D = 15, Tigwl = 0.2 seconds and network bandwidth is

100-Mbps, we expect a 3.5 seconds delay (for 720 480 video). When saturated,

Tstart will increase when N increases. This is caused by the enlarged duration of the

pipeline cycle. The cycle time increases by Tsm + Ts + 2c seconds for each of the

extra nodes beyond the saturation point. For the 100-Mbps network, this is about

0.48 seconds. For the 1000-Mbps network, we expect 0.05 seconds for each additional

node.

With a more sophisticated design of the master and slave control algorithm, we

can reduce the latency to (Tsingl + Tms + Ts). When the master node is distributing

raw data to a slave node, it will receive decompressed image data from another slave

node. When the pipeline is stabilized, slave nodes are running at different phase of

decompression, thus maximize the parallel gain.

4.3 Experimental Results

In order to evaluate the data pipeline scheme and verify the correctness of

the performance model, experimental results were collected from different hard-

ware/software configurations. The testing environment for the data-partition scheme

was used, we also added a 100-Mbps LAN equipped with low-end client nodes for

more complete comparison. The experimental results indicate that our proposed

system delivers a close-linear speed-up when high-speed networks are used. For ex-

ample, with a 100-Mbps switched Ethernet, a 30-fps decompression is accomplished

with 6 PCs (the corresponding speed-up is about 5). For higher display rates, we

have observed up to 73-fps using a Sun SMP server with a network bandwidth of 680

Mbps.









4.4 Experiment Design

Table 4.2 listed the hardware/software configuration of the testing environ-

ments. The three test cases are labeled as NT100M, LINUX100M, and SMP680M

respectively. Here, NIC indicates the speed of the network interface card, which might

be different from the speed of the network switcher (as in the case NT100M). From

our experience, the major factor determining the scalability of parallel decompression

is the achieved network bandwidth. The achieved bandwidth is less than the peak

bandwidth specified by the network or NIC hardware. Note that the achieved net-

work bandwidth can be affected by the CPU speed, NIC and operating system. Thus,

this observation indicates an end-to-end application-level measurement by combining

all of the above factors.

Table 4.2: Experimental configurations and environments

System NT100M LINUX100M SMP680M
Parameters
Processor Pentium PentiumII 15 UltraSparc
133 Mhz 450 Mhz 248 Mhz
Memory 64 MByte 128 MByte 2 GByte
NIC 10/100 Mbps 10/100 Mbps 10/100 Mbps
Network Switch 100 Mbps 100 Mbps 100 Mbps
Operating System NT4.0 RedHat Linux 6.2 Sun 5.6 SMP
Achieved End-to-End
Bandwidth 80 Mbps 90 Mbps 680 Mbps
Tsigle (sec) 0.58 0.18 0.21


The 100-Mbps switched Ethernets (i.e., NT100M and LINUX100M) provide

a basic environment for our experiments. The LINUX100M delivers a network ef-

ficiency of 90% For the Sun SMP server, the actual communication between two

processors is performed at the system bus level. The equivalent bandwidth is much

higher than the network interface. This explains why the actual bandwidth (i.e., 680

Mbps) in the SMP system is much higher than its NIC speed. Note also that the

CPU speed affects the Tsingle in a significant degree. A fast CPU results in a short









Tingl,. For instance, 450-Mhz Pentium II can reduce the decompression time (within

a slave node) from 0.58 seconds (133-Mhz Pentium) to 0.18 seconds.

Our testing video stream is a MP@ML MPEG-2 bit-stream with 720*480

image resolution, encoded at an average bit rate of 6 Mbps. Our parallel decoder is a

revised version, based on a public-domain sequential MPEG-2 encoder/decoder [26].

We only investigate the decoder part in this chapter. The inter-node communication

is implemented by synchronous MPI (message passing interface) protocol. Although

asynchronous message passing could be used to improve the network efficiency, the

potential benefit may be marginal since our experiments are performed in the clear

environment.

We measure the following performance metrics from each experiment:

FRD: The actual frame rate of decompression, which is compared to an ex-
pected frame rate from our analytical model.

Speed-up: This is a scalability measurement on the achieved FRD when the
number of slave nodes is increased.

CPU usage: The utilization of the CPU in each slave node.

4.5 Performance over A 100-Mbps Network

Figure 4.4 shows the performance results for the NT100M environment. Note

that the slave nodes are equipped with 133-Mhz Pentium. If used alone, it is only

capable of decoding at 2-fps when using the sequential decoder. With our proposed

software solution we have demonstrated that the decompression rate can be scaled to

at least 15-fps. The 15-fps performance usually produces an animation-like quality

that video is continuous (instead of 2-fps slide-show quality). Detailed analysis on

the performance results are the following:

Our experimental results in Figure 4.4 show a close-linear improvement of de-
compression rate when the slave nodes increase. For one slave node, we ob-
served 1.8-fps. When increased to two slave nodes, a 3.5-fps decompression
rate is measured (with a speed-up of 1.94). Further increase to 4 slave nodes
results a 6-fps decoding rate. (about 3.33 times higher). At 8 nodes, the system
can deliver a throughput of 14-fps, this represents a 7.78 speed-up.










NT-Cluster, Gop=15, B=100-Mbps


35 -

30 -

25-

() 20- Actual Frame Rate
E
15 Expected Frame Rate


10 -

5 /

0 5 10 15 20 25 30
Processor Number

Figure 4.4: Pipeline decoding experiment on a cluster of Pentium 133 PC worksta-
tions with 100-Mbps fast switched Ethernet.


The actual system performance conforms with the analytical model closely. For
1, 2, 4 and 8 slave nodes, there are only 11%, 12.5%, 14% and 6% differences
between the expected and observed decompression rate. The small differences
between the analytical and experimental results indicate a great consistence on
the system behavior before the saturation point (as we pointed out in the last
section).


The expected frame rate from our model is 35-fps, which can be approached

with 24 slave nodes according to our prediction. Due to the lack of sufficient slave

nodes in the NT100M platform, we were not able to verify the behavior after this

performance saturation point (e.g., N=16 or 24). Fortunately, another set of exper-

iments performed on cluster of Linux machines (i.e., LINUX100M environment) to

give us more insight.

Within the LINUX100M configuration, each node is a PentiumII 450-Mhz PC.

The single-node decoding speed is 5-fps. The highest frame rate predicted under this

cluster environment is 36-fps, at an 8-slave-node configuration. Detailed results are

discussed as following:

Again, we observed that the decoding rate increases close-linear when the num-
ber of nodes grows from one to five. For 1 slave node, we have 4.2-fps. Two










LIN100M, Gop= 15, B=100-Mbps









Actual Frame Rate
Expected Frame Rate


1 2 4 6 8 10
Processor Number


Figure 4.5: Pipeline decoding
100Mb fast switched Ethernet


experiment on a PentiumII PC Linux cluster with


slave nodes produces 8.3-fps, which is a 97% improvement. Further increas-
ing to 4 slave nodes brings frame rate to 21.5-fps, nearly 4 time higher than
one-slave node case.

(saturation point) The system reaches its peak performance at 7 slave nodes,
with a frame rate of 30-fps. Then the frame rate becomes flat, with only a tiny
fluctuation along 30-fps. In fact, the performance gain becomes insignificant
after 5 slave nodes. The saturation point comes earlier than expected (e.g., the
analytical model predicts the system saturation point at 8 slave nodes.)

Nevertheless, the observed results still show an acceptable match with the an-
alytical model. For one and two slave nodes, the mismatches are less than 5%.
After saturation point, the observed throughput is about 14% less than ex-
pected (e.g., the analytical model predicts 35-fps, while the experimental result
is 30-fps after saturation).


The LINUX100M configuration along with our proposed parallel software de-

coder represents a reasonable solution for providing a real-time MPEG-2 decompres-

sion. The achieved quality can be as great as any hardware-based solutions since

it achieved up to 30-fps. However, if the system is allowed to support flexible user

interactions (e.g., fast motion display) or 60-fps HDTV video quality, decompression

with a higher-than-30-fps rate is required.

Due to the network bandwidth limitation, LINUX100M is not capable of pro-

viding this flexibility. Therefore, we investigated the performance over the 680-Mbps


12 14









Sun SMP platform. Since we used MPI as part of the communication mechanism,

our proposed software solution did not require any modification on the SMP envi-

ronment. Note that this kind of high bandwidth is not easily achieved in today's

networking environment. Early results in Gigabit Ethernet just started revealing

similar performance results in the range of 600 Mbps over high-end servers [27].

4.6 Performance over A 680-Mbps SMP Environment

Figure 4.6 is our high-performance result using an SMP machine with 15

UltraSparc 248-Mhz CPUs. As mentioned in Table 4.2, the inter-process commu-

nication is 680-Mbps. We tested three video titles with different content: Tennis,

Flower and Mobl. These three videos are encoded with same encoding parameters

to have a fair comparison. They are chosen to represent different degree of image

complexity and object motion. Nevertheless, our experiments are quite consistent

for all the testing stream. The difference in decoding performance for these video

titles is almost negligible. Part of the reasons of this performance consistency comes

from the high-encoded-bit-rate. With 6Mbps encoding bit rate, few macroblocks are

skipped, thus the number of macroblocks decoded for the 3 video streams are very

close to each other. This is not the case when the encoding bit rate is small, where

many macroblock is skipped during encoding. In the following we only discuss the

performance for the tennis. We have the following observations:
30-fps real-time decompression rate is achieved at 7 slave nodes. With 13
slave nodes, the system is able to provide 60-fps HDTV quality. The highest
measured system performance is 68-fps, using 14 slave nodes out of the total of
15 physical CPUs.
The actual frame rate also indicates a close linear speed-up. Starting from
5-fps at one slave node, we observe 10-fps at 2 slave nodes, implying a 100%
speed-up. The speed-up for 4,8 and 14 slave nodes are 400%, 780% and 1360%
respectively. This indicates that our pipeline scheme can be well scaled up
for a high-demanding video decompressing scenario. After 14 slave nodes, the
decompression rate becomes flattened (not shown in the plot). Further increas-
ing the number of slave nodes could not bring more performance gain, simply
because all of the 14 CPUs have been fully loaded.
The system shows a precise behavior as predicted by our analytical model.
The deviation of decompression rate is controlled within a 10% error range.











SUN SMP, B=680-Mbps, Gop= 15
140 -

120

100 -

S80

E
(2 60-
U-

40 Expected Frame Rate
0 '- Actual Frame Rate
20 Frame Rate Flower
Frame Rate Mobl

0 5 10 15 20 25 30
Processor Number

Figure 4.6: Pipeline decoding experiment on SUN eclipse server with 680Mbps sus-
tained bandwidth.


According to our analytical model, the system should saturate at 55 slave nodes,
which should possibly provide a 270-fps decompression rate.


Table 4.3 shows CPU usage for the master and slave nodes. In the SMP

machine, the slave processes are dynamically scheduled to 14 physical processors by

the operating system. Thus the statistics is collected directly from the master and

slave nodes. It is observed that the slave nodes keep a high CPU utilization through

all the experiments. In average, 90% of CPU time in slave nodes is used in the user

space for computation, and the rest of computation is spent on communication and

miscellaneous system cost. The system also runs in a highly balanced manner, and

the standard deviation of slave node load is very low (less than 7%). The waiting

time in slave nodes is also controlled in a low level, ranged between 5% and 8%.

Unlike the case of data-partition scheme, where the overhead in master node

become significant at high system configuration, the data pipeline scheme has low

master node complexity. The computation in the master node is kept low (15% at

two slave node), and increases slightly when more slave nodes are adopted (refer to

the Master Load row in Table 4.3). This indicates that the computation overhead









in master node is not significant. At the 14-slave-node configuration, there is still

70% idle time in the master node, indicating the master node is not yet saturated.

Further performance improvement could be obtained when more than 14 slave nodes

is used.

Table 4.3: CPU utilization of master and slave nodes

Number of Slave Nodes
Parameters 2 3 4 6 8 10 12 14
Slave CPU Load 92% s', 90% 91% 90% 90% 90% 91%
(Average)
STD of Slave Load 7% 6% 6% 5% 6% 4% 3% 4%
(Average)
Slave Waiting 5% 5% 6% 7% 7% 6% 8% 8%
(Average)
Master Load 15% 18% 20% 22% 25% 28% 30% 32%


The improvement of the decoding performance is further verified by the cu-

mulative system CPU utilization. With two slave nodes, 10% of the total computing

power of the 15-nodes SMP machine is used. The number becomes 16% with 3 slave

nodes, and increases by about 7% for each additional slave nodes afterward. With

the full configuration of 14 slave nodes, 90% of overall processing power is used by

the parallel decoder, and remains at this level when further increasing the number of

slave processes.

4.7 Towards the High Resolution MPEG-2 Video

The achieved frame rates for the low and main level MPEG-2 video are very

close to our prediction. However, the scalability performance results for the high

resolution MPEG-2 videos are not satisfactory. In Figure 4.7, the decoding rates for

(1404*960) MPEG-2 files are illustrated. Starting with 2 fps at single node config-

uration, a linear increase can be observed. However, the decoding performance of

"flower" suddenly dropped to 2.5 fps at 10 slave nodes, and continued deteriorating











with a small rebound at 11 slave nodes. For "tennis" and "( ,i, .iI ,1 similar perfor-

mance degradation is observed at 12 slave nodes, right after the peak performance

point. Similar performance degradation is observed for the (1024*1024) video format.


1404*960


-x-prediction
-o-tennis
-+-flower
-*-calendar

a o


4 6 8 10 12 14
number of nodes

Figure 4.7: Decoding frame rate for 1404 x 960


In order to identify the system bottleneck which causes the degradation of

decoding performance for high-resolution video, we record the utilization of system

resources during the decoding process. The evidence from system runtime statistics

can be collected from the CPU time distribution and the number of page faults to

support this unique observation1. Figure 4.9 illustrates the measured number of page

faults versus the number of slave nodes and the CPU statistic. For the sake of brevity,

we only present the results for "tennis".

For the 352x240 video, the number of page faults virtually remains unchanged,

and is kept at a low level (1010 page faults/frame). Increasing the video resolution to

1The CPU utilization could be obtained by system call time() in UNIX system, and the page
fault is recorded by a utility process truss spawned by the decoding processes











704x480 is reflected by a rise of the page fault number, a four fold jump is observed.

Nevertheless, the 704x480 case still has a flat curve for the increasing slave node,

indicating that the system is running steadily. For the 1024x1024 video, the number

of page faults increases considerably. It is noticed that the page faults significantly

increases at 10 to 12 slave nodes, reaching 3500 page faults per frame. Compared to

the decoding performance in Figure 4.7.(b), the period with high page faults coincides

with the collapse of the decoding rate. This indicates that the excessive page faults

had driven the system into an thrashing state. The page faults behavior of the

1404x960 video shows the same pattern as in the 1024x1024 case.

System Page Fault
5000

4500

4000

3500 -
0..
3000 -

02500 "
250 ->-1404*960 e
-o-1024*1024
2000 -+-704*480
-*-352*240
15001 -
0 O -e -0 O O -- O --0. -0
1000 -

500 -
'* t ** t ** I '" '*t + ,*
2 4 6 8 10 12 14 16
Number of Slave Nodes

Figure 4.8: Page fault vs number of slave node


The excessive increasing of page faults is also indicated by the CPU usage.

With one slave node, 90% of the system time is idle, 8% of the CPU time is used in

the user space, and the remaining 2% for other system maintenance. When increasing

slave nodes, the user space time increases proportionally, and the system idle time

decreases. After 8 slave nodes, however, both system idle time and user space time

drop significantly, while the system overhead shows a major increase. About 90% of








56


the CPU time is used by the operating system, while user space only occupies 5% of

CPU time. Recalling that the page faults number increases suddenly at 9 slave nodes

(see Figure 4.9), we conclude that the system spends most of its CPU time swapping

page in/out, thus drops the decoding performance.

CPU usage for tennis40
1r

09 0 .... ....

08 -

07-

w06-
-o-System Processing(O.S)
05 -+-System Idle
-*-User Space (Decompression) *
S04 -

03-
.' \ "


02- ..*

01
0 .....- .. .. .. 0
0 2 4 6 8 10 12
number of nodes

Figure 4.9: CPU usage



Realizing that page fault is directly related to the shortage of system memory,

we believe that the buffer management of the parallel decode should be investigated.

A not-optimized buffer scheme will devastate the competition between user processes

(e.g., our communication and decompression software) and system processes (e.g.,

demand-paging mechanisms by OS). Because of the shortage of the overall memory,

the system process will generate a significant number of page faults, which in turn

slows down the decompression speed due to the lack of CPU.









The memory requirement for the master node and slave nodes can be ex-

pressed:

f = mc + m streambuffer + mToutbuffer + m1inbuffer

Ms = m6,+ T f er + mTtransmissionbuffer

= Tc + moutbuffer + 1.5 mTinbuffer

Here mT is the size of executable code for the master node, about 500 KB. mstreambuffer

is the streaming buffer to receive the compressed video packet from the video server,

we currently fixed it to be 1 MB. moutbuffer and minbuffer are dedicated for infor-

mation exchanging in the parallel decoding. moutbuffer equals one GOP of MPEG-2

compressed frames, and minbuffer needs to accommodate two GOP of decompressed

frames (one GOP for displaying and another for incoming traffic).

For the test stream tennis40 (1404*960), the corresponding memory require-

ment in master/slave sides are: .1 = 42 (MB) and Me = 30.8(MB). The ac-

cumulative buffering space will grow quickly when using a large-scale slave node

configuration, which causes unsatisfactory scalability performance when the number

of slave nodes is large. For instance, let N be the number of slave nodes, the total

memory requirement becomes

M = M +N*fM.

Using the parameters of the testing MPEG-2 video, the total memory used

can be estimated from above equation. For the video Tennis40, we need about 73

MB, 104 MB, 165MB, 319 MB and 381.5 MB respectively when the number of slave

node are 1,2,4,9 and 11. In the next section, we will discuss several techniques to

reduce the buffer space.
4.7.1 Efficient Buffering Schemes

It is noticed that the slave node allocated a GOP length of frame buffer orig-

inally, which could be further optimized to the minimal buffer space. However, due









to the decoding dependency inside the MPEG-2 video structure, we are not able to

use only one frame buffer. To decode a B-frame, we need two reference frames and

one decoding working frame, resulting in a total of 3 frames. With a careful redesign

of the master-slave communication protocol, using a 3-frame transmission buffer in

the slave side is possible, which we called the ST scheme. When the picture size is

1024*1024 and GOP=15, we save about 12 MB buffer space per slave node, about

an 80% reduction in the slave side.

The minimum required frame buffer can be further decreased from 3-frames

to 2 frames. For the I- or P- frames, we need one buffer for the prediction picture,

and another buffer for the working frame. The two buffers change their role after

decoding a I- or P- frame, so that the most recently decoded I- or P- frame is used

as the prediction frame for the next P- frame. For the B- type frame, since the

decoded B-frame will not be used as reference frame, we can directly send decoded

blocks to the master node without storing them. The above discussion assumes that

the reference frame for a current P- frame is always the last decoded P- frame, and

the reference frame for a current B- frame are the last two P-frames. Nevertheless,

this approach also works if the B- and P- frames always refer to the I- frame with

corresponding change in the reference buffer.

With this scheme, the expected memory requirement becomes


M' = M' + N*M,

I= + N (m, + 1.5 3 mframe + minbuffer)


To find out the number of maximum slave nodes before the exhausting of system

memory for the 1404*960 single layered MPEG-2 video, we solve (47 + (N-l) *

6)<300 MB (use 300 MB as a system threshold). This gives N=43 slave nodes and

a higher than 60 fps decoding rate can be expected.









4.7.2 Further Optimization in the Slave Nodes

It is further observed that the decoding procedure in slave nodes might not

use three frames all the time. More specifically, the I-frame needs only one frame

buffer, while P-frame can be decoded with two frame buffer. Only B-frame needs

the whole three frame buffers. Thus the total amount of buffer can vary during the

life time of the slave node. By allocating frame buffer dynamically according to the

frame type, it can be expected that the total buffer can be significantly reduced for

high quality video.

This is particularly true when I- and P-frame represent a considerable portion

of the frames. Let the ratio of I, P, B frames in a GOP structure be a:b:c, the effective

buffer space for one layer is expressed by M = (1 a + 2 b + 3 c)/(a + b + c).

In a typical GOP structure of "IBBPBBPBBPBBPBB", we have a:b:c=l:4:10. This

results in an effective buffer number of 39/15=2.6, which is about 85% of the 3 frames

buffer scheme.

The concept of dynamic buffer allocation can be applied inside the decoding of

each frame. Since the decompression of each frame is based on a serial decompression

of macroblocks, the overall buffer space could be reduced by dynamically allocating

buffer for macroblocks. For example, when decoding the first macro-block, we only

need to allocate a 16*16 block space. The buffer for other macroblocks will be

assigned when it is needed. With this dynamic memory allocation, we expect an

additional buffer reduction of 0.5 frame for the working frame. Notice that this

scheme can not reduce the amount of buffer for the reference frame, which should

be in system during the decoding process. The effective buffer requirement becomes

M = ((1 0.5) a + (2 0.5) b + (3 0.5) c)/(a + b + c). Using the same GOP

structure as the above, the effective frame number of the buffer in slave node is 2.1,

which is 60% of the 3-frame buffer scheme.









The dynamic allocation of buffer in the slave node is an application level

memory management scheme, which is closely embedded in the decoding process.

The current implementation rely on some system-provided routines (e.g., malloc and

free). We speculate that a customized buffer management routine (direct access of the

system memory) should be able to further increase the decoding performance, which

should be discussed in our future research. The tradeoff here is the additional CPU

cost introduced by the dynamic memory management. For each macro-block, the

additional cost includes at least two system calls (for memory allocation/deallocation)

and some other miscellous operations. It has been shown that the cost associated

with dynamic memory allocation is significant for the database server and Web-server,

where thousands of processes may co-exist to process user requests. In our case, the

number of slave nodes/processes is usually below 20 and it is expected that memory

management activity is far less frequent, thus the overhead introduced should be

limited. This is confirmed by our experimental results by comparing the performance

of the decoding with/without dynamic memory allocation. With dynamic buffer

allocation enabled, the overall decoding time is increased less than 7% than the

static memory allocation case.
4.7.3 Implementation and Experimental Results

Experiments are performed for the high resolution video format with the re-

vised memory management scheme. Our results show that the two improvements

bring significant memory reduction, and the phenomena of paging panic is elimi-

nated.

For the 1404*960 video, the total buffer size is 53.5 MB with one slave node,

which is 27 % less than the original one. For 4-slave-node case, the ST scheme use

85 MB instead of the original 165 MB, which is almost 50% memory saving. For

the 1404*960 case, the number of page faults shrinks from 1500 to 1200, at one slave

node configuration. For 1024x1024 video, the page faults are now 943, 25% less
















5 0B 5 -o-System Processing(0 S)
s5 --System Idle
-*-User Space (Decompression)
-x-prediction
-o-tennis
-+-flower
S-calendar




Figure 4.10: (a)Decoding frame rate for the revised memory management, and (b)user
space time VS kernel space time for the first buffer optimization scheme


than before. For all of the video streams, the number of page faults almost remains

unchanged when increasing the number of the slave nodes.

Figure 4.10.(a) show the scalable decoding frame rate for the 1404*960 video

with our revised ST scheme. We observed a close-to-linear increase of the frame rate.

The peak decoding rate is obtained at 14-slave nodes, where 20 fps is observed. The

decoding performance for 1024*1024 video files shows a similar behavior. Thus our

revised buffering scheme has successfully solved the memory shortage problem, and

works well for high quality video up to MP@HL video.

Figure 4.10.(b) shows the overall CPU time distribution of slave nodes when

decoding high-resolution video formats with the revised buffer scheme. The user

space time component represents the computation time for the MPEG-2 decoding

procedure, the kernel space time is for the system level overhead, including time spent

in the network layer, system call, and other costs. It is observed that the user space

time increases linearly when the number of slave nodes increases, accompanied by a

corresponding decrease in the system idle time. Meanwhile the operating system level

cost is maintained at a low level (between 5% to 10 % of total CPU time). For the

large scale experiments (more than 11 slave nodes deployed), the abnormality cross-

over of the user space time and system overhead observed in the original decoding









experiments no longer exists. This further proves the effectiveness of the ST scheme

in solving the memory shortage.

4.8 Summary

Up to this chapter, we discuss how a generic and scalable MPEG-2 decoder

is implemented via a pure-software-based parallel decoding scheme. The MPEG-2

decompression algorithm is parallelized in data-partition and data-pipeline manner

built on a master/slave architecture. The data-partition scheme is shown not as

scalable as the data pipeline scheme, due to the overhead in master node and high-

bandwidth requirement. The data pipeline scheme is able to produce high gain from

parallel processing with little overhead. We analyzed the effect of different block

sizes and the speedups with increasing number of slave nodes. Using the block size

of one GOP, the communication overhead is reduced to minimum by reducing the

inter-frame dependence as much as possible.

Promising results show that data pipeline scheme performs well in various

hardware/software platform, given necessary network bandwidth. The reported high-

est decoding rate is more than 70 f/s for ML@MP video. We further investigated

the scalability performance for a wide range of video format, especially for the high-

resolution MPEG-2 video (e.g., HDTV). However, It is found that the original data-

pipeline scheme suffers significant performance degradation when decoding high-level

MPEG-2 video with the full system configuration, due to inefficient management of

memory space in the decoder. The shortage of system memory is also confirmed by

the outbreak of page fault when system is fully loaded. We propose an efficient buffer

management mechanism such that the memory requirement can be reduced by 50%.

The revised parallel decode significantly relieve the memory shortage problem, and

showed a satisfactory scale-up performance when decoding the high-resolution video

formats (close to 24 f/s decoding rate is achievable for high resolution quality video).









Our analysis imply that upto 270 f/s is possible by increasing the number of slave

node in the SMP environment.

Our investigation on parallelizing MPEG-2 decoding algorithm indicates that

real-time decompression of high quality video can be obtained via pure software

solution. We also believe that computation power in the future will be improved

significantly via cost-effective MPP technology. Therefore, video decoding capability

at the end-user could be regarded as solved.

In the following chapters, we discuss another critical link in distributed multi-

media system how to provide quality-guaranteed communication service in wireless

network. As mentioned in chapter 1, delivering multimedia content timely with high

fidelity in wireless network presents a great technical challenge, not to mention that

the system should also support large number of users. Among several competing tech-

nologies (e.g., FDMA, TDMA, and CDMA), we choose WCDMA wireless network as

the target platform, due to the many advantages CDMA have over other systems. In

the next chapter, we discuss a dynamic spreading factor scheme in WCDMA system

such that multimedia traffic with strict BER requirement can be supported by dy-

namically updating its spreading factor. The protocol provides guaranteed QoS for

all accepted call requests. Our discussion also includes a time analysis for the protocol

execution, and a traffic scheduler utilizing the dynamic spreading capability.

In chapter 7, we further extend the dynamic spreading factor scheme into the

multiple cells environment. Specifically, we addressed how soft handoff algorithm

in WCDMA should be revised to work with dynamic spreading factor control. Our

studies show that the decision of spreading factor in handoff period should be consid-

ered together with the power control for mobile stations. With a heuristic algorithm

to optimize the spreading factor and power control, we will show that the overall

throughput during the handoff period can be improved by 25%.















CHAPTER 5
MULTIMEDIA SUPPORT IN CDMA WIRELESS NETWORK

Providing multimedia service through wireless network is becoming the ma-

jor battle field for the service provider and technology venders. Recent technology

advances are increasing multimedia capabilities in mobile devices. Cellular phones

and notebooks are converging into a single mini-device which is capable of both com-

puting and communicating. They are becoming more competitive to desktop PC in

terms of computing power. However, the communication quality supported by cur-

rent wireless network still need major improvement to meet the rigorous demands of

multimedia applications, where higher data rate and lower bit error rate is desired.

In traditional wireless cellular network, traffic channels are designed to support

voice conversation. These channels have the same data rate, and have the same bit

error rate (BER). These systems only have limited support for data traffic, such as

in Pour and Liu [28] where the silent period of voice is utilized for data transmit.

Thus, providing quality of guarantee service in wireless network needs new design

and functionality in the MAC layer [8]. Choi and Shin [29] discussed QoS guarantees

in a wireless LAN with Dynamic Time-Division Duplexed (D-TDD) transmission. In

paper [30], an admission control protocol for multi-service CDMA is developed based

on interference estimation. They used log-normal distribution to approximate the

effect of random user location, shadowing, and imperfect power control.

A major effort toward multimedia supports in the cellular wireless network

is so called the 3rd generation system, such as WCDMA. CDMA based system is

particular appealing for multimedia application, primary due to its capability to

providing different level of link quality (BER) for different traffic type, thus make it









possible to better utilize the radio resource. The spreading factor of CDMA is the key

variable in determining user data rate and associated BER. Theoretically, spreading

factor makes CDMA possible by repeating user's data signals such that they can

be re-constructed at the receiving mobile stations. Increasing spreading factor can

benefit BER because it will increase the desired signal strength linearly. The mean

MAI (Multi-Access-Interference) caused by other users will decrease accordingly, and

will approach to zero when the spreading factor approaches to infinity.

The evaluation of the BER has been studied intensively [9, 31-33]. It is widely

agreed that the BER is largely determined by MAI. In Fukumasa et al. [9], the design

of PN sequence is discussed to reduce MAI. In Geraniotis and Wu [33], the probability

of successful packet transmission is analyzed for DS-CDMA system. In Choi and Cho

[34], a power control scheme is proposed to minimize the interference of high data-rate

users to the neighboring cells. The result shows that the number of high data-rate

users for data communication should be less than 6 in order to support enough voice

users. However, these works did not cover the system BER with dynamic spreading

factor. A new middle-ware protocol is needed to integrate different services more

efficiently (i.e. voice, data, video).

The unique BER behavior and its correlation to spreading factor in CDMA

system allow additional flexibility in transmitting data traffic. Akyildiz proposed

a packet scheduling protocol for slotted CDMA [35]. The scheduler can maximize

system throughput based on BER requirement. In Oh and Wasserman [36], system

performance of a DS-CDMA when setting to a different spreading gain is studied.

Two kinds of traffic are considered. The author shows that optimal spreading gain

increases linearly when the MAI increases. However, their work did not relate the

control of dynamic spreading gain with the system load.

To have a clear picture of how spreading factors and the number of active

users affect the BER, we performed link level simulation for the synchronized uplink











channels in IS95. Figure 5.1 depicts the BER under different numbers of active users

and spreading factors. The experiments were performed within a single cell without

interference from neighboring cells. Multi-path effects and thermal noise are not

taken into account since we emphasize on the effect of spreading factor.

0-

-1

-2 -

-3-
,-- I ~--------+--


0-4 -.-'






-7 -
U-5 -o-
m // *-- / -+--SF=96
//*-SF=128
-6- //// ->--SF=160






0 5 10 15 20 25 30 35 40 45
Number of Users

Figure 5.1: BER vs. user number vs. SF



The performance results clearly indicate that the increase of spreading factors

can effectively decrease the BER for a given number of users. For example, with 10

users, increasing the spreading factors from 64 to 96 will reduce BER from 0.0008

to 0.0003 (e.g., 62.5% reduction). Increasing spreading factors further to 128 results

a BER of 0.000004, which is 75 times less. In order to support a variety of BERs,

many hardware manufacturers are considering to bring in the dynamic capability of

changing spreading factors in next-generation mobile and base stations.

Because the possible adaptation of spreading factors, novel admission proto-

cols (i.e., state diagrams) are proposed in mobile and base stations. With the new

protocol, it is possible that a mobile user is notified to change its spreading factor.









This usually happens when a new mobile presents an OPEN request. The prelimi-

nary results indicate that our proposed system always maintains a desired BER for

all the connections (including the existing and newly-arriving one).

Starting from this chapter, we present our on-going study for the new protocol

design in WCDMA to support multimedia traffic and the associated performance

issues. Our discussion in the first part of this chapter will focus on the baseline

protocol design in a single cell situation. We also provide a detail analysis for the

timing components when the protocol is executed and propose improved schemes to

reduce the end-to-end connection setup time. In the second part of this chapter,

we address the multimedia traffic scheduling schemes based on the new dynamic

spreading factor protocol. In addition to guaranteeing the voice communication,

our proposed scheduling scheme reduced the turn-around time significantly for the

conventional data traffic (i.e., e-mails). These discussion can also be found at [37, 38].

The dynamic spreading factor protocol is further extended to handle multiple

cells in the next chapter. We add handoff capability into the dynamic spreading

protocol such that mobile station can take advantage of the dynamic spreading even

during handoff period. To optimize the overall system throughput, we propose a new

resource allocation algorithm to assign spreading factor and transmitting power to

handoff mobiles. The majority of the content is also reported in references [39] and

[40].

5.1 Performance-Guaranteed Call Processing

The ultimate goal of admission control protocol is to support as many users as

possible while still satisfying BER requirements for all existing connections. Conven-

tional FDMA schemes divide the frequency spectrum into multiple channels. Every

user's connection needs to allocate one channel from the base station before the voice









conversation can take place. When all the channels are allocated, no more connec-

tions can be admitted. Therefore, admission control with FDMA schemes is straight

forward.

However, CDMA-based schemes (along with the multimedia support and dy-

namic spreading factors) make the admission decision nontrivial. One advantage of

CDMA system is that the whole spectrum is used for communication. Connections

are separated by Pseudo-Noise codes (PN) assigned at the base station. Thus de-

termining an exact point to block newly-arriving connections are difficult. During

a typical call life time of voice conversation, the system will accept and terminate

a number of calls, thus the number of active users will change rather frequently. In

an interference-limited CDMA system, the variation of system load is the key factor

determining the fluctuation of the link quality of traffic channel. Thus our admis-

sion control protocol needs to be adaptive to the changes in the system load. Our

proposed admission control protocol is able to monitor and assure that performance

remains almost the same by taking necessary steps whenever needed.

Both mobile and base stations need to participate in the call admission process.

Mobile stations are the ones that make the requests and/or update their parameters

following various commands from the base station. Base station should process re-

quests from mobile stations, monitoring the change in the environment, and decide

the key transmitting commands to mobile stations( such as power command, spread-

ing factor, and PN codes). The environment change includes increase/decrease in

the number of users and alterations in traffic types. As an example, a station may

start a connection with voice, halt the voice without terminating the connection, send

email, and go back to voice communication. This scheme eliminates the significant

overhead (in term of tens of seconds) caused by terminating and re-opening a new

connection. Therefore, the admission control protocol that we propose here is flexible

and capable of guarantee overall system performance.









5.1.1 Pronosed Admission Protocol in A Mobile Station


The following Figure 5.2 depicts the state diagram (i.e., protocol) for the

mobile stations.


Figure 5.2: The state diagram and protocol of a mobile station.


Unlike traditional CDMA systems (IS-95), mobile stations have three types

of requests: OPEN a new connection, ALTER the traffic type, and CLOSE the

connection. The first two request types follow the similar steps except the fact that

altering the traffic type does not change the number of users. The application in

mobile unit sends an OPEN request to base station through the access channel.

Along with the request is the type of the traffic (and the desired minimum data rate,

if required). Our protocol defines 4 traffic types: VOICE, AUDIO, VIDEO, and

DATA. Multiple data rate options are available for each traffic type. IS-95 does not

allow mobile to specify the desired traffic type.









To establish a connection to the base station, the mobile sends an OPEN

request, specifying the traffic type and data rate options. Once the base station

receives that request, it first checks if the new request can be satisfied independent

of other connections. The satisfaction is determined based on two things: The type

of the traffic to be carried on the new connection, i.e., the BER requirement and

minimum data rate, and interference by other users. The minimum spreading factor

that will provide the BER satisfaction is used for the new connection. The maximum

data rate allowed by that spreading factor is calculated meanwhile, which is compared

to the minimum data rate required by the specific application and the first decision

is made. The decision is either to continue remaining steps or to deny the request.

If the request is denied, the mobile is allowed to retry after waiting a random time

period.
5.1.2 Proposed Admission Protocol in a Base Station

Figure 5.3 depicts the state diagram (i.e., protocol) for the base stations. If

it is determined that the new connection can be satisfied, the base station moves

further to check if existing connections can be satisfied once the new connection

is up. This facility is not available in IS-95 CDMA. However, it is added to our

protocol to ensure quality for existing users. For each existing connection, the system

calculates the expected average BER corresponding to the increased system load If

the expected average BER is high, we try to increase the spreading factor and check

the maximum data rate in order to see if data rate requirement can also be satisfied

with an increased spreading factor. This step is repeated till all existing connections

are reviewed.

There are two possible situations after the review: First, no existing connec-

tion requires an update. The destination mobile is immediately informed of an OPEN

request and an ACK is sent back to the requesting mobile. The second situation is

that some connections may require an update. For each traffic type requiring an












M--No -e "- Monitor Access Channel SendACCEPT Send
on PgCh monitor BER
Any more existing Yes
connection?


| 1 |Send OPEN request
~ to destination
win 1

Decrease SF A conn closed N Find SF fornew conn No
Notify User =0es to satisfy BER
Can we decrease SF more
for conn[i] Change in the traffic Yes contn
type request CheckBERfor conn[] es i

S-FdSFfo I Yes,l 0 BERreq
Find SF fornew conn met2
to satisfy BER
Is DR
satisfied Yes
No
IsDR No
sti fied?
Ye sati IDRatified
Send accept message with higher SF?
on PgC
Noncreae SF
Y es
No



Send denied message
on PgC





Figure 5.3: The state diagram and protocol of a base station.



update, the base station broadcasts an UPDATE message. All mobiles using the


same traffic type must send an ACK back to base station to confirm that they have


updated their parameters. This is necessary to make sure that no one suffers bad


performance once the new connection is active. The update procedure is a major im-


provement over IS-95. It is the key in providing the QoS guarantee to all calls. This


procedure allows the mobiles to adjust their spreading factors dynamically. When all


ACKs are received, the base station sends an ACK to the requesting mobile, and the


mobile may start to transmit its data.


Another important functionality of the protocol is its flexibility to alter the


traffic type. One cannot switch from one traffic to another in IS-95 without terminat-


ing and re-establishing the connection. This function is provided only if a connection


is already established. If the mobile decides to change the traffic type, it will send


an UPDATE message to the base station. The message has to specify the requested









new traffic type. The base station will follow the same steps as in the case of the

OPEN request, but this time it will not consider an increase in the number of users

while doing its computations.
5.1.3 A Performance-Guaranteed System

We performed an emulation of voice traffic up to 50 users for measuring the

performance of our admission control protocol. Practical parameters associated with

the hardware and software environment in the mobile and base stations are listed in

the following Table 5.1.


Table 5.1: Practical parameters used in the wireless WCDMA environment

Chip Rate 4.096 Mcps
Packet Size 128 bits
Spreading Factors 32,64,96,128,160
Paging Channel Data Rate 32 kbps
Access Channel Data Rate 16 kbps
BER for voice 10-2
Min Data Rate for voice 8 kbps


The spreading factors used in the experiments are 32, 64, 96, 128, and 160.

The experiments presented in this section only tested homogeneous voice traffic.

The integration of multimedia traffic (e.g., additional support of e-mail, ftp and

audio/music streams) will be presented in the next section. The BER requirement

for voice traffic is assumed to be 10-2. The minimum data rate required by voice is

8-Kbps1. For simplicity, packet size is fixed to 128 bits.

The system chip rate is set to 4.096 mcps as proposed in WCDMA standard.

The maximum traffic channel data rate in this case will be 128 kbps with a spreading

factor of 32. The data rates of Paging and Access channels are 32 Kbps and 16 Kbps
10ur latest study indicated that this is possible with advanced compression schemes such like
(:;, ". ,










respectively. BER was measured in every mobile station, and the average BER among

all the voice streams were calculated and illustrated in the following Figure 5.4.

0





-2 Upper BER limit for voice


-3 -
S SF128 \ ER limit foSdio
L /SF64 SF96
m SF160
-4 -


-5 -


0 5 10 15 20 25
Time slot

Figure 5.4: Performance guarantee with the proposed admission control.


As depicted by the curve labeled as "SF32," fixed-spreading-factor schemes

did not guarantee the overall performance among all the users. When the connections

are less than five, the average BER was acceptable (i.e., less than 10-2). However,

when the connection number exceeded five, every connection (including existing and

the newly-accepted connections) suffered with the BER quality above 10-2.

On the other hand, by using our proposed admission control protocol, the

average BER was always maintained below 10-2 threshold for voice even as the

number of user increases. The curve labeled with a sequence of (SF64, SF96, SF128,

SF160) in Figure 5.4 stated the time instances that our CDMA system re-acted to

the increasing demand of users, and new spreading factors have been adopted. The

system never exceeded the upper limit for the voice traffic, (10-2).

Though the experimental results are very promising, our proposed admission

control protocol does introduce few consequences in two design tradeoffs. One of the

tradeoffs occurs in the prolonged base station processing. The other tradeoff occurs









in the contention time for all mobile station to acknowledge the accomplishments of

changing spreading factors. These two timing factors thus result in a longer end-to-

end connection setup time. During the next subsections, we describe these trade-offs

in details and propose methods for further improvement.
5.1.4 Processing Time at Base Station

The end-to-end delay is defined as the total time between a user making a

connection request and the user becoming ready to transmit data. Two dominant

time components are Tp, base station processing time, and Tpdate, waiting time for

UPDATE ACKs from mobile stations.

In normal situation, to process a new connection request, the base station need

to review the link status for each existing connection. Thus T, is roughly proportional

to the number of users in the system. The admission protocol takes approximately 1

msec to review one connection if it is determined that the connection can be satisfied.

However, when the need for a new spreading factor is required, the required time is

much higher. For instance, when the number of existing connections reaches 5 where

an UPDATE should be undertaken, the processing time Tp increases to 22 msec. We

realized that the connections belonging to the same traffic type always change to the

same spreading factor since they have the same BER requirement. Thus, connections

with the same BER requirement (same traffic type) could use same spreading factor

once for all.
Contention Time for Acknowledgment

When UPDATE decision is made at the base station, our protocol requires

that, before accepting the new connection, all active mobiles should switch to the

new spreading factor so that BER never increases above the acceptable point. The

current system design enforces that mobiles should send acknowledgments back to

base station after they change their spreading factors. Since the Access Channel of

a CDMA system only has a common channel and operated in slotted ALOHA, there









will be extra delays due to possible access collisions, if more than one mobile stations

need to access the common channel. The average number of slots per contention is -

where A is the probability that some station acquires the channel in a specific slot.

When A = 1/e, the minimum contention is reached. This results in a large contention

delay when UPDATE is needed. When the number of existing connections is 9, the

end-to-end delay becomes 448.4323 msec, and Tupdate is 395.432 msec.

Many approaches can potentially decrease the contention period. For exam-

ples, the goal can be accomplished if an improved collision prevention/resolution

algorithm is used (i.e., other than CSMA methods). Increasing the number of access

channels is another alternative. Without any technology preference, we have first

investigated the second approach by increasing the number of access channels.

The base station now should distribute access channels among all users equally,

thus the overall Tupdate can be reduced evenly across all the mobile stations. A hash

function implemented in the base station should be sufficient for this purpose. By

taking a mobile station's serial number or PN, we can balance the number of mobiles

using each access channel. The following Figure 5.5 depicts the preliminary results

corresponding to 1, 2, and 4 access channels.

The preliminary results demonstrate a very promising results on reducing the

pdatm component. By using 2 access channels, the average Tdate can be reduced
by 49% stably independent of the number of existing connections. With 4 access

channels, a further reduction about 48% to 58% of Tupdate is accomplished.

However, when multiple access channels are deployed, we observed that the

average BER will increase at the updating points. This is caused by the additional

interference caused by multiple active access channels, The BER simulation with 1,

2, and 4 access channels shows that the BER corresponding 2 and 4 access channel

is higher than the cases with one access channel. For two access channel, the overall

BER performance is still below 10-2 guaranteed quality. Therefore, using two access










700
= 1 Access Channel
600- *= 2 Access Channel
>= 4 Access Channel
500 -

_400
E
S300 -




100



Number of existing connections

Figure 5.5: Improved T ate by using multiple access channels.


channels proved to be a good method to balance the shorter Tupdate time and BER

quality guarantee. When we use 4 access channels, the system experiences a sharp

increase in BER and violated the BER requirement at 5 users.


5.2 Dynamic Scheduling for Multimedia Integration


With a solid understanding of using dynamic spreading factors to support

many voice-only users with guaranteed quality, a fundamental question needs to be

answered is how can the system integrate the multimedia traffic (with the best system

performance)? In this section, we will illustrate how a dynamic scheduling algorithm

can be proposed for supporting this mission. To simplify the discussion, our example

assumes a simple traffic pattern mixing with e-mails and voice streams. A typical

traffic pattern consist of K voice sessions as background traffic, and 8 large Email

requests (with a minimum BER=0.00007). We usually emulate each e-mail as 384-

Kbits data amount come with a few attachments.









5.2.1 Design Issue of Traffic Scheduling for Wireless CDMA Uplink Channel

The traffic scheduling problem in CDMA uplink brings some new issues that

are not presented in scheduling with wireline network. Speaking briefly, the capacity

of wireless CDMA uplink is a variable of the traffic type and their spreading fac-

tors, thus makes the scheduling more difficult than the wireline network case. Let us

consider a simple scheduling goal to maximize the instance throughput without any

fairness requirement among different flows. Such a goal can be easily meet in time

division based wireline networks by simply transmitting whatever is in the transmis-

sion buffer. As far as the transmission buffer is kept non-empty, the network should

be under full utilization, which is the link speed. A often forgotten assumption be-

hind this scheduling method is the high fidelity of communication media with very

low transmitting error. Nevertheless, this assumption remain true for most of the

wireline media. With a very low transmission error, the difference among traffic type

in term of BER requirement is basically invisible to the network. Thus as far as the

above simple scheduling goal is concerned, it does not matter as to which traffic type

should be scheduled earlier.

Unfortunately, the transmission error in wireless CDMA uplink is much higher

than any wireline media used today. With a given channel situation and spreading

factor, some traffic might not able to be transmit due to the violation of the BER

requirements. Furthermore, CDMA allow several concurrent transmissions to be

undertaken, thus the throughput is the summation of data rate for all active channels.

From the discussion in the first part of this chapter, it is generally true that when

high spreading factors are used, the number of concurrent active channels increases.

However the data rate for a channel will decrease. Thus even considering one traffic

type, the decision of the spreading factor is not trivial.

In fact, the traffic scheduling in wireless CDMA uplink must decides two

things: the order in which connections will be activated, and the spreading factor









to be used for each connections. The decision of these scheduling parameters should

also satisfy the BER requirement for all traffic, and maximize the overall throughput.

Notice that frame by frame scheduling is not considered here due to the signaling

overhead between base station and mobile stations.

Therefore our simple traffic scheduler will be invoked when one of the following

events occurs: new connection request, connection termination and traffic profile

update. Each connection is associated with traffic type as discussed early, and the

amount of data to be transmitted. For streaming traffic, a default of 500 kbits is

assumed for scheduling purpose. When this quota is finished and the connection

is still alive, another 500 kbits will be assigned. For non-streaming traffic such as

ftp and email requests, the transmission quota is the actual amount of data for the

request.

For the rest of this chapter, we will discuss the performance of such traffic

scheduler with fixed spreading factor and dynamic scheduling factor. The perfor-

mance is evaluated in the term of overall turn around time instead of the instance

throughput.
5.2.2 Traffic Scheduling with Fixed Spreading Factor

The performance metric to be measured for data communication is the total

turn-around time to fulfill all the data traffics, denoted as T,. The goal of system

design is thus targeted for shorter T, (while maintaining performance guarantee for

all existing voice connections). If traditional CDMA system with fixed spreading

factor (e.g., SF=64) is used, data traffic will be experiencing the same BER as voice

traffic. Therefore, when the number of background voice K increases, it may not

be effective to transmit e-mails since the BER becomes too high. Since accuracy of

e-mails needs to be guaranteed, the transmission of e-mail's data traffic under this

situation will cause high probability of packet damage (due to the large BER over

acceptable limits).










The high packets damage rate will require much higher retransmissions rate

from the upper network layer. Thus the total T, will become even higher. If the

network load sustains for a long time, even the constant retransmissions will not

guarantee a definite success of packet delivery. Therefore it does not only introduce

a long delay for the retransmitting user, it also generates a negative interference to

other users.

Thus, in wireless CDMA environment, perhaps a non-retransmission policy

can benefit the overall system where data transmission occurs only under acceptable

BER level. For example, when K = 10, the predicted average BER is 0.0001, which

is higher than the BER requirement of e-mail traffic, thus email request has to wait

until the voice load decreases. Based on this policy and a fixed spreading factors, a

possible traffic scheduler to support integrated multimedia communication may work

as follow:

Algorithm FSF_Scheduling: This algorithm delays the transmission of e-mail com-

munication until the BER in the environment is acceptable. This algorithm will

reduce the frequency of the re-transmission from the upper layers.


Algorithm FSF_Scheduling

contact the base station for the current
traffic load.

select an unfinished e-mail request r(i),
check to see if the addition of this request will
still satisfy BER requirement.

IF (the predicted BER exceeds any of the existing
connection, or the BER of r(i)),
r(i) is not accepted for the next time frame.
END IF

GOTO (3) and check for other traffics.
otherwise, r(i) is scheduled at the next time frame.







80


We have conducted experiments with the FSF_Scheduling algorithm to collect

the performance results. The focus of the performance is the total turn-around time,

T,, for 8 e-mail requests. Figure 5.6 shows the turn around time T, with fixed

spreading factor.


Scheduling under Fixed SF for 8 Email (each 384-Kbits)



*










-o--SF=64
-+--SF=96


- --SF=1 Z
->--SF=160


0 5 10 15 20 25 30 35 40
Number of Existing Voice Connection


Figure 5.6: Turn around time for data traffic


The following observations can be found:

* for SF=64 the data traffic takes 6 seconds when k = 2, 3, 4. It increases to
12 seconds for k = 5, 6 and 7. At k = 11, the turn-around time becomes 48
seconds. Further increase of the voice background will result in an unacceptable
high BER for the e-mail traffic. Thus the transmission time is N/A. Similar
performance results are observed for SF=96.

* With SF=128, T, equals to 12 (second) when less than 16 voice connections
exist. T, increases to 24 and 36 second when background voice connections is
19 and 22. with 24 background voice connections, only one e-mail connection
can be transmitted in order to satisfy the BER requirement, which results a
turn around time of 96 (sec). No e-mail transmission is allowed after 25 voice
users.

* For SF=160, T, equals to 15 (sec) when less than 20 voice users presented,
which is 25% higher than that of SF=128. The increase of T, is caused by the
decrease of data rate, the data rate at SF=160 is about 25-Kbits, which is 25%
less that of SF=128. When k = 27, T, increases to 30 (sec), 45 (sec) at k = 31,
60 (sec) at k = 32 and 120 (sec) at k = 33. After k = 34, we find that the
system BER is already too high to accept any DATA traffic.


120r


*C*C+*


45 50


' t









In summary, with the FSF_Scheduling algorithm, the system throughput

suffers in order to satisfy the high BER requirement of DATA communication. How-

ever, with small spreading factors, the voice background can affect the transmission

of DATA traffic significantly. Under light network load, the longer spreading factors

(e.g., 128 and 160) actually generate longer turn-around time than the shorter spread-
ing factors (e.g., 64 and 96). However, longer spreading factors do have the capability

to support more concurrent voice streams with reasonable response time for conven-

tional data requests. How to take advantage of these two design alternatives (with

a balanced performance between continuous and conventional data service) become

the focus of our next proposed scheduling algorithm.
5.2.3 Dynamic Scheduling to Improve the T,

In order to be rescued from such undesirable situation, a scheduling algorithm

based on dynamic spreading factor assignment can be adopted. Intuitively, when

the current system BER does not satisfy the BER requirement of DATA traffic, we

should find a longer SF. When there are multiple data transmission requests, we

should select a feasible SF for each of them. Since some data sessions may last

several time frames, the spreading factor should be determined in a per-time-frame

basis. Therefore, minimizing the turn-around time T, for a given set of data traffics

become a global optimization problem over the all feasible SF combinations through

the life-time of data sessions.

However the searching space will grow exponentially. For example, assume

the data traffics consist of one FTP session and one AUDIO session and there

are 10 background voice session, we will have 3 candidate SF(SF=96,128 or 160)

for AUDIO and 2 for FTP (SF=128 or 160). Further assume the AUDIO session

require 256-Kbits and FTP require 64-Kbits. With the maximum channel data rate

128-Kbps ( achieved if SF=32), the AUDIO session will last at least 2 seconds,

which corresponding 100 time-frames (each time-frame is 20 msec). With similar









arguments, the least number of time-frame for the FTP session is 25. Thus the total

combination for this example will be at least (2 3).mn{10025}. In general, assume the

feasible SF for kth traffic type Tk is Nk, and the estimated frame number of Tk is Fk,

the approximate size of searching space is (HA Tk)"minNk}.

This huge searching space makes brute force searching impractical since the

scheduling is desired to finish within a frame-time. In addition, the background

voice traffic and data requests may change from time to time, which also makes it

less worthy to find a static global optimization based on the history system load

information. Therefore, it is our belief that a spreading factor combination that

maximize the throughput for current time slot is more reasonable.

The remaining issue is which traffic should be considered first with what

spreading factor. We observed that those with a low BER requirement have a bet-

ter chance to be satisfied than those have strict BER requirement, which indicates

that a small spreading factor is more likely to be accepted by the low-BER-traffic.

Therefore, to improve the overall throughput, we should inspect the low-BER-traffic

first. For example, with 40 voice background, we can either schedule 5 FTP traffics

with SF=160, or 5 Audio traffics with SF=96, resulting in a total of 45 active traffics

(including voice)2. Obviously the latter one has higher throughput.

Thus our scheduling algorithm works in four phases corresponding to the four

data traffics, from low-BER-traffic to strict-BER- traffic. Namely, the traffic types

are processed in the order of Audio, Image, FTP, and Email. The following pseudo

code illustrates the algorithm for processing audio traffic. The algorithms processing

other traffic types are similar.


Algorithm DSF_Scheduling
/* This algorithm improves the FSFScheduling Algorithm
by assigning different traffic types more dynamically*/

2Other combinations which result 6 or more data traffics (such as 3 FTP and 3 Audio ) will
result more than 46 connections. The resulted voice BER will not be satisfied.











INPUT PARAMETERS:
Nf,N,,No,Ni represents
the number of pending FTP, e-mail,
AUDIO and Image data requests respectively.
FTP[1..Nf]: pending FTP requests,
FTP[k] represents the remaining data amount yet
to be transmitted.
EMAIL[1..N,]: pending Email REQUESTS.
AUDIO[1..Na]: pending Audio requests.
IMAGE[1..Ni]: pending Image requests.
RBER[4]: BER requirement of the four traffics.
BER[5][1:50]: the predicted BER given
the number of active users
and the spreading factor.

Find the minimum spreading factor SF, and SFa:
BER[SF,][k + Na] < RBER[1]) and
(BER[SFa][k + Na] < RBER[4])

IF such SF, and SF, are found
V i < N, AUDIO[i] =SFa.
GOTO (execute)
END IF

IF SF, does not 3
Use SF, = 160 as
the voice spreading factor
END IF
locate the maximum audio traffic number Na
such that (BER[SF,][k + N] < RBER[1])
N = maxi (BER[SF][k + N] < RBER[1])}

Find the minimum SFa that satisfies
(BER[SFa][k + Na] < RBER[4])

Decide which subset of audio traffics will be
chosen if Na < Na
this should based on a fair strategy so that there is
equal chance for all traffics.

execute:
FOR (each of the selected audio traffics i)
Calculate Tf as the length of time frame
Reduce their remaining
data amount AUDIO[i]- = Tf 4.096/SFa
Update array AUDIO[] and Na by










delete finished requests
END FOR

Continue with other traffic types.


The performance of this scheduling algorithm for the sample traffic pattern is

demonstrated in Figure 5.7, and the following observations are made:


Scheduling under Dynamic SF for 8 Email (each 384-Kbits)


SF=128--->

SF=64--->


-SF=160


0 5 10 15 20 25 30 35 40
Number of Existing Voice Connection


45 50


Figure 5.7: Turn around time for data traffic


* When the system is under low load (less than 3 voice), the BER of Email
traffic can be satisfied with SF=64, with all Email traffics being transmitted
simultaneously. The data rate is 64-Kbps. The data traffic transmission is
completed within 6 (sec).

* For k between 4 to 7. The number of concurrent-transmitted E-mails are 7, 6,
5 and 4 respectively. This is because with SF=64, at most 11 active traffics are
allowed to maintain the Email BER requirement. The resultant turn-around
time increases to 12 (sec). Nevertheless, SF=64 still provides the shortest T,
compared to other spreading factors.

* At k = 8, it is observed that T, will increase to 18 (sec) if we use SF=64. While
with SF=128, we need 12 (sec). Thus the spreading factor is turned to 128 at
both mobile and base stations. When k keeps increasing, the FSF_Scheduling
algorithm results in a turn-around time in the range of 30 to 75 seconds (refer
to Figure 5.6, which is much higher than the results from DSF_Scheduling.

* Similarly, we find that after k = 17, the best turn around time of 12 sec-
onds can be approached with SF=160. In fact, after k = 10, FSF_Scheduling









with spreading factor of 64 or 96 can not support the data traffic with a rea-
sonable turn-around time (due to the high packet damage probability). Our
DSFScheduling automatically changes into high spreading factor and provide
a continues transmission of data traffic. In Figure 5.7, under a heavy system
load of 30 users, we still can fulfill the sample traffic pattern in 40 seconds.

Dynamically changing spreading factor based on different system load can sig-
nificantly improve system throughput. For example, with 8 background voice
connections, T, is reduced from 24 sec to 12 sec by simply turning SF from 64
to 160, which results in a 100% save in data traffic time.


In summary, our dynamic spreading factor assignment scheme can adjust the

system load and balance the requirement of different traffic. Voice communication

is guaranteed to be non-interrupted, and data traffics are served with reasonable

fairness and response time. Our scheme can guarantee no starvation for data traffics

,even when the base station is under high load.

5.3 Summary

In this chapter, we discussed a new MAC protocol for W-CDMA. The protocol

can support different traffic classes that have a variety of BER requirements. Since

the BER is indirectly proportional to the number of users in a wireless communi-

cation system, we designed a protocol that integrates different traffic classes with a

guaranteed quality as well as maximizes the number of users supported.

The presented protocol admits new calls only if the new connection's as well as

all the existing connections' quality can be guaranteed. Our protocol uses a dynamic

spreading factor scheme. In order to satisfy the BER requirement, the protocol tries

to dynamically change the SF used by connections that may experience a high BER.

The connections may alter their SF several times depending on the changes in the

number of users in the system.

We have evaluated the BER performance and admission time of the protocol

under an increasing number of Voice connections. We have also compared our pro-

tocol to a regular CDMA protocol that uses a fixed spreading factor. The results

show that the DSF protocol provides significant improvement in BER satisfaction









compared to CDMA with fixed SF. Even though the number of users increase in the

system, the average BER of all connections are maintained below the upper limit for

their traffic type. New schemes are proposed and evaluated to reduce the admission

delays. Furthermore, in order to guarantee a balanced support among different traffic

types, we proposed a scheduling algorithm. We measured the response time for data

traffic by our proposed system.

As father study, we evaluate the performance of the protocol under more

complex condition (i.e., mixed traffic with streaming video). We also expand the

protocol to cover more than one cell (i.e., so-called soft handoff). These studies are

described in next chapter.















CHAPTER 6
SOFT HANDOFF WITH DYNAMIC SPREADING


In our previous study [38], we had proposed a dynamic spreading scheme

such that the mobile user can change the length of the assigned spreading code for

more efficient utilization of the radio frequency. With the new dynamic spreading

capability, the system can support BER-sensitive multimedia traffic (i.e., combined

voice and traditional text information) even when the system load is high.

It is the task of this chapter to extend the dynamic spreading scheme to the

multi-cell scenario. More specifically, we need to take mobility and handover1 [41, 42]

into consideration. In one envisioned application scenario, we can image how Tedd

and Tom can benifit from this scheme when driving from Gainesville to Orlando (total

distance is about 80 miles). Assume base station is placed along the high way every

4 miles, the mobile will drive across 20 cells through the trip. It can be expected that

there are less mobiles in the rural area than in the urban area. In rural area, Tedd

can have a video phone conversation with his friend in Orlando when there are no

mobile users around. When comes into a town, where more mobile users co-existed,

Tedd can switched to a regular voice conversation by using a longer spreading code

(thus decrease the data rate requirement and BER requirement). After driving out of

the town, Tedd can resume the video session by changing back to a short spreading

factor, since shorter spreading factor is enough to provide signal quality when the

number of mobile user decrease (thus less interference).

The fundermental question yet to answer is: How to integrate Soft Handoff into

the Ii,iini. spreading scheme? In the dynamic spreading enabled CDMA system,
1we use the terms handover and IloiI,ftt interchangeable in the context of this chapter.









a mobile might travel through a series of cells during a call session. It is likely that

these cells have different number of active in-cell connections, therefore might assign

different spreading factors, in order to maintain the BER quality. Thus a mobile

might have to change its spreading factor from time to time as it enters into a new

cell. A basic requirement for handoff algorithm under this scenario is to make sure

that the updating of the spreading factor in both the handovering mobile and the

target base station be finished within the allowed handoff time.

We proposed a novel soft-handoff scheme to address the above problems. Our

scheme uses a similar methodology as WCDMA to determine the addition and drop-

ping of cells into the active set. The basic functionality of the proposed handoff

scheme has three parts: (1) collect information of the surrounding cells and the mo-

bile profile, (2) update the Spreading Factor (SF) in the active-set (base stations in

range) to maintain the BER level for the mobiles in these cells, and (3) decide the

right spreading factor and power level for the handover mobiles.

The probabilities of updates as the result of handoff is an important factor

which describes the performance of our proposed framework. We present an analytical

evaluation for update probability. The preliminary results show that mobiles with

voice traffic have a relative small update probability (also see Wang et al. [39]).

However video traffic and WWW traffic have a high update probability, due to the

high BER requirements.

The time components of the proposed algorithm were analyzed. We found that

the handoff algorithm could consume a significant system resource and result in long

handoff delay, especially when the system needs to execute a large number of updates.

We therefore revised the handoff procedure such that handovering mobiles can be

processed in batch, and several update operations caused by consecutive handoff

could be combined together. Numerical results show that the average handoff delay

can be reduced by almost 29%.









The core process executed in our proposed soft-handoff with dynamic spread-

ing scheme is to determine the right spreading factor and the power level of the

transmitting signal for the handoff mobile. The decision for SF and power level

should be made such that the following conditions be satisfied: (1) the link quality

of the mobile stations in the neighboring cell is not sacrificed, (2) the minimum BER

requirements of the handovering mobile is not violated, and (3) the maximum data

rate of the handovering mobile is achieved.

We proposed a sub-optimal method (called HYB) with the goal of joint op-

timization for SF and power level for all mobiles in the handover area. The HYB

algorithm modeled the BER constraints of all mobiles as the function of various fac-

tors, including (SF, power) pair for handoff mobile, system load of the active set, and

distance from the active-cells. The decision variables are (SF, power) for all handoff

mobiles maximizing the throughput for handoff mobiles. Our modeling results in an

nonlinear programming problem. Further investigation shows that our problem can

be simplified to a linear programming problem with a slightly modification of the

constraints. The reduced problem contains only half the number of decision variables

and can be solved efficiently using linear programming methods. The solution of the

reduced problem is then used to drive the desired SF's from the original constraints.

Numerical experiments show that the HYB algorithm outperforms a greedy strategy

(we called it MaD algorithm) significantly. The BER of the mobiles is preserved

during the whole handoff period, meanwhile the interference to the surrounding cell

is controlled. The throughput of the HYB scheme for WWW traffic is 25% higher

compared to the conservative strategy. A 26% increase for video traffic was also

observed.

6.1 Related Study

Traditional handoff algorithm is based on the voice-only network, carrying

homogeneous traffic. The major performance focus of these algorithms is to guarantee









the connectivity of live calls and to keep the failure rate as small as possible. Therefore

some important metrics and criteria used include received signal strength (RSS),

signal to interference ratio (SIR), distance, traffic load and etc. With different handoff

decision schemes, the corresponding handoff probability could be analyzed [43]. A

general survey of handoff protocol was presented in Marichamy and Maskara [44].

The CDMA system supports a unique handoff scheme -soft handoff [41, 42].

Soft handoff allows more than one base station communicating with the mobile station

during the handover process, thus there is no break period when the mobile moves to

another cell. It had been shown that soft-handoff can reduce the transmission power

of the mobile, and increase the system capacity [41, 45].

Due to the shortage of radio resources, it is essential to make the best use

of the system resource and support multiple users and multimedia traffic. Resource

allocation for the CDMA system and CAC in general had been discussed in many

works [46-48]. In Naghshineh and Schwartz [47], a distributed call admission scheme

with the consideration of neighboring cells is reported. The movement of mobiles as

well as the system load in nearby cells is used to make the admission decision.

In Zhou et al. [46], a forward link power control scheme was discussed in a two-

cell CDMA network supporting different QoS requirements. The scheme is optimized

in the sense of minimizing base station transmitting power, and maximizing system

throughput. However, the study is based on the assumption of fixed spreading factor,

and did not address the resource assignment for the reverse link.

Lee and Wang [48] presented an optimum power assignment for mobiles sup-

porting different QoS requirements. They formulated the admission problem by a set

of inequality of desired SIR for all the active mobiles and provided a method to drive

a feasible solution. However in this study, the processing gain is fixed to a predefined

value for each traffic type, thus the admissible region will be strictly limited.









However only a few works had been done to optimize the performance of

handoff mobiles by reallocating the radio resources. Furthermore, the effects of dy-

namically changing the spreading factor have not been reported by these studies.

Our study is the first of the few attempts taking the dynamic spreading factor into

the handoff algorithm. We also addressed the resource allocation during the handoff

period as a joint optimization for both the spreading factor and the mobile power

control. To the best of our knowledge, this approach has not been reported in liter-

ature.

The rest of this chapter is organized as follows, Section 6.2 discusses our pro-

posed dynamic spreading soft handoff scheme and the system response time. Section

6.3 presents a baseline algorithm using a greedy strategy in deciding the spreading

factor. In Section 6.4, we show how an closer estimation of the mobile/cell condition

can provide high throughput while preserving the low interference level.

6.2 The Framework of Soft Handoff Algorithm with Dynamic Spreading

Our proposed handoff algorithm is based on the traditional soft handoff algo-

rithm. The mobile station, when moving around in the multicell environment, could

be either in handoff state or non-handoff state. When in the non-handoff state, the

mobile is solely controlled by the resident cell, which is described in Wang et al. [38].

The mobile entered into handoff state when the active-set of this particular mobile

contain more than one cell site. The maintenance of active set is similar as the one

used in IS-95, where the relative pilot signals of surrounding cells are reported by the

mobile and a decision of adding/dropping/ignoring of cell into/from the active set is

conducted.

Figure 6.1 shows the state diagram of the handoff procedure that is executed

in the BS side. The logic of the mobile station is relatively simple, and is not further

elaborated here. Notice that the B, need to send its forward Walsh code to the

mobile so the mobile can decode the signal from the new base station.









It can be observed that there are three parties participating during the handoff

period. The role of the mobile station in the handoff processing is similar to that

of the existing soft handoff procedure. The mobile is responsible for pilot signal

measurement, and compares the detected pilot signals to an adding threshold and a

dropping threshold (the signal is filtered to eliminate the random variation caused

by fast fading). The mobile then reports any change of signal strength to the BS in

a fixed time interval. The current serving base station, will update the active set of

the mobile and decide whether or not the mobile should be in the handoff period.

Once in the handoff state, the current BS will gather all necessary information and

decide a spreading factor for the handovering mobile.

However the handoff algorithm should also guarantee that the target cell (and

other nearby cells) can still operate with performance guarantee. In fact, the han-

dovering mobile station is usually relatively close to the BSs in the active set, so

their signal is strong enough to become dominant interference if not being carefully

controlled. Thus it is the duty of the handoff algorithm to maintain the desired BER

in the target cell all the time, e.g., an update in the target cell should be taken

immediately. The need of updating the spreading factors in target cell during the

handoff procedure could prolong the total handoff time. This makes the time factor

more important than in the traditional system, where handoff can be finished within

a matter of 100 ms. In the case of heavy handoff traffic, efficient handoff process-

ing becomes nontrivial since the target cell might have to trigger update procedures

several times in a limited time period.

The performance goals of the handoff algorithm with dynamic spreading factor

are summarized as follows (1) All handoff requests should be processed within a given

time period, such that the mobile can smoothly migrate to the other cell. (2) The

algorithm should be able to handle stress handoff requests, which is important when

the system in under heavy load. (3) The handoff algorithm should seek the highest