<%BANNER%>

Content-Aware Approaches for Digital Video Adaptation, Summarization and Compression

Permanent Link: http://ufdc.ufl.edu/UFE0042462/00001

Material Information

Title: Content-Aware Approaches for Digital Video Adaptation, Summarization and Compression
Physical Description: 1 online resource (127 p.)
Language: english
Creator: Lu, Taoran
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: adaptation, aware, compression, content, summarization, video
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this dissertation, we mainly present our work on three challenging problems of digital video applications: video compression, summarization and adaptation. Unlike conventional techniques, we focus on the video content modeling and investigate how the characteristics of human attention will help solving these problems. We denote these approaches 'content-aware' approaches and present our innovations in the main body of this dissertation. The first problem is content-aware video adaptation. We employ saliency analysis, which generates a saliency map to indicate the relative importance of pixels within a frame for human attention modeling. We also propose a nonlinear saliency map fusing approach that considers human perceptual characteristics. To effectively map the important content from source to the target display, we propose to have both intra-frame visual considerations and inter-frame visual considerations, where intra-considerations focus on measuring the information loss within a frame, and inter-considerations emphasis the visual smoothness between frames. The mapping problem is formulated as a shortest path problem and is solved with dynamic programming. The second problem is content-aware video summarization. We introduce an automatic video summarization approach that includes unsupervised learning of original video-audio concept primitives and hierarchical (both frame and shot levels) skimming. For video concept mining, we propose a novel model using bag-of-word (BoW) shot features. We further design a hierarchical video summarization framework which jointly considers content completeness, saliency, smoothing and scalability. Another problem that we investigate is content-aware video compression. We study this problem in two aspects. In one aspect, we propose a content-aware framework and formulate the constraint optimization of rate-distortion as a resource allocation problem, where bit-allocation is adjusted differently at two levels: region-of-interest (ROI) and non-ROI, intra frames and inter frames. The results exhibit better visual quality as well as objective quality improvement. In the other aspect, we aim at improving the coding efficiency of existing H.264 intra prediction. We incorporate a reverse encoding order with geometric analysis for binary transition points on block boundaries to explicitly derive the prediction direction. Besides, we design and implement a video coding parameter analyzer to facilitate the development of new coding tools for state-of-the-art and next-generation video compression standards.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Taoran Lu.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Wu, Dapeng.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042462:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042462/00001

Material Information

Title: Content-Aware Approaches for Digital Video Adaptation, Summarization and Compression
Physical Description: 1 online resource (127 p.)
Language: english
Creator: Lu, Taoran
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: adaptation, aware, compression, content, summarization, video
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In this dissertation, we mainly present our work on three challenging problems of digital video applications: video compression, summarization and adaptation. Unlike conventional techniques, we focus on the video content modeling and investigate how the characteristics of human attention will help solving these problems. We denote these approaches 'content-aware' approaches and present our innovations in the main body of this dissertation. The first problem is content-aware video adaptation. We employ saliency analysis, which generates a saliency map to indicate the relative importance of pixels within a frame for human attention modeling. We also propose a nonlinear saliency map fusing approach that considers human perceptual characteristics. To effectively map the important content from source to the target display, we propose to have both intra-frame visual considerations and inter-frame visual considerations, where intra-considerations focus on measuring the information loss within a frame, and inter-considerations emphasis the visual smoothness between frames. The mapping problem is formulated as a shortest path problem and is solved with dynamic programming. The second problem is content-aware video summarization. We introduce an automatic video summarization approach that includes unsupervised learning of original video-audio concept primitives and hierarchical (both frame and shot levels) skimming. For video concept mining, we propose a novel model using bag-of-word (BoW) shot features. We further design a hierarchical video summarization framework which jointly considers content completeness, saliency, smoothing and scalability. Another problem that we investigate is content-aware video compression. We study this problem in two aspects. In one aspect, we propose a content-aware framework and formulate the constraint optimization of rate-distortion as a resource allocation problem, where bit-allocation is adjusted differently at two levels: region-of-interest (ROI) and non-ROI, intra frames and inter frames. The results exhibit better visual quality as well as objective quality improvement. In the other aspect, we aim at improving the coding efficiency of existing H.264 intra prediction. We incorporate a reverse encoding order with geometric analysis for binary transition points on block boundaries to explicitly derive the prediction direction. Besides, we design and implement a video coding parameter analyzer to facilitate the development of new coding tools for state-of-the-art and next-generation video compression standards.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Taoran Lu.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Wu, Dapeng.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042462:00001


This item has the following downloads:


Full Text

PAGE 1

CONTENT -AWAREAPPROACHESFORDIGITALVIDEOADAPTATION, SUMMARIZATIONANDCOMPRESSION By TAORANLU ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2010

PAGE 2

c 2010 TaoranLu 2

PAGE 3

T oeveryonethatmotivated,helpedandencouragedmeonthewaytowardsmyPhD 3

PAGE 4

A CKNOWLEDGMENTS Firstandmost,IexpressmysinceregratitudetomyPh.D.advisorDr.Dapeng OliverWuforhissupportthatmadepossiblethecompletionofthiswork.Withouthis guidance,imagination,andenthusiasm,passion,whichIadmire,thisdissertationwould nothavebeenpossible.IalsothankmycommitteememberDr.JohnHarris,Dr.Scott BanksandDr.YijunSunfortheirvaluablediscussionsandcommentstoimprovethe qualityofthisdissertation. MydeepestappreciationalsogoestoDr.ThomasDeselaersandPhilippeDreuw fromRheinisch-WestfalischeTechnischeHochschule(RWTH)AachenUniversity,GE, fortheirkindhelpwithmyexperimentsandsubjectiveevaluationsonvideoadaptation.I alsowouldliketothankthecompressiongroupinTechnicolorResearchandInnovation. IwouldliketothankJillBoyce,PengYin,XiaoanLu,QianXu,YunfeiZheng,LiweiGuo, DongTianandPolinLai,fortheirsupportandguidanceonmyresearchonH.264video codecandnextgenerationvideocoding. DuringmycourseonPh.D.research,IinteractwithmycolleaguesandIbeneta lotfromthevaluablediscussionsonresearchandlifeatlarge.Especially,Ithankformer andcurrentgroupmembersDr.XiaochenLi,Dr.JunXu,Dr.BingHan,Dr.XihuaDong, ZhifengChen,ZhengYuan,HuanghuangLi,QianChen,ZongruiDing,LeiYang,and YuejiaHe.Lastbutnotleast,Ithankmyparentsfortheirloveandsupportthroughout mylife. 4

PAGE 5

T ABLEOFCONTENTS page A CKNOWLEDGMENTS ..................................4 LISTOFTABLES ......................................8 LISTOFFIGURES .....................................9 ABSTRACT .........................................12 CHAPTER 1INTRODUCTION ...................................14 1.1Motivation ....................................14 1.1.1VideoAdaptation ............................14 1.1.2VideoSummarization .........................15 1.1.3VideoCompression ...........................16 1.2Outline ......................................17 2CONTENT-AWAREVIDEOADAPTATION .....................22 2.1Introduction ...................................22 2.1.1Content-UnawareVideoAdaptation ..................22 2.1.2PreviousApproachesofContent-AwareVideoAdaptation .....23 2.1.3OverviewoftheProposedApproach .................25 2.2Spatial-TemporalSaliencyMap ........................27 2.2.1SpatialSaliencyMap ..........................27 2.2.2MotionSaliencyMap ..........................28 2.2.3NonlinearFusionofSpatial-TemporalSaliencyMap .........29 2.3Intra-FrameVisualConsideration .......................30 2.3.1Content-AwareInformationLossMetrics ...............30 2.3.2HierarchicalSearchforOptimalSingle-FrameCroppingWindow .31 2.4Inter-FrameVisualConsideration .......................32 2.4.1DynamicProgrammingSolutionforOptimizationofCroppingWindow Parameters ...............................33 2.4.2OptimizeScaleinAShot ........................34 2.5ExperimentalResults .............................35 2.5.1FigureIllustrations ...........................36 2.5.2SubjectiveEvaluations .........................38 2.6Summary ....................................42 3CONTENT-AWAREVIDEOSUMMARIZATION ..................51 3.1Introduction ...................................51 3.1.1Content-UnawareVideoSkimming ..................51 3.1.2PreviousApproachesofContent-AwareVideoSkimming ......52 5

PAGE 6

3.1.3 OverviewoftheProposedApproach .................53 3.2FeatureExtractionofShots ..........................56 3.2.1Video-AudioDe-InterleavingandTemporalSegmentation .....56 3.2.2Content-AwareAttentionModeling ..................57 3.2.3ConceptPrimitivesandBag-of-WordsFeatures ...........59 3.3VideoSkimmingbyConceptReconstruction .................65 3.3.1Spectral-ClusteringSolutionforConceptLearning .........65 3.3.2Audio-VisualConceptAlignmentandConsistenceChecking ....67 3.3.3SkimmingAlgorithmandPostProcessing ..............68 3.4ExperimentalResults .............................71 3.4.1FigureIllustrations ...........................71 3.4.2SubjectiveEvaluation ..........................74 3.5Summary ....................................75 4AGENERICFRAMEWORKFORCONTENT-AWAREVIDEOCODING ....84 4.1Introduction ...................................84 4.1.1Content-UnawareVideoCoding ....................84 4.1.2ExistingWorksofContent-AwareVideoCoding ...........84 4.2Group-of-PictureBasedBitAllocationFramework ..............86 4.3Intra-FrameROIIdenticationandBit-Allocation ..............87 4.4Inter-FrameROIIdenticationandBit-Allocation ..............87 4.5ExperimentalResults .............................89 4.6Summary ....................................91 5TRANSITION-BASEDINTRACODING ......................96 5.1Introduction ...................................96 5.1.1ExistingWorks .............................96 5.1.2OverviewofOurApproach .......................97 5.2TransitionCasesandInterpolationSchemes ................98 5.2.1Flat/zeroTransitionCase ........................100 5.2.2TwoTransitionsCase ..........................100 5.2.3FourTransitionsCase .........................101 5.2.4SixorMoreTransitionsCase .....................101 5.3NewEncodingOrder ..............................102 5.4ExperimentalResults .............................102 5.5AnCodingParameterAnalyzer ........................103 5.5.1Introduction ...............................103 5.5.2TheProposedAnalyzer ........................106 5.5.3CodingParameterFiles ........................106 5.5.4AnalyzerModules ............................108 5.5.4.1Dataparser ..........................108 5.5.4.2Graphicaluserinterface ...................109 5.5.5Examples ................................110 5.5.5.1Maingraphical-user-interface ................110 6

PAGE 7

5.5.5.2 Parameterdisplay ......................110 5.6Summary ....................................110 6CONCLUSIONANDFUTUREWORK .......................119 REFERENCES .......................................121 BIOGRAPHICALSKETCH ................................127 7

PAGE 8

LIST OFTABLES T able page 2-1 Hypothesistestingresultsforsubjectiveevaluationonretargetingalgorithms .40 2-2Hypothesistestingresultsforsubjectiveevaluationonretargetingalgorithms .42 3-1Basicstatisticsofsubjectivetestingscoresofbigbuckbunny .........75 3-2Basicstatisticsofsubjectivetestingscoresoflordofthering .........75 4-1PerformanceevaluationonbenchmarksequenceCarphoneandAkiyo. ...90 5-1ExperimentalresultsofTIPoverJM. ........................103 5-2Fourcategoriesofcodingparameters. .......................106 8

PAGE 9

LIST OFFIGURES Figure page 1-1 Successfulonlinevideoserviceproviders:YoutubeandHulu. ..........20 1-2Copiousdigitalvideoapplicationsandconventionalvideoadaptationmethods. 20 1-3Content-awarevideosummarizationvs.conventionalvideosummarization. ..21 1-4Blockbasedconventionalvideocodingtechniquegivesequalweightforevery Macroblock. ......................................21 2-1Numerousdigitalvideoapplicationsseekasmartcontentadaptationapproach. 43 2-2Examplesofcontent-unawarevideoretargetingmethods. ............43 2-3Examplesofcontent-awareretargetingmethods. .................44 2-4Proposedsaliencydetectionandsingleframeretargetingmethod. .......44 2-5Proposedvideoretargetingframeworkusingdynamicprogramming. ......45 2-6Comparisonofspatialsaliencydetectionbymulti-channelPFFTvs.PQFT. ..45 2-7Comparisonoffusingschemes. ...........................45 2-8AnexampleofhieraticalsearchingspaceofAvatar. ...............46 2-9GraphModelforoptimizecroppingwindowtrace. .................46 2-10Retargetingperformancesonnaturalimages. ...................46 2-11Comparisononvideoretargetingofbaseline-PQFTandourapproach. .....47 2-12Retargetingresultsforvisualsmoothness. .....................47 2-13Comparisonofsaliencydetectiononimages. ...................48 2-14Statisticalanalysisforsaliencymaps. .......................49 2-15Statisticalanalysisforretargetingalgorithms. ...................49 2-16Demos. ........................................50 3-1Videosummarizationtechniques:staticstoryboardandvideoskimming. ...77 3-2VisualsaliencymaskingonBig-Buck-Bunny. ..................77 3-3SIFTfeaturedetectiononactiveregions. .....................77 3-4ThehistogramrepresentationofthevisualBoWfeatureforashot. .......78 3-5Samesemanticsconceptsindifferentscalesandlocations. ...........78 9

PAGE 10

3-6 TheowchatforextractingvisualBoWfeature. ..................78 3-7Audiosaliencymasking. ...............................79 3-8TheowchatforextractingaudioBoWfeature. ..................79 3-9AnRRTwithmust-inshots,optional-shots,virtualshotsandshottable. ....80 3-10Postprocessingbysaliencythresholding. .....................80 3-11Shotdetection,saliencymaskingandSIFT-featureextractionforsequence TheBigBungTheory. ...............................81 3-12Theconceptsminingbyspectralclusteringofbag-of-wordshotfeaturesof sequencethebigbungtheory. ..........................81 3-13Reconstructionreferencetreeofthebigbungtheory. ..............82 3-14Thehistogramsoftheenjoyabilityandinformativenessscores. .........82 3-15Statisticalanalysisresultsofthescoresbysubjectiveevaluation. ........83 4-1Two-levelbitallocation. ...............................92 4-2ROI-basedencoderdiagram. ............................93 4-3PSNRcomparisonunderlowratecaseforCarphone. .............93 4-4PSNRgainforCarphone. .............................94 4-5PSNRcomparisonunderlowratecaseforAkiyo. ................94 4-6PSNRgainforAkiyo. ................................94 4-7ReconstructedvideoframesofCarphoneandAkiyobyROI-schemecompared toJM14. ........................................95 5-1H.264Intrapredictionmodesfor8 8and4 4blocks. ..............97 5-2Transitionpointsoninnerlayerandouterlayer. ..................112 5-3Twotransitionscase:Anedgegoesthrough. ...................112 5-4Twotransitionscase:Astreakorcorner. ......................112 5-5Fourtransitionscase:transitionpoint0isconnectedtopoint3. .........113 5-6Fourtransitionscase:Anedgeandastreak. ...................113 5-7Distributionoftransitioncasesof8 8blocksforBasketballDrill 832x480. ..113 5-8Rastercodingordervs.reversecodingorder. ...................114 10

PAGE 11

5-9 IntrapredictionmodesforBR,UR,BLandULblockswithrasterandreverse codingorder. .....................................114 5-10Rate-distortioncurveforTIPvs.JM. ........................114 5-11Themotivationforusingananalyzerinsteadofcheckinglogles. ........115 5-12WorkowofaconventionalH.264bitstreamanalyzer. ..............115 5-13Workowoftheproposedvideocodinganalyzer. .................115 5-14Representblockpartitionswithbinarystrings. ...................115 5-15AnexampleGUIscreenshot. ............................116 5-16AnexampleQALFlterinformation. ........................116 5-17Anexamplemodedistributionplot. .........................117 5-18Anexample QP variationplot. ...........................118 5-19Anexamplepartitioningandmodeoverlayplot. ..................118 11

PAGE 12

Abstr actofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy CONTENT-AWAREAPPROACHESFORDIGITALVIDEOADAPTATION, SUMMARIZATIONANDCOMPRESSION By TaoranLu December2010 Chair:DapengOliverWu Major:ElectricalandComputerEngineering Inthisdissertation,wemainlypresentourworkonthreechallengingproblemsof digitalvideoapplications:videocompression,summarizationandadaptation.Unlike conventionaltechniques,wefocusonthevideocontentmodelingandinvestigatehow thecharacteristicsofhumanattentionwillhelpsolvingtheseproblems.Wedenotethese approachescontent-awareapproachesandpresentourinnovationsinthemainbody ofthisdissertation. Therstproblemiscontent-awarevideoadaptation.Weemploysaliencyanalysis, whichgeneratesasaliencymaptoindicatetherelativeimportanceofpixelswithina frameforhumanattentionmodeling.Wealsoproposeanonlinearsaliencymapfusing approachthatconsidershumanperceptualcharacteristics.Toeffectivelymapthe importantcontentfromsourcetothetargetdisplay,weproposetohavebothintra-frame visualconsiderationsandinter-framevisualconsiderations,whereintra-considerations focusonmeasuringtheinformationlosswithinaframe,andinter-considerations emphasisthevisualsmoothnessbetweenframes.Themappingproblemisformulated asashortestpathproblemandissolvedwithdynamicprogramming. Thesecondproblemiscontent-awarevideosummarization.Weintroducean automaticvideosummarizationapproachthatincludesunsupervisedlearningoforiginal video-audioconceptprimitivesandhierarchical(bothframeandshotlevels)skimming. Forvideoconceptmining,weproposeanovelmodelusingbag-of-word(BoW)shot 12

PAGE 13

f eatures.Wefurtherdesignahierarchicalvideosummarizationframeworkwhichjointly considerscontentcompleteness,saliency,smoothingandscalability. Anotherproblemthatweinvestigateiscontent-awarevideocompression.Westudy thisproblemintwoaspects.Inoneaspect,weproposeacontent-awareframeworkand formulatetheconstraintoptimizationofrate-distortionasaresourceallocationproblem, wherebit-allocationisadjusteddifferentlyattwolevels:region-of-interest(ROI)and non-ROI,intraframesandinterframes.Theresultsexhibitbettervisualqualityaswell asobjectivequalityimprovement.Intheotheraspect,weaimatimprovingthecoding efciencyofexistingH.264intraprediction.Weincorporateareverseencodingorder withgeometricanalysisforbinarytransitionpointsonblockboundariestoexplicitly derivethepredictiondirection.Besides,wedesignandimplementavideocoding parameteranalyzertofacilitatethedevelopmentofnewcodingtoolsforstate-of-the-art andnext-generationvideocompressionstandards. 13

PAGE 14

CHAPTER 1 INTRODUCTION 1.1Motivation Thisdecadeisthedecadeofmultimedia.Thankstothefastdevelopmentofthe Internet,thenetworkvolumehasincreaseddramatically,andthenumberofsupported usershasalsogrownapace.Intheconsumerelectronicsindustry,theadvanced techniquesindigitalcamerasandcamcordershavemadeiteasyandpleasantfor everyonetocaptureandstoreadigitalvideo.Thereisanobvioustrendthatmoreand morepeoplestarttowatchandsharedigitalvideosonlinewiththeirfamily,friends orevenstrangersworldwide.Atthesametime,conventionalmediadistribution companies,includingFOX,NBC,Discovery,etc.,alsonoticethistrend,andstart toprovideonlinevideoservicestohelppeoplendandenjoytheworld'spremium videocontentwheneverandwherevertheywantit.ThesuccessofYoutubeandHulu (Fig. 1-1)hasshownthetremendousimpactofdigitalvideoindustryonourdailylives. Thisrapidevolutionofdigitalvideoindustryhasbroughtmanynewapplications andtechniques.Amongthem,videoadaptation,videosummarizationandvideo compressionaretherudimentarytechniques.Sophisticatedapplications,likevideo indexing,browsing,archiving,cataloging,retrievingandrendering,arebasedonthose fundamentalones.Todate,thecontradictionbetweenthehugeamountofvideodata andthelimitedmanpowerisanurgentproblemtobesolved.Thus,researchersand developersareseekingnewapproachesintheseareastoimprovetheefciencyand reducethecostofhumanmanipulation. 1.1.1VideoAdaptation Videoadaptationistheprocesstotandrenderasourcevideointoatarget display,whentheresolutionor/andaspectratioofsourceandtargetaredifferent.Due tothecopioussourceresolutionsandaspectratios(Fig. 1-2),videoadaptationisa fundamentalandimportanttechnique,andiswidelyusedincurrentconsumerelectronic 14

PAGE 15

mar kets.Videoadaptationcanbeperformedwithoutdifcultybyapplyingsomebasic operationslikecentercropping,squeezing,orblack-padding;thedrawbacksofthese approachesareobvious:xed-windowcroppingwillcauseinformationlossonthe frameboundary;squeezingwillcauseunpleasantgeometricdistortion;black-padding, withoutdistortionandlossofboundaryinformation,isawasteofthescarcedisplay pixels.Meanwhile,aswide-screenvideosourcesanddisplayingdevicesaregradually takingtheplaceofthestandard4:3videosanddevices,somelmcompaniesemploy humanbeingstomanuallyadapttheirlmsintomanyversionstotdifferentdisplays. Thisendeavor,isobvioustedious,inefcientandcostly.Thus,thereemergeastrong desireofanautomaticvideoadaptationalgorithmwhichcanintelligentlyadapttothe videocontents. Inrecentyears,content-awarevideoadaptationbecomesahotresearchtopicthat continuallyattractsresearchers'interests.Thereexisttwochallengesforcontent-aware videoadaptation.Therstishowtoestablishamodeltoindicatetheimportanceor saliencyofvideocontents.Thesecondishowtomaptheimportantcontentssuchthat userscansimultaneouslyexperienceleastinformationlossandbestvisual-friendliness. 1.1.2VideoSummarization Videosummarizationistheprocesstoextractanabstractofavideosequence andpresentitinacompactmanner,thathelpstoenableaquickbrowsingofthevideo sequenceandtoachieveefcientcontentaccessandrepresentation.Considerthese situations:mostpeoplewillbeboredbyanuneditedandlonghomevideocapturedby anon-professional;peoplemaywanttoquicklygrabthestoryprogressofaTVseries whichhas40episodes;orpeoplewouldliketochoosefromahugedatabaseofmovies toquicklydecidewhichonetowatch.Atoolthatcanautomaticallyshortentheoriginal videowhileintelligentlypreservethemostimportantandexcitingvideosegmentswillbe highlyindemand. 15

PAGE 16

Most oftheearlierworksdonotconsiderthevariationofvideocontents.By randomlyoruniformlysampletheoriginalvideo,asimplesummarizationcanbe obtained.However,suchanarrangementmayfailtocapturetherealvideocontent, especiallywhenitishighlydynamic.Itcannotremovetheredundancyinthestory, neithercanitkeepmostsalientsegments.Thus,moresophisticatedresearchesare conductedforseekingcontent-awarevideosummarizationtechniques.Fig. 1-3 shows anexamplethatconventionalsummarizationtechniquefailstocapturedynamicsofthe videocontent.Likevideoadaptation,videosummarizationisalsoachallengingtask foritshighlysubjectivenature.Ourresearchyieldsapromisingsolutionforcontent modelingandvideosummarization. 1.1.3VideoCompression Videocompression,asthenamesuggests,isthetechniquetoexploitthe redundancyofavideosequence,thustostoreortransmititmoreefciently.For decades,peoplehaveproposedmanytechniquesforexploitingtheredundancyboth spatiallyandtemporallybyemployingpredictivecoding,transformcoding,entropy coding,etc.Thestate-of-artvideocompressionstandard,H.264/AVC,achieves hugerate-distortiongainoveritsprecedingstandardslikeH.263andMPEG-2,by incorporatinganumberofsophisticatedtechniquesincludingrate-distortion-optimization (RDO),quarter-pelmotionestimation,etc.However,itstilllackstheabilitytoadjust thecodingparametersforvariouscontents,thussometimesbecomesinexibleand sub-optimal.AnexampleisshowninFig. 1-4:Althoughpeopleusuallypayhigher attentionontheredregion(humanface)thantheblueregion(treesoutside),theyare treatedequallyinanH.264encoder;andsincethetreeareahasarichtexture,itmay consumeconsiderablecodingbits.Thisisundesirable,especiallywhenresourcesare limited.Toovercomethisproblem,weneedtoaddresschallengessuchascontent modelingandresourceallocation.Ourresearchprovideagenericframeworkfor 16

PAGE 17

content-a warevideocoding.Also,animprovedintracodingschemeisproposedto improvethecodingefciencyofH.264intraprediction. 1.2Outline Theoutlineofthisdissertationispresentedasfollows,alongwithasummaryof ourcontributionsintheseareas:Chapter 2 presentscontent-awarevideoadaptation techniques.Inthischapter,werstoverviewtraditionalvideoadaptationmethodsand representativeavailableapproachesforcontent-awarevideoadaptationtechniques. Then,weaddressthetwochallengesofvideoadaptationrespectively.Fortherst challengecontentmodeling,weproposeanonlinearspatial-temporalsaliencyfusing approachthatconsidershumanperceptualcharacteristics.Weincorporatefeatures frombothspatialandtemporaldomain.Thesaliencymaps,whichareindicatorsof contentimportance,arefusednonlinearlytoimitatethehumanperceptionprocess. Forthesecondchallengecontentmapping,weproposetotakebothintra-frame visualconsiderationsandinter-framevisualconsiderationsintoaccount,where intra-considerationsfocusonmeasuringtheinformationlosswithinaframe,and inter-considerationsemphasisthevisualsmoothnessbetweenframes.Wesegment thewholevideoinashot/subshotbasis,wherethesubshotsarexed-lengthgroups ofconsecutivevideoframes.Theboundaryframesinashot/subshotareintra-frames, forwhichanovelcontent-awarecroppingandscalingmetricisproposedandbest mappingparametersarefoundbyahierarchicalbrute-force-searchwithinthat frame.Fortherestofinterframesinthesubshot,weminimizevisualinformation lossaccumulationundertheconstraintofvisualconsistency(inter-frameconsideration). Theoptimizationisformulatedasashortest-pathproblemingraphtheoryandweuse dynamicprogrammingtoyieldthetemporaltransitiontraceofcroppingwindows. Chapter 3 presentscontent-awarevideosummarizationapproaches.Inthischapter, werstoverviewsomepreviousapproachesofvideosummarizationmethods,both content-unawareandcontent-aware.Then,afteranalyzingthechallengesofthis 17

PAGE 18

prob lemandlimitationsofexistingworks,weproposeanovelapproachtoperformthe videosummarizationtask.Thisapproachincludesunsupervisedlearningoforiginal video-audioconceptsandhierarchical(bothframeandshotlevels)skimming.We rstdeneanintermediatecognitivelevelterm-conceptprimitive,toextractthe structureoforiginalvideobyconceptprimitivemining.Viewingitasaclustering problem,weproposetousebag-of-words(BoW)modelforshotfeatureextraction, andusethescale-invariant-feature-transform(SIFT)togetthevisualwordsand thematching-pursuit(MP)decompositionforgeneratingtheaudiowords,fromboth visualandaudiosensorychannelslteredwithsaliencymasking,thenclusterthem intoseveralgroupsbyspectralclustering.Eachclusterrepresentsacertainconcept primitive.Next,wesummarizetheoriginalvideofromreconstructionpointofviewbased onthelearnedconceptprimitives.Whilemostresearchersregardvideosummarization asasubtractionprocess,weregardeditasummationprocess.Weproposea reconstructionreferencetree(RRT)structuretoefcientlyrepresentthecharacteristics ofavideosequence.Keepingatleastoneshotforeachconceptprimitive,theconcept integrityofsummarizedvideoisguaranteedtoofferviewersthecapabilityofcontext recovery.Inaddition,givenaspeciedskimmingratio,wegenerateavideothatalso containsmaximumachievablesaliencyaccumulation(scalability).Thesummarization processisconductedinaniterativefashion,allowingexiblecontrolofsummarized videoinformationrichnessvs.skimmingratio.Finally,tomeettheskimmingratio specicationandkeepthesmoothtransitioninthesummarizedvideo,weaddaframe levelsaliencythresholdingfollowedbyatemporallymorphologicaloperationaspost processing. Chapter 4 presentscontent-awarevideocompressiontechniques.Inthischapter, webrieyoverviewsomeexistingworksonthiseld,andproposeagenericcontent-aware codingframeworkforlowbitratevideoapplications.Thekeyideaofthisframeworkis toregardthecontent-awarecodingasaresourceallocationproblem,whereresources 18

PAGE 19

are allocatedbetweenIntraandInterframeswithinagroup-of-picture(GOP)structure. Weproposetotreatdifferenttypesofframesdifferentlyandthusadjustthequantization parameterscorrespondingly. Chapter 5 presentsanewintracodingalgorithmtoimprovetheH.264/AVCintra prediction.Weproposetoanalyzethebinarytransitionpointsonablockboundaryto implicitlyderivethepredictiondirection.Meanwhile,anewencodingorderschemeis incorporated.Thetransition-basedintracodingcanachieveupto10%bitratesavings overH.264referencesoftwareJM.Inaddition,anovelvideocodingparameteranalyzer isdesignedtoassistthedevelopmentofnewcodingalgorithmsfornext-generation videocompressionstandard.Thedesignphilosophyarealsointroducedinthischapter. 19

PAGE 20

Figure 1-1.Successfulonlinevideoserviceproviders:YoutubeandHulu. Figure 1-2.Copiousdigitalvideoapplicationsandconventionalvideoadaptation methods. 20

PAGE 21

Figure 1-3.Content-awarevideosummarizationvs.conventionalvideosummarization. Bottomleft:content-awarevideosummarizationthatextractsallconcepts fromoriginalvideoandremovethestructureredundancy.Bottomright: conventionalvideosummarizationbyuniformsampling,whichfailstocover allconceptsfromoriginalvideowhilestillhavingredundancy. Figure 1-4.Blockbasedconventionalvideocodingtechniquegivesequalweightfor everyMacroblock.Redellipse:Humanface.Bluerectangle:Treesoutside. 21

PAGE 22

CHAPTER 2 CONTENT-AWAREVIDEOADAPTATION 2.1Introduction Nowadays,thedevelopmentofdigitalvideoapplicationshastwooppositetrends. Ononehand,peoplesavorthefantasticvideocontentsdeliveredincinemasandon highdenitiontelevisions(HDTVs)withresolutionhigherthan1920 1080.Ontheother hand,peopleenjoytheexibilityofplayingvideosontheirportabledevices(iPhone, BlackBerry,etc.)withresolutionsmallerthan480 320.Thesetwotrendsevokegrowing interestsinautomaticvideoadaptationthatseekstochangetheresolution(aswellas aspectratios),generallyfromthelargertothesmaller,ofvideosourceswhilefaithfully conveyingtheoriginalinformation.Thetaskofadaptingandre-renderingavideoonto arbitrarydisplaysizeoraspectratioistermedasvideoadaptation,orvideoretargeting. Acompellingretargetingalgorithmaimsatpreservingtheviewers'experiencewhen theresolutionand/oraspectratiochanges.Videoretargetingisnaturallyachallenging problem,asitisaverysubjectivetasktomaphumancognitionintotheautomated process. 2.1.1Content-UnawareVideoAdaptation Astraightforwardwaytoperformthistaskistotanoriginalvideointoatarget displayviaresizingwithblack-padding.Whentheaspectratioofthetargetdisplay variesfromtheoriginaldisplay,twoblackbandsarepaddedontheboundaryoftarget displaytomaketheretargetedvideofreeofgeometricdistortion.Thisapproach,which ismostoftenadoptedincurrentconsumerelectronicmarket,cannotefcientlyutilizethe scarcescreenpixelsofasmalldevice.Anothernaiveapproachistocroptheoriginal displayviaaxedwindow(generallylocatedinthecenteroftheframe)ofthetarget displaysize.Thisapproachwillfullyutilizethescreenofthetargetdevice,butmay causethedangeroflosingimportantcontentsontheboundaries.Anotherapproachis squeezing,whentheaspectratiooforiginalandtargetdevicesaredifferent,theframe 22

PAGE 23

will experiencesomeextentofgeometricdistortions,makinguncomfortablefeelingsfor viewers.Fig. 2-2 presentsexamplesofthecontent-unawareretargetingmethods. 2.1.2PreviousApproachesofContent-AwareVideoAdaptation Althoughcontent-unwarevideoretargetingmethodshavetheleastcomputation costsandgenerallyproduceacceptableresults,peopleareseekingmoresophisticated solutionsthatcanintelligentlyselectthecontentstopresent.However,therearetwo majorchallengeswehavetodealwith. Therstchallengeforcontent-awarevideoretargetingapproachesis:howto identifyimportantcontents?Althoughthehumanperceptionisabiologicaland phycologicalprocessthatarenotyetfullyunderstood,studiesontheuser-attention modelhaveshownthat,suchamodelcanbeusedtosimplifythebehaviorsofthe verycomplexhumanvisualsystem.Thestudies,alsoreferredtoassaliencyanalysis, couldbeclassiedasthetop-down(objectoreventdriven),orthebottom-up,like feature-driven[ 1, 2]inliterature.Featuresmaycomefrombothspatialandtemporal domains.Inthelatterclass,thespectralresidueapproach[ 2]suggestsutilizingthe spectralresidueforanimagetoobtaintheattentionmodel.Later,Guo etal. showed thatthephasespectralaloneisgoodenoughasaspatialfeature,andtheyextend[ 2]for videobyincorporatingmotionfeaturesunderaphasespectrumofquaternionFourier transform(PQFT)framework[ 3](wecallitbaseline-PQFT).Thisworkprovidesanew insightforaspatial-temporalsaliencydetection.However,puttingthemotionfeatureinto achannelofaquaternionimagelacksitsphysicalinterpretation,asthemotionchannel isactuallyaderivativeimage,whichisnotcommeasurablewiththeotherthreecolor channelsinoriginalimagedomain.Besides,thenaivemotionfeatureobtainedbyframe differencingdoesnottrulyreecthuman'sperceptualexperience:it'sthelocalmotion, nottheglobalmotion,thattriggersthehumaninterestsmostly.Consideringthis,many peopleuseopticalowbasedapproaches[ 4]tofactorouttheglobalmotion.While thesemethodsaremorereliablethanthenaivedifferencing,theyhavetodealwith 23

PAGE 24

the heavycomputationburdenandtheapertureproblem.Besides,thoseapproaches usuallyadoptalinearcombinationschemetofusefeaturesindifferentdomains,where theweighingfactorsneedtobecarefullyselected.Thesaliencyanalysiswillresultinan outputnamedsaliencymap,whichrepresentstheconspicuity(orsaliency),atevery locationintheoriginalimagebyascalarquantityandtoguidetheattendedlocations. Giventhesaliencymap,thecontentimportancecanbecomputed. Thesecondchallengeforcontent-awarevideoretargetingapproachesis:howto mapimportantcontentsfromthesourcetothetargetdisplay?Therearemanypossible solutions.Toutilizeentiretargetdisplayefciently,[ 5],[ 6],[ 7]proposedcontent-aware retargetingmethods.Basedongeneratedsaliencymap,theyrearrangepixelsintarget frames:theoriginalgeometriclayoutarefaithfullymaintainedforadjacentpixelswith highervisualsaliency,whileotherlesssalientpixelsaremorphologicallysqueezedto makeupfortheoriginal-targetdisplaysizedifference.Thesemethodsworkwellon stillimagesbecauseviewerstendtoconcentrateonthosesalientareasandgenerally neglectotherareaswithlittleinterests.Theyaremostsuccessfulforimageswith naturalsceneries.Forexample,curves(e.g.proleofamountain,streams,trees) arerobusttogeometricdistortion.However,forimageswithobjectswhoseshapes canbepreciselyexpected(likebuildings),thesemethodsbecomedisastrousastheir anamorphicstrategyoftenresultsinunjustiedobjectshapedistortion.Inorderto avoidthisproblem,singleframeoptimization[ 4][ 8]methodsareproposedtoapply acroppingwindowtopanthroughouteachoriginalframetoyieldaregionofinterest withthedegradationinsideminimized.Visualconsistencyisclaimedbysmoothing optimalcroppingwindowparametersofeachframe.Thisendeavor,nevertheless,does littlehelptoremovevisualinconsistencyalongtemporalaxisbecausethecropping windowparametersofadjacentframesareoptimizedindependently,andmanytwists andturnsstillexistonthewindowtraceaftersmoothing.Thissuggestsobviousframe jumpbackandforth,zoominandoutinthetargetedvideo,whichleadstoviewer 24

PAGE 25

v ertigoverysoon.Tocarefullyconsiderthevisualexperiencealongtemporalaxis,a back-tracingmethod[ 9]ispresentedtodynamicallydeterminethecroppingwindow trace.Thismethodaddsanotherconstrainttoboundpossibleshiftofcroppingwindows amongadjacentframeswhenoptimizingthewindowtrace.Itproducesaretargeted videowithframeinconsistencyremovedandthusavoidsviewers'discomfort.However, Thismethodunfairlyfavorstheinitiallocationofcroppingwindowandclampsequent croppingwindowlocationsneartheinitialvalue.Thus,thewindowcannotcropsalient objectssoonasframegoesfurther,whenthelocationofsalientobjectsisquitedifferent fromthatoftherstframe.Consequently,thismethodcannothandlevideoswith frequentcontentchange.Someexamplesofexistingcontent-awareretargetingmethods areshowninFig. 2-3. 2.1.3OverviewoftheProposedApproach Weaddressthetwochallengesrespectively.Forsaliencydetection,weproposea novelspatial-temporalsaliencymapbasedonnonlinearfusion.Inourscheme,spatial saliencyisdetectedbythephasespectrumofquaternionFouriertransform(PQFT) onacolorimage(videoframe),whichutilizesthemultipleimagechannelsasavector eldtoexploitconspicuousspatialfeatures(color,intensity,etc.);motionsaliencyis measuredbylocalmotion(globalmotionresidue),wheretheglobalmotionparameters areestimatedbyrobustafnettingwithleastmedianofsquares(LMedS)[ 10],from asetofmatchedfeaturepointsdetectedbytheKanade-Lucas-Tomasi(KLT)feature tracker[ 11].Unlikethedenseopticalowapproaches,theKLTtrackerworksonsparse featurepoints,andthusismoreefcient.Then,thespatialandtemporalsaliencymaps arenonlinearlyfused.Theinnovationforthisnonlinearfusionisbasedonhuman perceptualproperties:1)Whenexcitationisabsent(textureuniformlydistributed), peopletendtofocusonthecenteroftheframe,insteadoftheborders.2)Thehuman perceptionprocessconsistsofastimulatingphaseandatrackingphase,denedas saccadeandpursuit[ 12]inhumanvisiontheory.First,spatial-salientregionspopup 25

PAGE 26

as primitivestimulus.Ifaspatial-salientregionhassignicantlocalmotionactivities, thismotionstimuluswillstrengthenthespatialstimulus,andcausehigherattention. Otherwise,lazyhumaneyeswillcontinuallyfocusonthespatial-salientregions.3) Occasionally,spatialandmotionsaliencyregionsarenotconsistent.Ourschemetreats thisasaprohibitedcasesincethemotionstimuluswilldistractthespatialstimulusand makearapidchangeoffocuspoints,whichwillcauseaneyefatigue.Thus,professional photographersandmoviemakerswillmaketheireffortstoavoidthesituation. Forcontentmapping,weproposeaframeworktoadaptreallifevideos,whichcan beaslongasanentiremovieratherthanmerelyaclip.Ourframeworkisoriginated fromthecropping-and-scalingmethoddevelopedby[ 4],butwithnewfeatures.Notethat viewersarenotsensitivetoabruptcroppingwindowchangeforadjacentframeswith rapidscenechangeatshotboundaries.Werstdetectshots[ 13][ 14]andthenperform thetaskindependently.Ashotisthendecomposedintosubshotsforvisualcomfort andcomputationalefciency.Foreachframe,a3-parameter(scaleandlocation) rigidcroppingwindowisdeterminedtoselectaregionofinterestasretargetedframe. Withinashot,weproposeamotion-predictionmethodtondanoptimalxedscalefor thecroppingwindowasotherwiseamildscalevariationmaycausesignicantvisual degradation.Regardingtheoptimallocationofcroppingwindow,werstprocessthe twoboundaryframesofeachsubshot.Aimingatkeepingasmuchdelitytooriginal frameaspossible,acroppingwindowisselectedtominimizeaninformationloss functionduetocroppingandresizingandyieldsourceanddestinationlocationsof croppingwindowsinthesubshot.Forotherframeswithin,weaddressviewervisual expectationsascontradictoryintra-framedelityandinter-framevisual-inertnessand minimizeanaccumulativelossfunctionincludingbothinformationlossaccumulation andvisual-inertnesslossaccumulation.Thenthedynamictraceofrigidcropping windowsfromsourcetodestinationlocationsisoptimizedasashortestpathproblem withdynamicprogrammingsolution.Assubshotalternates,thedestinationlocationsare 26

PAGE 27

updated tosynchronizecontentoftheframesatthattime.Thusasframegoesfurther, thecroppingwindowisnotclampedontotheadjacencyofsourcelocationasin[ 9] andisstillcapableofcroppingsalientobjectsofinterest.Ourapproachcanbeapplied toanytype,anylengthofvideo,nomatterhowfastthevideocontentchanges.Our retargetingresultsarefreeofshapedistortion,havenoannoyingzoomin/outartifacts withinthesamescene,preservethesalientobjectsofinterestthroughoutandkeep visualconsistencyaswell.Thecomputationalloadofourmethodincludesthebrutal forcesearchatboundaryframesanddynamicprogrammingforotherframes.Besides, withC++implementations,ourapproachcanperformthetaskinrealtime. Ourinnovationpointsalsoincludecontent-awareinformationlossmetrics,anda hierarchicalsearchtondoptimalretargetingparametersonasingleframe.Compared tothecontent-independentscalingpenalties[ 4],ourmetriccanadjustthescaling factorcorrespondingtodifferentcontents.ourscalingmetricalsooutperformsthe content-awarescalingmetricin[ 8],aswetakeintoaccountnotonlytheanti-aliasing lterin[ 8]butalsothetrueresizingprocess.Thehierarchicalsearch,cangreatlysave computationcosts. Fig. 3-1 andFig. 2-5 illustratetheprocedureofourapproach.Fig. 3-1 showsthe saliencydetectionandoptimalsignalframeadaptationbyndingbestcroppingwindow parameters,andFig. 2-5 showsthedynamicprogrammingapproachforoptimizingthe traceofthecroppingwindowdynamically.Inthefollowingsections,wewilldiscussour contentmodelingandcontentmappingalgorithmsindetail. 2.2Spatial-TemporalSaliencyMap 2.2.1SpatialSaliencyMap Denotethe n-th frameinthevideosequence F n .Theframecanberepresentedas aquaternionimage[ 15]whichhasfourchannels, q n = Ch n 1 + Ch n 2 1 + Ch n 3 2 + Ch n 4 3 ; where i i =1; 2; 3 satises 2 i = 1, 1 ? 2 2 ? 3 1 ? 3 1 ? 2 Ch n j j =1; 2; 3; 4 arethechannelsofthequaternionimage.Ifchoosing 1 alongtheluminanceaxis,i.e., 27

PAGE 28

1 = (i + j + k )= p 3, thecolorimageisthusdecomposedintoluminanceandchrominance components Y n C n b and C n r ,andthequaternionimageispure( Ch 1 =0)[15].Wecan furtherrepresent q n in symplectic form: q n = q n 1 + q n 2 2 ;q n 1 = Ch n 1 + Ch n 2 1 ;q n 2 = Ch n 3 + Ch n 4 1 : ThequaternionFouriertransform(QFT)ofthequaternionimage q n (x;y ) canbecalculatedbytwocomplexfouriertransformsofthe symplectic parts: Q n [u;v ]= Q n 1 [u;v ]+ Q n 2 [u;v ] 2 : Theforwardandinversefouriertransformofeachpartare: Q n i [u;v ]= 1 p W H P W 1 y =0 P H 1 x=0 e 1 2 ( yv W + xu H ) q n i (x; y ) (2) q n i (x;y )= 1 p W H P W 1 v =0 P H 1 u=0 e 1 2 ( yv W + xu H ) Q n i [u; v ] (2) where (x;y ) isthespatiallocationofeachpixel,W and H areimage'swidthandheight, and [u;v ] isthefrequency. Thephasespectrumof Q n [u;v ] (Q forabbreviation)canbecalculatedby Q P = Q= jjQ jj.Takingtheinversetransformofthephasespectrum Q P asinEq.( 2),the spatialsaliencymapisobtainedbysmoothingoutthesquared L 2 normof q P witha two-dimensionalGaussiansmoothinglter g SM s = g jjq P jj 2 (2) TheadvantagesofthePQFTapproachovertraditionalmulti-channelphasespectrum ofthe2DFouriertransform(PFFT)isshowninFig. 2-6.PQFTnotonlyachievesbetter saliencydetectionresultsbytreatingacolorimageasavectoreld,butalsoconsumes lesscomputationtimesinceonlytwocomplex2DFourierTransformsareconductedfor the symplectic equations,whilePFFThasthree(oneforeachchannel). 2.2.2MotionSaliencyMap Therearetwostepstoobtainthemotionsaliencymap:a)Kanade-Lucas-Tomasi (KLT)trackertogetasetofmatchedgoodfeaturepoints[ 11]andb)robustafne parameterestimationbyleastmediansquares(LMedS)[ 16].Denotethedisplacement ofapoint x =(x;y ) T atpreviousframe F n1 tocurrentframe F n as d =(d x ;d y ) T ,a 28

PAGE 29

six-par ameterafnemodelisadoptedtoestimatetheglobalmotion: d = Dx + t,where t isthetranslationvector t =(t x ;t y ) T and D isa22deformationmatrix.Goodfeatures arelocatedbycheckingtheminimumeigenvalueofeachgradientmatrix,andthegood featuresaretrackedusingaNewton-Raphsonmethod.Thepoint x in F n1 movesto point x'=Ax + t in F n ,where A = I + D and I isa2 2identitymatrix.Themodel parametersareestimatedbyminimizethedissimilarityineachfeaturewindow W : = ZZ W (F n (Ax + t ) F n1 (x)) 2 (x)dx (2) Weadopttheleastmediansquarestoestimatetheafneparametersrobustly[ 16]. Theglobalcompensatedimageisgeneratedbywarpingwiththeestimated ^ A and ^ t Theabsolutedifferenceoftheoriginalframewithitsglobal-compensatedversionisused togeneratethemotionsaliencymap. SM m = g (x) jF n1 (x) F n ( ^ A 1 [x ^ t ])j (2) 2.2.3NonlinearFusionofSpatial-TemporalSaliencyMap Whenboththespatialandtemporalsaliencymapsareavailable,thenalsaliency mapisgeneratedbyaspatial-maskednonlinearmannerwhichimitatesthehuman visionfeatures.First,a2DGaussianlayerGcenteredattheframecenterisfusedtothe spatialsaliencymap: SM spatial = SM spatial G.Abinarymask M s ofspatialsaliency signicanceisgeneratedbythresholding.Thenalsaliencymap SM isobtainedby: SM =max( SM s ;SM m \ M s ) (2) Thereasonstousethe max operatorwithabinarymaskedmotionsaliencymap andaGaussianlayeraremany.1)Gaussianlayerisusedtoadjustthedescending importancefromthecenterofaframetotheborder.2)Maskisusedtoexcludethe spatial-temporalinconsistentcases(theprohibitedcases).3)Themaskenhancesthe robustnessofthespatial-temporalsaliencymap,whenglobal-motionparametersarenot 29

PAGE 30

estimated correctly.4)The max operationavoidsthedepressionofinsignicantsalient regionscausedbyrenormalizationifusingalinearcombinationscheme.5)The max operationavoidstheselectionofweightingfactorbetweenspatialandmotionsaliency maps.Fig. 2-7 showsthecomparisonoflinearcombination,naivenonlinear(unmasked max )fusionandourscheme,whereintherstvideotheglobalmotionparameters arecorrectlyestimatedbutinsecondvideoarewrong.Thecomparisonshowsthe robustnessofourscheme. 2.3Intra-FrameVisualConsideration Ourretargetingframeworkstartsatsingleframeadaptationontherstandlast framesinasubshot.Inthisphase,weonlyconsiderintra-frameinformationloss,and ourtargetistominimizetheinformationlosscausedbyretargeting.Thus,therst importantstepistoobjectivelymodeltheinformationlosseswithinoneframe.Wecall thistheintra-framevisualconsideration. 2.3.1Content-AwareInformationLossMetrics Tobetterquantifytheinformationlosseswithinaframe,weconsiderboththe saliencylossduetocroppingandthescalingpenalty.Weproposeacontent-aware informationlossmetric L,whichconsistsoftwoterms.Therstterm L c iscontent-aware croppinglossandthesecondterm L s iscontent-awarescalingloss. L =(1 )L c + L s (2) L c =1 P (x;y )2W SM (x;y ) (2) L s = P (x;y )2W (F (x;y ) ^ F (x;y )) 2 (2) where SM isnormalizedsuchthat P (x;y ) SM (x;y )=1, W isthecroppingarea,and ^ F = upsizing (g downsizing (F;s );s ). isafactortobalancetheimportanceofcropping andscaling,itisadjustabletouser'spreference.Theproposed L s bettermeasuresthe scalinglossfordifferentcontentsthan[ 4, 8],bytakingintoaccounttherealresizing 30

PAGE 31

process .Integralimagesareusedforcomputingthe L c and L s byalookupoperationon integralimages. 2.3.2HierarchicalSearchforOptimalSingle-FrameCroppingWindow Underacropping-scalingframework,theretargetingoptimizationforasingle frameistondbestparameters (x;y;s ) withminimuminformationloss,where (x;y ) isthelocationofthetop-leftpointofthecroppingwindowoveroriginalframe,and s isthescalingfactor.Isotropicscalingisusedtoavoidgeometricdistortion.Afterthe informationlossmetriciswellformulated,ahierarchicalbrute-forcesearchisusedto ndthebestretargetingwindowparameters (^ x; ^ y; ^ s). P (^ x; ^ y; ^ s)=argmin x;y;s L(x;y;sW t ;sH t ) (2) Notethesearchrangeof (x;y ) isconstrainedby s,and s isconstraintby 1 s min (W s =W t ;H s =H t ),where W s ;W t ;H s ;H t arethewidthandheightforsourceand targetframes,respectively.Thesearchingrangeisagroupofsurfacesin x;y;L space. Fig. 2-8 showsthesearchingspace.Eachsurfacecorrespondstoaparticularscaling factor s.Thepointwhichyieldstheminimum L (thelowestpointinspace)isthebest parameter (x;y ) andthesurfaceitbelongstoisthebest s. Forcomputationsavingpurposes,thesearchfor (x;y;s ) isrstonacoarse (x;y ) grid(a10 10gridsearchwillsave99%computationover1 1gridsearch),atarget parameterset (x 1 ;y 1 ;s ) isfoundafterthecoarsesearch.Thesecondsearchisane searchwithinarangearound (x 1 ;y 1 ) withthexedscalingfactor s foundpreviously. Afterthehieraticalsearch,thebestparameterset (x 2 ;y 2 ;s ) isobtained. Consideringthathumansareverysensitivetoscalevariationevenwithamodest value,wealternativelydetermineagoodscale s usingthemethoddescribedin Sec. 2.4.2.Thescaleparameteristhenxed s throughoutashot.Thethree-dimensional 31

PAGE 32

optimization problemisreducedtoatwo-dimensionalsearchforoptimal x y instead. P (^ x; ^ y )=argmin x;y L(x;y; ^ sW t ; ^ sH t ) (2) 2.4Inter-FrameVisualConsideration Asignicantdifferenceofvideoretargetingtaskfromresizingastillimageis temporalconsiderations.i.e.,ifweonlyminimizeintra-framevisualinformationloss, theresultantvideostillsuffersfromannoyingjittersduetoindependentbutinconsistent windowparameters.Herewealsotakeintoaccountthefactthatviewersneedasteady andsmoothvideocontenttransitionknownasvisualinertnessrequirement.Note thatacrossadjacentframes,ashiftofcroppingwindowsimposesarticialcamera motiontoretargetedframes.Ononehand,anabsolutefreeinter-frameshiftmakesit possibletocropandthenpreservethemostsalientregionofeachdifferentframe.On theotherhand,visualinertnessfavorsamodestinter-frameshiftornoshiftatbest. Inourapproach,weconsiderthesecontradictoryvisualcomfortcluestogetherand optimizetheirtotalvalue.Tomeasurethevisualperformancewithrespecttothelocation ofcroppingwindow,wedeneafunctionofvisualpenaltyaccumulationwithinasubshot asinEq. 2. Q ( x N ; y N )= N X i=1 L(x i ;y i )+ N X i=2 EI (x i1 ;y i1 ;x i ;y i ) EI (x i1 ;y i1 ;x i ;y i )=(x i x i1 ) 2 +(y i y i1 ) 2 (2) where L isintra-framevisualinformationlossofframe i, EI istemporalpenaltythat constrainsshiftofpanesacrossadjacentframes. x i y i isthelocationofupper-left cornerofcroppingpaneofframe i and N isthetotalnumberofframesinasubshot. ( x N ; y N ) isadynamictraceoftheupper-leftcornerofthecroppingwindowovera subshot.Ourgoalistondtheoptimaltrace ( ^ x N ; ^ y N ) suchthat Q isminimized. 32

PAGE 33

2.4.1 DynamicProgrammingSolutionforOptimizationofCroppingWindow Parameters Wemodelthesolutionspace ( x N ; y N )= fx i ;y i g N i=1 byagraphillustratedin Fig. 2-9,whereeachnode (x i ;y i ) denotestheupper-leftcornerlocationofacandidate croppingwindowofframe i andeachedge (x i1 ;y i1 ) (x i ;y i ) representsthe shiftofcroppingwindowfromframe i-1toframe i.Thecostoneachnodeisvisual informationloss L(x i ;y i ) andasforeachedge,thecostcorrespondstotemporalpenalty EI (x i1 ;y i1 ;x i ;y i ).ThusminimizingQinEq. 2 isequivalenttondingtheshortest pathfromnode (x 1 ;y 1 ) to (x N ;y N ).Theoptimizationcanbeeasilysolvedbydynamic programming(DP).TherecursiveformatoftheobjectivefunctionisgiveninEq. 2: Q ( x k i ; y k i )=min j fQ ( x j i1 ; y j i1 )+ EI (x j i1 ;y j i1 ;x k i ;y k i )g + L(x k i ;y k i ) (2) where Q ( x 1 1 ; y 1 1 )=0, Q ( x k i ; y k i ) denotesminimizedcostaccumulationorequivalently theshortestpathfromsourcenode (x 1 1 ;y 1 1 ) offrame1tothe k thnodeofframe i. Q ( x j i1 ; y j i1 ) istheshortestpathuptothe j thnodeofframe i-1, EI (x j i1 y j i1 x k i y k i ) denotesthecostofedgeconnectingthe j thnodewithframe i-1tothe k thnodeofframe i and L(x k i ;y k i ) isthecostofthe k thnodeofframe i.Algorithm 1 presentedthealgorithm tondtheshortestpathbetweenthesourceanddestinationnodes. Theremainingquestionishowtochoosetwoboundarynodesasthesourceand destinationbetweenwhichashortestpathissearched.Asmentionedbefore,ashotis dividedintoequal-lengthsubshots.Weassigndestinationasthelocationofoptimized croppingwindowinEq. 2 andsourceasthelocationofthedestinationoflast previoussubshottoavoidjitterbetweensubshots.Bythismeasure,everysubshot,we updatecroppingwindowtoafreepositionthatcropsmostsalientareaoforiginalframe andpreservemostvisualinformationatthattime.Meanwhile,dynamicprogramming methodyieldsacroppingwindowtransitiontracefromthestarttoendofthesubshot, withconsiderationofbothleastvisualinformationlossandleastvisualinconsistency.So 33

PAGE 34

input : Sourcenodeofcroppingwindow (x 1 ;y 1 ),Destinationnodeofcropping window (x N ;y N ) andvideoframes fI i g N i=1 ofthesubshot,thenumberof nodes C [i] offrame i output:Optimaltrace ( ^ x N ; ^ y N ) astheshortestpathfromsourcetodestination x 1 1 x 1 ;y 1 1 y 1 ;Q( x 1 1 ; y 1 1 ) 0; ( x 1 1 ; y 1 1 ) (x 1 1 ;y 1 1 ); for i 2 to N do e xtractfromvideothe ithframeinsubshotas I [i]; calculatesaliencymap SM [i] of I [i]; for k 1 to C (i) do calculate costofNode (x k i ;y k i ) as L(x k i ;y k i ) ; T opt 1 ; for j 1 to C (i 1) do calculate Edgecost EI (x j i1 ;y j i1 ;x k i ;y k i ); T ( x k i ; y k i ;j ) Q ( x j i1 ; y j i1 )+ EI (x j i1 ;y j i1 ;x k i ;y k i ) ; if T ( x k i ; y k i ;j )
PAGE 35

por traitforegroundsalientobjectsinhighresolution.Soweassumethatmostviewers aremoreexpectingacroppingwindowwithcompleteobjectsinaspeopleusuallyprefer aglobalviewwithbroadvisiblerangeatthepriceofresolutionratherthanonlyaccess toalimitedarea.Hereresizingismorepreferredtocropping.Onthecontrary,inmost longdistanceshotscenes(e.g.sportbroadcasting)wheresalientobjectsoccupies smallareas,mostviewerswouldliketofocusonandtracktheobjectwithouthuge resolutiondegradation,otherwise,objectsbecometoosmalltorecognize.Herecropping becomesmorepreferabletoresizing. Basedonaestheticrequirement,wespecifyaninitialweight andndoptimized scalesofsomesampledframesbasedonEq. 2 andaveragethemasthescale oftheshot.Mostly,thissimplemethodworksne,however,whenasalientobjectis movingfast,thecroppingwindowmaynotmovefastenoughtocatchupwiththeobject duetovisualconsistencyconstraint.Thisleadstocutoffsomepartsoftheobjectand itsuggestsalargerscaleofcroppingwindowneeded.Weusethevelocityofdynamic croppingwindowtransitionwithinashottoestimatehowfastobjectsofinterestmove. Thenbasedonthevelocityestimate,weadjusttheweight inordertoobtainalarger scale,whichyieldsalargercroppingwindowtoincludesalientobjectscompletely. 0 =(1+exp fj 1 N N X i=1 ( ^ x i ^ x i1 ) 2 +(^ y i ^ y i1 ) 2 L 2 i v jg) 1 (2) where 1 N P N i=1 ( ^ x i ^ x i1 ) 2 +(^ y i ^ y i 1 ) 2 L 2 i is thevelocityestimateofcroppingwindow transition, N isthetotalnumberofframesintheshot. L i isthemaximumdistancea croppingwindowcanmovefrom (x i1 ;y i1 ) and v denotesareferencevelocity.Given theupdatedweight 0 ,anewscaleaverageisoptimizedfortheshot.Thenwestartover tondoptimaltraceofcroppingwindowunderthenewscale. 2.5ExperimentalResults WeimplementourschemeinC++withOpenCV( http://opencv.willowgarage. com/wiki/ ),FFTW3(http://www.fftw.org/ )andKLT( http://www.ces.clemson. 35

PAGE 36

edu/ ~ stb/klt/ ) libraries,anddesignaseriesofexperimentstotesttheperformance ofourapproachcomparedtotherepresentativeexistingworkonsaliencymodeling andretargeting.Theexperimentsarecarriedoutonvarioustypesofvideos,including movie,entertainments,news,sports,etc.Originalvideoscanbeinanysizeand length.Viewersareallowedtospecifyanyretargeteddisplaysizeorcustomizedaspect ratio.Weevaluatedourcontent-awarevideoadaptationframeworkbyevaluatingthe performanceofourcontent-modelingalgorithmrst.Then,weevaluatetheproposed content-mappingalgorithmwiththesaliencymapsgiven.Wepresenttheexperimental resultsintwoways.Sincecontentmodelingandvideoadaptationarehighlysubjective tasks,resultsaregenerallyshownongureillustrationswithoutobjectiveperformance quantication.Therefore,ourrstwaytopresenttheexperimentalresultsistousegure illustrations,likemostoftheexistingworkdidtoshowtheirresults.Inaddition,wego beyondthepreviousworksbyconductingsubjectivetestsontheexperimentalresults tocollectmeanopinionscores(MOS),whichplaytheroletoobjectivelyevaluatethe algorithms'performances.ByconductingstatisticalanalysisontheMOSoftheresults, weareabletodrawaconclusionbyhypothesistestingtoquantifytheperformances ofalgorithmswithacertaincondencelevel.Wethenrecordthehypothesistesting decisionasthesecondwaytopresenttheexperimentalresults. 2.5.1FigureIllustrations Inthefollowing,wepresenttheexperimentalresultswithgureillustrations.The rstillustration,Fig. 2-13,showsthecontentmodelingresults(saliencymaps)generated byhuman,bysaliencytoolbox(STB, http://www.saliencytoolbox.net/ )andbyour contentmodelingalgorithmrespectively.Weuseacollectionofimageswithmultiple resolutionsandaspectratiosfortheexperiments.Besidesthesaliencymaps,wealso illustratethesocalledproto-regions,whicharefoundbyathresholdingmethodin[ 2] usingthesaliencymaps,toshowthecontoursofsalientregionsonoriginalimages, asshowninthered-circledregionsonoriginalimagesinFig. 2-13.Bycomparingthe 36

PAGE 37

saliency mapsgeneratedbySTBandouralgorithmwithmanuallygeneratedlabels (generatedbyusasviewers),wemaytentativelyclaimthattheproposedalgorithm outperformsSTBasitbetterapproximateshumans'viewingexperiences.Forexample, forthefourthimagewhichshowstwochildrenplayingatbeachwithasailingboatin thesea,ouralgorithmsuccessfullyextractstheregionsofchildrenandtheboat,which arealsothefocusedregionsofhumaneyes.TheSTB,onthecontrary,isonlyableto capturethelinebetweentheseaandthesky.Later,weconductsubjectivetestsand statisticalanalysistovalidateourclaim.ThedetailsarepresentedinSec. 2.5.2. Thesecondillustration,Fig. 2-10,showstheperformancesofimageadaptation algorithms.Regardingimageadaptationasasingle-framevideoadaptation,we comparetheperformanceofouradaptationalgorithmwithtworepresentativeimage retargetingalgorithms:bidirectionalsimilarity(BS)[ 7],whichisapatch-based approachthatmaximizethesimilarityscoreofpatchesfromsourceandtargetimages bidirectionally,andseamcarving(SC)[ 5],whichaimsatmaximizingthetotalenergyon thetargetimagebyremovinglow-energyseamsontheoriginalimage.Weusethetest imagesfrom[ 7].Wecanclearlyseethatouradaptationalgorithmmostfaithfullykeeps thegeometricshapesofobjectsonoriginalwhilepresentingthemostsalientregions -thedolphin,thebuildingandthehouse.SCandBShavetobearwithgeometric distortions,whicharenotdesirable. ThethirdillustrationisFig. 2-11,whichpresentsthecomparisonofvideoadaptation byourcontentmappingalgorithm,withoursaliencymapandthesaliencymap generatedbybaseline-PQFT[ 3],respectively.Althoughoursaliencymapisan extensionofthebaseline-PQFTandsharesalotofsimilaritieswithit,thesaliency mapsofourmethodshowabetterperformancetoincorporatethemotionfeatures,as areillustratedinFig. 2-11 (themovinghumanfaces,tennisplayerandbee).Therefore, ourcontentmodelingalgorithmcanbenetfromthemoreaccuratesaliencymaps,and thenresultinbetteradaptedvideos. 37

PAGE 38

The lastillustrationinthissectionisFig. 2-12.Itservesthepurposeofshowing thevisual-friendlinesscharacteristicofouradaptationalgorithm,comparedtotwo state-of-the-artcontent-awarevideoadaptationmethods:singleframesmoothing (SFS)[ 4][8]andbacktracing(BT)[ 9].Generally,SFSsuffersfromjittering,whichwill causeuncomfortablefeelingsofviewers.Backtracingismostlyacceptable,however, theadaptedvideoisnotalwaysabletopreservesalientregionsofinterestintheoriginal video.Incomparison,ourmethodthroughoutpreservessalientregionasframegoes furtherandavoidsjittereffectsaswell.Fig. 2-12 presentsresultcomparisoninastatic fashion.Weillustratecroppingwindowsonoriginalframeswithframenumbernoted inFig. 2-12.Originalvideoisinresolution 640 352 andthespeciedretargeted sizeis 320 240 .Aninitialweight (cropping/resizingpreference)isprovidedas0.3 andthesubshotlengthis120frames.InresultsofSFS,althoughlionandzebrais preservedcompletely,thecroppingwindowshiftsbackandforcefrequently,which meanshugejittereffectsinretargetedvideo.InresultsofBT,fromframe #238 to #259 thecroppingwindowincludescompletezebra,however,asframegoesto #294 and #318 ,itisleftbehindbythezebraduetofastmotion.Somostpartsofzebraislost inretargetedvideo.Incontrast,ourresultyieldsavisualconsistentcroppingwindow tracetopreservezebracompletely.Inordertomakeaconvincingconclusionthatour algorithmoutperformsBT,weconducedsubjectivetestsonadaptedvideosequencesby ouralgorithmandBT.ThedetailsarepresentedinSec. 2.5.2. 2.5.2SubjectiveEvaluations Aswementionedatthebeginningofthissection,wegobeyondthepreviousworks onpresentingexperimentalresultsbyconductingsubjectivetestsandstatisticalanalysis ontheresults.Wecarryoutsubjectiveevaluationsintheformofonlinesurvey.Inthis section,wepresentthesubjectivetestsindetail. TherstsubjectiveevaluationisonthesaliencymapsinFig. 2-13.Thepurpose ofthistestistoquantitativelymeasurethatifouralgorithmoutperformstheSTBon 38

PAGE 39

saliency mapgenerating,withinacertaincondenceinterval.Wesetupawebsite (http://www.mcn.ece.ufl.edu/public/subjective/ )toprovidethetestingmaterials fortheexperiment.Onthewebsite,wedescribethepurposeofthistestandexplain howtoevaluatethesaliencymaps.Foreachofthenineoriginalimages,twosaliency mapsgeneratedbyourcontent-awarealgorithmandtheSTBarebothpresentedtothe participants.Throughoutthisevaluationprocess,theparticipantsareblindtothename ofthealgorithmssoastoavoidpossiblebias.Then,eachparticipantareinstructedto giveeachsaliencymapascorerangingfrom1to5,1fortheworstand5forthebest,to indicatehowgooditis.Theparticipantsneednottomakehastydecisionsastheycould taketheirtimetomakecarefulcomparisonsuntiltheyarecondentaboutthescorethey haveinmind. Afterthat,wecollectthemeanopinionscores(MOS)submittedby60participants andconductstatisticalanalysisontheMOS.Weproposetousehypothesistesting, whichisapopularstatisticalmethodtoscienticallyevaluateifthereisenoughstatistical evidencetomakeadecisiononahypothesiswiththeexperimentaldata.Sinceeach participantevaluatesthesaliencymapsofbothalgorithms,thepaired-samplesstudent t-testisconsideredmostsuitablefortheanalysis.Theassumptionoftheexperimental dataisthatthescoresofthetwosaliencymapsbyeachalgorithmforimage i follow normaldistributionswiththesamevarianceanddifferencemean i 1 and i 2 ,where i 1 is themeanofthescoresforouralgorithm.Wemakethenullhypothesisas H 0 : i 1 i 2 whichisinterpretedasthatouralgorithmisnotbetterthantheSTB;thealternative hypothesisis H 1 : i 1 > i 2 ,whichisinterpretedasthatouralgorithmoutperformsthe STB.Wetakethesignicancelevel =0:05 ,sothecondencelevelofourtestsis 95% .Wecalculatethep-valueforeachtest,andthenmakedecisiontorejectthenull hypothesisifthep-valueissmallerthanorequaltothesignicancelevel TheresultsofhypothesistestingaresummarizedinTable 2-1.Decision1 meanstorejectthenullhypothesis.Itisshownthatforallthenineimages,we 39

PAGE 40

T able2-1.Hypothesistestingresultsforsubjectiveevaluationonretargetingalgorithms 1:Ourapproach2:SaliencyToolbox(STB) =0:05 df =59, t(criticalone-tail) =1:671 Mean VariancetStatp(T i = t)Decision Img1 1 3.4500.930 2.7550.0041 22.9670.880 Img2 1 3.5170.754 7.5210.0001 22.2000.908 Img3 1 3.6500.731 8.0500.0001 22.2170.918 Img4 1 3.5330.821 7.8000.0001 22.0830.959 Img5 1 3.3001.061 1.6910.0481 22.9501.031 Img6 1 3.8830.884 11.6970.0001 21.9830.830 Img7 1 3.5831.332 5.3190.0001 22.4831.034 Img8 1 4.0500.582 10.6190.0001 22.0670.945 Img9 1 3.5671.055 3.5870.0001 22.8831.223 reject thenullhypothesisthatouralgorithmisnotbetterthantheSTB.Thus,we canmakeaconclusionthatouralgorithmconsistentlyperformsbetterthantheSTB with 95% confidence .Fig. 2-14 isthebarchartthatdepictsthescores'statisticsfor saliencymapsbythetwoalgorithms.Exceptimage5,theothereightimagesexhibita consistencythatthecondenceintervalsofSTBandoursdonotoverlap.Theresults shownonFig. 2-14 complywellwiththehypothesistestingresultsfromTable 2-1.We alsohavesomeinterestingobservationsfromFig. 2-14.Forexample,theworstsaliency mapsgeneratedbytheSTBareforimage4,6and8,whiletheseareinrealthemost dissimilaroneswiththemanuallygeneratedlabels.Andforimage1,5and9,wherethe STBandourshavetheclosestperformances,thetwosaliencymapsandourmanually labelresemble.Thisphenomenonshowsthattheparticipantshavesimilarviewing experienceswithus. 40

PAGE 41

The secondsubjectivetestwehavecarriedoutistoevaluatetheperformance ofourretargetingalgorithmbycomparingtheoutputvideosequencesgenerated byouralgorithmwithoutputsgeneratedbyotherschemes.Inpractice,itishardto generateorcollectvideodemosofothervideoretargetingalgorithmssincemost literaturespresentedtheirresultsontheformofpapers.Inordertomakeafairand convincingcomparison,wemadeeffortstocontacttheauthors,askingfortheirhelpon theexperiments.Fortunately,wegotonereplyfromDr.ThomasDeselaersinRWTH AachenUniversity,GE,thathewouldliketoreleasehissourcecodefortheiralgorithm backtracing(BT)proposedinCVPR2008[ 9].Withhiskindlyhelpandguidance,we areabletocompileandruntheirretargetingalgorithmandgenerateoutputs.Theoutput videosaregeneratedunderoptimalparameterswithhisguidance.Consideredthe interestingnessofthesubjectivetestandtoreducetheburdenoftheparticipants,we pickfourmostinterestingvideoclipstothetestgroup:Avatar,Up,Madagascarand SpainTorres,andprovideallthetestingmaterialsonoursubjectivetestingwebsite. Foreachclip,theparticipantareinstructedtorstwatchtheoriginalversion.Then,the participantcouldfeelfreetoplaytheoutputsbyalgorithmAandalgorithmB(theywere blindtothenamestoavoidbias)inanyorderorwitharbitrarytimes,andgiveascore rangedfrom1-5whentheyarecondentaboutthescore,astheydidfortheprevious test.Wecollectthenalscoresfrom60subjects,andanalyzethestatisticalsignicance withhypothesistesting.Inthiscase,thenullhypothesisisthatouralgorithmwillnot outperformtheBT. TheresultsofhypothesistestingaresummarizedinTable 2-2 andthestatistics ofscoresaredepictedinFig. 2-15.Unliketheprevioustest,thehypothesistesting doesnotshowtheconsistencytorejectallnullhypothesesforthefourtestsequences. Wecanalsoobservethat 95% condenceintervalsoftheMOSscoresofBTand ouralgorithmoverlap.Thus,wecannotdrawasafeconclusionthatouradaptation 41

PAGE 42

algor ithmisstatisticallybetterthanBT.Wewouldsaythetwoalgorithmshavesimilar performanceswhileourstendstobeslighterbetterforahighermean. Table2-2.Hypothesistestingresultsforsubjectiveevaluationonretargetingalgorithms 1:Ourapproach2:Backtracing(BT) =0:05 df =59, t(criticalone-tail) =1:671 Mean VariancetStatp(T i = t)Decision A vatar 13.8330.480 1.7870.0401 23.5830.722 Up 1 4.0670.640 4.3730.0001 23.5750.605 Madagascar 1 3.7500.767 0.1860.4270 23.7250.605 SpainT orres 13.7670.860 1.6840.0491 23.5420.757 2.6 Summary Inthischapter,wehaveproposedanonlinearapproachtofusethespatial andtemporalsaliencymapsforvideoretargetingconsideringthehumanvision characteristics.Wealsopresentedthenewcontent-awareinformationlossmetrics andahierarchicalsearchschemeunderacroppingandscalingretargetingframework. Meanwhile,adynamicprogrammingsolutionisproposedtooptimizethetemporaltrace ofcroppingwindows.Experimentalresultsarepresentednotonlybygureillustrations liketheexistingworksdid,butalsobysubjectiveevaluationswithstatisticalanalysis. Resultsshownthatourcontentmodelingalgorithmisstatisticallysignicantlybetter thansaliencytoolbox,andourretargetingframeworkisatleastsimilartoorslightly betterthanthebacktracing.Last,weofferaseriesofadaptationdemosatwebsite: http://plaza.ufl.edu/lvtaoran/demo-all.htm .Fig. 2-16 providesasnapshotofour testvideos. 42

PAGE 43

Figure 2-1.Numerousdigitalvideoapplicationsseekasmartcontentadaptation approach. Figure 2-2.Examplesofcontent-unawarevideoretargetingmethods. 43

PAGE 44

Figure 2-3.Examplesofcontent-awareretargetingmethods. Figure 2-4.Overviewofthenonlinear-fusedspatial-temporalsaliencydetectionand single-frameretargetingframework. 44

PAGE 45

Figure 2-5.Overviewofcroppingwindowtraceoptimizationframeworkona shot/subshotbasis,greenarrow:searchforoptimizedtraceofcropping windowthroughoutasubshotusingdynamicprogramming. Figure 2-6.Comparisonofspatialsaliencydetectionbymulti-channelPFFTvs.PQFT. a)Originalframes,up:foreman,bottom:football.b)The zero;Y;C b ;C r channels.c)SaliencymapdetectedbyPFFT.d)Saliencymapdetectedby PQFT.e)Timeconsumption. Figure 2-7.Comparisonoflinearcombination,naiveMAXoperationandproposed approachwhenglobalmotioniscorrectorwrong. 45

PAGE 46

Figure 2-8.Leftcolumnup:aframeofAVATAR.bottom:spatial-temporalsaliency. Middlecolumn:Searchingspaceofbrute-forcesearch.Rightcolumnup: croppingregiononoriginalframe.bottomleft:retargetingresult,bottom right:directlysqueezingresult. Figure 2-9.GraphModelforoptimizecroppingwindowtrace;green:sourceand destinationnodes.yellow:candidatenodeforeachframe.red:shortestpath todenoteoptimizeddynamictrace. Figure 2-10.Retargetingperformancesonnaturalimages.Courtesyof[ 7]forthetest imagesandresultsofcomparisongroups. 46

PAGE 47

Figure 2-11.Comparisononvideoretargetingofbaseline-PQFT[ 3]andourapproach. Foreachvideosequence,theleftcolumnshowsresultsofbaseline-PQFT, therightcolumnshowsours.Therstrowarespatial-temporalsaliency maps,secondrowaretheoptimalcroppingwindowsandthirdrowarethe retargetingresults.Themiddleguresinthethirdrowaredirectly squeezingresults. Figure 2-12.RetargetingResults:top:singleframesearchandsmoothing,middle:back tracing,bottom:theproposedapproach. 47

PAGE 48

Figure 2-13.Comparisonofsaliencydetectiononimages.Col.1:originalimage.Col.2: humanlabeledsalientregions.Col.3:proto-regionsdetectedbySTB. Col.4:saliencymapbySTB.Col.5:proto-regionsdetectedbyourmethod. Col.6:saliencymapofourmethod. 48

PAGE 49

Figure 2-14.Statisticalanalysisforsaliencymaps.Blue:SaliencyToolbox.Green:Ours. Figure 2-15.Statisticalanalysisforretargetingalgorithms.Blue:Backtracing.Green: Ours 49

PAGE 50

Figure 2-16.Avarietyoftestsequences. 50

PAGE 51

CHAPTER 3 CONTENT-AWAREVIDEOSUMMARIZATION 3.1Introduction Thefastdevelopmentofdigitalvideoindustryhasbroughtmanynewapplications andconsequently,researchanddevelopmentofnewtechnologies,whichwilllowerthe costsofvideoarchiving,catalogingandindexing,aswellasimprovetheefciency, usabilityandaccessibilityofstoredvideosaregreatlyindemand.Amongallhot researchareas,oneimportanttopicishowtoenableaquickbrowseofalargecollection ofvideodataandhowtoachieveefcientcontentaccessandrepresentation.To addresstheseissues,videosummarizationtechniqueshaveemergedandhavebeen attractingmoreresearchinterestinrecentyears. Therearetwotypesofvideosummarization[ 17],staticstoryboardandvideo skimming(Fig. 3-1).Staticstoryboard[ 18][19][ 20]isastillabstract,whichisasetof frames(keyframes)selectedfromanoriginalvideosequence.Videoskimming[ 21], alsocalledamovingabstract,isacollectionofimagesequencesalongwiththe correspondingaudiosignalsfromanoriginalvideosequence.Fortherestofthis chapter,wewillmainlydiscussthemovingabstract:videoskimming. Thefundamentalpurposeofvideoskimmingistoepitomizealongvideointoa succinctsynopsis,whichallowsviewerstoquicklygraspthegeneralideaoforiginal video.Theresultantsummaryprovidesacompactrepresentationoftheoriginalcontent structure,leadingtoefcientindexingandretrieval.Althoughbrief,agoodsummary preservesallnecessaryhallmarksoftheoriginalvideoandviewersaresufcientlyable torecoveroriginalcontentthroughreasoningandimagination. 3.1.1Content-UnawareVideoSkimming Anobsoletemethodofvideoskimmingistouniformlysampletheframes[ 22][ 23][24] toshrinkthevideosizewhilelosingtheaudiopart,likethefastforwardfunctionindigital players.Althoughthisisprobablythesimplestwayforvideoskimming,thedrawback 51

PAGE 52

is obvious.Suchanapproachmaycausesomeshortyetimportantsegmentstohave norepresentativeframes,whileotherlongersegmentsmayhavemultipleframeswith similarcontent.Thismadetheschemefailtocapturetheactualdynamicsofthevideo content.Timecompressionmethod[ 25][ 26][ 27]cancompressaudioandvideoatthe sametimetosynchronizethem,usingframedroppingandaudiosampling.However,the compressionratioofthismethodislimitedunlessthespeechdistortionistolerable. 3.1.2PreviousApproachesofContent-AwareVideoSkimming To-date,thereareexcellentsurveysonvideoskimming[ 17, 28].Thesepapers covermanydetailedapproacheswithonecommonstrategy:beingformulatedasan optimizationproblem,itselectsasubsetofvideounits(eitherstaticframesordynamic shotclips)fromallpossibleunitsintheoriginalvideosuchthattheymaximizesome metricfunctionofthesummaryquality. Basedonthecognitivelevel(fromlowtohigh:signal,syntaxandsemantic) whereametricfunctionlies,wecategorizecurrentvideoskimmingtechniquesinto threetypes.MethodsinTypeIutilizesignallevelmeasurestocomparethedifference ofavideosummaryfromitsoriginal.Variousimplementationsincludethemotion trajectorycurve[ 29],visualredundancy[ 30],visualcentroid[ 31],inter-framemutual information[ 32],similaritygraph[ 33]andsummarizedPSNR[ 34].Allthesemetrics aremanipulationsofpureframeintensitiesandinessencemeasurethevisualdiversity containedinasummary.Hencethemaximizationleadstothesummarywithmost contentdiversity,deviatingfromthefundamentalpurposeofvideosummarization:Itisa visuallycolorfulone,butnotnecessarilytheonethatpresentsmostimportantcluesthat enhanceviewers'understanding. TypeIIcharacterizeswithhighlevelsemanticanalysis,inwhichsemanticevents withexplicitmeaningsaredetectedandtheresultantsemanticstructureisutilized toguidethesummarization.Generally,semanticsareofinedenedexplicitlyby someontology,whichannotatestheeventswithmeaningfultags.Throughsupervised 52

PAGE 53

lear ningfromlabeleddata,variousmethodsinthiscategorydetectseventswithunique meanings.Typicalimplementationsincludetherecognitionoftheemotionaldialogue andviolentaction[ 35],cinematographysemantics[ 36],whoandwhatinquires[ 37], lecturetemplate[ 38],who,what,whereandwhenentities[ 39].Thesemethods makesenseastheyconsiderthefundamentalpurposeofvideosummarizationclosely. However,duetothelimitationofcurrentcomputerintelligence,recognizinganentity asaneventwithexplicitmeaningsisarigorousworkasbecauseofthewell-known semanticgapproblem.Also,thecapacitylimitationsofadenedontologyforcecurrent approachestoexercisesomewhatheuristicrulesonthesemanticentries,whichprove tobepowerfulinadhocsystems,butwithweakgeneralizationability. TypeIIIliesintheintermediatelevel,withmethodsseekingentitieswithimplicit meanings.Thephilosophyisthatimplicitsemanticalentitiesalsosufceviewersto understandandrecoveroriginalplotwhileavoidingtheheuristicattemptsforexplicit semanticrecognition.Someresearchersin[ 4043]assumetheimplicitsemanticsare expressedbypopularhumanperceptionmodelsandtheyyieldsummarieswithmost salient(mostprobableattendedfeaturesbyhumanattention)videounits.Unfortunately, althoughcorrelated,salientfeaturesdonotnecessarilymeansemanticdistinguishable astheybasicallymeasurehowinterestingofavideowhiletheinterestingpartmaybe animportantclueforunderstandingormaybenot. 3.1.3OverviewoftheProposedApproach Wefeatureourcompellingvideoskimmingalgorithmtomeetfourrequirements simultaneously.Thefourrequisitesarestoryskeletonpreservation,appealingand salientsummarization,smoothtransitionandskimmingratioadaptation.These demandsimposehugechallengesforustocomeupwithawell-formulatedsolution. Therstrequirementisstoryskeletonpreservation.Summarysequenceenables viewerstoquicklyandefcientlygraspwhatavideodescribesorpresentsfroma shortersummarizedversion.Tomeetthisneed,itisintuitivetoextractthemainskeleton 53

PAGE 54

from theoriginalvideoandkeepitinthesummarizedvideo.Videoskeletoncouldbe seenasaqueueof conceptprimitives withcertainsemanticimplicationsinatemporal order.Conceptprimitiveisnotashigh-levelasrealsemanticconcept,whichislearned withhumanintervention.Here,theconceptprimitiveimplicitlyencodesthesemantic meaningsofshots(setsofconsecutivesimilarvideoframes),symbolizesshotgroup thatportraitsconsistentsemanticsettingsandgenerallypossessthecapabilityasa hallmarkorself-evidentcluethathintsthedevelopmentoftheoriginalvideo.Viewers maypossiblyrecovertheplotbyonlywatchingandhearingahandfulofshotsaslongas allconceptprimitivesareconveyed. Thesecondrequirementisappealingandsalientsummarization.Obviously, anexcitingsummaryofvideoishighlydesiredbyviewers(i.e.,highlights).Often, therearevariousshotsconveyingthesameconceptprimitives.Whenselectingone shotconveyingaconceptprimitivefrommany,theonewithhighestsaliencyvalueor equivalentlygeneratingthelargeststimulustohumanattentionwouldbefavoredso thattheresultantsummarizedvideonotonlycontainsintegralconceptprimitives,but alsocarefullyselectsshotinstanceswithrichestinformationtoreecttheseconcept primitivestoavoidaplainorevendullsummarization. Thethirdrequirementissmoothtransition.Anevidentartifactinthesummarized videoisanunnaturaltransitionbetweentwoadjacentconceptprimitivesduetothe eliminationofanumberofvisuallyandacousticallysimilarshots.Acompellingvideo summarizationalsoexpectssmoothtransitionthatrequiresframelevelsummarization besidesoftheconceptprimitivelevel. Lastbutnotleast,thevideoskimmingalgorithmshouldhavethescalabilityto adapttoarbitraryskimmingratio.Forexample,wemayusedifferentskimmingratios foronesourcevideofordifferentapplications.Itishighlydesiredthatavideoskimming algorithmcangenerateaninformativeandattractivesequencewhichsatisesboth users'needsandtheresourcebudget. 54

PAGE 55

In thischapter,weproposeanovelapproachtoexploretheimplicitsemantics oforiginalvideoonintermediatecognitivelevel.Wepursueaself-explanatoryvideo summarythroughdiscoveringandpreservingconceptprimitives.Themotivationof conceptprimitiveisintuitive:emulatingthehumancognitiveprocess,naturallyalist ofkeypatternedhints,suchascharacters,settings,actionsandtheirordersetc., areneededintheshortsummaryforviewerstostitchthesehintslogicallyanduse imaginationtolltheomittedpart.Weextractaudiovisualfeaturesandusespectral clusteringtodiscovertheconceptprimitivesandconsidertherepetitionofshot instanceswhichinstantiatethesameconceptprimitiveassummarizationredundancy. Wefurtheranalyzeagoodsummaryshouldkeepvariousconceptprimitivesas completeandbalancedaspossiblesothatthesummarypresentscomparableclues fromacompleteperspective,allowingviewerstomakemostreasonableandobjective inference.Wealsoproposeagreedyalgorithmtoapplythesummarizationcriteria: basedontheConcept-Primitive-Shot-Instance(CPRI)representation,wesorttheshot instancesprimarilybytheconceptimportanceandsecondarilybythesaliencyvalue withineachindividualconceptprimitive.Thenshotinstancesareselectedinagreedy fashionuntilthesummarizationratioisreached.Finally,tomeettheskimmingratio specicationandkeepthesmoothtransitioninthesummarizedvideo,weaddaframe levelsaliencythresholdingfollowedbyatemporallymorphologicaloperationaspost processing.Themaincontributionsofourworkaretherefore: (1)Aconceptprimitivebasedvideosummarization:itsmeritscomefromthe fundamentalpurposetohelpviewersunderstandandrecovertheoriginalplot semantically. (2)Concept-primitive-shot-instancerepresentationofvideosemanticstructure:We proposedauniquewaytodiscovertheconceptprimitivesusingspectralclustering. 55

PAGE 56

(3) Agreedyapproachtosolvethesummaryproblemwithcompleteandbalanced conceptprimitivesaswellassalientshotinstances.Thismethodiswellsuitablefor scalabilityofthesummarization. 3.2FeatureExtractionofShots 3.2.1Video-AudioDe-InterleavingandTemporalSegmentation Agoodvideosummarycannotbeachievedwithoutgoodunderstandingofthe contents.Themostcommoncontentsforatypicalvideosequencearevisualand acousticchannels.Mostoftime,visualsignalsprovidethemajorityinformationto learnlatentconceptpatternsfromoriginalvideo;butaudiosensorychannelscanalso provideimportantinformationofaconceptprimitiveinsituationswherevisualchannel maynotoffer,suchasinanenvironmentlackoflightatnighttime.Inaddition,recall thataconceptprimitiveimpliesthatthecomprisingshotssharebothvisualandaudio consistencyatthesametime.Thus,ifweallowindependentfeatureextractionand unsupervisedconceptlearningfrombothvisualandaudiosensorydata,thelearned conceptresultscanbejointlyanalyzedinaparity-checkfashiontoenhanceco-reliability. Therefore,weextractaudiostreamfromrawvideoandputitthroughaparallelassembly linesimilarasvisualstreamdoestodiscoverthepossibleaudioconcepts. Thetemporalsegmentationforvideostreamisshotdetection.Weproposea variance-differencebasedapproachtodetectshotchange,whichisrobusttodetect thecutandcanalsogetgoodperformancetodetectfades.Thevariance Var i offrame i iscalculatedandthedeltavariance Var i offrame i withitspreviousframe i 1 is recoded.ThealgorithmforshotdetectionispresentedasinAlgorithm. 2. Forprocessingconvenience,audiodataaresegmentedintopieces,whereeach piecehasitsboundariessynchronizedtoitsco-locatedvideoshotintimeaxis.Within eachshot,audiodataarefurthersegmentedintopieceswithsametimedurationlikea videoframe.Wecallthiskindofsegmentsaudioframes. 56

PAGE 57

input : Numberofframesinsequence N ,Thresholds T stable T tolerance T length Accumulatedframenumberinashot N s output:Shotindex K i K 0 1, Max Var Min Var 1 N s 0; compute Var 0 ; for i 1 to N 1 do compute V ar i ; Var i Var i Var i1 ; if Var i Max Var then M ax Var Var i ; end if Var i T tolerance then if N s > T length then Ne wshotboundaryfound ; K i K i1 +1 ; resetparametersforanewshot ; N s 0; Max Var ; Min Var 1 ; end end end else N s + + ; K i K i1 ; end end Algorithm 2: Proposedalgorithmforshot-detection. 3.2.2Content-AwareAttentionModeling Anappealingsummarizationrequiresacontentattentiveness(saliency)measurement. Thesaliencymeasureshouldeffectivelyreecthowattractiveashotoraframeis. BasedonourresearchonattentionmodelinginChapter 2,wedevelopasaliency measuresystem. Therearethreelevelsforvisualsaliencyandtwolevelsforaudiosaliency: pixel-levelforvisualsaliencyonly,frame-levelandshot-levelforbothvisualandaudio 57

PAGE 58

saliency .Let SM t bethespatial-temporalsaliencymapforframe t detectedbyour algorithminEq. 2.Then SM t isthepixel-levelvisualsaliencythatindicatehow attentiveeachpixelintheframeis.Frame-levelvisualsaliencyismeasuredas: Sal v t = 1 W H W X i=1 H X j =1 S M t (i;j ) (3) where (i;j ) isthepixellocationand W;H areframewidthandheight. Frame-levelaudiosaliencyismeasuredbysomelow-levelaudiofeatures[ 44], includingspectralcentroid(SC),rootmeansquare(RMS),absolutevaluemaximum (AVM),zero-crossingratio(ZCR),andspectralux(SF).SCisthecenterofthe spectrum;itiscomputedconsideringthespectrumasadistributionwhichvalues arethefrequenciesandtheprobabilitiestoobservethesearethenormalizedamplitude. RMSisameasureofshorttimeenergyofasignalfrom L 2 norm.AVMisameasure ofshorttimeenergyofasignalfrom L 1 norm.ZCRisameasureofthenumberof timethesignalvaluecrossthezeroaxe.Theselow-levelfeaturescaneitherbeused aloneorfused,thenweobtainedtheframe-levelaudiosaliency Sal a t .Theframe-level audio-visualsaliencyismeasuredbyalinearweighingwithweighingfactor : Sal t = Sal v t +(1 )Sal a t (3) Forashot,thevisualandauralconspicuousnessarecalculatedbyaveragingthe frame-levelsaliencyinthatshot,respectively: AvgSal v k = f 1 N k X t S al v t jF t 2 Shot k g (3) AvgSal a k = f 1 N k X t S al a t jF t 2 Shot k g (3) where N k isthenumberofframesin Shot k .Theshot-levelaudio-visualsaliencyis measuredby: AvgSal k = AvgSal v k +(1 )AvgSal a k (3) 58

PAGE 59

3.2.3 ConceptPrimitivesandBag-of-WordsFeatures Skeletonpreservationrequiressomedistinctivefeatureforshot-discrimination,the shotfeatureshouldbediscriminativeenoughatthesenseofrepresentingthevideo skeleton,asitwillbeusedtondthesimilarityamongshots. Weproposetousethebag-of-words(BoW)modeltocharacterizetheshot propertiesinvisualandauraldomains,respectively.BoWmodel[ 45]wasinitially utilizedinnaturallanguageprocessingtorepresentthefeatureofatextdocument.It considerseachtextdocumentasacollectionofcertainwordsbelongingtoareference dictionarybutignorestheorderandsemanticimplicationsofwords.BoWmodel usestheoccurrenceofeachwordinthedictionaryasthefeatureoftext,thusitoften endsupasasparsevector.TheBoWmodelcanberegardedasthehistogram representationbasedonindependentfeatures.Inourcase,ashotcanberegarded asatextdocument.However,sinceneitherthevisualwordnortheauralwordina shotisreadyforuseliketherealwordsintextdocuments,thewordsneedtobewell dened.Itusuallyinvolvestwostepstoobtainaword:featureextractionandcodeword generation. 1)VisualconceptprimitivesandBag-of-wordsfeatures Weinterpretavideoconceptprimitiveasaself-learnedsetfeaturedbyacombination ofcertain spatiallylocalvisualatom (SLVA)andeachSLVAstandsforasinglevisual pattern,whichisfoundwithinalocalizedneighborhoodataparticularspatiallocation, withimplicitsemanticimplications,likerose,buttery,etc.Anoticeablepropertyof thevideoconceptprimitiveisthat,weonlyattachimportancetotheoccurrenceof SLVA's,withoutesteemingtheirorder(spatiallocation).ForexampleinFig. 3-5, ashotofafarviewofroseandbutteryandaclose-uplookofthesameentities shouldbothimplythesamesemanticimplications,despitetheroseandbutterymay appearindifferentlocationsandindifferentscales.TheBoWmodelforvisualshots, whichgraciouslyexpressestheorder-irrelevantproperty,isthusadopted,withthe 59

PAGE 60

SL VA'sasthevisualwords.Weadoptthescale-invariantfeaturetransform(SIFT) featureextractionalgorithm[ 46]toobtainthevideowords,becausetheSIFTfeature bestexhibitsthelocalcharacteristicswithinaneighborhood,withhighestmatching accuraciesunderdifferentscale,orientation,afnedistortion,andpartiallyinvariantto illuminationchanges.EvaluationssuggeststronglythatSIFT-baseddescriptorsarethe mostrobustanddistinctive,andarethereforebestcandidatesforSLVA's. Consideraregularfullprocessmode,thatSIFTfeaturepointsshouldbedetected oneveryframeintheshotandoneveryregionwithinaframe.Thisprocedure,although precise,isespeciallytime-consuming.Thus,somepre-processingneedtobeconducted beforetheSIFTfeaturedetectionWeadoptkey-framestobalancethecomputationcost andaccuracy.Sinceframeswithinashotappeartohaveminordifferences,itiswise toselectoneframeasthemostrepresentativeone,i.e.key-frame.Therearemany key-frameselectionmethods.Somestraightforwardmethodsincludechoosingtherst /lastframe,orthemiddleframeinashot.Somemotion-basedapproachesusemotion intensitytoguidethekey-frameselection,likeinMEPG-7[ 47].Unlikethoseapproaches, weconsiderhumanattentionmodelsintheshot,andselectthemostsalientframetobe thekey-frame t k : t k =argmax t fSal v t j F t 2 Shot k g (3) Thekey-frameselectionwillsaveahugeamountofcomputationatminorcost ofprecisionloss,undertheassumptionthattheframesaresimilarwithinashot.In addition,byexploitingtheattentionmodelonasingleframe,wecanfurtherexclude someinattentiveregionsonthekey-frame.Wedeneactiveregion AR t k onthe key-framebythresholdingthesaliencymap: AR t k (i;j )= fF t k (i;j )jSM t k (i;j ) >T; 1 i W; 1 j H g (3) 60

PAGE 61

T is theactivethreshold.TheSIFTfeaturedetectiononactiveregionswillgenerate prominentandrobustSLVA'softheframe.Fig. 3-2 illustratestheresultsofsaliency maskingontwoshotsfromsequencebigbuckbunny. WeadoptLowe'salgorithm[ 46]forSIFTfeaturedetectioninactiveregionsonthe key-frame.TheframeisconvolvedwithGaussianltersatdifferentscales,andthenthe differencesofsuccessiveGaussian-blurredversionsaretaken.Keypointsarelocated asmaxima/minimaofthedifferenceofGaussians(DoG)thatoccuratmultiplescales. Then,lowcontrastkey-pointsarediscarded,highedgeresponsesareeliminated.After that,eachkey-pointisassignedoneormoreorientationsbasedonthelocalgradient directions.Finally,ahighlydistinctive128dimensionvectorisgeneratedasthepoint descriptor;i.e.,theSLVA.Fig. 3-3 illustratestheSIFTfeaturedetectionresultsontwo shotsforsequencebigbuckbunny. AfterSIFTfeaturepointsarefoundonthekey-frameofeachshot,theshotasabag hasacollectionofvisual-words,eachoneisavectorofdimension128.Thenumber ofwordsisthenumberofSIFTfeaturepointsonthekey-frame.AshotbagwithitsSIFT featuredescriptorscannowberegardedasatextdocumentthathasmanywords.In ordertogeneratethehistogramrepresentationasthefeaturefortheshot,dictionary shouldbebuiltasthecollectionofallthewordsfromallthebags,andsimilarwords shouldbetreatasonecodeword,likeintextdocuments,take,takes,takenand tookshouldbeclassiedintoonegroupandusetakeasthecodewordforthisgroup. AcodewordcanbeconsideredasarepresentativeofseveralsimilarSLVA's.Weuse K-meansclusteringoveralltheSLVA's,thenumberoftheclustersisthecodebook size(analogytothenumberofdifferentwordsinatextdictionary).Codewordsare thecentersoftheclusters,andeachwordismappedtoacertaincodewordthrough theclusteringprocess.Thus,eachshotcanberepresentedbyahistogramofthe codewords.Fig. 3-4 showsthehistogram-likerepresentationoftheBoWfeature,and Fig. 3-6 showstheowchartofvisualBoWfeatureextractionforashot. 61

PAGE 62

2) AudioconceptprimitivesandBag-of-wordsfeatures Similartotheanalysisforvideochannel,weexploretheaudiostructurebyaudio conceptprimitives,ratherthanfrommoredetailedsingleacousticsourcelevelasin manyaudiorecognitionproblemsorevenfromfurtherwaveform-levelperspective.In general,weinterpretanaudioconceptprimitiveasacousticenvironmentfeaturedbya combinationofcertain temporallylocalacousticatom (TLAA)andeachTLAAstands forasingleaudiopatternwithplausiblesemanticimplications(e.g.theaudioconcept conversationbetweenJohnandMaryattheshoreisfeaturedasacombinationofJohn's shorttimevoice(aTLAA)switchingwithMary's(aTLAA)andcontinuousenvironmental soundofseawave(aTLAA)).Notethatforthepurposeofvideosummarization,we seekanaudioskeletonthatareusuallycomprisedofself-containedconceptprimitives. Byself-contained,wemeanthatinthesetofshotsthatformthisconceptprimitive, everyshothasTLAA'sfromthesameclosedsubsetofplausibleaudiopatternsand thereshufingofplausibleaudiopatternsareallowed.Thisassumptionoriginatesfrom thefactthathumansrecognizeanaudioscenefromamacroscopicperspective,which emphasizesthecomponentsinsteadofexacttimeandlocationofeverycomponent. Likeintheexampleabove,ifanotheraudioscenealsoincludeJohn,Maryandsea wave,butthistimeJohncontinuouslytalkatthersthalfandMaryatthesecondhalf, withoutanyvoiceswitching.Westillconsiderthissceneisinthesameconceptprimitive astheexampleabovebecauseitalsoconveythesemanticimplicationthatJohnMary's conversationattheshore.Soweassumeinoneaudioconcept,thoseshotsaresubject toconsistentTLAAcompositions,nomatterinwhatordertheseTLAA'sarearranged.In thecontextofaudioconceptclustering,atthislevel,thefeaturevectorsofdifferentshots maybemuchcloseraslongastheiracousticcomponentTLAA'sarealike.Thenthey areprunetobeclusteredintothesamegroup,whichcapturestheunderlyingcommon characteristicsofanaudioscene.comparingwithmanyindicator-likefeatures,which identiesashotasasingleacousticsourceeachshotwillenduptobeasparsevector 62

PAGE 63

with onlyone1-entrythatindicateswhichacousticsourcethisshotbelongsto.This hard-decision-likefeatureisgenerallycontradictorytothefactthatanaudiosegment correspondingtoashotusuallyconsistsofmultipleinterveningsources.,whilethisfact isimplicitlyreectedbyBoWfeature.Fortheindicator-likefeatures,theirsparsenature ofshotdatahighlightsthedifferenceofshotdatabyassumingshotasasinglesource withmajoritycontribution,whichareusuallydifferent.Inthisway,theclusteringmay losemuchopportunitytolearnareasonableconceptprimitivewhereshotshavesimilar acousticcomponentsbutthemajoritysourcesaredifferent. Toservetheneedofconceptprimitiveminingwhichfocusesonthecomponents ratherthantheirorder,theBoWmodelisquitesuitabletorepresenttheaudiofeature ofadetectedshot.Ifwechoptheaudiostreamofashotintomultipleoverlapped short-timeaudiosegmentswithequallength,wemayregardtheshotasabag containingmultipleaudiosegmentsasaudiowords.Eachword,withextractedfeature byMatchingPursuitdecomposition[ 48],representsauniqueTLAA,whichisaudio patternwithplausiblesemanticimplications;andashotisconsequentlyconsidered asabagcontainingtheaudiopatterns.Thehistogramofeachwordoccurrenceisa summarizedfeatureofashotthroughallthewordswithin.Here,anencodingthemeis appliedtoavoidtheover-sparsityoffeaturevectors(negativelyimpacttheclassication result)fromdirectwordoccurrencestatistic.Westoreallaudiowordsfromallshots inrawvideointoadictionary,andconduct k -meansclusteringoverthedictionaryto produce k codewords.Theneachwordisassignedtoanearestcodewords.TheBoW featureofeachshotistheoccurrenceofcodewordsinside. InordertoimprovetherobustnessofanaudioBoWfeature,wealsoapplythe saliencymaskingmethodtotakeaccountforthoseaudiowordsaboveanacoustic saliencylevel,thustoavoidthenegativeeffectontheBoWaccuracyexertedbylow salientaudiowords,duetoitssmallvaluecomparedwithnoise.Fig. 3-7 showsan exampleofaudiosaliencymasking,whichisathresholdingontheaudiosaliencycurve. 63

PAGE 64

In termsoffeatureextractionforaword,weusematchingpursuitmethodsimilar to[ 49]todecomposetheaudiosegmentcorrespondingtoawordintoaseriesof predenedwaveformbasis.AlthoughmanyacousticfeaturessuchasMel-frequency cepstralcoefcients(MFCC),linearpredictivecepstralcoefcient(LPCC)forrecognition purposeareavailable,theyareonlysuitableforstructuredaudiostreams,suchas musicorspeech.Matchingpursuit(MP),however,isabletofeatureambientsound andotherunstructuredsound,thusaccessmuchmoreinformationtoenhancethe awarenessofalatentconceptprimitive.Foranaudiowordasashort-timeaudio segmentwithacertainlengththatproducesonesingleTLAA,itsuniqueacoustic characteristiccanbeencodedbyasetofbasefunctionsinareferencedictionary andcorrespondingcorrelationcoefcients.UsingMP,weenableanefcientsparse representationoftheaudiosegment.NotethatinMP,basesofagivendictionaryare selectedbymaximizingtheenergyremovedfromtheresidualsignalateachstep;so thesparserepresentationofadecompositionresultingfromMPisthemostefcientin thesensethatthereconstructedsignalbasedontheselectedbasistakesupalarger percentagethananyotherdecompositionmethod.HerewereferaGabordictionary withGaborwaveformbasisforitspromisingreconstructionefciency.Eachparticular Gaborwaveformsareindexedbyitsscale,frequencyandtranslationfromorigin.Fora xednumberofiterationsteps,MPselectsaGaborbasisfromtheGabordictionarywith themaximumsimilaritytotheaudiosegmentresidualintermsofcorrelationcoefcients. TheGaborfunctionisdenedby: g s;;!; (n )= K s;;!; p s e (n) 2 =s 2 cos[2 (n )+ ] (3) where s, ,and arescale,translation,frequencyandinitialphaserespectively. NotethatthebasesinGabordictionaryareallin256pointlength.Toencodea shorttimeaudiosegmentasaTLAAvectorbyMPdecomposition,wemakethelength oftheshort-timeaudiosegmentas256pointaswelltoneatlyalignwiththeGaborbase 64

PAGE 65

input : Signal f (t),Gabordictionary D output:Listofcoefcients (a n ;g n ). Rf 1 f (t); while kRf n k ; a n ; Rf n+1 Rf n a n g n ; n n +1; end Algorithm 3: Matchingpursuitalgorithm. function.ApplyingMP,aTLAAcanberepresentedbyafeaturevectoreachentryof whichsymbolizesthecoefcientsofaselectedGaborbasis.Theowchartofaudio BoWfeatureextractionisshowninFig. 3-8. 3.3VideoSkimmingbyConceptReconstruction 3.3.1Spectral-ClusteringSolutionforConceptLearning Inthefollowingpartsofthischapter,we'lluseconcepttoabbreviatetheconcept primitive.Withthefeaturevectoravailableforeachshotunderbothvisualandaudio sensorychannels,shotsinoriginalvideoarereadyforclusteringtodiscoverthelatent concepts.Visualandaudiosensorychannelsareprocessedindependentlysothat theycanprovidemutualreliabilitytoeachother.Acompellingclusteringmethodwould rstbeabletogroupthedatacorrectly,eventhoughthenumbersofdataindifferent clustersareconsiderablydifferent.Weincorporatespectralclustering[ 50]method tolearnthepossibleconceptfromshots.Givenshotfeaturedata,spectralclustering providesastate-of-the-artclassicationapproach.Spectralclusteringminimizesan objectivefunctionthatcancelsoutthenegativeeffectduetoimbalanceddivisionof numberofmembersindifferentclusters.Thuseventhoughoriginalvideocontain conceptpatternsthatconsistofsignicantlydifferentnumberofshotmembers,spectral clusteringisfreeofarticialbiasofadivisionofuniformnumberofmembersandis capableofdividingthemcorrectlyaslongasthefeaturemeasuremaketheshots insameconceptconsistent.Anotherbenetofspectralclusteringisthatitfavorsto 65

PAGE 66

classify locally-correlateddataintooneclusterBecauseitaddsanotherconstraintto distinguishtheclose-locatedorlocally-connecteddataandincreasetheirsimilarityto bedividedintoonegroup.Bythisconstraint,theclusteringresultapproacheshuman intuitionthataclusterwithconsistentmembersisgenerallysubjecttoaconcentrated distribution.Bythevirtueofspectralclustering,thelatentconceptsareindependent fromthenumberallocationofshotmembersindifferentclusters;meanwhile,dueto thefavoroflocally-connecteddataintoasinglecluster,thelearnedconcepttendsto beself-contained,whichisdesirabletorepresentavideoskeleton.Thealgorithmof spectralclusteringisreferredtoasAlgorithm. 4. input : featurevectorset U = u 1 ;:::;u n ,searchrange [k min ;k max ]. output:clusters c j formanafnitymatrix A: d(u i ;u j )= ku i u j k; A ij = exp (d(u i ;u j ) 2 =2 2 ) if i 6= j and A ii =0; dene D tobeadiagonalmatrixwhose (i;i ) elementisthesumof A's i-throw ; constructnormalizedafnitymatrix L; L = D 1=2 AD 1=2 ; eigen-analysisfor L:get k max +1 largesteigen-vectors (x 1 ;:::;x k max +1 ) and correspondingeigen-values ( 1 ;:::; k max +1 ) ; estimateoptimalclusternumber: k =arg max i2[k min ;k max ] (1 i+1 = i ); formmatrix X =[x 1 x 2 :::x k ] 2 R nk bystackingtherst k eigenvectorsin columns. ; formmatrix Y byrenormalizingeachof X 'srows: Y ij = X ij =( P j X 2 ij ) 1=2 ; clustereachrowof Y into k clustersviacosine-distancebasedK-means: assignoriginaldatapoint u i tocluster c j ifrow i of Y isclusteredto c j ; Algorithm 4: Spectralclusteringalgorithm. Herethefeaturevectorset U isourextractedfeaturesetofvisual-BoWsand audio-BoWs,respectively.Thenumberofclusters k arethenumberofconcepts,which caneitherbeempiricallysetoradaptivelyadjustedbyiterativelytrydifferentnumber of k ,sincethecomputationcomplexityofspectralclusteringalgorithmisfairlylow 66

PAGE 67

compared tootherclusteringalgorithmslikemeanshiftorK-meansonoriginalhigh dimensionaldata. 3.3.2Audio-VisualConceptAlignmentandConsistenceChecking Afterspectralclusteringforbothvisualandaudiosensorychannels,theshots aregroupedintodifferentvisualandaudioconcepts.Notethatforascenewithboth auralandvisualcontentsthatconveysacertainconcept,itsvisuallayerandaudio layerareonlyphysicallydifferentcarriers,however,theypresentthesamesemantic implicationsforthesameconcept.Therefore,weassumeaone-to-onemappingfrom thelearnedvisualconceptstoaudioconcepts.Weproposeanumber-of-memberbased methodtoalignvisualandaudioconceptindexes.Sinceavisualconceptwithitsaudio counterpartreectsasingleconceptwithsemanticimplications,thenumberofshot membersinthisconceptwouldrevealtheidentityoftheconcept.Sincethelabelindex foraudioandvideoclusteringresultsareindependentlygeneratedandthelabelindex arerandomlyassigned,forexample,13113forvideoshouldrepresentsame conceptclusteringresultas21221foraudio.Thusweneedtoalignthelabels foraneasycheckingforconsistency,i.e.,rearrangethelabelsforaudiotobe131 13.Thisnumber-of-memberisfeasiblealsobecausespectralclusteringimposesno sucharticialeffectstoevenlydividedatatoclusters.Thebimodalconceptalignment algorithmisasAlgorithm. 5. Whentheaudioandvisualconceptsarealigned,weshouldcheckiftheconcepts areconsistent.Considerthatsomeshotshavemismatchedaudio-visualconcepts,for example,avideooftwopeopleAandBtalking;mostshotswillconsistentlyshowthe person'sgureandplaytheperson'svoice.SomeshotswillshowA'sgurewhileplay B'svoice.Thecaseisrarebutpossible,andwecallitamismatch.Aftertheconcept alignment,themismatchag d k forshot k =1:::K canbeeasilyfoundbycomparingthe alignedspectralclusteringresults: d k =1; if V k 6= A k ,else d k =0 (3) 67

PAGE 68

input : Visualconceptlabel V k andaudioconceptlabel A k forshot k;k =1:::K output:Alignedconceptlabels. VM l 0, AM l 0; for k 1 to K do f or l 1 to N cluster do if V k = l then V M l VM l +1; end if A k = l then AM l V M l +1; end end end sort VM and AM indescendingorderandgettheirindexmapping I VM l 0, AM l 0; for k 1 to K do f or l 1 to N cluster do if A k = l then A k I l ; end end end Algorithm 5: Proposedaudio-visualconceptalignmentalgorithm. Whenthereisamismatch,theaudio-visualsaliencyoftheshotshouldbe decreased,sincekeepingsuchkindofashotintheskimmedvideowillcausesome misunderstandingtoviewers. AvgSal k = AvgSal k d k (3) where issaliencypenaltyforaudio-visualconceptsmismatch. 3.3.3SkimmingAlgorithmandPostProcessing Weproposeagreedyalgorithmtoprogressivelygeneratethesummarizedvideo clip,bythemeansofcollectingshots.Inotherwords,avideoskimmingprocesscanbe regardedasavideoreconstructionprocess;startingfromanemptyoutputsequence, ashotisrecruitedeachtimetotheoutputsequence,untilthetargetskimmingratio isachieved.Thedurationoftheoutputvideocanthusbecontrolledbyrecruiting 68

PAGE 69

diff erentamountsofvideoshotstosatisfyarbitraryskimmingratio.Acrucialfactoristhe recruitingorder,whichplaysanimportantroletothenalresult.Giventherequirements, wedesignseveralrulesandproposea reconstructionreferencetree structureforour skimmingalgorithm: Rule1:Conceptintegrityshouldbesatised.Theconceptintegrity,orconcept completeness,isthemajorconcernofourskimmingscheme.Weregardtheultimate goalofvideoskimmingistofaithfullyreectthediversityofconceptsoftheoriginal videothusyieldthemaximumentropy,despitethatsomeconceptsmayseemnotsalient (exciting).Therefore,inourreconstructionframework,werequirethateachconcept shouldcontributeshotstotheskimmedvideo. Rule2:Conceptimportanceshouldbemeasured.Theconceptimportanceisa factorfordecidingtherecruitingorderofdifferentconceptprimitives.Itisnotequivalent totheconceptsaliency.Itisamorehigh-levelargumentthatwillrevealthevideo producer'sintentionfortheconcepts'representation.Mostcommonly,iftheproducer givesalongshotforaconceptprimitive,orrepeatstheconceptinmanyshots,thenthis conceptisofhighimportanceintentionally.Underthisassumption,wecanassignthe conceptimportance Im forconcept C l as: Im l = f X N k jShot k 2 C l g (3) where N k isthenumberofframesinshot k .Inourreconstructionframework,werequire thatashotshouldberstpickfromthemostimportantconcept. Rule3:Theoverallsaliencyvalueoftheoutputsequenceshouldbemaximized. AstheconceptdiversityisguaranteedbyRule1,wecannowfocusonthesaliency requirementmaximizingthesaliency.Forsimilarshotsinoneconceptcluster,theshot withhighestsaliencyshouldbepickuprst.Wedeneseveraltermsforbetterdescribe therulesinouralgorithm. 1) Must-inshot and optionalshot 69

PAGE 70

W edenethemostsalientshotineachconceptamust-inshot.Itmeansthat theshotsmustberecruitedintheskimmedvideoregardlessoftheskimmingratio. Thisguaranteestheconceptintegrity.Theothershotsareoptionalshotsthatcanbe recruitedornotdependingonthetargetskimmingratio. 2) Reconstructionreferencetree Thereconstructionreferencetree(RRT)isadatastructurewedesignedforvideo reconstructionguidance.Itisbuiltaccordingtotherulesdenedabove.Therootofthe RRTisthevideoconceptspace,whichisthesetoflearnedvisualconceptsthroughthe spectralclusteringprocess.Therstlevelleavesaretheconcepts,sortedinimportance descendingorderfromlefttoright.Thesecondlevelleavesaretheshots.Undereach concept,theshotsaresortedinsaliencydescendingorderfromtoptobottom.Therst childofeachconceptisthemust-inshotandtherestsareoptionalshots. 3) Virtualshot and shottable Sinceeachconceptmayhavedifferentnumberofshots,weputsomevirtualshots withzerosaliencytoformanarrayofallshots.Thearrayiscalledtheshottable.The must-inandoptionalshotsare realshots. AnRRTwithmust-inshots,optionalshots,virtualshotsandshottableisillustrated inFig. 3-9.GiventheRRTandshottable,thereconstructionprocessisrelatively easy.Thealgorithmoperatesiteratively,foreachtime,arealshotispickedupfrom theshottable,withrasterscanorder,untiltheactualskimmingratioexceedsthetarget skimmingratio.Sincethereconstructionisbasedonshots,theactualskimmingratio R act maynotperfectlyequaltothetargetskimmingratio.Itismorelikelythat R act isslightlylargerthan R tar ,asthethestopcriteriaisthat R act exceeds R tar .Inorder topreciselycontroltheoutputvideoduration,weproposetousepureframe-level skimming,whichisbasedontheattentionmodel,aspostprocessing.Theaudio-visual saliencydenedpreviouslyofeveryframethatappearsintheoutputsequenceis checkedagain;throughthresholdingonthesaliencycurve,theframeswithrelatively 70

PAGE 71

lo wsaliencywillbediscarded,makingthenaldurationoftheoutputvideosatisfythe targetduration.Inaddition,thesmoothnessrequirementisalsoconsideredtoyielda viewer-friendlyskimmedvideo.Amorphological-likeoperation[ 40]isadopted(denoted asfunctionmorph()):deletecurvesegmentsthatareshorterthan K frames,and join-togethercurvesegmentsthatarelessthan K framesapart,where K isasmall valueempiricallyset.Algorithm. 6 describesthevideoskimmingandpostprocessing process.Fig. 3-10 illustratesanexampleofsaliencycurvethresholdingwithcurve preservingratio R =95% 3.4ExperimentalResults Ascontent-awarevideoskimmingisahighlysubjectivetasklikecontent-aware videoretargeting,itisalsodifcultforanymechanicalcomparisonorsimulationmethods toobtainaccurateobjectiveevaluations,andtheredoesnotexistastandardmethod toevaluateorquantifytheperformance.Thus,wealsopresentourresultswithboth gureillustrationsandsubjectivetests,aswedidpreviouslyforcontent-awarevideo adaptation. 3.4.1FigureIllustrations Werstexaminethecapabilityouralgorithmtominethevideoconceptprimitives inavideosequence,bydoingshotdetectionandextractingshotfeatures.Sincea verylongandcomplexvideomaycontaintoomanyconceptprimitives,itisverylikely thatdifferenthumanbeingsyielddifferentclusteringresultsofshotgroups.Toavoid theambiguityofthegroundtruth,weillustratetheresultsforconceptminingusinga 30secondsclip,whichhasaclearconceptprimitive,fromthepopularsitcomthebig bangtheory.Thissequenceshowsatypicalconversationoffourpeople(Leonard, Sheldon,Howard,Rajesh)inalivingroom.Weanalyzethisclipmanuallytogenerate thegroundtruth.Theclipcontains17shots,5conceptprimitives:Leonardtalking(L), Sheltontalking(S),Howardtalking(H),Rajeshtalking(R)andallpeopletogether(A). ThestoryisprogressivelyevolvedasAHLHRHLRARARSARAS.Weemploy 71

PAGE 72

input : ReconstructionReferenceTree:Concepts C l ;l =1: L andShotTable S k;l ;k =1: K max ,Targetskimmingratio R tar output:Finalskimmedvideo V o V o ; R act 0, flag must 0; while (R act T g; SC morph (SC ); update V o bycollcectingframes t from SC ; Algorithm 6: Proposedalgorithmforvideoskimmingbyshotreconstructionand postprocessing. 72

PAGE 73

our algorithm 2 todoshotdetectionandkey-frameselectionusingEq. 3.Then,the novelsaliencymaskingtechniqueisappliedonbothvisualandaudiochannels,and BoWfeatureforeachshotareextractedbysaliency-maskedSIFT-featuredetectionon visualframesandMPdecompositiononaudiosegments. Therstgureillustration,Fig. 3-11,showsthedetected17shotsofthebig bungtheorybytheirsaliency-maskedkey-frames.Itisshownthatourshotdetection algorithmsuccessfullydetectedthe17shotsinthisclip.TheupperpartofFig. 3-11 showsthekey-framesforeachofthe17shots.Thekey-framesareselectedasthemost salientframesineachshot.Theblackregionsofeachframearethemaskedregionsby saliencymasking,anovelapproachthatweusedtoeliminatebackgroundsandleave thesalientregionsforrobustfeaturedetection.ThebottompartofFig. 3-11 depictsthe SIFTfeaturepointsdetectedonsalientregions.Wecouldseethatsimilarshotshave similarSIFTfeaturesdetected.Thusareveryrobustforshotfeatureclusteringinthe followingprocedure. ThesecondgureillustrationisFig. 3-12.Itshowsthespectralclusteringresults forthe17shotswiththeproposedBoWfeatures.Fiveconceptprimitivesareminedby theproposedapproach.Eachconceptprimitiveconsistsofsimilarshots.Forexample, conceptprimitive C 2 containsshot1,3and5whichshowHowardtalking(H).The clusteringresultmatcheswiththegroundtruthperfectlyforthissequence.Ourshot clusteringalgorithmisshowntobeeffectiveforconceptmining. Thethirdillustration,Fig. 3-13,presentsthereconstructionreferencetree(RRT) builtforthebigbungtheory.Theveconceptprimitivesarearrangedfromleftto rightwithdecreasingconceptimportance.Theshotsbelongtoeachconceptpattern arearrangedfromtoptobottomundereachconcept,withdecreasingsaliency.We cansimplyreadfromtheRRTtheorderforpickingshotstogenerateanoutputby reconstruction:0115122arethemust-inshotsthatwilldenitelyappearintheoutput sequence.Followingareshots15143166134110789whichareoptionalshots 73

PAGE 74

that theirappearancesintheoutputdependonthegivenskimmingratio.Theskimming processisinvokedwithAlgorithm. 6.Finally,theoutputsequenceisgenerated. 3.4.2SubjectiveEvaluation Toqualitativelymeasurehowwellouralgorithmcangenerateaskimmedvideo, weemploytheuserstudywhichhasbeenwidelyusedforvideosummarization evaluation[ 42][40][ 51][ 52][43].Weadopttwometrics informativeness and employability proposedin[ 42]toquantifythequalityoftheskimmedvideounderdifferent skimmingratios.Enjoyablityreectsuser'ssatisfactoryofhis/herviewingexperience. Informativenessmeasurestheamountofinformationoftheoriginalvideothatthe skimmedvideocanpreserve. Thesubjectivetestissetupasfollows.First,consideringnottocausetiredness toparticipantstodowngradetheirviewingexperiences,wecarefullypicktwotesting videos.Therstisafourminutesclipinthebigbuckbunny(BBB),from www. bigbuckbunny.org ,andtheotherisasevenminutesclipinlordoftherings(LoR)from theMUSCLEmoviedatabase[ 53].Then,weassigntwoskimminglevelsforeachclip, 20% and 10% forBBBtotesttheextremeskimmingcasesand 50% and 30% forLoRto testordinaryskimmingcases.Thus,twoskimmedvideosaregeneratedforeachofthe testingvideos.Next,weprovideallthesixvideoclipsonoursubjectivetestingwebsite (http://www.mcn.ece.ufl.edu/public/subjective/ ),andgiveinstructionsonhow toevaluatetheoutputbytheaforementionedmetricsenjoyabilityandinformativeness. Theparticipantsarethenaskedtogiveeachskimmedvideoanenjoabilityscoreandan imformativenessscoreinpercentagerangingfrom 0% 100% ,toexpresstheirfeelings ontheskimmedvideos.Duringthewholeevaluationprocess,theparticipantscouldtake theirtimetoplayeachvideoasmuchtimeastheywant,beforetheyfeelthatthescores truthfullyreecttheirfeelings. Wecollectscoressubmittedby60participantsafterthesubjectivetest.Wethen plotthehistogramsofthescoresinFig. 3-14.Asisshowninthehistograms,the 74

PAGE 75

enjo yabilityandinformativenessscoresof 20% skimmingtoBBBexhibitpeakswithinthe 50% 80% range,whichisaverypromisingresult.Thescoresof 10% skimmingtoBBB aremoreuniformlydistributedthanthe 20% scores,showingthatparticipants'feelings diversewhentheratioapproachestheextreme.ForthehistogramsofscorestoLoR with 50% and 30% ratio,thescoresaremoreconcentratedthanthehistogramsofvery lowskimmingratios. Wecalculatesomebasicstatisticsofthescoresandpresentthestatisticsin Table 3-1 andTable 3-2.Accordingly,barchartswithcondenceintervalsareplottedto illustratethedatafromthetables,asareshowninFig. 3-15.Thebarsshowthemeanof enjoyabilityandinformativenessscoreswith 95% condencelevelforthetwoskimmed videosofBBBandLoR,respectively.Itisclearlyseenfromtheplotthatthescoresare signicantlyhigherthanthecorrespondingskimmingratio,whichindicatedthatusers agreethatourskimmingalgorithmeffectivelygeneratesanoutputvideowhichexploits enjoyabilityandinformativeness. Table3-1.Basicstatisticsofsubjectivetestingscoresofbigbuckbunny Big buckbunnyMean Std.Std.Sample MinMax 95% errordeviationvariance condence 20% Enjo yability61.252.8221.86477.6510905.65 Informativeness68.002.1816.9285.76201004.37 10% Enjoyability55.832.1421.26451.84101005.49 Informativeness58.952.1716.78281.4010904.33 T able3-2.Basicstatisticsofsubjectivetestingscoresoflordofthering Lord oftheringMean Std.Std.Sample MinMax 95% errordeviationvariance condence 50% Enjo yability64.052.3518.21331.78101004.71 Informativeness66.082.4218.71350.07201004.83 30% Enjoyability57.252.5619.82392.73101005.12 Informativeness60.922.2617.5306.3520904.52 3.5 Summary Wehavepresentedanovelapproachforvideoskimminginthischapter.Audio-visual bag-of-wordsshotmodelandspectralclusteringareincorporatedforvideoconcept 75

PAGE 76

mining. Saliency-maskedSIFTfeaturedescriptorsonkey-framesaretakenasvisual BoWfeaturesandmatching-pursuitdecompositionisadoptedtodiscoveraudioBoW featuresforashot.Spectralclusteringisemployedforunsupervisedconceptprimitive learning.Then,fromthereconstructionperspective,theskimmedvideoisprogressively generatedinagreedyfashionwiththereconstructionreferencetreewhichtakesinto accountbothvideoinformativenessandenjoyabilityunderagivenskimmingratio. Smoothnessrequirementisachievedbypostprocessing.Ourapproachisshownby subjectiveteststohaveencouragingresultstoofferbothinformativeandenjoyable summarizations.Finally,weprovidesomecontent-awarevideosummarizationdemosat website: http://plaza.ufl.edu/lvtaoran/skimming.htm 76

PAGE 77

Figure 3-1.Videosummarizationtechniques:staticstoryboardandvideoskimming. Figure 3-2.VisualsaliencymaskingonBig-Buck-Bunny. Figure 3-3.SIFTfeaturedetectiononactiveregions.Thearrowlengthforeverypointis the L 2 normoftheSIFTdescriptor. 77

PAGE 78

Figure 3-4.ThehistogramrepresentationofthevisualBoWfeatureforashot. Figure 3-5.Samesemanticsconceptsindifferentscalesandlocations. Figure 3-6.TheowchatforextractingvisualBoWfeature. 78

PAGE 79

Figure 3-7.Audiosaliencymasking. Figure 3-8.TheowchatforextractingaudioBoWfeature. 79

PAGE 80

Figure 3-9.AnRRTwithmust-inshots,optional-shots,virtualshotsandshottable. Figure 3-10.Postprocessingbysaliencythresholding. 80

PAGE 81

Figure 3-11.Up:Thesaliency-maskingondetected17shots.Bottom:TheSIFT featuresdetectedonkey-framesofTheBigBangTheory. Figure 3-12.Theconceptsminingbyspectralclusteringofbag-of-wordshotfeaturesof sequencethebigbungtheory. 81

PAGE 82

Figure 3-13.Reconstructionreferencetreeofthebigbungtheory. Figure 3-14.Thehistogramsoftheenjoyabilityandinformativenessscores. 82

PAGE 83

Figure 3-15.Statisticalanalysisresultsofthescoresbysubjectiveevaluation. 83

PAGE 84

CHAPTER 4 AGENERICFRAMEWORKFORCONTENT-AWAREVIDEOCODING 4.1Introduction Withtherapiddevelopmentofvideoandmultimediatechnologies,digitalvideo applicationhasbecomeoneofthehottesttopicswhichaffectpeople'slives.The demandfordigitalvideocommunication,suchasvideo-conferencing,mobilebroadcasting andvideophone,hasincreasedconsiderablythankstothesuccessofadvancedvideo codingtechniques,suchasH.264[ 54],MPEG-4,etc.However,duetothescarceof channelresourceandtherestrictionoftransmissionrates,encodingvideosequencesat verylowbit-ratewithgoodqualityremainsamajorchallenge. 4.1.1Content-UnawareVideoCoding Atthesametime,moststate-of-the-artvideocodingstandards,includingH.264/AVC, treateachoftheircodingunits(i.e.,Macroblocks)equally.Althoughdifferentmacroblocks withinthesameframemaybecodedwithdifferentmodesandbepartitionedinto differentsub-blocks,noonemacroblockismoreimportantthananother,sonoonewill befavoredforresourceallocation.Thismodeliseasyandefcient,butisnotalways desirablewhenresourcesarereallylimited. 4.1.2ExistingWorksofContent-AwareVideoCoding Extensivephycologicalstudiesrevealthat,thehumanperceptiononanimage (or,avideoframe)isnotat,i.e.,someregionsontheimagemayincurhigher humanattentioncomparedtootherregions.Thisnaturalphenomenonmotivates peopletodesigna`smart'strategyforresourceallocation.Thatis,whenresources arelimited,itiswisetosacricethebitsconsumptiononinattentiveregionsandsave thesebitsforsalientregions.Forexample,whentwopeoplearemakingavideo phonecallintwodifferentplaces,underapoolwirelessnetwork,theymaywantto seeeachother'sfacemoreclearly,notthebackground.Thus,toencodethehuman faceswithhigherqualityandthebackgroundwithlowerqualitywillsatisfytheuser's 84

PAGE 85

need insuchacircumstance.Theregionsthatcausepeople'sinterestsorconvey moreinformation,arenamedRegion-of-Interest(ROI).Forexample,thespeaker's faceinavideo-conferencingisanROIotherthanthebackground;ananchorman orananchorwomanwhoisbroadcastingnewsistheROIotherthanthestudio;two table-tennisplayersinagameareROI'sotherthantheplaygroundandspectators; arunningboatintheriveristheROIotherthanthewaterandriverbank.Meanwhile, ashuman'seyesaremoresensitivetomotions,thoseregionswithseveremotion areverylikelyconsideredROIs.TherearetwoaspectsthatROIcanhelptoimprove theperformanceoftheexistingcodingstandards.Therstaspectistoimproveerror resiliencecapability.Thesecondaspectistoimprovethecodingefciency.Thismeans toachievebetterqualityunderthesamebitratebudgetortoachievelowerbitrateunder samequalityconstrain,whichisrepresentedas: max PSNR s.t. Rate R target min Rate s.t. PSNR PSNR target (4) Inthischapterwemainlyaddressthesecondaspect.Researchershaveproposed manyalgorithmstodobit-allocation.Karlssonetal.usespatial-temporallters[ 55] whichforcesundesirablebackgroundskip(reducethebackgroundframerate).Linet al.useaframe-skippingschemetodoresourceallocationinvideoconferences[ 56]. Chenetal.solvethisproblemasanoptimizaitonproblembyLagrangetheory[ 57]. Wangetal.designanalgorithmwhichupdatetheR-Qmodeltoadaptivelyndthebest quantizationparameter[ 58].DouglasChaiandKingN.Nganetal.[ 59]proposedtwo strategies,namely,MaximumBitTransfer(MBT)andJointBitAssignment(JBA).In MBT,thelargestQPisassignedtonon-ROIandROIsareoptimizedbytheremaining bits.However,apoorbackgroundisnotalwaysdesirable.JBAovercomesthisdrawback butstillcannotavoidtheabruptqualitydegradationbetweenROIandnon-ROIs. 85

PAGE 86

4.2 Group-of-PictureBasedBitAllocationFramework Considertheproblemsofexistingwork,weintroduceanewframeworkfor content-awareresourceallocation.ThecodingmodeofamacroblockinH.264can bechosenfromthesetMode= fINTRA4 4, INTRA16 16 INTER 16 16 INTER 16 8, INTER 8 16 INTER 8 8, INTER 8 4, INTER 4 8, INTER 4 4, SKIP,DIRECT g.ForeachmacroblockS,themodeisrstdonetondtheoptimal RD cost byminimizing RD cost (S;Mode jQP; )= D REC (S;Mode jQP )+ R REC (S;Mode jQP ) (4) where D REC isthedistortion,generallyrepresentedbythesumofthesquared differences(SSD)orthesumofabsolutedifferences(SAD), R REC isthebitsconsumption forentropycodingonencodingtheresidualofthemacroblock. istheLagrangian multiplier,andQPisthequantizationparameter. =0:85 2 (QP 12) =3 (4) ThedistortionDandrateRarealsofunctionsofQP.Manyresearchershave proposedmanymodelstodescribetheirrelationship.Forexample,TMN8suggestsa quadraticmodelfortherateanddistortionwithQP,andsomeotherpeopleproposed theirlinearmodelsinstead[ 58].Thosepriorworksimpliesthefactthattheadjustmentof thequantizationparameterQPisaneffectivemethodtodobit-allocation. Consideringatwo-levelbit-allocationschemebasedonGOP(Group-of-pictures) structureasFig. 4-1 shows.Sincetherstintraframeisthereferenceframeformotion estimationofsucceedinginterframes,itsqualityisthedominantfactorforthePSNRof theGOP.Thus,thecompensationofbit-consumptionbetweenI-frameandP-framewill bringusbenetsonPSNRimprovementundersamebittarget.Atthesametime,the unimportantregionsinP-frames(non-ROIs)aresacricedtoanacceptableextent. 86

PAGE 87

Fig. 4-2 sho wstheencoderdiagramforourROI-basedbit-allocationscheme. ComparedtoconventionalH.264codingdiagram,weaddanewmoduleEncoder Control.Inthismodule,theGroup-of-picturestructurearerstformed.Then,through computationoncontentsoftheIntraandInterframesintheGOPrespectively,an ROIagwillbeassignedtoeachmacroblock.Whenencodingthemacroblock,the quantizationparameterwillbeadjustedaccordinglyforqualitycontrol. 4.3Intra-FrameROIIdenticationandBit-Allocation Intra-frameROIidenticationcanberegardedasimage-ROIidentication, whichcanbesolvedbymanytechniques,suchasskincolordetection[ 60],levelset segmentation[ 61],feature-basedsaliencydetection[ 62]andtransform-basedsaliency detection[ 2]etc.ThediscussionofimageROI-detectioniscanbefoundinchapter2, wherethesaliencyanalysisispresented. Inourframework,weassumetheROIforintra-frameisalreadyidentied. Correspondingly,anROImaskatmacroblocklevelisgenerated.Forsimplicity,we assumeallROI-macroblocksareequallyweighed.Weshouldnoticedtheweightcan alsobemeasuredbytherelativeimportance.Thebit-allocationschemetriedtotake theadvantageofROIinformationandallocatemoreresourcetoROI-macroblocks. ThisisrealizedbyadjustingtheQPformacroblocks.Asweknow,smallerQPreects nerquantizationandwillcausesmallerquantizationerrorbutatthesametime,the bitconsumptionincreases.Somenotations:theactualQPforcurrentmacroblockis denoted MB >currentQP ,theinitialQPconguredbyuseris QP ini .Thealgorithmis describedinAlgorithm 7. 4.4Inter-FrameROIIdenticationandBit-Allocation Inter-frameROI,however,isdeneddifferenttoIntra-framesROIs.Thisisdueto thefactthatInter-framesutilizethereferenceframestoachievecodingefciency.As weknow,ifthecontentsinamacroblockdonotchangemuchbetweenanInter-frame anditsreferenceframe,theadvancedvideocodingalgorithmtendstouseaSKIPmode 87

PAGE 88

input : Intraframe f k tobeencoded,InitialQP QP ini ,numberofMacroblocks N MB output:Encodedframe f k ,BitConsumption B k for f k Dosaliencydetectionon f k ; for i 1 to N MB do compute ROIag R i forMacroblock M i ; QP M i QP ini ; if R i ==1 then M i is anROI ; QP M i QP M i ; end encode M i with QP M i ; end nishencoding f k ; compute B k ; Algorithm 7: ProposedalgorithmforIntra-frameROIidenticationand bit-allocation. forthatmacroblock,especiallyinlowratecases.Thismeansthemacroblockhasno residuedatasodoesnotneedtobequantized.Forreconstruction,itonlyneedstocopy thedatafromthereferenceframewiththecorrespondingmotionvectors.Undersuch condition,theadjustmentofquantizationparameter(QP)isimproperandmeaningless. Ontheotherhand,ifthecontentsinamacroblockchangealot,thatmacroblockcannot beskippedandtheresiduedatawillbequantized.Sowedenethatmacroblockswith highmotionactivitiesareROIs.Themotionactivitiesaremeasuredbythestatisticsof motionvectorsinamacroblock.TheencodingQPcontrolalgorithmforInter-framecan bedescribedinAlgorithm 8,wherenisthepixelindexinamacroblock,Nisthetotal pixelnumberinamacroblock.Thethreshold T isupdatedsuchthatthenalbitratefora GOPiswithin3%uctuationoftheJMcodedbitrate. Weshouldbeawareofthefactofqualitypropagation,whichisthephenomenon that,ifthereferenceframeofanInter-frameisingoodquality(i.e.,highPSNR),itis likelythatthecorrespondingInter-framequalitywillalsobegood,especiallywhenSKIP modeisused.Thisisduetothecopyingofmacroblockdatafromthereferenceframe. 88

PAGE 89

input : GOPbitsresource C ,BitConsumptionoftheIntraframeintheGOP B Motionactivitythreshold T ,Interframe f k tobeencoded,InitialQP QP ini numberofMacroblocks N MB output:Encodedframe f k ,BitConsumption B k for f k Domotionestimationon f k ; for k 1 to N P frames do f or i 1 to N MB do compute motionactivity MA i forMacroblock M i ; MA i = 1 N P N n=1 (jM V x i;n j + jMV y i;n j); QP M i QP ini ; if MA i 256kbps).Foreach sequence,90framesaretested.The90framesaredividedintothreegroup-of-pictures 89

PAGE 90

(GOP) withIntraperiod30frames.Table 4-1 liststheexperimentalresults.Itisshown thatunderthesamebitratelevel(uctuationlessthan3%),theROI-basedcoding schemeachievesbetterPSNRperformancecomparedtoJM14.0baselineprolewith thesameGOPstructure.ThePSNRgainincreaseswiththedecreaseofbitrate. Table4-1.PerformanceevaluationonbenchmarksequenceCarphoneandAkiyo. T estsequenceBitrate(kbps) PSNR(dB) Avg.PSNRGain(dB) JMROI Car phone 53045.0545.07 +0.02 15538.5238.55 +0.03 2630.0930.29 +0.20 Akiy o 43350.9851.00 +0.02 11744.2144.33 +0.12 1230.8531.10 +0.25 Fig. 4-3 sho wsthePSNRperformancebyframeforcar.qcifunderthelowbitrate case.Fig. 4-4 depictsthePSNRgainofROI-basedcodingschemeoverstandard JM.ItisseenthatthemaximumPSNRgainislargerthan0.5dBandtheaverage gainis0.2dB.Fig. 4-5 showsthePSNRperformancebyframeforakiyo.qcifunderthe lowbitratecase.Fig. 4-6 depictsthePSNRgainofROI-basedcodingschemeover standardJM.Again,itisseenthatthemaximumPSNRgainislargerthan0.6dBand theaveragegainis0.25dB.WhileachievingahigherPSNR,ourROI-basedcoding schemesimultaneouslyachievesabettervisualquality.ThiscanbeseenfromFig. 4-7. Toshowtheimprovementofsubjectivequalityforarbitraryframes,wepickuptherst frame(Intra-frameandtherstframeinaGOP),the45thframe(P-frameandthemiddle frameinaGOP)andthe90thframe(P-frameandthelastframeinaGOP).Therst rowshowsoriginalframes.ThesecondrowshowsreconstructedframesbyJM.The thirdrowpresentsreconstructedframesbyROIscheme.Thefourthrowisthedetail comparisonofJMandROI.Obviouslyourschemehasabetterqualityatthehuman's eyesandmouse. 90

PAGE 91

4.6 Summary Wehaveproposedacontent-awarevideocodingframeworkthatconsidershuman visionproperties.Theframeworkutilizestheintrinsicrelationshipbetweenintraandinter frames,resourcesaresmartlyallocatedtosalientregions,resultinginbothsubjective andobjectivequalitygainoverJM14.Ourschemealsoovercomesthedrawbackof abruptROI/non-ROIqualitydegradationofexistingworks.Meanwhile,itiseasyto implementandtotallycompatiblewithstandardH.264decoders.Thesimplicityand goodperformancemakeitapromisingsolutionforlow-ratereal-timevideoapplications. Thedemoofthisworkcanbefoundat: http://plaza.ufl.edu/lvtaoran/ROI.htm 91

PAGE 92

Figure 4-1.Two-levelbitallocation.(a)GOPLevel.(b)FrameLevel. 92

PAGE 93

Figure 4-2.ROI-basedencoderdiagram. Figure 4-3.PSNRcomparisonunderlowratecaseforCarphone. 93

PAGE 94

Figure 4-4.PSNRgainforCarphone. Figure 4-5.(PSNRcomparisonunderlowratecaseforAkiyo. Figure 4-6.PSNRgainforAkiyo. 94

PAGE 95

Figure 4-7.ReconstructedvideoframesofCarphoneandAkiyobyROI-scheme comparedtoJM14.(a)Carphone:Originalvideoframe#0,#44,#89.(b) Carphone:Reconstructedframe#0,#44and#89byJM14.(c)Carphone: Reconstructedframe#0,#44and#89byROI-scheme.(d)Detail comparisonof(b)and(c).(e)Akiyo:Originalvideoframe#0,#44,#89.(f) Akiyo:Reconstructedframe#0,#44and#89byJM14.(g)Akiyo: Reconstructedframe#0,#44and#89byROI-scheme.(h)Detail comparisonof(f)and(g). 95

PAGE 96

CHAPTER 5 TRANSITION-BASEDINTRACODING 5.1Introduction Howtoeffectivelyutilizespatialcorrelationisfundamentaltotheefciency ofcurrentvideocodecsforintracoding.Thestate-of-artcompressionstandard H.264/AVC[ 63]istherstvideocodingstandardthatemploysspatialdirectional predictionforintracoding.Itprovidesaexiblepredictionframework,thusthecoding efciencyisgreatlyimprovedoverpreviousstandardswhereintrapredictionwasdone onlyinthetransformdomain.InH.264/AVC,spatialintrapredictionisperformedusing thesurroundingavailablesamples,whicharethepreviouslyreconstructedsamples availableatthedecoderwithinthesameslice(Fig. 5-1).Theencodertypicallyselects thepredictionmodethatminimizesthedifferencebetweenthepredictionandoriginal blocktobecoded.RDcostsarecalculatedforseveralpre-deneddirectionsandthe bestpredictionmodeisthusselectedastheonewiththeleastRDcost.Theselected modeisthencodedandtransmittedtothedecoder. 5.1.1ExistingWorks AlthoughtheintrapredictioninH.264canexploitsomespatialredundancywithin apicture,thepredictiononlyreliesonpixelsaboveortotheleftoftheblockwhichhave alreadybeenencoded.Thespatialdistancebetweenthepixelsservingaspredictions (whichwecallpredictorpixels)andthepixelsbeingpredicted(whichwecallpredicted pixels),especiallytheonesonthebottomrightofthecurrentblock,canbelarge.With alargespatialdistance,thecorrelationbetweenpixelscanbelow,andtheresidue signalscanbelargeafterprediction,whichaffectsthecodingefciency.Inaddition, extrapolationisusedinsteadofinterpolationbecauseofthelimitationofcausality. In[ 64],anewencodingmethodfortheplanarmodeofintra 16 16 isproposed. Whenamacroblockiscodedinplanarmode,itsbottom-rightsampleissignaledinthe bitstream,therightmostandbottomsamplesofthemacroblockarelinearlyinterpolated, 96

PAGE 97

Figure 5-1.H.264Intrapredictionmodesfor8 8and4 4blocks. andthemiddlesamplesarebilinearlyinterpolatedfromthebordersamples.When planarmodeissignaled,thesamealgorithmisappliedtoluminanceandboth chrominancecomponentsseparatelywithindividualsignalingofthebottom-right samples( 16 16 basedoperationforluminanceand 8 8 basedforchrominance). Theplanarmodedoesnotcodetheresidue.Althoughthenewplanarpredictionmethod exploitssomespatialcorrelationwiththebottom-rightsample,thepredictionaccuracyof therightandbottompixelsarestillquitelimited. In[ 6567]ofthekeytechniquearea(KTA)[ 68],bidirectionalintraprediction(BIP) isproposedtoimprovetheintracodingefciency.Twofeaturesareproposed:oneis thebidirectionalpredictionthatcombinestwounidirectionalintrapredictionmodes,and theotheristhechangeofthesub-blockcodingorderinamacroblock.Byintroducing thebidirectionalprediction,BIPincreasethetotalnumberofpredictionmodesfrom 9to16.Tochangethesub-blockcodingorder,itencodesthebottom-right 8 8 (or 4 4)sub-blockrstbeforeencodingtheotherthreesub-blocks.Whethertochangethe codingorderisanRDcostbaseddecisionwhichneedstobesignaledtothedecoder. AlthoughtheBIPmethodgreatlyimprovesthecodingefciency,thecomplexityofthis algorithmisveryhigh:H.264loopsover9modesfor8x8blockswhileBIPhastoloop over 16 2=32 modestoselectonewiththeminimumRDcost.BIPalsorequiresmore bitstosignalthemodeandcodingorder. 5.1.2OverviewofOurApproach Inthischapter,weproposeanewmethodtoencodeanintrablock.Theproposed methodderivestheintrapredictiondirectionofablockorablockpartitionusingits 97

PAGE 98

surrounding pixels.Therefore,nomodeselectionisneededsosyntaxbitsaresaved, andpredictiondirectionsotherthanthe9pre-deneddirectionsareallowed. WenameouralgorithmTIPforabbreviationoftransition-basedintraprediction. Transitionherestandsforthealteringbetweenblackandwhitepixelsonabinary mask.Zeng[ 69]proposedageometric-structure-baseddirectionallteringscheme forerrorconcealmentofamissingblock,wheretheboundaryinformationisalways available.WesimplifyZeng'sanalysisfortransitioncases,extenditforintraprediction andincorporateitwithShiodera'snewcodingorderframeworkin[ 6567]tomakethe blockboundaryavailable.ExperimentsshowthatTIPcanachieveupto10%bitrate savingsoverJM. Tofacilitatethedevelopmentofnewcodingtools,wealsodesignanovelvideo parameteranalyzerasthesideproductofthiswork.Thisanalyzerisindependentof bitstreamsyntaxsoitcanbewidelyused.Weincludetheintroductionoftheanalyzer inthischaptertoo.Therestofthischapterisorganizedasfollows.InSec. 5.2,werst introducethemethodtogeneratethetransitionpointsonblockboundaries.Thenwe presentthealgorithmtoanalyzelocalgeometricpatterns,andshowtheinterpolation schemeinTIPmode.Thenwediscussthenewcodingorderschemetosupportthis newmodeinSec. 5.3.Experimentalresultsofthetransitionbasedintracodingare presentedinSec. 5.4.Thedesignprincipleandimplementationdetailsoftheanalyzer arepresentedinSec. 5.5.Finally,weconcludeourworkinSec. 5.6. 5.2TransitionCasesandInterpolationSchemes Thesuddenchangeofneighboringpixelvaluesformsatransition.Huangand Algazihaveshownintheirwork[ 70, 71]thatwithinasmallanalysiswindowofan image/frame,thelocalgeometricstructurecanoftenbecharacterizedbyabimodal distribution.Thus,atransitionfromblacktowhite(orviceversa)revealstheexistence ofanedge.Giventhetransitiondistributiononablockboundary,wecananalyzethe 98

PAGE 99

local geometricpatternswithintheblock.Intrapredictionisthusbenetfromthelocal geometricpatterns. Asweknow,alineisdenedbytwopoints.Inordertondthelocalgeometric structurealongablockboundary,twonearestsurroundingboundarylayersare examined.Thetransitionpointsoninnerlayerindicatethelocationofanedgeand thetransitionpointsontheouterlayerhelptodeterminetheangleofthatedge.The twolayersarerstconvertedintoabinarypattern,andthethresholdforbinarization isadaptivelycalculated.Severalmethodscanbeusedforcalculatingthethreshold, includingthesimplestmeanpixelvalueofboundarylayers,theaverageofthefourth largestvalueandthefourthsmallestvalueusedin[ 69],andmostcomplicatedhistogram basedsegmentation.Afterbinarization,athreepointmedianlterisappliedtoeliminate isolatedblackorwhitepoints. Wedenethewhitepointcorrespondingtoatransition(blacktowhite,orwhite toblack)asa transitionpoint .AsshowninFig. 5-2,thereddotsoninnerlayerand dark-reddotsonouterlayerindicatethetransitionpoints.Notethatsincetheboundary isaclosedloop,thenumberoftransitionpointsisalwayseven. Atransitionpointontheinnerlayerimpliesalocaledge,whileitscorresponding transitionpointontheouterlayerhelpstoidentifytheslopeoftheedge.Depending onthenumberoftransitionpointsontheinnerlayer,thesituationisclassiedintofour cases:at(0transition),2,4andmorethan4.Ameasureofdirectionalconsistencyis usedtoresolvetheambiguityabouthowthetransitionpointsontheinnerlayershould bematchedtoeachothertoillustratethelocaledgestructure. Intheclockwisedirection,forthe i th transitionpointontheinnerlayer,denote theangleofthelineconnectingthispointanditscorrespondingtransitionpointonthe outerlayeras i (seeFig. 5-2).Theangleofthelineconnectingthe i th transitionpoint andthe j th pointontheinnerlayerisdenotedas ij .Anassumptionforthelocal geometricpatternis:Ifthereisanedgepassingthroughtransitionpoints i and j ,then 99

PAGE 100

ij i and j should beconsistent.Themeasurementforconsistencyisdenedas C ij = j i ij j + j j ij j (5) 5.2.1Flat/zeroTransitionCase Wewilldiscussthedecisionrulesandinterpolationschemesforeachcaseinthis section.Whenthebinarizationthresholdistooclosetothemaximumandminimum value,orthelocalvarianceisrelativelysmall,thecurrentblockisasmoothblock.In thiscase,theprojectiveinterpolationschemeof[ 72]isused.Inthisscheme,theedge orientationisclassiedintoeightpossibledirections,i.e., k 22 :5 ;k =0 ; 1;:::; 7.Let P 1 k and P 2 k representthetwosetsofprojectiondataatorientation k .Thebestorientationis foundwithminimizingtheprojectiondifference: k =argmin 0k 7 diff (P k )=argmin 0k 7 jP 1 k P 2 k j 2 D im (P k ) (5) Giventheorientation,theintrapredictors I (p) atpixel p canbegeneratedbybilinear interpolationalongthatorientation k : I p = d 2 d 1 + d 2 p 1 + d 1 d 1 + d 2 p 2 (5) where p 1 and p 2 are linearlyinterpolatedfromtheirtwonearestneighboringpixelsonthe innerlayer,and d 1 d 2 aretheEuclideandistancesof p with p 1 and p 2 5.2.2TwoTransitionsCase Fortwotransitionpoints,therearetwoconditions.Therstconditionisthatan edgegoesthroughthetwotransitionpoints(Fig. 5-3).Thisisthemostlikelycase.The otheristhatastreakorcornerexists(Fig. 5-4).Theinterpolationschemesareslightly differentforthesetwoconditions.ThedecisionisbasedonEq. 5:If C 01 < 3= 4, thenanedgeexists.Thepredictorsaregeneratedusingbilinearinterpolationalong 01 Otherwise,astreakorcornerexistsandtheinterpolationisalong =( 0 + 1 )=2. 100

PAGE 101

5.2.3 FourTransitionsCase Fourtransitionscaseismorecomplexthantwotransitionscase.Denotethe transitionpointsstartingfromtopintheclockwisedirectionas0,1,2,3ontheinner layer.Thereareseveralsituations.1) C 01 + C 23
PAGE 102

5.3 NewEncodingOrder Wehavediscussedthealgorithmtogeneratetheintrapredictorsintheprevious section,whichisbasedontheassumptionthatallthesurroundingpixelsofablock areavailable.However,incurrentH.264/AVCcodingframework,thisisnotthecase. Onlytheblocksatthetoportotheleftofthecurrentblockareavailablewiththeraster encodingorder.Inordertomakeallsurroundingpixelsavailableforsomeblocks,we incorporatethereversecodingorderfrom[ 67]. Wetaketheencodingprocessoffour 8 8 blocksinamacroblockasanexample toillustratehowthereversecodingorderworkswithTIP.AsshowninFig. 5-8,the leftgureshowstherastercodingorder,andtherightgureshowsthereverse codingorder.Thebottomright(BR)blockwillbeencodedrstusingthetopandleft neighboringmacroblockpixels(theregioningrey).Next,theupperright(UR)blockis encodedusingthetopandleftneighboringmacroblockpixelsandthereconstructed BLblockaswell.Then,thebottomleft(BL)blockisencodedusingthetopandleft neighboringmacroblockpixels,theBRandURblock.Finally,theupperleft(UL)block iscodedbyTIPmodewithallitssurroundingpixelsavailable.Thepredictionmodes foreachblockareshowninFig. 5-9.Theencoderwillchoosethecodingorderwith correspondingmodesundertherate-distortionoptimizationcriteria. J (Mode)= D + R (5) Mode =argmin J (i) (5) whereDistheMADofsourceandpredictedpixels,whichisthemeasureofdistortion underamode.Risthebitrateincurredtocodethemodesyntaxandtheresidueunder themode. istheLagrangemultiplier. 5.4ExperimentalResults WeimplementtheintrapredictionalgorithmonKTA1.4andtestitsperformance usingtheMPEGnewcall-for-proposalsequences,whichcontainthreesetsof 102

PAGE 103

T able5-1.ExperimentalresultsofTIPoverJM. Sequences Bitratesaving(%)PSNRGain(dB) (Name resolution fr amerate) Bask etballDrive 1920x1080 50 10.30 0.31 BQTerrace 1920x1080 60 5.46 0.37 cactus 1920x1080 50 4.21 0.18 Kimono1 1920x1080 24 6.65 0.25 ParkScene 1920x1080 24 4.04 0.18 BasketballDrill 832x480 50 4.31 0.21 BQMall 832x480 60 5.47 0.35 PartyScene 832x480 50 3.49 0.28 RaceHorses 832x480 30 2.51 0.17 BasketballPass 416x240 50 5.93 0.35 BlowingBubbles 416x240 50 3.19 0.20 BQSquare 416x240 60 1.47 0.13 RaceHorses 416x240 30 2.82 0.20 A verage 4.60 0.24 sequences withresolution 1920 1080 832 480 ,and 416 240 .Theconguration parametersoftheexperimentsare:Frametobeencoded:50.Codingstructure:IIIII. Testpoints: QP =22 ; 27 ; 32 ; 37 .TheresultsarelistedinTable 5-1.Therate-distortion curveofsequenceBasketballDrive 1920x1080 50 isshowninFig. 5-10.Resultsshow thattheproposedapproachachievesanaverage4.6%bitratesavingoverJM. 5.5AnCodingParameterAnalyzer 5.5.1Introduction Avideocodecanalyzerisapowerfultoolindesigningvideocodingalgorithms. Insteadoftakingtimetoreadtheloglescontainingtheintermediatecodingparameters, suchasmodeandmotionvectorsofeverymacroblock,videocodecdeveloperscan gainmoreinsightsbylookingatthegraphicdisplayoftheseinformationwiththe analyzer.Thisenablestheuserstoquicklyandeasilychecktheconformanceand encodingalgorithmperformance.Moreover,itcanbeveryhelpfulinassistingand guidingtheencodingalgorithmdesign. Atypicalvideocodecanalyzertakestheencodedbitstreamsasinputanddecodes itbeforevisualizingtheencodingparametersthroughagraphicaluserinterface(GUI). 103

PAGE 104

Commercial videoanalyzershavebeendevelopedinthepast,suchasSencore CMA1820[ 73],ElecardStreamAnalzyer[ 74],etc.Theseanalyzersaredesigned forsyntaxanalysisandpresentationoftheencodingparametersinavisualform.They supportMPEG-2,H.264/AVC[ 63],andVC-1videoformatsalongwithotheraudio formats. WhilevariousanalyzersofferdifferentGUI,theyareallcapableofdisplayingthe decodedvideo,high-levelsyntax,andencodingparameterssuchas QP s,motion vectors,partitionmodes,referenceindexesandwarningorerrormessageifnecessary. Forsyntaxconformancechecking,thesyntaxinformationisoftenvisualizedinatree structure,whichenablesthedeveloperstoeasilyverifynewencodercompliance beforedeployment.Inaddition,ananalyzerallowsuserstoquicklyidentifythepotential problemsintheencoderalgorithmsbecausewell-tunedencodingparametersareoften highlycorrelatedwithvideocontent.Itbecomeseasiertoidentifyinconsistenciesand conrmstheeffectivenessofthecodingalgorithmoneisinterestedinbydisplayingthe encodingparametersandthevideocontentinthesamewindow.Forexample,asalient areashouldchooseasmaller QP thananinattentiveareatoimprovevisualquality. Whenweoverlaythe QP foreachmacroblock(MB)ontopofthepicture,wecanclearly seewhetherthe QP isadaptivetothetextureofthatMB. Itiscommontouseavideobitstreamanalyzerwhendevelopinganewproduct ofencoderordecoderthatconformstoastandard,suchasMPEG-2,H.264/AVC, VC-1,etc.Theanalyzermayprovidepicture-levelcodingparameters,suchasthe entropycodingmethod,picture-levelQP,etc.Basedontheuser'srequest,theanalyzer maydisplayonthemonitorblock-levelcodingparameters,suchasQPs,modesand motionvectors.Ananalyzermayalsochecktheconformanceofthebitstreamtoa certainstandardandprovidedetailedwarningorerrormessagesifthebitstreamisnot conformant. 104

PAGE 105

When theaforementionedvideobitstreamanalyzerishighlyappreciatedduring theproductdevelopmentafterastandardisnalized,thereisastrongneedfora videoanalyzerthatcanbeusedduringthestandarddevelopmentstageandismore resilienttothesyntaxchanges.Thus,weproposesuchavideoanalyzerthatwillnot takethecodedbitstream,buttakestheencoder/decoderstatisticaldataasitsinput. Sincetheanalyzerdoesnottakebitstreamasaninput,theanalyzerisindependentof thebitstreamsyntaxandcanbeusedbeforethestandardisnalized.Becauseofits independenceofthesyntax,suchananalyzercanbeusedbydifferentvideoencoding formatstoo.Thiskindofanalyzerwillbeespeciallyusefulfordevelopingthefuturevideo codingstandardH.265. Fig. 5-12 illustratesatypicalvideobitstreamanalyzerforH.264[ 73, 74].Itrequires aH.264standard-compatiblebitstreamasitsonlyinput.Conformancecheckis performedandwarningorerrormessageisprovidedifthereexistsanyproblemin thesyntax.Thebitstreamisthenpartiallyorfullydecodedtoobtaincodingparameters suchasthemotionvectors,modetypes, QP values,and/orresiduesamplevalues. Upontheuser'srequestfromtheGUI,theanalyzerdisplaysthecorrespondingdata onthemonitor.Forexample,ausercanrequestthroughtheGUItodisplaythemotion vectorsforaframe. AsshowninFig. 5-12,theseanalyzerscontainanembeddedvideodecoderthat decodesthebitstreambeforevisualizingit.Theyrequiretheinputbitstreamtobe completelycompliantwiththedecodersyntaxspecications.Thisisappropriatefor designingstandard-compatibleencoderproductsafterthestandardhasbeennalized, butitisnotdesirableatthestandarddevelopmentstage.Duringthestandardization process,manyproposalscompeteforadoption,sothecodingtoolsandsyntax denitionschangefrequently.Anidealanalyzershouldberobustandexibleenoughto accommodatedifferentsolutions.Motivatedbythis,weproposeanovelvideocoding analyzerthatdecouplesfromtheembeddeddecoderandinsteadtakesthecoding 105

PAGE 106

T able5-2.Fourcategoriesofcodingparameters. F rame-levelparameterframetype, QP ,resolution extendedMBsize,QALFsize Extended MB-levelparameterMBmode,MB-level QP CBP,lterag MB partition extendedMBpartition, lterblockpartition Sub-b locklevelparametermotionvector,referenceindex, intrapredictionmode par ametersastheinput.Then,theanalyzeronlyparsesthecodingparametersand issyntax-insensitive.Thus,comparedtotheexistingstreamanalyzers,ourproposed analyzeroffersmoreexibilityandcanaccommodatetheconstantlychangingsyntax denitionsatthedevelopmentstageofanewvideocodingalgorithmorstandard. 5.5.2TheProposedAnalyzer Fig. 5-13 illustratestheframeworkoftheproposedanalyzer.Ittakesthecoding parameterlesandthereconstructedYUVvideosequenceasinputsbeforedisplaying themontheGUI.SincethecodingparameterlesandtheYUVcanbegeneratedat eithertheencoderorthedecoderwithoutanyknowledgeofthesyntaxdenitions,the presenceofadecoderbecomesoptionalinouranalyzer(dashedboxinFig. 5-13), whileitismandatoryfortheH.264bistreamanalyzer(Fig. 5-12). 5.5.3CodingParameterFiles Inthefollowing,wewillexplainhowtogeneratethecodingparameterlesatthe encoder(orthedecoder)beforedescribingthefunctionalitiesofeachmoduleofthe analyzer.Thecodingparameterlesfromtheencoder/decoderaretakenasinputsofour analyzer.Giventhedifferentproposals[ 75]fornext-generationvideocodingstandards, weusetheKTAsoftwareasourbasecodecandconsiderotherstobesimilar.The essentialparametersincludecodingmodes,motionvectors,referenceindexes,block partitioningforeachcodingunit(e.g.,amacroblock),etc.Otherparametersarenewand mayprovidehighercodingefciency,suchastheextendedMBsizeandthequad-tree adaptivelooplter(QALF)[ 76]forKTA.Theexactparametervaluesmaystilldiffer 106

PAGE 107

among competingsolutions.Forexample,thelargestextendedblocksizeis64inKTA, anditmaybe128inanotherplatform.Ouranalyzerisexibleenoughtoaccommodate suchdifferences. Todesignananalyzerthatiseasilyextendable,westudythepropertiesofallthese parametersandclassifythemintofourcategories,whicharesummarizedinTable 5-2. Parametersofthesamecategoryareoutputattheencoderandparsedattheanalyzer followingthesamemethodology,whichisexplainedinthefollowingsubsections. Whenanewcodingtoolisintroduced,wecanclassifytherelatedparametersintothe categoriesandoutputtheminthecodingparameterleinasimilarway. 1)Frame-levelparameter. Theframe-levelinformationincludestheframetype, QP ,resolution,extended MBsize,QALFblocksize,etc.Suchinformationisglobaltotheframeandhasalarge impactonthevisualquality.Thesedataarewrittenintoatextle. 2)ExtendedMB-levelparameter. Itappearsthatthenext-generationvideocodingstandardwillstillusetheblock asthebasiccodingunitwithinaframe.InthecaseofKTA,theextendedMBisthe basiccodingunit.ThereareafewcodingparametersattheextendedMBlevel.For example,theINTRA/INTERcodingmodeisdistinctiveforeachMB(inBframesit supportssub-MBdirectmodes,thustheminimumunitisan8x8block).Westore thecodingmodeindexintheoriginalscanningorder.Itcanbeeasilyvisualizedfor everyMBorsub-MB,eitherthroughdifferentcolorsordisplayingtheindexes.Other parametersinthiscategoryincludetheMB-level QP ,thecodedblockpattern(CBP),the block-adaptivelterag,etc. 3)PartitionpatternwithinextendedMB. InKTA,thepartitionpatternwithineachsuperblockcanbedifferentandneedsto beinputtothecodinganalyzer.Sincethenumberofpossiblepartitionsisverylarge anditvarieswiththesuperblocksize,itisnotrealistictorepresentthepartitionfora 107

PAGE 108

super blockbyanindex.Instead,weuseahierarchicalrepresentation.Letstandfor thecurrentblockhavingfourequal-sizedsub-partitionsandforthecurrentblocknot beingfurtherpartitioned.Assumingweuse64x64extendedMBsinKTA,thepartition inFig. 5-14(a)isrepresentedbyabinarystring10000000,andFig. 5-14(b)by .AnotherexampleistheQALF,whichusesaquadtree-basedpartitionovereachMB (thesizeofwhichneedstobesignaledataframelevel)andthepartitionpatternneeds tobesignaledinthehierarchicalfashionsimilarly.Inourwork,thedatainthiscategory aresavedinatextle.Eachlineinthetextlerepresentspartitionparametersforan extendedblock. 4)Sub-blocklevelparameter AfterthepartitionpatternwithinanextendedMBissignaled,thesubpartition-level codingparameters(e.g.,themotionvectorsandreferenceindexesforeachsub-block) havetoberepresented.TakethemotionvectorsinH.264andKTAasanexample,each subblockhastwosetsofmotionvectors:thelist0andlist1motionvectors.Thevectors arestoredforeachblockandcanbeeasilyvisualized.Otherparametersthatbelongto thiscategoryincludetheINTRApredictionmode.Inourwork,thedatainthiscategory aresavedblockbyblockinabinaryle. 5.5.4AnalyzerModules AsshowninFig. 5-13,ouranalyzerconsistsoftwomodules:adataparseranda GUI.Inthissection,weexplainthefunctionalityofeachmodule. 5.5.4.1Dataparser ThedataparserreadsthecodingparameterdatalesandtheYUVsequence. AsthecodingparameterdatafallsintodifferentcategoriesasdenedinSection 5.5.3,wehavecorrespondingparsingfunctionstoextractcodingparametersfromthe formatteddatales,whicharethenorganizedintoamatrixorabitmapthatisreadyto bedisplayed.Forinstance,thepartitionpatternsfortheextendedMBsareorganized intoalogicalbitmapofthesamesizeasthevideoframe;thelineswherepartition 108

PAGE 109

happens areindicatedbylogicvalueTRUE(otherwise,FALSEisindicated).Forthe codingmodeparameter,differentmodescorrespondtodifferentintensityvaluesinthe bitmap. Anotherfunctionalityofthedataparseristocomputethedistribution/statisticsofthe codingparameterdata.Forinstance,itcancollectstatisticssuchasthefrequencyof eachmodebeingselectedandthepercentageofdifferentMBpartitionpatternswithin aframe.Suchknowledgeisgreatlyappreciatedduringthealgorithmdesign.Takethe modedistributionasexample,theusercandesignamoreefcientcodebookbasedon themodedistributionwiththefrequentlyselectedmodeusingashortcodewordtosave bitrate. 5.5.4.2Graphicaluserinterface TheGUIdisplaysgraphicalobjectsbasedontheextracteddatafromthedata parser.Italsogeneratesstatisticplotsandoffersasetofinteractivetoolsforusers.In ourwork,theGUIpresentsthefollowinginformation: codingparametersinTable 5-2 statisticaldistributionofcodingparameters theMBgrid theY/U/Vchannels theframepreview Theonandoffofeachinformationdisplaycanbecontrolledindependently,and theinformationcanbeoverlayedtoformacomprehensivedisplay,whichcanbe continuouslyplayedasavideo.Theuserscanaccessanarbitraryframebyentering theframenumber,getthedetaildatabyclickingonaparticularMB,ortakeasnapshot oftheanalyzedresultsforrecord.Alltheseiterativefunctionalitiesmaketheproposed analyzeraconvenientandeffectivetoolinassistingandguidingthedesignofvideo codingalgorithms. 109

PAGE 110

5.5.5 Examples WehaveoutputthecodingparameterlesandthedecodedYUVfromtheKTA softwareandperformedanalysiswiththeproposedanalyzer.Pleasenotethatour analyzercanbeeasilyextendedtoanalyzeothercompetingvideocodingsoftwares.In thissection,weillustratehowtouseouranalyzerbyafewscreenshots. 5.5.5.1Maingraphical-user-interface Fig. 5-15 showsasnapshotofthemainGUIwindow,whichisanexampleofthe KTAmode/partitonoverlayingontheYUV.Thecontrolbaratthetopprovideseasy accesstoallfunctionalities.Thepreviewwindowattherightnotonlyallowseasy navigationamongframesbutalsogivesasimplepreviewoftheneighboringpictures. Thetextwindowatthebottomshowstheframe-levelinformationtogetherwiththemode distribution.Whenablockisselected,itspertinentmotioninformationisdetailedinthe textwindow. 5.5.5.2Parameterdisplay Fig. 5-16 showsanexampleoftheKTAQALFag.Notethattheoverlaytransparency canbeadjustedfromopaquetotransparent.Fig. 5-17 illustratesthemodestatistical distributionplotgeneratedbyouranalyzer.Thedistributionishelpfulindesigningthe codewords.Forexample,ifamodeisfrequentlyselected,ashortercodewordshould beassigned.Fig. 5-18 presentsanexamplefor QP variationamongMBs,andFig. 5-19 showstheblockpartitioningandcodingmodesoverlay. 5.6Summary Inthischapter,wehaveproposedanovelintrapredictionmodewhichutilizesthe binarytransitionpointsonblockboundariestoexplorethelocalgeometricpattern. TheTIPmodederivestheintrapredictiondirectionandperformsinterpolationwithout transmittingsyntaxbitsforthismode.Areversecodingorderisintroducedtomake surroundingpixelsavailableforTIP.Experimentalresultsshowapromisingbitrate savingandPSNRgainoverJM.However,constrainedbytheavailabilityrequirementof 110

PAGE 111

surrounding pixels,theTIPmodeisonlyallowedonasubsetoftheblocks,whichisits majorlimitation. Asasideproduct,weproposedavideocodinganalyzerthatexcludesan embeddeddecoderanddoesnotrelyonsyntaxdenitions.AsMPEGandVCEGare nowactivelydevelopingthenext-generationvideocodingstandard,acomprehensive videoanalyzerthatvisualizesthecodingparametersishighlyattractivetothedevelopers. Theanalyzercandisplaythevideoaswellasthecodingparameters,including motionvectors,modes,partitions,lterregions, QP s,etc.Itcanalsocalculateand displaythestatisticsoftheinputparameters.Byintegratingthevideocontentandthe codingparametersintoonewindow,theanalyzerprovidesacomprehensivereviewof performanceofthecodingtools.Moreover,sinceitisinsensitivetosyntaxelements, theanalyzercanbeeasilyextendedforothervideocodingsoftwares.Therefore, thisisaconvenientandpowerfultoolforthedevelopmentofthenext-generation videocodingstandards.Demosfortheproposedanalyzerareprovidedatwebsite: http://plaza.ufl.edu/lvtaoran/analyzer.htm 111

PAGE 112

Figure 5-2.Transitionpointsoninnerlayerandouterlayer. Figure 5-3.Twotransitionscase:Anedgegoesthrough.Left:Original.Middle: Transitionpointsonboundarywithoriginalpixelvaluesinblockshowingthe localgeometricpatterns.Right:Originalboundarywithinterpolatedblock pixels(withinbluerectangle). Figure 5-4.Twotransitionscase:Astreakorcorner.Left:Original.Middle:Transition pointsonboundarywithoriginalpixelvaluesinblockshowingthelocal geometricpatterns.Right:Originalboundarywithinterpolatedblockpixels (withinbluerectangle). 112

PAGE 113

Figure 5-5.Fourtransitionscase:transitionpoint0isconnectedtopoint3.Left: Original.Middle:Transitionpointsonboundarywithoriginalpixelvaluesin blockshowingthelocalgeometricpatterns.Right:Originalboundarywith interpolatedblockpixels(withinbluerectangle). Figure 5-6.Fourtransitionscase:Anedgeandastreak.Left:Original.Middle: Transitionpointsonboundarywithoriginalpixelvaluesinblockshowingthe localgeometricpatterns.Right:Originalboundarywithinterpolatedblock pixels(withinbluerectangle). Figure 5-7.Distributionoftransitioncasesof8 8blocksforBasketballDrill 832x480. 113

PAGE 114

Figure 5-8.Rastercodingordervs.reversecodingorder.Left:Rastercodingorder. Right:Reversecodingorder. Figure 5-9.IntrapredictionmodesforBR,UR,BLandULblockswithrasterandreverse codingorder. Figure 5-10.Rate-distortioncurveforTIPvs.JM. 114

PAGE 115

Figure 5-11.Themotivationforusingananalyzerinsteadofcheckinglogles. Figure 5-12.WorkowofaconventionalH.264bitstreamanalyzer. Figure 5-13.Workowoftheproposedvideocodinganalyzer. (a) (b) Figure 5-14.Blockpartitionscanberepresentedbybinarystrings:(a)1000000 0and(b). 115

PAGE 116

Figure 5-15.AnexampleGUIscreenshot. Figure 5-16.AnexampleQALFlterinformation. 116

PAGE 117

Figure 5-17.Anexamplemodedistributionplot. 117

PAGE 118

Figure 5-18.Anexample QP variationplot. Figure 5-19.Anexamplepartitioningandmodeoverlayplot. 118

PAGE 119

CHAPTER 6 CONCLUSIONANDFUTUREWORK Inthisdissertation,wehavepresentedourworkandinnovationsonthree challengingproblemsofdigitalvideoapplications:videocompression,videosummarization andvideoadaptation.Westudiedthemodelingforvideocontentsandproposed anonlinearspatial-temporalsaliencymapforhumanattentionmodeling.Withthis attentionmodel,weareabletoestablishcontent-awaretechniquesforsolvingthese problems. Videoadaptationisconceptuallyacompressionintheresolutiondomain.Thegoal istotanexistingvideowithhighresolutiontoanarbitrarydisplaywithlowresolution. Withthesaliencymaps,weproposedcontent-awareinformationlossmetricsand formulatedthecontentmappingproblemasashortest-pathproblemwhichisthen solvedbydynamicprogramming.Thesolutioncorrespondstotheoptimalparameters (x;y;s ) foracropping-and-scalingwindowoverthewholevideosequence,where (x;y ) standsforthelocationofthewindowand s isthescalingfactor.Thecostfunctionofthe shortest-pathproblemhastwoterms,whichstandforintraframeconsiderationsand interframeconsiderations,respectively.Theweightingfactor tobalancethetwoterms iscurrentlyadjustableuponuser'spreference.Tondtheoptimalchoiceof isoneof ourfuturedirections.Anotherfutureworkistorenethescienticevaluationmethod toquantifyandcomparetheperformanceofthealgorithmswhichgeneratesubjective results. Videosummarizationcanberegardedasacompressioninthetimedomain, wherethecompressionratioiscalledskimmingratio.Theobjectiveistogeneratea skimmedversionoftheoriginalvideowithhighcompressionratio,whilepreserving contentinformationasmuchaspossible.Weproposedanovelhierarchicalframework toprogressivelygeneratethesummarizedvideo.Thisframeworktakesintoaccountthe conceptsofcompleteness,saliency,smoothnessandratioexibility.Thehierarchical 119

PAGE 120

fr ameworkisshowntobeeffectivewithsubjectivetests.Apossiblefuturedirectionof thisworkistoexploitthesemanticimplicationsinconceptprimitiveswithsupervised learningsoastogeneratehigher-cognitivelevelshotgroupsformoresophisticated videostructurerepresentation. Forvideocompression,thegoalistominimizethebitratefortransmission orstoragewhilemaximizingthevideoquality.Weintroducedatwolevelresource allocationframeworkandconductbit-allocationwithrespecttotheregion-of-interestson intraandinterframes.Thefutureworksmayincludeusingrate-quantizationmodeling andresiduedistortionestimationtoachievemoreaccuraterate-control.Forthe proposednewintrapredictiontool-transitionbasedintracoding,ourfutureworkis toincorporatethecurrentintracodingschemeinourcontent-awareframework. 120

PAGE 121

REFERENCES [1]L. Itti,C.Koch,andE.Niebur,Amodelofsaliency-basedvisualattentionforrapid sceneanalysis, IEEETransactionsonPatternAnalysisandMachineIntelligence 1998. [2]X.HouandL.Zhang,Saliencydetection:Aspectralresidualapproach,in ProceedingsofIEEEComputerVisionandPatternRecognization ,2007. [3]C.Guo,Q.Ma,andL.Zhang,Spatio-temporalsaliencydetectionusingphase spectrumofquaternionFouriertransform,in ProceedingsofIEEEComputer VisionandPatternRecognization ,2008. [4]F.LiuandM.Gleicher,Videoretargeting:automatingpanandscan,in ProceedingsofACMMultimedia ,2006. [5]S.AvidanandA.Shamir,Seamcarvingforcontent-awareimageresizing, ACM TransactionsonGraphics,2007. [6]L.Wolf,M.Guttmann,andD.Cohen-Or,Non-homogeneouscontent-driven video-retargeting,in ProceedingsofIEEEInternationalConferenceonComputer Vision,2007. [7]D.Simakov,Y.Caspi,E.Shechtman,andM.Irani,Summarizingvisualdatausing bidirectionalsimilarity,in ProceedingsofIEEEComputerVisionandPattern Recognization,2008. [8]G.Hua,C.Zhang,Z.Liu,Z.Zhang,andY.Shan,Efcientscale-space spatiotemporalsaliencytrackingfordistortion-freevideoretargeting, Computer Vision,pp.182,2010. [9]T.Deselaers,P.Dreuw,andH.Ney,Pan,zoom,scantime-coherent,trained automaticvideocropping,in ProceedingsofIEEEComputerVisionandPattern Recognization,2008. [10]L.Taycher,J.W.FisherIII,andT.Darrell,Combiningobjectandfeaturedynamics inprobabilistictracking,in ProceedingsofIEEEComputerVisionandPattern Recognization,2005. [11]J.ShiandC.Tomasi,Goodfeaturestotrack,in ProceedingsofIEEEComputer VisionandPatternRecognization ,1994. [12]S.deBrouwer,M.Missal,G.Barnes,andP.Lefevre,Quantitativeanalysisof catch-upsaccadesduringsustainedpursuit, JournalofNeurophysiology,vol.87, no.4,pp.1772,2002. [13]J.S.BoreczkyandL.A.Rowe,Comparisonofvideoshotboundarydetection techniques, JournalofElectronicImaging ,1996. 121

PAGE 122

[14]N.V .PatelandI.K.Sethi,Videoshotdetectionandcharacterizationforvideo databases, PatternRecognition ,1997. [15]SJSangwineandTAEll,HypercomplexFouriertransformsofcolorimages,in ProceedingsofInternationalConferenceonImageProcessing ,2001. [16]P.J.RousseeuwandA.M.Leroy, Robustregressionandoutlierdetection Wiley-IEEE,2003. [17]Y.Li,T.Zhang,andD.Tretter,Anoverviewofvideoabstractiontechniques, HP LaboratoriesPaloAlto,Tech.ReportNo.HPL-2001-191 ,2001. [18]H.S.Chang,S.Sull,andS.U.Lee,Efcientvideoindexingschemefor content-basedretrieval, IEEETransactionsonCircuitsandSystemsforVideo Technology,vol.9,no.8,pp.1269,1999. [19]D.DeMenthon,V.Kobla,andD.Doermann,Videosummarizationbycurve simplication,in ProceedingsofACMMultimedia .ACM,1998. [20]A.HanjalicandH.J.Zhang,Anintegratedschemeforautomatedvideoabstraction basedonunsupervisedcluster-validityanalysis, IEEETransactionsonCircuitsand SystemsforVideoTechnology,vol.9,no.8,pp.1280,1999. [21]Y.GongandX.Liu,Videosummarizationusingsingularvaluedecomposition,in ProceedingsofIEEEComputerVisionandPatternRecognization ,2000. [22]M.Mills,J.Cohen,andY.Y.Wong,Amagniertoolforvideodata,in Proceedings oftheSIGCHIConferenceonHumanFactorsinComputingSystems .ACM,1992. [23]Y.Taniguchi,A.Akutsu,Y.Tonomura,andH.Hamada,Anintuitiveandefcient accessinterfacetoreal-timeincomingvideobasedonautomaticindexing,in ProceedingsofACMMultimedia .ACM,1995. [24]K.Otsuji,Y.Tonomura,andY.Ohba,Videobrowsingusingbrightnessdata,in ProceedingsofSPIE ,1991. [25]N.Omoigui,L.He,A.Gupta,J.Grudin,andE.Sanocki,Time-compression: systemsconcerns,usage,andbenets,in ProceedingsofACMConferenceon ComputerHumanInteraction.ACM,1999. [26]A.Amir,D.Ponceleon,B.Blanchard,D.Petkovic,S.Srinivasan,andG.Cohen, Usingaudiotimescalemodicationforvideobrowsing,in ProceedingsofHawaii InternationalConferenceonSystemSciences ,2000. [27]G.W.Heiman,R.J.Leo,andG.Leighbody,Wordintelligibilitydecrementsandthe comprehensionoftime-compressedspeech, PerceptionandPsychophysics,vol. 40,no.6,pp.407,1986. 122

PAGE 123

[28]Y .Li,S.H.Lee,C.H.Yeh,andC.C.J.Kuo,Techniquesformoviecontentanalysis andskimming, IEEESignalProcessingMagazine ,vol.23,no.2,pp.79,2006. [29]D.DementhonandD.Doermann,Videosummarizationbycurvesimplication,in ProceedingsofACMMultimedia ,1998,pp.211. [30]Y.GongandX.Liu,Videosummarizationusingsingularvaluedecomposition, in ProceedingsofIEEEComputerVisionandPatternRecognization ,2000,pp. 174. [31]M.Padmavathi,R.Yong,andet.al,Keyframe-basedvideosummarizationusing Delaunayclustering, InternationalJournalonDigitalLibraries ,vol.6,no.2,pp. 219,2006. [32]Z.CernekovaandIoannisPitas,Informationtheory-basedshotcut/fadedetection andvideosummarization, IEEETransactionsonCircuitsandSystemsforVideo Technology,vol.16,no.1,pp.82,2005. [33]YuxinPengandChong-WahNgo,Clip-basedsimilaritymeasurefor query-dependentclipretrievalandvideosummarization, IEEETransactions onCircuitsandSystemsforVideoTechnology,vol.16,no.5,pp.612. [34]ZhuLi,GuidoM.Schuster,andet.al,MINMAXoptimalvideosummarization, IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.15,pp. 1245,2005. [35]J.NamandA.H.Tewk,Dynamicvideosummarizationandvisualization,in ProceedingsofACMMultimedia ,1999,pp.53. [36]N.VasconcelosandA.Lippman,ABayesianframeworkforsemanticcontent characterization,in ProceedingsofIEEEComputerVisionandPatternRecognization,1998,pp.566. [37]S.Lu,M.R.Lyu,andI.King,Semanticvideosummarizationusingmutual reinforcementprincipleandshotarrangementpatterns,in MultimediaModelling Conference,2005,pp.60. [38]D.M.Russell,Adesignpattern-basedvideosummarizationtechnique:moving fromlow-levelsignalstohigh-levelstructure,in InternationalConferenceon SystemSciences ,2000. [39]B.W.Chen,J.C.Wang,andJ.F.Wang,Anovelvideosummarizationbasedon miningthestory-structureandsemanticrelationsamongconceptentities, IEEE TransactionsonMultimedia ,vol.11,no.2,pp.295,2009. [40]G.Evangelopoulos,A.Zlatintsi,G.Skoumas,K.Rapantzikos,A.Potamianos, P.Maragos,andY.Avrithis,Videoeventdetectionandsummarizationusingaudio, visualandtextsaliency,in ProceedingsofInternationalConferenceonAcoustics, Speech,andSignalProcessing .IEEEComputerSociety,2009. 123

PAGE 124

[41]Y .F.Ma,L.Lu,H.J.Zhang,andM.Li,Auserattentionmodelforvideo summarization,in ProceedingsofACMMultimedia .ACMNewYork,NY,USA, 2002. [42]Y.F.Ma,X.S.Hua,L.Lu,andH.J.Zhang,Agenericframeworkofuserattention modelanditsapplicationinvideosummarization, IEEETransactionsonMultimedia,vol.7,no.5,pp.907,2005. [43]C.W.Ngo,Y.F.Ma,andH.J.Zhang,Videosummarizationandscenedetectionby graphmodeling, IEEETransactionsonCircuitsandSystemsforVideoTechnology, vol.15,no.2,pp.296,2005. [44]G.Peeters,Alargesetofaudiofeaturesforsounddescription(similarityand classication)intheCUIDADOproject, CUIDADOISTProjectReport ,pp.1, 2004. [45]D.Lewis,Naive(Bayes)atforty:Theindependenceassumptionininformation retrieval, MachineLearning ,pp.4,1998. [46]D.G.Lowe,Objectrecognitionfromlocalscale-invariantfeatures,in Proceedings oftheIEEEInternationalConferenceonComputerVision ,1999. [47]J.Kim,HChang,K.Kang,M.Kim,andH.Kim,Summarizationofnewsvideoand itsdescriptionforcontent-basedaccess, InternationalJournalonImageSystem andTechnology,vol.13,no.5,pp.267,2004. [48]SGMallatandZ.Zhang,Matchingpursuitswithtime-frequencydictionaries, IEEE TransactionsonSignalProcessing ,vol.41,no.12,pp.3397,1993. [49]S.Chu,S.Narayanan,andJ.Kuo,Environmentalsoundrecognitionusing MP-basedfeatures,in ProceedingsofIEEEInternationalConferenceonAcoustics, SpeechandSignalProcessing .IEEE,2008. [50]U.VonLuxburg,Atutorialonspectralclustering, StatisticsandComputing ,vol. 17,no.4,pp.395,2007. [51]T.Syeda-MahmoodandD.Ponceleon,Learningvideobrowsingbehaviorandits applicationinthegenerationofvideopreviews,in ProceedingsofACMMultimedia ACM,2001. [52]Y.GaoandQ.Dai,Shot-basedsimilaritymeasureforcontent-basedvideo summarization,in ProceedingsofInternationalConferenceonImageProcessing 2008. [53]Musclemoviedatabasev3.0, http://poseidon.csd.auth.gr/EN/MUSCLE_ moviedb ,2007. 124

PAGE 125

[54]T .Wiegand,G.J.Sullivan,G.Bjontegaard,andA.Luthra,OverviewoftheH. 264/AVCvideocodingstandard, IEEETransactionsonCircuitsandSystemsfor VideoTechnology,vol.13,no.7,pp.560,2003. [55]L.S.KarlssonandM.Sjostrom,ImprovedROIvideocodingusingvariable Gaussianpre-ltersandvarianceinintensity,in ProceedingsofInternational ConferenceonImageProcessing ,2005,vol.2. [56]C.W.Lin,Y.C.Chen,andM.T.Sun,Dynamicregionofinteresttranscodingfor multipointvideoconferencing, IEEETransactionsonCircuitsandSystemsfor VideoTechnology,vol.13,no.10,pp.982,2003. [57]M.J.Chen,M.C.Chi,C.T.Hsu,andJ.W.Chen,ROIvideocodingbasedon H.263+withrobustskin-colordetectiontechnique,in ProceedingsofIEEE InternationalConferenceonConsumerElectronics ,2003. [58]H.WangandS.Kwong,Rate-distortionoptimizationofratecontrolforH.264 withadaptiveinitialquantizationparameterdetermination, IEEETransactionson CircuitsandSystemsforVideoTechnology,vol.18,no.1,pp.140,2008. [59]D.Chai,KNNgan,andA.Bouzerdoum,Foreground/backgroundbitallocation forregion-of-interestcoding,in ProceedingsofIEEEInternationalConferenceon ImageProcessing ,2000,vol.2,pp.923. [60]Y.Liu,Z.G.Li,andY.C.Soh,Region-of-interestbasedresourceallocationfor conversationalvideocommunicationofH.264/AVC, IEEETransactionsonCircuits andSystemsforVideoTechnology,vol.18,no.1,pp.134,2008. [61]L.A.VeseandT.F.Chan,Amultiphaselevelsetframeworkforimagesegmentation usingtheMumfordandShahmodel, InternationalJournalofComputerVision ,vol. 50,no.3,pp.271,2002. [62]L.IttiandC.Koch,Asaliency-basedsearchmechanismforovertandcovertshifts ofvisualattention, VisionResearch ,vol.40,no.10-12,pp.1489,2000. [63]JointVideoTeam,Draftofversion4ofH.264/AVC,January2005. [64]JCT-VC,DescriptionofvideocodingtechnologyproposalbyTandberg,Nokia, Ericsson,April2010. [65]T.ShioderaA.TanizawaandT.Chujoh,Bidirectionalintraprediction (VCEG-AE14),in 31rdVCEGMeeting,2007. [66]A.TanizawaT.ShioderaandT.Chujoh,Simulationresultsofbidirectionalintra predictiononKTAsoftwareversion1.3(VCEG-AF06),in 32rdVCEGMeeting, 2007. [67]T.ChujohT.ShioderaandA.Tanizawa,Improvementofbidirectionalintra prediction(VCEG-AG08),in 33rdVCEGMeeting ,2007. 125

PAGE 126

[68]K eytechnologyareareferencesoftware, http://iphome.hhi.de/suehring/tml/ download/KTA/ ,2002. [69]W.ZengandB.Liu,Geometric-structure-basederrorconcealmentwithnovel applicationsinblock-basedlow-bit-ratecoding, IEEETransactionsonCircuitsand SystemsforVideoTechnology,vol.9,no.4,pp.648,1999. [70]T.Huang,G.Yang,andG.Tang,Afasttwo-dimensionalmedianltering algorithm, IEEETransactionsonAcoustics,SpeechandSignalProcessing vol.27,no.1,pp.13,1979. [71]VRAlgazi,GEFord,andR.Potharlanka,Directionalinterpolationofimagesbased onvisualpropertiesandrankorderltering,in ProceedingsofIEEEInternational ConferenceonAcoustics,Speech,andSignalProcessing ,1991,pp.3005. [72]K.H.Jung,J.H.Chang,andC.W.Lee,Errorconcealmenttechniqueusing projectiondataforblock-basedimagecoding,in ProceedingsofSPIE,1994,vol. 2308,p.1466. [73]Sencore,CMA1820compressedmediaanalyzersoftware,2008. [74]Elecard,Elecardstreamanalyzerv2.0userguide,2008. [75]ISO/IEC,Draftcallforevidenceonhigh-performancevideocoding(HVC),MPEG documentN10363,January2009. [76]ToshibaCorporation,Quadtree-basedadaptivelooplter,January2009. 126

PAGE 127

BIOGRAPHICAL SKETCH TaoranLureceivedherbachelor'sdegreeinelectricalengineeringfromShanghai JiaoTongUniversity,Shanghai,China,in2006.Shereceivedhermaster'sdegreeat theDepartmentofElectricalandComputerEngineeringoftheUniversityofFlorida in2007.InDecember2010,shereceivedherPh.D.fromtheUniversityofFlorida. Duringthesummerof2009and2010,sheworkedasaresearchinterninThomson CorporateResearch(TechnicolorResearchandInnovation)atPrinceton,NewJersey. Herresearchinterestsincludeadvancedvideocoding(H.264,KTA),videocontent modeling,videostreaming,videoprocessingonvideosummarizationandadaptation, region-of-interestdetectionandsaliencyanalysis. 127