Citation
Efficient buffering control for a software-only, high-level, high-profile, MPEG-2 decoder

Material Information

Title:
Efficient buffering control for a software-only, high-level, high-profile, MPEG-2 decoder
Creator:
He, Yishu
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
English

Subjects

Subjects / Keywords:
Buffer storage ( jstor )
Computer memory ( jstor )
Frame rate ( jstor )
High resolution ( jstor )
Maxims ( jstor )
Pressure reduction ( jstor )
Slavery ( jstor )
Spatial resolution ( jstor )
Streaming ( jstor )
Tennis ( jstor )
Computer and Information Science and Engineering thesis, M.S ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh )
MPEG (Video coding standard) ( lcsh )
Video compression ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent) ( marcgt )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Abstract:
ABSTRACT: There are some common video resolutions available today. Typical ones include: QCIF (352*240), CIF (704*480), (1024*1024) and (1408*960). We believe that a high-quality MPEG-2 software decoder should support a good scalability performance across different video resolutions. By using our parallel software-only MPEG-2 decoder, scale-down performance has been proved effectively for low-level and main-level MPEG-2 streaming videos. However, the challenge arises when attempts are made to support(1024*1024) and (1404*960) high-level MPEG-2 video. It is found that the existing scheme suffers significant performance degradation when decodinghigh-level MPEG-2 video with full system configuration. The origin of the problem is traced to the excessive memory usage of the original design of the parallel scheme. Therefore we propose an efficient buffer management mechanism such that the memory requirement can be reduced by 50%. This is approached by two steps, first we use an ST scheme to minimize the transmission buffer in a slave node by allowing dynamic sharing between frames in one GOP. Then we further reduce the buffer space by a dynamic on-demand allocation on the slave side. By solving the memory-shortage bottleneck, we have proven that scale-up performance can be successfully achieved with 13 and 14 slave nodes for the high-resolution (1024*1024) and (1404*960) video formats.
Thesis:
Thesis (M.S.)--University of Florida, 2002.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
General Note:
Title from title page of source document.
General Note:
Includes vita.
Statement of Responsibility:
by Yishu He.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright He, Yishu. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
12/1/2003
Resource Identifier:
029834278 ( ALEPH )
53465515 ( OCLC )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

EFFICIENT BUFFERING CONTR OL F OR A SOFTW ARE-ONL Y, HIGH-LEVEL, HIGH-PR OFILE, MPEG-2 DECODER By YISHU HE A THESIS PRESENTED TO THE GRADUA TE SCHOOL OF THE UNIVERSITY OF FLORID A IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORID A 2002

PAGE 2

Cop yrigh t 2002 b y Yish u He

PAGE 3

This is dedicated to m y paren ts.

PAGE 4

A CKNO WLEDGMENTS I w ould lik e to express m y gratitude to Dr. Jonathan C Liu and Ju W ang for his constan t encouragemen t and the opp ortunit y to w ork on this researc h pro ject. His inno v ativ e ideas and encouragemen t ha v e made this w ork in teresting and c hallenging. I w ould also lik e to thank Dr, Randy Cho w and Dr. Jih-Kw on P eir for agreeing to serv e on m y committee. I thank m y paren ts for their inspiration and supp ort throughout m y academic career. I sincerely thank all the p eople who ha v e help ed and supp orted me directly and indirectly in the course of this thesis w ork. iv

PAGE 5

T ABLE OF CONTENTS page A CKNO WLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv ABSTRA CT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CHAPTER 1 INTR ODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Defne Scalabilit y . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Exp erimen tal Result . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Revised Buer Sc heme . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Organization of the P ap er . . . . . . . . . . . . . . . . . . . . . . 6 2 RELA TED STUD Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 PR OBLEM NA TURE AND ANAL YSIS . . . . . . . . . . . . . . . . . . 10 3.1 Memory Usage Analysis . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Impact of Memory Shortage . . . . . . . . . . . . . . . . . . . . . 16 4 EFFICIENT BUFFERING SCHEMES . . . . . . . . . . . . . . . . . . . 21 4.1 The First Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Implemen tation and Exp erimen t Result . . . . . . . . . . . . . . . 24 4.3 F urther Optimization in the Sla v e No des . . . . . . . . . . . . . . 29 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 v

PAGE 6

Abstract of Thesis Presen ted to the Graduate Sc ho ol of the Univ ersit y of Florida in P artial F ulllmen t of the Requiremen ts for the Degree of Master of Science EFFICIENT BUFFERING CONTR OL F OR A SOFTW ARE-ONL Y, HIGH-LEVEL, HIGH-PR OFILE, MPEG-2 DECODER By Yish u He Decem b er 2002 Chair: Jonathan C.L. Liu Ma jor Departmen t: Computer and Information Science and Engineering There are some common video resolutions a v ailable to da y . T ypical ones include QCIF (352*240), CIF (704*480), (1024*1024) and (1408*960). W e b eliev e that a high-qualit y MPEG-2 soft w are deco der should supp ort a go o d scalabilit y p erformance across dieren t video resolutions. By using our parallel soft w areonly MPEG-2 deco der, scale-do wn p erformance has b een pro v ed eectiv e for lo w-lev el and main-lev el MPEG-2 streaming videos. Ho w ev er, the c hallenge arises when attempts are made to supp ort (1024*1024) and (1404*960) highlev el MPEG-2 video. It is found that the existing sc heme suers signican t p erformance degradation when deco ding high-lev el MPEG-2 video with full system conguration. The origin of the problem is traced to the excessiv e memory usage of the original design of the parallel sc heme. Therefore w e prop ose an ecien t buer managemen t mec hanism suc h that the memory requiremen t can b e reduced b y 50%. This is approac hed b y t w o steps: rst w e use an ST scheme to minimize the transmission buer in a sla v e no de b y allo wing dynamic sharing b et w een frames in one GOP; then w e further reduce the buer space b y a dynamic on-demand vi

PAGE 7

allo cation on the sla v e side. By solving the memory-shortage b ottlenec k, w e ha v e pro v en that scale-up p erformance can b e successfully ac hiev ed with 13 and 14 sla v e no des for the high-resolution (1024*1024) and (1404*960) video formats. vii

PAGE 8

CHAPTER 1 INTR ODUCTION 1.1 Dene Scalabilit y Man y streaming video formats are used in tensiv ely in to da y's so ciet y . Commercial streaming formats suc h as RealPla y er and Windo ws Media Pla y er are common to ols to displa y lo w-resolution (e.g., 352*240 without dithering) video o v er In ternet. Ho w ev er, due to the rapid deplo ymen t of high-sp eed net w orks (e.g., A TM net w orks) and "rst-mile" tec hnology (e.g., cable mo dem and digital subscrib er lines), users ha v e the capabilit y for receiving high-qualit y and high-resolution streaming videos to their desktop or TV sets. Bey ond Realpla y er and Media Pla y er, the great success of D VD titles bring us the widely-accepted MPEG-2 video formats. According to MPEG-2 sp ecications, a wide range of video resolution is p ossible. In realit y there are usually v e formats that are widely used: 170 * 120, QCIF (352*240), CIF (704*480), 1024*1024 and 1408*960. The rst three formats represen t the most dominan t applications: video conferencing, lo w qualit y streaming video and broadcast lev el video (D VD). The latter t w o video formats are pro jected to b e used for future HDTV and high-end video applications. Therefore, in the near future, w e en vision that a high-qualit y MPEG-2 soft w are deco der needs to supp ort go o d scalabilit y p erformance across dieren t video resolutions. F or example, video resolutions should b e supp orted from medium resolutions (e.g., 704*480) to large resolutions (e.g., 1404*960) with guaran teed displa y qualit y . Smo oth displa y rates with more than 24 frame p er second (fps) should b e guaran teed. Ideally , an MPEG-2 soft w are deco der should also automatically c ho ose the b est resolution/qualit y to adapt to the user's 1

PAGE 9

2 en vironmen t settings. F or instance, if mac hines are equipp ed with sucien t CPU, w e should deliv er as high a resolution as p ossible. MPEG-2 sp ecications do pro vide some recommendations for "scalable" co ding where video can b e reconstructed on the basis of user demand to suit dieren t application scenarios, e.g., for dieren t a v ailable comm unication bandwidths. This is implemen ted b y m ulti-la y er enco ding/deco ding with the assumption that the video resolution size remains the same. Ho w ev er, the enlargemen t of the video resolutions usually generate a need to ha v e more rened qualit y asso ciated with the frame size. Therefore, MPEG-2 dened a family of video formats using a pr ole/level com bination. The concept of pr ole in MPEG-2 can b e roughly in terpreted in the view of precision of pixel, and the level corresp onds to spatial resolution. In addition to the spatial dimension, video qualit y can b e impro v ed b y represen ting the pixel more precisely . This can b e accomplished b y allo cating more bits for eac h pixel, i.e, using more color and/or more precise quan tization. In this pap er, w e mainly target for high-prole high-lev el MPEG-2 formats. The recommendations of MPEG-2 scalable features can p oten tially increase the deco ding complexit y . Among these, increasing the size of video frame resolution seems to ha v e the most direct inruence on the deco ding p erformance b ecause more macro-blo c ks need to b e deco ded. W e b eliev e the increase in spatial resolution is probably the most eectiv e w a y to supp ort b etter video qualit y . Th us, as part of the long-term in v estigation, the goal of this pap er is to examine ho w spatial scalabilit y can b e supp orted with dieren t video resolutions. 1.2 State of the Art W e ha v e b een researc hing a generic, p ortable, pure-soft w are MPEG-2 enco der/deco der for the last few y ears. W e b eliev e a pure-soft w are based MPEG-2

PAGE 10

3 deco ding is still desirable in man y situation for its rexibilit y and scalabilit y . Ho wev er, soft w are only MPEG-2 deco ding is v ery computation in tensiv e, esp ecially for high-lev el video format. F or example, a high-prole (1440*1152) base MPEG2 video con tain 4 times as man y macro blo c ks than the main lev el D VD video, roughly corresp onding to 4 times more deco ding computation. With an enhancemen t la y er of the same spatial resolution (SNR scalabilit y), the complexit y of the deco ding pro cess will b e doubled. Th us w e exp ect an 8 fold increase in computation requiremen ts for suc h video formats. This computing gap will not b e co v ered in the near future according to the curren t micropro cessor ev olution trend. T o ac hiev e high p erformance soft w are MPEG-2 deco ding, w e had designed a parallel MPEG-2 deco der that can run on b oth cluster and m ulti-pro cessor en vironmen ts [1 , 2]. With a high-sp eed net w ork, a parallel deco der could pro duce high deco ding frame rates b y distributing the deco ding w orkload in to sev eral computing no des. The pip eline sc heme [1] tak es a Master/Sla v e arc hitecture where the master is in c harge of data distribution/collection and the sla v e no des p erform MPEG-2 decompression algorithms for the assigned task. The master also main tains the smo oth running of the pip eline to assure the highest o v erall system throughput. 1.3 Exp erimen tal Result The results w ere v ery promising with 30-fps pla ybac k ac hiev ed with 4 P en tium 400MHz desktop computers, and a 72-fps HDTV frame rate ac hiev ed in a SUN SMP en vironmen t. Ho w ev er, only one video resolution w as in v estigated with a main lev el MPEG-2 format (i.e., 704*480). It remained unclear ho w our parallel MPEG-2 deco der w ould supp ort larger (or smaller) video resolutions. Did w e exp ect the same soft w are to b e used for (1404*960) without an y adaption? Ho w man y sla v e no des are required to deliv er a 24-fps high-lev el high-prole MPEG-2 video with (1404*960) resolution?

PAGE 11

4 By pro ducing sev eral v ersions of the same video con ten t, w e are able to generate dieren t video resolutions from (352*240) to (1404*960). The rst t w o resolutions are (352*240) and (704*480), whic h roughly corresp onding to lo w-lev el QCIF and main-lev el CIF formats of MPEG-2 standards. Our parallel MPEG2 deco der p erforms w ell on these streaming videos. Three represen tativ e video con ten t with dieren t c haracteristics are decompressed up to more-than-200fps for (352*240) and 75 fps for (704*480) using 14 sla v e no des. Therefore, the p erformance results indicate that our parallel MPEG-2 deco der do es scale-do wn w ell for lo w-lev el and main-lev el MPEG-2 streaming videos. The c hallenge arises when attempts are made to supp ort (1024*1024) and (1404*960) high-lev el MPEG-2 video. W e ha v e observ ed a sev ere p erformance degradation (e.g., dropping from 18 or 20 fps to 2.5 fps) when more than 10 sla v e no des are used. It is not trivial to us wh y this b eha vior happ ens, and w e are p erhaps one of the rst groups that disco v er this strange system b eha vior. By analyzing the run time system resources utilization, w e found that the system memory is quic kly exhausted when increasing the n um b er of sla v e no des. When deco ding the video le with high spatial resolution, the increase of memory usage ev en tually b ecomes a system b ottlenec k. W e observ ed that at the saturating state, the op erating system sp ends most of its CPU time sw apping in/out b et w een main memory and secondary storage.. The analysis of the original data pip eline scheme also indicated that the problem will b ecome more sev ere when a larger GOP size is used with large video frame sizes. Therefore, in addition to the already-found net w ork b ottlenec k from W ang and Liu[1], w e disco v ered that lac k-of-memory can also b e another system b ottlenec k. 1.4 Revised Buer Sc heme T o address the c hallenge and obtain high scalable deco ding for high resolution video, w e prop osed and implemen ted t w o revised memory managemen t approac hes

PAGE 12

5 to reduce the buer requiremen t. The rst is Minim um T ransmission Buer in Sla v e No de ( ST sc heme ). In our original design, the sla v e no des allo cate the buer for the whole GOP . When the n um b er of no des gro ws, a lot of memory is needed. T o reduce the memory requiremen t, w e reduce the transmission buer size of the sla v e no des to three frames. W e can see the b enet of the ST sc heme from the decreased page faults of the sla v e no des and the increased deco ding frame rate. In the ST sc heme , w e use a 3-frame transmission buer for eac h sla v e no de. F or scalable MPEG-2 streams, eac h L-la y er sub-stream requires the same amoun t of buer space as that of the base la y er. It can b e exp ected that memory will b ecome a b ottlenec k again. T o further reduce the buer requiremen t in sla v e no des, w e prop osed a dynamic buer requiremen t. It is ob vious that only the B-frame needs the whole three frame buer. So if w e allo cate the buers according to the actual picture need, the eectiv e n um b er of frames p er buer will b e only 85% of the 3 frame buer. F urthermore , dynamic buer allo cation can b e applied inside the deco ding of eac h frame. The exp erimen tal results sho ws that the buer space is signican tly reduced, and w e observ ed a w ell scaled deco ding p erformance for the high resolution MPEG-2 video. With the revised buer sc hemes, our parallel deco der is able to deliv er high qualit y scalable deco ding p erformance based on the conguration of sla v e no des. In order to ac hiev e the 24-fps target deco ding frame rate, w e need 2 sla v e no des for (352*240) video resolution, 5 sla v e no des for (704*480) main resolution, and 13 and 14 sla v e no des for the high-resolution (1024*1024) and (1404*960) video formats resp ectiv ely . W e also observ ed that the system resource usage at large scale settings is under con trol, indicating the system can b e easily scaled up, as w ell as scaled do wn.

PAGE 13

6 1.5 Organization of the P ap er The organization of the pap er is as follo ws: section 2 pro vides related studies and a brief o v erview of MPEG-2 scalabilit y . Section 3 describ es the preliminary results of scalable deco ding p erformance for v arious MPEG-2 vide formats. The nature of the problem is iden tied b y analysing the original buer sc heme and run time system statistics (CPU usage, memory o ccupation). In section 4, w e presen t the t w o impro v ed buer sc hemes and rep ort the exp erimen tal results. Finally , section 5 giv es the conclusion of this pap er.

PAGE 14

CHAPTER 2 RELA TED STUD Y The optimization of MPEG-2 deco ding has b een attempted in b oth soft w are and hardw are approac hes. Based on general purp ose micropro cessor, m uc h of the w ork ha v e b een fo cused on accelerating h uman deco ding, fast IDCT, and other run time cost, suc h as the w ork in Lee [3 ]. In So derquist and Leeser [4 ], the memory access pattern of MPEG-2 deco ding w as analyzed to impro v e the cac he eciency , their prop osed cac he-orien ted arc hitecture rep orted to reduce memory trac b y 50%. Ho w ev er, the real-time p erformance requiremen ts w as not addressed in their w ork. In P atel [5], p erformance of a soft w are deco der w as discussed and v arious enhancemen ts in IDCT, ME and DITHERING w ere studied. Ho w ev er, only a (320*240) video stream w as deco ded in real-time. Beside the pure soft w are-orien ted optimization, man y CPU v ender had built m ultimedia instructions inside the general purp ose pro cessor [6, 7, 8]. Lee [3] rep orted a 4 folds p erformance impro v emen t using the P A-RISC m ultimedia instructions. Recen tly , INTEL's MMX tec hnology is gaining more in terests in MPEG-2 deco ding optimization [9, 10 , 11 ]. In our exp erimen ts, a 70% reduction of execution time is observ ed for IDCT transform. With the P en tium I I I 700MHz CPU, the main lev el MPEG-2 video (D VD qualit y) can b e deco ded at nearly jitter-free qualit y . Some commercial soft w are D VD deco ders can op erate on a lo w er CPU clo c k rate with hardw are m ultimedia supp ort features pro vided b y a video card. F or example, most state-of-art video card v endors had in tegrated IDCT and ev en motion comp ensation in to their c hips [12 ]. These hardw are features can signican tly reliev e the computation load to the host CPU. 7

PAGE 15

8 Pure-hardw are approac hes usually use a redundan t DSP unit and a m uc h wider in ternal bus design, whic h mak e it p ossible to exploit instruction-lev el parallelism (suc h as VLIW). Some of the w orks are rep orted in Akiy ama and Sriram [13], and Sriram and Hung [14 ]. In Baum et al. [15 ], a lo w-cost, highp erformance RISC pro cessor core based c hip set is prop osed to encapsulating man y of the functions required in high qualit y consumer audio-visual platforms. In Ishiw ata et al. [16 ], A single-c hip MPEG-2 MP@ML co des, in tegrating 3.8M gates on 72mm is describ ed. It has heterogeneous m ultipro cessor arc hitecture in whic h six micropro cessors with the same instruction set but dieren t customization execute sp ecic tasks suc h as video, audio etc. concurren tly . The micropro cessor, dev elop ed for digital media pro cessing, pro vides v arious extensions suc h as VLIW one and DSP one inheren t in its arc hitecture. Making full use of the extensions, the c hip executes enco ding and deco ding of video, audio and system concurren tly in real time. Ho w ev er these approac hes did not address high qualit y scalable MPEG-2 video formats whic h will probably b ecome more desirable in the future m ultimedia applications. Moreo v er, their strong dep endence on sp ecic hardw are mak e them less rexible and reusable. In some cases, a generic pure soft w are solution is more desirable. As demonstrated in the literature, pure soft w are MPEG-2 enco ding/deco ding requires large amoun ts of computation p o w er. Muc h has b een done [17 , 18, 19 ] to parallelize the MPEG-2 enco ding pro cess based on SMP en vironmen t or clusters of w orkstations. With In tel's P aragon m ultiple pro cessor system, Akram ullah et al. [18] rep orted a real-time parallel enco der for lo w resolution MPEG-2 enco der. Gong and Ro w e [19 ] prop osed a coarse-grained parallel v ersion of a MPEG-1 enco der and sho w ed a v ery go o d parallel gain. In He et al. [20 ], the sc hedule algorithms of parallel MPEG-4 enco ding w ere discussed to

PAGE 16

9 balance the system load when dispatc hing m ultiple video streams o v er a cluster of w orkstations. On other hand, Only a few w orks ha v e b een rep orted regarding parallel MPEG-2 deco ding. A parallel MPEG-2 deco der based on a shared-memory SMP mac hine w as rep orted in Bilas et al. [21 ], ho w ev er, they did not address ho w realtime deco ding could b e supp orted, and whether the system can b e scaled up for the high-prole and high-lev el video source. In W ang and Liu [1], w e prop osed a data pip eline based sc heme to w arding pure soft w are, scalable MPEG-2 deco der. The early results sho w that MP@ML MPEG-2 video can b e adequately supp orted with lo w end CPUs. Ho w ev er, the deco ding p erformance of high end MPEG-2 video formats with scalable features has not b een rep orted. The high end MPEG-2 video format usually comes with m ultiple substreams, with a mandatory base la y er and additional enhancemen t la y ers pro viding v arious scalabilit y features. Three scalabilities are dened so far: SNR, Sp atial, and T emp or al sc alability . T o enable these features, additional computation resources m ust b e pro vided. Therefore it is not clear to us that high-prole high-lev el could b e automatically supp orted with the existing solutions. It is our goal in this study to v erify ho w scalable deco ding of high resolution video can b e ac hiev ed.

PAGE 17

CHAPTER 3 PR OBLEM NA TURE AND ANAL YSIS W e ha v e long susp ected that the scalabilit y issue for high-prole high-lev el MPEG-2 video could b e a c hallenging issue, but it w as not un til recen tly that w e found this practical issue do es exist. By using a public domain MPEG-2 enco der, w e w ere able to generate a series of MPEG-2 video streams with dieren t resolutions. The video con ten t w as enco ded with N=12, and M=3 with c hroma format 4 : 2 : 0. Eac h video con ten t had sev en v ersions with dieren t resolutions, from 352 * 240 to 1404 * 960. The in termediate resolution w as c hosen so that con tin uous p erformance trends could b e observ ed. W e used the same GOP structure, quantization table, color format, and motion searc h range as the enco ding parameters to ha v e a fair comparison. Eac h enco ded video consisted of sixt y frames, whic h is roughly four GOP . The tested video sources consisting of three dieren t con ten ts of dieren t motion activit y and picture complexit y w ere c hosen. The "ro w er" video t yp e consists of a static scenario of ro w ers. The "calendar" title has slo w motion and a complex picture. The "tennis" is the most motion in tensiv e one. Eac h of the three video titles w as enco ded in to the four dieren t sizes w e are in terested in. All the enco ding parameters w ere the same except the horizontal and vertic al sizes. Using the p erformance mo del in W ang and Liu [1], w e can deriv e the exp ected deco ding p erformance. T o simplify the discussion, w e assume a one la y er structured MPEG-2 video le. With the assumption of sucien t long video sequences, the exp ected deco ding frame rate can b e appro ximated b y the follo wing: F R D = ( N D ) =max f D :T sing l e + T ms + T sm + 2 c; N : ( T sm + T ms + 2 c ) g 10

PAGE 18

11 where N, D denote the engaged pro cessor n um b er and the length of GOP (Group of Picture). T sing l e is the a v erage deco ding time of one frame of the giv en video le at a giv en CPU. T sm and T ms are the transmission time of a decompressed frame and ra w frame resp ectiv ely (equation (7), (8) in W ang and Liu [1]). Using the same hardw are conguration as in the SUN SMP en vironmen t in W ang and Liu [1], the exp ected deco ding p erformance is sho wn in T able 3.1. T able 3.1: Exp ected P arallel Deco ding P erformance Video Spatial 2 no de 4 no de 8 no de 16 no de Resolution 352*240 25 fps 55 fps 120 fps 260 fps 704*480 10 fps 25 fps 48 fps 92 fps 1024*1024 3 fps 8 fps 20 fps 34 fps 1404*960 3 fps 6 fps 14 fps 25 fps Though the exp ected deco ding p erformance can b e predicted via our prop osed p erformance mo del, w e are in terested in whether the exp erimen tal results will agree us. By using a SUN SMP mac hine with 14 248-MHz UltraSparc CPUs, 512-MByte memory space and in ternal comm unication bandwidth up to 680 Mbps, w e ha v e collected the scalabilit y p erformance with dieren t video resolutions. The ac hiev ed frame rates for the lo wand mainlev el MPEG-2 video are v ery close to our prediction. The results sho w ed only a sligh t dierence among the three video con ten ts. F or the small resolution video(Figure 3.1.a), w e observ ed a linear increasing of frame rate. The maxim um frame rate is ac hiev ed when 14 no des are deplo y ed, pro viding 220 fps for tennis , 231 fps for c alendar , and 213 fps for rower . Since the video size is small, the system's theoretical p eak could reac h 500 fps (at 30 no des) according to the prediction in W ang and Liu [1]. Our test platform only has 14 no des, th us the theoretical saturation p oin t will not b e reac hed. The results for the main-lev el video (720*480) also conform with our prediction. The p erformance for the three video titles sho ws little dierence in terms of frame rate. Eac h of them

PAGE 19

12 2 4 6 8 10 12 14 16 0 50 100 150 200 250 300 -*-calendar -+-flower -o-tennis -x-predictionnumber of nodesFPR352*240 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 80 90 100 -*-calendar -+-flower -o-tennis -x-predictionnumber of nodesFPR704*480 Figure 3.1: Deco ding P erformance F or Lo w Resolution Video (a) Deco ding p erformance for 352 * 240 video (b) Deco ding p erformance for 704 * 480 video increases close to linearly when more sla v e no des are used. The highest frame rate ac hiev ed is 70 fps (with 14 no des). Ho w ev er, the scalabilit y p erformance for the high resolution MPEG-2 videos are not satisfactory . In Figure 3.2.b, the deco ding rates for (1404*960) MPEG2 les are illustrated. Starting with 2 fps at single no de conguration, a linear increase can b e observ ed. The highest decompression rate is 20 fps for "ro w er" at 9 sla v e no des, and 22 fps for "tennis" and "calendar" at 11 no des. With 10 sla v e no des, the deco ding p erformance of "ro w er" suddenly dropp ed to 2.5 fps, and con tin ued deteriorating with a small reb ound at 11 sla v e no des. F or "tennis" and "calendar", a similar p erformance degradation is observ ed at 12 sla v e no des, righ t after the p eak p erformance.

PAGE 20

13 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 35 -+-calendar -*-flower -o-tennis -x-prediction number of nodesFPR1024 * 1024 2 4 6 8 10 12 14 16 0 5 10 15 20 25 -*-calendar -+-flower -o-tennis -x-prediction number of nodesFPR1404 * 960 Figure 3.2: Deco ding F rame Rate F or High Resolution Video (a)1024 x 1024 (b) 1404 x 960 Note that 20or 22-fps deco ding p erformance is not considered as a real-time video displa y . It is considered closer to slo w motion, whic h can not b e sync hronized with an accompan ying audio trac k. Therefore, it is desirable to ac hiev e at least 24 fps, whic h is close to theater's lm displa y . Most imp ortan tly , the displa y rates should b e smo oth and without sudden drops as illustrated b y the 2.5 fps that w e observ ed at this time. The b eha vior is also observ ed with other high-prole formats. Similar p erformance drops are observ ed for the (1024*1024) format. The system can do w ell for up to 10 sla v e no des, where a p eak of 23 fps can b e ac hiev ed with 10 no des. Ho w ev er, great degradation o ccurred after 11 no des. The frame rate dramatically drops to only 2 fps, whic h is ev en w orse than a single sla v e no de conguration.

PAGE 21

14 F urther increasing the sla v e no de seems not to impro v e the p erformance at all. W e observ ed no frame rate impro v emen t after 11 no des. 3.1 Memory Usage Analysis The p erformance degradation sho ws our pip eline sc heme is b ounded b y a system b ottlenec k. W e need to nd the underlying reason for this to further impro v e the original design, the goal here is to mak e the system w ell scale up for the high-lev el high-prole MPEG-2 video. According to W ang and Liu [1], the data exc hange b et w een the master no de and sla v e no des is based on GOP , whic h usually con tains 10 to 20 frames. The sla v e no de has to k eep a buer space to accommo date b oth the compressed and decompressed video frames, for eac h of the GOP . In the master no de, a dedicated buer space (1 GOP) is reserv ed for displa ying, and another for receiving data from sla v e no des. The follo wing Figure 3.3 depicts the buering requiremen ts and relations b et w een the master and sla v e no des: The memory requiremen t for the master no de is: M m = m c + m str eambuf f er + m outbuf f er + m inbuf f er Here m c is the size of executable co de for the master, ab out 500 KB. m str eambuf f er is the streaming buer to receiv e the compressed video pac k et from the video serv er, w e curren tly xed it to b e 1 MB. m outbuf f er and m inbuf f er are dedicated for information exc hange in the parallel deco ding. m outbuf f er equals one GOP of MPEG-2 compressed frames, and m inbuf f er needs to accommo date t w o GOP of decompressed frames (one GOP for displa ying and another for incoming trac). Notice that in our sc heme, the outbuer and in buer are shared b y all sla v e no des, whic h is made p ossible b y the master doing a round robin p olling. Using the horizon tal video size h and v ertical video size v , GOP=15, and the a v erage

PAGE 22

15 nnn nnn nnn nnn nnn rrrrr !!! !!! !!! """" """" """" """" """" """" #### #### #### #### #### #### $$$$$$$ $$$$$$$ $$$$$$$ $$$$$$$ %%%%%%% %%%%%%% %%%%%%% %%%%%%% &&&&& &&&&& &&&&& &&&&& &&&&& ''''' ''''' ''''' ''''' ''''' (((((((((( (((((((((( (((((((((( (((((((((( (((((((((( (((((((((( (((((((((( )))))))))) )))))))))) )))))))))) )))))))))) )))))))))) )))))))))) )))))))))) Display Device From Video ServerMaster Node Slave NodeMemory Bolock mem-m-mc (1) Transmission Buufer (7) MasterCode (1) Receiving Buffer (3) Displaying Buffer (3) Buffer (6) Compress Data decoding To Master Node To Master Node decodingSlave Node1 Slave Node2Slave Code (5) (2) mem-m-sb 3) mem-m-rb (4)mem-m-out (5) mem-s-mc (6) mem-s-ib (7) mem-s-tb To Slave Stream Buffer (2) Compressed Outbuffer (4) Figure 3.3: Memory Usage Illustration compression ratio is 20, w e ha v e: M m = 0 : 5 + 1 + h v = 10 6 + 2 GO P h v = 10 6 ( M B ) = 1 : 5 + ( + 2 GO P ) h v = 10 6 ( M B ) F or the sla v e pro cesses, the size of executable co de is also 500 KB. The streambuer is not used since the sla v e no de do es not receiv e compressed video pac k ets from the video serv er. A compressed data buer is used to receiv e data from the master (the same size as the m outbuf f er in master no de). The transmission buer can serv e t w o purp oses, it is used during the deco ding pro cessing, th us ob viating the need for separate space for the YUV comp onen ts of eac h macroblo c ks, and so

PAGE 23

16 that in-place transmission can b e done without mo ving data. Using a 4:2:0 color sc heme, the a v erage bits p er pixel is 12 bits, instead of 8 bits used in the displa y system, th us the transmission buer m t is 1.5 times the size of m inbuf f er . W e ha v e M s = m c + m compr essedbuf f er + m tr ansmissionbuf f er = m c + m outbuf f er + 1 : 5 m inbuf f er Using the ab o v e t w o equations, w e can calculate the memory requiremen t for the master no de and the sla v e no de for eac h video format. F or the test stream tennis40 (1404*960), M m = 42 (MB), and eac h sla v e needs ab out 30.8 MB. The original consideration of pip eline p ar al lel design is to minimize the comm unication cost and reduce the n um b er of high lev el net w ork access times. Ho w ev er the memory buering sc heme is not considered optimized. The accum ulativ e buering space will gro w quic kly when using a large scale sla v e no de conguration, whic h causes unsatisfactory scalabilit y p erformance when the n um b er of sla v e no des is large. F or instance, let N b e the n um b er of sla v e no des, the total memory requiremen t b ecomes M t = M m + N M s Using the parameters of our testing MPEG-2 video, the actual memory used is listed in table 3.2. In Figure 3.2.b and 3.2.a, w e nd that the tennis40 has the frame rate dropp ed when N=9, and tennis60 dropp ed at N=11. The corresp onding amoun t of memory used is 319.9 MB and 296.5 MB resp ectiv ely . The minim um of these t w o should b e used as the indication of p oten tial memory outrage. This amoun t of memory is actually 70% of the system ph ysical memory . 3.2 Impact of Memory Shortage The p erformance impact of a non-optimized buering sc heme will aect on the comp etition b et w een user pro cesses (e.g., our comm unication and decompression soft w are) and system pro cesses (e.g., demand-paging mec hanisms b y OS). Because

PAGE 24

17 T able 3.2: Memory requiremen ts for dieren t no des(MB) Horizon tal V ertical 1 2 4 8 9 10 11 resolution resolution no de no des no des no des no des no des no des T ennis40 1404 960 73.5 104.3 165.9 289.1 319.9 350.7 381.5 T ennis60 1024 1024 59.5 83.2 130.6 225.4 249.1 272.8 296.5 the shortage of the o v erall memory , the system pro cess will generate a signican t n um b er of page faults, whic h in general slo wing do wn the decompression sp eed due to the lac k of CPU. The shortage of system memory will force the op erating system to sw ap some of the memory page out to hard disk, and this activit y in turn will use more CPU time, th us aect the p erformance of all user space pro cesses. The evidence from system run time statistics can b e collected from the CPU time distribution and the n um b er of page faults to supp ort this unique observ ation. Figure 3.4 illustrats our measured n um b er of page faults v ersus the n um b er of sla v e no des. F or the sak e of clarit y , w e only presen t the results for "tennis", the "ro w er" and "calendar" sho ws similar results for this measuremen ts. W e plotted the page faults for the four video resolutions, from QCIF to MPEG-2 high lev el (1404*960). The follo wing observ ation can b e made: F or the 352x240 video, the page faults virtually remain unc hanged, and are k ept at a lo w lev el (1010 page faults/frame). Increasing the video resolution to 704x480 is rerected b y the rise in the n um b er of page fault, a four fold jump is observ ed. Nev ertheless, the 704x480 case still has a rat curv e for the increasing sla v e no de, indicating the system is running steadily . F or the 1024x1024 video, the n um b er of page faults increased considerably at b eginning, but still within a manageable lev el. 1200 page faults p er frame is observ ed for 2 sla v e no des, and remain the same un til 9 sla v e no des. This is follo w ed b y a signican t increase at 10 to 12 sla v e no des, reac hing 3500 page faults p er frame at 12 sla v e no des as p eak. Then the gure drops bac k to certain degree, but still main tains a high lev el ( more than 2500). Compared with the deco ding p erformance in Figure3.2.b, the p erio d of high page faults coincides with the collapse of the deco ding rate. This indicates that the excessiv e page faults had driv en the system in to an outrage state.

PAGE 25

18 2 4 6 8 10 12 14 16 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 -*-352*240 -+-704*480 -o-1024*1024 ->-1404*960 Number of Slave NodesSecondSystem Page Fault Figure 3.4: P age F ault VS Num b er of Sla v e No de The page faults b eha vior of the 1404x960 video sho ws the same pattern as in the 1024x1024 case. The former one do es ha v e a the higher n um b er of page faults than other cases, due to its highest spatial resolution th us high memory requiremen ts. The outrage of the page faults o ccurred ev en earlier than the 1024x1024 case, with the jump b et w een 9 and 10 sla v e no des. The most frequen t page faults reac hed 4700 faults/frame at 10 sla v e no des, where the deco ding p erformance drops from 22 fps to 2.5 fps (see Figure3.2.a). Figure 3.5 presen ted the o v erall CPU usage distribution b et w een user space pro cess (our deco ding algorithm), system cost (paging), and system idle time. With one sla v e no de, 90% of the system time is idle, 8% of the CPU time is used in the user space, and the remaining 2% for other system main tenance. With increasing sla v e no des, the user space time increases prop ortionally , and the system idle time decreases. During these p erio ds, more CPU time is used for the sla v e no des, and the deco ding frame rate increases linearly . After 8 sla v e no des, ho w ev er, b oth system idle time and user space time dropp ed signican tly , while

PAGE 26

19 the system o v erhead sho w ed a sharp rise. Ab out 90% of the CPU time is used b y the op erating system, while user space only o ccupies 5% of CPU time. Recalling that the page faults n um b er increases suddenly at 9 sla v e no des (see Figure3.4), w e conclude that the system sp end most of its CPU time sw apping page in/out, th us the observ ed drop of deco ding p erformance. 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -*-User Space (Decompression) -+-System Idle -o-System Processing(0.S) number of nodesCPU PercentageCPU usage for tennis40 Figure 3.5: CPU usage for tennis40 The previous sections sho w that our original parallel deco der could consume a h uge amoun t of memory space as a frame buer when deco ding high-resolution MPEG-2 video. The alw a ys-limited ph ysical memory can b e exhausted when a large n um b er of sla v e no des are deplo y ed. Ho w ev er, in order to reduce the comm unication o v erhead, a large GOP will mak e the memory o v er-allo cation ev en w orse. The system page fault and CPU time distribution pro vides strong evidence for this claim. Th us, ho w to create ecien t buering sc hemes b ecomes critical

PAGE 27

20 to the success of the data pip eline scheme for supp orting high-prole high-lev el MPEG-2 video decompression.

PAGE 28

CHAPTER 4 EFFICIENT BUFFERING SCHEMES In order to prop ose ecien t buering sc hemes, w e ha v e to analyze more deeply our original sc heme. In our original design [1], eac h sla v e no de allo cated enough buer spaces at the initialization stage. The buer is big enough to hold a GOP of frames. These buer space will b e allo cated statically throughout the life time of the video decompression of curren t GOP . When the deco ding of the whole GOP is completed, the data in the buer (decompressed video frames) will b e sen t in to the MPI comm unication proto col stac k. In the master no de, a t w o-buer sc heme is adopted, where one buer is used for incoming frames from sla v e no de, another buer is dedicated for displa ying the last GOP of frames. T o a v oid the memory mo v emen t (whic h is undesirable for video displa y), w e actually sw ap the incoming frame buer and displa y buer ev ery time a new GOP of frames is receiv ed. Therefore, the memory requiremen ts of eac h sla v e no de dep ends on the GOP length, and the picture size. The aggregate memory for the whole system will increase linearly when the n um b er of no des gro ws. 4.1 The First Solution T o reduce the memory requiremen ts in sla v e no des, the buering sc heme should b e redesigned. Ideally , a p erfect buering sc heme migh t only need one transmission buer. Ho w ev er, due to the deco ding dep endency inside the MPEG-2 video structure, w e are not able to use only one frame buer in realit y . T o deco de a B-frame, w e need at least t w o reference frames, this indicates that the w orst case of the minim um buer should b e three frames, with t w o frames for reference frames, 21

PAGE 29

22 and one for the w orking B-frame. With a careful redesign of the master-sla v e comm unication proto col, using a 3-frame transmission buer in the sla v e side is p ossible, whic h w e called the ST sc heme. When the picture size is 1024*1024 and GOP=15, w e can sa v e ab out 12 MB buer space p er sla v e no de, ab out an 80% reduction in the sla v e side. T o adopt the prop osed memory-ecien t algorithm, the sla v e deco ding pro cess needs to rotate the usage of t w o reference buers as suggested in the reference serial deco der. Let f or w ar db p oin ting to the forw ard predicting frame, and back w ar db to the bac kw ard predicting frame from the view of B-F rame deco ding, the follo wing rotation rules m ust b e ob ey ed: (1)The rst frame, I-F rame, is decompressed to the f or w ar db , whic h is initialized p oin ting to the rst buer. (2)The rst P-F rame will refer to f or w ar db , and is decompressed to back w ar db , initialized p oin ting to the second buer. (3)The follo wing P-F rames will use back w ar db as reference frame, and are deco ded in to the f or w ar db . After completion, f or w ar db and back w ar db ha v e to b e switc hed. (4)F or eac h B-F rame, it will refer to f or w ar db as forw ard predicting and back w ar db as bac kw ard predicting frame. The decompressed data is stored in the third buer. Also the sla v e has to send a frame bac k to the master once it is completely deco ded, so that the buer can b e reused for the next frame. As to the master side, the master should b e able to receiv e the data whenev er it app ears in the net w ork la y er, suc h that the sla v e no des don't ha v e to w ait on the blo c king transmission. This can b e implemen ted via the use of UNIX SIGNAL mec hanism. The minim um required frame buer can b e further decreased from 3-frames to 2 frames. F or the Ior Pframes, w e need one buer for the prediction picture, and another buer for the w orking frame. The t w o buers c hange their role after

PAGE 30

23 deco ding a Ior Pframe, so that the most recen tly deco ded Ior Pframe is used as the prediction frame for the next Pframe. F or the Bt yp e frame, since the result is not used as reference, w e can directly send eac h deco ded blo c k in to the MPI sending proto col. The ab o v e discussion assumes the reference frame of Pframe is alw a ys the last deco ded Pframe, and the reference frame for Bframe are the last t w o Pframe. Nev ertheless, this approac h also w orks if the Band Pframes alw a ys refer to the Iframe with the prop er setting of the reference buer. With this sc heme, the exp ected memory requiremen t b ecome M 0 t = M 0 m + N M 0 c = M m + N ( m c + 1 : 5 3 m f r ame + m inbuf f er ) Here the buer space of the master no de remains the same. F or dieren t video con ten t, the exp ected amoun t of M 0 t is plotted in Figure 4.1. 1 2 3 4 5 6 7 8 9 10 30 40 50 60 70 80 90 100 110 120 130 -*-tennis40 -+-tennis60Number of NodesMemory Occupied (mb)Memory Size Using Minimum Transmission Buffer Figure 4.1: Memory Size Using Minim um T ransmission Buer It is observ ed that the memory required increases at a slo w slop e, where eac h sla v e no de will in tro duce only ab out 6 MB additional space for the tennis40 video. The tennis60 nds an ev en smaller memory requiremen t. T o nd out the n um b er of maxim um sla v e no des b efore system memory runs out for the 1404*960 single

PAGE 31

24 la y ered MPEG-2 video, w e ha v e (47 + (N-1) * 6) < 300 MB (use 300 MB as a system threshold). This will giv e N=43 sla v e no des and more than 60 fps.. 4.2 Implemen tation and Exp erimen t Result W e implemen ted the ST memory allo cation sc heme, and rep eated the exp erimen ts for the high lev el MPEG-2 video with the new buer managemen t. The measured actual memory requiremen t is depicted in T able 4.1. The p ercen tage of the new memory requiremen t and the original memory size is giv en b elo w the actual memory size. The memory requiremen t for the ST is signican tly less than the original sc heme, esp ecially when more sla v e no des are deplo y ed. F or 1 sla v e no de, w e need 53.5 MB for 1404*960 video, whic h is 27 % less than the original one. F or 4-sla v e-no de case, the ST sc heme use 85 MB instead of the original 165 MB, whic h is almost 50% memory sa ving. This closely matc hed the analytical estimation in the previous section, since the ST sc heme can sa v e 66% memory in sla v e no des. The o v erall sa ving will alw a ys b e b elo w 66% in total. A similar memory requiremen t impro v emen t is found for the 1024*1024 case. The ST sc heme requires the maxim um of 200MB when all 14 no des are utilized. Using 300 MB as the b ottom line of allo w ed safet y memory allo cation, as suggested b y T able 4.2, the system outrage should b e a v oided ev en if a full conguration of sla v e no des is used. T able 4.1: Memory requiremen t for ST sc heme (MB) and the Ratio of Sa ving Compared to Original Sc heme 1 2 4 8 9 10 11 no de no des no des no des no des no des no des T ennis40 53.5 64.3 85 129.1 146.9 160.7 175.5 72.8% 61.7% 51.2% 44.7% 45.9% 45.8% 45.9% T ennis60 47.5 54.2 77.6 110.4 120.1 130.8 141.5 79.8% 65.1% 59.4% 49% 48.2% 47.9% 47.6% The eectiv eness of our ST sc heme is also conrmed b y measuring the n um b er of system page faults after necessary mo dications are made. Figure 4.2 sho ws the

PAGE 32

25 a v erage page fault caused b y a sla v e no de during the decompression of 60 frames p er no de. The same video les used in section 3 are tested. W e observ ed that: The n um b er of page faults is directly related to the picture size. The small size (352*480) has the lo w est amoun t of page faults, while the high picture resolution (1404*960) corresp onds to the highest page fault rate. This trend is also observ ed in Figure 3.4 for the original deco der. The n um b er of page faults for eac h individual no de is signican tly reduced, comparing to the n um b er in Figure 3.4. F or the 1404*960 case, the n um b er of page faults shrinks from 1500 to 1200, at one sla v e no de conguration. F or 1024x1024 video, the page faults are no w 943, 25% less than b efore. F or all of the video streams, the n um b er of page faults almost remains unc hanged when increasing the n um b er of the sla v e no des. This phenomena is also observ ed in Figure 3.2.b b efore the memory saturation p oin t. The rat curv es sho ws that the system memory usage is still under \con trol". The page fault outrage for the t w o high resolution video streams are eliminated, sho wing that the ST sc heme has successfully reliev ed the memory b ottle nec k. 2 4 6 8 10 12 14 16 400 500 600 700 800 900 1000 1100 1200 1300 number of nodesAvg page fault per slave node 2 4 6 8 10 12 14 16 400 500 600 700 800 900 1000 1100 1200 1300 -*-704*480 -+-1024*1024 -o-1404*960number of nodesAvg page fault per slave node Figure 4.2: Memory P age F ault F or The Revised Memory Managemen t

PAGE 33

26 The obtained deco ding frame rate giv es a nal judgmen t for the correctness of our analysis. Since our ma jor target is the high qualit y video, w e only sho w the accomplished frame rate for 1404*960 and 1024*1024 video streams. F or eac h of the video sizes, w e only sho w the accomplished frame rate for 1404*960 and 1024*1024 video streams. F or eac h of the video sizes, w e compare the p erformance for three video titles men tioned early . Figure 4.3.b sho w the scalable deco ding frame rate for 1404*960 video of our revised ST sc heme. W e observ ed a close to linear increasing of the frame rate. F or one sla v e no de, w e ha v e 1.7 fps for tennis, 1.86 fps for ro w er, and 1.6 fps for the "mobl" video. F or t w o sla v e no des, the p erformance is nearly doubled for eac h case, with 3.5 fps for tennis, 3.7 fps for ro w er, and 3.2 fps for mobl. F or the other no de conguration, the ac hiev ed frame rate increase prop ortionally , and there are sligh t dierence b et w een the three video titles. The p eak deco de rates are obtained at 14-sla v e no des, where 20 fps is observ ed. The deco ding p erformance for 1024*1024 video les sho ws a similar b eha vior. The close linear sp eed-up is also observ ed. A t one sla v e no de, the frame rate is roughly 1.8 fps. The highest frame rate is 23 fps for calendar when the system is fully loaded. The dierence of frame rates for the three video title is sligh tly higher than that of 1404*960 size. The biggest p erformance dierence o ccurred b et w een calendar and tennis at 7 sla v e no des, with a frame rate of 13 and 10 resp ectiv ely . Nev ertheless, the o v erall result still matc hes our exp ectation quite w ell, and there is no sign that the p erformance dierence ma y increase. F or the t w o high resolution video formats, our revised ST sc heme has successfully solv ed the memory shortage problem. The observ ed near real-time deco ding rate sho ws that our sc heme w orks w ell for high qualit y video up to MP@HL video. Our analysis of memory usage indicates that 14 sla v e no des still lea v e enough memory space in the system. The theoretical maxim um allo w ed sla v e no des can

PAGE 34

27 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 35 -*-calendar -+-flower -o-tennis -x-prediction number of nodesFPR1024*1024 2 4 6 8 10 12 14 16 0 5 10 15 20 25 -*-calendar -+-flower -o-tennis -x-prediction number of nodesFPR1404*960 Figure 4.3: Deco ding F rame Rate F or the Revised Memory Managemen t (a) 1024 * 1024 (b) 1404* 960

PAGE 35

28 b e estimated as follo wing: Still tak e 300 MB as the memory budget, w e ha v e M m = 42 M B and M s = 11 M B for the 1404*960 case. Th us the maxim um n um b er of sla v e no des is 300 42 11 32. Giv en this amoun t of sla v e no des, our sc heme should pro duce up to a 45 fps deco ding rate. 0 2 4 6 8 10 12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -*-User Space (Decompression) -+-System Idle -o-System Processing(0.S) number of nodesCPU PercentageCPU usage for tennis40 Figure 4.4: User Space Time VS Kernel Space Time for The First Buer Optimization Sc heme Figure 4.4 sho ws the o v erall CPU time distribution of sla v e no des when deco ding high-resolution video formats with the revised buer sc heme. The user space time comp onen t represen ts the computation time for the MPEG-2 deco ding pro cedure, the k ernel space time is for the system lev el o v erhead, including time sp en t in the net w ork la y er, system call, and other costs. It is observ ed that the user space time increases linearly when the n um b er of sla v e no des increase, accompanied b y the coun terparting decrease in the system idle time. Mean while the op erating system lev el cost is main tained in a lo w lev el (b et w een 5% to 10 % of total CPU time). P articularly for the large scale exp erimen ts (more than 11 sla v e no des

PAGE 36

29 deplo y ed), the abnormalit y cross-o v er of the user space time and system o v erhead observ ed in the original deco ding exp erimen ts no longer exists. This further pro v es the eectiv eness of the ST sc heme in solving the memory shortage. 4.3 F urther Optimization in the Sla v e No des In the ab o v e section, w e use a 3-frame transmission buer for eac h sla v e no de, whic h has already obtained signican t reduction in the memory requiremen ts. W e also sho w ed that the memory shortage will not happ en un til there are more than 32 sla v e no des participating in the parallel decompression, for the 1404*960 case. The maxim um allo w ed n um b er of sla v e no des for other video sizes can b e deriv ed similarly based on the memory budget and the size of the video image. This result can b e easily extended to the m ulti-la y ered MPEG-2 video, where an enhanced video stream la y er ma y exist to feature SNR scalabilit y , D A T A partition scalable and other scalabilit y features. F or these scalable MPEG-2 streams, w e usually ha v e L-la y er sub-stream, eac h requires the same amoun t of buer space as that of the base la y er in order to b e successfully deco ded. Assume a 3 la y er MPEG-2 video is to b e deco ded; the deco ding/transmission buer in a sla v e no de will b e nearly tripled. It can b e exp ected that the memory shortage will b ecome a b ottlenec k again. T o further reduce the buer requiremen t in the sla v e no de, w e prop osed a dynamic buer requiremen t sc heme. It is observ ed that the deco ding pro cedure in a sla v e no de did not need three frames all the time. More sp ecically , the I-frame did not refer to an y other frames, th us w e can only use one frame as the deco ding/transmission buer. Similarly , P-frame only refers to one frame (Ior P-frame), th us the total need for deco ding a P-frame is t w o. Only B-frame needs the whole three frame buers. Th us the total amoun t of needed buer can v ary during the life time of the sla v e no de. Since all the sla v e no des are sharing the ph ysical memory and are p erforming decompression

PAGE 37

30 indep enden tly , w e can eectiv ely reduce the memory requiremen t b y dynamically allo cate buers in the sla v e no de. Let the ratio of I, P , B frames in a GOP structure b e a:b:c, the eectiv e buer space for one la y er is expressed b y M = (1 a + 2 b + 3 c ) = ( a + b + c ) (4.1) In a t ypical GOP structure of "IBBPBBPBBPBBPBB", w e ha v e a:b:c=1:4:10, this will result in an eectiv e buer n um b er of 39/15=2.6, whic h is ab out 85% of the 3 frames buer sc heme. The eectiv e buer space is a function of the GOP structure. When the p ercen tage of I frame increases, the eectiv e buer space will decrease. In the extreme case of all I-frame GOP , the eectiv e buer is 1 f/GOP .. While a long GOP structure with man y B frame will mak e the eectiv e buer space approac hing the limit, whic h is 3 frame/GOP . Assume a t w o-la y ered scalable MPEG-2 stream with 1404*960 video size is to b e deco ded, the memory requiremen t of the sla v e no de is M s = 2 11 0 : 85 = 18 : 7 M B . Still assuming a 300MB total memory budget and M m = 42 M B , the system can supp ort up to 300 42 18 : 7 = 14 sla v e no des. F urther assume the deco ding time for suc h a t w ola y ered high resolution stream is t wice as high as the one-la y er stream in a serial pure soft w are deco der, the 14-sla v e no des conguration can only pro duce up to 14 0 : 9 = 12 : 6 frame/sec. In order to ha v e a higher p erformance, w e need further decrease the buer space, suc h that more sla v e no des can b e supp orted. Algorithm 1 b elo w describ es the dynamic buer allo cation sc heme. In fact, the concept of dynamic buer allo cation can b e applied inside the deco ding of eac h frame. Since the decompression of eac h frame is based on a serial decompression of macroblo c ks, the o v erall buer space could b e reduced b y dynamically allo cating buer for macroblo c ks. F or example, when deco ding

PAGE 38

31 Algorithm 1 Dynamic Buer Allo cation in Sla v e No de /*Three Buers outbuer[1,2,3] are used rep eatly . */ /*forw ardb, rev erseb, and curren tb p oin t to the forw arding reference*/ /*frame, bac kw ard reference frame, and w orking frame resp ectiv ely*/ /* The follo wing steps deco de and transmit one GOP frames */ Reciev eGOP(&compressedBuer) allo cate outbuer[1,2,3] for eac h frame f ( i ) in compressedBuer do if f ( i ) is I-F rame then deallo cate outbuer[2,3] curren tb=outbuer[1] p erform MPEG-2 I-frame decompression transmit the outbuer[1] to master else if f ( i ) is P-F rame then if (outbuer[2]==NIL) allo cate outbuer[2] forw ardb:=curren tb, curren tb:=outbuer[2] p erform P-F rame decompression transmit the outbuer[2] to master else if f ( i ) is B-F rame then if (outbuer[3]==NIL) allo cate outbuer[3] rev erseb=outbuer[2]; curren tb=outbuer[3]; p erform B-F rame decompression transmit outbuer[3] and release it. end if end for de-allo cate outbuer[1,2,3]

PAGE 39

32 the rst macro-blo c k, w e only need allo cate a 16*16 blo c k space. The other macroblo c ks buer will b e assigned when it is needed for deco ding. The total buer will gro w as more macroblo c ks are deco ded, and will reac h the maxim um full buer size after the deco ding is nished. Then the sla v e will k eep the full buer size un til the frame is able to b e discarded. After the deco ded frame is sen t bac k to the master no de, the deco ding buer can b e released and a new buer gro wing pro cess will b e started for the next frame. With this dynamic memory allo cation, w e can exp ect an additional buer reduction of 0.5 frame for the curren t frame. Notice that this sc heme can not reduce the amoun t of buer for the reference frame, whic h should b e in system through the whole pro cess. The eectiv e buer requiremen t b ecome M = ((1 0 : 5) a + (2 0 : 5) b + (3 0 : 5) c ) = ( a + b + c ) Using the same GOP structure as the ab o v e, the eectiv e frame n um b er of the buer in sla v e no de is 2.1, whic h is 60% of the 3-frame buer sc heme. The tradeo here is the additional CPU cost in tro duced for the dynamic memory managemen t. F or eac h macro-blo c k, the additional cost includes at least t w o system calls (for memory allo cation/deallo cation) and some other miscellous op eration. It has b een sho w ed that the cost asso ciated with dynamic memory allo cation is signican t for the database serv er and W eb-serv er, where thousands of pro cesses ma y co-exist to pro cessing user requests. In our case, the n um b er of sla v e no des/pro cesses is usually b elo w 20 and it is exp ected that memory managemen t activit y is far less frequen t, th us the o v erhead in tro duced should b e limited. This is conrmed b y our exp erimen tal results b y comparing the p erformance of the deco ding with/without dynamic memory allo cation, depicted b y table 4.2. It can b e seen that the increase of system time is v ery small. F or 1024*1024 video format,

PAGE 40

33 the system time with dynamic allo cation enabled is 1.3 seconds, only 0.2 second more than the static memory allo cation. T able 4.2: Deco ding P erformance of Sla v e No de With Dynamic Memory Allo cation Measured Time in Second for The T otal of F rames Video 500*360 704*480 850*750 1024*1024 Resolution Dynamic User 4.37 7.02 10.61 12.08 Dynamic System 0.5 0.8 1.2 1.3 Original User 4.14 6.86 10.63 12.00 Original System 0.4 0.6 1.0 1.1

PAGE 41

CHAPTER 5 CONCLUSION Due to the limited memory to supp ort a scalable p erformance for high-lev el high-prole MPEG-2 video resolutions, new buering con trols and mec hanisms need to b e created within our soft w are-only parallel MPEG-2 deco der. W e th us prop ose an ST buering sc heme with a dynamic allo cation algorithm to signican tly reduce the memory demands within this parallel deco ding soft w are. The results are v ery promising with excellen t scalabilit y p erformance ac hiev ed in b oth do wn-scaling and up-scaling capabilit y . Therefore, it is no w p ossible for our soft w are-only parallel MPEG-2 deco der to automatically c ho ose the b est video resolutions (e.g., with prop er n um b er of sla v e no des) according to the hardw are and net w orking settings. 34

PAGE 42

APPENDIX /* Bounce Creates a new thread each time the letter 'a' is typed. * Each thread bounces a happy face of a different color around the screen. * All threads are terminated when the letter 'Q' is entered. ** This program requires the multithread library.For example,compile * with the following command line: * CL /MT BOUNCE.C */ //\#include \#include \#include \#include //\#include //\#include \#include "config.h" \#include "global.h" \#define MAX\_THREADS 32 extern void frame_reorder(struct PictureBuffer *frame,int bitstream_framenum, int sequence_framenum); extern motion_compensation (struct PictureBuffer *frame,short decsrc[6][64],int MBAMax,int MBA, int macroblock_type, 35

PAGE 43

36 int motion_type, int PMV[2][2][2], int motion_vertical_field_select[2][2], int dmvector[2], int stwtype, int dct_type); /* getrandom returns a random number between min and max, which must be in * integer range. */ \#define getrandom(min,max)((rand()%(int)(((max) + 1)-(min)))+ (min))//void main( void ); /* Thread 1: main */ //void KbdFunc( void ); /* Keyboard input */ //void BounceProc( char * MyID ); /* Threads 2 to n:display*/ //void ClearScreen( void ); /* Screen clear */ //void ShutDown( void ); /* Program shutdown */ //void WriteTitle( int ThreadNum ); /* Display title bar */ //HANDLE hConsoleOut; /* Handle to the console */ //HANDLE hRunMutex; /* "Keep Running" mutex */ //HANDLE hScreenMutex; /* "Screen update" mutex */ int ThreadNr; /* Number of threads started */ //CONSOLE_SCREEN_BUFFER_INFO csbiInfo; /* Console information */ /*void mmain() // Thread One { //* Get display screen information & clear the screen.

PAGE 44

37 hConsoleOut = GetStdHandle( STD_OUTPUT_HANDLE ); GetConsoleScreenBufferInfo( hConsoleOut, &csbiInfo ); ClearScreen();WriteTitle( 0 ); //* Create the mutexes and reset thread count. hScreenMutex = CreateMutex( NULL, FALSE, NULL ); // Cleared hRunMutex = CreateMutex( NULL, TRUE, NULL ); // Set ThreadNr = 0; //* Start waiting for keyboard input to dispatch threads or exit. KbdFunc();//* All threads done. Clean up handles. CloseHandle( hScreenMutex ); CloseHandle( hRunMutex ); CloseHandle( hConsoleOut ); }*/int getframe(struct PictureBuffer * frame,int framenum) //resume the frame->data which is covered by mpi receive. //this func call update_picture_buffers to attach a frame buffer. //the relative point is fixed in the receiveDistributeData function. { int MBAmax; int ret; int PsizeVerify; PsizeVerify=PARALLELSIZE;frame->data=global_microblocks[framenum%PARALLELSIZE];

PAGE 45

38 //be overlapped by mpi transmit, the main process //do not care it. if (frame->picture_structure==FRAME_PICTURE && Second_Field) { /* recover from illegal number of field pictures */ // printf("odd number of field pictures\n"); Second_Field = 0; }frame->Second_Field=Second_Field;/* IMPLEMENTATION: update picture buffer pointers */ Update_Picture_Buffers(frame);/* form spatial scalable picture */ /* ISO/IEC 13818-2 section 7.7: Spatial scalability */ if (frame->base.pict_scal && !Second_Field) { printf("spatial_prediction, we don't support\n"); }/* decode picture data ISO/IEC 13818-2 section 6.2.3.7 */ /* number of macroblocks per picture */ MBAmax = frame->mb_width*frame->mb_height; if (frame->picture_structure!=FRAME_PICTURE)

PAGE 46

39 MBAmax>>=1; /* field picture has half as mnay macroblocks as frame */ rMBA=0;frame->pnum=framenum;frame->MBAmax=MBAmax;for(;;){ if((ret=slice(frame,framenum, MBAmax))<0)break; //if slice return -1,it mean we meet the start code for next picture. }//picture_data(frame,bitstream_framenum);if (frame->picture_structure!=FRAME_PICTURE) Second_Field = !Second_Field; return 0; }void decodehighhalf(struct PictureBuffer * frame) //decode the high half part of microblock for a picture {int comp,i; int MBAmax;

PAGE 47

40 MBAmax=frame->MBAmax;for(i=(MBAmax/2);iblock[comp],frame->data[i].blocks[comp], 64); motion_compensation(frame,frame->data[i].blocks,MBAmax,i, frame->data[i].macroblock_type, frame->data[i].motion_type,frame->data[i].PMV, frame->data[i].motion_vertical_field_select, frame->data[i].dmvector, frame->data[i].stwtype, frame->data[i].dct_type); } }void decodelowhalf(struct PictureBuffer * frame) {//int comp; int i; int MBAmax; MBAmax=frame->MBAmax;for(i=0;iblock[comp],frame->data[i].blocks[comp], 64);

PAGE 48

41 motion_compensation(frame, frame->data[i].blocks,MBAmax,i, frame->data[i].macroblock_type, frame->data[i].motion_type, frame->data[i].PMV, frame->data[i].motion_vertical_field_select, frame->data[i].dmvector, frame->data[i].stwtype, frame->data[i].dct_type); } }void bufferchange(struct PictureBuffer * srcframe, struct PictureBuffer * desframe) { desframe->ld->Incnt=srcframe->ld->Incnt; desframe->ld->Rdptr = (srcframe->ld->Rdptr-srcframe->ld->Rdbfr) +desframe->ld->Rdbfr ; desframe->ld->Rdmax = srcframe->ld->Rdmax; desframe->ld->Bfr = srcframe->ld->Bfr; memcpy(desframe->ld->Rdbfr,srcframe->ld->Rdbfr,2048); }void dothework(char *MyId) {printf("this is thread 2 running\n"); // getch();

PAGE 49

42 while(1){if(decodestate==START){//WaitForSingleObject( hDecodeStart, INFINITE ); //ResetEvent(hDecodeStart);}if(decodestate==MIX){//WaitForSingleObject( hDecodeMix, INFINITE ); //ResetEvent(hDecodeMix);decodelowhalf(frameptr[(basenum)%3]);decodehighhalf(frameptr[(basenum)%3]);frame_reorder(frameptr[basenum%3],Bitstream_Framenum, Sequence_Framenum); //getch();if (!frameptr[basenum%3]->Second_Field) {Bitstream_Framenum++;Sequence_Framenum++;}//SetEvent(hDecodeEnd);//WaitForSingleObject(hNEXTSTATE,INFINITE);

PAGE 50

43 //ResetEvent(hNEXTSTATE);}else if(decodestate==SINGLEDECODE) {//WaitForSingleObject( hDecodeSingle, INFINITE ); //ResetEvent(hDecodeSingle);//decodelowhalf(frameptr[(basenum+1)%3]);decodehighhalf(frameptr[(basenum+1)%3]);//getch();//SetEvent(hDecodeEnd);//WaitForSingleObject(hNEXTSTATE,INFINITE);//ResetEvent(hNEXTSTATE);}else if(decodestate==END)break; else if(decodestate==ERRHEAD)break; }}%

PAGE 51

REFERENCES [1] Ju W ang, Jonathan C.L. Liu, \P arallel MPEG-2 Deco ding with High-Sp eed Net w orks," Pr o c e e dings of International Confer enc e on Multime dia and Exp osition 2001 (ICME 2001) T oky o, Japan, Aug. 2001, pp. 449{452. [2] J. W ang, J. Liu,Joseph Lin, \A P artition-based P arallel MPEG-2 Soft w are Deco der," Pr o c e e dings of Joint Confer enc e of Information Systems Durham, North Carolina, Mar. 2002, pp. 1009-1012. [3] Rub y B. Lee, \Realtime MPEG Video via Soft w are Decompression on a P A-RISC Pro cessor," 40th IEEE Computer So ciety International Confer enc e (COMPCON'95) , Marc h 1995, pp. 186-192. [4] P . So derquist, M. Leeser, \Optimizing the Data Cac he P erformance of a Soft w are MPEG-2 Video Deco der," A CM Multime dia 97 , Seattle, pp. 291-301. [5] K. P atel , \P erformance of a Soft w are MPEG Video Deco der," Pr o c e e dings A CM Multime dia 93 ,1993, pp. 75-82. [6] L. Kohn, \The Visual Instruction set (VIS) in UltraSP AR C" Comp c on '95.T e chnolo gies for the Information Sup erhighway, Digest of Pap ers pp. 462-469. [7] D. Drap er, \Circuit T ec hniques in A 266-MHz MMX-enabled Pro cessor" IEEE Journal of Solid-State Cir cuits 32(11): No v. 1997, pp. 1650{1664. [8] Rub y B. Lee, \64-bit and Multimedia Extensions in the P A-RISC 2.0 Arc hitecture" Comp c on '96. 'T e chnolo gies for the Information Sup erhighway' Digest of Pap ers, 1996, pp. 152{160. [9] R. Bharga v a, L. John , \Ev aluating MMX T ec hnology Using DSP and Multimedia Applications," Pr o c e e dings of the 31st IEEE International Symp osium on Micr o ar chite ctur e , Dallas, TX, 1998, pp. 37-46. [10] A. P eleg, S. Wilkie, U. W eiser, \In tel MMX for Multimedia PCs," Communic ation of the A CM , Jan 1997, 40(1): pp. 25-40. [11] T. T ung, C. Ho, J. W u, \MMX-based DCT and MC Algorithms for Real-Time Pure Soft w are MPEG Deco ding," IEEE Multime dia Confer enc e 99 , July 1999, pp. 357-362. 44

PAGE 52

45 [12] S. Nvidia, \NVIDIA and In terVideo Deliv er WHQL Certied Softw are D VD Pla ybac k for RIV A TNT" http://www.nvidia.c om/view.asp IO=IO 20020109 4636 Ma y , 2001, pp. 123-125 [13] T. Akiy ama, H. Aono , \MPEG2 Video Co dec Using Image Compression DSP ," IEEE T r ans on Consumer Ele ctr on Aug 1994, 40(3): pp. 466-472. [14] S. Sriram, C.-Y. Hung, \MPEG-2 Video Deco ding on the TMS320C6X DSP Arc hitecture," Pr o c. of the 1998 IEEE Asilomar Conf. On Signal, Systems and Computers, pp. 1735-1739. [15] A.J. Baum, K. Clark e, M. T aun ton, \A Multimedia Chipset for Consumer Audio-Visual applications," IEEE T r ansactions on Consumer Ele ctr onics , 43(3): , Aug.1997, pp. 646-648 [16] S. Ishiw ata, T. Y amak age, Y. Tsub oi, \A Single-c hip MPEG-2 Co dec Based on Customizable Media Micropro cessor," Pr o c e e dings of the IEEE 2002 ,2002, pp. 163-166 [17] I. Ahmad, \A Scalable oine MPEG-2 Video Enco ding Sc heme Using a Multipro cess System," Par al lel Computing , 27(6): 2001, pp. 823-846. [18] S. Akram ullah, I. Ahmad, M. Liou, \A Data-P arallel Approac h for Real-Time MPEG-2 Video Enco ding," Journal of Par al lel A nd Distribute d Computing , 30(2): No v em b er 1995, pp. 129-146. [19] Kevin L. Gong, La wrence A. Ro w e, 'P arallel MPEG-1 Enco ding', Pr o c e e dings of the 1994 Pictur e Co ding Symp osium , No v em b er 1994, pp. 123-136. [20] Y. He, I. Ahmad, M. Liou, \Real-Time In teractiv e MPEG-4 System Enco der Using a Cluster of W orkstations," IEEE T r ans on Multime dia , 1(2): June 1999, pp. 217-233. [21] A. Bilas, J. F ritts, J. Singh, \Real-Time P arallel MPEG-2 Deco ding in Softw are," Pr o c e e dings of the 11th International Par al lel Pr o c essing Symp osium, Apr, 1997, pp. 37-46. [22] MPEG Soft w are Sim ulation Group, \MPEG-2 Video Co dec V ersion 1.2," http://www.mp e g.or g/ tristan/MPEG/MSSG , 1997, pp. 27-36, Apr, 2001.

PAGE 53

BIOGRAPHICAL SKETCH Yish u He w as b orn in Tianjin, China. She receiv ed the Bac helor of Arts degree from Jilin Univ ersit y , China, in July 1998. She will receiv e her Master of Science degree in computer and information science and engineering from the Univ ersit y of Florida, Gainesville, in August 2002. Her researc h in terests include MPEG-2 video compression.