Wireless/mobile video delivery architecture

Material Information

Wireless/mobile video delivery architecture
Sampath, Latha, 1977- ( Dissertant )
Helal, Abdelsalam A. ( Thesis advisor )
Dankel, Dr. ( Reviewer )
Raj, Dr. ( Reviewer )
Frank, Dr. ( Reviewer )
Place of Publication:
Gainesville, Fla.
University of Florida
Publication Date:
Physical Description:
88 p.


Subjects / Keywords:
Acoustic data ( jstor )
Databases ( jstor )
Digital video files ( jstor )
Downloading ( jstor )
Image compression ( jstor )
Indexing ( jstor )
Multimedia materials ( jstor )
Streaming ( jstor )
Subject terms ( jstor )
Video data ( jstor )
Computer and Information Science and Engineering thesis, M.S ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh )
Wireless communication systems ( lcsh )
government publication (state, provincial, terriorial, dependent) ( marcgt )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


A wireless network is a source of speculation with any kind of data transfer, because of its limited capacity. When delivering video over a wireless network, the problem is intensified because of the nature of video data. Users of wireless devices are increasingly attempting to access the Web, with its highly visual content, using devices with limited capabilities. The video delivery system must adjust itself to these stringent conditions and must perform to the user's satisfaction. In this thesis we present an architecture aimed at delivering video over a wireless network in the most satisfactory manner possible under the existing circumstances. The architecture makes use of the fact that wireless network devices generally have limited capabilities and will not be able to exhibit the full quality of a high-quality video. We present methods of reducing the quality of the video without making it too obvious to the user. We also present ways and means to make the burden of performance fall on the video server, instead of the client, so that the user is not made aware of the problems that normally occur in such situations. This is achieved using a matrix architecture, and indexing of the video file on the server side, so that the server is able to make decisions at runtime, depending on the network condition and the client device type.
Thesis (M.S.)--University of Florida, 2000.
Includes bibliographical references (p. 76-77).
General Note:
Title from first page of PDF file.
General Note:
Document formatted into pages; contains x, 78 p.; also contains graphics.
General Note:
Statement of Responsibility:
by Latha Sampath.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
50751314 ( OCLC )
002678741 ( AlephBibNum )
ANE5968 ( NOTIS )


This item has the following downloads:

Full Text







To my family


I express my sincere gratitude to my advisor, Dr. Sumi Helal, for giving me the

opportunity to work on this challenging topic and for providing continuous guidance and

feedback during the course of this work and thesis writing. Without his encouragement

this thesis would not have been possible.

I thank Dr. Dankel and Dr. Raj for agreeing to be on my committee. I also thank

Dr. Frank for agreeing to attend my defense on my last minute request.

I thank Sharon Grant for making the Database Center a truly great place to work,

and Matt Belcher for providing me with all the computing facilities I needed.

I give special thanks to my friend Yan Huang, who assisted me in the initial

stages of this project.

I especially thank my friend Sangeetha Shekar, for all her support, encouragement

and help during my work. I thank my friends Subha, Harini, Vidyamani, Amit and

Prashant for their support; and for making my stay at the University of Florida a

memorable one.

I thank my parents and my sister and brother-in-law for their constant emotional

support and encouragement throughout my academic career.


A C K N O W L E D G E M EN T S .................................................................... .....................iii

LIST OF TABLES .............. ................. ............ ................ .......... vii

LIST OF FIGURES ............................. .. .. .... ...... .................. viii

A B S T R A C T ........................................... ................................................. ix

1. IN T R O D U C T IO N ........................................................... ...................... ................ 1

2. REVIEW OF RELATED WORK................................... ...........................4

2.1. N nature of V ideo D ata ............... ........ ... .... .. ......... ....... 4
2.2. Video Com pression Standards ......................... ................... ........ ... ........... 5
2.2.1. E existing Standards ...................................... ............................ .... .......... .. 6
2.2.2. E m erging Standards ............................................ ... ................. .............. 8
2.3. Problems with Wireless Video Delivery................................................... 11
2.3.1. Problems Due to the Wireless Network ........... ............................... 11
2.3.2. Problems Due to the Nature of Video Data................................. .............. 11
2.3.3. Problems Due to Client Device Capabilities....................................... 12
2.4. Transcoding ............................................................................ ........ ...................... 12

3. ANALYZING VIDEO STREAMING................................... ................... 14

3.1. What Is Streaming And How Will It Help? ................................................... 14
3.2. Existing Streaming Technology .................. ................................... 15
3.3. Theoretical Achievement of Streaming...... ................... .............. 16
3.4. Streaming for W wireless Networks .............. ...... ......................... ............ 17
3.5. L im stations of Stream ing ................................................... ........... .............. 17

4. V ID E O R ED U C TIO N .................................................... .................................. 19

4.1. The N eed for R education ........... .................................................... .............. 19
4.2. Color R education .................................... ..... .......... .............. .. 21
4.3. Size Reduction .................................................................... ......... ......................... 22
4.4. Fram e D dropping ............. .. ............ ............ ....... .... .... ............. 23
4.5. H ybrid R eductions......................... ................. ......................................... 24


4.6. Is Online Video Reduction Feasible? ............................................ ...............25

5. WIRELESS VIDEO WITH MPEG-7.................. .......... ... ................... 26

5.1. Introduction .................................................................... .......... 26
5.2. W hat Is M PEG-7? ............ ........... .............................. ...... ... ......... ..... 27
5.3. Descriptors and Description Schemes......................... .............28
5.4. Scenarios w ith M PE G -7 ........... ................................... ................. .............. 30
5.4.1. D distributed Processing........... ................................................ .............. 30
5.4.2. Content Exchange .............. ....................... ........ ... .... .. .......... 31
5.4.3. Custom ized V iew s ..... .............. ........... ........ ... ............ .. 31
5.5. Wireless Video using MPEG-7 .............. ... ..................................... 32

6. UNIVERSAL MULTIMEDIA ACCESS AND MPEG-7 ........................................ 33

6. 1. Overview of U M A .................. .......................... ................. ............ 33
6. 1.1. U M A V variations ............................................... ........ .. .......... 34
6.1.2. U M A A attributes ......................................... ...................... 35
6.1.3. The Im portance of UM A ...................................................... ................ 36
6.2. C conceptual M odel ................ ...................... ............. .......... .. ........ .... 37
6.2.1. Factors Affecting D delivery ................................................... ................. 37
6.2.2. N etw ork Conditions ........................................................... .............. 39
6.2 .3. C lient D evice M odality ......................................................................... .... 40
6.2.4. M ultim edia C ontent.................................................................................. ....... 42
6.2.5. U ser Preferences ................ .. ... .................................... ..... ............ 43
6.3. The Reduction Set and the Restrictions on It........................................................ 44
6.4. Extensions on the M atrix Them e .................................... ............. ........ ....... 48

7. METHODOLOGY OF IMPLEMENTATION........................................................... 50

7. 1. The M atrix ..................................... ................................ ........... 50
7.2. The Secondary Indexing Schem e...................................... ................. ....... 51
7.3. The Introduction of the Packager ...................................... ...................... ......... 52
7 .4 P ra ctic al Issu e s ....................................................................................................... 5 4


8. 1. The M PE G Player .................. ........................... .............. .. 56
8.2. Client-Server Setup .................. ................................. ... .... .............. 57
8 .3 Stream in g ................................................. ....... ......... ...... 57
8 .4 R ed u ctio n .................................................................................. 5 8
8.5. Indexing and the M atrix ............................................................................. ..... 60
8.6. Illustration of the Execution Process.................................................................... 61
8.6.1. T he C lient ................................................... ...................... ..... ....... 61


8.6.2. The Server ............... ......... ...................63

9. EXPERIMENTAL EVALUATION ............................................ 65

9.1. Description of the Experimental Setup ............................... .............65
9 .2 R esu lts ......... ..... ............. ..................................... ............................ 67

10. CONCLUSIONS AND FUTURE WORK .......................................................... 73

10.1. Achievements of this Thesis.......... ........ .. ....................... 73
10.2. Proposed Extensions and Future W ork................... ....... .... ...................... 74

L IST O F R EFER EN C E S ...................................................... .................................. 76

BIOGRAPHICAL SKETCH .................... .......................................... .......................... 78



Table Page

6-1: Classification of reductions performed on multimedia data.............................. 35

6-2: UM A Attributes. ................................................... .. ............36

6-3: Classification of networks based on connection quality ........................................40

6-4: Classification of devices based on device modality. ............................................ 41

6-5: Mapping of (Connections, Modality) to Variations.............................. ............... 45

9-1: Performance of streaming without indexing, and with no matrix. ..........................67

9-2: Performance of streaming with index, and with no matrix .............. ...............67

9-3: Performance of server without indexing, with 2 possible levels each of network
conditions and device modality........... ........... ................................ 68

9-4: Performance of server with indexing, with 2 possible levels each of network
conditions and device modality........... ........... ................................ 68

9-5: Performance of server without indexing, with 3 possible levels each of network
conditions and device modality........... ........... ................................ 69

9-6: Performance of server with indexing, with 3 possible levels each of network
conditions and device modality........... ........... ................................ 69

9-7: Performance of the scheme for a fixed matrix element............. ...............71


Figure Page

5-1: A UML Based Representation of possible relations between D and DS ................... 29

6-1: Three-dimensional view of MPEG-7 and the description schema for UMA. ........... 38

7-1: O v erview of the process................................................................... .................... 53

8-1: T he client. .............................................................. .............. 6 1

8-2: The client w while playing the video...................................... ........................ ......... 62

8 -3 : T h e serve er ................................................................................................6 3

9-1: Performance of the scheme with client device Class 1............................................ 70

9-2: Performance of the scheme with client device Class 2............................................ 70

9-3: Performance of the scheme with client device Class 3............................................ 71

9-4: Performance of the scheme for a fixed matrix element, with variation in buffering
tim e....................................................... ................... ... ... . .. .... . 72

Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science



Latha Sampath

December, 2000
Chairman: Dr. Sumi Helal
Major Department: Computer and Information Science and Engineering

A wireless network is a source of speculation with any kind of data transfer,

because of its limited capacity. When delivering video over a wireless network, the

problem is intensified because of the nature of video data. Users of wireless devices are

increasingly attempting to access the Web, with its highly visual content, using devices

with limited capabilities. The video delivery system must adjust itself to these stringent

conditions and must perform to the user's satisfaction.

In this thesis we present an architecture aimed at delivering video over a wireless

network in the most satisfactory manner possible under the existing circumstances. The

architecture makes use of the fact that wireless network devices generally have limited

capabilities and will not be able to exhibit the full quality of a high-quality video. We

present methods of reducing the quality of the video without making it too obvious to the


We also present ways and means to make the burden of performance fall on the

video server, instead of the client, so that the user is not made aware of the problems that

normally occur in such situations. This is achieved using a matrix architecture, and

indexing of the video file on the server side, so that the server is able to make decisions at

runtime, depending on the network condition and the client device type.


As the trend toward mobile computing increases, so does the need for multimedia

support for the wireless network. The support of multimedia on wireless systems will

allow many interesting new applications. Because video data is generally bulky, attempts

have been made to compress this data stream into a manageable size that can be

transmitted over a network, without wasting valuable bandwidth. One of the more

important standards of compression that has emerged is MPEG (Moving Pictures Expert

Group). This document deals with the various aspects of delivery of video data over a

wireless network.

One of the main concerns of the process of delivering video data over a wireless

network is the poor quality of wireless networks. The wireless network has problems of

bandwidth, latency and disconnection. The various problems of wireless networks are

discussed in detail in Chapter 2. These affect the quality of video delivery to wireless

clients. The problem is aggravated by the nature of the data being transferred over such a

network. Video data is bulky by nature and does not adjust easily to situations of data

loss, without the user's awareness of the loss. The capabilities of the client device are

also less than those of normal fixed network devices, so sending the video to the client in

its entirety does not even make sense. All these concerns and possible solutions are

discussed in Chapter 2.

One possible solution to downloading the entire video and then playing it is called

streaming. Streaming allows the client to buffer a small part of the video initially and

then to begin the playing, while downloading continues in the background. The amount

buffered is just sufficient so that the client application does not starve for data in the

middle of playing the video. Streaming is discussed in Chapter 3.

However, streaming is not an ultimate solution. Streaming alone is insufficient for

good quality video delivery over a wireless network, as the quality of the network

reduces. Hence we come to the next step in reducing the bandwidth requirements of

delivering the video: reducing the number of bytes to be transmitted. The way we do this

is by reducing the information stored in the video, by a variety of choices. For example,

we could reduce the color quality from 256-color to 16-color, or further to grayscale. Or

we could reduce the resolution of the video. This is actually helped by the fact that the

client devices would probably not have the capability to exhibit the quality of the original

video anyway, since wireless devices in general compromise on hardware capabilities.

Chapter 4 discusses the various possibilities in terms of reductions of the original video.

With all these changes to the normal streaming process, algorithmic complexities

are inevitable, and we shall have to study the MPEG decoding algorithm closely to be

able to decide which option to follow. We can also have a totally different method to

actually access the video, or make changes to the basic MPEG format to accommodate

easy online reductions. These options are discussed in Chapters 6 and 7.

The delivery of video over a wireless network is not an easy task, as we can see. It

involves many issues that may be unseen at first glance, but which occur during

implementation. This document deals with many of these issues. Some issues, discussed

in the concluding chapter, are not resolved. Ultimately, the goal of delivering good

quality video to the user has to be achieved, with the user's approval of the perceived

video quality. We might be able to use any complicated data structures to access and

process the video data, but if the user is not happy with the reduction chosen by the

server, the service is of no use. This document attempts to partly achieve, with some

measure of success, this goal.


Video data modeling is the process of designing the representation for the video

data, based on its characteristics, content and application. Before we analyze the process

of transmitting video data over a wireless network, let us understand the nature of video

data, and the complex issues involved in dealing with this type of data.

2.1. Nature of Video Data

Video data is bulky and very rich in its content [1]. Audiovisual content is the

physical characteristics of the video data, such as color, texture, object, interrelationships

among objects, camera operation etc. Textual content may contain embedded information

about the clip, such as a caption, title, names of characters etc. Semantic content is the

data presented to the user. Video data is traditionally represented in the form of a stream

of images, called frames. These frames are displayed to the user at a constant rate, called

frame rate (frames/second). These frames may or may not be accessed independently,

depending on the complexity of the video browser. For most applications, the entire

video stream is too coarse (too high) a level of abstraction. However, a single frame is too

fine a level of abstraction, because a single frame lasts for a very short time, and each

stream contains an unmanageable number of frames. Other (more intermediate) levels of

abstraction- are often used for data extraction, and a hierarchy of video stream abstraction

can be formed.

The basic element of characterizing video data is the shot. The shot consists of a

possible series of frames generated and recorded contiguously, representing a continuous

action in time and space. A set of shots that are related in time and space is defined as a

scene. Related scenes are grouped together into a sequence. Related sequences are

grouped together into a compound unit, which can recursively form a part of a compound


A necessary characteristic of a video data representation is its annotation

capability [2]. It should permit the association of content descriptions with the data, to

facilitate easy content-based extraction of data. Annotations of data should be alterable

based on user interpretation and context.

2.2. Video Compression Standards

Video data is generally bulky, and must be compressed to bring it to a

manageable size that can be stored, manipulated and transmitted over a network, without

wasting valuable bandwidth. Video can be compressed both temporally and spatially.

Spatial compression involves individual frames. Temporal frames take into consideration

the adjacent frames. Compression techniques can be both lossless and lossy. Lossless

compression techniques attempt to compress the storage of data using mathematical

functions alone. Lossy compression techniques leave out redundant parts of the

information in the frames (which can be reconstructed later) and perform better

compression at the cost of data loss. Most current compression standards use lossy

techniques. Several compression standards have been devised in the past, and are now in

common use and we discuss some of them here.

2.2.1. Existing Standards

In the past, the focus was mainly on images, and representation of images was

generally in the form of bitmaps and compressed versions of bitmaps. Several

compression schemes emerged for storing image data. The JPEG (Joint Photographic

Experts Group) compression format was one of the most popular formats used for storing

images, as it took up the least amount of space. It performs lossy compression of still-life

scenes, that is, photographs. It works well on images of real world scenes, but not on

lettering, cartoons and line drawings. This is because JPEG takes advantage of limitations

of the human eye in perceiving minute differences in color and brightness, for

compression of pictures. With line drawings, there is not much detail that can be left out

in the process of lossy compression, and JPEG cannot make use of the same limitations

for such simple pictures.

The JPEG format is a symmetric codec (video Compression and DECompression

standard). The coding process involves an RGB to YIQ transformation, DCT,

quantization, DPCM (Differential Pulse Code Modulation), RLE and finally entropy

encoding [3]. The advantage of JPEG is that the encoder can control the trade-off

between compression ratio and image quality, or between compression ratio and

compression speed.

The GIF (Graphics Interchange Format) format uses a LZW compression scheme

(the same as Stuffit and PKZip) which does not change the quality of the image [4].

Instead, it uses an index color palette, which may degrade the color quality of images, but

effectively cuts the size of an RGB image by two-thirds. It uses a smaller color depth as

compared to JPEG and supports line drawings more than full-color illustrations. This is

because GIF uses a lossless compression scheme and, hence, compromises on the color

depth. If GIF files were to support the same range of color as JPEG images, the file size

would become too large to download.

There are several other compression formats besides these two the TIFF, the

TARGA and the BMP formats, for example. These did not achieve any significant

compression and they store the image in almost the original bitmap version.

Then came the video formats: AVI [5] and QuickTime's MOV. The AVI (Audio-

Video Interleave) format saves audio and video data interleaved. This format uses the

keyframe concept, which saves fully only every 12 to 17 pictures, depending on the

picture contents. The intermediary pictures are saved as their differences from the

preceding frame. The AVI videos cannot be streamed. The entire file needs to be present

in local memory before the video can be played.

We also had the M-JPEG (Motion JPEG) format that extended the JPEG format

[1]. This format compresses frames on a one-by-one basis, without any reference to

neighboring frames. The M-JPEG compression scheme minimizes the time and

computational resource required in the digital editing environment and makes it a good fit

for video editing.

Finally, we had the emergence of the MPEG (Motion Picture Experts Group)

compression format. MPEG provides a very good compression ratio, but the number of

calculations that are required make software MPEG codec a bit slow. All MPEG

compression formats use the principle of classifying frames into I (Individual), P

(Predictive-coded), B (Bidirectionally predictive-coded) and D (DC coded) frames. Each

of these different types of frames has a different typical size, quantity of information,

encoding method and importance in the frame sequence. The MPEG group developed a

series of formats for video compression. The MPEG-1 format was the first, with only

progressive video encoding. The MPEG-2 format came next, with interlaced as well as

progressive encoding. Then was MPEG-41, with object oriented concepts of representing

objects in a picture.

2.2.2. Emerging Standards

The MPEG-4 format had totally different ways to encode a picture. Instead of the

standard bitmap-by-bitmap encoding, MPEG-4 used an object-oriented encoding scheme,

to represent the different objects in a picture [6]. These objects, called Audio-Visual

Objects, were used to put the picture together. There are simple AVOs, such as graphics,

background, text, animation and 3D objects. These can be combined to form other AVOs,

called compound AVOs. Such simple and compound AVOs together compose an

audiovisual scene. More generally, MPEG-4 provides a standardized way to describe a

scene (for example allowing media objects to be placed anywhere within a coordinate

system, allowing the application of transforms to change the physical properties of an

object or changing the user's viewpoint of a scene).

The MPEG-4 format is different from its predecessors in the extent of user

interaction allowed with the scene and the objects contained in it. The MPEG-4 format's

language for describing and manipulating the scene is called BIFS (BInary Format for

Scenes). The BIFS commands are available for addition/deletion of objects, and

alteration of individual properties of individual objects. The BIFS code is binary, so it

occupies minimum space. The MPEG-4 format uses BIFS for real-time streaming (that is,

a scene does not need to be downloaded fully for playing, but can be played on the fly).

1 MPEG-3 existed once upon a time, but it basically did what MPEG-2 could do, so it
was abandoned.

The MPEG-4 format's representation of data for transport is different too. Its

objects are placed in what are called Elementary Streams (ES). Some objects, such as a

sound track or video, will have a single such stream. Others may have two or more.

Higher level data describing the scene will have its own ES, making it easier to reuse

objects in the production of new multimedia content. If parts of a scene are to be

delivered only under certain conditions, multiple scene description ESs may be used to

describe the same scene under different circumstances. Objects descriptors (ODs) are

used to tell the system which ESs belong to which object, and which decoders are needed

to decode a stream.

All this conversion is done in a layer devoted solely to synchronization. Here

elementary streams are split into packets, and timing information is added, before passing

the packets on to the transport layer. Timing information is important for many video

compression schemes, which encode a frame in terms of its adjacent frames. The existing

MPEG-2 transport scheme suffices, along with others such as the Asynchronous Transfer

Mode (ATM) and Real-time Transport Protocol (RTP).

After MPEG-4 came MPEG-72. The MPEG-7 format is targeted at the problem of

browsing a video file to detect frames of interest to the user. MPEG-7 defines a standard

set of descriptors that can be used to describe various types of multimedia information, in

the form of video, audio, graphics, images, 3D models etc. The MPEG-7 format also

defines other descriptors and structures that will define the relationships among these

descriptors. The MPEG-7 data may be located anywhere within the same stream, the

2 After MPEG-1, MPEG-2 and MPEG-4, there was a lot of speculation about which
number to use next 5 (the next number) or 8 (following a binary pattern). MPEG
decided not to follow either pattern, and chose MPEG-7 as the new standard.

same storage system or anywhere on the network. Links are used to connect the

descriptors with the associated material. The MPEG-7 format is discussed in greater

detail in Chapter 5, since it is at the center of the work involved in this document.

The latest standard is MPEG-21 [6]. This standard is mainly concerned with

satisfying the user. The consumer is becoming more and better equipped with the

facilities and is becoming more knowledgeable of technology available to make life

easier. At the same time, system administrators must become aware of the various factors

that affect the quality of video delivery and make sure that consumers get exactly what

they ask for, and no less. This holds true for network delivery, quality and flexibility of

service, quality of content delivered and ease of use. This also holds for physical factors

such as format compatibility, platform compatibility and other user-oriented factors such

as ease of subscription. The system is also expected to be capable of search and retrieve

operations, besides the normal operations, and customer rights and privacy have to be

respected. It is expected that the recognition of such factors as important will go a long

way toward standardizing service and making it acceptable to users.

The Dublin Core [7] is a lightweight standard that has attracted much attention. It

is a standard that deals with metadata (data about data) aimed at easy discovery of

electronic resources. Now we hardly need the whole picture; we just get the picture

through its description. The Web has such diverse information that a vast infrastructure is

needed to support all the metadata. Efforts are being made to develop such a standard for

this purpose. The procedures involved in forming the Dublin Core are mainly borrowed

from other standards' committees.

2.3. Problems with Wireless Video Delivery

The delivery of video over a wireless network involves many points, of which

care needs to be taken. These problems are not just due to the fact that the network is not

the high-bandwidth fixed network, but also because the data we have chosen to transmit

over such a fickle network has unique requirements [8].

2.3.1. Problems Due to the Wireless Network

First of all, the existing internetworking protocols do not support wireless

communication to any great extent. Wireless communication is closely linked with

mobility and with mobility come the demands of dynamic routing and frequent

disconnection. These were not really taken into consideration before, and different

options to improve this are now being studied. Secondly, wireless communication has

additional shortcomings, such as low bandwidth, high bit error rate and packet loss.

Existing transport protocols cannot distinguish between packet loss due to motion from

that due to congestion; the protocols suffer from a lack of awareness of motion. Wireless

networks also have increased latency problems, which causes additional data loss and

considerable decrease in perceived video quality.

2.3.2. Problems Due to the Nature of Video Data

Additional technical difficulties arise when multimedia needs to be supported

over this wireless network. The real-time nature of multimedia applications makes time

considerations an important issue, since for playback quality to be good, the delivery

times of audio and video streams have to match exactly. Video compression introduces

additional delay, called packet jitter. Existing protocols also fail to make intelligent

decisions based on the nature of the application as to the needed quality of data delivery.

Different applications may have different requirements, for example, regarding the

relative importance of audio over video. Arbitrary loss of data may, at the same time,

damage the quality of data.

Additionally, the different streams of data (video and audio) have to be

synchronized at all times or else the perceived video quality is reduced.

2.3.3 Problems Due to Client Device Capabilities

The devices to which the multimedia data has to be downloaded over a wireless

network are generally devices with low capabilities in their display unit, CPU processing

power, audio quality and other factors of importance where multimedia is concerned.

Such devices generally have hardly any processing capability, which limits the ability of

the device to decode the bitstream of multimedia data and display it. The display unit of

such devices is also generally barely functional, with maybe a black and white display,

minimal refresh rate and low picture resolution. It actually becomes pointless to

download the entire content of the multimedia data to such a device. The device would

also have limited local memory to store large multimedia files.

2.4 Transcoding

Many problems arise when transmitting multimedia data over the wireless

network, and it is unnecessary to send all the components of the data. One approach

would be to leave out some component (such as color) or to reduce the quality of the data

(for example, by reducing the resolution) when transmitting the data to a device with

limited capabilities. This practice, called transcoding [8], has been tested for wireless

networks and multimedia data, and also for normal downloading applications.

Different schemes were tried for transcoding. One of the first suggested schemes

was a layered encoding scheme. Each layer would add more quality to the encoded

signal. The individual layers are then transmitted on separate multicast addresses.

Receivers should be able to adapt to changing network conditions by adjusting the

number of levels they subscribe to.

Layered encoding, however, does not provide any significant advantage to speech

codecs, because available speech codecs do not lend themselves easily to layering, and an

exponential relationship between adjacent layers is required for maximum throughput.

For video, the target frame rate cannot be deviated from.

Another scheme that was tried was simulcasting. A group of receivers can adjust

to network conditions by customizing parallel streams transmitted by the sender, to match

their needs. However, this causes congestion near the sender, since all parallel streams

have to emerge from the sender.

Self-organized transcoding [9] is an interesting scheme that uses automatically

configurable transcoders to improve bad reception. When a group of receivers detect a

congested link, an upstream receiver with better reception acts as a transcoder and

provides a customized version of the stream. This new stream is multicast to a different

address, so that affected receivers can switch to that address. The suffering receivers must

elect a representative who attempts to locate a suitable upstream receiver to serve them,

and to coordinate the transcoding process.

The actual method we select to implement the transcoding process without

pressure on the server is discussed in detail in Chapters 6 and 7.


This chapter discusses a common alternative to downloading the entire

multimedia data file before playing it: streaming. With streaming, the playing starts even

though the entire file may not have been downloaded, and downloading continues as a

background process.

3.1. What Is Streaming And How Will It Help?

Streaming is basically downloading and playing at the same time. Instead of

waiting for the entire downloading process to be finished before the client starts working

with (decoding) the file, the client starts playing once it gets enough data to prevent itself

from starving for data because of the low downloading speed. This obviously improves

the whole performance of the client. The client buffers enough data (frames) to be able to

start playing and then performs the downloading as a background process, while playing

the file at the same time.

This reduces the wait time of the user. The user does not have to wait for the

entire file to be downloaded, but can start watching the video/audio playing, while the

rest of the file is downloaded in the background. The client has to buffer enough data to

be able to play the file continuously, without running out of data and pausing in between

to buffer more.

3.2. Existing Streaming Technology

Lots of existing multimedia standard videos are downloaded using streaming

technology. In this section we discuss existing streaming technology.

QuickTime is the Macintosh multimedia application [10]. QuickTime has a long

list of compatible formats, which makes it one of the most flexible applications. The

server converts live media data into packets, which it sends to the client via multicast or

unicast, as the case may be. The streaming, based on RTP (Real-time Transport Protocol)

and RTSP (Real-Time Streaming Protocol), is industry standard streaming. However

some network equipment has not yet met the standards of QuickTime. The application

has a number of attractive features that account for its popularity.

Windows Media Player is Microsoft's media application [11], for the Advanced

Streaming Format. The Advanced Streaming Format supports streaming from a

streaming server or a typical Web server. This application uses a template approach, with

one template for each type of application. It does not work on Macintosh machines, but it

plays AVI, WAV and MP3 files the most popular formats in use today.

Without doubt, the RealNetworks application RealPlayer is one of the best media

streaming applications in use today [11]. The application supports almost all media types,

and includes a browser plug-in. The RealProducer software generates the multimedia file

in the rm format compatible with the RealPlayer software. The RealPlayer application

sometimes has buffering problems, but otherwise in terms of versatility and ease of use, it

is one of the best in use today.

3.3. Theoretical Achievement of Streaming

To be able to achieve streaming, several issues must be addressed. The first one is

how to let the playing process and the downloading process access the same file at the

same time. A possible solution is to split an MPEG-2 video file into small portions, in

such a way that the two processes are actually accessing different files. However, this

might require big changes to the decoder in the client, which currently might be using

only one whole file, not a stream. As an alternative solution, multiple threads might be

used to act as the play and download processes, which needs thread handling. Another

issue of streaming is to synchronize the playing thread and downloading thread to prevent

the player from starving for data. The problem we need to avoid is when the playing

thread plays much faster than the downloading thread downloads data. As an example,

suppose the player is playing 30 frames per second, while the downloading thread can

only download 10 frames per second. This will cause the player to starve for data and the

playing will get stuck for buffering, when insufficient data has been downloaded. To

solve this problem, we need to do enough buffering before starting the playing. As a very

simple example, if the player frame rate rp is 30 fps, the downloading frame rate rd is 10

fps, and the video file size S is 900 frames. Downloading the file needs td = 900/10 = 90

seconds, while playing the video needs tp = 900/30 = 30 seconds. We need to find a

buffering time bf to satisfy

(bf+S -)xrd= S

In this case, bf should be 60 seconds.

Buffering the file will also cause the problem of handling the decoder, a very

necessary but time-consuming task.

3.4. Streaming for Wireless Networks

When wireless networks come into the picture, additional issues emerge. The

formula mentioned in Section 3.3 must be followed. But since rd is going to be very

small, bf increases in size, resulting in a longer wait time for wireless device users. This

is not a satisfactory streaming result. Additional problems result from frequent

disconnection what should be done with what has already been buffered? How should

the server keep track of where to restart? What happens in the case of lost packets a

probable occurrence? The streaming application must address all these issues and,

accordingly, fix the buffering limits. The rate of download for a wireless network is not

constant. The network quality that the download started with may not be maintained

throughout. Midway, the application may be forced to pause for buffering. To avoid this,

the server must adapt by periodically measuring the network quality and readjusting the


Packet loss is known to be a big problem where wireless networks are involved,

so this must be taken care of. Standard error-checking procedures work very poorly in

this case, because it is difficult to keep track of the exact chunk in the source file that has

not been transmitted correctly to the client. The same is true for keeping track of how

much has been downloaded and where to restart downloading, in the case of a


3.5. Limitations of Streaming

The previous section illustrates the various issues involved in streaming over a

wireless network. Although an application might be developed, that addresses all of these

issues and that performs well, the quality of an application is difficult to ensure when

multimedia data is involved. Multimedia data is typically very bulky in nature, and the

buffer size becomes too large, thus making the wait time impractical. The user will not

have the patience to wait too long for the client to buffer sufficient data to start playing,

while downloading the rest in the background.

On the other hand, even RealPlayer shows problems while operating over a

wireless network. Thus, streaming multimedia over a wireless network, with all its

accompanying problems, is not really feasible in its raw state, because the equation must

be obeyed.

Of the different variable factors involved in the equation, bf, rd, and rp are the

only ones considered so far. One more factor not yet considered is S, the size of the video

file. But to make S variable, we would have to bring transcoding into the picture, and

reduce the quality of the video in the process. Whether or not the resultant video is

acceptable to the user is a subjective issue. The process and its effect on the video are

discussed in detail in later chapters.


In Section 2.4 we discussed the principle behind transcoding, a means to reduce

the strain on the network bandwidth by reducing the required bit-rate. One way of doing

that is reducing the speed of playing the video, so that rp is reduced, effectively reducing

bf. But this is not a very good solution practically, because the effect is very obvious to

the user, and may not be acceptable. An alternative solution is discussed in this chapter -

reducing the contents of the video so that the number of bytes delivered to the client is

reduced. This obviously reduces the quality of the video. However, because we are

concerned with wireless networks, we assume that the better quality is not absolutely

necessary. In fact, it may not even be perceptible.

4.1 The Need for Reduction

Chapter 2 showed the various problems associated with multimedia data over a

wireless network. Multimedia data is bulky, and the wireless network does not help at all.

When we discussed streaming in Chapter 3, we realized that streaming is an option to be

considered. However, streaming is not helpful in this particular situation, with limited

network capabilities and bulky data. Now we begin to consider the option of reducing

video quality to accommodate lower bit-rates.

The devices used with wireless communication generally have limited

capabilities, as seen in Subsection 2.3.3. We can turn this to our advantage, by not

transmitting the part of the data that is useless to that particular client. For example, we

can have a client device with a monochrome display. Then there is no need to transmit

color information over the network to the client, because that would take up more

bandwidth. An alternative is to remove color information from the video before

transmitting to the client. The user sees no difference in the quality of the video, because

the user's device is not able to display color. So the results are satisfactory.

Other types of reduction are also possible, depending on the different

classifications of client devices and their capabilities. If the client device display has a

small resolution, there is no point in sending the client a 1024 x 768-resolution image,

when the client device probably has only a 2" x 3" display, with a minimal resolution

designed for mostly text display. The solution is to reduce the resolution (or size) of the

image to a size suitable for the client device [12, 13]. This would effectively reduce the

number of bytes to be transmitted, by the same ratio as the ratio of reduction in size of the

video, theoretically at least.

We can also make use of the fact that multimedia data, especially MPEG data, is

organized into frames, and each frame is stored in sequential order in the file, so we can

theoretically separate the transmission of each frame. The practical implications of such a

process are discussed in detail in later sections. Now if the video does not contain much

object movement, one frame may not be too different from the next, and we can afford to

skip the transmission of such frames in between. Such frame dropping will not depend

directly on the client device capabilities. We study the effect on perceived video quality

in Section 4.4.

4.2. Color Reduction

Most of the smaller wireless devices have limited display capabilities, with small

color ranges, for which a 256-color display is meaningless. Some devices even have

grayscale or monochrome display units. For such client devices, we can easily reduce the

required bandwidth, by removing the color information from the video. This does not

result in noticeable difference in the perceived video quality, since the client device

capability itself is less.

Of course, such a reduction involves several issues to be considered on the

implementation side. The question of how to achieve this color reduction comes up

immediately. The encoding process for an MPEG video is a complicated process and

does not store the color information in a bitmap format. Instead, to save on storage space,

the RGB information is converted to YUV information and stored not directly but as the

quantized difference between adjacent frames. We must be able to extract the color

information from the frames, reduce it to whatever level of color is required for the client

device, and transmit it to the client. For this we need to decode the video at the server,

take off the color information, reencode it and finally transmit it to the client.

The issue then arises of how much time this process would involve. For a long

video, the user can probably go have lunch and come back, and the video still would not

have been downloaded. That is definitely not a good algorithm from the user's point of

view. To enable this option of reducing color, we would hence need to introduce some

simplification, in the form of offline reduction. If the color reduction were done offline

and stored, when the user's request for the video comes in, the reduced version is already

available and ready to be streamed.

However, with a large video database, we must next consider the space involved

in this. The server would need to store different versions of each video, involving a space

overhead of at least 200%. Hence, we need to think of some cheaper method to reduce

color. An easy solution is a compromise between the two extremes of doing everything

offline, and doing everything online. Why not do a part of the reduction offline, the part

that does not require too much storage, and then do the rest on the fly? Thinking of such

alternatives involves a careful study of the actual MPEG encoding algorithms. We might

be able to reduce a part of the work by storing the references to each frame at the server,

and on the fly, picking up each frame, applying the conversion algorithm to it and

sending it to the client. Such methods are discussed in detail in Chapter 7.

4.3. Size Reduction

When smaller display devices enter the picture, the next obvious step to reduction

is decreasing the display size of the video. We can either reduce the resolution of the

image, or reduce the size of the effective display and leave the resolution (pixels/inch) the


Consider a reduction of the video by a factor of n. To do this, the algorithm would

have to pick every nth pixel from the image and send it as the new image. We need to be

able to get at the bitmap version of each frame, pick every nth pixel and reencode the

frame into the video. This will theoretically reduce the number of bits to be transmitted

by a factor of n, since the storage complexity of the video in terms of pixels involved is

of the order O(n). Because of the complexity of the encoding algorithm, the process is a

long one and involves a lot of computational complexity.

Again, the issue of online versus offline reduction comes up, with the same

advantages and disadvantages. Online reduction takes a lot of time, and streaming will

not be feasible. Offline reduction takes up a lot of space on the server. The advantage of

offline reduction is that the overhead shifts to the server side, and the client is not aware

of the overhead. Partial reduction can be an option. The advantages and disadvantages of

this will be seen in later chapters, where we shall also discuss whether the tradeoff is

sensible, and whether it is worth the trouble.

4.4. Frame Dropping

Frame dropping is a pretty good and easy way to reduce the bandwidth

requirements of multimedia data, especially those with frame-based formats. Some

multimedia data lends itself easily to frame dropping, because there may not be too much

difference between one frame and the next. For example, with a news report, the image

may mainly involve a correspondent reviewing a situation. A generic example is any

video, which focuses on one person or object, which does not involve too much motion.

Apart from the video type, the client device type would also determine if frame

dropping were suitable for the situation. If the client device does not have a fast processor

that can process and display the frames as fast as they are being sent, there is really no

point in sending all the frames at the same rate as in the original video. It is better to skip

some frames in between, allowing the client device to catch up. Though we are speaking

of the situation as if the main problem is on the client side and not the network, we can

use the situation to our advantage when we have network problems. We are able to make

use of the client device's inferior capabilities by reducing the quality of the video n such

a way that the user is not made aware of it.

It is not difficult to achieve frame dropping, as long as the storage format is a

frame-based format like MPEG. The format would have a code to mark the start and end

of each frame, and all the algorithm would have to do would be parsing through the

bitstream to be sent to the client, looking for these codes, and appropriately suppressing

the transmission of the frame when necessary. This process can also be simplified using

some data structures, as shown in later chapters.

4.5. Hybrid Reductions

Now that we have studied the reduction of size, color information or the number

of frames of the multimedia data, let us consider what we would do if the client device

were really bad in terms of capabilities, and the network conditions were also bad. The

natural alternative would be to apply more than one reduction algorithm to the

multimedia data. Of course, this would mean further degradation in resulting quality, but

we do not have a choice. Additionally, the client device, rather conveniently, is less

capable of exhibiting the full quality of the delivered video.

Let us consider the various possibilities we have, given the three reduction

strategies we have studied so far. Whether these reductions are done offline or on the fly

plays an important role in deciding which permutations are feasible and which are not

worth consideration [14]. If size reduction were being done offline, by storing the

different possible versions of the video, we could not achieve frame dropping before size

reduction it would have to be done after. This order has to be maintained, though it

would make more sense to get rid of the useless frames before processing each frame for

the size reduction. It would also make sense to drop frames before performing color

reduction, following the same logic. Since frame dropping is the only totally online (on-

the-fly) reduction we have devised, this seems to be a tough thing to achieve.

Between size reduction and color reduction, it would again make more sense to

reduce the size before dealing with color, since then we would have fewer pixels to

process. This works well for us if size reduction were done offline.

4.6. Is Online Video Reduction Feasible?

This issue actually needs a careful study of the multimedia data format. If the

format involved a complicated decoding and encoding algorithm, and the reduction

algorithm involved a lot of processing at the pixel level instead of at the frame level, it

becomes difficult, maybe impossible, to achieve. If the reduction algorithm were to be

performed at the frame level, decoding and reencoding would not be needed at all,

making our task simpler. However when the reduction is done largely depends on the

encoding algorithm. With MPEG files, it becomes difficult to access pixel information at

the frame level, unless the decoding process is carried out.

Thus we have to look at the alternative to online reduction performing the

reductions offline, and storing the versions of the video at the server, ready to be

downloaded to the client on demand. The downside of this is the space overhead on the

server. It is not a simple double space overhead, since all possible versions of the video

will have to be stored on the server; the ratio of reduction depends on the client device,

the user's specifications and the network condition at the time of download.

It helps to remember that since the complexity is effectively shifted to the server

side of the transfer, the user can be kept oblivious to the overhead. The complexity would

be of the order of O(n) still, which can be borne by the server.


From this chapter onwards, we shall shift our focus to the MPEG compression

standard. The reason why we choose MPEG as the standard to work with will be more

obvious in Chapter 6, where we discuss recent developments that help our work in

wireless video delivery. MPEG is one of the most efficient compression standards in use

today, and suits out purpose admirably.

5.1. Introduction

The Moving Pictures Expert Group (MPEG) is a family of video compression

standards [6]. The existing MPEG standards are MPEG-1, 2 and 4. The MPEG-1 format

provides a mode of storage for video data. The MPEG-2 format provides better user

interaction in deciding the quality of encoding and uses a two-layer approach for

effective encoding. The basic difference between MPEG-4 and previously existing

standards is the object-oriented approach. The MPEG-4 format defines standard ways of

representation of objects in a picture, called audio-visual objects (AVOs). The MPEG-4

format hence provides a standardized way to describe a scene, allowing for example to

place media objects anywhere within a coordinate system, apply transforms to change the

physical properties of an object or change the user's viewpoint of a scene.

Recently MPEG announced a call for proposals for MPEG-7, its newest standard

of video compression. The MPEG committee is now accepting proposals for MPEG-7

and expects the standard to be finalized by the year 2000. The MPEG-7 format does not

aim to be a compression standard on its own, but is designed to work with other

compression standards to provide a way of accessing content of video data.

5.2 What Is MPEG-7?

The MPEG-7 format, formally called "Multimedia Content Description

Interface," standardizes the following:

* A set of description schemes and descriptors,

* A description definition language (DDL) to specify description schemes, and

* A scheme for coding the description.

The MPEG-7 format aims to standardize content descriptions of multimedia data.

The MPEG-7 format extends the limited capabilities of currently existing codecs in

identifying content of video streams, mainly by including more data types. In doing so,

MPEG-7 specifies a standard set of descriptors that can be used to describe various types

of multimedia information. This description allows fast search of audio-visual content for

material of a user's interest. The MPEG-7 format also standardizes a language to specify

structures for the descriptors and their relationships.

The advantage of using MPEG-7 descriptors is that they do not depend on the

format or encoding scheme of the material being described. This makes MPEG-7 a very

flexible standard. MPEG-7 builds on other existing standards to provide references to

suitable portions of them. MPEG-7 will also allow different granularity in its

descriptions, offering the possibility of different levels of discrimination. The content

descriptions can be placed in a separate stream from the actual multimedia stream. They

can be associated with each other through bi-directional links. Hence, data may exist

anywhere on the network.

5.3 Descriptors and Description Schemes

A descriptor is a representation of some definite characteristic of the data, which

signifies something to somebody. A descriptor allows an evaluation of the feature via the

descriptor value. It is possible to have several descriptors describing the same feature to

address different relevant requirements. For example, the color feature of the data can

have the descriptors of the color histogram, the average of the frequency components, the

motion field, the text of the title etc.

A descriptor value is an instantiation of a descriptor for a given data set.

Descriptor values are combined using a description scheme to form a description.

A description scheme specifies the structure and semantics of the relationships

between its components, which may be both descriptors and description schemes. The

distinction between a descriptor and a description scheme is that a descriptor contains

only basic data types and does not refer to other descriptors or description schemes.

For example, a movie may consist of several scenes and shots, if partitioned

temporally. Such a movie may contain descriptions of the relations between these

component shots and scenes, and may include some textual descriptors at the scene level,

and color, motion and audio descriptors at the shot level.

A description consists of a description scheme and the set of descriptor values that

describe the data. Depending on the completeness of the set of descriptor values, the

description scheme may be fully or partially instantiated. Whether or not the description

scheme is actually present in the description depends on technical solutions still to be


The Description Definition Language is a language that allows the creation of

new description schemes and possibly descriptors. It also allows the extension and

modification of existing description schemes.

The advantage of MPEG-7 descriptors is their flexibility: MPEG-7 descriptors

can be used to describe data in different formats, such as an MPEG-4 stream, a video

tape, a CD containing music, sound or speech, a picture printed on paper and an

interactive multimedia installation on the web. This gives a great deal of freedom in the

use of particular data formats for the actual video content, since they can all be described

using one common format: MPEG-7. This leads to different applications in which

MPEG-7 can be used.


defines 1..*


1 *

Observing human or system

Figure 5-1: A UML-based representation of possible relations between D and DS

Figure 5-1 describes the relationship between descriptors, description schemes

and the audio-visual data they describe.

5.4. Scenarios with MPEG-7

The MPEG-7 format aims at the improvement of existing applications and the

introduction of completely new ones. Three of the most relevantly impacted scenarios


* Distributed processing,

* Content exchange, and

* Customized views.

5.4.1. Distributed Processing

MPEG-7 will permit interchange of descriptions of multimedia material

independent of platform or application. This enables distributed processing of multimedia

content. Data from a variety of sources can be plugged into different distributed

applications such as multimedia processors, editors, retrieval systems, filtering agents etc.

In the future, one may be able to access various content providers' web sites to

download content and associated indexing data, obtained by some high-level or low-level

processing. The user can then proceed to access several tool providers' web sites to

download tools (e.g., Java applets) to manipulate the heterogeneous data descriptions in

particular ways, according to the user's personal interests. An example of such a

multimedia tool will be a video editor. An MPEG-7 compliant video editor will be able to

manipulate and process video content from a variety of sources if the description

associated with each video is MPEG-7 compliant. Each video may come with varying

degrees of description detail such as camera motion, scene cuts, annotations and object


5.4.2. Content Exchange

Another scenario that will benefit from a common description standard is the

exchange of multimedia content among heterogeneous audio-visual databases. MPEG-7

will provide the means to express, exchange, translate and reuse existing descriptions of

audio-visual material.

Currently, material used by databases is described manually using text and

proprietary databases (the whole purpose of which is such description). Describing audio-

visual material is a time-consuming and expensive task, so it is desirable to minimize the

re-indexing of data that has been processed before.

Interchange of multimedia content descriptions would be possible if all the

content providers used the same scheme and system. As this is on the impractical side,

MPEG-7 proposes to adopt a single industry-wide interoperable interchange format that

is system and vendor independent.

5.4.3. Customized Views

Multimedia players and viewers compliant with the multimedia description

standard will provide users with innovative capabilities such as multiple views of the data

configured by the user. The user could change the display's configuration without

requiring the data to be downloaded again in a different format from the content


The ability to capture and transmit semantic and structural annotations of the

audio-visual data, made possible by MPEG-7, greatly expands the possibilities for client-

side manipulation of the data for displaying purposes. For example, a browsing system

can allow users to quickly browse through videos if they receive information about their

corresponding semantic structure. For example, when modeling a tennis match video, the

viewer can choose to view only the third game of the second set, all overhead smashes of

one player etc.

5.5. Wireless Video using MPEG-7

The content-based nature of MPEG-7 makes it highly appealing to designers of

databases. It is also interesting to note the advantages it could have in a wireless network.

When we consider the customization of views, the user could probably get a view

of a video as per the capabilities of his wireless device. This could enable the user to get

just that portion of the video in which he is interested, which would reduce the amount

that needs to be downloaded, thus reducing bandwidth needs.

The display can be probably adjusted dynamically according to the device

constraints. Thus handheld devices that cannot support or possibly do not need detailed

images can probably download just what is sufficient for the user to get the general

picture. Thus maybe in a weather news video, the user may just be interested in getting

the weather report for Florida about rain for the next day. This can be easily arranged

using content-based indexing.

What also can be reduced is the space need. The entire video need not be

downloaded. These and other issues are discussed in detail in the following chapters.


The Moving Picture's Experts Group (MPEG) has started work on a new

standardization effort called the "Multimedia Content Description Interface," also known

as MPEG-7. The goal of MPEG-7 is to enable fast and efficient searching and filtering of

audio-visual material. The effort is being driven by specific requirements gleaned from a

large number of applications related to image, video and audio databases, media filtering

and interactive media services (radio, TV programs), scientific image libraries etc.

Recently, multimedia content description information for enabling Universal

Multimedia Access (UMA) has been proposed [15] as part of the MPEG-7 specifications.

The basic idea of universal multimedia access (UMA) is to enable client devices with

limited communication, processing, storage and display capabilities to access rich

multimedia content [5]. The use of MPEG-7 with UMA descriptors would be ideal for

our efforts towards fast and efficient multimedia data transmission.

6.1. Overview of UMA

The main idea of UMA (Universal Multimedia Access) is that any kind of device

with any minimal capabilities should be able to access any multimedia data, over any

network. To be able to do this, the multimedia content must be modified to suit the

conditions of transfer. UMA describes broad categories into which these reduction

schemes could be classified. These, called variations on the data, are described in

Subsection 6.1.1.

MPEG-7 suggests a specific set of attributes to be used to describe multimedia

content, to ease the task of downloading the video. This has been extended to UMA in

[15] with UMA-specific attributes. These attributes are discussed in Subsection 6.1.2.

6.1.1. UMA Variations

Several entities are defined by [15] for describing variations of multimedia data:

substitution, translation, summarization and extraction. The variation entity is defined as

an abstract entity from which the other different types of entities (substitution, translation,

summarization, extraction and visualization) are derived. The variation entity also

contains attributes that give the selection conditions for which a variation should be

selected as a replacement for a multimedia item [15]. The derived entities are as in Table


In UMA applications, the variations of multimedia material can be selected as

replacement, if necessary, to adapt to client terminal capabilities or network conditions. If

the network conditions are bad, a variation with the least possible effort needed for

transfer may be selected, so that the work is reduced. At the same time, if the device used

by the client has excellent display characteristics and other good resources needed for

display of multimedia, the variation selected must as far as possible have good visible

characteristics (for example, key frame extraction can be selected instead of color

reduction). On the other hand, if the device does not have some necessary quality, such as

a color display, then that quality can be dropped from the multimedia that needs to be

streamed to that client.

In general, the Variation-DS provides important information not only for UMA

applications but also for managing multimedia data archives since in many cases the

multimedia items are variations of others.

Table 6-1: Classification of reductions performed on multimedia data.
Reduction Description Examples
Substitution One program substitutes for Text passages used to
another when there need not be substitute for images that
any actual derivation relationship the browser is not capable
between the two of handling
Translation Conversion from one modality to text-to-speech (TTS),
another the input program speech-to-text (speech
generates the output program by recognition), and video-to-
means of translation image (video mosaicing)
Summarization Input program is summarized to Input program is
generate the output program summarized to generate the
may involve compaction and output program may
possible loss of data involve compaction and
possible loss of data
Extraction Information is extracted from the key-frame extraction from
input program to generate the video, embedded-text and
output program involving caption extraction from
_analysis of the input program images and video

6.1.2 UMA Attributes

The MPEG-7 specifications document proposed several descriptive attributes for

multimedia data in general, which were used to reduce the capacity requirements of large

MPEG multimedia videos. The MPEG-7 format is targeted at the problem of browsing a

video file to detect frames of interest to the user. These attributes, called descriptors, were

used to describe various types of multimedia information, in the form of video, audio,

graphics, images, 3D models etc.

When UMA was proposed as a new part of the MPEG-7 requirements, several

new attributes were proposed as additions to enable UMA [15]. With these additions, it

becomes simpler to select any of the different possible kinds of variations, suitable for the

nature of multimedia content in any particular video. For example, we may select speech

to text translation for a foreign language film. Thus we are able to take into consideration

the lendability of the multimedia data to particular reductions.

These descriptive attributes, or descriptors, are summarized in Table 6.2. Using

these descriptors, we shall see, in later sections, how multimedia content descriptions are

obtained and used to select the ideal variation for the situation for the multimedia data, to

be sent to the client.

Table 6-2: UMA attributes.
Descriptor Scheme Purpose
Published Multimedia Data Describes the need for and the relative importance of a
Hint-DS multimedia data item in a presentation, and may
contain hints as to the type of reduction the
presentation lends itself easily to
Media-D Standardized descriptive information about the image,
video and audio material, which help in UMA, such as
resource requirements, functions of network
conditions to preferred scaling operations etc.
Meta-D Data about data rights, ownership etc.
Spatio-temporal Domain-D Information about the source, acquisition and use of
visual data
Image Type-D Describes display characteristics of images
Region Hint-D Information about the importance of particular regions
within an image, relative to each other
Audio Domain-D Information about the source and usage of audio data
Segment Hint-D Analogous to the region hint, but applicable to the
___temporal dimension

6.1.3. The Importance of UMA

UMA attacks the problem of wireless video delivery at the root, by actually

prescribing attributes for the ideal reductions suited for a video. Thus we get a way of

combining our methodology with the MPEG-7 standard, and we can deliver video over

the wireless network to any type of device. With UMA attributes we get a method to

describe the content of any video, to be able to decide the ideal set of reductions for that

video over any network conditions, to any type of device.

Now the application has to be designed in such a way as to provide content

description for a video at the time of its insertion into the database, and this content

description could be used at the time of streaming to select the ideal reduction for the

video. This reduction algorithm could then be applied before or during streaming, as the

case may be.

We need to think of a suitable data structure to store all the information that we

have about the video in the database. This information will include content description in

the form of an MPEG-7 packaging for the video and ideal reduction set for different

conditions of device modality and network conditions. We also need to consider the

actual process that takes place when a client requests a particular video from the server.

This process will need to handle methods for reduction access.

6.2. Conceptual Model

In Chapter 2 we saw that the main factors affecting the quality of video delivery

are the network conditions and the client device modality. We also have the content of

the multimedia data and user preferences as additional factors, and we shall take these

into consideration while designing the architecture of the server. Network conditions and

device modality are the main factors that effectively decide the reduction set to be applied

on the video before transmitting it to the client. We shall consider the inter-relationship

between these factors and the reduction set chosen for the video delivery.

6.2.1. Factors Affecting Delivery

To match the terminal device capabilities and connection quality to the different

variations, we consider a three dimensional structure (the three dimensions being device

modality, connection quality and variation type).

Depending on the quality of the specific connection, and the capabilities of the

specific device, the mapping into the three-dimensional space is decided. A number of

specific descriptors describe the possible combinations of media resources available for

different devices and at different connection qualities. The different variations to the

multimedia data can be specified in the form of indices into the multimedia material. We

can also have meta-data that influences the content adaptation process, by describing the

mapping from connection quality and device modality into required variations.

Device modality
Given n (display, memory,
modality MFR, CPU)
range / /

appng Connection quality
Variation .
{Substitution, i s s
Translation, Possible
Summarization, Possible Connection
Extraction} / Variations Quality range

Figure 6-1: Three-dimensional view of MPEG-7 and description schema for UMA

The relationship between the different entities guiding the three-dimensional

mapping is illustrated in Figure 6-1. The following subsections examine these factors in


Additionally, we might have a number of descriptors that can be considered to be

transcoding hints; that is, they can be used to guide the adaptation process. We discuss

some issues that arise in the implementation of such a scheme.

6.2.2 Network Conditions

Different variations in connection quality that may be possible in the network can

be described in terms of connection attributes. The quality of connection ranges from the

strong connection quality of a fixed network, to the weak connection quality of a wireless

network, to loss of connection, which is quite possible in mobile networks. You have

three possible Connections attributes:

* Bandwidth This describes the existing bandwidth of the network. This will be

measured in the unit of bps (bits per second). This can have a range from 0.0 bps to

50 Mbps.

* Latency This factor describes the connectivity of the network in terms of initial

connecting time and time of connection maintenance. This can be measured as with

or without latency. The importance of latency can be seen when a large amount of

video data needs to be streamed over a network and, due to a latency problem that

cannot be ignored, the quality of the streaming is poor. Of course we have to consider

that when chunking is allowed in a packet-switched network (as in iDEN), the

problem can be ignored.

* BER (Bit Error Rate) This factor describes the accuracy of the network. Accuracy is

an important factor to be considered in wireless networks, especially in multimedia

applications, since errors in data are so visible to the user.

Depending on the values of these factors, the various network connections can be

classified into certain classes of connections, as illustrated in Table 6-3. The class of

network connection is one of the major factors in the function, that maps the existing

conditions to the appropriate variation suggested, for the multimedia data that needs to be

transferred over that network.

Table 6-3: Classification of networks based on connection quality.
Network class Bandwidth and Latency specs. Example Network
Class Al 2.0 Kbps 9.6 Kbps, with latency Mobitex (RAM), early
CDMA PCS systems
Class A2 2.0 Kbps 9.6 Kbps, without GSM phase 1, CDPD
Class B1 9.6 Kbps 20 Kbps, with latency

Class F2 > 10 Mbps, <= 50 Mbps, W-LAN, HyperLAN
without latency

6.2.3. Client Device Modality

The capabilities of the device that is currently attempting to view the multimedia

data can be described in terms of the variants Display, Audio, Memory, Maximum Frame

Rate (MFR), CPU and Color.

The capabilities of a device range over all possible combinations of all possible

values of these attributes:

* Display The display type (resolution, size, refresh rate and color capabilities) of the

device. This can range from a handheld device's display to that for a desktop. This

will have a domain ranging from B/W to 64K color or more.

* Audio The audio capabilities of the device. The benchmark for this could be

measured in a sound present or absent flag. We could also have different quality

audio, such as CD quality, stereo quality etc. The audio would be useful in

determining whether audio information needs to be downloaded to the device at all or

not. Hence the domain of this descriptor would be {absent, CD quality, stereo


Table 6-4: Classification of devices based on device modality.
Class of Display Audio Memory CPU Color Eg. of
device device
Class Low Not Low (up to Low (B/W, Web
I A resolution present 32 MB) grayscale) phones
(64x64 (e.g.
256x128) Nokia
CLASS Low CD Low (up to Low (B/W,
Ib resolution 32 MB) grayscale)
(64x64 -

Class High Stereo High (1 GB Good 64 K
IV B (1024x640 and above) (above 500
and above) MHz)

* Memory The amount of storage space available to store buffered multimedia data

and run the application. The memory used by a mobile device can generally be

divided into slow access memory, under which we have RAM and flash memory, and

fast access memory, under which we have microdrives and hard drives. With more

flash memory, we can buffer more amount of information for the streaming, and with

the presence of slow access memory (which all mobile devices do not have), we can

store even more information. This descriptor will have a domain of less than 32 MB

to more than 1 GB.

* MFR The maximum permissible rate of display of frames on the device. The device

capabilities can limit the rate at which the screen can be regenerated each time a new

frame needs to be displayed. This may be related to the refresh rate of the device

screen, which is dependent on the type of screen material.

* CPU The CPU capabilities of the device. This can be measured in speed of the

CPU, or computational capability we can use a fixed unit such as flops. This mainly

affects the complexity of computations that is permitted on the device, and hence, the

extent of decoding possible on the device.

Depending on the values of these attributes, that describe the capabilities of

wireless devices, the various devices can be classified into certain classes, as described in

Table 6-4.

6.2.4. Multimedia Content

The next matter to be considered is the importance that has to be given to the

content of multimedia. This is what can be called the "lendability" of the multimedia data

to certain types of reductions. Multimedia data is rich in content, and different types of

data have parts with different importance, depending on the context in which it is used.

For example, the weather channel may give prime importance to the audio, and the video

may be less important. But the sports channel may give higher preference for video. A

foreign language film may take off audio and leave captions in. So all that seems to be

required is some means of describing the content of the multimedia and associating this

description with the video at the server.

This is where MPEG-7 and the additional descriptive data detailed in Section 2

comes in. Originally MPEG-7 was designed to be a content-descriptive language. This

concept of content description is extended to include content description specifically

useful for UMA [14], so that it gives us an idea about the variations suggested for any

particular multimedia data item, the relative importance of different multimedia

components within the data, its resource requirements and any other information needed

to describe the content absolutely to the server.

When we have this content description with us, it becomes easier to use this to

select the set of variations (reductions) ideal for that particular content. These suggestions

are also included as part of the content description as transcoding hints [14]. These

descriptors can be used to affect the nature of variations suggested for the multimedia


We can express this as a restriction on the usage of reductions by the multimedia

server. So on the implementation side, when the server gets a request for a particular

video from a client, the server uses the network conditions and device modality of the

current situation to decide the full set of variations suitable for that situation. Then the

Multimedia Content (MC) comes into the picture to restrict the full set of available

variations to a smaller set of reductions that are permitted on that content.

6.2.5. User Preferences

The final issue that has to be considered is the user's opinion of all these changes

being made to what had originally been requested. The "ideal" variation may be totally in

contrast with what the user considers essential to make sense out of the data. The user

may, for example, find audio data more important as compared to video data and might

not like the idea of sound being cut out, to fit the result within the network conditions. If

the ideal variation derived from the existing network conditions is speech-to-text

conversion, the user will find the result dissatisfactory.

To take the preferences of the user into consideration, we introduce the concept of

loss profiles. The user gives a set of limits within which the variation of the stream from

the desired quality should stay as much as possible. This would help in making a choice

between different types of reduction possibilities available.

We can extend this concept a little more to include the data relevant to that user,

in the form of a Loss Profile (LP). This would include data in the form of client

preferences, the network conditions that the client normally operates within, and the

modality of the device that the client normally works with.

We can move further on the lines we have followed above for multimedia content,

and impose another restriction on the final result obtained, depending on the user Loss

Profile, to get the final set of restrictions to apply on the multimedia data.

6.3. The Reduction Set and the Restrictions on It

The Transcoding Function describes the mapping between the different static and

dynamic variations available with MPEG-7, and their applicability, i.e., the (Connection,

Modality) combination situation in which each is relevant. The mapping may be

influenced by loss profiles that prescribe user preferences in variation options, and also

by the content of multimedia data. These options are discussed in later sections.

Basically, the mapping of (Connections, Modality) to Variations can be

summarized in Table 6-5. Now we add in the effect of the nature of multimedia content,

and the user loss profiles, in the form of the following equations.

Let C be the range of connection quality and D the range of device modality into

which the current conditions happen to fall. The resultant region of permitted variations,

V, into which the system maps can be represented as a function of C and D.


Let MC be a Multimedia Content and VMC be the possible set of applicable variations for

this particular multimedia data, given by


where V is the set of variations obtained so far and nec is variation restriction on V.

Table 6-5: Mapping of (connections, modality) to variations.

Class Al

Class IV B

Class Al {substitution}- {translation} = {video-
{text for images} ___to-image}
Class A2 {substitution}- {translation} = {video-
{text for images} ___to-image}

Class B1

{ sub stitution,
translation} = {text
for images, video-
to-image, voice-to-
text, color

thumbnail generation }

Class E2 {summarization} -
Class E2 .scaling} ... No reduction
{voice scaling}

Let VF represent the region of finally permitted variations when the user loss

profile, LP, is taken into consideration. Hence V, is a restricted version of VMC, as

specified by LP. This can be expressed using the following equation, where nLp is the

loss profile restriction on VMC:



Effectively, when a client tries to get some multimedia data from the server, the

server will gauge the connection quality of the network. The server will also try to judge

the modality of the client device. This can be achieved by having the user select some

class of device from among a menu.

Then the server will use this range of C and D to map into the C x D table and the

result will be V, a set of variations, which may be applied to the multimedia data to fit it

within the physical conditions.

Then the server will use MC to further restrict the set of possible variations suited

for that particular multimedia data. Finally the server will use LP to choose, from among

the suggested variations, those that are acceptable to the user. This will yield another set

of variations, which may be applied to the multimedia content.

On the implementation side, the user LP can be obtained from the user, by having

the user move a set of sliders that indicate the acceptable quality of the multimedia

needed [16]. These sliders can represent the values for different relevant multimedia

information, which can be selected from the descriptors specified for MPEG-7. For

example, we can have sliders for color quality varying from B/W to 64 K color, audio

quality varying from text captions to stereo quality audio etc.

When the user selects the values of these attributes, they form an attribute for user

loss profile (LP). This is used to restrict VMC to VF. The user's loss profile can be

obtained dynamically at the time of connection, and also stored as part of the user's


We can actually do most of this process statically at the time of registration of the

video into the database. As the video is being registered into the database, the set of

MPEG-7 descriptors could be created for that video and stored along with the video into

the database. It should also be possible to insert a video with pre-defined descriptors into

the database.

The various variations of the video are also statically created at the same time and

stored along with the video. This raises the issue of the amount of storage space required

for each video, along with its different variations. The solution to this would involve

some sort of complex indexing scheme that does not store the variations separately, but

instead indices into the existing video bitstream, to dynamically assemble the particular

bitstream required for any particular variation of the video. These indices can also be

created and stored with the video at the time of registration.

The mapping of all possible combinations of (C x D) to the respective set of V

can also be created and stored statically at registration. Finally, since the content of

multimedia does not change with the user, the restrictions imposed by MC on V can also

be stored statically at registration. So finally, after registration, what we have is a new

video with its accompanying descriptors (MC), table of C x D, and the indices for all

possible variations with that video.

Online, when the client establishes connection with the server, the client's Loss

Profile (LP) is downloaded, along with information about the network conditions (C) and

the device modality (D) of the client. When the client asks for a particular video, the set

of VMc is determined, and then the user's LP comes into the picture, to determine VF for

that video. The ideal order of performing the final set of variations is determined

dynamically, and then the indices are used to select the portions of the bitstream which,

when assembled together and streamed to the client, would construct the perfect video as

per all the restrictions and user requirements.

This scheme involves the discussion of a lot of issues. Firstly, we need to

remember that the ultimate goal of MPEG-7 is not to impose so many restrictions that the

final result that the users get is much smaller than the original need. There has to be a

way of ensuring that within the given limitations, the maximal quality video is delivered

to the users.

Secondly, the imposing of all these stages in the encoding and delivery of the data

makes the scheme increasingly complex. We shall discuss these issues and possible

solutions in following sections.

6.4. Extensions on the Matrix Theme

We can extend the matrix idea and the concept of having a reduction in each

element of the matrix, to having a set of reductions for each element of the matrix. As we

saw in Chapter 4, Section 4.5, the order of applying the reductions does matter. So how

do we decide the ideal order in which we apply the reductions? Some orders are, of

course, ruled out, based on which reductions are done offline and which are done online.

But we can experiment with the different options available to us, and obtain an ideal

order of performing reductions.

The next issue that crops up is that with a limited number of reductions, how can

we obtain as many different combinations as Section 6.3 illustrates we will need. The

solution is fine-tuning. We can use different degrees of reduction after all, color

reduction does not necessarily mean 16 K color to grayscale directly. We can have

different levels of reduction, and combine all these in all possible ways that make sense,

to get different semantics of the reduction set. The first level semantic can have the basic

reductions, with no complexity: a total order semantic. The second level semantic can

have variations on the first level semantic, by combining different possibilities and

different ranges of reductions: a partial order semantic. We can thus crank it up tighter

and tighter until we have the perfect possible reduction set for any situation.

This approach will give us something like a Figure of Merit for the application.

After all, our primary goal is not to perform reduction in any way to achieve streaming,

but to satisfy to user. We must not only fit the video into the situation as quickly as

possible, but also as perfectly as possible, to leave minimal gap between the achieved

video and the one required by the user.

We can also attempt to fit these various semantics into the matrix, so that the

decision as to which semantic to use can be taken at runtime. Of course, this complicates

things somewhat, and pops up various implementation issues. These are discussed in

Chapter 8.


So far we have seen how delivery is to be achieved theoretically, and the needs of

the process. Next it is time to think about the architecture of the MPEG-7 wireless/mobile

video server. We need a way to choose from among several possible reduction-sets,

depending on the conditions of connection quality and client device modality. Next we

need a method to limit the reductions in these reductions sets, based on multimedia

content and user preferences. When we think about this, we also need to think about a

way to package the MPEG-7 descriptors along with each video. Finally, we need to think

about a way to perform the reductions and to store that information also at the server,

ready for streaming to the client.

7.1. The Matrix

We will have an MPEG-7 server that contains the database of multimedia data.

The server will also contain Table 6-3 and Table 6-4. We can have a matrix structure to

determine the ideal reduction set for the situation. This matrix will map (Connection

quality, Device modality) to (Variations). To determine the type of client device, we can

have a menu option on the client that allows the user to pick one of several types, and this

information is sent to the server at connection. The server should also be able to measure

the network connection quality at the time. After ascertaining the class of network and

client device, the server will accordingly use the matrix in Table 6-5 to decide the

appropriate reductions needed to be performed. Each element of the matrix will contain a

pointer to the most appropriate reduction set for the situation specified.

This matrix can contain pointers to the actual data, or references to the functions

that must be applied on the data before sending it to the client.

7.2. The Secondary Indexing Scheme

Once we have the matrix, the question then arises of how to store the different

reductions in the matrix elements. Once the server gets the request from the client for a

particular video, the server has no time to wait and search for the appropriate parts of the

video file that need to be streamed with or without preprocessing.

To eliminate the search time, we can have a second level of reference into the

video file represented by an index into the video file. This index will store pointers to

the different chunks or parts of the video file that are relevant for each reduction applied

on that video. So when the request for the video comes in, all the server has to do is

follow the index, using the already determined reduction set to index into it, and obtain

the exact data with which the server has to work. Thus the server does not have to waste

time parsing the file at runtime for start and stop bits.

This index will be secondary to the matrix. The server will primarily use the

matrix to determine the reduction set ideal for the situation at hand. Then the server

merely uses the index to make the job of performing reductions on the file easier. The

main job of the server now reduces to checking the conditions of connection quality and

client device modality, and accordingly deciding the ideal reduction set, and then using

this reduction set as the key for the index into the video file. Then the server is able to

obtain the exact data with which to work.

The question now arises as to when will this index be created, and who will create

it. The same question also crops up in connection to the matrix: who will fill it and when?

Obviously, since the server is the entity that has the connection with the video database,

and the creation of the matrix and the index is a one-time step to be performed before any

client has access to the video, we can see that the onus of the creation of the matrix and

index would lie with the server.

7.3. The Introduction of the Packager

We can shift the responsibility of creating the index and the matrix from the main

server to a new entity, the packager. The server imposes the restriction that before the

clients can see a video on the video list, the video needs to be registered with the

packager. The packager will create a matrix that will be accessed for all video

registration. When a video gets registered with the database, the packager will create

MPEG-7 descriptors for the video and create the index for the video. This index will refer

to the parts of the video file that need to be accessed by the server for each particular

reduction set.

Thus when a client requests a particular video, the server uses the matrix to decide

the ideal reduction set for the video. Then the server uses the index to pick the relevant

parts and optionally process them before sending them to the client. The packager does

the following main tasks offline, at registration:

* Creation of MPEG-7 (and UMA) descriptor values for the video,

* Creation of the ideal matrix for that video in particular, with ideal reductions for each

level of device modality and network conditions,

* Creation of indices for each element of the matrix. Each index is basically a linked

list, with links to each piece of the video that needs to be streamed after optional

processing stages, and

* Creation of the mapping table of (C x D) to V, including the restrictions imposed by

MC, so that we get VMC.

Figure 7-1 Overview of the process

When the server gets a request from a client for a particular video, the server

checks for the ranges of C (Connection) and D (Device modality), as discussed in Section

3. This gives the server V, the set of variations according to the physical conditions, from

the table, and following from that, VMC, depending on MC (Multimedia Content). Then

the server uses LP (the user Loss Profile) to restrict the variations provided by V, forming



VF. Finally the server gets an acceptable set of variations. The server then uses algorithms

to decide the order in which these variations must be applied to the data, which is used by

the server to call the appropriate function that will stream only the required data, after

applying the variations, to the client. The entire process is summarized in Figure 7-1. The

indexing scheme is used along with this process to assemble portions of the information

needed for any particular variation of the video. When the server chooses a particular

variation to be applied, it will follow the index and stream the appropriate bits to the

client, where these will be reassembled to form the intended reduced video according to

the network quality, device modality, content of multimedia requested and the user's


7.4. Practical Issues

One of the main issues that have to be dealt with in the implementation of this

architecture is the efficiency of storage. The prominent downside of using several

variations, of the same multimedia data, would be the redundancy in storage of the same

data, in different ways, to accommodate all these variations.

One way of reducing, or possibly eliminating redundancy, that was suggested in

the previous section, would be to not store different versions of the same data separately,

but have a sophisticated indexing structure and scheme. This indexing scheme would

have several indices into the base multimedia data file. Depending on which variation

was selected as the best one for the situation, the corresponding index into the data would

be selected. This would then access, in the appropriate order, only that data relevant to

the selected variation.

Another important issue to be considered, when we play around with dynamic

variation of data, would be the quality of the data delivered. It may not be acceptable to

the users to have their perfect video, with CD quality sound, suddenly change into a text

only stream, because of fluctuations in the network.

The solution to this problem would be simply to ask the users what is acceptable

to them, and stay within those limits as much as possible. This may be represented in the

form of a loss-profile. This would include data from the users as to what is acceptable to

them in case of changes in network conditions.

When we mention jumping from one part of the data to another using indices, we

would also have to consider the different streams of data going over the network

simultaneously (audio, video and possibly text captions/translations). These have to be

synchronized throughout the stream. The design may dictate the dropping of audio data if

the network deteriorates. If the network comes back to its earlier high quality, we have to

restart audio data streaming. However the audio data cannot pick up where it left off. It

has to match with the part of video being streamed currently.

We may devise any sophisticated scheme to determine the perfect reduction

strategy for a video under given conditions of network and device modality. However all

this would be of no use if the user is not happy with the results. The acceptability of a

video quality is highly subjective and personal. The measurement of acceptability of

perceived video quality is a difficult task that involves user participation, and was not

performed as part of the tests mentioned in Chapter 9.


Now that we have set the architecture of the application, we need to decide on

implementation issues, such as how we should implement the matrix, the indexing

scheme, the basic client-server structure, the packager and other parts of the


8.1. The MPEG Player

The MPEG player is basically an MPEG-2 decoder freely available off the web

[17]. We would have to add some coding to make it visually attractive and user-friendly,

since it would be on the user's side of the network, but otherwise it would do the main

job of picking up the bytes from the video file and displaying the video.

Again, as mentioned in Chapter 3, the addition of streaming into the equation

would mean modifications to the client. The client would have to be made aware of the

fact that the video file is not entirely present on the client, but is still being brought in.

This means that if at any point the client ran out of data, the client would have to wait and

request the server for more data, and could not proceed until a required number of bytes

were buffered. This might mean poor streaming, but the client really does not have a


Further, we might have to consider the reductions introduced into the video. If

these make the video differ in any way from the original MPEG format, we would have

to make the client aware of this and the client should accordingly adjust the decoding

process to accommodate the changes. For example, with color reduction, we would be

removing a part of the original video and transmitting it from the server. This would have

to be accommodated by the client, who would buffer the missing bytes with zeroes.

8.2. Client-Server Setup

The client being ready, now the focus shifts to the server. The server will need to

be able to receive requests from the client and process them. This processing will involve

testing the connection quality, and getting information from the client about the client

device modality. Then the server will have to select the appropriate reduction set from the

matrix. If the server were a sophisticated one, the server would also select the appropriate

order or performing the reductions, and the degree of reduction for each type. The next

step is to use the reduction set to index into the original video and process it. If necessary,

the appropriate functions would have to be called to process the video bitstream before

sending it.

The server would work with packets of data. The client would have to calculate

how many bytes to buffer before beginning to play the video. For doing this, the client

would need to know the size of the video file beforehand. Hence the server will have to

estimate the target size of the video file after processing and send that information to the

client before anything else, so that while the server is preparing the data to be streamed,

some of the work can be passed on to the client.

8.3. Streaming

Streaming is not apparently difficult. We use the formula presented in Chapter 3,

to calculate the number of bytes we would need to buffer before we start playing the file.

Problems could occur if the network connection quality does not remain the same

throughout the transfer of data. The client could still starve for data, which is to be

avoided. The best we can do in such a situation is to constantly monitor the network

connection quality, and update the formula accordingly. Since on the client side we are

using up one half of the time for downloading and the rest for playing, we can offset this

ratio in favor of the downloading for a period of time, until the formula gets balanced

again. This would avoid stalling the client in the middle of play for buffering extra data.

Another situation that can crop up is disconnection. This is a normal situation

where mobile networks are concerned. If the disconnection were only for a fraction of a

second (long enough to break the client-server connection, but too short for the user to

realize that there was a problem) this means trouble. Would the server have to restart

sending data from the beginning? If not, how would the server know where to restart? We

could probably recognize the packet boundaries, and the number of packets sent at both

the client side and the server side. When the client tries to reconnect to the server, we

should have the client send over the number of packets successfully received, so that the

server can restart the transfer of data.

8.4. Reduction

Without doubt, reduction is one of the trickiest issues to be handled in

implementation. We have three types of reduction we have selected to implement, and

each of these has to be performed with fine tuning if possible, so as to achieve as many

different levels of fine tuning as possible.

The first reduction, and probably the easiest, is frame dropping. Frames in an

MPEG-2 stream are recognized by frame start codes. If we scan the stream for the start

codes, we can access the start of each frame. As we skip from frame to frame by scanning

the stream that we send to the client, we can skip the frames that we choose to skip. We

can also use the index efficiently to reduce the time used up in scanning through the


The next reduction is size reduction, which reduces the visible size of the image

on the screen. To achieve this, we effectively need to skip every nth pixel, or rather, since

we have two dimensions to consider, we need to skip every (1n)th pixel in each

dimension (approximately) to achieve the goal. The encoding process being too

complicated to achieve this on the fly in a reasonable amount of time, we opt to perform

this reduction offline. So during registration, the size reduced files are generated for the

registered video. The process to reduce the size involves decoding the original video at

the server end, taking off every nth pixel, before reencoding the video. The client has no

awareness that the video has been reduced to fit the conditions, and plays the video


Of course, with memory restrictions at the server, we cannot do too much fine-

tuning, and can create only a limited number of versions of the video.

The next to be considered is color reduction. The process of removing color is

reduced to removing all Y (chroma) information from the video, so that the client gets a

grayscale video. Of course, the client would normally try to search for the Y information,

and process it, so the client would have to be made aware of the change in the video that

has been sent. The client would then be able to substitute the Y information with zeroes,

or ignore the Y information in calculations.

For the purpose of fine-tuning, we have another option open to us: remove U/V

information. This will achieve the goal of reducing the information to be sent over, and

also affect the quality of the resultant video accordingly. Again, the client will have to be

made aware of this, so that substitution can be made accordingly.

Of course, we cannot do all the processing for this online a partial route will be

chosen to make this happen.

8.5. Indexing and the Matrix

Achieving the matrix does not seem to be a very difficult task. All we have to do

is have a lookup table of some sort we can have an index to do this that takes a pair of

conditions of (connection quality, device modality) and returns a set of reductions. This

set of reductions is used to peek into the index and retrieve the data that is to be processed

and streamed to the client.

We can simplify this further once we decide what kind of elements the index will

have. The index is basically implemented as a linked list of pointers into the original

video file (or the reduced video file, if size reduction is performed as the first reduction).

Each element of the linked list contains the position of the next chunk of data in the file

and the size of the chunk. All the server has to do is locate the appropriate index entry,

start picking up the chunks from the file as dictated by the linked list elements in

succession and stream them to the client.

So effectively, the matrix elements will be pointers to the appropriate linked list.

The job of locating the index entry appropriate for the situation is shifted from the main

server to the packager, so that the server does not waste too much time at runtime.

8.6. Illustration of the Execution Process

We shall cover the execution process in two subsections one for the client and

one for the server.

8.6.1. The Client

The client is not the main part of the implementation, but it requires a good

interface, because that will be visible to the user. Figure 8-1 shows an image of the client.

When the client is started, it shows a list of servers previously connected to on the left of

the display screen. The client can connect to any of these by simply double-clicking on

the server, or can connect to a new server by selecting the menu option under


Preferred reductions Playing
Play options toolbar
\ screen

Server \.0 :V t
connection lo Ml|
options Video
MtaL GpuL lr FIt It 6II

servers / ,IAnb
reconnect I -.

Download progress

Figure 8-1: The client

Then on the right of the display screen we can see the list of videos available at

the server to which we are connected. We can select to play any of these by double

clicking on any one of these. The download status can be seen on the progress bar at the

bottom of the client window. When the client buffers up enough data to start playing the

video, the display window is resized to the size of the video and the video starts playing.

The play progress can be seen in the slider bar right below the menu. At this stage the

client looks as in Figure 8-2.

The client menu has options so that the user can select the type of device the

video will be played on and the type of network the device is trying to connect to the

server over. There are other options to control the playing and connect to and disconnect

from the server. There are shortcut buttons available for some of these on the toolbar.

R ELee CompuT F

miat. C -m-s--r-- Flu- LLSL .-


Figure 8-2 The client while playing the video
Figure 8-2: The client while playing the video


8.6.2 The Server

The server does not have to have a fancy interface, but it has to at least have the

set of necessary menu options that enable the system administrator to deliver video of

desired quality to the client. The server interface is shown in Figure 8.3.

The server has options on the menu to add new videos to the current video list,

and when this is done, the packager registers the video by creating all indices for the

video. Before registering the video, the video has to be put into the correct location, so

that the packager can find it. This is what is observed in Figure 8.3 the menu option for

adding a new video has been selected, and the dialog box prompting for the name of the

video appears.

SOption to add new video to
Port number setting database pops up the dialog

'r r r t, gll:,.
Start and "I '
shut down

W:" l '-,r',- rijI

F: F -

Figure 8-3: The server


When the client connects to the server, the server sends the video list to the client

and awaits the choice of video. When this is received, the server picks up the appropriate

index and starts delivering the video.

The server also has several menu options for initializing the values of IP address

and port, displaying the current list of videos registered with the server, starting the server

and shutting down the server.


Now that we have completed the implementation details, we shall perform some

experiments to illustrate the performance of the server. The experiments shall serve to

demonstrate not only the performance of the reduction schemes, but also that of the

matrix and the indexing scheme.

9.1. Description of the Experimental Setup

The experimental data was collected in the following manner. A particular video

was selected as a case study this video can be found at [18]. This video was repeatedly

downloaded and played under different circumstances, to measure the performance of the

scheme. The experiments were divided into two sections, each illustrating one aspect of

the scheme.

The first set of experiments illustrated the use of the matrix and the index. These

experiments were divided into two main sections, those with the index and those without.

Within each separation, the performance of the server was measured under different

schemes of reduction applied to the video.

The reduction schemes were categorized into frame dropping, size reduction and

color reduction. In the experiments without indexing, the scheme for application of these

reductions was a carefully selected algorithm that yielded the best results for all

possibilities of network conditions and device modality. In the experiments with

indexing, the same algorithm was used to construct the index mentioned in Section 5, and

this index was used later to access and stream the video to the client. The overhead of

time and space in the construction of the index was also measured.

To further illustrate the importance of a good indexing scheme, three separate

levels of indexing were considered:

* A 1 x 1 index, which just performs the basic streaming, without taking into

consideration the network conditions and device modality,

* A 2 x 2 index, which has different combinations of two possible network condition

levels and two possible device modality levels. This index combines different types

of reduction for better streaming at the lower levels of these conditions.

* A 3 x 3 index, which is similar to the 2 X 2 index, with one more level of device

modality as well as network conditions.

The second set of experiments was performed to illustrate the flexibility of the

architecture in terms of planning the performance according to the user's needs. These

experiments zoomed into one cell in the matrix, and varied the buffering time that the

user was forced to wait for before the video started playing. The results compared the

corresponding degradation in the quality of the video.

This set of experiments shows that the actual process of reduction is highly

mathematical and the results are very predictable. Hence the system administrator, in

charge of the server, can accordingly tune the video size to the buffering constraints of

the client, and in the process, the quality of the video changes. If better quality were

desired, the user would have to tolerate the longer buffering time.

9.2. Results

The video selected for the case study was a 4 minute long video clip of 20.361

Mbytes, that contained variations in factors such as extent of movement in the video and

importance given to color and clarity of the video.

The results for the lowest level of sophistication in the reduction scheme are

shown in Table 9-1 and Table 9-2, one without indexing and one with indexing. For both

schemes we used a reduction policy most appropriate for most kinds of videos: 25% size

reduction, followed by frame dropping. The results of Table 9-1 were obtained after a

straightforward application of the available reductions for the current network and device

modality conditions, all being performed at execution time (without the use of the

indexing scheme to get pointers to the relevant frames). Table 9-2 also shows the time

taken to create the index entries for the video, as well as the space overhead required for

storing the index.

Table 9-1: Performance of streaming without indexing, and with no matrix
Total time taken to download video

Table 9-2: Performance of streaming with index, and with no matrix
Total download time Overhead in index creation
(sec) Space/Time in bytes/sec
1620 45124 bytes/1360 sec

The results for the next level of sophistication in the reduction scheme are shown

in Table 9-3 and Table 9-4, the former without indexing and the latter with indexing. The

results show that though the indexing adds overhead in terms of time and space, the

improvement in the streaming quality cannot be ignored. The point to remember is that

the index generation is done offline and the index is stored on the server, and hence the

process is not at the client's cost.

Table 9-3: Performance of server without indexing, with 2 possible levels each of
network conditions and device modality.

Device type 1 Device type 2

Network Download
Cls 1 te () 1023 Size reduction 3685 Frame dropping
Class 1 time (sec)

Network Download 36 Frame dropping 3906
Class 2 time (sec) 3 Size reduction

Table 9-4. Performance of server with indexing, with 2 possible levels each of network
conditions and device modality.
Device type 1 Device type 2
Network Download 94
934 2986
Class 1 time (sec)
Network Download 17
1567 3910
Class 2 time (sec)__
Space/time overhead 49345 bytes/2106 sec

The results for the highest level of sophistication in the reduction scheme are

shown in Table 9-5 and Table 9-6. This clearly demarcates the scheme with a good index,

and the scheme without an index. The scheme without the index suffers from the

complexity of the algorithm that has to perform reduction at runtime, and does not have

the helping hand of reduction hints stored previously in the indexing scheme.

These experiments clearly illustrate the importance of not just performing

reductions on the streamed video, but getting aid in performing these reductions with a

sophisticated indexing scheme that will allow reduction of the server's access time to the

video and processing time of the video.

Table 9-5: Performance of server without indexing, with 3 possible levels each of
network conditions and device modality
Device type 1 Device type 2 Device type 3
Network Download Frame Frame Frame
Class 1 time (sec) 876 size 1253 size 3218 size
S1 Color Color

Download Frame
Network *arame Frame Frame
Class 2 time (sec) 1106 ize 1378 Size 3689
Class 2 Color
Network Download
Class 3 time (sec) 106 Frame 3889 ze 3938
Class 3 time (sec) 1067 size 3889 size 3938

Table 9-6: Performance of server with indexing, with 3 possible levels each of network
conditions and device modality
Device type 1 Device type 2 Device type 3
Network Download
556 896 2989
Class 1 time (sec)
Network Download
678 945 3105
Class 2 time (sec)
Network Download
697 801 3916
Class 3 time (sec)_
Space/time overhead 57089 bytes/4576 sec

Figure 9-1 and Figure 9-2 show the performance of the scheme with the 3 x 3

matrix. It is evident that the scheme with the matrix performs better than without the

matrix. Also, the indexed scheme in general performs better than the scheme without the

index. Of course, there are a few anomalies, caused by the fact that some reduction types

are helped more by the indexing than others.

Performance with Client Device Class 1

SWithout index With index


c 1000
M 800
E 600

' 400

I 200




Network 1

Network 2
Network Class

Network 3

Figure 9-1: Performance of the scheme with client device Class 1

Performance with Client Device Class 2

Without index With index

0 3500
u 3000

Network Class

Figure 9-2: Performance of the scheme with client device Class 2

137 8

Performance with Client Device Class 3
Without index With index



Figure 9-3: Performance of the scheme with client device Class 3

The next set of experiments yielded the results shown in Table 9-7 below. The

results are shown graphically in Figure 9-4.

Table 9-7: Performance of the scheme for a fixed matrix element
Percentage of video not delivered (%) Buffering time (seconds)
0 1081.81
10 544.65
20 281.589
30 207.623
40 128.445
50 53.938


3938 3916

Network Class

Percentage of video not delivered, vs buffering time

' 800

E 600

. 400

m 200

0% 10% 20% 30% 40% 50% 60%
Percentage of video not delivered

Figure 9-4: Performance of the scheme for a fixed matrix element, with variation in
buffering time

In this chapter we illustrated the performance of the scheme. Let us summarize

our achievements, and see in what ways the project can be improved.


Chapter 8 illustrated the implementation of our scheme, and Chapter 9

demonstrated the performance of the implementation. We shall conclude with a summary

of the entire project, and suggestions for future improvements.

10.1. Achievements of this Thesis

We saw how streaming was necessary to obtain a good performance out of video

delivery, without an infinite waiting period. We saw even streaming did not always help,

especially in the case of huge video files. Next, we started thinking of more ways to

reduce the client's wait time. In this factor, the fact, that with wireless networks, client

devices do not need a very high quality video anyway, helped. We could actually reduce

the quality of the video, in the process reducing the required bandwidth and improve the


This scheme helped, but there was no attention paid to the user's requirements.

The user simply had to be satisfied with whatever fixed scheme was settled on for the

purpose of the common good. Though the client might actually be able to handle a larger

video, the user might still have to settle for lesser quality. To avoid this one-for-all

scheme, we expanded it to a matrix of different schemes, based on the two principal

factors on which the video delivery depended the network connection quality and the

client device modality. Each cell in the matrix got a different scheme of reductions most

appropriate for those conditions.

We then further improved the scheme, by shifting most of the onus of performing

reductions from the server to the packager, and introducing an index to store information

about the reductions. The server's work was reduced to using the index to access the

video file at runtime.

10.2. Proposed Extensions and Future Work

This project just got a start on implementing wireless video delivery. There were

some aspects that could easily be expanded on. For instance, the matrix was reduced to a

3 x 3 matrix. The main reason for this was the small number of reductions we could

implement. Rate reduction, discussed in Chapter 2, could also be added to the general

scheme, allowing room for more divisions in the matrix. The scheme of frame dropping

could have been improved, making it more sensitive to the video data, since the frame

composition changes with the video.

This brings into the picture MPEG-7, which would have been perfect for

specifying the contents of the video. There could have been another descriptor describing

the frame composition of the video, and the packager could have accessed this descriptor

to get information about the video. There could be more UMA descriptors suggesting the

ideal reductions to be performed based on the video content. In fact, the packager could

be assigned the job of initializing MPEG-7 descriptor values for each video as it is being

registered, aided by a set of dialog boxes that would prompt for choices of values for

each descriptor.

While preparing the tables of device classes and connection quality, we were

hampered by the unavailability of data publicly. This could be improved on, and a more

sophisticated classification could be achieved, so that minimum work is given to the user.

The indexing scheme selected was also very basic, and deserves improvement.

The basic aim of this project was to illustrate the advantages of the scheme, however

primitively, and extensions on the scheme are not difficult to implement.

Hence a (not necessarily comprehensive) list of future extensions on the project

would be:

* More reduction schemes,

* More sophisticated reduction schemes for the existing reductions,

* Better classification in the tables for the matrix, and

* A more sophisticated indexing scheme.

To conclude, the scheme illustrated in this project attempted to better the

performance of wireless video delivery systems for the users of wireless networks. With

all the suggested extensions, this project could very well become a very interesting and

popular application.


[1] A. Elmagarmid and H. Jiang, Digital Video. In the Encyclopedia of Electrical and
Electronics Engineering, John Wiley & Sons, 1998.

[2] Ahmed K. Elmagarmid, Haitao Jiang, Abdelselam A. Helal, Anupam Joshi and
Magdy Ahmed. Video database systems: issues, products and applications. Kluwer
Academic, 1997.

[3] Compression algorithms.

[4] Still Image Compression Formats.

[5] AVI Audio-Video Interleave.

[6] MPEG home page.

[7] Dublin Core standard.

[8] R. Alonso, Y. Chang and V. Mani. Managing video data in a mobile environment.

[9] I. Kouvelas, V. Hardman and J. Crowcroft. Network-Adaptive Continuous Media
Applications Through Self-Organized Transcoding. Proc. Network and Operating
Systems Support for Digital Audio and Video (NOSSDAV 98, Cambridge, UK), 8-
10 July 1998.

[10] APPLE QuickTime streaming technology.

[11] Streaming Media World Tutorials.

[12] N. Merhav and V. Bhaskaran. Fast Algorithms for DCT-Domain Image Down-
Sampling and for Inverse Motion Compensation. IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 7, No. 3, June 1997.

[13] Oliver Werner. Requantization for Transcoding of MPEG-2 Intraframes. IEEE
Transactions on Image Processing, Vol. 8, No. 2, February 1999.

[14] Sun documentation about JDBC. dbc

[15] J. Smith, C. -S. Li, R. Mohan, A. Puri, C. Christopoulos, A. B. Benitez, P. Bocheck,
S. -F. Chang, T, Ebrahimi and V. V. Vinod. MPEG-7 Content Description for
Universal Multimedia Access, ISO/IEC Y17CI/SC29/WGII MPEG99IM4949,
MPEG-7 Proposal draft.

[16] Malcolm Mclhagga, Ann Light and Ian Wakeman. Giving Users the Choice
Between a Picture and a Thousand Words. School of Cognitive and Computing
Sciences, University of Sussex, Brighton, May 18 1998.

[17] MPEG Software Simulation Group (MSSG).

[18] Experimental data.


Latha Sampath was born on July 17, 1977 in Chennai, India. She received her

Bachelor of Engineering degree in Computer Science from VES Institute of Technology,

affiliated with Bombay University, Bombay, India in August 1998.

She joined the Department of Computer and Information Science and Engineering

at the University of Florida in Fall 1998. She worked at the Database Systems Research

and Development Center while earning her Masters' degree.

Her research interests include streaming media and multimedia applications,

wireless networks and database systems.