WIRELESS/MOBILE VIDEO DELIVERY ARCHITECTURE
By
LATHA SAMPATH
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2000
To my family
ACKNOWLEDGMENTS
I express my sincere gratitude to my advisor, Dr. Sumi Helal, for giving me the
opportunity to work on this challenging topic and for providing continuous guidance and
feedback during the course of this work and thesis writing. Without his encouragement
this thesis would not have been possible.
I thank Dr. Dankel and Dr. Raj for agreeing to be on my committee. I also thank
Dr. Frank for agreeing to attend my defense on my last minute request.
I thank Sharon Grant for making the Database Center a truly great place to work,
and Matt Belcher for providing me with all the computing facilities I needed.
I give special thanks to my friend Yan Huang, who assisted me in the initial
stages of this project.
I especially thank my friend Sangeetha Shekar, for all her support, encouragement
and help during my work. I thank my friends Subha, Harini, Vidyamani, Amit and
Prashant for their support; and for making my stay at the University of Florida a
memorable one.
I thank my parents and my sister and brother-in-law for their constant emotional
support and encouragement throughout my academic career.
TABLE OF CONTENTS
Page
A C K N O W L E D G E M EN T S .................................................................... .....................iii
LIST OF TABLES .............. ................. ............ ................ .......... vii
LIST OF FIGURES ............................. .. .. .... ...... .................. viii
A B S T R A C T ........................................... ................................................. ix
1. IN T R O D U C T IO N ........................................................... ...................... ................ 1
2. REVIEW OF RELATED WORK................................... ...........................4
2.1. N nature of V ideo D ata ............... ........ ... .... .. ......... ....... 4
2.2. Video Com pression Standards ......................... ................... ........ ... ........... 5
2.2.1. E existing Standards ...................................... ............................ .... .......... .. 6
2.2.2. E m erging Standards ............................................ ... ................. .............. 8
2.3. Problems with Wireless Video Delivery................................................... 11
2.3.1. Problems Due to the Wireless Network ........... ............................... 11
2.3.2. Problems Due to the Nature of Video Data................................. .............. 11
2.3.3. Problems Due to Client Device Capabilities....................................... 12
2.4. Transcoding ............................................................................ ........ ...................... 12
3. ANALYZING VIDEO STREAMING................................... ................... 14
3.1. What Is Streaming And How Will It Help? ................................................... 14
3.2. Existing Streaming Technology .................. ................................... 15
3.3. Theoretical Achievement of Streaming...... ................... .............. 16
3.4. Streaming for W wireless Networks .............. ...... ......................... ............ 17
3.5. L im stations of Stream ing ................................................... ........... .............. 17
4. V ID E O R ED U C TIO N .................................................... .................................. 19
4.1. The N eed for R education ........... .................................................... .............. 19
4.2. Color R education .................................... ..... .......... .............. .. 21
4.3. Size Reduction .................................................................... ......... ......................... 22
4.4. Fram e D dropping ............. .. ............ ............ ....... .... .... ............. 23
4.5. H ybrid R eductions......................... ................. ......................................... 24
iv
4.6. Is Online Video Reduction Feasible? ............................................ ...............25
5. WIRELESS VIDEO WITH MPEG-7.................. .......... ... ................... 26
5.1. Introduction .................................................................... .......... 26
5.2. W hat Is M PEG-7? ............ ........... .............................. ...... ... ......... ..... 27
5.3. Descriptors and Description Schemes......................... .............28
5.4. Scenarios w ith M PE G -7 ........... ................................... ................. .............. 30
5.4.1. D distributed Processing........... ................................................ .............. 30
5.4.2. Content Exchange .............. ....................... ........ ... .... .. .......... 31
5.4.3. Custom ized V iew s ..... .............. ........... ........ ... ............ .. 31
5.5. Wireless Video using MPEG-7 .............. ... ..................................... 32
6. UNIVERSAL MULTIMEDIA ACCESS AND MPEG-7 ........................................ 33
6. 1. Overview of U M A .................. .......................... ................. ............ 33
6. 1.1. U M A V variations ............................................... ........ .. .......... 34
6.1.2. U M A A attributes ......................................... ...................... 35
6.1.3. The Im portance of UM A ...................................................... ................ 36
6.2. C conceptual M odel ................ ...................... ............. .......... .. ........ .... 37
6.2.1. Factors Affecting D delivery ................................................... ................. 37
6.2.2. N etw ork Conditions ........................................................... .............. 39
6.2 .3. C lient D evice M odality ......................................................................... .... 40
6.2.4. M ultim edia C ontent.................................................................................. ....... 42
6.2.5. U ser Preferences ................ .. ... .................................... ..... ............ 43
6.3. The Reduction Set and the Restrictions on It........................................................ 44
6.4. Extensions on the M atrix Them e .................................... ............. ........ ....... 48
7. METHODOLOGY OF IMPLEMENTATION........................................................... 50
7. 1. The M atrix ..................................... ................................ ........... 50
7.2. The Secondary Indexing Schem e...................................... ................. ....... 51
7.3. The Introduction of the Packager ...................................... ...................... ......... 52
7 .4 P ra ctic al Issu e s ....................................................................................................... 5 4
8. IMPLEMENTATION OF THE WIRELESS/MOBILE VIDEO SERVER .............. 56
8. 1. The M PE G Player .................. ........................... .............. .. 56
8.2. Client-Server Setup .................. ................................. ... .... .............. 57
8 .3 Stream in g ................................................. ....... ......... ...... 57
8 .4 R ed u ctio n .................................................................................. 5 8
8.5. Indexing and the M atrix ............................................................................. ..... 60
8.6. Illustration of the Execution Process.................................................................... 61
8.6.1. T he C lient ................................................... ...................... ..... ....... 61
v
8.6.2. The Server ............... ......... ...................63
9. EXPERIMENTAL EVALUATION ............................................ 65
9.1. Description of the Experimental Setup ............................... .............65
9 .2 R esu lts ......... ..... ............. ..................................... ............................ 67
10. CONCLUSIONS AND FUTURE WORK .......................................................... 73
10.1. Achievements of this Thesis.......... ........ .. ....................... 73
10.2. Proposed Extensions and Future W ork................... ....... .... ...................... 74
L IST O F R EFER EN C E S ...................................................... .................................. 76
BIOGRAPHICAL SKETCH .................... .......................................... .......................... 78
vi
LIST OF TABLES
Table Page
6-1: Classification of reductions performed on multimedia data.............................. 35
6-2: UM A Attributes. ................................................... .. ............36
6-3: Classification of networks based on connection quality ........................................40
6-4: Classification of devices based on device modality. ............................................ 41
6-5: Mapping of (Connections, Modality) to Variations.............................. ............... 45
9-1: Performance of streaming without indexing, and with no matrix. ..........................67
9-2: Performance of streaming with index, and with no matrix .............. ...............67
9-3: Performance of server without indexing, with 2 possible levels each of network
conditions and device modality........... ........... ................................ 68
9-4: Performance of server with indexing, with 2 possible levels each of network
conditions and device modality........... ........... ................................ 68
9-5: Performance of server without indexing, with 3 possible levels each of network
conditions and device modality........... ........... ................................ 69
9-6: Performance of server with indexing, with 3 possible levels each of network
conditions and device modality........... ........... ................................ 69
9-7: Performance of the scheme for a fixed matrix element............. ...............71
LIST OF FIGURES
Figure Page
5-1: A UML Based Representation of possible relations between D and DS ................... 29
6-1: Three-dimensional view of MPEG-7 and the description schema for UMA. ........... 38
7-1: O v erview of the process................................................................... .................... 53
8-1: T he client. .............................................................. .............. 6 1
8-2: The client w while playing the video...................................... ........................ ......... 62
8 -3 : T h e serve er ................................................................................................6 3
9-1: Performance of the scheme with client device Class 1............................................ 70
9-2: Performance of the scheme with client device Class 2............................................ 70
9-3: Performance of the scheme with client device Class 3............................................ 71
9-4: Performance of the scheme for a fixed matrix element, with variation in buffering
tim e....................................................... ................... ... ... . .. .... . 72
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
WIRELESS/MOBILE VIDEO DELIVERY ARCHITECTURE
By
Latha Sampath
December, 2000
Chairman: Dr. Sumi Helal
Major Department: Computer and Information Science and Engineering
A wireless network is a source of speculation with any kind of data transfer,
because of its limited capacity. When delivering video over a wireless network, the
problem is intensified because of the nature of video data. Users of wireless devices are
increasingly attempting to access the Web, with its highly visual content, using devices
with limited capabilities. The video delivery system must adjust itself to these stringent
conditions and must perform to the user's satisfaction.
In this thesis we present an architecture aimed at delivering video over a wireless
network in the most satisfactory manner possible under the existing circumstances. The
architecture makes use of the fact that wireless network devices generally have limited
capabilities and will not be able to exhibit the full quality of a high-quality video. We
present methods of reducing the quality of the video without making it too obvious to the
user.
We also present ways and means to make the burden of performance fall on the
video server, instead of the client, so that the user is not made aware of the problems that
normally occur in such situations. This is achieved using a matrix architecture, and
indexing of the video file on the server side, so that the server is able to make decisions at
runtime, depending on the network condition and the client device type.
CHAPTER 1
INTRODUCTION
As the trend toward mobile computing increases, so does the need for multimedia
support for the wireless network. The support of multimedia on wireless systems will
allow many interesting new applications. Because video data is generally bulky, attempts
have been made to compress this data stream into a manageable size that can be
transmitted over a network, without wasting valuable bandwidth. One of the more
important standards of compression that has emerged is MPEG (Moving Pictures Expert
Group). This document deals with the various aspects of delivery of video data over a
wireless network.
One of the main concerns of the process of delivering video data over a wireless
network is the poor quality of wireless networks. The wireless network has problems of
bandwidth, latency and disconnection. The various problems of wireless networks are
discussed in detail in Chapter 2. These affect the quality of video delivery to wireless
clients. The problem is aggravated by the nature of the data being transferred over such a
network. Video data is bulky by nature and does not adjust easily to situations of data
loss, without the user's awareness of the loss. The capabilities of the client device are
also less than those of normal fixed network devices, so sending the video to the client in
its entirety does not even make sense. All these concerns and possible solutions are
discussed in Chapter 2.
One possible solution to downloading the entire video and then playing it is called
streaming. Streaming allows the client to buffer a small part of the video initially and
then to begin the playing, while downloading continues in the background. The amount
buffered is just sufficient so that the client application does not starve for data in the
middle of playing the video. Streaming is discussed in Chapter 3.
However, streaming is not an ultimate solution. Streaming alone is insufficient for
good quality video delivery over a wireless network, as the quality of the network
reduces. Hence we come to the next step in reducing the bandwidth requirements of
delivering the video: reducing the number of bytes to be transmitted. The way we do this
is by reducing the information stored in the video, by a variety of choices. For example,
we could reduce the color quality from 256-color to 16-color, or further to grayscale. Or
we could reduce the resolution of the video. This is actually helped by the fact that the
client devices would probably not have the capability to exhibit the quality of the original
video anyway, since wireless devices in general compromise on hardware capabilities.
Chapter 4 discusses the various possibilities in terms of reductions of the original video.
With all these changes to the normal streaming process, algorithmic complexities
are inevitable, and we shall have to study the MPEG decoding algorithm closely to be
able to decide which option to follow. We can also have a totally different method to
actually access the video, or make changes to the basic MPEG format to accommodate
easy online reductions. These options are discussed in Chapters 6 and 7.
The delivery of video over a wireless network is not an easy task, as we can see. It
involves many issues that may be unseen at first glance, but which occur during
implementation. This document deals with many of these issues. Some issues, discussed
in the concluding chapter, are not resolved. Ultimately, the goal of delivering good
quality video to the user has to be achieved, with the user's approval of the perceived
video quality. We might be able to use any complicated data structures to access and
process the video data, but if the user is not happy with the reduction chosen by the
server, the service is of no use. This document attempts to partly achieve, with some
measure of success, this goal.
CHAPTER 2
REVIEW OF RELATED WORK
Video data modeling is the process of designing the representation for the video
data, based on its characteristics, content and application. Before we analyze the process
of transmitting video data over a wireless network, let us understand the nature of video
data, and the complex issues involved in dealing with this type of data.
2.1. Nature of Video Data
Video data is bulky and very rich in its content [1]. Audiovisual content is the
physical characteristics of the video data, such as color, texture, object, interrelationships
among objects, camera operation etc. Textual content may contain embedded information
about the clip, such as a caption, title, names of characters etc. Semantic content is the
data presented to the user. Video data is traditionally represented in the form of a stream
of images, called frames. These frames are displayed to the user at a constant rate, called
frame rate (frames/second). These frames may or may not be accessed independently,
depending on the complexity of the video browser. For most applications, the entire
video stream is too coarse (too high) a level of abstraction. However, a single frame is too
fine a level of abstraction, because a single frame lasts for a very short time, and each
stream contains an unmanageable number of frames. Other (more intermediate) levels of
abstraction- are often used for data extraction, and a hierarchy of video stream abstraction
can be formed.
The basic element of characterizing video data is the shot. The shot consists of a
possible series of frames generated and recorded contiguously, representing a continuous
action in time and space. A set of shots that are related in time and space is defined as a
scene. Related scenes are grouped together into a sequence. Related sequences are
grouped together into a compound unit, which can recursively form a part of a compound
unit.
A necessary characteristic of a video data representation is its annotation
capability [2]. It should permit the association of content descriptions with the data, to
facilitate easy content-based extraction of data. Annotations of data should be alterable
based on user interpretation and context.
2.2. Video Compression Standards
Video data is generally bulky, and must be compressed to bring it to a
manageable size that can be stored, manipulated and transmitted over a network, without
wasting valuable bandwidth. Video can be compressed both temporally and spatially.
Spatial compression involves individual frames. Temporal frames take into consideration
the adjacent frames. Compression techniques can be both lossless and lossy. Lossless
compression techniques attempt to compress the storage of data using mathematical
functions alone. Lossy compression techniques leave out redundant parts of the
information in the frames (which can be reconstructed later) and perform better
compression at the cost of data loss. Most current compression standards use lossy
techniques. Several compression standards have been devised in the past, and are now in
common use and we discuss some of them here.
2.2.1. Existing Standards
In the past, the focus was mainly on images, and representation of images was
generally in the form of bitmaps and compressed versions of bitmaps. Several
compression schemes emerged for storing image data. The JPEG (Joint Photographic
Experts Group) compression format was one of the most popular formats used for storing
images, as it took up the least amount of space. It performs lossy compression of still-life
scenes, that is, photographs. It works well on images of real world scenes, but not on
lettering, cartoons and line drawings. This is because JPEG takes advantage of limitations
of the human eye in perceiving minute differences in color and brightness, for
compression of pictures. With line drawings, there is not much detail that can be left out
in the process of lossy compression, and JPEG cannot make use of the same limitations
for such simple pictures.
The JPEG format is a symmetric codec (video Compression and DECompression
standard). The coding process involves an RGB to YIQ transformation, DCT,
quantization, DPCM (Differential Pulse Code Modulation), RLE and finally entropy
encoding [3]. The advantage of JPEG is that the encoder can control the trade-off
between compression ratio and image quality, or between compression ratio and
compression speed.
The GIF (Graphics Interchange Format) format uses a LZW compression scheme
(the same as Stuffit and PKZip) which does not change the quality of the image [4].
Instead, it uses an index color palette, which may degrade the color quality of images, but
effectively cuts the size of an RGB image by two-thirds. It uses a smaller color depth as
compared to JPEG and supports line drawings more than full-color illustrations. This is
because GIF uses a lossless compression scheme and, hence, compromises on the color
depth. If GIF files were to support the same range of color as JPEG images, the file size
would become too large to download.
There are several other compression formats besides these two the TIFF, the
TARGA and the BMP formats, for example. These did not achieve any significant
compression and they store the image in almost the original bitmap version.
Then came the video formats: AVI [5] and QuickTime's MOV. The AVI (Audio-
Video Interleave) format saves audio and video data interleaved. This format uses the
keyframe concept, which saves fully only every 12 to 17 pictures, depending on the
picture contents. The intermediary pictures are saved as their differences from the
preceding frame. The AVI videos cannot be streamed. The entire file needs to be present
in local memory before the video can be played.
We also had the M-JPEG (Motion JPEG) format that extended the JPEG format
[1]. This format compresses frames on a one-by-one basis, without any reference to
neighboring frames. The M-JPEG compression scheme minimizes the time and
computational resource required in the digital editing environment and makes it a good fit
for video editing.
Finally, we had the emergence of the MPEG (Motion Picture Experts Group)
compression format. MPEG provides a very good compression ratio, but the number of
calculations that are required make software MPEG codec a bit slow. All MPEG
compression formats use the principle of classifying frames into I (Individual), P
(Predictive-coded), B (Bidirectionally predictive-coded) and D (DC coded) frames. Each
of these different types of frames has a different typical size, quantity of information,
encoding method and importance in the frame sequence. The MPEG group developed a
series of formats for video compression. The MPEG-1 format was the first, with only
progressive video encoding. The MPEG-2 format came next, with interlaced as well as
progressive encoding. Then was MPEG-41, with object oriented concepts of representing
objects in a picture.
2.2.2. Emerging Standards
The MPEG-4 format had totally different ways to encode a picture. Instead of the
standard bitmap-by-bitmap encoding, MPEG-4 used an object-oriented encoding scheme,
to represent the different objects in a picture [6]. These objects, called Audio-Visual
Objects, were used to put the picture together. There are simple AVOs, such as graphics,
background, text, animation and 3D objects. These can be combined to form other AVOs,
called compound AVOs. Such simple and compound AVOs together compose an
audiovisual scene. More generally, MPEG-4 provides a standardized way to describe a
scene (for example allowing media objects to be placed anywhere within a coordinate
system, allowing the application of transforms to change the physical properties of an
object or changing the user's viewpoint of a scene).
The MPEG-4 format is different from its predecessors in the extent of user
interaction allowed with the scene and the objects contained in it. The MPEG-4 format's
language for describing and manipulating the scene is called BIFS (BInary Format for
Scenes). The BIFS commands are available for addition/deletion of objects, and
alteration of individual properties of individual objects. The BIFS code is binary, so it
occupies minimum space. The MPEG-4 format uses BIFS for real-time streaming (that is,
a scene does not need to be downloaded fully for playing, but can be played on the fly).
1 MPEG-3 existed once upon a time, but it basically did what MPEG-2 could do, so it
was abandoned.
The MPEG-4 format's representation of data for transport is different too. Its
objects are placed in what are called Elementary Streams (ES). Some objects, such as a
sound track or video, will have a single such stream. Others may have two or more.
Higher level data describing the scene will have its own ES, making it easier to reuse
objects in the production of new multimedia content. If parts of a scene are to be
delivered only under certain conditions, multiple scene description ESs may be used to
describe the same scene under different circumstances. Objects descriptors (ODs) are
used to tell the system which ESs belong to which object, and which decoders are needed
to decode a stream.
All this conversion is done in a layer devoted solely to synchronization. Here
elementary streams are split into packets, and timing information is added, before passing
the packets on to the transport layer. Timing information is important for many video
compression schemes, which encode a frame in terms of its adjacent frames. The existing
MPEG-2 transport scheme suffices, along with others such as the Asynchronous Transfer
Mode (ATM) and Real-time Transport Protocol (RTP).
After MPEG-4 came MPEG-72. The MPEG-7 format is targeted at the problem of
browsing a video file to detect frames of interest to the user. MPEG-7 defines a standard
set of descriptors that can be used to describe various types of multimedia information, in
the form of video, audio, graphics, images, 3D models etc. The MPEG-7 format also
defines other descriptors and structures that will define the relationships among these
descriptors. The MPEG-7 data may be located anywhere within the same stream, the
2 After MPEG-1, MPEG-2 and MPEG-4, there was a lot of speculation about which
number to use next 5 (the next number) or 8 (following a binary pattern). MPEG
decided not to follow either pattern, and chose MPEG-7 as the new standard.
same storage system or anywhere on the network. Links are used to connect the
descriptors with the associated material. The MPEG-7 format is discussed in greater
detail in Chapter 5, since it is at the center of the work involved in this document.
The latest standard is MPEG-21 [6]. This standard is mainly concerned with
satisfying the user. The consumer is becoming more and better equipped with the
facilities and is becoming more knowledgeable of technology available to make life
easier. At the same time, system administrators must become aware of the various factors
that affect the quality of video delivery and make sure that consumers get exactly what
they ask for, and no less. This holds true for network delivery, quality and flexibility of
service, quality of content delivered and ease of use. This also holds for physical factors
such as format compatibility, platform compatibility and other user-oriented factors such
as ease of subscription. The system is also expected to be capable of search and retrieve
operations, besides the normal operations, and customer rights and privacy have to be
respected. It is expected that the recognition of such factors as important will go a long
way toward standardizing service and making it acceptable to users.
The Dublin Core [7] is a lightweight standard that has attracted much attention. It
is a standard that deals with metadata (data about data) aimed at easy discovery of
electronic resources. Now we hardly need the whole picture; we just get the picture
through its description. The Web has such diverse information that a vast infrastructure is
needed to support all the metadata. Efforts are being made to develop such a standard for
this purpose. The procedures involved in forming the Dublin Core are mainly borrowed
from other standards' committees.
2.3. Problems with Wireless Video Delivery
The delivery of video over a wireless network involves many points, of which
care needs to be taken. These problems are not just due to the fact that the network is not
the high-bandwidth fixed network, but also because the data we have chosen to transmit
over such a fickle network has unique requirements [8].
2.3.1. Problems Due to the Wireless Network
First of all, the existing internetworking protocols do not support wireless
communication to any great extent. Wireless communication is closely linked with
mobility and with mobility come the demands of dynamic routing and frequent
disconnection. These were not really taken into consideration before, and different
options to improve this are now being studied. Secondly, wireless communication has
additional shortcomings, such as low bandwidth, high bit error rate and packet loss.
Existing transport protocols cannot distinguish between packet loss due to motion from
that due to congestion; the protocols suffer from a lack of awareness of motion. Wireless
networks also have increased latency problems, which causes additional data loss and
considerable decrease in perceived video quality.
2.3.2. Problems Due to the Nature of Video Data
Additional technical difficulties arise when multimedia needs to be supported
over this wireless network. The real-time nature of multimedia applications makes time
considerations an important issue, since for playback quality to be good, the delivery
times of audio and video streams have to match exactly. Video compression introduces
additional delay, called packet jitter. Existing protocols also fail to make intelligent
decisions based on the nature of the application as to the needed quality of data delivery.
Different applications may have different requirements, for example, regarding the
relative importance of audio over video. Arbitrary loss of data may, at the same time,
damage the quality of data.
Additionally, the different streams of data (video and audio) have to be
synchronized at all times or else the perceived video quality is reduced.
2.3.3 Problems Due to Client Device Capabilities
The devices to which the multimedia data has to be downloaded over a wireless
network are generally devices with low capabilities in their display unit, CPU processing
power, audio quality and other factors of importance where multimedia is concerned.
Such devices generally have hardly any processing capability, which limits the ability of
the device to decode the bitstream of multimedia data and display it. The display unit of
such devices is also generally barely functional, with maybe a black and white display,
minimal refresh rate and low picture resolution. It actually becomes pointless to
download the entire content of the multimedia data to such a device. The device would
also have limited local memory to store large multimedia files.
2.4 Transcoding
Many problems arise when transmitting multimedia data over the wireless
network, and it is unnecessary to send all the components of the data. One approach
would be to leave out some component (such as color) or to reduce the quality of the data
(for example, by reducing the resolution) when transmitting the data to a device with
limited capabilities. This practice, called transcoding [8], has been tested for wireless
networks and multimedia data, and also for normal downloading applications.
Different schemes were tried for transcoding. One of the first suggested schemes
was a layered encoding scheme. Each layer would add more quality to the encoded
signal. The individual layers are then transmitted on separate multicast addresses.
Receivers should be able to adapt to changing network conditions by adjusting the
number of levels they subscribe to.
Layered encoding, however, does not provide any significant advantage to speech
codecs, because available speech codecs do not lend themselves easily to layering, and an
exponential relationship between adjacent layers is required for maximum throughput.
For video, the target frame rate cannot be deviated from.
Another scheme that was tried was simulcasting. A group of receivers can adjust
to network conditions by customizing parallel streams transmitted by the sender, to match
their needs. However, this causes congestion near the sender, since all parallel streams
have to emerge from the sender.
Self-organized transcoding [9] is an interesting scheme that uses automatically
configurable transcoders to improve bad reception. When a group of receivers detect a
congested link, an upstream receiver with better reception acts as a transcoder and
provides a customized version of the stream. This new stream is multicast to a different
address, so that affected receivers can switch to that address. The suffering receivers must
elect a representative who attempts to locate a suitable upstream receiver to serve them,
and to coordinate the transcoding process.
The actual method we select to implement the transcoding process without
pressure on the server is discussed in detail in Chapters 6 and 7.
CHAPTER 3
ANALYZING VIDEO STREAMING
This chapter discusses a common alternative to downloading the entire
multimedia data file before playing it: streaming. With streaming, the playing starts even
though the entire file may not have been downloaded, and downloading continues as a
background process.
3.1. What Is Streaming And How Will It Help?
Streaming is basically downloading and playing at the same time. Instead of
waiting for the entire downloading process to be finished before the client starts working
with (decoding) the file, the client starts playing once it gets enough data to prevent itself
from starving for data because of the low downloading speed. This obviously improves
the whole performance of the client. The client buffers enough data (frames) to be able to
start playing and then performs the downloading as a background process, while playing
the file at the same time.
This reduces the wait time of the user. The user does not have to wait for the
entire file to be downloaded, but can start watching the video/audio playing, while the
rest of the file is downloaded in the background. The client has to buffer enough data to
be able to play the file continuously, without running out of data and pausing in between
to buffer more.
3.2. Existing Streaming Technology
Lots of existing multimedia standard videos are downloaded using streaming
technology. In this section we discuss existing streaming technology.
QuickTime is the Macintosh multimedia application [10]. QuickTime has a long
list of compatible formats, which makes it one of the most flexible applications. The
server converts live media data into packets, which it sends to the client via multicast or
unicast, as the case may be. The streaming, based on RTP (Real-time Transport Protocol)
and RTSP (Real-Time Streaming Protocol), is industry standard streaming. However
some network equipment has not yet met the standards of QuickTime. The application
has a number of attractive features that account for its popularity.
Windows Media Player is Microsoft's media application [11], for the Advanced
Streaming Format. The Advanced Streaming Format supports streaming from a
streaming server or a typical Web server. This application uses a template approach, with
one template for each type of application. It does not work on Macintosh machines, but it
plays AVI, WAV and MP3 files the most popular formats in use today.
Without doubt, the RealNetworks application RealPlayer is one of the best media
streaming applications in use today [11]. The application supports almost all media types,
and includes a browser plug-in. The RealProducer software generates the multimedia file
in the rm format compatible with the RealPlayer software. The RealPlayer application
sometimes has buffering problems, but otherwise in terms of versatility and ease of use, it
is one of the best in use today.
3.3. Theoretical Achievement of Streaming
To be able to achieve streaming, several issues must be addressed. The first one is
how to let the playing process and the downloading process access the same file at the
same time. A possible solution is to split an MPEG-2 video file into small portions, in
such a way that the two processes are actually accessing different files. However, this
might require big changes to the decoder in the client, which currently might be using
only one whole file, not a stream. As an alternative solution, multiple threads might be
used to act as the play and download processes, which needs thread handling. Another
issue of streaming is to synchronize the playing thread and downloading thread to prevent
the player from starving for data. The problem we need to avoid is when the playing
thread plays much faster than the downloading thread downloads data. As an example,
suppose the player is playing 30 frames per second, while the downloading thread can
only download 10 frames per second. This will cause the player to starve for data and the
playing will get stuck for buffering, when insufficient data has been downloaded. To
solve this problem, we need to do enough buffering before starting the playing. As a very
simple example, if the player frame rate rp is 30 fps, the downloading frame rate rd is 10
fps, and the video file size S is 900 frames. Downloading the file needs td = 900/10 = 90
seconds, while playing the video needs tp = 900/30 = 30 seconds. We need to find a
buffering time bf to satisfy
(bf+S -)xrd= S
rp
In this case, bf should be 60 seconds.
Buffering the file will also cause the problem of handling the decoder, a very
necessary but time-consuming task.
3.4. Streaming for Wireless Networks
When wireless networks come into the picture, additional issues emerge. The
formula mentioned in Section 3.3 must be followed. But since rd is going to be very
small, bf increases in size, resulting in a longer wait time for wireless device users. This
is not a satisfactory streaming result. Additional problems result from frequent
disconnection what should be done with what has already been buffered? How should
the server keep track of where to restart? What happens in the case of lost packets a
probable occurrence? The streaming application must address all these issues and,
accordingly, fix the buffering limits. The rate of download for a wireless network is not
constant. The network quality that the download started with may not be maintained
throughout. Midway, the application may be forced to pause for buffering. To avoid this,
the server must adapt by periodically measuring the network quality and readjusting the
formula.
Packet loss is known to be a big problem where wireless networks are involved,
so this must be taken care of. Standard error-checking procedures work very poorly in
this case, because it is difficult to keep track of the exact chunk in the source file that has
not been transmitted correctly to the client. The same is true for keeping track of how
much has been downloaded and where to restart downloading, in the case of a
disconnection.
3.5. Limitations of Streaming
The previous section illustrates the various issues involved in streaming over a
wireless network. Although an application might be developed, that addresses all of these
issues and that performs well, the quality of an application is difficult to ensure when
multimedia data is involved. Multimedia data is typically very bulky in nature, and the
buffer size becomes too large, thus making the wait time impractical. The user will not
have the patience to wait too long for the client to buffer sufficient data to start playing,
while downloading the rest in the background.
On the other hand, even RealPlayer shows problems while operating over a
wireless network. Thus, streaming multimedia over a wireless network, with all its
accompanying problems, is not really feasible in its raw state, because the equation must
be obeyed.
Of the different variable factors involved in the equation, bf, rd, and rp are the
only ones considered so far. One more factor not yet considered is S, the size of the video
file. But to make S variable, we would have to bring transcoding into the picture, and
reduce the quality of the video in the process. Whether or not the resultant video is
acceptable to the user is a subjective issue. The process and its effect on the video are
discussed in detail in later chapters.
CHAPTER 4
VIDEO REDUCTION
In Section 2.4 we discussed the principle behind transcoding, a means to reduce
the strain on the network bandwidth by reducing the required bit-rate. One way of doing
that is reducing the speed of playing the video, so that rp is reduced, effectively reducing
bf. But this is not a very good solution practically, because the effect is very obvious to
the user, and may not be acceptable. An alternative solution is discussed in this chapter -
reducing the contents of the video so that the number of bytes delivered to the client is
reduced. This obviously reduces the quality of the video. However, because we are
concerned with wireless networks, we assume that the better quality is not absolutely
necessary. In fact, it may not even be perceptible.
4.1 The Need for Reduction
Chapter 2 showed the various problems associated with multimedia data over a
wireless network. Multimedia data is bulky, and the wireless network does not help at all.
When we discussed streaming in Chapter 3, we realized that streaming is an option to be
considered. However, streaming is not helpful in this particular situation, with limited
network capabilities and bulky data. Now we begin to consider the option of reducing
video quality to accommodate lower bit-rates.
The devices used with wireless communication generally have limited
capabilities, as seen in Subsection 2.3.3. We can turn this to our advantage, by not
transmitting the part of the data that is useless to that particular client. For example, we
can have a client device with a monochrome display. Then there is no need to transmit
color information over the network to the client, because that would take up more
bandwidth. An alternative is to remove color information from the video before
transmitting to the client. The user sees no difference in the quality of the video, because
the user's device is not able to display color. So the results are satisfactory.
Other types of reduction are also possible, depending on the different
classifications of client devices and their capabilities. If the client device display has a
small resolution, there is no point in sending the client a 1024 x 768-resolution image,
when the client device probably has only a 2" x 3" display, with a minimal resolution
designed for mostly text display. The solution is to reduce the resolution (or size) of the
image to a size suitable for the client device [12, 13]. This would effectively reduce the
number of bytes to be transmitted, by the same ratio as the ratio of reduction in size of the
video, theoretically at least.
We can also make use of the fact that multimedia data, especially MPEG data, is
organized into frames, and each frame is stored in sequential order in the file, so we can
theoretically separate the transmission of each frame. The practical implications of such a
process are discussed in detail in later sections. Now if the video does not contain much
object movement, one frame may not be too different from the next, and we can afford to
skip the transmission of such frames in between. Such frame dropping will not depend
directly on the client device capabilities. We study the effect on perceived video quality
in Section 4.4.
4.2. Color Reduction
Most of the smaller wireless devices have limited display capabilities, with small
color ranges, for which a 256-color display is meaningless. Some devices even have
grayscale or monochrome display units. For such client devices, we can easily reduce the
required bandwidth, by removing the color information from the video. This does not
result in noticeable difference in the perceived video quality, since the client device
capability itself is less.
Of course, such a reduction involves several issues to be considered on the
implementation side. The question of how to achieve this color reduction comes up
immediately. The encoding process for an MPEG video is a complicated process and
does not store the color information in a bitmap format. Instead, to save on storage space,
the RGB information is converted to YUV information and stored not directly but as the
quantized difference between adjacent frames. We must be able to extract the color
information from the frames, reduce it to whatever level of color is required for the client
device, and transmit it to the client. For this we need to decode the video at the server,
take off the color information, reencode it and finally transmit it to the client.
The issue then arises of how much time this process would involve. For a long
video, the user can probably go have lunch and come back, and the video still would not
have been downloaded. That is definitely not a good algorithm from the user's point of
view. To enable this option of reducing color, we would hence need to introduce some
simplification, in the form of offline reduction. If the color reduction were done offline
and stored, when the user's request for the video comes in, the reduced version is already
available and ready to be streamed.
However, with a large video database, we must next consider the space involved
in this. The server would need to store different versions of each video, involving a space
overhead of at least 200%. Hence, we need to think of some cheaper method to reduce
color. An easy solution is a compromise between the two extremes of doing everything
offline, and doing everything online. Why not do a part of the reduction offline, the part
that does not require too much storage, and then do the rest on the fly? Thinking of such
alternatives involves a careful study of the actual MPEG encoding algorithms. We might
be able to reduce a part of the work by storing the references to each frame at the server,
and on the fly, picking up each frame, applying the conversion algorithm to it and
sending it to the client. Such methods are discussed in detail in Chapter 7.
4.3. Size Reduction
When smaller display devices enter the picture, the next obvious step to reduction
is decreasing the display size of the video. We can either reduce the resolution of the
image, or reduce the size of the effective display and leave the resolution (pixels/inch) the
same.
Consider a reduction of the video by a factor of n. To do this, the algorithm would
have to pick every nth pixel from the image and send it as the new image. We need to be
able to get at the bitmap version of each frame, pick every nth pixel and reencode the
frame into the video. This will theoretically reduce the number of bits to be transmitted
by a factor of n, since the storage complexity of the video in terms of pixels involved is
of the order O(n). Because of the complexity of the encoding algorithm, the process is a
long one and involves a lot of computational complexity.
Again, the issue of online versus offline reduction comes up, with the same
advantages and disadvantages. Online reduction takes a lot of time, and streaming will
not be feasible. Offline reduction takes up a lot of space on the server. The advantage of
offline reduction is that the overhead shifts to the server side, and the client is not aware
of the overhead. Partial reduction can be an option. The advantages and disadvantages of
this will be seen in later chapters, where we shall also discuss whether the tradeoff is
sensible, and whether it is worth the trouble.
4.4. Frame Dropping
Frame dropping is a pretty good and easy way to reduce the bandwidth
requirements of multimedia data, especially those with frame-based formats. Some
multimedia data lends itself easily to frame dropping, because there may not be too much
difference between one frame and the next. For example, with a news report, the image
may mainly involve a correspondent reviewing a situation. A generic example is any
video, which focuses on one person or object, which does not involve too much motion.
Apart from the video type, the client device type would also determine if frame
dropping were suitable for the situation. If the client device does not have a fast processor
that can process and display the frames as fast as they are being sent, there is really no
point in sending all the frames at the same rate as in the original video. It is better to skip
some frames in between, allowing the client device to catch up. Though we are speaking
of the situation as if the main problem is on the client side and not the network, we can
use the situation to our advantage when we have network problems. We are able to make
use of the client device's inferior capabilities by reducing the quality of the video n such
a way that the user is not made aware of it.
It is not difficult to achieve frame dropping, as long as the storage format is a
frame-based format like MPEG. The format would have a code to mark the start and end
of each frame, and all the algorithm would have to do would be parsing through the
bitstream to be sent to the client, looking for these codes, and appropriately suppressing
the transmission of the frame when necessary. This process can also be simplified using
some data structures, as shown in later chapters.
4.5. Hybrid Reductions
Now that we have studied the reduction of size, color information or the number
of frames of the multimedia data, let us consider what we would do if the client device
were really bad in terms of capabilities, and the network conditions were also bad. The
natural alternative would be to apply more than one reduction algorithm to the
multimedia data. Of course, this would mean further degradation in resulting quality, but
we do not have a choice. Additionally, the client device, rather conveniently, is less
capable of exhibiting the full quality of the delivered video.
Let us consider the various possibilities we have, given the three reduction
strategies we have studied so far. Whether these reductions are done offline or on the fly
plays an important role in deciding which permutations are feasible and which are not
worth consideration [14]. If size reduction were being done offline, by storing the
different possible versions of the video, we could not achieve frame dropping before size
reduction it would have to be done after. This order has to be maintained, though it
would make more sense to get rid of the useless frames before processing each frame for
the size reduction. It would also make sense to drop frames before performing color
reduction, following the same logic. Since frame dropping is the only totally online (on-
the-fly) reduction we have devised, this seems to be a tough thing to achieve.
Between size reduction and color reduction, it would again make more sense to
reduce the size before dealing with color, since then we would have fewer pixels to
process. This works well for us if size reduction were done offline.
4.6. Is Online Video Reduction Feasible?
This issue actually needs a careful study of the multimedia data format. If the
format involved a complicated decoding and encoding algorithm, and the reduction
algorithm involved a lot of processing at the pixel level instead of at the frame level, it
becomes difficult, maybe impossible, to achieve. If the reduction algorithm were to be
performed at the frame level, decoding and reencoding would not be needed at all,
making our task simpler. However when the reduction is done largely depends on the
encoding algorithm. With MPEG files, it becomes difficult to access pixel information at
the frame level, unless the decoding process is carried out.
Thus we have to look at the alternative to online reduction performing the
reductions offline, and storing the versions of the video at the server, ready to be
downloaded to the client on demand. The downside of this is the space overhead on the
server. It is not a simple double space overhead, since all possible versions of the video
will have to be stored on the server; the ratio of reduction depends on the client device,
the user's specifications and the network condition at the time of download.
It helps to remember that since the complexity is effectively shifted to the server
side of the transfer, the user can be kept oblivious to the overhead. The complexity would
be of the order of O(n) still, which can be borne by the server.
CHAPTER 5
WIRELESS VIDEO WITH MPEG-7
From this chapter onwards, we shall shift our focus to the MPEG compression
standard. The reason why we choose MPEG as the standard to work with will be more
obvious in Chapter 6, where we discuss recent developments that help our work in
wireless video delivery. MPEG is one of the most efficient compression standards in use
today, and suits out purpose admirably.
5.1. Introduction
The Moving Pictures Expert Group (MPEG) is a family of video compression
standards [6]. The existing MPEG standards are MPEG-1, 2 and 4. The MPEG-1 format
provides a mode of storage for video data. The MPEG-2 format provides better user
interaction in deciding the quality of encoding and uses a two-layer approach for
effective encoding. The basic difference between MPEG-4 and previously existing
standards is the object-oriented approach. The MPEG-4 format defines standard ways of
representation of objects in a picture, called audio-visual objects (AVOs). The MPEG-4
format hence provides a standardized way to describe a scene, allowing for example to
place media objects anywhere within a coordinate system, apply transforms to change the
physical properties of an object or change the user's viewpoint of a scene.
Recently MPEG announced a call for proposals for MPEG-7, its newest standard
of video compression. The MPEG committee is now accepting proposals for MPEG-7
and expects the standard to be finalized by the year 2000. The MPEG-7 format does not
aim to be a compression standard on its own, but is designed to work with other
compression standards to provide a way of accessing content of video data.
5.2 What Is MPEG-7?
The MPEG-7 format, formally called "Multimedia Content Description
Interface," standardizes the following:
* A set of description schemes and descriptors,
* A description definition language (DDL) to specify description schemes, and
* A scheme for coding the description.
The MPEG-7 format aims to standardize content descriptions of multimedia data.
The MPEG-7 format extends the limited capabilities of currently existing codecs in
identifying content of video streams, mainly by including more data types. In doing so,
MPEG-7 specifies a standard set of descriptors that can be used to describe various types
of multimedia information. This description allows fast search of audio-visual content for
material of a user's interest. The MPEG-7 format also standardizes a language to specify
structures for the descriptors and their relationships.
The advantage of using MPEG-7 descriptors is that they do not depend on the
format or encoding scheme of the material being described. This makes MPEG-7 a very
flexible standard. MPEG-7 builds on other existing standards to provide references to
suitable portions of them. MPEG-7 will also allow different granularity in its
descriptions, offering the possibility of different levels of discrimination. The content
descriptions can be placed in a separate stream from the actual multimedia stream. They
can be associated with each other through bi-directional links. Hence, data may exist
anywhere on the network.
5.3 Descriptors and Description Schemes
A descriptor is a representation of some definite characteristic of the data, which
signifies something to somebody. A descriptor allows an evaluation of the feature via the
descriptor value. It is possible to have several descriptors describing the same feature to
address different relevant requirements. For example, the color feature of the data can
have the descriptors of the color histogram, the average of the frequency components, the
motion field, the text of the title etc.
A descriptor value is an instantiation of a descriptor for a given data set.
Descriptor values are combined using a description scheme to form a description.
A description scheme specifies the structure and semantics of the relationships
between its components, which may be both descriptors and description schemes. The
distinction between a descriptor and a description scheme is that a descriptor contains
only basic data types and does not refer to other descriptors or description schemes.
For example, a movie may consist of several scenes and shots, if partitioned
temporally. Such a movie may contain descriptions of the relations between these
component shots and scenes, and may include some textual descriptors at the scene level,
and color, motion and audio descriptors at the shot level.
A description consists of a description scheme and the set of descriptor values that
describe the data. Depending on the completeness of the set of descriptor values, the
description scheme may be fully or partially instantiated. Whether or not the description
scheme is actually present in the description depends on technical solutions still to be
provided.
The Description Definition Language is a language that allows the creation of
new description schemes and possibly descriptors. It also allows the extension and
modification of existing description schemes.
The advantage of MPEG-7 descriptors is their flexibility: MPEG-7 descriptors
can be used to describe data in different formats, such as an MPEG-4 stream, a video
tape, a CD containing music, sound or speech, a picture printed on paper and an
interactive multimedia installation on the web. This gives a great deal of freedom in the
use of particular data formats for the actual video content, since they can all be described
using one common format: MPEG-7. This leads to different applications in which
MPEG-7 can be used.
DESCRIPTION AUDIO-VISUAL
DEFINITION
DENGUAGET. CONTENT ITEM
LANGUAGE ..
defines 1..*
DESCRIPTION
SCHEME
0..*
..*
1 *
Observing human or system
Figure 5-1: A UML-based representation of possible relations between D and DS
Figure 5-1 describes the relationship between descriptors, description schemes
and the audio-visual data they describe.
5.4. Scenarios with MPEG-7
The MPEG-7 format aims at the improvement of existing applications and the
introduction of completely new ones. Three of the most relevantly impacted scenarios
are:
* Distributed processing,
* Content exchange, and
* Customized views.
5.4.1. Distributed Processing
MPEG-7 will permit interchange of descriptions of multimedia material
independent of platform or application. This enables distributed processing of multimedia
content. Data from a variety of sources can be plugged into different distributed
applications such as multimedia processors, editors, retrieval systems, filtering agents etc.
In the future, one may be able to access various content providers' web sites to
download content and associated indexing data, obtained by some high-level or low-level
processing. The user can then proceed to access several tool providers' web sites to
download tools (e.g., Java applets) to manipulate the heterogeneous data descriptions in
particular ways, according to the user's personal interests. An example of such a
multimedia tool will be a video editor. An MPEG-7 compliant video editor will be able to
manipulate and process video content from a variety of sources if the description
associated with each video is MPEG-7 compliant. Each video may come with varying
degrees of description detail such as camera motion, scene cuts, annotations and object
segmentations.
5.4.2. Content Exchange
Another scenario that will benefit from a common description standard is the
exchange of multimedia content among heterogeneous audio-visual databases. MPEG-7
will provide the means to express, exchange, translate and reuse existing descriptions of
audio-visual material.
Currently, material used by databases is described manually using text and
proprietary databases (the whole purpose of which is such description). Describing audio-
visual material is a time-consuming and expensive task, so it is desirable to minimize the
re-indexing of data that has been processed before.
Interchange of multimedia content descriptions would be possible if all the
content providers used the same scheme and system. As this is on the impractical side,
MPEG-7 proposes to adopt a single industry-wide interoperable interchange format that
is system and vendor independent.
5.4.3. Customized Views
Multimedia players and viewers compliant with the multimedia description
standard will provide users with innovative capabilities such as multiple views of the data
configured by the user. The user could change the display's configuration without
requiring the data to be downloaded again in a different format from the content
broadcaster.
The ability to capture and transmit semantic and structural annotations of the
audio-visual data, made possible by MPEG-7, greatly expands the possibilities for client-
side manipulation of the data for displaying purposes. For example, a browsing system
can allow users to quickly browse through videos if they receive information about their
corresponding semantic structure. For example, when modeling a tennis match video, the
viewer can choose to view only the third game of the second set, all overhead smashes of
one player etc.
5.5. Wireless Video using MPEG-7
The content-based nature of MPEG-7 makes it highly appealing to designers of
databases. It is also interesting to note the advantages it could have in a wireless network.
When we consider the customization of views, the user could probably get a view
of a video as per the capabilities of his wireless device. This could enable the user to get
just that portion of the video in which he is interested, which would reduce the amount
that needs to be downloaded, thus reducing bandwidth needs.
The display can be probably adjusted dynamically according to the device
constraints. Thus handheld devices that cannot support or possibly do not need detailed
images can probably download just what is sufficient for the user to get the general
picture. Thus maybe in a weather news video, the user may just be interested in getting
the weather report for Florida about rain for the next day. This can be easily arranged
using content-based indexing.
What also can be reduced is the space need. The entire video need not be
downloaded. These and other issues are discussed in detail in the following chapters.
CHAPTER 6
UNIVERSAL MULTIMEDIA ACCESS AND MPEG-7
The Moving Picture's Experts Group (MPEG) has started work on a new
standardization effort called the "Multimedia Content Description Interface," also known
as MPEG-7. The goal of MPEG-7 is to enable fast and efficient searching and filtering of
audio-visual material. The effort is being driven by specific requirements gleaned from a
large number of applications related to image, video and audio databases, media filtering
and interactive media services (radio, TV programs), scientific image libraries etc.
Recently, multimedia content description information for enabling Universal
Multimedia Access (UMA) has been proposed [15] as part of the MPEG-7 specifications.
The basic idea of universal multimedia access (UMA) is to enable client devices with
limited communication, processing, storage and display capabilities to access rich
multimedia content [5]. The use of MPEG-7 with UMA descriptors would be ideal for
our efforts towards fast and efficient multimedia data transmission.
6.1. Overview of UMA
The main idea of UMA (Universal Multimedia Access) is that any kind of device
with any minimal capabilities should be able to access any multimedia data, over any
network. To be able to do this, the multimedia content must be modified to suit the
conditions of transfer. UMA describes broad categories into which these reduction
schemes could be classified. These, called variations on the data, are described in
Subsection 6.1.1.
MPEG-7 suggests a specific set of attributes to be used to describe multimedia
content, to ease the task of downloading the video. This has been extended to UMA in
[15] with UMA-specific attributes. These attributes are discussed in Subsection 6.1.2.
6.1.1. UMA Variations
Several entities are defined by [15] for describing variations of multimedia data:
substitution, translation, summarization and extraction. The variation entity is defined as
an abstract entity from which the other different types of entities (substitution, translation,
summarization, extraction and visualization) are derived. The variation entity also
contains attributes that give the selection conditions for which a variation should be
selected as a replacement for a multimedia item [15]. The derived entities are as in Table
6-1.
In UMA applications, the variations of multimedia material can be selected as
replacement, if necessary, to adapt to client terminal capabilities or network conditions. If
the network conditions are bad, a variation with the least possible effort needed for
transfer may be selected, so that the work is reduced. At the same time, if the device used
by the client has excellent display characteristics and other good resources needed for
display of multimedia, the variation selected must as far as possible have good visible
characteristics (for example, key frame extraction can be selected instead of color
reduction). On the other hand, if the device does not have some necessary quality, such as
a color display, then that quality can be dropped from the multimedia that needs to be
streamed to that client.
In general, the Variation-DS provides important information not only for UMA
applications but also for managing multimedia data archives since in many cases the
multimedia items are variations of others.
Table 6-1: Classification of reductions performed on multimedia data.
Reduction Description Examples
Substitution One program substitutes for Text passages used to
another when there need not be substitute for images that
any actual derivation relationship the browser is not capable
between the two of handling
Translation Conversion from one modality to text-to-speech (TTS),
another the input program speech-to-text (speech
generates the output program by recognition), and video-to-
means of translation image (video mosaicing)
Summarization Input program is summarized to Input program is
generate the output program summarized to generate the
may involve compaction and output program may
possible loss of data involve compaction and
possible loss of data
Extraction Information is extracted from the key-frame extraction from
input program to generate the video, embedded-text and
output program involving caption extraction from
_analysis of the input program images and video
6.1.2 UMA Attributes
The MPEG-7 specifications document proposed several descriptive attributes for
multimedia data in general, which were used to reduce the capacity requirements of large
MPEG multimedia videos. The MPEG-7 format is targeted at the problem of browsing a
video file to detect frames of interest to the user. These attributes, called descriptors, were
used to describe various types of multimedia information, in the form of video, audio,
graphics, images, 3D models etc.
When UMA was proposed as a new part of the MPEG-7 requirements, several
new attributes were proposed as additions to enable UMA [15]. With these additions, it
becomes simpler to select any of the different possible kinds of variations, suitable for the
nature of multimedia content in any particular video. For example, we may select speech
to text translation for a foreign language film. Thus we are able to take into consideration
the lendability of the multimedia data to particular reductions.
These descriptive attributes, or descriptors, are summarized in Table 6.2. Using
these descriptors, we shall see, in later sections, how multimedia content descriptions are
obtained and used to select the ideal variation for the situation for the multimedia data, to
be sent to the client.
Table 6-2: UMA attributes.
Descriptor Scheme Purpose
Published Multimedia Data Describes the need for and the relative importance of a
Hint-DS multimedia data item in a presentation, and may
contain hints as to the type of reduction the
presentation lends itself easily to
Media-D Standardized descriptive information about the image,
video and audio material, which help in UMA, such as
resource requirements, functions of network
conditions to preferred scaling operations etc.
Meta-D Data about data rights, ownership etc.
Spatio-temporal Domain-D Information about the source, acquisition and use of
visual data
Image Type-D Describes display characteristics of images
Region Hint-D Information about the importance of particular regions
within an image, relative to each other
Audio Domain-D Information about the source and usage of audio data
Segment Hint-D Analogous to the region hint, but applicable to the
___temporal dimension
6.1.3. The Importance of UMA
UMA attacks the problem of wireless video delivery at the root, by actually
prescribing attributes for the ideal reductions suited for a video. Thus we get a way of
combining our methodology with the MPEG-7 standard, and we can deliver video over
the wireless network to any type of device. With UMA attributes we get a method to
describe the content of any video, to be able to decide the ideal set of reductions for that
video over any network conditions, to any type of device.
Now the application has to be designed in such a way as to provide content
description for a video at the time of its insertion into the database, and this content
description could be used at the time of streaming to select the ideal reduction for the
video. This reduction algorithm could then be applied before or during streaming, as the
case may be.
We need to think of a suitable data structure to store all the information that we
have about the video in the database. This information will include content description in
the form of an MPEG-7 packaging for the video and ideal reduction set for different
conditions of device modality and network conditions. We also need to consider the
actual process that takes place when a client requests a particular video from the server.
This process will need to handle methods for reduction access.
6.2. Conceptual Model
In Chapter 2 we saw that the main factors affecting the quality of video delivery
are the network conditions and the client device modality. We also have the content of
the multimedia data and user preferences as additional factors, and we shall take these
into consideration while designing the architecture of the server. Network conditions and
device modality are the main factors that effectively decide the reduction set to be applied
on the video before transmitting it to the client. We shall consider the inter-relationship
between these factors and the reduction set chosen for the video delivery.
6.2.1. Factors Affecting Delivery
To match the terminal device capabilities and connection quality to the different
variations, we consider a three dimensional structure (the three dimensions being device
modality, connection quality and variation type).
Depending on the quality of the specific connection, and the capabilities of the
specific device, the mapping into the three-dimensional space is decided. A number of
specific descriptors describe the possible combinations of media resources available for
different devices and at different connection qualities. The different variations to the
multimedia data can be specified in the form of indices into the multimedia material. We
can also have meta-data that influences the content adaptation process, by describing the
mapping from connection quality and device modality into required variations.
Device modality
Given n (display, memory,
modality MFR, CPU)
range / /
appng Connection quality
{Bandwidth,
Latency}
Variation .
{Substitution, i s s
Translation, Possible
Summarization, Possible Connection
Extraction} / Variations Quality range
Figure 6-1: Three-dimensional view of MPEG-7 and description schema for UMA
The relationship between the different entities guiding the three-dimensional
mapping is illustrated in Figure 6-1. The following subsections examine these factors in
depth.
Additionally, we might have a number of descriptors that can be considered to be
transcoding hints; that is, they can be used to guide the adaptation process. We discuss
some issues that arise in the implementation of such a scheme.
6.2.2 Network Conditions
Different variations in connection quality that may be possible in the network can
be described in terms of connection attributes. The quality of connection ranges from the
strong connection quality of a fixed network, to the weak connection quality of a wireless
network, to loss of connection, which is quite possible in mobile networks. You have
three possible Connections attributes:
* Bandwidth This describes the existing bandwidth of the network. This will be
measured in the unit of bps (bits per second). This can have a range from 0.0 bps to
50 Mbps.
* Latency This factor describes the connectivity of the network in terms of initial
connecting time and time of connection maintenance. This can be measured as with
or without latency. The importance of latency can be seen when a large amount of
video data needs to be streamed over a network and, due to a latency problem that
cannot be ignored, the quality of the streaming is poor. Of course we have to consider
that when chunking is allowed in a packet-switched network (as in iDEN), the
problem can be ignored.
* BER (Bit Error Rate) This factor describes the accuracy of the network. Accuracy is
an important factor to be considered in wireless networks, especially in multimedia
applications, since errors in data are so visible to the user.
Depending on the values of these factors, the various network connections can be
classified into certain classes of connections, as illustrated in Table 6-3. The class of
network connection is one of the major factors in the function, that maps the existing
conditions to the appropriate variation suggested, for the multimedia data that needs to be
transferred over that network.
Table 6-3: Classification of networks based on connection quality.
Network class Bandwidth and Latency specs. Example Network
Class Al 2.0 Kbps 9.6 Kbps, with latency Mobitex (RAM), early
CDMA PCS systems
Class A2 2.0 Kbps 9.6 Kbps, without GSM phase 1, CDPD
latency
Class B1 9.6 Kbps 20 Kbps, with latency
Class F2 > 10 Mbps, <= 50 Mbps, W-LAN, HyperLAN
without latency
6.2.3. Client Device Modality
The capabilities of the device that is currently attempting to view the multimedia
data can be described in terms of the variants Display, Audio, Memory, Maximum Frame
Rate (MFR), CPU and Color.
The capabilities of a device range over all possible combinations of all possible
values of these attributes:
* Display The display type (resolution, size, refresh rate and color capabilities) of the
device. This can range from a handheld device's display to that for a desktop. This
will have a domain ranging from B/W to 64K color or more.
* Audio The audio capabilities of the device. The benchmark for this could be
measured in a sound present or absent flag. We could also have different quality
audio, such as CD quality, stereo quality etc. The audio would be useful in
determining whether audio information needs to be downloaded to the device at all or
not. Hence the domain of this descriptor would be {absent, CD quality, stereo
quality}.
Table 6-4: Classification of devices based on device modality.
Class of Display Audio Memory CPU Color Eg. of
device device
Class Low Not Low (up to Low (B/W, Web
I A resolution present 32 MB) grayscale) phones
(64x64 (e.g.
256x128) Nokia
7110)
CLASS Low CD Low (up to Low (B/W,
Ib resolution 32 MB) grayscale)
(64x64 -
256x128)
Class High Stereo High (1 GB Good 64 K
IV B (1024x640 and above) (above 500
and above) MHz)
* Memory The amount of storage space available to store buffered multimedia data
and run the application. The memory used by a mobile device can generally be
divided into slow access memory, under which we have RAM and flash memory, and
fast access memory, under which we have microdrives and hard drives. With more
flash memory, we can buffer more amount of information for the streaming, and with
the presence of slow access memory (which all mobile devices do not have), we can
store even more information. This descriptor will have a domain of less than 32 MB
to more than 1 GB.
* MFR The maximum permissible rate of display of frames on the device. The device
capabilities can limit the rate at which the screen can be regenerated each time a new
frame needs to be displayed. This may be related to the refresh rate of the device
screen, which is dependent on the type of screen material.
* CPU The CPU capabilities of the device. This can be measured in speed of the
CPU, or computational capability we can use a fixed unit such as flops. This mainly
affects the complexity of computations that is permitted on the device, and hence, the
extent of decoding possible on the device.
Depending on the values of these attributes, that describe the capabilities of
wireless devices, the various devices can be classified into certain classes, as described in
Table 6-4.
6.2.4. Multimedia Content
The next matter to be considered is the importance that has to be given to the
content of multimedia. This is what can be called the "lendability" of the multimedia data
to certain types of reductions. Multimedia data is rich in content, and different types of
data have parts with different importance, depending on the context in which it is used.
For example, the weather channel may give prime importance to the audio, and the video
may be less important. But the sports channel may give higher preference for video. A
foreign language film may take off audio and leave captions in. So all that seems to be
required is some means of describing the content of the multimedia and associating this
description with the video at the server.
This is where MPEG-7 and the additional descriptive data detailed in Section 2
comes in. Originally MPEG-7 was designed to be a content-descriptive language. This
concept of content description is extended to include content description specifically
useful for UMA [14], so that it gives us an idea about the variations suggested for any
particular multimedia data item, the relative importance of different multimedia
components within the data, its resource requirements and any other information needed
to describe the content absolutely to the server.
When we have this content description with us, it becomes easier to use this to
select the set of variations (reductions) ideal for that particular content. These suggestions
are also included as part of the content description as transcoding hints [14]. These
descriptors can be used to affect the nature of variations suggested for the multimedia
data.
We can express this as a restriction on the usage of reductions by the multimedia
server. So on the implementation side, when the server gets a request for a particular
video from a client, the server uses the network conditions and device modality of the
current situation to decide the full set of variations suitable for that situation. Then the
Multimedia Content (MC) comes into the picture to restrict the full set of available
variations to a smaller set of reductions that are permitted on that content.
6.2.5. User Preferences
The final issue that has to be considered is the user's opinion of all these changes
being made to what had originally been requested. The "ideal" variation may be totally in
contrast with what the user considers essential to make sense out of the data. The user
may, for example, find audio data more important as compared to video data and might
not like the idea of sound being cut out, to fit the result within the network conditions. If
the ideal variation derived from the existing network conditions is speech-to-text
conversion, the user will find the result dissatisfactory.
To take the preferences of the user into consideration, we introduce the concept of
loss profiles. The user gives a set of limits within which the variation of the stream from
the desired quality should stay as much as possible. This would help in making a choice
between different types of reduction possibilities available.
We can extend this concept a little more to include the data relevant to that user,
in the form of a Loss Profile (LP). This would include data in the form of client
preferences, the network conditions that the client normally operates within, and the
modality of the device that the client normally works with.
We can move further on the lines we have followed above for multimedia content,
and impose another restriction on the final result obtained, depending on the user Loss
Profile, to get the final set of restrictions to apply on the multimedia data.
6.3. The Reduction Set and the Restrictions on It
The Transcoding Function describes the mapping between the different static and
dynamic variations available with MPEG-7, and their applicability, i.e., the (Connection,
Modality) combination situation in which each is relevant. The mapping may be
influenced by loss profiles that prescribe user preferences in variation options, and also
by the content of multimedia data. These options are discussed in later sections.
Basically, the mapping of (Connections, Modality) to Variations can be
summarized in Table 6-5. Now we add in the effect of the nature of multimedia content,
and the user loss profiles, in the form of the following equations.
Let C be the range of connection quality and D the range of device modality into
which the current conditions happen to fall. The resultant region of permitted variations,
V, into which the system maps can be represented as a function of C and D.
V=CxD
Let MC be a Multimedia Content and VMC be the possible set of applicable variations for
this particular multimedia data, given by
VMC = nMC V
where V is the set of variations obtained so far and nec is variation restriction on V.
Table 6-5: Mapping of (connections, modality) to variations.
Modality
Class Al
Class IV B
Class Al {substitution}- {translation} = {video-
{text for images} ___to-image}
Class A2 {substitution}- {translation} = {video-
{text for images} ___to-image}
Class B1
{ sub stitution,
translation} = {text
for images, video-
to-image, voice-to-
text, color
reduction}
{translation,
summarization}
{video-to-image,
thumbnail generation }
Class E2 {summarization} -
Class E2 .scaling} ... No reduction
{voice scaling}
Let VF represent the region of finally permitted variations when the user loss
profile, LP, is taken into consideration. Hence V, is a restricted version of VMC, as
specified by LP. This can be expressed using the following equation, where nLp is the
loss profile restriction on VMC:
VF = -LP VMC
Connect-
ion
Effectively, when a client tries to get some multimedia data from the server, the
server will gauge the connection quality of the network. The server will also try to judge
the modality of the client device. This can be achieved by having the user select some
class of device from among a menu.
Then the server will use this range of C and D to map into the C x D table and the
result will be V, a set of variations, which may be applied to the multimedia data to fit it
within the physical conditions.
Then the server will use MC to further restrict the set of possible variations suited
for that particular multimedia data. Finally the server will use LP to choose, from among
the suggested variations, those that are acceptable to the user. This will yield another set
of variations, which may be applied to the multimedia content.
On the implementation side, the user LP can be obtained from the user, by having
the user move a set of sliders that indicate the acceptable quality of the multimedia
needed [16]. These sliders can represent the values for different relevant multimedia
information, which can be selected from the descriptors specified for MPEG-7. For
example, we can have sliders for color quality varying from B/W to 64 K color, audio
quality varying from text captions to stereo quality audio etc.
When the user selects the values of these attributes, they form an attribute for user
loss profile (LP). This is used to restrict VMC to VF. The user's loss profile can be
obtained dynamically at the time of connection, and also stored as part of the user's
profile.
We can actually do most of this process statically at the time of registration of the
video into the database. As the video is being registered into the database, the set of
MPEG-7 descriptors could be created for that video and stored along with the video into
the database. It should also be possible to insert a video with pre-defined descriptors into
the database.
The various variations of the video are also statically created at the same time and
stored along with the video. This raises the issue of the amount of storage space required
for each video, along with its different variations. The solution to this would involve
some sort of complex indexing scheme that does not store the variations separately, but
instead indices into the existing video bitstream, to dynamically assemble the particular
bitstream required for any particular variation of the video. These indices can also be
created and stored with the video at the time of registration.
The mapping of all possible combinations of (C x D) to the respective set of V
can also be created and stored statically at registration. Finally, since the content of
multimedia does not change with the user, the restrictions imposed by MC on V can also
be stored statically at registration. So finally, after registration, what we have is a new
video with its accompanying descriptors (MC), table of C x D, and the indices for all
possible variations with that video.
Online, when the client establishes connection with the server, the client's Loss
Profile (LP) is downloaded, along with information about the network conditions (C) and
the device modality (D) of the client. When the client asks for a particular video, the set
of VMc is determined, and then the user's LP comes into the picture, to determine VF for
that video. The ideal order of performing the final set of variations is determined
dynamically, and then the indices are used to select the portions of the bitstream which,
when assembled together and streamed to the client, would construct the perfect video as
per all the restrictions and user requirements.
This scheme involves the discussion of a lot of issues. Firstly, we need to
remember that the ultimate goal of MPEG-7 is not to impose so many restrictions that the
final result that the users get is much smaller than the original need. There has to be a
way of ensuring that within the given limitations, the maximal quality video is delivered
to the users.
Secondly, the imposing of all these stages in the encoding and delivery of the data
makes the scheme increasingly complex. We shall discuss these issues and possible
solutions in following sections.
6.4. Extensions on the Matrix Theme
We can extend the matrix idea and the concept of having a reduction in each
element of the matrix, to having a set of reductions for each element of the matrix. As we
saw in Chapter 4, Section 4.5, the order of applying the reductions does matter. So how
do we decide the ideal order in which we apply the reductions? Some orders are, of
course, ruled out, based on which reductions are done offline and which are done online.
But we can experiment with the different options available to us, and obtain an ideal
order of performing reductions.
The next issue that crops up is that with a limited number of reductions, how can
we obtain as many different combinations as Section 6.3 illustrates we will need. The
solution is fine-tuning. We can use different degrees of reduction after all, color
reduction does not necessarily mean 16 K color to grayscale directly. We can have
different levels of reduction, and combine all these in all possible ways that make sense,
to get different semantics of the reduction set. The first level semantic can have the basic
reductions, with no complexity: a total order semantic. The second level semantic can
have variations on the first level semantic, by combining different possibilities and
different ranges of reductions: a partial order semantic. We can thus crank it up tighter
and tighter until we have the perfect possible reduction set for any situation.
This approach will give us something like a Figure of Merit for the application.
After all, our primary goal is not to perform reduction in any way to achieve streaming,
but to satisfy to user. We must not only fit the video into the situation as quickly as
possible, but also as perfectly as possible, to leave minimal gap between the achieved
video and the one required by the user.
We can also attempt to fit these various semantics into the matrix, so that the
decision as to which semantic to use can be taken at runtime. Of course, this complicates
things somewhat, and pops up various implementation issues. These are discussed in
Chapter 8.
CHAPTER 7
METHODOLOGY OF IMPLEMENTATION
So far we have seen how delivery is to be achieved theoretically, and the needs of
the process. Next it is time to think about the architecture of the MPEG-7 wireless/mobile
video server. We need a way to choose from among several possible reduction-sets,
depending on the conditions of connection quality and client device modality. Next we
need a method to limit the reductions in these reductions sets, based on multimedia
content and user preferences. When we think about this, we also need to think about a
way to package the MPEG-7 descriptors along with each video. Finally, we need to think
about a way to perform the reductions and to store that information also at the server,
ready for streaming to the client.
7.1. The Matrix
We will have an MPEG-7 server that contains the database of multimedia data.
The server will also contain Table 6-3 and Table 6-4. We can have a matrix structure to
determine the ideal reduction set for the situation. This matrix will map (Connection
quality, Device modality) to (Variations). To determine the type of client device, we can
have a menu option on the client that allows the user to pick one of several types, and this
information is sent to the server at connection. The server should also be able to measure
the network connection quality at the time. After ascertaining the class of network and
client device, the server will accordingly use the matrix in Table 6-5 to decide the
appropriate reductions needed to be performed. Each element of the matrix will contain a
pointer to the most appropriate reduction set for the situation specified.
This matrix can contain pointers to the actual data, or references to the functions
that must be applied on the data before sending it to the client.
7.2. The Secondary Indexing Scheme
Once we have the matrix, the question then arises of how to store the different
reductions in the matrix elements. Once the server gets the request from the client for a
particular video, the server has no time to wait and search for the appropriate parts of the
video file that need to be streamed with or without preprocessing.
To eliminate the search time, we can have a second level of reference into the
video file represented by an index into the video file. This index will store pointers to
the different chunks or parts of the video file that are relevant for each reduction applied
on that video. So when the request for the video comes in, all the server has to do is
follow the index, using the already determined reduction set to index into it, and obtain
the exact data with which the server has to work. Thus the server does not have to waste
time parsing the file at runtime for start and stop bits.
This index will be secondary to the matrix. The server will primarily use the
matrix to determine the reduction set ideal for the situation at hand. Then the server
merely uses the index to make the job of performing reductions on the file easier. The
main job of the server now reduces to checking the conditions of connection quality and
client device modality, and accordingly deciding the ideal reduction set, and then using
this reduction set as the key for the index into the video file. Then the server is able to
obtain the exact data with which to work.
The question now arises as to when will this index be created, and who will create
it. The same question also crops up in connection to the matrix: who will fill it and when?
Obviously, since the server is the entity that has the connection with the video database,
and the creation of the matrix and the index is a one-time step to be performed before any
client has access to the video, we can see that the onus of the creation of the matrix and
index would lie with the server.
7.3. The Introduction of the Packager
We can shift the responsibility of creating the index and the matrix from the main
server to a new entity, the packager. The server imposes the restriction that before the
clients can see a video on the video list, the video needs to be registered with the
packager. The packager will create a matrix that will be accessed for all video
registration. When a video gets registered with the database, the packager will create
MPEG-7 descriptors for the video and create the index for the video. This index will refer
to the parts of the video file that need to be accessed by the server for each particular
reduction set.
Thus when a client requests a particular video, the server uses the matrix to decide
the ideal reduction set for the video. Then the server uses the index to pick the relevant
parts and optionally process them before sending them to the client. The packager does
the following main tasks offline, at registration:
* Creation of MPEG-7 (and UMA) descriptor values for the video,
* Creation of the ideal matrix for that video in particular, with ideal reductions for each
level of device modality and network conditions,
* Creation of indices for each element of the matrix. Each index is basically a linked
list, with links to each piece of the video that needs to be streamed after optional
processing stages, and
* Creation of the mapping table of (C x D) to V, including the restrictions imposed by
MC, so that we get VMC.
Figure 7-1 Overview of the process
When the server gets a request from a client for a particular video, the server
checks for the ranges of C (Connection) and D (Device modality), as discussed in Section
3. This gives the server V, the set of variations according to the physical conditions, from
the table, and following from that, VMC, depending on MC (Multimedia Content). Then
the server uses LP (the user Loss Profile) to restrict the variations provided by V, forming
MPEG-7 VIDEO DATABASE
Device
modality
VF. Finally the server gets an acceptable set of variations. The server then uses algorithms
to decide the order in which these variations must be applied to the data, which is used by
the server to call the appropriate function that will stream only the required data, after
applying the variations, to the client. The entire process is summarized in Figure 7-1. The
indexing scheme is used along with this process to assemble portions of the information
needed for any particular variation of the video. When the server chooses a particular
variation to be applied, it will follow the index and stream the appropriate bits to the
client, where these will be reassembled to form the intended reduced video according to
the network quality, device modality, content of multimedia requested and the user's
preferences.
7.4. Practical Issues
One of the main issues that have to be dealt with in the implementation of this
architecture is the efficiency of storage. The prominent downside of using several
variations, of the same multimedia data, would be the redundancy in storage of the same
data, in different ways, to accommodate all these variations.
One way of reducing, or possibly eliminating redundancy, that was suggested in
the previous section, would be to not store different versions of the same data separately,
but have a sophisticated indexing structure and scheme. This indexing scheme would
have several indices into the base multimedia data file. Depending on which variation
was selected as the best one for the situation, the corresponding index into the data would
be selected. This would then access, in the appropriate order, only that data relevant to
the selected variation.
Another important issue to be considered, when we play around with dynamic
variation of data, would be the quality of the data delivered. It may not be acceptable to
the users to have their perfect video, with CD quality sound, suddenly change into a text
only stream, because of fluctuations in the network.
The solution to this problem would be simply to ask the users what is acceptable
to them, and stay within those limits as much as possible. This may be represented in the
form of a loss-profile. This would include data from the users as to what is acceptable to
them in case of changes in network conditions.
When we mention jumping from one part of the data to another using indices, we
would also have to consider the different streams of data going over the network
simultaneously (audio, video and possibly text captions/translations). These have to be
synchronized throughout the stream. The design may dictate the dropping of audio data if
the network deteriorates. If the network comes back to its earlier high quality, we have to
restart audio data streaming. However the audio data cannot pick up where it left off. It
has to match with the part of video being streamed currently.
We may devise any sophisticated scheme to determine the perfect reduction
strategy for a video under given conditions of network and device modality. However all
this would be of no use if the user is not happy with the results. The acceptability of a
video quality is highly subjective and personal. The measurement of acceptability of
perceived video quality is a difficult task that involves user participation, and was not
performed as part of the tests mentioned in Chapter 9.
CHAPTER 8
IMPLEMENTATION OF THE WIRELESS/MOBILE VIDEO SERVER
Now that we have set the architecture of the application, we need to decide on
implementation issues, such as how we should implement the matrix, the indexing
scheme, the basic client-server structure, the packager and other parts of the
implementation.
8.1. The MPEG Player
The MPEG player is basically an MPEG-2 decoder freely available off the web
[17]. We would have to add some coding to make it visually attractive and user-friendly,
since it would be on the user's side of the network, but otherwise it would do the main
job of picking up the bytes from the video file and displaying the video.
Again, as mentioned in Chapter 3, the addition of streaming into the equation
would mean modifications to the client. The client would have to be made aware of the
fact that the video file is not entirely present on the client, but is still being brought in.
This means that if at any point the client ran out of data, the client would have to wait and
request the server for more data, and could not proceed until a required number of bytes
were buffered. This might mean poor streaming, but the client really does not have a
choice.
Further, we might have to consider the reductions introduced into the video. If
these make the video differ in any way from the original MPEG format, we would have
to make the client aware of this and the client should accordingly adjust the decoding
process to accommodate the changes. For example, with color reduction, we would be
removing a part of the original video and transmitting it from the server. This would have
to be accommodated by the client, who would buffer the missing bytes with zeroes.
8.2. Client-Server Setup
The client being ready, now the focus shifts to the server. The server will need to
be able to receive requests from the client and process them. This processing will involve
testing the connection quality, and getting information from the client about the client
device modality. Then the server will have to select the appropriate reduction set from the
matrix. If the server were a sophisticated one, the server would also select the appropriate
order or performing the reductions, and the degree of reduction for each type. The next
step is to use the reduction set to index into the original video and process it. If necessary,
the appropriate functions would have to be called to process the video bitstream before
sending it.
The server would work with packets of data. The client would have to calculate
how many bytes to buffer before beginning to play the video. For doing this, the client
would need to know the size of the video file beforehand. Hence the server will have to
estimate the target size of the video file after processing and send that information to the
client before anything else, so that while the server is preparing the data to be streamed,
some of the work can be passed on to the client.
8.3. Streaming
Streaming is not apparently difficult. We use the formula presented in Chapter 3,
to calculate the number of bytes we would need to buffer before we start playing the file.
Problems could occur if the network connection quality does not remain the same
throughout the transfer of data. The client could still starve for data, which is to be
avoided. The best we can do in such a situation is to constantly monitor the network
connection quality, and update the formula accordingly. Since on the client side we are
using up one half of the time for downloading and the rest for playing, we can offset this
ratio in favor of the downloading for a period of time, until the formula gets balanced
again. This would avoid stalling the client in the middle of play for buffering extra data.
Another situation that can crop up is disconnection. This is a normal situation
where mobile networks are concerned. If the disconnection were only for a fraction of a
second (long enough to break the client-server connection, but too short for the user to
realize that there was a problem) this means trouble. Would the server have to restart
sending data from the beginning? If not, how would the server know where to restart? We
could probably recognize the packet boundaries, and the number of packets sent at both
the client side and the server side. When the client tries to reconnect to the server, we
should have the client send over the number of packets successfully received, so that the
server can restart the transfer of data.
8.4. Reduction
Without doubt, reduction is one of the trickiest issues to be handled in
implementation. We have three types of reduction we have selected to implement, and
each of these has to be performed with fine tuning if possible, so as to achieve as many
different levels of fine tuning as possible.
The first reduction, and probably the easiest, is frame dropping. Frames in an
MPEG-2 stream are recognized by frame start codes. If we scan the stream for the start
codes, we can access the start of each frame. As we skip from frame to frame by scanning
the stream that we send to the client, we can skip the frames that we choose to skip. We
can also use the index efficiently to reduce the time used up in scanning through the
bitstream.
The next reduction is size reduction, which reduces the visible size of the image
on the screen. To achieve this, we effectively need to skip every nth pixel, or rather, since
we have two dimensions to consider, we need to skip every (1n)th pixel in each
dimension (approximately) to achieve the goal. The encoding process being too
complicated to achieve this on the fly in a reasonable amount of time, we opt to perform
this reduction offline. So during registration, the size reduced files are generated for the
registered video. The process to reduce the size involves decoding the original video at
the server end, taking off every nth pixel, before reencoding the video. The client has no
awareness that the video has been reduced to fit the conditions, and plays the video
normally.
Of course, with memory restrictions at the server, we cannot do too much fine-
tuning, and can create only a limited number of versions of the video.
The next to be considered is color reduction. The process of removing color is
reduced to removing all Y (chroma) information from the video, so that the client gets a
grayscale video. Of course, the client would normally try to search for the Y information,
and process it, so the client would have to be made aware of the change in the video that
has been sent. The client would then be able to substitute the Y information with zeroes,
or ignore the Y information in calculations.
For the purpose of fine-tuning, we have another option open to us: remove U/V
information. This will achieve the goal of reducing the information to be sent over, and
also affect the quality of the resultant video accordingly. Again, the client will have to be
made aware of this, so that substitution can be made accordingly.
Of course, we cannot do all the processing for this online a partial route will be
chosen to make this happen.
8.5. Indexing and the Matrix
Achieving the matrix does not seem to be a very difficult task. All we have to do
is have a lookup table of some sort we can have an index to do this that takes a pair of
conditions of (connection quality, device modality) and returns a set of reductions. This
set of reductions is used to peek into the index and retrieve the data that is to be processed
and streamed to the client.
We can simplify this further once we decide what kind of elements the index will
have. The index is basically implemented as a linked list of pointers into the original
video file (or the reduced video file, if size reduction is performed as the first reduction).
Each element of the linked list contains the position of the next chunk of data in the file
and the size of the chunk. All the server has to do is locate the appropriate index entry,
start picking up the chunks from the file as dictated by the linked list elements in
succession and stream them to the client.
So effectively, the matrix elements will be pointers to the appropriate linked list.
The job of locating the index entry appropriate for the situation is shifted from the main
server to the packager, so that the server does not waste too much time at runtime.
8.6. Illustration of the Execution Process
We shall cover the execution process in two subsections one for the client and
one for the server.
8.6.1. The Client
The client is not the main part of the implementation, but it requires a good
interface, because that will be visible to the user. Figure 8-1 shows an image of the client.
When the client is started, it shows a list of servers previously connected to on the left of
the display screen. The client can connect to any of these by simply double-clicking on
the server, or can connect to a new server by selecting the menu option under
Connections.
Preferred reductions Playing
Play options toolbar
\ screen
Server \.0 :V t
connection lo Ml|
options Video
list
MtaL GpuL lr FIt It 6II
Previous
servers / ,IAnb
to
reconnect I -.
to FrHV FI dDOR
Download progress
Figure 8-1: The client
Then on the right of the display screen we can see the list of videos available at
the server to which we are connected. We can select to play any of these by double
clicking on any one of these. The download status can be seen on the progress bar at the
bottom of the client window. When the client buffers up enough data to start playing the
video, the display window is resized to the size of the video and the video starts playing.
The play progress can be seen in the slider bar right below the menu. At this stage the
client looks as in Figure 8-2.
The client menu has options so that the user can select the type of device the
video will be played on and the type of network the device is trying to connect to the
server over. There are other options to control the playing and connect to and disconnect
from the server. There are shortcut buttons available for some of these on the toolbar.
R ELee CompuT F
miat. C -m-s--r-- Flu- LLSL .-
Ms~LybidiwSe1df4S2~
Figure 8-2 The client while playing the video
Figure 8-2: The client while playing the video
M09
8.6.2 The Server
The server does not have to have a fancy interface, but it has to at least have the
set of necessary menu options that enable the system administrator to deliver video of
desired quality to the client. The server interface is shown in Figure 8.3.
The server has options on the menu to add new videos to the current video list,
and when this is done, the packager registers the video by creating all indices for the
video. Before registering the video, the video has to be put into the correct location, so
that the packager can find it. This is what is observed in Figure 8.3 the menu option for
adding a new video has been selected, and the dialog box prompting for the name of the
video appears.
SOption to add new video to
Port number setting database pops up the dialog
'r r r t, gll:,.
Start and "I '
shut down
options
W:" l '-,r',- rijI
F: F -
Figure 8-3: The server
64
When the client connects to the server, the server sends the video list to the client
and awaits the choice of video. When this is received, the server picks up the appropriate
index and starts delivering the video.
The server also has several menu options for initializing the values of IP address
and port, displaying the current list of videos registered with the server, starting the server
and shutting down the server.
CHAPTER 9
EXPERIMENTAL EVALUATION
Now that we have completed the implementation details, we shall perform some
experiments to illustrate the performance of the server. The experiments shall serve to
demonstrate not only the performance of the reduction schemes, but also that of the
matrix and the indexing scheme.
9.1. Description of the Experimental Setup
The experimental data was collected in the following manner. A particular video
was selected as a case study this video can be found at [18]. This video was repeatedly
downloaded and played under different circumstances, to measure the performance of the
scheme. The experiments were divided into two sections, each illustrating one aspect of
the scheme.
The first set of experiments illustrated the use of the matrix and the index. These
experiments were divided into two main sections, those with the index and those without.
Within each separation, the performance of the server was measured under different
schemes of reduction applied to the video.
The reduction schemes were categorized into frame dropping, size reduction and
color reduction. In the experiments without indexing, the scheme for application of these
reductions was a carefully selected algorithm that yielded the best results for all
possibilities of network conditions and device modality. In the experiments with
indexing, the same algorithm was used to construct the index mentioned in Section 5, and
this index was used later to access and stream the video to the client. The overhead of
time and space in the construction of the index was also measured.
To further illustrate the importance of a good indexing scheme, three separate
levels of indexing were considered:
* A 1 x 1 index, which just performs the basic streaming, without taking into
consideration the network conditions and device modality,
* A 2 x 2 index, which has different combinations of two possible network condition
levels and two possible device modality levels. This index combines different types
of reduction for better streaming at the lower levels of these conditions.
* A 3 x 3 index, which is similar to the 2 X 2 index, with one more level of device
modality as well as network conditions.
The second set of experiments was performed to illustrate the flexibility of the
architecture in terms of planning the performance according to the user's needs. These
experiments zoomed into one cell in the matrix, and varied the buffering time that the
user was forced to wait for before the video started playing. The results compared the
corresponding degradation in the quality of the video.
This set of experiments shows that the actual process of reduction is highly
mathematical and the results are very predictable. Hence the system administrator, in
charge of the server, can accordingly tune the video size to the buffering constraints of
the client, and in the process, the quality of the video changes. If better quality were
desired, the user would have to tolerate the longer buffering time.
9.2. Results
The video selected for the case study was a 4 minute long video clip of 20.361
Mbytes, that contained variations in factors such as extent of movement in the video and
importance given to color and clarity of the video.
The results for the lowest level of sophistication in the reduction scheme are
shown in Table 9-1 and Table 9-2, one without indexing and one with indexing. For both
schemes we used a reduction policy most appropriate for most kinds of videos: 25% size
reduction, followed by frame dropping. The results of Table 9-1 were obtained after a
straightforward application of the available reductions for the current network and device
modality conditions, all being performed at execution time (without the use of the
indexing scheme to get pointers to the relevant frames). Table 9-2 also shows the time
taken to create the index entries for the video, as well as the space overhead required for
storing the index.
Table 9-1: Performance of streaming without indexing, and with no matrix
Total time taken to download video
(sec)
3906
Table 9-2: Performance of streaming with index, and with no matrix
Total download time Overhead in index creation
(sec) Space/Time in bytes/sec
1620 45124 bytes/1360 sec
The results for the next level of sophistication in the reduction scheme are shown
in Table 9-3 and Table 9-4, the former without indexing and the latter with indexing. The
results show that though the indexing adds overhead in terms of time and space, the
improvement in the streaming quality cannot be ignored. The point to remember is that
the index generation is done offline and the index is stored on the server, and hence the
process is not at the client's cost.
Table 9-3: Performance of server without indexing, with 2 possible levels each of
network conditions and device modality.
Device type 1 Device type 2
Network Download
Cls 1 te () 1023 Size reduction 3685 Frame dropping
Class 1 time (sec)
Network Download 36 Frame dropping 3906
Class 2 time (sec) 3 Size reduction
Table 9-4. Performance of server with indexing, with 2 possible levels each of network
conditions and device modality.
Device type 1 Device type 2
Network Download 94
934 2986
Class 1 time (sec)
Network Download 17
1567 3910
Class 2 time (sec)__
Space/time overhead 49345 bytes/2106 sec
The results for the highest level of sophistication in the reduction scheme are
shown in Table 9-5 and Table 9-6. This clearly demarcates the scheme with a good index,
and the scheme without an index. The scheme without the index suffers from the
complexity of the algorithm that has to perform reduction at runtime, and does not have
the helping hand of reduction hints stored previously in the indexing scheme.
These experiments clearly illustrate the importance of not just performing
reductions on the streamed video, but getting aid in performing these reductions with a
sophisticated indexing scheme that will allow reduction of the server's access time to the
video and processing time of the video.
Table 9-5: Performance of server without indexing, with 3 possible levels each of
network conditions and device modality
Device type 1 Device type 2 Device type 3
Network Download Frame Frame Frame
Class 1 time (sec) 876 size 1253 size 3218 size
S1 Color Color
Download Frame
Network *arame Frame Frame
Class 2 time (sec) 1106 ize 1378 Size 3689
Class 2 Color
Network Download
Class 3 time (sec) 106 Frame 3889 ze 3938
Class 3 time (sec) 1067 size 3889 size 3938
Table 9-6: Performance of server with indexing, with 3 possible levels each of network
conditions and device modality
Device type 1 Device type 2 Device type 3
Network Download
556 896 2989
Class 1 time (sec)
Network Download
678 945 3105
Class 2 time (sec)
Network Download
697 801 3916
Class 3 time (sec)_
Space/time overhead 57089 bytes/4576 sec
Figure 9-1 and Figure 9-2 show the performance of the scheme with the 3 x 3
matrix. It is evident that the scheme with the matrix performs better than without the
matrix. Also, the indexed scheme in general performs better than the scheme without the
index. Of course, there are a few anomalies, caused by the fact that some reduction types
are helped more by the indexing than others.
Performance with Client Device Class 1
SWithout index With index
1200
c 1000
U
a
M 800
c
E 600
' 400
I 200
0
1067
8etw16
V556
Network 1
Network 2
Network Class
Network 3
Figure 9-1: Performance of the scheme with client device Class 1
Performance with Client Device Class 2
Without index With index
4500
S4000
0 3500
u 3000
2500
S2000
1500
1000
500
0
2
Network Class
Figure 9-2: Performance of the scheme with client device Class 2
137 8
Performance with Client Device Class 3
Without index With index
2989
tL
Figure 9-3: Performance of the scheme with client device Class 3
The next set of experiments yielded the results shown in Table 9-7 below. The
results are shown graphically in Figure 9-4.
Table 9-7: Performance of the scheme for a fixed matrix element
Percentage of video not delivered (%) Buffering time (seconds)
0 1081.81
10 544.65
20 281.589
30 207.623
40 128.445
50 53.938
4500
4000
3500
3000
2500
2000
1500
1000
500
0
3938 3916
2
Network Class
Percentage of video not delivered, vs buffering time
S1000
0
' 800
E 600
. 400
m 200
0% 10% 20% 30% 40% 50% 60%
Percentage of video not delivered
Figure 9-4: Performance of the scheme for a fixed matrix element, with variation in
buffering time
In this chapter we illustrated the performance of the scheme. Let us summarize
our achievements, and see in what ways the project can be improved.
CHAPTER 10
CONCLUSIONS AND FUTURE WORK
Chapter 8 illustrated the implementation of our scheme, and Chapter 9
demonstrated the performance of the implementation. We shall conclude with a summary
of the entire project, and suggestions for future improvements.
10.1. Achievements of this Thesis
We saw how streaming was necessary to obtain a good performance out of video
delivery, without an infinite waiting period. We saw even streaming did not always help,
especially in the case of huge video files. Next, we started thinking of more ways to
reduce the client's wait time. In this factor, the fact, that with wireless networks, client
devices do not need a very high quality video anyway, helped. We could actually reduce
the quality of the video, in the process reducing the required bandwidth and improve the
performance.
This scheme helped, but there was no attention paid to the user's requirements.
The user simply had to be satisfied with whatever fixed scheme was settled on for the
purpose of the common good. Though the client might actually be able to handle a larger
video, the user might still have to settle for lesser quality. To avoid this one-for-all
scheme, we expanded it to a matrix of different schemes, based on the two principal
factors on which the video delivery depended the network connection quality and the
client device modality. Each cell in the matrix got a different scheme of reductions most
appropriate for those conditions.
We then further improved the scheme, by shifting most of the onus of performing
reductions from the server to the packager, and introducing an index to store information
about the reductions. The server's work was reduced to using the index to access the
video file at runtime.
10.2. Proposed Extensions and Future Work
This project just got a start on implementing wireless video delivery. There were
some aspects that could easily be expanded on. For instance, the matrix was reduced to a
3 x 3 matrix. The main reason for this was the small number of reductions we could
implement. Rate reduction, discussed in Chapter 2, could also be added to the general
scheme, allowing room for more divisions in the matrix. The scheme of frame dropping
could have been improved, making it more sensitive to the video data, since the frame
composition changes with the video.
This brings into the picture MPEG-7, which would have been perfect for
specifying the contents of the video. There could have been another descriptor describing
the frame composition of the video, and the packager could have accessed this descriptor
to get information about the video. There could be more UMA descriptors suggesting the
ideal reductions to be performed based on the video content. In fact, the packager could
be assigned the job of initializing MPEG-7 descriptor values for each video as it is being
registered, aided by a set of dialog boxes that would prompt for choices of values for
each descriptor.
While preparing the tables of device classes and connection quality, we were
hampered by the unavailability of data publicly. This could be improved on, and a more
sophisticated classification could be achieved, so that minimum work is given to the user.
The indexing scheme selected was also very basic, and deserves improvement.
The basic aim of this project was to illustrate the advantages of the scheme, however
primitively, and extensions on the scheme are not difficult to implement.
Hence a (not necessarily comprehensive) list of future extensions on the project
would be:
* More reduction schemes,
* More sophisticated reduction schemes for the existing reductions,
* Better classification in the tables for the matrix, and
* A more sophisticated indexing scheme.
To conclude, the scheme illustrated in this project attempted to better the
performance of wireless video delivery systems for the users of wireless networks. With
all the suggested extensions, this project could very well become a very interesting and
popular application.
LIST OF REFERENCES
[1] A. Elmagarmid and H. Jiang, Digital Video. In the Encyclopedia of Electrical and
Electronics Engineering, John Wiley & Sons, 1998.
[2] Ahmed K. Elmagarmid, Haitao Jiang, Abdelselam A. Helal, Anupam Joshi and
Magdy Ahmed. Video database systems: issues, products and applications. Kluwer
Academic, 1997.
[3] Compression algorithms.
http://viswiz.gmd.de/DVP/Public/deliv/deliv.211/mpeg/mpeghome.htm
[4] Still Image Compression Formats. http://users.planetcable.net/jjerwe/
[5] AVI Audio-Video Interleave. http://www.mpthree.co.uk/vcd/avi.html
[6] MPEG home page. http://drogo.cselt.stet.it/mpeg/
[7] Dublin Core standard. http://purl.org/DC/
[8] R. Alonso, Y. Chang and V. Mani. Managing video data in a mobile environment.
SIGMOD RECORDS, Dec 1995.
[9] I. Kouvelas, V. Hardman and J. Crowcroft. Network-Adaptive Continuous Media
Applications Through Self-Organized Transcoding. Proc. Network and Operating
Systems Support for Digital Audio and Video (NOSSDAV 98, Cambridge, UK), 8-
10 July 1998.
[10] APPLE QuickTime streaming technology.
http://www.apple.com/creative/resources/qts/
[11] Streaming Media World Tutorials.
http://www.streamingmediaworld.com/series.html
[12] N. Merhav and V. Bhaskaran. Fast Algorithms for DCT-Domain Image Down-
Sampling and for Inverse Motion Compensation. IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 7, No. 3, June 1997.
[13] Oliver Werner. Requantization for Transcoding of MPEG-2 Intraframes. IEEE
Transactions on Image Processing, Vol. 8, No. 2, February 1999.
[14] Sun documentation about JDBC.
http://java.sun.com/products/jdk/1.2/docs/guide/j dbc
[15] J. Smith, C. -S. Li, R. Mohan, A. Puri, C. Christopoulos, A. B. Benitez, P. Bocheck,
S. -F. Chang, T, Ebrahimi and V. V. Vinod. MPEG-7 Content Description for
Universal Multimedia Access, ISO/IEC Y17CI/SC29/WGII MPEG99IM4949,
MPEG-7 Proposal draft.
[16] Malcolm Mclhagga, Ann Light and Ian Wakeman. Giving Users the Choice
Between a Picture and a Thousand Words. School of Cognitive and Computing
Sciences, University of Sussex, Brighton, May 18 1998.
[17] MPEG Software Simulation Group (MSSG). http://www.mpeg.org/MPEG/MSSG/
[18] Experimental data. http://www.harris.cise.ufl.edu/research/wvideo
BIOGRAPHICAL SKETCH
Latha Sampath was born on July 17, 1977 in Chennai, India. She received her
Bachelor of Engineering degree in Computer Science from VES Institute of Technology,
affiliated with Bombay University, Bombay, India in August 1998.
She joined the Department of Computer and Information Science and Engineering
at the University of Florida in Fall 1998. She worked at the Database Systems Research
and Development Center while earning her Masters' degree.
Her research interests include streaming media and multimedia applications,
wireless networks and database systems.
|