Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0000662/00001
## Material Information- Title:
- Compression library for Palm OS platform
- Creator:
- Ä†iriÄ‡, NebojÅ¡a
- Place of Publication:
- [Gainesville, Fla.]
- Publisher:
- University of Florida
- Publication Date:
- 2003
- Language:
- English
## Subjects- Subjects / Keywords:
- Algorithms ( jstor )
Alphabets ( jstor ) Arithmetic ( jstor ) Compression ratio ( jstor ) Image compression ( jstor ) Integers ( jstor ) Libraries ( jstor ) PDAs ( jstor ) Pressure reduction ( jstor ) Statistical models ( jstor ) C++, COMPRESSION, PALMOS, UML Computer and Information Science and Engineering thesis,M.S ( lcsh ) Data compression (Telecommunication) ( lcsh ) Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh ) PalmPilot (Computer) -- Programming ( lcsh ) Palm OS ( lcsh ) - Genre:
- government publication (state, provincial, terriorial, dependent) ( marcgt )
bibliography ( marcgt ) theses ( marcgt ) non-fiction ( marcgt )
## Notes- Summary:
- ABSTRACT: Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science COMPRESSION LIBRARY FOR PALM OS PLATFORM By Neboj a capital ae dipthongirilower case ae dipthong May 2003 Chair: Dr. Douglas Dankel II Major Department: Computer and Information Science and Engineering This thesis presents the design and implementation of a compression library (CLP) for the Palm Operating System (OS) platform. Data compression is a well-researched field in information theory and there are numerous programs and libraries available for almost every platform and OS in the market. Porting an existing library from the UNIX or MS Windows OS to the Palm OS is a nontrivial task because of constraints imposed by the handheld platform s memory size and organization. That is why CLP implemented simple, yet effective algorithms with small memory requirements. CLP contains two algorithms: the static Huffman compression and the static arithmetic compression. It would be easy to expand the library with new algorithms, if needed, because it was designed and implemented using an object oriented approach. CLP uses a simple interface, exposing only a few methods to the user, thus decreasing deployment time and increasing productivity. The library was specially designed with the Palm Database (PDB) format in mind. It allows a user to easily manipulate data in separate records, which is the preferred mode of operation on Palm OS.
- Thesis:
- Thesis (M.S.)--University of Florida, 2003.
- Bibliography:
- Includes bibliographical references.
- System Details:
- System requirements: World Wide Web browser and PDF reader.
- System Details:
- Mode of access: World Wide Web.
- General Note:
- Title from title page of source document.
- General Note:
- Includes vita.
- Statement of Responsibility:
- by NebojÅ¡a Ä†iriÄ‡.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Ä†iriÄ‡, NebojÅ¡a. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Embargo Date:
- 9/9/1999
- Resource Identifier:
- 029897946 ( ALEPH )
1110020394 ( OCLC )
## UFDC Membership |

Downloads |

## This item has the following downloads:
ciric_n.pdf
ciric_n_Page_20.txt ciric_n_Page_07.txt ciric_n_Page_26.txt ciric_n_Page_49.txt ciric_n_Page_38.txt ciric_n_Page_41.txt ciric_n_Page_48.txt ciric_n_Page_23.txt ciric_n_Page_18.txt ciric_n_Page_31.txt ciric_n_Page_06.txt ciric_n_Page_46.txt ciric_n_Page_14.txt ciric_n_Page_02.txt ciric_n_Page_54.txt ciric_n_Page_55.txt ciric_n_Page_28.txt ciric_n_Page_03.txt ciric_n_Page_16.txt ciric_n_Page_47.txt ciric_n_Page_42.txt ciric_n_Page_33.txt ciric_n_Page_12.txt ciric_n_Page_45.txt ciric_n_Page_51.txt ciric_n_Page_04.txt ciric_n_Page_10.txt ciric_n_Page_05.txt ciric_n_Page_11.txt ciric_n_Page_43.txt ciric_n_Page_53.txt ciric_n_Page_19.txt ciric_n_pdf.txt ciric_n_Page_34.txt ciric_n_Page_37.txt ciric_n_Page_08.txt ciric_n_Page_21.txt ciric_n_Page_32.txt ciric_n_Page_36.txt ciric_n_Page_13.txt ciric_n_Page_22.txt ciric_n_Page_09.txt ciric_n_Page_15.txt ciric_n_Page_30.txt ciric_n_Page_44.txt ciric_n_Page_35.txt ciric_n_Page_40.txt ciric_n_Page_27.txt ciric_n_Page_29.txt ciric_n_Page_39.txt ciric_n_Page_17.txt ciric_n_Page_25.txt ciric_n_Page_24.txt ciric_n_Page_50.txt ciric_n_Page_01.txt ciric_n_Page_52.txt |

Full Text |

BIOGRAPHICAL SKETCH NebojSa Ciric was born in Belgrade, Republic of Serbia, Yugoslavia. He received his Bachelor of Science degree in computer science and engineering from the School of Electrical Engineering, Belgrade, Yugoslavia, in 1998. He worked at the Institute "Mihajlo Pupin" for seven months on his BS project as a programmer. In August 1999, he moved to Ljubljana, Slovenia, to work at the Hermes SoftLab, one of the largest software development companies in the country. He quit his job in Slovenia a year later to be with his wife in Gainesville, FL. He was accepted into the Computer and Information Science and Engineering Department at the University of Florida in August 2001. His research interests include object-oriented software development, artificial intelligence, and pattern recognition. TABLE OF CONTENTS Page L IS T O F T A B L E S ...... .. .... .. .... .... .... .. .... .... .................................................... .. v ii LIST OF FIGURES ................................ .. ............ .... ..... .............. .. viii ABSTRACT .............. .................. .......... .............. ix 1 IN T R O D U C T IO N ................. .................................. .... ........ .. ............. . Description of the Problem ............... ........................... ........................ Overview of CLP .................. .......................................... .................. .2 O organization of Thesis.................. ...................... ....... .. .. ..... .. ........ .... 2 COMPRESSION THEORY AND ALGORITHMS .........................................5 C o m p ressio n T h eo ry ....................................................................... .. .......... .. .. 5 Static and Adaptive M odeling .................................. ......................................6 Static M modeling .............................................. ........................ 6 A daptive M odeling ............... .......................... ........ .................... .. .8 Lossy and Lossless algorithm s .................................. .....................................9 Com pression Algorithm s ....................................... ....... ... ................... 10 Specialized A lgorithm s ............................................... ............................ 10 Run length encoding (RLE) ................................................. 10 JPEG and M PEG algorithm s........................................ .............................11 M P3 and W M A algorithm s ..................................................... ............... 11 G eneric A lgorithm s ................................................... ......... ............... .11 D ictionary based algorithm s.................................................... ...............11 Statistical model based algorithms............... .............................................12 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS...............................14 The H uffm an's C oding A lgorithm .................................................. .....................14 Building a Huffman's Tree........... ......... ............... ................. 14 The Compression Procedure........... ................................ .. ........ ............... 16 The Decompression Procedure............................................................. 16 Summary ......... ...... .... ..... ................. 16 The A rithm etic Coding A lgorithm ........................................ ........................ 17 The Arithmetic Coding Procedure........... ........................... ............... 17 Implementation with integer arithm etic ................................... ............... 18 S u m m ary ......... ........ ......... ..................................................2 0 v Dedicated to my family. character "u" after the character "q" in English language, but very low probability for "w" after "b." This approach yields a better compression but the overhead of keeping a larger table is often too expensive. The requirement of keeping the model table with the data to decode that data is the reason why the static modeling has been practically abandoned in modern compression theory. Adaptive Modeling The adaptive algorithm does not have to scan the data to generate statistics. Instead, the model is constantly updated as new symbols are read and coded. This is true for both the encoding and decoding processes and means that the table does not have to be saved with the data to be decoded. The compression and decompression models are shown in Figure 2-3 and Figure 2-4. ----* -. I ,-:.,.1..-. Figure 2-3. Adaptive compression diagram. /* * pdbbuild.php * * Make Test DB for compression * * It uses php-pdp module [18] to manipulate PDB files */ // Turn on all error reporting ini_set('error reporting', EALL); include "./php-pdb.inc"; // Get data first echo "Get data first and put it into SourceDataDB\n"; $fp = fopen("compressionTest.txt","rt"); if(! $fp) { echo "Can't open compressionTest.txt\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SourceDataDB'); $counter = 1; $test = 0; while(! feof($fp)) { $string = fgets($fp); $PDBFile->AppendString($string); $counter = $counter+l; $test = $PDBFile->GoToRecord($counter); echo "Test =", $test, "\n"; $PCPDBFile = fopen("SourceDataDB.pdb","wb"); if(! $PCPDBFile) { echo "Can't create SourceDataDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); 22 implementations are different, reflecting the differences in their corresponding algorithms. A modular OO, C++ [14] approach makes changes and additions to the library easy. If a new algorithm is needed, then its class can be inherited from StatisticCompress or AdaptiveCompress, without changing the existing code. Each class is declared in a separate header file, and its implementation is placed in a corresponding .cpp file, thus reducing the size of each module. The complete UML [15] compression class diagram is given in Figure 4-1. BitIO Class Hierarchy The BitIO class is a helper class. It handles single or multiple bit stream operations. The OS file or memory functions are developed for byte access, but for compression purposes, the ability to store an arbitrarily sized bit sequence in memory is needed. The UML representation of the BitIO class is given in Figure 4-2. CLP Library Interface The compression library should provide a simple interface, enabling the user to easily send data to the library and receive responding results from it. Each class in the hierarchy either expands the interface or implements its methods. number of bits needed to encode the whole message Mis simply a sum of the code lengths for each symbol contained in the message, or I(M) = C(Y) = C, log P(,), where C, is a count of the symbol Y, in the message M [4]. The probability of the symbol depends on the model we choose to describe the data set. This means that the entropy of a message or symbol is not an absolute, unique value. As a result, models that predict symbols with high probability are good for a data compression system. After data is modeled it is encoded with a proper number of bits. If the entropy of the symbol was 3.5 bits then that symbol should be encoded with 3.5 bits. Some encoding schemes (e.g., the Huffman scheme) decrease compression by rounding up number of bits to boost the processing speed. Static and Adaptive Modeling There are two approaches to model data: static and adaptive. The static method was developed first but now is abandoned in favor of the adaptive method. A description of each of the methods follows [5]. Static Modeling The static model first gathers statistical information about each symbol in a table, by scanning the data once and counting the symbol frequencies in the data set. The resulting model is used for data encoding and decoding. For the encoder and decoder to be compatible they must share the same model, which means that the table has to be transmitted to both the encoder and the decoder. Data compression and decompression with the static model are shown in Figure 2-1 and Figure 2-2. CHAPTER 5 COMPRESSION RESULTS This chapter presents a summary of the CLP library's performance on a Palm Pilot device based on a sample program using the library to compress and decompress drug reference data [5]. The compression ratios and speeds are measured for both the Huffman and Arithmetic coding algorithms. A comparison is made to determine which algorithm is more suitable for PDA use. Test Environment The following is a description of the sample application, scripts, utilities, source data, and testing procedure used. Source Data The initial data set was a text file containing ASCII strings delimited with new line characters. The size of the text file is approximately 2300 strings or 840KB. Using the first PHP script given in Appendix, each string from the text file was inserted into a separate record in the PDB database. The CLP implementation on the PC is used to generate a header with statistical data for the compression and decompression procedures. The second PHP script, also given in Appendix, is used to transfer binary information from the header file to the PDB database record on the PC. All generated PDB files and the sample application are loaded into the Palm OS device emulator program (POSE). 29 CLP :Stt~rfhmibc A j1idhff ob, Inlirati~ad(Uln~w Ulftw., boar)) ;~~~InI C OMpA4UInr UrAW, Ulr42= *ornpiess(uln *r n *. UInM3M -jUw Utd.WnJ Figure 4-5. The UML sequence diagram for the compression process using the StatAritmetic class. BmD Ithm"a :BRIGO Oe4Htjrdt,5IzBl3t324*,mdv~at ouretfiCftu ftjrgujji L~n~inLengih h VAHIC C inblj .)FvnIUVJ~lryV1. Ulr-02) .iII1III II. ued UI bLIung c Slrmbrlik L. npiBlr9 FluthCadvrO DuipulpiBIUinL3-, ulnlelb FIus- id) rIt1ounti p.ih*iei , L IL'4IUWy:,d i LIST OF TABLES Table page 3-1. Input alphabet for Huffman's coder ............. ............. ........................15 3-2. Arithmetic encoding process for alphabet S and sequence a b a EOS.....................19 3-3. Arithmetic decoding process for alphabet S and bit sequence 011012....................19 5-1. C om pression results........... ..... .................................................................... ....... 34 5-2. D ecom pression results ......... ................. ........................................... ............... 34 CHAPTER 4 CLP LIBRARY IMPLEMENTATION This chapter discusses the implementation details from a programmer's perspective. These include (1) the class hierarchy, (2) the interface exposed by the compression library, (3) the proper call sequence to the class methods, (4) error handling, and (5) instructions on how to include the CLP library into a Palm OS application. Class Hierarchy The CLP library consists of two separate sets of classes. One consists of the main compression hierarchy, while the other holds the simple bit-stream class. Compression Class Hierarchy Different compression algorithms use different techniques for compressing data, but they all have a similar interface [13] to the application that uses them. It is a common practice to enforce such behavior by using a common virtual base class, in the case of the CLP library that base class is the BaseCompress class. Static and adaptive compression algorithms update the coding model in a different way. The static compressor gathers statistical information before compression, while the adaptive does that during the compression phase. As a result, there are two new classes inherited from the BaseCompress class: StatisticCompress and AdaptiveCompress. Both are virtual classes with the AdaptiveCompress class not being implemented. The Huffman and arithmetic compression classes are inherited from the StatisticCompress class. While they share the same interface, their internal Figure 2-5. Lossy compression example. A) High quality JPG image. B) Low quality JPG image. Compression Algorithms Compression algorithms can be divided in two groups: generic and specialized. Generic algorithms are able to compress any type of information with good but not perfect results. Specialized algorithms are very good at compressing some types of information, like images or sound, but have poor results for other types of data. Specialized Algorithms Some well-known specialized algorithms are reviewed in this section. Run length encoding (RLE) Run length encoding algorithm [5] is mostly used with bitmap images (e.g., black and white images) where symbols (pixels) with the same value are often found in contiguous streams. The stream can be then replaced with decreasing the image size. This method must be cleverly implemented to avoid data expansion, which can happen if the streams are short or the symbols alternate. fclose($PCPDBFile); fclose($fp); // Get first header (for Static Huffman Compression) echo "Get first header (for Static Huffman Compression)\n"; $fp = fopen("SHuffman.bin","rb"); if(! $fp) { echo "Can't open SHuffman.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SHuffmanDB'); $Header = fread($fp, 256); $PDBFile->AppendString($Header); $PCPDBFile = fopen("SHuffmanDB.pdb","wb"); if(! $PCPDBFile) { echo "Can't create SHuffmanDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); // Get second header (for Static Arithmetic Compression) echo "Get second header (for Static Arithmetic Compression)\n"; $fp = fopen("SArithmetic.bin","rb"); if(! $fp) { echo "Can't open SArithmetic.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SArithmeticDB'); $Header = fread($fp, 256); $PDBFile->AppendString($Header); $PCPDBFile = fopen(" SArithmeticDB.pdb","wb"); APPENDIX SOURCE CODE LISTINGS The source code for the CLP and the test application is not included in the thesis. Only PHP scripts mentioned in the thesis are included. /* * header.php * * Makes test header PDB for compression * * It uses php-pdp module [18] to manipulate PDB files */ // Turn on all error reporting ini_set('error reporting', EALL); $fp = openn" SHuffman.bin", "rb"); for ($i=0; $i<256; $i++) { $symbol = fread($fp, 1); echo ord($symbol); echo ","; } ?> Sample Application A sample application was written in the C++ language and was linked with the CLP library for Palm OS. It first compresses the input PDB database using the Huffman and Arithmetic coding algorithms and logs compression times with an external profiler application. The compression results are written into new database files, one for each type of compression. The resulting databases are decompressed using both algorithms, and decompression times are logged with the profiler. Scripts Scripts for converting data to the Palm PDB format [1] were written using a php- pdb [17] module. PHP script language interpreters are available for almost all OS platforms, so these scripts are highly portable [17]. Utilities Two utility programs are used for testing: POSE and Palm Reporter. They are both part of the Palm OS Software Development Kit (SDK) [1]. POSE is able to emulate any Palm OS device, if a ROM image for that device is available. It emulates the real speed of the device, so measurements taken on it are roughly equal to measurements taken on the real device. Palm Reporter is a stand-alone program able to connect to POSE and receive log messages. It is used to obtain compression and decompression speed information from the sample program running on POSE. algorithm are less than optimal, due to integer arithmetic and other compromises. Since the Huffman coding algorithm is nearly optimal in many cases, the choice between the methods is not as simple as the theory suggests. Summary The two algorithms used for development of CLP are described in this chapter. An in-depth theoretical analysis [4] and detailed implementation [5] can be found in the literature. The Huffman's coding algorithm is lightweight and fast, but it is non-optimal. The arithmetic coding algorithm is CPU intensive, but in its "pure" form, it is optimal. In Chapter 5 compares the results of both algorithms and identifies which one is better for PDA application. Welch (LZW) algorithm [3, 4, 8, 9], which is used in almost every commercial compression utility or library. It parses the input stream and for each new phrase that encounters, it adds a simple but its implementation is usually complex because of the dictionary maintenance tradeoffs. If the dictionary becomes full one of three actions occurs: (1) the dictionary could be flushed, which affects compression, (2) the dictionary could be frozen, which affects compression if data changes over time, or (3) the dictionary could be expanded, which could lower the compression ratio because the token size increases. Statistical model based algorithms This family of algorithms encodes symbols into bit-strings of variable length using a statistical model. These algorithms are based on a Greedy Approach where symbols with the high probabilities are given the shortest codes (bit-strings). The model has to accurately predict the probabilities of symbols to increase the compression. The Huffman coding algorithm [4, 5, 10] achieves the minimum amount of redundancy possible if the bit-strings are limited to an integer size. As a result, it is not an optimal method, just a good approximation. It is a very simple and fast algorithm, suitable for devices with slow CPUs. Memory consumption depends on the order of the model used. The arithmetic coding algorithm [4, 5, 11, 12] does not produce a single code for each symbol. It produces a code for the whole message. Each symbol added to the message incrementally modifies the output code. This approach allows a symbol to be encoded with a fractional number of bits instead of an integer number, thus exactly representing entropy of the symbol. In theory, arithmetic coding is optimal for a given model, but a real-world implementation has to make some tradeoffs related to floating 44 [14] B. Eckel, Thinking in C++, 2nd Edition, Volume 1, Prentice Hall, Upper Saddle River, NJ, 2000. [15] C. Larman, Applying UML and Patterns, Prentice Hall PTR, Upper Saddle River, NJ, 2002. [16] H. Schildt, C/C++ Programmer's Reference, 2nd Edition, Osbore/McGraw-Hill, Berkeley, CA, 2000. [17] SourceForge Home Page, http://php-pdb.sourceforge.net, Accessed: 01/10/2003. determined by the probabilities. If si is the next source letter, assign L <- L, and H L,,,. 5. Repeat steps 2-4 until the end of the stream and none of the conditions in steps 2 or 3 hold. At this stage, the final interval satisfies L < I < < H or L < < < H, and sequence "01" is output if the first condition holds and "10" is output otherwise. Any leftover underflow bits are output after these bits. Implementation with integer arithmetic The previously described algorithm cannot be directly applied in practice for two reasons. First, floating point arithmetic is much slower than integer arithmetic on any computer platform. Second, the result r can become long, hence demanding a very high precision that is not available with current technology. To solve these problems, the interval [0, 1) can be replaced with [0, M), where M is a positive integer with a value of at least 4(|S|-2). The value of M is usually chosen as some power of 2, which can improve integer arithmetic performance. Each subinterval in A is expanded, which corresponds to a value of M and a subinterval length. Also, L and H values can be represented as 0x0000... and OxFFFF... in the initial case, where the ellipsis means that Os or Is can be shifted in when needed, thus enabling 16 or 32-bit register arithmetic. The interval [0, M) is just an approximation of the interval [0, 1), with imprecise symbol frequencies, which leads to lower compression ratios. But using integer arithmetic drastically decreases the processing speed. To illustrate encoding and decoding process using the integer approximation algorithm, consider the following example. The source alphabet is S = {a, b, EOS}, the frequencies for each symbol arefa=6/10,fb=3/10, andfEos=1/10, and the cumulative counts are Ca=6, Cb=9 and CEOS=10. The input sequence is: a b a EOS. To obtain an The current size of the CLP library is 145KB. If it was developed in the C language its size would decrease to approximately 30-40KB, and its dynamic memory consumption would be much smaller. This decrease in size happens because the C language does not use the STL library, which is quite large, and there is no object initialization overhead for virtual classes and methods. With some decrease in flexibility and expandability the CLP library can be ported to the C language, hence reducing its size and memory footprint. 1st-order and Adaptive Algorithms Current, Oth-order algorithms can be replaced with 1st-order algorithms to increase compression ratios, though the problem with holding large statistical tables in memory must be solved first. If record based compression/decompression is not needed then adaptive algorithms can be used. They would eliminate the need for header storage space and data preprocessing. 13 point or integer arithmetic, thus decreasing the compression ratio. The arithmetic coding algorithm needs a powerful CPU for both the encoding and decoding process, which makes it less desirable on PDA devices than the Huffman coding algorithm. Both of these algorithms are covered in depth in next chapter. 35 Discussion Users of PDA devices expect PDA applications to have a quick response time and a small size. The test results clearly show that the Huffman coder algorithm can provide both, while the arithmetic coder algorithm falls short when speed is taken into account. LIST OF FIGURES Figure page 2-1. Static data com pression diagram ........................................................................... 7 2-2. Static data decom pression diagram ........................................ ......................... 7 2-3. A adaptive com pression diagram ............................ ...... ........ ........................... 8 2-4. Adaptive decompression diagram ........................................ ......................... 9 2-5. Lossy compression example. A) High quality JPG image. B) Low quality JP G im age. ....................... .... ................ ... .... ........ .... ...... 10 3-1. Huffman's tree and code words generated from input alphabet S. ..........................15 4-1. The UML compression CLP library diagram ..................................................23 4-2. The UML bit input/output BitlO class diagram. .............................. ................24 4-3. The UML sequence diagram for the compression process using the StatH uffm an class ..... ........................................... ........ ...... .... ...........27 4-4. The UML sequence diagram for the decompression process using the StatH uffm an class ..... ........................................... ........ ...... .... ...........28 4-5. The UML sequence diagram for the compression process using th e StatA ritm etic class...................................................................... .................. 2 9 4-5. The UML sequence diagram for the decompression process using th e StatA ritm etic class...................................................................... .................. 30 5-1. Sym bol frequency distribution. ............................................................................33 COMPRESSION LIBRARY FOR PALM OS PLATFORM By NEBOJSA CIRIC A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2003 ,_ci.--,- ------_-- L,,r_-c,:.i .1 l_-*.i- . ,.l -_,. I . Figure 2-4. Adaptive decompression diagram. The problem with the adaptive model is that it starts with an empty table, so the compression is very low in the beginning. Most adaptive algorithms adjust to the input stream after a few thousands symbols resulting in a good compression ratio. Also, they are able to adapt to changes in the nature of the input stream, like when the data changes from text to an image. Lossy and Lossless algorithms Data can be compressed with or without loss of information. Lossy compression is mostly used for drastically reducing size of images and sound files, because the human senses are not very sensitive towards small changes in quality. The JPG format is able to reduce an image size by examining adjacent pixels and making ones that appear similar the same, thus reducing the entropy and increasing the compression. Images A and B in Figure 2-5 are saved using the highest and lowest JPG quality. Note that while image A is twice as large as B, the quality of picture B is not drastically different. Lossless compression is used for all other types of data where accuracy is mandatory. Some of the most widely used lossless algorithms are described below, in the generic algorithms section. Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science COMPRESSION LIBRARY FOR PALM OS PLATFORM By NebojSa Ciric May 2003 Chair: Dr. Douglas Dankel II Major Department: Computer and Information Science and Engineering This thesis presents the design and implementation of a compression library (CLP) for the Palm Operating System (OS) platform. Data compression is a well-researched field in information theory and there are numerous programs and libraries available for almost every platform and OS in the market. Porting an existing library from the UNIX or MS Windows OS to the Palm OS is a nontrivial task because of constraints imposed by the handheld platform's memory size and organization. That is why CLP implemented simple, yet effective algorithms with small memory requirements. CLP contains two algorithms: the static Huffman compression and the static arithmetic compression. It would be easy to expand the library with new algorithms, if needed, because it was designed and implemented using an object oriented approach. CLP uses a simple interface, exposing only a few methods to the user, thus decreasing deployment time and increasing productivity. Table 3-1. Input alphabet for Huffman's coder Symbol Frequency S1 1/3 s2 1/5 S3 1/6 S4 1/10 s5 1/10 S6 1/10 To illustrate this process consider the following example. In Figure 3-1, the input alphabet S, defined in Table 3-1, is transformed into a Huffman's tree. The average code m word length is defined as = fw, where fi is frequency and wi is code length of ,=1 symbol si. For the input alphabet S, I has decreased from 3 to 2.47. In an average case, for this simple alphabet, the compression ratio is 17.7%. 4-LI / ':- 00 -:_ 10 ':, 010 '-" '., '-, _111 L(,/_ : e 1 lI l Figure 3-1. Huffman's tree and code words generated from input alphabet S.47 Figure 3-1. Huffman's tree and code words generated from input alphabet S. Shared or dynamically linked libraries (DLL) offer a fix to this problem by keeping the library in a separate file, which can be loaded on demand by the application needing it. Then, only one copy of the library is kept in the system thus reducing memory requirements. DLLs are not a perfect solution because: 1. They are highly system dependant. 2. They must provide backward compatibility with previous versions. If condition 1 is acceptable and if the library is well implemented then a DLL approach should be chosen. Compression libraries are usually in high demand in the system, thus they are shared by many applications at the same time. They are also system specific because they use low-level system properties to increase processing speed, hence the CLP library should be implemented as a DLL in its next version. C++ vs. C Language The C++ language offers a rich set of libraries and language elements (i.e., strict type checking, exception handling, strings, dynamic arrays, and classes) to the developer. These properties increase productivity by enabling the programmer to focus more on the problem than on its implementation. Also programs developed in an 00 language can be easily expanded by adding new classes and reusing old ones. The problem with the C++ language is that this flexibility and power increases memory consumption and decreases speed, which can hurt performance on PDA devices. The C language is the language of choice for embedded systems programmers, because of its small memory footprint and tight connection with the underlying system. It does not offer a rich set of language elements like C++, thus forcing the programmer to spend more time in the implementation and testing phases. Figure 2-1. Static data compression diagram. Figure 2-2. Static data decompression diagram. Depending on nature of the data set, the probability of the adjacency of two or more specific symbols in the message can be independent, for binary files, or dependent with high or low values, for text files. The encoding model can try to predict such probabilities, thus increasing the compression ratio. The number of adjacent symbols defines the order of the model. A Oth-order model assumes that the symbol position in the message is independent of the position of other symbols. A 1st-order model assumes that two adjacent symbols are dependent, etc. For a Oth-order model there is only one table with symbol counts. For the standard ASCII character set, the table would have 256 entries. With the Oth-order model each symbol is assumed independent from the other symbols. As the order of the model increases the number of tables increases. In the case of the lst-order model there are 256 tables with 256 entries each. With the lst-order model some relation between symbols is assumed, for example there is high probability of the REFERENCES [1] Palm Corp. Home Page, http://www.palmos.com, Accessed: 12/10/2002 [2] L. R. Foster, Palm OS Programming Bible, IDG Books Worldwide, Inc, Foster City, CA, 2000. [3] C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, 27:379-423 and 623-56, 1948. [4] D. Hankerson, G.A. Harris, P. D. Johnson, Introduction to Information Theory and Data Compression, CRC Press, Boca Raton, FL, 1997. [5] M. Nelson, J. L. Gailly, The Data Compression Book, M&T Books, New York, 1996. [6] G. K. Wallace, The JPEG still picture compression standard, Communications of the ACM, 34(4):32-44, April 1991. [7] Microsoft Corp., Support Page for Windowstm Media Formats, http://support.microsoft.com/default.aspx?scid=/support/mediaplayer/wmptest/wm ptest.asp#Windows%20Media, Accessed: 02/10/2003. [8] T. Welch, A technique for high-performance data compression, IEEE Computer, 17:8-19, June 1984. [9] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23(3):337-343, May 1977. [10] D. A. Huffman, A method for the construction of minimum redundancy codes, Proceedings of the IRE, 40(9):1098-1101, September 1952. [11] A. Moffat, R. Neal, I. H. Witten, Arithmetic coding revisited, ACM Transactions on Information Systems, 16(3):256-294, July 1998. [12] I. H. Witten, R. Neal, J. G. Cleary, Arithmetic coding for data compression, Communications of the ACM, 30(3):520-540, June 1987. [13] E. Gamma, R. Help, R. Johnson, J. Vlissides, Design Patterns, Addison-Wesley, Reading, MA, 1994. A Palm OS device keeps its data in a specific file format known as a Palm Database File (PDB) [1, 2]. This is record based, random access file residing in main memory or on a memory card. The PDB format has a limit of 65535 records per file. Each record can be up to 64KB long and of variable size [1, 2]. The new Palm devices are introducing regular files without this limitations but the huge number of legacy applications slows the transition. Two general categories of compression techniques are used: adaptive and statistical. In this environment adaptive compression techniques do not yield good results because the size of each record is too small for these methods to effectively gather statistics about data before compressing. As a result, CLP uses only a statistical compression method. Palm OS devices, especially older ones, have a small amount of dynamic memory (on the order of few kilobytes) that is shared with the TCP/IP stack, global variables, etc. [1, 2] This limits the number and the size of the statistical tables that a compression algorithm can hold in memory at one time. As a result, CLP implements only the 0th order statistical compression methods. But even with all these limitations, good compression ratios and compression speeds are obtainable using the current implementation of CLP. The CLP library can be used on any device, including PCs, provided that the C++ compiler and Standard Template Library are available. This enables the user to preprocess data on more a powerful platform and just transfer them to the PDA device for later use. The library was specially designed with the Palm Database (PDB) format in mind. It allows a user to easily manipulate data in separate records, which is the preferred mode of operation on Palm OS. EIiiC dI Figure 4-2. The UML bit input/output BitIO class diagram. BaseCompress Class Interface The BaseCompress class provides the behaviors detailed below by exposing the following methods in its interface: Initialize Initializes the compression/decompression engine for a given class. This is a pure virtual method [14]. Compress/Decompress Sends data to the CLP library for compression/decompression. These are pure virtual methods. GetResult Extracts the result from the CLP library. This method is implemented in the BaseCompress class. FlushBuffer Flushes the internal buffers and prepares the CLP library for the next compression sequence. This method is implemented in the BaseCompress class. StatisticCompress Class Interface The StatisticCompress class expands the base interface with methods operating on the statistical data and resulting header: GetHeader Returns the header with statistical information that was generated by Initialize call. This is a pure virtual method. 25 GetHeaderSize Returns the size of the header, so that the application can allocate enough memory to save the header. This is a pure virtual method. SetHeader Data can be preprocessed or even compressed on the PC. The header information generated through preprocessing can be read from the file and used for compression/decompression on Palm platform thus decreasing the overhead and increasing speed. This is a pure virtual method. StatHuffman and StatArithmetic Classes The StatHuffman and StatArithmetic classes implement the interface declared by the StatisticCompress parent class. They implement the corresponding Huffman and Arithmetic algorithms. BitIO Class Interface The BitIO class implements a simple interface that allows the user to read/write a sequence of bits from/to a stream. The proper call sequence is outlined below: 1. OpenIO Opens a new bit I/O stream connection. 2. Input/Output Bit/Bits Inputs/outputs the bit sequence from/to the stream. 3. FlushIO Flushes the bits remaining in the buffer to the stream. 4. GetBuffer Returns the resulting stream. 5. CloselO Closes the I/O stream. Step 2 is repeated as long as there are bits to input or output. Methods Interaction To properly use the CLP library, the application must follow this method's calling sequence for both the compression and decompression process: 1. Initialize 2. Set/GetHeader Set the header if is needed later, or get it if it was generated before. CHAPTER 1 INTRODUCTION This chapter begins with a description of the problem addressed within this thesis, followed by the overview of the compression library for Palm OS platform (CLP) that was developed to solve this problem. The chapter concludes with a summary of the remaining chapters in this thesis. Description of the Problem The number of handheld devices is increasing rapidly every year because of constant price drop, increased functionality, and increasing demand. They are used everywhere-from hospitals to supermarkets and IT companies. The largest share, 75%, of the personal data assistants (PDA) market is held by Palm OS based PDAs. Second, a 25% market share is PocketPC PDAs based on MS Windows CE or Linux operating systems. Companies producing Palm PDAs are Palm, Visor, Handera, Sony, and Handspring. Companies producing MS Windows based PocketPC PDAs include Hewlett-Packard, Compaq, and Toshiba, while Sharp produces a Linux based system. The Palm platform covers a wide range of devices. The oldest or cheapest ones have only 2MB of main memory, a black and white screen, an RS232 serial link to a PC, and a 16MHz Motorola DragonBall processor. The newest and the most expensive ones have up to 16MB of main memory, CompactFlash (CF) or SmartMedia (SD) memory expansion slots, a color screen, a USB link to PC, and a 133MHz Intel StrongArm CPU [1, 2]. They all share "beaming" capability. Beaming is process of transferring data between PDA devices, laptops, or cell phones using the Infrared Data Association ACKNOWLEDGMENTS I would like to thank my committee chair, Dr. Douglas Dankel II. It was a great pleasure to work with him and I gained valuable experience under his direction. I would also like to thank Dr. Sanjay Ranka and Dr. Joachim Hammer for being on my committee. Many thanks go to Mr. John Bowers for being a very helpful graduate secretary. He was always there to answer my questions and offer suggestions. Finally I would like to thank my wife Ivana for giving me support when I needed it the most. Protocol (IrDA) for short distances (usually less than Im). Beaming speed is comparable with an RS232 serial connection speed (i.e., up to 115200 bits/second) [1, 2]. PDAs are mostly used as personal organizers and for reading e-documents, e-mails, and e-books downloaded from a PC. These documents can be several megabytes in length, requiring significant time to download through an RS232 or IrDA connection. Large documents can also decrease the amount of free main memory by 20-30% per document. Text documents, as opposed to binary files, have good compression properties because of non-random word and character repetition. Using compression can reduce memory consumption and download time, thus reducing battery consumption and improving the overall response time of the system. The top compression algorithms can reduce the size of text documents by up to 90%, but with the tradeoffs of increased algorithm complexity/processing time and memory consumption. This thesis proposes an algorithm with similar performance for a 16MHz Palm Vx with 4MB of memory or a 133MHz Tungsten device with 16MB of main memory. Overview of CLP CLP, the compression library for Palm, is a general compression library developed to solve the problem described above. It provides a simple interface for application developers to easily compress and decompress data on PDA devices or a PC. CLP can be effortlessly extended with new compression algorithms because of its modular, object- oriented design. Using interfaces hides implementation complexity from the application and allows library changes and improvements without a need to change the application itself. 28 A i M~14uhmir. AppI[CifiDn :Hutl' j.t Dmprre 51JrrhICueuip., Op*nIX(UkWB. L40Z2) qt"*Fvvrwdnw- UlrnMa U OwlUUr ir42 IiIUt i Figure 4-4. The UML sequence diagram for the decompression process using the StatHuffman class. DC rcmpiww .E0I 4 CLP LIBRARY IMPLEMENTATION .......................................... ...............21 C la ss H ierarch y ..................................................................................................... 2 1 Com pression Class H ierarchy ........................................ ......... ............... 21 B itlO C lass H ierarchy .............................................................. .....................22 CLP Library Interface .......... .. ........................ .............. .. ............ 22 B aseCom press Class Interface ........................................ ........................ 24 StatisticCom press Class Interface ............................................ ............... 24 StatHuffman and StatArithmetic Classes .....................................................25 B itlO C lass Interface ................................................... .. ........ ...... ............2 5 M methods Interaction .................. ....................................... .......... .... 25 E rro r H an d lin g ..................................................................... ................ 2 6 H ow to D eploy the CLP Library ........................................ .......................... 30 5 C O M PR E SSIO N R E SU L TS ........................................................... .....................31 Test Environm ent ............................................ .................... .......... 31 S o u rc e D a ta ................................................................................................... 3 1 Sam ple A application .............................. ........................ .. ........ .... ............32 S c rip ts ............................................................................................................ 3 2 U tilitie s ................................................................3 2 T e st R e su lts ........................................................................................................... 3 3 Sym bol D distribution ................................................ ............... 33 C om p ression R esu lts ..................................................................................... 3 3 D ecom pression R results ................................................................................. 34 D isc u ssio n ............................................................................................................. 3 5 6 CON CLU SION S .................................................................... 36 Overview of the CLP Library ......................................................................... ... ...... 36 Future CLP Library Improvements ................................ ...............36 Static vs. Shared L libraries ........................................................................ .......... 36 C++ vs. C Language ...................... ......... ...... ..................................37 sIt-order and Adaptive Algorithms ..................... ..... ......... ..............38 APPENDIX: Source code listings .................................................39 R E F E R E N C E S ................................................................43 B IO G R A PH IC A L SK E T C H ....................................................................................... 45 , -I ::,i r.-- -. r. : - vvrp?~E~ruv~n Ire! .vrt &j QI ] jII-!Lm UMID5 + r-iti!" 7 ffli + rn~uwA~Abv: L nI + rab~l: cn) * m uO2Cruik UQrv + 06Nl(b~o~ I:31t?. + muhrjCkiIe Jhlra + ~l~MCI IIE *f~t~~44jr I + SCKi143I~ UIJt1 Figure 4-1. The UML compression CLP library diagram U. 9 nuCh(4IpF IaC~lUrr~1~ r r +" k p r I *aI~rVl.%* Llf:r t~rlrrd37 rlrt.:L t -;lllr~r m~lrrq"J.0d AIgrqIhrrdb SrabI0"O B~OO I k-106"ig kli ff$ m uaIllOU nds. U U1*Ifl *r .1L w UreIl C. r ~m ilt' 11r.O.- F. * n~95r~ieud r. arlls I S T~i~t '1t5)"c ; - Irlh;t li~ Lh.. )r * IHr~*S~iP.(TI~PFl2& (~ IJ~i( gi * Fltrl1JrmJn1~LaaQ, - Dri~~pirt' 1 'j"~. u'fl?: i OvX~Ull.l * m.T4 .7m' i~~$TrtJI~ * mbhI 14110r~~a 0 rnubbAI1s: W1Oi + InIU w I F,5 C- n A * grr~tiMstIC~r.1Tri2&n rii + i i. r, j 1., 01Z4 r~l 4 C~pua~nlY' UIl!S. ihT3Ad~ tr + Copyright 2003 by NebojSa Ciric 26 3. Compress/Decompress This call passes data to the library and returns the size of the result. 4. GetResult 5. FlushBuffer Steps 3-5 are repeated for each record that the application intends to compress/decompress. A detailed calling sequence for the public and private functions, for both types of compression/decompression, is given as a UML sequence diagrams in Figures 4-3, 4-4, 4-5, and 4-6. Error Handling The CLP library uses the C++ exception handling mechanism [14] to handle errors. There are only three types of exceptions that CLP throws: out of range is thrown if the array index is out of range. length_error is thrown if the array resize operation fails (i.e., there is insufficient memory). unspecified exception is thrown if an unspecified exception happens (i.e., some system exception). All three types of exceptions are inherited from the C++ Standard Template Library (STL). The CLP library uses the STL's vector container type [14, 16] to implement the dynamic arrays used for holding the results generated during compression and decompression. A simple tester application shows the correct way of handling exceptions thrown by the CLP library in the application that uses it. JPEG and MPEG algorithms The Joint Photographic Experts Group (JPEG) [6] and the Moving Pictures Expert Group (MPEG) committees proposed two lossy algorithms for image and moving picture compression, respectively. These algorithms make two passes over the data. First pass converts the data into the frequency domain, using FFT-like algorithms. Once transformed, the data are "smoothed" by rounding off the peaks, resulting in a loss of information. In the second pass, the data are compressed using one of the lossless algorithms. Compression ratios can be very high with acceptable quality degradation as shown in Figure 2-5. MP3 and WMA algorithms MP3 and Windows Media Audio (WMA) [7] are lossy algorithms used for audio signal compression. Compression ratios are high with a reduction in size by a factor of 10 or more with low distortion of the audio signal. These algorithms use the fact that the human ear cannot hear some frequencies if they are masked by other frequencies. They eliminate those hidden frequencies, hence reducing the size of the signal. The encoding procedure is highly CPU intensive but decoding is not. This property is one of the reasons why the MP3 format is widely accepted for storing audio files. Generic Algorithms There are a few generic algorithms currently in use, like the (1) Dictionary, (2) Sliding Window [4, 5], (3) Huffman, and (4) arithmetic coding algorithms. They are all implemented as lossless and adaptive. Dictionary based algorithms This family of algorithms encodes variable-length strings into fixed length codes, which are called tokens. The most popular algorithm in this group is the Lempel-Ziv- interesting, non-trivial, example, M should be 24. The lowest possible value for M can be obtained from following inequality M > 4(S 2). The EOS symbol is used to terminate the input message. Adding an artificial symbol to the alphabet changes the real symbol frequencies, thus lowering the compression ratio, but by using it, the message size does not have to be appended to the message and incremental transmission becomes possible. The encoding and decoding processes are given in Tables 3-2 and 3-3. Table 3-2. Arithmetic encoding process for alphabet S and sequence a b a EOS. Symbol Current interval Subintervals Output a b EOS Start [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8, 9) b [5, 8) Expand x-*2x, where x is current interval 0 [10, 16) Expand x to 2(x-M/2) 1 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(x-M/4) Underflow [0, 14) [0, 8) [8, 12) [12, 14) EOS [12, 14) Expand x to 2(x-M/2) 10 [8, 12) Expand x to 2(x-M/2) 1 [0, 8) Expand x to 2x 0 [0, 16) Table 3-3. Arithmetic decoding process for alphabet S and bit sequence 011012. Value Current Subintervals Output interval a b EOS symbol 01102=6 [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8,9) b [5, 8) Expand x to 2x 11012=13 [10, 16) Expand x to 2(x-M/2) 10102=10 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(x-M/4) 11002=12 [0, 14) [0, 8) [8, 12) [12, 14) EOS For a given model, the arithmetic coding algorithm is optimal and outperforms the Huffman method. It is important to note that implementations of the arithmetic coding Table 5-1. Compression results Compression Source data Comp. data Comp. ratio Comp. time Comp. time method (in bytes) (in bytes) (write (w/o write None 878,090 878,090 0% 45s 2s Huffman 878,090 544,136 38.03% 75s 62s Arithmetic 878,090 546,895 37.72 99s 83s The compression time for both algorithms is comparable with the simple memory access and is around 0.02s/record. Both algorithms have similar compression ratios. The expected result was that the arithmetic coder algorithm would have a better compression ratio than the Huffman coder algorithm, but in this case that is not true. There are many factors that can influence the compression results, with the main one being the integer arithmetic approximation. Decompression Results Decompression results are shown in Table 5-2. Table 5-2. Decompression results Decompression Comp. data Decomp. data Decomp. time Decomp. time method (in bytes) (in bytes) (write) (w/o write) None 878,090 878,090 45s 2s Huffman 544,136 878,090 128s 110s Arithmetic 546,895 878,090 812s 799s Decompression is slightly slower than compression in the case of the Huffman coder algorithm. This result is expected, because the decoder must traverse the Huffman tree to lookup a symbol, which is a slower operation than finding a symbol-code pair in the compression procedure. In the case of the arithmetic coder algorithm, the decompression is much slower than the compression, by a factor of 10. This is due to the more complex decompression algorithm. Also, some optimization could help in narrowing the gap. 42 if(! $PCPDBFile) { echo "Can't create SArithmeticDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); ?> 27 CLP WShtHuttman APPII'Muon IMuttman (HmflMsn Up,-o Li U CJ avlo d& iputif-4 d 4r iiLjlp r~) r c -jn;svwlmntw. knnipZI. Uinj2& .IR*3uIWUInW UlnC3 riLahsuH,-K OkEvaO1WUJnt3Z. UnnMO) (Unfil and l z&4am. rrwtdrh4rTQIC t FluthIC~idid) 6vkLBuR~wyvldn"Ull n4SO)d el~olli;Ybon Figure 4-3. The UML sequence diagram for the compression process using the StatHuffman class. OrmiH4JdRu~in3~&~ ml lntIintoB. UInZ2. bool) HilJOCHUNMrn If Co rrPr+.jcyUlreF'. Ulnr=~ '0'Q~vB' Test Results Symbol Distribution The header files were generated from the source data by the CLP library for a PC. Both the Huffman and Arithmetic coder headers are identical with the symbol frequencies shown in Figure 5-1 Character frequency distribution [Char. fr e, 1 300 250 200 150- E- Characters 100 50 0 .....nrm. S c o c c M [ASCII value] S -- -- '- J C1 N (4 Figure 5-1. Symbol frequency distribution. The data distribution in the graph meets expectations, because the source text consists of ASCII strings in the English language, thus the SPACE symbol, digits, and characters between a-z and A-Z have the highest frequency. Compression Results Palm devices use FLASH memory to store all databases and applications. FLASH memory is faster for reading than it is for writing (by a factor of 10). That is why there are two sets of results for the compression test. One includes writing results to storage memory and the other is without. They are both shown in Table 5-1. 4 Organization of Thesis This thesis consists of six chapters. Chapter 2 introduces compression theory and algorithms. Chapter 3 gives an in-depth description of the implemented algorithms. Chapter 4 discusses the implementation details from a programmer's perspective. Chapter 5 addresses tests and results of compression/decompression capabilities of the CLP library on the basis of test data and sample application. Finally, Chapter 6 presents future work and conclusion. CHAPTER 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS This chapter describes the two algorithms used to implement CLP. These are the static Huffman and Arithmetic algorithms. As mentioned in the previous chapter, both belong to a class of Greedy algorithms. Each tries to assign the shortest bit-sequence to the symbol with the highest frequency or the lowest entropy. We start by examining the Huffman Algorithm. The Huffman's Coding Algorithm Assume an input alphabet, sl, s2 ... Sm, with corresponding frequencies, J > f2 >... > f, for each symbol. To compress or decompress data using Huffman coding, symbols and frequencies must be transformed into code words. This can be accomplished by building a Huffman's tree. Building a Huffman's Tree Each leaf node in the Huffman's tree is assigned one of the symbols from the input alphabet. Symbols Sm and sm-1 become siblings in the tree, with a common parent node, Sp, and frequency fm+fm-l. In each iteration of the algorithm, the two nodes from the set of leaf and intermediate nodes with the least weights that have not yet been used as siblings are paired as siblings and assigned to a parent node whose weight becomes the sum of the siblings' weights. The algorithm stops when the last two siblings are paired. The final parent, with a weight 1, is the root of the tree. 30 SCLP BitO (Arithmetic A :StatArithmetic Decompress) Application (Arithmetic Decompress) SetHeader(pui8Buffer, 0 -ui32Length) 'l Decompress(UlntS,. Ulnt32. Ulnt32&) OpenlO(UlntS*, Ulnt32) InputBits(Ulnt32&, Ulntl6) ConvertSymbolToChar(Ulnt16, i.I li, I f,, RemoveSymbolFromStreamO II H I f. I l E I T 1 JT, CloselO(void) GetResuli(UInt8", Ulnt32) SFlushBuffeO( Figure 4-5. The UML sequence diagram for the decompression process using the StatAritmetic class. How to Deploy the CLP Library To use the CLP library, the application programmer must (1) add the library file to the path accessible by the linker and (2) include StatHuffman.h or/and StatArithmetic.h files into his project. Both the StatHuffman and StatArithmetic header files will automatically include the necessary underlying header files and libraries into the project. CHAPTER 2 COMPRESSION THEORY AND ALGORITHMS This chapter introduces compression theory and explains the difference between static and adaptive modeling. Lossy and lossless data compression approaches are explained. Algorithms that are currently used in commercial compression applications or libraries are reviewed. Compression Theory Compression Theory is tightly related to Information Theory. Information Theory is a branch of mathematics started in 1948 from the work of Claude Shannon at Bell Labs [3]. It deals with various questions about information. Data compression is interested in information redundancy. Data containing redundant information takes extra bits to store. Eliminating the extra bits reduces data size, hence freeing more memory or communication channel bandwidth. To find the amount of redundancy in the data, Information Theory introduces the concept of entropy as a measure of how much information is encoded in the data. High entropy identifies that the data set has low redundancy while low entropy points to a highly redundant data set, which is consistent with the thermodynamics' definition of entropy. Let (S, P) be some unspecified finite probability space. If symbol/event Y c S, the self-information or information contained in Yis I(Y) = -log P(Y) = log This equation also defines the number of bits needed to encode symbol Y and its entropy. If the probability of Y is high then the number of bits needed to encode Y is low. The The Compression Procedure The compression algorithm is fairly simple. It consists of four steps: 1. Scan the input stream and obtain the frequencies for each symbol. Save that information. 2. Build the Huffman's tree and extract the code words. 3. Input a symbol from the stream and output its code. 4. Repeat step 3 until the end of stream. The Decompression Procedure The decompression algorithm is also simple, consisting of five steps: 1. Build the Huffman's tree from the saved information. 2. Move the pointer to the root node. 3. For each input bit follow the corresponding label down the tree, until a leaf node is reached. 4. Output the symbol at that leaf node. 5. Repeat the algorithm from step 2, until the end of the stream. All code words produced from the Huffman's tree have the property that no code word is the prefix of another code word. The decompression procedure uses this property to traverse tree from the root node to the leaf and to find the corresponding symbol without having to resolve any ambiguities. Summary For a given model, the Huffman coding is optimal among the probabilistic methods that replace source symbols with an integer number of bits. However, it is not optimal in the sense of entropy. As an example, consider an alphabet consisting of two symbols. Regardless of the probabilities, the Huffman coder will assign a single bit to each of symbols, giving no compression. CHAPTER 6 CONCLUSIONS This chapter reviews the key concepts addressed in this thesis. The important issues of the CLP library are discussed, and the future improvements over the current system are addressed as well. Overview of the CLP Library CLP is a simple, easily expandable compression library for PDA devices running Palm OS. It implements well-known algorithms that are used in most commercial compression applications because of their good performance. The compression algorithms' complex implementation is well hidden behind a simple interface exposed by the CLP library. This interface enables the application programmer to effortlessly deploy the CLP library. New compression algorithms can be added to the CLP library without changing the existing library code. The C++ class hierarchy enforces the existing interface onto the new implementations, hence demanding only minor changes in the application code. Future CLP Library Improvements The current implementation of the CLP library leaves much room for improvement. Static vs. Shared Libraries The CLP library is implemented as a statically linked library (SLL). A copy of SLL is added to each application that uses it. Every application on the same PDA device must have its own separate copy of the library. This clearly wastes memory, which is not a good practice on a PDA device. The Arithmetic Coding Algorithm Assume an input alphabet, sl, s2 ... Sm, with corresponding frequencies, f > 2 >... > f, for each symbol. Using the arithmetic coding algorithm, the entire source text, composed of symbols from the alphabet S, is assigned a code word determined by the process described below. Each source symbol, si, is assigned a subinterval, A(i), in the interval [0, 1). The subintervals, A(1), A(2)... A(m), are disjoint subintervals of [0, 1). The length of A(i) is proportional to fi. Having determined the interval A, the arithmetic coder chooses a number r e A and represents the source text with some finite segment of the binary expansion of r. The smaller the interval A(i) is, the farther out the decoder has to go with binary expansion to decode the source symbol. For this reason, symbols with a higher frequency or longer interval have shorter binary expansions, resulting in better compressed. The Arithmetic Coding Procedure The arithmetic coding algorithm is more complex than the Huffman coding algorithm, both in the number of steps and in the calculation complexity. The steps are: 1. A current interval [L,H) is initialized as [0, 1) and is maintained at each step. An underflow count is initialized at 0 and is maintained to the end of the file. 2. (Underflow condition) If the current interval satisfies < L < < H <- then expand it to [2(L 1),2(H _- )) and add 1 to the underflow count. 3. (Shift condition) If [L,H) c [0,-), then output 0 and any pending underflow bits (which are all 1), and expand the current interval to [2L,2H). If [L, H) [-,1), then output 1 and any pending underflow bits (which are all 0), and expand the current interval to [2L 1,2H 1). In either case, reset the underflow count to 0. 4. If none of the conditions in steps 2 or 3 hold, then divide the current interval into disjoint subintervals [L,, L, ) corresponding to the symbol s, e S, with lengths |

Full Text |

PAGE 1 COMPRESSION LIBRARY FOR PALM OS PLATFORM By NEBOJA IRI A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2003 PAGE 2 Copyright 2003 by Neboja iri" PAGE 3 Dedicated to my family. PAGE 4 ACKNOWLEDGMENTS I would like to thank my committee chair, Dr. Douglas Dankel II. It was a great pleasure to work with him and I gained valuable experience under his direction. I would also like to thank Dr. Sanjay Ranka and Dr. Joachim Hammer for being on my committee. Many thanks go to Mr. John Bowers for being a very helpful graduate secretary. He was always there to answer my questions and offer suggestions. Finally I would like to thank my wife Ivana for giving me support when I needed it the most. iv PAGE 5 TABLE OF CONTENTS Page LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT.......................................................................................................................ix 1 INTRODUCTION........................................................................................................1 Description of the Problem...........................................................................................1 Overview of CLP..........................................................................................................2 Organization of Thesis..................................................................................................4 2 COMPRESSION THEORY AND ALGORITHMS....................................................5 Compression Theory.....................................................................................................5 Static and Adaptive Modeling......................................................................................6 Static Modeling.....................................................................................................6 Adaptive Modeling................................................................................................8 Lossy and Lossless algorithms.....................................................................................9 Compression Algorithms............................................................................................10 Specialized Algorithms.......................................................................................10 Run length encoding (RLE).........................................................................10 JPEG and MPEG algorithms........................................................................11 MP3 and WMA algorithms..........................................................................11 Generic Algorithms.............................................................................................11 Dictionary based algorithms.........................................................................11 Statistical model based algorithms...............................................................12 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS...................................14 The Huffmans Coding Algorithm.............................................................................14 Building a Huffmans Tree..................................................................................14 The Compression Procedure................................................................................16 The Decompression Procedure............................................................................16 Summary..............................................................................................................16 The Arithmetic Coding Algorithm.............................................................................17 The Arithmetic Coding Procedure.......................................................................17 Implementation with integer arithmetic..............................................................18 Summary.....................................................................................................................20 v PAGE 6 4 CLP LIBRARY IMPLEMENTATION......................................................................21 Class Hierarchy...........................................................................................................21 Compression Class Hierarchy.............................................................................21 BitIO Class Hierarchy.........................................................................................22 CLP Library Interface.................................................................................................22 BaseCompress Class Interface............................................................................24 StatisticCompress Class Interface.......................................................................24 StatHuffman and StatArithmetic Classes............................................................25 BitIO Class Interface...........................................................................................25 Methods Interaction....................................................................................................25 Error Handling............................................................................................................26 How to Deploy the CLP Library................................................................................30 5 COMPRESSION RESULTS......................................................................................31 Test Environment........................................................................................................31 Source Data.........................................................................................................31 Sample Application.............................................................................................32 Scripts..................................................................................................................32 Utilities................................................................................................................32 Test Results.................................................................................................................33 Symbol Distribution............................................................................................33 Compression Results...........................................................................................33 Decompression Results.......................................................................................34 Discussion...................................................................................................................35 6 CONCLUSIONS.........................................................................................................36 Overview of the CLP Library.....................................................................................36 Future CLP Library Improvements............................................................................36 Static vs. Shared Libraries...................................................................................36 C++ vs. C Language............................................................................................37 1st-order and Adaptive Algorithms......................................................................38 APPENDIX: Source code listings.....................................................................................39 REFERENCES..................................................................................................................43 BIOGRAPHICAL SKETCH.............................................................................................45 vi PAGE 7 LIST OF TABLES Table page 3-1. Input alphabet for Huffmans coder...........................................................................15 3-2. Arithmetic encoding process for alphabet S and sequence a b a EOS.......................19 3-3. Arithmetic decoding process for alphabet S and bit sequence 011012.......................19 5-1. Compression results....................................................................................................34 5-2. Decompression results................................................................................................34 vii PAGE 8 LIST OF FIGURES Figure page 2-1. Static data compression diagram..................................................................................7 2-2. Static data decompression diagram..............................................................................7 2-3. Adaptive compression diagram....................................................................................8 2-4. Adaptive decompression diagram................................................................................9 2-5. Lossy compression example. A) High quality JPG image. B) Low quality JPG image....................................................................................................10 3-1. Huffmans tree and code words generated from input alphabet S.............................15 4-1. The UML compression CLP library diagram.............................................................23 4-2. The UML bit input/output BitIO class diagram.........................................................24 4-3. The UML sequence diagram for the compression process using the StatHuffman class...............................................................................................27 4-4. The UML sequence diagram for the decompression process using the StatHuffman class...............................................................................................28 4-5. The UML sequence diagram for the compression process using the StatAritmetic class..............................................................................................29 4-5. The UML sequence diagram for the decompression process using the StatAritmetic class..............................................................................................30 5-1. Symbol frequency distribution...................................................................................33 viii PAGE 9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science COMPRESSION LIBRARY FOR PALM OS PLATFORM By Neboja !iri" May 2003 Chair: Dr. Douglas Dankel II Major Department: Computer and Information Science and Engineering This thesis presents the design and implementation of a compression library (CLP) for the Palm Operating System (OS) platform. Data compression is a well-researched field in information theory and there are numerous programs and libraries available for almost every platform and OS in the market. Porting an existing library from the UNIX or MS Windows OS to the Palm OS is a nontrivial task because of constraints imposed by the handheld platforms memory size and organization. That is why CLP implemented simple, yet effective algorithms with small memory requirements. CLP contains two algorithms: the static Huffman compression and the static arithmetic compression. It would be easy to expand the library with new algorithms, if needed, because it was designed and implemented using an object oriented approach. CLP uses a simple interface, exposing only a few methods to the user, thus decreasing deployment time and increasing productivity. ix PAGE 10 The library was specially designed with the Palm Database (PDB) format in mind. It allows a user to easily manipulate data in separate records, which is the preferred mode of operation on Palm OS. x PAGE 11 CHAPTER 1 INTRODUCTION This chapter begins with a description of the problem addressed within this thesis, followed by the overview of the compression library for Palm OS platform (CLP) that was developed to solve this problem. The chapter concludes with a summary of the remaining chapters in this thesis. Description of the Problem The number of handheld devices is increasing rapidly every year because of constant price drop, increased functionality, and increasing demand. They are used everywherefrom hospitals to supermarkets and IT companies. The largest share, 75%, of the personal data assistants (PDA) market is held by Palm OS based PDAs. Second, a 25% market share is PocketPC PDAs based on MS Windows CE or Linux operating systems. Companies producing Palm PDAs are Palm, Visor, Handera, Sony, and Handspring. Companies producing MS Windows based PocketPC PDAs include Hewlett-Packard, Compaq, and Toshiba, while Sharp produces a Linux based system. The Palm platform covers a wide range of devices. The oldest or cheapest ones have only 2MB of main memory, a black and white screen, an RS232 serial link to a PC, and a 16MHz Motorola DragonBall processor. The newest and the most expensive ones have up to 16MB of main memory, CompactFlash (CF) or SmartMedia (SD) memory expansion slots, a color screen, a USB link to PC, and a 133MHz Intel StrongArm CPU [1, 2]. They all share beaming capability. Beaming is process of transferring data between PDA devices, laptops, or cell phones using the Infrared Data Association 1 PAGE 12 2 Protocol (IrDA) for short distances (usually less than 1m). Beaming speed is comparable with an RS232 serial connection speed (i.e., up to 115200 bits/second) [1, 2]. PDAs are mostly used as personal organizers and for reading e-documents, e-mails, and e-books downloaded from a PC. These documents can be several megabytes in length, requiring significant time to download through an RS232 or IrDA connection. Large documents can also decrease the amount of free main memory by 20-30% per document. Text documents, as opposed to binary files, have good compression properties because of non-random word and character repetition. Using compression can reduce memory consumption and download time, thus reducing battery consumption and improving the overall response time of the system. The top compression algorithms can reduce the size of text documents by up to 90%, but with the tradeoffs of increased algorithm complexity/processing time and memory consumption. This thesis proposes an algorithm with similar performance for a 16MHz Palm Vx with 4MB of memory or a 133MHz Tungsten device with 16MB of main memory. Overview of CLP CLP, the compression library for Palm, is a general compression library developed to solve the problem described above. It provides a simple interface for application developers to easily compress and decompress data on PDA devices or a PC. CLP can be effortlessly extended with new compression algorithms because of its modular, object-oriented design. Using interfaces hides implementation complexity from the application and allows library changes and improvements without a need to change the application itself. PAGE 13 3 A Palm OS device keeps its data in a specific file format known as a Palm Database File (PDB) [1, 2]. This is record based, random access file residing in main memory or on a memory card. The PDB format has a limit of 65535 records per file. Each record can be up to 64KB long and of variable size [1, 2]. The new Palm devices are introducing regular files without this limitations but the huge number of legacy applications slows the transition. Two general categories of compression techniques are used: adaptive and statistical. In this environment adaptive compression techniques do not yield good results because the size of each record is too small for these methods to effectively gather statistics about data before compressing. As a result, CLP uses only a statistical compression method. Palm OS devices, especially older ones, have a small amount of dynamic memory (on the order of few kilobytes) that is shared with the TCP/IP stack, global variables, etc. [1, 2] This limits the number and the size of the statistical tables that a compression algorithm can hold in memory at one time. As a result, CLP implements only the 0th order statistical compression methods. But even with all these limitations, good compression ratios and compression speeds are obtainable using the current implementation of CLP. The CLP library can be used on any device, including PCs, provided that the C++ compiler and Standard Template Library are available. This enables the user to preprocess data on more a powerful platform and just transfer them to the PDA device for later use. PAGE 14 4 Organization of Thesis This thesis consists of six chapters. Chapter 2 introduces compression theory and algorithms. Chapter 3 gives an in-depth description of the implemented algorithms. Chapter 4 discusses the implementation details from a programmers perspective. Chapter 5 addresses tests and results of compression/decompression capabilities of the CLP library on the basis of test data and sample application. Finally, Chapter 6 presents future work and conclusion. PAGE 15 CHAPTER 2 COMPRESSION THEORY AND ALGORITHMS This chapter introduces compression theory and explains the difference between static and adaptive modeling. Lossy and lossless data compression approaches are explained. Algorithms that are currently used in commercial compression applications or libraries are reviewed. Compression Theory Compression Theory is tightly related to Information Theory. Information Theory is a branch of mathematics started in 1948 from the work of Claude Shannon at Bell Labs [3]. It deals with various questions about information. Data compression is interested in information redundancy. Data containing redundant information takes extra bits to store. Eliminating the extra bits reduces data size, hence freeing more memory or communication channel bandwidth. To find the amount of redundancy in the data, Information Theory introduces the concept of entropy as a measure of how much information is encoded in the data. High entropy identifies that the data set has low redundancy while low entropy points to a highly redundant data set, which is consistent with the thermodynamics definition of entropy. Let (S, P) be some unspecified finite probability space. If symbol/event Y, the self-information or information contained in Y is S! )(1log)(log)(YPYPYI" # This equation also defines the number of bits needed to encode symbol Y and its entropy. If the probability of Y is high then the number of bits needed to encode Y is low. The 5 PAGE 16 6 number of bits needed to encode the whole message M is simply a sum of the code lengths for each symbol contained in the message, or where Ci is a count of the symbol Yi in the message M [4]. $$#""iiiiiiYPCYICMI)(log)()( The probability of the symbol depends on the model we choose to describe the data set. This means that the entropy of a message or symbol is not an absolute, unique value. As a result, models that predict symbols with high probability are good for a data compression system. After data is modeled it is encoded with a proper number of bits. If the entropy of the symbol was 3.5 bits then that symbol should be encoded with 3.5 bits. Some encoding schemes (e.g., the Huffman scheme) decrease compression by rounding up number of bits to boost the processing speed. Static and Adaptive Modeling There are two approaches to model data: static and adaptive. The static method was developed first but now is abandoned in favor of the adaptive method. A description of each of the methods follows [5]. Static Modeling The static model first gathers statistical information about each symbol in a table, by scanning the data once and counting the symbol frequencies in the data set. The resulting model is used for data encoding and decoding. For the encoder and decoder to be compatible they must share the same model, which means that the table has to be transmitted to both the encoder and the decoder. Data compression and decompression with the static model are shown in Figure 2-1 and Figure 2-2. PAGE 17 7 Figure 2-1. Static data compression diagram. Figure 2-2. Static data decompression diagram. Depending on nature of the data set, the probability of the adjacency of two or more specific symbols in the message can be independent, for binary files, or dependent with high or low values, for text files. The encoding model can try to predict such probabilities, thus increasing the compression ratio. The number of adjacent symbols defines the order of the model. A 0th-order model assumes that the symbol position in the message is independent of the position of other symbols. A 1st-order model assumes that two adjacent symbols are dependent, etc. For a 0th-order model there is only one table with symbol counts. For the standard ASCII character set, the table would have 256 entries. With the 0th-order model each symbol is assumed independent from the other symbols. As the order of the model increases the number of tables increases. In the case of the 1st-order model there are 256 tables with 256 entries each. With the 1st-order model some relation between symbols is assumed, for example there is high probability of the PAGE 18 8 character u after the character q in English language, but very low probability for w after b. This approach yields a better compression but the overhead of keeping a larger table is often too expensive. The requirement of keeping the model table with the data to decode that data is the reason why the static modeling has been practically abandoned in modern compression theory. Adaptive Modeling The adaptive algorithm does not have to scan the data to generate statistics. Instead, the model is constantly updated as new symbols are read and coded. This is true for both the encoding and decoding processes and means that the table does not have to be saved with the data to be decoded. The compression and decompression models are shown in Figure 2-3 and Figure 2-4. Figure 2-3. Adaptive compression diagram. PAGE 19 9 Figure 2-4. Adaptive decompression diagram. The problem with the adaptive model is that it starts with an empty table, so the compression is very low in the beginning. Most adaptive algorithms adjust to the input stream after a few thousands symbols resulting in a good compression ratio. Also, they are able to adapt to changes in the nature of the input stream, like when the data changes from text to an image. Lossy and Lossless algorithms Data can be compressed with or without loss of information. Lossy compression is mostly used for drastically reducing size of images and sound files, because the human senses are not very sensitive towards small changes in quality. The JPG format is able to reduce an image size by examining adjacent pixels and making ones that appear similar the same, thus reducing the entropy and increasing the compression. Images A and B in Figure 2-5 are saved using the highest and lowest JPG quality. Note that while image A is twice as large as B, the quality of picture B is not drastically different. Lossless compression is used for all other types of data where accuracy is mandatory. Some of the most widely used lossless algorithms are described below, in the generic algorithms section. PAGE 20 10 Figure 2-5. Lossy compression example. A) High quality JPG image. B) Low quality JPG image. Compression Algorithms Compression algorithms can be divided in two groups: generic and specialized. Generic algorithms are able to compress any type of information with good but not perfect results. Specialized algorithms are very good at compressing some types of information, like images or sound, but have poor results for other types of data. Specialized Algorithms Some well-known specialized algorithms are reviewed in this section. Run length encoding (RLE) Run length encoding algorithm [5] is mostly used with bitmap images (e.g., black and white images) where symbols (pixels) with the same value are often found in contiguous streams. The stream can be then replaced with PAGE 21 11 JPEG and MPEG algorithms The Joint Photographic Experts Group (JPEG) [6] and the Moving Pictures Expert Group (MPEG) committees proposed two lossy algorithms for image and moving picture compression, respectively. These algorithms make two passes over the data. First pass converts the data into the frequency domain, using FFT-like algorithms. Once transformed, the data are smoothed by rounding off the peaks, resulting in a loss of information. In the second pass, the data are compressed using one of the lossless algorithms. Compression ratios can be very high with acceptable quality degradation as shown in Figure 2-5. MP3 and WMA algorithms MP3 and Windows Media Audio (WMA) [7] are lossy algorithms used for audio signal compression. Compression ratios are high with a reduction in size by a factor of 10 or more with low distortion of the audio signal. These algorithms use the fact that the human ear cannot hear some frequencies if they are masked by other frequencies. They eliminate those hidden frequencies, hence reducing the size of the signal. The encoding procedure is highly CPU intensive but decoding is not. This property is one of the reasons why the MP3 format is widely accepted for storing audio files. Generic Algorithms There are a few generic algorithms currently in use, like the (1) Dictionary, (2) Sliding Window [4, 5], (3) Huffman, and (4) arithmetic coding algorithms. They are all implemented as lossless and adaptive. Dictionary based algorithms This family of algorithms encodes variable-length strings into fixed length codes, which are called tokens. The most popular algorithm in this group is the Lempel-Ziv PAGE 22 12 Welch (LZW) algorithm [3, 4, 8, 9], which is used in almost every commercial compression utility or library. It parses the input stream and for each new phrase that encounters, it adds a PAGE 23 13 point or integer arithmetic, thus decreasing the compression ratio. The arithmetic coding algorithm needs a powerful CPU for both the encoding and decoding process, which makes it less desirable on PDA devices than the Huffman coding algorithm. Both of these algorithms are covered in depth in next chapter. PAGE 24 CHAPTER 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS This chapter describes the two algorithms used to implement CLP. These are the static Huffman and Arithmetic algorithms. As mentioned in the previous chapter, both belong to a class of Greedy algorithms. Each tries to assign the shortest bit-sequence to the symbol with the highest frequency or the lowest entropy. We start by examining the Huffman Algorithm. The Huffmans Coding Algorithm Assume an input alphabet, s1, s2 sm, with corresponding frequencies, for each symbol. To compress or decompress data using Huffman coding, symbols and frequencies must be transformed into code words. This can be accomplished by building a Huffmans tree. mfff%%%...21 Building a Huffmans Tree Each leaf node in the Huffmans tree is assigned one of the symbols from the input alphabet. Symbols sm and sm-1 become siblings in the tree, with a common parent node, Sp, and frequency fm+fm-1. In each iteration of the algorithm, the two nodes from the set of leaf and intermediate nodes with the least weights that have not yet been used as siblings are paired as siblings and assigned to a parent node whose weight becomes the sum of the siblings weights. The algorithm stops when the last two siblings are paired. The final parent, with a weight 1, is the root of the tree. 14 PAGE 25 15 Table 3-1. Input alphabet for Huffmans coder Symbol Frequency S1 1/3 S2 1/5 S3 1/6 S4 1/10 S5 1/10 S6 1/10 To illustrate this process consider the following example. In Figure 3-1, the input alphabet S, defined in Table 3-1, is transformed into a Huffmans tree. The average code word length is defined as $""iiiwfl1 m where fi is frequency and wi is code length of symbol si. For the input alphabet S, l has decreased from 3 to 2.47. In an average case, for this simple alphabet, the compression ratio is 17.7%. Figure 3-1. Huffmans tree and code words generated from input alphabet S. PAGE 26 16 The Compression Procedure The compression algorithm is fairly simple. It consists of four steps: 1. Scan the input stream and obtain the frequencies for each symbol. Save that information. 2. Build the Huffmans tree and extract the code words. 3. Input a symbol from the stream and output its code. 4. Repeat step 3 until the end of stream. The Decompression Procedure The decompression algorithm is also simple, consisting of five steps: 1. Build the Huffmans tree from the saved information. 2. Move the pointer to the root node. 3. For each input bit follow the corresponding label down the tree, until a leaf node is reached. 4. Output the symbol at that leaf node. 5. Repeat the algorithm from step 2, until the end of the stream. All code words produced from the Huffmans tree have the property that no code word is the prefix of another code word. The decompression procedure uses this property to traverse tree from the root node to the leaf and to find the corresponding symbol without having to resolve any ambiguities. Summary For a given model, the Huffman coding is optimal among the probabilistic methods that replace source symbols with an integer number of bits. However, it is not optimal in the sense of entropy. As an example, consider an alphabet consisting of two symbols. Regardless of the probabilities, the Huffman coder will assign a single bit to each of symbols, giving no compression. PAGE 27 17 The Arithmetic Coding Algorithm Assume an input alphabet, s1, s2 sm, with corresponding frequencies, for each symbol. Using the arithmetic coding algorithm, the entire source text, composed of symbols from the alphabet S, is assigned a code word determined by the process described below. mfff%%%...21 Each source symbol, si, is assigned a subinterval, A(i), in the interval [0, 1). The subintervals, A(1), A(2) A(m), are disjoint subintervals of [0, 1). The length of A(i) is proportional to fi. Having determined the interval A, the arithmetic coder chooses a number A r & and represents the source text with some finite segment of the binary expansion of r. The smaller the interval A(i) is, the farther out the decoder has to go with binary expansion to decode the source symbol. For this reason, symbols with a higher frequency or longer interval have shorter binary expansions, resulting in better compressed. The Arithmetic Coding Procedure The arithmetic coding algorithm is more complex than the Huffman coding algorithm, both in the number of steps and in the calculation complexity. The steps are: 1. A current interval is initialized as [0, 1) and is maintained at each step. An underflow count is initialized at 0 and is maintained to the end of the file. (HL, 2. (Underflow condition) If the current interval satisfies 432141)) ) ) HL then expand it to *( ((41412,2 # #HL and add 1 to the underflow count. 3. (Shift condition) If ( (21,0,!HL then output 0 and any pending underflow bits (which are all 1), and expand the current interval to (HL2,2 If ( '(1,,21!HL then output 1 and any pending underflow bits (which are all 0), and expand the current interval to In either case, reset the underflow count to 0. '(12,12##HL 4. If none of the conditions in steps 2 or 3 hold, then divide the current interval into disjoint subintervals corresponding to the symbol with lengths (1,+iiLL Ssi& PAGE 28 18 determined by the probabilities. If si is the next source letter, assign and iLL, 1+,iLH --4321 5. Repeat steps 2-4 until the end of the stream and none of the conditions in steps 2 or 3 hold. At this stage, the final interval satisfies HL 2141 or HL and sequence is output if the first condition holds and 0 is output otherwise. Any leftover underflow bits are output after these bits. Implementation with integer arithmetic The previously described algorithm cannot be directly applied in practice for two reasons. First, floating point arithmetic is much slower than integer arithmetic on any computer platform. Second, the result r can become long, hence demanding a very high precision that is not available with current technology. To solve these problems, the interval [0, 1) can be replaced with [0, M), where M is a positive integer with a value of at least 4(|S|-2). The value of M is usually chosen as some power of 2, which can improve integer arithmetic performance. Each subinterval in A is expanded, which corresponds to a value of M and a subinterval length. Also, L and H values can be represented as 0x0000 and 0xFFFF in the initial case, where the ellipsis means that 0s or 1s can be shifted in when needed, thus enabling 16 or 32-bit register arithmetic. The interval [0, M) is just an approximation of the interval [0, 1), with imprecise symbol frequencies, which leads to lower compression ratios. But using integer arithmetic drastically decreases the processing speed. To illustrate encoding and decoding process using the integer approximation algorithm, consider the following example. The source alphabet is S = {a, b, EOS}, the frequencies for each symbol are fa=6/10, fb=3/10, and fEOS=1/10, and the cumulative counts are Ca=6, Cb=9 and CEOS=10. The input sequence is: a b a EOS. To obtain an PAGE 29 19 interesting, non-trivial, example, M should be 24. The lowest possible value for M can be obtained from following inequality ).2(4#%SM The EOS symbol is used to terminate the input message. Adding an artificial symbol to the alphabet changes the real symbol frequencies, thus lowering the compression ratio, but by using it, the message size does not have to be appended to the message and incremental transmission becomes possible. The encoding and decoding processes are given in Tables 3-2 and 3-3. Table 3-2. Arithmetic encoding process for alphabet S and sequence a b a EOS. Subintervals Symbol Current interval a b EOS Output Start [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8, 9) b [5, 8) Expand x#2x, where x is current interval 0 [10, 16) Expand x to 2(x-M/2) 1 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(x-M/4) Underflow [0, 14) [0, 8) [8, 12) [12, 14) EOS [12, 14) Expand x to 2(x-M/2) 10 [8, 12) Expand x to 2(x-M/2) 1 [0, 8) Expand x to 2x 0 [0, 16) Table 3-3. Arithmetic decoding process for alphabet S and bit sequence 011012. Subintervals Value Current interval a b EOS Output symbol 01102=6 [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8, 9) b [5, 8) Expand x to 2x 11012=13 [10, 16) Expand x to 2(x-M/2) 10102=10 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(x-M/4) 11002=12 [0, 14) [0, 8) [8, 12) [12, 14) EOS For a given model, the arithmetic coding algorithm is optimal and outperforms the Huffman method. It is important to note that implementations of the arithmetic coding PAGE 30 20 algorithm are less than optimal, due to integer arithmetic and other compromises. Since the Huffman coding algorithm is nearly optimal in many cases, the choice between the methods is not as simple as the theory suggests. Summary The two algorithms used for development of CLP are described in this chapter. An in-depth theoretical analysis [4] and detailed implementation [5] can be found in the literature. The Huffmans coding algorithm is lightweight and fast, but it is non-optimal. The arithmetic coding algorithm is CPU intensive, but in its pure form, it is optimal. In Chapter 5 compares the results of both algorithms and identifies which one is better for PDA application. PAGE 31 CHAPTER 4 CLP LIBRARY IMPLEMENTATION This chapter discusses the implementation details from a programmers perspective. These include (1) the class hierarchy, (2) the interface exposed by the compression library, (3) the proper call sequence to the class methods, (4) error handling, and (5) instructions on how to include the CLP library into a Palm OS application. Class Hierarchy The CLP library consists of two separate sets of classes. One consists of the main compression hierarchy, while the other holds the simple bit-stream class. Compression Class Hierarchy Different compression algorithms use different techniques for compressing data, but they all have a similar interface [13] to the application that uses them. It is a common practice to enforce such behavior by using a common virtual base class, in the case of the CLP library that base class is the BaseCompress class. Static and adaptive compression algorithms update the coding model in a different way. The static compressor gathers statistical information before compression, while the adaptive does that during the compression phase. As a result, there are two new classes inherited from the BaseCompress class: StatisticCompress and AdaptiveCompress. Both are virtual classes with the AdaptiveCompress class not being implemented. The Huffman and arithmetic compression classes are inherited from the StatisticCompress class. While they share the same interface, their internal 21 PAGE 32 22 implementations are different, reflecting the differences in their corresponding algorithms. A modular OO, C++ [14] approach makes changes and additions to the library easy. If a new algorithm is needed, then its class can be inherited from StatisticCompress or AdaptiveCompress, without changing the existing code. Each class is declared in a separate header file, and its implementation is placed in a corresponding .cpp file, thus reducing the size of each module. The complete UML [15] compression class diagram is given in Figure 4-1. BitIO Class Hierarchy The BitIO class is a helper class. It handles single or multiple bit stream operations. The OS file or memory functions are developed for byte access, but for compression purposes, the ability to store an arbitrarily sized bit sequence in memory is needed. The UML representation of the BitIO class is given in Figure 4-2. CLP Library Interface The compression library should provide a simple interface, enabling the user to easily send data to the library and receive responding results from it. Each class in the hierarchy either expands the interface or implements its methods. PAGE 33 23 Figure 4-1. The UML compression CLP library diagram PAGE 34 24 Figure 4-2. The UML bit input/output BitIO class diagram. BaseCompress Class Interface The BaseCompress class provides the behaviors detailed below by exposing the following methods in its interface: ./Initialize Initializes the compression/decompression engine for a given class. This is a pure virtual method [14]. ./Compress/Decompress Sends data to the CLP library for compression/decompression. These are pure virtual methods. ./GetResult Extracts the result from the CLP library. This method is implemented in the BaseCompress class. ./FlushBuffer Flushes the internal buffers and prepares the CLP library for the next compression sequence. This method is implemented in the BaseCompress class. StatisticCompress Class Interface The StatisticCompress class expands the base interface with methods operating on the statistical data and resulting header: ./GetHeader Returns the header with statistical information that was generated by Initialize call. This is a pure virtual method. PAGE 35 25 ./GetHeaderSize Returns the size of the header, so that the application can allocate enough memory to save the header. This is a pure virtual method. ./SetHeader Data can be preprocessed or even compressed on the PC. The header information generated through preprocessing can be read from the file and used for compression/decompression on Palm platform thus decreasing the overhead and increasing speed. This is a pure virtual method. StatHuffman and StatArithmetic Classes The StatHuffman and StatArithmetic classes implement the interface declared by the StatisticCompress parent class. They implement the corresponding Huffman and Arithmetic algorithms. BitIO Class Interface The BitIO class implements a simple interface that allows the user to read/write a sequence of bits from/to a stream. The proper call sequence is outlined below: 1. OpenIO Opens a new bit I/O stream connection. 2. Input/Output Bit/Bits Inputs/outputs the bit sequence from/to the stream. 3. FlushIO Flushes the bits remaining in the buffer to the stream. 4. GetBuffer Returns the resulting stream. 5. CloseIO Closes the I/O stream. Step 2 is repeated as long as there are bits to input or output. Methods Interaction To properly use the CLP library, the application must follow this methods calling sequence for both the compression and decompression process: 1. Initialize 2. Set/GetHeader Set the header if is needed later, or get it if it was generated before. PAGE 36 26 3. Compress/Decompress This call passes data to the library and returns the size of the result. 4. GetResult 5. FlushBuffer Steps 3-5 are repeated for each record that the application intends to compress/decompress. A detailed calling sequence for the public and private functions, for both types of compression/decompression, is given as a UML sequence diagrams in Figures 4-3, 4-4, 4-5, and 4-6. Error Handling The CLP library uses the C++ exception handling mechanism [14] to handle errors. There are only three types of exceptions that CLP throws: ./out_of_range is thrown if the array index is out of range. ./length_error is thrown if the array resize operation fails (i.e., there is insufficient memory). ./unspecified exception is thrown if an unspecified exception happens (i.e., some system exception). All three types of exceptions are inherited from the C++ Standard Template Library (STL). The CLP library uses the STLs vector container type [14, 16] to implement the dynamic arrays used for holding the results generated during compression and decompression. A simple tester application shows the correct way of handling exceptions thrown by the CLP library in the application that uses it. PAGE 37 27 Figure 4-3. The UML sequence diagram for the compression process using the StatHuffman class. PAGE 38 28 Figure 4-4. The UML sequence diagram for the decompression process using the StatHuffman class. PAGE 39 29 Figure 4-5. The UML sequence diagram for the compression process using the StatAritmetic class. PAGE 40 30 Figure 4-5. The UML sequence diagram for the decompression process using the StatAritmetic class. How to Deploy the CLP Library To use the CLP library, the application programmer must (1) add the library file to the path accessible by the linker and (2) include StatHuffman.h or/and StatArithmetic.h files into his project. Both the StatHuffman and StatArithmetic header files will automatically include the necessary underlying header files and libraries into the project. PAGE 41 CHAPTER 5 COMPRESSION RESULTS This chapter presents a summary of the CLP librarys performance on a Palm Pilot device based on a sample program using the library to compress and decompress drug reference data [5]. The compression ratios and speeds are measured for both the Huffman and Arithmetic coding algorithms. A comparison is made to determine which algorithm is more suitable for PDA use. Test Environment The following is a description of the sample application, scripts, utilities, source data, and testing procedure used. Source Data The initial data set was a text file containing ASCII strings delimited with new line characters. The size of the text file is approximately 2300 strings or 840KB. Using the first PHP script given in Appendix, each string from the text file was inserted into a separate record in the PDB database. The CLP implementation on the PC is used to generate a header with statistical data for the compression and decompression procedures. The second PHP script, also given in Appendix, is used to transfer binary information from the header file to the PDB database record on the PC. All generated PDB files and the sample application are loaded into the Palm OS device emulator program (POSE). 31 PAGE 42 32 Sample Application A sample application was written in the C++ language and was linked with the CLP library for Palm OS. It first compresses the input PDB database using the Huffman and Arithmetic coding algorithms and logs compression times with an external profiler application. The compression results are written into new database files, one for each type of compression. The resulting databases are decompressed using both algorithms, and decompression times are logged with the profiler. Scripts Scripts for converting data to the Palm PDB format [1] were written using a php-pdb [17] module. PHP script language interpreters are available for almost all OS platforms, so these scripts are highly portable [17]. Utilities Two utility programs are used for testing: POSE and Palm Reporter. They are both part of the Palm OS Software Development Kit (SDK) [1]. POSE is able to emulate any Palm OS device, if a ROM image for that device is available. It emulates the real speed of the device, so measurements taken on it are roughly equal to measurements taken on the real device. Palm Reporter is a stand-alone program able to connect to POSE and receive log messages. It is used to obtain compression and decompression speed information from the sample program running on POSE. PAGE 43 33 Test Results Symbol Distribution The header files were generated from the source data by the CLP library for a PC. Both the Huffman and Arithmetic coder headers are identical with the symbol frequencies shown in Figure 5-1 Figure 5-1. Symbol frequency distribution. The data distribution in the graph meets expectations, because the source text consists of ASCII strings in the English language, thus the SPACE symbol, digits, and characters between a-z and A-Z have the highest frequency. Compression Results Palm devices use FLASH memory to store all databases and applications. FLASH memory is faster for reading than it is for writing (by a factor of 10). That is why there are two sets of results for the compression test. One includes writing results to storage memory and the other is without. They are both shown in Table 5-1. PAGE 44 34 Table 5-1. Compression results Compression method Source data (in bytes) Comp. data (in bytes) Comp. ratio Comp. time (write) Comp. time (w/o write) None 878,090 878,090 0% 45s 2s Huffman 878,090 544,136 38.03% 75s 62s Arithmetic 878,090 546,895 37.72 99s 83s The compression time for both algorithms is comparable with the simple memory access and is around 0.02s/record. Both algorithms have similar compression ratios. The expected result was that the arithmetic coder algorithm would have a better compression ratio than the Huffman coder algorithm, but in this case that is not true. There are many factors that can influence the compression results, with the main one being the integer arithmetic approximation. Decompression Results Decompression results are shown in Table 5-2. Table 5-2. Decompression results Decompression method Comp. data (in bytes) Decomp. data (in bytes) Decomp. time (write) Decomp. time (w/o write) None 878,090 878,090 45s 2s Huffman 544,136 878,090 128s 110s Arithmetic 546,895 878,090 812s 799s Decompression is slightly slower than compression in the case of the Huffman coder algorithm. This result is expected, because the decoder must traverse the Huffman tree to lookup a symbol, which is a slower operation than finding a symbol-code pair in the compression procedure. In the case of the arithmetic coder algorithm, the decompression is much slower than the compression, by a factor of 10. This is due to the more complex decompression algorithm. Also, some optimization could help in narrowing the gap. PAGE 45 35 Discussion Users of PDA devices expect PDA applications to have a quick response time and a small size. The test results clearly show that the Huffman coder algorithm can provide both, while the arithmetic coder algorithm falls short when speed is taken into account. PAGE 46 CHAPTER 6 CONCLUSIONS This chapter reviews the key concepts addressed in this thesis. The important issues of the CLP library are discussed, and the future improvements over the current system are addressed as well. Overview of the CLP Library CLP is a simple, easily expandable compression library for PDA devices running Palm OS. It implements well-known algorithms that are used in most commercial compression applications because of their good performance. The compression algorithms complex implementation is well hidden behind a simple interface exposed by the CLP library. This interface enables the application programmer to effortlessly deploy the CLP library. New compression algorithms can be added to the CLP library without changing the existing library code. The C++ class hierarchy enforces the existing interface onto the new implementations, hence demanding only minor changes in the application code. Future CLP Library Improvements The current implementation of the CLP library leaves much room for improvement. Static vs. Shared Libraries The CLP library is implemented as a statically linked library (SLL). A copy of SLL is added to each application that uses it. Every application on the same PDA device must have its own separate copy of the library. This clearly wastes memory, which is not a good practice on a PDA device. 36 PAGE 47 37 Shared or dynamically linked libraries (DLL) offer a fix to this problem by keeping the library in a separate file, which can be loaded on demand by the application needing it. Then, only one copy of the library is kept in the system thus reducing memory requirements. DLLs are not a perfect solution because: 1. They are highly system dependant. 2. They must provide backward compatibility with previous versions. If condition 1 is acceptable and if the library is well implemented then a DLL approach should be chosen. Compression libraries are usually in high demand in the system, thus they are shared by many applications at the same time. They are also system specific because they use low-level system properties to increase processing speed, hence the CLP library should be implemented as a DLL in its next version. C++ vs. C Language The C++ language offers a rich set of libraries and language elements (i.e., strict type checking, exception handling, strings, dynamic arrays, and classes) to the developer. These properties increase productivity by enabling the programmer to focus more on the problem than on its implementation. Also programs developed in an OO language can be easily expanded by adding new classes and reusing old ones. The problem with the C++ language is that this flexibility and power increases memory consumption and decreases speed, which can hurt performance on PDA devices. The C language is the language of choice for embedded systems programmers, because of its small memory footprint and tight connection with the underlying system. It does not offer a rich set of language elements like C++, thus forcing the programmer to spend more time in the implementation and testing phases. PAGE 48 38 The current size of the CLP library is 145KB. If it was developed in the C language its size would decrease to approximately 30-40KB, and its dynamic memory consumption would be much smaller. This decrease in size happens because the C language does not use the STL library, which is quite large, and there is no object initialization overhead for virtual classes and methods. With some decrease in flexibility and expandability the CLP library can be ported to the C language, hence reducing its size and memory footprint. 1st-order and Adaptive Algorithms Current, 0th-order algorithms can be replaced with 1st-order algorithms to increase compression ratios, though the problem with holding large statistical tables in memory must be solved first. If record based compression/decompression is not needed then adaptive algorithms can be used. They would eliminate the need for header storage space and data preprocessing. PAGE 49 APPENDIX SOURCE CODE LISTINGS The source code for the CLP and the test application is not included in the thesis. Only PHP scripts mentioned in the thesis are included. 39 PAGE 50 40 AppendString($string); $counter = $counter+1; $test = $PDBFile->GoToRecord($counter); } echo "Test = ", $test, "\n"; $PCPDBFile = fopen("SourceDataDB.pdb","wb"); if (! $PCPDBFile) { echo "Can't create SourceDataDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); PAGE 51 41 fclose($PCPDBFile); fclose($fp); // Get first header (for Static Huffman Compression) echo "Get first header (for Static Huffman Compression)\n"; $fp = fopen("SHuffman.bin","rb"); if (! $fp) { echo "Can't open SHuffman.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SHuffmanDB'); $Header = fread($fp, 256); $PDBFile->AppendString($Header); $PCPDBFile = fopen("SHuffmanDB.pdb","wb"); if (! $PCPDBFile) { echo "Can't create SHuffmanDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); // Get second header (for Static Arithmetic Compression) echo "Get second header (for Static Arithmetic Compression)\n"; $fp = fopen("SArithmetic.bin","rb"); if (! $fp) { echo "Can't open SArithmetic.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SArithmeticDB'); $Header = fread($fp, 256); $PDBFile->AppendString($Header); $PCPDBFile = fopen("SArithmeticDB.pdb","wb"); PAGE 52 42 if (! $PCPDBFile) { echo "Can't create SArithmeticDB.pdb\n"; exit; } $PDBFile->WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); ?> PAGE 53 REFERENCES [1] Palm Corp. Home Page, http://www.palmos.com Accessed: 12/10/2002 [2] L. R. Foster, Palm OS Programming Bible, IDG Books Worldwide, Inc, Foster City, CA, 2000. [3] C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, 27:379-423 and 623-56, 1948. [4] D. Hankerson, G.A. Harris, P. D. Johnson, Introduction to Information Theory and Data Compression, CRC Press, Boca Raton, FL, 1997. [5] M. Nelson, J. L. Gailly, The Data Compression Book, M&T Books, New York, 1996. [6] G. K. Wallace, The JPEG still picture compression standard, Communications of the ACM, 34(4):32-44, April 1991. [7] Microsoft Corp., Support Page for Windowstm Media Formats, http://support.microsoft.com/default.aspx?scid=/support/mediaplayer/wmptest/wmptest.asp#Windows%20Media Accessed: 02/10/2003. [8] T. Welch, A technique for high-performance data compression, IEEE Computer, 17:8-19, June 1984. [9] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23(3):337-343, May 1977. [10] D. A. Huffman, A method for the construction of minimum redundancy codes, Proceedings of the IRE, 40(9):1098-1101, September 1952. [11] A. Moffat, R. Neal, I. H. Witten, Arithmetic coding revisited, ACM Transactions on Information Systems, 16(3):256-294, July 1998. [12] I. H. Witten, R. Neal, J. G. Cleary, Arithmetic coding for data compression, Communications of the ACM, 30(3):520-540, June 1987. [13] E. Gamma, R. Help, R. Johnson, J. Vlissides, Design Patterns, Addison-Wesley, Reading, MA, 1994. 43 PAGE 54 44 [14] B. Eckel, Thinking in C++, 2nd Edition, Volume 1, Prentice Hall, Upper Saddle River, NJ, 2000. [15] C. Larman, Applying UML and Patterns, Prentice Hall PTR, Upper Saddle River, NJ, 2002. [16] H. Schildt, C/C++ Programmers Reference, 2nd Edition, Osborne/McGraw-Hill, Berkeley, CA, 2000. [17] SourceForge Home Page, http://php-pdb.sourceforge.net Accessed: 01/10/2003. PAGE 55 BIOGRAPHICAL SKETCH Neboja !iri" was born in Belgrade, Republic of Serbia, Yugoslavia. He received his Bachelor of Science degree in computer science and engineering from the School of Electrical Engineering, Belgrade, Yugoslavia, in 1998. He worked at the Institute Mihajlo Pupin for seven months on his BS project as a programmer. In August 1999, he moved to Ljubljana, Slovenia, to work at the Hermes SoftLab, one of the largest software development companies in the country. He quit his job in Slovenia a year later to be with his wife in Gainesville, FL. He was accepted into the Computer and Information Science and Engineering Department at the University of Florida in August 2001. His research interests include object-oriented software development, artificial intelligence, and pattern recognition. 45 |