UFDC Home  myUFDC Home  Help 



Full Text  
COMPRESSION LIBRARY FOR PALM OS PLATFORM By NEBOJSA CIRIC A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2003 Copyright 2003 by NebojSa Ciric Dedicated to my family. ACKNOWLEDGMENTS I would like to thank my committee chair, Dr. Douglas Dankel II. It was a great pleasure to work with him and I gained valuable experience under his direction. I would also like to thank Dr. Sanjay Ranka and Dr. Joachim Hammer for being on my committee. Many thanks go to Mr. John Bowers for being a very helpful graduate secretary. He was always there to answer my questions and offer suggestions. Finally I would like to thank my wife Ivana for giving me support when I needed it the most. TABLE OF CONTENTS Page L IS T O F T A B L E S ...... .. .... .. .... .... .... .. .... .... .................................................... .. v ii LIST OF FIGURES ................................ .. ............ .... ..... .............. .. viii ABSTRACT .............. .................. .......... .............. ix 1 IN T R O D U C T IO N ................. .................................. .... ........ .. ............. . Description of the Problem ............... ........................... ........................ Overview of CLP .................. .......................................... .................. .2 O organization of Thesis.................. ...................... ....... .. .. ..... .. ........ .... 2 COMPRESSION THEORY AND ALGORITHMS .........................................5 C o m p ressio n T h eo ry ....................................................................... .. .......... .. .. 5 Static and Adaptive M odeling .................................. ......................................6 Static M modeling .............................................. ........................ 6 A daptive M odeling ............... .......................... ........ .................... .. .8 Lossy and Lossless algorithm s .................................. .....................................9 Com pression Algorithm s ....................................... ....... ... ................... 10 Specialized A lgorithm s ............................................... ............................ 10 Run length encoding (RLE) ................................................. 10 JPEG and M PEG algorithm s........................................ .............................11 M P3 and W M A algorithm s ..................................................... ............... 11 G eneric A lgorithm s ................................................... ......... ............... .11 D ictionary based algorithm s.................................................... ...............11 Statistical model based algorithms............... .............................................12 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS...............................14 The H uffm an's C oding A lgorithm .................................................. .....................14 Building a Huffman's Tree........... ......... ............... ................. 14 The Compression Procedure........... ................................ .. ........ ............... 16 The Decompression Procedure............................................................. 16 Summary ......... ...... .... ..... ................. 16 The A rithm etic Coding A lgorithm ........................................ ........................ 17 The Arithmetic Coding Procedure........... ........................... ............... 17 Implementation with integer arithm etic ................................... ............... 18 S u m m ary ......... ........ ......... ..................................................2 0 v 4 CLP LIBRARY IMPLEMENTATION .......................................... ...............21 C la ss H ierarch y ..................................................................................................... 2 1 Com pression Class H ierarchy ........................................ ......... ............... 21 B itlO C lass H ierarchy .............................................................. .....................22 CLP Library Interface .......... .. ........................ .............. .. ............ 22 B aseCom press Class Interface ........................................ ........................ 24 StatisticCom press Class Interface ............................................ ............... 24 StatHuffman and StatArithmetic Classes .....................................................25 B itlO C lass Interface ................................................... .. ........ ...... ............2 5 M methods Interaction .................. ....................................... .......... .... 25 E rro r H an d lin g ..................................................................... ................ 2 6 H ow to D eploy the CLP Library ........................................ .......................... 30 5 C O M PR E SSIO N R E SU L TS ........................................................... .....................31 Test Environm ent ............................................ .................... .......... 31 S o u rc e D a ta ................................................................................................... 3 1 Sam ple A application .............................. ........................ .. ........ .... ............32 S c rip ts ............................................................................................................ 3 2 U tilitie s ................................................................3 2 T e st R e su lts ........................................................................................................... 3 3 Sym bol D distribution ................................................ ............... 33 C om p ression R esu lts ..................................................................................... 3 3 D ecom pression R results ................................................................................. 34 D isc u ssio n ............................................................................................................. 3 5 6 CON CLU SION S .................................................................... 36 Overview of the CLP Library ......................................................................... ... ...... 36 Future CLP Library Improvements ................................ ...............36 Static vs. Shared L libraries ........................................................................ .......... 36 C++ vs. C Language ...................... ......... ...... ..................................37 sItorder and Adaptive Algorithms ..................... ..... ......... ..............38 APPENDIX: Source code listings .................................................39 R E F E R E N C E S ................................................................43 B IO G R A PH IC A L SK E T C H ....................................................................................... 45 LIST OF TABLES Table page 31. Input alphabet for Huffman's coder ............. ............. ........................15 32. Arithmetic encoding process for alphabet S and sequence a b a EOS.....................19 33. Arithmetic decoding process for alphabet S and bit sequence 011012....................19 51. C om pression results........... ..... .................................................................... ....... 34 52. D ecom pression results ......... ................. ........................................... ............... 34 LIST OF FIGURES Figure page 21. Static data com pression diagram ........................................................................... 7 22. Static data decom pression diagram ........................................ ......................... 7 23. A adaptive com pression diagram ............................ ...... ........ ........................... 8 24. Adaptive decompression diagram ........................................ ......................... 9 25. Lossy compression example. A) High quality JPG image. B) Low quality JP G im age. ....................... .... ................ ... .... ........ .... ...... 10 31. Huffman's tree and code words generated from input alphabet S. ..........................15 41. The UML compression CLP library diagram ..................................................23 42. The UML bit input/output BitlO class diagram. .............................. ................24 43. The UML sequence diagram for the compression process using the StatH uffm an class ..... ........................................... ........ ...... .... ...........27 44. The UML sequence diagram for the decompression process using the StatH uffm an class ..... ........................................... ........ ...... .... ...........28 45. The UML sequence diagram for the compression process using th e StatA ritm etic class...................................................................... .................. 2 9 45. The UML sequence diagram for the decompression process using th e StatA ritm etic class...................................................................... .................. 30 51. Sym bol frequency distribution. ............................................................................33 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science COMPRESSION LIBRARY FOR PALM OS PLATFORM By NebojSa Ciric May 2003 Chair: Dr. Douglas Dankel II Major Department: Computer and Information Science and Engineering This thesis presents the design and implementation of a compression library (CLP) for the Palm Operating System (OS) platform. Data compression is a wellresearched field in information theory and there are numerous programs and libraries available for almost every platform and OS in the market. Porting an existing library from the UNIX or MS Windows OS to the Palm OS is a nontrivial task because of constraints imposed by the handheld platform's memory size and organization. That is why CLP implemented simple, yet effective algorithms with small memory requirements. CLP contains two algorithms: the static Huffman compression and the static arithmetic compression. It would be easy to expand the library with new algorithms, if needed, because it was designed and implemented using an object oriented approach. CLP uses a simple interface, exposing only a few methods to the user, thus decreasing deployment time and increasing productivity. The library was specially designed with the Palm Database (PDB) format in mind. It allows a user to easily manipulate data in separate records, which is the preferred mode of operation on Palm OS. CHAPTER 1 INTRODUCTION This chapter begins with a description of the problem addressed within this thesis, followed by the overview of the compression library for Palm OS platform (CLP) that was developed to solve this problem. The chapter concludes with a summary of the remaining chapters in this thesis. Description of the Problem The number of handheld devices is increasing rapidly every year because of constant price drop, increased functionality, and increasing demand. They are used everywherefrom hospitals to supermarkets and IT companies. The largest share, 75%, of the personal data assistants (PDA) market is held by Palm OS based PDAs. Second, a 25% market share is PocketPC PDAs based on MS Windows CE or Linux operating systems. Companies producing Palm PDAs are Palm, Visor, Handera, Sony, and Handspring. Companies producing MS Windows based PocketPC PDAs include HewlettPackard, Compaq, and Toshiba, while Sharp produces a Linux based system. The Palm platform covers a wide range of devices. The oldest or cheapest ones have only 2MB of main memory, a black and white screen, an RS232 serial link to a PC, and a 16MHz Motorola DragonBall processor. The newest and the most expensive ones have up to 16MB of main memory, CompactFlash (CF) or SmartMedia (SD) memory expansion slots, a color screen, a USB link to PC, and a 133MHz Intel StrongArm CPU [1, 2]. They all share "beaming" capability. Beaming is process of transferring data between PDA devices, laptops, or cell phones using the Infrared Data Association Protocol (IrDA) for short distances (usually less than Im). Beaming speed is comparable with an RS232 serial connection speed (i.e., up to 115200 bits/second) [1, 2]. PDAs are mostly used as personal organizers and for reading edocuments, emails, and ebooks downloaded from a PC. These documents can be several megabytes in length, requiring significant time to download through an RS232 or IrDA connection. Large documents can also decrease the amount of free main memory by 2030% per document. Text documents, as opposed to binary files, have good compression properties because of nonrandom word and character repetition. Using compression can reduce memory consumption and download time, thus reducing battery consumption and improving the overall response time of the system. The top compression algorithms can reduce the size of text documents by up to 90%, but with the tradeoffs of increased algorithm complexity/processing time and memory consumption. This thesis proposes an algorithm with similar performance for a 16MHz Palm Vx with 4MB of memory or a 133MHz Tungsten device with 16MB of main memory. Overview of CLP CLP, the compression library for Palm, is a general compression library developed to solve the problem described above. It provides a simple interface for application developers to easily compress and decompress data on PDA devices or a PC. CLP can be effortlessly extended with new compression algorithms because of its modular, object oriented design. Using interfaces hides implementation complexity from the application and allows library changes and improvements without a need to change the application itself. A Palm OS device keeps its data in a specific file format known as a Palm Database File (PDB) [1, 2]. This is record based, random access file residing in main memory or on a memory card. The PDB format has a limit of 65535 records per file. Each record can be up to 64KB long and of variable size [1, 2]. The new Palm devices are introducing regular files without this limitations but the huge number of legacy applications slows the transition. Two general categories of compression techniques are used: adaptive and statistical. In this environment adaptive compression techniques do not yield good results because the size of each record is too small for these methods to effectively gather statistics about data before compressing. As a result, CLP uses only a statistical compression method. Palm OS devices, especially older ones, have a small amount of dynamic memory (on the order of few kilobytes) that is shared with the TCP/IP stack, global variables, etc. [1, 2] This limits the number and the size of the statistical tables that a compression algorithm can hold in memory at one time. As a result, CLP implements only the 0th order statistical compression methods. But even with all these limitations, good compression ratios and compression speeds are obtainable using the current implementation of CLP. The CLP library can be used on any device, including PCs, provided that the C++ compiler and Standard Template Library are available. This enables the user to preprocess data on more a powerful platform and just transfer them to the PDA device for later use. 4 Organization of Thesis This thesis consists of six chapters. Chapter 2 introduces compression theory and algorithms. Chapter 3 gives an indepth description of the implemented algorithms. Chapter 4 discusses the implementation details from a programmer's perspective. Chapter 5 addresses tests and results of compression/decompression capabilities of the CLP library on the basis of test data and sample application. Finally, Chapter 6 presents future work and conclusion. CHAPTER 2 COMPRESSION THEORY AND ALGORITHMS This chapter introduces compression theory and explains the difference between static and adaptive modeling. Lossy and lossless data compression approaches are explained. Algorithms that are currently used in commercial compression applications or libraries are reviewed. Compression Theory Compression Theory is tightly related to Information Theory. Information Theory is a branch of mathematics started in 1948 from the work of Claude Shannon at Bell Labs [3]. It deals with various questions about information. Data compression is interested in information redundancy. Data containing redundant information takes extra bits to store. Eliminating the extra bits reduces data size, hence freeing more memory or communication channel bandwidth. To find the amount of redundancy in the data, Information Theory introduces the concept of entropy as a measure of how much information is encoded in the data. High entropy identifies that the data set has low redundancy while low entropy points to a highly redundant data set, which is consistent with the thermodynamics' definition of entropy. Let (S, P) be some unspecified finite probability space. If symbol/event Y c S, the selfinformation or information contained in Yis I(Y) = log P(Y) = log This equation also defines the number of bits needed to encode symbol Y and its entropy. If the probability of Y is high then the number of bits needed to encode Y is low. The number of bits needed to encode the whole message Mis simply a sum of the code lengths for each symbol contained in the message, or I(M) = C(Y) = C, log P(,), where C, is a count of the symbol Y, in the message M [4]. The probability of the symbol depends on the model we choose to describe the data set. This means that the entropy of a message or symbol is not an absolute, unique value. As a result, models that predict symbols with high probability are good for a data compression system. After data is modeled it is encoded with a proper number of bits. If the entropy of the symbol was 3.5 bits then that symbol should be encoded with 3.5 bits. Some encoding schemes (e.g., the Huffman scheme) decrease compression by rounding up number of bits to boost the processing speed. Static and Adaptive Modeling There are two approaches to model data: static and adaptive. The static method was developed first but now is abandoned in favor of the adaptive method. A description of each of the methods follows [5]. Static Modeling The static model first gathers statistical information about each symbol in a table, by scanning the data once and counting the symbol frequencies in the data set. The resulting model is used for data encoding and decoding. For the encoder and decoder to be compatible they must share the same model, which means that the table has to be transmitted to both the encoder and the decoder. Data compression and decompression with the static model are shown in Figure 21 and Figure 22. Figure 21. Static data compression diagram. Figure 22. Static data decompression diagram. Depending on nature of the data set, the probability of the adjacency of two or more specific symbols in the message can be independent, for binary files, or dependent with high or low values, for text files. The encoding model can try to predict such probabilities, thus increasing the compression ratio. The number of adjacent symbols defines the order of the model. A Othorder model assumes that the symbol position in the message is independent of the position of other symbols. A 1storder model assumes that two adjacent symbols are dependent, etc. For a Othorder model there is only one table with symbol counts. For the standard ASCII character set, the table would have 256 entries. With the Othorder model each symbol is assumed independent from the other symbols. As the order of the model increases the number of tables increases. In the case of the lstorder model there are 256 tables with 256 entries each. With the lstorder model some relation between symbols is assumed, for example there is high probability of the character "u" after the character "q" in English language, but very low probability for "w" after "b." This approach yields a better compression but the overhead of keeping a larger table is often too expensive. The requirement of keeping the model table with the data to decode that data is the reason why the static modeling has been practically abandoned in modern compression theory. Adaptive Modeling The adaptive algorithm does not have to scan the data to generate statistics. Instead, the model is constantly updated as new symbols are read and coded. This is true for both the encoding and decoding processes and means that the table does not have to be saved with the data to be decoded. The compression and decompression models are shown in Figure 23 and Figure 24. * . I ,:.,.1... Figure 23. Adaptive compression diagram. ,_ci., _ L,,r_c,:.i .1 l_*.i . ,.l _,. I . Figure 24. Adaptive decompression diagram. The problem with the adaptive model is that it starts with an empty table, so the compression is very low in the beginning. Most adaptive algorithms adjust to the input stream after a few thousands symbols resulting in a good compression ratio. Also, they are able to adapt to changes in the nature of the input stream, like when the data changes from text to an image. Lossy and Lossless algorithms Data can be compressed with or without loss of information. Lossy compression is mostly used for drastically reducing size of images and sound files, because the human senses are not very sensitive towards small changes in quality. The JPG format is able to reduce an image size by examining adjacent pixels and making ones that appear similar the same, thus reducing the entropy and increasing the compression. Images A and B in Figure 25 are saved using the highest and lowest JPG quality. Note that while image A is twice as large as B, the quality of picture B is not drastically different. Lossless compression is used for all other types of data where accuracy is mandatory. Some of the most widely used lossless algorithms are described below, in the generic algorithms section. Figure 25. Lossy compression example. A) High quality JPG image. B) Low quality JPG image. Compression Algorithms Compression algorithms can be divided in two groups: generic and specialized. Generic algorithms are able to compress any type of information with good but not perfect results. Specialized algorithms are very good at compressing some types of information, like images or sound, but have poor results for other types of data. Specialized Algorithms Some wellknown specialized algorithms are reviewed in this section. Run length encoding (RLE) Run length encoding algorithm [5] is mostly used with bitmap images (e.g., black and white images) where symbols (pixels) with the same value are often found in contiguous streams. The stream can be then replaced with decreasing the image size. This method must be cleverly implemented to avoid data expansion, which can happen if the streams are short or the symbols alternate. JPEG and MPEG algorithms The Joint Photographic Experts Group (JPEG) [6] and the Moving Pictures Expert Group (MPEG) committees proposed two lossy algorithms for image and moving picture compression, respectively. These algorithms make two passes over the data. First pass converts the data into the frequency domain, using FFTlike algorithms. Once transformed, the data are "smoothed" by rounding off the peaks, resulting in a loss of information. In the second pass, the data are compressed using one of the lossless algorithms. Compression ratios can be very high with acceptable quality degradation as shown in Figure 25. MP3 and WMA algorithms MP3 and Windows Media Audio (WMA) [7] are lossy algorithms used for audio signal compression. Compression ratios are high with a reduction in size by a factor of 10 or more with low distortion of the audio signal. These algorithms use the fact that the human ear cannot hear some frequencies if they are masked by other frequencies. They eliminate those hidden frequencies, hence reducing the size of the signal. The encoding procedure is highly CPU intensive but decoding is not. This property is one of the reasons why the MP3 format is widely accepted for storing audio files. Generic Algorithms There are a few generic algorithms currently in use, like the (1) Dictionary, (2) Sliding Window [4, 5], (3) Huffman, and (4) arithmetic coding algorithms. They are all implemented as lossless and adaptive. Dictionary based algorithms This family of algorithms encodes variablelength strings into fixed length codes, which are called tokens. The most popular algorithm in this group is the LempelZiv Welch (LZW) algorithm [3, 4, 8, 9], which is used in almost every commercial compression utility or library. It parses the input stream and for each new phrase that encounters, it adds a simple but its implementation is usually complex because of the dictionary maintenance tradeoffs. If the dictionary becomes full one of three actions occurs: (1) the dictionary could be flushed, which affects compression, (2) the dictionary could be frozen, which affects compression if data changes over time, or (3) the dictionary could be expanded, which could lower the compression ratio because the token size increases. Statistical model based algorithms This family of algorithms encodes symbols into bitstrings of variable length using a statistical model. These algorithms are based on a Greedy Approach where symbols with the high probabilities are given the shortest codes (bitstrings). The model has to accurately predict the probabilities of symbols to increase the compression. The Huffman coding algorithm [4, 5, 10] achieves the minimum amount of redundancy possible if the bitstrings are limited to an integer size. As a result, it is not an optimal method, just a good approximation. It is a very simple and fast algorithm, suitable for devices with slow CPUs. Memory consumption depends on the order of the model used. The arithmetic coding algorithm [4, 5, 11, 12] does not produce a single code for each symbol. It produces a code for the whole message. Each symbol added to the message incrementally modifies the output code. This approach allows a symbol to be encoded with a fractional number of bits instead of an integer number, thus exactly representing entropy of the symbol. In theory, arithmetic coding is optimal for a given model, but a realworld implementation has to make some tradeoffs related to floating 13 point or integer arithmetic, thus decreasing the compression ratio. The arithmetic coding algorithm needs a powerful CPU for both the encoding and decoding process, which makes it less desirable on PDA devices than the Huffman coding algorithm. Both of these algorithms are covered in depth in next chapter. CHAPTER 3 DESCRIPTION OF THE IMPLEMENTED ALGORITHMS This chapter describes the two algorithms used to implement CLP. These are the static Huffman and Arithmetic algorithms. As mentioned in the previous chapter, both belong to a class of Greedy algorithms. Each tries to assign the shortest bitsequence to the symbol with the highest frequency or the lowest entropy. We start by examining the Huffman Algorithm. The Huffman's Coding Algorithm Assume an input alphabet, sl, s2 ... Sm, with corresponding frequencies, J > f2 >... > f, for each symbol. To compress or decompress data using Huffman coding, symbols and frequencies must be transformed into code words. This can be accomplished by building a Huffman's tree. Building a Huffman's Tree Each leaf node in the Huffman's tree is assigned one of the symbols from the input alphabet. Symbols Sm and sm1 become siblings in the tree, with a common parent node, Sp, and frequency fm+fml. In each iteration of the algorithm, the two nodes from the set of leaf and intermediate nodes with the least weights that have not yet been used as siblings are paired as siblings and assigned to a parent node whose weight becomes the sum of the siblings' weights. The algorithm stops when the last two siblings are paired. The final parent, with a weight 1, is the root of the tree. Table 31. Input alphabet for Huffman's coder Symbol Frequency S1 1/3 s2 1/5 S3 1/6 S4 1/10 s5 1/10 S6 1/10 To illustrate this process consider the following example. In Figure 31, the input alphabet S, defined in Table 31, is transformed into a Huffman's tree. The average code m word length is defined as = fw, where fi is frequency and wi is code length of ,=1 symbol si. For the input alphabet S, I has decreased from 3 to 2.47. In an average case, for this simple alphabet, the compression ratio is 17.7%. 4LI / ': 00 :_ 10 ':, 010 '" '., ', _111 L(,/_ : e 1 lI l Figure 31. Huffman's tree and code words generated from input alphabet S.47 Figure 31. Huffman's tree and code words generated from input alphabet S. The Compression Procedure The compression algorithm is fairly simple. It consists of four steps: 1. Scan the input stream and obtain the frequencies for each symbol. Save that information. 2. Build the Huffman's tree and extract the code words. 3. Input a symbol from the stream and output its code. 4. Repeat step 3 until the end of stream. The Decompression Procedure The decompression algorithm is also simple, consisting of five steps: 1. Build the Huffman's tree from the saved information. 2. Move the pointer to the root node. 3. For each input bit follow the corresponding label down the tree, until a leaf node is reached. 4. Output the symbol at that leaf node. 5. Repeat the algorithm from step 2, until the end of the stream. All code words produced from the Huffman's tree have the property that no code word is the prefix of another code word. The decompression procedure uses this property to traverse tree from the root node to the leaf and to find the corresponding symbol without having to resolve any ambiguities. Summary For a given model, the Huffman coding is optimal among the probabilistic methods that replace source symbols with an integer number of bits. However, it is not optimal in the sense of entropy. As an example, consider an alphabet consisting of two symbols. Regardless of the probabilities, the Huffman coder will assign a single bit to each of symbols, giving no compression. The Arithmetic Coding Algorithm Assume an input alphabet, sl, s2 ... Sm, with corresponding frequencies, f > 2 >... > f, for each symbol. Using the arithmetic coding algorithm, the entire source text, composed of symbols from the alphabet S, is assigned a code word determined by the process described below. Each source symbol, si, is assigned a subinterval, A(i), in the interval [0, 1). The subintervals, A(1), A(2)... A(m), are disjoint subintervals of [0, 1). The length of A(i) is proportional to fi. Having determined the interval A, the arithmetic coder chooses a number r e A and represents the source text with some finite segment of the binary expansion of r. The smaller the interval A(i) is, the farther out the decoder has to go with binary expansion to decode the source symbol. For this reason, symbols with a higher frequency or longer interval have shorter binary expansions, resulting in better compressed. The Arithmetic Coding Procedure The arithmetic coding algorithm is more complex than the Huffman coding algorithm, both in the number of steps and in the calculation complexity. The steps are: 1. A current interval [L,H) is initialized as [0, 1) and is maintained at each step. An underflow count is initialized at 0 and is maintained to the end of the file. 2. (Underflow condition) If the current interval satisfies < L < < H < then expand it to [2(L 1),2(H _ )) and add 1 to the underflow count. 3. (Shift condition) If [L,H) c [0,), then output 0 and any pending underflow bits (which are all 1), and expand the current interval to [2L,2H). If [L, H) [,1), then output 1 and any pending underflow bits (which are all 0), and expand the current interval to [2L 1,2H 1). In either case, reset the underflow count to 0. 4. If none of the conditions in steps 2 or 3 hold, then divide the current interval into disjoint subintervals [L,, L, ) corresponding to the symbol s, e S, with lengths determined by the probabilities. If si is the next source letter, assign L < L, and H L,,,. 5. Repeat steps 24 until the end of the stream and none of the conditions in steps 2 or 3 hold. At this stage, the final interval satisfies L < I < < H or L < < < H, and sequence "01" is output if the first condition holds and "10" is output otherwise. Any leftover underflow bits are output after these bits. Implementation with integer arithmetic The previously described algorithm cannot be directly applied in practice for two reasons. First, floating point arithmetic is much slower than integer arithmetic on any computer platform. Second, the result r can become long, hence demanding a very high precision that is not available with current technology. To solve these problems, the interval [0, 1) can be replaced with [0, M), where M is a positive integer with a value of at least 4(S2). The value of M is usually chosen as some power of 2, which can improve integer arithmetic performance. Each subinterval in A is expanded, which corresponds to a value of M and a subinterval length. Also, L and H values can be represented as 0x0000... and OxFFFF... in the initial case, where the ellipsis means that Os or Is can be shifted in when needed, thus enabling 16 or 32bit register arithmetic. The interval [0, M) is just an approximation of the interval [0, 1), with imprecise symbol frequencies, which leads to lower compression ratios. But using integer arithmetic drastically decreases the processing speed. To illustrate encoding and decoding process using the integer approximation algorithm, consider the following example. The source alphabet is S = {a, b, EOS}, the frequencies for each symbol arefa=6/10,fb=3/10, andfEos=1/10, and the cumulative counts are Ca=6, Cb=9 and CEOS=10. The input sequence is: a b a EOS. To obtain an interesting, nontrivial, example, M should be 24. The lowest possible value for M can be obtained from following inequality M > 4(S 2). The EOS symbol is used to terminate the input message. Adding an artificial symbol to the alphabet changes the real symbol frequencies, thus lowering the compression ratio, but by using it, the message size does not have to be appended to the message and incremental transmission becomes possible. The encoding and decoding processes are given in Tables 32 and 33. Table 32. Arithmetic encoding process for alphabet S and sequence a b a EOS. Symbol Current interval Subintervals Output a b EOS Start [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8, 9) b [5, 8) Expand x*2x, where x is current interval 0 [10, 16) Expand x to 2(xM/2) 1 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(xM/4) Underflow [0, 14) [0, 8) [8, 12) [12, 14) EOS [12, 14) Expand x to 2(xM/2) 10 [8, 12) Expand x to 2(xM/2) 1 [0, 8) Expand x to 2x 0 [0, 16) Table 33. Arithmetic decoding process for alphabet S and bit sequence 011012. Value Current Subintervals Output interval a b EOS symbol 01102=6 [0, 16) [0, 9) [9, 14) [14, 16) a [0, 9) [0, 5) [5, 8) [8,9) b [5, 8) Expand x to 2x 11012=13 [10, 16) Expand x to 2(xM/2) 10102=10 [4, 16) [4, 11) [11, 14) [14, 16) a [4, 11) Expand x to 2(xM/4) 11002=12 [0, 14) [0, 8) [8, 12) [12, 14) EOS For a given model, the arithmetic coding algorithm is optimal and outperforms the Huffman method. It is important to note that implementations of the arithmetic coding algorithm are less than optimal, due to integer arithmetic and other compromises. Since the Huffman coding algorithm is nearly optimal in many cases, the choice between the methods is not as simple as the theory suggests. Summary The two algorithms used for development of CLP are described in this chapter. An indepth theoretical analysis [4] and detailed implementation [5] can be found in the literature. The Huffman's coding algorithm is lightweight and fast, but it is nonoptimal. The arithmetic coding algorithm is CPU intensive, but in its "pure" form, it is optimal. In Chapter 5 compares the results of both algorithms and identifies which one is better for PDA application. CHAPTER 4 CLP LIBRARY IMPLEMENTATION This chapter discusses the implementation details from a programmer's perspective. These include (1) the class hierarchy, (2) the interface exposed by the compression library, (3) the proper call sequence to the class methods, (4) error handling, and (5) instructions on how to include the CLP library into a Palm OS application. Class Hierarchy The CLP library consists of two separate sets of classes. One consists of the main compression hierarchy, while the other holds the simple bitstream class. Compression Class Hierarchy Different compression algorithms use different techniques for compressing data, but they all have a similar interface [13] to the application that uses them. It is a common practice to enforce such behavior by using a common virtual base class, in the case of the CLP library that base class is the BaseCompress class. Static and adaptive compression algorithms update the coding model in a different way. The static compressor gathers statistical information before compression, while the adaptive does that during the compression phase. As a result, there are two new classes inherited from the BaseCompress class: StatisticCompress and AdaptiveCompress. Both are virtual classes with the AdaptiveCompress class not being implemented. The Huffman and arithmetic compression classes are inherited from the StatisticCompress class. While they share the same interface, their internal 22 implementations are different, reflecting the differences in their corresponding algorithms. A modular OO, C++ [14] approach makes changes and additions to the library easy. If a new algorithm is needed, then its class can be inherited from StatisticCompress or AdaptiveCompress, without changing the existing code. Each class is declared in a separate header file, and its implementation is placed in a corresponding .cpp file, thus reducing the size of each module. The complete UML [15] compression class diagram is given in Figure 41. BitIO Class Hierarchy The BitIO class is a helper class. It handles single or multiple bit stream operations. The OS file or memory functions are developed for byte access, but for compression purposes, the ability to store an arbitrarily sized bit sequence in memory is needed. The UML representation of the BitIO class is given in Figure 42. CLP Library Interface The compression library should provide a simple interface, enabling the user to easily send data to the library and receive responding results from it. Each class in the hierarchy either expands the interface or implements its methods. , I ::,i r. . r. :  vvrp?~E~ruv~n Ire! .vrt &j QI ] jII!Lm UMID5 + riti!" 7 ffli + rn~uwA~Abv: L nI + rab~l: cn) * m uO2Cruik UQrv + 06Nl(b~o~ I:31t?. + muhrjCkiIe Jhlra + ~l~MCI IIE *f~t~~44jr I + SCKi143I~ UIJt1 Figure 41. The UML compression CLP library diagram U. 9 nuCh(4IpF IaC~lUrr~1~ r r +" k p r I *aI~rVl.%* Llf:r t~rlrrd37 rlrt.:L t ;lllr~r m~lrrq"J.0d AIgrqIhrrdb SrabI0"O B~OO I k106"ig kli ff$ m uaIllOU nds. U U1*Ifl *r .1L w UreIl C. r ~m ilt' 11r.O. F. * n~95r~ieud r. arlls I S T~i~t '1t5)"c ;  Irlh;t li~ Lh.. )r * IHr~*S~iP.(TI~PFl2& (~ IJ~i( gi * Fltrl1JrmJn1~LaaQ,  Dri~~pirt' 1 'j"~. u'fl?: i OvX~Ull.l * m.T4 .7m' i~~$TrtJI~ * mbhI 14110r~~a 0 rnubbAI1s: W1Oi + InIU w I F,5 C n A * grr~tiMstIC~r.1Tri2&n rii + i i. r, j 1., 01Z4 r~l 4 C~pua~nlY' UIl!S. ihT3Ad~ tr + EIiiC dI Figure 42. The UML bit input/output BitIO class diagram. BaseCompress Class Interface The BaseCompress class provides the behaviors detailed below by exposing the following methods in its interface: Initialize Initializes the compression/decompression engine for a given class. This is a pure virtual method [14]. Compress/Decompress Sends data to the CLP library for compression/decompression. These are pure virtual methods. GetResult Extracts the result from the CLP library. This method is implemented in the BaseCompress class. FlushBuffer Flushes the internal buffers and prepares the CLP library for the next compression sequence. This method is implemented in the BaseCompress class. StatisticCompress Class Interface The StatisticCompress class expands the base interface with methods operating on the statistical data and resulting header: GetHeader Returns the header with statistical information that was generated by Initialize call. This is a pure virtual method. 25 GetHeaderSize Returns the size of the header, so that the application can allocate enough memory to save the header. This is a pure virtual method. SetHeader Data can be preprocessed or even compressed on the PC. The header information generated through preprocessing can be read from the file and used for compression/decompression on Palm platform thus decreasing the overhead and increasing speed. This is a pure virtual method. StatHuffman and StatArithmetic Classes The StatHuffman and StatArithmetic classes implement the interface declared by the StatisticCompress parent class. They implement the corresponding Huffman and Arithmetic algorithms. BitIO Class Interface The BitIO class implements a simple interface that allows the user to read/write a sequence of bits from/to a stream. The proper call sequence is outlined below: 1. OpenIO Opens a new bit I/O stream connection. 2. Input/Output Bit/Bits Inputs/outputs the bit sequence from/to the stream. 3. FlushIO Flushes the bits remaining in the buffer to the stream. 4. GetBuffer Returns the resulting stream. 5. CloselO Closes the I/O stream. Step 2 is repeated as long as there are bits to input or output. Methods Interaction To properly use the CLP library, the application must follow this method's calling sequence for both the compression and decompression process: 1. Initialize 2. Set/GetHeader Set the header if is needed later, or get it if it was generated before. 26 3. Compress/Decompress This call passes data to the library and returns the size of the result. 4. GetResult 5. FlushBuffer Steps 35 are repeated for each record that the application intends to compress/decompress. A detailed calling sequence for the public and private functions, for both types of compression/decompression, is given as a UML sequence diagrams in Figures 43, 44, 45, and 46. Error Handling The CLP library uses the C++ exception handling mechanism [14] to handle errors. There are only three types of exceptions that CLP throws: out of range is thrown if the array index is out of range. length_error is thrown if the array resize operation fails (i.e., there is insufficient memory). unspecified exception is thrown if an unspecified exception happens (i.e., some system exception). All three types of exceptions are inherited from the C++ Standard Template Library (STL). The CLP library uses the STL's vector container type [14, 16] to implement the dynamic arrays used for holding the results generated during compression and decompression. A simple tester application shows the correct way of handling exceptions thrown by the CLP library in the application that uses it. 27 CLP WShtHuttman APPII'Muon IMuttman (HmflMsn Up,o Li U CJ avlo d& iputif4 d 4r iiLjlp r~) r c jn;svwlmntw. knnipZI. Uinj2& .IR*3uIWUInW UlnC3 riLahsuH,K OkEvaO1WUJnt3Z. UnnMO) (Unfil and l z&4am. rrwtdrh4rTQIC t FluthIC~idid) 6vkLBuR~wyvldn"Ull n4SO)d el~olli;Ybon Figure 43. The UML sequence diagram for the compression process using the StatHuffman class. OrmiH4JdRu~in3~&~ ml lntIintoB. UInZ2. bool) HilJOCHUNMrn If Co rrPr+.jcyUlreF'. Ulnr=~ '0'Q~vB' 28 A i M~14uhmir. AppI[CifiDn :Hutl' j.t Dmprre 51JrrhICueuip., Op*nIX(UkWB. L40Z2) qt"*Fvvrwdnw UlrnMa U OwlUUr ir42 IiIUt i Figure 44. The UML sequence diagram for the decompression process using the StatHuffman class. DC rcmpiww .E0I 29 CLP :Stt~rfhmibc A j1idhff ob, Inlirati~ad(Uln~w Ulftw., boar)) ;~~~InI C OMpA4UInr UrAW, Ulr42= *ornpiess(uln *r n *. UInM3M jUw Utd.WnJ Figure 45. The UML sequence diagram for the compression process using the StatAritmetic class. BmD Ithm"a :BRIGO Oe4Htjrdt,5IzBl3t324*,mdv~at ouretfiCftu ftjrgujji L~n~inLengih h VAHIC C inblj .)FvnIUVJ~lryV1. Ulr02) .iII1III II. ued UI bLIung c Slrmbrlik L. npiBlr9 FluthCadvrO DuipulpiBIUinL3, ulnlelb FIus id) rIt1ounti p.ih*iei , L IL'4IUWy:,d i 30 SCLP BitO (Arithmetic A :StatArithmetic Decompress) Application (Arithmetic Decompress) SetHeader(pui8Buffer, 0 ui32Length) 'l Decompress(UlntS,. Ulnt32. Ulnt32&) OpenlO(UlntS*, Ulnt32) InputBits(Ulnt32&, Ulntl6) ConvertSymbolToChar(Ulnt16, i.I li, I f,, RemoveSymbolFromStreamO II H I f. I l E I T 1 JT, CloselO(void) GetResuli(UInt8", Ulnt32) SFlushBuffeO( Figure 45. The UML sequence diagram for the decompression process using the StatAritmetic class. How to Deploy the CLP Library To use the CLP library, the application programmer must (1) add the library file to the path accessible by the linker and (2) include StatHuffman.h or/and StatArithmetic.h files into his project. Both the StatHuffman and StatArithmetic header files will automatically include the necessary underlying header files and libraries into the project. CHAPTER 5 COMPRESSION RESULTS This chapter presents a summary of the CLP library's performance on a Palm Pilot device based on a sample program using the library to compress and decompress drug reference data [5]. The compression ratios and speeds are measured for both the Huffman and Arithmetic coding algorithms. A comparison is made to determine which algorithm is more suitable for PDA use. Test Environment The following is a description of the sample application, scripts, utilities, source data, and testing procedure used. Source Data The initial data set was a text file containing ASCII strings delimited with new line characters. The size of the text file is approximately 2300 strings or 840KB. Using the first PHP script given in Appendix, each string from the text file was inserted into a separate record in the PDB database. The CLP implementation on the PC is used to generate a header with statistical data for the compression and decompression procedures. The second PHP script, also given in Appendix, is used to transfer binary information from the header file to the PDB database record on the PC. All generated PDB files and the sample application are loaded into the Palm OS device emulator program (POSE). Sample Application A sample application was written in the C++ language and was linked with the CLP library for Palm OS. It first compresses the input PDB database using the Huffman and Arithmetic coding algorithms and logs compression times with an external profiler application. The compression results are written into new database files, one for each type of compression. The resulting databases are decompressed using both algorithms, and decompression times are logged with the profiler. Scripts Scripts for converting data to the Palm PDB format [1] were written using a php pdb [17] module. PHP script language interpreters are available for almost all OS platforms, so these scripts are highly portable [17]. Utilities Two utility programs are used for testing: POSE and Palm Reporter. They are both part of the Palm OS Software Development Kit (SDK) [1]. POSE is able to emulate any Palm OS device, if a ROM image for that device is available. It emulates the real speed of the device, so measurements taken on it are roughly equal to measurements taken on the real device. Palm Reporter is a standalone program able to connect to POSE and receive log messages. It is used to obtain compression and decompression speed information from the sample program running on POSE. Test Results Symbol Distribution The header files were generated from the source data by the CLP library for a PC. Both the Huffman and Arithmetic coder headers are identical with the symbol frequencies shown in Figure 51 Character frequency distribution [Char. fr e, 1 300 250 200 150 E Characters 100 50 0 .....nrm. S c o c c M [ASCII value] S   ' J C1 N (4 Figure 51. Symbol frequency distribution. The data distribution in the graph meets expectations, because the source text consists of ASCII strings in the English language, thus the SPACE symbol, digits, and characters between az and AZ have the highest frequency. Compression Results Palm devices use FLASH memory to store all databases and applications. FLASH memory is faster for reading than it is for writing (by a factor of 10). That is why there are two sets of results for the compression test. One includes writing results to storage memory and the other is without. They are both shown in Table 51. Table 51. Compression results Compression Source data Comp. data Comp. ratio Comp. time Comp. time method (in bytes) (in bytes) (write (w/o write None 878,090 878,090 0% 45s 2s Huffman 878,090 544,136 38.03% 75s 62s Arithmetic 878,090 546,895 37.72 99s 83s The compression time for both algorithms is comparable with the simple memory access and is around 0.02s/record. Both algorithms have similar compression ratios. The expected result was that the arithmetic coder algorithm would have a better compression ratio than the Huffman coder algorithm, but in this case that is not true. There are many factors that can influence the compression results, with the main one being the integer arithmetic approximation. Decompression Results Decompression results are shown in Table 52. Table 52. Decompression results Decompression Comp. data Decomp. data Decomp. time Decomp. time method (in bytes) (in bytes) (write) (w/o write) None 878,090 878,090 45s 2s Huffman 544,136 878,090 128s 110s Arithmetic 546,895 878,090 812s 799s Decompression is slightly slower than compression in the case of the Huffman coder algorithm. This result is expected, because the decoder must traverse the Huffman tree to lookup a symbol, which is a slower operation than finding a symbolcode pair in the compression procedure. In the case of the arithmetic coder algorithm, the decompression is much slower than the compression, by a factor of 10. This is due to the more complex decompression algorithm. Also, some optimization could help in narrowing the gap. 35 Discussion Users of PDA devices expect PDA applications to have a quick response time and a small size. The test results clearly show that the Huffman coder algorithm can provide both, while the arithmetic coder algorithm falls short when speed is taken into account. CHAPTER 6 CONCLUSIONS This chapter reviews the key concepts addressed in this thesis. The important issues of the CLP library are discussed, and the future improvements over the current system are addressed as well. Overview of the CLP Library CLP is a simple, easily expandable compression library for PDA devices running Palm OS. It implements wellknown algorithms that are used in most commercial compression applications because of their good performance. The compression algorithms' complex implementation is well hidden behind a simple interface exposed by the CLP library. This interface enables the application programmer to effortlessly deploy the CLP library. New compression algorithms can be added to the CLP library without changing the existing library code. The C++ class hierarchy enforces the existing interface onto the new implementations, hence demanding only minor changes in the application code. Future CLP Library Improvements The current implementation of the CLP library leaves much room for improvement. Static vs. Shared Libraries The CLP library is implemented as a statically linked library (SLL). A copy of SLL is added to each application that uses it. Every application on the same PDA device must have its own separate copy of the library. This clearly wastes memory, which is not a good practice on a PDA device. Shared or dynamically linked libraries (DLL) offer a fix to this problem by keeping the library in a separate file, which can be loaded on demand by the application needing it. Then, only one copy of the library is kept in the system thus reducing memory requirements. DLLs are not a perfect solution because: 1. They are highly system dependant. 2. They must provide backward compatibility with previous versions. If condition 1 is acceptable and if the library is well implemented then a DLL approach should be chosen. Compression libraries are usually in high demand in the system, thus they are shared by many applications at the same time. They are also system specific because they use lowlevel system properties to increase processing speed, hence the CLP library should be implemented as a DLL in its next version. C++ vs. C Language The C++ language offers a rich set of libraries and language elements (i.e., strict type checking, exception handling, strings, dynamic arrays, and classes) to the developer. These properties increase productivity by enabling the programmer to focus more on the problem than on its implementation. Also programs developed in an 00 language can be easily expanded by adding new classes and reusing old ones. The problem with the C++ language is that this flexibility and power increases memory consumption and decreases speed, which can hurt performance on PDA devices. The C language is the language of choice for embedded systems programmers, because of its small memory footprint and tight connection with the underlying system. It does not offer a rich set of language elements like C++, thus forcing the programmer to spend more time in the implementation and testing phases. The current size of the CLP library is 145KB. If it was developed in the C language its size would decrease to approximately 3040KB, and its dynamic memory consumption would be much smaller. This decrease in size happens because the C language does not use the STL library, which is quite large, and there is no object initialization overhead for virtual classes and methods. With some decrease in flexibility and expandability the CLP library can be ported to the C language, hence reducing its size and memory footprint. 1storder and Adaptive Algorithms Current, Othorder algorithms can be replaced with 1storder algorithms to increase compression ratios, though the problem with holding large statistical tables in memory must be solved first. If record based compression/decompression is not needed then adaptive algorithms can be used. They would eliminate the need for header storage space and data preprocessing. APPENDIX SOURCE CODE LISTINGS The source code for the CLP and the test application is not included in the thesis. Only PHP scripts mentioned in the thesis are included. /* * header.php * * Makes test header PDB for compression * * It uses phppdp module [18] to manipulate PDB files */ // Turn on all error reporting ini_set('error reporting', EALL); $fp = openn" SHuffman.bin", "rb"); for ($i=0; $i<256; $i++) { $symbol = fread($fp, 1); echo ord($symbol); echo ","; } ?> /* * pdbbuild.php * * Make Test DB for compression * * It uses phppdp module [18] to manipulate PDB files */ // Turn on all error reporting ini_set('error reporting', EALL); include "./phppdb.inc"; // Get data first echo "Get data first and put it into SourceDataDB\n"; $fp = fopen("compressionTest.txt","rt"); if(! $fp) { echo "Can't open compressionTest.txt\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SourceDataDB'); $counter = 1; $test = 0; while(! feof($fp)) { $string = fgets($fp); $PDBFile>AppendString($string); $counter = $counter+l; $test = $PDBFile>GoToRecord($counter); echo "Test =", $test, "\n"; $PCPDBFile = fopen("SourceDataDB.pdb","wb"); if(! $PCPDBFile) { echo "Can't create SourceDataDB.pdb\n"; exit; } $PDBFile>WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); // Get first header (for Static Huffman Compression) echo "Get first header (for Static Huffman Compression)\n"; $fp = fopen("SHuffman.bin","rb"); if(! $fp) { echo "Can't open SHuffman.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SHuffmanDB'); $Header = fread($fp, 256); $PDBFile>AppendString($Header); $PCPDBFile = fopen("SHuffmanDB.pdb","wb"); if(! $PCPDBFile) { echo "Can't create SHuffmanDB.pdb\n"; exit; } $PDBFile>WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); // Get second header (for Static Arithmetic Compression) echo "Get second header (for Static Arithmetic Compression)\n"; $fp = fopen("SArithmetic.bin","rb"); if(! $fp) { echo "Can't open SArithmetic.bin\n"; exit; } $PDBFile = new PalmDB('DATA','STRT','SArithmeticDB'); $Header = fread($fp, 256); $PDBFile>AppendString($Header); $PCPDBFile = fopen(" SArithmeticDB.pdb","wb"); 42 if(! $PCPDBFile) { echo "Can't create SArithmeticDB.pdb\n"; exit; } $PDBFile>WriteToFile($PCPDBFile); fclose($PCPDBFile); fclose($fp); ?> REFERENCES [1] Palm Corp. Home Page, http://www.palmos.com, Accessed: 12/10/2002 [2] L. R. Foster, Palm OS Programming Bible, IDG Books Worldwide, Inc, Foster City, CA, 2000. [3] C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, 27:379423 and 62356, 1948. [4] D. Hankerson, G.A. Harris, P. D. Johnson, Introduction to Information Theory and Data Compression, CRC Press, Boca Raton, FL, 1997. [5] M. Nelson, J. L. Gailly, The Data Compression Book, M&T Books, New York, 1996. [6] G. K. Wallace, The JPEG still picture compression standard, Communications of the ACM, 34(4):3244, April 1991. [7] Microsoft Corp., Support Page for Windowstm Media Formats, http://support.microsoft.com/default.aspx?scid=/support/mediaplayer/wmptest/wm ptest.asp#Windows%20Media, Accessed: 02/10/2003. [8] T. Welch, A technique for highperformance data compression, IEEE Computer, 17:819, June 1984. [9] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23(3):337343, May 1977. [10] D. A. Huffman, A method for the construction of minimum redundancy codes, Proceedings of the IRE, 40(9):10981101, September 1952. [11] A. Moffat, R. Neal, I. H. Witten, Arithmetic coding revisited, ACM Transactions on Information Systems, 16(3):256294, July 1998. [12] I. H. Witten, R. Neal, J. G. Cleary, Arithmetic coding for data compression, Communications of the ACM, 30(3):520540, June 1987. [13] E. Gamma, R. Help, R. Johnson, J. Vlissides, Design Patterns, AddisonWesley, Reading, MA, 1994. 44 [14] B. Eckel, Thinking in C++, 2nd Edition, Volume 1, Prentice Hall, Upper Saddle River, NJ, 2000. [15] C. Larman, Applying UML and Patterns, Prentice Hall PTR, Upper Saddle River, NJ, 2002. [16] H. Schildt, C/C++ Programmer's Reference, 2nd Edition, Osbore/McGrawHill, Berkeley, CA, 2000. [17] SourceForge Home Page, http://phppdb.sourceforge.net, Accessed: 01/10/2003. BIOGRAPHICAL SKETCH NebojSa Ciric was born in Belgrade, Republic of Serbia, Yugoslavia. He received his Bachelor of Science degree in computer science and engineering from the School of Electrical Engineering, Belgrade, Yugoslavia, in 1998. He worked at the Institute "Mihajlo Pupin" for seven months on his BS project as a programmer. In August 1999, he moved to Ljubljana, Slovenia, to work at the Hermes SoftLab, one of the largest software development companies in the country. He quit his job in Slovenia a year later to be with his wife in Gainesville, FL. He was accepted into the Computer and Information Science and Engineering Department at the University of Florida in August 2001. His research interests include objectoriented software development, artificial intelligence, and pattern recognition. 