Citation
Automated segmentation of printed documents for computer understanding of structures and contents

Material Information

Title:
Automated segmentation of printed documents for computer understanding of structures and contents
Creator:
Kim, Hokyung, 1953-
Publication Date:
Language:
English
Physical Description:
vii, 145 leaves : ill. ; 28 cm.

Subjects

Subjects / Keywords:
Abstract objects ( jstor )
Drawing ( jstor )
Geometric lines ( jstor )
Graphics ( jstor )
Image classification ( jstor )
Image processing ( jstor )
Line segments ( jstor )
Pixels ( jstor )
Symbolism ( jstor )
Trademarks ( jstor )
Dissertations, Academic -- Electrical Engineering -- UF
Electrical Engineering thesis Ph. D
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 141-144).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Hokyung Kim.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
028580128 ( ALEPH )
27483479 ( OCLC )

Downloads

This item has the following downloads:


Full Text










AUTOMATED SEGMENTATION OF PRINTED DOCUMENTS FOR COMPUTER UNDERSTANDING OF STRUCTURES AND CONTENTS






















By

HOKYUNG KIM


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1992













ACKNOWLEDGEMENTS


The author is deeply grateful to his advisor and supervisory committee chairman, Dr. Julius T. Tou, for his invaluable inspiration, encouragement, and advice. Without his guidance and support, this work could never have been done. He is also greatly indebted to all the other members of the supervisory committee, Professor John Staudhammer, Professor Leon W. Couch, Professor Herman Lam, and Professor Rick Smith, for their suggestions and advice regarding this dissertation.

He would like to thank sincerely all the members and good friends of the Center for Information Research for their helpful discussions and patience.

The financial support of the Center for Information Research is gratefully acknowledged.


ii













TABLE OF CONTENTS



ACKNOWLEDGEMENTS ......................................11i

ABSTRACT................................................ vi

CHAPTERS

1 INTRODUCTION.......................................1

1. 1 Statement of the Problem .............................. 2
1.2 Objectives .......................................... 4
1.3 Approaches ........................................5
1.4 Preview of Remaining Chapters .......................... 9

2 BACKGROUND AND REVIEW OF PREVIOUS WORK .........11

2.1 Preprocessing......................................11
2.1.1 Binary Representation by Image Thresholding .........12
2.1.2 Thinning and Boundarization for Object Description ... 16
2.2 Review of Previous Work.............................. 18
2.2.1 Block Segmentation Rule for Page Images............19
2.2.2 Classification Rule for the Segmented Block ..........22
2.2.3 Line-to-Point Transformation .....................32
2.2.4 Recognizing Textual Blocks Using the Hough Transform 32 2.2.5 Conclusions from Previous Work.................. 36

3 MACHINE UNDERSTANDING OF STRUCTURES IN
DOCUMENTS ........................................39

3.1 Introduction.......................................39
3.2 The Block Segmentation of Digitized Documents .............40
3.3 Labelling of Segmented Blocks.......................... 42
3.4 The Unconstrained Block Classification Rule ...........43
3.4.1 The Ratio of the Number of Black Pixels and Black-White
Transitions .................................. 47
3.4.2 Removal of Lines ............................. 50


iii








3.4.3 The Procedures of Block Classification ..............51
3.5 Experiments.......................................52
3.5.1 Experimental Images and Facilities .................52
3.5.2 Experimental Results ...........................54
3.5.3 Analyses for the Block Segmentation Approaches .......62
3.6 Page-Structure Analysis ............................... 64
3.6.1 Office Document Architecture (ODA) and its Structure 65 3.6.2 The Structure for Office Document Architecture .......67
3.7 Summary .........................................70

4 RECOGNITION OF THE CLASSIFIED BLOCK ................73

4.1 Introduction.......................................73
4.2 Recognition of Text .................................. 74
4.2.1 The Selection of Feature in Character Recognition .... 77
4.2.2 Geometrical and Topological Properties in Character
Recognition ................................. 81
4.2.3 Word Recognition of Contextual Information ..........85
4.3 Recognition of Non-Text .............................. 86
4.3.1 Determination of Non-Text Type ..................87
4.3.2 Classification of Line Drawing.................... 89
4.3.3 Recognition of Line Drawings .................... 94
4.3.4 The Symbol Matching Process by Transformation to the
Graph Model ................................100
4.4 Summary ......................................... 108

5 DOCUMENT FILING AND RETRIEVAL ....................110

5.1 Introduction....................................... 110
5.2 System Resources...................................11ll
5.2.1 Document Contents ...........................112
5.2.2 Structure of Computerized Documents ..............113
5.2.3 Internal and External Representation of Documents ..116 5.2.4 Information Extraction in Documents...............120
5.3 System Facilities ....................................123
5.3.1 Editing .................................... 124
5.3.2 Formatting ..................................125
5.3.3 Retrieving..................................127
5.4 Summary ......................................... 129

6 CONCLUSION ........................................ 131

6.1 Discussion ........................................ 131
6.2 Contributions...................................... 133


iv








APPENDICES

A ROBUSTNESS OF NEW SEGMENTATION ALGORITHM ......135
B SOME METHODOLOGIES OF CONTEXTUAL WORD
RECOGNITION....................................... 137

REFERENCES............................................. 141

BIOGRAPHICAL SKETCH....................................145


V














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

AUTOMATED SEGMENTATION OF PRINTED DOCUMENTS FOR
COMPUTER UNDERSTANDING OF STRUCTURES AND CONTENTS By

Hokyung Kim

August 1992

Chairperson: Dr. Julius T. Tou
Major Department: Electrical Engineering

This dissertation presents a conversion of paper-based documentation to computerized form. The main process for such conversion is considered as computer understanding of the document image which is a visual representation of a twodimensional field consisting of blocks of text, graphics, and picture images. In designing this document analysis system, automatic block segmentation and classification of a digitized document image are necessary stages. For the automatic block segmentation, we have developed a robust approach which connects black pixels within the predetermined distance to separate the blocks. The segmentation procedure is performed as a top-down approach to reduce the processing time.

For the development of a block classification algorithm insensitive to skew, the block classification rule based on black pixels is considered as a way to solve this problem. This method uses a step-by-step classification approach to avoid exhaustive


vi








classification procedure. The ratio between a black-white transition count and the count of black pixels of each block is used as one of measurements. This ratio is almost invariant to skew and is constantly high for the text block. The further pixelbased operator classifies each block in detail.

For a system that not only recognizes the text block but also understands a nontext region, we have utilized and integrated advanced technologies developed in various methodologies. In the stage for understanding nontext, we have distinguished some of the symbols from the other picture images. We have divided the symbols into two different categories. Two different image processing techniques, such as thinning and finding boundary, are applied to line and blob type symbols respectively in order to extract valuable features. We have used a geometrical feature for line type symbols and applied a weighted graph matching method for identification.


vii














CHAPTER 1
INTRODUCTION


A document analysis system which converts a paper-based documentation to computerized form involves a two-dimensional information processing task. Such a system must recognize characters of a text block and identify nontext regions such as line drawings and pictures. A human understands a page image by applying sufficient knowledge to it. Documents with nontext are becoming more common. The trend is to have many more documents composed of text, graphics, and pictures. In most documents some pages have pictures because editors have learned that all-type pages generally have little eye appeal and pictures help the reader to understand. When no photos or other illustrations pertinent to the article's subject are available, many editors will try to use a picture that is germane to the subject and helps attract attention to the article even if it adds no information of value. It seems impossible to give the machine the knowledge which humans possess about a page because it is not known how the human eye recognizes the object.

Fortunately, a page image has a distinctive geometric layout in printed form. In the composition of a page the first consideration is a well-organized layout with proper columns and margin selections. Graphics and pictures are well separated from the text block and other nontext regions. Thus, such an arrangement of content on the page provides a stepping stone to designing a document analysis system to cope


1








2

with the knowledge explosion that threatens to bury us in information debris. This system not only copes with the explosion of knowledge but also can possibly help people with visual handicaps. Thus far, the equipment available for blind and physically handicapped individuals are the tape recorded book, magnifier, and braille. Tape recorded books are produced by human readers working with a reading service. These reading services mainly rely upon volunteers. Developing a system for document content analysis is particularly important for the increasing number of elderly people, many of whom are visually handicapped.



1.1 Statement of the Problem

A computerized form of paper-based documentation provides some advantages such as efficient document update and revision; such conversion of paperbased documentation to computerized form, however, requires several steps including preprocessing steps. The main process for conversion is considered as computer understanding of document images. The document image is a visual representation of a two-dimensional field consisting of blocks with text only and blocks including nontext. The understanding of text has been developed and is an extension of character recognition. Complete understanding of paper-based documentation confronts the difficulty of understanding nontext regions. Many efforts to interpret engineering drawings and diagrams by machine can be found in the literature [Hua86b, Tou87b, Kas9O].








3

The preprocessing stage for the document analysis system requires the conversion of a paper-based document to a digital bit-map representation after optical scanning, followed by the automatic segmentation of the page image which separates each block by spaces between them. Two diametrically opposite philosophies called top-down and bottom-up have been proposed for the automatic segmentation task. Certain global operations are performed on an entire page image in the top-down approach, while character components are individually detected and then merged together into progressively larger blocks using component properties and interrelationships in the bottom-up approach. Several approaches such as projection profiles or run-length smoothing have been proposed; these enable segmenting the image into blocks, each of which can be subsequently classified using pattern classifcation techniques. Both projection profile and run-length smoothingbased techniques are fast; however, these are too rigid and fail if the documents or their constituent textual blocks are skewed. In other words, if any pixel of the next text line is higher than or at the same height as any lowest pixel of the text line being scanned, the document image will not be segmented as being desired for subsequent block classification.

Some algorithms have been reported in the literature for text string separation, an early stage in any document analysis system. However, many of these algorithms are very restrictive in the type of documents they can process; others are robust but they are too computationally intensive to be used without special purpose hardware [Fle88].








4

1.2 Objectives

The objectives of this research are first, to develop a robust and fast algorithm for segmentation of the page and a precise classification rule for the block, and second, to attempt an understanding of some nontext regions such as pictures or line drawings. In order to achieve these goals, we divided the tasks into the following subtasks:

(1) Development of fast and robust algorithm to segment blocks.

To process the recognition and analysis of a page, the page should be segmented into blocks which are separated by spaces or lines. The algorithm should be fast and robust for skew.

(2) Development of an approach classifying each segmented block.

In order to recognize a page with text and nontext, each block should be classified according to what it contains. Nontext should be classified in detail to handle the document management.

(3) Development of a systematic and efficient matching process.

The processing of visual information is typically a multi-layered task. The human brain mechanisms of vision clearly indicate that there is a layered approach to the processing of visual information both physiologically and psychologically.

(4) Design of a system for understanding some graphics and symbols.

A page contains some graphics images, especially line drawings, to help readers understand. Since a typical drawing contains both text strings and graphics, recognition of text can be separated from the understanding of graphics image.








5

1.3 Approaches

The work in this dissertation uses the document analysis system shown in Figure 1-1. The primary processing step, which is the bulk of this dissertation, is illustrated in Figure 1-2. The design of a machine vision system concerned with document analysis involves several major problem areas. These are (1) the preprocessing problem known as low-level processing; (2) the image segmentation and classification problem categorized as a pre-step for intermediate-level processing;

(3) the feature extraction and scene analysis problem known as intermediate-level processing. This dissertation presents several unique methodologies for some of the problem areas: (1) a block segmentation algorithm which is robust to page skewing,

(2) a block classification algorithm invariant to the change of block shape, (3) some graph theoretical approach to the recognition of nontext areas such as symbols and/or line drawings.

An elegant algorithm is described for block segmentation. This process rapidly connects each component by dividing the document into blocks separated by spaces. In connecting components, the proposed algorithm applies an operation for each black pixel so as to generate a linearly expanded contour of each component. The performance of the algorithm is measured for the time required to execute an algorithm on problems of size N. This algorithm has a time complexity of N 2 like most previous top-down approaches for block segmentation; however, it is invariant to skew and it is faster than most previous approaches for most input cases even if they have the same time complexity of N 2. As the very next stage to the block









6


L MEDIA
PROCESSING

STATION



i-f


SCANNED DOCUMENTS











WORK STATION


I


CODED DOCUMENTS


F-


Figure 1-1. System Overview for Document Analysis.


PAGE
READER










7


TEXT




TEXT
RECOGNITION


'SCANNED
DOCUMENT
PRPESSING


BLOCK
SEGMENTATION


LABELLING



CLASSIFICATION













PAGE STRUCTURE
ANALYSIS


CODED
DOCUMENT DOCUMENT FILING
& RETRIEVAL I


LINE DRAWING RENITION


Figure 1-2. Primary Components of a Document Analysis System.








8

segmentation, each block should be classified according to what it contains and block classification rules are required not to be restricted to error-free document image data. In order to develop an algorithm insensitive to skew, block classification rules based on pixel level will be considered as a way to solve this problem. Unlike previous classification rules, this algorithm tries to extract the features based on pixel level data.

Several measurements are considered for developing a block classification rule. In order to apply measurements for classifying the blocks, each block should be analyzed and considered based on structural hierarchy. For example, a text block is composed of text lines, and each text line consists of a variety of characters. The basic components of characters are line segments of uniform width. The proposed pixel operations can distinguish a line segment from a blob segment whose skeleton does not represent its own properties. Separation of text from complex fine line drawing is made by the removal of the pure line segment which is the primary component of the line drawing. Features such as probability of occurrence of black pixels in a block, the number of black pixels after applying consecutive operations, the black-white transition, the total number of black neighbors for each black pixel in the original image are used for a new classification rule.

In the understanding stage, some symbols stored in the database will be distinguished from the picture image. These symbols are classified into two types depending upon whether the segment is of line or blob shape. Two different image processing techniques such as thinning and finding a boundary will be applied to line








9

and blob type symbols respectively in order to extract their features. Geometrical and topological features are utilized for the line type of symbol. On the other hand, the Fourier descriptor for the boundary, which is considered as closed curve, can be used to recognize the blob type of symbol.



1.4 Preview of Remaining Chapters

Chapter 2 presents a brief survey of previous work on the segmentation of images and the classification of block segments. We discuss both (1) the direct method of segmenting the page image by both top down and bottom-up approaches, and (2) the transformational method which converts a coordinate space to another coordinate space so as to eliminate the skew problem.

In chapter 3, we describe a robust algorithm for block segmentation and an unconstrained block classification rule. Since some of documents consist of complex data such as text, graphics, and pictures, block segmentation and block classification are required for designing an automatic document analysis system. We also describe page-structure analysis based on standard document architecture and present experimental results.

In chapter 4, we describe the page recognition system which not only recognizes the text block but also understands some of the nontext regions. The text recognition only deals with the off-line case in which the characters are written completely on a sheet or on any other materials, because only page images are treated in this thesis. The understanding of nontext has mainly been done on line








10

drawings such as logic circuit diagrams and mechanical engineering drawings and trademarks. Prior to the understanding stage, each line drawing is classified into detailed line drawings based on evidence extracted from each line drawing.

We close with chapter 5. After documents have been created, some sort of file-management facility is required for a user to collect the documents into piles and to put these piles away. When a document is filed away in a document-management system, it is necessary to be able to retrieve it at some later time. The design of a complete document-filing and retrieval system is very complex, and it is beyond the scope of this dissertation. Chapter 5 describes some aspects for a document-filing and retrieval system that would be appropriate for an extension of the work in this dissertation.














CHAPTER 2
BACKGROUND AND REVIEW OF PREVIOUS WORK


In this chapter we review previous work on the segmentation of page images and the classification of the segmented blocks. The automatic segmentation of the page image is a significant process for a document analysis system, first demonstrated by Wahl, Wong and Casey [Wah82]. They have presented a heuristic approach to the problem, by operating on a binary image for the page. As a preprocessing stage, they convert the grey-scale image into a binary-valued image format by comparing the gray-level values with a threshold value.

This chapter describes a few preprocessing techniques and the segmentation and classification processes which are at the heart of a document analysis system. Prior to reviewing previous work on segmentation of the page image and classification of the segmented blocks, some preprocessing techniques will be described briefly. Before the document image can be processed, a format conversion is required because the image obtained from a scanner is in raster format. This raster format of a document image is converted to a raw format in order to process the image data.



2.1 Preprocessing

Since the early 1960s, much work has been done in image processing,


11








12

especially for the reduction of the amount of computation and the efficient reduction of the errors. Even normal text documents present several difficult problems for further processing because of variations of type production. Some characters in the documents image are often smeared or smudged or sometimes printed with either very light strokes which are difficult to detect or very heavy strokes that tend to broaden and run together when imaged for a grey-level scanner. Furthermore the amount of date in a scanned document is enormous. To solve these problems, we will discuss some techniques for (1) binary representation by image thresholding and (2) thinning and boundarization for object description.



2.1.1 Binary Representation by Image Thresholding

The creation of a binary representation from an analog image requires that we determine whether a point is converted into a binary one or a binary zero depending on the grey-level measured by a scanner. Thresholding is an obvious tool for creating the binary representations from grey level images. By judiciously choosing a grey level threshold between the dominant values of the object and the background, the original grey level image can be transformed into a binary form. Although the method appears to be simplistic, it is not easy to find the threshold value from the poor grey level image data normally returned by a scanner. The scanning hardware, due to technology and cost limitations, have nonuniform illumination over the scan field, sensitivity and dark current variations from element to element in the scanning array, and distorted resolution from the lens.








13

Kittler and Illingworth [Kit86] proposed minimum error thresholding which is applicable in multithreshold selection. Minimum error threshold uses a histogram which summarizes the distribution of the grey levels in the image and gives the frequency of occurrence of each grey level in the image. The histogram can be viewed as an estimate of the probability density function p(g) of the mixture population comprising grey levels of object and background pixels. In the following, each of the two components p(g I i) of the mixture is normally distributed with mean ,ui standard deviation a1 and a priori probability Pi, i.e.

2
P(g) = i Pp(g Ii) (2.1)

where



(gl) I exp(- (_- ) (2.2)
2a


For given p(g I i) and Pi there exists a gray level T for which gray levels g satisfy P1 p(gjl1) < P2 p(g 12) ,g ! 'r (2.3a)


P, p(g 11) > P2p(gJ2) ,g > -r (2.3b)

where -r is the Bayes minimum error threshold at which the image should be binarized. The problem of minimum error threshold selection is to determine the optimum threshold value Tr in an estimate of the probability density function, which is viewed as a histogram. The minimum error threshold can be obtained by solving the quadratic equation which represents the condition of a gray level for the








14

existence of the thresholding gray level. However, the parameters of the mixture density function associated with an image to be thresholded will not usually be known. The fitting techniques estimates these parameters from the gray level histogram in order to get these parameters. In the fitting technique, the average performance figure for the whole image can be characterized by the criterion function. One of the techniques for finding the optimum threshold 'r can be summarized as follows: Suppose that the gray level data is thresholded at some arbitrary level T and each of the two resulting pixel populations is characterized by a normal density h(g Ii, T) with parameters u(T), a1F(T) and a priori probability Pi(T) is modeled as

b
Pj(7) E h(g) (2.4)
g=a

b
i j [7) h(g) g] /P(7) (2.5)
g=a

and


b

g=a
where


a+1 i=21 (2.7)


and








15


b T i=1(28
ni i 2(28

where, n is the gray level value for white pixel. Now using the model h(g I i, T), i = 1,2 the conditional probability e(g,T) of grey level g being replaced in the image by a correct binary value is given by


e(g,7) =h(g Ii, 7) -P1(7)/h(g) i 1 g gT (2.9)
2 g>T

An index of correct classification performance, e (g,T), is obtained by taking the logarithm of the numerator in (2.9) and multiplying the result by -2.


,~g7)=[(g-I~,(7)) / a,]2 + 2log a (7) - 2logPi(7) i 2 g >! T(210


The average performance figure for the whole image can then be characterized by the criterion function


J(7) h g - c(g,7) (2.11)


This criterion function reflects indirectly the amount of overlap between the Gaussian models of the object and background populations. The better the fit between the data and the models, the smaller the overlap between the density functions and therefore the smaller the classification error. The problem of minimum error threshold selection can then be formulated as one of minimizing criterion J(T), i.e.


J@r) = min J(7) (2.12)








16

2.1.2 Thinning, and Boundarization for Object Description

Two common ways to describe an object are by the use of boundaries and of skeletons. An object in two-dimensional space is completely determined if we know its borders, and provided that we also know which side of each border is inside the object and which is outside. If it has holes, an object may have more than one border. A different way of describing an object makes use of representation by its "skeleton". Finding the skeleton of object is usually called thinning of image data. The thinning algorithm usually is an iterative edge point erosion technique. The purpose of thinning is to simplify the boundary image by reducing it to its skeleton without destroying its geometrical shape and connectivity. In other words, the thinned pattern, so-called skeleton, must preserve the connectedness and shape of the original pattern other than extremely thick objects. It should be noted that the skeleton of a pattern may not be unique. During the past years, many algorithms have been proposed for thinning [Nac84, Pav8O]. If the region is composed of thin components, it can be described well by its skeleton. Skeletons derived by the thinning algorithm keep connectivity of regions.

Thinning usually consists of iteratively deleting border points, such that deletion of these points does not remove end-points and does not break the connectivity of the pattern or does not cause excessive erosion. To prevent excessive erosion, the end point cannot be removed. Here, an end-point is defined as a dark point with at most one dark eight-neighbor. The eight-neighbors of point p are defined to be the eight points adjacent to p. Points P0. P2, P4, and P6 are referred to








17
as the four-neighbors of p in Figure 2-1. (In this representation it is assumed that the object is represented by a regular grid of measurements, in equal steps in both the row and column directions.)




P3 P2 P1
P4 P PO
P5 P6 P7


Figure 2-1. A point p and its neighbors. Most thinning algorithms have similar operations by deleting a dark point from the pattern if it satisfies a certain condition, such as an edge-point, not an endpoint, and predefined points etc. These are applied mostly on a smooth synthesized pattern which has no irregularities and no noise. However, the thinning process is very sensitive to noise on the binary boundary. A small disturbance of the boundary causes the creation of small strokes as well as a disturbance in main skeletons. In the boundary image, there is positive noise which is either isolated salt-and-pepper noise or boundary annexed noise. Therefore, we have to remove the small strokes from the main skeletons through a refining process after the main stream of the thinning process. The thinning algorithm is illustrated as follows:


1. Set the flag remain to true.
2. While remain is true do steps 3-14.
Begin.
3. If p is 1 and not a pixel of a double line then do
4. For j = from 0 to 7 do step 5-6.
5. count the number of change for 8-neighbors ( c(p))
6. count the number of dark pixels for 8-neighbors ( N8(P) )








18

7. For j = 0,2, 4, 6do step 8.
8. count the number of dark pixels for 4-neighbors (N4(P))
9. If c(p) is 1 and 3 < N8(p) < 7 and N4(P) < 4 then do
10. set p equal to 0.
11. If p is last pixel then do step 12-14.
12. For all pixels p of image data do
Begin
13. If' A= B then set remain to true.
14. else set remain to false
End.
15. Set A to B.
End.
16. End of Algorithm.



2.2 Review of Previous Work

Since the late 1970s, several approaches for text block separation from mixed text/graphics images have been proposed as an intermediate process for document analysis systems. These approaches are categorized into two methods, direct and transformational. The page is usually split into blocks in order for the reader to be able to read with ease. The white space between the blocks is usually wider than the space between text-lines. Direct methods separate the document image into several blocks applying certain rules to the image data directly. The page image is segmented into blocks by layout structure and each block is classified. The transformational method converts the image data from one coordinate space to another coordinate space, then applies a rule to it. Such conversion from one coordinate space to another coordinate space was performed for designing a robust approach separating text block from mixed text and graphics images. A transform, which is known as Hough transform, is applied for separating text images from documents with text and








19

graphics images. Here, we will discuss the most representative frameworks for (1) block segmentation rule for page images, (2) block classification rule for the segmented block, and (3) recognizing textual block using the Hough transform.



2.2.1 Block Segmentation Rule for Page Images

Wong et al. [Won82] and Wahl et al. [Wah82] proposed the Document Analysis System which assists a user in encoding a printed document for computer processing. The proposed system consists of a block segmentation stage and a block classification procedure mainly to analyze the document image, which is the visual representation of a page. First, a segmentation procedure subdivides the area of a document into blocks, each of which should contain only one type of data. Second, some basic features of these blocks are calculated in order to classify them into a specific type. At an early stage of the proposed system, they used a run length smoothing algorithm (RLSA) which operates on every dark point in the document image for a block segmentation rule. A RLSA had been used earlier to detect long vertical and horizontal white lines. This algorithm had been extended to obtain a bitmap of white and black areas representing blocks containing the various types of data. The RLSA operates on two black pixels which have at most a predetermined number of contiguous white pixels between them on the same column or the same row. The RLSA is first applied row-by-row and then column-by-column, yielding two distinct bit maps. The two results are then combined by applying a logical AND to each pixel location. The RLSA can detect small blocks such that each block includes









20


- UMw


Figure 2-2. The failure of RLSA when the document is skewed.









21

just a text line in the text region. The RLSA is fast: however, it fails if the text lines are skewed: it doesn't extract any text lines in spite of existing text lines in the page. Figure 2-2 shows the segmented block of the document skewed by 1.2'. Although the document is skewed by less than two degrees, the RLSA does not generate blocks of text lines.

Nagy et al.[Nag86] described one of the top-down segmentation strategies called RXYC (Recursive X-Y cuts). This approach is also known as projection profile cuts. Printed pages are conventionally made up of rectangular blocks, and a page can be recursively cut into rectangular blocks. Thus the document is represented in the form of a tree of nested rectangular blocks. At each step of the recursive process, the projection profile, computed along both horizontal and vertical is simply a sum of all the pixel values along that line. Then division along the two directions is


American Cyanamid Co., Wayne, NJ has added a foliar spray application of Cycocel, a plant growth reg ulant for use on poinse aslo_ i8 grouct line. Ap proved yt eP Uy ocel can beused on all varieties and colors of p ointsettias, as well as on azaleas , geraniums and hibiscus. The company markets the product in 1 -quart containers.I


FMVC Corp.'s Agricultural Chemicals Division Philadelphia, has introduced a liquide formulation of an insecticidemiticide previously available only as a wettable powder. Designed for use in greenhouses, Talstar Fl owable controls
5 ifrnt pests and leaves little re-


Terraquard 50, a product of Uniroyal Chemical1 Co. Inc., Middlebu rv, CT, has been approved by the Envifonmental Protection Agency for use nationwide. The wettable powder controls Cv lindrocladium spathiphylli root and petiole rot
on pathivurn in enclosed structures, such a~ greenhouses and shade houses, and in interior landscapes.


Unocal Corp., Los Angeles, has introduced N- pHuric GTO, a fertilizer and water-treatment product. According to the company, the acidic uroa chemistry of N-puric TO reduces the possibility of free ammonia formation, increases macronutrient and micronutriant up-


Figure 2-3. An example of the subdivided page by RXYC.








22

accomplished by making cuts corresponding to deep valleys in the projection profile, with the width larger than a predetermined threshold. The RXYC identifies large blocks as illustrated in Figure 2-3; however like RLSA, it fails if the text lines are skewed.

In contrast, certain global operations are performed on the entire image in the approaches described thus far. Doster [Dos84] proposed the bottom-up approach which determines the individual connected components. In the case of text blocks the characters are merged into words, words are merged into lines, lines into paragraphs, and paragraphs into even larger blocks, if such a merging is possible. However, this approach requires extensive usage of memory resources and is very slow in processing speed [Sri86].





2.2.2 Classification Rule for the Seg-mented Blocks

Scherl et al. [Sch8O] described the simple method for obtaining characteristic features for text, graphics and picture segments. It subdivides the document into small, overlapping windows to generate a histogram. Within each window, a grey level histogram is evaluated. Then the features can be extracted statistically from the histogram. Text consists of white background with black characters on it. The background is almost entirely of one intensity level and is the largest and brightest part within a window. The characters don't consist of black lines but of many transitions between black and white. Because of this, a small sharp peak at a bright








23

greylevel and a lot of darker greylevels are typical of text. Meanwhile, the histogram of a picture has no similar sharp characteristics. The shape of such a histogram strongly depends on the content of the picture. In some cases, it might be possible that the histogram of a picture looks like the histogram of text. But usually the percentage of darker levels within a picture is higher and often the brightest points within pictures are darker than the background of text. Furthermore, if a graphic consists only of lines, its greylevel histogram will not differ much from that of text. Therefore, statistical features taken from a greylevel histogram seem to be not suitable for discrimination of text and graphics. The shape of the histogram is largely dependent upon the size of the window. A larger window results in a weaker dependence of the shape of the histogram on the position of the window within the text. However, larger windows decreases the accuracy.

A method for block classification was proposed by Wong et al. [Won82]. They use block height and block mean black pixel run length as the basic features. Several measurements, such as total number of black pixels in the segmented image block

(BC), minimum x-y coordinates of a block and its x-y lengths (xmin,, Ax i, Ay), total number of black pixels in the original image for the block (DC), and number of horizontal white-black transitions in the original image block (TC), are taken to classify text line blocks and graphics or halftone picture blocks. The following features are measured during component labeling: (1) the height of each block segment H = Ay, (2) the eccentricity of the rectangle surrounding the block E = Ax/Ay,

(3) the ratio of black area to enclosing box area S=BC/(AxAy), and (4) the mean








24

horizontal length of the black runs of the original data from each block Rm. = DC/TC. Some of the features are illustrated in Figure 2-4.

A block is considered to be text if its R and H values are less than some constant multiples of the mean length of black run and mean height. In other words, the pattern classification scheme that assumes linear separability is used to determine







(a) The have of tex~t blook











(b) An exampule of the .have for the nho-t.,t block


Figure 2-4. The typical block shape of text line in the RLSA.



the region in the plane of mean height (Hm) in the one coordinate and mean length of black run (R,,,) in the other coordinate. The distribution of values in the R-H plane obtained from sample documents are observed to determine the discriminant function. For example, text is the predominant data type in a typical office document and text lines are basically textured stripes of approximately constant height H and








25

mean length of black run Rm. Text blocks tend to cluster with respect to these features. Figure 2-5 illustrates the distribution of value in the Rm-H plane. The text lines of a document form a clustered population within the range 20 < H < 35 and 2 < Rm < 8. Low Rm and H values represent the regions that contain text. The graphic and halftone images have high values of H, whereas solid black lines have high R and low H value in the R.-H plane.


1 2 3 4 5 6 7 4 10 20 40 60 40 100 300


i0 .0


30 10 00 1000


0 0


a a NO


Figure 2-5. The distribution of Rm and H values for each block type.

The Hm and Rm for the text cluster may vary for different types of documents, depending on character size and font. Furthermore, the text cluster's standard deviations a(Hm) and a(Rm) may also vary depending on whether a document is in








26

a single font or multiple fonts and character sizes. These authors applied heuristic rules to classify the blocks regardless the character size and font. A variable, linear, separable classification scheme is used to assign the blocks into the following four classes.

Text: if R < C21 X Rm, and H < C22 X H,

Horizontal solid black lines: if R > C21 X Rm and H < C22 X H..

Graphics and halftone images: if E > 1/C23 and H > C22 X H..

Vertical solid black lines: if E > 1/C23 and H > C22 X Hm

The constants Ci, are determined heuristically by examining the R-H plane plot of typical documents and the values of Rm, and H,,. They have assigned some values to the parameters based on several training documents. Although prior knowledge about the structural characteristics of a newspaper can be used for classifying blocks, in some cases these features will lead to classification errors. For example, the geometric characteristic that a text line has approximately a given constant height could be used for deciding that a block is a text line. But if the image was skewed while digitizing it, some text lines would be linked together by the block segmentation procedure, then linked text lines would be classified into the graphic and half tone categories.

In Wahl's work [Wah83], a distance-mapping function based on a border-toborder distance of the block to be classified is used for shape discrimination. The new distance is a function of the two Cartesian coordinates x,y ( (x,y)GS I and an angle (b measured from the x axis ( Oo (b< 1800 }. It is defined as the length of a line








27

segment 112 with angle *, which connects an inner point p(x,y) of S with two opposite border points B1, B2 of S, such that the line segment B1B2 is entirely inside S (Figure 2-6). The minimum and the maximum line segment length at any point (x,y) is defined as D,,jn.(x,y) and D..,,(x,y) over all possible angles (0 respectively. In addition, an eccentricity mapping D.,,,(x,y) is defined to be D,,,(x,y) D..xy/~nXY = max(,)[d(x,y, )] /i()[d(x,y, )]. Similarly, din d.. and CC are defined as the average values of Dmjj(x'y), Dmaxxy), and Decc(X,y) over all the pixels in the connected component.

Simple shape factors f, and f2 which are derived from dmin and dma,,x as f,=c1 *A/d 2min, f2 = C2 *A/d 2max respectively can be used as features to discriminate text, graphics, and thresholded gray-level pictures. In two shape factors f, and f2, A is the number of pixels in discrete space and cl, c2 are constants determined by experimentation. Text has large f, value and graphics have relatively large f, value too. However, text has large f2 value compared to graphics and thresholded gray-level pictures, while thresholded gray-level pictures have a small f, compared to the other two types of images.

Wang et al.jWan89] adopt the statistical texture analysis for discriminating document image categories. Their statistical approach to texture analysis has two basic stages: (1) a series of intermediate matrices which are computed from the image region and (2) a set of features which are computed from these intermediate matrices. For the intermediate matrices, they use the BW matrix and BWB matrix, which are a set of consecutive black pixels followed by a set of consecutive white








28


(0,y) d(XY~oS



B P x ,y2 )


Figure 2-6. Illustration of the distance mapping function d(x,y,( ).








29

pixels and a set of consecutive black pixels respectively. These matrices are illustrated in Figure 2-7. The length of the run is the number of pixels in the run. A black-white pair run is categorized into nine classes (i.e. 1-9) depending upon the proportions of white pixels. The category number represents the percentage of white part in a blackwhite pair run. In matrix element p(ij) which specifies the number of times that image contains a black-white pair run, i means 10 * i percentage of white pixel in the black-white pair run of length j. For example, the matrix element p(3,10) stands for that length of black-white run is 10% and percentage of white pixels is 30.

Meanwhile, a black-white-black combination run is defined as a pixel sequence in which two black pixel runs are separated by a white pixel run as shown in Figure 2-7 (b). The length of a black pixel run is fixed and assigned into three


Figure 2-7. Definition of BW and BWB matrices.


L-9~th-N-,b. of Pi..







Length-The wh.,bwr of White Pixoela








30

categories depending upon the predefined arrangement. The matrix element p(ij) is the number of times that the image contains a black-white-black combination run, in the horizontal direction, with white pixel run length j and black pixel runs with length lying in categories i.

In order to create a three-dimensional feature space that can distinguish all the blocks in document image, these authors also derive two features F, and F2 from the BW matrix, and a feature F3 from BWB matrix. These features are defined as follows,

(1) Short Run Emphasis


N, IV, N. Nr,
F1 E F ( ~j /j p(ij) (2.13)


In short run emphasis, the matrix element p(ij) is the (ij)-th entry in the given run length matrix, N, is the number of different kinds of pixel runs, and N, is the number of different run lengths that occur. The value of F1 for small letters is larger than the value of F, for large letters, because white spaces between strokes in small letters are smaller than those in large letters. Meanwhile, the second feature is defined as follows:

(2) Long Run Emphasis

N,. N, NV, N,.
F2 E j2 p(ij) / p(ij) (2.14)
i=1 j=1 i=1 j=l








31

The long run emphasis, on the contrary, has a larger value for large letter blocks than small letter blocks.

A third feature derived from the BWvB matrix is extra long run emphasis.

(3) Extra Long Run Emphasis


N, N, N7. N,
F3 = E j2 E p'(ij) E F p / p(ijf) (2.15)
j=T1 = J=T i=1



where


pI~ ij) PO fPO>T (2.16)
0 Vf POO :g T2



In extra long run emphasis, threshold Ti is set to delete short run lengths because only very long run lengths are needed to express the characteristics of graphics blocks. The threshold T2 is for deleting the effect of small values of p(ij) because a long run appears occasionally in letters blocks and photograph blocks. Thresholds T, and T, are determined by experimentation. The three features were measured for several different type of the sample blocks. From the FI-F2 feature space, the blocks with different type of document image are clustered together within each class and are well separated between classes, except for graphics blocks. The feature F3, however, separates graphics blocks from the other types of blocks.








32


2.2.3 Line-to-Point Transformation

The transformation of a line in Cartesian coordinate space to a point in polar coordinate space was developed by Hough. A straight line is described in Figure 2-8

(a) as p = x cosO + y sinO where p is the normal distance of the line from the origin and 0 is the angle of the origin with respect to the x axis. The Hough transform of the line is simply a point with coordinate (p,O) in the polar domain (Figure 2-8 (b)). A family of lines passing through a common point (Figure 2-8 (c)) maps into the connected set in the polar domain (Figure 2-8 (d)).

The connected set for a family of lines passing through point A in Figure 2-8

(e) will be in the top of Figure 2-8 (f), and the connected set for point B is drawn as middle, and for point C is drawn as bottom. These connected sets meet at (p0,00) in Figure 2-8 (f). This occurs since three points in Cartesian coordinate space are collinear.



2.2.4 Recog~nizing, Textual Blocks Using the Houg-h Transform

The Hough transform has found numerous applications such as detecting lines, curves in pictures, handling multi-valued images etc. Rastogi et al. [Ras86] applied the Hough transform for document analysis. The Hough transform technique detects the presence of a parametrically representable group of points in an image, such as a straight line or a circle through a mapping to a parameter space. They utilized the fact that pages consist of straight components; e.g., text lines are straight. If we use











33


y 0.5 0,


0


y 1.0 0.5





0




y 1.0/ 0.5





0
0


0


7T V


0


0oF


p


e


0.5

(a)


2
3


1 0

7 6


6
5
7
"4
2

0 0.5 1.0
(C)





P0 x cos 0+y sin&



B


xvc


0.5
(e)


1.0


1.


0


e
TV






0


-TV


x


x


e 7V






0





-TV


0.5
(b)


1.0 f


6







3
2


0


0.5
(d)


1.0


(p, &o


f


0 0.5 1.0

(0)


Figure 2-8. The Hough transform.


p








34

the Hough transform for document analysis, we can extract text lines because characters on the text line are usually collinear.

The Hough transform is applied to the centroid of the connected component in the Cartesian coordinate space in order to show three-dimensional view in the polar domain. The accumulator for each point in the polar domain represents the number of connected sets which is transformed by the centroid of each connected component in the document image. An array of accumulators is set up by quantizing the value of p and 0. For a 512 by 512 image, the possible range of p is -362 to 362. This value is arrived at by assuming the origin of the (p,O) space to be at the center of the 512 by 512 image and hence the maximum value of p is 25612 =362 and the range of 0 is 0' to 1800. Code for the algorithm is as follow:



For all centroids of components
For 0 =0 to 180 degrees
{p = x cosO + y sinO
accumnulator[ p,0] = accumulator[ p,61 + 1




The above states that the Hough transform is applied to all the significant point in row x, and column y. For a 512 by 512 image, the maximum value of the normal distance p from the center of the 512 by 512 image is 362 where x is 256 and y is 256, i.e. p = 256 cos 450 + 256 sin 45'.

The Hough transform for document analysis converts 2-D page images to 3-D images. The valleys of 3-D images transformed from 2-D page images separate the








35

blocks. This mapping is one-to-many in either direction, and among the various properties which hold true for this transformation are (1) a point in the document image corresponds to a sinusoidal curve in the parameter plane, (2) a point in the parameter plane corresponds to a straight line in the document image, (3) points lying in the same straight line in the document image correspond to a curve through a common point in the parameter plane, and (4) points lying on the same curve in the parameter plane correspond to lines through the same point in the document image.

Fletcher et al. [Fle88] described an algorithm robust to changes in text font style and size within an image. The algorithm uses simple heuristics based on the characteristics of text strings. This segmentation algorithm is based on grouping collinear connected components of similar size and does not recognize individual characters. In order to accomplish these tasks, the algorithm consists of five steps: (1) connected component generation, (2) area/ratio filter, (3) collinear component grouping, (4) logical grouping of strings into words and phrases, and (5) text string separation. In the analysis of block structure for the 3-D document image, blocks are easily accessed by looking at rectangular chunks cut across by the true orientation lines and their perpendiculars. The authors analyzed number of rows and the width of the transitions to classify the blocks. They also use several properties such as size of the rectangles, their eccentricity, orientation, texture and complexity.








36


2.2.5 Conclusions from Previous Work

As discussed earlier, the document analysis system consists of two activities such as block segmentation and block classification. Most of early work tried to separate the page image from the bit-map of the page by a direct method [Won82, Nag86]. The RLSA has generated every text line as a block, in other words, many blocks with almost the same height of text lines. Thus, RLSA generates too many blocks; it has furthermore another shortcoming called the skew problem [Bai87]. If text lines are scanned with skew even less than a few degrees, that is, vertically any part of the highest letter in the next text line is at the same height or above the lowest pixel of the text line being scanned, then the RLSA cannot generate the block with the height of the text line.

An approach, based on the observation that documents generally have rectangular block structures, used a projection profile to segment the block [Nag86]. The projection of the black pixel counts along the horizontal and vertical directions was used for block segmentation [Zen85, Mas85]. The projection method is very sensitive to document skew with respect to the raster-scanning direction of the scanner. It produces satisfactory results only for documents with rectangular block structure. To overcome this problem, skew detection should be done by iteratively examining small angle deviations from the normal direction to determine which angle gives the steepest variation of the projection profile [Mas85].

Rastogi and Srihari [Ras86] applied the Hough Transform for document analysis in order to solve the skew problem. This method is invariant to skew;








37

however, it requires much computation time for the preliminary steps and extensive usage of memory resources. It also is highly CPU intensive and consequently is too slow to be applied for document analysis without support of special hardware.

Most of the previous work for block classification was developed under the assumption that documents were digitized without any skew. The classification scheme in [Won82] uses block height and block mean black pixel run length, and it requires that block height should not be taller than the height of the highest letter in that text line. However, if the page is scanned with even a few degrees of skew, some text lines will be stuck together. Thus, the height of stuck text lines leads to the failure of the block classification rule.

Wang and Srihari [Wan89] considered that an image region possessed a certain texture if it had some basic subpatterns which occur repeatedly according to some specific rules of arrangement. Two matrices, BW matrix and BWB matrix, are used to represent the textual characteristics of a newspaper image block. In addition to those, three feature definitions such as (1) short run emphasis (SRE), (2) long run emphasis (LRE), and (3) extra long run emphasis (ELRE) were measured for several sample blocks segmented from the newspaper image. From these three feature definitions, two-dimensional SRE-LRE space and three dimensional feature space created by SIZE, LRE, and ELRE are used to distinguish between the different types of blocks. Although prior knowledge about the structural characteristics of a newspaper can be used for classifying blocks, in some cases these features will lead to classification errors. For example, the geometric characteristic that a text line has








38

approximately a given constant height could be used for deciding that a block should be a text line. But if the image was skewed while digitizing it, some text lines would be linked together by the block segmentation procedure, and then linked text line may be classified into the graphic and half tone categories.














CHAPTER 3
MACHINE UNDERSTANDING OF STRUCTURES IN DOCUMENTS


3.1 Introduction

Automatic block segmentation and classification of a digitized document image are necessary elements of a document analysis system capable of understanding a document consisting of a mixture of text and graphics images. Such block segmentation can be done by element, text line or relatively big paragraphs separated by wide white space. The block segmentation by text line has been described by run length smoothing algorithm [Wah82]. The method of recursive X-Y cuts [Nag86] generates blocks with bigger sizes, obtained from the projection profile. However, both algorithms require that documents be placed without skew. The Hough transform has been used to design a system which is very insensitive to skewed document and separated text string from mixed text/graphics images [Fle88]. However, this robust algorithm is so CPU intensive that it may require special purpose hardware for acceptable response times.

The failure of block segmentation due to skewed document image data not only separates the document inappropriately but also induces misclassification of the blocks. In this chapter, we will address the development and implementation of a robust algorithm for automatic separation and analysis of text, graphics, and halftone images. This new algorithm for block segmentation divides the page by white spaces 39








40

which separate the blocks despite the skew of the document. Then, the segmented blocks will be classified by a classification scheme which is not effected by rotation of the block.

In the understanding stage, the document analysis system not only recognizes the text blocks but also understands a nontext area. Such a system requires advanced capability which can analyze an object and synthesize the technologies developed by various methodologies. The complete understanding system for a two-dimensional page image with printed character and graphics is the integration of work done on several problem areas. This system not only encodes each block using different types of data format to reduce the memory size, but understands the documents as well.



3.2 The Block Segmentation of Digitized Documents

In this section, we will discuss the block segmentation which is a procedure that subdivides the area of a digitized document into blocks in order to process the document images systematically. Each of the blocks ideally is required to contain only one type of image data. Such block segmentation of document image data should be done by certain rules. As we described earlier, it can be done by an element, a text line, or relatively big blocks. The previous block segmentation approaches, such as run length smoothing algorithm and recursive X-Y cuts, also called projection profile cuts, require the document to be placed without skew. The block segmentation algorithm will be evaluated by a few aspects such as (1) time complexity, (2) robustness, and (3) block size.








41

The document image can be segmented into blocks by two different methods, top-down and bottom-up. In the top-down approach, certain global operations are performed on the entire image. In the bottom-up approach, on the other hand, all the components in the document images are individually detected and then merged together into larger blocks. We will address a new approach for block segmentation belonging to the top-down approach. This new approach connects each component to generate bigger connected components with appropriate size. To generate the connected component, we will apply the operator defined as follows:

Definition 1. Let P(x,y) be a picture element at location x and y in the document D. Define P(x,y)= 1 as a black picture element, 0 as a white picture element.

Definition 2. The operator OP, executes the following operation. If P(x,y) = 1, then 0P,(P(x,y)) generates picture elements such that P(x + a,y +B) = 1, for every a,&B Id 1, where a and B represent the row and column locations and d is a predetermined distance.

The operator OP, with the predetermined distance is applied to every black pixel of the document image data and then expands each black pixel to connect every black pixel in a certain area which can be separated from other areas intuitively. The predetermined distance is set by the value which is larger than half the distance between text lines and less than half the distance between blocks. Normally, the space between the blocks is wider than the space between the text lines. Of course, this is not a rule all the documents have to abide by. However, we will exclude the








42

document with poor layout structure that violate the above condition. Table 3-1. shows an illustration of the space distance between the blocks for several documents.





Table 3-1. Illustration for the space distance between blocks.


Block Pairs Distance

Text :Text 4 -9

Text(Bold face) :Text 4 - 10

Text :Text(Bold face) 5 - 10

Text :Nontext 2 - 17

Nontext :Text 3 - 28

Caption(Nontext) :Nontext 4 - 7

Nontext :Caption (Nontext) 3 - 14

< Unit = distance between text lines >


3.3 Labelling of Segmented Blocks


To identify each block separated by the block segmentation process, labels have to be assigned to different blocks for subsequent procedures such as block classification and feature extraction. This labelling process treats the individual connected components of a set of S as separate objects. S is a bit map representation of the scanned document page. Each component of S has a value of 1, for black pixels, and a value of 0 for white pixels.








43

We can label the components of S by performing a row-by-row scan. As the first line is scanned from left to right, the label 1 is assigned to the first black pixel. This label assigning process is propagated repeatedly, in other words, subsequent adjacent black pixels are labeled with 1's until the first white pixel is encountered. The next black pixel along the line is labeled with 2 and so are its adjacent neighbor black pixels. This is continued until the end of the first line is reached. For each black pixel in the second line, the neighborhood in the previously labeled line is examined along with the left neighborhood of the pixel. The two upper diagonal neighbors of each black pixel, already visited by the scan, are examined in order to label them with the same number if they are black pixels. If all eight neighbors are 0, the current pixel P gets a new label, that is, if a black pixel has no labeled neighborhood the next label not yet assigned is assigned to this pixel. This labelling process is illustrated in Figure 3-1 (a). If one of them is 1, pixel P gets the same label; if two or more of them are 1, pixel P gets one of their labels and the equivalences are noted. This procedure is continued until the bottom line of the binary image is reached. The equivalences, i.e. some adjacent black regions, may be labeled differently. The equivalent pairs of equivalences are sorted into equivalences classes, and a label to represent each class is picked up (see Figure 3-1 (b)).



3.4 The Unconstrained Block Classification Rule

Some of the simplest patterns that people can recognize without difficulties are very hard for a computer to detect. A human can classify the blocks in a








44

document instantaneously even though a text block consists of letters which are not known to us. Without recognizing each character, the human being has an ability to classify the block of text in documents if he(she) is educated enough to figure out what most characters look like. A few research works describe the block classification of mixed text/graphics images assuming that the documents are scanned without skew. Since in most cases the document image is segmented with skew, the classification rule should be able to classify the blocks regardless of the shape of the blocks because the shape of blocks will change when the document is scanned with skew.

The features of each type of block need to be scrutinized in order to generate the classification rule. In the human visual mechanism, seeing is known to involve processing an enormous amount of data. Part of the shock of making a deep analysis of the vision process comes from the realization of how much information the human brain processes in the act of seeing. The brain keeps a temporary record of the sensory input during perception [Sow84]. A visual icon is stored in the brain for just a fraction of a second. When the brain receives a new sensory icon, it must search its stock of percepts to find ones that match parts of the icon. The cerebral cortex stores the percepts, but other parts of the brain may control the actual searching and comparing. The brain has also an associative mechanism which retrieves the pattern that matches best, while an ordinary computer retrieves data by an address in storage. Perception finds percepts that match the overall pattern of an icon before it fills in percepts for the detail. It is therefore impossible to simulate the human












45


11 11


4
4
4
4
4
4
4
4
4
4


4
4
4
4
4
4
4
4
4
4


4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4


3
3
3
3
3
3
3
3


4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4


4 4 4 4
4
4
4
4
4


3 3 3 3 3 3 3
3
3
3
3
3
3 3 3 3 3 3
3


3 3 3 3
3





3 3 3
3
3
3 3
3


2
3
3


3
3
3


3
3


2
3


3
3
3



3
3
3


2
3
3
3
3
3
3
3
3
3


2
3
3
3
3
3
3
3
3
3
3


3 3 3 3


3 3 3 3
3


3
3
3
3
3


(a)


Figure 3-1. The algorithm for component labelling.


11 11


4
4
4
4
4
4
4
4
4


4
4
4
4
4
4
4
4
4


4 44 44 4 44 4 44 4 44 4 44 4 44 4 44 4 44 4 4 44


4
4
4 44 44 44


3
3
3






3
3
3


3
3
3
3


3
3
3
3


3
3 3 3 3 3 3 3 3
3
3
3


3
3
3












46


1111111111111111 22222222
2111111111 2222 2 22 22 2
1111111111 2222 2 22 22 2
111111111 22 2 22 22 2
1111111 22 2 22 22 2
11111 2 2 2 22 22
2i 2 2 2 2 22 2 2 2 22 22 2 22 22 3 333 22 2222 22 2 2
3 33 33333 32 2222 2 22 2 2
33 33 33 33 33333333 32 22 222 222
3 33 33 33 33 33 33 33 33 33 22 2222 2 222 33 33 33 33 33 33 33 33 333 2 2222 2 22 2
3 33 33 33 33 33 33 33 3 33 2 2222

3 3 333 3 33 3 3333 3 33 333 3 3 3 33 333 3 3333 33 33 3
333 3 3333 33 3 33 3 3
333 33333 3 33 3 3
3333 33 33











(b)








Figure 3-1 ( Continued)








47

visual mechanism completely using current computer technology. Some features for the block classification will be extracted in a simple way in order to reduce the enormous computation time of visual processing.



3.4.1 The Ratio of the Number of Black Pixels and Black-White Transitions

Several measurements, such as the total number of black pixels in the original image of the block and the number of horizontal black-white transitions in the original image block, etc., are considered in order to distinguish text blocks from graphics image blocks. A ratio of the total number of black pixels and black-white transitions for a certain block can represent the feature for block classification. Table 3-2 shows the processing results of documents containing text and graphics images when the documents are scanned at 240 dots per inch (dpi). The processing results of a document scanned at 100 dpi are shown in Table 3-3.

From Tables 3-2 and 3-3, the ratio between the total number of black pixels and black-white transitions shows different ranges of values for different types of blocks. The ordinary text block, and half of graphics and halftone images in Table 3-2, have a ratio of around 30% when the documents are scanned at 240 dpi, while the text block with bold faced letters and bigger type has a smaller ratio than that of the ordinary text block. As expected, ratios will be higher when scanned at lower dot resolution.










48


Table 3-2. Processing results of ratio between the number
of black pixel and black-white transition.


# of black pixel IB/W TransitionlI Ratio [ Class


30.29 29.61 29.05 28.77 29.66 29.29 22.88 29.78 28.25
15.47 29.24 28.34 29.35
23.49 29.88
26.04 33.67
30.14 31.13 34.68 1.33 3 .03 18. 30
29.42 30.67 19.96


text text text text text text text text text text text text text text text text text text text text
H. black lines H. black lines graphics, halftone graphics, halftone graphics, halftone graphics, halftone


(scanned at 240 dpi)


2139
3417 3580
4112 1780
3544 2911
4046 3427 4751 3406
434 770
2469 3310
4436 2251
4167 3800
1142 5394 5608
47742 175114 195328 63103


648 1012 1040 1183 528 1038 666 1205 968 735 996 123 226 580 989 1155 758 1256 1183 396
72
170 8738 51527 59906 12598










49


Table 3-3. Processing results of ratio between the number
of black pixel and black-white transition.


# of black pixel B/W Transition Ratio Class

517 182 35.20 text
366 209 57.10 text
412 207 50.24 text
375 227 60.53 text
408 232 56.86 text
352 209 59.37 text
383 230 60.05 text
416 206 49.52 text
390 217 55.64 text
449 232 51.67 text
482 231 47.93 text
469 217 46.27 text
471 217 46.07 text
424 215 50.71 text
400 220 55.00 text
454 231 50.88 text
320 190 59.38 text
5818 304 5.23 trademark
387 239 61.76 text
372 234 62.90 text
466 286 61.37 text
2136 1036 48.50 text
482 1219 39.54 line drawing
66 33 50.00 text
428 225 52.57 text
454 221 48.46 text

(scanned at 100 dpi)








50


3.4.2 Removal of Lines

The removal of line segments relies on being able to detect the line segments. Various methods for detecting line segments such as tracking the medial line of thinned images and finding the longest vector have been reported in the literature [Hil69]. However, these bottom-up approaches are considered as too time consuming a process. An approach for removal of line segments as a top-down process can reduce the processing time. Global operators are applied to each logical 1, which represent a black pixel, and may result in the removal the line segments, called pure line segments. A pure line segment is a line segment which does not represent any other element but only itself. In order to remove the line segment, we apply the operator which is defined as follows:

Definition 6. Let the boundary black pixel (BBP[x,y]) be the black pixel at location x and y, with up to 7 black pixel neighbors. The operator Bj(p) eliminates the BBP from the image data.

The removal of line segments are described as two different schemes according to the length of the line segment. The removal of all the line segments including characters can be accomplished by applying the operator B1, while removal of pure line segments can be done as follows. The linear expansion operator OP1 is applied to each block pixel until near line segments conglomerate. Then apply the operator B1 which removes the boundary pixels to each black pixel until the disappearance of line segments not already conglomerated.








51

3.4.3 The Procedures of Block Classification

Since the document image is digitized with skew in most cases, the classification rule which is invariant to skew is desirable for the classification of the blocks. The procedures for block classification of documents can be informally described as follows.

1. Estimate the total number of black pixels.

2. Count the black-white transitions (B/W TC) of the block.

3. Estimate the ratio between the total number of black pixels and the black-white

transitions (R).

4. Remove all the line segments, measure the total number of black pixels. If the

value above is almost same, the block is symbol, otherwise picture with blob

image.

5. When the documents are scanned at 100 dpi (S = 100), both text and some of the

complex line drawing satisfy the followings.

EEBj[Bj[D[ij]]] = 0 and 20x(240/S)1/2 < R < 40x(240/S) 1/2

Where S is the resolution of scanned document at dot per inch.

6. If EEB,[BI[D[ij]]]*O and B/W TC is greater than half the number of step 2.

Then estimate the number of black pixels after EEB[B1[B1[D[ijj]j.

7. If the number of black pixel equals 0, then the block is text with large character. 8. Remove the pure line segments from the original image of the block and then

measure the B/W transition. If the value R is the same as the original value, the

block is text otherwise line drawing.








52

9. If R < 20x(240/S) 1/2 or R > 40x(240/S) 1/2, and EiB[Bj[D[ij]]] = 0 then the

block is line drawing.



3.5 Experiments

3.5.1 Experimental Images and Facilities

In this dissertation, we have selected page images with texts and a few nontexts such as a designed trademark and a circuit line drawing. The designed trademark and the circuit line drawing are clearly separated from text blocks by wide enough white space. The methods for block segmentation and classification robust against poor placement of documents on the scanner. The page images, therefore, are scanned with various skew angles.

Figure 3-2 illustrates the hardware structure of the SUN-Workstation based intelligent text processing system in which the algorithms described in the previous sections are implemented. The EasyScan image scanning system is used as a page reader for the document analysis system, . This system (Microtex scanner version), is hooked up to a SPARC station 1 + SUN Workstation, and it contains a Microtex grayscale or color scanner, a scanner driver for Sun SPARCstations, and the EasyScan scanning utility software. DigitalPhoto, used for scanning utility, combines advanced tools for scanning utility, and advanced tools to create, capture, and process grayscale color images to achieve a variety of visual effects. The EasyScan image scanning system can scan 8-bit grayscale, 24-bit true color images, and a 1-bit line art up to 400 or 600 dpi. The EasyScan software provides intelligent scanning functions


















































JSUN
SPARCSTATION 1+


Figure 3-2. Page image acquisition and processing system.


53


MICROTEX COLO R/G RAY SCANNER


APPLE
LASER WRITER 11 POSTSCRIPT PRINTER


SUN








54

to prescan, adjust brightness and contrast, sharpen, and color correct. The SUN Workstation monitor can display the image on the screen through image display software and the images displayed on the screen can be printed out through a Postscript laser printer (Apple Laserwriter 11).



3.5.2 Experimental Results

Two sample pages for automatic block segmentation and block classification have been selected. The first sample page is composed of texts, a trademark, and a circuit line drawing; the second sample page has several text blocks within a relatively small area. These sample pages are shown in Figure 3-3. The segmented block from the experiment is illustrated in Figure 3-4. The second page has been selected with several blocks to show that the algorithm can segment the blocks despite the skew. It was scanned from several different angles. Figure 3-5 (a) - (d) shows the results of the robust block segmentation approach. The new approach for block segmentation shows satisfying results despite the skewed document images. This approach also generated bigger blocks than the run length smoothing algorithm. The block size of the run length smoothing algorithm is a text line of digitized document image. The ratio of the total number of black pixels and black-white transitions for the second page have been measured for the skewed pages images. As expected, the ratios of each block were affected little by the skew. Table 3-4 shows the experimental result for the ratio difference.













55


T~araalt ollanqe with th. I.9905



51 *L. CTILTL C 4. 3oC is )". -111 Tscct 'It - I 11CIu

- js~ CL is Lie P~u~ l. , l~
131LL ~ ~ ~ :n,,e IC jfLe2tL.~I15~1 1 "I'
.3JLLCIC T050 IS n L I. 4CJ

LII inl ILISLLjJJLI Cal CCIi~ise
'I~~ICCCli n u C 4151CC Ie 'lOP

"In ha CC e Id] eCOLqJ oiIOS a Lai oOPC, -a-510 -c ieid- 01)ihef i ,,-l] n A -q1 rwl '"151 C :'ee 1 cm









LO~SICC$ -,SIOC .ZnIel Tr eL OCI ~~ JatLC In,~ee _"L an oma li-i
toJC~ necssiaie at' . iU LL CLv C IS


399


.............. I ...LI . LL lLLL)IIIT .1111r 11




L.Lv iirIII.;, I ....... . LLIL I t LI



... ...... I jILL 'i


crackinit! thlilv. After lv.01CC- aith Ierti dii fortlhe IeLIdfrLiIne and0 talie i1nk LlIll i [nl', .l-1 fjILTIIoIrr I t, lin i int l,eIV - SiIII.LILL IL~~r,.ll if I :i -n olulte" t 1113 i very
flit hTitItirnl1rILr

.1 l~~TT ..... ....... l








01 ,



C.l



Figure 6. LO camoensalan circuit




Sample page # 1


D


a a epic,,
C sococ ., a a fo liar C 0 W ay n e . N, R?
hsetls ad ra Cy80 i cation 7 ~~O'"o
On a P a r Pray t pp .SCeer /;.
oallp vatt:Pry use~ P a Col a Prduct O nry
as t i Oee Iae. yfbeen a Middlb0Uioa
Isialeas ry setti as, The n Agenc9 vionthe 4 P'duict I h asL erani c/0 wttah0 C for us narOT htas at na MSand dium do Over r
t~0~P no r the sO hohpyl SP59I cyO /at.~ a
toors, 'res . 'Phlum . oo and Y'ndr0
.- ff, elI W-f,- ouse, h rcnhOe u f u s 'I and0 Pet;0;6 roL Mc191 Cr.' AI~0 0 and in gnrree nd stru
qtid 2~hia-d IiinI 19, eja itro land&8nd shad0 forcid p; a Phisn I LL99 ntrdL,~t P er


V1. I af antlosjIcd
flrenhoPO iO',ly an7cod ,
erh Pwder a D la, at cd
di I 9 lI T03 ine,t for U.S rent Peas,r na~ /Ztb.l0 conlLi)
littl a, ro;

Sample page #2


o, com pal ftn POr. od, A e~ ~ o t h e c o r A c co, 0d ! h n d macra n mn,Predc a rahe' no
f a r m th u t A e r
tlfl a~ acdi 8110nOd

cronu trncrea's;



~ ~287


Figure 3-3. The sample pages for the experiment.








56


Figure 3-4. The segmented block of the first sample page.


I - - - --j








57


(a)



Figure 3-5. The processing result of the second page. (a) without skew. (b) skewed by 1.2'. (c) skewed by 6.5'. (d) skewed by 23.0'.


I








58


I


I


(b)


Figure 3-5 ( Continued )








59


(c)


Figure 3-5 ( Continued )








60


(d)


Figure 3-5 ( Continued )









61


Table 3-4. The ratio between the number of black pixel black-white transition for the document skewed at various


and angles.


Skew angle Block #of B Pixels B/W Trans. Ratio(%)

#1 5856 2161 53.98

#2 4710 2656 56.39
00
#3 6338 3361 53.03

# 4 4392 2463 56.08

# 1 5667 3132 55.27

# 2 4497 2637 58.64
1.20
#3 6050 3367 55.65

#4 4222 2475 58.62

#1 5443 3000 55.11

#2 4520 2596 57.43
6.50
#3 6355 3549 55.85

#4 4652 2581 55.48

#1 6065 3179 52.42

#2 4751 2714 57.12
23.00o
# 3 6488 3549 54.70

# 4 4448 2537 57.04

< scanned at 100 dpi >








62

3.5.3 Analysis for the Block Segmentation Approaches

The complexity of an algorithm can be measured by either the time required to execute it on a problem of size n (time complexity) or the memory space required for its execution (space complexity). The main concern for the analysis of most algorithms is the time complexity with relatively large values of n. Like most image processing work, block segmentation of image data in documents is oriented by operations based on pixels. The number of pixels in the document image is considered as a large value. The complexity of previously published segmentation algorithms are mostly 0(n 2). Unfortunately, these limits are not enough to evaluate those algorithms including the approach presented here. In order to evaluate these algorithms, the detailed time complexity function is required. However, finding the exact complexity function for the general algorithm is almost impossible.

The time complexity function for the RLSA is 0(n 2) , and it can be estimated as follows: Let p, be the probability of a black pixel in the document, and P2 be the probability of black pixels to be merged, and t, be the time required of reading a pixel, and tm be the time required for merging two black pixels into a continuous stream of black pixels, and t, be the time required for setting a numeric value to a valuable. Then total time required for the RLSA is estimated as (4tr + 2tckl + 2p1(tck2 + P2tM + (1 - P2)tJ) + tAND)n , where tckl and tck are the times required for checking if-conditions and tAND is the time required of applying a logical AND operation to each pixel in the algorithm respectively.








63

The time complexity function for the RXYC is also 0(n 2) . However, it is more dependent upon the image data than RLSA. Let P3 be the probability of the necessity to read the next pixel, and r be the number or recursion required, and P4 be the probability of the possibility for block segmentation, and t"9 be the time required for finding out segment of block, and tP. be the time required for checking whether segmentation is possible or not. Then total time required for the RXYC can be roughly estimated as 2(t~k3 + P3tr + P3t&k4 + Iti tO + (1 - q~)n 2 + 2P4t,n + 2(t~k3 + P3tck4 + rjt,,. + (1 - tl)ts)(n - y)(n - qj) + ... .This can be rewritten as 2(1 + )tk + P3tr + PNtc4 + TtP0S + (1 - I)tQn 2 + ..,where t~k and tck4 are the times required for checking if-conditions in the algorithm. The algorithm using the Hough transform is invariant to skew; however, the time complexity function is 0(n 3) . This algorithm needs to generate connected components and find out centroids for each connected component and then must apply the Hough transform to each centroid of each connected component. Let a be the number of lines passing through a common point. Then, the roughly estimated time complexity function is ((tg9 + tcent + tHougha)n)n = tHougha'n3 + (tg9 + teent)n , where a = a'n and tg9 is the time required for generating connected components and t,,ent is the time required to find the centroid and tHough is the time required for transforming lines passing through a point in Cartesian coordinate space to a point in the polar coordinate.

The time complexity function for the proposed approach is also 0(n 2) and is dependent upon the operation to connect each black pixel. Relatively exact time complexity can be estimated as (tr + tckl + Pit0PI)n , where t0O,1 is the time required








64

for execution of the operation which connects pixels within a certain distance. The method using the Hough transform is 0(n 3) unlike other previous approaches. This means that the method using the Hough transform is definitely slower than any other algorithms as long as n is a large value. It is not easy to evaluate the processing time through the estimated time complexity above directly, compared to the method using the Hough transform, because the three algorithms above have the same 0(n 2) complexity variation. In order to evaluate algorithms with the same big 0, we have to estimate the exact coefficient for the highest degree of n. The RLSA is known as a fast algorithm. We are going to compare RLSA and the proposed approach. Roughly estimated, p, is less than 0.1 in the text area. It is heavily dependent upon the type of image data in the document. Assuming that P2 is around 0.5 for most document image data, the proposed approach can be considered as faster than RLSA as long as t0P1 is almost the same as tin. The average-case input may be a good choice, but it is sometimes very hard to measure effectively. Generally it is not clear what an average input is. The worst input is very useful in some cases. However, it is also not easy to find out the worst input for all the approaches.



3.6 Pagze-Structure Analysis

Page-structure analysis is the process of converting a page representation to an abstract representation. This process, considered as a higher level of document understanding, attempts to determine an overall block structure of a page. The abstract representation is the specification of the words and diagrams that make up








65

the printed document and of how the pieces of content are to be fitted together into a whole. The process of converting the abstract representation into a physical representation of the document is called formatting. The physical representation may be oriented to a specific output device. The physical representation of the document is then converted into a page representation -- a representation in the format expected by a specific device. Page-structure analysis, in some senses, is the inverse problem of interpreting format-control commands. It attempts to find or understand the control commands that could have been used for laying out image documents.

Document structure varies from one type of document to another. It is not practical, nor easy, to develop a general system to analyze all types of documents automatically. Each document-analysis system has to focus on certain chosen types of document that are most often needed by its application. There is no generally agreed upon ideal model for representing document structure. It should be noted that a human can recognize the structure of a page immediately after it is displayed. For a document-analysis system to cope with general classes of documents, a provision for interactive page structure identification by the user will be extremely useful and powerful. Some document structures will be discussed in the following section.



3.6.1 Office Document Architecture ( ODA ) and its Structure

The Office Document Architecture provides a hierarchical and object-oriented document model. A document is best thought of as a tree, where the structure is defined by the shape of the tree and the content is stored entirely in the leaves of the








66

tree. The ODA document is described by a logical structure and a layout structure. The logical structure divides and subdivides the document into items that mean something to the human author or reader, while the layout structure divides and subdivides a visible representation of the document into rectangular areas. Logical objects represent general items like titles and paragraphs, and layout objects represent sets of rectangular areas within pages.

The common item to both structures is clearly the content which provides the link between them as shown in Figure 3-6. To illustrate the structures we shall use a simple technical document divided into Parts and Sections. Initially we shall assume that each Part has a title followed by one or more Sections, and that each Section in turn has a subtitle followed by a series of one or more paragraphs.













Figur 3-.Lgcladlaotsrcue








67

The logical structure shows that the fragment consists of the Part title and the beginning of the first Section, including the subtitle and paragraphs in an actual layout on pages shown in Figure 3-7. Layout structure shows that there are four




Pazagxaph




LIZI~hParagraph




Figure 3-7. An example of actual layout on pages.

blocks for the first, that is, left page and two blocks for the second page. Only the leaves, represented as the block, in the tree structure have contents associated with both structures. The content of a leaf for the logical object frequently corresponds to the content of a block. This gives the neat one-to-one correspondence between the leaves of the logical and layout structure shown in Figure 3-6. However, when a paragraph is split over two portions and associated with two separate blocks belonging to two different pages, the one-to-one correspondence between two structures does not exist. Alternatively the content portions belonging to several logical objects may be run together into a single layout block.



3.6.2 The Structures for Office Document Architecture

The Office Document Architecture consists of two sets of object class








68

descriptions such as one for logical objects and one for layout objects. These sets of descriptions define the types and combinations of objects. The qualifiers concerning the occurrence of a subordinate object are optional, required, repetitive or optional and repetitive. For the groups of subordinate object, there is a sequence, an aggregate, or a choice. One of the generic logical structure for a Part in the document is defined as shown in Figure 3-8. Each object is assumed to be required unless shown otherwise, so this indicates that a Part begins with a required title, followed optionally by an author's name, followed by one or more Sections.


Subtitle
Repetitive at any order

Paragraph Diagramn List
Aggregate Item

Picture F Captio


Figure 3-8. Generic logical structure.








69

Each Section begins with a required subtitle and then consists of a mixture of paragraphs, diagrams and lists. Repetitive and Choice represents a series of one or more items occurring in random order. Lists consist of one or more list items, while diagrams consist of a picture above the caption or a caption above the picture. A simple specific instance of the page set with a single continuation page is shown in Figure 3-9.




Title Page Continuation Page

Title Frame
(Header)

Continuation Body
Frame
Body Frame (oy

(Body)






Figure 3-9. Specific instance of "part page set".



Several different views of a logical ODA document can be obtained by altering the generic layout structure and/or the sets of the presentation and layout styles. As a simple instance, deleting the "Body frame" from the "Title page" in Figure 3.9 would cause each Part of the document to be laid out with only the Part title and








70

author's name on the first page. Because there would be no frame on the first page with "Body" as its permitted category, the first Section would have to start in a "Continuation body frame" on a subsequent page.

Altering the attributes that make up the representation and layout styles can produce more radical changes. Although these attributes refer to logical objects, they are held separately from the main logical structure. This leads to a more concise representation of the document. The layout styles include the important layout object class and layout category attributes. Magnificent changes to the positioning and ordering of items could be made by changes to these attributes. The presentation styles are used to guide the lower-level content layout process and thus affect the appearance of the content within the blocks. They contain different attributes for different content architecture. For character content, for example, they include attributes affecting the font and size of characters, the distance between lines and the indentation of the first line. Changing both the generic layout structure and the styles can lead to significantly different views of the same logical document. Page and margin sizes can vary, single or double column output can be used, and paragraph and font details can be changed.



3.7 Summary

In this dissertation, we mainly discussed the block segmentation and classification of the document analysis system. The segmentation of document image data is done by a relatively big size of block separated by white space. The skewed








71

document image data not only can separate the document inappropriately but also induce misclassification of the blocks in most of the previously published works. This chapter described the development and implementation of the new algorithm for automated separation and analysis of text, graphics, and halftone images. The new algorithm for block segmentation connected each component to divide the documents into blocks separated by space for page layout. In connecting components, the proposed algorithm applied to each black pixel an operation which generates a linearly expanded contour of each component. This algorithm has time complexity of N 2 like most previous top-down approaches for block segmentation; however, it is invariant to skew and even faster than most previous approaches for most input cases even if they have the same time complexity of N2.

As the very next stage to the block segmentation, each block should be classified according to what it possesses and block classification rules are required so as not to be restricted to error-free document image data. The proposed method is insensitive to skew and is far superior to published methods, which are seriously impaired by a skew of less than a few degrees. Unlike previous classification rules, this algorithm used characteristics which are insensitive to document skew. Several measurements are considered for developing a block classification rule; for instance, a text block is composed of text lines and each text line consists of a variety of characters. The basic components of characters are line segments which have uniform width. On applying the proposed pixel operation, the variation of boundary pixels of total black image pixels can, in every instance, distinguish the line segments from the








72

blob segments. For the separation of text and complex fine line drawing, the total number of neighbors for each black pixel shows that it has close relationship with compactness. The features such as probability of occurrence of black pixels for block, the number of black pixels after applying consecutive operations, the black-white transitions, and the total number of black neighbors for each black pixel in the original image are used for a new classification.














CHAPTER 4
RECOGNITION OF THE CLASSIFIED BLOCK


4.1 Introduction

Designing a system that not only recognizes the text blocks but also understands the nontext blocks requires advanced analysis and synthesis technologies. These technologies are of two types: technology for text recognition and for nontext understanding. Much work has been done for text recognition known as character recognition since the late 1950s. The results are categorized in several sub-areas. Character recognition is largely classified as on-line or off-line character recognition. The on-line character recognition makes use of the order of strokes made by the writer, whereas the off-line case treats the completed characters written on a sheet or on some other material. The on-line character recognition deals with a onedimensional representation of the input, whereas the off-line case involves analysis of a two-dimensional image. Order information, in the case of on-line character recognition, obtained by writing on an electronic bit pad which causes the twodimensional coordinates of successive points to be stored in order, eases the recognition problems compared to off-line character recognition. The off-line character recognition will only be described in this chapter since our document analysis system deals with paper-based documents.


73








74

Designing such a complete system of understanding two-dimensional page images with printed characters and graphics consists of integrating work from several problem areas. Selection of an appropriate type of feature is the most important thing in designing the system. Two types of features, such as global [Ba182, Per77] and structural or local [Hua86a, Cox82], are considered as prospective features for developing character recognition systems.



4.2 Recognition of Text

Text recognition, considered as the union set of character recognition, requires some preliminary processing, such as word segmentation and character segmentation, before it can extract features. The block diagram of a typical character reader is shown in Figure 4-1. Each character is read and digitized by an optical scanner. Each character is located and segmented by software control of the computer. The resulting matrix is then fed into a preprocessor for further processing steps. As an early stage for recognition of text, word segmentation is performed to separate each word. The textual knowledge that a word usually is the combination of characters lying on a straight text line with the distance between words in the text line longer than the distance between characters in the word provides some information for separating the words. The generation of eight connected components is used to group together black pixels which are eight connected to one another. The eight connected pixels belonging to individual characters or graphics are enclosed in circumscribing rectangles. Each rectangle is identical to a single connected










75


Diiize ~ 7 osp1cesin
Docuent Identity of
Character






______ Character Recogntion




Location Matching
andFetr

Segmentation preprocessor Etato




Smoothing &
Noise elimination


Figure 4-1. The block diagram of a typical character reader.








76

component. On applying eight connectedness, it automatically segments the characters in the word with an exception of "i "," j ", and some marks such as "; ". However, this method works only for alphabetical characters. A global operator which connects the characters within a certain distance can be used to separate the words.

The segmentation of closely spaced printed characters is considered for combining segmentation with classification by means of an adaptive decision tree [Cas82]. The pattern array to be resolved is viewed by the classifier through the window. A supervisory routine takes control of the window's width and location. The window is initially set at the full width of the patterns so that if the array contains a single character, the classifier can recognize it in one step. The viewing window is narrowed from the right-hand side and the classifier is applied to the truncated array when the classifier rejects the pattern. The rejection of the classifier indicates that the full array does not belong to the alphabet. This process is repeated until either the truncated array is successfully recognized, or the window becomes so narrow that the search is given up. A window narrowing operation is attempted in both directions when the search fails. The segmentation terminates successfully if the residual array after a positive classification is either null or narrower than any character in the alphabet.

In the following section, only the recognition of machine-printed characters in Latin fonts will be described. Current objectives in the document-analysis system are mainly to read machine printed characters, although techniques to be described








77

are capable of recognizing handwritten and script characters with accuracy and speed. Recognition of oriental characters such as Chinese, Korean, and Japanese will not be considered because these characters involve large alphabets.



4.2.1 The Selection of Feature in Character Recognition

The selection of feature is the most important step to simulate the machine like human reading with the machine. Two types of features, such as global and structural or local feature, are customarily used for automatic character recognition. Techniques such as (1) template matching or (2) various mathematical transformations treat the character matrix as a whole and select global features from it. On the other hand, the structural or local feature is based on geometrical and topological properties of the characters. These features include interesting points and subpieces. Of the approaches with global features, template matching is a well known pattern matching process [Bal82]. This technique simply measures the similarity between the input character and the stored references matching points in the frame. A conventional template matcher calculates the similarity between a pair of vector patterns by summing the number of picture elements (pixels) for which both patterns differ using Exclusive OR. The Exclusive OR error is defined as E = EE A(x,y) E) B(x,y) where, A(x,y) and B(x,y) represent the picture elements at location (x,y) and G) denotes logical Exclusive OR.

A major shortcoming of the conventional template matcher above is that it treats all errors alike regardless of where they occur spatially. In Figure 4-2, pattern








78

A and pattern B are different characters while pattern A and pattern B in Figure 4-3 are the same character. The Exclusive OR error in Figure 4-3 should be less than the Exclusive OR error of Figure 4-2 in order to succeed in recognition. However, the Exclusive OR count for different characters is greater than for the same character. In order to improve this drawback, weighted Exclusive OR error can be utilized. In the example, we used a 3 x 3 window to get the weighted Exclusive OR count. The weighted Exclusive OR count for the same character is less than for the different character. The drawback of template matching is its high dimensionality and its sensitivity to translation, rotation, and scaling. High dimensionality of the character feature vectors in template matching requires large storage and long computation time.

Several orthogonal transformations have been explored as possible feature extractors in order to reduce high dimensionality of template matching. The Fourier descriptor, one of the transformational approaches, is proposed to reduce the high dimensionality of template matching, and to extract features invariant to global deformation [Per77]. This rotational transformation has been explored along with others by Walsh [And7l], and Haar and Hadamard [Wen78J. Zahn and Roskies applied the Fourier series to describe plane closed curves [Zah72]. The basic idea for Fourier descriptors is that a closed curve can be represented by a periodic function of a continuous parameter, or alternatively, by a set of Fourier coefficients of this function. The coefficients in this collection are referred as Fourier descriptors.


















79


PATTERN A


#(A+PB) =26


PATTERN B


WEIGHTED #(A& B) =144


Figure 4-2. The template matcher for different character.


Illill

IllIlill III II

II ill

III I,

II

III III III III III III III

1111 III

Ililill I I

111111


111111

11111111 Ill 111

I I III
III I I

11 I I I

II II

III III

Ill 111

Ill III

Ill III

III II

III II I

1111 III

IllIllIll

111111


11 'II Ii III III I I I III I I


1 4 568 86 580 1 3

















80


PATTERN A


PATTERN B


1I 1 11


11 11 1


1 1 1 1 I I II1


#(A+B)=38


3

23 3 32 3

3 3


2


2 3

2 3
3 3 3 3 3 3

23 3333 3
3

3

23 3 33 33 3






WEIGHTED #(P& B)=106


Figure 4-3. The template matcher for same character.


1111111111 111l111l1l II


11111111 III



Ill liii,

Ill Ill,

III 1111

III liii


liii,,, 111



Ill 1111

111 111

Ill Ill

111 III

1111 liii



1111111 liii

liii liii


1








81
In order to use these descriptors for pattern classification applications, the curve representation must be normalized with respect to a desired transformation class.

The Fourier descriptors are defined as follows: The function 6(l) is the angular direction of closed curve y. The function io(1) is the net amount of angular bend between starting point I= 0 and point 1. So (0(l) is represented as 0(l) = 0(l)-O(0) except for possible multiples of 2nt, and 0(L) = -2nr. The function 40(1) simply contains absolute size information, we need to normalize 4i(t) to make it a periodic function. The 0(t) for character H is shown in Figure 4-4(a). Figure 4-4(b) shows the normalized 0(t), defined as 0*(t). We define 0'(t) as 0*(t) = (Lt/2n) + 1.

The normalized function (0(t) is a periodic function, so we can expand 16(t) in its Fourier series as follows.


0()=110 + Ak cos (kt - adk (4.1)
k= 1
Then the set {Ak, ak,; k = 1,2,...,ool are the Fourier descriptors for curve y. The main drawbacks of global techniques are their dependence on position alignment and high sensitivity to distortion on style variation.



4.2.2 Geometrical and Topological Properties in Character Recognition

Geometrical and topological features for character recognition are based on the extraction of features which describe the interesting geometry or topology of the character as a drawing. These features may represent global and local properties of the character. This is by far the most popular technique from the view point of











82


y


7


/


X(I),Y(I)




/6(o)
(I)


x


H


0-


0-


(a)


0'-


(b)


Figure 4-4. The Fourier descriptor and normalized function.


0-








83

sensitivity and a proficiency for implementation. The sensitivity to the deformation of a character image is caused by several of the following factors: a) font variation -- the use of a different font to represent the same character. b) rotation -- change in orientation.

c) translation -- movement of the whole character. d) noise -- which causes disconnected line segments, filled loops, etc. The proficiency of implementation is evaluated with respect to speed, complexity, independence, and automatic mask making.

The main advantages of geometrical and topological features are their high tolerance to font variation and noise compared to other techniques. The geometrical and topological features as structural or local types of features are applied mainly for recognition of handwritten characters which have much style variation. The letters for local features are considered as abstract letters which represent only the main aspects. The abstract letters do not include character embellishments which physical characters have. The partial abstract representation of letters, termed a skeleton, contains vertices, spatial ordering between vertices called edges, and relationships between vertices and edges as the basic elements. The vertex denoted by the symbol shown in Figure 4-5 is taken as primitive, whereas edges specify the spatial ordering which exists between vertices. The relationships between vertices and edges represent the interesting aspects which are shown in Figure 4-6.

These interesting points are divided into several categories: endpoint, forkpoint, crosspoint, and breakpoint. Each interesting point has its own number of








84

branches; for example, crosspoint has four branches and forkpoint has three branches. The interesting points above can partition the characters in subpieces. The start and the end points of each subpiece are found and their corresponding categories are recorded. The orientation and the length of each subpiece are also recorded, and the length is normalized by the character length after tracing the whole character.






Vertex
Edge

0/7




Figure 4-5. Vertex and edge.


Figure 4-6. The relationship between vertices and edge as an interesting aspect.








85

4.2.3 Word Recog-nition by Contextual Information

The performance of a character recognition system can not be based on only a single-character recognition. Contextual information can increase recognition accuracy just as humans rely on using contextual information in reading text [Ehr75, Dos77]. The input of a contextual word recognition system consists of a string of characters which are recognized through the character recognizer. The application of context makes it possible to detect or even to correct them. In a contextual word recognition system, the character recognizer module assigns to each input character 26 numbers showing the confidences that the character in the input has labels from a to z. The confidences are then transformed to probabilities. If a character has two or more labels with non zero probabilities, it is difficult to determine which label is the correct one. A string of characters delimited by spaces constitutes a word. Since each character in a word may have a set of alternative labels, the output of the character recognizer is actually a sequence of sets called substitution sets, each of which contains the alternatives for a particular character with nonzero probability. All possible words are obtained by selecting one character from each of the substitution sets. It is obvious that only one of the words that can be formed from the substitution sets is the correct word. The problem in contextual word recognition is to determine the combination of labels 01, 02,""0n that maximizes the a posteriori probability p(O1O2'. .., 0nIX i..Xn) for a word of length n as the output of a character recognizer, where a word X1X2 ... Xn is the output of a character recognizer. The probability that the true label of character Xi is Oi is expressed as p(01X).








86

4.3 Recognition of Nontext

Unlike text recognition, the understanding of nontext is performed for the limited number of nontext items such as prestored labels and symbols. The difficulty for perfect understanding of graphics images comes from limitless creation of graphics images in addition to complexities. Even though designers try to draw unique labels, there are many similarities between some of them. Thus, perfect understanding of the graphics image has appeared to be impossible; nevertheless, some efforts have been made for this understanding. An effort to understand geometrical configuration by computer has been explored in the literature [Tou8O]. The literature introduces a novel algorithmic approach to automatic understanding of geometrical configuration by computer. It describes a fundamental problem in automatic picture understanding as designing a computer system to analyze, interpret, and describe such geometrical configurations.

In designing the system for understanding the graphics image, a conventional vision system simulates the human vision assuming that it is able to identify objects seen previously. Human vision, however, can identify pictures without learned convention; this has been demonstrated in a study by Hochborg and Brooks [Hoc62]. They raised their son allowing him not to see any pictures, even advertisements or food containers or billboards, for the first two years. Nevertheless, the boy who was two years old then had no difficulties identifying pictures such as simple line drawings of shoes and other familiar objects, even complicated ones. Thus, human capability for perception and recognition is not a matter of learned convention. We are going








87

to design the recognition system based on conventional design methodology which supposes that humans begin with careful observation and then interpret what they memorize in order to understand. The conventional recognition system requires a learning stage including tasks such as image processing and picture understanding. The image processing task is concerned with the determination of nontext type in addition to the transmission of graphics images. From the viewpoint of preprocessing for understanding graphics images, the nontext is classified into two types. One is the line type which is composed of line components, and its skeleton does not destroy most properties. The other is the blob type whose properties mostly exist on the boundary.



4.3.1 Determination of NonText Type

Of the image processing tasks, thinning is one of the most important image processing algorithms and is used frequently to simplify image data in order to ease extracting features from the image data. Thinning should be applied to the graphics image which will not be affected in losing the feature. In other words, thinning is applied to simplify the boundary image by reducing it to its skeleton without destroying its geometrical shape and connectivity. Thus it is necessary to discriminate the graphics image before the thinning process is applied it. The graphics image should be divided by graphics image thinning, either applicable or not applicable. The procedures for deciding the graphics image type can be informally described as follows.








88

(1) Apply operator B1 to the image block at every location until the number of black

pixels reaches zero; then keep the number of iterations. ( Countiteration = no.

of iteration )

(2) Count the number of black pixels when applying B, Countiteration - 1 times, set

n = 1 and find out weights for the pixels at each location by using 3 x 3 windows.

(3) If the number of black pixels resulting from (2) is greater than a predefined

value and there exist half the predefined value of pixels with weight less than

four, then the graphics image block is line type.

(4) Otherwise, use 3 x 3 windows to estimate weighted black pixel value after

applying operator B1 Countiteration - (n + 1) times. Find out line component by checking the number of pixels with weight three. If the count exceeds a certain

value which makes graphics image look line type, then classify it as line type.

(5) If there is no line component by 3 x 3 windows in step (4), then count the total

number of black pixels. If the number of total black pixels is bigger than 2 x 8 times of the number of black pixels after applying B1 Countiteration - n times,

then it is line type.

(6) Set n=n+2, then repeat step (4) and (5) until n reaches Countiteration - 1.

(7) If n reaches Countiteration - 1 without satisfying the condition of either (4) or

(5), then it is blob type.

Examples in Figure 4-7, and 4-8 show the characters which are represented by bit-map. Both (c) and (d) in Figures 4-7 and 4-8 show us the number of black pixel after applying operation B1 once. If we apply the operation B1 twice, then it








89

makes the number of black pixels equal to zero. The numbers of black pixels in (c) and (d) of Figure 4-7 are relatively large. Thus, it can be line type by step 3 of the procedures. Similarly, characters in Figure 4-8 can be classified into line type by step

5 in the procedures.



4.3.2 Classification of Line Drawing

The computer understanding of line drawings has been explored for the industrial applications such as CAD/Design Automation techniques of electronic circuit diagrams [Tou87] and interpretations of mechanical design. The primary components of most line drawings are lines. These lines have a special meaning for certain line drawings and not for the other line drawings. The former can be referred to as a line context sensitive line drawing, while the latter is referred to as a line context free line drawing. Thus, prior to applying each understanding stage to the line drawing, the classification of line drawing as either line context sensitive or line context free is required.



4.3.2.1 The classification algorithm for line context free line drawings

The primary feature of line drawings, either hand or CAD drawn and often seen in pages along with various picture images, is line. Despite the fact that line drawings are composed of line components, line itself is not the feature to classify the line drawings in line context free line drawings which usually do not include three-dimensional information. The most prominent feature, which can differentiate


















90


PATTERN A No. of B.P. = 140


(a)


B1(A)



No. of B.P. = 40


(c)


PATTERN B No. of B.P. = 150


(b)


II II


11 'I


I I I 11


I I l1 l


B, (B)



No. of B.P. = 38

(d)


Figure 4-7. The bit-maps of characters composed of line components.


l1lll1l11I





III Ill

Ill III

111 II,

1111111 liii Ill 1111111 Ill,,,, 111 111 liii

III 111

111 111

III Ill

I lIllIllIll I IllIlIll Ill

11l1l11l1l


IllIlIlIll II 111 11111

Iii 1111

111 1111

III liii



1111111 II, III 111111111 Ill liii

Ill Ill

III 111

Ill III

1111 liii





11111111


IllillIl illIllIl














IllIllIl














91


PATTERN A No. of B.P. = 74

(a)


B1 (A)

No. of B.P. = 3

(C)


I V V II

VI









I I II I II

I 1 Il II


Figure 4-8. The bit-maps of different character composed of line segments.


PA rEINIB

No of BP = I9 III(b)















II (B)

No of 11.

I I I(d)








92

line drawings from context free line drawings, are the components. The components in the line drawing are usually drawn as lines which connect the components, for example, lines in the electronic circuit diagram. Chemical line drawings, especially those of organic chemistry, include English characters such as C, 0, H or others. Meanwhile, the most prominent feature in mechanical and architectural line drawing are lines with 3-dimensional information. In line context free line drawing, the components which exist in the line drawing provide the evidence to classify them. The procedures for extracting evidence from the line drawing are informally described as follows.

(1) Set the number of iterations to 1 (Count - 1), and estimate the width of line

(Width) which is represented as the number of pixels in the bit-map.

(2) Apply the operator OP1 to every black pixel of the line drawing to be classified.

(3) If there exists a white pixel with black neighbors in the directions either (0,4) or

(2,6), then repeat step 2 and set Count = Count + 1.

(4) Otherwise, apply the operator B, to every black pixel Count + (Width - 2) times.

(5) Apply the operator OP, (Width - 2) times to every black pixel.

(6) Extract components combining the result of (5) with the original line drawing

image by a logical AND at each pixel location.



4.3.2.2 The attempt for classification of line drawings with line context sensitive lines

The classification of context sensitive line drawings is directly connected to 3-D object recognition. The understanding of line drawing has been performed in








93

various aspects to recognize 3-D objects. Huffman and Clowes demonstrated labelling techniques to interpret the line drawing [Huf7l, Clo7l]. In an attempt to capture quantitative aspects of shape and to handle arbitrary polyhedra, Mackworth [Mac85] utilized the concept of Gradient Space. However, this approach could not guarantee realizability, that is, there may not exist polyhedral scenes corresponding to a labelling. Other researchers have also studied the interpretation of line drawing. Nevertheless, there does not exist a rigorous approach to understanding the line drawing, or even discriminating the line drawing of 3-D objects from the simple 2-D line drawing. One difficulty in the understanding of line drawing is that the image data do not contain 3-D information.

Despite this fact, humans can interpret line drawings with little difficulty. As an attempt to interpret line drawings, volumetric primitives will be used to discriminate the line drawing of 3-1) objects from simple 2-D line drawings as the very first step of interpreting line drawings. Lines in machine drawing with line context-sensitive lines are classified into object lines and interpretation lines. The majority of the object lines describe the object's visible contour using a solid thick font, hidden lines, axis of symmetry lines, or cross-sectioned planes. Meanwhile the interpretation lines provide the object's precise geometric description along with other information necessary for producing the object, and are classified into dimension lines and auxiliary lines. These are further classified into manufacturing lines and logistic lines, such as the title and frame of the drawing, part list, and part numbers. These lines represent either 2-D or 3-D objects. Thus, dimensioning plays




Full Text

PAGE 1

AUTOMATED SEGMENTATION OF PRINTED DOCUMENTS FOR COMPUTER UNDERSTANDING OF STRUCTURES AND CONTENTS By HOKYUNG KIM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1992

PAGE 2

ACKNOWLEDGEMENTS The author is deeply grateful to his advisor and supervisory committee chairman, Dr. Julius T. Tou, for his invaluable inspiration, encouragement, and advice. Without his guidance and support, this work could never have been done. He is also greatly indebted to all the other members of the supervisory committee, Professor John Staudhammer, Professor Leon W. Couch, Professor Herman Lam, and Professor Rick Smith, for their suggestions and advice regarding this dissertation. He would like to thank sincerely all the members and good friends of the Center for Information Research for their helpful discussions and patience. The financial support of the Center for Information Research is gratefully acknowledged. 11

PAGE 3

TABLE OF CONTENTS ACKNOWLEDGEMENTS ii ABSTRACT vi CHAPTERS 1 INTRODUCTION j 1.1 Statement of the Problem 2 1.2 Objectives 4 1.3 Approaches 5 1.4 Preview of Remaining Chapters 9 2 BACKGROUND AND REVIEW OF PREVIOUS WORK 11 2.1 Preprocessing \\ 2.1.1 Binary Representation by Image Thresholding 12 2.1.2 Thinning and Boundarization for Object Description ... 16 2.2 Review of Previous Work 18 2.2.1 Block Segmentation Rule for Page Images 19 2.2.2 Classification Rule for the Segmented Block 22 2.2.3 Line-to-Point Transformation 32 2.2.4 Recognizing Textual Blocks Using the Hough Transform 32 2.2.5 Conclusions from Previous Work 36 3 MACHINE UNDERSTANDING OF STRUCTURES IN DOCUMENTS 39 3.1 Introduction 39 3.2 The Block Segmentation of Digitized Documents 40 3.3 Labelling of Segmented Blocks 42 3.4 The Unconstrained Block Classification Rule 43 3.4.1 The Ratio of the Number of Black Pixels and BlackWhite Transitions 47 3.4.2 Removal of Lines 50 iii

PAGE 4

3.4.3 The Procedures of Block Classification 51 3.5 Experiments 52 3.5.1 Experimental Images and Facilities 52 3.5.2 Experimental Results 54 3.5.3 Analyses for the Block Segmentation Approaches 62 3.6 Page-Structure Analysis 64 3.6.1 Office Document Architecture (ODA) and its Structure 65 3.6.2 The Structure for Office Document Architecture 67 3.7 Summary 70 4 RECOGNITION OF THE CLASSIFIED BLOCK 73 4.1 Introduction 73 4.2 Recognition of Text 74 4.2.1 The Selection of Feature in Character Recognition .... 77 4.2.2 Geometrical and Topological Properties in Character Recognition 81 4.2.3 Word Recognition of Contextual Information 85 4.3 Recognition of Non-Text 86 4.3.1 Determination of Non-Text Type 87 4.3.2 Classification of Line Drawing 89 4.3.3 Recognition of Line Drawings 94 4.3.4 The Symbol Matching Process by Transformation to the Graph Model 100 4.4 Summary 108 5 DOCUMENT FILING AND RETRIEVAL 110 5.1 Introduction HO 5.2 System Resources \\\ 5.2.1 Document Contents 112 5.2.2 Structure of Computerized Documents 113 5.2.3 Internal and External Representation of Documents . . 116 5.2.4 Information Extraction in Documents 120 5.3 System Facilities 123 5.3.1 Editing 124 5.3.2 Formatting 125 5.3.3 Retrieving 127 5.4 Summary 129 6 CONCLUSION i31 6.1 Discussion 1 3 1 6.2 Contributions 1 33 iv

PAGE 5

APPENDICES A ROBUSTNESS OF NEW SEGMENTATION ALGORITHM 135 B SOME METHODOLOGIES OF CONTEXTUAL WORD RECOGNITION 137 REFERENCES 141 BIOGRAPHICAL SKETCH 145 v

PAGE 6

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AUTOMATED SEGMENTATION OF PRINTED DOCUMENTS FOR COMPUTER UNDERSTANDING OF STRUCTURES AND CONTENTS By Hokyung Kim August 1992 Chairperson: Dr. Julius T. Tou Major Department: Electrical Engineering This dissertation presents a conversion of paper-based documentation to computerized form. The main process for such conversion is considered as computer understanding of the document image which is a visual representation of a twodimensional field consisting of blocks of text, graphics, and picture images. In designing this document analysis system, automatic block segmentation and classification of a digitized document image are necessary stages. For the automatic block segmentation, we have developed a robust approach which connects black pixels within the predetermined distance to separate the blocks. The segmentation procedure is performed as a top-down approach to reduce the processing time. For the development of a block classification algorithm insensitive to skew, the block classification rule based on black pixels is considered as a way to solve this problem. This method uses a step-by-step classification approach to avoid exhaustive vi

PAGE 7

classification procedure. The ratio between a black-white transition count and the count of black pixels of each block is used as one of measurements. This ratio is almost invariant to skew and is constantly high for the text block. The further pixelbased operator classifies each block in detail. For a system that not only recognizes the text block but also understands a nontext region, we have utilized and integrated advanced technologies developed in various methodologies. In the stage for understanding nontext, we have distinguished some of the symbols from the other picture images. We have divided the symbols into two different categories. Two different image processing techniques, such as thinning and finding boundary, are applied to line and blob type symbols respectively in order to extract valuable features. We have used a geometrical feature for line type symbols and applied a weighted graph matching method for identification. vn

PAGE 8

CHAPTER 1 INTRODUCTION A document analysis system which converts a paper-based documentation to computerized form involves a two-dimensional information processing task. Such a system must recognize characters of a text block and identify nontext regions such as line drawings and pictures. A human understands a page image by applying sufficient knowledge to it. Documents with nontext are becoming more common. The trend is to have many more documents composed of text, graphics, and pictures. In most documents some pages have pictures because editors have learned that all-type pages generally have little eye appeal and pictures help the reader to understand. When no photos or other illustrations pertinent to the articleÂ’s subject are available, many editors will try to use a picture that is germane to the subject and helps attract attention to the article even if it adds no information of value. It seems impossible to give the machine the knowledge which humans possess about a page because it is not known how the human eye recognizes the object. Fortunately, a page image has a distinctive geometric layout in printed form. In the composition of a page the first consideration is a well-organized layout with proper columns and margin selections. Graphics and pictures are well separated from the text block and other nontext regions. Thus, such an arrangement of content on the page provides a stepping stone to designing a document analysis system to cope 1

PAGE 9

2 with the knowledge explosion that threatens to bury us in information debris. This system not only copes with the explosion of knowledge but also can possibly help people with visual handicaps. Thus far, the equipment available for blind and physically handicapped individuals are the tape recorded book, magnifier, and braille. Tape recorded books are produced by human readers working with a reading service. These reading services mainly rely upon volunteers. Developing a system for document content analysis is particularly important for the increasing number of elderly people, many of whom are visually handicapped. 1.1 Statement of the Problem A computerized form of paper-based documentation provides some advantages such as efficient document update and revision; such conversion of paperbased documentation to computerized form, however, requires several steps including preprocessing steps. The main process for conversion is considered as computer understanding of document images. The document image is a visual representation of a two-dimensional field consisting of blocks with text only and blocks including nontext. The understanding of text has been developed and is an extension of character recognition. Complete understanding of paper-based documentation confronts the difficulty of understanding nontext regions. Many efforts to interpret engineering drawings and diagrams by machine can be found in the literature [Hua86b, Tou87b, Kas90].

PAGE 10

3 The preprocessing stage for the document analysis system requires the conversion of a paper-based document to a digital bit-map representation after optical scanning, followed by the automatic segmentation of the page image which separates each block by spaces between them. Two diametrically opposite philosophies called top-down and bottom-up have been proposed for the automatic segmentation task. Certain global operations are performed on an entire page image in the top-down approach, while character components are individually detected and then merged together into progressively larger blocks using component properties and interrelationships in the bottom-up approach. Several approaches such as projection profiles or run-length smoothing have been proposed; these enable segmenting the image into blocks, each of which can be subsequently classified using pattern classifcation techniques. Both projection profile and run-length smoothingbased techniques are fast; however, these are too rigid and fail if the documents or their constituent textual blocks are skewed. In other words, if any pixel of the next text line is higher than or at the same height as any lowest pixel of the text line being scanned, the document image will not be segmented as being desired for subsequent block classification. Some algorithms have been reported in the literature for text string separation, an early stage in any document analysis system. However, many of these algorithms are very restrictive in the type of documents they can process; others are robust but they are too computationally intensive to be used without special purpose hardware [Fle88].

PAGE 11

4 1.2 Objectives The objectives of this research are first, to develop a robust and fast algorithm for segmentation of the page and a precise classification rule for the block, and second, to attempt an understanding of some nontext regions such as pictures or line drawings. In order to achieve these goals, we divided the tasks into the following subtasks: (1) Development of fast and robust algorithm to segment blocks. To process the recognition and analysis of a page, the page should be segmented into blocks which are separated by spaces or lines. The algorithm should be fast and robust for skew. (2) Development of an approach classifying each segmented block. In order to recognize a page with text and nontext, each block should be classified according to what it contains. Nontext should be classified in detail to handle the document management. (3) Development of a systematic and efficient matching process. The processing of visual information is typically a multi-layered task. The human brain mechanisms of vision clearly indicate that there is a layered approach to the processing of visual information both physiologically and psychologically. (4) Design of a system for understanding some graphics and symbols. A page contains some graphics images, especially line drawings, to help readers understand. Since a typical drawing contains both text strings and graphics, recognition of text can be separated from the understanding of graphics image.

PAGE 12

5 1.3 Approaches The work in this dissertation uses the document analysis system shown in Figure 1-1. The primary processing step, which is the bulk of this dissertation, is illustrated in Figure 1-2. The design of a machine vision system concerned with document analysis involves several major problem areas. These are (1) the preprocessing problem known as low-level processing; (2) the image segmentation and classification problem categorized as a pre-step for intermediate-level processing; (3) the feature extraction and scene analysis problem known as intermediate-level processing. This dissertation presents several unique methodologies for some of the problem areas: (1) a block segmentation algorithm which is robust to page skewing, (2) a block classification algorithm invariant to the change of block shape, (3) some graph theoretical approach to the recognition of nontext areas such as symbols and/or line drawings. An elegant algorithm is described for block segmentation. This process rapidly connects each component by dividing the document into blocks separated by spaces. In connecting components, the proposed algorithm applies an operation for each black pixel so as to generate a linearly expanded contour of each component. The performance of the algorithm is measured for the time required to execute an algorithm on problems of size N. This algorithm has a time complexity of N 2 like most previous top-down approaches for block segmentation; however, it is invariant to skew and it is faster than most previous approaches for most input cases even if they have the same time complexity of N 2 . As the very next stage to the block

PAGE 13

6 MEDIA PROCESSING STATION SCANNED CODED DOCUMENTS WORK STATION DOCUMENTS PAGE READER Figure 1-1. System Overview for Document Analysis.

PAGE 14

7 SCANNED DOCUMENT PREPROCESSING | ri SEGMENTATION TEXT LABELLING "T" qi r\nis CLASSIFICATION SYMBOL TEXT RECOGNITION £ LINE DRAWING SYMBOL RECOGNITION 1 PICTURE PICTURE RECOGNITION PAGE STRUCTURE ANALYSIS CODED DOCUMENT DOCUMENT FILING & RETRIEVAL LINEDRAWING RECOGNITION Figure 1-2. Primary Components of a Document Analysis System.

PAGE 15

8 segmentation, each block should be classified according to what it contains and block classification rules are required not to be restricted to error-free document image data. In order to develop an algorithm insensitive to skew, block classification rules based on pixel level will be considered as a way to solve this problem. Unlike previous classification rules, this algorithm tries to extract the features based on pixel level data. Several measurements are considered for developing a block classification rule. In order to apply measurements for classifying the blocks, each block should be analyzed and considered based on structural hierarchy. For example, a text block is composed of text lines, and each text line consists of a variety of characters. The basic components of characters are line segments of uniform width. The proposed pixel operations can distinguish a line segment from a blob segment whose skeleton does not represent its own properties. Separation of text from complex fine line drawing is made by the removal of the pure line segment which is the primary component of the line drawing. Features such as probability of occurrence of black pixels in a block, the number of black pixels after applying consecutive operations, the black-white transition, the total number of black neighbors for each black pixel in the original image are used for a new classification rule. In the understanding stage, some symbols stored in the database will be distinguished from the picture image. These symbols are classified into two types depending upon whether the segment is of line or blob shape. Two different image processing techniques such as thinning and finding a boundary will be applied to line

PAGE 16

9 and blob type symbols respectively in order to extract their features. Geometrical and topological features are utilized for the line type of symbol. On the other hand, the Fourier descriptor for the boundary, which is considered as closed curve, can be used to recognize the blob type of symbol. 1.4 Preview of Remaining Chapters Chapter 2 presents a brief survey of previous work on the segmentation of images and the classification of block segments. We discuss both (1) the direct method of segmenting the page image by both top down and bottom-up approaches, and (2) the transformational method which converts a coordinate space to another coordinate space so as to eliminate the skew problem. In chapter 3, we describe a robust algorithm for block segmentation and an unconstrained block classification rule. Since some of documents consist of complex data such as text, graphics, and pictures, block segmentation and block classification are required for designing an automatic document analysis system. We also describe page-structure analysis based on standard document architecture and present experimental results. In chapter 4, we describe the page recognition system which not only recognizes the text block but also understands some of the nontext regions. The text recognition only deals with the off-line case in which the characters are written completely on a sheet or on any other materials, because only page images are treated in this thesis. The understanding of nontext has mainly been done on line

PAGE 17

10 drawings such as logic circuit diagrams and mechanical engineering drawings and trademarks. Prior to the understanding stage, each line drawing is classified into detailed line drawings based on evidence extracted from each line drawing. We close with chapter 5. After documents have been created, some sort of file-management facility is required for a user to collect the documents into piles and to put these piles away. When a document is filed away in a document-management system, it is necessary to be able to retrieve it at some later time. The design of a complete document-filing and retrieval system is very complex, and it is beyond the scope of this dissertation. Chapter 5 describes some aspects for a document-filing and retrieval system that would be appropriate for an extension of the work in this dissertation.

PAGE 18

CHAPTER 2 BACKGROUND AND REVIEW OF PREVIOUS WORK In this chapter we review previous work on the segmentation of page images and the classification of the segmented blocks. The automatic segmentation of the page image is a significant process for a document analysis system, first demonstrated by Wahl, Wong and Casey [Wah82], They have presented a heuristic approach to the problem, by operating on a binary image for the page. As a preprocessing stage, they convert the grey-scale image into a binary-valued image format by comparing the gray-level values with a threshold value. This chapter describes a few preprocessing techniques and the segmentation and classification processes which are at the heart of a document analysis system. Prior to reviewing previous work on segmentation of the page image and classification of the segmented blocks, some preprocessing techniques will be described briefly. Before the document image can be processed, a format conversion is required because the image obtained from a scanner is in raster format. This raster format of a document image is converted to a raw format in order to process the image data. 2,1 Preprocessing Since the early 1960s, much work has been done in image processing, 11

PAGE 19

12 especially for the reduction of the amount of computation and the efficient reduction of the errors. Even normal text documents present several difficult problems for further processing because of variations of type production. Some characters in the documents image are often smeared or smudged or sometimes printed with either very light strokes which are difficult to detect or very heavy strokes that tend to broaden and run together when imaged for a grey-level scanner. Furthermore the amount of date in a scanned document is enormous. To solve these problems, we will discuss some techniques for (1) binary representation by image thresholding and (2) thinning and boundarization for object description. 2.1.1 Binary Representation bv Image Thresholding The creation of a binary representation from an analog image requires that we determine whether a point is converted into a binary one or a binary zero depending on the grey-level measured by a scanner. Thresholding is an obvious tool for creating the binary representations from grey level images. By judiciously choosing a grey level threshold between the dominant values of the object and the background, the original grey level image can be transformed into a binary form. Although the method appears to be simplistic, it is not easy to find the threshold value from the poor grey level image data normally returned by a scanner. The scanning hardware, due to technology and cost limitations, have nonuniform illumination over the scan field, sensitivity and dark current variations from element to element in the scanning array, and distorted resolution from the lens.

PAGE 20

13 Kittler and Illingworth [Kit86] proposed minimum error thresholding which is applicable in multithreshold selection. Minimum error threshold uses a histogram which summarizes the distribution of the grey levels in the image and gives the frequency of occurrence of each grey level in the image. The histogram can be viewed as an estimate of the probability density function p(g) of the mixture population comprising grey levels of object and background pixels. In the following, each of the two components p(g | i) of the mixture is normally distributed with mean standard deviation Oj and a priori probability P i? i.e. 2 Pig) = E P iP(8 10 (21 ) i = 1 where i 0? u ) 2 Pig 10 = 11 exp( -2-) (2.2) v^o,. 2o* For given p(g | i) and P ; there exists a gray level t for which gray levels g satisfy P x pig\\) < P 2 p(g\2) , g z t (2.3a) PiPigW > P 2 pig\ 2 ) > 8 > T (2.3b) where x is the Bayes minimum error threshold at which the image should be binarized. The problem of minimum error threshold selection is to determine the optimum threshold value x in an estimate of the probability density function, which is viewed as a histogram. The minimum error threshold can be obtained by solving the quadratic equation which represents the condition of a gray level for the

PAGE 21

14 existence of the thresholding gray level. However, the parameters of the mixture density function associated with an image to be thresholded will not usually be known. The fitting techniques estimates these parameters from the gray level histogram in order to get these parameters. In the fitting technique, the average performance figure for the whole image can be characterized by the criterion function. One of the techniques for finding the optimum threshold x can be summarized as follows: Suppose that the gray level data is thresholded at some arbitrary level T and each of the two resulting pixel populations is characterized by a normal density h(g | i, T) with parameters /q(T), Oj(T) and a priori probability Pj(T) is modeled as b PfT) = E *<*) ( 2 . 4 ) b = I E *(*)»]/ P, (T) ( 2 . 5 ) and b o)(T) = [ E ( * M 7 ) ) 2 %) 1 / PfT) (2.6) where ( 2 . 7 ) and

PAGE 22

15 b = { T n i = 1 i = 2 (2.8) where, n is the gray level value for white pixel. Now using the model h(g | i, T), i = 1,2 the conditional probability e(g,T) of grey level g being replaced in the image by a correct binary value is given by e(g,7) = h(s\i. T) • PfJUMg) i = { \ g g f T (2.9) An index of correct classification performance, e(g,T), is obtained by taking the logarithm of the numerator in (2.9) and multiplying the result by -2. e(g,7M(g-p.(7)) / of + 21ogo J .(7) 21ogP i (7) i = { \ 8 g * * (2.10) The average performance figure for the whole image can then be characterized by the criterion function m = E *te) • (2.ii) g This criterion function reflects indirectly the amount of overlap between the Gaussian models of the object and background populations. The better the fit between the data and the models, the smaller the overlap between the density functions and therefore the smaller the classification error. The problem of minimum error threshold selection can then be formulated as one of minimizing criterion J(T), i.e. J(t) = min J(T) T ( 2 . 12 )

PAGE 23

16 2.1.2 Thinning and Boundarization for Object Description Two common ways to describe an object are by the use of boundaries and of skeletons. An object in two-dimensional space is completely determined if we know its borders, and provided that we also know which side of each border is inside the object and which is outside. If it has holes, an object may have more than one border. A different way of describing an object makes use of representation by its "skeleton". Finding the skeleton of object is usually called thinning of image data. The thinning algorithm usually is an iterative edge point erosion technique. The purpose of thinning is to simplify the boundary image by reducing it to its skeleton without destroying its geometrical shape and connectivity. In other words, the thinned pattern, so-called skeleton, must preserve the connectedness and shape of the original pattern other than extremely thick objects. It should be noted that the skeleton of a pattern may not be unique. During the past years, many algorithms have been proposed for thinning [Nac84, Pav80]. If the region is composed of thin components, it can be described well by its skeleton. Skeletons derived by the thinning algorithm keep connectivity of regions. Thinning usually consists of iteratively deleting border points, such that deletion of these points does not remove end-points and does not break the connectivity of the pattern or does not cause excessive erosion. To prevent excessive erosion, the end point cannot be removed. Here, an end-point is defined as a dark point with at most one dark eight-neighbor. The eight-neighbors of point p are defined to be the eight points adjacent to p. Points p 0 , p 2 , p 4 , and p 6 are referred to

PAGE 24

17 as the four-neighbors of p in Figure 2-1. (In this representation it is assumed that the object is represented by a regular grid of measurements, in equal steps in both the row and column directions.) P 3 P 2 Pi P 4 P Po Ps P6 P? Figure 2-1. A point p and its neighbors. Most thinning algorithms have similar operations by deleting a dark point from the pattern if it satisfies a certain condition, such as an edge-point, not an endpoint, and predefined points etc. These are applied mostly on a smooth synthesized pattern which has no irregularities and no noise. However, the thinning process is very sensitive to noise on the binary boundary. A small disturbance of the boundary causes the creation of small strokes as well as a disturbance in main skeletons. In the boundary image, there is positive noise which is either isolated salt-and-pepper noise or boundary annexed noise. Therefore, we have to remove the small strokes from the main skeletons through a refining process after the main stream of the thinning process. The thinning algorithm is illustrated as follows: 1. Set the flag remain to true. 2. While remain is true do steps 3-14. Begin. 3. If p is 1 and not a pixel of a double line then do 4. For j = from 0 to 7 do step 5-6. 5. count the number of change for 8-neighbors ( c(p) ). 6. count the number of dark pixels for 8-neighbors ( N 8 (p) ).

PAGE 25

18 7. 8. 9. 10 . 11 . 12 . 13. 14. 15. 16. For j = 0, 2, 4, 6 do step 8. count the number of dark pixels for 4-neighbors ( N 4 (p) ). If c(p) is 1 and 3 < N 8 (p) < 7 and N 4 (p) < 4 then do set p equal to 0. If p is last pixel then do step 12-14. For all pixels p of image data do Begin If A = B then set remain to true, else set remain to false End. Set A to B. End. End of Algorithm. 2.2 Review of Previous Work Since the late 1970s, several approaches for text block separation from mixed text/graphics images have been proposed as an intermediate process for document analysis systems. These approaches are categorized into two methods, direct and transformational. The page is usually split into blocks in order for the reader to be able to read with ease. The white space between the blocks is usually wider than the space between text-lines. Direct methods separate the document image into several blocks applying certain rules to the image data directly. The page image is segmented into blocks by layout structure and each block is classified. The transformational method converts the image data from one coordinate space to another coordinate space, then applies a rule to it. Such conversion from one coordinate space to another coordinate space was performed for designing a robust approach separating text block from mixed text and graphics images. A transform, which is known as Hough transform, is applied for separating text images from documents with text and

PAGE 26

19 graphics images. Here, we will discuss the most representative frameworks for (1) block segmentation rule for page images, (2) block classification rule for the segmented block, and (3) recognizing textual block using the Hough transform. 2.2.1 Block Segmentation Rule for Page Images Wong et al. [Won82] and Wahl et al. [Wah82] proposed the Document Analysis System which assists a user in encoding a printed document for computer processing. The proposed system consists of a block segmentation stage and a block classification procedure mainly to analyze the document image, which is the visual representation of a page. First, a segmentation procedure subdivides the area of a document into blocks, each of which should contain only one type of data. Second, some basic features of these blocks are calculated in order to classify them into a specific type. At an early stage of the proposed system, they used a run length smoothing algorithm (RLSA) which operates on every dark point in the document image for a block segmentation rule. A RLSA had been used earlier to detect long vertical and horizontal white lines. This algorithm had been extended to obtain a bitmap of white and black areas representing blocks containing the various types of data. The RLSA operates on two black pixels which have at most a predetermined number of contiguous white pixels between them on the same column or the same row. The RLSA is first applied row-by-row and then column-by-column, yielding two distinct bit maps. The two results are then combined by applying a logical AND to each pixel location. The RLSA can detect small blocks such that each block includes

PAGE 27

20 Figure 2-2. The failure of RLSA when the document is skewed.

PAGE 28

21 just a text line in the text region. The RLSA is fast: however, it fails if the text lines are skewed: it doesn’t extract any text lines in spite of existing text lines in the page. Figure 2-2 shows the segmented block of the document skewed by 1.2°. Although the document is skewed by less than two degrees, the RLSA does not generate blocks of text lines. Nagy et al.[Nag86] described one of the top-down segmentation strategies called RXYC (Recursive X-Y cuts). This approach is also known as projection profile cuts. Printed pages are conventionally made up of rectangular blocks, and a page can be recursively cut into rectangular blocks. Thus the document is represented in the form of a tree of nested rectangular blocks. At each step of the recursive process, the projection profile, computed along both horizontal and vertical is simply a sum of all the pixel values along that line. Then division along the two directions is American Cvanamid Co., Wayne, NJ. has added a foliar spray application of Cycocel, a plant growth regulant for use on poinsettias. to its product line. Approved by the EPA Cydocel can be used on all varieties and colors of pointsettias, as well as on azaleas, geraniums and hibiscus. The company markets the product in 1 -quart containers. FMC Corp.’s Agricultural Chemicals Division Philadelphia, has introduced a liquide formulation of an insecticidemiticide previously available only as a wettable powder. Designed for use in reenhouses, Talstar Rowable controls 5 different pests and leaves little reTerraguard 50, a product of Uniroyal Chemical Co. Inc., Middlebury, CT, has been approved by the Environmental Protection Agency for use nationwide. The wettable powder controls Cylindrocladium spathiphylli root and petiole rot on Spathiphyllum in enclosed structures, such as greenhouses and shade houses, and in interior landscapes. Unocal Corp., Los Angeles, has introduced N-pHuric GTO, a fertilizer and water-treatment product. According to the company, the acidic uroa chemistry of N-pHuric GTO reduces the possibility of free ammonia formation, increases macronutrient and micronutriant upFigure 2-3. An example of the subdivided page by RXYC.

PAGE 29

22 accomplished by making cuts corresponding to deep valleys in the projection profile, with the width larger than a predetermined threshold. The RXYC identifies large blocks as illustrated in Figure 2-3; however like RLSA, it fails if the text lines are skewed. In contrast, certain global operations are performed on the entire image in the approaches described thus far. Doster [Dos84] proposed the bottom-up approach which determines the individual connected components. In the case of text blocks the characters are merged into words, words are merged into lines, lines into paragraphs, and paragraphs into even larger blocks, if such a merging is possible. However, this approach requires extensive usage of memory resources and is very slow in processing speed [Sri86]. 2.2.2 Classification Rule for the Segmented Blocks Scherl et al. [Sch80] described the simple method for obtaining characteristic features for text, graphics and picture segments. It subdivides the document into small, overlapping windows to generate a histogram. Within each window, a grey level histogram is evaluated. Then the features can be extracted statistically from the histogram. Text consists of white background with black characters on it. The background is almost entirely of one intensity level and is the largest and brightest part within a window. The characters donÂ’t consist of black lines but of many transitions between black and white. Because of this, a small sharp peak at a bright

PAGE 30

23 greylevel and a lot of darker greylevels are typical of text. Meanwhile, the histogram of a picture has no similar sharp characteristics. The shape of such a histogram strongly depends on the content of the picture. In some cases, it might be possible that the histogram of a picture looks like the histogram of text. But usually the percentage of darker levels within a picture is higher and often the brightest points within pictures are darker than the background of text. Furthermore, if a graphic consists only of lines, its greylevel histogram will not differ much from that of text. Therefore, statistical features taken from a greylevel histogram seem to be not suitable for discrimination of text and graphics. The shape of the histogram is largely dependent upon the size of the window. A larger window results in a weaker dependence of the shape of the histogram on the position of the window within the text. However, larger windows decreases the accuracy. A method for block classification was proposed by Wong et al. [Won82]. They use block height and block mean black pixel run length as the basic features. Several measurements, such as total number of black pixels in the segmented image block (BC), minimum x-y coordinates of a block and its x-y lengths (x min , Ax, y min , Ay), total number of black pixels in the original image for the block (DC), and number of horizontal white-black transitions in the original image block (TC), are taken to classify text line blocks and graphics or halftone picture blocks. The following features are measured during component labeling: (1) the height of each block segment H = Ay, (2) the eccentricity of the rectangle surrounding the block E = Ax/ Ay, (3) the ratio of black area to enclosing box area S = BC/(AxAy), and (4) the mean

PAGE 31

24 horizontal length of the black runs of the original data from each block R m = DC/TC. Some of the features are illustrated in Figure 2-4. A block is considered to be text if its R and H values are less than some constant multiples of the mean length of black run and mean height. In other words, the pattern classification scheme that assumes linear separability is used to determine T * x (a) The shape of text block y a y 1 h* H (b) An example of the shape for the non* text block Figure 2-4. The typical block shape of text line in the RLSA. the region in the plane of mean height (H m ) in the one coordinate and mean length of black run (R m ) in the other coordinate. The distribution of values in the R-H plane obtained from sample documents are observed to determine the discriminant function. For example, text is the predominant data type in a typical office document and text lines are basically textured stripes of approximately constant height H and

PAGE 32

25 mean length of black run R m . Text blocks tend to cluster with respect to these features. Figure 2-5 illustrates the distribution of value in the R m -H plane. The text lines of a document form a clustered population within the range 20 < H <35 and 2< R m <8. Low R m and H values represent the regions that contain text. The graphic and halftone images have high values of H, whereas solid black lines have high R and low H value in the R m -H plane. Figure 2-5. The distribution of R m and H values for each block type. The H m and R m for the text cluster may vary for different types of documents, depending on character size and font. Furthermore, the text clusterÂ’s standard deviations o(H m ) and a(R m ) may also vary depending on whether a document is in

PAGE 33

26 a single font or multiple fonts and character sizes. These authors applied heuristic rules to classify the blocks regardless the character size and font. A variable, linear, separable classification scheme is used to assign the blocks into the following four classes. Text: if R < X R m and H < C 22 X H m Horizontal solid black lines: if R > C 21 X R m and H < C 22 X H m Graphics and halftone images: if E > 1/C 23 and H > C 22 X H m Vertical solid black lines: if E > 1/C 23 and H > C 22 X H m The constants C (J are determined heuristically by examining the R-H plane plot of typical documents and the values of R m and H m . They have assigned some values to the parameters based on several training documents. Although prior knowledge about the structural characteristics of a newspaper can be used for classifying blocks, in some cases these features will lead to classification errors. For example, the geometric characteristic that a text line has approximately a given constant height could be used for deciding that a block is a text line. But if the image was skewed while digitizing it, some text lines would be linked together by the block segmentation procedure, then linked text lines would be classified into the graphic and half tone categories. In Wahl’s work [Wah83], a distance-mapping function based on a border-toborder distance of the block to be classified is used for shape discrimination. The new distance is a function of the two Cartesian coordinates x,y { (x,y)eS } and an angle $ measured from the x axis { 0°^
PAGE 34

27 segment B^ with angle , which connects an inner point p(x,y) of S with two opposite border points Bj, B 2 of S, such that the line segment BjB 2 is entirely inside S (Figure 2-6). The minimum and the maximum line segment length at any point (x,y) is defined as D min (x,y) and D max (x,y) over all possible angles 4> respectively. In addition, an eccentricity mapping D ecc (x,y) is defined to be D ecc (x,y) = D max( x >y)/ D min( x >y) = max w [d(x,y,)]. Similarly, d min , d max , and d ecc are defined as the average values of D min (x,y), D max (x,y), and D ecc (x,y) over all the pixels in the connected component. Simple shape factors fj and f 2 which are derived from d min and d max as f j = Cj • A/d 2 min , f 2 = c 2 • A/d 2 max respectively can be used as features to discriminate text, graphics, and thresholded gray-level pictures. In two shape factors f 2 and f 2 , A is the number of pixels in discrete space and c l5 c 2 are constants determined by experimentation. Text has large fj value and graphics have relatively large fj value too. However, text has large f 2 value compared to graphics and thresholded gray-level pictures, while thresholded gray-level pictures have a small fj compared to the other two types of images. Wang et al.[Wan89] adopt the statistical texture analysis for discriminating document image categories. Their statistical approach to texture analysis has two basic stages: (1) a series of intermediate matrices which are computed from the image region and (2) a set of features which are computed from these intermediate matrices. For the intermediate matrices, they use the BW matrix and BWB matrix, which are a set of consecutive black pixels followed by a set of consecutive white

PAGE 35

28 3 (*i >Yi ) Figure 2-6. Illustration of the distance mapping function d(x,y,4>).

PAGE 36

29 pixels and a set of consecutive black pixels respectively. These matrices are illustrated in Figure 2-7. The length of the run is the number of pixels in the run. A black-white pair run is categorized into nine classes (i.e. T9) depending upon the proportions of white pixels. The category number represents the percentage of white part in a blackwhite pair run. In matrix element p(i,j) which specifies the number of times that image contains a black-white pair run, i means 10 * i percentage of white pixel in the black-white pair run of length j. For example, the matrix element p(3,10) stands for that length of black-white run is 10% and percentage of white pixels is 30. Meanwhile, a black-white-black combination run is defined as a pixel sequence in which two black pixel runs are separated by a white pixel run as shown in Figure 2-7 (b). The length of a black pixel run is fixed and assigned into three Figure 2-7. Definition of BW and BWB matrices.

PAGE 37

30 categories depending upon the predefined arrangement. The matrix element p(i,j) is the number of times that the image contains a black-white-black combination run, in the horizontal direction, with white pixel run length j and black pixel runs with length lying in categories i. In order to create a three-dimensional feature space that can distinguish all the blocks in document image, these authors also derive two features Fj and F 2 from the BW matrix, and a feature F 3 from BWB matrix. These features are defined as follows, (1) Short Run Emphasis N c N, N c N r F i = E E ( pOV) / j 1 ) / E E pw i=l j=\ i= 1 7=1 (2.13) In short run emphasis, the matrix element p(i,j) is the (i,j)-th entry in the given run length matrix, N c is the number of different kinds of pixel runs, and N r is the number of different run lengths that occur. The value of Fj for small letters is larger than the value of Fj for large letters, because white spaces between strokes in small letters are smaller than those in large letters. Meanwhile, the second feature is defined as follows: (2) Long Run Emphasis f 2 = E E j 2 pVj) / E E pw i= 1 y=l i=l 7=1 (2.14)

PAGE 38

31 The long run emphasis, on the contrary, has a larger value for large letter blocks than small letter blocks. A third feature derived from the BWB matrix is extra long run emphasis. (3) Extra Long Run Emphasis N r N e N r N c F,E ; 2 < E p'm > / E E p'(W (2 15) l-T, i-1 1-T, 1.1 where P(iJ) 0 if P(ti ) > T 2 if p{ij) ± T 2 (2.16) In extra long run emphasis, threshold Tj is set to delete short run lengths because only very long run lengths are needed to express the characteristics of graphics blocks. The threshold T 2 is for deleting the effect of small values of p(i,j) because a long run appears occasionally in letters blocks and photograph blocks. Thresholds Tj and T 2 are determined by experimentation. The three features were measured for several different type of the sample blocks. From the F,-F 2 feature space, the blocks with different type of document image are clustered together within each class and are well separated between classes, except for graphics blocks. The feature F 3 , however, separates graphics blocks from the other types of blocks.

PAGE 39

32 2.2.3 Line-to-Point Transformation The transformation of a line in Cartesian coordinate space to a point in polar coordinate space was developed by Hough. A straight line is described in Figure 2-8 (a) as p=x cos0 + y sin0 where p is the normal distance of the line from the origin and 0 is the angle of the origin with respect to the x axis. The Hough transform of the line is simply a point with coordinate (p,0) in the polar domain (Figure 2-8 (b)). A family of lines passing through a common point (Figure 2-8 (c)) maps into the connected set in the polar domain (Figure 2-8 (d)). The connected set for a family of lines passing through point A in Figure 2-8 (e) will be in the top of Figure 2-8 (f), and the connected set for point B is drawn as middle, and for point C is drawn as bottom. These connected sets meet at (p o ,0 o ) in Figure 2-8 (f). This occurs since three points in Cartesian coordinate space are collinear. 2.2.4 Recognizing Textual Blocks Using the Hough Transform The Hough transform has found numerous applications such as detecting lines, curves in pictures, handling multi-valued images etc. Rastogi et al. [Ras86] applied the Hough transform for document analysis. The Hough transform technique detects the presence of a parametrically representable group of points in an image, such as a straight line or a circle through a mapping to a parameter space. They utilized the fact that pages consist of straight components; e.g., text lines are straight. If we use

PAGE 40

33 Figure 2-8. The Hough transform.

PAGE 41

34 the Hough transform for document analysis, we can extract text lines because characters on the text line are usually collinear. The Hough transform is applied to the centroid of the connected component in the Cartesian coordinate space in order to show three-dimensional view in the polar domain. The accumulator for each point in the polar domain represents the number of connected sets which is transformed by the centroid of each connected component in the document image. An array of accumulators is set up by quantizing the value of p and 0. For a 512 by 512 image, the possible range of p is -362 to 362. This value is arrived at by assuming the origin of the (p,0) space to be at the center of the 512 by 512 image and hence the maximum value of p is 256/2 = 362 and the range of 0 is 0° to 180°. Code for the algorithm is as follow: For all centroids of components For 0 = 0 to 180 degrees { p = x cos0 + y sin0 accumulator [p,0] = accumulator [p,0] + 1 } The above states that the Hough transform is applied to all the significant point in row x, and column y. For a 512 by 512 image, the maximum value of the normal distance p from the center of the 512 by 512 image is 362 where x is 256 and y is 256, i.e. p = 256 cos 45° + 256 sin 45°. The Hough transform for document analysis converts 2-D page images to 3-D images. The valleys of 3-D images transformed from 2-D page images separate the

PAGE 42

35 blocks. This mapping is one-to-many in either direction, and among the various properties which hold true for this transformation are (1) a point in the document image corresponds to a sinusoidal curve in the parameter plane, (2) a point in the parameter plane corresponds to a straight line in the document image, (3) points lying in the same straight line in the document image correspond to a curve through a common point in the parameter plane, and (4) points lying on the same curve in the parameter plane correspond to lines through the same point in the document image. Fletcher et al. [Fle88] described an algorithm robust to changes in text font style and size within an image. The algorithm uses simple heuristics based on the characteristics of text strings. This segmentation algorithm is based on grouping collinear connected components of similar size and does not recognize individual characters. In order to accomplish these tasks, the algorithm consists of five steps: (1) connected component generation, (2) area/ratio filter, (3) collinear component grouping, (4) logical grouping of strings into words and phrases, and (5) text string separation. In the analysis of block structure for the 3-D document image, blocks are easily accessed by looking at rectangular chunks cut across by the true orientation lines and their perpendiculars. The authors analyzed number of rows and the width of the transitions to classify the blocks. They also use several properties such as size of the rectangles, their eccentricity, orientation, texture and complexity.

PAGE 43

36 2.2.5 Conclusions from Previous Work As discussed earlier, the document analysis system consists of two activities such as block segmentation and block classification. Most of early work tried to separate the page image from the bit-map of the page by a direct method [Won82, Nag86]. The RLSA has generated every text line as a block, in other words, many blocks with almost the same height of text lines. Thus, RLSA generates too many blocks; it has furthermore another shortcoming called the skew problem [Bai87]. If text lines are scanned with skew even less than a few degrees, that is, vertically any part of the highest letter in the next text line is at the same height or above the lowest pixel of the text line being scanned, then the RLSA cannot generate the block with the height of the text line. An approach, based on the observation that documents generally have rectangular block structures, used a projection profile to segment the block [Nag86]. The projection of the black pixel counts along the horizontal and vertical directions was used for block segmentation [Zen85, Mas85]. The projection method is very sensitive to document skew with respect to the raster-scanning direction of the scanner. It produces satisfactory results only for documents with rectangular block structure. To overcome this problem, skew detection should be done by iteratively examining small angle deviations from the normal direction to determine which angle gives the steepest variation of the projection profile [Mas85]. Rastogi and Srihari [Ras86] applied the Hough Transform for document analysis in order to solve the skew problem. This method is invariant to skew;

PAGE 44

37 however, it requires much computation time for the preliminary steps and extensive usage of memory resources. It also is highly CPU intensive and consequently is too slow to be applied for document analysis without support of special hardware. Most of the previous work for block classification was developed under the assumption that documents were digitized without any skew. The classification scheme in [Won82] uses block height and block mean black pixel run length, and it requires that block height should not be taller than the height of the highest letter in that text line. However, if the page is scanned with even a few degrees of skew, some text lines will be stuck together. Thus, the height of stuck text lines leads to the failure of the block classification rule. Wang and Srihari [Wan89] considered that an image region possessed a certain texture if it had some basic subpatterns which occur repeatedly according to some specific rules of arrangement. Two matrices, BW matrix and BWB matrix, are used to represent the textual characteristics of a newspaper image block. In addition to those, three feature definitions such as (1) short run emphasis (SRE), (2) long run emphasis (LRE), and (3) extra long run emphasis (ELRE) were measured for several sample blocks segmented from the newspaper image. From these three feature definitions, two-dimensional SRE-LRE space and three dimensional feature space created by SRE, LRE, and ELRE are used to distinguish between the different types of blocks. Although prior knowledge about the structural characteristics of a newspaper can be used for classifying blocks, in some cases these features will lead to classification errors. For example, the geometric characteristic that a text line has

PAGE 45

38 approximately a given constant height could be used for deciding that a block should be a text line. But if the image was skewed while digitizing it, some text lines would be linked together by the block segmentation procedure, and then linked text line may be classified into the graphic and half tone categories.

PAGE 46

CHAPTER 3 MACHINE UNDERSTANDING OF STRUCTURES IN DOCUMENTS 3.1 Introduction Automatic block segmentation and classification of a digitized document image are necessary elements of a document analysis system capable of understanding a document consisting of a mixture of text and graphics images. Such block segmentation can be done by element, text line or relatively big paragraphs separated by wide white space. The block segmentation by text line has been described by run length smoothing algorithm [Wah82], The method of recursive X-Y cuts [Nag86] generates blocks with bigger sizes, obtained from the projection profile. However, both algorithms require that documents be placed without skew. The Hough transform has been used to design a system which is very insensitive to skewed document and separated text string from mixed text/graphics images [Fle88]. However, this robust algorithm is so CPU intensive that it may require special purpose hardware for acceptable response times. The failure of block segmentation due to skewed document image data not only separates the document inappropriately but also induces misclassification of the blocks. In this chapter, we will address the development and implementation of a robust algorithm for automatic separation and analysis of text, graphics, and halftone images. This new algorithm for block segmentation divides the page by white spaces 39

PAGE 47

40 which separate the blocks despite the skew of the document. Then, the segmented blocks will be classified by a classification scheme which is not effected by rotation of the block. In the understanding stage, the document analysis system not only recognizes the text blocks but also understands a nontext area. Such a system requires advanced capability which can analyze an object and synthesize the technologies developed by various methodologies. The complete understanding system for a two-dimensional page image with printed character and graphics is the integration of work done on several problem areas. This system not only encodes each block using different types of data format to reduce the memory size, but understands the documents as well. 3.2 The Block Segmentation of Digitized Documents In this section, we will discuss the block segmentation which is a procedure that subdivides the area of a digitized document into blocks in order to process the document images systematically. Each of the blocks ideally is required to contain only one type of image data. Such block segmentation of document image data should be done by certain rules. As we described earlier, it can be done by an element, a text line, or relatively big blocks. The previous block segmentation approaches, such as run length smoothing algorithm and recursive X-Y cuts, also called projection profile cuts, require the document to be placed without skew. The block segmentation algorithm will be evaluated by a few aspects such as (1) time complexity, (2) robustness, and (3) block size.

PAGE 48

41 The document image can be segmented into blocks by two different methods, top-down and bottom-up. In the top-down approach, certain global operations are performed on the entire image. In the bottom-up approach, on the other hand, all the components in the document images are individually detected and then merged together into larger blocks. We will address a new approach for block segmentation belonging to the top-down approach. This new approach connects each component to generate bigger connected components with appropriate size. To generate the connected component, we will apply the operator defined as follows: Definition 1. Let P(x,y) be a picture element at location x and y in the document D. Define P(x,y) = l as a black picture element, 0 as a white picture element. Definition 2. The operator OPj executes the following operation. If P(x,y) = 1, then OP 1 (P(x,y)) generates picture elements such that P(x+ a,y+6) = 1, for every a,fk I d I, where a and 6 represent the row and column locations and d is a predetermined distance. The operator OP, with the predetermined distance is applied to every black pixel of the document image data and then expands each black pixel to connect every black pixel in a certain area which can be separated from other areas intuitively. The predetermined distance is set by the value which is larger than half the distance between text lines and less than half the distance between blocks. Normally, the space between the blocks is wider than the space between the text lines. Of course, this is not a rule all the documents have to abide by. However, we will exclude the

PAGE 49

42 document with poor layout structure that violate the above condition. Table 3-1. shows an illustration of the space distance between the blocks for several documents. Table 3-1. Illustration for the space distance between blocks. Block Pairs Distance Text : Text 4-9 Text (Bold face) : Text 4-10 Text : Text (Bold face) 5-10 Text : Nontext 2-17 Nontext : Text 3-28 Caption (Nontext) : Nontext 4-7 Nontext : Caption (Nontext) 3-14 < Unit = distance between text lines > 3.3 Labelling of Segmented Blocks To identify each block separated by the block segmentation process, labels have to be assigned to different blocks for subsequent procedures such as block classification and feature extraction. This labelling process treats the individual connected components of a set of S as separate objects. S is a bit map representation of the scanned document page. Each component of S has a value of 1, for black pixels, and a value of 0 for white pixels.

PAGE 50

43 We can label the components of S by performing a row-by-row scan. As the first line is scanned from left to right, the label 1 is assigned to the first black pixel. This label assigning process is propagated repeatedly, in other words, subsequent adjacent black pixels are labeled with lÂ’s until the first white pixel is encountered. The next black pixel along the line is labeled with 2 and so are its adjacent neighbor black pixels. This is continued until the end of the first line is reached. For each black pixel in the second line, the neighborhood in the previously labeled line is examined along with the left neighborhood of the pixel. The two upper diagonal neighbors of each black pixel, already visited by the scan, are examined in order to label them with the same number if they are black pixels. If all eight neighbors are 0, the current pixel P gets a new label, that is, if a black pixel has no labeled neighborhood the next label not yet assigned is assigned to this pixel. This labelling process is illustrated in Figure 3-1 (a). If one of them is 1, pixel P gets the same label; if two or more of them are 1, pixel P gets one of their labels and the equivalences are noted. This procedure is continued until the bottom line of the binary image is reached. The equivalences, i.e. some adjacent black regions, may be labeled differently. The equivalent pairs of equivalences are sorted into equivalences classes, and a label to represent each class is picked up (see Figure 3-1 (b)). 3.4 The Unconstrained Block Classification Rule Some of the simplest patterns that people can recognize without difficulties are very hard for a computer to detect. A human can classify the blocks in a

PAGE 51

44 document instantaneously even though a text block consists of letters which are not known to us. Without recognizing each character, the human being has an ability to classify the block of text in documents if he(she) is educated enough to figure out what most characters look like. A few research works describe the block classification of mixed text/graphics images assuming that the documents are scanned without skew. Since in most cases the document image is segmented with skew, the classification rule should be able to classify the blocks regardless of the shape of the blocks because the shape of blocks will change when the document is scanned with skew. The features of each type of block need to be scrutinized in order to generate the classification rule. In the human visual mechanism, seeing is known to involve processing an enormous amount of data. Part of the shock of making a deep analysis of the vision process comes from the realization of how much information the human brain processes in the act of seeing. The brain keeps a temporary record of the sensory input during perception [Sow84], A visual icon is stored in the brain for just a fraction of a second. When the brain receives a new sensory icon, it must search its stock of percepts to find ones that match parts of the icon. The cerebral cortex stores the percepts, but other parts of the brain may control the actual searching and comparing. The brain has also an associative mechanism which retrieves the pattern that matches best, while an ordinary computer retrieves data by an address in storage. Perception finds percepts that match the overall pattern of an icon before it fills in percepts for the detail. It is therefore impossible to simulate the human

PAGE 52

45 (a) Figure 3-1. The algorithm for component labelling.

PAGE 53

46 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 111111111111 11111111 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 (b) Figure 3-1 ( Continued )

PAGE 54

47 visual mechanism completely using current computer technology. Some features for the block classification will be extracted in a simple way in order to reduce the enormous computation time of visual processing. 3.4.1 The Ratio of the Number of Black Pixels and BlackWhite Transitions Several measurements, such as the total number of black pixels in the original image of the block and the number of horizontal black-white transitions in the original image block, etc., are considered in order to distinguish text blocks from graphics image blocks. A ratio of the total number of black pixels and black-white transitions for a certain block can represent the feature for block classification. Table 3-2 shows the processing results of documents containing text and graphics images when the documents are scanned at 240 dots per inch (dpi). The processing results of a document scanned at 100 dpi are shown in Table 3-3. From Tables 3-2 and 3-3, the ratio between the total number of black pixels and black-white transitions shows different ranges of values for different types of blocks. The ordinary text block, and half of graphics and halftone images in Table 3-2, have a ratio of around 30% when the documents are scanned at 240 dpi, while the text block with bold faced letters and bigger type has a smaller ratio than that of the ordinary text block. As expected, ratios will be higher when scanned at lower dot resolution.

PAGE 55

48 Table 3-2. Processing results of ratio between the number of black pixel and black-white transition. of black pixel B/W Transition Ratio Class 2139 648 30.29 text 3417 1012 29.61 text 3580 1040 29.05 text 4112 1183 28.77 text 1780 528 29.66 text 3544 1038 29.29 text 2911 666 22.88 text 4046 1205 29.78 text 3427 968 28.25 text 4751 735 15.47 text 3406 996 29.24 text 434 123 28.34 text 770 226 29.35 text 2469 580 23.49 text 3310 989 29.88 text 4436 1155 26.04 text 2251 758 33.67 text 4167 1256 30.14 text 3800 1183 31.13 text 1142 396 34.68 text 5394 72 1.33 H. black lines 5608 170 3.03 H. black lines 47742 8738 18.30 graphics , halftone 175114 51527 29.42 graphics , halftone 195328 59906 30.67 graphics , halftone 63103 12598 19.96 graphics , halftone (scanned at 240 dpi)

PAGE 56

49 Table 3-3. Processing results of ratio between the number of black pixel and black-white transition. # of black pixel B/W Transition Ratio Class 517 182 35.20 text 366 209 57.10 text 412 207 50.24 text 375 227 60.53 text 408 232 56.86 text 352 209 59.37 text 383 230 60.05 text 416 206 49.52 text 390 217 55.64 text 449 232 51.67 text 482 231 47.93 text 469 217 46.27 text 471 217 46.07 text 424 215 50.71 text 400 220 55.00 text 454 231 50.88 text 320 190 59.38 text 5818 304 5.23 trademark 387 239 61.76 text 372 234 62.90 text 466 286 61.37 text 2136 1036 48.50 text 482 1219 39.54 line drawing 66 33 50.00 text 428 225 52.57 text 454 221 48.46 text (scanned at 100 dpi)

PAGE 57

50 3.4.2 Removal of Lines The removal of line segments relies on being able to detect the line segments. Various methods for detecting line segments such as tracking the medial line of thinned images and finding the longest vector have been reported in the literature [Hil69]. However, these bottom-up approaches are considered as too time consuming a process. An approach for removal of line segments as a top-down process can reduce the processing time. Global operators are applied to each logical 1, which represent a black pixel, and may result in the removal the line segments, called pure line segments. A pure line segment is a line segment which does not represent any other element but only itself. In order to remove the line segment, we apply the operator which is defined as follows: Definition 6. Let the boundary black pixel (BBP[x,y]) be the black pixel at location x and y, with up to 7 black pixel neighbors. The operator Bj(p) eliminates the BBP from the image data. The removal of line segments are described as two different schemes according to the length of the line segment. The removal of all the line segments including characters can be accomplished by applying the operator B ls while removal of pure line segments can be done as follows. The linear expansion operator OPj is applied to each block pixel until near line segments conglomerate. Then apply the operator Bj which removes the boundary pixels to each black pixel until the disappearance of line segments not already conglomerated.

PAGE 58

51 3.4.3 The Procedures of Block Classification Since the document image is digitized with skew in most cases, the classification rule which is invariant to skew is desirable for the classification of the blocks. The procedures for block classification of documents can be informally described as follows. 1. Estimate the total number of black pixels. 2. Count the black-white transitions (B/W TC) of the block. 3. Estimate the ratio between the total number of black pixels and the black-white transitions (R). 4. Remove all the line segments, measure the total number of black pixels. If the value above is almost same, the block is symbol, otherwise picture with blob image. 5. When the documents are scanned at 100 dpi (S = 100), both text and some of the complex line drawing satisfy the followings. EEB 1 [B 1 [D[i,j]]] = 0 and 20x(240/S) 1/2 < R < 40x(240/S) 1/2 Where S is the resolution of scanned document at dot per inch. 6. If EEBJBjfDfij]]]^ and B/W TC is greater than half the number of step 2. Then estimate the number of black pixels after EEBjfBJB^Dfij]]]]. 7. If the number of black pixel equals 0, then the block is text with large character. 8. Remove the pure line segments from the original image of the block and then measure the B/W transition. If the value R is the same as the original value, the block is text otherwise line drawing.

PAGE 59

52 9. If R < 20x(240/S) 1/2 or R > 40x(240/S) 1/2 , and EEB 1 [B 1 [D[i,j]]]=0 then the block is line drawing. 3.5 Experiments 3.5.1 Experimental Images and Facilities In this dissertation, we have selected page images with texts and a few nontexts such as a designed trademark and a circuit line drawing. The designed trademark and the circuit line drawing are clearly separated from text blocks by wide enough white space. The methods for block segmentation and classification robust against poor placement of documents on the scanner. The page images, therefore, are scanned with various skew angles. Figure 3-2 illustrates the hardware structure of the SUN-Workstation based intelligent text processing system in which the algorithms described in the previous sections are implemented. The EasyScan image scanning system is used as a page reader for the document analysis system, . This system (Microtex scanner version), is hooked up to a SPARC station 1 + SUN Workstation, and it contains a Microtex grayscale or color scanner, a scanner driver for Sun SPARCstations, and the EasyScan scanning utility software. DigitalPhoto, used for scanning utility, combines advanced tools for scanning utility, and advanced tools to create, capture, and process grayscale color images to achieve a variety of visual effects. The EasyScan image scanning system can scan 8-bit grayscale, 24-bit true color images, and a l-bit line art up to 400 or 600 dpi. The EasyScan software provides intelligent scanning functions

PAGE 60

53 Figure 3-2. Page image acquisition and processing system.

PAGE 61

54 to prescan, adjust brightness and contrast, sharpen, and color correct. The SUN Workstation monitor can display the image on the screen through image display software and the images displayed on the screen can be printed out through a Postscript laser printer (Apple LaserWriter II). 3.5.2 Experimental Results Two sample pages for automatic block segmentation and block classification have been selected. The first sample page is composed of texts, a trademark, and a circuit line drawing; the second sample page has several text blocks within a relatively small area. These sample pages are shown in Figure 3-3. The segmented block from the experiment is illustrated in Figure 3-4. The second page has been selected with several blocks to show that the algorithm can segment the blocks despite the skew. It was scanned from several different angles. Figure 3-5 (a) ~ (d) shows the results of the robust block segmentation approach. The new approach for block segmentation shows satisfying results despite the skewed document images. This approach also generated bigger blocks than the run length smoothing algorithm. The block size of the run length smoothing algorithm is a text line of digitized document image. The ratio of the total number of black pixels and black-white transitions for the second page have been measured for the skewed pages images. As expected, the ratios of each block were affected little by the skew. Table 3-4 shows the experimental result for the ratio difference.

PAGE 62

55 Trademarks chanqe with the times Coe. I jn ioovei !»«.*m me '* .n-.-ii md 'Mdivn.ii* "? in sd 'jar* me -v-is jn .vn -rn A is A. Ml/ lOliced '01 ora/ imo ** j les-JMP.s out o. me general DutH-c as .v*-' J-.iSi JS me ccouiamv * 5 * siv*«f * wo* is jmj many ?mer asoects ol culture ,va*.*S mo .vanes mere are man/ -nstaoces An**** i trajem.H* uses is tresnness and .Iowa* A.m me ojssace ol t ine Pie'erences and ait'iu-aes gudua«v -mange ewerofses cnanae me media tnrouqn wnicn me trademarit •$ used also chanoe m keeomq with «ocai manae and •ec-moioqicai oioqress Thus idem .iks too -* -sed o/er a onq penod • m* alien must manqe 5 jme cor oorat ons nave made 3 numoer ol minor cnanaes tram erne to t'me wnne omers nave drasucaiiv r-vamoed men irademams thereby necessitating a vigorous puoncitv cam paiqn Trere jre many Jjojnese enterpnses .yhicn -ave -.nanqeo me-' trademarks tittle bv .me 3 /er -ne years such as mils remains for the molding comjiminri. Such an application requires a low-viscosity molding material that can squeeze itself into a very fine aperture. Krniii a process |M*rs|H*etive. the impor* l .-tint* nl' stress equali/ialimi tv ill increase as packages Insulin; thinner. Willi silicon being made thinner, its intolerance to being flexed Dr*l Figure 6. LC comoensation circuit Sample page # 1 fa 100 * ^ £& a w,, i air o/w,,„ c °rr i,i a y^/d/zy •>•<>„’; '>»• We
PAGE 63

56 Figure 3-4. The segmented block of the first sample page.

PAGE 64

57 (a) Figure 3-5. The processing result of the second page, (a) without skew, (b) skewed by 1.2°. (c) skewed by 6.5°. (d) skewed by 23.0°.

PAGE 65

58 (b) Figure 3-5 ( Continued )

PAGE 66

59 (c) Figure 3-5 ( Continued )

PAGE 67

60 Figure 3-5 ( Continued )

PAGE 68

61 Table 3-4. The ratio between the number of black pixel and black-white transition for the document skewed at various angles. Skew angle Block # of B Pixels B/W Trans. Ratio (%) # 1 5856 2161 53.98 n ° # 2 4710 2656 56.39 U # 3 6338 3361 53.03 # 4 4392 2463 56.08 # 1 5667 3132 55.27 1.2° # 2 4497 2637 58.64 # 3 6050 3367 55.65 # 4 4222 2475 58.62 # 1 5443 3000 55.11 6.5° # 2 4520 2596 57.43 # 3 6355 3549 55.85 # 4 4652 2581 55.48 # 1 6065 3179 52.42 23.0° # 2 4751 2714 57.12 # 3 6488 3549 54.70 # 4 4448 2537 57.04 < scanned at 100 dpi >

PAGE 69

62 3.5.3 Analysis for the Block Segmentation Approaches The complexity of an algorithm can be measured by either the time required to execute it on a problem of size n (time complexity) or the memory space required for its execution (space complexity). The main concern for the analysis of most algorithms is the time complexity with relatively large values of n. Like most image processing work, block segmentation of image data in documents is oriented by operations based on pixels. The number of pixels in the document image is considered as a large value. The complexity of previously published segmentation algorithms are mostly 0(n 2 ). Unfortunately, these limits are not enough to evaluate those algorithms including the approach presented here. In order to evaluate these algorithms, the detailed time complexity function is required. However, finding the exact complexity function for the general algorithm is almost impossible. The time complexity function for the RLSA is 0(n 2 ), and it can be estimated as follows: Let pj be the probability of a black pixel in the document, and p 2 be the probability of black pixels to be merged, and t r be the time required of reading a pixel, and t m be the time required for merging two black pixels into a continuous stream of black pixels, and t s be the time required for setting a numeric value to a valuable. Then total time required for the RLSA is estimated as (4t r + 2t ckl + 2Pi(t ck2 + P 2 Im + (1 p 2 )t s ) + Und) 02 Â’ where t ckl and t ck2 are the times required for checking if-conditions and t AND is the time required of applying a logical AND operation to each pixel in the algorithm respectively.

PAGE 70

63 The time complexity function for the RXYC is also 0(n 2 ). However, it is more dependent upon the image data than RLSA. Let p 3 be the probability of the necessity to read the next pixel, and r be the number or recursion required, and p 4 be the probability of the possibility for block segmentation, and t seg be the time required for finding out segment of block, and t^ be the time required for checking whether segmentation is possible or not. Then total time required for the RXYC can be roughly estimated as 2(t ck3 + p 3 t r + p 3 t ck4 + nt^ + (1 q)n 2 + 2p 4 t seg n + 2(t ck3 + P 3 t ck 4 + nt pos + (1 ri)t s )(n y)(n n) + ... • This can be rewritten as 2(1 + r)(t ck3 + P 3 t r + PsW + ntpos + (1 h)t s )n 2 + ... , where t ck3 and t ck4 are the times required for checking if-conditions in the algorithm. The algorithm using the Hough transform is invariant to skew; however, the time complexity function is 0(n 3 ). This algorithm needs to generate connected components and find out centroids for each connected component and then must apply the Hough transform to each centroid of each connected component. Let a be the number of lines passing through a common point. Then, the roughly estimated time complexity function is ((t g + t cent + b lough a )n)n = t Hough ^ * tt 3 + ( l g + bent) 112 ' where a = a'n and t g is the time required for generating connected components and t cent is the time required to find the centroid and t Hough is the time required for transforming lines passing through a point in Cartesian coordinate space to a point in the polar coordinate. The time complexity function for the proposed approach is also 0(n 2 ) and is dependent upon the operation to connect each black pixel. Relatively exact time complexity can be estimated as (t r + t ckl + p,t opl )n 2 , where t opl is the time required

PAGE 71

64 for execution of the operation which connects pixels within a certain distance. The method using the Hough transform is 0(n 3 ) unlike other previous approaches. This means that the method using the Hough transform is definitely slower than any other algorithms as long as n is a large value. It is not easy to evaluate the processing time through the estimated time complexity above directly, compared to the method using the Hough transform, because the three algorithms above have the same 0(n 2 ) complexity variation. In order to evaluate algorithms with the same big O, we have to estimate the exact coefficient for the highest degree of n. The RLSA is known as a fast algorithm. We are going to compare RLSA and the proposed approach. Roughly estimated, pj is less than 0.1 in the text area. It is heavily dependent upon the type of image data in the document. Assuming that p 2 is around 0.5 for most document image data, the proposed approach can be considered as faster than RLSA as long as t opl is almost the same as t m . The average-case input may be a good choice, but it is sometimes very hard to measure effectively. Generally it is not clear what an average input is. The worst input is very useful in some cases. However, it is also not easy to find out the worst input for all the approaches. 3.6 Page-Structure Analysis Page-structure analysis is the process of converting a page representation to an abstract representation. This process, considered as a higher level of document understanding, attempts to determine an overall block structure of a page. The abstract representation is the specification of the words and diagrams that make up

PAGE 72

65 the printed document and of how the pieces of content are to be fitted together into a whole. The process of converting the abstract representation into a physical representation of the document is called formatting. The physical representation may be oriented to a specific output device. The physical representation of the document is then converted into a page representation -a representation in the format expected by a specific device. Page-structure analysis, in some senses, is the inverse problem of interpreting format-control commands. It attempts to find or understand the control commands that could have been used for laying out image documents. Document structure varies from one type of document to another. It is not practical, nor easy, to develop a general system to analyze all types of documents automatically. Each document-analysis system has to focus on certain chosen types of document that are most often needed by its application. There is no generally agreed upon ideal model for representing document structure. It should be noted that a human can recognize the structure of a page immediately after it is displayed. For a document-analysis system to cope with general classes of documents, a provision for interactive page structure identification by the user will be extremely useful and powerful. Some document structures will be discussed in the following section. 3.6.1 Office Document Architecture ( PDA I and its Structure The Office Document Architecture provides a hierarchical and object-oriented document model. A document is best thought of as a tree, where the structure is defined by the shape of the tree and the content is stored entirely in the leaves of the

PAGE 73

66 tree. The ODA document is described by a logical structure and a layout structure. The logical structure divides and subdivides the document into items that mean something to the human author or reader, while the layout structure divides and subdivides a visible representation of the document into rectangular areas. Logical objects represent general items like titles and paragraphs, and layout objects represent sets of rectangular areas within pages. The common item to both structures is clearly the content which provides the link between them as shown in Figure 3-6. To illustrate the structures we shall use a simple technical document divided into Parts and Sections. Initially we shall assume that each Part has a title followed by one or more Sections, and that each Section in turn has a subtitle followed by a series of one or more paragraphs. Figure 3-6. Logical and layout structure.

PAGE 74

67 The logical structure shows that the fragment consists of the Part title and the beginning of the first Section, including the subtitle and paragraphs in an actual layout on pages shown in Figure 3-7. Layout structure shows that there are four Figure 3-7. An example of actual layout on pages, blocks for the first, that is, left page and two blocks for the second page. Only the leaves, represented as the block, in the tree structure have contents associated with both structures. The content of a leaf for the logical object frequently corresponds to the content of a block. This gives the neat one-to-one correspondence between the leaves of the logical and layout structure shown in Figure 3-6. However, when a paragraph is split over two portions and associated with two separate blocks belonging to two different pages, the one-to-one correspondence between two structures does not exist. Alternatively the content portions belonging to several logical objects may be run together into a single layout block. 3.6.2 The Structures for Office Document Architecture The Office Document Architecture consists of two sets of object class

PAGE 75

68 descriptions such as one for logical objects and one for layout objects. These sets of descriptions define the types and combinations of objects. The qualifiers concerning the occurrence of a subordinate object are optional, required, repetitive or optional and repetitive. For the groups of subordinate object, there is a sequence, an aggregate, or a choice. One of the generic logical structure for a Part in the document is defined as shown in Figure 3-8. Each object is assumed to be required unless shown otherwise, so this indicates that a Part begins with a required title, followed optionally by an authorÂ’s name, followed by one or more Sections. Figure 3-8. Generic logical structure.

PAGE 76

69 Each Section begins with a required subtitle and then consists of a mixture of paragraphs, diagrams and lists. Repetitive and Choice represents a series of one or more items occurring in random order. Lists consist of one or more list items, while diagrams consist of a picture above the caption or a caption above the picture. A simple specific instance of the page set with a single continuation page is shown in Figure 3-9. Title Page Continuation Page Continuation Body Frame (Body) Figure 3-9. Specific instance of "part page set". Several different views of a logical ODA document can be obtained by altering the generic layout structure and/or the sets of the presentation and layout styles. As a simple instance, deleting the "Body frame" from the "Title page" in Figure 3.9 would cause each Part of the document to be laid out with only the Part title and

PAGE 77

70 authorÂ’s name on the first page. Because there would be no frame on the first page with "Body" as its permitted category, the first Section would have to start in a "Continuation body frame" on a subsequent page. Altering the attributes that make up the representation and layout styles can produce more radical changes. Although these attributes refer to logical objects, they are held separately from the main logical structure. This leads to a more concise representation of the document. The layout styles include the important layout object class and layout category attributes. Magnificent changes to the positioning and ordering of items could be made by changes to these attributes. The presentation styles are used to guide the lower-level content layout process and thus affect the appearance of the content within the blocks. They contain different attributes for different content architecture. For character content, for example, they include attributes affecting the font and size of characters, the distance between lines and the indentation of the first line. Changing both the generic layout structure and the styles can lead to significantly different views of the same logical document. Page and margin sizes can vary, single or double column output can be used, and paragraph and font details can be changed. 3.7 Summary In this dissertation, we mainly discussed the block segmentation and classification of the document analysis system. The segmentation of document image data is done by a relatively big size of block separated by white space. The skewed

PAGE 78

71 document image data not only can separate the document inappropriately but also induce misclassification of the blocks in most of the previously published works. This chapter described the development and implementation of the new algorithm for automated separation and analysis of text, graphics, and halftone images. The new algorithm for block segmentation connected each component to divide the documents into blocks separated by space for page layout. In connecting components, the proposed algorithm applied to each black pixel an operation which generates a linearly expanded contour of each component. This algorithm has time complexity of N 2 like most previous top-down approaches for block segmentation; however, it is invariant to skew and even faster than most previous approaches for most input cases even if they have the same time complexity of N 2 . As the very next stage to the block segmentation, each block should be classified according to what it possesses and block classification rules are required so as not to be restricted to error-free document image data. The proposed method is insensitive to skew and is far superior to published methods, which are seriously impaired by a skew of less than a few degrees. Unlike previous classification rules, this algorithm used characteristics which are insensitive to document skew. Several measurements are considered for developing a block classification rule; for instance, a text block is composed of text lines and each text line consists of a variety of characters. The basic components of characters are line segments which have uniform width. On applying the proposed pixel operation, the variation of boundary pixels of total black image pixels can, in every instance, distinguish the line segments from the

PAGE 79

72 blob segments. For the separation of text and complex fine line drawing, the total number of neighbors for each black pixel shows that it has close relationship with compactness. The features such as probability of occurrence of black pixels for block, the number of black pixels after applying consecutive operations, the black-white transitions, and the total number of black neighbors for each black pixel in the original image are used for a new classification.

PAGE 80

CHAPTER 4 RECOGNITION OF THE CLASSIFIED BLOCK 4.1 Introduction Designing a system that not only recognizes the text blocks but also understands the nontext blocks requires advanced analysis and synthesis technologies. These technologies are of two types: technology for text recognition and for nontext understanding. Much work has been done for text recognition known as character recognition since the late 1950s. The results are categorized in several sub-areas. Character recognition is largely classified as on-line or off-line character recognition. The on-line character recognition makes use of the order of strokes made by the writer, whereas the off-line case treats the completed characters written on a sheet or on some other material. The on-line character recognition deals with a onedimensional representation of the input, whereas the off-line case involves analysis of a two-dimensional image. Order information, in the case of on-line character recognition, obtained by writing on an electronic bit pad which causes the twodimensional coordinates of successive points to be stored in order, eases the recognition problems compared to off-line character recognition. The off-line character recognition will only be described in this chapter since our document analysis system deals with paper-based documents. 73

PAGE 81

74 Designing such a complete system of understanding two-dimensional page images with printed characters and graphics consists of integrating work from several problem areas. Selection of an appropriate type of feature is the most important thing in designing the system. Two types of features, such as global [Bal82, Per77] and structural or local [Hua86a, Cox82], are considered as prospective features for developing character recognition systems. 4,2 Recognition of Text Text recognition, considered as the union set of character recognition, requires some preliminary processing, such as word segmentation and character segmentation, before it can extract features. The block diagram of a typical character reader is shown in Figure 4-1. Each character is read and digitized by an optical scanner. Each character is located and segmented by software control of the computer. The resulting matrix is then fed into a preprocessor for further processing steps. As an early stage for recognition of text, word segmentation is performed to separate each word. The textual knowledge that a word usually is the combination of characters lying on a straight text line with the distance between words in the text line longer than the distance between characters in the word provides some information for separating the words. The generation of eight connected components is used to group together black pixels which are eight connected to one another. The eight connected pixels belonging to individual characters or graphics are enclosed in circumscribing rectangles. Each rectangle is identical to a single connected

PAGE 82

75 Digitized Documents Character Matrix Location and Segmentation preprocessor Identity of Character Matching Smoothing & Noise elimination Figure 4-1. The block diagram of a typical character reader.

PAGE 83

76 component. On applying eight connectedness, it automatically segments the characters in the word with an exception of " i j ", and some marks such as " ; However, this method works only for alphabetical characters. A global operator which connects the characters within a certain distance can be used to separate the words. The segmentation of closely spaced printed characters is considered for combining segmentation with classification by means of an adaptive decision tree [Cas82]. The pattern array to be resolved is viewed by the classifier through the window. A supervisory routine takes control of the windowÂ’s width and location. The window is initially set at the full width of the patterns so that if the array contains a single character, the classifier can recognize it in one step. The viewing window is narrowed from the right-hand side and the classifier is applied to the truncated array when the classifier rejects the pattern. The rejection of the classifier indicates that the full array does not belong to the alphabet. This process is repeated until either the truncated array is successfully recognized, or the window becomes so narrow that the search is given up. A window narrowing operation is attempted in both directions when the search fails. The segmentation terminates successfully if the residual array after a positive classification is either null or narrower than any character in the alphabet. In the following section, only the recognition of machine-printed characters in Latin fonts will be described. Current objectives in the document-analysis system are mainly to read machine printed characters, although techniques to be described

PAGE 84

77 are capable of recognizing handwritten and script characters with accuracy and speed. Recognition of oriental characters such as Chinese, Korean, and Japanese will not be considered because these characters involve large alphabets. 4.2.1 The Selection of Feature in Character Recognition The selection of feature is the most important step to simulate the machine like human reading with the machine. Two types of features, such as global and structural or local feature, are customarily used for automatic character recognition. Techniques such as (1) template matching or (2) various mathematical transformations treat the character matrix as a whole and select global features from it. On the other hand, the structural or local feature is based on geometrical and topological properties of the characters. These features include interesting points and subpieces. Of the approaches with global features, template matching is a well known pattern matching process [Bal82]. This technique simply measures the similarity between the input character and the stored references matching points in the frame. A conventional template matcher calculates the similarity between a pair of vector patterns by summing the number of picture elements (pixels) for which both patterns differ using Exclusive OR. The Exclusive OR error is defined as E = EE A(x,y) © B(x,y) where, A(x,y) and B(x,y) represent the picture elements at location (x,y) and © denotes logical Exclusive OR. A major shortcoming of the conventional template matcher above is that it treats all errors alike regardless of where they occur spatially. In Figure 4-2, pattern

PAGE 85

78 A and pattern B are different characters while pattern A and pattern B in Figure 4-3 are the same character. The Exclusive OR error in Figure 4-3 should be less than the Exclusive OR error of Figure 4-2 in order to succeed in recognition. However, the Exclusive OR count for different characters is greater than for the same character. In order to improve this drawback, weighted Exclusive OR error can be utilized. In the example, we used a 3 x 3 window to get the weighted Exclusive OR count. The weighted Exclusive OR count for the same character is less than for the different character. The drawback of template matching is its high dimensionality and its sensitivity to translation, rotation, and scaling. High dimensionality of the character feature vectors in template matching requires large storage and long computation time. Several orthogonal transformations have been explored as possible feature extractors in order to reduce high dimensionality of template matching. The Fourier descriptor, one of the transformational approaches, is proposed to reduce the high dimensionality of template matching, and to extract features invariant to global deformation [Per77]. This rotational transformation has been explored along with others by Walsh [And71], and Haar and Hadamard [Wen78]. Zahn and Roskies applied the Fourier series to describe plane closed curves [Zah72]. The basic idea for Fourier descriptors is that a closed curve can be represented by a periodic function of a continuous parameter, or alternatively, by a set of Fourier coefficients of this function. The coefficients in this collection are referred as Fourier descriptors.

PAGE 86

79 111111 1 1 1 1 1 1 111111111 111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111111111 111111 PATTERN A PATTERN B 1 2 1 1 1 4 1 1 5 6 1 1 1 5 7 5 1 1 8 6 1 1 1 5 8 6 1 1 1 6 9 6 1 1 1 6 9 6 1 1 1 6 9 5 1 1 5 6 1 1 1 3 #(A@B)=26 WEIGHTED #(ADB) = 144 Figure 4-2. The template matcher for different character.

PAGE 87

80 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 PATTERN A PATTERN B 1 2 1 3 11111 1 2 3 3 3 2 3 1 3 1 3 1 3 1 2 1 1 2 3 1 1 2 3 1 1 3 3 1 1 3 3 1 1 3 3 11111 1 2 3 3 3 3 3 1 3 1 3 11111111 23333333 #(AeB)=38 WEIGHTED #(AB B) = 106 Figure 4-3. The template matcher for same character.

PAGE 88

81 In order to use these descriptors for pattern classification applications, the curve representation must be normalized with respect to a desired transformation class. The Fourier descriptors are defined as follows: The function 0(1) is the angular direction of closed curve y. The function <|>(1) is the net amount of angular bend between starting point 1 = 0 and point 1. So 4>(1) is represented as <|>(1) = 0(1)-0(O) except for possible multiples of 2 tt, and (L) = -2 ji. The function (1) simply contains absolute size information, we need to normalize (t) to make it a periodic function. The 4>(t) for character H is shown in Figure 4-4(a). Figure 4-4(b) shows the normalized 4>(t), defined as 4>*(t). We define *(t) as *(t) = (Lt/2ir) + 1. The normalized function *(t) is a periodic function, so we can expand *(t) in its Fourier series as follows. 4>*(0 = P 0 + £ A k cos (** «*) C 4 1 ) k= 1 Then the set {A k , a k ; k = l,2,...,°°} are the Fourier descriptors for curve y. The main drawbacks of global techniques are their dependence on position alignment and high sensitivity to distortion on style variation. 4,2.2 Geometrical and Topological Properties in Character Recognition Geometrical and topological features for character recognition are based on the extraction of features which describe the interesting geometry or topology of the character as a drawing. These features may represent global and local properties of the character. This is by far the most popular technique from the view point of

PAGE 89

82 Figure 4-4. The Fourier descriptor and normalized function.

PAGE 90

83 sensitivity and a proficiency for implementation. The sensitivity to the deformation of a character image is caused by several of the following factors: a) font variation the use of a different font to represent the same character. b) rotation -change in orientation. c) translation -movement of the whole character. d) noise which causes disconnected line segments, filled loops, etc. The proficiency of implementation is evaluated with respect to speed, complexity, independence, and automatic mask making. The main advantages of geometrical and topological features are their high tolerance to font variation and noise compared to other techniques. The geometrical and topological features as structural or local types of features are applied mainly for recognition of handwritten characters which have much style variation. The letters for local features are considered as abstract letters which represent only the main aspects. The abstract letters do not include character embellishments which physical characters have. The partial abstract representation of letters, termed a skeleton, contains vertices, spatial ordering between vertices called edges, and relationships between vertices and edges as the basic elements. The vertex denoted by the symbol shown in Figure 4-5 is taken as primitive, whereas edges specify the spatial ordering which exists between vertices. The relationships between vertices and edges represent the interesting aspects which are shown in Figure 4-6. These interesting points are divided into several categories: endpoint, forkpoint, crosspoint, and breakpoint. Each interesting point has its own number of

PAGE 91

84 branches; for example, crosspoint has four branches and forkpoint has three branches. The interesting points above can partition the characters in subpieces. The start and the end points of each subpiece are found and their corresponding categories are recorded. The orientation and the length of each subpiece are also recorded, and the length is normalized by the character length after tracing the whole character. I / • > ' • Figure 4-5. Vertex and edge. Figure 4-6. The relationship between vertices and edge as an interesting aspect.

PAGE 92

85 4.2.3 Word Recognition bv Contextual Information The performance of a character recognition system can not be based on only a single-character recognition. Contextual information can increase recognition accuracy just as humans rely on using contextual information in reading text [Ehr75, Dos77]. The input of a contextual word recognition system consists of a string of characters which are recognized through the character recognizer. The application of context makes it possible to detect or even to correct them. In a contextual word recognition system, the character recognizer module assigns to each input character 26 numbers showing the confidences that the character in the input has labels from a to z. The confidences are then transformed to probabilities. If a character has two or more labels with non zero probabilities, it is difficult to determine which label is the correct one. A string of characters delimited by spaces constitutes a word. Since each character in a word may have a set of alternative labels, the output of the character recognizer is actually a sequence of sets called substitution sets, each of which contains the alternatives for a particular character with nonzero probability. All possible words are obtained by selecting one character from each of the substitution sets. It is obvious that only one of the words that can be formed from the substitution sets is the correct word. The problem in contextual word recognition is to determine the combination of labels 0 l5 0 2 , ...,6 n that maximizes the a posteriori probability p(0 1 ,0 2 ,...,0 n lX 1 ...X n ) for a word of length n as the output of a character recognizer, where a word XjX 2 ...X n is the output of a character recognizer. The probability that the true label of character Xj is 0j is expressed as pW).

PAGE 93

86 4.3 Recognition of Nontext Unlike text recognition, the understanding of nontext is performed for the limited number of nontext items such as prestored labels and symbols. The difficulty for perfect understanding of graphics images comes from limitless creation of graphics images in addition to complexities. Even though designers try to draw unique labels, there are many similarities between some of them. Thus, perfect understanding of the graphics image has appeared to be impossible; nevertheless, some efforts have been made for this understanding. An effort to understand geometrical configuration by computer has been explored in the literature [Tou80]. The literature introduces a novel algorithmic approach to automatic understanding of geometrical configuration by computer. It describes a fundamental problem in automatic picture understanding as designing a computer system to analyze, interpret, and describe such geometrical configurations. In designing the system for understanding the graphics image, a conventional vision system simulates the human vision assuming that it is able to identify objects seen previously. Human vision, however, can identify pictures without learned convention; this has been demonstrated in a study by Hochborg and Brooks [Hoc62]. They raised their son allowing him not to see any pictures, even advertisements or food containers or billboards, for the first two years. Nevertheless, the boy who was two years old then had no difficulties identifying pictures such as simple line drawings of shoes and other familiar objects, even complicated ones. Thus, human capability for perception and recognition is not a matter of learned convention. We are going

PAGE 94

87 to design the recognition system based on conventional design methodology which supposes that humans begin with careful observation and then interpret what they memorize in order to understand. The conventional recognition system requires a learning stage including tasks such as image processing and picture understanding. The image processing task is concerned with the determination of nontext type in addition to the transmission of graphics images. From the viewpoint of preprocessing for understanding graphics images, the nontext is classified into two types. One is the line type which is composed of line components, and its skeleton does not destroy most properties. The other is the blob type whose properties mostly exist on the boundary. 4.3.1 Determination of NonText Type Of the image processing tasks, thinning is one of the most important image processing algorithms and is used frequently to simplify image data in order to ease extracting features from the image data. Thinning should be applied to the graphics image which will not be affected in losing the feature. In other words, thinning is applied to simplify the boundary image by reducing it to its skeleton without destroying its geometrical shape and connectivity. Thus it is necessary to discriminate the graphics image before the thinning process is applied it. The graphics image should be divided by graphics image thinning, either applicable or not applicable. The procedures for deciding the graphics image type can be informally described as follows.

PAGE 95

88 (1) Apply operator Bj to the image block at every location until the number of black pixels reaches zero; then keep the number of iterations. ( Countiteration = no. of iteration ) (2) Count the number of black pixels when applying Bj Countiteration 1 times, set n = 1 and find out weights for the pixels at each location by using 3x3 windows. (3) If the number of black pixels resulting from (2) is greater than a predefined value and there exist half the predefined value of pixels with weight less than four, then the graphics image block is line type. (4) Otherwise, use 3x3 windows to estimate weighted black pixel value after applying operator Bj Countiteration fn-t-l) times. Find out line component by checking the number of pixels with weight three. If the count exceeds a certain value which makes graphics image look line type, then classify it as line type. (5) If there is no line component by 3 x 3 windows in step (4), then count the total number of black pixels. If the number of total black pixels is bigger than 2x8 times of the number of black pixels after applying Bj Countiteration n times, then it is line type. (6) Set n = n + 2, then repeat step (4) and (5) until n reaches Countiteration 1 . (7) If n reaches Countiteration 1 without satisfying the condition of either (4) or (5), then it is blob type. Examples in Figure 4-7, and 4-8 show the characters which are represented by bit-map. Both (c) and (d) in Figures 4-7 and 4-8 show us the number of black pixel after applying operation B : once. If we apply the operation Bj twice, then it

PAGE 96

89 makes the number of black pixels equal to zero. The numbers of black pixels in (c) and (d) of Figure 4-7 are relatively large. Thus, it can be line type by step 3 of the procedures. Similarly, characters in Figure 4-8 can be classified into line type by step 5 in the procedures. 4,3.2 Classification of Line Drawing The computer understanding of line drawings has been explored for the industrial applications such as CAD/Design Automation techniques of electronic circuit diagrams [Tou87] and interpretations of mechanical design. The primary components of most line drawings are lines. These lines have a special meaning for certain line drawings and not for the other line drawings. The former can be referred to as a line context sensitive line drawing, while the latter is referred to as a line context free line drawing. Thus, prior to applying each understanding stage to the line drawing, the classification of line drawing as either line context sensitive or line context free is required. 4.3.2. 1 The classification algorithm for line context free line drawings The primary feature of line drawings, either hand or CAD drawn and often seen in pages along with various picture images, is line. Despite the fact that line drawings are composed of line components, line itself is not the feature to classify the line drawings in line context free line drawings which usually do not include three-dimensional information. The most prominent feature, which can differentiate

PAGE 97

90 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 PATTERN A No. of B.P. = 140 (a) PATTERN B No. of B.P. = 150 (b) 11111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111111 11111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111111 1 1 111111 B,(A) No. of B.P. = 40 (c) ^(B) No. of B.P. = 38 (d) Figure 4-7. The bit-maps of characters composed of line components.

PAGE 98

91 i i i i i i i i i i i i i i i i i i PATTERN A No. of B.P. = 74 (a) PATTERN B No. of B.P. = 96 (b) a, (A) No. of B.P. = 3 (c) B,(B) No. of B.P. = 5 (d) Figure 4-8. The bit-maps of different character composed of line segments.

PAGE 99

92 line drawings from context free line drawings, are the components. The components in the line drawing are usually drawn as lines which connect the components, for example, lines in the electronic circuit diagram. Chemical line drawings, especially those of organic chemistry, include English characters such as C, O, H or others. Meanwhile, the most prominent feature in mechanical and architectural line drawing are lines with 3-dimensional information. In line context free line drawing, the components which exist in the line drawing provide the evidence to classify them. The procedures for extracting evidence from the line drawing are informally described as follows. (1) Set the number of iterations to 1 (Count 1), and estimate the width of line (Width ) which is represented as the number of pixels in the bit-map. (2) Apply the operator OPj to every black pixel of the line drawing to be classified. (3) If there exists a white pixel with black neighbors in the directions either (0,4) or (2,6), then repeat step 2 and set Count = Count + 1. (4) Otherwise, apply the operator Bj to every black pixel Count + (Width 2) times. (5) Apply the operator OP! (Width 2) times to every black pixel. (6) Extract components combining the result of (5) with the original line drawing image by a logical AND at each pixel location. 4.3.2.2 The attempt for classification of line drawings with line context sensitive lines The classification of context sensitive line drawings is directly connected to 3-D object recognition. The understanding of line drawing has been performed in

PAGE 100

93 various aspects to recognize 3-D objects. Huffman and Clowes demonstrated labelling techniques to interpret the line drawing [Huf71, Clo71]. In an attempt to capture quantitative aspects of shape and to handle arbitrary polyhedra, Mackworth [Mac85] utilized the concept of Gradient Space. However, this approach could not guarantee realizability, that is, there may not exist polyhedral scenes corresponding to a labelling. Other researchers have also studied the interpretation of line drawing. Nevertheless, there does not exist a rigorous approach to understanding the line drawing, or even discriminating the line drawing of 3-D objects from the simple 2-D line drawing. One difficulty in the understanding of line drawing is that the image data do not contain 3-D information. Despite this fact, humans can interpret line drawings with little difficulty. As an attempt to interpret line drawings, volumetric primitives will be used to discriminate the line drawing of 3-D objects from simple 2-D line drawings as the very first step of interpreting line drawings. Lines in machine drawing with line context-sensitive lines are classified into object lines and interpretation lines. The majority of the object lines describe the objectÂ’s visible contour using a solid thick font, hidden lines, axis of symmetry lines, or cross-sectioned planes. Meanwhile the interpretation lines provide the objectÂ’s precise geometric description along with other information necessary for producing the object, and are classified into dimension lines and auxiliary lines. These are further classified into manufacturing lines and logistic lines, such as the title and frame of the drawing, part list, and part numbers. These lines represent either 2-D or 3-D objects. Thus, dimensioning plays

PAGE 101

94 a vital role in classifying line drawing for the preliminary step in understanding machine drawings. It is aimed at providing an exact definition of the 2-D geometry of the projection, whose approximate shape is described graphically by objects lines. 4,3.3 Recognition of Line Drawings In the understanding of graphic images, we will be concerned only with line drawings. Images such as pictures are not ordinarily subjects for recognition in a document analysis system. There are many basic procedures for recognizing line drawings. Of these procedures, detection of lines will be described because the most dominant element in the line drawing is the line segment. Since images in line drawings usually involve an enormous amount of pixel data, processing procedures should be done efficiently and fast. 4.3.3. 1 Detection of Lines Of various methods of detecting line segments, the most common is the application of a thinning algorithm to obtain medial line images and then to track the medial lines by using eight connected neighbors within a 3x3 window. This is illustrated schematically in Figure 4-9. Another algorithm, referred to as longest vector-detection method, is to search the logical 1 pixels and find the longest vector drawable on the line segment. When the longest vector is found, line tracking is halted to store the coordinates of both ends of the vector. Then, the pixels that make up the line are flagged to show that the search has been completed. New tracking

PAGE 102

95 Figure 4-9. Medial line-tracking method.

PAGE 103

96 resumes from the last point just stored. Thus, line images can be represented as a set of vectors, with each vectored line consisting of a pair of point coordinates. Using these line-detection methods, a straight line is usually represented by a long vector, while a curved line can be approximated by a set of short vectors, each having gradually changing directions. The line detection by thinning algorithm is a timeconsuming process, whereas the longest vector-detection method is reasonably fast when a drawing consists of long straight lines. 4 . 3.32 Recognition of Line Types The main objective in line-type recognition is to find the route for each line type by analyzing the set of short line segments. Extracting subsets of short line segments can lead to finding broken lines. For this purpose, a two-level recognition algorithm is provided: one level is for local recognition, the other is for global recognition. In local recognition, one line segment is chosen as the starting segment. A small search area is generated at the end of the segment, and the starting point of the new segment is found within this area as a segment is checked to see if it lies in the same direction as the preceding line segment. For a same direction line segment, the length of the segment is checked and accumulated for statistical analysis to determine the line type. This process continues until the line types meet. When a mismatch occurs, a probable line type for the preceding line segments is stored. The process is begun again at a point after the mismatch. From that point, new tracking begins until the mismatched line segment flagged. This procedure can

PAGE 104

97 find all broken lines sequentially other than obscure points that may have been flagged during the local recognition process and may still remain unclassified. The global recognition process is initiated to check for conflicts in the results at the starting point of the mismatched line segment. If the global processing finds a conflict, and the connection or disconnection of points is required, the local recognition procedure is again applied to these points until all conflicts are resolved. 4.3.3.3 Logic-Circuit Diagrams In electrical engineering work, logic-circuit diagrams are often encountered. These logic-circuit diagrams usually consists of three main elements such as lines representing connecting wires, symbols representing logic-circuit components, and characters and numbers for component names and attributes. The understanding of such logic-circuit diagrams can be described by the following sequence. In order to execute the recognition process, first of all characters and numbers regarded as small in size compared with others in the logic-circuit diagram are separated from component symbols and wire lines. Next, for the extracted character and number image, contour tracking is executed. A structure analysis of the resulting set of point coordinates that make up the contour is then executed to recognize the characters and numbers. The remaining image, excluding characters and numbers, is thinned in order to extract features points such as end points, branch points, cross points, and corner points. These feature points are analyzed and result in connection information for the

PAGE 105

98 wire lines. Loops also can be extracted from the medial line data. By analyzing the shape of the loops, component types are recognized in one of the four orientations in the horizontal and vertical directions. The components and characters are replaced with prescribed symbols and fonts with standard shapes, and recognized wire lines are straightened to be horizontal and vertical. 4.3.3.4 Mechanical Engineering Drawings Recognition of mechanical-engineering drawings poses somewhat different problems from recognition of schematic drawings. The object in a mechanical drawing is usually complex and intrinsically three-dimensional. The industrial applications of mechanical line drawings require three-dimensional model data for the output of the drawing-recognition task. The recognition of mechanical engineering drawings requires a highly involved performance recognition process. In the recognition of mechanical-engineering drawings, the binarization process for drawings necessitates floating-type binarization using a locally adapted threshold in order to obtain good and reliable images. Symbols are discriminated from one another after vectorizing line drawings. Some of the symbols in mechanical drawings have no specific meaning by themselves, whereas they take on meaning when combined with other lines and symbols in the line drawing. The symbols with meaning include sectioning symbols, production symbols, symmetry symbols, and dimensioning symbols. All of these lines provide very important information, so that these symbols must be analyzed, integrated, and

PAGE 106

99 interpreted in connection with the lines representing the object. It is then necessary to reconstruct the two-dimensional geometry by rectifying the vector data according to the dimension information derived in the above process. Then, by combining views, sections, and details represented in the line drawing, a three-dimensional model can be reconstructed. 4.3.3.5 Trademarks and Symbols An effort for the recognition of trademarks by computer was attempted by geometrical feature selection based on the chain code [Tou87a]. Most trademarks which symbolize companies or associations are designed in a pattern which should be simple, yet distinct. Even though designers try to draw a unique trademark, there are many similarities between some of them. Trademarks are categorized in two forms; one is an alphanumeric and the other is a symbolic one. Each trademark can further be classified into two types, the blob and the line type. The line types are grouped as no loop, single loop, and multiple loops depending upon the number of loops in the trademark symbol. The recognition of trademarks is performed separately by line type and blob type. In the understanding of the line type trademark, the image data will be reduced through reducing steps for convenience of dealing with the image data. The main function of the thinning process is to transform a raster image into an image composed of linearly connected points. The purpose of applying this process prior to recognition is twofold: to reduce the number of black points by repeatedly peeling

PAGE 107

100 off boundary points until only the centerline remains, and to define node. The important clues for recognition of line type are geometrical and topological features. The theoretical ideas of obtaining general features such as common ridges, touching, fin, and bridge, etc. were discussed intensively by an approach using chain code [Tou80]. A blob type of trademark should be treated in a different feature selection unlike the recognition of line type. The contour representing function of each piece of a trademark is used to extract the shape information of a blob-type trademark. The Fourier transformation technique is applied to extract the shape information. The ratio of the corresponding Fourier coefficients for the unknown trademark and the model object represents the derivation of the ratio chosen as the difference measurement. 4.3.4 The Symbol Matching Process by Transformation to the Graph Model Symbol matching is a one-to-one correspondence to itself between the same symbols to be seen at different times t x , t 2 . Template matching as a recognition scheme for symbols is inefficient due to the properties of image data as the number of symbols grows. Storing the graphics image into computer as raw data requires an extremely large memory space; it is thus necessary to transform the raw data into compressed computerized form. The simple one-to-one correspondence between the feature sets for geometrical and topological features is usually not enough to recognize the symbols. The relationships between the features can increase the rate

PAGE 108

101 of recognition. However, excessive extraction of relationships for all the objects would be prohibitively expensive in documents with lots of graphics images. Minimum search tree can avoid excessive effort for extracting additional features. In other words, a tree structured description greatly reduces the processing time. The planar graph is considered the appropriate way to computerize the graphics image using matrix representation. The skeleton of a line type symbol can be represented by the planar graph model whose nodes represent interesting points in the skeleton. The edge of a graph model simply represents the line component of the skeleton in a graphics image. This graph model loses some information of the original graphics image. The graph model simply represents the relationship of the line components for the skeletonized symbol. While the graph model is convenient for showing the relationship, a matrix representation is a convenient and useful way of representing a graph to a computer. The recognition of a symbol represented by the graph model then is converted to the problem of isomorphism of the matrix for symbols. Two graphs are said to be isomorphic if there is a one-to-one correspondence between their vertex-sets which preserves the adjacency of vertices. The representation of skeletonized symbols in graph models can be used to find the most similar symbols. However, the brute force matching process for graph models, finding out isomorphic graphs of objective symbols, leads to permutations of the existing graph models for symbols. The weighted graph matching includes the graph isomorphism problem [Ume88]. A weighted graph G is an ordered pair (V,w) where V is a set of vertices

PAGE 109

102 of the graph and w is a weighting function which gives a real nonnegative value wfy, Vj) to each pair of vertices (Vj,Vj), v^V, VjGV, and v^Vj. The adjacency matrix of a weighted graph G(V,w) is an n x n matrix defined as follows: A c [a a {j = ai x ~ _ n l * J a u = 0 (4.2) The n x n matrix Aq is symmetric when G is an undirected graph. The graph matching is the problem of finding a one-to-one correspondence $ between Vi = { v i,v 2 , ... ,v n ) and V 2 = {v’ 1 ,v’ 2 , ... ,v’ n } which minimizes a difference between G and H which are G = (Vj,Wj) and H = (V 2 ,w 2 ) with n vertices respectively. The criterion for a measure of difference is defined as follows: n n J (4>) = Y,Y, ( W l( V /» V ;) VV 2 ($( V)MVj))) 2 ( 43 ) <= 1 J-l J() can be reformulated as follows by using a permutation matrix p if G and H are weighted graphs and Ac and A H are their adjacency matrices, respectively. J(p) = II pAcp 7 A h II 2 (4.4) where the permutation matrix p represents the vertex correspondence O and II • II is the Euclidean norm. The weighted graphs G and H are called isomorphic if there exists a one-toone correspondence $ which makes J($) equal zero. That is, from (4.4) pA G p T = A H (4.5) Now, the graph matching problem is to find the permutation p which satisfies (4.5).

PAGE 110

103 However, there doesn’t usually exist the permutation matrix P in real problem. The optimum matching between G and H is a permutation matrix p which minimizes J(p) in (4.4). It is difficult to find the permutation matrix p directly. If we extend the domain of J to the set of orthogonal matrices, the optimum matching between G and H is to find orthogonal matrix Q which minimize J(Q). This extension of the domain is natural because a permutation matrix is a kind of orthogonal matrix. These sets of orthogonal matrices are given by Q = U^UZ , S e b l (4.6) assuming eigendecompositions of A c and A H as \ tfcA G Ul < 4 7 > «M' < 4 8 > In (4.6), 6j represents { diag (s„ s 2 , ... , s n ) I s ; = 1 or -1 }. Now assuming that G and H are isomorphic, the following formula (4.9) is obtained from (4.5), (4.7), and (4.8). PU
PAGE 111

104 and negative directions when all eigenvalues are distinct. Then, P = U^UI ( 4 n ) This means that there exists some Sefi! which exactly makes Q a permutation matrix when G and H are isomorphic. However, it is not easy to find diagonal matrix S. This diagonal matrix can be written as S', and P' = U H S'U G T . Let U H = [h ij ], U G = [gy], S' = diag(Sj). Then we have «iP' T u^v r c ) -EE*,*, < 4 12 > i=l k=l Obviously, the following holds n E S k h ik ^ n E S k h ik I £ n E I S k h ik &*(i)Jk I Thus, n E I ^ik I I £x(QJt I ( 4 . 13 ) rr (p lT U^U T G ) s £ E I K I I 8m j=i *=i = tr (p /T U h U T g) ( 4 . 14 ) Since the length of each row vector of U H and U G is equal to 1 and the values of its

PAGE 112

105 elements are nonnegative, each element of U H U G T is greater than or equal to 0 and less than or equal to 1. Thus, we have tr (P^X) * » (4.15) On the other hand, tr (p ,T U^'U^ = tr (p T p) = n (4.16) This means that p maximizes tr(p T U H U G T ) since tr(p T U H U G T ) n for any permutation matrix p. Therefore, when G and H are isomorphic, the optimum permutation matrix can be obtained as a permutation matrix p which maximize tr(p T U H U G T ). For example, the adjacency matrices Aq and A H of planar graph G and H shown in Figure 4-10 are given in (4.12) and (4.13) ^G 0 5 8 6 5 0 5 1 8 5 0 2 6 12 0 (4.17) 0 18 4 10 5 2 8 5 0 5 4 2 5 0 (4.18)

PAGE 113

106 Figure 4-10. An example of the weighted planar graph matching. The characteristic polynomial of Aq is det(A G XI) = (A. 14.2488)(X + 0.28)(X + 4.8247)(A +9.1441). Eigendecompositions of Aq and A H are given as follows: = u
PAGE 114

107 14.2488 0 0 0 0 -0.28 0 0 0 0 -4.8247 0 0 0 0 -9.1441 0.6142 0.1409 -0.1822 -0.7548 0.4336 -0.5276 0.7262 0.0791 0.5484 -0.2700 -0.5820 0.5363 0.3660 0.7930 0.3173 0.3693 (4.20) (4.21) = ^X (4.22) 13.2567 0 0 0 0 -0.7744 0 0 A h = H 0 0 -3.4341 0 0 0 0 -9.0481 0.5383 -0.4358 -0.4247 -0.5383 0.3439 0.8801 -0.0185 -0.3269 U„ = H 0.6242 0.0255 -0.2503 0.7396 0.4498 -0.1867 0.8699 -0.0787 From these, there exists a permutation matrix P which satisfy (4.11). (4.23) (4.24) P = 0 0 10 0 0 0 1 10 0 0 0 10 0 (4.25)

PAGE 115

108 4.4 Summary This chapter has covered the recognition system for document text including nontext such as graphics and line drawings. The text recognition system consists of three stages: preprocessing, feature selection followed by matching, and postprocessing. Each character read by an optical scanner is separated by a word segmentation algorithm using eight connected components. The eight connected components generated the circumscribing rectangles for each individual character. The segmentation of closely printed characters was resolved by combining segmentation with classification by means of an adaptive decision tree. In this method, the supervisory routine took control of the windowÂ’s width and location recursively. The viewing window was narrowed until either the truncated array was successfully recognized or the window became so narrow that search was given up. As feature selection for character recognition, geometrical properties were mainly considered due to insensitivity to variations and proficiency for implementation. To increase recognition accuracy, contextual information was utilized. Several methods such as one based on compound decision theory, a Markov process, etc., were introduced for a contextual word recognition system. In the second part of this chapter, we discussed the recognition system for nontext regions. First of all, nontext regions were classified into two different types. Two different feature selection methods were applied to each type of image data. Prior to applying a feature extraction algorithm, image data type was classified by the removal of line segments. In the understanding of line type nontext, the geometrical

PAGE 116

109 feature was extracted for a graphics-recognition and interpretation system. The graphics-recognition and interpretation system mainly consisted of image processing and a pattern recognition algorithm. For an imbedded text string of nontext, the text string image was separated from a pure graphics image through the high resolution binary input. Line segments were analyzed to find closed loops that formed graphical primitives of known shapes. Meanwhile, the boundary information was mainly used for recognition of blob-type images. Matrix representation of graphics images through graph models induced the matching problem of the isomorphism problem. The adjacency matrix of a weighted graph was utilized to find a one-to-one correspondence. The optimum matching between matrices was considered as a solution to the weighted undirected graph matching problem.

PAGE 117

CHAPTER 5 DOCUMENT FILING AND RETRIEVAL 5.1 Introduction The recognized and interpreted documents should be handled in electronic format to ease their further handling. This section will describe only the framework for doing this. The electronic document is considered as a source of information, as a learning device, and as a mechanism for communication between people who are distant in time or place. As workstations grow cheaper, more powerful, and more available by the advance of computer technology electronic documents will be the normal means of communication. Some sort of file management is required to enable users to manage such documents. A computerized office requires an office system with powerful and integrated facilities. Proper integration of these facilities is an important task requiring unifying concepts that can be used to tie together diverse physical capabilities. This chapter will focus on the document preparation, communication, and management aspects of office systems which are referred to as a document management system, that allow the user to group documents and to file these groups. The objects in a document management system are the resources that people require to prepare, communicate, and manage documents. These include the documents themselves, document repositories, printing facilities, etc. These system 110

PAGE 118

Ill resources are manipulated by office workers playing various office roles and using various system facilities. There is also a growing interest about office information systems that handle complex data such as text, attributes, graphics images, and picture images. They are composed of text, attributes, and image information. Some of the functions that these systems may provide are creation and filing of such information, content addressability of computerized documents, automatic insertion of documents in a paper form, and computerized document transmission. 5.2 System Resources Various resources are required for a document management system. A set of generic objects which represent the available system resources includes documents, file folders, a terminal, file cabinets, a printer, etc. The objects are grouped by functionality and category. Functionality describes the functions of objects: either they bear data, provide a service such as acting as repositories for data objects, or perform specific functions on data objects. One of the data bearing objects is the document. The document will be further described in the following sections. The basic information carrying entity in an office automation system is the document. The other objects in the system provide various facilities for communicating and managing documents. The most important aspects of the document object in a document management system are document structure and contents. First, the types of data that can constitute the contents of a document and the structuring of documents will be discussed. Second, the types of constraints that

PAGE 119

112 can be specified both on the document contents and on the document structure will be considered. 5.2.1 Document Contents A document is defined as anything that can be used to communicate information. In a paper environment, anything that is written on paper can be a document. Thus, a document management system must support at least text and attribute data types, where attribute data types are the traditional data types supported in programming languages and data base management systems. Meanwhile, there are other ways to communicate information, and these can also be regarded as potential constituents of documents. However, these other ways will not be discussed in this section because they are beyond the scope of this dissertation. Computerized documents are very important for office automation. To support computerized documents, hardware facilities must be available for handling the different types of data. In addition, user level facilities such as editing must be provided for the different data types. In a computerized document system, capabilities should be provided for presentation of computerized documents. A document formatter should combine attributes, text, pictures, and graphics images that are easy to use. The formatter may use existing information in the system. Thus, information extraction from documents stored in the computerized document system is needed.

PAGE 120

113 5.2.2 Structure of Computerized Documents The logical components of a computerized document are illustrated in Figure 5-1. Computerized documents are composed of one or more of the following: a set of attributes, a text part, and a set of images. They may also have an annotation part. The document type contains minimal common information in a large number of documents. The text part is composed of text sections. Each text section is composed of paragraphs, and each paragraph is made up of words, and so on. This structuring of the text document allows queries to restrict retrieval, on the basis of the proximity of words within the text document, as well as to associate annotation with each of the text components. Attributes have an attribute name and a value. The value may be a repeating group of values. An image is composed of an image type, a vector form, a raster form, and a text part. The vector form represents the image as a set of image objects which are represented as a set of ordered points and a set of parameter values. Points are pairs of values indicating the position of a point within an image. Points may be connected to form lines, polygons, polylines, etc. Image objects may be hierarchically structured. In other words, regions may contain other regions, polylines, or text. The raster form represents the image as an ordered set of pixels in two dimensions. The raster form of an image may contain overlapping raster objects, which are sets of adjacent pixels. Each raster object corresponds to a distinct vector object of the same picture, which is a closed polygon. The object caption is composed

PAGE 121

114 Figure 5-1. Computerized document structure.

PAGE 122

115 Figure 5-1 ( Continued )

PAGE 123

116 of object caption words. Object caption words are of the type text, and are composed of words or parts of words. The image text part is composed of image text words. Image text words are composed of parts of words. The image text part is text-related to a given image. The text part is formed by (1) the image caption of a given image, (2) text paragraphs related to the image, (3) object caption words of objects within the image, and (4) text annotation. Annotation is a further informal explanation about the contents of a document, paragraph, word, or image. It may be associated with a text document, text section, text paragraph, text word, and an image. 5.2.3 Internal and External Representation of Documents A document management system should support the categorization of documents according to their type in order to facilitate the management of documents efficiently. This implies that all documents with the same content structure belong to the same document type. Not only does this make management of documents easier, but it also facilitates the incorporation of more advanced office automation functions. The representation of document types and instances can be divided into two levels: the external representation and the internal representation. The external representation may be different from the internal representation of the document in order to allow for better secondary storage and communication bandwidth utilization. In other words, the external representation is concerned with what users see, how they see, and how they use what they see. Meanwhile, the

PAGE 124

117 internal representation captures all the information of the external representation in an internal data structure. This data structure is transparent to the user and stores the documents for future use. A sophisticated internal data structure for a document is required to facilitate retrieval. The internal representation of an image does not need both an object form and a raster form, but many have only one of the two. A photograph where objects have been identified and stored in the object form is an instance of an image in which both forms exist in the internal representation. An example of an image having only a raster internal representation is an uninterpreted photograph. An image with only an object form as internal representation can be an engineering design. However, the object form at the external representation level may be used to display the design in a raster display. The internal representation of the object form of an image is a collection of objects. With each object is stored information related to its type such as a polygon or a circle, its name, shading information, the coordinates of a set of points, and name display specifications such as font, size, and position of display. The internal representation of statistical type images such as graphs, histograms, and tables is a collection of tables. The information about the objects composing the presentation of these images in a specific device is also maintained. The duplication of statistical type images is not very large, and the approach facilitates both answering queries on the image contents and presenting the image in a different form, or in the same form but with different parameters. In addition,

PAGE 125

118 it can be used to display the contents of the image in devices which do not have graphics or bit-map display capability. The external representation of a computerized document in an output device will be called a physical document. Some default information is used for displaying the document in an output device. Here, default information refers to font, size, line spacing, etc. Figure 5-2 shows the structure of a physical document. The physical document is divided into physical pages. Each physical page is composed of rectangles. A rectangle can be a text rectangle or an image rectangle. Rectangles are identified by their location within a physical page and their size. Image rectangles are isomorphic to images of a computerized document, and text rectangles contain information that is used for displaying documents in an output device. A descriptor is associated with each created computerized document. The descriptor indicates the part of the document, the internal form for each part, and its mapping to a physical document. Compressed information may also be encoded in the document descriptor. The compression method to be used in such an environment depends also on the system workload and the devices used. In addition, since there may be a variety of techniques that can be used, the particular method used and its parameters may be encoded within the descriptor. This may be more important for the image part than for the text or attribute parts of computerized documents, due to the large number of bits in images. The external representation of a document type is defined by one or more document templated. A document template specifies (1) the background information for the document template, (2) the

PAGE 126

119 Figure 5-2. Physical document structure.

PAGE 127

120 layout of the document fields on the document, (3) the content of the document fields. The document templates and document contents in any appropriate data structure can be represented as a relation in a relational data base management system. 5.2.4 Information Extraction in Documents To achieve better storage utilization and to enhance content addressability, information extraction from the document is required. In executing the information extraction, recognition of text is the primary concern in document management. Automatic recognition of text has been successful for a variety of fonts. This success can be applied to documents to extract the document parts and the information that is necessary for reconstructing them. Such information will be stored in the document descriptor. For a large repository of information, pattern recognition takes place once per document and not for every query. In other words, information is extracted from the bit-maps at document insertion time, using an information extraction subsystem, and is stored with the document. Region expansion techniques are applied to the bitmap in order to extract information about the dominant regions of the bit-map. The region expansion technique picks a threshold that divides the image pixels into either objects or background. Some well known ways to pick the threshold were discussed earlier in the first part of Chapter 2. Assuming the image is bimodal, the threshold will be the minimum value to separate the two peaks. However, when the histogram is not a smooth function, it can be difficult to find the right valley between the peaks

PAGE 128

121 of a histogram. An elegant method for treating bimodal images assumes that the histogram is the sum of two composite normal functions and determines the valley from the normal parameters. The single-threshold method is useful in simple situations, but is a problem when collections vary smoothly in gray levels by more than the threshold. To improve the difficulty above, Chow and Kanako [Cho73] modified the threshold approach by deemphasizing the low-frequency background variation in the original technique or using a spatially varying threshold method. Their technique divides the image into rectangular subimages and computes a threshold for each subimage. A subimage can fail to have a threshold if its gray-level histogram is not bimodal. Such subimages receive interpolated thresholds from neighboring subimages that are bimodal, and finally the entire picture is thresholded by using separate thresholds for each subimage. Split and merge techniques can then be applied to decide the final set of regions. The technique is most successful in defining dominant regions [Hor74]. Further segmentation of the picture will probably require knowledge of the content of the picture and cannot be done easily with general purpose techniques. Special types of regions which are often encountered in the office environment are needed to be recognized in a picture. Such regions are surrounded by a boundary and are categorized into square regions, parallelogram regions, circle regions, and ellipsoidal regions, depending upon the shape of the surrounding boundary. There are two reasons for recognizing these special types of regions. First, it can reduce data size. For instance, a circle requires storage of its center and radius. Second, it can increase

PAGE 129

122 the speed of content addressability since not all regions have to be examined to see if they satisfy special properties. User-defined regions are stored in an image dictionary with a special code name and anchor points in the dictionary. These userdefined regions from the graphics editor can be used to place a copy of the region in question within an image using the anchor points. The user can insert new user defined regions into the dictionary at any point in time. The search of the dictionary can be done by text words attached to the definition of the region. The search for the images that contain a user-defined region can be done by the code name of the region for images created within the system. Information describing the region is also extracted and stored with the definition of the region. Region parameters are associated with each region, and their parameter values are extracted after the segmentation of the bit-map into regions. The user can specify certain images, using the defined-image dictionary, or extract a region from an image that he has seen while browsing through images of the system, or draw the image that he wants in his screen. The system will extract parameters describing the specified image. The system will try to match the parameters of the defined region with the parameters of the images in the system. Polylines, which are collections of connected line segments, and image text must be handled. User-defined polylines are named polylines and are stored in the defined-image dictionary for reasons of compression and content addressability, as was the case with user-defined regions. Such polylines may represent, for instance, registers, capacitors, or more complicated circuits. A polyline descriptor abstracts the

PAGE 130

123 global characteristics of the polyline and allows retrieval based on the similarity of two polylines. Several regions, polylines, and text may be hierarchically structured. In other words, a region may contain other regions or polylines or text. 5.3 System Facilities The document management system requires facilities for manipulating the system resources. Two different notions of environment are needed in a document management system: a general environment which provides common or frequently used facilities that are accessible to all users, and application-specific environments which provide facilities that are restricted to more knowledgeable users or are of less global interest. Of the two different environments, the facilities of the general environment will be described. The document management system should include some basic environments to facilitate the tasks. These include editing, formatting, filing, and retrieving. This environment should provide a way to access the application-specific environments. Like all the environments, the general environment must be concerned with providing a uniform and consistent interface to all the facilities available within the environment. Providing a generic set of operations across all the objects is considered to be one of the ways to achieve integration of facilities within an environment. Integration of facilities is achieved by commonality of effect, as far as the user is concerned. A document is an object composed of more primitive objects. Each object is an instance of a class that defines the possible constituents and representations of the

PAGE 131

124 instances. Some document classes are business letters, papers for a particular journal or conferences, theses, and programs in a given language. Objects are further classified as either abstract or concrete. An abstract object is denoted by an identifier and the class to which the object belongs. One or more concrete objects corresponds to each abstract object. Concrete objects are defined on two-dimensional pages and represent the possible formatted images of abstract objects. For example, a particular paragraph of a document or an abstract paragraph object may be represented concretely in many different ways depending on font, hyphenation conventions, line length, and other concrete variables. Document processing consists of executing various operations to define and manipulate abstract and concrete objects. There are two distinguishable concepts for objects; ordered and unordered. Many textual objects, such as paragraphs and words, are normally ordered, implying that we can speak of the first one, the last one, the next one, the preceding one, and so on. On the other hand, there are many objects that are more naturally treated as unordered for particular applications such as figures, tables, parts of mathematical equations, and pieces of unrelated text. 5.3.1 Editing Editing is a requisite facility for the document management system. Editing operations are defined as mapping from either abstract to abstract objects or concrete to concrete objects. Conventional text editing operations map logical text objects to logical text objects. For example, a text insertion or deletion may be a

PAGE 132

125 mapping from strings to strings or from paragraphs to paragraphs. An editing facility usually requires tools to manage the document with different data types because it may contain several data types. Not only does the editor need word processing for text editing, but also a geometric editor for structured graphics design and a paint/bit-map editor for free-hand drawing of digitized images. A single set of operations usable for the editing of all data types has been studied to reduce the amount of detail characteristic of existing multi-packaged document preparation systems [Fur82]. A fully integrated editing facility is difficult to achieve without a uniform framework for handling different data types. To improve this shortcoming, the boxesand-glue approach is proposed. This approach uses two-dimensional objects, called boxes, that encase concrete entities such as characters, words, lines, paragraphs and pages. Reference points of boxes which are variable in size are used to align them. The content of a document can be constructed from a collection of boxes whose contents may contain only one type of data. To insert information into a box, the appropriate box is selected, positioned and sized. The type of box defines the appropriate editor to be invoked. 5.3.2 Formatting Since we are able to categorize documents as to type, it seems natural to associate some formatting information with each document type. Mappings from abstract objects to concrete objects are defined as formatting operations.

PAGE 133

126 Transforming a logical character to its representation in a particular font, producing a two-dimensional word with possible hyphenation from a logical word, mapping a paragraph into a sequence of lines, and breaking an abstract document into pages are considered as standard examples. Mappings such as those that transform an abstract directed graph to a line drawing and functions of constructing or laying out a table from a list of its entries are in the nontextual domain. This formatting information is called the document profile, and it specifies the default appearance for the document fields and background information. In addition, we may want to change the appearance to specific fields of document type. For this case, we need to be able to override the default format, and to associate a different format with parts of all of a document field. Most interactive formatters have a hierarchical structure and inheritance scheme for the format environment [Fur82]. For example, an extended abstract which can be seen as a paper has the logical objects defined and structured as follows: < Extended Abstract > = ( < Header >, , < References > ) < Header > = ( < Title >, < Authors > < Affiliation > ) = < Introduction > < Section 1> < Section 2 > < Section 3 > < Reference > = ... < Title > = " Knowledge based ... " < Extended Abstract > is an instance of the class of extended abstracts specified for a particular conference. The notation (A, B, ..., H) denotes the unordered set of objects A, B, ..., H; and A B ... H means that the object sequence

PAGE 134

127 A followed by B followed by ... followed by H. Thus the < Header > consist of the object < Title > and the object sequence < Authors > < Affiliation >. The format environment at any point in a document instance is the complete set of values that are in force at that point. The root format environment of this hierarchy is the document profile. In a particular format environment, the value for a format attribute may be undefined. In this case, the format attribute inherits its value from a higher format environment. In other words, the particular format environment may extend all the way back to the documentÂ’s document profile. 5.3.3 Retrieving Retrieval of a document at some later point in time is an indispensable facility in a document management system. Typically, two types of retrieval patterns are observed. One is the case in which the user is not quite sure of what he or she is looking for. The other is the one where he may have an idea of what he is looking for, but tends to be vague when he formulates his request. Retrieval by specifying document content information instead of document identifier is useful for content addressability. The user will have some idea of the content of documents that he wants to see, and will specify this information in his query. The system will try to return all relevant documents to him. Conditions on the text part of computerized documents involve Boolean conditions of text words or parts of words. In some cases converting image recognition problems to attribute and text recognition problems provides a powerful

PAGE 135

128 alternative. Image content addressability can be achieved by specifying conditions on the image text part and the image statistical part, as well as by similarity conditions on image objects. Similar conditions are matched with the parameters of the image objects. These parameters have been extracted and stored at document insertion time. Thus, pattern recognition does not take place at query time with the possible exception of the extraction of information from a picture drawn by the user. Retrieving documents based on conditions on an imageÂ’s text part is different from specifying conditions on the text of the document. The former specifies a document that has an image related to the condition specified. The latter specifies a document related to the condition specified. For an image with a number of statistical objects where each object has an internal representation in the form of a table, the user can focus his attention on only one of the statistical objects at a time. The relationships among tables are intolerable, and conditions on tables may be very selective. As a consequence, the size of the response is limited. The external representation of a document allows more than one statistical object to appear in the same image. The user can query directly on the image objects which do not contain image text and are not of a statistical type. The user specifies his queries on images with the help of the graphics editor, the special type images dictionary, the defined-images dictionary, and the texture dictionary. The specification of the query can be done interactively by using the image editor to draw objects and their structure relationships. The user can also specify a

PAGE 136

129 texture directly for textual image. The user may also want to allow flexibility about objects that he draws. He may indicate that rotation, translation, scaling are allowed. If the user is not very confident about the shape of the object, only general measures are examined for matching. The system tries to match the user description of the object with the descriptions of the stored objects. A similarity measure is computed, and images with similar objects are returned to the user. The system also indicates to the user which object was qualified from a given image, so that the user is able to see a possible error or omission in the specification of his query. If one or the other occurs, he may want to further edit the image of his query or he may want to redefine the image. Region expansion techniques can be used to find the dominant objects of an image. Structural relationships of objects are hierarchical, so detection of relationships is easy. In the case of a more specialized environment for a particular application, application-related techniques are desirable. 5.4 Summary In this chapter, we have discussed what type of document structure would be desirable for computerized documents and what facilities it should provide. For these, we have presented the functionality of the document, one of the system resources of document management systems. As the most important aspect of the document object in a document management system, document structure and its content were discussed. The representation of document types and instances was divided into two

PAGE 137

130 levels, internal and external. To design a computerized document is to transform the internal representation to the external representation. The external representation does not always correspond to the internal representation. The internal representation captures all the information of the external representation in an internal data structure. The document management system requires two different environments for system facilities. One of these environments, called the general environment, was the main concern and provides common or frequently used facilities. The basic facilities, such as editing, formatting, and retrieving were discussed conceptually. Among these, retrieval of a document at some later point in time is necessarily required and is considered as a main concern in computerizing documents. Retrieval by specifying document content information is more useful than retrieval by document identifier. Retrieval of documents such as text parts, image content, and an image with a number of statistical objects were discussed. The system matched the user description of the objects with the description of the stored objects.

PAGE 138

CHAPTER 6 CONCLUSION 6.1 Discussion This dissertation addresses the problems concerning block segmentation and block classification of a digitized document. Systems that can simply capture, store, print, and distribute images have widespread utility. A commercially available character recognition machine, which is able to recognize textual information only, can be used as a subsystem within a general document analysis system. The more advanced aspects of document analysis, such as automatic block segmentation and block classification, can encode complex documents containing a mixture of text, graphics images, and pictures. For the block segmentation problem, we developed an algorithm which uses a global operator to separate the blocks by the wide space between them. A page with double spacing is hardly seen in professional documents such as journals, magazines, and books. The ratio for the distance between text-lines to the width of text block is usually less than 0.02. For instance, the distance between the text lines is 0.1 mm and the width of the text block is 6 cm, then the ratio above is approximately 0.01667. Using the RLSA method the allowable skew angle for this ratio is only 0.95 °. It is unlikely to be able to scan a document with this little skew. The advantages of the 131

PAGE 139

132 method presented here over the previously published ones are summarized as follows: (1) The procedure does not restrict the scanning direction of the document. That is, it is insensitive to the skew of the document placement on the scanner. This leads to much simplified processing. (2) Its time complexity function is 0(n 2 ) and the complexity is dependent upon the global operations to connect each black pixel. This complexity leads to fast processing speed. For the block classification problem of the document image, the classification rule which is invariant to skew was utilized. The ratio for the black-white transition count over the count of black pixels separates graphics images from the text block or complex line drawing. Line segments were removed to distinguish text block from other blocks. The advantages of this method are summarized as follows: (1) It is based on pixel level. Therefore, for text blocks the ratio defined above is insensitive to the rotation of the document. (2) It provides a straightforward algorithm to separate text blocks from other blocks with similar ratios by eliminating pure line segments. For the recognition problem of each block, strong use was made of experts for nontext region understanding for such applications as logic circuit diagrams and mechanical engineering drawings. In identifying trademarks and symbols, the type decision was made in order to apply two different recognition methods depending upon the type of trademarks and symbols. The conversion from thinned image to

PAGE 140

133 planar graph model was done to simplify the identification problem. Then the planar graph model was converted to a matrix representation, so that the nontext understanding problem corresponded to the isomorphism problem in matrix matching. A higher level document understanding, page-structure analysis, was discussed. Document structure varies from one type of document to another. It is not easy to develop a general system which is able to analyze all types of documents automatically. We have introduced one standard page structure. Page-structure analysis would be an interesting topic for further study. 6.2 Contributions This research provides a satisfactory solution for automatic block segmentation and classification of the digitized paper-based documents. The key contributions of this research are summarized as follows: (1) Development of a fast and robust algorithm for block segmentation. This method utilizes a global operator to connect black picture elements within a same block and is insensitive to skew. (2) Development of an approach classifying each segmented block despite the skewed blocks. In the actual scanning process the scanning line is unlikely to correspond to the text line. This means that the skewed block is the likely starting point for block classification. This method makes use of measurements which are not affected by the block shape.

PAGE 141

134 (3) Development of a systematic and efficient matching process. The thinned image of line type graphics image is converted to a planar graph model. This method identifies the line type graphics image by the weighted graph matching algorithm. (4) Design of a system for understanding some graphics and symbols. The professional document contains some graphics images for easing reader understanding and represents a current trend. Our process divides the graphics and symbols into either line type or blob type, then each type of graphics and symbols is identified by geometrical features or boundary features respectively.

PAGE 142

APPENDIX A ROBUSTNESS OF THE NEW SEGMENTATION ALGORITHM Several algorithms for block segmentation have been reported. However, many of these algorithms are very restrictive concerning the accurate placement of documents on the scanner. As a robust approach, the image is first processed to determine the individual connected components. Individual characters and other large figures are connected at the lowest level of analysis. The characters which are near enough to be within a certain distance are merged into words, words are merged into lines, lines are merged into paragraphs, and paragraphs into even larger blocks, if such a merging is possible. However, this bottom-up approach requires long processing time and may not be robust. The new segmentation rule is considered a robust one, in other words it can separate the blocks even if the documents are scanned with skew. Definition 3. Let Dj and D 2 be document images, and D 2 be the document image obtained by rotating Dj by a degrees. Then denote this operation as rot a (D 1 ) = D 2 . Definition 4. Let W be a logical document image OR summation at each location such that U N P(k) = P(k) + P(k) + ... + P(k), where + denotes logical OR operation. The operator fef 1 implies that the logical document image OR summation at each location is applied on the page image N-times. 135

PAGE 143

136 Definition 5. Denote D 2 1 = Dj if there exists 0 ^ B ^ a such that D 2 = rot B (D 1 ). Theorem: rot B (WilOP 1 (P(x,y)))HifeiOP 1 (rot B (P(x,y))) for all (kB^a. Proof: from definition 5, there exists (kB^a such that rot B (Wi) OPj (P(x,y)))t= Wi) OPj (rot B (P(x,y))). From definition 2, Wi) OPj (P(x,y)) 2 rot B (P(x,y)). There exists Odka such that rot B (Wi)OPj(P(x,y))) = l±lOP 1 (rot B (P(x,y))). Therefore from definition 5, rot B (Wi) OP! (P(x,y))) t= Wil OPj (rot B (P(x,y))). By the theorem above, the proposed approach separates each block regardless of skewing of the document. Assume that li^ W 1 ^ OPj (P(x,y)) is well segmented. Here, "well segmented" means that document is segmented by the wide white space which separates page into blocks. Then rot B (lif 1 OPj (P(x,y)) is also well segmented without loss of generality. By the theorem, rot B (life) OPj (P(x,y))) t= life) OPj (rot B (P(x,y))). There exists (kB^a such that rot B (life) OPj (P(x,y))) = life) OPj (rot B (P(x,y))). Therefore, Wil OP! (rot B (P(x,y))) is also well segmented.

PAGE 144

APPENDIX B SOME METHODOLOGIES OF CONTEXTUAL WORD RECOGNITION The organization of a contextual word recognition system is shown in Figure B-l. The input consists of a string of machine-printed characters. Suppose the output of a character recognizer is a word X v X 2 , ... , X n of length n. Suppose also that the character recognizer has assigned to each character a set of a labels with given probabilities. For example, the probability that the true label of character X s is 0j may be given by P(0jlXj). The problem in contextual word recognition is to determine the combination of labels 0j, ... , 0 n that maximizes the a posteriori probability P(0j, ... , 0 n lXj ... X n ). To maximize a posteriori probability based on (1) compound decision theory, (2) a Markov process, and (3) dictionary look-up. Figure B-l. Organization of a contextual word recognition system 137

PAGE 145

138 Method Based on Optimum Compound Decision Theory A posteriori probability can also be expressed as follows by Bayes’ rule, P(0 p ... ,6J X, ... X n ) = P(X { ... X n | 0, 0„) P(0„ ... ,0„) (B.l) P(X x ... X n ) The independence of the shape of printed characters in a word describes that every character of a sequence is classified on the basis of the information from the character itself. The realistic assumption states that the recognizer behavior is independent of previous decisions, then P(X, .... x n j e„ ... ,e„) = n P(x ,\0) n M i=l P(X.) i>(0jX.) P&i) (B.2) From the two expressions above, the a posteriori probability is rewritten as />(e„ ... ,e„| x, ... x,) P(e,, ... ,e.) • p(x,> p(e,ix,) P(Xj ... XJ il P(8,) (B.3) Since the word is given and the a priori probability p(0j) is known because the a priori probability of the word p(0 lv ..,0 n ) is determined from the frequency of the word, maximizing the a posteriori probability corresponds to maximizing n p
PAGE 146

139 «8, e„) = Wo) n « e ,l fl ,-i) ( B -5) i-1 where 0 O is the label of the character to the left of a word, known to be a space. Therefore, p(0 o ) = l. Assuming the process is memoryless, then we get pw , .... x.i 6, o.) n n M i=l PCO^IX.) P(Q t ) P(X) (B.6) If we substitute the two expressions above into the a posteriori probability, the a posteriori probability is formed as w,, ... ,8„| x, ... xj = P (X, _ A P(0,|X^ P(8,|8,_,) P(8|) X„) M P(X) As described in the previous method, Xj...X n and the a priori probability p(0j) are known. To maximize the a posteriori probability, we need to maximize n n Piejx,) w,|e M ) (B.8) 1=1 or equivalently maximize E 1 lo 8 W,l*i) + lo S > (B.9) where p(0 ; | represents the probability that if the (i-l)th character has label 0^ the i-th character has the label 0j. This is known as the transition probability and is determined from the underlying language. The maximum of the expression described above is found efficiently from an algorithm given by Viterbi. The Viterbi algorithm efficiently searches for the maximum of a sum.

PAGE 147

140 Method Based on Dictionary Look-Up The dictionary look-up method makes use of identity comparison, and three distance functions optimize a contextual postprocessing system [Dos77]. This method requires enough words in the dictionary to find the right entry for a given word. If the word does not exist in the dictionary, we have to find the word from the substitution sets that have the next highest a posteriori probability and check that wordÂ’s validity against the dictionary. This method excludes the failure of substitution sets with the largest a posteriori probability of representing the string. In other words, it excludes the case that the word with the highest a posteriori probability is not the correct one. If the word with the highest a posteriori probability is an illegal word which does not exist in the language, we can easily detect it by checking the word against a dictionary.

PAGE 148

REFERENCES And71 H. C. Anderson, "Multi-dimensional Rotations in Feature Selection," IEEE Transactions on Computers, Vol. 20, pp. 1045-1051, Sep. 1971. Bai87 H. S. Baird, "The Skew Angle of Printed Documents," in Proceedings, SPSE 40th Conference and Symposium on Hybrid Imaging System, Rochester, New York, pp. 21-24, May 1987. Bal82 D. H. Ballard and C. M. Brown, "Computer Vision," Chapter III, Prentice Hall, Englewood Cliffs, NJ, pp. 65-70, 1982. Cas82 R. G. Casey and G. Nagy, "Recursive Segmentation and Classification of Composite Character Patterns," Proceedings, 6th Int. J. Conf. on Pattern Recogntion, Munich, pp. 1023-1026, 1982. Cho73 C. K. Chow and T. Kaneko, "Boundary Detection and Volume Determination of the Left Ventricle from Cineangiograms," Computers in Biomedical Research, Vol. 3, pp. 13-26, 1973. Clo71 M. B. Clowes, "On Seeing Thing," Artificial Intelligence, Vol. 2, pp. 79-116, 1971. Cox82 C. H. Cox, P. Coueignoux, B. Blesser, and M. Eden, "Skeletons: A Link between Theoretical and Physical Letter Description," Pattern Recognition, Vol. 15, No. 1, pp. 11-22, 1982. Dos77 W. Doster, "Contextual Postprocessing System for Cooperation with a Multiple-Choice Character-Recognition System," IEEE Transactions on Computers, Vol. C-26, No. 11, pp. 1090-1101, Nov. 1977. Dos84 W. Doster, "Different States of a Documents Content on its Way from the Gutenbergian World to the Electronic World," Proceedings, ICPR, Montreal, Vol. 2, pp. 872-874, 1984. Ehr75 R. W. Ehrich and K. J. Koehler, "Experiments in the Contextual Recognition of Cursive Script," IEEE Transactions on Computers, Vol. C24, No. 2, pp. 182-194, Feb. 1975. 141

PAGE 149

142 Fle88 L.A. Fletcher and R. Kasturi, "A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 6, pp. 910-918, Nov. 1988. Fur82 R. Furuta, J. Scofield, and A. Shaw, "Document Formatting System: Survey, Concepts, and Issues," ACM Computing Surveys, Vol. 14, No. 3, pp. 417472, Sep. 1982. Hil69 C. L. Hilditch, "Linear Skeletons from Square Cupboards, Machine Intelligence" Meltzer and Michie edition, Edinburgh University Press, 1969. Hoc62 J. E. Hochberg and V. Brooks, "Pictorial Recognition as an Unlearned Ability : A Study of One ChildÂ’s Performance," American Journal of Psychology 75, pp. 624-628, 1962. Hor74 S. L. Horowitz and T. Pavlidis, "Picture Segmentation by a Directed Splitand-Merge Procedure," Proceedings, 2nd IJCPR, Washington, D.C., pp. 424433, August 1974. Hua86a J. S. Hua and K. Chuang, "Heuristic Approach to Handwritten Numeral Recognition," Pattern Recognition, Vol. 19, No.l, pp.15-19, 1986. Hua86b C. L. Huang and J. T. Tou, "Knowledge Based Functional Symbol Understanding in Electronic Circuit Diagram Interpretation," Applications of Artificial Intelligence, SPIE, Vol. 635, pp. 288-299, 1986. Huf71 D. A. Huffman, "Impossible Objects as Nonsense Sentences," Machine Intelligence, Vol. 6, pp. 295-323, 1971. Kas90 R. Kasturi, S. T. Bow, W. El-masri, J. Shah, J. R. Gattiker and U. B. Mokata, "A System for Interpretation of Line Drawings," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 10, pp. 978-992 Oct. 1990. Kit86 J. Kittler and J. Illingworth, "Minimum Error Thresholding," Pattern Recognition, Vol. 19, No. 1, pp. 41-47, 1986. Mac85 A. K. Mackworth and E. C. Freuder, "The Complexity of Some Polynomial Network Consistency Algorithms for Constraint Satisfaction Problems," Artificial Intelligence, Vol. 25, pp. 65-74, 1985

PAGE 150

143 Mas85 I. Masuda, N. Hagita and T. Akiyama, "Approach to Smart Document Reader System," Proceedings of the CVPR Conference, San Francisco, pp. 550-557, 1985. Mel82 R.L. Melamud, J.D. Nihart and J.M. White, "Dynamic Threshold Device," U.S. patent 4,345,314, August 17, 1982. Nac84 N. J. Naccache and R. Schinghal, "SPTA: A Proposed Algorithm for Thinning Binary Patterns," IEEE Transactions on System, Man, and Cybernatics, Vol. SMC-14, No.3, pp. 409-418, May/June 1984. Nag86 G. Nagy, S.C. Seth, and S.D. Stoddard, "Document Analysis with an Expert System," Pattern Recognition in Practice, Vol. 2, pp. 149-159, 1986. Pav80 T. Pavlidis, "A Thinning Algorithm for Discrete Binary Images," Computer Graphics and Image Processing, Vol-13, pp. 142-157, 1980. Per77 E. Person and K. Fu, "Shape Discrimination Using Fourier Descriptors," IEEE Transactions on Systems, Man, and Cybernatics, Vol. SMC-7, No. 3, pp. 170-179, March 1977. Ras86 A. Rastogi and S. N. Srihari, "Recognizing Textual Blocks in Document Images using the Hough Transform," Dept, of CS, SUNY/Buffalo, 1986. Ros76 A. Rosenfeld, R. A. Hummel, and S. W. Zucker, "Scene Labelling by Relaxation Operations," IEEE Transactions on Systems, Man, and Cybernatics, Vol. SMC-6, pp. 420-433, 1976. Sch80 W. Scherl, F. Wahl, and H. Fuchsberger, "Automatic Separation of Text, Graphic and Picture Segments in Printed Material." Pattern Recognition in Practice, Vol. 1, pp. 213-221, 1980. Sow84 J. F. Sowa, Conceptual Structures: Information Processing in Mind and Machine, Addison-Wesley, Reading. MA, 1984. Sri86 S. N. Srihari and G. W. Zack, "Document Image Analysis," Proceedings, 8th International Conference on Pattern Recognition, Paris, France, pp. 434436, 1986. Tou80 J. T. Tou, An Approach to Understanding Geometrical Configurations by Computer," Int. Journal Computer and Information Science, Vol 9 No 1 pp. 1-13, 1980.

PAGE 151

144 Tou87a J. T. Tou, "Understanding of Trademarks by Computer," Proceedings of the IEEE Conf. on Pattern Recognition and Computer Vision, Paris, France, pp. 27-31, 1987. Tou87b J. T. Tou, W. H. Li, K. C. Fan, and C. L. Huang, "Knowledge-Based Approach for the Verification of CAD Database Generation by an Automatic Schematic Capture System," 24th ACM/IEEE Design Automation Conference, Miami Beach, Florida, pp. 713-720, 1987. Ume88 S. Umeyama, "An Eigendecomposition Approach to Weighted Graph Matching Problems," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.10, No. 5, pp. 695-703, Sep. 1988. Wah82 F.M. Wahl, K.Y. Wong, and R.G. Casey, "Block Segmentation and Text Extraction in Mixed Text/Image Documents," Computer Graphics and Image Processing, Vol. 20, pp. 375-390, 1982. Wah83 F. M. Wahl, "A New Distance Mapping and Its Use for Shape Measurement on Binary Patterns," Computer Vision, Graphics, and Image Processing, Vol. 23, pp. 218-226, 1983. Wan89 D. Wang and S.N. Srihari, "Classification of Newspaper Image Blocks Using Texture Analysis," Computer Vision, Graphics, and Image Processing, Vol.47, pp. 353-360, 1989. Wen78 S. Wendling, G. Gagneux, and G. Stamon, "A Set of Invariants Within the Power Spectrum of Unitary Transformations," IEEE Transactions on Computer, Vol. 27, pp. 1213-1216, Dec. 1976. Won82 K.Y. Wong, R.G. Casey, and F.M. Wahl, "Document Analysis System," IBM J. Res. Develop., Vol. 26, No. 6, pp. 647-656, Nov. 1982. Zah72 C. T. Zahn and R. Z. Roskies, "Fourier Descriptors for Plane Closed Curves," IEEE Transactions on Computers, Vol. C-21, pp. 269-281, 1972. Zen85 H. Zen and S. Ozawa, "Extraction of the Fair Document from Mixed Mode Manuscript," Proceedings of the CVPR Conference, San Francisco, pp. 544549, 1985.

PAGE 152

BIOGRAPHICAL SKETCH Hokyung Kim was born in Seoul, Korea, on July 19, 1953. He received the B.S. in 1977, the M.S. degree in electronics engineering from the Hanyang University, Seoul, Korea, in 1981, and the M.Sc. in electrical and computer engineering from the University of New Mexico, Albuquerque, in 1985. Since January 1986, he has been working toward the Ph.D degree in electrical engineering at the University of Florida, Gainesville. He had been serving as a research assistant at the Center for Information Research from August 1987 to May 1990. From 1977 to 1978, he was with the Tongyang Electric Co., Seoul, Korea, where he was engaged in the design of control circuits. From 1981 to 1983, he was an instructor at the Hanyang WomenÂ’s College. His current research interests are intelligent document analysis, image processing, and machine vision. 145

PAGE 153

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Julius T. Tou, Chairperson Graduate Research Professor of Electrical Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. lohn Staudhammer Professor of Electrical Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. U j, Leon W. Couch Professor of Electrical Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Herman Lam Associate Professor of Electrical Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Rick Smith Associate Professor of Mathematics

PAGE 154

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. August 1992 Cl, Winfred M. Phillips Dean, College of Engineering Madelyn M. Lockhart Dean, Graduate School