<%BANNER%>

Computer Aided Invariant Feature Selection

Permanent Link: http://ufdc.ufl.edu/UFE0022870/00001

Material Information

Title: Computer Aided Invariant Feature Selection
Physical Description: 1 online resource (134 p.)
Language: english
Creator: Baker, Antoin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: automated, classification, computer, image, pattern, rapid, real, recognition, time, vision
Mechanical and Aerospace Engineering -- Dissertations, Academic -- UF
Genre: Mechanical Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Computer vision is a topic composed of a wealth of research. Several of these topics include image restoration, scene reconstruction, and motion estimation. Another classical problem in computer vision includes determining whether or not an image contains a specific object; this branch of computer vision is called object recognition. Object recognition (classification) is a broad topic that combines the expertise from several disciplines such as Machine Learning, Cognitive Psychology, Signal Processing, Physics, Mathematics, Information Theory, and Image Processing. However, it is well known that even the best object classification algorithms will produce poor results when given poor features to track. 'Garbage In-Garbage Out' is the phrase coined by George Fuechsel of International Business Machines (IBM) to describe this phenomenon. This proposal presents a method for reducing the load on the user in the feature selection process. By placing a computer 'in the loop' of the feature selection process, the amount of time selecting appropriate features for object classification can be significantly reduced. The goal is simple. Given at least two images (frames) containing the object of interest, the user simply selects the desired object in both images. Using a preprogrammed feature set, the computer will inform the user which features are best for recognizing the object of interest in future images. To accomplish this objective, an object-oriented collection of possible features to track will be created (e.g., color, texture, intensity, centroid, moment of inertia). These features will be standalone objects (classes) with the same interface to allow for easy future expansion of the feature set. Once the user selects the desired object in at least two frames, each feature class calculates the 'distance' between the two selected objects. Features with relatively small 'distance' between the two selected objects will be considered good features to track, and will be presented to the user. Based on expert knowledge, the user can simply accept or reject the proposed feature set.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Antoin Baker.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Crane, Carl D.
Local: Co-adviser: Dixon, Warren E.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022870:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022870/00001

Material Information

Title: Computer Aided Invariant Feature Selection
Physical Description: 1 online resource (134 p.)
Language: english
Creator: Baker, Antoin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: automated, classification, computer, image, pattern, rapid, real, recognition, time, vision
Mechanical and Aerospace Engineering -- Dissertations, Academic -- UF
Genre: Mechanical Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Computer vision is a topic composed of a wealth of research. Several of these topics include image restoration, scene reconstruction, and motion estimation. Another classical problem in computer vision includes determining whether or not an image contains a specific object; this branch of computer vision is called object recognition. Object recognition (classification) is a broad topic that combines the expertise from several disciplines such as Machine Learning, Cognitive Psychology, Signal Processing, Physics, Mathematics, Information Theory, and Image Processing. However, it is well known that even the best object classification algorithms will produce poor results when given poor features to track. 'Garbage In-Garbage Out' is the phrase coined by George Fuechsel of International Business Machines (IBM) to describe this phenomenon. This proposal presents a method for reducing the load on the user in the feature selection process. By placing a computer 'in the loop' of the feature selection process, the amount of time selecting appropriate features for object classification can be significantly reduced. The goal is simple. Given at least two images (frames) containing the object of interest, the user simply selects the desired object in both images. Using a preprogrammed feature set, the computer will inform the user which features are best for recognizing the object of interest in future images. To accomplish this objective, an object-oriented collection of possible features to track will be created (e.g., color, texture, intensity, centroid, moment of inertia). These features will be standalone objects (classes) with the same interface to allow for easy future expansion of the feature set. Once the user selects the desired object in at least two frames, each feature class calculates the 'distance' between the two selected objects. Features with relatively small 'distance' between the two selected objects will be considered good features to track, and will be presented to the user. Based on expert knowledge, the user can simply accept or reject the proposed feature set.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Antoin Baker.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Crane, Carl D.
Local: Co-adviser: Dixon, Warren E.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022870:00001


This item has the following downloads:


Full Text

PAGE 1

1 COMPUTER AIDED INVARIANT FEATURE SELECTION By ANTOIN LENARD BAKER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008

PAGE 2

2 2008 Antoin Lenard Baker

PAGE 3

3 To Lorraine, Yolanda, Clin ton Jr., and Clinton Sr.

PAGE 4

4 ACKNOWLEDGMENTS I would like to thank m y parents for providi ng emotional and financ ial support throughout my extended stay in college. I would also li ke to thank my fiance for providing support, especially during my final semester. Last but not least, I would like to thank my brother for keeping me grounded.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4LIST OF TABLES................................................................................................................. ..........7LIST OF FIGURES.........................................................................................................................8ABSTRACT...................................................................................................................................12 CHAP TER 1 INTRODUCTION..................................................................................................................142 BACKGROUND.................................................................................................................... 183 HUMAN VISION SYSTEM.................................................................................................. 214 COMPUTER VISION SYSTEM........................................................................................... 33Motion.....................................................................................................................................34Scene Reconstruction........................................................................................................... ...37Image Restoration.............................................................................................................. .....39Recognition.................................................................................................................... .........43Sensing............................................................................................................................44Segmentation................................................................................................................... 44Feature Extraction........................................................................................................... 45Classification...................................................................................................................47Post Processing................................................................................................................48More on Feature Selection...............................................................................................495 METHODOLOGY................................................................................................................. 636 FEATURE DEVELOPMENT................................................................................................ 74The RGB Class.................................................................................................................. .....74Grayscale Class.......................................................................................................................78Hsv Class................................................................................................................................78Texture Class..........................................................................................................................80Image Weight................................................................................................................... .......82Image Moment Of Inertia.......................................................................................................82Histogram Modal Analysis..................................................................................................... 83Number of Circles.............................................................................................................. .....84Number of Rectangles........................................................................................................... .867 APPLICATION ASSEMBLY.............................................................................................. 102

PAGE 6

6 8 TESTING AND COMPARISON......................................................................................... 1139 CONCLUSION / FUTURE WORK..................................................................................... 125LIST OF REFERENCES.............................................................................................................131BIOGRAPHICAL SKETCH.......................................................................................................134

PAGE 7

7 LIST OF TABLES Table page 3-1 Confusion matrix........................................................................................................... ....27 4-1 Mock 5x5 grayscale image................................................................................................ 52 4-2 Pixel intensity count...........................................................................................................52 4-3 Cumulative distribution function....................................................................................... 52 5-1 Preliminary feature list................................................................................................... ....67 6-1 Sample variations in RGB values...................................................................................... 89 6-2 Generic accumulator........................................................................................................ ..89 6-3 Summary of Distance Measures........................................................................................ 90 7-1 Selected features.......................................................................................................... ....108 8-1 Textbook classifier training time..................................................................................... 119 8-2 Frontal automobile cl assifier training tim e......................................................................119 8-3 Neural network QT4 book confusion matrix................................................................... 119 8-4 Neural network automobile confusion matrix................................................................. 120 8-5 Invariant feature selec tion QT4 book confusion matrix .................................................. 120 8-6 Invariant feature selection autom obile confusion matrix................................................. 120

PAGE 8

8 LIST OF FIGURES Figure page 2-1 Moore's law............................................................................................................... ........203-1 The human eye ............................................................................................................ .....273-2 Inversion of image on retina.............................................................................................273-3 The horizontal lines are actually parallel.......................................................................... 283-4 The impossible object..................................................................................................... ..283-5 Gestalt simplicity principle.............................................................................................. .293-6 Proximity Gestalt principle............................................................................................... 293-7 Similarity Gestalt principle.............................................................................................. .293-8 Closure Gestalt principle...................................................................................................303-9 Continuity Gestalt principle..............................................................................................303-10 Connectedness Gestalt principle....................................................................................... 303-11 Common region Ge stalt principle..................................................................................... 313-12 Emergence Gestalt principle............................................................................................. 313-13 Reification Gestalt principle............................................................................................ .323-14 Receiver operating char acteristic (ROC) curve. ............................................................. 324-1 Raster orientation..............................................................................................................534-2 Point correspondence for optical flow..............................................................................534-3 Finding motion of multiple objects................................................................................... 544-4 Image before histogram equalization................................................................................ 544-5 Image after histogram equalization................................................................................... 544-6 Image containing salt and pepper noise............................................................................ 554-7 Application of box filter................................................................................................. ...554-8 Application of median filter.............................................................................................. 56

PAGE 9

9 4-9 Application of mask to a pixel.......................................................................................... 564-10 Computing derivative of a discrete 1-dimensional signal................................................. 574-11 Edge at an angle......................................................................................................... .......574-12 Prewitt and Sobel masks.................................................................................................. .584-13 Mask used for corner detection......................................................................................... 584-14 Simple neural network.................................................................................................... ..584-15 General pattern recognition process..................................................................................594-16 Extremely fast motorcycle with no mirrors...................................................................... 594-17 The RGB cube...................................................................................................................604-18 Histogram of red, green, and blue channels of an arbitrary image................................... 604-19 Measuring edgeness of an image...................................................................................... 614-20 Normalized histogram (probability density function) of two classes............................... 614-21 Nearest neighbor analysis................................................................................................ .625-1 Custom Mustang...............................................................................................................685-2 Red, green, and blue channels of mustang image.............................................................685-3 Histogram of red channel of Mustang.............................................................................. 695-4 Histogram of green channel of Mustang........................................................................... 695-5 Histogram of blue channel of Mustang............................................................................. 705-6 Second image of Mustang................................................................................................. 705-7 Histogram of red channel for second Mustang................................................................. 715-8 Histogram of green channel for second Mustang............................................................. 715-9 Histogram of Blue Channel for Second Mustang............................................................. 725-10 Small sliding search window............................................................................................ 725-11 Example of multiple hits on the same object....................................................................735-12 Elimination of multiple hits............................................................................................. .73

PAGE 10

10 6-2 Second image of object to track......................................................................................... 916-3 Plot of distance meas ures versus bin size..........................................................................916-4 Program used to calculate average RGB values................................................................ 926-5 Variations in L1 distance while tracking the same object.................................................926-6 Conversion of RGB image to grayscale............................................................................. 936-7 Visual representation of the HSV color space................................................................... 936-8 Program used to calculate average HSV values................................................................ 946-9 Results of Sobel operator...................................................................................................946-10 Results of Canny operator..................................................................................................956-11 Results of Laplacian operator............................................................................................ 956-12 Image origin for calculating moment of inertia................................................................. 966-13 Calculating the image mome nt of inertia from edges........................................................ 966-14 Beverage container........................................................................................................ .....976-15 Histogram of red channel for beverage container.............................................................. 976-16 Parameters of a generic circle............................................................................................ 986-17 Line detection for Hough transform.................................................................................. 986-18 Gradient at a single pixel................................................................................................ ...986-19 Gradient of a circle...................................................................................................... .......996-20 Results of circle detection..................................................................................................996-21 Binary image representation............................................................................................ 1006-22 Neighbors of pixel to be labeled...................................................................................... 1006-23 Results of connected component labeling........................................................................ 1016-24 Rectangle finding in an image......................................................................................... 1017-1 Design of animals without inheritance............................................................................ 1087-2 Design of animals with inheritance..................................................................................109

PAGE 11

11 7-3 Programming with interfaces........................................................................................... 1097-4 First selection of object of interest (QT4 book)............................................................... 1107-5 Second selection of object of interest (QT4 book)..........................................................1117-6 Finding object of interest (QT4 book) in future images.................................................. 1128-1 Entire image to be searched............................................................................................. 1218-2 Example of template........................................................................................................ 1218-3 Image used to find template............................................................................................. 1228-4 Template used for future image searches......................................................................... 1228-5 Result of template matching............................................................................................ 1238-6 Haar like features......................................................................................................... ....1238-7 Neural network classifier tracking automobiles.............................................................. 1249-1 Current multi-threaded search window............................................................................ 1299-2 Future multi-threaded search window............................................................................. 1299-3 Future implementation of threading using AMDs upcoming 16 core processor............. 130

PAGE 12

12 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMPUTER AIDED INVARIANT FEATURE SELECTION By Antoin Lenard Baker December 2008 Chair: Carl Crane Major: Mechanical Engineering Computer vision is a topic composed of a wea lth of research. Several of these topics include image restoration, scen e reconstruction, and motion es timation. Another classical problem in computer vision includes determining whether or not an image contains a specific object; this branch of computer vision is ca lled object recognition. Object recognition (classificat ion) is a broad topic that combines the expertise from several disciplines such as Machine Learning, Cognitive Psychology, Signal Processing, Physics, Mathematics, Information Theory, and Image Pr ocessing. However, it is well known that even the best object classification algorithms will pr oduce poor results when given poor features to track. Garbage In-Garbage Ou t is the phrase coined by Geor ge Fuechsel of International Business Machines (IBM) to de scribe this phenomenon. This proposal presents a method for reducing the load on the us er in the feature selection process. By placing a computer in the loop of the feature sel ection process, the amount of time selecting appropriate features for object classification can be si gnificantly reduced. The goal is simple. Given at least two images (frames) containing th e object of interest, the user simply selects the desired object in both images. Using a preprogrammed feature set,

PAGE 13

13 the computer will inform the user which features are best for recognizing th e object of interest in future images. To accomplish this objective, an object-orien ted collection of possible features to track will be created (e.g., color, texture, intensity, centr oid, moment of inertia). These features will be standalone objects (classes) with the same in terface to allow for easy future expansion of the feature set. Once the user sel ects the desired object in at leas t two frames, each feature class calculates the distance between the two selected objects. Feat ures with relatively small distance between the two select ed objects will be considered good features to track, and will be presented to the user. Based on expert knowle dge, the user can simply accept or reject the proposed feature set.

PAGE 14

14 CHAPTER 1 INTRODUCTION The field of object detection and classificati on is a topic that requi res expertise from a broad audience. Many disciplines have long been interested in human vision (cognitive psychology, physics, and medicine ) or teaching machines to see like humans (electrical engineering, computer science). Although th ese practitioners make up the bulk of the vision/computer vision communit y, anyone who has an application where extracting information from images is important can automatical ly become a computer vision developer. Cognitive psychology is the field of psychology th at examines internal mental processes. The branch of cognitive psychology deals highly w ith the scientific method, as opposed to other areas of psychology, which deal with introspection (self observation). The vast majority of cognitive psychology re search deals with sensation and perception. Cognitive psychologists are interested in how hum ans acquire information (psychophysics), and organize information (Gestalt psychology). Before a machine can be taught to see, it is first necessary to take a look at how humans handle perception. Cognitive psychology is a relatively new field and has only officially been around since the 1960s. Some 50 years later, psychologists and medical practitioners still do not fully understand how humans handle vision. The field of physics is also concerned w ith the computer vision problem. Whereas cognitive psychologists are interest ed in human perception and com puter scientists are concerned with teaching machines to make inferences (decisions) from images; physicists are usually concerned with the image formation process. For example, physicists are concerned with the process of how light refracts when passed throug h a lens. By understanding the physics of image formation, problems in image formation can be corrected though pre-processing. Regarding vision, psychophysicists are concerned with image formation in human vision. Psychophysicists

PAGE 15

15 are faced with a daunting task because the human vision system is essentially a massively parallel black box. Moreover, st andardized test results can vary by person based on that individuals prior experience, me dical condition, blood sugar level, amount of sleep gained the night before, or simply how the question was worded! Psychophysicists have methods for eliminating human bias to an extent, but th e research is still in its early stages. Computer science and electrical engineering are the two professional fields of study that have taken on the task of machine perception. However, anyone with a personal computer can do research on their own. Only until relatively recent times have computers become powerful enough to run algorithms on decent size images. Now research that has long been restricted to advanced labs is now available to anyone with acce ss to a computer. Computer vision formally became a topic of study in the 1970s when computers became able to handle the large amounts of data that are i nherent with image processing. Moores law has been a blessing and a curse when pertai ning to computer vision. Although the massive increase in computer power has facilitated the incr ease in computer vision research, this research is now available to everyone. Because computer vision is a very immature field, there are massive amounts of research being conducted without any unifying source. As of today, a formal computer vision problem has yet to be defined. Because computer vision is such a vast field, the research has been split into several subfields, including object recognition and clas sification. Even though object detection and classification is a subtopic of computer vision, the amount of research being conducted is still overwhelming. Pertaining to computer vision, object detecti on can be defined as determining whether an image contains a specific object. Object clas sification is slightly di fferent. Classification

PAGE 16

16 assumes that we already have an object, and we are determining its state of nature (class). Despite this minor difference, object detecti on, recognition, and classifi cation will be used interchangeably throughout this dissertation. Because of the wealth of literature concer ning object classifica tion, an engineer or scientist faced with the task of determining the correct features and algorithms for object detection is faced with a very time consuming process. There are thousands of available features and hundreds of algorithms available for object classification. Experience helps when determining which features and algorithms to use; but because of the lack of unification in the computer vision community, this ex perience rarely gets passed from journeyman to apprentice. With so many different people working on the sa me project, information is bound to get lost if no steps are taken. However, the computer science community ha s already solved this problem starting in the 1990s with the widespread adaptation of object-oriented program ming. Object-oriented programming simply means thinking of your program as a set of objects that can interact with each other. This programming philosophy can be extended to the field of object detection. There are simply too many available featur es and algorithms for one person to be proficient at all of them: however, by usi ng an object-oriented programming philosophy, a person doesnt have to be an expert at everything. Each feature th at can be used to identify or classify an object will be made into its own class. All of these classes wi ll have the exact same interface to the rest of the program. As more f eatures become available, the user can simply insert the new features into the program with very little effort. This database of features can continue to grow over time, w ithout adding significant complex ity to the original program.

PAGE 17

17 Once a database of features has been created, the task of finding the best features to detect an object becomes simple. Instead of scouring massive amounts of literature for the best features to find the object of interest, the user simply selects the object in a minimum of two separate images and the computer tells the us er which features are be st for finding the object in future images. By organizing all the features into a single place, valuable experience is preserved. When a person graduates or leaves a company, he/she doesnt take all of his/her knowledge with him/her. Moreover, with this approach, everyone will be able to contribute relatively seamlessly to the overall goal of accurate ob ject classification without needle ssly adding complexity to the system. This chapter explained why an object-oriented approach would be beneficial in the field of object detection. The following chapters gi ve some background for the field of human and computer vision systems. Finally, the approach used for finding the best features for object detection is discussed.

PAGE 18

18 CHAPTER 2 BACKGROUND Ever since the early twentieth centu ry, humans have been fascinated with the idea of teaching robots to imitate human behavior. In the 1920s, Karel Capeks play R.U.R. was the first piece of entertainment to put the word robot into mainstream circulation [1]. Since the introduction of robots in the 1920s, a myriad of science fiction novels, plays, and movies have come about, showing robots that completely mimic and interact with humans Robots have been portrayed as very friendly such as Johnny Five from the movie Short Circuit, or on a oneway track for the destruction of humanity such as the Terminator series [2]. One of the great pioneers of robotic scienc e literature is the famous Isaac Asimov. Asimov was a Russian biochemist who wrote hu ndreds of science-fiction novels. Asimovs three laws of robotics paved the way for the majo rity of future robotic fiction. Using Asimovs laws as a basis, science fiction writers have created r obots that range from wife replacements [3], to robots that fight crime so that humans can stay out of harms way [4]. One of the major assumptions in most of thes e movies is the robots ability to have vision that is equal, if not superior to that of a humans vision system. Since the 1920s, there have been vast improvements in technology. With the inve ntion of the modern computer in the 1950s, very sophisticated labs were the only places one could find enough computational power to perform the most basic computer vision researc h. In 1965, Gordon Moore st ated that computing power would double every two years [5] (Figure 2-1) Computing power has followed Moores law ve ry closely. In the year 2008, PDAs, gaming consoles and personal computers have more processing power than the mainframes of the 1970s [6]. With this exponential increase in processing power, computer vision research has become a possibility to anyone w ith access to a relatively modern personal computer. Because

PAGE 19

19 of this widespread availability of computing power, enormous research has been dedicated to the broad field of computer vision. However, despite the vast amount of rese arch being conducted in computer vision, humans are not even close to teaching machin es to mimic human vision. Current computer vision research is focused on basic face detecti on and image segmentation; tasks most humans can do at birth. For example, newborn babies are able to recognize a persons mood from the persons facial expressions [7]. This fact is not a putdown on human i ngenuity, but rather a testament to the creativity and complexity of a vi sion system so remarkable, that it must have been created by a higher being! Because the overall goal of computer vision is teaching machines to imitate human vision and perception, additional attenti on will be given to the human vision system. The human vision system is extremely complex and many components have been black boxed: but it only seems correct to investigate the subject that is to be mimicked. The following chapter will give an introduction to the human vision system. The topics to be covered in the next chapter range from low-level topics such as image formation, to high -level topics such as optical illusions. A full understanding of the next chapter is not needed to understand the goal of this research; however it gives the reader a background into the systems researcher s are trying to reproduce.

PAGE 20

20 Figure 2-1: Moore's law

PAGE 21

21 CHAPTER 3 HUMAN VISION SYSTEM This chapter discusses how object classification occurs in hum ans. First, it gives a brief overview of the low level topics of human vision including the structur e of the human eye and image formation. Second, it gives a discussi on of the higher-level visual properties of perception. Finally, the chapter di scusses some of the research be ing conducted in the field of human visual perception. The human eye (Figure 3-1) can be thought of as a 22-millimeter camera with an extremely high resolution. There are several studies that s uggest a camera must have a resolution of 550-600 megapixels in order to match the resolution present in the human eye [8]! Other studies have shown that the human eye behaves more like a contrast detector, rather than an absolute detector such as a Charged-Couple Device (CCD) camera. These studies have proven the human eye to have a cont rast ration of 1,000,000:1, or higher [9]. The human eye is an electromagnetic sensor th at is sensitive to el ectromagnetic radiation between the wavelengths of approximately 370730 nanometers. Light enters the eye through the cornea, which provides fixed focusing power ve ry similar to a camera. Further focusing is accomplished by the lens, which changes its shape based on the distance between the fixated object and the observer. Most modern digita l cameras perform autofocus by maximizing the contrast (value of derivatives) in the image. However, the image must have objects that contrast each other or the maximization algorithm will not work: for example, a digital camera will not be able to focus on a blank wall [10]. The pupil controls the intensity of the image a nd the lens controls the focus of the image. After the image passes through thes e filters, the resulti ng image will be inve rted and flipped on the retina (Figure 3 2 ) However, the human brain has evolve d to take care of this problem.

PAGE 22

22 When the image lands on the retina, two types of receptors detect it. These two photoreceptors in the eye are call ed rods and cones. Rods ar e responsible for vision at low light levels and have very poor spatial acuity. The cones are the exact opposite. Cones respond at high light levels and have very accurate spatial acuity. There ar e three different types of cones in the eye, which respond to different wavelengths of light (red, green, and blue) [11]. Once the image is formed on the retina, the resulting image is carried from the eye to the brain by the optic nerve. The pathway from the eye to the brain is a complicated one, but knowledge of these details is outside the scope of this chapter. As the image passes from the eye to the brain, the visual process changes from sensation (image detection) to one of perception (image understanding). While the details of image formation in the human eye may be of great concern to biologists or other medical professionals, imag e understanding is where the fun begins for the computer vision engineer. With this in mind, the rest of this chapter will be dedicated to the higher-level processes of how humans interpret and understa nd scenes once they have been formed by the lower-level systems. As light passes through the cornea, lens, and pupil, it forms patches of light on the retina. However humans rarely see patches of light (unl ess under the influence of some mind-altering drug), rather we see an organized world consisti ng of different objects interacting with each other based on a perceptual set. The process is so natural that it is unperceived to most humans until an event occurs that violates our previous knowledge of the world. For example, Figures 33 & 3-4 illustrate this phenomenon These are two images that violate most peoples perceptual organization experience, and therefore cau se great confusion when encountered.

PAGE 23

23 The Optical Illusion field is one of cogn itive psychology that focuses on how the brain extracts meaningful information from raw data. The principle driving force behind perceptual organization is Gestalt psychology. The overall theme of Gestalt psychology is that The whole is more than the sum of the parts [12]. Accord ing to Gestalt psychologists, the main idea behind their research is that the brain will produce th e simplest possible organization allowed by the conditions [13]. For example, most huma ns will see the image in Fi gure 3-5 as three overlapping disks. The image could have been interpreted as three balls place in front of each other, but the simplest interpretation states otherwise [14]. Gestalt psychologists have long been studyi ng how humans perform the act of object recognition and classification. Th e first step in recognizing an object is separating that object from its background. Gestalt psychologists be lieve humans accomplish these actions based on six main types of grouping principles: 1.Pr oximity 2.Similarity 3.C ontinuity 4.Closure 5.Common-Region 6.Connectedness (F igures 3-6 through 3-11). Proximity simply implies that elements that ar e close together will tend to be grouped together. The principle of similarity says that elements similar in appearance will tend to be grouped together. Continuity says humans will organize elem ents along continuous contours. Closure shows that display elements that make up a closed figure will tend to be grouped together. Connected shows that objects that are somehow visually connected will be grouped together. Common-region or common fate implies that elements moving in the same direction or moving similarly will tend to be grouped together In computer vision, the research of grouping related objects falls under th e topic of image segmentation. Once the image has been grouped into similar elements according to the above grouping principles, the human process of object recogni tion and classification be gins. Although studying

PAGE 24

24 the human vision system can greatly aid in t eaching computers to segment images, the same cannot be said about teaching co mputers to recognize objects. This problem occurs because humans and computers perform pattern recognition in a fundamentally different manner. Whereas humans utilize a top-down method (recogni zing the object as a w hole), computers use a bottom-up approach, meaning that they search for a collection of feat ures to recognize an object. This top-down approach can be observed by the Gestalt properties of emergence, reification, and invariance. Emergence means that the detected object is identified all at once, instead of a set of certain features. For exampl e, most humans can recognize the dog in Figure 312 even though it is completely disconnected and blends with the background. Reification implies that humans add more information than what is present in the orig inal scene. Looking at Figure 3-13 one can observe the triangle and the three-dimensional object even though no such objects exist. Invariance implies that humans are able to recognize object even if th e features are completely distorted. Gestalt psychology gives a description of wh at occurs in human vision, but it doesnt necessary explain how thes e processes occur. Because of the holistic nature of the human vision system, traditional pattern recogni tion techniques are vi rtually impossible. The initial step in pattern recognition is the process of feature analysis, but the human vision system is essentially feature invariant, making most pattern recognition techniques useless. Most humans are able to recognize a particular car whether its night or day, whether the car is dirty or clean, whether the car is missing tires, upside-down, or has even been in an accident. It will be shown in the Methodology section of this dissertation how the holisti c nature of the human vision system will be taken into account.

PAGE 25

25 One area of pattern recognition that is used to make measurements of the human vision system is Signal Detection Theory. Signal detecti on theory is a method to quantify the ability to discern between a signal and noise. In the beginning, signal detection theory was used extensively by radar researchers [15]. Green and Swets later introdu ced signal detection theory into the field of psychophysics in 1966 [16]. Signal detection theory is used when psychologists want to measure how humans make decisions unde r conditions of uncertain ty. Signal detection theory assumes that the noise and the signal are unchangeable, and that the variations in measurements are due to the active decision making of the human. For example, if researchers want to determ ine a humans ability to recognize an object against a background, they presen t him with a large number of mixed images. The human classifies these images in to object present or object absent. The researchers then measure the number of hits and false alarms Afterwards, the researchers are able to use a lookup table to measure the humans sensitivity and his res ponse bias. Experimenters use two main visualization techniques in Signal Detection Theory. The first method is the confusi on matrix (Table 3-1). The confusion matrix is more like a matching matrix that simply graphs the number of hits, misses, false alarms, and correct rejections. The second method for visualizing Signa l Detection Information is the Receiver Operating Characteristic (ROC). The ROC is a plot of the hit rate versus the false alarm rate for a sensor as its discrimination thre shold is varied (Figure 3-14). For example, lets assume there is a pile of toothpicks on a Table varying from 1 cm to 10 cm. The human is instructed to pick out all the toothpicks in the range of 1.5 cm to 9 cm. The number of correct identifications (hits) and false identifications (false alarms) is then pl otted on the ROC curve as a point. The human is then instructed to pick out all the toothpicks that vary in range from 3 cm 7cm, then 5cm 6cm

PAGE 26

26 and so on. As the selection criterion gets tougher, the number of false alarms goes up and the number of hits goes down. The resulting curve fro m plotting these points will form a figure such as the Figure 3-14. A ROC curve used with real data will generally not be as smooth as the results shown in the figure, but the curve will follow the same general pattern. While Signal Detection Theory may give th e researcher a method of determining how well a humans vision system can perform a particul ar task, it still doesnt explain how the actual process occurs. Much more research is needed in the field of human ob ject detection before computers can begin to mimic the actual process. Humans have only begin to scratch the surface of the human vision system, and in due time humans will gain a more complete understanding of how the process works. Nevertheless, several exciting computer vision projects are currently being conducted without a full knowledge of the human vision syst em. The next chapter will give an introduction to some computer vision fundamentals and will di scuss current computer vision research topics with a focus on object detection.

PAGE 27

27 Table 3-1: Confusion matrix Actual Positive Actual Negative Predicted Positive "Hit" "False Alarm" Predicted Negative "Miss" "Correct Rejection" Figure 3-1: The human eye (Biological Sciences Lab, Vanderbilt University, 2008, www.cas.vanderbilt.edu/bsci111b/eye/human-eye.jpg) Figure 3-2: Inversion of image on retina

PAGE 28

28 Figure 3-3: The horizontal lines are actually parallel (Optical Illu sion, National Institute Environmental Health Sciences, 2008, http://kids.niehs.nih.gov/illusion/illusions3.htm ) Figure 3-4: The impossible object (Optical Illusion, National In stitute Environmental Health Sciences, 2008, http://kids.niehs.nih.gov/illusion/illusions3.htm )

PAGE 29

29 Figure 3-5: Gestalt simplicity principle Figure 3-6: Proximity Gestalt principle Figure 3-7: Similarity Gestalt principle

PAGE 30

30 Figure 3-8: Closur e Gestalt principle Figure 3-9: Continu ity Gestalt principle Figure 3-10: Connected ness Gestalt principle

PAGE 31

31 Figure 3-11: Common region Gestalt principle Figure 3-12: Emergence Gestalt principle

PAGE 32

32 Figure 3-13: Reification Gestalt principle Figure 3-14: Receiver operating characteristic (ROC ) curve. (A perfect sensor would be a vertical line along the y-Axis, indicating a 100% hit rate and a 0% false alarm rate for all discrimination thresholds.)

PAGE 33

33 CHAPTER 4 COMPUTER VISION SYSTEM Chapter 3 focused on the biological vision system and how object detection is conducted in humans. Unfortunately, most bi ological vision research is stil l in a rudimentary phase. Most computer vision engineers understand very little about the human biological vision system and have been performing experiments haphazardly. Therefore, many of the accomplishments in computer vision have come through painstaking trial and error; and findi ng a unified source of information remains a problem even with the widesp read use of the internet Nevertheless, there are some computer vision fundamentals that are common to every practitioner. The beginning of this chapter will cover so me of these fundamentals in order for the reader to have a general knowledge of modern computer vision systems. Computer vision is the science of machines th at see. The goal of computer vision is for a machine to make decisions based on information from images. Several references use the terms machine vision and computer vision interchangeab ly, however there is a subtle difference. Machine vision usually refers to the context of industrial applications, where computer vision refers to the field in general. The field of computer vision is a very br oad one. This field combines the knowledge from various disciplines such as neurobiology, mathematics, physics, and artificial intelligence. Thankfully, one doesnt have to be an expert on all these topics to perf orm meaningful computer vision research. Because the computer vision fiel d is so vast, most people simply specialize in one of the many subtopics of computer. Computer vision has dozens of subtopics. Some prominent areas of computer vision include the following: 1.Motion 2.Scene Rec onstruction 3.Image Restoration 4.Recognition. Several of these fields overlap each other. Becau se of this fact, most computer vision engineers

PAGE 34

34 will become experts at one subtopic, while having a mediocre knowledge about the other fields. The rest of this chapter will give an explanation on these four fields of computer vision with an emphasis on object recognition. Motion Image formation for a camera is very similar to that of a humans. The main difference is that most cameras use a charge-couple device (C CD) instead of the rods and cones used by the human eye. The principle is essentially the sa me: tiny solid state cells sensitive to certain wavelengths of light convert them to electrical energy. This pixe l coordinate system is usually organized in a raster orientation instead of a Ca rtesian system (Figure 4-1). Almost every other principle remains the same as the same as the human eye. Motion describes the movement between an obj ect of interest and th e camera. The four basic cases of motion are described below from least to most complex. 1. Still camera, single moving object, constant background. 2. Still camera, several moving objects, constant background. 3. Moving camera, relatively constant scene. 4. Moving camera, several moving objects. For the first case with a still camera, the m ovement of objects can be computed by simply subtracting two frames taken some small time apar t. If the difference between pixels in two frames is above a threshold, we say a change occurred for those pixels If only one object is moving in the field, that object can be identified using various si mple algorithms, such as finding the center of mass of the moving object, or by using connected component analysis [17]. However, if the object is stationary, then s ubtracting two images wont give any pertinent information about the object being tracked. In this scenario, it is necessary to use some sort of point correspondence. Interesting feature points must be identified on th e object of interest. These same feature points are then found in the ne xt image and their motion is then calculated.

PAGE 35

35 This technique of tracking feature points is called optical flow (Figure 4-2 ) The most common types of feature points are corners or rough edges. The method for finding corners or edges will be covered later in this chapter und er the section of Image Restoration. Using a method such as corner or edge detecti on, a point of interest is found in the first image. A small region S1 surrounding the point of interest is stored into memory. Now a second image is obtained. The stored region S1 is then moved across a search rectangle is the second image. Wherever the stored region S1 has the highest autocorrelati on (least difference) in the search rectangle of the second image is the new location of the point of interest. Computing the path of several moving object s becomes a more difficult task. Because several objects moving in the same image have the annoying tendency to cross paths, it is necessary to obtain a way to maintain some c onsistency of the tracked objects (Figure 4-3 ) In order to track multiple objects in the same image four assumptions are made. First, the location of any object changes smoothly over time. Second, the velocity of any object changes smoothly over time. Third, any object can be at only one location at any given time. Fourth, two or more objects cannot occupy the same lo cation at the same time. Because three-dimensional spac e is projected onto a two-dimensional pixel plane, the fourth assumption may be violated if one object gets occluded by a nother. Regardless of this minor flaw, an algorithm has been developed by Sethi and Jain in 1987 to track multiple objects [18]. This algorithm uses the previous assumpti ons to compute the smoothest set of paths though points observed in a sequence of frames. If an object i is observed at time instants t = 1,2,n, then the sequence of image points Ti = (pi,1, pi,2,pi,t,,pi,n) is called the trajectory of i. The difference between two points is defined as Equation 4-1.

PAGE 36

36 tititippV,1,, (4-1) The smoothness value at point pi,t can be defined in terms of the difference of vectors reaching and leaving that point. The smoothness of direction is measured by their dot product. The smoothness of speed is measured by compari ng the geometric mean of their magnitude to their average magnitude. The geometric mean is shown in Equation 4-2. Unlike an arithmetic mean where one adds a set of numbers and di vides by the count, in a geometric mean one multiplies a set of n numbers and takes the nth root of the result. n n i n n i iaaaa 2 1 1 (4-2) By using the definition of the geometric mean for two corresponding points, the smoothness can be defined as Equation 4-3 titi titi titi titi tiVV VV w VV VV wS,1, ,1, ,1, ,1, ,2 )1( (4-3) Shown in Equation 4-3, the weight 10 w can be adjusted to emphasize smoothness of the direction or speed. The corresponding smoot hness S will also lie between 0 and 1, with a higher value of S being better. The total sm oothness for all points is given by Equation 4-4 m i n t ti sS T1 1 2 (4-4) Points of interest in the first frame are labeledmi ,...,2,1 Given the next frame, the goal is to label points in the next image that ma ximizes the value of the total smoothness. There are several algorithms available that perform this procedure such as the Greedy-Exchange

PAGE 37

37 Algorithm [19]. The third case of motion occurs when the camera is moving, but the scene is relatively stationary. Optical flow methods have been developed for computing the flow of an entire image and not just at in teresting points. These methods also have th eir corresponding assumptions. First, it is assumed that the object reflectivity does not change over the interval [t1,t2]. Second, the distances of the object fr om the camera or light sources do not vary significantly. Third, any sm all intensity neighborhood Nx,y (e.g. any 3 3 or 5 5 set of pixels) at time t1 is observed is some shifted position yyxxN at time t2. Unfortunately, these first two assumptions rarely hold in a real world setting. Unlike the scenario of tracking a single object with a stationary camera, moving the camera dras tically changes the image. Nevertheless, an image flow equation (Equation 4-5) has been developed to compute image flow vectors. t t f y y f x x f tyxfttyyxxf ),,( ,, (4-5) The solution of this equation for x and y does not give a unique solution, but puts a linear constraint on it. Moreover, this differential equation must be solved for every pixel in the image making it computationally unfeasible for real world environments. Similarly, the fourth case of motion involves a moving camera and moving obj ects in the image. The analysis for this situation is very complex and error prone. Scene Reconstruction The second broad topic in com puter vision is scene reconstruction. The goal of scene reconstruction is to construct a three-dimensional model of a scene based on two or more related images. The reconstruction process consists of three main steps: 1. Three-Dimensional Data Acquisition 2. Registration 3. Surface Construction

PAGE 38

38 In the data acquisition phase, range data is acquired from a set of views. In the image registration process, the range data is combined by transforming them all into a single 3D coordinate system. After the image registra tion phase is complete, the result is a three dimensional point cloud. A mesh is wra pped around this point cloud in the surface reconstruction phase. Although the topic seems very novel, it is plague d with problems. First consider the data acquisition phase. It makes sense that a person should use the sensor which is appropriate for the job. A person wouldnt use sonar to check the temperature; nor would a person use a thermometer to spy on a conversation. Attempting to get range (distance) information from a monocular camera is similar to th e above scenarios. Humans have an enormous perceptual set of objects stored into memory. Humans have prior knowledge of the size of millions of objects; computers do not. By identifying an automob ile, a human has a rough approximation of the distance of that automobile based on its size relative to the viewer. Until computers have the storage and processing power comparative to that of a humans, trying to get range information from a monocular camera will be very inaccurate. If one need s range information, there are a myriad of sensors designed to give exactly that information. Second, the image registration process is usually never automated. Unless the images are taken in a very controlled environment (i.e. placing infrared LEDs strate gically throughout the im age), a human must be present to aid the computer in transforming the information into a single three-dimensional coordinate system. Since the goal of computer vision is for the machine to make important decisions, having a human in the loop for anything but the most controlled scenarios essentially violates the definition. One would get much mo re accurate results by using a laser-range finder and building a mesh from th e resulting point cloud.

PAGE 39

39 Image Restoration Im age restoration is the enhan cement of images for either human consumption or further machine analysis. Image restoration is essent ially image processing, si nce the input and output are both images. The formal definition of image enhancement is to improve the detectability of important image details. Image restoration is th e process of restoring a degraded image to an ideal condition. These differences are very subtle, and the two terms will be used interchangeably for the rest of this paper. The first type of image enhancement to be di scussed is the contrast stretching operator. This operator essentially broadens the range of gray levels in an image (Figure 4-4 and Figure 45). The most common type of contrast stretchi ng operator is histogram equalization. Histogram equalization spreads the in tensities of the image th roughout the histogram [20]. Histogram equalization is accomplished by the cumulative distribution function (cdf). Using the cdf, the original pixel value is ma pped onto a new value. The cdf is shown in Equation 4-6. The mapping function is shown in Equation 4-7. xx iixpxF )()( (4-6) )1(* )( )(min minL cdfNM cdfvcdf roundvmap (4-7) where, M*N := The number of pixels in the image L := Number of GrayScale Levels in the Image

PAGE 40

40 As an example, consider the mock 5 5 image with grayscale intensities shown in Table 4-1 The goal is to equalize this image so that th e values which are currently between the range of 17-27 are better distributed on the range 0-255. The number of occurrences for each intensity value is now counted, as shown in Table 42. The cdf is now computed. This is simply a sum of the counts that fall below a certain value. These values are shown in Table 4-3 Now that the cdf of the image is known, Equation 4-8 is used to map the old pixel values into a more s pread version. For exam ple, the process of calculating the new value of a pixel valued the following is performed is performed in Equation 4-9. )1( )( )23( 23min minL cdfNM cdf cdf round map (4-8) 149)75.148()1256( 1)55( 115 23 round round map (4-9) Therefore, each pixel with th e value of will be change d to a new value of This process is done with every pixel in the image and the resulting image will have a much higher contrast than the original However, this operation only produces significant changes if the original image has gray values that are close in intensity. Image smoothing is another popular operation in image enhancement. If an image has some underlying structure that is to be enhanced and noise that is to be removed, smoothing operations can be used. There are three main types of smoothing operations. The first is the box filter. The box filter is the most basic of all the filters. It simply averages a 5 5 region around a pixel, and assigns the average value to the pixel (Equation 4-10 ). The image in Figure 4-6 is full of salt-and-pepper noise.

PAGE 41

41 25 ],[ ,2 2 2 2 ijjcirIn crOut (4-10) After applying the box filter, the image in Figure 4-7 is obtained. The box filter did very little in getting rid of the noise. Even worse, it caused the image to become blurry. A Gaussian filter (Equations 4-11 and 4-12 ) can also be used to eliminate noise, but it has essentially the same shortcomings as the box filter. The Gaussian filter simply takes the values of the adjacent pixels and multiplies by a weighting factor g(x,y). 2 222 1 ),(de yxg (4-11) Where, (4-12) The most commonly used filter for reducing salt and pepper noise is the median filter. Instead of averaging a 5 5 region around a pixel, the number directly in the center of the 25 intensity values will be substituted for the pixel. The result of the median filter is shown in Figure 4-8. The salt and pepper filter essentially eliminated the noise from the image. Another major topic in image enhancement is edge detection. An edge is simply an image point of high contrast (high derivative). This point forms the border between different objects or abrupt changes in scenery. Most e dge detection is done using neighborhood template or masks Masks are also sometimes refe rred to as kernels. Figure 4-9 shows how a mask is applied to a pixel. 2 2 c cyyxxd

PAGE 42

42 For each pixel in the image, the 3 3 mask is convoluted to the pixel and its surrounding area. For example to apply the 3 3 mask to the pixel marked with intensity the following calculation is performed: [(-1*10)+(1*11)+(-1*12)+(1*15)+(2*16)+]/sum(mask). Dividing by the sum of the mask is sometimes left out of the computation. To see how a mask is used to compute the contrast of an im age, a one-dimensional case is considered (Figure 4-10) Given that the signal S is a sequence of samp les from some function f, the derivative is approximated as Equation 4-13. If the sample sp acing is 1, then the mask M = [-1 1] can also approximate the derivative. )/()()()('1 1 iii i ixxxfxfxf (4-13) Similarly, the derivatives for 2D images can be computed in the same manner as for the 1D signal. However, there is one abnormality. Because the edge might cut across the pixel array at an angle (Figure 4-11 ) it helps to average at least 3 estim ates of the contrasts as shown in Equations 4-14 and 4-15. 2/]1,1[]1,1[2/]1,1[]1,1[2/],1[],1[ 3 1 yxIyxIyxIyxIyxIyxI x f 2/]1,1[]1,1[2/]1,1[]1,1[2/]1,[]1,[ 3 1 yxIyxIyxIyxIyxIyxI y f (Equations 4-14 & 4-15 ). Using Equations 414 and 4-15, masks can be assembled to compute the derivatives of images. Two of the most popular masks are th e Prewitt and Sobel masks. Both masks are shown below in Figure 4-12. The masks are virt ually identical except that the Sobel filter emphasizes the center estimate.

PAGE 43

43 There are literally dozens of masks that can be used to modify images, such as the ripple mask (Figure 4-13 ) which is used for detecting corners in optical flow algorithms. For a more complete reference of masks used for imag e processing, one should read Digital Image Processing by Gonzalez and Woods. Recognition Pattern Recognition is a subtopic of machine l earning. It is the act of taking in raw data and taking an action based on the category of th e pattern. Pattern Recognition is a very broad field that encompasses making decision about numer ous types of data. Some of the applications of Pattern Recognition concepts include stock market predictions, speech recognition or Signal Detection from devices such as radars. Howe ver, here the discussi on of Pattern Recognition Concepts will be restricted to the subject of Computer Vision. In many problems, one needs to make decisions about whether an image contains a certain object. Sometimes the object is already given, and it is necessary to decide to which cla ss the object belongs. Unlike most Expert (knowledge-based) systems which rely on th e analytical skills of one or more experts in the field, the Pattern Recognition approach is normally a statistical one [21-22]. Neural networks are also an approach to Pattern Recognition th at is very popular. Neural networks are systems that try to mimic the neurons in the human brain. They consist of a set of neurons (simple computation elements) that are interwoven in a complex manner to exhibit complex global behavior [23], (Figure 4-14 ) The main problem with neural networks is th at they are very specific. They must be trained for every situation that will be encountered in a test. Given two outcomes, a probabilistic method will pick the outcome that has the highes t chance of occurring. This more often will result in a correct decision. The same cannot be said about neural networks. If the neural

PAGE 44

44 network encounters a situation that it has not specifically traine d to handle, the outcome will be unknown. In computer vision using real images, ther e are simply too many va riables to account for every scenario. Uncontrollable factors such as w eather, camera vibration, and time of day can all affect the accuracy of a neural network system. Because of this fact, a neural network system used for computer vision must use vast amounts of training data. For exam ple, training a neural network for face detection required ar ound 10,000 images for good results [24-25]. The general steps in most Pattern Recogni tion systems are shown in Figure 4-15. Some pattern recognition systems might omit some of these steps, while others add steps of their own. Other pattern recognition systems may include feedback between one or all the steps. The importance of these steps will be illustrated using the example of the motorcycle in Figure 4-16. For this example, lets assume the goal is to find the motorcycle in the image and classify it as either a sport bike or a cruiser. Sensing The first step in the Pa ttern Recognition Process is sensing. The sensor for most computer vision systems is a CCD or CMOS (Complementary Metal Oxide Semiconductor) camera. Both of these cameras take light energy and convert them to elect rical energy. For this image, the sensing simply implies measuring th e red, green, and blue wa velengths of light and combining them to form a color image. Segmentation The second step in the Pattern Recognition process is segmentation Segmentation is the partitioning of an image into a set of regions th at cover it: segmentation is one of the most difficult problems in computer vision. Most pattern recognition methods assume the image is already segmented (pre-processed) and a decision need s to be made about the object of interest.

PAGE 45

45 However this situation is rarely the case. For th e test case image, before the motorcycle can be classified as either a sportsbike or a cruiser, it is necessary to find the motorcycle in the image. Image segmentation usually requires expert kno wledge of the image being analyzed. For example, if it is known that the object to classify is a blue motorcycle one can simply eliminate areas of the image that do not match this criteria such as the green trees or the gray asphalt. There are numerous features (col or, texture, etc) and algorithms that can be used for segmenting images. However, a preferred approach for finding objects in an image is to use a sliding search window. This approach will be discussed in the next chapter. Feature Extraction The next step in the pa ttern recognition process is feature extraction Feature extraction is one of the most important steps in pattern re cognition. If the features used for classification are poor, then any corresponding classification sche me will also have poor results. Feature selection also requires expert knowledge of the object being classified; but unlike image segmentation, knowledge of the background is no t needed. A good feature is one that is particular to one object and di stinguishable to everything else A good feature must also be invariant, meaning that it is still identifiable if the object undergoes some slight permutations. In the test case image, a good featur e for separating a sports bike fr om a cruiser would be the solid blue color of the motorcycle (most cruisers dont come in solid blue.) There are several features that can be used for detecting or classifying an object in an image. Some of these features include color and texture. These features will be discussed below. For this research, the approach will be limited to features that can be placed into histograms, as there are very useful techniques that can be used to analyze histograms. Color is the form of electromagnetic radia tion that falls roughly between the ranges of 400-700 nanometers. Most color in computer vision is done using the RGB (red, green, blue)

PAGE 46

46 color basis. Moreover, each value of color us ually ranges in value from 0-255. Any arbitrary color can be made by combining different values of red, green, and blue. For example, red = (255,0,0), green = (0,255,0), black = (0,0,0), and white = (25 5,255,255). Figure 4-17 shows the entire RGB color spectrum. To compute a histogram of the image, one can simply divide the image into its red, green, and blue channels. To compute the histogram of a single channel image, one divides the x-axis into a series of bins. For a typical image, th e range will be 0-255. Next each pixel in the image is considered. Given the intensity of the pixel, one simply adds a count to its corresponding bin. The results will look similar to Figure 4-18 Texture is another feature that can be used for object detection. Texture gives information about the spatial arrange ments of the colors or intensities of an image. One measure of texture is the edgeness per unit area as defined by Equation 4-16. N TpMagp Fedgeness )(| (4-16) This equation simply means that given an area (say a 5 5 kernel), count the number of pixels in that kernel that are above a certa in threshold T. The results ar e then divided by the number of pixels N in the kernel. This process is repeated as the kernel is moved across the image. The results can then be plotted in a histogram w ith the varying degrees of edgeness as the bin parameter. In Figure 4-19, the left image has an edgeness of 16 /25, while the image on the right has an edgeness of 21/25. Another very simple, but very useful measure of texture is the local binary partition For each pixel p in the image, the eight neighbors are exam ined to see if their in tensity is greater than that of p. The results from the eight neighbors are used to construct an eight digit binary

PAGE 47

47 number87654321bbbbbbbb, where, bi=0 if the intensity of the ith neighbor is less than or equal to that of p, and 1 otherwise. A hi stogram of these numbers is used to represent the texture of the image. While the local binary partition is a very simple text ure measure, it is sensitive to changes in orientation of the image, such as rotation. Classification The fourth step in the pa ttern recogniti on process is classification The job of the classifier is to use the features to assign the obj ect to a category. Ther e are literally dozens of classification methods used to determine an objects class membership. A good reference for these methods is Pattern Classificati on by Duda, Hart, and Stork [26]. Most traditional classification techniques involve building a cl assifier using training data from each class. The user then takes a feature (e.g. color) that distinguishes the classes from each other, and builds a normalized histogr am of the training objects (Figure 4-20 ) Once this histogram is constructed, the user can use any number of me thods to determine the class membership of any new test images. Some classification methods are as simple as the nearest mean. The test image is assigned to the class whose feature mean is clos est. Other simple methods include the nearest neighbor analysis. This means that the test imag e is assigned to the cla ss which has a data point closet to the test image (Figure 4-21). More general methods include Bayes formula (Equations 4-17 & 4-18). This method allows the experimenter to scale his/her resu lts by introducing prior knowledge. In the motorcycle example, if the experimenter knows he /she is more likely to see sportbikes than cruisers, he/she can scale the clas sification method appropriately.

PAGE 48

48 n j jj j j jwPwxp wPwxp xwp1)()|( )(*)|( | (4-17) evidence prior likelyhood posterior (4-18) The likelihood is exactly the same as a normalized histogram. The prior can be based on expert knowledge. The evidence is simply a scaling factor that ensures the posteriors value will lie between 0 and 1. Given a test image, one can now calculate the posterior probability that it belongs to a certain class. The class that gives the highest posterior probability is chosen. As stated before, there are several met hods used for classification. Some methods assume the likelihood is Gaussian. Some methods allow us to include risk factors. Other methods such as the maximum-likelihood algorithms allows one to determine the parameters of the likelihood if they are unknown. Once again, the text by Duda, Hart, and Stork provides a good reference for these techniques. Post Processing Post processing simply means that another step can be included before a final decision is made. In the test case example with the motorcycles, if the probabilities that it was a sportbike or cruiser were almost identical, it could be deci ded that it is necessary to take another action. Post processing is important in applications wh ere a wrong decision could be disastrous. If one has an automated gun turret that is supposed to shoot anyone whos armed but let everyone else pass, a wrong decision would be disastrous. Th erefore, one can set some threshold in post processing where the gun wont fire unless the proba bility that the person is armed is extremely high. Even better, the post processer could be a human who has the final say on whether to shoot.

PAGE 49

49 More on Feature Selection As stated earlier, a pattern recognition syst ems weakest link is the features. If a humans choice of features is poor, no pattern re cognition system will be able to overcome this hurdle. Feature selection is the most important part of designing any pa ttern recognition system, but yet it is the least talked a bout in most textbooks. On the other hand, there are hundreds of research papers that address the topic of f eature selection. This following paragraphs will address some of the most applicable topics. In 1996, Richeldi and Lanzi introduced a program call ADHOC, which aids in the feature selection process [27]. This tool eliminated re dundant and irrelevant fe atures that could be detrimental to system performance. This program used tools such as principle component analysis to reduce a high dimensional data set into a smaller one. Principle component analysis reduces the dimensionality of a data set by retaini ng characteristics of the data set that contribute most to the its variance by keeping low-order principle components a nd ignoring higher-order ones. However, principle component analysis simply organizes the data in terms of highest variance. Principle component anal ysis tells the user which features are least correlated, so that the user can reduce the redundancy in his/her com putation. The user must still determine which features are best for tracking th e desired object. This simple transformation of features is generally called feature extraction rather than feature selection. Also in 1996, Zongker and Jain performed a comparison of feature selection algorithms [28]. Some of these algorithms include Sequential Forward Selection and Sequential Backward Selection Sequential forward selection starts with an empty feature subset. For each iteration, exactly one feature is added to the feature subset. To determine which feature to add, the algorithm adds one feature that is not already in th e subset and tests the accuracy of a classifier built on the tentative feature subset. Sequentia l backwards deletion is similar to forward

PAGE 50

50 selection, but it starts with all the features and eliminates the featur es that result in a classifier with the least error. One of the best results fr om this research shows that each algorithm has poor results when the training data is small (less than 20 samples.) In 1997, Lanzi introduced a way of using Ge netic Algorithms to reduce the number of redundant features [29]. A genetic algorithm is search technique that finds solutions to optimization techniques. His research shows that this new genetic algorithm is at least 10 times faster than any previous genetic algorithm or princi ple component analysis. In 2005 Liu, Dougherty and others once agai n showed that most feature selection algorithms performed poorly when the training data is small and the origin al dimensionality of the feature set is high [30]. Most research papers tend to agree that traditional feature selection algorithms based on supervised or unsupervised learning give poor results when given a high initial feature set and a small set of training data. There are currently new methods being developed for feature selection using information theory. Information theory is a branch of math ematics and engineering that focuses on the quantification of information. One of the key con cepts of information theo ry is the concept of entropy Entropy quantifies the uncertainty of a random variable (Equations 4-19 & 4-20 ) Equation 4-19 shows how to calculate the entropy for a discrete pr obability density function, while Equation 4-20 shows the entropy for the continuous case. n i iiPPH1log (4-19) dxxpxpH )(ln)( (4-20)

PAGE 51

51 The relative entropy between two probability density f unctions (Equation 4-21) can also be used as a measure of reducing redundant features [31]. This measure of entropy between two probability density functions is termed the KullB ack-Leibler Divergence. This divergence will be further discussed in the next chapter. n i i i i klp q qD1log (4-21) In summary, this chapter has shown that com puter vision is a very broad topic. Several research topics have been presented such as image restoration and object recognition. The fundamentals of pattern recognition as it applies to computer vision have also been discussed. Feature selection is the most important part of any pattern recognition system and various research papers on this topic have been presen ted. These feature selection algorithms include traditional methods which fail for small sample sizes. A more novel approach of feature selection using information theory has also been presented. In chapter 5, a new method for feature selection is proposed which can be used for very small sample sizes. This method will work regardless of the sample size or the number of initial features available for selection. Moreover, future users will be able to expand this feature set with minimal effort, and the feature selection method will be unaffected.

PAGE 52

52 Table 4-1: Mock 5x5 grayscale image 20 19 25 2123 22 24 24 2225 21 24 23 2319 23 24 22 2418 27 22 21 2517 Table 4-2: Pixel intensity count Value Count Value Count Value Count 17 1 21 3253 18 1 22 4260 19 1 23 4271 20 1 24 5 Table 4-3: Cumulativ e distribution function Value cdf Value cdf Value cdf 17 1 21 72523 18 2 22 112623 19 3 23 152724 20 4 24 20

PAGE 53

53 Figure 4-1: Rast er orientation Figure 4-2: Point corr espondence for optical flow

PAGE 54

54 Figure 4-3: Finding moti on of multiple objects Figure 4-4: Image before histogram equalization Figure 4-5: Image after histogram equalization

PAGE 55

55 Figure 4-6: Image containing salt and pepper noise Figure 4-7: Application of box filter

PAGE 56

56 Figure 4-8: Application of median filter Figure 4-9: Applicati on of mask to a pixel

PAGE 57

57 Figure 4-10: Computing derivative of a discrete 1-dimensional signal Figure 4-11: Edge at an angle

PAGE 58

58 Figure 4-12: Prewitt and Sobel masks Figure 4-13: Mask used for corner detection Figure 4-14: Simple neural network

PAGE 59

59 Figure 4-15: General patte rn recognition process Figure 4-16: Extremely fast motorcycle with no mirrors

PAGE 60

60 Figure 4-17: The RGB cube Figure 4-18: Histogram of red, green, and blue channels of an arbitrary image

PAGE 61

61 Figure 4-19: Measuring edgeness of an image Figure 4-20: Normalized histogram (proba bility density functi on) of two classes

PAGE 62

62 Figure 4-21: Nearest neighbor analysis

PAGE 63

63 CHAPTER 5 METHODOLOGY This chapter proposes a m ethod for feature se lection that can be used on objects with relatively stationary statistics. This method for feature sel ection can also be implemented with small sample sizes and a high dimensional feature se t. By incorporating a simple object-oriented approach, easy expansion of the feature set can be achieved. Moreover, this method used for feature selection doesnt require any traditional training, wh ich causes problems for so many feature selection algorithms. The goal of this research is to vastly simplif y the feature selection process. A set of predefined features are organized in to a database. These features can be simple such as color, texture, shape, or complicated like Haar wavelets and features derived fr om principle component analysis. The user simply selects the object of interest in a minimum of two consecutive images; the method then informs the user of which features will be best for recognizing that object in future images. The following is an example of feature selectio n using color. Assume the user wishes to track the Mustang shown in Figure 5-1. The imag e of the Mustang is sp lit into its corresponding Red, Green, and Blue channels as shown in Figur e 5-2. A histogram of the red, green, and blue images is then constructed as shown in Figures 5-3 through 5-5 The user then selects the object in another image (Figure 5-6). Similarly, histograms are built using the red, green, and blue color channels of the second select ed image (Figure 5-7 through 5-9). Now that histograms of both images have been assembled, a way to compare these histograms must be developed. Th ere are many methods that exist to compare the similarities of two histograms. Two methods th at are used extensively for co mparing Gaussian Histograms are the Euclidean distance and the Dunn Index (Equations 5-1 & 5-2). The Euclidean distance is

PAGE 64

64 simply the distance between the means of the histograms. The Dunn index takes into account the variance (spread) of the data. Unfortunately, most histograms taken from real images rarely have a Gaussian distribution. Theref ore, these two methods of comparison will give poor results for most real applications. 2 1 2 12)( d (5-1) 2 2 2 1 12 d (5-2) As stated in the previous chapter, recent approaches to feature selection use information theory. More specifically, these approaches rank the similarity between two histograms by using relative entropy (Equation 4-21). At first impression this rela tive entropy (Ku llback-Leibler) divergence seems to be an excellent way to comp are the similarities between two histograms. However it will later be shown in the Feature Development section why this distance fails for the scenarios considered here. Another excellent method for comparing the similarities of two histograms is the L1 Norm, also known as the city block metric (Equation 5-3). n i iiEOL1 1 (5-3) After the distance between histograms has b een measured, the program checks to see if this distance is below some th reshold. If the distance betwee n histograms is below a certain threshold, the program selects the feature as good. Table 5-1 shows a list of preliminary features that will be implemented for this research. As one can probably see, the method of meas uring the distance between histograms will not be completely applicable to every feature in Table 5-1. For example, storing the number of

PAGE 65

65 circles into a histogram would not yield any rele vant information. Any va riations in computing the distance measures will be described in the next chapter titled Feature Development. Once the good features have been selected, the next step is to actually find the object in future images. For this approach, a growing, s earch window will be used as shown in Figure 510. It was shown in the previous chapter The Human Vision System that human sight is one of holistic nature, therefore making traditional pattern classification techniques non usuable. By using this growing search window, the holistic nature of the human vision system can be mimicked. Everything contained in the search window will be considered a possible object of interest. The properties of the region inside the rectangle will then be compared to previous features. This window will slide over the entire image. If the properties inside the search window are similar to the properties of the good f eatures, the search window will return a hit for the object of interest. If the distance between the region insi de the rectangle and the feature classes is too large, the region inside the rectan gle will be rejected. As the search window is moved across the image, it will most likely report several hits for the sa me object as shown in Figure 5-11. To eliminate this redundancy, if a h it from a smaller search window lies inside a hit from a larger search window, th e hit from the smaller search wi ndow is deleted (Figure 5-12). As stated at the beginning of this chapter, this method will be usef ul for tracking objects that possess relatively stationary statistics. This stat ement implies that the object being tracked has statistics that will not vary in regard to each feature class. As an example, it is assumed that the objects texture and shape will remain relatively constant throughout the duration of any testing. In regards to texture and shape, solid ob jects of constant propert ies such as automobiles or textbooks will be great for testing. Amorphous objects such as a handful of rocks or water will not fair very well using this method.

PAGE 66

66 Another assumption when using this method is that the environmental variables (mainly lighting) remain relatively c onstant. Although the user cannot completely control the environment, he or she should try to minimize a ny drastic changes that could negatively impact the distance measures for each feature class. In most practical situations, putting forth a modest effort to ensure constant lighting is more than enough to obtain reasonable results. The accuracy of this method will be documented using a confusion matrix as described in Table 3-1. One of the assumptions in using this method is that the objects being tracked will have relatively fixed statistics. Because of this assumption, solid object s such as automobiles will be used in the outdoor testing scenarios. The other assumption used in this method is the environmental variables remaining relatively cons tant. Therefore, testing will be conducted in conditions that provide a constant and adequa te lighting source. The optimal condition for outdoor testing will be midday under clear skies. Lighting is usually not a problem for indoor testing scenarios as the source typically remains constant and can be controlled by the us er. Once again, the object being tracked must possess relatively constant properties. Object s used for indoor tracking will be left to the discretion of the user. The overall goal of this research is to drastically reduce the fe ature selection process. This method will be compared to two well known me thods for object detection. These methods consist of artificial neural networks and template matching. As stated before, a neural network classifier provides laudable results when given adequate training data. However, gathering adequate training data is usually a very time co nsuming task as most neural networks require thousands of training images for appreciable results.

PAGE 67

67 The second well known method that will be used for comparison is template matching. Template matching is a method often used in in dustrial processes where machines are used to inspect products under extremely controlled cond itions. Examples of template matching include quality control applications where machines ex amine circuit boards for defects, or examine screws for thread defects. Template matching re quires very little traini ng. Unfortunately, the results of template matching are not useful unless the environment is completely controlled. This new method is designed to give the user the robustness of a neur al network classifier with the user-friendlines s of a template matching algorithm. All three methods will be tested under the conditions described above. The results of all three methods will then be documented in the previously mentioned confusion matrix. Chapter 6 will show the development of the individual feature clas ses shown in Table 5-1 After the creation of a ll feature classes are explained, the process of assembling the features classes into a fully working algorithm will be described. Finally the working program will be demonstrated in its entirety. Table 5-1: Preliminary feature list RGB ColorSpace Image Weight GrayScale Modal Analysis GrayScale Intensity Image Moment of Inertia Texture Modal Analysis HSV ColorSpace RGB Modal Analysis Number of Circles Texture HSV Modal Analysis Number of Rectangles

PAGE 68

68 Figure 5-1: Custom Mustang Figure 5-2: Red, gree n, and blue channels of mustang image

PAGE 69

69 Figure 5-3: Histogram of red channel of Mustang Figure 5-4: Histogram of green channel of Mustang

PAGE 70

70 Figure 5-5: Histogram of blue channel of Mustang Figure 5-6: Second image of Mustang

PAGE 71

71 Figure 5-7: Histogram of re d channel for second Mustang Figure 5-8: Histogram of gr een channel for second Mustang

PAGE 72

72 Figure 5-9: Histogram of blue channel for second Mustang Figure 5-10: Small sliding search window

PAGE 73

73 Figure 5-11: Example of multiple hits on the same object Figure 5-12: Elimination of multiple hits

PAGE 74

74 CHAPTER 6 FEATURE DEVELOPMENT This chapter details th e steps in creating the program. The twelve f eature classes to be used in the final program will each be given special attention. Any variations from the Methodology section will also be described in detail. After the creation of each feature class is demonstrated, the next chapter will give a brie f overview of how the feature classes will be assembled into the final program. The RGB Class The first feature class to be discussed is the Red, Green, and Blue (RGB) color space. As shown in Figure 4-18, a color image can be divi ded into its corresponding RGB color space. Once the image is separated, a histogram of each color space can be created. To determine whether or not the color space is useful for tracking, the L1 and DKL distances between two images of the same object are calculated. If the distances are below a certain threshold, then this color space will be selected as a candidate to find the object in future images. The first step is determining the threshold to classify this color space as a usable feature. The distance between two objects is influenced by numerous factors including the following: Histogram Bin Size, Variations in Lighting, Shadows, and Object Transformations/Rotations. However, one of the assumptions in using this algorithm is that the lighting will be kept relatively constant. Because of this assumption, changes in lighting and shadows will be kept to a minimum. Rotations of the object of interest will not play a major factor in the distance calculations since all RGB distance measures are calculated from a histogram, which itself is a simple pixel value count. The major factor affecting the distances be tween two objects is the histogram bin size. An appropriate bin size is one which minimizes th e variations when tracking the same object, but

PAGE 75

75 will provide good separation between different objects. Figure 6-1 and Figure 6-2 show two very similar objects that could po ssibly be used for tracking. Figure 6-3 shows how the L1 and DKL distance vary as the number of bins are increased for the objects in Figures 6-1 and 6-2. As exp ected, the distances increas e as the number of bins increase. The figure shows the distance comparisons of th e Red Channel as a function of bin size. The plot of bin size versus either distance measure doesnt yield any helpful information. There is no magic number of bi ns that minimizes the distance be tween similar objects. To find a reasonable number of bins fo r tracking, a different approach is needed. Instead of focusing on how bin sizes affect the distance measures, the fo cus instead needs to be on the variations of a single object. By determining the color variations under normal circumstances, the bin size can be calculated in return. To accomplish this task, multiple images of a soli d color will be taken in relatively constant lighting conditions. The only variations will be of orie ntation and distance. Although the lighting on the object will be indir ectly affected by the obj ects transformations, direct manipulation of the light source will not be part of the testing. To accomplish this testing for the RGB color space, a simple C++ GUI program was created. As shown in Figure 6-4 this program calculates the av erage red, green and blue intensity values for everything contained inside the green box. This tool will be used on several objects to determine the variations of color caused by the transformations of that object in space. Th e objects used for testing will be homogeneous in color and have constant properties as stated in the above testing scenarios. Also, this bin size analysis will be performed indoors as to minimize any changes in lighting.

PAGE 76

76 Throughout the testing, the green channel almost always had th e greatest variance of all three channels (Red, Green, and Blue). This re sult was expected as the green channel has the greatest effect on an objects lu minance as shown by Equation 6-1 (The National Television System Committee (NTSC) conversion of a color image into a grayscale). Because the only changes on the object were indirect changes in lighting, it makes sense that the green channel would have the biggest influence. Table 6-1 sh ows some of the sample variations that were encountered in the testing scenar ios. The numbers in the figures are the minimum and maximum intensity values for each channel (Red, Green, Bl ue) while tracking an object of solid color. These values were obtained using the program in Figure 6-4. Blue Green d GrayScale 11.0 59.0Re3.0 (6-1) As shown in Table 6-1, the variations caused solely by transformations of the object were anywhere between 5-37 intensity values. From th is observation, a bin size of 40 intensity values should encompass the changes caused by most spatial transformations of a desired object. This bin size of 40 intensity values will now be used in determining whether or not the RGB color space is a good feature to use for tracking. Using the bin size of 40 intensity values, th e next step in testing is determining the thresholds for the L1 and DKL measures in order to test whethe r a feature is reliable for tracking. The method used to determine the thresholds for the distance measures is the same as determining the bin size. Several random objects were selected and their orientation was shifted in space. Once again, the lighting in the testing was kept constant. The variations in the L1 and DKL were then calculated. A plot of the L1 distances is shown in Figure 6-5 As shown in Figure 6-5, the mean L1 distance (a normalized value between 0 and 1 for calculating the distance between histograms) when tracking the same object was around 0.1

PAGE 77

77 units. The maximum L1 distance encountered when repeat edly tracking the same object was slightly above 0.15 units. Based on these results, a very generous L1 threshold of 0.25 units was chosen. This threshold should be more than e nough to compensate for any variations caused by simple geometric transformation of the object. NOTE: The selected bin size and L1 distance is a tunable parameter th at is completely left to the discretion of the user. Prior knowledge of the objects destined to be tracked would help in selecting these parameters. Sin ce this application is designed to be very ge neral in the objects being tracked, the sel ected bin size and L1 distance will be slightly larger than anything encountered in testing. Several problems were encountered wh en calculating a threshold for the DKL distance. The first major problem is that DKL is nonlinear. This nonlinearity makes determining a threshold for a general set of objects very imprac tical. Second, because of the nature of the logarithmic function, if a bin contains zero counts, the DKL distance will expl ode to infinity. This problem with the DKL distance makes it unreliable and not very useful for classification. Because of these problems, the DKL as defined again in Equation 6-2 will not used in the application development. n i i i i klp q qD1log (6-2) In summary, the results of testing for the RGB class show that a bin size of 40 accounts for most variations in solid colors. The testing also showed that an L1 threshold of 0.25 should account for most geometric transfor mations of solid objects. These values for the RGB class will be used in the fully working application.

PAGE 78

78 Grayscale Class Grayscale intensity is anothe r useful feature which can be used for object tracking. Instead of h aving three separate channels such as the RGB color space, grayscale only has one channel. Equation 6-1 is used to convert an image from a RGB color space to grayscale. The result of using Equation 6-1 is shown in Figure 6-6. Once the image has been converted to grayscale, the process of determ ining the feature parameters is exactly the same as the RGB color space. The grayscale feature proved to be somewhat more robust than the RGB feature class. Using the same methodology as the RGB feature class, a bin size of 30 was deemed sufficient. The RGB threshold of 0.25 was kept the same for the grayscale feature class. Hsv Class The hue, saturation and value (HSV) color sp ace is also useful for com puter vision applications. The HSV color space (Figure 6-7) separates the intensity (Value) of the image from two parameters encoding the images chromaticity (hue and saturation). The HSV color space was conceived to find a colo r representation that is more conceptually available to humans [32]. In the RGB color sp ace, humans have a difficult time visualizing which combination of red, green, and blue values mu st be chosen to develop a particular color. Unless dealing with the three primary colors (red, green, blue), choosing the correct combination simply comes with experience. In the HSV colo r space, the color is determined by the hue. The richness of the color is determined by the saturation. Fina lly, the brightness of the color is determined by the value. For some people, th is color representation seems more natural than the RGB color space. The HSV color space some times provides better results for computer vision systems. This result is because the program can compensate for lighting (value), and

PAGE 79

79 focus on the parameters of the objects surface [3 3]. The hue is defined as an angle between 0 and 360 degrees. The saturation and value are both defined as floating points in the range of [01]. Because most cameras deliver the image to the user in the RGB color space, they have to first be converted from RGB to HSV. The firs t step is rescaling the RGB range from integer values of [0-255] into floating point values on th e range of [0-1]. This simple conversion is shown in Equation 6-3. For example, because the maximum value of the red, green, or blue channel is 255, a red pixel va lue of 127 becomes 127/255 = 0.498. ger MaxRGBInte RGBInteger gPo RGBFloatin int (6-3) Now that the RGB values are floating point va lues on the range of [0-1], the hue can now be calculated. For each pixel, let max be the greatest red, green, or blue pixel value and min the least. The calculation of the hue is shown in Equations 6-4 through 6-7. (6-4) (6-5) (6-6) (6-7) In Equation 6-4, the mod 360 simply means add 360 degrees if the value of the hue is less than zero. Now that the hue has been calcula ted, the next step is calculating the saturation of the pixel. This saturation is shown in Equations 6-8 and 6-9. Finally, the calculation of the value is shown in Equation 6-10. (6-8) (6-9) (6-10)

PAGE 80

80 Now that the hue, saturation and value parame ters have been calcu lated, their variance needs to be tested. In a manner similar to th at of the RGB and Grayscale class, a C++ GUI program was developed to test the variance of HSV color space when objects of a solid color were transformed in space. This program is shown in Figure 6-8. Once again, the first step was determining a bin size that accounts for normal spatial transformations of the object. One of the main advantages of the HSV color space was its claimed resistance to minor changes in illuminati on. From testing, the results did not disappoint. Normal spatial transformations of objects caused very minor ch anges in the hue and saturation calculations. Regardless of th e objects color, the maximum ch ange in the hue during normal geometric transformations was 15, on a scale of [0 -360]! From these observations, a bin size of 20 should be plenty to cover most normal conditions. The saturation was also very resistant to changes in the objects orientation. On a floating point scale of [0-1], the maximum vari ation detected was 0.1. For ease in programming, the saturation will be rescaled on the integer interval of [0-100]. From this new interval, a bin size of 15 will be chosen to account for most scenarios. The value in HSV had variations similar to that of the RGB and grayscale class. On a floating point scale of [0-1], the maximum varia tion was 0.2. The value will also be rescaled on an integer scale of [0-100], and a bin size of 25 will be implemented in the program. Using the same procedures as the previous classes, the L1 distance for the hue and saturation were selected as 0.2. The L1 distance for the Value was selected as 0.25. Texture Class This f ourth feature class will mark a departure aw ay from traditional color spaces. Texture is a feature class that gives information about the spatial arrangements of the color space of an

PAGE 81

81 image, rather than the actual colors themselves. The concept of texture is closely related to that of edge detection. A brief overview of texture will now be described. The first step in determining an images texture is performing edge detection on that image. There are several operators that are cap able of performing edge detection. Figure 4-12 shows the Sobel and Prewitt operators which are used for edge detection. While several edge detection algorithms exist, the Sobel and Prewitt operators are the simplest. As an example, Figures 6-9 through 6-11 show th e results of the Sobel operator versus the more complicated Canny and Laplacian filter. Looking at the results for both operators, th e Laplacian operator tends to have too many clouds to perform any useful sort of textur e comparisons. The Sobel and Canny operators provide good separation between areas of high a nd low texture, while the Laplace operators fill all entities with noise. Although the Laplace op erators certainly have their purpose, the Sobel and Canny operators will be more useful for this texture class. Once e dge detection has been performed on the image, the texture can be ca lculated as shown prev iously in Figure 4-19. The texture measure to be used for this class is the edgeness per unit area .. In this class, a 3 3 kernel will be moved across the image and the edgeness values will be placed into histograms. This 3 3 kernel was chosen because of concerns related to calcula ting the bin size. The objects to be tracked vary in size. If the object being tracked becomes very small, a kernel which is too large will not give enough hits on th e object to perform accurate calculations. Using a 3 3 kernel will give accurate results for general object tracking applications. Based on this kernel size, the edgeness can take on a maximum of 10 values Because of this fact, a histogram with 10 bins will be selected for initial testing. Us ing the same method as previous classes, an L1 threshold of 0.20 was selected.

PAGE 82

82 Image Weight The overall im age weight is another simp le feature which can be used for object classification. Like the texture feature, the we ight of the image is strongly based on edge detection. The normalized weight of an image is simply the sum of the edge pixels divided by the total number of pixels in the image. Unlike previous classes, the distance measure for the image weight will not be based on the L1 threshold. The image weight is a simple number describing an image; therefore histograms methods do not apply. To measure th e separation between image weights, a simple percent difference will suffice ( Equation 6-11). (6-11) Using this simple distance measure, empirical variance testing s howed that a percent difference below 25% would qua lify this feature as good. Image Moment Of Inertia Mom ent of Inertia (MOI) is the polar moment of inertia of an image. Similarly to the image weight class, MOI is based on edge detection. Most digital images are stored in raster orientation with the origin of the image being the top left corner. However, to ease the implementation process, the images cente r c will be chosen as the origin ( Figure 6-12 ) The first step in calculating an images MOI is to perform edge detection as with the previous two classes. Two methods for calculating a moment of inertia are treating the entity as a rigid body and treating the entity as a summation of discreet point s. Both methods will be used to calculate a normalized MOI. This normalized MO I is needed so that the results can be used on images of varying sizes.

PAGE 83

83 Determining the MOI consists of two steps: calculating the raw MOI, and normalizing the MOI on an interval of To calculate the raw MOI, the image will be treated as a cloud of points ( Figure 6-13 ) The raw MOI is then calculated by Equation 6-12 where n is the number of edges in the image and m is the mass. Because the mass m is a simple pixel, it will be set to unity. The moment of inertia now b ecomes a summation of the squared distances from an edge to the origin. (6-12) Once the raw MOI has been calculated by treating the image as a cloud of points, the result must be normalized. To normalize the MOI, the maximum value of the MOI must be determined. This task is accomplished by treatin g the image as a rigid body. Treating the image as a rigid body is essentially the same as assumi ng every pixel is an edge. This simplification allows the use of Equation 6-13 to calculate the maximum MOI of an image about its center. The normalized moment of inertia is calculated by dividing the point cloud MOI by its maximum value (Equation 6-14). Using the same percent difference method as the image weight class, testing showed that a percent difference belo w 25% would be qualify this feature as good. (6-13) (6-14) Histogram Modal Analysis Histogram modal analysis is another simple f eature class to determine whether a test image belongs to the same category as the training images. With the previous histogram based feature classes (RGB, HSV, Texture, Grayscale), the di stances between histograms was calculated using

PAGE 84

84 the L1 measure ( Equation 5-3 ) The L1 measure compares the difference between each bin in two separate histograms. Histogram modal analysis is a much more re laxed method of comparing two histograms. Instead of comparing each bin in two separate histograms, we simply compare their histograms modes. For example, the red channel histogr am of the beverage container in Figure 6-14 is shown in Figure 6-15. From observation, it can be seen that the red channel histograms mode is the bin that contains the values of 255. To determine whet her the feature class is acceptable for finding objects in future images, this mode is compared to the mode of any future images. If the histograms contain the same mode the feature is classified as good. If the histograms contain different modes, the feature class is rejected. Unlike previous feature classes, there is no L1 distance or percent difference. Whether different histograms have the same modes is either true or false. As stated, this modal analysis is a relaxed method of histogram comparison that will be used for the RGB, HSV, Texture, and Grayscale Feature classes. Number of Circles The eleven th feature class that will be implemented is the number of circles presented in an image. This circle detection algorithm w ill be designed using Hough transform principles. The Hough transform is a feature extraction me thod that uses an accumulator array to find instances of objects by a voting procedure. The Hough transform is used mainly for finding lines or line segments, but the transform can al so be applied to other well known shapes [34]. As stated, the Hough transform requires an accu mulator array. The dimensions of this accumulator array are determined by the number of parameters of the curves being determined. The equation of a line is shown in Equation 6-15. Because Equation 6-15 cannot be used for

PAGE 85

85 representing vertical lines and the coordinates are in Cartesian space, Equation 6-16 will be used to represent a line in raster coordinates. These values are shown in Figure 6-16. (6-15) (6-16) The first step in the Hough line finder is e dge detection. This edge detection can be performed using any of the previous edge detecti on methods such as the Sobel, Prewitt, or Canny operator. The next step in the Hough method is analyzing the edge image for the presence of lines. To determine the presence of a line, each pixel and its 8 neighbors are analyzed for edges as shown in Figure 6-17 In this example, two possible lines have been detected. The distance d and angle are then calculated. This procedure is repeated for each pixel in the image and the results are stored into an accumulator (Table 6-2). If th e number of similar lines reaches a predefined threshold, the Hough method classifies that line as a hit. Once a line has been classified as a hit, post processing of the lines distance and angle measurements can be performed to determine the length of the actual line segment. A similar procedure can also be used to fi nd circles in an image. The parametric equations of circle are show n in Equations 6-17 and 6-18. (6-17) (6-18) Equations 6-17 and 6-18 show that a ci rcle consists of three unknowns, ro, co, and d. In order to compensate for this extra variable, the ra dius d will be added to the accumulator. To find the center of a circle for a specified radius, the gradient at each pixel point must first be

PAGE 86

86 calculated (Figure 6-18). The grad ient of a pixel is the direction of the largest intensity increase and is shown by Equation 6-19 (6-19) If a point lies on a circle, then the gradient poin ts to the center of that circle as shown in Figure 6-19 Once the gradient at each pixel has been calculated, the procedure follows that of the Hough line finder. The values and d are all stored into an accumulator. If the values pass a predefined threshold, then the curve is classified as a circle. The re sults of this procedure are shown in Figure 6-20. Number of Rectangles To calculate the num ber of rectangles in an image, the image processing technique of connected component labeling will be implemented. Connected component labeling groups pixels into classes based on pixel connectivit y. Connected component labeling is usually performed on binary Images. To obtain a binary image, the color image is simply converted to grayscale using Equation 6-1. Once the image has been converted to grayscale, each pixel is then compared to an intensity threshold. If the pixel falls below this intensity threshold, its value is set to 0, otherwise the pixel is set to 1. The result is shown in Figure 6-21. Once the image has been converted into its binary representation, connected component labeling is performed. An excellent descripti on of connected component labeling can be found at Image Analysis Connected Component Labeling [35]. Connected component scans an image, pixel by pixel, from top-left corner to botto m-right corner. When a pixel of value 1 is encountered the algorithm compar es the pixel to its left, ri ght, and two diagonal neighbors (Figure 6-22). The pixel of interest p is compared to the following rules.

PAGE 87

87 If all four neighbors have a va lue of 0, assign a new label to p, else If only one neighbor has V={1}, assign its label to p, else If more than one of the neighbors have V={1} assign one of the labels to p and make a note of the equivalences. Afterwards, pixels with the same labe l are all assigned to the same class. The results of connected component labeling are shown in Figure 6-23 Now that connected components have been identified in th e image, the next step is identifying which contours in the image represent rectangles. To simplify the process of identifying cont ours as rectangles, Intels Open Source Computer Vision Library (OpenCV) will be used. This library has a function that approximates polygonal contours. This function implements the Douglas-Pecker algorithm for finding a simple polygon that approximates the original within a specified tolerance. The details of the algorithm are beyond the scope of this research; for more information on this algorithm one should read Algorithms for the reduction of the numb er of points required to represent a digitized line or its caricature [36]. Once all the contours have been appr oximated as polygons, the polygons must be classified as rectangles. In or der for a polygon to be classified as a rectangle, it must pass the following criteria. Each polygon must have 4 vertices Each vertice must concave (curve inward) Each vertice must have appr oximately a 90 degree angle

PAGE 88

88 Once these restrictions have been placed on the list of approximated polygons, only the rectangles in the image should remain. The result s of this rectangle find ing are shown in Figure 6-24. With the conclusion of this chapter, the de velopment of the features classes has been illustrated. As stated, not all the feature classe s were able to fit into the framework of the methodology section; however any deviations from the methodology section were described in detail. The next chapter will demonstrate how an object-oriented design approach will be used for implementation of these feature classes in to the overall program assembly. This objectoriented approach will allows easy modification of the current feature classes and easy addition of any future feature classes.

PAGE 89

89 Table 6-1: Sample va riations in RGB values Dark Green Object Gray Object Min Max Min Max Red 40 60 Red 160 179 Green 70 107 Green 150 180 Blue 44 66 Blue 126 136 Red Object Yellow Object Min Max Min Max Red 250 255 Red 250 255 Green 60 90 Green 250 255 Blue 13 24 Blue 80 107 White Object Dark Blue Object Min Max Min Max Red 250 255 Red 24 30 Green 250 255 Green 40 60 Blue 250 255 Blue 40 60 Table 6-2: Generic accumulator Distance Angle (degrees) Line 1 34 12 Line 2 12 3 Line 3 4 78 Line 4 34 12 Line 5 34 12 Line 6 2 9

PAGE 90

90 Table 6-3. Summary of Distance Measures Feature Distance Measure RGB L1 Norm Grayscale L1 Norm HSV L1 Norm Texture L1 Norm Image Weight Percent Difference Image M.O.I. Percent Difference RBG Modal Mode Comparison HSV Modal Mode Comparison Texture Modal Mode Comparison Grayscale Modal Mode Comparison Number Circles Exact Match Number Rectangles Exact Match Figure 6-1: First imag e of object to track

PAGE 91

91 Figure 6-2: Second image of object to track 0 5 10 15 20 25 0 0.05 0.1 0.15 0.2 A Plot of L1 Distance Versus Bin Size Number of BinsL1Distance 0 5 10 15 20 25 0 0.01 0.02 0.03 0.04 A Plot of Kullback-Leibler Divergence Versus Bin Size Number of BinsDKl Distance Figure 6-3: Plot of distance measures versus bin size

PAGE 92

92 Figure 6-4: Program used to calculate average RGB values 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 2 4 6 8 10 12 14 16 18 Plot of L1 Variations when Tracking the Same Object L1 DistanceCount Figure 6-5: Variations in L1 distance while tracking the same object

PAGE 93

93 Figure 6-6: Conversion of RGB image to grayscale Figure 6-7: Visual representation of the HSV color space

PAGE 94

94 Figure 6-8: Program used to calculate average HSV values Figure 6-9: Results of Sobel operator

PAGE 95

95 Figure 6-10: Results of Canny operator Figure 6-11: Results of Laplacian operator

PAGE 96

96 Figure 6-12: Image origin for calculating moment of inertia Figure 6-13: Calculating the image moment of inertia from edges

PAGE 97

97 Figure 6-14: Beverage container 0 51 102 153 204 255 0 2 4 6 8 10 12 x 104 BinCountHistogram of Beverage Container Figure 6-15: Histogram of red channel for beverage container

PAGE 98

98 Figure 6-16: Parameters of a generic circle Figure 6-17: Line detection for Hough transform Figure 6-18: Gradient at a single pixel

PAGE 99

99 Figure 6-19: Gradient of a circle Figure 6-20: Results of circle detection

PAGE 100

100 Figure 6-21: Binary image representation Figure 6-22: Neighbors of pixel to be labeled

PAGE 101

101 Figure 6-23: Results of c onnected component labeling Figure 6-24: Rectangle finding in an image

PAGE 102

102 CHAPTER 7 APPLICATION ASSEMBLY As stated in the abstract section of this dissertation, an object oriented approach will be used for the overall program assembly. By incorporating an object oriented design (OOD) philosophy, different portions of th e application can be modified without modifying or damaging the entire program. Moreover, an OOD design philosophy allows for easy future expansion of feature classes. This chapter will give an in troduction to object oriented programming and will show how the feature classes will be implem ented using this programming philosophy. Writing a program without a design is like building a house without blueprints [37]. Program design is the single most important step in application development. However most individual program developers, even professionals breeze over this st ep as if its just a hassle. Programmers have a tendency of coming up with a rough idea (usually in their head) of how a program should work and then immediately begin coding. The typical time distribution for most individual programmers is 20% design and 80% coding. This ne glect of the design phase only works for the most simplest of programs. W ithout adequate design, the programmer will almost always run into problems that could have easily been avoided. Unfortunately, these problems usually show themselves when the programmer is nearly done with the program. To illustrate the importance of program de sign, an example will be given. One of the benefits of this research was to allow the feature cla sses to be easily modified or extended long after the original designer of th e application has left. The gra phical user interface (GUI) portion of this program was done using Trolltechs QT4 a pplication development platform. However, if the functionality of each feature class was intertwined with the GUI code, every subsequent user of the application would also have to be an expert with QT4.

PAGE 103

103 With this approach, anyone who wanted to modify or extend the f eatures classes would also be forced to modify preexisting GUI code. Not only would this be annoying to the end user, it would be dangerous. By changi ng the GUI code, he/she runs the risk of breaking a previously working application. Not everyone loves developing GUIs using QT4. Several people who work with the author prefer using Microsoft s very popular C# programming la nguage. By planning ahead, the features classes can be easily extracted from the application and placed in any GUI development environment, thus allowing the end programmer to use whatever GUI language in which he/she is most comfortable. Once good programming design has been perf ormed, there are two main programming practices that exist today, procedural and object-oriented. Procedur al code (typically done using C) revolves around data, and functi ons that act on that data. Pr ocedural programming asks the question, What does this program do? By answering this ques tion and designing the program as a series of steps, one is designing procedurally. While procedural programming works well if a program follows a sequential series of steps, it is plagued with problems in larger appl ications. The first proble m is that the data and the functions that act on that data are in different locations. If a function changes its arguments, then the data on which the function operates must also change its structure. Moreover, because the data is separate, the data is accessible to any part of the program This lack of data privacy is a major source of errors and secu rity concerns in many programs. The second problem with procedural pr ogramming is breaking pre-existing code. Because procedural code consists of functions th at operate on data, the function must be able to

PAGE 104

104 handle whatever data type is passed to it. Because of this, any future modification to the program involves possibly changing c ode that was previously working. The third problem with procedural program ming is redundant code. Either a function must be able to handle every data type its give n, or a separate function must be created for each data type. If a new function is created, the progr am calling that function must also be changed. More code and more changes impl y more possibilities for error. The object-oriented (C++, C#, Java, and ot hers) approach works around these problems by treating the program as a set of objects that in teract with each other. Instead of separate functions and data, both are contained in the same location. Moreover, this object-oriented approach allows for easy future expansion of the program. Instead of asking What does this program do? the question is instead What Objects are to be modeled. Object oriented programming has three main advantages over pro cedural programming. 1. Encapsulation 2. Inheritance 3. Polymorphism Encapsulation consists of keeping the data and f unctions (methods) that act on that data in one location called a class. By putting the data and methods in a single class, this class can be treated as a separate entity. By treating the program as a set of standalone classes, different parts of the program can be modified without uni ntentionally breaking something else. Inheritance is the process of cr eating a superclass from a set of base classes. For example, assume a program is created that mimics farm anim als. These animals consist of dogs, cats, pigs, cows and horses. Each of these animals shares several of the same behaviors such as the following:

PAGE 105

105 1. Eat 2. Walk 3. Run 4. Sleep Because all the animals share these same behaviors, it wouldnt make sense to program different methods for each animal as shown in Figure 7-1. Instead, a superc lass will be created that contains the common functionality of each animal as shown in Fi gure 7-2. By abstracting this common functionality to the supe rclass, a tremendous amount of programming time is avoided. Furthermore, if someone wants to add a new animal to the program, he/she can simply inherit the common functionality from the superclass and avoid typing the code themselves. The final benefit of an object oriented desi gn approach is polymorphism. Polymorphism allows a subclass to override the functionality of it s superclass. In this example of farm animals all the animals usually sleep lying down, except the horses. Because of this, we can simply supersede the generic Animals sleep method with a method particular to that of a horse. Unlike procedural programming, this modification can be performed without affecting any other part of the program. A technique of polymorphism that is used for programs which will undergo continuous expansion is programming with interfaces. An interface is a superc lass that contains method prototypes, but doesnt contain any underlying functionality. One might ask why interfaces are ever needed; wasnt the goal of object orient ed programming to reduce redundant code? Interfaces are useful when several subclasses contain th e same methods, but implement each method differently. As an example, cons ider the twelve featur e classes which were presented in the previous chapter. Although each feature class calculates the distance between

PAGE 106

106 two or more images, each feature cl ass performs this operation in a different manner. Any future feature classes will also calculate the distance between images using a different approach. Interfaces allow the programmer to design hi s/her program for future expandability. By designing the program using a gene ric feature superclass, any s ubsequent feature classes can be implemented with ease. As long as the s ubclass overrides the met hod prototypes in the superclass, expansion can be done without modifying any other part of the program. This feature class interface is shown in Figure 7-3. As shown, the interface has two methods that are used in the program, IsFeatureGood and ProcessImage. The IsFeatureGood method checks whether the featur e will be appropriate fo r finding the object in future images. This method is called during th e training session of the program. The second method ProcessImage, is called when the progra m is looking for the object in future images. Therefore, as long as each subsequent feature class implements these two methods contained in the interface, the feat ure class can then easily be adde d into the overall program. Two programming libraries will be used in th e implementation of this program. The first library is Intels Open Source Computer Vision Libr ary. This library consists of a huge set of low-level functions designed to make the computer vision engineers life much easier. Although the library is written purely in C, a C++ wra pper will be placed around these functions to make them adhere to the object oriented framework. The second library used in this program is Trolltechs QT4. QT4 is a completely objectoriented, C++, cross-platform application de velopment framework. QT4 was chosen because C++ programs tend to operate much faster than other languages. QT4 was also chosen because

PAGE 107

107 all of the pre-existing code was already in C ++, therefore adding another programming language would add unnecessary complications. As stated however, if one wishes to move the feature classes to another language, he/she will be able to do so relatively easily. The fully working program will now be dem onstrated. This program will be used to determine which set of features are good fo r tracking the QT4 book. Next, based on which features the program selects as good, the program will then find the object of interest in future images. The fully working program is shown in Figure 74. In this figure, the first image of the QT4 book is selected by the user. Next, a rotated view of the QT4 book is selected. This selection is shown in Figure 7-5. Based on these two selections, the features which the progr am deemed as good are shown in Table 7-1 Because the images were nearly identical except for the rotations, the majority of the feature classes were chosen as good. The f eature classes of circles and rectangles failed. Unfortunately, these feature classes which consist of primitive shapes usually fail under most real-world conditions as near perfect primitive shapes rarely exist in nature. The user has the option to add or remove feature classes based on his/her expert knowledge of the situation. Once the user is satis fied with the feature classes, he/she then presses the search button and the program will then use the good feature classes to find the object in future images. Finding the QT4 book in future images is shown in Figure 7-6. With the conclusion of this chapter, one should understand why an object oriented program philosophy is so beneficial to the deve lopment of this program. Using this object oriented approach, the development of the indi vidual feature classes a nd how each feature class is implemented in the overall program has been demonstrated. The fully working program has

PAGE 108

108 also been used to track an object. The next chapter will compare this program against two popular methods of object det ection as described in the methodology section of this dissertation. Table 7-1: Selected features ALL FEATURE CLASSES GOOD FEATURE CLASSES RGB RGB GRAYSCALE GRAYSCALE HSV HSV TEXTURE TEXTURE IMAGE WEIGHT IMAGE WEIGHT IMAGE MOI IMAGE MOI GRAYSCALE MODAL GRAYSCALE MODAL RGB MODAL RGB MODAL HSV MODAL HSV MODAL TEXTURE MODAL TEXTURE MODAL # CIRCLES # SQUARES Figure 7-1: Design of anim als without inheritance

PAGE 109

109 Figure 7-2: Design of animals with inheritance Figure 7-3: Programming with interfaces

PAGE 110

110 Figure 7-4: First selection of object of interest (QT4 book)

PAGE 111

111 Figure 7-5: Second selection of object of interest (QT4 book)

PAGE 112

112 Figure 7-6: Finding object of intere st (QT4 book) in future images

PAGE 113

113 CHAPTER 8 TESTING AND COMPARISON This chapter com pares this new feature sel ection method with two very popular methods of object detection. These two popular methods consist of image template matching and a simple statistical based neural network which is built into Intels OpenCV library. This chapter will give a brief introduction to these well know n methods and then show how these methods compare to the new application. The first method to be discussed is image template matching. Template matching is the process of finding small parts of a large image (Figure 8-1 ) which match a pre-specified template. The only requirement is that the template (Figure 8-2 ) must be smaller than the image being searched. To find the template in the larger image, the template is usually compared to the larger image using the sum of absolute differences SAD method (Equation 8-1). The template image is simply moved across the larger image and the sum of absolute differences is repeatedly computed. If the SAD is below a certain thresh old, then the method declares a match has been found in the larger image. TRows i TColumns jiyixdiff yxSAD00),( ),( (8-1) The template matching program is fairly simple The user selects which part of the image he/she wants to use as a template. If the user is happy with the template, he/she then clicks search and the program attempts to find the template in future images. In this example, the goal is the tracking of the QT4 book. The video stream is shown in Figure 8-3. Whatever image is inside the central square becomes the template. This template is

PAGE 114

114 shown in Figure 8-4. When the user is satisfied with the template, he/she then tells the program to search future images. The searching of future images is shown in Figure 8-5. The second method to be used in comparison with the new Invarian t Feature Selection method is artificial neural networks. OpenCV has built-in functionality whic h is used to train a neural network classifier. This appr oach was developed by Viola and Jones [38] and later was extended by Lienhart [39-40]. This approach by Viola, Jones and Lie nhart is based on Haar Like features as shown in Figure 8-6 Haarlike features consist of a set of primitives that can be cascaded together in a neural network to form a somewhat robust classifier. The specific detail of how this classifier builds th is cascade is outside the scope of the dissertation. For further information, one can read about Face Detection in the Intel Technology Journal [25]. As stated previously, one of the most annoying downfalls of a neural network approach is the massive amount of training that must be used for training. The typical amount of training data consists of at least 1 000 positive training images, and at least 4000 negative images: positive images contain the object of interest a nd negative images do not. Moreover, one has no idea how a classifier will perform after the training of the neural network has been completed. Now that all three methods have been desc ribed, they will be compared against each other. For this test, two object s will be selected for tracking. These objects consist of textbooks for indoor conditions and automobiles for outdo or conditions. The QT4 textbook in Figure 8-4 will be used for indoor testing. For outdoor testing, images of automobiles will be used for testing. Results of testing are documented in the previously mentione d confusion matrix. The reason only two objects were used for test ing in this dissertation is the same reason that originally motivated this research. Table 8-1 and Table 8-2 show the amount of time invested in training each classifi er. Table 8-1 shows that traini ng the textbook classifier for both

PAGE 115

115 the template matching and the i nvariant feature selector took le ss than one minute. However, training the neural network clas sifier for the textbook took 14 hours to complete. Even worse, each positive training image must be manually ente red into the neural network training program. This mind-numbing step of manually enteri ng 1000 images took around 4 hours to complete. Developing the neural networ k classifier for automobiles was even more tedious. Because of the sheer number of images that ne eded to be entered into the program, 4 students assisted with the training of the classifier. On ce the painstaking task of entering all the images was completed, the neural network classifier took more than 3 days to complete! Now that the training of each classifier is completed, their results will now be compared. Each classifier will be given 100 images (50 Positive, 50 Negative) and the results will be documented. As stated in the Methodology section, all indoor testi ng will be conducted in constant lighting conditions while all outdoor testi ng will be conducted under clear skies. First, the template matching algorithm will be tested against the QT4 textbook and automobiles. Afterwards, the neural network a nd invariant feature se lection method will be compared against the same objects. The first method to be tested is template matc hing. One of the best properties of template matching is the very small amount of false alar ms. As shown in Equation 8-1, the template matching algorithm simple slides the template image across the full size image. This property allows for near perfect object detection if the fu ll size image contains a near copy of the template image. Unfortunately, if the object in the im age is undergoes any form of transformation, the value of Equation 8-1 grows very rapidly. Because the matching of a template matching algorithm is very specific, this type of object detection is used very heavily in manufacturing processes where objects are kept in very controlled conditions.

PAGE 116

116 Because template matching doesnt work for any scenarios other than an exact match of the object, documenting the results in a confusi on matrix wouldnt yield any extra information. Template matching simply provides very rapid object detection under very controlled conditions. Because one of the goals of this research was to dramatically reduce the burden on the engineer in the object detection process, the ease of use for template matching will be used for ergonomic comparison. The neural network classifier will now be used to attempt to track the QT4 book. The results of the neural network cl assifier tracking the QT4 book are shown in the confusion matrix of Table 8-3. The neural netw ork classifier completely failed to find the QT4 textbook. Despite being trained with 1000 positive images and 4000 negative images, the neural network classifier missed every single positive image of the QT4 text book. Moreover, the neural network classifier for the textbook was plagued with false alarms. Table 8-3 shows the results of 100 images, where the classifier was given 50 images contai ning the QT4 textbook and 50 images that did not contain the QT4 textbook. For each positive test image, a hit was define d if the classifier detected the correct object (the QT4 textbook). If the classifier detected the wrong obj ect in a positive test image, it WAS NOT classified as a false alar m, but rather a miss. If the classifier detected an erroneous object when given a negative test image, a false alarm was declared. When the classifier didnt detect anything in a negati ve image, a correct rejection was declared. This approach was chosen so that the values in the confusion matrix would sum to 100. This approach was kept consistent throughout testing. The reason why the neural network classifier failed to track the textbook is not fully understood. Perhaps the textbook did not contain e nough Haar like features to build an accurate

PAGE 117

117 classifier. Perhaps Haar like f eatures was simply a poor feature choice. Maybe the classifier simply wasnt given enough training images. Either way, more than 4 hours of manually inserting positive images and 14 hours of classifier training yielded very disappointing results. The neural classifier yielded much better resu lts when trained to track the frontal view of automobiles. An example of this tracking can be shown in Figure 8-7. The neural network classifier for tracking the front views of automobiles was designed during the 2007 Darpa Urban challenge; a robotics competition where autonomous vehicles were to navigate themselves throughout cities. Because the tr acking of automobiles was of substantial importance to the objective, several people were able to help with the monumental task of training the classifier. The results of the automobile cl assifier are shown in Table 8-4. The neural network classifier was trained with massive amounts of automobiles. These automobiles consist of cars, trucks, and vans. The confusion matrix in Table 8-4 shows that the neural network classifier was able to track a variety of automobiles 34 percent (17/50) of the time. While this percentage may seem low, one must consider that the raw data coming from a computer vision program is almost never used without some sort of filtering. The Invariant Feature Selection method was then put through the same tests as the previous classifiers. This method has alre ady been shown tracking the QT4 book shown in Figure 7-6. This method was able to find the te xtbook through a series of transformations in space. The confusion matrix for this test is shown in Table 8-5. When tracking the QT4 textbook, the invarian t feature selection method substantially outperformed the neural network classifier (which wasnt diffi cult since the neural network classifier didnt detect anything). Unfortunately, the same ca nnot be said about the automobile scenario. When the goal was to track multiple instances of automobiles, the neural network

PAGE 118

118 method was the slight winner. The new invari ant feature selection method had a difficult time tracking multiple automobiles with a single classi fier. The new method was superior when the automobiles were similar in appearance; howev er the neural network method pulled slightly ahead when the automobiles were drastically diffe rent in appearance. The results for invariant feature selection method tracking multiple automobiles are shown in Table 8-6. This chapter has illustrated that the invari ant feature selection me thod has an ease of implementation which rivals that of template matching. However, the new method is far more robust than template matching when the object to be detected undergoes spatial transformations. By comparing the new method to template matc hing, the new method is clearly an improvement. The new method was then compared to a ne ural network approach. The new method was vastly superior to the neural network approach when tracking the QT4 textbook. Both the new method and the neural network approach were then given one of the most challenging object detection problems, automobile de tection. In this scenario, th e neural network approach only slightly outperformed the new method with regards to hits, despite taking more than 3 days to train the classifier. However, the new method had a slightly lowe r false alarm rate than the neural network classifier. In short, this new method gives the user the ease of template matching with the robustness of a neural network.

PAGE 119

119 Table 8-1: Textbook cl assifier training time TEXTBOOK TRACKING Template matching Artificial neural networks Invariant feature selection Number positive images 1 1000 4 Number negative images 0 4000 0 Training time < 1 minute 14 hours < 1 minute Table 8-2: Frontal automob ile classifier training time FRONTAL AUTOMOBILE TRACKING Template matching Artificial neural networks Invariant feature selection Number positive images 1 5000 2 Number negative images 0 10000 0 Training time < 1 minute Approx. 3 Days < 1 minute Table 8-3: Neural networ k QT4 book confusion matrix Neural network textbook tracking confusion matrix Actual positive Actual negative Predicted positive 0 16 Predicted negative 50 34

PAGE 120

120 Table 8-4: Neural network automobile confusion matrix Neural network automobile tracking confusion matrix Actual positive Actual negative Predicted positive 17 7 Predicted negative 33 43 Table 8-5: Invariant feature se lection QT4 book confusion matrix Invariant feature selection tracking textbook Actual positive Actual negative Predicted positive 48 3 Predicted negative 2 47 Table 8-6: Invariant feature sel ection automobile confusion matrix Invariant feature selection automobiles Actual positive Actual negative Predicted positive 14 6 Predicted negative 36 44

PAGE 121

121 Figure 8-1: Entire image to be searched Figure 8-2: Example of template

PAGE 122

122 Figure 8-3: Image used to find template Figure 8-4: Template used for future image searches

PAGE 123

123 Figure 8-5: Result of template matching Figure 8-6: Haar like features

PAGE 124

124 Figure 8-7: Neural network cl assifier tracking automobiles

PAGE 125

125 CHAPTER 9 CONCLUSION / FUTURE WORK This dissertation has shown the developm en t and implementation of a new method to perform object detection in images. With this conclusion, the creation of the individual feature classes and their overall implementation into the program has been illustrated. This new method was then compared to two previously well kno wn methods of performing object detection in images (template matching and neural networks). While there were several miniature milestone s of this research, the overall goal of this research was to place a computer in the loop to aid in the feature selection process. The user simply inputs the images he/she wants to track in future images and the program then informs the user which features are best for finding that particular object. With the computer informing the user as to which features are usef ul, substantial time is saved. A good example of the amount of time th at can be saved was th e tracking of the QT4 book using the neural network. One thousand im ages of the book were entered and trained overnight, only to find that either Haar like features are a poor choice or the user didnt use enough training images. This type of training can be extremely frus trating; especially when the user is on a tight schedule such a finishing a di ssertation before a deadline. With this new method the user is relieved of the burden of manually entering thousands of images only to find that his/her feature select ion was a poor choice. This approach of object detection was insp ired by the human vision system. Whereas most pattern recognition techni ques focus on separating an object from a background, this new method completely ignores the background. By focu sing on the object to be detected instead of the background, the amount of training data is substantially reduced.

PAGE 126

126 The human vision system is holistic, meaning that humans see objects as whole instead of a combination of features. Humans segment (g roup) different parts of images based on the previously described Gestalt principles and th en proceed to classify the individual image segments. By utilizing the growing search wi ndow, the holistic natures of human grouping can be mimicked. This new approach gives the user the ease of implementation of a template matching algorithm with a robustness rivaling that of a neur al network classifier. With this new method, determining whether a feature is good for tracking can be performed in seconds, as opposed to days. This topic leads to the future work. There are three major areas for expansion of this research. Obviously, the first area of expansion is the addition of more feature classes. There are currently twelve feature classes being used for object detection. These twelve features appear to do a reasonable good job of tracking objects. There are many other features that are not incl uded into the program such as Haar like features. If Haar li ke features were built into the program, perhaps the amount of time training the neural network to find the QT4 te xtbook could have been avoided. Other possible features include methods such as those used in optical ch aracter recognition. The second way to improve this program is rewriting the search window (Figure 9-1) to better utilize multi-core processors. The modern computer industry is moving away from making processors faster (clock speed), and instead are taking the path of adding more processing cores to a single chi p. In the year 2000, most people have never heard of a dual core processor for personal computers. Only a few year s ago, dual core processo rs were the standard. Currently, most desktop computers being purchased are equipped with quad-core processors. This trend is expected to c ontinue for the near future.

PAGE 127

127 Programs utilize multi-core processors by creating execution units called threads Threads are lightweight processe s that can run simultaneously. If a computer has a multi-core processor and different parts of a program can r un in parallel, the user can put these different parts of the program into threads to cut com putation time. Despite their usefulness, multithreaded programs are very difficult to debug. Because the programmer must keep track of several operations occurring in pa rallel instead of sequentially, th reads are a very rich source of unexpected program crashes. Nevertheless, the computing industry has invested substantial time and money into designing multi-core processors, thus forcing programmers to adhere to their philosophy. The current implementation of the search window is a multi-threaded approach based on the size of the scanning recta ngle (Figure 9-1). While this approach satisfies current requirements, it doesnt leave much room for growth. As computers inevitably gain more processing cores, adding more threads in the current manner will only gi ve the search window more resolution: this approach wont n ecessarily make the program run any faster. A much better approach for creating the multithread search window is shown in Figure 9-2. Instead of basing th e threads on the size of the scanning rectangle, the threads are based on the number of processor cores. For example, if a computer has 4 processor cores, the image will be split into four sections. If a computer has 16 processor cores (such as the upcoming AMD Bulldozer processor), the image will be split in to 16 sections as shown in Figure 9-3. Writing the search window in this manner maximizes the future advances in multi-core technology. The third way this program can be expanded is by adding a real-time database of objects. The current model only searches for one object in the image. However, there is no reason the program cannot do multiple object detection. Each object contains features classes which store

PAGE 128

128 some parameters about the said object. Some feature classes contain histogram data, others contain simply one value. Each search box is simp ly compared to the databa se of objects. If the search window is within the dist ance threshold of stored object, then the stored object declares a hit. Expanding the program in this manner wo uld greatly increase it s capabilities. For example, not only would the program be able to identify automobiles, the program would be able to tell the user what kind of automobile is being tracked. Adding more feature classes, improving the search window, and building a database are three ways to allow for future expandability. As more research is pe rformed in the area of human vision, new, more robust fe ature classes are destined to be created. Old and new feature classes in conjunction with faster processors a nd a growing database of objects will create an object detection scheme that has a chance of eventually rivaling the human monocular object detection process.

PAGE 129

129 Figure 9-1: Current multi-threaded search window Figure 9-2: Future multi -threaded search window

PAGE 130

130 Figure 9-3: Future implementation of thr eading using AMDs upcoming 16 core processor

PAGE 131

131 LIST OF REFERENCES [1] Rafferty, T., Goodbye, Evil R obot; Hello, Kind Android, The New York Times New York, New York, May 9, 2004. [2] Moran, M., The 50 Best Robot Movies, Time O nline, July 2007, New York, New York, http://entertainment.timeson line.co.uk/tol/arts_and_enter tainment/film/article2133609.ec e. [3] Leven, I., The Stepford Wives, Harper Collings Publishing New York, New York, 1972. [4] Dir. Verhoeven, P., Robocop, Orion Pictures Los Angeles, California 1987. [5] Moore, G., Cramming More Components Onto Integrated Circuits, Proceedings of the IEEE Vol. 86, No. 1, Raleigh, North Carolina, 1965, pp. 82-85. [6] Frei, M., Moores Law on Chips Marks 40th, BBC News Westminster, United Kingdom, April 18, 2005. [7] Hare, B., Babies Recognize Face Structure Before Body Structure, Blackwell Publishing Queensland, Australia, January 4, 2005. [8] Clark, R.,Resolution of the Human Eye, Clarkvision Photography, Boston, Massachussetts, 2007. [9] Blackwell, J., Optical Society America, Vol. 36, Rochester, New York, 1946, pp. 624643. [10] Brown, G., How Autofocus Cameras Work, HowStuffWorks Atlanta, Georgia, 2008, http://electronics.howstuffwor ks.com/autofocus.htm. [11] Bianco, C., How Vision Works, HowStuffWorks, Atlanta, Georgia, 2008, http://www.howstuffworks.com/eye.htm. [12] Sternberg, R.,Cognitive Psychology, Harcourt Brace Publishing Orlando, Florida, 2006. [13] Palmer, S. E., Visual Perception of Objects, Experimental Psychology Vol. 4, Washington, DC, 2003, pp. 179-211. [14] Matlin, M. and Foley, H., Sensation and Perception, Allyn and Bacon, Needham Heights, Massachusetts, 1997. [15] Marcum, J., A statistical theory of target detection by pulsed radar, IEEE Transactions on Information Theory, Vol. 6, Issue 2, Barcelona, Spain, April 1960. [16] Green, D., Swets J.,Signal Detection Theory and Psychophysics, Wiley Publishing New York, New York, 1966.

PAGE 132

132 [17] Gonzalez, R. and Woods, R., Digital Image Processing, Prentice-Hall, Inc Old Tappen, New Jersey, 1992. [18] Sethi, I.K. and Jain, R., Finding Trajectori es of Feature Points in a Monocular Image Sequence, IEEE Transactions on Pattern Anal ysis and Machine Intelligence Vol. 9, Issue 1, San Diego, California, January 1987. [19] Salari, V. and Sethi, I.K., Feature Point Co rrespondence in the Pres ence of Occlusion, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 12, Issue 1, San Diego, California, January 1990. [20] Acharya, T. and Ray, A., Image Proce ssing: Principles and Applications, WileyInterscience New York, New York, 2005. [21] Jackson, P., Introductio n to Expert Systems, Addison Wesley Longman Limited Harrow, England, 1999. [22] Bishop, C., Pattern Recognition and Machine Learning, Springer Science and Business Media Berlin, Germ any, 2006. [23] Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, Oxford, England, 1995. [24] Chao, L., Face Detection, Intel Technology Journal, Vol. 09, Issue 01, Santa Clara, California, May 19, 2005. [25] Seo, N., Tutorial: OpenCV haartraining (Rapid Object Detection With A Cascade of Boosted Classifiers Bas ed on Haar-like Features), Baltimore, Maryland, July 2007, http://note.sonots.com/SciS oftware/haartraining.html. [26] Duda, R., Hart, P., and Stork, D., Pattern Classification, Wiley-Interscience New York, New York, 2001. [27] Richeldi, M. and Lanzi, P., ADHOC: A Tool for Performing Effective Feature Selection, Proceedings Eighth IEEE International Conference Milano, Italy, 1996, pp. 102 105. [28] Zongker, D. and Jain, A., Algorithms fo r Feature Selection: An Evaluation, Pattern Recognition Proceedings of the 13th International Conference Vol. 2, East Lansing, Michigan, 1996, pp.18 22. [29] Lanzi, P., Fast Feature Selection with Genetic Algorithms: a Filter Approach, Evolutionary Computation, IEEE International Conference Milano, Italy, 1997, pp. 537540. [30] Liu, H., Dougherty, E., Dy, J., Torkkola, K., Tuv, E., Peng, H., Ding, C., Long, F., Berens, M., Parsons, L., Zhao, Z., Yu, L., Forman, G., Evolving Feature Selection, IEEE Transactions on Intelligent Systems Vol.20 Issue 6 2005, pp. 64 76.

PAGE 133

133 [31] Peng, H., Long, F., Ding, C., Feature Selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 27, Issue 8, 2005, pp. 1226 1238. [32] Sm ith, A., Color Gamut Transform Pairs, Computer Graphics Vol. 12, Issue 3, 1978, pp. 12-19. [33] Murray, J. and VanRyper, W., Encycl opedia of Graphics File Formats, O-Reilly and Associates Santa Rosa, California, 1994. [34] Shapiro, L. and Stockman, G., Computer Vision, Prentice-Hall, Inc Old Tappen, New Jersey, 2001. [35] Fisher, R., Perkins, S., Walker, A., Wolfart, E., Image Analysis Connected Component Labeling, Edinburgh, Scotland, 2003, http://homepages.inf.ed.ac.uk/rbf/HIPR2/label.htm. [36] Douglas, D., and Peucker, T., Algorithm s fo r the Reduction of the Number of Points Required to Represent a Digiti zed Line or its Caricature, The Canadian Cartographer Vol. 10, 1973, pp. 112-122. [37] Kleper, S., and Solter, N., Professional C++, Wiley Publishing New York, New York, 2005. [38] Viola, P., and Jones, M., Rapid Object De tection Using a Boosted Cascade of Simple Features, Conference on Computer Vision and Pattern Recognition Vol. 1, Cambridge, Massachusetts, 2001, pp. 511-518. [39] Lienhart, R., and Maydt, J., An Extended Set of Haar-like Features for Rapid Object Detection, Intel Labs 2002. [40] Kuranov, A., Lienhart, R., and Pisarevsky,V ., An Empirical Analysis of Boosting Algorithms for Rapid Objects With and Extended Set of Haar-like Features, Intel Technical Report July 2002.

PAGE 134

134 BIOGRAPHICAL SKETCH Antoin Baker was raised in Jacksonville, Florida. After high school he attended the University of Florida obtaining his doctoral degree in m echanical e ngineering while studying robotics. His area of expertise is computer vision, pattern recognition and system control.