<%BANNER%>

Computer controlled detection of people using adaptive ellipsoidal distributions

University of Florida Institutional Repository

PAGE 1

COMPUTER CONTR OLLED DETECTION OF PEOPLE USING AD APTIVE ELLIPSOID AL DISTRIBUTIONS By JENNIFER L. LAINE A THESIS PRESENTED TO THE GRADUA TE SCHOOL OF THE UNIVERSITY OF FLORID A IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORID A 2003

PAGE 2

Cop yrigh t 2003 b y Jennifer L. Laine

PAGE 3

A CKNO WLEDGMENTS I w ould lik e to thank m y committee mem b ers for taking the time to listen to m y defense and to really read and critique m y thesis. I thank Dr. Arro y o for his guidance throughout the y ears. It w as his inuence whic h allo w ed me to striv e to w ards m y p oten tial. I also wish to thank m y labmates at the Mac hine In telligence Lab oratory whose w ords of encouragemen t (b oth p ositiv e and negativ e) ha v e help ed to shap e me in to the p erson I am to da y The writing of this thesis w as done primarily in a v acuous apartmen t in Raleigh, NC. I w ould ha v e p ossibly gone insane without the daily lunc h breaks with Scott Nic hols and Iv an Zapata, so thanks also go to them. I thank JD Landry for making w ork-time at IBM less b oring, and I also thank Dr. Eddie Gran t at NC State for pro viding a structure for m y unorganized editorial infancy I thank Juan Sanc hez for giving me p ositiv e reinforcemen t to nish m y master's degree ev en though V egas seemed lik e a million miles a w a y Man y thanks go out to Dr. Jac k Smith, who kic k ed me in the butt throughout the last mon ths of writing and editing so that I could nish m y thesis and do m y presen tation on time, and I thank James Sc h ub ert for telling me that m y w ork w as not the remains of putrefaction and questioning the practicalit y of 144 dimensions. Finally thanks go out to m y paren ts and grandparen ts who ha v e b een motiv ating me m y en tire life. iii

PAGE 4

T ABLE OF CONTENTS page A CKNO WLEDGMENTS . . . . . . . . . . . . . . iii LIST OF T ABLES . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . vii ABSTRA CT . . . . . . . . . . . . . . . . . . viii CHAPTER 1 INTR ODUCTION . . . . . . . . . . . . . . . 1 1.1 Preface . . . . . . . . . . . . . . . . 1 1.2 Curren t Classication T ec hniques and Problem Discussion . . 2 1.3 Thesis Organization . . . . . . . . . . . . . 3 2 PR OBLEM REVIEW . . . . . . . . . . . . . . 5 2.1 In tro duction . . . . . . . . . . . . . . . 5 2.2 W a v elet T emplates and Supp ort V ector Mac hines . . . . 5 2.3 Gaussian Distribution-Based Mo del and Neural Net w orks . . 7 2.4 Stereo vision and Neural Net w orks . . . . . . . . . 9 2.5 Neural Net w ork Ov erload . . . . . . . . . . . 10 2.6 Shap e-Based P edestrian Detection . . . . . . . . . 13 3 IMA GE SEGMENT A TION AND DIMENSIONALITY REDUCTION . 15 3.1 In tro duction . . . . . . . . . . . . . . . 15 3.2 P artitioning the Image . . . . . . . . . . . . 16 3.3 Dimensionalit y Reduction . . . . . . . . . . . 18 3.4 Published Metho ds . . . . . . . . . . . . . 19 4 PREPR OCESSING OF AN IMA GE SEGMENT . . . . . . 21 4.1 In tro duction . . . . . . . . . . . . . . . 21 4.2 Brigh tness Equalization . . . . . . . . . . . . 21 4.3 Histogram Equalization and Con trast Stretc hing . . . . . 23 4.4 Horizon tal and V ertical In tensit y Dierencing . . . . . 23 iv

PAGE 5

5 CLASSIFIER ALGORITHM . . . . . . . . . . . . 27 5.1 Bac kground . . . . . . . . . . . . . . . 27 5.2 Linear T ransform for Spheroids . . . . . . . . . 28 5.3 Ellipsoidal T ransform . . . . . . . . . . . . 30 5.4 Classier Rules . . . . . . . . . . . . . . 33 6 EXPERIMENTS . . . . . . . . . . . . . . . 38 6.1 T raining the Classier . . . . . . . . . . . . 38 6.2 Prepro cessing Sc hemes . . . . . . . . . . . . 38 6.3 T esting Phase . . . . . . . . . . . . . . 39 6.4 Sensitivit y Analysis . . . . . . . . . . . . . 41 7 RESUL TS . . . . . . . . . . . . . . . . . 43 8 CONCLUSIONS . . . . . . . . . . . . . . . 46 REFERENCES . . . . . . . . . . . . . . . . . 48 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . 49 v

PAGE 6

LIST OF T ABLES T able page 7-1 Results of detector sc hemes 1{3 on test data . . . . . . . 44 7-2 Results of detector sc hemes 4{6 on test data . . . . . . . 44 vi

PAGE 7

LIST OF FIGURES Figure page 3-1 Clarication of the placemen t of staggered subimages in the input image . . . . . . . . . . . . . . . . . . 17 3-2 Arrangemen t of half-blo c ks within an image blo c k . . . . . 18 3-3 Steps for segmen ting an example image of a p erson . . . . . 19 4-1 Eects of brigh tness equalization on in tensit y uniformit y . . . 22 4-2 Con trast stretc hing and histogram equalization lters applied to a sample image . . . . . . . . . . . . . . . 24 4-3 Con trast stretc hing and histogram equalization lters applied to an image of a p edestrian . . . . . . . . . . . . . 25 4-4 Horizon tal and v ertical dierencing tec hniques applied to an image of a p edestrian . . . . . . . . . . . . . . . . 26 5-1 Demonstration of ellipsoidal dilation and con traction . . . . 28 5-2 Linear transformation of a spheroid . . . . . . . . . 30 5-3 Mapping of an ellipsoid to a unit spheroid kno wing the dilation axis 32 5-4 Mapping of an ellipsoid to a unit spheroid b y calculating the dilation axis . . . . . . . . . . . . . . . . . . 35 5-5 Input p oin t not within the h yp erplanes of an ellipsoid . . . . 36 6-1 Examples of p ositiv e and negativ e training images resp ectiv ely . . 39 6-2 Examples of p ositiv e images from the rst test group . . . . 40 6-3 Examples of negativ e images from the rst test group . . . . 40 6-4 Selections of images from the second test group . . . . . . 42 6-5 Dieren t blurring lev els of a test image . . . . . . . . 42 7-1 Some analyzed selections from the second group of test images . . 45 vii

PAGE 8

Abstract of Thesis Presen ted to the Graduate Sc ho ol of the Univ ersit y of Florida in P artial F ulllmen t of the Requiremen ts for the Degree of Master Of Science COMPUTER CONTR OLLED DETECTION OF PEOPLE USING AD APTIVE ELLIPSOID AL DISTRIBUTIONS By Jennifer L. Laine Ma y 2003 Chair: Mic hael C. Nec h yba Ma jor Departmen t: Electrical and Computer Engineering W e presen t a soft w are approac h to w ards real-time detection of h uman b eings in camera images or video. Due to the large v ariations in shap e, texture, and orien tation presen t in p eople o v er time and o v er samples, w e use a statistical pro cedure whic h con tours ellipsoidal distributions around p ositiv e data examples while a v oiding negativ e samples during the training of the detector. The data p oin ts around whic h the statistical approac h mo dels its distributions are feature v ectors extracted from the image pixel v alues through a pro cess whic h normalizes, lters, and dimensionally reduces them. W e test the eectiv eness of sev eral p opular image pro cessing tec hniques to determine whic h ones con tribute the b est detection rates, and use them in the nal detector. Finally w e test the mo del on real-w orld test images and discuss the results. viii

PAGE 9

CHAPTER 1 INTR ODUCTION 1.1 Preface Computer con trolled detection of h uman b eings is a w ell established articial vision problem. The c hallenges asso ciated with it are daun ting b ecause p eople in ev ery-da y situations are not structurally uniform, and they exhibit man y v astly dieren t p oses and orien tations. Here w e presen t an algorithm trained to detect h uman b eings in still images or video. A tec hnique for learning class regions in R 2 b y the autonomous generation of ellipsoidal distributions to enclose p ositiv e target ob jects, dev elop ed b y Kositsky and Ullman [ 1 ], is extended to accept a higher dimensional input space. Evidence of the p ossibilit y of suc h an extension is men tioned b y the authors, but the theory is nev er tested. Also, sev eral data preparation metho ds are exp erimen tally compared so that a maxim um p ositiv e detection rate and minim um false p ositiv e detection rate are ac hiev ed across the metho ds. Structural cues exhibited b y p eople in m ultiple p oses participate in the formation of man y distinct ellipsoidal distributions signifying h uman existence. W e mak e the assumption that certain relativ e phenomena within t w o dimensional images of p eople ma y b e represen ted as p oin ts within indep enden t ellipsoidal con tours based on prepro cessing tec hniques used in the pap er b y Oren et al. [ 2 ]. Eac h con tour is dened b y a high order transfer matrix coupled with a cen ter p oin t, and an ob ject's exclusion from the p ositiv e class set is determined b y the asso ciated data p oin t's exclusion from all of the ellipsoidal con tours. Our main ob jectiv es in in tro ducing this new metho d are acquiring detection results similar to previous tec hniques while main taining a degree of simplicit y and elegance b y k eeping the n um b er of 1

PAGE 10

2 algorithmic steps small, using only image preparation metho ds that are exp erimentally determined to impro v e detection, and manipulating data clusters whic h are quadratic in shap e rather than Gaussian. 1.2 Curren t Classication T ec hniques and Problem Discussion Most detection sc hemes attempt to minimize an error metric b et w een results generated b y the classier acting up on dimensionally reduced data and the asso ciated true outcome. Neural net w orks adjust w eigh ts to minimize the error b et w een the training net w ork and the true result. Kernel classiers suc h as Radial Basis F unctions (RBF) and Supp ort V ector Mac hines (SVM) minimize a b ound on the actual risk or error created b y applying a particular set of functions to massaged data p oin ts. Mo del-based and example-based detectors use p ositiv e and negativ e examples of the desired class to build a statistical or parametric criterion b y whic h test material is measured. Unfortunately h uman b eings are not structurally consisten t o v er time and across samples. When a p erson is mo ving, he assumes a cadence whic h ma y b e measured and used for detection. Ho w ev er, this tec hnique requires the analysis of video frames and an extensiv e n umerical history of some kind [ 3 ]. Excluding a regard to motion, sp oradic displa y of v ertical symmetry and wide color and texture v ariations across man y samples of p eople mak e it dicult to encapsulate a single input example in to a rep eatable pattern of whic h a sp ecic detection tec hnique could tak e adv an tage. Hence curren t metho ds tend to limit themselv es to a test class with a discrete n um b er of p ositions and tigh t size restrictions. Sp ecialized hardw are ma y b e used to ease data dissection. An infrared imager creates a heat map of the target and allo ws the data signature of h umans to b e more distinctiv e than that pro duced b y a digital camera [ 4 ]. Tw o camera systems ha v e b een dev elop ed that pro duce a disparit y map of the en vironmen t so that the bac kground ma y b e eliminated from the images and only foreground ob jects are considered [ 5 ].

PAGE 11

3 Hardw are solutions are useful but are also exp ensiv e. T raining data whic h are needed to mak e adjustmen ts to the detector w eigh ts or conditions are usually man ually manipulated to remo v e other unconstrained ob jects and the bac kground whic h is t ypically not homogeneous. The extensiv e preparation time required to train a detector w arran ts simplication of the training tec hnique and p erhaps a more in teractiv e approac h. W e attempt to address all of these issues b y rst p erforming only necessary prepro cessing pro cedures to the input data that are exp erimen tally determined to impro v e h uman detection. W e do v ery little preparation to the training images and automate the b o otstrapping pro cess during detector training so that false p ositiv es are ecien tly reduced. Lastly w e c ho ose a distribution t yp e that is amenable to compact represen tation and has malleable attributes. 1.3 Thesis Organization Chapter 2 outlines tec hniques previously emplo y ed for the detection of h uman b eings and lists their strengths and w eaknesses. Chapter 3 describ es ho w an input image is segmen ted in to a set of high dimensional data p oin ts acceptable to the detector. Chapter 4 denes man y prepro cessing sc hemes handled b y curren t ob ject detection metho ds and rep orts ho w they aect the prepro cessing of our input data. Chapter 5 giv es the theoretical basis b ehind the prop osed detector algorithm and describ es eac h step in the training pro cess. W e compile these prepro cessing tec hniques and asso ciate eac h approac h with an indep enden t detector sc heme. Eac h sc heme is functionally comp osed of a sp ecic set of prepro cessing tec hniques and the univ ersal trainer or detector. The trainer creates quadratic distributions based on the clustering b eha vior of the prepro cessed image data, and the detector tests the inclusion of a new image within these distributions. Chapter 6 relates the exp erimen ts p erformed in the detection pro cess b y detailing the t yp es of prepro cessing metho ds con tained in eac h detector sc heme. The same database of

PAGE 12

4 p ositiv e and negativ e training images is used to train eac h detector sc heme, and Chapter 7 displa ys the exp erimen tal results found when eac h sc heme examines an image database dieren t from the training one. Chapter 8 ev aluates the presen ted approac h and describ es future w ork whic h ma y b e done in this area.

PAGE 13

CHAPTER 2 PR OBLEM REVIEW 2.1 In tro duction Here w e presen t a review of some of the curren t tec hniques emplo y ed to solv e the problem of p eople detection. They are explained here to pro vide a basis for our approac h. The tec hnique of Oren et al. [ 2 ] reduces the problem space in to a set of meaningful frequency comp onen ts. With these comp onen ts, the authors create a parametric p edestrian mo del b y minimizing a risk function. The metho d of P oggio and Sung [ 6 ] feeds distance metrics related to the Gaussian distributions of face image training data in to a neural net w ork. The metho d of Zhao and Thorp e [ 7 ] elicits the help of stereo vision to separate the bac kground from foreground ob jects and inputs the gradien t of p oten tial p eople in the foreground in to a neural net w ork. The metho d of Ro wley et al. [ 8 ] in tro duces carefully c hosen parts of face images to a sequence of arbitrating neural net w orks. The tec hnique of Broggi et al. [ 5 ] comes up with systematic morphological rules that are applied alongside stereo vision to w ards the detection of p eople in images. 2.2 W a v elet T emplates and Supp ort V ector Mac hines The tec hnique of Oren et al. [ 2 ] fo cuses on the dierences in in tensit y b et w een p edestrians and the bac kground, or the relativ e in tensities and p osition of p edestrian b oundaries coupled with homogeneous in teriors. The authors concen trate on structural cues b ecause a p edestrian's colors are not constrained, and the colors and textures of the bac kground are not consisten t. A w a v elet transform appro ximates non-stationary signals with sharp discon tin uities at v arying scales. Hence, the structure of a p erson lends itself to the use of w a v elet co ecien ts to dierentiate p eople from non-p eople, and in this application, a w a v elet transform is used 5

PAGE 14

6 as a h uman edge detection algorithm. A redundan t set of Haar basis functions is used to completely capture the relationships of the a v erage pixel in tensities b et w een neigh b oring regions of an image. They apply the Discrete W a v elet T ransform (D WT) along three orien tations to generate w a v elet co ecien ts at t w o dieren t scales. Co ecien ts are pro duced for v ertical, horizon tal, and diagonal passes of the transform, and b oth 32 £ 32 and 16 £ 16 pixel blo c k scales are tested. In the 32 £ 32 scale, one co ecien t represen ts the energy of the signal lo calized b oth in time and frequency within the corresp onding 32 £ 32 blo c k. A similar metho d is used for the 16 £ 16 scale. Co ecien ts generated from a training database are compiled in to a template. The w a v elet co ecien ts are calculated for eac h color c hannel (R GB) and for eac h orien tation in an image. The largest absolute v alue o v er all of the color c hannels b ecomes the corresp onding co ecien t for that orien tation in the image. The co ecien ts for eac h orien tation are normalized separately o v er all of the co efcien ts in the image and a v eraged o v er all of the images in the p edestrian image database. The resulting arra y of co ecien ts is the p edestrian template for eac h orien tation. The p edestrian training image database consists of 564 color images of p edestrians in fron tal and rear p ositions within a 128 £ 64 pixel frame. A nonp edestrian template is also created using 597 color images of natural scenes within a 128 £ 64 frame. By visual insp ection, the authors select \signican t co ecien ts" from the template whic h pinp oin t areas in the image imp ortan t for p edestrian classication. Consequen tly 29 co ecien ts are used to form a feature v ector for the classication eort. During detection, the system mo v es a 128 £ 64 windo w throughout the en tire space of the input image. Bo otstrapping aids the problem of the o v erwhelming negativ e class space. F alse p ositiv e detections during testing are group ed in to the negativ e training image database, and the system is retrained. The system is not adaptable since the whole training image database m ust b e submitted

PAGE 15

7 to the algorithm instead of just the new images when retraining is necessary Tw o metho ds of classication are used. The rst is a simple tec hnique called basic template matc hing where the ratio of feature v ector v alues in agreemen t is calculated for eac h new input image. The second metho d utilizes the supp ort v ector mac hine. After three b o otstrapping sessions, the system trains from 4,597 negativ e images. F rom the 141 high qualit y p edestrian test images, the classier exhibits a detection rate of 52.7% using basic template matc hing with 1 false p ositiv e p er 5,000 windo ws examined. With the supp ort v ector classier, the system has a detection rate of 69.7% and a false p ositiv e detect for ev ery 15,000 windo ws examined. 2.3 Gaussian Distribution-Based Mo del and Neural Net w orks The approac h of P oggio and Sung [ 6 ] detects uno ccluded v ertical fron tal views of faces b y tting hand-tuned Gaussian distributions up on example data. They form ulate a mo del of faces and a mo del of non-faces. Eac h training image con tains a single face whic h ts inside a 19 £ 19 pixel mask. A feature v ector for the classication distribution is dened b y the absolute pixel in tensities of the unmask ed pixels. Hence, eac h input image translates in to a v ector in R 283 space. An elliptical k-means clustering algorithm groups 4,150 examples of p ositiv e data and 6,189 examples of negativ e data in to a predened n um b er of clusters, and t w elv e Gaussian b oundaries are placed up on the groupings. Six Gaussians, eac h with a cen troid and co v ariance matrix, are placed up on the p ositiv e p oin ts, and six are p ositioned up on the corresp onding negativ e data. Fitting the p ositiv e sample space with one Gaussian distribution is not sucien t b ecause there is to o m uc h o v erlap b et w een the p ositiv e Gaussian and non-face example feature v ectors. In some cases, non-face patterns lie closer to the p ositiv e Gaussian distribution cen troid than a true face arrangemen t. The relationship b et w een incoming feature v ectors and the existing face mo del is enco ded in to a 2-v alue distance metric.

PAGE 16

8 The rst distance is called the Mahalanobis distance. It represen ts a separation in units of standard deviations of the input p oin t from the cluster distribution. The 75 largest eigen v ectors of the Gaussian are used as the discriminating v ector space to reduce the c hance of o v ertting the metric. The second distance is a generic Euclidean distance. It measures the un biased separation b et w een the input p oin t and the cluster mean within the subspace spanned b y the same 75 largest eigen v ectors. F or eac h test input, a m ulti-la y er p erceptron (MLP) is trained with the 12 pairs of distances as the inputs and one binary output. They train the net w ork using the same 4,150 p ositiv e images used to create the p ositiv e Gaussian clusters and 43,166 negativ e images whic h include the 6,189 patterns used to create the negativ e Gaussian clusters. The rest of the negativ e training images are selected with a b o otstrapping metho dology Images generating a false p ositiv e detection b y the neural net w ork are added to the negativ e image collection for further training. Represen ting a distribution with a cen troid and co v ariance matrix is dicult for high dimensional v ector spaces b ecause the n um b er of free parameters in the mo del is directly prop ortional to the square of the n um b er of dimensions, and the parameters m ust b e reco v ered from training data for the detector to b e robust. One ma y reduce the n um b er of mo del parameters b y fo cusing on the \signican t" eigen v ectors in the co v ariance matrix. The signican t eigen v ectors in the Gaussians correlate to the prominen t pixel features in a face image. This information is enco ded in the Mahalanobis distance. The less prominen t pixel features are enco ded in the less signican t eigen v ectors. The Euclidean distance is supp osed to accoun t for the less salien t facial c haracterizations. The searc h for faces in an image is done o v er all image lo cations and at a single scale. Their system has a 96.3% detection rate on a test database of 301 CCD images of p eople with 3 false p ositiv es. On a more

PAGE 17

9 c hallenging database of 23 cluttered images with 149 face patterns, their system has a detection rate of 79.9% with 5 false p ositiv es. 2.4 Stereo vision and Neural Net w orks The tec hnique of Zhao and Thorp e [ 7 ] is a real-time p edestrian detector using t w o mo ving cameras and sp ecialized segmen tation soft w are. The real-time stereo system Small Vision System (SVS) dev elop ed b y SRI constructs a disparit y map of the input image based on color and spatial cues so that ob jects in the foreground of the image ma y b e distinguished from those in the bac kground. Hence, the segmentation soft w are is not inuenced b y drastic ligh ting c hanges, ob ject o cclusion, or color v ariation. A neural net w ork trained with bac k propagation is fed the in tensit y gradien t of the resulting foreground partition. Since the bac kground is remo v ed from the training and testing images b y stereo image analysis and the neural netw ork learns from examples, their metho d requires no a priori mo del or bac kground image. F rom the disparit y map of an input image, they use thresholding to remo v e bac kground ob jects. They then smo othly group together ob jects of similar disparit y and rule-out groupings that are to o small or to o big to b e p edestrians. Small pixel blobs that are near eac h other with close disparit y v alues are in tegrated in to one big blob. Large regions undergo a v erication pro cess where subregions are analyzed for the presence of p edestrians and then split apart if m ultiple p ositiv e detections exist. P edestrians ha v e high degree of v ariabilit y in texture and color so absolute pixel in tensities are not used as input information for the detector. Instead, they use the in tensit y gradien t of the pixel groupings in the foreground still found to b e p oten tial p eople as the input v ectors to the neural net w ork. These eects of the prepro cessing phase are constrained to a 30 £ 65 windo w, and the region v alues are linearly normalized to n um b ers b et w een 0 and 1. A three la y er feed forw ard net w ork is trained with 1,012 p ositiv e images of p edestrians and 4,306 negativ e images. Bo otstrapping is used to impro v e system

PAGE 18

10 p erformance. The net w ork w eigh ts are initialized to small random n um b ers b efore training, and detection is nalized b y thresholding the output of the trained net w ork. The system is tested with 8,400 images of p edestrians and other ob jects in cluttered cit y scenes. They ac hiev e a detection rate of 85.2% and a false p ositiv e rate of 3.1%. The system p erforms segmen tation and detection on t w o 320 £ 240 images at a framerate ranging from 3 frames/second to 12 frames/second. The system fails when ob jects that are structurally similar to h umans are presen ted and when o cclusion is extreme or the color of the p erson is similar to that of the bac kground. 2.5 Neural Net w ork Ov erload The metho d of Ro wley et al. [ 8 ] uses a neural net w ork to detect uprigh t, fron tal faces in greyscale images. The training images are sp ecially customized for the algorithm. The ey es, tip of the nose, and corner and cen ter of the mouth for eac h training image face are lab eled man ually so that they can b e normalized to the same scale, orien tation, and p osition. Bo otstrapping is used to solv e the problem of nding represen tativ e images for the non-face category F alse p ositiv e images are added to the training set during successiv e phases of training and testing. Bo otstrapping negativ e images reduces the n um b er of images needed in the training set. A neural net w ork is applied to ev ery 20 £ 20 pixel blo c k in an image, and detection of faces at dieren t scales is ac hiev ed b y applying the lter to an input image that is subsampled. A prepro cessing step is p erformed on the input image b efore it is passed through the neural net w ork. The rst step in the prepro cessing phase equalizes the brigh tness in an o v al region inside the 20 £ 20 pixel blo c k, and the second step p erforms histogram equalization within the resulting o v al. The net w ork has retinal connection to its input la y er. F our hidden la y ers lo ok at 10 £ 10 pixel subblo c ks, Sixteen lo ok at 5 £ 5 pixel subblo c ks, and six lo ok at 20 £ 5 pixel strip es. These regions are sp ecically hand-c hosen so that the

PAGE 19

11 hidden units learn features unique to faces. The strip es iden tify mouths or ey es, and the square regions see a nose, individual ey es, or the corner of a mouth. The net w ork has a single output signifying the presence or absence of a face. 1,050 images of faces of v arying size, p osition, orien tation, and brigh tness are gathered for training and man ually massaged in to images uniform o v er all training data b y creating a mapping of sp ecic pixel lo cations to face features. The mapping itself scales, rotates, and translates the input image b y a least squares algorithm that is run to con v ergence for eac h image. Once a uniform image is made, v arian ts of the image are created b y rotating, scaling, and translating the mo del. Non-face images are generated randomly and the non-face training database is formed b y a b o otstrapping tec hnique. The net w ork is trained using standard error bac kpropagation with momen tum and initial random w eigh ts. Resultan t w eigh ts from a previous training iteration are used in the next iteration. Random images that generate false p ositiv e detections are added to the database for further training. Generation of random data forces the net w ork to set a precise b oundary b et w een faces and non-faces. Tw o heuristics are in tro duced to reduce the n um b er of false p ositiv es in the initial neural net w ork. Since the net w ork is somewhat in v arian t to the p osition of the face up to a few pixels, m ultiple detections within a sp ecied neigh b orho o d of p osition and scale are thresholded. The pixel neigh b orho o d and the n um b er of detections found in the neigh b orho o d are the t w o parameters used. A n um b er of detections greater than the threshold implies a p ositiv e detection, and the cen troid of the neigh b orho o d is scrutinized again for the presence of a face. A n um b er of detections few er than the threshold implies a no-detect. The second heuristic in v olv es result arbitration from m ultiple neural net w orks. Eac h neural net w ork is trained with the same p ositiv e image database, but b ecause the set of negativ e images is randomly c hosen from the b o otstrap images, the order of presen tation and the negativ e examples themselv es

PAGE 20

12 dier. Also, the initial w eigh ts ma y dier b ecause they are generated randomly They try ANDing and ORing the results of t w o similarly trained net w orks. Three net w orks v oting a result is also tried. Lastly they train a neural net w ork to go v ern the decisions of the arbitrating neural net w orks to see if suc h a sc heme yields b etter results than simple b o olean functions. Sensitivit y analysis is p erformed on all the net w orks to determine whic h features in the face more greatly inuence detection. It turns out that the detectors rely hea vily on the ey es, then the nose, and then the mouth. Man y dieren t net w orks are tested with t w o large data sets con taining images dieren t from the training images. 130 of the images are collected at CMU and consist of m ultiple p eople in fron t of cluttered bac kgrounds. The second set is a subset of the FERET database, and eac h image in the second set consists of only one face, has uniform bac kground, and go o d ligh ting. The detection rate of all tried systems on the rst data set ranges from 77.9% to 90.3%. ORing the arbitration net w orks yields the b est detection rate but also con tributes the most false p ositiv es. In the second set of data images, detection success ranges from 97.8% to 100.0% for fron tal faces and faces turned less than 15 degrees from the camera. A detection rate range of 91.5% to 97.4% is ac hiev ed on faces turned 22.5 degrees from the camera. They determine that the system with t w o ANDed arbitrating net w orks pro duces the b est tradeo b et w een detection rate and false p ositiv es. It has a detection rate of 86.2% with a false detect rate of 1 p er 3,613,009 test windo ws on the rst test set. On the second test set, it has an a v erage detection rate of 98.1% on the faces at all orien tations. The b est system tak es 383 seconds to pro cess a 320 £ 240 pixel image on a 200MHz R4400 SGI Indigo 2. After mo difying the system to allo w bigger searc h windo ws in steps of 10 pixels, the pro cessing time is reduced to 7.2 seconds, but with the side eect of ha ving more false detects and a lo w er detection rate.

PAGE 21

13 2.6 Shap e-Based P edestrian Detection The pro cedure of Broggi et al. [ 5 ] presen ts a mo del-based metho d to detect p edestrians from a mo ving v ehicle with t w o cameras. The core tec hnique is a mo del-based approac h whic h fo cuses on the v ertical symmetry and the presence of texture in h umans. It c hec ks for h uman morphological c haracteristics b y incorp orating rules to a pixel-lev el analysis. Ho w ev er, other approac hes are used to rene the results. Analysis of stereo disparities in the images pro vides distance information and giv es an indication of the b ottom b oundary of the p edestrian. Also, an image history is k ept to further lter the morphological results. Their system is an additional feature of the AR GO Pro ject| an autopilot mec hanism for a v ehicle. A greyscale input image is do wnsampled to a 256 £ 288 pixel blo c k, and a lo calized region of highest probable p edestrian existence is transformed b y a Sob el op erator to extract the magnitude and orien tation of the edges in the image. Binary edge maps are created of v ertical and horizon tal edges, and bac kground edges are eliminated from the maps b y subtraction of the thresholded and shifted stereo images. They run the resulting binary maps through a lter that concatenates densely pac k ed ob jects and remo v es small sparse blobs. A v ertical symmetry map is created from the ltered v ertical edges map b y scanning the image horizon tally for v ertical symmetries. Humans ha v e a high degree of v ertical symmetry but m uc h less of an instance of horizon tal symmetry Under this assumption, non-h uman ob jects are ruled out b y analyzing the horizon tal edges map for horizon tal symmetries. A linear densit y map of the horizon tal edge pixels is sup erp ositioned with the v ertical symmetry map with exp erimen tally determined co ecien ts to create a probabilit y map of h uman presence. This tec hnique eliminates ob jects ha ving b oth strong v ertical and horizon tal symmetry The probabilit y map is b olstered b y considering a history of images and image en trop y since ob jects that are uniform t ypically are not h uman. The widths of

PAGE 22

14 the remaining ob jects are determined b y coun ting the n um b er of pixels in eac h v ertical edge ab out the symmetry axis. They c ho ose the b oundary to b e the column with the highest pixel coun t on eac h side. The Sob el map is scanned for a head matc hing one of a set of predened binary mo dels of dieren t sizes. The mo del is constructed b y hand-com bining features that sample h uman heads ha v e in common. The b ottom b oundary is determined b y nding an op en ro w of pixels in the v ertical edge map in the left and righ t b oundaries of the b o dy Distance information is calculated based on a com bination of prior camera calibration kno wledge and the p osition of the b ottom b oundary The b ottom b oundary is then rened b y comparing the calculated distance to the distance determined b y comparing the p osition of the p edestrian in the t w o stereo images. More rules are c hec k ed as the nal b ounding b o x is t for size constrain ts, asp ect ratio and com bined distance and size restrictions. Bounding b o x construction is sometimes not v ery accurate concerning the head's p osition and the detection of lateral b orders, and no detection results are presen ted.

PAGE 23

CHAPTER 3 IMA GE SEGMENT A TION AND DIMENSIONALITY REDUCTION 3.1 In tro duction The segmen tation part of our approac h eectiv ely transforms a 220 £ 220 input image in to a set of o v erlapping 109 £ 28 rectangles and then further lters eac h rectangle in to a p oin t in R 144 space. This lo w-pass lter a v erages the pixel brigh tnesses of o v erlapping regions in eac h rectangle and then uses the resulting v alues as co ordinates in a feature v ector in R 144 Reasons for using this particular metho d are giv en b elo w. T ypically in example-based learning sc hemes, the detector is trained with templates of the target class. T raining suc h a detector in v olv es learning relationships b et w een features in the template. Ecien t systems reduce the dimensionalit y of the features to a smaller n um b er whic h still retains the in tegrit y of the original pattern. In addition, increasing the n um b er of searc h windo ws increases the eectiv eness of the analysis b ecause example-based classiers rely on p osition and rotation in v ariance when transferring kno wledge from training examples to the test cases. More windo ws mean less error margin in the ob ject's p osition and orien tation within a detection frame [ 9 ]. In order for the system to detect p eople at dieren t scales, t w o options exist. Either the input image ma y b e do wnsampled to reduce the size of the features or larger windo ws ma y b e in troduced [ 7 8 ]. Therefore, scanning an image for features tends to b e computationally exp ensiv e, and, in man y cases, is the b ottlenec k of an y learning-based classication sc heme. Curren t metho ds that do not use training examples rely on a priori mo dels as the reference data. Their high detection sp eed is comp ensated b y the complexit y of their inference rules coupled with their inexibilit y W e c ho ose the simplicit y 15

PAGE 24

16 of a non a priori mo del without ha ving to scan ev ery single rectangle in the input image b ecause w e a v erage o v erlapp ed pixel regions. Dimensionalit y reduction can b e ac hiev ed b y paring do wn the eectiv e comp onen ts of the feature space. It reduces computation and prev en ts the system from o v er-tting the decision surface of the training data. Using an algorithm that retains all of the bases for the feature space sp ends to o m uc h time computing details that are not unique to the desired ob ject class. Principal Comp onen t Analysis (PCA) is t ypically used to reconstruct a decision space with a subset of its eigen v ectors. F or an y subspace of the trained decision surface, a subset of the eigen v ectors spans a subspace whose signal p o w er con tributes the least error to that of the original signal. Ho w ev er, the n um b er of eigen v ectors necessary to successfully reconstruct a desired ob ject space dep ends highly on the n um b er of training data and the n um b er of pixels in eac h image [ 10 ]. W e use a simplied asp ect of the tec hnique of Oren et al. [ 2 ] to dimensionally reduce the data set. The lter w e emplo y is a 7 £ 7 mean lter applied ev ery 4 pixels. It is a lo w frequency represen tation of an image rectangle. 3.2 P artitioning the Image Since our exp erimen ts fo cus on the success of the ellipsoidal distribution algorithm instead of the viabilit y of the system to m ultiple scales, w e regard only one scale of p edestrian. Ho w ev er, the metho d ma y easily b e applied to p eople of bigger or smaller size. The system b egins the testing phase of p eople classication b y analyzing a 220 £ 220 greyscale image. The image is divided in to t w o nono v erlapping 109 £ 220 ro ws and t w o similarly nono v erlapping 220 £ 109 columns. Within eac h ro w, 13 equally spaced 109 £ 109 subimages are selected in a staggered arrangemen t suc h that eac h subimage shares 100 pixels of the previous subimage. Previous tec hniques motiv ate the o v erlapping of template windo ws [ 2 6 8 ]. F urther segmen tation and dimensionalit y reduction of the space

PAGE 25

17 deems the analysis of ev ery p ossible searc h windo w unnecessary Eac h column has 13 similarly sized and p ositioned subimages, and duplicates arise on the four corners. There are a total of 52 subimages in the whole image. The placemen t of staggered subimages in a 220 £ 220 input image is sho wn in Figure 3-1 The Figure 3-1. Clarication of the placemen t of staggered subimages in the input image. Eac h arro w con tours a nono v erlapping 109 £ 220 ro w or column, and the staggered 109 £ 109 subimages lie along the arro ws within eac h ro w and column image is further divided in to 10 £ 10 pixel blo c ks. They eac h o v erlap b y one pixel in order to completely ll the subimages, and there are 144 blo c ks p er subimage. Eac h blo c k is made up of 4 o v erlapping 7 £ 7 half-blo c ks, and the pixel depth of eac h o v erlapping half-blo c k is 4. Figure 3-2 demonstrates the arrangemen t of half-blo c ks within an image blo c k. There are 576 half-blo c ks p er subimage. The mean of the pixel in tensities within eac h half-blo c k is assigned to the en tire halfblo c k. In essence, the half-blo c ks are discretized in to pixel-lik e collectiv es b y a 7 pixel wide lo w-pass lter. A priori information ab out the asp ect ratio of a h uman form nalizes the size of the searc h windo w. A h uman's heigh t is appro ximately four times larger than the width in most p ositions. On accoun t of this, w e tak e eac h subimage (109 £ 109 pixels or 24 £ 24 collectiv e half-blo c ks) and split it in to 4 v ertical strips of 109 £ 28 pixels or 24 £ 6 collectiv e half-blo c ks. The resulting windo ws tigh tly encompass p edestrians ha ving a scale of 109 £ 28 pixels in most

PAGE 26

18 half-block1half-block2half-block3half-block4 start of block start of block, etc... start of block Figure 3-2. Arrangemen t of half-blo c ks within an image blo c k. The segmen tation algorithm partitions the image in to 4 o v erlapping half-blo c ks within eac h blo c k. Eac h square within the grid represen ts a pixel in the image, and the thic k lines represen t blo c k b oundaries p oses. Figure 3-3 sho ws an example image of a p edestrian and the result of eac h segmen tation step. 3.3 Dimensionalit y Reduction The o v erlapping half-blo c ks b ecome distinct collectiv es y et they share information b et w een eac h other b ecause enco ded in the collectiv e half-blo c k in tensit y v alue is neigh b oring half-blo c k brigh tnesses. This has the eect of \smearing" the in tensities across man y pixels. Residual feature information ma y b e transferred to areas of the image as far as 20 pixels a w a y in b oth the x and the y directions. Changing the ob ject comp onen ts from pixels to collectiv e half-blo c ks eectiv ely reduces the dimensionalit y of the feature space from 3,080 do wn to 144, a 95% reduction. Pixel sharing b et w een neigh b oring half-blo c ks within eac h blo c k giv es a

PAGE 27

19 A B C Figure 3-3. Steps for segmen ting an example image of a p erson. A) Undo ctored 220 £ 220 input image. B) Magnied view of a subimage con taining the p erson consisting of 24 £ 24 collectiv e half-blo c ks. C) Magnied view of v ertical strip con taining p erson consisting of 24 £ 6 collectiv e half-blo c ks 4 pixel maxim um v ariance in the horizon tal and v ertical directions. An ob ject ma y shift b y as m uc h as four pixels v ertically or horizon tally or ma y also rotate b y as m uc h as 34 and still main tain a \presence" within the same collectiv e half-blo c ks. 3.4 Published Metho ds W e again review the p eople detection tec hniques in tro duced in Chapter 2 but fo cus only on the segmen tation and dimensionalit y reduction approac hes of eac h metho d to presen t p oin ts of comparison with our metho d. Ro wley et al. [ 8 ] lo ok at ev ery 20 £ 20 pixel blo c k in a test image initially and they later c hange the detect windo ws to 30 £ 30 pixel blo c ks in steps of 10 pixels to reduce computation time of an image from 383 seconds to 7.2 seconds. Up on nding a p ositiv e result in a 30 £ 30 blo c k they more closely scrutinize the area with their standard 20 £ 20 detector. The neural net w orks they use discern features within the face, and their detection windo ws m ust b e k ept small. The n um b ers of hidden units are exp erimen tally determined, but there is a ne line b et w een the n um b er of hidden units required to determine an underlying trend in a decision space and the n um b er of hidden units that will t the in tricate details of the training data but

PAGE 28

20 not extract the fundamen tal pattern. P oggio and Sung [ 6 ] lo ok at ev ery 19 £ 19 subregion lo cation in the primary image during testing. They retain the full dimensionalit y of the 19 £ 19 space (283 dimensions) and t 6 separate Gaussian distributions up on the training data in 283-space using an elliptical k means algorithm. Oren et al. [ 2 ] mo v e a 128 £ 64 windo w throughout all the p ositions in a test image. They sub jectiv ely c ho ose 29 \signican t" w a v elet co ecien ts whic h indicate regions of \in tensit y-c hange" or regions of \no in tensit y-c hange" in the learned w a v elet template. These 29 co ecien ts form the feature v ector resp onsible for classication of p eople.

PAGE 29

CHAPTER 4 PREPR OCESSING OF AN IMA GE SEGMENT 4.1 In tro duction Input image prepro cessing is the transformation of cryptic data in to information that is amenable to a training or testing algorithm. Whic hev er tec hnique is used m ust enhance the qualities unique to the target class while deemphasizing externally motiv ated transien ts. Most curren t classication sc hemes emplo y prepro cessing tec hniques, but to dieren t exten ts. W e apply basic image ltering to ac hiev e results comparable to that of other tec hniques. Our metho d nds the algorithms that increase the rate of detection of p eople and decrease the n umb er of false p ositiv es. Hence, w e compare test results from sev eral prepro cessing sc hemes instead of sub jectiv ely c ho osing the nal metho d of image preparation. The tec hniques applied in previous classication metho ds prompt those used here: brigh tness equalization [ 8 ], histogram equalization [ 8 ], con trast stretc hing [ 8 2 7 ], horizon tal in tensit y dierencing [ 2 ], and v ertical in tensit y dierencing [ 2 ]. There are man y more metho ds to unify images in unpredictable ligh ting situations than w e ha v e en umerated here, but these are among the practices that sho w up rep eatedly in the detection metho dologies w e analyzed. 4.2 Brigh tness Equalization There are sev eral metho ds used to reduce un w an ted global or partial brigh tness v ariations caused b y a c hanging en vironmen t. One of these is brigh tness equalization or lev el shifting. Some in tensit y c hanges exemplify a prop ert y of an ob ject in uniform ligh ting conditions, while others are pro v ok ed b y a lo calized external ligh t source or sink. A ligh ting equalization op erator reduces the eects of luminance shifts caused b y fo cused ligh t v ariations. Curren t prepro cessing 21

PAGE 30

22 tec hniques try to lter out lo calized in tensit y dieren tials b y applying t ypically a p olynomial transform to the pixel in tensities of the en tire image. Figure 4-1 sho ws the result of this bac kground lev eling tec hnique on the uniformit y of the in tensities. Lev el shifting is successful if the bac kground lev el c hanges gradually and can b e A B Figure 4-1. Eects of brigh tness equalization on in tensit y uniformit y A) Image of rice with lo calized brigh tness non uniformities. B) Same image after a linear brigh tness equalization lter is applied mo deled b y a p olynomial. Ro wley et al. [ 8 ] execute brigh tness equalization on ev ery similarly mask ed 20 £ 20 o v al in their searc h space b efore feeding them to their system of neural net w orks. Their luminance equalization lter is essen tially a linear function tted to the a v erage in tensit y v alues of small pixel regions within the image. W e p erform luminance equalization up on the 220 £ 220 greyscale image b efore segmen tation and dimensionalit y reduction tak e place [ 8 ]. W e t piecewise con tin uous linear functions to b oth the brigh test and dark est pixels in the image. This has the adv an tage of k eeping b oth the bac kground and con trast of the image consisten t.

PAGE 31

23 4.3 Histogram Equalization and Con trast Stretc hing Histogram equalization transforms the con trast and range of a set of pixel in tensit y v alues b y pro viding a t ypically non-linear mapping of the original v alues. In eect, the brigh tness histogram of the resulting image b ecomes uniform or at after its application. This tec hnique is useful for pic king out details that are dicult for h umans to see in an image or for a classier to distinguish ob jects kno wn to ha v e densely represen ted in tensit y v alues. Ho w ev er, relativ e pixel information across the image is not preserv ed. Con trast stretc hing, on the other hand, linearly scales the input pixel in tensities so that the range of v alues is stretc hed to desired minim um and maxim um b ounds. This tec hnique preserv es in tensit y dierences throughout the en tire image. Figure 4-2 sho ws the eects of con trast stretc hing and histogram equalization on an example image. In addition to brigh tness equalization, Ro wley et al. [ 8 ] p erform histogram equalization and con trast stretc hing to the searc h space. Oren et al. [ 2 ] normalize the w a v elet co ecien ts of their training images. Zhao and Thorp e [ 7 ] normalize the v alues of the output of an edge detection algorithm b efore they are input to a neural net w ork. W e execute con trast stretc hing and histogram equalization after the segmen tation and dimensionalit y reduction step. The con trast stretc hing step (also called normalization) c hanges the dynamic range of the pixel v alues to one b ounded b y 0 and 100. Figure 4-3 sho ws the v ertical strip of the p edestrian from Chapter 3 b efore and after con trast stretc hing and histogram equalization resp ectiv ely 4.4 Horizon tal and V ertical In tensit y Dierencing The next phase in v olv es the implemen tation of t w o v ery simple region dierence extraction lters. Go o d results ha v e b een generated from the tec hnique of Oren et al. [ 2 ] whic h uses w a v elets to enco de region in tensit y dierences for feature extraction b ecause relativ e quan tities eliminate lo w frequency noise and more consisten tly explicate h uman shap e. In fact, the application of Haar w a v elets uses

PAGE 32

24 A B C Figure 4-2. Con trast stretc hing and histogram equalization lters applied to a sample image. A) Image of lunar surface. B) Image after applied con trast stretc hing lter. C) Image after applied histogram equalization lter our same dierencing sc heme when generating co ecien ts of highest frequency It seems logical that w e use a high frequency dierencing sc heme after extracting lo w frequency information from dimensionalit y reduction. W e exercise the spirit of the approac h but simplify the application. One metho d, called horizon tal dierencing, replaces absolute pixel v alues with dierences calculated horizon tally If i represen ts an image ro w, and j represen ts an image column in the 24 £ 6 v ertical strip from Chapter 3 and x ij signies the in tensit y v alue at pixel lo cation ( i; j ), then the

PAGE 33

25 A B Figure 4-3. Con trast stretc hing and histogram equalization lters applied to an image of a p edestrian. A) 24 £ 6 v ertical strip image of p edestrian from Chapter 3 B) Image after histogram equalization and normalization lters are applied horizon tal dierencing tec hnique ac hiev es the follo wing: x ij = 8>><>>: j x ij ¡ x i ( j +1) j 1 j < 6 ; 0 j = 6 : (4.1) The second tec hnique also nds relativ e pixel in tensities but along the columns of the image instead of the ro ws. x ij = 8>><>>: j x ij ¡ x ( i +1) j j 1 i < 24 ; 0 i = 24 : (4.2) Figure 4-4 sho ws the horizon tal and v ertical dierencing eects on the 24 £ 6 v ertical strip p edestrian image. This prepro cessing step is p erformed after segmen tation and dimensionalit y reduction. The end result from the analysis of one 220 £ 220 image is a set of 208 feature v ectors in 144-space, and the image is ready for either classier training or p eople recognition.

PAGE 34

26 A B C Figure 4-4. Horizon tal and v ertical dierencing tec hniques applied to an image of a p edestrian. A) 24 £ 6 v ertical strip image of p edestrian. B) Image after horizon tal pixel dierencing is applied. C) Image after v ertical pixel dierencing is applied

PAGE 35

CHAPTER 5 CLASSIFIER ALGORITHM 5.1 Bac kground Based on the discussion in Chapters 3 and 4 w e kno w that the dimensionalit y of our feature v ectors of p eople is 144. The space of all p ossible v ectors, v 2 R 144 signify ev ery p ossible segmen ted and prepro cessed image input in to our system. W e wish to delineate a subset of these v ectors within sev eral ellipsoidal b oundaries. The v ectors within the ellipsoids refer to represen tations of p eople, while v ectors outside of the ellipsoids ideally are represen tations of non-p eople. Previous w ork has b een done to classify high-dimensional image features with precisely placed Gaussian distributions [ 6 ]. W e instead allo w the system itself to place ellipsoids up on the data and adaptiv ely stretc h or con tract them during training. Neither the n um b er of ellipsoids nor the ma jor or minor axis distances are constrained, but p ositiv e examples of p eople are consumed b y expanding ellipsoids, and negativ e examples are rejected via ellipsoidal con traction. Figure 5-1 demonstrates dilation and con traction for the t w o dimensional case. Dilation and con traction op erations are enco ded in to the construction of linear transforms applied to p ositiv e and negativ e input v ectors. The testing phase consists of inclusion tests of p oin ts within the established ellipsoidal con tours. A go o d starting p oin t for the analysis of the b oundary conditions starts with a lo ok at spheroids b ecause they are the simplest ellipsoids, and the b oundary test for a p oin t's inclusion in an ellipsoid builds on the analogous test for a spheroid. W e form ulate a metho d of testing data inclusion within a general spheroid in 144-space in Section 5.2 based on the tec hnique used b y Kositsky and Ullman [ 1 ], and then w e in tro duce the extensions required to adapt the b oundary t yp e to an 27

PAGE 36

28 P N Figure 5-1. Demonstration of ellipsoidal dilation and con traction. A t w o dimensional ellipsoid dilates to capture a p ositiv e p oin t P and con tracts to thro w out a negativ e p oin t N ellipsoidal one in Section 5.3 also based on the same authors' w ork. Section 5.4 discusses the com bination of the t w o transforms and pro vides the ner p oin ts of the complete op erator. 5.2 Linear T ransform for Spheroids A spheroid is an ellipsoid whose axes are all equal in magnitude. This equalit y simplies the equation of a spheroid in to a sum of n squared terms: x 21 r 2 + x 22 r 2 + x 23 r 2 + ¢ ¢ ¢ + x 2n r 2 = 1 (5.1) where r is the radius of the spheroid, and n is the spheroid's dimensionalit y T o determine whether a particular v ector is within a giv en spheroid, a sequence of scalings is p erformed on eac h comp onen t of the v ector. Suc h scalings con tract or expand a p oin t on the b oundary of a spheroid of radius r to a p oin t on the b oundary of a spheroid of radius 1 if the spheroid is cen tered at the origin. P oin ts inside or outside of the original spheroid end up in a prop ortionally equiv alen t

PAGE 37

29 lo cation in the new spheroid. A linear map can p erform v ector scaling along sp ecic directions. Suc h a linear map is dened in this w a y: 9 a set of v ectors D 2 R 144 and another set of v ectors Q 2 R 144 st 9 a linear map f c : D Q where f c ( x ) = L x for some 144 £ 144 matrix L some reference p oin t c 2 D and 8 x 2 D The sym b ol D refers to the set of v ectors, relativ e to the curren t spheroidal cen ter c receiv ed b y the system for classication or training. The sym b ol Q represen ts the set of linearly mapp ed relativ e v ectors. Eac h comp onen t of the input v ector is scaled b y the same amoun t to eectiv ely compress or distend p oin ts along a direction whic h seeks or a v oids the cen ter of the spheroid. Hence, the direction along whic h the scaling is executed dep ends on the input p oin t, and is alw a ys p erp endicular to the tangen t of the spheroid at the transformed p oin t. Giv en an input p oin t x an output v ector of the follo wing form is sough t: 1 r x = f ( x ) (5.2) where r is the radius of the original spheroid. The matrix L tak es on the follo wing conguration to p erform this op eration. L 2666666664 1 r 0 0 : : : 0 0 1 r 0 : : : 0 ... 0 0 0 : : : 1 r 3777777775 (5.3) The en tries in L along the diagonal are just the square ro ots of the co ecien ts of Equation 5.1 W e are no w in a p osition to p erform an inclusion test on the input p oin t. It is imp ortan t to note that a test p oin t m ust rst b e rewritten relativ e to the cen ter of the spheroid b ecause the matrix L scales the comp onen ts closer or farther from the spheroid's cen ter. If the p oin t x is inside or on the b oundary of a

PAGE 38

30 spheroid of radius r cen tered at the p oin t c then the resulting v ector L ( x ¡ c ) is inside or on the b oundary of a spheroid of radius 1 cen tered at the p oin t c when it is referenced to c The magnitude of the new relativ e v ector, j L ( x ¡ c ) j is equal to 1. If the norm is less than 1, then the p oin t x is within the original spheroid. If the magnitude is greater than 1, then x is outside the connes of the original spheroid of radius r Figure 5-2 demonstrates the transformation in t w o dimensions. r r I x A B c x I c Figure 5-2. Linear transformation of a spheroid. T ransformation of a p oin t x from the b oundary of a spheroid of radius r to the b oundary of a unit spheroid. A) sho ws con traction while B) sho ws expansion 5.3 Ellipsoidal T ransform The form ula for a spheroid is deriv ed from the general equation of an ellipsoid whic h assumes a more complicated formalization. The equation of an ellipsoid tak es the follo wing general structure: X 1 i n 1 j n a ij x i x j = 1 a ij 2 Z + [ 0 8 i; j (5.4) where n is the dimensionalit y of the space. There are more terms in the ellipsoid equation than in the spheroid equation. The morphological reason for this is that an ellipsoid is a spheroid whose b oundary p oin t comp onen ts are scaled in a nite n um b er of directions and b y dieren t amoun ts. The directions corresp ond to the directions of the ma jor and minor axes, and the scaling amoun ts refer to

PAGE 39

31 the distance discrepancies of the corresp onding ma jor and minor axes. On the other hand, a spheroid results when original spheroidal b oundary p oin ts are scaled o v er all directions b y similar amoun ts. The extra terms are in tro duced when the minor and ma jor axes do not coincide with the co ordinate axes. Again, w e w an t to transform a h yp erdimensional ellipsoid in to a unit spheroid b ecause the resulting inclusion test for p oin ts b ecomes trivial. W e fo cus on a p oin t x on the b oundary of a giv en ellipsoid. W e assume that only one dilation or con traction in one direction is needed to transform the ellipsoid in to a spheroid. W e ma y mak e this claim b ecause as w e will see, suc h a transform is linear and linear mappings are transitiv e. In other w ords, if f ( x ) = y = L x and g ( y ) = z = K y (5.5) are b oth linear mappings, then z = f ( g ( x )) (5.6) or using the matrix equiv alen ts, z = L K x (5.7) Since eac h mapping corresp onds to a unidirectional ellipsoidal con traction or dilation, a sequence of expansions and constrictions translates to a sequence of matrix m ultiplications. W e wish to retain prop ortionalit y across the transform along the direction b eing mo died, so that the prop ortion of the p oin t's pro jection along the mo died direction to the ma jor axis of the conic remains constan t across the mapping. Giv en that ~ x is the b oundary p oin t on the mo dication axis, x is an y p oin t on the b oundary of the ellipsoid, c is the cen ter of the ellipsoid, and remem b ering that all p oin ts are actually v ectors in R 144 space relativ e to c w e use the follo wing v ector form ula to transform x in to a p oin t y on the b oundary of a

PAGE 40

32 unit spheroid cen tered at c Figure 5-3 is a visual represen tation of the mapping. y = f c ( x ) = x ¡ ~ x ¢ x j ~ x j ~ x j ~ x j + ~ x ¢ x j ~ x j ~ x j ~ x j 2 (5.8) The v ector pro jection of x along ~ x is subtracted from x so that the result has x x c y I ~ Figure 5-3. Mapping of an ellipsoid to a unit spheroid kno wing the dilation axis. T ransformation of a p oin t x on the b oundary of an ellipsoid to a p oin t y on the b oundary of a unit spheroid. Both are cen tered at p oin t c and the dilation axis is ~ x no comp onen t along the ma jor axis, and it is p erp endicular to this mo dication axis. Then, the third term in the sum adds the original comp onen t of x along the ma jor axis scaled b y the in v erse of the magnitude of ~ x Since the magnitude of ~ x is alw a ys larger than or equal to the pro jection of x along the ma jor axis, the ratio is alw a ys less than or equal to one. The culmination of the sum is a p oin t on the b oundary of a unit spheroid cen tered at c main taining the same prop ortional distance to other p oin ts along the mo died axis. The mapping's corresp onding matrix op erator, L has the follo wing form: L = 26666666664 1 ¡ ¡ ~ x 0 j ~ x j ¢ 2 j ~ x j¡ 1 j ~ x j 0 ¡ ~ x 0 j ~ x j ~ x 1 j ~ x j j ~ x j¡ 1 j ~ x j : : : 0 ¡ ~ x 0 j ~ x j ~ x n j ~ x j j ~ x j¡ 1 j ~ x j 0 ¡ ~ x 1 j ~ x j ~ x 0 j ~ x j j ~ x j¡ 1 j ~ x j 1 ¡ ¡ ~ x 1 j ~ x j ¢ 2 j ~ x j¡ 1 j ~ x j : : : 0 ¡ ~ x 1 j ~ x j ~ x n j ~ x j j ~ x j¡ 1 j ~ x j ... 0 ¡ ~ x n j ~ x j ~ x 0 j ~ x j j ~ x j¡ 1 j ~ x j 0 ¡ ~ x n j ~ x j ~ x 1 j ~ x j j ~ x j¡ 1 j ~ x j : : : 1 ¡ ¡ ~ x n j ~ x j ¢ 2 j ~ x j¡ 1 j ~ x j 37777777775 (5.9) Since y is on the con tour of a unit spheroid, the follo wing is true: j L y j = 1 (5.10)

PAGE 41

33 W e ha v e just sho wn that if w e are giv en an ellipsoid that is one transfer function a w a y from a spheroid, and w e kno w the v ector form of the axis of dilation or con traction, then w e can determine whether a p oin t relativ e to the cen ter of the ellipsoid is inside the giv en ellipsoid. A t the same time, w e ma y dene the ellipsoid to b e the linear op erator used to squash or expand it in to a unit spheroid b ecause the b oundary is dened b y the transform. Since linear transforms are transitiv e in nature and an ellipsoid ma y b e p ortra y ed as a spheroid whose p oin ts are scaled in a nite n um b er of directions, an ellipsoid in general can b e represen ted b y a sequence of matrix m ultiplications, and eac h matrix in v olv ed in the pro duct represen ts a single dilation or con traction. 5.4 Classier Rules The pro cess of con traction and dilation o ccur in the follo wing w a y Initially if the new input p oin t cannot b e engulfed b y an y existing ellipsoids, then it b ecomes the cen ter of a new spheroid of radius r The radius is a user-dened constan t whic h constrains the v olume of the initial h yp erdimensional conic. If this v alue is to o large, then the training algorithm m ust w ork harder to shrink the ellipsoids, and if this v alue is to o small, enlarging the conics b ecomes pro cessor in tensiv e. W e c ho ose a v alue on the small side b ecause the set of p oin ts that represen t the existence of p eople is a m uc h smaller set than the set of R 144 Our starting radius is 10% of eac h v ector comp onen t's p ossible maxim um. Ultimately the v alue do esn't really matter b ecause the training algorithm automates the learning pro cess without an y a priori constrain ts. The con tin ual pro cess of reshaping the ellipsoids ts the con tours of the conics to the data regardless of the initial v alue of the spheroid radius. The transform matrix iden ties the shap e of the ellipsoid, but t w o p oin ts on its b oundary determine the directions of dilation and con traction. These p oin ts are called the last con traction p oin t (LCP) and the last dilation p oin t (LDP). The LCP is the last negativ e sample p oin t previously inside the ellipsoid

PAGE 42

34 but no w on the b oundary after the last con traction. The LDP is the last p ositiv e sample p oin t previously outside of the ellipsoid but recen tly captured b y it. The p oin ts are also v ectors that are referenced from the ellipsoid's cen ter. The next dilation axis is alw a ys p erp endicular to the LCP v ector and along the plane formed b y the LCP ellipsoidal cen ter, and the new p ositiv e sample p oin t. Analogously the con traction axis is alw a ys p erp endicular to the LDP v ector and along the plane con taining the LDP cen ter, and new con traction p oin t. This simple rule ultimately prev en ts ellipsoids from blindly and immediately rein tro ducing a p oin t that it purp osely recen tly exp elled and thro wing out a p oin t that it just acquired. If a dilation is b eing p erformed, w e m ust nd a dilation axis whose direction is p erp endicular to the LCP v ector. Analogously if a con traction is taking place, w e m ust nd a compression axis whose direction is p erp endicular to the LDP v ector. Let the LDP or LCP v ector b e depicted b y v and let x b e the new sample p oin t. The compression/dilation axis, e is giv en b y: x + e = x ¢ v j v j v j v j (5.11) Since e can b e scaled b y an y v alue and retain its direction, w e ha v e e K e = v ¡ j v j 2 x ¢ v x for some K 2 R (5.12) Only the direction of e is imp ortan t in the dilation or con traction equations so w e don't care ab out the actual v alue of K. Based on this information, w e can no w determine the transfer function for dilation or con traction. W e replace ~ x in Equation 5.8 with the v ector e y = f c ( x ) = x ¡ e ¢ x j e j e j e j + e ¢ x j e j e j e j 2 y e (5.13) where e ¢ x = s j x j 2 ¡ x ¢ v j v j 2 (5.14)

PAGE 43

35 and y e = s 1 ¡ x ¢ v j v j 2 (5.15) and e is as dened in Equation 5.12 The extra term on the end, y e is the pro jection of the output p oin t y along e In Equation 5.8 this term is 1. The head of the v ector e do es not necessarily lie on the b oundary of the ellipsoid, so the extra term is needed to scale the result prop erly Figure 5-4 displa ys the aforemen tioned linear mapping in t w o dimensions. The analogous linear op erator L b ecomes: x c y I v e ye Figure 5-4. Mapping of an ellipsoid to a unit spheroid b y calculating the dilation axis. T ransformation of a p oin t x on the b oundary of an ellipsoid to a p oin t y on the b oundary of a unit spheroid. Both are cen tered at p oin t c and the dilation axis is determined based on the p osition of the LCP ( v ) L = 26666666664 1 ¡ ¡ e 0 j e j ¢ 2 j e j¡ y e j e j 0 ¡ e 0 j e j e 1 j e j j e j¡ y e j e j : : : 0 ¡ e 0 j e j e n j e j j e j¡ y e j e j 0 ¡ e 1 j e j e 0 j e j j e j¡ y e j e j 1 ¡ ¡ e 1 j e j ¢ 2 j e j¡ y e j e j : : : 0 ¡ e 1 j e j e n j e j j e j¡ y e j e j ... 0 ¡ e n j e j e 0 j e j j e j¡ y e j e j 0 ¡ e n j e j e 1 j e j j e j¡ y e j e j : : : 1 ¡ ¡ e n j e j ¢ 2 j e j¡ y e j e j 37777777775 (5.16) The resulting matrix L sp ecies a linear scaling in a single direction along only e to add or exp el an in tro duced p ositiv e or negativ e sample resp ectiv ely The matrix C enco des all of the linear scalings done previous to the curren t one. A recursiv e metho dology up dates the univ ersal op erator C in suc h a w a y: C new = C ol d L (5.17)

PAGE 44

36 There are certain constrain ts on an ellipsoid that prev en t it from capturing a new input p oin t. The addition of a new p oin t to an existing ellipsoid is p ossible only when the added p oin t is within the h yp erplanes of the prosp ectiv e ellipsoid. If the pro jection of the additional p oin t along the LCP v ector is greater than the magnitude of the LCP then it is clear that the ellipsoid cannot capture it b ecause the LCP m ust remain on the b oundary of the dilated ellipsoid. Figure 5-5 sho ws the conceptual dicult y of an ellipsoid capturing a p ositiv e p oin t when it is not within its h yp erplanes. A test is p erformed prior to dilation to determine if the c I v h1h2 x Figure 5-5. Input p oin t not within the h yp erplanes of an ellipsoid. A prosp ectiv e ellipsoid cannot dilate to capture a p ositiv e input p oin t, x if it is not within the h yp erplanes h 1 and h 2 b ecause v m ust remain on the b oundary of the ellipsoid curren t ellipsoid is a p oten tial candidate to engulf the new p oin t. Sym b olically the test is equiv alen t to the follo wing inequalit y: j v ¢ x j < j v j (5.18) where v is the LCP and x is the new sample p oin t. Con tractions op erate on a sample p oin t within an ellipsoid so the p oin t is necessarily within the h yp erplanes of the conic. No w, an o v erview of the formation of the ellipsoidal distribution is presen ted. A system consists of an ordered set of linear op erators, E Eac h linear op erator denes an ellipsoid in R 144 space. A p ositiv e or negativ e sample p oin t, x is

PAGE 45

37 in tro duced to the system. The matrix, C i is the univ ersal linear op erator for an existing ellipsoid i If the p oin t is a p ositiv e sample, then the ellipsoid inclusion test is p erformed with the follo wing inequalit y j C i x j < 1 C i 2 E (5.19) The inclusion test is done for eac h ellipsoid i un til the inequalit y is met or E is consumed. If the p oin t is within an ellipsoid, nothing more is done. If no ellipsoid con tains the p oin t, then the h yp erplane test is p erformed for eac h existing ellipsoid, i un til the test succeeds or the end of E is reac hed. If the p oin t is within an ellipsoid's h yp erplanes, that ellipsoid is stretc hed to capture the p oin t, and C i is up dated b y matrix m ultiplication with the linear op erator for the dilation, L Otherwise, the p oin t b ecomes the cen ter of a new ellipsoid and C b ecomes 1 10 I If the sample p oin t is a represen tation of a non-h uman, and it is within an y existing ellipsoid, i that ellipsoid is con tracted to place the p oin t on the b oundary The univ ersal op erator C i is up dated b y m ultiplication to the con traction op erator, L Otherwise, nothing more is done.

PAGE 46

CHAPTER 6 EXPERIMENTS 6.1 T raining the Classier The ellipsoidal classier is trained with appro ximately 1000 images of p eople ha ving a size of 110 £ 55 pixels. They are pictures from the same database used to train the SVM classier in the pro cedure of Oren et al. [ 2 ]. Negativ e image samples of an equiv alen t size are also used to train the system. It is dicult to represen t the en tire class of non-h umans with a marginal n um b er of images. Hence, syn thetic images are created as negativ e sample p oin ts to train the algorithm. Appro ximately 150 images of indo or and outdo or scenes w ere do wnloaded from the in ternet, and 850 of the total negativ e images ha v e randomly generated pixel in tensities. Preceden t for the use of random data as negativ e samples is seen in the approac h of Ro wley et al. [ 8 ]. The large size of the non-h uman image class dictates the use of random samples to presen t a more complete example of the class. Figure 6-1 displa ys sev eral illustrations of p ositiv e and negativ e training images. 6.2 Prepro cessing Sc hemes Sev eral training sc hemes are pro duced b y coupling dieren t image preparation metho ds with the ellipsoidal trainer. The exp erimen tal results obtained from eac h trained sc heme are compared to determine the b est image prepro cessing tec hniques to use for detection. The systems dier in the amoun t of prepro cessing done and the t yp e of feature extraction p erformed. All of the sc hemes use the same training images, and all p erform segmen tation, dimensionalit y reduction, and normalization of the input images so that the feature space is reduced and unied. Prepro cessing sc heme 1 additionally p erforms horizon tal dierencing 38

PAGE 47

39 Figure 6-1. Examples of p ositiv e and negativ e training images resp ectiv ely to the prepro cessed image to create the feature v ector. Prepro cessing sc heme 2 executes v ertical dierencing instead. Prepro cessing sc heme 3 in tro duces brigh tness equalization to lev el the bac kground of the ra w image b efore further pro cessing o ccurs. It also uses horizon tal dierencing for feature extraction. Prepro cessing sc heme 4 is similar to sc heme 3 except that it uses v ertical instead of horizon tal dierencing. Prepro cessing sc heme 5 implemen ts a nonlinear histogram equalization pro cess applied to the segmen ted and dimensionally reduced v ertical strip image. Horizon tal dierencing tak es place afterw ard. Prepro cessing sc heme 6 is similar to sc heme 5 except it uses v ertical dierencing of pixel in tensities instead of horizon tal dierencing. 6.3 T esting Phase P ositiv e and negativ e images dieren t from the training data are analyzed b y eac h prepro cessing sc heme, and the corresp onding output is fed in to the ellipsoidal detector whic h is pro duced via the instruction of the asso ciated training sc heme. Tw o groups of test data are presen ted to eac h detector. The rst group of test images are divided in to subsets of p ositiv e and negativ e images. Subset 1

PAGE 48

40 consists of 142 images of p eople tak en from the same database used for training but dieren t from the training images themselv es. These images approac h an ideal test group b ecause the size constrain t of the sub jects is consisten t with that of the detector, and the p oses are limited to fron tal and bac k orien tations. Examples of p ositiv e images from test group 1 are sho wn in gure 6-2 Subset 2 Figure 6-2. Examples of p ositiv e images from the rst test group is a collection of 127 negativ e images that w ere do wnloaded from the in ternet and c hosen sp ecically b ecause they resem ble h umans structurally Hence the c hance of a false detect within this subset is more probable. Some examples from subset 2 are sho wn in gure 6-3 Figure 6-3. Examples of negativ e images from the rst test group A second group of images is considered to b e more complex test material b ecause the data is not staged, and few en vironmen tal v ariables are con trolled. This test group con tains 31 220 £ 220 greyscale images tak en b y a digital camera of real-w orld outdo or scenes. The images are pro cessed only b y the detector sc heme with the b est p ositiv e and negativ e detection rates on the group 1 test images.

PAGE 49

41 Also, b efore the second test group is analyzed, the c hosen detector undergo es a semi-automated b o otstrapping pro cedure. A w ebpage is up dated ev ery few seconds with an image pro cesssed b y the detector sc heme. The detector dra ws a b o x in the image around an area where a p erson is found, and the source of the original image is a digital camera p oin ted outside at an area where p eople frequen tly w alk. When the w ebpage is monitored, and the image displa ys a false p ositiv e, a script is executed whic h activ ely includes the oending original camera image in the training database of negativ e images. Another script sa v es an image in the database of p ositiv e training images in the instance of a false negativ e. The detector ma y then b e retrained at the user's con v enience. Bo otstrapping the system to ac hiev e b etter p erformance is used in man y of the tec hniques that are examined [ 2 6 7 8 9 ]. Figure 6-4 giv es a selection of images from the group 2 database. 6.4 Sensitivit y Analysis The ruggedness of the system is tested b y in tro ducing blurred v ersions of test images from the rst testing group to the classier. Three lev els of a Gaussian blur con v olution mask are pro duced b y v arying the exp onen tial deca y function of the 20 £ 20 mask. Eac h lev el v aries from the previous b y an order of magnitude of the exp onen tial p o w er. Figure 6-5 sho ws the blurring lev els.

PAGE 50

42 Figure 6-4. Selections of images from the second test group A B C D Figure 6-5. Dieren t blurring lev els of a test image. A) Original test image. B) Same test image put through a 20 £ 20 Gaussian blur lter with an exp onen tial deca y function of exp 0 : 35 C) Exp onen tial deca y function is exp 0 : 035 D) Exp onen tial deca y function is exp 0 : 0035

PAGE 51

CHAPTER 7 RESUL TS T ables 7-1 and 7-2 displa y the classication results of the trained detector sc hemes on the rst and second groups of test data. The el l. en try sp ecies the n um b er of ellipsoids pro duced during the training of eac h detector sc heme. Surprisingly sc heme 1 p erforms the b est for b oth p ositiv e and negativ e test images in the group 1 data set with a p ositiv e detection rate of 84.5% and a negativ e detection rate of 7.9%. Prepro cessing sc heme 1 executes neither brigh tness lev eling nor histogram equalization, but p erforms only the horizon tal dierencing algorithm along with image segmen tation and dimensionalit y reduction. The horizon tal dierencing of neigh b oring pixels in an input image in general pro duces b etter results than v ertical dierencing of pixel v alues. This observ ation seems reasonable considering the higher degree of v ertical than horizon tal symmetry in h umans. The rst detector sc heme is used to analyze the test images of group 2 after the b o otstrapping pro cedure explained in Chapter 6 is executed b ecause it p erforms the b est out of all of the detector sc hemes on group 1 test data. Figure 7-1 displa ys the abilit y of detector sc heme 1 to pic k out p eople from selections of the group 2 database. 43

PAGE 52

44 T able 7-1. Results of detector sc hemes 1{3 on test data Scheme# el l. P. dete cts N. dete cts P. det. r ate N. det. r ate Sc heme 1 4 Group1 images 120/142 10/127 84.5% 7.9% Group2 images 2 7/27 13/6448 25.9% 0.20% Gaussian blur 0.35 74/142 52.1% Gaussian blur 0.035 70/142 49.3% Gaussian blur 0.0035 9/142 6.3% Sc heme 2 13 Group 1 images 98/142 11/127 69.0% 8.7% Gaussian blur 0.35 66/142 46.5% Gaussian blur 0.035 38/142 26.8% Gaussian blur 0.0035 8/142 5.6% Sc heme 3 3 Group 1 images 109/142 21/127 76.8% 16.5% Gaussian blur 0.35 73/142 51.4% Gaussian blur 0.035 61/142 43.0% Gaussian blur 0.0035 15/142 10.6% T able 7-2. Results of detector sc hemes 4{6 on test data Scheme# el l. P. dete cts N. dete cts P. det. r ate N. det. r ate Sc heme 4 19 Group 1 images 75/142 10/127 52.8% 7.9% Gaussian blur 0.35 61/142 43.0% Gaussian blur 0.035 30/142 21.1% Gaussian blur 0.0035 3/142 2.1% Sc heme 5 1 Group 1 images 107/142 32/127 75.4% 25.2% Gaussian blur 0.35 73/142 51.4% Gaussian blur 0.035 102/142 71.8% Gaussian blur 0.0035 75/142 52.0% Sc heme 6 2 Group 1 images 71/142 13/127 50.0% 10.2% Gaussian blur 0.35 53/142 37.3% Gaussian blur 0.035 84/142 59.1% Gaussian blur 0.0035 66/142 46.5%

PAGE 53

45 Figure 7-1. Some analyzed selections from the second group of test images. Bo xes are dra wn around p ositiv e detections b y the algorithm

PAGE 54

CHAPTER 8 CONCLUSIONS In this thesis w e determine if p eople who t a sp ecic size prole and who p ose in ev eryda y situations ma y b e used as a viable input class for a simple binary detector using metho ds adapted and simplied from sev eral noted tec hniques. W e maximize the detection results o v er all of the prepro cessing tec hniques used during training b y selecting image preparation algorithms that giv e the b est results during testing of the detector sc hemes on one group of test data. Dimensionalit y reduction is an imp ortan t asp ect of classier formation, and w e c ho ose a pro cedure whic h has a basis in w a v elet co ecien t formation. It is one of the lo w frequency represen tations of the input data. Unlik e most w a v elet tec hniques whic h use man y more than t w o lev els of w a v elet transforms, w e use one lo w frequency transform and one high frequency transform to reduce the dimensionalit y of the input v ectors and extract h uman features whic h exhibit go o d clustering b eha vior when they are depicted as feature v ectors resp ectiv ely W e assume that the feature represen tations assem ble in to ellipsoidal shap es with v arying ma jor and minor axes lengths, and through con traction and dilation of the ellipsoidal distributions, a large ma jorit y of feature v ectors represen ting negativ e input examples remain outside of the ellipsoidal b oundaries. Bo otstrapping is used to impro v e the p erformance of the nal detector b y allo wing the retraining of the classier sc heme with images that previously displa y ed false p ositiv es and false negativ es. The results are encouraging b ecause the detectors are trained with a small n um b er of p ositiv e and negativ e examples, and the detection rates are comparable to curren t tec hniques. Ho w ev er, sev eral v ariables in v olv ed in training the detector sc hemes w ere not tak en in to consideration, so it is unclear whether the results could 46

PAGE 55

47 b e impro v ed b y increasing the n um b er of p ositiv e and negativ e training images. W e assume that the order of the images presen ted to the trainer inuences the abilit y of the detector to detect h umans b ecause ellipsoids are created, con tracted, and dilated in the order that the input images are in tro duced. W e do not use the order of the training data as a factor in the training of the detectors, nor do w e try dieren t orderings of the examples to ac hiev e higher detection rates. Also, based on the dilation and con traction metho dology in the ellipsoidal algorithm, it w ould seem that in tro ducing more negativ e p oin ts could thro w out more p ositiv e examples not equal to the LDP. Dieren t training metho dologies w ould ha v e to b e examined to try to minimize this consequence. Our curren t detection rates reect a training ep o c h whic h examines all of the negativ e examples b et w een eac h p ositiv e example. Other t yp es of training metho dologies should b e studied. More w ork should b e done pro viding a theoretical basis to the ideas presen ted in this thesis. Muc h w ork has b een done b y others to create a link b et w een high dimensional v ector spaces and lo w dimensional k ernel classiers. W e b eliev e that there is a connection b et w een represen ting image data with second order manifolds and the principles related in k ernel classication tec hniques; ho w ev er, suc h concepts are left as future analysis on the topic.

PAGE 56

REFERENCES [1] M. Kositsky and S. Ullman, \Learning class regions b y the union of ellipsoids," Pr o c e e dings of the 13th International Confer enc e on Pattern R e c o gnition v ol. 4, pp. 750{757, 1996. 1 27 [2] M. Oren, C. P apageorgiou, P Sinha, E. Osuna, and T. P oggio, \P edestrian detection using w a v elet templates," Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition pp. 193{199, 1997. 1 5 16 20 21 23 38 41 [3] C. Curio, J. Edelbrunner, T. Kalink e, C. Tzomak as, and W. v on Seelan, \W alking p edestrian recognition," IEEE T r ansactions on Intel ligent T r ansp ortation Systems v ol. 1, no. 3, pp. 155{163, 2000. 2 [4] H. Nanda and L. Da vis, \Probabislistic template based p edestrian detection in infrared videos," Pr o c e e dings of IEEE Intel ligent V ehicle Symp osium pp. 504{515, 2002. 2 [5] A. Broggi, M. Bertozzi, A. F ascioli, and M. Sec hi, \Shap e-based p edestrian detection," Pr o c e e dings of IEEE Intel ligent V ehicle Symp osium pp. 215{220, 2000. 2 5 13 [6] T. P oggio and Sung, \Finding h uman faces with a gaussian mixture distribution-based face mo del," Pr o c e e dings of Se c ond Asian Confer enc e on Computer Vision pp. 139{155, Dec. 1995. 5 7 16 20 27 41 [7] L. Zhao and C. Thorp e, \Stereo and neural net w ork-based p edestrian detection," IEEE T r ansactions on Intel ligent T r ansp ortation Systems v ol. 1, pp. 148{154, Sept. 2000. 5 9 15 21 23 41 [8] H. Ro wley S. Baluja, and T. Kanade, \Neural net w ork-based face detection," IEEE T r ansactions on P AMI v ol. 20, no. 1, pp. 23{28, 1998. 5 10 15 16 19 21 22 23 38 41 [9] H. Sc hneiderman and T. Kanade, \Probabilistic mo deling of lo cal app earance and spatial relationships for ob ject recognition," Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition pp. 324{339, June 1998. 15 41 [10] P S. P enev and L. Siro vic h, \The global dimensionalit y of face space," Pr o c e e dings of the 4th IEEE International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pp. 264{270, 2000. 16 48

PAGE 57

BIOGRAPHICAL SKETCH Jennifer Lea Laine w as b orn in V ero Beac h, Florida, in 1975. She receiv ed a Bac helor of Science degree with honors in electrical engineering at the Univ ersit y of Florida in the summer of 1998 and, later in 2000, a Bac helor of Science degree in mathematics. Besides w orking to w ards a Master of Science degree in electrical engineering, she is a mem b er of the Mac hine In telligence Lab oratory in the Electrical and Computer Engineering Departmen t and w orks part-time as a design engineer at Neurotronics in Gainesville, Florida. 49


Permanent Link: http://ufdc.ufl.edu/UFE0000727/00001

Material Information

Title: Computer controlled detection of people using adaptive ellipsoidal distributions
Physical Description: Mixed Material
Creator: Laine, Jennifer L. ( Author, Primary )
Publication Date: 2003
Copyright Date: 2003

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000727:00001

Permanent Link: http://ufdc.ufl.edu/UFE0000727/00001

Material Information

Title: Computer controlled detection of people using adaptive ellipsoidal distributions
Physical Description: Mixed Material
Creator: Laine, Jennifer L. ( Author, Primary )
Publication Date: 2003
Copyright Date: 2003

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000727:00001


This item has the following downloads:


Full Text











COMPUTER CONTROLLED DETECTION OF PEOPLE USING ADAPTIVE
ELLIPSOIDAL DISTRIBUTIONS















By

JENNIFER L. LAINE


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2003


































Copyright 2003

by

Jennifer L. Laine















ACKNOWLEDGMENTS

I would like to thank my committee members for taking the time to listen

to my defense and to really read and critique my thesis. I thank Dr. Arroyo for

his guidance throughout the years. It was his influence which allowed me to

strive towards my potential. I also wish to thank my labmates at the Machine

Intelligence Laboratory whose words of encouragement (both positive and negative)

have helped to shape me into the person I am today. The writing of this thesis

was done primarily in a vacuous apartment in Raleigh, NC. I would have possibly

gone insane without the daily lunch breaks with Scott Nichols and Ivan Zapata,

so thanks also go to them. I thank JD Landry for making work-time at IBM less

boring, and I also thank Dr. Eddie Grant at NC State for providing a structure

for my unorganized editorial infancy. I thank Juan Sanchez for giving me positive

reinforcement to finish my master's degree even though Vegas seemed like a million

miles away. Many thanks go out to Dr. Jack Smith, who kicked me in the butt

throughout the last months of writing and editing so that I could finish my thesis

and do my presentation on time, and I thank James Schubert for telling me that

my work was not the remains of putrefaction and questioning the practicality of

144 dimensions. Finally, thanks go out to my parents and grandparents who have

been motivating me my entire life.















TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ................... ...... iii

LIST OF TABLES ............................... vi

LIST OF FIGURES ..................... .......... vii

ABSTRACT ...................... ............ viii

CHAPTER

1 INTRODUCTION ........................... 1

1.1 Preface ............... ..... ........... 1
1.2 Current Classification Techniques and Problem Discussion .... 2
1.3 Thesis Organization ......... ................ 3

2 PROBLEM REVIEW .. ... .................... 5

2.1 Introduction ........... ................ 5
2.2 Wavelet Templates and Support Vector Machines ......... 5
2.3 Gaussian Distribution-Based Model and Neural Networks ..... 7
2.4 Stereovision and Neural Networks ...... ......... 9
2.5 Neural Network Overload ...... .......... . .. 10
2.6 Shape-Based Pedestrian Detection ..... . . 13

3 IMAGE SEGMENTATION AND DIMENSIONALITY REDUCTION. 15

3.1 Introduction ............... ........... .. 15
3.2 Partitioning the Image .................. ... .. 16
3.3 Dimensionality Reduction ................... . 18
3.4 Published Methods .................. ..... .. 19

4 PREPROCESSING OF AN IMAGE SEGMENT . . 21

4.1 Introduction .................. ........... .. 21
4.2 Brightness Equalization . . ......... .. 21
4.3 Histogram Equalization and Contrast Stretching . ... 23
4.4 Horizontal and Vertical Intensity Differencing . .... 23









5 CLASSIFIER ALGORITHM ................... .... 27

5.1 Background ................... ....... 27
5.2 Linear Transform for Spheroids .................. .. 28
5.3 Ellipsoidal Transform .................. ..... .. 30
5.4 Classifier Rules .................. ......... .. 33

6 EXPERIMENTS .................. ............ .. 38

6.1 Training the Classifier .................. .... .. 38
6.2 Preprocessing Schemes .................. ... .. 38
6.3 Testing Phase ............... ........ ..39
6.4 Sensitivity Analysis ............... .... .. 41

7 RESULTS .... ........................ ..... 43

8 CONCLUSIONS ............... ........... ..46

REFERENCES ............... ................... 48

BIOGRAPHICAL SKETCH .... .......... ......... .. 49















LIST OF TABLES

Table page

7-1 Results of detector schemes 1-3 on test data . . ..... 44

7-2 Results of detector schemes 4-6 on test data . . ..... 44















LIST OF FIGURES

Figure page

3-1 Clarification of the placement of -I..-.-. red subimages in the input
im age . . . . .. . . .... 17

3-2 Arrangement of half-blocks within an image block . . ... 18

3-3 Steps for segmenting an example image of a person . ... 19

4-1 Effects of brightness equalization on intensity uniformity ...... ..22

4-2 Contrast stretching and histogram equalization filters applied to a
sample image ............... ........... .. 24

4-3 Contrast stretching and histogram equalization filters applied to an
image of a pedestrian ........... . . .... 25

4-4 Horizontal and vertical differencing techniques applied to an image of
a pedestrian ............... ............ .. 26

5-1 Demonstration of ellipsoidal dilation and contraction . ... 28

5-2 Linear transformation of a spheroid .... . ... 30

5-3 Mapping of an ellipsoid to a unit spheroid knowing the dilation axis .32

5-4 Mapping of an ellipsoid to a unit spheroid by calculating the dilation
axis.................... ........ .... .......... 35

5-5 Input point not within the hyperplanes of an ellipsoid . ... 36

6-1 Examples of positive and negative training images respectively . 39

6-2 Examples of positive images from the first test group . ... 40

6-3 Examples of negative images from the first test group . ... 40

6-4 Selections of images from the second test group . ..... 42

6-5 Different blurring levels of a test image ................ ..42

7-1 Some analyzed selections from the second group of test images . 45















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master Of Science

COMPUTER CONTROLLED DETECTION OF PEOPLE USING ADAPTIVE
ELLIPSOIDAL DISTRIBUTIONS

By

Jennifer L. Laine

M.I.- 2003

Chair: Michael C. Nechyba
M., i]r Department: Electrical and Computer Engineering

We present a software approach towards real-time detection of human beings

in camera images or video. Due to the large variations in shape, texture, and

orientation present in people over time and over samples, we use a statistical

procedure which contours ellipsoidal distributions around positive data examples

while avoiding negative samples during the training of the detector. The data

points around which the statistical approach models its distributions are feature

vectors extracted from the image pixel values through a process which normalizes,

filters, and dimensionally reduces them. We test the effectiveness of several popular

image processing techniques to determine which ones contribute the best detection

rates, and use them in the final detector. Finally, we test the model on real-world

test images and discuss the results.















CHAPTER 1
INTRODUCTION

1.1 Preface

Computer controlled detection of human beings is a well established artificial

vision problem. The challenges associated with it are daunting because people in

every-day situations are not structurally uniform, and they exhibit many vastly

different poses and orientations. Here we present an algorithm trained to detect

human beings in still images or video. A technique for learning class regions in

R2 by the autonomous generation of ellipsoidal distributions to enclose positive

target objects, developed by Kositsky and Ullman [1], is extended to accept a

higher dimensional input space. Evidence of the possibility of such an extension

is mentioned by the authors, but the theory is never tested. Also, several data

preparation methods are experimentally compared so that a maximum positive

detection rate and minimum false positive detection rate are achieved across the

methods.

Structural cues exhibited by people in multiple poses participate in the forma-

tion of many distinct ellipsoidal distributions signifying human existence. We make

the assumption that certain relative phenomena within two dimensional images of

people may be represented as points within independent ellipsoidal contours based

on preprocessing techniques used in the paper by Oren et al. [2]. Each contour is

defined by a high order transfer matrix coupled with a center point, and an object's

exclusion from the positive class set is determined by the associated data point's

exclusion from all of the ellipsoidal contours. Our main objectives in introducing

this new method are acquiring detection results similar to previous techniques

while maintaining a degree of simplicity and elegance by keeping the number of









algorithmic steps small, using only image preparation methods that are experimen-

tally determined to improve detection, and manipulating data clusters which are

quadratic in shape rather than Gaussian.

1.2 Current Classification Techniques and Problem Discussion

Most detection schemes attempt to minimize an error metric between results

generated by the classifier acting upon dimensionally reduced data and the

associated true outcome. Neural networks adjust weights to minimize the error

between the training network and the true result. Kernel classifiers such as Radial

Basis Functions (RBF) and Support Vector Machines (SVM) minimize a bound

on the actual risk or error created by applying a particular set of functions to

massaged data points. Model-based and example-based detectors use positive and

negative examples of the desired class to build a statistical or parametric criterion

by which test material is measured.

Unfortunately, human beings are not structurally consistent over time and

across samples. When a person is moving, he assumes a cadence which may be

measured and used for detection. However, this technique requires the analysis

of video frames and an extensive numerical history of some kind [3]. Excluding a

regard to motion, sporadic display of vertical symmetry and wide color and texture

variations across many samples of people make it difficult to encapsulate a single

input example into a repeatable pattern of which a specific detection technique

could take advantage. Hence current methods tend to limit themselves to a test

class with a discrete number of positions and tight size restrictions. Specialized

hardware may be used to ease data dissection. An infrared imager creates a heat

map of the target and allows the data signature of humans to be more distinctive

than that produced by a digital camera [4]. Two camera systems have been

developed that produce a disparity map of the environment so that the background

may be eliminated from the images and only foreground objects are considered [5].









Hardware solutions are useful but are also expensive. Training data which are

needed to make adjustments to the detector weights or conditions are usually

manually manipulated to remove other unconstrained objects and the background

which is typically not homogeneous. The extensive preparation time required to

train a detector warrants simplification of the training technique and perhaps a

more interactive approach.

We attempt to address all of these issues by first performing only necessary

preprocessing procedures to the input data that are experimentally determined to

improve human detection. We do very little preparation to the training images and

automate the bootstrapping process during detector training so that false positives

are efficiently reduced. Lastly, we choose a distribution type that is amenable to

compact representation and has malleable attributes.

1.3 Thesis Organization

Chapter 2 outlines techniques previously "mpl-i d for the detection of human

beings and lists their strengths and weaknesses. Chapter 3 describes how an input

image is segmented into a set of high dimensional data points acceptable to the

detector. Chapter 4 defines many preprocessing schemes handled by current object

detection methods and reports how they affect the preprocessing of our input

data. Chapter 5 gives the theoretical basis behind the proposed detector algorithm

and describes each step in the training process. We compile these preprocessing

techniques and associate each approach with an independent detector scheme.

Each scheme is functionally composed of a specific set of preprocessing techniques

and the universal trainer or detector. The trainer creates quadratic distributions

based on the clustering behavior of the preprocessed image data, and the detector

tests the inclusion of a new image within these distributions. Chapter 6 relates

the experiments performed in the detection process by detailing the types of

preprocessing methods contained in each detector scheme. The same database of







4

positive and negative training images is used to train each detector scheme, and

Chapter 7 displays the experimental results found when each scheme examines an

image database different from the training one. Chapter 8 evaluates the presented

approach and describes future work which may be done in this area.















CHAPTER 2
PROBLEM REVIEW

2.1 Introduction

Here we present a review of some of the current techniques "inpl.-vd to solve

the problem of people detection. They are explained here to provide a basis for our

approach. The technique of Oren et al. [2] reduces the problem space into a set of

meaningful frequency components. With these components, the authors create a

parametric pedestrian model by minimizing a risk function. The method of Poggio

and Sung [6] feeds distance metrics related to the Gaussian distributions of face

image training data into a neural network. The method of Zhao and Thorpe [7]

elicits the help of stereo vision to separate the background from foreground objects

and inputs the gradient of potential people in the foreground into a neural network.

The method of Rowley et al. [8] introduces carefully chosen parts of face images to

a sequence of arbitrating neural networks. The technique of Broggi et al. [5] comes

up with systematic morphological rules that are applied alongside stereo vision

towards the detection of people in images.

2.2 Wavelet Templates and Support Vector Machines

The technique of Oren et al. [2] focuses on the differences in intensity between

pedestrians and the background, or the relative intensities and position of pedes-

trian boundaries coupled with homogeneous interiors. The authors concentrate on

structural cues because a pedestrian's colors are not constrained, and the colors

and textures of the background are not consistent. A wavelet transform approxi-

mates non-stationary signals with sharp discontinuities at varying scales. Hence,

the structure of a person lends itself to the use of wavelet coefficients to differen-

tiate people from non-people, and in this application, a wavelet transform is used









as a human edge detection algorithm. A redundant set of Haar basis functions is

used to completely capture the relationships of the average pixel intensities between

neighboring regions of an image. They apply the Discrete Wavelet Transform

(DWT) along three orientations to generate wavelet coefficients at two different

scales. Coefficients are produced for vertical, horizontal, and diagonal passes of the

transform, and both 32 x 32 and 16x16 pixel block scales are tested. In the 32 x 32

scale, one coefficient represents the energy of the signal localized both in time and

frequency within the corresponding 32x32 block. A similar method is used for the

16x 16 scale. Coefficients generated from a training database are compiled into a

template. The wavelet coefficients are calculated for each color channel (RGB) and

for each orientation in an image. The largest absolute value over all of the color

channels becomes the corresponding coefficient for that orientation in the image.

The coefficients for each orientation are normalized separately over all of the coef-

ficients in the image and averaged over all of the images in the pedestrian image

database. The resulting array of coefficients is the pedestrian template for each

orientation. The pedestrian training image database consists of 564 color images

of pedestrians in frontal and rear positions within a 128x64 pixel frame. A non-

pedestrian template is also created using 597 color images of natural scenes within

a 128x64 frame. By visual inspection, the authors select *-i,,i I .Il. co( fi. i: I -

from the template which pinpoint areas in the image important for pedestrian

classification. Consequently, 29 coefficients are used to form a feature vector for the

classification effort.

During detection, the system moves a 128x64 window throughout the entire

space of the input image. Bootstrapping aids the problem of the overwhelming

negative class space. False positive detections during testing are grouped into

the negative training image database, and the system is retrained. The system

is not adaptable since the whole training image database must be submitted









to the algorithm instead of just the new images when retraining is necessary.

Two methods of classification are used. The first is a simple technique called

basic template matching where the ratio of feature vector values in agreement

is calculated for each new input image. The second method utilizes the support

vector machine. After three bootstrapping sessions, the system trains from 4,597

negative images. From the 141 high quality pedestrian test images, the classifier

exhibits a detection rate of 52.7% using basic template matching with 1 false

positive per 5,000 windows examined. With the support vector classifier, the

system has a detection rate of 69.7% and a false positive detect for every 15,000

windows examined.

2.3 Gaussian Distribution-Based Model and Neural Networks

The approach of Poggio and Sung [6] detects unoccluded vertical frontal

views of faces by fitting hand-tuned Gaussian distributions upon example data.

They formulate a model of faces and a model of non-faces. Each training image

contains a single face which fits inside a 19x19 pixel mask. A feature vector for

the classification distribution is defined by the absolute pixel intensities of the

unmasked pixels. Hence, each input image translates into a vector in R283 space.

An elliptical k-means clustering algorithm groups 4,150 examples of positive data

and 6,189 examples of negative data into a predefined number of clusters, and

twelve Gaussian boundaries are placed upon the groupings. Six Gaussians, each

with a centroid and covariance matrix, are placed upon the positive points, and

six are positioned upon the corresponding negative data. Fitting the positive

sample space with one Gaussian distribution is not sufficient because there is too

much overlap between the positive Gaussian and non-face example feature vectors.

In some cases, non-face patterns lie closer to the positive Gaussian distribution

centroid than a true face arrangement. The relationship between incoming feature

vectors and the existing face model is encoded into a 2-value distance metric.









The first distance is called the Mahalanobis distance. It represents a separation

in units of standard deviations of the input point from the cluster distribution.

The 75 largest eigenvectors of the Gaussian are used as the discriminating vector

space to reduce the chance of overfitting the metric. The second distance is a

generic Euclidean distance. It measures the unbiased separation between the input

point and the cluster mean within the subspace spanned by the same 75 largest

eigenvectors. For each test input, a multi-layer perception (I\I. P) is trained with

the 12 pairs of distances as the inputs and one binary output.

They train the network using the same 4,150 positive images used to create

the positive Gaussian clusters and 43,166 negative images which include the

6,189 patterns used to create the negative Gaussian clusters. The rest of the

negative training images are selected with a bootstrapping methodology. Images

generating a false positive detection by the neural network are added to the

negative image collection for further training. Representing a distribution with

a centroid and covariance matrix is difficult for high dimensional vector spaces

because the number of free parameters in the model is directly proportional to

the square of the number of dimensions, and the parameters must be recovered

from training data for the detector to be robust. One may reduce the number of

model parameters by focusing on the -i, m,! .,l," eigenvectors in the covariance

matrix. The significant eigenvectors in the Gaussians correlate to the prominent

pixel features in a face image. This information is encoded in the Mahalanobis

distance. The less prominent pixel features are encoded in the less significant

eigenvectors. The Euclidean distance is supposed to account for the less salient

facial characterizations. The search for faces in an image is done over all image

locations and at a single scale. Their system has a 96.3% detection rate on a

test database of 301 CCD images of people with 3 false positives. On a more









challenging database of 23 cluttered images with 149 face patterns, their system

has a detection rate of 79.9% with 5 false positives.

2.4 Stereovision and Neural Networks

The technique of Zhao and Thorpe [7] is a real-time pedestrian detector using

two moving cameras and specialized segmentation software. The real-time stereo

system Small Vision System (SVS) developed by SRI constructs a disparity map of

the input image based on color and spatial cues so that objects in the foreground of

the image may be distinguished from those in the background. Hence, the segmen-

tation software is not influenced by drastic lighting changes, object occlusion, or

color variation. A neural network trained with back propagation is fed the intensity

gradient of the resulting foreground partition. Since the background is removed

from the training and testing images by stereo image analysis and the neural net-

work learns from examples, their method requires no a priori model or background

image. From the disparity map of an input image, they use thresholding to remove

background objects. They then smoothly group together objects of similar dispar-

ity, and rule-out groupings that are too small or too big to be pedestrians. Small

pixel blobs that are near each other with close disparity values are integrated into

one big blob. Large regions undergo a verification process where subregions are

analyzed for the presence of pedestrians and then split apart if multiple positive

detections exist. Pedestrians have high degree of variability in texture and color

so absolute pixel intensities are not used as input information for the detector.

Instead, they use the intensity gradient of the pixel groupings in the foreground

still found to be potential people as the input vectors to the neural network. These

effects of the preprocessing phase are constrained to a 30x65 window, and the

region values are linearly normalized to numbers between 0 and 1.

A three layer feed forward network is trained with 1,012 positive images of

pedestrians and 4,306 negative images. Bootstrapping is used to improve system









performance. The network weights are initialized to small random numbers before

ti..iiiir. and detection is finalized by thresholding the output of the trained

network. The system is tested with 8,400 images of pedestrians and other objects

in cluttered city scenes. They achieve a detection rate of 85.2% and a false positive

rate of 3.1 The system performs segmentation and detection on two 320x240

images at a framerate ranging from 3 frames/second to 12 frames/second. The

system fails when objects that are structurally similar to humans are presented

and when occlusion is extreme or the color of the person is similar to that of the

background.

2.5 Neural Network Overload

The method of Rowley et al. [8] uses a neural network to detect upright,

frontal faces in greyscale images. The training images are specially customized

for the algorithm. The eyes, tip of the nose, and corner and center of the mouth

for each training image face are labeled manually so that they can be normalized

to the same scale, orientation, and position. Bootstrapping is used to solve the

problem of finding representative images for the non-face category. False positive

images are added to the training set during successive phases of training and

testing. Bootstrapping negative images reduces the number of images needed

in the training set. A neural network is applied to every 20x20 pixel block in

an image, and detection of faces at different scales is achieved by applying the

filter to an input image that is subsampled. A preprocessing step is performed

on the input image before it is passed through the neural network. The first step

in the preprocessing phase equalizes the brightness in an oval region inside the

20x20 pixel block, and the second step performs histogram equalization within the

resulting oval. The network has retinal connection to its input layer. Four hidden

layers look at 10x 10 pixel subblocks, Sixteen look at 5 x 5 pixel subblocks, and six

look at 20x5 pixel stripes. These regions are specifically hand-chosen so that the









hidden units learn features unique to faces. The stripes identify mouths or eyes,

and the square regions see a nose, individual eyes, or the corner of a mouth. The

network has a single output signifying the presence or absence of a face. 1,050

images of faces of varying size, position, orientation, and brightness are gathered

for training and manually massaged into images uniform over all training data by

creating a mapping of specific pixel locations to face features. The mapping itself

scales, rotates, and translates the input image by a least squares algorithm that is

run to convergence for each image. Once a uniform image is made, variants of the

image are created by rol.I ii.-:. scaling, and translating the model.

Non-face images are generated randomly and the non-face training database

is formed by a bootstrapping technique. The network is trained using standard

error backpropagation with momentum and initial random weights. Resultant

weights from a previous training iteration are used in the next iteration. Random

images that generate false positive detections are added to the database for

further training. Generation of random data forces the network to set a precise

boundary between faces and non-faces. Two heuristics are introduced to reduce

the number of false positives in the initial neural network. Since the network

is somewhat invariant to the position of the face up to a few pixels, multiple

detections within a specified neighborhood of position and scale are thresholded.

The pixel neighborhood and the number of detections found in the neighborhood

are the two parameters used. A number of detections greater than the threshold

implies a positive detection, and the centroid of the neighborhood is scrutinized

again for the presence of a face. A number of detections fewer than the threshold

implies a no-detect. The second heuristic involves result arbitration from multiple

neural networks. Each neural network is trained with the same positive image

database, but because the set of negative images is randomly chosen from the

bootstrap images, the order of presentation and the negative examples themselves









differ. Also, the initial weights may differ because they are generated randomly.

They try ANDing and ORing the results of two similarly trained networks. Three

networks voting a result is also tried. Lastly, they train a neural network to govern

the decisions of the arbitrating neural networks to see if such a scheme yields better

results than simple boolean functions. Sensitivity analysis is performed on all the

networks to determine which features in the face more greatly influence detection.

It turns out that the detectors rely heavily on the eyes, then the nose, and then the

mouth.

Many different networks are tested with two large data sets containing images

different from the training images. 130 of the images are collected at CMU and

consist of multiple people in front of cluttered backgrounds. The second set is a

subset of the FERET database, and each image in the second set consists of only

one face, has uniform background, and good lighting. The detection rate of all tried

systems on the first data set ranges from 77.9% to 90.;' ORing the arbitration

networks yields the best detection rate but also contributes the most false positives.

In the second set of data images, detection success ranges from 97.8% to 100.0% for

frontal faces and faces turned less than 15 degrees from the camera. A detection

rate range of 91.5% to 97.4% is achieved on faces turned 22.5 degrees from the

camera. They determine that the system with two ANDed arbitrating networks

produces the best tradeoff between detection rate and false positives. It has a

detection rate of 86.2% with a false detect rate of 1 per 3,613,009 test windows on

the first test set. On the second test set, it has an average detection rate of 98.1%

on the faces at all orientations. The best system takes 383 seconds to process

a 320x240 pixel image on a 200MHz R4400 SGI Indigo 2. After modifying the

system to allow bigger search windows in steps of 10 pixels, the processing time is

reduced to 7.2 seconds, but with the side effect of having more false detects and a

lower detection rate.









2.6 Shape-Based Pedestrian Detection

The procedure of Broggi et al. [5] presents a model-based method to detect

pedestrians from a moving vehicle with two cameras. The core technique is a

model-based approach which focuses on the vertical symmetry and the presence

of texture in humans. It checks for human morphological characteristics by

incorporating rules to a pixel-level analysis. However, other approaches are used

to refine the results. Analysis of stereo disparities in the images provides distance

information and gives an indication of the bottom boundary of the pedestrian.

Also, an image history is kept to further filter the morphological results. Their

system is an additional feature of the ARGO Project an autopilot mechanism

for a vehicle. A greyscale input image is downsampled to a 256x288 pixel block,

and a localized region of highest probable pedestrian existence is transformed

by a Sobel operator to extract the magnitude and orientation of the edges in

the image. Binary edge maps are created of vertical and horizontal edges, and

background edges are eliminated from the maps by subtraction of the thresholded

and shifted stereo images. They run the resulting binary maps through a filter

that concatenates densely packed objects and removes small sparse blobs. A

vertical symmetry map is created from the filtered vertical edges map by scanning

the image horizontally for vertical symmetries. Humans have a high degree of

vertical symmetry but much less of an instance of horizontal symmetry. Under

this assumption, non-human objects are ruled out by analyzing the horizontal

edges map for horizontal symmetries. A linear density map of the horizontal edge

pixels is superpositioned with the vertical symmetry map with experimentally

determined coefficients to create a probability map of human presence. This

technique eliminates objects having both strong vertical and horizontal symmetry.

The probability map is bolstered by considering a history of images and image

entropy since objects that are uniform typically are not human. The widths of









the remaining objects are determined by counting the number of pixels in each

vertical edge about the symmetry axis. They choose the boundary to be the

column with the highest pixel count on each side. The Sobel map is scanned for

a head matching one of a set of predefined binary models of different sizes. The

model is constructed by hand-combining features that sample human heads have

in common. The bottom boundary is determined by finding an open row of pixels

in the vertical edge map in the left and right boundaries of the body. Distance

information is calculated based on a combination of prior camera calibration

knowledge and the position of the bottom boundary. The bottom boundary is

then refined by comparing the calculated distance to the distance determined by

comparing the position of the pedestrian in the two stereo images. More rules

are checked as the final bounding box is fit for size constraints, aspect ratio and

combined distance and size restrictions. Bounding box construction is sometimes

not very accurate concerning the head's position and the detection of lateral

borders, and no detection results are presented.















CHAPTER 3
IMAGE SEGMENTATION AND DIMENSIONALITY REDUCTION

3.1 Introduction

The segmentation part of our approach effectively transforms a 220x220

input image into a set of overlapping 109x28 rectangles and then further filters

each rectangle into a point in R144 space. This low-pass filter averages the pixel

brightnesses of overlapping regions in each rectangle and then uses the resulting

values as coordinates in a feature vector in R144. Reasons for using this partic-

ular method are given below. Typically, in example-based learning schemes, the

detector is trained with templates of the target class. Training such a detector

involves learning relationships between features in the template. Efficient systems

reduce the dimensionality of the features to a smaller number which still retains

the integrity of the original pattern. In addition, increasing the number of search

windows increases the effectiveness of the analysis because example-based classifiers

rely on position and rotation invariance when transferring knowledge from training

examples to the test cases. More windows mean less error margin in the object's

position and orientation within a detection frame [9]. In order for the system to

detect people at different scales, two options exist. Either the input image may be

downsampled to reduce the size of the features or larger windows may be intro-

duced [7, 8]. Therefore, scanning an image for features tends to be computationally

expensive, and, in many cases, is the bottleneck of any learning-based classification

scheme. Current methods that do not use training examples rely on a priori models

as the reference data. Their high detection speed is compensated by the complexity

of their inference rules coupled with their inflexibility. We choose the simplicity









of a non a priori model without having to scan every single rectangle in the input

image because we average overlapped pixel regions.

Dimensionality reduction can be achieved by paring down the effective

components of the feature space. It reduces computation and prevents the system

from over-fitting the decision surface of the training data. Using an algorithm

that retains all of the bases for the feature space spends too much time computing

details that are not unique to the desired object class. Principal Component

Analysis (PCA) is typically used to reconstruct a decision space with a subset

of its eigenvectors. For any subspace of the trained decision surface, a subset of

the eigenvectors spans a subspace whose signal power contributes the least error

to that of the original signal. However, the number of eigenvectors necessary to

successfully reconstruct a desired object space depends highly on the number of

training data and the number of pixels in each image [10]. We use a simplified

aspect of the technique of Oren et al. [2] to dimensionally reduce the data set. The

filter we mpl',' is a 7x7 mean filter applied every 4 pixels. It is a low frequency

representation of an image rectangle.

3.2 Partitioning the Image

Since our experiments focus on the success of the ellipsoidal distribution

algorithm instead of the viability of the system to multiple scales, we regard

only one scale of pedestrian. However, the method may easily be applied to

people of bigger or smaller size. The system begins the testing phase of people

classification by analyzing a 220x220 greyscale image. The image is divided into

two nonoverlapping 109x220 rows and two similarly nonoverlapping 220x109

columns. Within each row, 13 equally spaced 109x109 subimages are selected

in a -Ir. '.-. red arrangement such that each subimage shares 100 pixels of the

previous subimage. Previous techniques motivate the overlapping of template

windows [2, 6, 8]. Further segmentation and dimensionality reduction of the space









deems the analysis of every possible search window unnecessary. Each column

has 13 similarly sized and positioned subimages, and duplicates arise on the four

corners. There are a total of 52 subimages in the whole image. The placement

of -.I.-.--. red subimages in a 220x220 input image is shown in Figure 3-1. The


Figure 3-1. Clarification of the placement of -Ir. ,.-. red subimages in the input im-
age. Each arrow contours a nonoverlapping 109x220 row or column,
and the -I.-.-.'i d 109x109 subimages lie along the arrows within each
row and column


image is further divided into 10x 10 pixel blocks. They each overlap by one pixel

in order to completely fill the subimages, and there are 144 blocks per subimage.

Each block is made up of 4 overlapping 7x7 half-blocks, and the pixel depth of

each overlapping half-block is 4. Figure 3-2 demonstrates the arrangement of

half-blocks within an image block. There are 576 half-blocks per subimage. The

mean of the pixel intensities within each half-block is assigned to the entire half-

block. In essence, the half-blocks are discretized into pixel-like collectives by a 7

pixel wide low-pass filter. A priori information about the aspect ratio of a human

form finalizes the size of the search window. A human's height is approximately

four times larger than the width in most positions. On account of this, we take

each subimage (109x109 pixels or 24x24 collective half-blocks) and split it into

4 vertical strips of 109x28 pixels or 24x6 collective half-blocks. The resulting

windows tightly encompass pedestrians having a scale of 109x28 pixels in most


I F'









Start of block, etc...


start of block


\\\XXXX///
\\\7XXX///









Shalf-blockl

-/ half-block2

half-block3

O half-block4


Figure 3-2. Arrangement of half-blocks within an image block. The segmentation
algorithm partitions the image into 4 overlapping half-blocks within
each block. Each square within the grid represents a pixel in the im-
age, and the thick lines represent block boundaries

poses. Figure 3-3 shows an example image of a pedestrian and the result of each
segmentation step.
3.3 Dimensionality Reduction
The overlapping half-blocks become distinct collectives yet they share infor-
mation between each other because encoded in the collective half-block intensity
value is neighboring half-block brightnesses. This has the effect of -IInI iiI the
intensities across many pixels. Residual feature information may be transferred
to areas of the image as far as 20 pixels away in both the x and the y directions.
Changing the object components from pixels to collective half-blocks effectively
reduces the dimensionality of the feature space from 3,080 down to 144, a 95%
reduction. Pixel sharing between neighboring half-blocks within each block gives a


. start of block






















A B


Figure 3-3. Steps for segmenting an example image of a person. A) Undoctored
220x220 input image. B) Magnified view of a subimage containing the
person consisting of 24x24 collective half-blocks. C) Magnified view of
vertical strip containing person consisting of 24x6 collective half-blocks

4 pixel maximum variance in the horizontal and vertical directions. An object may

shift by as much as four pixels vertically or horizontally or may also rotate by as

much as 34 and still maintain a 1'r -. n. '." within the same collective half-blocks.

3.4 Published Methods

We again review the people detection techniques introduced in Chapter 2

but focus only on the segmentation and dimensionality reduction approaches of

each method to present points of comparison with our method. Rowley et al. [8]

look at every 20x20 pixel block in a test image initially, and they later change the

detect windows to 30x30 pixel blocks in steps of 10 pixels to reduce computation

time of an image from 383 seconds to 7.2 seconds. Upon finding a positive result

in a 30x30 block they more closely scrutinize the area with their standard 20x20

detector. The neural networks they use discern features within the face, and

their detection windows must be kept small. The numbers of hidden units are

experimentally determined, but there is a fine line between the number of hidden

units required to determine an underlying trend in a decision space and the

number of hidden units that will fit the intricate details of the training data but







20

not extract the fundamental pattern. Poggio and Sung [6] look at every 19 x 19

subregion location in the primary image during testing. They retain the full

dimensionality of the 19x 19 space (283 dimensions) and fit 6 separate Gaussian

distributions upon the training data in 283-space using an elliptical k means

algorithm. Oren et al. [2] move a 128x64 window throughout all the positions in

a test image. They subjectively choose 29 -imi .l .i" wavelet coefficients which

indicate regions of ii'I n,-ilIy-change" or regions of iin, intensity-change" in the

learned wavelet template. These 29 coefficients form the feature vector responsible

for classification of people.















CHAPTER 4
PREPROCESSING OF AN IMAGE SEGMENT

4.1 Introduction

Input image preprocessing is the transformation of cryptic data into infor-

mation that is amenable to a training or testing algorithm. Whichever technique

is used must enhance the qualities unique to the target class while deemphasiz-

ing externally motivated transients. Most current classification schemes eimpl]'

preprocessing techniques, but to different extents. We apply basic image filtering

to achieve results comparable to that of other techniques. Our method finds the

algorithms that increase the rate of detection of people and decrease the num-

ber of false positives. Hence, we compare test results from several preprocessing

schemes instead of subjectively choosing the final method of image preparation.

The techniques applied in previous classification methods prompt those used here:

brightness equalization [8], histogram equalization [8], contrast stretching [8, 2, 7],

horizontal intensity differencing [2], and vertical intensity differencing [2]. There are

many more methods to unify images in unpredictable lighting situations than we

have enumerated here, but these are among the practices that show up repeatedly

in the detection methodologies we analyzed.

4.2 Brightness Equalization

There are several methods used to reduce unwanted global or partial bright-

ness variations caused by a changing environment. One of these is brightness

equalization or level shifting. Some intensity changes exemplify a property of an

object in uniform lighting conditions, while others are provoked by a localized

external light source or sink. A lighting equalization operator reduces the effects

of luminance shifts caused by focused light variations. Current preprocessing









techniques try to filter out localized intensity differentials by applying typically a

polynomial transform to the pixel intensities of the entire image. Figure 4-1 shows

the result of this background leveling technique on the uniformity of the intensities.

Level shifting is successful if the background level changes gradually and can be

















A B

Figure 4-1. Effects of brightness equalization on intensity uniformity. A) Image of
rice with localized brightness nonuniformities. B) Same image after a
linear brightness equalization filter is applied


modeled by a polynomial. Rowley et al. [8] execute brightness equalization on

every similarly masked 20x20 oval in their search space before feeding them to

their system of neural networks. Their luminance equalization filter is essentially

a linear function fitted to the average intensity values of small pixel regions within

the image. We perform luminance equalization upon the 220x220 greyscale image

before segmentation and dimensionality reduction take place [8]. We fit piecewise

continuous linear functions to both the brightest and darkest pixels in the image.

This has the advantage of keeping both the background and contrast of the image

consistent.









4.3 Histogram Equalization and Contrast Stretching

Histogram equalization transforms the contrast and range of a set of pixel

intensity values by providing a typically non-linear mapping of the original values.

In effect, the brightness histogram of the resulting image becomes uniform or

flat after its application. This technique is useful for picking out details that are

difficult for humans to see in an image or for a classifier to distinguish objects

known to have densely represented intensity values. However, relative pixel

information across the image is not preserved. Contrast strl, hiin.. on the other

hand, linearly scales the input pixel intensities so that the range of values is

stretched to desired minimum and maximum bounds. This technique preserves

intensity differences throughout the entire image. Figure 4-2 shows the effects of

contrast stretching and histogram equalization on an example image. In addition

to brightness equalization, Rowley et al. [8] perform histogram equalization and

contrast stretching to the search space. Oren et al. [2] normalize the wavelet

coefficients of their training images. Zhao and Thorpe [7] normalize the values

of the output of an edge detection algorithm before they are input to a neural

network. We execute contrast stretching and histogram equalization after the

segmentation and dimensionality reduction step. The contrast stretching step (also

called normalization) changes the dynamic range of the pixel values to one bounded

by 0 and 100. Figure 4-3 shows the vertical strip of the pedestrian from Chapter 3

before and after contrast stretching and histogram equalization respectively.

4.4 Horizontal and Vertical Intensity Differencing

The next phase involves the implementation of two very simple region differ-

ence extraction filters. Good results have been generated from the technique of

Oren et al. [2] which uses wavelets to encode region intensity differences for feature

extraction because relative quantities eliminate low frequency noise and more

consistently explicate human shape. In fact, the application of Haar wavelets uses









" ":'k; *%"..$. ,l.

''. ; ".... .-." '
.. -. .


A B


Figure 4-2. Contrast stretching and histogram equalization filters applied to a sam-
ple image. A) Image of lunar surface. B) Image after applied contrast
stretching filter. C) Image after applied histogram equalization filter


our same differencing scheme when generating coefficients of highest frequency. It

seems logical that we use a high frequency differencing scheme after extracting low

frequency information from dimensionality reduction. We exercise the spirit of the

approach but simplify the application. One method, called horizontal differi, iin-:.

replaces absolute pixel values with differences calculated horizontally. If i repre-

sents an image row, and j represents an image column in the 24x6 vertical strip

from Chapter 3, and xij signifies the intensity value at pixel location (i, j), then the























A B


Figure 4-3. Contrast stretching and histogram equalization filters applied to an
image of a pedestrian. A) 24x6 vertical strip image of pedestrian from
Chapter 3. B) Image after histogram equalization and normalization
filters are applied

horizontal differencing technique achieves the following:


I ij xi(j+i) | 1 < j < 6,
x(4.1)
0 j 6.

The second technique also finds relative pixel intensities but along the columns of

the image instead of the rows.


x, \Xij X(i+i)j 1 < i < 24, (4.2)
X(4.2)
0i =24.

Figure 4-4 shows the horizontal and vertical differencing effects on the 24x6 verti-

cal strip pedestrian image. This preprocessing step is performed after segmentation

and dimensionality reduction. The end result from the analysis of one 220x220

image is a set of 208 feature vectors in 144-space, and the image is ready for either

classifier training or people recognition.




























I





A B C

Figure 4-4. Horizontal and vertical differencing techniques applied to an image of a
pedestrian. A) 24x6 vertical strip image of pedestrian. B) Image after
horizontal pixel differencing is applied. C) Image after vertical pixel
differencing is applied















CHAPTER 5
CLASSIFIER ALGORITHM

5.1 Background

Based on the discussion in Chapters 3 and 4, we know that the dimensionality

of our feature vectors of people is 144. The space of all possible vectors, v E R144

signify every possible segmented and preprocessed image input into our system.

We wish to delineate a subset of these vectors within several ellipsoidal boundaries.

The vectors within the ellipsoids refer to representations of people, while vectors

outside of the ellipsoids ideally are representations of non-people. Previous work

has been done to classify high-dimensional image features with precisely placed

Gaussian distributions [6]. We instead allow the system itself to place ellipsoids

upon the data and adaptively stretch or contract them during training. Neither

the number of ellipsoids nor the major or minor axis distances are constrained, but

positive examples of people are consumed by expanding ellipsoids, and negative

examples are rejected via ellipsoidal contraction. Figure 5-1 demonstrates dilation

and contraction for the two dimensional case. Dilation and contraction operations

are encoded into the construction of linear transforms applied to positive and

negative input vectors. The testing phase consists of inclusion tests of points within

the established ellipsoidal contours.

A good starting point for the analysis of the boundary conditions starts with

a look at spheroids because they are the simplest ellipsoids, and the boundary test

for a point's inclusion in an ellipsoid builds on the analogous test for a spheroid.

We formulate a method of testing data inclusion within a general spheroid in

144-space in Section 5.2 based on the technique used by Kositsky and Ullman [1],

and then we introduce the extensions required to adapt the boundary type to an











/ /

















P,



Figure 5-1. Demonstration of ellipsoidal dilation and contraction. A two dimen-
sional ellipsoid dilates to capture a positive point P and contracts to
throw out a negative point N


ellipsoidal one in Section 5.3 also based on the same authors' work. Section 5.4

discusses the combination of the two transforms and provides the finer points of the

complete operator.

5.2 Linear Transform for Spheroids

A spheroid is an ellipsoid whose axes are all equal in magnitude. This equality

simplifies the equation of a spheroid into a sum of n squared terms:

2 2 2 2
+1 2+ X I "=1 (5.1)
T2 F2 F2 F2

where r is the radius of the spheroid, and n is the spheroid's dimensionality. To

determine whether a particular vector is within a given spheroid, a sequence of

scalings is performed on each component of the vector. Such scalings contract

or expand a point on the boundary of a spheroid of radius r to a point on the

boundary of a spheroid of radius 1 if the spheroid is centered at the origin. Points

inside or outside of the original spheroid end up in a proportionally equivalent
< / /
Figur 5-1 Demostraion o ellpsoial diatio and ontrctio. Atodmn
sionl elipoiddiltes o cptue apostiv poit Pandconracs t




tho utangaiepon"

elipoda oeinSeton5. ls bsd n h smeathrs or. etin .
discusses~ th obnto ftetotasom adpoie h ie onso h
coplt opraor









location in the new spheroid. A linear map can perform vector scaling along

specific directions. Such a linear map is defined in this way:

3 a set of vectors D C R144 and another set of vectors Q C R144
st 3 a linear map f,: D Q where f (x) = Lx
for some 144 x 144 matrix L,
some reference point c C D and V x D

The symbol D refers to the set of vectors, relative to the current spheroidal center

c, received by the system for classification or training. The symbol Q represents

the set of linearly mapped relative vectors. Each component of the input vector

is scaled by the same amount to effectively compress or distend points along a

direction which seeks or avoids the center of the spheroid. Hence, the direction

along which the scaling is executed depends on the input point, and is always

perpendicular to the tangent of the spheroid at the transformed point. Given an

input point x, an output vector of the following form is sought:

1
= f(x) (5.2)


where r is the radius of the original spheroid. The matrix L takes on the following

configuration to perform this operation.

0 ... 0
1
0 0 ... 0
L r (5.3)

1
0 0 0 ...

The entries in L along the diagonal are just the square roots of the coefficients of

Equation 5.1 We are now in a position to perform an inclusion test on the input

point. It is important to note that a test point must first be rewritten relative to

the center of the spheroid because the matrix L scales the components closer or

farther from the spheroid's center. If the point x is inside or on the boundary of a










spheroid of radius r centered at the point c, then the resulting vector L (x c) is

inside or on the boundary of a spheroid of radius 1 centered at the point c, when it

is referenced to c. The magnitude of the new relative vector, IL (x c) is equal to

1. If the norm is less than 1, then the point x is within the original spheroid. If the

magnitude is greater than 1, then x is outside the confines of the original spheroid

of radius r. Figure 5-2 demonstrates the transformation in two dimensions.


r








5.3 Ellipsoidal Transform \ /
i I i ^*^ '
C C





A B

Figure 5-2. Linear transformation of a spheroid. Transformation of a point x
from the boundary of a spheroid of radius r to the boundary of a unit
spheroid. A) shows contraction while B) shows expansion


5.3 Ellipsoidal Transform

The formula for a spheroid is derived from the general equation of an ellipsoid

which assumes a more complicated formalization. The equation of an ellipsoid

takes the following general structure:


a zix x =1 ai Z+ UO Vi,j (5.4)
l 1
where n is the dimensionality of the space. There are more terms in the ellipsoid

equation than in the spheroid equation. The morphological reason for this is

that an ellipsoid is a spheroid whose boundary point components are scaled in a

finite number of directions and by different amounts. The directions correspond

to the directions of the major and minor axes, and the scaling amounts refer to









the distance discrepancies of the corresponding major and minor axes. On the

other hand, a spheroid results when original spheroidal boundary points are scaled

over all directions by similar amounts. The extra terms are introduced when

the minor and major axes do not coincide with the coordinate axes. Again, we

want to transform a hyperdimensional ellipsoid into a unit spheroid because the

resulting inclusion test for points becomes trivial. We focus on a point x on the

boundary of a given ellipsoid. We assume that only one dilation or contraction in

one direction is needed to transform the ellipsoid into a spheroid. We may make

this claim because as we will see, such a transform is linear and linear mappings are

transitive. In other words, if


f(x)= y Lx and g (y) z Ky (5.5)

are both linear mappings, then

z f (g(x)) (5.6)

or using the matrix equivalents,

z LKx (5.7)

Since each mapping corresponds to a unidirectional ellipsoidal contraction or

dilation, a sequence of expansions and constrictions translates to a sequence of

matrix multiplications. We wish to retain proportionality across the transform

along the direction being modified, so that the proportion of the point's projection

along the modified direction to the major axis of the conic remains constant across

the mapping. Given that ;x is the boundary point on the modification axis, x is

any point on the boundary of the ellipsoid, c is the center of the ellipsoid, and

remembering that all points are actually vectors in R144 space relative to c, we use

the following vector formula to transform x into a point y on the boundary of a









unit spheroid centered at c. Figure 5-3 is a visual representation of the mapping.


y(f)(x)=-X ] X + -X 2 (5.8)
( I \x \ 1 x\1 ) *\2

The vector projection of x along x is subtracted from x so that the result has

y x

x




Figure 5-3. Mapping of an ellipsoid to a unit spheroid knowing the dilation axis.
Transformation of a point x on the boundary of an ellipsoid to a point
y on the boundary of a unit spheroid. Both are centered at point c,
and the dilation axis is

no component along the major axis, and it is perpendicular to this modification

axis. Then, the third term in the sum adds the original component of x along the

major axis scaled by the inverse of the magnitude of x. Since the magnitude of

x is always larger than or equal to the projection of x along the major axis, the

ratio is always less than or equal to one. The culmination of the sum is a point on

the boundary of a unit spheroid centered at c maintaining the same proportional

distance to other points along the modified axis. The mapping's corresponding

matrix operator, L, has the following form:

I- (3o} 2 ( |-1 Fl *--I-
0 1 I (-2 2 |-1 0. X 1
L (= I /) / (5.9)



(0s on te c ur of a ut sero, te follo g is 2-1

Since y is on the contour of a unit spheroid, the following is true:


ILyl= 1


(5.10)









We have just shown that if we are given an ellipsoid that is one transfer function

away from a spheroid, and we know the vector form of the axis of dilation or

contraction, then we can determine whether a point relative to the center of the

ellipsoid is inside the given ellipsoid. At the same time, we may define the ellipsoid

to be the linear operator used to squash or expand it into a unit spheroid because

the boundary is defined by the transform. Since linear transforms are transitive in

nature and an ellipsoid may be portrayed as a spheroid whose points are scaled in a

finite number of directions, an ellipsoid in general can be represented by a sequence

of matrix multiplications, and each matrix involved in the product represents a

single dilation or contraction.

5.4 Classifier Rules

The process of contraction and dilation occur in the following way. Initially, if

the new input point cannot be engulfed by any existing ellipsoids, then it becomes

the center of a new spheroid of radius r. The radius is a user-defined constant

which constrains the volume of the initial hyperdimensional conic. If this value is

too large, then the training algorithm must work harder to shrink the ellipsoids,

and if this value is too small, enlarging the conics becomes processor intensive.

We choose a value on the small side because the set of points that represent the

existence of people is a much smaller set than the set of R144. Our starting radius

is 10% of each vector component's possible maximum. Ultimately, the value

doesn't really matter because the training algorithm automates the learning process

without any a priori constraints. The continual process of reshaping the ellipsoids

fits the contours of the conics to the data regardless of the initial value of the

spheroid radius. The transform matrix identifies the shape of the ellipsoid, but

two points on its boundary determine the directions of dilation and contraction.

These points are called the last contraction point (LCP) and the last dilation point

(LDP). The LCP is the last negative sample point previously inside the ellipsoid









but now on the boundary after the last contraction. The LDP is the last positive

sample point previously outside of the ellipsoid but recently captured by it. The

points are also vectors that are referenced from the ellipsoid's center. The next

dilation axis is always perpendicular to the LCP vector and along the plane formed

by the LCP, ellipsoidal center, and the new positive sample point. Analogously,

the contraction axis is always perpendicular to the LDP vector and along the

plane containing the LDP, center, and new contraction point. This simple rule

ultimately prevents ellipsoids from blindly and immediately reintroducing a point

that it purposely recently expelled and throwing out a point that it just acquired.

If a dilation is being performed, we must find a dilation axis whose direction is

perpendicular to the LCP vector. Analogously, if a contraction is taking place, we

must find a compression axis whose direction is perpendicular to the LDP vector.

Let the LDP or LCP vector be depicted by v, and let x be the new sample point.

The compression/dilation axis, e, is given by:

x v v
+ e = (5.11)
V\ V

Since e can be scaled by any value and retain its direction, we have

Iv12
e K e = v x for some K R (5.12)


Only the direction of e is important in the dilation or contraction equations so

we don't care about the actual value of K. Based on this information, we can

now determine the transfer function for dilation or contraction. We replace in

Equation 5.8 with the vector e.

,e x (e x e ,
y = fe (x) = x (-e ) + (e) 2 Ye (5.13)

where
x x )2 (5.14)
e-x= X IX12 X (5.14)
V ~V









and

ye 1 (.v (5.15)

and e is as defined in Equation 5.12. The extra term on the end, ye, is the projec-

tion of the output point y along e. In Equation 5.8, this term is 1. The head of the

vector e does not necessarily lie on the boundary of the ellipsoid, so the extra term

is needed to scale the result properly. Figure 5-4 displays the aforementioned linear

mapping in two dimensions. The analogous linear operator L becomes:
v


Figure 5-4.


Mapping of an ellipsoid to a unit spheroid by calculating the dilation
axis. Transformation of a point x on the boundary of an ellipsoid to a
point y on the boundary of a unit spheroid. Both are centered at point
c, and the dilation axis is determined based on the position of the LCP
(v)


(1 ( eo2 e|ye 0 o el l-ye
i \|e I e| le| e| el
0 el eo e|-Ye 1 ( e_ )2 2e|--ye
le e1 1e le; 1el


0 L0 e. lel-ye
lel lel jel
0 eo e, el- e
' \ || |e e /


\(o0_re (o )e (I (La^)2 lel\ )J
1e 1 eo lel- e1 e1 lel-1 "
l1e| e | el le e l le e I el

The resulting matrix L specifies a linear scaling in a single direction along only e

to add or expel an introduced positive or negative sample respectively. The matrix

C encodes all of the linear scalings done previous to the current one. A recursive

methodology updates the universal operator C in such a way:


Cnew = Cold L


(5.16)


(5.17)









There are certain constraints on an ellipsoid that prevent it from capturing a new

input point. The addition of a new point to an existing ellipsoid is possible only

when the added point is within the hyperplanes of the prospective ellipsoid. If

the projection of the additional point along the LCP vector is greater than the

magnitude of the LCP, then it is clear that the ellipsoid cannot capture it because

the LCP must remain on the boundary of the dilated ellipsoid. Figure 5-5 shows

the conceptual difficulty of an ellipsoid capturing a positive point when it is not

within its hyperplanes. A test is performed prior to dilation to determine if the

Sx
v
____ hi


I
c


h2


Figure 5-5. Input point not within the hyperplanes of an ellipsoid. A prospec-
tive ellipsoid cannot dilate to capture a positive input point, x, if it is
not within the hyperplanes hi and h2 because v must remain on the
boundary of the ellipsoid


current ellipsoid is a potential candidate to engulf the new point. Symbolically, the

test is equivalent to the following inequality:


Iv x < Iv| (5.18)


where v is the LCP and x is the new sample point. Contractions operate on a

sample point within an ellipsoid so the point is necessarily within the hyperplanes

of the conic.

Now, an overview of the formation of the ellipsoidal distribution is presented.

A system consists of an ordered set of linear operators, E. Each linear operator

defines an ellipsoid in R144 space. A positive or negative sample point, x, is


I









introduced to the system. The matrix, Ci, is the universal linear operator for an

existing ellipsoid i. If the point is a positive sample, then the ellipsoid inclusion

test is performed with the following inequality


IC xI < 1 Ci E (5.19)


The inclusion test is done for each ellipsoid i until the inequality is met or E is

consumed. If the point is within an ellipsoid, nothing more is done. If no ellipsoid

contains the point, then the hyperplane test is performed for each existing ellipsoid,

i until the test succeeds or the end of E is reached. If the point is within an

ellipsoid's hyperplanes, that ellipsoid is stretched to capture the point, and Ci

is updated by matrix multiplication with the linear operator for the dilation, L.

Otherwise, the point becomes the center of a new ellipsoid and C becomes 1 I. If

the sample point is a representation of a non-human, and it is within any existing

ellipsoid, i, that ellipsoid is contracted to place the point on the boundary. The

universal operator Ci is updated by multiplication to the contraction operator, L.

Otherwise, nothing more is done.















CHAPTER 6
EXPERIMENTS

6.1 Training the Classifier

The ellipsoidal classifier is trained with approximately 1000 images of people

having a size of 110x55 pixels. They are pictures from the same database used

to train the SVM classifier in the procedure of Oren et al. [2]. Negative image

samples of an equivalent size are also used to train the system. It is difficult to

represent the entire class of non-humans with a marginal number of images. Hence,

synthetic images are created as negative sample points to train the algorithm.

Approximately 150 images of indoor and outdoor scenes were downloaded from

the internet, and 850 of the total negative images have randomly generated pixel

intensities. Precedent for the use of random data as negative samples is seen in

the approach of Rowley et al. [8]. The large size of the non-human image class

dictates the use of random samples to present a more complete example of the

class. Figure 6-1 displays several illustrations of positive and negative training

images.

6.2 Preprocessing Schemes

Several training schemes are produced by coupling different image preparation

methods with the ellipsoidal trainer. The experimental results obtained from

each trained scheme are compared to determine the best image preprocessing

techniques to use for detection. The systems differ in the amount of preprocessing

done and the type of feature extraction performed. All of the schemes use the

same training images, and all perform segmentation, dimensionality reduction,

and normalization of the input images so that the feature space is reduced and

unified. Preprocessing scheme 1 additionally performs horizontal differencing




















Figure 6-1. Examples of positive


lb


I


e and negative training images respectively


to the preprocessed image to create the feature vector. Preprocessing scheme 2
executes vertical differencing instead. Preprocessing scheme 3 introduces brightness
equalization to level the background of the raw image before further processing
occurs. It also uses horizontal differencing for feature extraction. Preprocessing
scheme 4 is similar to scheme 3 except that it uses vertical instead of horizontal
differencing. Preprocessing scheme 5 implements a nonlinear histogram equalization
process applied to the segmented and dimensionally reduced vertical strip image.
Horizontal differencing takes place afterward. Preprocessing scheme 6 is similar to
scheme 5 except it uses vertical differencing of pixel intensities instead of horizontal
differencing.
6.3 Testing Phase
Positive and negative images different from the training data are analyzed
by each preprocessing scheme, and the corresponding output is fed into the
ellipsoidal detector which is produced via the instruction of the associated training
scheme. Two groups of test data are presented to each detector. The first group
of test images are divided into subsets of positive and negative images. Subset 1









consists of 142 images of people taken from the same database used for training

but different from the training images themselves. These images approach an

ideal test group because the size constraint of the subjects is consistent with

that of the detector, and the poses are limited to frontal and back orientations.

Examples of positive images from test group 1 are shown in figure 6-2. Subset 2









.. ::.. :::: ... .....

Figure 6-2. Examples of positive images from the first test group


is a collection of 127 negative images that were downloaded from the internet and

chosen specifically because they resemble humans structurally. Hence the chance of

a false detect within this subset is more probable. Some examples from subset 2 are

shown in figure 6-3.











Figure 6-3. Examples of negative images from the first test group


A second group of images is considered to be more complex test material

because the data is not staged, and few environmental variables are controlled.

This test group contains 31 220x220 greyscale images taken by a digital camera of

real-world outdoor scenes. The images are processed only by the detector scheme

with the best positive and negative detection rates on the group 1 test images.









Also, before the second test group is analyzed, the chosen detector undergoes a

semi-automated bootstrapping procedure. A webpage is updated every few seconds

with an image processed by the detector scheme. The detector draws a box in

the image around an area where a person is found, and the source of the original

image is a digital camera pointed outside at an area where people frequently walk.

When the webpage is monitored, and the image displays a false positive, a script is

executed which actively includes the offending original camera image in the training

database of negative images. Another script saves an image in the database of

positive training images in the instance of a false negative. The detector may then

be retrained at the user's convenience. Bootstrapping the system to achieve better

performance is used in many of the techniques that are examined [2, 6, 7, 8, 9].

Figure 6-4 gives a selection of images from the group 2 database.

6.4 Sensitivity Analysis

The ruggedness of the system is tested by introducing blurred versions of test

images from the first testing group to the classifier. Three levels of a Gaussian blur

convolution mask are produced by varying the exponential decay function of the

20x20 mask. Each level varies from the previous by an order of magnitude of the

exponential power. Figure 6-5 shows the blurring levels.







































Figure 6-4. Selections of images from the second test group


A


aWll%
LT^ Pmi""""'.:E;
*\-


Figure 6-5. Different blurring levels of a test image. A) Original test image. B)
Same test image put through a 20x20 Gaussian blur filter with an
exponential decay function of exp035. C) Exponential decay function is
exp0o035. D) Exponential decay function is expo00035















CHAPTER 7
RESULTS

Tables 7-1 and 7-2 display the classification results of the trained detector

schemes on the first and second groups of test data. The ell. entry specifies

the number of ellipsoids produced during the training of each detector scheme.

Surprisingly, scheme 1 performs the best for both positive and negative test images

in the group 1 data set with a positive detection rate of 84.5% and a negative

detection rate of 7.9' Preprocessing scheme 1 executes neither brightness leveling

nor histogram equalization, but performs only the horizontal differencing algorithm

along with image segmentation and dimensionality reduction. The horizontal

differencing of neighboring pixels in an input image in general produces better

results than vertical differencing of pixel values. This observation seems reasonable

considering the higher degree of vertical than horizontal symmetry in humans.

The first detector scheme is used to analyze the test images of group 2 after the

bootstrapping procedure explained in Chapter 6 is executed because it performs

the best out of all of the detector schemes on group 1 test data. Figure 7-1 displays

the ability of detector scheme 1 to pick out people from selections of the group 2

database.











Table 7-1. Results of detector schemes 1-3 on test data
SI/,1 ,,". ell. P. detects N. detects P. det. rate N. det. rate
Scheme 1 4
Group images 120/142 10/127 84.5% 7.9%
Group2 images 2 7/27 13/6448 25.9% 0.20%
Gaussian blur 0.35 74/142 52.1%
Gaussian blur 0.035 70/142 49.3%
Gaussian blur 0.0035 9/142 6.3%
Scheme 2 13
Group 1 images 98/142 11/127 69.0% 8.7%
Gaussian blur 0.35 66/142 46.5%
Gaussian blur 0.035 38/142 26.8%
Gaussian blur 0.0035 8/142 5.6%
Scheme 3 3
Group 1 images 109/142 21/127 76.8% 16.5%
Gaussian blur 0.35 73/142 51.4%
Gaussian blur 0.035 61/142 43.0%
Gaussian blur 0.0035 15/142 10.6%





Table 7-2. Results of detector schemes 4-6 on test data
S, /,, ,". ell. P. detects N. detects P. det. rate N. det. rate
Scheme 4 19
Group 1 images 75/142 10/127 52.8% 7.9%
Gaussian blur 0.35 61/142 43.0%
Gaussian blur 0.035 30/142 21.1%
Gaussian blur 0.0035 3/142 2.1%
Scheme 5 1
Group 1 images 107/142 32/127 75.4% 25.2%
Gaussian blur 0.35 73/142 51.4%
Gaussian blur 0.035 102/142 71.8%
Gaussian blur 0.0035 75/142 52.0%
Scheme 6 2
Group 1 images 71/142 13/127 50.0% 10.2%
Gaussian blur 0.35 53/142 37.3%
Gaussian blur 0.035 84/142 59.1%
Gaussian blur 0.0035 66/142 46.5%




















,/-^
L **^ ~ I Ai




NO
I* Z:
I.
Ir


Figure 7-1. Some analyzed selections from the second group of test images. Boxes
are drawn around positive detections by the algorithm















CHAPTER 8
CONCLUSIONS

In this thesis we determine if people who fit a specific size profile and who

pose in everyday situations may be used as a viable input class for a simple binary

detector using methods adapted and simplified from several noted techniques.

We maximize the detection results over all of the preprocessing techniques used

during training by selecting image preparation algorithms that give the best results

during testing of the detector schemes on one group of test data. Dimensionality

reduction is an important aspect of classifier formation, and we choose a procedure

which has a basis in wavelet coefficient formation. It is one of the low frequency

representations of the input data. Unlike most wavelet techniques which use many

more than two levels of wavelet transforms, we use one low frequency transform

and one high frequency transform to reduce the dimensionality of the input vectors

and extract human features which exhibit good clustering behavior when they are

depicted as feature vectors respectively. We assume that the feature representations

assemble into ellipsoidal shapes with varying major and minor axes lengths, and

through contraction and dilation of the ellipsoidal distributions, a large majority

of feature vectors representing negative input examples remain outside of the

ellipsoidal boundaries. Bootstrapping is used to improve the performance of the

final detector by allowing the retraining of the classifier scheme with images that

previously displayed false positives and false negatives.

The results are encouraging because the detectors are trained with a small

number of positive and negative examples, and the detection rates are comparable

to current techniques. However, several variables involved in training the detector

schemes were not taken into consideration, so it is unclear whether the results could









be improved by increasing the number of positive and negative training images.

We assume that the order of the images presented to the trainer influences the

ability of the detector to detect humans because ellipsoids are created, contracted,

and dilated in the order that the input images are introduced. We do not use the

order of the training data as a factor in the training of the detectors, nor do we

try different orderings of the examples to achieve higher detection rates. Also,

based on the dilation and contraction methodology in the ellipsoidal algorithm, it

would seem that introducing more negative points could throw out more positive

examples not equal to the LDP. Different training methodologies would have to be

examined to try to minimize this consequence. Our current detection rates reflect a

training epoch which examines all of the negative examples between each positive

example. Other types of training methodologies should be studied.

More work should be done providing a theoretical basis to the ideas presented

in this thesis. Much work has been done by others to create a link between high

dimensional vector spaces and low dimensional kernel classifiers. We believe that

there is a connection between representing image data with second order manifolds

and the principles related in kernel classification techniques; however, such concepts

are left as future analysis on the topic.















REFERENCES


[1] M. Kositsky and S. Ullman, "Learning class regions by the union of ellipsoids,"
Proceedings of the 13th International Conference on Pattern Recognition, vol. 4,
pp. 750-757, 1996. 1, 27

[2] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, "Pedestrian
detection using wavelet templates," Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 193-199, 1997. 1, 5, 16, 20, 21,
23, 38, 41

[3] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. von Seelan,
"Walking pedestrian recognition," IEEE Transactions on Intelligent Trans-
portation Systems, vol. 1, no. 3, pp. 155-163, 2000. 2

[4] H. Nanda and L. Davis, "Probabislistic template based pedestrian detection
in infrared videos," Proceedings of IEEE Intelligent Vehicle S/i,,.' t ..:,,,':.
pp. 504-515, 2002. 2

[5] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, "Shape-based pedestrian
detection," Proceedings of IEEE Intelligent Vehicle Si'!' ...i .:,,,. pp. 215-220,
2000. 2, 5, 13

[6] T. Poggio and Sung, "Finding human faces with a gaussian mixture
distribution-based face model," Proceedings of Second Asian Conference
on Computer Vision, pp. 139-155, Dec. 1995. 5, 7, 16, 20, 27, 41

[7] L. Zhao and C. Thorpe, "Stereo and neural network-based pedestrian de-
tection," IEEE Transactions on Intelligent Transportation Systems, vol. 1,
pp. 148-154, Sept. 2000. 5, 9, 15, 21, 23, 41

[8] H. Rowley, S. Baluja, and T. Kanade, A. iii.l network-based face detection,"
IEEE Transactions on PAMI, vol. 20, no. 1, pp. 23-28, 1998. 5, 10, 15, 16, 19,
21, 22, 23, 38, 41

[9] H. Schneiderman and T. Kanade, "Probabilistic modeling of local appearance
and spatial relationships for object recognition," Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 324-339, June
1998. 15, 41

[10] P. S. Penev and L. Sirovich, "The global dimensionality of face space,"
Proceedings of the 4th IEEE International Conference on Automatic Face and
Gesture Recognition, pp. 264-270, 2000. 16















BIOGRAPHICAL SKETCH

Jennifer Lea Laine was born in Vero Beach, Florida, in 1975. She received a

Bachelor of Science degree with honors in electrical engineering at the University

of Florida in the summer of 1998 and, later in 2000, a Bachelor of Science degree

in mathematics. Besides working towards a Master of Science degree in electrical

nir-ii, iir.-. she is a member of the Machine Intelligence Laboratory in the

Electrical and Computer Engineering Department and works part-time as a design

engineer at Neurotronics in Gainesville, Florida.