UFDC Home  myUFDC Home  Help 



Full Text  
COMPUTER CONTROLLED DETECTION OF PEOPLE USING ADAPTIVE ELLIPSOIDAL DISTRIBUTIONS By JENNIFER L. LAINE A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2003 Copyright 2003 by Jennifer L. Laine ACKNOWLEDGMENTS I would like to thank my committee members for taking the time to listen to my defense and to really read and critique my thesis. I thank Dr. Arroyo for his guidance throughout the years. It was his influence which allowed me to strive towards my potential. I also wish to thank my labmates at the Machine Intelligence Laboratory whose words of encouragement (both positive and negative) have helped to shape me into the person I am today. The writing of this thesis was done primarily in a vacuous apartment in Raleigh, NC. I would have possibly gone insane without the daily lunch breaks with Scott Nichols and Ivan Zapata, so thanks also go to them. I thank JD Landry for making worktime at IBM less boring, and I also thank Dr. Eddie Grant at NC State for providing a structure for my unorganized editorial infancy. I thank Juan Sanchez for giving me positive reinforcement to finish my master's degree even though Vegas seemed like a million miles away. Many thanks go out to Dr. Jack Smith, who kicked me in the butt throughout the last months of writing and editing so that I could finish my thesis and do my presentation on time, and I thank James Schubert for telling me that my work was not the remains of putrefaction and questioning the practicality of 144 dimensions. Finally, thanks go out to my parents and grandparents who have been motivating me my entire life. TABLE OF CONTENTS page ACKNOWLEDGMENTS ................... ...... iii LIST OF TABLES ............................... vi LIST OF FIGURES ..................... .......... vii ABSTRACT ...................... ............ viii CHAPTER 1 INTRODUCTION ........................... 1 1.1 Preface ............... ..... ........... 1 1.2 Current Classification Techniques and Problem Discussion .... 2 1.3 Thesis Organization ......... ................ 3 2 PROBLEM REVIEW .. ... .................... 5 2.1 Introduction ........... ................ 5 2.2 Wavelet Templates and Support Vector Machines ......... 5 2.3 Gaussian DistributionBased Model and Neural Networks ..... 7 2.4 Stereovision and Neural Networks ...... ......... 9 2.5 Neural Network Overload ...... .......... . .. 10 2.6 ShapeBased Pedestrian Detection ..... . . 13 3 IMAGE SEGMENTATION AND DIMENSIONALITY REDUCTION. 15 3.1 Introduction ............... ........... .. 15 3.2 Partitioning the Image .................. ... .. 16 3.3 Dimensionality Reduction ................... . 18 3.4 Published Methods .................. ..... .. 19 4 PREPROCESSING OF AN IMAGE SEGMENT . . 21 4.1 Introduction .................. ........... .. 21 4.2 Brightness Equalization . . ......... .. 21 4.3 Histogram Equalization and Contrast Stretching . ... 23 4.4 Horizontal and Vertical Intensity Differencing . .... 23 5 CLASSIFIER ALGORITHM ................... .... 27 5.1 Background ................... ....... 27 5.2 Linear Transform for Spheroids .................. .. 28 5.3 Ellipsoidal Transform .................. ..... .. 30 5.4 Classifier Rules .................. ......... .. 33 6 EXPERIMENTS .................. ............ .. 38 6.1 Training the Classifier .................. .... .. 38 6.2 Preprocessing Schemes .................. ... .. 38 6.3 Testing Phase ............... ........ ..39 6.4 Sensitivity Analysis ............... .... .. 41 7 RESULTS .... ........................ ..... 43 8 CONCLUSIONS ............... ........... ..46 REFERENCES ............... ................... 48 BIOGRAPHICAL SKETCH .... .......... ......... .. 49 LIST OF TABLES Table page 71 Results of detector schemes 13 on test data . . ..... 44 72 Results of detector schemes 46 on test data . . ..... 44 LIST OF FIGURES Figure page 31 Clarification of the placement of I.... red subimages in the input im age . . . . .. . . .... 17 32 Arrangement of halfblocks within an image block . . ... 18 33 Steps for segmenting an example image of a person . ... 19 41 Effects of brightness equalization on intensity uniformity ...... ..22 42 Contrast stretching and histogram equalization filters applied to a sample image ............... ........... .. 24 43 Contrast stretching and histogram equalization filters applied to an image of a pedestrian ........... . . .... 25 44 Horizontal and vertical differencing techniques applied to an image of a pedestrian ............... ............ .. 26 51 Demonstration of ellipsoidal dilation and contraction . ... 28 52 Linear transformation of a spheroid .... . ... 30 53 Mapping of an ellipsoid to a unit spheroid knowing the dilation axis .32 54 Mapping of an ellipsoid to a unit spheroid by calculating the dilation axis.................... ........ .... .......... 35 55 Input point not within the hyperplanes of an ellipsoid . ... 36 61 Examples of positive and negative training images respectively . 39 62 Examples of positive images from the first test group . ... 40 63 Examples of negative images from the first test group . ... 40 64 Selections of images from the second test group . ..... 42 65 Different blurring levels of a test image ................ ..42 71 Some analyzed selections from the second group of test images . 45 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master Of Science COMPUTER CONTROLLED DETECTION OF PEOPLE USING ADAPTIVE ELLIPSOIDAL DISTRIBUTIONS By Jennifer L. Laine M.I. 2003 Chair: Michael C. Nechyba M., i]r Department: Electrical and Computer Engineering We present a software approach towards realtime detection of human beings in camera images or video. Due to the large variations in shape, texture, and orientation present in people over time and over samples, we use a statistical procedure which contours ellipsoidal distributions around positive data examples while avoiding negative samples during the training of the detector. The data points around which the statistical approach models its distributions are feature vectors extracted from the image pixel values through a process which normalizes, filters, and dimensionally reduces them. We test the effectiveness of several popular image processing techniques to determine which ones contribute the best detection rates, and use them in the final detector. Finally, we test the model on realworld test images and discuss the results. CHAPTER 1 INTRODUCTION 1.1 Preface Computer controlled detection of human beings is a well established artificial vision problem. The challenges associated with it are daunting because people in everyday situations are not structurally uniform, and they exhibit many vastly different poses and orientations. Here we present an algorithm trained to detect human beings in still images or video. A technique for learning class regions in R2 by the autonomous generation of ellipsoidal distributions to enclose positive target objects, developed by Kositsky and Ullman [1], is extended to accept a higher dimensional input space. Evidence of the possibility of such an extension is mentioned by the authors, but the theory is never tested. Also, several data preparation methods are experimentally compared so that a maximum positive detection rate and minimum false positive detection rate are achieved across the methods. Structural cues exhibited by people in multiple poses participate in the forma tion of many distinct ellipsoidal distributions signifying human existence. We make the assumption that certain relative phenomena within two dimensional images of people may be represented as points within independent ellipsoidal contours based on preprocessing techniques used in the paper by Oren et al. [2]. Each contour is defined by a high order transfer matrix coupled with a center point, and an object's exclusion from the positive class set is determined by the associated data point's exclusion from all of the ellipsoidal contours. Our main objectives in introducing this new method are acquiring detection results similar to previous techniques while maintaining a degree of simplicity and elegance by keeping the number of algorithmic steps small, using only image preparation methods that are experimen tally determined to improve detection, and manipulating data clusters which are quadratic in shape rather than Gaussian. 1.2 Current Classification Techniques and Problem Discussion Most detection schemes attempt to minimize an error metric between results generated by the classifier acting upon dimensionally reduced data and the associated true outcome. Neural networks adjust weights to minimize the error between the training network and the true result. Kernel classifiers such as Radial Basis Functions (RBF) and Support Vector Machines (SVM) minimize a bound on the actual risk or error created by applying a particular set of functions to massaged data points. Modelbased and examplebased detectors use positive and negative examples of the desired class to build a statistical or parametric criterion by which test material is measured. Unfortunately, human beings are not structurally consistent over time and across samples. When a person is moving, he assumes a cadence which may be measured and used for detection. However, this technique requires the analysis of video frames and an extensive numerical history of some kind [3]. Excluding a regard to motion, sporadic display of vertical symmetry and wide color and texture variations across many samples of people make it difficult to encapsulate a single input example into a repeatable pattern of which a specific detection technique could take advantage. Hence current methods tend to limit themselves to a test class with a discrete number of positions and tight size restrictions. Specialized hardware may be used to ease data dissection. An infrared imager creates a heat map of the target and allows the data signature of humans to be more distinctive than that produced by a digital camera [4]. Two camera systems have been developed that produce a disparity map of the environment so that the background may be eliminated from the images and only foreground objects are considered [5]. Hardware solutions are useful but are also expensive. Training data which are needed to make adjustments to the detector weights or conditions are usually manually manipulated to remove other unconstrained objects and the background which is typically not homogeneous. The extensive preparation time required to train a detector warrants simplification of the training technique and perhaps a more interactive approach. We attempt to address all of these issues by first performing only necessary preprocessing procedures to the input data that are experimentally determined to improve human detection. We do very little preparation to the training images and automate the bootstrapping process during detector training so that false positives are efficiently reduced. Lastly, we choose a distribution type that is amenable to compact representation and has malleable attributes. 1.3 Thesis Organization Chapter 2 outlines techniques previously "mpli d for the detection of human beings and lists their strengths and weaknesses. Chapter 3 describes how an input image is segmented into a set of high dimensional data points acceptable to the detector. Chapter 4 defines many preprocessing schemes handled by current object detection methods and reports how they affect the preprocessing of our input data. Chapter 5 gives the theoretical basis behind the proposed detector algorithm and describes each step in the training process. We compile these preprocessing techniques and associate each approach with an independent detector scheme. Each scheme is functionally composed of a specific set of preprocessing techniques and the universal trainer or detector. The trainer creates quadratic distributions based on the clustering behavior of the preprocessed image data, and the detector tests the inclusion of a new image within these distributions. Chapter 6 relates the experiments performed in the detection process by detailing the types of preprocessing methods contained in each detector scheme. The same database of 4 positive and negative training images is used to train each detector scheme, and Chapter 7 displays the experimental results found when each scheme examines an image database different from the training one. Chapter 8 evaluates the presented approach and describes future work which may be done in this area. CHAPTER 2 PROBLEM REVIEW 2.1 Introduction Here we present a review of some of the current techniques "inpl.vd to solve the problem of people detection. They are explained here to provide a basis for our approach. The technique of Oren et al. [2] reduces the problem space into a set of meaningful frequency components. With these components, the authors create a parametric pedestrian model by minimizing a risk function. The method of Poggio and Sung [6] feeds distance metrics related to the Gaussian distributions of face image training data into a neural network. The method of Zhao and Thorpe [7] elicits the help of stereo vision to separate the background from foreground objects and inputs the gradient of potential people in the foreground into a neural network. The method of Rowley et al. [8] introduces carefully chosen parts of face images to a sequence of arbitrating neural networks. The technique of Broggi et al. [5] comes up with systematic morphological rules that are applied alongside stereo vision towards the detection of people in images. 2.2 Wavelet Templates and Support Vector Machines The technique of Oren et al. [2] focuses on the differences in intensity between pedestrians and the background, or the relative intensities and position of pedes trian boundaries coupled with homogeneous interiors. The authors concentrate on structural cues because a pedestrian's colors are not constrained, and the colors and textures of the background are not consistent. A wavelet transform approxi mates nonstationary signals with sharp discontinuities at varying scales. Hence, the structure of a person lends itself to the use of wavelet coefficients to differen tiate people from nonpeople, and in this application, a wavelet transform is used as a human edge detection algorithm. A redundant set of Haar basis functions is used to completely capture the relationships of the average pixel intensities between neighboring regions of an image. They apply the Discrete Wavelet Transform (DWT) along three orientations to generate wavelet coefficients at two different scales. Coefficients are produced for vertical, horizontal, and diagonal passes of the transform, and both 32 x 32 and 16x16 pixel block scales are tested. In the 32 x 32 scale, one coefficient represents the energy of the signal localized both in time and frequency within the corresponding 32x32 block. A similar method is used for the 16x 16 scale. Coefficients generated from a training database are compiled into a template. The wavelet coefficients are calculated for each color channel (RGB) and for each orientation in an image. The largest absolute value over all of the color channels becomes the corresponding coefficient for that orientation in the image. The coefficients for each orientation are normalized separately over all of the coef ficients in the image and averaged over all of the images in the pedestrian image database. The resulting array of coefficients is the pedestrian template for each orientation. The pedestrian training image database consists of 564 color images of pedestrians in frontal and rear positions within a 128x64 pixel frame. A non pedestrian template is also created using 597 color images of natural scenes within a 128x64 frame. By visual inspection, the authors select *i,,i I .Il. co( fi. i: I  from the template which pinpoint areas in the image important for pedestrian classification. Consequently, 29 coefficients are used to form a feature vector for the classification effort. During detection, the system moves a 128x64 window throughout the entire space of the input image. Bootstrapping aids the problem of the overwhelming negative class space. False positive detections during testing are grouped into the negative training image database, and the system is retrained. The system is not adaptable since the whole training image database must be submitted to the algorithm instead of just the new images when retraining is necessary. Two methods of classification are used. The first is a simple technique called basic template matching where the ratio of feature vector values in agreement is calculated for each new input image. The second method utilizes the support vector machine. After three bootstrapping sessions, the system trains from 4,597 negative images. From the 141 high quality pedestrian test images, the classifier exhibits a detection rate of 52.7% using basic template matching with 1 false positive per 5,000 windows examined. With the support vector classifier, the system has a detection rate of 69.7% and a false positive detect for every 15,000 windows examined. 2.3 Gaussian DistributionBased Model and Neural Networks The approach of Poggio and Sung [6] detects unoccluded vertical frontal views of faces by fitting handtuned Gaussian distributions upon example data. They formulate a model of faces and a model of nonfaces. Each training image contains a single face which fits inside a 19x19 pixel mask. A feature vector for the classification distribution is defined by the absolute pixel intensities of the unmasked pixels. Hence, each input image translates into a vector in R283 space. An elliptical kmeans clustering algorithm groups 4,150 examples of positive data and 6,189 examples of negative data into a predefined number of clusters, and twelve Gaussian boundaries are placed upon the groupings. Six Gaussians, each with a centroid and covariance matrix, are placed upon the positive points, and six are positioned upon the corresponding negative data. Fitting the positive sample space with one Gaussian distribution is not sufficient because there is too much overlap between the positive Gaussian and nonface example feature vectors. In some cases, nonface patterns lie closer to the positive Gaussian distribution centroid than a true face arrangement. The relationship between incoming feature vectors and the existing face model is encoded into a 2value distance metric. The first distance is called the Mahalanobis distance. It represents a separation in units of standard deviations of the input point from the cluster distribution. The 75 largest eigenvectors of the Gaussian are used as the discriminating vector space to reduce the chance of overfitting the metric. The second distance is a generic Euclidean distance. It measures the unbiased separation between the input point and the cluster mean within the subspace spanned by the same 75 largest eigenvectors. For each test input, a multilayer perception (I\I. P) is trained with the 12 pairs of distances as the inputs and one binary output. They train the network using the same 4,150 positive images used to create the positive Gaussian clusters and 43,166 negative images which include the 6,189 patterns used to create the negative Gaussian clusters. The rest of the negative training images are selected with a bootstrapping methodology. Images generating a false positive detection by the neural network are added to the negative image collection for further training. Representing a distribution with a centroid and covariance matrix is difficult for high dimensional vector spaces because the number of free parameters in the model is directly proportional to the square of the number of dimensions, and the parameters must be recovered from training data for the detector to be robust. One may reduce the number of model parameters by focusing on the i, m,! .,l," eigenvectors in the covariance matrix. The significant eigenvectors in the Gaussians correlate to the prominent pixel features in a face image. This information is encoded in the Mahalanobis distance. The less prominent pixel features are encoded in the less significant eigenvectors. The Euclidean distance is supposed to account for the less salient facial characterizations. The search for faces in an image is done over all image locations and at a single scale. Their system has a 96.3% detection rate on a test database of 301 CCD images of people with 3 false positives. On a more challenging database of 23 cluttered images with 149 face patterns, their system has a detection rate of 79.9% with 5 false positives. 2.4 Stereovision and Neural Networks The technique of Zhao and Thorpe [7] is a realtime pedestrian detector using two moving cameras and specialized segmentation software. The realtime stereo system Small Vision System (SVS) developed by SRI constructs a disparity map of the input image based on color and spatial cues so that objects in the foreground of the image may be distinguished from those in the background. Hence, the segmen tation software is not influenced by drastic lighting changes, object occlusion, or color variation. A neural network trained with back propagation is fed the intensity gradient of the resulting foreground partition. Since the background is removed from the training and testing images by stereo image analysis and the neural net work learns from examples, their method requires no a priori model or background image. From the disparity map of an input image, they use thresholding to remove background objects. They then smoothly group together objects of similar dispar ity, and ruleout groupings that are too small or too big to be pedestrians. Small pixel blobs that are near each other with close disparity values are integrated into one big blob. Large regions undergo a verification process where subregions are analyzed for the presence of pedestrians and then split apart if multiple positive detections exist. Pedestrians have high degree of variability in texture and color so absolute pixel intensities are not used as input information for the detector. Instead, they use the intensity gradient of the pixel groupings in the foreground still found to be potential people as the input vectors to the neural network. These effects of the preprocessing phase are constrained to a 30x65 window, and the region values are linearly normalized to numbers between 0 and 1. A three layer feed forward network is trained with 1,012 positive images of pedestrians and 4,306 negative images. Bootstrapping is used to improve system performance. The network weights are initialized to small random numbers before ti..iiiir. and detection is finalized by thresholding the output of the trained network. The system is tested with 8,400 images of pedestrians and other objects in cluttered city scenes. They achieve a detection rate of 85.2% and a false positive rate of 3.1 The system performs segmentation and detection on two 320x240 images at a framerate ranging from 3 frames/second to 12 frames/second. The system fails when objects that are structurally similar to humans are presented and when occlusion is extreme or the color of the person is similar to that of the background. 2.5 Neural Network Overload The method of Rowley et al. [8] uses a neural network to detect upright, frontal faces in greyscale images. The training images are specially customized for the algorithm. The eyes, tip of the nose, and corner and center of the mouth for each training image face are labeled manually so that they can be normalized to the same scale, orientation, and position. Bootstrapping is used to solve the problem of finding representative images for the nonface category. False positive images are added to the training set during successive phases of training and testing. Bootstrapping negative images reduces the number of images needed in the training set. A neural network is applied to every 20x20 pixel block in an image, and detection of faces at different scales is achieved by applying the filter to an input image that is subsampled. A preprocessing step is performed on the input image before it is passed through the neural network. The first step in the preprocessing phase equalizes the brightness in an oval region inside the 20x20 pixel block, and the second step performs histogram equalization within the resulting oval. The network has retinal connection to its input layer. Four hidden layers look at 10x 10 pixel subblocks, Sixteen look at 5 x 5 pixel subblocks, and six look at 20x5 pixel stripes. These regions are specifically handchosen so that the hidden units learn features unique to faces. The stripes identify mouths or eyes, and the square regions see a nose, individual eyes, or the corner of a mouth. The network has a single output signifying the presence or absence of a face. 1,050 images of faces of varying size, position, orientation, and brightness are gathered for training and manually massaged into images uniform over all training data by creating a mapping of specific pixel locations to face features. The mapping itself scales, rotates, and translates the input image by a least squares algorithm that is run to convergence for each image. Once a uniform image is made, variants of the image are created by rol.I ii.:. scaling, and translating the model. Nonface images are generated randomly and the nonface training database is formed by a bootstrapping technique. The network is trained using standard error backpropagation with momentum and initial random weights. Resultant weights from a previous training iteration are used in the next iteration. Random images that generate false positive detections are added to the database for further training. Generation of random data forces the network to set a precise boundary between faces and nonfaces. Two heuristics are introduced to reduce the number of false positives in the initial neural network. Since the network is somewhat invariant to the position of the face up to a few pixels, multiple detections within a specified neighborhood of position and scale are thresholded. The pixel neighborhood and the number of detections found in the neighborhood are the two parameters used. A number of detections greater than the threshold implies a positive detection, and the centroid of the neighborhood is scrutinized again for the presence of a face. A number of detections fewer than the threshold implies a nodetect. The second heuristic involves result arbitration from multiple neural networks. Each neural network is trained with the same positive image database, but because the set of negative images is randomly chosen from the bootstrap images, the order of presentation and the negative examples themselves differ. Also, the initial weights may differ because they are generated randomly. They try ANDing and ORing the results of two similarly trained networks. Three networks voting a result is also tried. Lastly, they train a neural network to govern the decisions of the arbitrating neural networks to see if such a scheme yields better results than simple boolean functions. Sensitivity analysis is performed on all the networks to determine which features in the face more greatly influence detection. It turns out that the detectors rely heavily on the eyes, then the nose, and then the mouth. Many different networks are tested with two large data sets containing images different from the training images. 130 of the images are collected at CMU and consist of multiple people in front of cluttered backgrounds. The second set is a subset of the FERET database, and each image in the second set consists of only one face, has uniform background, and good lighting. The detection rate of all tried systems on the first data set ranges from 77.9% to 90.;' ORing the arbitration networks yields the best detection rate but also contributes the most false positives. In the second set of data images, detection success ranges from 97.8% to 100.0% for frontal faces and faces turned less than 15 degrees from the camera. A detection rate range of 91.5% to 97.4% is achieved on faces turned 22.5 degrees from the camera. They determine that the system with two ANDed arbitrating networks produces the best tradeoff between detection rate and false positives. It has a detection rate of 86.2% with a false detect rate of 1 per 3,613,009 test windows on the first test set. On the second test set, it has an average detection rate of 98.1% on the faces at all orientations. The best system takes 383 seconds to process a 320x240 pixel image on a 200MHz R4400 SGI Indigo 2. After modifying the system to allow bigger search windows in steps of 10 pixels, the processing time is reduced to 7.2 seconds, but with the side effect of having more false detects and a lower detection rate. 2.6 ShapeBased Pedestrian Detection The procedure of Broggi et al. [5] presents a modelbased method to detect pedestrians from a moving vehicle with two cameras. The core technique is a modelbased approach which focuses on the vertical symmetry and the presence of texture in humans. It checks for human morphological characteristics by incorporating rules to a pixellevel analysis. However, other approaches are used to refine the results. Analysis of stereo disparities in the images provides distance information and gives an indication of the bottom boundary of the pedestrian. Also, an image history is kept to further filter the morphological results. Their system is an additional feature of the ARGO Project an autopilot mechanism for a vehicle. A greyscale input image is downsampled to a 256x288 pixel block, and a localized region of highest probable pedestrian existence is transformed by a Sobel operator to extract the magnitude and orientation of the edges in the image. Binary edge maps are created of vertical and horizontal edges, and background edges are eliminated from the maps by subtraction of the thresholded and shifted stereo images. They run the resulting binary maps through a filter that concatenates densely packed objects and removes small sparse blobs. A vertical symmetry map is created from the filtered vertical edges map by scanning the image horizontally for vertical symmetries. Humans have a high degree of vertical symmetry but much less of an instance of horizontal symmetry. Under this assumption, nonhuman objects are ruled out by analyzing the horizontal edges map for horizontal symmetries. A linear density map of the horizontal edge pixels is superpositioned with the vertical symmetry map with experimentally determined coefficients to create a probability map of human presence. This technique eliminates objects having both strong vertical and horizontal symmetry. The probability map is bolstered by considering a history of images and image entropy since objects that are uniform typically are not human. The widths of the remaining objects are determined by counting the number of pixels in each vertical edge about the symmetry axis. They choose the boundary to be the column with the highest pixel count on each side. The Sobel map is scanned for a head matching one of a set of predefined binary models of different sizes. The model is constructed by handcombining features that sample human heads have in common. The bottom boundary is determined by finding an open row of pixels in the vertical edge map in the left and right boundaries of the body. Distance information is calculated based on a combination of prior camera calibration knowledge and the position of the bottom boundary. The bottom boundary is then refined by comparing the calculated distance to the distance determined by comparing the position of the pedestrian in the two stereo images. More rules are checked as the final bounding box is fit for size constraints, aspect ratio and combined distance and size restrictions. Bounding box construction is sometimes not very accurate concerning the head's position and the detection of lateral borders, and no detection results are presented. CHAPTER 3 IMAGE SEGMENTATION AND DIMENSIONALITY REDUCTION 3.1 Introduction The segmentation part of our approach effectively transforms a 220x220 input image into a set of overlapping 109x28 rectangles and then further filters each rectangle into a point in R144 space. This lowpass filter averages the pixel brightnesses of overlapping regions in each rectangle and then uses the resulting values as coordinates in a feature vector in R144. Reasons for using this partic ular method are given below. Typically, in examplebased learning schemes, the detector is trained with templates of the target class. Training such a detector involves learning relationships between features in the template. Efficient systems reduce the dimensionality of the features to a smaller number which still retains the integrity of the original pattern. In addition, increasing the number of search windows increases the effectiveness of the analysis because examplebased classifiers rely on position and rotation invariance when transferring knowledge from training examples to the test cases. More windows mean less error margin in the object's position and orientation within a detection frame [9]. In order for the system to detect people at different scales, two options exist. Either the input image may be downsampled to reduce the size of the features or larger windows may be intro duced [7, 8]. Therefore, scanning an image for features tends to be computationally expensive, and, in many cases, is the bottleneck of any learningbased classification scheme. Current methods that do not use training examples rely on a priori models as the reference data. Their high detection speed is compensated by the complexity of their inference rules coupled with their inflexibility. We choose the simplicity of a non a priori model without having to scan every single rectangle in the input image because we average overlapped pixel regions. Dimensionality reduction can be achieved by paring down the effective components of the feature space. It reduces computation and prevents the system from overfitting the decision surface of the training data. Using an algorithm that retains all of the bases for the feature space spends too much time computing details that are not unique to the desired object class. Principal Component Analysis (PCA) is typically used to reconstruct a decision space with a subset of its eigenvectors. For any subspace of the trained decision surface, a subset of the eigenvectors spans a subspace whose signal power contributes the least error to that of the original signal. However, the number of eigenvectors necessary to successfully reconstruct a desired object space depends highly on the number of training data and the number of pixels in each image [10]. We use a simplified aspect of the technique of Oren et al. [2] to dimensionally reduce the data set. The filter we mpl',' is a 7x7 mean filter applied every 4 pixels. It is a low frequency representation of an image rectangle. 3.2 Partitioning the Image Since our experiments focus on the success of the ellipsoidal distribution algorithm instead of the viability of the system to multiple scales, we regard only one scale of pedestrian. However, the method may easily be applied to people of bigger or smaller size. The system begins the testing phase of people classification by analyzing a 220x220 greyscale image. The image is divided into two nonoverlapping 109x220 rows and two similarly nonoverlapping 220x109 columns. Within each row, 13 equally spaced 109x109 subimages are selected in a Ir. '.. red arrangement such that each subimage shares 100 pixels of the previous subimage. Previous techniques motivate the overlapping of template windows [2, 6, 8]. Further segmentation and dimensionality reduction of the space deems the analysis of every possible search window unnecessary. Each column has 13 similarly sized and positioned subimages, and duplicates arise on the four corners. There are a total of 52 subimages in the whole image. The placement of .I... red subimages in a 220x220 input image is shown in Figure 31. The Figure 31. Clarification of the placement of Ir. ,.. red subimages in the input im age. Each arrow contours a nonoverlapping 109x220 row or column, and the I...'i d 109x109 subimages lie along the arrows within each row and column image is further divided into 10x 10 pixel blocks. They each overlap by one pixel in order to completely fill the subimages, and there are 144 blocks per subimage. Each block is made up of 4 overlapping 7x7 halfblocks, and the pixel depth of each overlapping halfblock is 4. Figure 32 demonstrates the arrangement of halfblocks within an image block. There are 576 halfblocks per subimage. The mean of the pixel intensities within each halfblock is assigned to the entire half block. In essence, the halfblocks are discretized into pixellike collectives by a 7 pixel wide lowpass filter. A priori information about the aspect ratio of a human form finalizes the size of the search window. A human's height is approximately four times larger than the width in most positions. On account of this, we take each subimage (109x109 pixels or 24x24 collective halfblocks) and split it into 4 vertical strips of 109x28 pixels or 24x6 collective halfblocks. The resulting windows tightly encompass pedestrians having a scale of 109x28 pixels in most I F' Start of block, etc... start of block \\\XXXX/// \\\7XXX/// Shalfblockl / halfblock2 halfblock3 O halfblock4 Figure 32. Arrangement of halfblocks within an image block. The segmentation algorithm partitions the image into 4 overlapping halfblocks within each block. Each square within the grid represents a pixel in the im age, and the thick lines represent block boundaries poses. Figure 33 shows an example image of a pedestrian and the result of each segmentation step. 3.3 Dimensionality Reduction The overlapping halfblocks become distinct collectives yet they share infor mation between each other because encoded in the collective halfblock intensity value is neighboring halfblock brightnesses. This has the effect of IInI iiI the intensities across many pixels. Residual feature information may be transferred to areas of the image as far as 20 pixels away in both the x and the y directions. Changing the object components from pixels to collective halfblocks effectively reduces the dimensionality of the feature space from 3,080 down to 144, a 95% reduction. Pixel sharing between neighboring halfblocks within each block gives a . start of block A B Figure 33. Steps for segmenting an example image of a person. A) Undoctored 220x220 input image. B) Magnified view of a subimage containing the person consisting of 24x24 collective halfblocks. C) Magnified view of vertical strip containing person consisting of 24x6 collective halfblocks 4 pixel maximum variance in the horizontal and vertical directions. An object may shift by as much as four pixels vertically or horizontally or may also rotate by as much as 34 and still maintain a 1'r . n. '." within the same collective halfblocks. 3.4 Published Methods We again review the people detection techniques introduced in Chapter 2 but focus only on the segmentation and dimensionality reduction approaches of each method to present points of comparison with our method. Rowley et al. [8] look at every 20x20 pixel block in a test image initially, and they later change the detect windows to 30x30 pixel blocks in steps of 10 pixels to reduce computation time of an image from 383 seconds to 7.2 seconds. Upon finding a positive result in a 30x30 block they more closely scrutinize the area with their standard 20x20 detector. The neural networks they use discern features within the face, and their detection windows must be kept small. The numbers of hidden units are experimentally determined, but there is a fine line between the number of hidden units required to determine an underlying trend in a decision space and the number of hidden units that will fit the intricate details of the training data but 20 not extract the fundamental pattern. Poggio and Sung [6] look at every 19 x 19 subregion location in the primary image during testing. They retain the full dimensionality of the 19x 19 space (283 dimensions) and fit 6 separate Gaussian distributions upon the training data in 283space using an elliptical k means algorithm. Oren et al. [2] move a 128x64 window throughout all the positions in a test image. They subjectively choose 29 imi .l .i" wavelet coefficients which indicate regions of ii'I n,ilIychange" or regions of iin, intensitychange" in the learned wavelet template. These 29 coefficients form the feature vector responsible for classification of people. CHAPTER 4 PREPROCESSING OF AN IMAGE SEGMENT 4.1 Introduction Input image preprocessing is the transformation of cryptic data into infor mation that is amenable to a training or testing algorithm. Whichever technique is used must enhance the qualities unique to the target class while deemphasiz ing externally motivated transients. Most current classification schemes eimpl]' preprocessing techniques, but to different extents. We apply basic image filtering to achieve results comparable to that of other techniques. Our method finds the algorithms that increase the rate of detection of people and decrease the num ber of false positives. Hence, we compare test results from several preprocessing schemes instead of subjectively choosing the final method of image preparation. The techniques applied in previous classification methods prompt those used here: brightness equalization [8], histogram equalization [8], contrast stretching [8, 2, 7], horizontal intensity differencing [2], and vertical intensity differencing [2]. There are many more methods to unify images in unpredictable lighting situations than we have enumerated here, but these are among the practices that show up repeatedly in the detection methodologies we analyzed. 4.2 Brightness Equalization There are several methods used to reduce unwanted global or partial bright ness variations caused by a changing environment. One of these is brightness equalization or level shifting. Some intensity changes exemplify a property of an object in uniform lighting conditions, while others are provoked by a localized external light source or sink. A lighting equalization operator reduces the effects of luminance shifts caused by focused light variations. Current preprocessing techniques try to filter out localized intensity differentials by applying typically a polynomial transform to the pixel intensities of the entire image. Figure 41 shows the result of this background leveling technique on the uniformity of the intensities. Level shifting is successful if the background level changes gradually and can be A B Figure 41. Effects of brightness equalization on intensity uniformity. A) Image of rice with localized brightness nonuniformities. B) Same image after a linear brightness equalization filter is applied modeled by a polynomial. Rowley et al. [8] execute brightness equalization on every similarly masked 20x20 oval in their search space before feeding them to their system of neural networks. Their luminance equalization filter is essentially a linear function fitted to the average intensity values of small pixel regions within the image. We perform luminance equalization upon the 220x220 greyscale image before segmentation and dimensionality reduction take place [8]. We fit piecewise continuous linear functions to both the brightest and darkest pixels in the image. This has the advantage of keeping both the background and contrast of the image consistent. 4.3 Histogram Equalization and Contrast Stretching Histogram equalization transforms the contrast and range of a set of pixel intensity values by providing a typically nonlinear mapping of the original values. In effect, the brightness histogram of the resulting image becomes uniform or flat after its application. This technique is useful for picking out details that are difficult for humans to see in an image or for a classifier to distinguish objects known to have densely represented intensity values. However, relative pixel information across the image is not preserved. Contrast strl, hiin.. on the other hand, linearly scales the input pixel intensities so that the range of values is stretched to desired minimum and maximum bounds. This technique preserves intensity differences throughout the entire image. Figure 42 shows the effects of contrast stretching and histogram equalization on an example image. In addition to brightness equalization, Rowley et al. [8] perform histogram equalization and contrast stretching to the search space. Oren et al. [2] normalize the wavelet coefficients of their training images. Zhao and Thorpe [7] normalize the values of the output of an edge detection algorithm before they are input to a neural network. We execute contrast stretching and histogram equalization after the segmentation and dimensionality reduction step. The contrast stretching step (also called normalization) changes the dynamic range of the pixel values to one bounded by 0 and 100. Figure 43 shows the vertical strip of the pedestrian from Chapter 3 before and after contrast stretching and histogram equalization respectively. 4.4 Horizontal and Vertical Intensity Differencing The next phase involves the implementation of two very simple region differ ence extraction filters. Good results have been generated from the technique of Oren et al. [2] which uses wavelets to encode region intensity differences for feature extraction because relative quantities eliminate low frequency noise and more consistently explicate human shape. In fact, the application of Haar wavelets uses " ":'k; *%"..$. ,l. ''. ; ".... .." ' .. . . A B Figure 42. Contrast stretching and histogram equalization filters applied to a sam ple image. A) Image of lunar surface. B) Image after applied contrast stretching filter. C) Image after applied histogram equalization filter our same differencing scheme when generating coefficients of highest frequency. It seems logical that we use a high frequency differencing scheme after extracting low frequency information from dimensionality reduction. We exercise the spirit of the approach but simplify the application. One method, called horizontal differi, iin:. replaces absolute pixel values with differences calculated horizontally. If i repre sents an image row, and j represents an image column in the 24x6 vertical strip from Chapter 3, and xij signifies the intensity value at pixel location (i, j), then the A B Figure 43. Contrast stretching and histogram equalization filters applied to an image of a pedestrian. A) 24x6 vertical strip image of pedestrian from Chapter 3. B) Image after histogram equalization and normalization filters are applied horizontal differencing technique achieves the following: I ij xi(j+i)  1 < j < 6, x(4.1) 0 j 6. The second technique also finds relative pixel intensities but along the columns of the image instead of the rows. x, \Xij X(i+i)j 1 < i < 24, (4.2) X(4.2) 0i =24. Figure 44 shows the horizontal and vertical differencing effects on the 24x6 verti cal strip pedestrian image. This preprocessing step is performed after segmentation and dimensionality reduction. The end result from the analysis of one 220x220 image is a set of 208 feature vectors in 144space, and the image is ready for either classifier training or people recognition. I A B C Figure 44. Horizontal and vertical differencing techniques applied to an image of a pedestrian. A) 24x6 vertical strip image of pedestrian. B) Image after horizontal pixel differencing is applied. C) Image after vertical pixel differencing is applied CHAPTER 5 CLASSIFIER ALGORITHM 5.1 Background Based on the discussion in Chapters 3 and 4, we know that the dimensionality of our feature vectors of people is 144. The space of all possible vectors, v E R144 signify every possible segmented and preprocessed image input into our system. We wish to delineate a subset of these vectors within several ellipsoidal boundaries. The vectors within the ellipsoids refer to representations of people, while vectors outside of the ellipsoids ideally are representations of nonpeople. Previous work has been done to classify highdimensional image features with precisely placed Gaussian distributions [6]. We instead allow the system itself to place ellipsoids upon the data and adaptively stretch or contract them during training. Neither the number of ellipsoids nor the major or minor axis distances are constrained, but positive examples of people are consumed by expanding ellipsoids, and negative examples are rejected via ellipsoidal contraction. Figure 51 demonstrates dilation and contraction for the two dimensional case. Dilation and contraction operations are encoded into the construction of linear transforms applied to positive and negative input vectors. The testing phase consists of inclusion tests of points within the established ellipsoidal contours. A good starting point for the analysis of the boundary conditions starts with a look at spheroids because they are the simplest ellipsoids, and the boundary test for a point's inclusion in an ellipsoid builds on the analogous test for a spheroid. We formulate a method of testing data inclusion within a general spheroid in 144space in Section 5.2 based on the technique used by Kositsky and Ullman [1], and then we introduce the extensions required to adapt the boundary type to an / / P, Figure 51. Demonstration of ellipsoidal dilation and contraction. A two dimen sional ellipsoid dilates to capture a positive point P and contracts to throw out a negative point N ellipsoidal one in Section 5.3 also based on the same authors' work. Section 5.4 discusses the combination of the two transforms and provides the finer points of the complete operator. 5.2 Linear Transform for Spheroids A spheroid is an ellipsoid whose axes are all equal in magnitude. This equality simplifies the equation of a spheroid into a sum of n squared terms: 2 2 2 2 +1 2+ X I "=1 (5.1) T2 F2 F2 F2 where r is the radius of the spheroid, and n is the spheroid's dimensionality. To determine whether a particular vector is within a given spheroid, a sequence of scalings is performed on each component of the vector. Such scalings contract or expand a point on the boundary of a spheroid of radius r to a point on the boundary of a spheroid of radius 1 if the spheroid is centered at the origin. Points inside or outside of the original spheroid end up in a proportionally equivalent < / / Figur 51 Demostraion o ellpsoial diatio and ontrctio. Atodmn sionl elipoiddiltes o cptue apostiv poit Pandconracs t tho utangaiepon" elipoda oeinSeton5. ls bsd n h smeathrs or. etin . discusses~ th obnto ftetotasom adpoie h ie onso h coplt opraor location in the new spheroid. A linear map can perform vector scaling along specific directions. Such a linear map is defined in this way: 3 a set of vectors D C R144 and another set of vectors Q C R144 st 3 a linear map f,: D Q where f (x) = Lx for some 144 x 144 matrix L, some reference point c C D and V x D The symbol D refers to the set of vectors, relative to the current spheroidal center c, received by the system for classification or training. The symbol Q represents the set of linearly mapped relative vectors. Each component of the input vector is scaled by the same amount to effectively compress or distend points along a direction which seeks or avoids the center of the spheroid. Hence, the direction along which the scaling is executed depends on the input point, and is always perpendicular to the tangent of the spheroid at the transformed point. Given an input point x, an output vector of the following form is sought: 1 = f(x) (5.2) where r is the radius of the original spheroid. The matrix L takes on the following configuration to perform this operation. 0 ... 0 1 0 0 ... 0 L r (5.3) 1 0 0 0 ... The entries in L along the diagonal are just the square roots of the coefficients of Equation 5.1 We are now in a position to perform an inclusion test on the input point. It is important to note that a test point must first be rewritten relative to the center of the spheroid because the matrix L scales the components closer or farther from the spheroid's center. If the point x is inside or on the boundary of a spheroid of radius r centered at the point c, then the resulting vector L (x c) is inside or on the boundary of a spheroid of radius 1 centered at the point c, when it is referenced to c. The magnitude of the new relative vector, IL (x c) is equal to 1. If the norm is less than 1, then the point x is within the original spheroid. If the magnitude is greater than 1, then x is outside the confines of the original spheroid of radius r. Figure 52 demonstrates the transformation in two dimensions. r 5.3 Ellipsoidal Transform \ / i I i ^*^ ' C C A B Figure 52. Linear transformation of a spheroid. Transformation of a point x from the boundary of a spheroid of radius r to the boundary of a unit spheroid. A) shows contraction while B) shows expansion 5.3 Ellipsoidal Transform The formula for a spheroid is derived from the general equation of an ellipsoid which assumes a more complicated formalization. The equation of an ellipsoid takes the following general structure: a zix x =1 ai Z+ UO Vi,j (5.4) l 1 where n is the dimensionality of the space. There are more terms in the ellipsoid equation than in the spheroid equation. The morphological reason for this is that an ellipsoid is a spheroid whose boundary point components are scaled in a finite number of directions and by different amounts. The directions correspond to the directions of the major and minor axes, and the scaling amounts refer to the distance discrepancies of the corresponding major and minor axes. On the other hand, a spheroid results when original spheroidal boundary points are scaled over all directions by similar amounts. The extra terms are introduced when the minor and major axes do not coincide with the coordinate axes. Again, we want to transform a hyperdimensional ellipsoid into a unit spheroid because the resulting inclusion test for points becomes trivial. We focus on a point x on the boundary of a given ellipsoid. We assume that only one dilation or contraction in one direction is needed to transform the ellipsoid into a spheroid. We may make this claim because as we will see, such a transform is linear and linear mappings are transitive. In other words, if f(x)= y Lx and g (y) z Ky (5.5) are both linear mappings, then z f (g(x)) (5.6) or using the matrix equivalents, z LKx (5.7) Since each mapping corresponds to a unidirectional ellipsoidal contraction or dilation, a sequence of expansions and constrictions translates to a sequence of matrix multiplications. We wish to retain proportionality across the transform along the direction being modified, so that the proportion of the point's projection along the modified direction to the major axis of the conic remains constant across the mapping. Given that ;x is the boundary point on the modification axis, x is any point on the boundary of the ellipsoid, c is the center of the ellipsoid, and remembering that all points are actually vectors in R144 space relative to c, we use the following vector formula to transform x into a point y on the boundary of a unit spheroid centered at c. Figure 53 is a visual representation of the mapping. y(f)(x)=X ] X + X 2 (5.8) ( I \x \ 1 x\1 ) *\2 The vector projection of x along x is subtracted from x so that the result has y x x Figure 53. Mapping of an ellipsoid to a unit spheroid knowing the dilation axis. Transformation of a point x on the boundary of an ellipsoid to a point y on the boundary of a unit spheroid. Both are centered at point c, and the dilation axis is no component along the major axis, and it is perpendicular to this modification axis. Then, the third term in the sum adds the original component of x along the major axis scaled by the inverse of the magnitude of x. Since the magnitude of x is always larger than or equal to the projection of x along the major axis, the ratio is always less than or equal to one. The culmination of the sum is a point on the boundary of a unit spheroid centered at c maintaining the same proportional distance to other points along the modified axis. The mapping's corresponding matrix operator, L, has the following form: I (3o} 2 ( 1 Fl *I 0 1 I (2 2 1 0. X 1 L (= I /) / (5.9) (0s on te c ur of a ut sero, te follo g is 21 Since y is on the contour of a unit spheroid, the following is true: ILyl= 1 (5.10) We have just shown that if we are given an ellipsoid that is one transfer function away from a spheroid, and we know the vector form of the axis of dilation or contraction, then we can determine whether a point relative to the center of the ellipsoid is inside the given ellipsoid. At the same time, we may define the ellipsoid to be the linear operator used to squash or expand it into a unit spheroid because the boundary is defined by the transform. Since linear transforms are transitive in nature and an ellipsoid may be portrayed as a spheroid whose points are scaled in a finite number of directions, an ellipsoid in general can be represented by a sequence of matrix multiplications, and each matrix involved in the product represents a single dilation or contraction. 5.4 Classifier Rules The process of contraction and dilation occur in the following way. Initially, if the new input point cannot be engulfed by any existing ellipsoids, then it becomes the center of a new spheroid of radius r. The radius is a userdefined constant which constrains the volume of the initial hyperdimensional conic. If this value is too large, then the training algorithm must work harder to shrink the ellipsoids, and if this value is too small, enlarging the conics becomes processor intensive. We choose a value on the small side because the set of points that represent the existence of people is a much smaller set than the set of R144. Our starting radius is 10% of each vector component's possible maximum. Ultimately, the value doesn't really matter because the training algorithm automates the learning process without any a priori constraints. The continual process of reshaping the ellipsoids fits the contours of the conics to the data regardless of the initial value of the spheroid radius. The transform matrix identifies the shape of the ellipsoid, but two points on its boundary determine the directions of dilation and contraction. These points are called the last contraction point (LCP) and the last dilation point (LDP). The LCP is the last negative sample point previously inside the ellipsoid but now on the boundary after the last contraction. The LDP is the last positive sample point previously outside of the ellipsoid but recently captured by it. The points are also vectors that are referenced from the ellipsoid's center. The next dilation axis is always perpendicular to the LCP vector and along the plane formed by the LCP, ellipsoidal center, and the new positive sample point. Analogously, the contraction axis is always perpendicular to the LDP vector and along the plane containing the LDP, center, and new contraction point. This simple rule ultimately prevents ellipsoids from blindly and immediately reintroducing a point that it purposely recently expelled and throwing out a point that it just acquired. If a dilation is being performed, we must find a dilation axis whose direction is perpendicular to the LCP vector. Analogously, if a contraction is taking place, we must find a compression axis whose direction is perpendicular to the LDP vector. Let the LDP or LCP vector be depicted by v, and let x be the new sample point. The compression/dilation axis, e, is given by: x v v + e = (5.11) V\ V Since e can be scaled by any value and retain its direction, we have Iv12 e K e = v x for some K R (5.12) Only the direction of e is important in the dilation or contraction equations so we don't care about the actual value of K. Based on this information, we can now determine the transfer function for dilation or contraction. We replace in Equation 5.8 with the vector e. ,e x (e x e , y = fe (x) = x (e ) + (e) 2 Ye (5.13) where x x )2 (5.14) ex= X IX12 X (5.14) V ~V and ye 1 (.v (5.15) and e is as defined in Equation 5.12. The extra term on the end, ye, is the projec tion of the output point y along e. In Equation 5.8, this term is 1. The head of the vector e does not necessarily lie on the boundary of the ellipsoid, so the extra term is needed to scale the result properly. Figure 54 displays the aforementioned linear mapping in two dimensions. The analogous linear operator L becomes: v Figure 54. Mapping of an ellipsoid to a unit spheroid by calculating the dilation axis. Transformation of a point x on the boundary of an ellipsoid to a point y on the boundary of a unit spheroid. Both are centered at point c, and the dilation axis is determined based on the position of the LCP (v) (1 ( eo2 eye 0 o el lye i \e I e le e el 0 el eo eYe 1 ( e_ )2 2eye le e1 1e le; 1el 0 L0 e. lelye lel lel jel 0 eo e, el e ' \  e e / \(o0_re (o )e (I (La^)2 lel\ )J 1e 1 eo lel e1 e1 lel1 " l1e e  el le e l le e I el The resulting matrix L specifies a linear scaling in a single direction along only e to add or expel an introduced positive or negative sample respectively. The matrix C encodes all of the linear scalings done previous to the current one. A recursive methodology updates the universal operator C in such a way: Cnew = Cold L (5.16) (5.17) There are certain constraints on an ellipsoid that prevent it from capturing a new input point. The addition of a new point to an existing ellipsoid is possible only when the added point is within the hyperplanes of the prospective ellipsoid. If the projection of the additional point along the LCP vector is greater than the magnitude of the LCP, then it is clear that the ellipsoid cannot capture it because the LCP must remain on the boundary of the dilated ellipsoid. Figure 55 shows the conceptual difficulty of an ellipsoid capturing a positive point when it is not within its hyperplanes. A test is performed prior to dilation to determine if the Sx v ____ hi I c h2 Figure 55. Input point not within the hyperplanes of an ellipsoid. A prospec tive ellipsoid cannot dilate to capture a positive input point, x, if it is not within the hyperplanes hi and h2 because v must remain on the boundary of the ellipsoid current ellipsoid is a potential candidate to engulf the new point. Symbolically, the test is equivalent to the following inequality: Iv x < Iv (5.18) where v is the LCP and x is the new sample point. Contractions operate on a sample point within an ellipsoid so the point is necessarily within the hyperplanes of the conic. Now, an overview of the formation of the ellipsoidal distribution is presented. A system consists of an ordered set of linear operators, E. Each linear operator defines an ellipsoid in R144 space. A positive or negative sample point, x, is I introduced to the system. The matrix, Ci, is the universal linear operator for an existing ellipsoid i. If the point is a positive sample, then the ellipsoid inclusion test is performed with the following inequality IC xI < 1 Ci E (5.19) The inclusion test is done for each ellipsoid i until the inequality is met or E is consumed. If the point is within an ellipsoid, nothing more is done. If no ellipsoid contains the point, then the hyperplane test is performed for each existing ellipsoid, i until the test succeeds or the end of E is reached. If the point is within an ellipsoid's hyperplanes, that ellipsoid is stretched to capture the point, and Ci is updated by matrix multiplication with the linear operator for the dilation, L. Otherwise, the point becomes the center of a new ellipsoid and C becomes 1 I. If the sample point is a representation of a nonhuman, and it is within any existing ellipsoid, i, that ellipsoid is contracted to place the point on the boundary. The universal operator Ci is updated by multiplication to the contraction operator, L. Otherwise, nothing more is done. CHAPTER 6 EXPERIMENTS 6.1 Training the Classifier The ellipsoidal classifier is trained with approximately 1000 images of people having a size of 110x55 pixels. They are pictures from the same database used to train the SVM classifier in the procedure of Oren et al. [2]. Negative image samples of an equivalent size are also used to train the system. It is difficult to represent the entire class of nonhumans with a marginal number of images. Hence, synthetic images are created as negative sample points to train the algorithm. Approximately 150 images of indoor and outdoor scenes were downloaded from the internet, and 850 of the total negative images have randomly generated pixel intensities. Precedent for the use of random data as negative samples is seen in the approach of Rowley et al. [8]. The large size of the nonhuman image class dictates the use of random samples to present a more complete example of the class. Figure 61 displays several illustrations of positive and negative training images. 6.2 Preprocessing Schemes Several training schemes are produced by coupling different image preparation methods with the ellipsoidal trainer. The experimental results obtained from each trained scheme are compared to determine the best image preprocessing techniques to use for detection. The systems differ in the amount of preprocessing done and the type of feature extraction performed. All of the schemes use the same training images, and all perform segmentation, dimensionality reduction, and normalization of the input images so that the feature space is reduced and unified. Preprocessing scheme 1 additionally performs horizontal differencing Figure 61. Examples of positive lb I e and negative training images respectively to the preprocessed image to create the feature vector. Preprocessing scheme 2 executes vertical differencing instead. Preprocessing scheme 3 introduces brightness equalization to level the background of the raw image before further processing occurs. It also uses horizontal differencing for feature extraction. Preprocessing scheme 4 is similar to scheme 3 except that it uses vertical instead of horizontal differencing. Preprocessing scheme 5 implements a nonlinear histogram equalization process applied to the segmented and dimensionally reduced vertical strip image. Horizontal differencing takes place afterward. Preprocessing scheme 6 is similar to scheme 5 except it uses vertical differencing of pixel intensities instead of horizontal differencing. 6.3 Testing Phase Positive and negative images different from the training data are analyzed by each preprocessing scheme, and the corresponding output is fed into the ellipsoidal detector which is produced via the instruction of the associated training scheme. Two groups of test data are presented to each detector. The first group of test images are divided into subsets of positive and negative images. Subset 1 consists of 142 images of people taken from the same database used for training but different from the training images themselves. These images approach an ideal test group because the size constraint of the subjects is consistent with that of the detector, and the poses are limited to frontal and back orientations. Examples of positive images from test group 1 are shown in figure 62. Subset 2 .. ::.. :::: ... ..... Figure 62. Examples of positive images from the first test group is a collection of 127 negative images that were downloaded from the internet and chosen specifically because they resemble humans structurally. Hence the chance of a false detect within this subset is more probable. Some examples from subset 2 are shown in figure 63. Figure 63. Examples of negative images from the first test group A second group of images is considered to be more complex test material because the data is not staged, and few environmental variables are controlled. This test group contains 31 220x220 greyscale images taken by a digital camera of realworld outdoor scenes. The images are processed only by the detector scheme with the best positive and negative detection rates on the group 1 test images. Also, before the second test group is analyzed, the chosen detector undergoes a semiautomated bootstrapping procedure. A webpage is updated every few seconds with an image processed by the detector scheme. The detector draws a box in the image around an area where a person is found, and the source of the original image is a digital camera pointed outside at an area where people frequently walk. When the webpage is monitored, and the image displays a false positive, a script is executed which actively includes the offending original camera image in the training database of negative images. Another script saves an image in the database of positive training images in the instance of a false negative. The detector may then be retrained at the user's convenience. Bootstrapping the system to achieve better performance is used in many of the techniques that are examined [2, 6, 7, 8, 9]. Figure 64 gives a selection of images from the group 2 database. 6.4 Sensitivity Analysis The ruggedness of the system is tested by introducing blurred versions of test images from the first testing group to the classifier. Three levels of a Gaussian blur convolution mask are produced by varying the exponential decay function of the 20x20 mask. Each level varies from the previous by an order of magnitude of the exponential power. Figure 65 shows the blurring levels. Figure 64. Selections of images from the second test group A aWll% LT^ Pmi""""'.:E; *\ Figure 65. Different blurring levels of a test image. A) Original test image. B) Same test image put through a 20x20 Gaussian blur filter with an exponential decay function of exp035. C) Exponential decay function is exp0o035. D) Exponential decay function is expo00035 CHAPTER 7 RESULTS Tables 71 and 72 display the classification results of the trained detector schemes on the first and second groups of test data. The ell. entry specifies the number of ellipsoids produced during the training of each detector scheme. Surprisingly, scheme 1 performs the best for both positive and negative test images in the group 1 data set with a positive detection rate of 84.5% and a negative detection rate of 7.9' Preprocessing scheme 1 executes neither brightness leveling nor histogram equalization, but performs only the horizontal differencing algorithm along with image segmentation and dimensionality reduction. The horizontal differencing of neighboring pixels in an input image in general produces better results than vertical differencing of pixel values. This observation seems reasonable considering the higher degree of vertical than horizontal symmetry in humans. The first detector scheme is used to analyze the test images of group 2 after the bootstrapping procedure explained in Chapter 6 is executed because it performs the best out of all of the detector schemes on group 1 test data. Figure 71 displays the ability of detector scheme 1 to pick out people from selections of the group 2 database. Table 71. Results of detector schemes 13 on test data SI/,1 ,,". ell. P. detects N. detects P. det. rate N. det. rate Scheme 1 4 Group images 120/142 10/127 84.5% 7.9% Group2 images 2 7/27 13/6448 25.9% 0.20% Gaussian blur 0.35 74/142 52.1% Gaussian blur 0.035 70/142 49.3% Gaussian blur 0.0035 9/142 6.3% Scheme 2 13 Group 1 images 98/142 11/127 69.0% 8.7% Gaussian blur 0.35 66/142 46.5% Gaussian blur 0.035 38/142 26.8% Gaussian blur 0.0035 8/142 5.6% Scheme 3 3 Group 1 images 109/142 21/127 76.8% 16.5% Gaussian blur 0.35 73/142 51.4% Gaussian blur 0.035 61/142 43.0% Gaussian blur 0.0035 15/142 10.6% Table 72. Results of detector schemes 46 on test data S, /,, ,". ell. P. detects N. detects P. det. rate N. det. rate Scheme 4 19 Group 1 images 75/142 10/127 52.8% 7.9% Gaussian blur 0.35 61/142 43.0% Gaussian blur 0.035 30/142 21.1% Gaussian blur 0.0035 3/142 2.1% Scheme 5 1 Group 1 images 107/142 32/127 75.4% 25.2% Gaussian blur 0.35 73/142 51.4% Gaussian blur 0.035 102/142 71.8% Gaussian blur 0.0035 75/142 52.0% Scheme 6 2 Group 1 images 71/142 13/127 50.0% 10.2% Gaussian blur 0.35 53/142 37.3% Gaussian blur 0.035 84/142 59.1% Gaussian blur 0.0035 66/142 46.5% ,/^ L **^ ~ I Ai NO I* Z: I. Ir Figure 71. Some analyzed selections from the second group of test images. Boxes are drawn around positive detections by the algorithm CHAPTER 8 CONCLUSIONS In this thesis we determine if people who fit a specific size profile and who pose in everyday situations may be used as a viable input class for a simple binary detector using methods adapted and simplified from several noted techniques. We maximize the detection results over all of the preprocessing techniques used during training by selecting image preparation algorithms that give the best results during testing of the detector schemes on one group of test data. Dimensionality reduction is an important aspect of classifier formation, and we choose a procedure which has a basis in wavelet coefficient formation. It is one of the low frequency representations of the input data. Unlike most wavelet techniques which use many more than two levels of wavelet transforms, we use one low frequency transform and one high frequency transform to reduce the dimensionality of the input vectors and extract human features which exhibit good clustering behavior when they are depicted as feature vectors respectively. We assume that the feature representations assemble into ellipsoidal shapes with varying major and minor axes lengths, and through contraction and dilation of the ellipsoidal distributions, a large majority of feature vectors representing negative input examples remain outside of the ellipsoidal boundaries. Bootstrapping is used to improve the performance of the final detector by allowing the retraining of the classifier scheme with images that previously displayed false positives and false negatives. The results are encouraging because the detectors are trained with a small number of positive and negative examples, and the detection rates are comparable to current techniques. However, several variables involved in training the detector schemes were not taken into consideration, so it is unclear whether the results could be improved by increasing the number of positive and negative training images. We assume that the order of the images presented to the trainer influences the ability of the detector to detect humans because ellipsoids are created, contracted, and dilated in the order that the input images are introduced. We do not use the order of the training data as a factor in the training of the detectors, nor do we try different orderings of the examples to achieve higher detection rates. Also, based on the dilation and contraction methodology in the ellipsoidal algorithm, it would seem that introducing more negative points could throw out more positive examples not equal to the LDP. Different training methodologies would have to be examined to try to minimize this consequence. Our current detection rates reflect a training epoch which examines all of the negative examples between each positive example. Other types of training methodologies should be studied. More work should be done providing a theoretical basis to the ideas presented in this thesis. Much work has been done by others to create a link between high dimensional vector spaces and low dimensional kernel classifiers. We believe that there is a connection between representing image data with second order manifolds and the principles related in kernel classification techniques; however, such concepts are left as future analysis on the topic. REFERENCES [1] M. Kositsky and S. Ullman, "Learning class regions by the union of ellipsoids," Proceedings of the 13th International Conference on Pattern Recognition, vol. 4, pp. 750757, 1996. 1, 27 [2] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, "Pedestrian detection using wavelet templates," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 193199, 1997. 1, 5, 16, 20, 21, 23, 38, 41 [3] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. von Seelan, "Walking pedestrian recognition," IEEE Transactions on Intelligent Trans portation Systems, vol. 1, no. 3, pp. 155163, 2000. 2 [4] H. Nanda and L. Davis, "Probabislistic template based pedestrian detection in infrared videos," Proceedings of IEEE Intelligent Vehicle S/i,,.' t ..:,,,':. pp. 504515, 2002. 2 [5] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, "Shapebased pedestrian detection," Proceedings of IEEE Intelligent Vehicle Si'!' ...i .:,,,. pp. 215220, 2000. 2, 5, 13 [6] T. Poggio and Sung, "Finding human faces with a gaussian mixture distributionbased face model," Proceedings of Second Asian Conference on Computer Vision, pp. 139155, Dec. 1995. 5, 7, 16, 20, 27, 41 [7] L. Zhao and C. Thorpe, "Stereo and neural networkbased pedestrian de tection," IEEE Transactions on Intelligent Transportation Systems, vol. 1, pp. 148154, Sept. 2000. 5, 9, 15, 21, 23, 41 [8] H. Rowley, S. Baluja, and T. Kanade, A. iii.l networkbased face detection," IEEE Transactions on PAMI, vol. 20, no. 1, pp. 2328, 1998. 5, 10, 15, 16, 19, 21, 22, 23, 38, 41 [9] H. Schneiderman and T. Kanade, "Probabilistic modeling of local appearance and spatial relationships for object recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 324339, June 1998. 15, 41 [10] P. S. Penev and L. Sirovich, "The global dimensionality of face space," Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 264270, 2000. 16 BIOGRAPHICAL SKETCH Jennifer Lea Laine was born in Vero Beach, Florida, in 1975. She received a Bachelor of Science degree with honors in electrical engineering at the University of Florida in the summer of 1998 and, later in 2000, a Bachelor of Science degree in mathematics. Besides working towards a Master of Science degree in electrical nirii, iir.. she is a member of the Machine Intelligence Laboratory in the Electrical and Computer Engineering Department and works parttime as a design engineer at Neurotronics in Gainesville, Florida. 