UFDC Home  myUFDC Home  Help 



Full Text  
IRREG ULARS'IRUC'I'TUIRE TI: MODELS FOR IMAGE I iiT.Pi.T ',iON By S' TODOROVIC A :. ON PF F . NTED TO THE GRADUT.: : SCHOOL OF riii UNIVERSITY OF FLORIDA IN PART IAL FULFii i i OF Trii R:' UIREMENTS FOR Tiiii DEGREE OF DOCTOR OF PHILOSOPHY i:',. i : : : O F i: : ~ : )A AC : OWLED(. : C. "', I would to express ,. sincere gratitude to Dr. Michael Nechyba for his wise and pa tient guidance of my research for this dissertation. As my i : :: : advisor, Dr. Nechyba has been directing but on no account confining : interests. I especially appreciate his readi ness and expertise to help me solve numerous .... ..... .':. issues. Most importantly, I am ::ii :i for the friendship that we have I .... ::: on this work. Also, I thank current advisor Dr. Dapeng Wu for .:;:. extra effort to help me finalize my PhD :: I am grateful for his invaluable pieces of advise in choosing my future research goals, as well as .. practical concrete steps that he undertook to help me find a :. . My thanks also go to Dr. Jian Li, who helped me a lot in the transition : I in which I was .: .. . to change my advisor. Her research group provided .. a stimulating environment for me to endeavor investigating areas that are :1 the work : k in this dissertation. Also, I thank Dr. Antonio Arroyo, whose :ia: .::H lectures on machine intelligence have ':. .1:." me to do research in the field of machine learning. As the (::.. .. of the Machine : ;  L h Lab (MIL), Dr. Arroyo has created a warm, ..... i and hard working i: .: among the : ers." Thanks to him, I have decided to : :. the MIL, which has ... on numerous occasions to be the right decision. I thank i the members of the MIL for their friendship and support. I thank Dr. Takeo Kanade and Dr. Andrew Kurdila for sharing their research i... s:s on the micro air vehicle (MAV) project with me. : multidisciplinary environment of this i.: i. in which I had a chance to collaborate with various researchers with (: educational backgrounds was a great experience for me. TABLE OF CONTENTS ACKNOW LEDGMENTS ................................. LIST OF TABLES ................... ................. LIST OF FIGURES ... ... ... .. ... ... ... ... ... .. ... ... . KEY TO ABBREVIATIONS ............................... KEY TO SYMBOLS ................... ........... .... A B ST R A C T . . . . . . . . . .. CHAPTER 1 INTRODUCTION .................................. 1.1 PartBased Object Recognition ....................... 1.2 Probabilistic Framework ........................... 1.3 TreeStructured Generative Models ..................... 1.4 Learning Tree Structure from Data is an NPhard Problem ....... 1.5 Our Approach to Image Interpretation ................... 1.6 Contributions . . . . . . . . . 1.7 O verview . . . . . . . . . 2 IRREGULAR TREES WITH RANDOM NODE POSITIONS 2.1 M odel Specification .. ................. 2.2 Probabilistic Inference .. ............... 2.3 Structured Variational Approximation .......... 2.3.1 Optimization of Q(XIZ) .............. 2.3.2 Optimization of Q(R'IZ) .............. 2.3.3 Optimization of Q(Z) ................ 2.4 Inference Algorithm and B ,, i im Estimation ...... 2.5 Learning Parameters of the Irregular Tree with Random 2.6 Implementation Issues .. ............... 3 IRREGULAR TREES WITH FIXED NODE POSITIONS . Node Positions 3.1 M odel Specification . ... ... ... ... ... .. ... ... . 3.2 Inference of the Irregular Tree with Fixed Node Positions .. ...... 3.3 Learning Parameters of the Irregular Tree with Fixed Node Positions . 4 COGNITIVE ANALYSIS OF OBJECT PARTS .. .............. 4.1 Measuring Significance of Object Parts ................... page ii v vi viii x xii 4.2 Combining ObjectPart Recognition Results . 5 FEATURE EXTRACTION .................. .......... .. 39 5.1 Texture .................. ................. .. 39 5.1.1 Wavelet Transform .................. ..... .. 39 5.1.2 Wavelet Properties .................. ..... .. 41 5.1.3 Complex Wavelet Transform ................... ... .. 42 5.1.4 DifferenceofGaussian Texture Extraction . . ... 44 5.2 Color .................. ................... .. 45 6 EXPERIMENTS AND DISCUSSION ................ .... .. 46 6.1 Unsupervised Image Segmentation Tests ... . . 47 6.2 Tests of Convergence ............ . . .... 50 6.3 Image Classification Tests .................. ..... .. 53 6.4 ObjectPart Recognition Strategy ................. .. 57 7 CONCLUSION .................. ................. .. 63 7.1 Summary of Contributions .................. .. .... .. .. 63 7.2 Opportunities for Future Work ................ .... .. 65 APPENDIX A DERIVATION OF VARIATIONAL APPROXIMATION . . ... 67 B INFERENCE ON THE FIXEDSTRUCTURE TREE . . 72 REFERENCES ................... ......... ...... 74 BIOGRAPHICAL SKETCH .................. ............ .. 80 LIST OF TABLES Table page 51 Coefficients of the filters used in the Qshift DTCWT . . .... 43 61 Rootnode distance error .................. .......... .. 49 62 Pixel segmentation error .................. .......... .. 50 63 Object detection error. .................. ........... .. 50 64 Object recognition error .................. ........ .. .. 55 65 Pixel labeling error .................. ............. .. 55 66 Object recognition error for IQTyo ................ ... .. 59 67 Pixel labeling error for IQT o .................. ..... .. 59 LIST OF FIGURES Figure page 11 Variants of TSBNs .................. ... ........... 7 12 An irregular tree consists of a forest of subtrees .... . . 8 13 B ',, i mi estimation of the irregular tree ................ 11 21 Two types of irregular trees .................. ........ .. 13 22 Pixel clustering using irregular trees .................. ..... 17 23 Irregular tree learned for the 4x4 image in (a) ................ .. 17 24 Inference of the irregular tree given Y, R0, and 0 .............. .. 24 31 C'I .. of candidate parents .................. ........ .. 30 32 Inference of the irregular tree with fixed node positions . . .... 32 33 Algorithm for learning the parameters of the irregular tree . ... 34 41 For each subtree of ITv, representing an object in the 128 x 128 image 37 42 For each subtree of ITv, representing an object in the 256 x 256 image 38 51 Two levels of the DWT of a twodimensional signal. ........... .. 40 52 The original image (left) and its twoscale dyadic DWT (right). ....... .. 40 53 The Qshift DualTree CWT. . ............... . ... 42 54 The CWT is strongly oriented at angles 15, 45, 75 . .... 43 61 20 image classes in type I and II datasets. ................. .. 48 62 Image segmentation using ITvo .................. .... 48 63 Image segmentation using ITyo: (top) dataset I images . . .... 48 64 Image segmentation by irregular trees learned using SVA . . ... 49 65 Image segmentation by irregular trees learned using SVA: (a) ITo . 49 66 Image segmentation using ITv .................. .... 49 67 Comparison of inference algorithms .................. ...... 51 68 Typical convergence rate of the inference algorithm for ITyo on the 128 x 128 52 69 Typical convergence rate of the inference algorithm for ITyo on the 256 x 256 52 610 Percentage increase in loglikelihood .................. ..... 52 611 Comparison of classification results for various statistical models . 55 612 MAP pixel labeling using different statistical models. . . ..... 56 613 ROC curves for the image in Fig. 612a with ITvo, TSBN, DRF and MRF. 56 614 ROC curves for the image in Fig. 612a with ITv, ITvo, TSBN, and TSBNT. 56 615 Comparison of two recognition strategies .... . . 58 616 Recognition results over dataset IV for IQTo. ....... . . 60 617 Recognition results over dataset V for IQTvo. ............... .61 618 C'l .!i, i 1, .1 using the partobject recognition strategy . . ... 62 B1 Steps 2 and 5 in Fig. 32 ................. . ..... 73 KEY TO ABBREVIATIONS list shown below gives a description of the f. : ! used .... .. or abbrevi ations in this work. For each name, the page number corresponds to the where the name is :; used. B: blue channel of the RGB color space ............ G: green channel of the RGB color space . .............. .... 43 R: red channel of the R.GB color space ............... IQTV: irregular tree with fixed node positions, and with observables present at all levels ................. ............... .... 26 IQTyo: irregular tree with fixed node '":... and with observables present <. .1 at th e . . . . . . . . . . ITvi: irregular tree where observables are .: :.; .;.! at the ITV: irregular tree where observables are present at all levels g: normalized green channel ......... r: normalized red channel ..... .... ( ', T: Complex Wavelet Transform ...... DRF: Discriminative Random Field . . DTC ','IT: Dual Tree Complex Wavelet Transform .... DWT: Discrete WXavelet Transfor . ......... SExp : .: ..... ..... algorithm .. . KL:R : I i : i : divergence . ... MAP: Maximum A Posteriori . .... MC( C: Markov ( i .:: Monte Carlo method . . ML: Maxiimum Likelihood . ....... MPM : Maxirmum Posterior Marginal .. .. MRF: Markov .::. i..: Field .... . , \ ': nondeterministic : .. ....... 1 time . ........ leaflevel . 13 . . 37 . . . . 7 . . . . 3 1'7 .. . 15 . 22 2 . . . 7 RGB: T : color space that consists of red, green and blue color values . . S*: receiver operating characteristic ................ . .... 52 '.'A: structured variational approximation inference algorithm .... . 16 TSBN: treestructured belief network .. .. .. 5 VA: variational approximation inference algorithm .................. 16 KEY TO SYMBOLS list shown below gives a brief <1 : .. of the '":* mathematical symbols defined in this work. For each symbol, the number corresponds to the : where the symbol is : used. Influence of observables Y on .ij . . 20 Bij: influence of the geometric .. ... lies of the network on ... .... ..... 20 G: number of components in a Gaussian mixture . ......... 15 Hil: Shanon's :.i:.. of node i ............. ... .. . . 3 ,(Q, P): free ................ ............ 64 L: maximum number of levels in the irregular tree ..... . .. 13 :: set of image classes (i.e., ob'. appearances) .................. 13 i: : : i : 1' p .1 tables .................. . ...... 13 .< approximate conditional : :i tables, given Y and Ro ....... 18 R: positions of all nodes in the irregular tree . .......... 13 R': positions of nonleaf nodes in the irregular tree ......... . ....... 13 . 0: positions of leaf nodes in the irregular tree ............... ... 13 V: set of all nodes in the irregular tree ............... . ....... 13 V': set of all nonleaf nodes in the irregular tree . ............. 13 V: set of all leaf nodes in the irregular tree . ......... ...... 13 X: random vector of all ........... . ........ 13 Y: all observables ............... . .............. 13 Z: connect i r random matrix .................... . ....... 13 C: cost .. .. .. ......... ............. . 20 : the set of i i ::: ; .:7 } in the irregular tree with i node positions . Eg covariance matrix of a relative childparent i1 : i.. .... (rirj) .... . 0: set of parameters that characterize an irregular tree ........ ..... 15 ~Jy: approximate covariance of r,, given that j is the parent of i, and given Y and 18 ;: approximate mean of ri, given that j is the parent of i, and given Y and '." 18 p(i): ..'.. : of an observable random vector in the iage plane . . 13 f: index of levels in the irregular tree ..... . ....... 13 :pro ,,.i= of a node i being the child of j . ............ 13 h: norm alization constant ....................... . ..... 18 0: set of parameters that characterize a Gaussian mixture .............. .. 15 ,ij: approximate probability of i being the :i i of j, given Y and . . 18 *..' ': ** posterior that node i is labeled as image class k, given Y and R 19 xi: imageclass of node .. . ........ . 13 imageclass indicator if k class is assigned to node i ................ 13 zij: con .. : :' indicator random variable between nodes i and j . . .. 13 dij the mean of relative displacement rir . ............. .. 13 r,: ". ... of node i in the image .. ................. . 13 Yp(,): observable random vector ...... . ............... 13 Abstract of Dissertation Presented to the Graduate School of the Ui: i H of F : in Partial i ::.:::: : of the P. i::' :.. .. for the Degree of Doctor of Pi ' i ;G i. ,.i .'* i iE 'ii i i, l i i t:FO R i 'G E: ii i i i lI: O N By Sinisa Todorovic (. :: Dapeng \Vu '. .: I)Departmient: ii : : and C :... : :Enggineering In this dissertation, we seek to accomplish the f1i .. ,, related goals: (1) to i a un'r 1:. i ... i to address localization, detection, and recognition of obi. i as three subtasks of imageinterpretation, and (2) to :.. : a .. _ : : .. .. ..: and reliable solution to recognition of multiple, partially occluded, alike ob' in a given single image. second ....!.' ... is to date an open problem in computer vision, eluding a satisfactory solution. For this : we formulate obh. recognition as I' : ... estimation, whereby class labels with the maximum posterior (i : :i. : :. are assigned to each pixel. To effi ciently estimate the posterior distribution of image classes, we propose to model images with : :: r l models known as : .'.:. trees. '. irregular tree i..... i,..1!..!.':y distributions over both its structure and im age classes. i means that, for each image, it is necessary to i :. the optimal model structure, as well as the posterior distribution of image classes. XWe propose several infer ence algorithms as a solution to this NPhard *. 1.*. (nondeterministic : ... time), which can be viewed as variants of the ExpectationMaximization (EM) algorithm. After i....... the model i .... a forest .. subtrees, each of which segments the image. ri :: is, inference of model structure provides a solution to obi 1 localization and detection. With to our second goal, we hypothesize that for a successful occludedobject recognition it is critical to ( :* : iy analyze visible obi~. parts. Irregular trees are conve nient for such ... because the treatment of obi' : I. : represents ... 1 a particular interpretation of the tree/subtree structure. We analyze the '...*.. of irregulartree nodes, representing ob': parts, with : to recognition of an ob': as a whole. : S: :: : : is then exploited toward the ultimate obi 1 recognition. ..r .. .'. results demonstrate that irregular trees more accurately model images than their : structure counterparts quadtrees. Also, the experiments reported herein show that our explicit treatment of object i. .: results in an :' i: oved recognition p. : as compared to the strategies in which ob i ( ... .. are not t 1 ay accounted for. CHAPTER 1 INTRODUCTION Image interpretation is a difficult challenge that has long been confronting the computer vision community. A number of factors contribute to the complexity of this problem. The most critical is inherent uncer' i I,v in how the observed visual evidence in images should be attributed to infer object types and their relationships. In addition to video noise, there are various sources of this uncer' ,ir ,, including variations in camera quality and position, wideranging illumination conditions, extreme scene diversity, and the randomness of object appearances, clutter and locations in scenes. One of the critical hindrances to successful image interpretation is that objects may occlude each other in a complex scene. In the literature, the initial research on the inter pretation of scenes with occlusions appeared in early nineties. However, in the last decade relatively small volume of the related literature was published. In fact, a majority of the recently proposed vision systems is not directly aimed at solving the problem of occluded object recognition; experiments on images with occlusions are reported as a side result only to illustrate the versatility of those systems. This i:_:_ I that recognition of partially occluded objects is an open problem in computer vision, which motivates us to seek its solution in this dissertation. In the initial work, local features (e.g., points, line and curve segments) are used to represent objects, allowing the unoccluded features to be matched with object features, by computing a scalar measure of model fit [1,2,3]. The unmatched scene features are modeled as spurious features, and the unmatched object features indicate the occluded part of the object. The matching score is either the number of matched object features or the sum of a Gaussianweighted matching error. The main limitation with these approaches is that they do not account for the spatial correlation among occlusions. Statistical approaches to occludedobject recognition have also been reported in the literature. For instance, Wells [4], and Ying and Castanon [5] propose probabilistic models to characterize scene features and the correspondence between scene and object features. The authors model both objectfeature uncer' ,inil, and the pI...1 il.il, that the object features are occluded in the scene. They introduce two statistical models for occlusion. One model assumes that each feature can be occluded independently of whether any other features are occluded, whereas the second model accounts for the spatial correlation to represent the extent of occlusion. The spatial correlation is computed using a Markov Random Field (MRF) model with a Gibbs distribution [6]. The main drawback of these systems is a prohibitive computational load; the runtime of these algorithms is exponential in the number of objects to be recognized. Other related work exploits auxiliary information provided, for example, by image sequences or stereo views of the same scene [7,8,9,10,11,5], where occlusions are transitory. Since this information in general may not be available, and/or occlusions may remain permanent, in our approach we do not use the strategies of these systems. A review of the related literature also i:_:_ I that the majority of vision systems are designed to deal with only one constrained vision task, such as, for example, image segmen tation [10, 11, 5]. However, to conduct image interpretation, as is our goal, it is necessary to perform three related tasks: (1) localization, (2) detection (also called image segmenta tion), and (3) ultimate recognition of object appearances (also called image classification). Further, in many systems in which the three subtasks are addressed, this is not done in a unified manner. Here, as a drawback, the system's architecture comprises a serial connec tion of separate modules, without any feedback on the accuracy of the ultimate recognition. Moreover, vision systems are typically designed to recognize only a specific instance of ob ject classes appearing in the image (e.g., face), which, in turn, is assumed dissimilar to other objects in the image. However, the assumption of uniqueness of the target class may not be appropriate in many settings. Also, the success of these systems usually depends on ad hoc finetuning of the featureextraction methods and system's parameters, optimized for that unique target class. With current demands to design systems capable of I, , iI: thousands of image classes simultaneously, it would be difficult to generalize the outlined approaches. The small volume of published research addressing occlusions in images i:_:_ I that the problem is not fully examined. Also, the drawbacks of the above systemsnamely: con strained goals and settings of operation, poor spatial modeling of occlusion, and prohibitive computational load motivated us to conduct the research reported herein. Our motivation is that most object classes seem to be naturally described by a few characteristic parts or components and their geometrical relation. We hypothesize that it is not the percentage of occlusion that is critical for object recognition, but rather which object parts are occluded. Not all components of an object are equally important for its recognition, especially when that object is partially occluded. Given two similar objects in the image, the visible parts of one object may mislead the algorithm to recognize it as its counterpart. Therefore, careful consideration should be given to the I, '1, i of detected visible object parts. One of the benefits of such I ', 1i, is the flexibility to develop various recognition strategies that weigh the information obtained from the detected object parts more judiciously. In the following section, we review some of the reported partbased objectrecognition strategies. 1.1 PartBased Object Recognition Recently, there has been a flurry of research related to partbased object recognition. For example, Mohan et al. [12] use separate classifiers to detect heads, arms, and legs of people in an image, and a final classifier to decide whether a person is present. However, the approach requires object parts to be manually defined and separated for training the individual part classifiers. To build a , . i 1i that is easily extensible to deal with different objects, it is important that the part selection procedure be automated. One approach in this direction is developed by Weber et al. [13,14]. The authors assume that an object is composed of parts and shape, where parts are image patches, which may be detected and characterized by appropriate detectors, and shape describes the geometry of the mutual position of the parts in a way that is invariant with respect to rigid and, possibly, affine transformations. The authors propose a joint p. ".11 ,ii 1, density over part appearances and shape that models the object class. This framework is appealing in that it naturally allows for parts of different sizes and resolutions. However, due to computational issues, to learn the joint probability density, the authors choose heuristically a small number of parts per each object class, rendering the density unreliable in the case of large variations across images. Probabilistic detection of object parts has also been reported. For instance, Heisele et al. [15] propose to learn object components from a set of examples based on their dis criminative power, and their robustness against pose and illumination changes. For this purpose, they use Support Vector Machines. Also, Felzenszwalb and Huttenlocher [16] rep resent an object by a collection of parts arranged in a deformable configuration. In their approach, the appearance of each part is modeled separately by Gaussianmixture distribu tions, and the deformable configuration is represented by springlike connections between pairs of parts. The main problem of the mentioned approaches is that they lack the ., 11, 1, of object parts through scales. It is assumed that parts cannot contain other subparts, and that objects are unions of mutually exclusive components, which is hard to justify for more complex object classes. To address the .111 '1, i of object parts through scales Schneiderman and Kanade [17] propose a trainable multistage object detector composed of classifiers, each making a de cision about whether to cease evaluation, labeling the input as nonobject, or to continue further evaluation. The detector orders these stages of evaluation from a lowresolution to a highresolution search of the image. The aforementioned approaches are not suitable for recognition of a large number of object classes. As the number of classes increases there is a combinatorial explosion of the number of their parts (i.e., image patches) that need to be evaluated by appropriate detectors. In this dissertation, we seek a solution to the outlined problems. Our goal it to design a vision , I. in that would i, 1,... multiple object classes through their constituent, "mean ingful" parts at a number of different resolutions. To this end, we resort to a probabilistic framework, as discussed in the following section. 1.2 Probabilistic Framework We formulate image interpretation as inference of a posterior distribution over pixel random fields for a given image. Once the posterior distribution of image classes is inferred, each pixel can be labeled through B ,,. i ,,i estimation (e.g., maximum a posterioriMAP). Within this framework, it is necessary to specify the following: 1. The probability distribution of image classes over pixel random fields, 2. The inference algorithms for computing the posterior distribution of image classes, 3. B ,,. i i, estimation for ultimate pixel 1 1.. lii,:_. that is, object recognition. Our principal challenge lies in choosing a statistical model for specifying the probability distribution of image classes, since this choice conditions the formulation of inference and B ,,. i im estimation. A suitable model should be computationally manageable, and suffi ciently expressive to represent a wide range of patterns in images. A review of the literature offers four broad classes of models [18]. The descriptive models are constructed based on statistical descriptions of image ensembles with variables only at one level (e.g., [19, 20]). The pseudodescriptive models reduce the computational cost of descriptive models by im posing partial (or even linear) order among random variables (e.g., [21,22]). The generative models consist of observable and hidden variables, where hidden variables represent a finite number of bases generating an image (e.g., [23, 24]). The discriminative models directly encode posterior distribution of hidden variables given observables (e.g., [25,26]). The available models differ in structural complexity and difficulty of inference. At one end lie descriptive models, which build statistical descriptions of image ensembles only at the observable (i.e., pixel) level. Other modeling paradigms (i.e., generative, discriminative) impose \ ,i, in:_ levels of structure through the introduction of hidden variables. However, no principled formulation exists, as of yet, to 1:_:_. . one approach superior to the others. Therefore, our choice of model is guided by the goal to interpret scenes with partially occluded, alike objects. We seek a model that offers a viable means of recognizing partially occluded objects through recognition of their visible constituent parts. Thus, a prospective model should allow for I, '1, i of object parts towards recognition of objects as a whole. To alleviate the computational complexity arising from the treatment of multiple objectparts of multiple objects in images, we seek a model that is capable of modeling both whole objects and their subparts in a unified manner. That is, a candidate model must be expressive enough to capture componentsubcomponent relationships among re gions in an image. To accomplish this, it is necessary to i_ '1,.. pixel neighborhoods of varying size. The literature abounds with reports on successful applications of multi scale statistical models for this purpose [27,28,29,30,31, 32]. Following these trends, we choose the irregular treestructured '. 1'. f network, or short irregular tree. Our choice is directly driven by our imageinterpretation strategy and goals, and appears better suited than alternative statistical approaches. Descriptive models lack the necessary structure for componentsubcomponent representation we seek to exploit. Discriminative approaches directly model posterior distribution of hidden variables given observables. Consequently, they lose the convenience of assigning .1!, i. 1 meaning to the statistical parameters of the model. In contrast, irregular trees can detect objects and their parts simultaneously, as discussed in the following chapters. Before we continue to present our approach to image interpretation, we give a brief overview of treestructured generative models in the following section. 1.3 TreeStructured Generative Models Recently, there has been a flurry of research in the field of treestructured generative models, also known as treestructured belief networks (TSBNs) [27,33,28,29,30,31,32]. The models provide a systematic way to describe random processes/fields and have extremely efficient and statistically optimal inference algorithms. Treestructured belief networks are characterized by a fixed balanced tree structure of nodes representing hidden (latent) and observable random variables. We focus on TSBNs whose hidden variables take discrete val ues, though TSBNs can model even continuously valued Gaussian processes [34, 35]. The edges of TSBNs represent parentchild (Markovian) dependencies between neighboring lay ers of hidden variables, while hidden variables, belonging to the same layer, are conditionally independent, as depicted in Figure 11. Note that observables depend solely on their corre sponding hidden variables. Observables are either present at the finest level only, or could be propagated upward the tree, as dictated by the design choices related to image processing. TSBNs have efficient lineartime inference algorithms, of which, in the graphicalmodels literature, the bestknown is ', 1., f propagation [36, 37, 38]. Cheng and Bouman [29] have used TSBNs for multiscale document segmentation; Kumar and Hebert [39] have employed TSBNs for segmentation of manmade structures in natural scene images; and Schneider et al. [40] have used TSBNs for simultaneous image denoising and segmentation. All the afore mentioned examples demonstrate the powerful expressiveness of TSBNs and the efficiency of their inference algorithms, which is critically important for our purposes. In spite of these attractive properties, the fixed regular structure of nodes in the TSBN gives rise to 1.1.. I:y" estimates. The predefined tree structure fails to adequately represent the immense variability in size and location of different objects and their subcomponents in images. In the literature, there are several approaches to alleviate this problem. Irving et al. [28] have proposed an overlapping tree model, where distinct nodes correspond to overlapping parts in the image. Li et al. [41] have discussed twodimensional hierarchical models where nodes are dependent both at any particular layer through a Markovmesh and across resolutions. In both approaches segmentation results are superior to those when standard TSBNs are used, because the descriptive component of the models is improved at increased computational cost. Ultimately, however, these approaches do not deal with the source of the I .... Iii,  namely, the orderly structure of TSBNs. Not until recently has the research on irregular structures been initiated. Konen et al. [42] have proposed a flexible neural mechanism for invariant pattern recognition based on correlated neuronal activity and the selforganization of dynamic links in neural networks. Also, Montanvert et al. [43], and Bertolino and Montanvert [44] have explored irregular multiscale tessellations that adapt to image content. We join these research efforts building on the work of Adams et al. [45], Adams [46], Storkey [47], and Storkey and Williams [48], by considering the irregularstructured tree belief network. (a) (b) Figure 11: Variants of TSBNs: (a) observables (black) at the lowest layer only; (b) ob servables (black) at all layers; white nodes represent hidden random variables, connected in a balanced quadtree structure. '7        Figure 12: An irregular tree consists of a forest of subtrees, each of which segments the image into regions, marked by distinct shading; round and squareshaped nodes indicate hidden and observable variables, respectively; triangles indicate roots. In the irregular tree, as in TSBNs, nodes represent random variables, and arcs between them model causal (Markovian) dependence assumptions, as illustrated in Figure 12. The irregular tree specifies probability distributions over both its structure and image classes. It is this distribution over tree structures that mitigates the above cited problems with TSBNs. 1.4 Learning Tree Structure from Data is an NPhard Problem In order to fully characterize the irregular tree (and any graphical model, for that matter), it is necessary to learn both the graph topology (structure) and the parameters of transition probabilities between connected nodes from training data. Usually, for this purpose, one maximizes the likelihood of the model over training data, while at the same time minimizing the complexity of model structure. Current methods are successful at learning both the structure and parameters from complete data. Unfortunately, when the data are incomplete (i.e., some random variables are hidden), optimizing both the structure and parameters becomes NPhard (nondeterministic polynomial time) [49,50]. The principal contribution of this dissertation is that we propose a solution to the NPhard problem of modelstructure estimation. In our approach, we use a variant of the ExpectationMaximization (EM) algorithm [51,52], to facilitate efficient search over a large number of candidate structures. In particular, the EM procedure iteratively improves its current choice of parameters by using the following two steps. In the Expectation step, current parameters are used for computing the expected value of all the statistics needed to evaluate the current structure. That is, the missing data (hidden variables) are completed by their expected values. In the Maximization step, we replace current parameters with those that maximize the likelihood over the complete data. This second step is essentially equivalent to learning model structure and parameters from complete data, and, hence, can be done efficiently [50, 38, 49]. In the incompletedata case, a local change in structure of one part of the tree may lead to a structure change in another part of the model. Thus, the available methods for structure estimation evaluate all the neighbors (e.g., networks that differ by a few local changes) of each candidate they visit [53]. The novel idea of our approach is to perform a search for the best structure within EM. In each iteration step, our procedure attempts to find a better network structure, by computing the expected statistics needed for evaluation of alternative structures. In contrast to the available approaches, the EMbased structure search makes a significant progress in each iteration. As we show through experimental validation, our procedure requires relatively few EM iterations to learn nontrivial tree structures. The outlined image modeling constitutes the core of our approach to image interpre tation, which is discussed in the following section. 1.5 Our Approach to Image Interpretation We seek to accomplish the following related goals: (1) to find a unifying framework to address localization, detection, and recognition of objects, as three subtasks of image interpretation, and (2) to find a computationally efficient and reliable solution to recognition of multiple, partially occluded, alike objects in a given single image. For this purpose, we formulate object recognition as the B i,. i mi1 estimation problem, where class labels are assigned to pixels by minimizing the expected value of a suitably specified cost function. This formulation requires efficient estimation of the posterior distribution of image classes (i.e., objects), given an image. To this end, we resort to directed graphical models, known as irregular trees [54,55,46,47,48,45]. As discussed in Section 1.3, the irregular tree specifies probability distributions over both its structure and image classes. This means that, for each image, it is necessary to infer the optimal model structure, as well as the posterior distribution of image classes. By utilizing the Markov property of the irregular tree, we are in a position to reduce computational complexity of the inference algorithm, and, thereby, to efficiently solve our B i,. i mi estimation problem. After inference, the model represents a forest of subtrees, each of which segments the image. More precisely, leaf nodes that are descendants down the subtree of a given root form the image region characterized by that root, as depicted in Fig. 12. These segmented image regions can be interpreted as distinct object appearances in the image. That is, inference of irregulartree structure provides a solution to localization and detection. Moreover, in inference, we also derive the posterior distribution of image classes over leaf nodes. In order to classify the segmented image regions as a whole, we perform majority voting over the maximum a posteriori (MAP) classes of leaf nodes. In this fashion, we accomplish our first goal. With respect to our second goal, we hypothesize that the critical factor in a successful occludedobject recognition should be the ,i, '1, ~i of visible object parts, which, as dis cussed before, usually induces prohibitive computational cost. To account explicitly for object parts at various scales, we utilize the Markovian property of irregular trees, which lends itself as a natural solution. Since each root determines a subtree whose leaf nodes form a detected object, we can assign 1r,, i, 1 meaning to roots as representing whole objects. Also, each descendant of the root down the subtree can be interpreted as the root of another subtree whose leaf nodes cover only a part of the object. Thus, roots' descendants can be viewed as object parts at various scales. Therefore, within the irregulartree framework, the treatment of object parts represents merely a particular interpretation of the tree/subtree structure. To reduce the complexity of interpreting all detected object subparts, we propose to .,1 ,1,.. the .:i.:;i. ,i,,. . of object components (i.e., irregulartree nodes) with respect to recognition of objects as a whole. After B i,. i mi estimation of the irregulartree structure for a given image, we first find the set of most .:,,.:. a',,I irregulartree nodes. Then, these selected significant nodes are treated as new roots of subtrees. Finally, we conduct MAP classification and majority voting over the selected image regions, descending from the selected .:,,.:; ai,,Il nodes, as illustrated in Fig. 13. 1.6 Contributions Below, we outline the main contributions of this dissertation. St I         I / optimize structure find "significant" nodes classify selected regions Figure 13: B ,, i ,i, estimation of the irregular tree along with the ,ii ,1, i, of signifi cant tree nodes constitute our approach to recognition of partially occluded, alike objects; shading indicates the two distinct subtrees under the two i, il iii!" nodes. We propose an EMlike algorithm for learning a graphicalmodel, where both model structure and its distributions are learned on a given data simultaneously. The algorithm represents a stagewise solution to the learning problem known to be NPhard. While we use the algorithm for learning irregular trees, its generalization to any generative model is straightforward. A critical part of this learning algorithm is inference of the posterior distribution of image classes on a given data. As is the case for many complexstructure models, exact inference for irregular trees is intractable. To overcome this problem, we resort to variational approximation approach. We assume that there are averaging phenomena in irregular trees that may render a given set of variables in the model approximately independent of the rest of the network. Thereby, we derive the Structured Variational Approximation algorithm that advances existing methods for inference. In order to avoid variational approximation in inference, we propose two novel archi tectures and their inference algorithms within the irregulartree framework. Being simpler, these models allow for exact inference. Moreover, empirically, they exhibit higher accuracy in modeling images than irregulartreelike models proposed in prior work [45, 46, 47, 48]. Along with architectural novelties, we also introduce multilayered data into the model an approach that has been extensively investigated in fixedstructure quadtrees [29,33]. The proposed quadtrees have proved rather successful for various applications including image d, iriin, classification, and segmentation. Hence, it is important to develop a similar formulation for irregular trees. We develop a novel approach to object recognition, in which object parts are explicitly I, '1,. .1 in a computationally efficient manner. As a major theoretical contribution, we define the measure of cognitive significance of object details. The measure provides for a principled algorithm that combines detected object parts toward recognition of an object as a whole. Finally, we report results of experiments conducted on a wide variety of image datasets, which characterize the proposed models and inference algorithms, and validate our approach to image interpretation. 1.7 Overview The remainder of the dissertation is organized as follows. In Chapter 2, we specify two architectures of the irregulartree model, and derive inference algorithms for them. The architectures differ in the treatment of observable random variables. We also discuss learning of the model parameters. Detailed derivation of the inference algorithm is given in Appendix A. Next, in Chapter 3, we specify yet another two architectures of the irregulartree model, for which it is possible to simplify the inference algorithm, as compared to that discussed in Chapter 2. We deliberate the probabilistic inference and learning algorithms for the models. Further, in Chapter 4, we propose a measure of significance of object parts. This measure ranks object components with respect to the entropy over all image classes (i.e., objects). To incorporate the information of this .111 ,1, i into the MAP classification, we devise a greedy algorithm, which we refer to as objectpart recognition. The extraction of image features, which we use in our experiments, is thoroughly discussed in Chapter 5. Then, In Chapter 6, we report performance results of different irregulartree architectures on a large number of challenging images with partially occluded, alike objects. Finally, in Chapter 7, we summarize the major contributions of the dissertation, and conclude with remarks on the future research. CHAPTER 2 IRREGULAR TREES WITH RANDOM NODE POSITIONS 2.1 Model Specification Irregular trees are directed, , 11i graphs with two disjoint sets of nodes representing hidden and observable random vectors. Graphically, we represent all hidden variables as roundshaped nodes, connected via directed edges indicating Markovian dependencies, while observables are denoted as rectangularshaped nodes, connected only to their corresponding hidden variables, as depicted in Fig. 21. Below, we first introduce nodes characterized by hidden variables. There are V roundshaped nodes, organized in hierarchical levels, V, = {0, 1, ..., Ll}, where Vo denotes the leaf level, and V'AV\V0. The number of roundshaped nodes is identi cal to that of the corresponding quadtree with L levels, such that V=  V1/4=...= Vo/4'. Connections are established under the constraint that a node at level f can become a root, or it can connect only to the nodes at the next f+1 level. The network connectivity is represented by random matrix Z, where entry zij is an indicator random variable, such that zij=1 if iVe and jE{0, V+1} are connected. Z contains an additional zero ("root") 00 (a) (b) Figure 21: Two types of irregular trees: (a) observable variables present at the leaf level only; (b) observable variables present at all levels; round and squareshaped nodes indicate hidden and observable random variables; triangles indicate roots; unconnected nodes in this example belong to other subtrees; each subtree segments the image into regions marked by distinct shading. column, where entries zio=1 if i is a root. Since each node can have only one parent, a real ization of Z can have at most one entry equal to 1 in each row. We define the distribution over connectivity as P(Z) A nL ni) {, } [ (2.1) where is the I .1. 1 ,ill of i being the child of j, subject to Yjo0,ve+li} 1. Further, each roundshaped node i (see Fig. 21) is characterized by random position ri in the image plane. The distribution of ri is conditioned on the position of its parent rj as P(rirj, zij1) A exp((rirjdijrrd)) (2.2) 27rE ij 2 where yij is a diagonal matrix that represents the order of magnitude of object size, and pa rameter dij is the mean of relative displacement (rirj). Storkey and Williams [48] set dij to zero, which favors undesirable positioning of children and parent nodes at the same loca tions. From our experiments, this may seriously degrade the imagemodeling capabilities of irregular trees, and as such some nonzero relative displacement dij needs to be accounted for. For roots i, we have P(rilro,zio 1)Aexp( (ridi)T ; (r d))/(2rE). The joint probability of RA{riViEV}, is given by P(RZ) A [P(rilrj, zij) ]Z"i (2.3) At the leaf level, Vo, we fix node positions Ro to the locations of the finestscale ob servables, and then use P(Z, R'IR) as the prior over positions and connectivity, where Roa{rilVieVo}, and R'A{riVieV\Vo}. Next, each node i is characterized by an imageclass label xi and an imageclass indica tor random variable xz, such that x =1 if xi=k, where k is a label taking values in the finite set M. Thus, we assume that the set M of unknown image classes is finite. The label k of node i is conditioned on image class I of its parent j and is given by conditional probability tables Pb'. For roots i, we have P(xalz, zio 1) AP(xa). Thus, the joint 1.i1..1. i111 , of XA{xziEV, kEM} is given by P(XZ) = Y [k,pEM ''zi (2.4) Finally, we introduce nodes that are characterized by observable random vectors rep resenting image texture and color cues. Here, we make a distinction between two types of irregular trees. The model where observables are present only at the leaflevel is referred to as ITyo; the model where observables are present at all levels is referred to as ITv. To clarify the difference between the two types of nodes in irregular trees, we index observables with respect to their locations in the datastructure (e.g., wavelet dyadic squares), while hidden variables are indexed with respect to a nodeindex in the graph. This generalizes correspondence between hidden and observable random variables of the positionencoding dynamic trees [48]. We define the position of an observable, p(i), to be equal to the center of mass of the ith dyadic square at level in the corresponding quadtree with L levels: p(i) A [(n+0.5)2' (m+0.5)2 ]T Vi E V', = {0,..., L 1}, n, m = 1,2,... (2.5) where n and m denote the row and column in the dyadic square at scale f (e.g., for wavelet coefficients). C'I. Ily, other applicationdependent definitions of p(i) are possible. Note that while the r's are random vectors, the p's are deterministic values fixed at locations where the corresponding observables are recorded in the image. Also, after fixing Ro to the locations of the finestscale observables, we have VieVo, ri=p(i). The definition, given by Eq. (2.5), holds for ITvo, as well, for f=0. For both types of irregular trees, we assume that observables YA{yp(i)ViEV} at loca tions pa{p(i)VieV} are conditionally independent given the corresponding x4 : P(YIX, p) = n v kM [P(yp) p(i))] (2.6) where for ITvo, Vo should be substituted for V. The likelihood P(yp(y) =l1, p(i)) are modeled as mixtures of Gaussians: P(yp(i) x =1, p(i)) A I 7k (g)Afp(iy); vk(), 7k(g)). For large Gk, a Gaussianmixture density can approximate any probability density [56]. In order to avoid the risk of overfitting the model, we assume that the parameters of the Gaussianmixture are equal for all nodes. The Gaussianmixture parameters can be grouped in the set 0 A {Gk, {k(g9), Vk(g), Ek(g)}1 VkCM}. Speaking in generative terms, for a given set of V nodes, first P(Z) is defined using Eq. (2.1) and P(RIZ) using Eq. (2.3) to give us P(Z, R). We then impose the condition of fixing the leaflevel node positions to the locations of the finestscale observables, po C p, to obtain P(Z, R'IRo pO). Combining Eq. (2.4) and Eq. (2.6) with P(Z, R'IRo pO) results in the joint prior P(Z,X, R', YRO po) = P(YX, p)P(XZ)P(Z, R'RO po) (2.7) which fully specifies the irregular tree. All the parameters of the joint prior can be grouped in the set 0 A { dij,Yij,P', 0}, Vi, jV, Vk, leM. As depicted in Figure 21, a irregular tree is a directed graph. The formalism of the graphtheoretic representation of irregular trees provides general algorithms for computing marginal and conditional probabilities of interest, which is discussed in the following section. 2.2 Probabilistic Inference Image interpretation, as discussed in Chapter 1, requires computation of posterior prob abilities of hidden random variables Z, X, and R', given observables Y and leafnode posi tions Ro. However, due to the complexity of irregular trees, the exact probabilistic inference of P(Z, X, R'Y, Ro) is infeasible. Therefore, we resort to approximate inference methods, which are divided into two broad classes: deterministic approximations and MonteCarlo methods [57, 58, 59, 60, 61]. Markov Chain Monte Carlo (MC':\C) methods allow for sampling of the posterior P(Z, X, R' Y, Ro), and the construction of a Markov chain whose equilibrium distribution is the desired P(Z, X, R'IY, Ro). Below, we report an experiment for two datasets of 4x4 and 8x8 binary images, samples of which are depicted in Fig. 22a, where we learned P(Z,X, R'IY, R) for ITvo models through Gibbs sampling [62]. Observables yi were set to binary pixel values; the number of image classes was set to MI2; the number of components in the Gaussianmixture was set to G=1; and the maximum number of levels in the model is set to L=3 and L=4 for 4x4 and 8x8 images, respectively. The initial irregulartree structure is a balanced quadtree (TSBN), where the number of leaflevel nodes is equal to the number of pixels. One iteration of Gibbs sampling consists of sampling each variable, conditioned on the other variables in the irregular tree, until all the variables are sampled. We iterated this procedure until our convergence criterion was met namely, when IPt+l(Z,X, R'IY, R)Pt(Z,X,R'IY, RO)I/Pt(Z,X, R'IY, R) (a) (D) (c) Figure 22: Pixel clustering using irregular trees learned by Gibbs sampling: (a) sample 4x4 and 8x8 binary images; (b) clustered leaflevel pixels that have the same parent at level 1; (c) clustered leaflevel pixels that have the same grandparent at level 2; clusters are indicated by different shades of gray; the point in each group marks the position of the parent node. Figure 23: Irregular tree learned for the 4x4 image in (a), after 20,032 iterations of Gibbs  i 1.'liih:_. nodes are depicted inline representing 4, 2 and 1 actual rows of the levels 0, 1 and 2, respectively; nodes are drawn as piecharts representing P(x = 1), k e {0, 1}; note that there are two root nodes for two distinct objects in the image. iteration steps t, where e=0.1 and e=1 for 4x4 and 8x8 images, respectively. For the dataset of 50 binary 4x4 images, on average more than 20,000 iteration steps were required for convergence, while for 50 binary 8x8 binary image, more than 100,000 iterations were required. In Figs. 22bc, we also illustrate the grouping of pixels in the learned irregular trees, while in Fig. 23, we depict the irregular tree learned for the 4x4 image in Fig. 22a. From the experimental results, we infer that irregular trees learned through Gibbs sampling are capable of capturing important structural information about image regions at various scales. Generally, however, in MC' IC approaches, with increasing model com plexity, the choice of proposals in the Markov chain becomes hard, so that the equilibrium distribution is reached very slowly [63, 57]. Hence, in order to achieve faster inference, we resort to variational approximation, a specific type of deterministic approximation [59,64]. Variational approximation methods have been demonstrated to give good and significantly faster results, when compared to Gibbs sampling [46]. The proposed approaches range from a factorized approximating distribution over hidden variables [45] (a.k.a. mean field varia tional approximation) to more structured solutions [48], where dependencies among hidden variables are enforced. The underlying assumption in those methods is that there are aver aging phenomena in irregular trees that may render a given set of variables approximately independent of the rest of the network. Therefore, the resulting variational optimization of irregular trees provides for principled solutions, while reducing computational complexity. In the following section, we derive a novel Structured Variational Approximation (SVA) algorithm for the irregular tree model defined in Section 2.1. 2.3 Structured Variational Approximation In variational approximation, the intractable distribution P(Z, X, R'\Y, Ro) is approxi mated by a simpler distribution Q(Z, X, R'IY, Ro) closest to P(Z, X, R'IY, Ro). To simplify notation, below, we omit the conditioning on Y and R, and write Q(Z, X, R'). The novelty of our approach is that we constrain the variational distribution to the form Q(Z, X, R') A Q(Z)Q(XZ)Q(R'Z) (2.8) which enforces that both classindicator variables X and position variables R' are statisti cally dependent on the tree connectivity Z. Since these dependencies are significant in the prior, one should expect them to remain so in the posterior. Therefore, our formulation appears to be more appropriate for approximating the true posterior than the meanfield variational approximation Q(Z,X, R')=Q(Z)Q(X)Q(R') discussed by Adams et al. [45], and the form Q(Z, X, R')=Q(Z)Q(XZ)Q(R') proposed by Storkey and Williams [48]. We define the approximating distributions as follows: Q(Z) A nL1 Hj(i,j)ex v{ov+1} [iJ]~' (2.9) Q(XZ) A n ,j H1,klM [Qj] (2.10) exp ( p)TQl(rj j)i') Q(RtlZ) A ,v, [Q(rzIij)] jjj [, (  / (2.11) 271\lij\1 2 where parameters (ij correspond to the connection probabilities, and the Qk1 are anal ogous to the Pl' conditional p .1. 1. ,ilr, tables. For the parameters of Q(R'IZ), note that covariances Qij and mean values ptij form the set of Gaussian parameters for a given node iVe over its candidate parents jVcW 1. Which pair of parameters (pij, Qj), is used to generate ri is conditioned on the given connection between i and j that is, the current realization of Z. Furthermore, we assume that the Q's are diagonal matrices, such that node positions along the "x" and "y" image axes are uncorrelated. Also, for roots, suitable forms of Q functions are used, similar to the specifications given in Section 2.1. To find Q(Z, X, R') closest to P(Z, X, R'IY, Ro) we resort to a standard optimization method, where KullbackLeibler (KL) divergence between Q(Z, X, R') and P(Z, X, R'IY, R) is minimized ( [65], ch. 2, pp. 1249, and ch. 16, pp. 482509). The KL divergence is given by KL(QIIP) A dR' Q(, X, R) log P(, 7X, R') (2.12) Z,X It is well known that KL(QIIP) is nonnegative for any two distributions Q and P, and KL(QIIP)=0 if and only if Q=P; these properties are a direct corollary of Jensen's inequal ity ( [65], ch. 2, pp. 1249). As such, KL(QIIP) guarantees a global minimum that is, a unique solution to Q(Z, X, R'). By minimizing the KL divergence, we derive the update equations for estimating the parameters of the variational distribution Q(Z,X, R'). Below, we summarize the final derivation results. Detailed derivation steps are reported in Appendix A, where we also provide the list of nomenclature. In the following equations, we use K to denote an arbitrary normalization constant, the definition of which may change from equation to equation. Parameters on the righthand side of the update equations are assumed known, as learned in the previous iteration step. 2.3.1 Optimization of Q(XIZ) Q(XIZ) is fully characterized by parameters Qk1, which are updated as QJ = KPjA Vi,j EV Vk, cI eM , (2.13) where the auxiliary parameters A\ are computed as S/ P(yp()i, p(i)) ,i ( i, A. (2.14a) c[ V f M pak akl Ci G V, ACCV LZ aCM Pcijia] ,ci J , A = P(yp(i),\xp(i))cV [ZaeM ] Vi e V, Vk E M, (2.14b) where Eq. (2.14a) is derived for ITyo, and Eq. (2.14b) for ITV. Since the ,ci are nonzero only for childparent pairs, from Eq. (2.14), we note that A's are computed for both models by propagating the A messages of the corresponding children nodes upward. Thus, Q's, given by Eq. (2.13), can be updated by making a single pass up the tree. Also, note that for leaf nodes, ieV, the ,ci parameters are equal to 0 by definition, yielding Ai P(yp(i)zx, p(i)) in Eq. (2.14b). Further, from Eqs. (2.9) and (2.10), we derive the update equation for the approximate posterior probability mi that node i is assigned to image class k, given Y and R0, as "' j / fdR' ,x XQ(Z, X, R') = E y v, E M Q'}, Vi e Vk M. (2.15) Note that the mi can be computed by propagating imageclass probabilities in a single pass downward. This upwarddownward propagation, specified by Eqs. (2.14) and (2.15), is very reminiscent of belief propagation for TSBNs [36,31]. For the special case when ,ij1= only for one parent j, we obtain the standard A7r rules of Pearl's message passing scheme for TSBNs. 2.3.2 Optimization of Q(R'IZ) Q(R'Z) is fully characterized by parameters p/ij and Qij. The update equation for ILij, V(i,j)EV'x{0, V+1}, >0, is given by ]~1  iJ= EjPZij1 > Ci 1Ci > i 1(, Li dij) pcV c7V1 pV/ cCV' (2.16) where c and p denote children and grandparents of node i, respectively. Further, for all node pairs V(i, j)EVx {0, V1+}, >0, where ijQ0, ij is updated as Trz I T''r{?J}I 1+ E jp 1Tr{I QIj } J + pGV/ Tr1 (I [ zIQ ij + (2.17) +( Tr{f^^l. 1' I\7 + c iTr{E~ } 1+ ),Tr GyV / CL ( ^ where, once again, c and p denote children and grandparents of node i, respectively. Since the Q's and E's are assumed diagonal, it is straightforward to derive the expressions for the diagonal elements of the Q's from Eq. (2.17). Note that both pij and Qij are up dated summing over children and grandparents of i, and, therefore, must be iterated until convergence. 2.3.3 Optimization of Q(Z) Q(Z) is fully characterized by connectivity probabilities (ij, which are computed as ij = K exp(Aij Bij) VW, V(i,j)EVx {0, V+1} (2.18) where Aij represents the influence of observables Y, while Bij represents the contribution of the geometric properties of the network to the connectivity distribution. These are defined in Appendix A. 2.4 Inference Algorithm and Bayesian Estimation For the given set of parameters characterizing the joint prior, observables Y, and leaflevel node positions R, the standard B i, i 11 estimation of optimal Z, X, and R' requires minimizing the expectation of a cost function C: (Z, It')= arg minz,x,R E{C((Z, X, R'), (Z*,X*, R'*)) Y, R0, E}, (2.19) where C(.) penalizes the discrepancy between the estimated configuration (Z, X, R') and the true one (Z*, X*, R'*). We propose the following cost function: C((Z, X, R'), (Z*, X*, R'*))a [1_6(zz)]+ [16(_ *)]+ [(rr)], i,jEV iEV keM iEV' (2.20) where indicates true values, and 6(.) is the Kronecker delta function. Using the variational approximation P(Z,X,R'IY, RO)Q(Z)Q(XIZ)Q(R'IZ), from Eqs. (2.19) and (2.20), we derive: Z arg minz ZQ(Z) (i,j)Ex{o,v+}[16(ziJZ)], (2.21) X= arg minx Ez,x Q(Z)Q(XIZ) Zicv EkEM[l ai *))], (2.22) R'= arg minR, J dR' (Z Q(Z)Q(R'IZ) ~i yv[16(rir*)]. (2.23) Given the constraints on connections, discussed in Section 2.1, minimization in Eq. (2.21) is equivalent to finding parents: (Vf)(VieVl)(Z.i40) = argmaxj{0o,vyt+} ij for ITyo (2.24a) (Vw)(VieV ) = argmaxj{0oyV+,} ij for ITv (2.24b) where (ij is given by Eq. (2.18); Z.i denotes the ith column of Z, and Z.ij40 indicates that there is at least one nonzero element in column Z.i; that is, i has children, and thereby is included in the tree structure. Note that due to the distribution over connections, after estimation of Z, for a given image, some nodes may remain without children. To preserve the generative property in ITyo, we impose an additional constraint on Z that nodes above the leaf level must have children in order to be able to connect to upper levels. On the other hand, in ITv, due to multilayered observables, all nodes V must be included in the tree structure, even if they do not have children. The global solution to Eq. (2.24a) is an open problem in many research areas. Therefore, for ITyo, we propose a stagewise optimization, where, as we move upwards, starting from the leaf level = {0, 1, ..., Ll}, we include in the tree structure optimal parents at V+1 according to (ViEV)(Z.$)0) j=argmaxj0o,vt+1}ij, (2.25) where Z.i denotes ith column of the estimated Z, and Z.i40 indicates that i has already been included in the tree structure when optimizing the previous level V. Next, from Eq. (2.22), the resulting B i,, i I, estimator of imageclass labels, denoted as xi, is (VieV) xi = arg maxkeM mi (2.26) where the approximate posterior .1. .1.11 l.1, mk that image class k is assigned to node i is given by Eq. (2.15). Finally, from Eq. (2.23), optimal node positions are estimated as (V>0)(VieVe ) ri argin ::, Ez Q(rilZ)Q(Z) = Ej{o,vt+i} lijij, (2.27) where ttij and (ij are given by Eqs. (2.16) and (2.18), respectively. The inference algorithm for irregular trees is summarized in Fig. 24. The specified ordering of parameter updates for Q(Z), Q(XIZ), and Q(R'IZ) in Fig. 24, steps (4)(10), is arbitrary; theoretically, other orderings are equally valid. 2.5 Learning Parameters of the Irregular Tree with Random Node Positions Variational inference presumes that model parameters: 0= { dij, ij, P.l, 0}, Vi, jEV, Vk, leM, and V, L, M, are available. These parameters can be learned offline through standard Maximum Likelihood (ML) optimization. Usually, for the ML optimization, it is assumed that N, independently generated, training images, with observables {Y"}JN and latent variables {(Z', X', R"~) } are given. However, for multiscale generative models, in general, neither the true imageclass labels for nodes at higher levels nor their dynamic connections are given. Therefore, configurations {(Z", ', X R')} must be estimated from the training images. To this end, we propose an iterative learning procedure. In initialization, we first set L= log2 0(V), where IVo is equal to the size of a given image. The number of image classes M is also assumed known. Next, due to a huge diversity of possible configurations of objects in images, for each node iGVe, we initialize to be uniform over i's candidate parents VjE{0, V+1}. Then, for all pairs (i,j)EVxV1+l at level f, we set dijp(i)p(j); namely, the dij are initialized to the relative displacement of the centers of mass of the ith and jth dyadic squares in the corresponding quadtree with L levels, specified in Eq. (2.5). For roots i, we have di=p(i). Also, we set diagonal elements of Eij to the diagonal elements Inference Algorithm Assume that V, L, M, 0, Ns, e, and e, are given. (1) Initialization: t=0; tin 0; (Vi,jEV) (Vk,leM) j(0) Q(0)=Pj; lij(0) "node locations in the corresponding quadtree"; diagonal elements of Qij(0) are set to the area of dyadic squares in the corresponding quadtree; (2) repeat Outer Loop (3) t=t+l; (4) Compute in bottomup pass for f=0,1,..., Ll, ViEV', VkcM, x(t) given by Eq. (2.14); Qj(t) given by Eq. (2.13); (5) Compute in topdown pass for f=L1, L2,..., 0, ViEV', VkcM, m () given by Eq. (2.15); (6) repeat Inner Loop (7) tin tin + 1; (8) Compute Vi,jEV', llij(tin) given by Eq. (2.16); Qij(tin) given by Eq. (2.17); (9) until Iij (tin)IJij(tin1) I/Pij (tinl) < EP; (10) Compute Vi,jEV', ij(t) given by Eq. (2.18); (11) until Q(Z, X, R'; t)Q(Z, X, R'; t1)/Q(Z, X, R'; t1) (12) Estimation of Z: compute in bottomup pass for =0, 1, ..., Ll, for ITvo: (ViEVG)(Z.i40) = arglm , ,, e+ ij(), for ITv: (ViEV ) j arg maxj{o,v~+1} ij(t); (13) Estimation of X: compute (ViEV) ai argmaxkc m (t); (14) Estimation of R': compute (Vf>0)(VieVL) ri je{o,v,+i} iij(t)ij(t); Figure 24: Inference of the irregular tree given Y, R0, and 0; t and tin are counters in the outer and inner loops, respectively; N,, e, and e, control the convergence criteria for the two loops. of a matrix djd~ The number of components Gk in a Gaussian mixture for each class k is set to Gk=3, which is empirically validated to be appropriate. Other parameters of the Gaussian mixture, 0, are estimated by using the EM algorithm [52, 56] on the handlabeled training images. Finally, conditional 1.1. 1 ilil,i tables, P1', are initialized to be uniform over possible image classes. After initialization of 0, we run an iterative learning procedure, where in step t we conduct SVA inference of the irregular tree on the training images, as explained in the previous section. After inference of the posterior I .1. 1 d Iil.i I, that class k is assigned to node i, mn, given by Eq. (2.15), and posterior connectivity probability, ij, given by Eq. (2.18), on all training images, n 1, ..., N, we update only P' and as N P l (t+1l)  k";n(t) ,(2.28) n=1 N ,(t+1) = (E(t). (2.29) n=l Other parameters in O(t+l)= { .(t+l), dij, Eij, PiS(t+l), 0}, are fixed to their initial val ues. In the next iteration step, we use O(t+1) for SVA inference of the irregular tree on the training images. We assume that the learning algorithm converged when P (t+1) PM () where e > 0 is a prespecified parameter. 2.6 Implementation Issues In this section, we list algorithmrelated details that are necessary for the experimental results, presented in Chapter 6, to be reproducible. First, direct implementation of Eq. (2.13) would result in numerical underflow. There fore, we introduce the following scaling procedure: k A , VicV, VkeM, (2.30) Si Si A (2.31) keM Substituting the scaled A's into Eq. (2.13), we obtain vkl \k vkl \k kl pki k ki p k p aEM Pl aM P"Ia In other words, computation of Qfk does not change when the scaled A's are used. Second, to reduce computational complexity, we consider, for each node i, only the 7x7 box encompassing parent nodes j that neighbor the parent of the corresponding quadtree. Consequently, the number of possible children nodes c of i is also limited. Our experiments show that the omitted nodes, either children or parents, contribute negligibly to the update equations. Thus, we limit overall computational cost as the number of nodes increases. 26 Finally, the convergence criterion of the inner loop, where tij and Qij are computed, is controlled by parameter e,. When e =0.01, the average number of iteration steps, tin, in the inner loop, is from 3 to 5, depending on the image size, where the latter is obtained for 128x 128 images. The convergence criterion of the outer loop is controlled by parameters N, and e. Simplifications that we use in practice may lead to suboptimal solutions of SVA. From our experience, though, the algorithm recovers from unstable stationary points for sufficiently large N. In our experiments, we set N,=10 and e=0.01. After the inference algorithm (Fig. 24) converged, we then estimate the values of hidden variables (Z, X, R') for a given image, thereby conducting image interpretation. CHAPTER 3 IRREGULAR TREES WITH FIXED NODE POSITIONS In the previous chapter, two architectures of the irregular tree are presented, which are fully characterized by the following joint prior: P(Z, X, R', YIRO po) =P(YIX, p)P(XZ)P(Z, R'IRo po) As discussed in Section 2.2, the inference of the posterior distribution P(Z,X, R'IY, RO) is intractable, due to the complexity of the model. The nodeposition variables, R', are the main culprit for conducting approximate inference. On the other hand, the R' are very useful, because they constrain possible network configurations. In order to avoid approximate inference, in this chapter, we introduce yet another architecture of the irregular tree, where the R' are eliminated, and where the constraints on the tree structure are directly modeled in the distribution of connectivity Z. 3.1 Model Specification Similar to the model specification in the previous chapter, we introduce two architec tures: one with observables only at the leaf level, and the other with observables propagated to higher levels. The main difference from the architectures ITv and ITvo is that node po sitions are identical to those of the quadtree. Therefore, we refer to the architectures presented in this chapter as irregular quad trees IQTV and IQTvo. The irregular quad tree is a directed ,. li graph with nodes in set V, organized in hierarchical levels, VW, ={0,1,...,L}, where Vo denotes the leaf level. The layout of nodes is identical to that of the quadtree, modeling for example the dyadic pyramid of wavelet coefficients, such that the number of nodes at level can be computed as IVl =IV 1l/4=...= VO/4'. Unlike for positionencoding dynamic trees [48], we assume that nodes are fixed at locations of the corresponding quadtree. Consequently, irregular model structure is achieved only through establishing arbitrary connections between nodes. Connections are established under the constraint that a node at level f can become a root or it can connect only to the nodes at the next f+1 level. The network connectivity is rep resented by a random matrix, Z, where entry zij is an indicator random variable, such that zij=1 if iEVW and jeV1+l are connected. Z contains an additional zero ("root") column, where entries zio=1 if i is a root node. Each node can have only one parent, or can be a root. Note that due to the distribution over connections, after estimation of Z, for a given image, in IQTv, some nodes may remain without children. Each node i is characterized by an imageclass random variable, xi, which can take values in a finite class set C. Given Z, the label xi of node i is conditioned on xj of its parent j as P(xilxj, zij=1). The joint probability of imageclass variables X={xi}, VieV, is given by P(XZ)= ni o nieve P(Xz Xj, zij 1), (3.1) where for roots we use priors P(xi). We assume that the conditional probability tables P(xilxj, zij=1) are equal for all the nodes at all levels, as in [33]. Such a unique conditional probability table is denoted as F. Next, we assume that observables yi are conditionally independent given the corre sponding xi: P(YIX) = n v P(yilxi) (3.2) P(y\ 1) Ef 1 "1k(g)N(Yi; Vk(g),7 k(g)) (3.3) where for IQTvo instead of V we write Vo in Eq. (3.2). P(yl, i=k), kcM, is mod eled as a mixture of Gaussians. The Gaussianmixture parameters can be grouped in 0 {7k (9), vk(9), 7k(g), Gk}, VkeM. Finally, we specify the connectivity distribution. In the previous chapter, it is de flQTined as the prior P(Z)= Hi,je P(zij=1), and then the constraint on possible tree structures is imposed through introducing an additional set of random variables namely, random node positions R. The main purpose of the R's is to provide for the mechanism that the connections between close nodes are favored. That approach has two major dis advantages. First, the additional R variables render the exact inference of the dynamic tree intractable, enforcing the use of approximate inference methods (variational approxi mation). Second, the decision if nodes i and j should be connected is not informed on the actual values of xi and xj. To improve upon the model formulation of the previous chap ter, we seek to eliminate the R's, and to incorporate the information on imageclass labels and node positions in the connectivity distribution. We reason that connections between parents and children, whose relative distance is small, should be favored over those that are far apart. At the same time, we seek to establish a mechanism that groups nodes belonging to the same image class, and separates those assigned to different classes. Let us first examine relative distances between nodes. Due to , iiil. rry of the node layout (equal to that of the quadtree), we divide the set of all candidate parents j into classes of equidistance from child i, as depicted in Fig. 31. We specify that relative distances can take integer values dij{0, 1,2,...,d ax}, where if i is a root nioAO. Note that d"ax values vary for different positions of i at one level, as well as for different levels to which i belongs. Given X, we specify the conditional connectivity distribution as L P(ZIX) 1= H P(zi=llz, ), (3.4) =0 (i,j) Ve x{0,Ve+1} pi i is a root, P(zij= lzi,Xj) = pI p p)dij if xi=Xj, (3.5) subject to P(zij l\xi,Xj) 1, (3.6) jC{0,Ve+1} where K is a normalizing constant, and pi is the parameter of the geometric distribution. From Eq. (3.5), we observe that when xi=xj, P(zij=llxi,xj) decreases as dij becomes larger, while when xiyxj, P(zij=lzxi,xj) increases for greater distances dij. Hence, the form of P(zij=llxi, xj), given by Eq. (3.5), satisfies the aforementioned desirable properties. To avoid overfitting, we assume that pi is equal for all nodes i at the same level. The parameters of P(ZIX) can be grouped in the parameter set = {pi}, ViEV. S / class: dj = 1 e class: d3j = 2 / / D node i Figure 31: Classes of candidate parents j that are characterized by a unique relative distance dij from child i. The introduced parameters of the model can be grouped in the parameter set 0( {1, 0, '}. In the next section we explain how to infer the I. configuration of Z and X from the observed image data Y, provided that 0 is known. 3.2 Inference of the Irregular Tree with Fixed Node Positions The standard B ,,. i ,_i formulation of the inference problem consists in minimizing the expectation of some cost function C, given the data (Z, X) arg minz,x E{C((Z, X), (Z', X'))IY, } (3.7) where C penalizes the discrepancy between the estimated configuration (Z, X) and the true one (Z', X'). We propose the following cost function: C((Z, X), (Z', X')) = C(X, X') + C(Z, Z') (3.8) L1 L1 > [1 (xj 4)] + > [1i 4)], (3.9) =0 icyE =0 (i,j)cVex{0,Ve+l} where stands for true values, and 6(.) is the Kronecker delta function. From Eq. (3.9), the resulting B i,. i i estimator of X is VieV, i= argmax,,c P(xilZ, Y). (3.10) Next, given the constraints on connections in the irregular tree, we derive that mini mizing E{C(Z, Z')IY, 9} is equivalent to finding a set of optimal parents j such that (Vf)(VieV )(Z.,40) j arg ,:. ,,+ I} P(zijlx,x y), for IQTyo (3.11a) (Vw)(ViEV) j arg im ,,,jv+I} P(zijli, xj) for IQTv , (3.11b) where Z.i is the ith column of Z, and Z.ij4O represents the event "node i has children", that is, "node i is included in the irregulartree structure." The global solution to Eq. (3.11a) is an open problem in many research areas. We propose a stagewise optimization, where, as we move upwards, starting from the leaf level = {0, 1,..., L}, we include in the tree structure optimal parents at V+1 according to (ViEVc)(Z.$O0) j argmaxj{o,v+i} P(zij=lzi, x), (3.12) where Z.i40 denotes an estimate that i has already been included in the tree structure when optimizing the previous level V. By using the results in Eqs. (3.10) and (3.12), we specify the inference algorithm for the irregular quad tree, which is summarized in Fig. 32. In a recursive step t, we first assume that estimate Z(t1) of the previous step t1 is known and then derive estimate X(t) using Eq. (3.10); then, substituting X(t) in Eq. (3.12) we derive estimate Z(t). We consider the algorithm converged if P(Y, XIZ) does not vary more than some threshold e for N, consecutive iteration steps t. In our experiments, we set e = 0.01 and N, = 10. Steps 2 and 6 in the algorithm can be interpreted as inference of X given Y for a fixed structure tree. In particular, for Step 2, where the initial structure is the quadtree, we can use the standard inference on quadtrees, where, essentially, belief messages are propagated in only two sweeps up and down the tree [33,29,31]. For Step 6, the irregular tree represents a forest of subtrees, which also have fixed, though irregular, structure; therefore, we can use the very same treeinference algorithm for each of the subtrees. For completeness, in Appendix B, we present the twopass maximum posterior marginal estimation of X proposed by Laferte et al. [33]. 3.3 Learning Parameters of the Irregular Tree with Fixed Node Positions Analogous to the learning algorithm discussed in the previous chapter, the parameters of the irregular tree with fixed node positions can be learned by using the standard ML optimization. Here, we assume that N, independently generated, training images, with ob servables {Y"}, n=1, ..., N, are given. As explained before, configurations of latent variables {(Z", X')} must be estimated. Inference Algorithm (1) t = 0; initialize irregulartree structure Z(0) to quadtree; (2) Compute ViEV, xi(0) argn,:, I.c P(xi\Z(0),Y); (3) repeat (4) t t + 1; (5) Compute in bottomup pass for =0,1, ..., L for IQTyo: (ViEVG)(Z.4j0) = argmaxj{0ove+} P(zijz1lxi,Xj); for IQTv: (VieVG) j arg nir,::,, +1} P(zij 1lxi, xj); (6) Compute ViEV, xi(t) arg m ec P(xilZ(t),Y); (7) XX(t); Z(t); (8) until I P(Y, x( i1)Z(t1)) < for N s consecutive iteration steps. Figure 32: Inference of the irregular tree with fixed node positions, given observables Y and the model parameters R. To this end, we propose an iterative learning procedure, where in step t we first assume that 0(t)={((t), 0(t), I(t)} is given and then conduct inference for each training image, n= 1, ..., N, (Z, X) arg minE IC((Z, X), (Z', X'))IY', E(t)}, Z,X as explained in Section 3.2. Once the estimates {(ZI, X")} are found, we apply standard ML optimization to compute @(t+1). More specifically, suppose, in the learning step t, realizations of random variables (Y", X", Z") are given for n=1,..., N. Then the parameters of Gaussianmixture distribu tions, in step t + 1, can be computed using the standard EM algorithm [56]: P(wP(g) c) I I c)zc() (3.13) S( ( X CP ( I c) 7 c( g) ^rc(g) EPP((g)ly,i =c), (3.14) nin c(g) i= 1 Yii] P(wc(g)lyi, ~ic c) S Eil( C(g))(y C(g))TP(wc(g) yixi c) c(g)i P(a)c y^ c) (3.16) where n~ is the total number of all the nodes over N training images that are classified as class c. To compute P(wc(g)lyi,xi=c) in Eq. (3.13), we use Gaussianmixture parameters from the previous learning step t. For all classes we set Gc=3. Next, we explain how to learn the parameters of the connectivity distribution, I(t+l) = {pi(t+l)}iGv, by using the ML principle: N (t+1) arg max J P(ZT1X, '(t1)). (3.17) n=l Here, we consider two cases for IQTv and IQTyo models. Recall that parameters pi are equal for all nodes i at the same level f. Given the estimates {(Z", X")}, for each training image n=1, ..., N, from Eqs. (3.5) and (3.17), we derive for IQTy: p()= N N, (3.18) EL [1+I(a7 ,/ +I(4 )(dqax j)] n=1 icV where I(.) is an indicator function, j is an estimated parent of node i, d' denotes the relative distance assigned to the estimated connection z=l1. For IQTvo, given the estimates {(Z", X")}, for each training image n=1,...,N, we i_ ,'1,.. the set of nodes ieVe included in the corresponding irregular tree, i.e., Zz~O0. Thus, from Eqs. (3.5), and (3.17), we derive: N I ( I(Z. 0) M() n iE (3.19) E L i(Z'T 0) [1 + 1( + i(ml^# (dax d)] n= licV where I(.) is an indicator function, j is an estimated parent of node i, d' denotes the relative distance assigned to the estimated connection ~z=1. Finally, to learn the conditional probability table 1, we use the standard EM algorithm on fixedstructure trees, thoroughly discussed in [33]. Note that to obtain the estimates {(Z", X")}, for each training image n 1, ..., N, in the learning step t, we in fact have to conduct the MPM estimation, given in in Appendix B in Fig. B. By using already available P(xzi,xjlY), zl=1) and P(xiY' )), obtained for each image n as in Fig B, we derive N1 y iE P(Xi, Xjyn y,n=1) ( +I v xJ d(i)I1) (3.20) N iev P(j I Yd ) The overall learning procedure is summarized in Fig. 33. Learning Algorithm (1) t= 0; initialize E(0) {(0), 0(0), T(0)}; (2) Estimate for n 1,..., N (Z, X' arg minz,x E{C((Z, X), (Z', X'))lY', e(0)}; (3) repeat (4) t t+l; (5) Compute: 0(t) as in Eqs. (3.13)(3.16); p(; t), for IQTv as in Eq. (3.18); for IQTvo as in Eq. (3.19); D(t), as in Eq. (3.20); (6) Estimate for n 1,..., N (Z k), X)= arg minz,x E{C((Z, X), (Z', X'))IY", 8(t)}, using the inference algorithm in Fig. 32; (7) E*= (t); () unil () P(Y",X"lZ",*)P(Y",X"\Z",l(t1)) A (8) until (Vn) p, I Figure 33: Algorithm for learning the parameters of the irregular tree; for notational simplicity, in Step (8) we do not indicate the different estimates of (Z'",X") for 8* and E(t1). Once 8* is learned, we can localize, detect and recognize objects in the image, by conducting the inference algorithm, presented in Fig. 32. CHAPTER 4 COGNITIVE ANALYSIS OF OBJECT PARTS Inference of hidden variables (Z, X), can be viewed as building a forest of subtrees, each segmenting an image into arbitrary (not necessarily contiguous) regions, which we interpret as objects. Since, each root determines a subtree, whose leaf nodes form a detected object, we assign I,i, il 1 meaning to roots by assuming they represent whole objects. Moreover, each descendant of the root can be viewed as the root of another subtree, whose leaf nodes cover only a part of the object. Hence, we say that roots' descendants represent object parts at various scales. Strategies for recognizing detected objects naturally arise from a particular interpreta tion of the tree/subtree structure. Below, we make a distinction between two such strate gies. The ,1, i, of image regions under the roots leads to the whole. '1,i. recognition dl,,/i ..,o. while the .1 ,1:, i, of image regions determined by roots' descendants constitutes the .1,/'. Ipart recognition strategy. For both approaches, final recognition is conducted by majority voting over MAP labels, xi, of leaf nodes.1 The reason for ,,,,1 i,:_: smaller image regions than those under the roots stems from our hypothesis that the information of finescale object details may prove critical for the recognition of an object as a whole in scenes with occlusions. To reduce the complexity of interpreting all detected object subparts, we propose to i& '1,., the .1;,<.:;. ,ii,.. of object components (i.e., irregulartree nodes) with respect to recognition of objects as a whole. 1 The literature offers various strategies that outperform majorityvoting classification (e.g., multiscale B i, i i, classification [29], and multiscale Viterbi classification [32]); how ever, they do not account explicitly for occlusions, and, as such, do not significantly out perform majority voting for scenes with occluded objects. 4.1 Measuring Significance of Object Parts We hypothesize that the significance of object parts with respect to object recognition depends on both local, innate object properties and global scene properties. While in nate properties represent characteristic object features, which differentiate one object from another, global scene properties describe interdependencies of object parts in the overall image composition. It is necessary to account for both local and global cues, as the most conspicuous object component need not necessarily be the most significant for that object's recognition in the presence of alike objects. The .,m 11, i, of innate object properties is handled through inference of the irregular tree, where, for a given image, we compute P(zilZ,Y), VieV, as explained in C'!i 11.1' s 2 and 3. To account for the influence of global scene properties, for each node i, we compute Shanon's entropy over the set of image classes, M, as (Vie V)(zi 0) H,= P(x Z, Y) logP(x Z, Y) (4.1) xiEM Since node i represents an object part, we define Hi as a measure of significance of that object part. Note that a node with small entropy is characterized by a I" il:y" distribution P(1xiZ, Y) with the maximum, say, at xi = k e M. This indicates that the error of classification will be small when i is labeled as class c. Recall that during inference, the belief message of i is propagated down the subtree in belief propagation [33], which is likely to render i's descendants with small entropies, as well. Thus, the classification error of the whole region of leaf nodes under i is likely to be small, when compared to some other image region under, say, node j such that Hj>Hi. Consequently, i is more ,!!lil 'i!il" for recognition of class k than node j. In brief, the most significant object part has the smallest entropy over all nodes in a given subtree T: i* max Hi (4.2) ieT In Figs. 41 and 42, we illustrate the most significant object part under each root, where entropy is computed over seven and six image classes, shown in Figs. 41(top) and 42(top), respectively. The experiment is conducted as explained in Chapter 2, using the Figure 41: For each subtree of ITy, representing an object in the 128 x 128 image, a node i* is found with the highest entropy for M = 6 + 1 7 possible image classes (top row). Bright pixels are descendants of i* at the leaf level and indicate the object part represented by i. irregular tree with random node positions, and observables at all levels (ITv). Details on computing observables Y in this experiment are explained in Chapter 5. Note that for different scenes different object parts are established as the most significant with respect to the entropy measure. 4.2 Combining ObjectPart Recognition Results Once nodes are ranked with respect to the entropy measure, we are in a position to devise a criterion to optimally combine this information toward ultimate object recognition. Herewith, we propose a simple greedy algorithm, which, nonetheless, shows remarkable improvements in performance over the wholeobject recognition approach. Under each root, we first select the descendant node with the smallest entropy. Each selected node determines a subtree, whose leaf nodes form an object part. Then, we conduct majority voting over these selected image regions. In the second round, we select under each root the descendant node with the smallest entropy, such that it does not belong to any of the subtrees selected in the first round. Now, these nodes determine new subtrees, whose leaf nodes form object parts that do not overlap with the selected image regions in Figure 42: For each subtree of ITy, representing an object in the 256 x 256 image, a node i* is found with the highest entropy for M = 5 +1 = 6 possible image classes (top row). Bright pixels are descendants of i* at the leaf level and indicate the object part represented by i*; the images represent the same scene viewed from three different angles; the most significant object parts differ over various scenes. the first round. Then, we conduct majority voting over the newly selected image regions. This procedure is repeated until we exhaustively cover all the pixels in the image. This stagewise majority voting over nonoverlapping image regions constitutes the final step in the objectpart recognition strategy (see Fig. 1 3). CHAPTER 5 FEATURE EXTRACTION In C'lI 1.1. 'p 2 and 3, we have introduced four architectures of the irregular tree, referred to as ITv, ITvo, IQTv, and IQTvo. To compute the observable (feature) random vectors Y's for these models, we account for both color and texture cues. 5.1 Texture For the choice of texturebased features, we have considered several filtering, model based and statistical methods for texture feature extraction. Our conclusion complies with the comparative study of Randen and Husoy [66] that for problems with many textures with subtle spectral differences, as in the case of our complex classes, it is reasonable to assume that the spectral decomposition by a filter bank yields consistently superior results over other texture mI '1, i, methods. Our experimental results also 1:_:_ . that it is crucial to i, '1,. both local as well as regional properties of texture. As such, we (n 11,. ,, the wavelet transform, due to its inherent representation of texture at different scales and locations. 5.1.1 Wavelet Transform Wavelet atom functions, being well localized both in space and frequency, retrieve texture information quite successfully [67]. The conventional discrete wavelet transform (DWT) may be regarded as equivalent to filtering the input signal with a bank of bandpass filters, whose impulse responses are all given by scaled versions of a mother wavelet. The scaling factor between adjacent filters is 2:1, leading to octave bandwidths and center fre quencies that are one octave apart. The octaveband DWT is most efficiently implemented by the dyadic wavelet decomposition tree of Mallat [68], where wavelet coefficients of an image are obtained convolving every row and column with impulse responses of lowpass and highpass filters, as shown in Figure 51. Practically, coefficients of one scale are ob tained convolving every second row and column from the previous finer scale. Thus, the filter output is a wavelet subimage that has four times less coefficients than the one at the Level 0 Column filters Level 1 Row filters 2 WLH 2 WH,,L el 1 umn filters Level 0 Row filters 0 2 H1 2 SWL H LH Lev H WH WH Col ~^ WHH LEVEL 0 LEVEL 1 Figure 51: Two levels of the DWT of a twodimensional signal. 20 40 60 80 100 120 20 40 60 80 100 120 Figure 52: The original image (left) and its twoscale dyadic DWT (right). previous scale. The lowpass filter is denoted with Ho and the highpass filter with H1. The wavelet coefficients W have in index L denoting lowpass output and H for highpass output. Separable filtering of rows and columns produces four subimages at each level, which can be arranged as shown in Figure 52. The same figure also illustrates well the directional selectivity of the DWT, because WLH, WHL, and WHH bandpass subimages can select horizontal, vertical and diagonal edges, respectively. 5.1.2 Wavelet Properties The following properties of the DWT have made waveletbased image processing very attractive in recent years [67,30,69]: 1. 1.. .l1 ,: each wavelet coefficient represents local image content in space and frequency, because wavelets are well localized simultaneously in space and frequency 2. multiresolution: DWT represents an image at different scales of resolution in space domain (i.e., in frequency domain); regions of ." 1,1, i at one scale are divided up into four smaller regions at the next finer scale (Fig. 52) 3. edge detector: edges of an image are represented by large wavelet coefficients at the corresponding locations 4. energy compression: wavelet coefficients are large only if edges are present within the support of the wavelet, which means that the majority of wavelet coefficients have small values 5. decorrelation: wavelet coefficients are approximately decorrelated, since the scaled and shifted wavelets form orthonormal basis; dependencies among wavelet coefficients are predominantly local 6. clustering: if a particular wavelet coefficient is large/small, then the adjacent coeffi cients are very likely to also be large/small 7. persistence: large/small values of wavelet coefficients tend to propagate through scales 8. nonGaussian marginal: wavelet coefficients have peaky and longtailed marginal dis tributions; due to the energy compression property only a few wavelet coefficients have large values, therefore Gaussian distribution for an individual coefficient is a poor statistical model It is also important to introduce shortcomings of the DWT. Discrete wavelet decom positions suffer from two main problems, which hamper their use for many applications, as follows [70]: 1. lack of shift invariance: small shifts in the input signal can cause major variations in the energy distribution of wavelet coefficients 2. poor directional selectivity: for some applications horizontal, vertical and diagonal selectivity is insufficient When we .i_ ,1 ,.., the Fourier spectrum of a signal, we expect the energy in each frequency bin to be invariant to any shifts of the input. Unfortunately, the DWT has a significant drawback that the energy distribution between various wavelet scales depends critically on the position of key features of the input signal, whereas ideally dependence TREE a I TREE b Figure 53: The Qshift DualTree CWT. is on just the features themselves. Therefore, the real DWT is unlikely to give consistent results when used in texture I &', 1, In literature, there are several approaches proposed to overcome this problem (e.g., Discrete Wavelet Frames [67,71]), all increasing computational load with inevitable redun dancy in the wavelet domain. In our opinion, the Complex Wavelet Transform (CWT) offers the best solution providing additional advantages, described in the following subsection. 5.1.3 Complex Wavelet Transform The structure of the CWT is the same as in Figure 5 1, except that the CWT filters have complex coefficients and generate complex output. The output sampling rates are unchanged from the DWT, but each wavelet coefficient contains a real and imaginary part, thus a redundancy of 2:1 for onedimensional signals is introduced. In our case, for two dimensional signals, the redundancy becomes 4:1, because two adjacent quadrants of the spectrum are required to represent fully a real twodimensional signal, adding an extra 2:1 factor. This is achieved by additional filtering with complex < jiugates of either the row or column filters [70]. Despite its higher computational cost, we prefer the CWT over the DWT because of the CWT's following attractive properties. The CWT is shown to posses almost shift and rotational invariance, given suitably designed biorthogonal or orthogonal wavelet filters. We Level 2 Table 51: Coefficients of the filters used in the Qshift DTCWT. H 13 (,ii Wi1,) H 19 (,iri, ii)1.! H 6 0.0017581 0.0000706 0.03611. : I 0 0 0 111 1 1'' ,"i. 0.0013419 0.08832942 0.0468750 0.011;! 0.23389032 0.0482422 0.0071568 0.76027237 0.2968750 0.0238560 0.58751830 ",",", !i ,'ss 1a 1 ,,i, ;1 0 0.2968750 0.0516881 0.11430184 0.0482422 0.2997576 0 0.2997576 Figure 54: The CWT is strongly oriented at angles 150, 450, 750. implement the Qshift DualTree CWT scheme, proposed by Kingsbury [72], as depicted in Figure 53. The figure shows the CWT of only onedimensional signal x, for clarity. The output of the trees a and b can be viewed as real and imaginary parts of complex wavelet coefficients, respectively. Thus, to compute the CWT, we implement two real DWT's (see Fig. 51), obtaining a wavelet frame with redundancy two. As for the DWT, here, lowpass and highpass filters are denoted with 0 and 1 in index, respectively. The level 0 comprises oddlength filters Hoa(z) Hob(z) = H13(z) (13 taps) and Hia(z) Hlb(z) = H19( (19 taps). Levels above the level 0 consist of evenlength filters Hoo0(z) zH6(z1), Hoia(z) =H6(z), Hoob(z) =H6(z), Hob(z) = 1H6(z1), where the impulse response of the filters H13, H19 and H6 is given in the table 51. hit P 9 ONR D P.M 0210 Aside from being shift invariant, the CWT is superior to the DWT in terms of direc tional selectivity, too. A twodimensional CWT produces six bandpass subimages (analo gous to the three subimages in the DWT) of complex coefficients at each level, which are strongly oriented at angles of 150, 450, 750, as illustrated in Figure 54. Another advantageous property of the CWT exerts in the presence of noise. The phase and magnitude of the complex wavelet coefficients collaborate in a non trivial way to describe data [70]. The phase encodes the coherent (in space and scale) structure of an image, which is resilient to noise, and the magnitude captures the strength of local information that could be very susceptible to noise corruption. Hence, the phase of complex wavelet coefficients might be used as a principal clue for image denoising. However, our experimental results have shown that phase is not a good feature choice for sky/ground modeling. Therefore, we consider only magnitudes. In summary, for texture .111 m, i in ITv and IQTv, we choose the complex wavelet transform (CWT) applied to the intensity (gi ', .1) image, due to its shiftinvariant representation of texture at different scales, orientations and locations. 5.1.4 DifferenceofGaussian Texture Extraction In ITvo and IQTvo, observables are present only at the leaf level. Therefore, for these models, multiscale texture extraction is superfluous. Here, we compute the differenceof Gaussian function convolved with the image as D(x, y, k, )= (G(x, y, ka)G(x, y, a))*I(x, y), (5.1) where x and y represent pixel coordinates, G(x,y, a) exp((X2 + y2)/2a2)/27ca2, and I(, y) is the intensity image. In addition to reduced computational complexity, as com pared to the CWT, the function D provides a close approximation to the scalenormalized Laplacian of Gaussian, a2V2G, which has been shown to produce the most stable image features across scales when compared to a range of other possible image functions, such as the gradient and the Hessian [73,74]. We compute D(x, y, k, a) for three scales k= /2,2, V/8 and a = 2. 45 5.2 Color Ti color information in a video signal is O. !" e encoded in the RGB color space. For color features, in all models, we choose the generalized RGB color space: r=R/(R+G+B), and g=G/( R+G B), which vely 1 variations in brightness. For ITy and I( TV, the Y's of higherlevel nodes are computed as the mean of the r's and .of their (i. .:. nodes of the initial ..: 1 :ree structure. Each color observable is normalized to have zero mean and unit variance over the dataset. In summary, the y's are 8 dimensional vectors for ITv and Ii* i ,n and 5 ... : vectors for ITV,' and IQTyLo. CHAPTER 6 EXPERIMENTS AND DISCUSSION We report experiments on image segmentation and classification for six sets of images. Dataset I comprises fifty, 64x64, simplescene images with object appearances of 20 distinct objects shown in Fig. 61. Samples of dataset I are given in Figs. 62, 63, and 64. Dataset II contains 120, 128x128, complexscene images with partially occluded object appearances of the same 20 distinct objects as for dataset I images. Examples of dataset II are shown in Figs. 611, 612, 615. Note that objects appearing in datasets I and II are carefully chosen to test if irregular trees are expressive enough to capture very small variations in appearances of some classes (e.g., two different types of cans in Fig. 61), as well as to encode large differences among some other classes (e.g., wiryfeatured robot and books in Fig. 61). Next, dataset III contains fifty, 128x128, naturalscene images, samples of which are shown in Figs. 65 and 66. For dataset IV we choose sixty, 128 x 128 images from a database that is publicly available at the Computer Vision Home Page. Dataset IV contains a video sequence of two people approaching each other, who wear alike shirts, but different pants, as illustrated in Fig. 616. The sequence is in, i. ,ii:_. because the most significant "object" parts for differentiating between the two persons (i.e., pants) get occluded. Moreover, the images represent scenes with clutter, where recognition of partially occluded, similarinappearance people becomes harder. Together with the two persons, there are 12 possible image classes appearing in dataset II, as depicted in Fig. 616a. Here, each image is treated separately, without making use of the fact that the background scene does not change in the video sequence. Further, dataset V consists of sixty, 256 x 256 images, typical samples of which are shown in Figs. 617b. The images in dataset V represent the video sequence of a com plex scene, which is observed from different view points by moving a camera horizontally clockwise. Together with the background, there are 6 possible image classes, as depicted in Figs. 617a. Finally, dataset VI consists of sixty, 256 x 256 naturalscene images, samples of which are shown in Figs. 618. The images in dataset VI represent the video sequence of a row of houses, which is observed from different view points. The houses are very similar in appearance, so that the recognition task becomes very difficult, when details differentiating one house from another are occluded. There are 8 possible image classes: 4 different houses, sky, road, grass, and tree, as marked with different colors in Figs. 618. All datasets are divided into training and test sets by random selection of images, such that 2/3 are used for training and 1/3 for testing. Ground truth for each image is determined through handlabeling of pixels. 6.1 Unsupervised Image Segmentation Tests We first report experiments on unsupervised image segmentation using ITvo and ITy. Irregulartree based image segmentation is tested on datasets I and III, and conducted by the algorithm given in Fig. 24. Since in unsupervised settings the parameters of the model are not known, we initialize them as discussed in the initialization step of the learning algorithm in Section 2.5. After B i, i ,i estimation of the irregular tree, each node defines one image region composed of those leaf nodes (pixels) that are that node's descendants. Results presented in Figs. 62, 63, 64, 65, and 66 :_:_. 1 that irregular trees are able to parse images into "meaningful" parts by assigning one subtree per "object" in the image. Moreover, from Figs. 62 and 63, we also observe that irregular trees, inferred through SVA, preserve structure for objects across images subject to translation, rotation and scaling. In Fig. 62, note that the level4 clustering for the largerobject scale in Fig. 62(topright) corresponds to the level3 clustering for the smallerobject scale in Fig. 62(bottomcenter). In other words, as the object transitions through scales, the tree structure changes by eliminating the lowestlevel layer, while the higherorder structure remains intact. We also note that the estimated positions of higherlevel hidden variables in ITyo and ITv are very close to the center of mass of object parts, as well as of whole objects. We compute the error of estimated rootnode positions r as the distance from the actual center of mass rCM of handlabeled objects, derr,,= rCM. Also, we compare our SVA inference Figure 61: 20 image classes in type I and II datasets. Sitflfl Figure 62: Image segmentation using ITvo: (left) dataset I images; (center) pixel clusters with the same parent at level f=3; (right) pixel clusters with the same parent at level =4; points mark the position of parent nodes. Irregulartree structure is preserved through scales. Figure 63: Image segmentation using ITvo: (top) dataset I images; (bottom) pixel clusters with the same parent at level 3. Irregulartree structure is pre served over rotations. algorithm with variational approximation (VA)1 proposed by Storkey and Williams [48]. The averaged error values over the given test images for VA and SVA are reported in Table 61. We observe that the error significantly decreases as the image size increases, because in summing node positions over parent and children nodes, as in Eq. (2.16) and Eq. (2.17), more statistically significant information contributes to the position estimates. For example, d6 = 6.18 for SVA is only 4 s' of the datasetIII image size, whereas d =4.23 for SVA is 6.1' of the datasetI image size. In Table 62, we report the percentage of erroneously grouped pixels, and, in Table 63, we report the object detection error, when compared to ground truth, averaged over each dataset. For estimating the object detection error, the following instances are counted as 1 Although the algorithm proposed by Storkey and Williams [48] is also structured varia tional approximation, to differentiate that method from ours, we slightly abuse the notation. Figure 64: Image segmentation by irregular trees learned using SVA: (a)(c) ITvo for dataset I images; all pixels labeled with the same color are descendants of a unique root. (a) (b) (c) (d) Figure 65: Image segmentation by irregular trees learned using SVA: (a) ITyo for a dataset III image; (b)(d) ITv for dataset III images; all pixels labeled with the same color are descendants of a unique root. (c) (d) Figure 66: Image segmentation using ITv: (a) a dataset III image; (b)(d) pixel clusters with the same parent at levels f=3, 4, 5, respectively; white regions represent pixels already grouped by roots at the previous scale; points mark the position of parent nodes. Table 61: Rootnode distance error dataset I III ITvo VA SVA 6.32 4.61 9.15 6.87 ITv VA SVA 6.14 4.23 8.99 6.18 ~I+ Table 62: Pixel segmentation error Table 63: Object detection error datasets datasets I III I III ITvo VA 7' 10% SVA c 9', ITv VA 7'. 11% SVA 1' 7'. ITvo VA 1: ', SVA ,:' 10% ITv VA ', 10% SVA 2'. .' error: (1) merging two distinct objects into one (i.e., failure to detect an object), and (2) segmenting an object into subregions that are not actual object parts. On the other hand, if an object is segmented into several "meaningful" subregions, verified by visual inspection, this type of error is not included. Overall, we observe that SVA outperforms VA for image segmentation using ITvo and ITv. Interestingly, the segmentation results for ITv models are only slightly better than for ITvo models. It should be emphasized that our experiments are carried out in an unsupervised ii i . and, as such, cannot not be equitably evaluated against supervised object recognition results reported in the literature. Take, for instance, the segmentation in Fig. 65d, where two boys dressed in white clothes (i.e., two similarlooking objects) are merged into one subtree. Given the absence of prior knowledge, the groundtruth segmentation for this image is arbitrary, and the resulting segmentation ambiguous; nevertheless, we still count it towards the objectdetection error percentages in Table 63. Our claim that nodes at different levels of irregular trees represent objectparts at various scales is supported by experimental evidence that the nodes segment the image into "meaningful" object subcomponents and position themselves at the center of mass of these subparts. 6.2 Tests of Convergence In this section, we report on the convergence properties of the inference algorithms for ITvo, ITv, IQTyo, and IQTv. First, we compare our SVA inference algorithm with variational approximation (VA) [48]. In Fig. 67ab, we illustrate the convergence rate of computing P(Z, X, R'IY, Ro) a Q(Z,X, R') for SVA and VA, averaged over the given datasets. Numbers above bars represent the mean number of iteration steps it takes for the algorithm to converge. We consider the algorithm converged when IQ(Z,X, R';t + 300 0 A252 200 O 150 E 50 dataset I dataset III dataset II (a) Average convergence rate for ITvo. 40 iMm VA 33 35 33 30 M 25 20  P 10 dataset I dataset III dataset II (c) Increase of log Q(Z, X, R') in SVA over VA for ITvo. 250 r 200 0 c 50 e S dataset I dataset III dataset II (b) Average convergence rate for ITv. 45 40 Mi VA 3 35 3 5 1 I 30 25 2 20 20 a' 15 10 5 dataset I dataset III dataset II (d) Increase of log Q(Z, X, R') in SVA over VA for ITv Figure 67: Comparison of inference algorithms: (a)(b) convergence rate averaged over the given datasets; (c)(d) percentage increase in log Q(Z,X, R') computed in SVA over log Q(Z, X, R') computed in VA. 1) Q(Z,X, R';t)I/Q(Z,X, R';t) e=0.01 (see Fig. 24, Step (11)). Overall, SVA converges in the fewest number of iterations. For example, the average number of iterations for SVA on dataset III is 25 and 23 for ITyo and ITv, respectively, which takes approximately 6s and 5s on a Dual 2 GHz PowerPC G5. Here, the processing time also includes imagefeature extraction. For the same experiments, in Fig. 67cd, we report the percentage increase in log Q(Z, X, R') computed using our SVA over log Q(Z, X, R') obtained by VA. We note that SVA results in larger approximate posteriors than VA. The larger log Q(Z, X, R') means that the as sumed form of the approximate posterior distribution Q(Z,X, R')=Q(Z)Q(XIZ)Q(R'Z) more accurately represents underlying stochastic processes in the image than VA. Now, we compare the convergence of the inference algorithm for IQTyo with SVA and VA for ITyo. For simplicity, we refer to the inference algorithm for the model IQTvo, also, as IQTyo, slightly abusing the notation. The parameters that control the convergence 0 1000 2000 " S3000 I IQTvo \ VA S II 2 5 10 20 50 100 200 number of iterations Figure 68: Typical convergence rate of the inference algorithm for ITvo on the 128 x 128 dataset IV image in Fig. 616b; SVA and VA inference algorithms are conducted for ITvo model. 5000 10000 .*** . 15000 / E 20000 ., ' 25000 SoooIQTyo 30000V SVA 35000 VA S VA 2 5 10 50 100 500 number of iterations Figure 69: Typical convergence rate of the inference algorithm for ITvo on the 256 x 256 dataset V image in Fig. 617b; SVA and VA inference algorithms are conducted for ITvo model. 80 II I I _:.. 66 70 60 4 4) 47 S503 4 40 4 30 20 10 ITvo ITvo learned by SVA learned by VA Figure 610: Percentage increase in loglikelihood logP(YIX) of IQTvo over logP(YIX) of ITvo, after 500 and 200 iteration steps for datasets IV and V, respectively. criterion for the inference algorithms of the three models are N=10, and E=0.01. Figs. 68 and 69 illustrate typical examples of the convergence rate. We observe that the inference algorithm for IQTVo converges slightly slower than SVA and VA for ITvo. The average number of iteration steps for IQTvo is approximately 160 and 230, which takes 6s and 17s on a Dual 2 GHz PowerPC G5, for datasets IV and V, respectively. The barchart in Fig. 610 shows the percentage logP l P, where Pi=P(YX) is the likelihood of ITvo, and P2=P(YIX) of IQTvo. We observe that P(YIX) of IQTvo, after the algorithm converged, is larger than P(YIX) of ITyo. The larger likelihood means that the model structure and inferred distributions more accurately represent underlying stochastic processes in the image. 6.3 Image Classification Tests We compare classification performance of ITyo with that of the following statistical models: (1) Markov Random Field (lI;F) [6], (2) Discriminative Random Field (DRF) [25], and (3) TreeStructured Belief Network (TSBN) [33,29]. These models are representatives of descriptive, discriminative and fixedstructure generative models, respectively. Below, we briefly explain the models. For MRFs, we assume that the label field P(X) is a homogeneous and isotropic MRF, given by the generalized Ising model with only pairwise nonzero potentials [6]. The likeli hoods P(yilxi) are assumed conditionally independent given the labels. Thus, the posterior energy function is given by U(XY) = logP(y z)+ ZV2 (i j), ieVo i vojgi' V2 (i, x) = 3MRF ,if i x , /MRF,if i 4j . where Ni denotes the neighborhood of i, P(yilxi) is a Gcomponent mixture of Gaussians given by Eq. (2.6), and V2 is the interaction parameter. Details on learning the model parameters as well as on inference for a given image can be found in Stan Li's book [6]. Next, the posterior energy function of the DRF is given by U(XlY)= Ai(xi, Y)+ 1,ij (aX, x, Y), ieVo ieVo j'Ai where A= log (T(xWTyi) and ij= /3DRF(Kxixj +(1lK)(2a((xixjVTyV)1)) are the unary and pairwise potentials, respectively. Since the above formulation deals only with binary classification (i.e. xi E {1, 1}), when estimating parameters {W, V,/3DRF, K} for an ob ject, we treat that object as a positive example, and all other objects as negative examples ("one against all" strategy). For details on how to learn the model parameters, and how to conduct inference for a given image, we refer the reader to the paper of Kumar and Hebert [25]. Further, TSBNs or quadtrees are defined to have the same number of nodes V and levels L as irregular trees. For both ITyo and TSBNs, we use the same image features. When we operate on wavelets, which is a multiscale image feature, we in fact propagate observables to higher levels. In this case, we refer to the counterpart of ITv as TSBNT. To learn the parameters of TSBN or TSBNT, and to perform inference on a given image, we use the algorithms thoroughly discussed by Laferte et al. [33]. Finally, irregulartree based image classification is conducted by 1 d1.., i,:; the infer ence algorithms in Fig. 24 for ITyo and ITv, and the inference algorithms in Fig. 32 for IQTvo and IQTv. Since image classification represents a supervised machine learning problem, it is necessary to first learn model parameters on training images. For this pur pose, we (!n1 d.., the learning algorithms discussed in Section 2.5 for ITyo and ITv, and the learning algorithms discussed in Section 3.3 for IQTvo and IQTy. After inference of MRF, DRF, TSBN, and the irregular tree, on a given image, for each model, we conduct pixel labeling by using the MAP classifier. In Fig. 611, we illustrate an example of pixel labeling for a datasetII image. Here, we say that an image region is correctly recognized as an object if the majority of MAPclassified pixel labels in that region are equal to the true labeling of the object. For estimating the objectrecognition error, the following instances are counted as error: (1) merging two distinct objects into one, and (2) swapping the identity of objects. The objectrecognition error over all objects in 40 test images in dataset II is summarized in Table 64. In each cell of Table 64, the first number indicates the overall recognition error, while the number in parentheses indicates the ratio of swappedidentity errors. For instance, for ITyo the overall recognition error is 9.,.' ., of which ;7' of instances were caused by swappedidentity errors. Moreover, Table 65 shows average pixellabeling error. Next, we examine the receiver operating characteristic (ROC) of MRF, DRF, TSBN and ITvo for a twoclass recognition problem. From the set of image classes given in Fig. 61, we choose "to, ii I and , I. i' ..... I: as the two possible classes in the following set Table 64: Object recognition error image type MRF DRF TSBN ITyo dataset II 21.2' 12.5% 14 ' 9..' '0.7 .) ( :'.) (72 .) (::7',) Table 65: Pixel labeling error image type MRF DRF TSBN ITyo dataset II 1'. '. 12.::'. 16.1% 9.9'. of experiments. The task is to label twoclassproblem images containing "to,,ii Ii and '. ,i. 1. ii .... I: objects, a typical example of which is shown in Fig. 612. Here, pixels labeled as "toysnail" are considered true positives, while pixels labeled as ....:" are considered true negatives. In Fig. 613, we plot ROC curves for the twoclass problem, where we compare the performance of ITyo with those of MRF, DRF and TSBN From Fig. 613, we observe that image classification with ITyo is the most accurate, since its ROC curve is the closest to the lefthand and top borders of the ROC space, as compared to the ROC curves of the other models. Further, in Fig. 614, we plot ROC curves for the same twoclass problem, where we compare the performance of ITv, with those of ITvo, TSBN, and TSBNT. From Fig. 614, we observe that image classification with ITv is the most accurate, and that both ITyo and ITv outperform their fixedstructure counterparts TSBN and TSBNT. From the results reported in Tables 64 and 65, as well as form Figs. 613 and 614, we note that irregular trees outperform the other three models. However, recognition performance of all the models suffers substantially when an image contains occlusions. While for some applications the literature reports vision systems with impressively small classification errors (e.g., 2.5% handwritten digit recognition error [75]), in the case of (a) 256 x 256 (b) MRF (c) DRF (d) TSBN (e) ITvo Figure 611: Comparison of classification results for various statistical models; pixels are labeled with a color specific for each object; noncolored pixels are classified as background. N r (a) 256 x 256 (b) MRF (c) DRF (d) TSBN (e) ITvo Figure 612: MAP pixel labeling using different statistical models. 0.94 " : 0.9 2 . a0.9 / ITvo 0.88 7 DRF S'0 TSBN 0.86 MRF 0.06 0.08 0.1 0.12 0.14 0.16 false positive rate Figure 613: ROC curves for the image in Fig. 612a with ITvo, TSBN, DRF and MRF. complex scenes this error is much higher [76, 77, 11, 5, 4]. To some extent, our results could have been improved had we iil.l1.,. l more discriminative image features and/or more sophisticated classification algorithms than majority rule. However, none of these will alleviate the fundamental problem of ii Ii i, i_., iI" recognition approaches: the lack of explicit i ,1, i, of visible object parts. Thus, the poor classification performance of MRF, DRF, and TSBN, reported in Tables 64 and 65, can be interpreted as follows. Accounting for only pairwise potentials between adjacent nodes in MRF and DRF is not sufficient to , i'1,.., complex configurations of objects in the scene. Also, the .,, 11 ,, of fixedsize pixel neighborhoods at various scales in TSBN leads to I.1.. I:y" estimates, and consequently 0.96 0.94 S .....  2 0.92 E 0.9 'IT 0 o. 0.88 / ITyo *0.86 /  TSBN T 2 TSBN 0.06 0.08 0.1 0.12 0.14 0.16 false positive rate Figure 614: ROC curves for the image in Fig. 612a with ITv, ITvo, TSBN, and TSBNT. to poor classification performance. Therefore, we hypothesize that the main reason why irregular trees outperform the other models is their capability to represent object details at various scales, which in turn provides for explicit .,1 '1, i, of visible object parts. In other words, we speculate that in the face of the occlusionproblem, recognition of '.'/I. / parts is critical and should condition recognition of the object as a whole. To support our hypothesis, instead of I'l., i::_. more sophisticated imagefeature extraction tools and better classification procedures than majority vote, we introduce a more radical change to our recognition strategy. 6.4 ObjectPart Recognition Strategy Recall from Section 6.1 that irregular trees are capable of capturing component sub component structures at various scales, such that root nodes represent the center of mass of distinct objects, while children nodes down the subtrees represent object parts. As such, irregular trees provide a natural and seamless framework for identifying candidate image regions as object parts, requiring no additional training for such identification. To uti lize this convenient property, we conduct the objectpart recognition strategy presented in Section 4.2. We compare the performance of the wholeobject and partobject recognition strategies. The wholeobject approach can be viewed as a benchmark strategy, in the sense that a majority of existing vision systems does not explicitly ., 1 1,. .. visible object parts at various scales. In these systems, once the object is detected, the whole image region is identified through MAP classification, as is done in the previous section. In Fig. 615, we present classification results for ITyo, using the wholeobject and objectpart recognition strategies on datasetII images. In Fig. 615a, both strategies suc ceed in recognizing two different "Fluke" voltagemeasuring instruments (see Fig. 61). However, in Fig. 615b, the wholeobject recognition strategy fails to make a distinction between the objects, since the part that differentiates most one object from another is oc cluded, making it a difficult case for recognition even for a human interpreter. In the other two images, we observe that the objectpart recognition strategy is more successful than the wholeobject approach. (a) (b) (c) (d) Figure 615: Comparison of two recognition strategies on dataset II for ITyo: (top) 128 x 128 challenging images containing objects that are very similar in appearance; (middle) classification using the wholeobject recognition strategy; (bottom) classification using the partobject recognition strategy; each recognized object in the image is marked with a different color. For estimating the objectrecognition error of ITyo on datasetII images, the following instances are counted as error: (1) merging two distinct objects into one (i.e., object not detected), and (2) swapping the identity of objects (i.e., object correctly detected but misclassified as one of the objects in the class of known objects). The recognition error averaged over all objects in 40 test images in dataset II is only ', an improvement of nearly over the reported error of 9.1' in the previous section. We also recorded the objectrecognition error of IQTvo over all objects in 20 test images of datasets IV, V, and VI, respectively. The results are summarized in Table 66. In each cell of Table 66, the first number indicates the overall recognition error, while the number in parentheses indicates the ratio of mergedobject errors. For instance, for dataset V and the wholeobject strategy, the overall recognition error is 21.2' of which slightly more than half ('.1i' .) were caused by mergedobject errors. The results in Table 66 clearly demonstrate significantly improved recognition performance, as well as reduction in false Table 66: Object recognition error for IQTyo datasets strategy IV V VI wholeobject 11.i.' ( .) 21.2' ( .') 26.;' ( .) objectpart 3. :'. (100%) 8.7' (92'.) 12.5% (81%) Table 67: Pixel labeling error for IQTyo datasets strategy IV V V wholeobject 9..' 17.9' 16..:' objectpart 4.:;', 6.7' 8.;' alarm and swappedidentity types of error for the objectpart, as compared with the whole object approach. Also, Table 67 shows that the objectpart strategy reduces pixellabeling error. These results support our hypothesis that for successful recognition of partially oc cluded objects it is critical to I_ ,1,.. visible object details at various scales. (a) Cluttered scene containing 10 objects, each of which is marked with a different color; images of two alike persons. (b) Dataset II: video sequence of two alike people walking in a cluttered scene. (c) Classification using the wholeobject recognition strategy. (d) Classification using the partobject recognition strategy. Figure 616: Recognition results over dataset IV for IQTvo. (a) 6 image classes: 5 similar objects and background. (b) 4 images of the same scene viewed from 4 different angles with objects shown in (a). (c) The most significant object parts differ over various scenes; the majorityvoting classification result is indicated by the colored regions. (d) Classification using the wholeobject recognition strategy. (e) Classification using the objectpart recognition strategy. Figure 617: Recognition results over dataset V for IQTvo. Figure 618: Classification using the partobject recognition strategy; Recognition results for dataset VI. a 1:'1"1 mu RftH 17 iin 1 CHAPTER 7 CONCLUSION 7.1 Summary of Contributions In this dissertation, we have addressed detection and recognition of partially occluded, alike objects in complex scenes the problem that has eluded, as of yet, a satisfactory solution. The experiments reported herein show that i i'Il, i..11 I" approaches to object recognition, where objects are first detected and then identified as a whole, yield poor per formance in complex settings. Therefore, we speculate that a careful i,'1, i of visible, finescale object details may prove critical for recognition. However, in general, the .11 i,1, i of multiple subparts of multiple objects gives rise to prohibitive computational complexity. To overcome this problem, we have proposed to model images with irregular trees, which provide a suitable framework for developing novel objectrecognition strategies in particu lar, objectpart recognition. Here, object details at various scales are first detected through treestructure estimation; then, these object parts are I_ '1,. .1 as to which component of an object is the most significant for recognition of that object; finally, information on cog nitive significance of each object part is combined toward the ultimate image classification. Empirical evidence demonstrates that this explicit treatment of object parts results in an improved recognition performance, as compared to the strategies where object components are not explicitly accounted for. In Chapter 2, we have proposed two architectures within the irregulartree framework, referred to as ITyo and ITv. For each architecture, we have developed an inference al gorithm. Gibbs sampling has been shown to be successful at finding trees that have high posterior probability; however, at a great computational price, which renders the algorithm impractical. Therefore, we have proposed Structured Variational Approximation (SVA) for inference of ITyo and ITv, which relaxes poorly justified independence assumptions in prior work. We have shown that SVA converges to larger posterior distributions, an order of magnitude faster than competing algorithms. We have also demonstrated that ITyo and ITv overcome the blocky segmentation problem of TSBNs, and that they possess certain invariance to translation, rotation, and scaling transformations. In Chapter 3, we have proposed another two architectures, referred to as IQTvo and IQTv. In these models, we have constrained the node positions to be fixed, such that only connections can control irregular tree structure. At the same time, we have made the distribution of connections dependent on image classes. This formulation has allowed us to avoid variationalapproximation inference, and to develop the exact inference algorithm for IQTvo and IQTv. We have shown that it converges slower than SVA; however, it yields larger likelihood, which in general means that IQTyo represents underlying stochastic processes in the image more accurately than ITvo. In experiments on unsupervised image segmentation, we have shown the capability of irregular trees to capture important componentsubcomponent structures in images. Empir ical evidence demonstrates that root nodes represent the center of mass of distinct objects, while children nodes down the subtrees represent object parts. As such, irregular trees provide a natural and seamless framework for identifying candidate image regions as ob ject parts, requiring no additional training for such identification. In Chapter 4, we have proposed to explicitly .111 '1,. the significance of object parts (i.e., tree nodes) with respect to recognition of an object as a whole. We have defined entropy as a measure of such cog nitive significance. To avoid the costly approach of ._ ,1,. i,: every detected object part, we have devised a greedy algorithm, referred to as objectpart recognition. The compari son of wholeobject and partobject approaches indicates that the latter method generates significantly better recognition performance and reduced pixellabeling error. Ultimately, what allows us to overcome obstacles in I, i1,. i,:: scenes with occlusions in a computationally efficient and intuitively appealing manner is the generativemodel framework we have proposed. This framework provides an explicit representation of objects and their subparts at various scales, which, in turn, constitutes the key factor for improved interpretation of scenes with partially occluded, alike objects. 7.2 Opportunities for Future Work The .,1 ,1, i, in the previous chapters 1:_:_. I the following opportunities for future work. One promising thrust of research would be to investigate relationships among descrip tive, generative and discriminative statistical models. We anticipate that these studies will lead to a greater integration of the modeling paradigms, yielding richer and more advanced classes of models. Here, the most critical issue is that of computationally manageable in ference. With recent advances in the area of belief propagation (e.g., Generalized Belief Propagation [78]), the new algorithms may make it possible to solve realworld problems that were previously computationally intractable. Within the irregulartree framework, it is possible to continue further investigation toward replacing the current discretevalued node variables with realvalued ones. Thereby, a realvalued version of the irregular tree can be specified. Gaussians could be used as a probability distribution to govern continuous random variables, represented by nodes, due to their tractable properties. Such a model could then operate directly on realvalued pixel data, improving the stateoftheart techniques for solving various imageprocessing problems, including super resolution, image enhancement, and compression. Further, with respect to the measure of significance of irregulartree nodes, one can pursue investigation of more complex informationtheoretic concepts than Shanon's entropy. For example, we anticipate that joint entropy and mutual information may yield a more efficient cognitive .111 1, i. which in turn could eliminate the need for the greedy algorithm discussed in Section 4.2. The ., ,11, i, of object parts can be interpreted as integration of information from multiple complementary and/or competitive sensors, each of which has only limited accu racy. As such, further research could be conducted on formulating the optimal strategy for combining the pieces of information of object parts toward ultimate object recognition. We anticipate that algorithms such as the adaptive boosting (AdaBoost) [79] and Support Vector Machine [80] may prove useful for this purpose. Another promising research topic is to incorporate available prior knowledge into the proposed B ,, i mi estimation framework, where we have assumed that all classification errors are equally costly. However, in many applications, some errors are more serious than others. Costsensitive learning methods are needed to address this problem [81]. On a broader scale, the research reported in this dissertation can be viewed as solving a more general machine learning problem, with experimental validation on images as data. This problem concerns supervised learning from examples, where the goal is to learn a function X = f(Y) from N training examples of the form {(Y,, f(Y~))}I 1. Here, X, and Y, contain subcomponents, the meaning of which differs for various applications. For example, in computer vision, each Y, might be a vector of image pixel values, and each X, might be a partition of that image into segments and an assignment of labels to each segment. Most importantly, the components of Y, form a sequence (e.g., a sequence on the 2D image lattice). Therefore, learning a classifier function X = f(Y) represents the sequential supervised learning problem [82]. Thus, in this dissertation, we have addressed sequential supervised 1. in__:_. the solutions of which can be readily applied to a wide range of problems beyond computer vision, such as, for example, speech pi.. .,:_. where the components of Y form a sequence in time. APPENDIX A DERIVATION OF VARIATIONAL APPROXIMATION Preliminaries. Computation of KL(QIIP), given by Eq. (2.12), is intractable, be cause it depends on P(Z,X, R'IY, RO). Note, though, that Q(Z,X, R') does not depend on P(YIRo) and P(RO). Consequently, by subtracting logP(YIRo) and logP(RO) from KL(QIIP), we obtain a tractable criterion J(Q,P), whose minimization with respect to Q(Z, X, R') yields the same solution as minimization of KL(QIIP): 0RQ(Z, X, R') J(Q, P)KL(QIIP) log P(YIR) log P(R)= j dR' Q(Z, X, R') log ( R,) Z,X (A.1) J(Q, P) is known alternatively as Helmholtz free energy, Gibbs free energy, or free energy [59]. By minimizing J(Q, P), we seek to compute parameters of approximate distributions Q(Z), Q(XIZ) and Q(R'IZ). It is convenient, first, to reformulate Eq. (A.1) as J(Q,P) = Lz + Lx + LR. We define auxiliary Lz, Lx, and LR as Lz A z Q(Z) log Q(Z) Lx Ez,x Q(Z)Q(XIZ) log x ), and L To derive expressions for Lz, Lx, LR, we first observe: (zij)=i, a /k\, .)j Qjk => k :jiVJl:MQ!'' VieVVkM, (A.2) where (K) denotes expectation with respect to Q(Z, X, R'). Consequently, from Eqs. (2.1), (2.9) and (A.2), we have Lz = Eijv j i.j l[ ;/ .] (A.3) Next, from Eqs. (2.4), (2.10) and (A.2), we derive Lx = E yjv k,lcM ijQ.I "') log[ /Pi] Ei EkM log P(yp(,) X, p(i)) (A.4) Note that for DTvo, V in the second term is substituted with Vo. Finally, from Eqs. (2.3), (2.11) and (A.2), we get LR = I (jl log 1 Tr{ 1 } + Tr E{Zl,(rrrd)(rrjdj )T . (A.5) Let us now consider the expectation in the last term: ((rirj dij) (rirj dij)T) = ((r pj+ ijrjdij)(rpj+ ijrjdij)T) = ij + 2((ri pij) (jprj dij pjp+pij)T) + + ((rj jp+dij +pjij)(rj Cjpdij ++jppij)) = =Qij+2((riij)(Ljprj)T)+((rj Ljp) ( rj jp,)T+( ij jp dij) )(ij jpdij)T)= = ij + Y:EpC / jp (2'ijp + Qjp + ijp) , (A.6) where the definitions of auxiliary matrices fijp and Maijy are given in the second to the last derivation step above, and ijp is a childparentgrandparent triad. It follows from Eqs. (A.5) and (A.6) that LR log ( 2 + Tr{YZ1Qi} + E ipTr{z (2qijp+Qjp+Mij)} i,jEV' 1 pV/ (A.7) In Eq. (A.7), the last expression left to compute is Tr{E( I'jp}. For this purpose, we apply the CauchySchwartz inequality as follows: 1Tr{J }I ij, p Tr{ ij Y ij ((iij)(prjp)T)} = Tr{(ij2 (ri  i)(l (jprj)T~)} , < Tr{Z l i}yr{)ljp} (A.8) where we used the fact that the E's and Q's are diagonal matrices. Although the Cauchy Schwartz inequality in general does not yield a tight upper bound, in our case it appears reasonable to assume that variables re and rj (i.e., positions of object parts at different scales) are uncorrelated. Substituting Eq. (A.8) into Eq. (A.7), we finally derive the upper bound for LR as 216ij L ( i log 2+Tr{StQi} + Evr,pyTr{EZ1 (OY + My)}+ +2E YpviTr{Eij j}Tr{Eij n j,} (A.9) Optimization of Q(XIZ). Q(XIZ) is fully characterized by parameters Q1. From the definition of Lx, we have OJ(Q, P)/oQJ =Lx/xOQQk. Due to parentchild dependencies in Eq. (A.2), it is necessary to iteratively differentiate Lx with respect to Qfk down the subtree of node i. For this purpose, we introduce three auxiliary terms Fij, Gi, and XA, which facilitate computation, as shown below: FiA Ek,leM ij "* [f/Pi iOLx OFij 9Gj Om Gi EdP,ced(i) Fdc EkeM lg, O p(i) i X Pi)yo) 'i + g k k A exp(OGi/oM$), (A.10) where {.}vo denotes that the term is included in the expression for Gi if i is a leaf node for DTvo. For DTv, the term in braces {.} is 11,, included. This allows us to derive update equations for both models simultaneously. After finding the derivatives OFij/OQj = "m(lQl Pl]1) and Om/OQij=,jjm, and substituting these expres sions in Eq. (A.10), we arrive at OLx/OQj = ijm (l.:ck/Pl] + 1 log Ai) (A.11) Finally, optimizing Eq. (A.11) with the Lagrange multiplier that accounts for the constraint EkeM Q =1 yields the desired update equation: Q = iKP1jj, introduced in Eq. (2.13). To compute Ai, we first find oG/om=k Ecc(i) (OFc/om? + EaeM(oGc/), I)(',, I/om)) {log P(yp()I x, p())}o , cc(i) ZaeM ci (log /f] + Gc/c ) {log P(y() p(i))}v(oA.12) and then substitute Qk1, given by Eq. (2.13), into Eq. (A.12), which results in A {P(yp(), p(i))}voJcv [eEaeM PkA] as introduced in Eq. (2.14). Optimization of Q(R'lZ). Q(R'Z) is fully characterized by parameters tLij and Qij. From the definition of LR, we observe that 9J(Q)/10ij =0LR/0ij and 9J(Q)/O1ij 0LR/I P i. Since the Q's are positive definite, from Eq. (A.9), it follows that OLR/IQj =0.5 j (Tr{ 1}+T.r{ 1}+ CEr',cir{ }+ TrI Yir{ }+ +EpV jpTr{ l ZIp} JTQijITZjlQjp} + +Ec ciTr{Z }Tr{zC l i} Tr{Z } (A.13) From OLR/OQij=0, it is straightforward to derive the update equation for Qij given by Eq. (2.17). Next, to optimize the pij parameters, from (A.9), we compute ~^~~i Fi Ei,jPeV' ijJP(Pij  tjp ~ jp) d ij (.ij t jp , Dliji Dlij 2i Ec,peV (fijJpTij Y jpdjp) ciijj 1 (1Lciijdij)) (A.14) Then, from OLR/9Ipij 0, it is straightforward to compute the update equation for PLij given by Eq. (2.16). Optimization of Q(Z). Q(Z) is fully characterized by the parameters (ij. From the definitions of Lz, Lx, and LR we see that OJ(Q)/9,ij = 9(Lx+LR+Lz)/9idj. Similar to the optimization of Qk, we need to iteratively differentiate Lx as follows: OLx/iOj = OFiy/jlOy + keM(G9/Gm/ )(9 mfj/i4) (A.15) where Fij and Gi are defined as in Eq. (A.10). Substituting the derivatives OGi/om= log A, and aFij/j= EkeM ','jj .: /Pikl], and m/ni/ =j EZleMQ~j'"j into Eq. (A.15) we obtain Lx= EM :'" klcM Q log M A , S k1 ij i (A.16) Next, we differentiate LR, given by Eq. (A.9), with respect to ij as 1 1 OLRI/9Oj log Ijl/lijl 1 + Tr{Z j} + y2 2( + )}+2Tr{ }Tr{ t +2 ( (Qjp+Mijp )}2Tr{Z 1QP {z ,1Q2) + 71 + Ecv i (Tr{/ (,+i )}+2Tr{ l } Tr{ lj}) ,(A.17) = Bj 1 (A.18) where indexes c, j and p denote children, parents and grandparents of node i, respectively. Further, from Eq. (A.3), we get OLz/I2~ij 1 +logijj (A.19) Finally, substituting Eqs. (A.16), (A.18) and (A.19) into OJ(Q)/OSij=0 and adding the Lagrange multiplier to account for the constraint ~, v/ij= 1, we solve for the update equation of ij given by Eq. (2.18). APPENDIX B INFERENCE ON THE FIXEDSTRUCTURE TREE The inference algorithm for Maximum Posterior Marginal (MPM) estimation on the quadtree is known to alleviate implementation issues related to underflow numerical er ror [33]. The whole procedure is summarized in Fig. B 1. The algorithm assumes that the tree structure is fixed and known. Therefore, in Fig. B1, we simplify notation as P(xi Z, Y)P(xilY) and P(xilxj, Z)P(xijxj). Also, we denote with c(i) children of i, and with d(i) the set of all the descendants down the tree of node i including i itself. Thus, Yd(i) denotes a set of all observables down the subtree whose root is i. Also, for comput ing P(xilYd(i)), in the bottomup pass, oc means that equality holds up to a multiplicative constant that does not depend on xi. Twopass MPM estimation on the tree t Preliminary downward pass: Vi E VL1, VL2, ..., V, SP(xi) Ex P(xilzj)P(xj), T Bottomup pass: Initialize leaf nodes: Vi E VO P(' III 1 P('i i, )P(xz ), Pz,z^yj) P(z z)P(i)P(' I 1/P(Xi), A compute upward Vi E V1, V2..., VL, P(x Yd(c))P(xclx) P(Xj1Yd())ocP(:X) Hncc(i) Ex, P(X ) , P(xx, xjl d()))P(xilj)P(x )P(xilYd())/P(xi), t Topdown pass: Initialize root: i E V, P(xiY) P(xilYd(i)), i= argmax, P(xilY), V compute downward Vi E VL1, VL2..., VO, PfY)z P(xi, xj1Yd(i)) SP(,Y) P(zgY ),P(xi ) Y) i= arg max, P(xilY) Figure B1: Steps 2 and 5 in Fig. 32: MPM estimation on the fixedstructure tree. Dis tributions P(i 1, ) and P(xilxj) are assumed known. REFFi. i :NC 1] W. E. L. Grimson and T. LozanoPerez, i )calizing c .1 .:.. :. ts by searching the interpretation tree," PIattern Anal. Machine' vol. 9, no. 4, pp. [2] S. Z. Der and R. ( ii i 'robebased automatic target recognition in infrared :. I'EE 7.. ... P vol. 6, no. 1, pp. i 3] P. C. Ci .. E. L. C i.. and J. B. Wu, "A spatiotemporal neural network for recog nizing .. i .11 occluded ob. ts," ':'' .. vol. ., no. 7, pp. 4] W. M. Wells, : :. i aches to featurebased ob'" recognition," Intl. J. Computer Vision, vol. 21, no. 1, pp. 5] Z. Ying and D. Castanon, partially occluded object recognition using statistical models," J. C .' Vision, vol. :, no. 1, i 57 S. Z. Li, Maerkov random in. .: : S :. : V VC Tok  Japan, 2nd edition, 1. 7] M. H. Lin and C. Tomasi, ..: :es with occlusions from stereo," IEEE Pattern Anal. Machine Intell., vol. 26, no. 8, pp. 1i'. ' .' ;A. Miittal and L. S. Davis, i ': : a multiview approach to segmenting and tracking people in a cluttered scene," Intl J C ...' Vision, vol. 51, no. 3, pp. B. J. Frey, N. Jc : and A. Kannan, "Learning appearance and t .... manifolds of occluded ob:. in 1.." in Proc. C' .,' C. Corrputer Vision Pattern .i .. ', i vol. 1, pp. 45 52, i i i Inc. [1:: F. Dell'Acqua and R. Fisher, i.. : uction of planar surfaces behind occlusions in range images," IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 4, pp.. : 575, [11] R. i P. Perona, and A. Zisserman, "Ob'. class recognition by unsupervised scaleinvariant 1. .. .. in Proc. .' C ." Cormputer Vision Pattern Rec., son, WI, '.'. vol. 2, pp. 264271, IEEE, Inc. [12] A. Mohan, C. Papageorgiou, and T. P I ,i :::i based obi detection in images by :::i .:. 1 I' .: Trans. Patlern .'. Machine .C ''. *, vol. 23, no. 4, p p i. ; : : ... L. 1 M. Weber, M .'. ":: and P. Perona, : awards automatic : :. of obi cat egories," in Proc. 1iEEE C r C Vision Pattern Rec., Hilton Head Island, SC, vol. 2, 101 1' , IEEE, Inc. [14] Weber, ,'. i.. and P. Perona, "Unsupervised learning of models for recogni tion," in Proc. '7 European C Comp. Vision, Dublin, Ireland, ::I, vol. 1, pp. 1832, . [15] B. Heisele, T. Serre, M. P : :I T. Better, and T. P I "Categorization learning and combining obi 1 parts," in Advances in neural : / .' . 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds., vol. 2, i 1 : 2.: MIT Press, Cambridge, MA, [1:.' P. F. I and Daniel P. Huttenlocher, L. : .... structures for object recog nition," Intl. J. C .. Vision, vol. .: no. 1, 5579, ::: [17] H. Schneiderman and T. Kanade, "Obi I detection using the statistics of parts," Intl. J. Computer Vision, vol. 56, no. 3, pp. 151177, '.: i. [18] S. C. . modeling and conceptualization of visual ..' Trans. Pattern Anal. Machine Intell., vol. no. 6, : .. ::"_ 712, . [ :S. C. Zhu, Y. N. Wu, and D. B. Mumford, : ::::: entropy principle and its ap plications to texture :: 1 ":: Neural Computation, vol. 9, no. 8, pp. '. 7 1 1'* C S. Geman and D. Geman, :... ... relaxation, ( distribution and the E restoration of images," IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 6, pp. 721741, 1 [21] A. F : and T. ::: "Texture thesisis byh :, .: .: : .::..1i:. in Proc. In l. C. C .,: .'. : Vision, Y : Greece, 1. vol. 2, i 1.: 1: I;i i Inc. [22] J. S. De Bonet and P. Viola, texture e recognition using a nonpararetric multiscale statistical model," in IEEE C .' Computer Vision Patterni Rec., Santa Barbara, CA. 1=: 641 7, I= Inc. [23] M. J. Beal, N. Jojic, and H. Attias, "A, : .i '. model for i visual ob' . 1 .' Trans. Pallern Anal. Machine Intell., vol. no. 7, " SJ. Coughlan and A. ..ii. "Algorithms from statistical .1. for generative models of images," a.. and Vision C .'" : vol. 21, no. 1, pp. 29. SS. Kurnar and M. Hebert, "Discriminative random 1: a discriminative framework for contextual interaction in i :.: .i: in Proc. IEEE Inll. C. *' Comp. Vision, Nice, Frane, '::': vol. 2, pp. 11501157, Ii 1 Inc. SJ. i .i. : A. McCallum, and F. IPereira, "Conditional random fields: ....1 models ... segmenting and labeling sequence data," in Intl. C .' Machine .. ii. ...... .i. M A :::: r.. 282 SC. A. Bouman and M. Shapiro, "A mult.iscale random field model for i' .: image segmentation," IEEE .. F: : V. vol. 3, no. 2, : 162177, 1 W. W. Irving, P. W. Fieguth, and A. S. ..':. "An overlapping tree approach to multiscale stochastic ::: I ::: and estimation," IEEE Irans. ......: .... vol. 6, no. 11, pp. 1517 1 [29] i ( .. .. and C. A. Bouman, .i:iscale 1 .. segmentation using a trainable context model," '' '' .* .* .' vol. 10, no. 4, i 511 525, . ::M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, processing using ii: n Markov Models," IEEE V .:. 4, i : ,: i: i: based statistical signal Signal .. vol. i : no. S X. ... C.K. 1. '. and S. N. Felderhof, "Combining belief networks and neural networks for scene segmentation," .' Pattern Anal. Machine vol. 24, no. 4, pp. 483, . " S. Todorovic and M. C. " i1 awards ::: Vehicles: rnultiscale Viterbi I. :: .:: :: in . Vision, Prague, Czech :. : vol. 2, pp. 1" J.M. Laferte, P. erez, and F. ii. on the quadtree." IEEE Trans. T. Mission : : i of Micro Air I. E.:. .. .. C ...." Conmputer .' ,, . S)iscrete Markov image modeling and inference F. .. vol. 9, no. 3, pp. i. . [34] M. R. Luettgen and A. S. '.'ii 1, "i : :1: : calculation for a class of multiscale stochastic models, with i:' : : to texture discrimination," IEEEI '".. i: F .. vol. 4, no. 2, pp. 194 :'. 1 SP. L. Ainsleigh, N. i *.:arnavaz, and R. L. Streit, i .i.. ('... i .. I.v models for signal classification," IEEE ,.' F.. .. vol. 50, no. :. I. 1 1367, J. Pearl, Probabilistic : : in .' .''. : networks of plausible '.. '. Morgan Kaufamnn, San Mateo, CA, 1 [37] J. .. .. .. 1.:, T S. framework ... analysis of vol. : 1" no. 5, pp. . .i i a ind A. S. .ii i .'eebased ri .... !. ization .. uct and related algorithms," IEEE ... I inm. i: :: 11. Brendan J. Frey, ( '. models : machine .. and i..' communication, The MIT Press, Cambridge, MA, 1 S. Kumar and M. i::. i'rt, i .... e structure detection in natural images using a causal multiscale random :. ." in Proc. IEEE C ".' Computer Vision Pattern Rec., Madison, WI. : vol. 1, pp. 119126, I: : Inc. i. M. K. Schneider, P. W. Fiegut.h, W. C. Karl, and A. S. Willsky, : i: ;cale methods for the segmentation and reconstruction of signals and images," IEEE '.. I:. vol. 9, no. 3, pp. :: [41] J. Li, R. M. Gray, and R. A. Olshen, : solutionin image : ..... by hierar chical ... . "..; with twodimensional IT .. Markov Models," IEEE I*n! inm. T,77 vol. : no. 5, pp. 1 18/41, '.::::: SI W. K. Konen, T. Maurer, and C. von der M I1 1 "A fast d : .: link matching algorithm for invariant pattern recognition," Neural Networks, vol. 7, no. 7, pp. 1019 1030, :: [43] A. Montanvert, P. Meer, and A. i .. 1. i .. ..chical image ... 1 using irregular tessellations," ,'.' .' Pattern Anal. Machine Intell., vol. 13, no. 4, .. 307 3:: 1 "1. [44] P. Bertolino and A. Montanvert, :: :: solution segmentation using the irregular in Proc. Intl. C c ..... .... Lausanne, Switzerland, '., vol. 1, pp. 260, I : Inc. [45] N. J. Adams, A. J. Storkey, Z. (' ... ...** and C. K. 1. .' i ..*.. : iDTs: mean field dynamic trees," in 15Ith Intl. C 7.. Pattern Rec., Barcelona, Spain, ::: vol. 3, pp. 147150, Intl. Assoc. Pattern Rec. SN. J. Adams, D:...: .: trees: a hierarchical ... ,. to .. Ph.D. d nation, Division of Informatics, Univ. of Edinburgh, E : :::. UK, i [47] A. J. Storkey, \,. .. trees: a structured variational method giving ... nation rules," in Uncertainty in ...." C. B ... and M. Goldszmid t, Eds., : .573. Morgan I ....... San Francisco, CA, : SA. J. S and C. K. I. W: :: :: :: 1 ::: with positionencoding dynamic trees," '1 :' .. Patlern Anal. Machine In dll., vol. no. 7, : ;. 71, '. S 1. Jordan, i in models (adaptive computation and machine S' MIT press, Cambridge, MA, 1 :" M. I. Jordan, "Graphical models," .: Science ( issue on PE... :.. statis tics), vol. 19, pp. 1.: 1155, ... [51] A. IP. Dempster, N. M. Laird, and 1). B. Rubini, i .. ... 1.1. 1... 1 from incomplete data via the i i algorithm," Journal of the Statistical S. '; B, vol. ", pp. 1 1977. [52] G. J. McLachlan and K. :. i:: E: iM .... .' and extensions, John &. Sons, New York, NY, 1 . [53] D. M ( :... : .... and 1). i .. i .. .. .. ... ... for the marginal like lihood of ....I.. data given a : network," in Proc. : C . .' '" Portland, OR. 1 :: pp. 158 168, Assoc. Uncertainty Artificial ] ..: [54] S. Todorovic and M. C. ba, i .:: i elation of ..i scenes using generative dynamicstruct.ured models," in CDROM Proc. IEEE CVPR ',, Wo1orkshop on GenerativeuModel Based Vision (GAMBV), NWashington, DC, :,:' IEEE, Inc. [55] S. Todorovic and M. C. Nechyba, electionn of artificial structures in naturalscene images using dynamic trees," in Proc. : Intl. C .' Pattern Rec., Cambridge, UK, i 39, Intl. Assoc. Pattern Rec. 78 56] M. Ait.kin and D. B. Rubin, i i :: : :: and hypothesis testing in ::: mixture models," J. ..' Soc., vol. B47, no. 1, pp. ! 157] R. M. Neal, "Probabilistic inference using Markov ( ... Monte Carlo methods," Tech. Rep. ( i G(' 1 Connectionist Research Group, of Toronto, D. A. Fi th, J. Haddon, and S. I "The joy of ::::: Intl. J. Computer Vision, vol. 41, no. 12, pp. II:. 34, M. I. Jordan, Z. ( ".. .. .. T. S. Jaakkola, and L. K. Saul, "An introduction to variational methods for ..:'.: models," Machine vol. no. 2, pp. 1999. :: J. C. MacKay, I : inference, and ,; '".: Cambridge Univ. Press, Cambridge, UK, ' 161] D. Barber and P. van de Laar, .. ... cumulant ': for intractable dis tributions," J. Artificial Intell. Research, vol. 10, pp. 1  S D. J. C. MacKay, I '" rinerence, and '.: chapter ' pp. ; Cambridge University Press, Cambridge, UK, '. I). J. C(. MacKay, i:.:... auction to Monte Carlo methods," in ,'.. :.' in . models (adaptive computation and machine '. ), M. I. Jordan, Ed., pp. 175204. : i' press, Cambridge, MA, 1 : 164] T. S. Jaakkola, "Tutorial on variational : oximation methods," in Adv. Mean Field JMethods, M. Opper and D. Saad, Eds., I 1 61. 1 i press, Cambridge, MA, 'k : 165] T. M. Cover and J. A. Thomas,. of "' ," .i j Interscience Press. New York. NY, 1 . [ r 'lYygve Randen and lHakon llusoy, alteringg for texture a comparative study," IEEE .: Pattern Anal. Machine Intell., vol. 21, no. 4, pp. 291310, 1' *' i *S. .:. 4. A. ., wavelet tour of .. Academic Press, San Diego, CA, 2nd edition. : L j 1. '.'. G. .. .: "A theory for rnultiresolution signal decomposition: the wavelet representation," IEEJE Pattern Anal. Machine Intell. vol. 11, no. 7, ] .1 S' 1' ' S' Jerome M. Shapiro, i ..i added image coding using zerotrees of wavelet ... . IEEE .' on f. vol. 41, no. 12, pp. 34455 2, 1 ' H N. G. Kingsbury, "Corplex wavelets for shift invariant !. and filtering of signals," J. 1 .' C' HIarmonic 1 vol. 10, no. 3, : . Michael Unser, texture classification and segmentation using wavelet frames," IEEE' Trans. on vol. 4, no. 11, 1 [72] Nick Kingsbury, "Corrplex wavelets for shift invariant ... 1 and : ,of signals," Journal of Applied and C '' Harnmonic A4 vol. 10, no. 3, pp. 253, 79 173] T. UI:: 1. i theory: a basic tool : : : structures at different scales," J. .~..:' .7 Statistics, vol. 21, no. 2, pp. 224 270, [74] D. G. Lowe. "Distinctive image ...'es from scaleinvariant keypoints," Intl. J. C. Vision, vol. :, no. 2, pp. 91 110, : [75] S. Belongie, J. :' and J. Puzicha, : matching and object recognition using shape contexts," IEEE Pattern, Anal. Machine Intell., vol. 24, no. 4, pp. 522, SB. J. Frey, N. Jc : and A. Kannan, "Learning appearance and t .... manifolds of occluded ob' in 1 in Proc. .' C CI .. Vision Pattern Rec., .. ., i: .. vol. 1, pp. 45 52, i i i Inc. [77] G. Jones III and B. Bhanu, 'cognition of articulated and occluded obi IEEE S Pattern Anal. Machine Intell., vol. *, no. 7, J. S. Y, ...i. W. T. Freeman, and Y. Weiss, "C.( .. .. belief .. .: in Advances in neural ..' '. 183, T. K. Leen, T. G. Dietterich, and V. 'resp, Eds., pp. : 95. i Press, Cambridge, MA, ::A . S Y. Freund and R. E. .e, "A decisiontheoretic .: : : :of : ::: learning and an i" : to b 1: J. C '. Sciences, vol. 55, no. 1, pp. 39, :' :: V. N. .. ..i John .' i. k, Sons, Inc., New York, NY, l P. Dorningos, i: ost: a general method for making : :i: costsensitive," in Proc. 15th Intl. C. r /.. '.' ? Data /.u V.., San Diego, CA, i pp. 155164, A( Press. [82] T. G. Dietterich, i.. learning for sequential data: a review," in :. 'e notes in corrputer science, T C I i vol. :., pp. 15 30. SpringerV. i ... Germany, '::: ' BIOGRAPHICAL i ::;TC i STodorovic was born in Belgrade, Serbia, in 1968. He graduated from Mathemat ical I:: i: SchoolBelgrade in 1' 7. He received his B.S. degree in electrical and computer engineering at the Uli: .: .' of Belgrade, Serbia, in i F' r From 1994 ::: :il, he worked as a .. :. e engineer in the communications industry. In fall ::: Sinisa Todorovic enrolled in the master's degree program at. the Department of Fi ::1 a.nd Computer S" 'U .." 'r of Florida, G ... C i.. became a member of the Center for : :o Air Vehicle Research, where he conducted research in statistical image modeling and multiresolution signal processing. ... Todorovic earned his master's degree ( j i. thesis option) in December, '::: : after which he continued his studies toward a IF i). degree in the same D. .: tment.. He received two certificates for outstanding academic. .. .. in :::: and ':::. He to graduate in May, 