IMPROVEMENT OF THE CAMERA CALIBRATION
THROUGH THE USE OF MACHINE
LEARNING TECHNIQUES
BY
SCOTT A. NICHOLS
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2001
ACKNOWLEDGMENTS
I wish to thank Dr. Antonio Arroyo for taking a long shot on an average undergrad. You have
my gratitude, and have helped me make much more of myself. I also wish to thank Dr. Michael
Nechyba for more things than I can enumerate here, but will make a halfhearted effort to. Thank
you for your patience; your service as an idea blackboard that corrects mistakes; your recommen
dation of Danela's Ristorante; oh yeah, and your patience. I also wish to thank the members of the
Machine Intelligence Lab that I have shared workbench space with over the years for ideas and
motivation.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ............................................. ii
ABSTRACT ........ ................................................. vii
CHAPTERS
1 INTRODUCTION ................
1.1 Computer Vision. .........
1.2 Camera Calibration ........
1.3 ThisW ork...............
2 THE CAMERA MODEL ..........
2.1 Introduction.............
2.2 Definition .............
2.3 Training the model ........
3 A SINGLE CAMERA .............
Introduction..............
Training Edges ..........
Optimization Criterion .....
Initial Models ...........
Gradient Descent. .........
Model Perturbation ........
GradientPeturbation Hybrid
Performance Comparisons ..
4 STEREO CAMERAS ...........
Introduction..............
Related W ork ............
Our Approach ............
Training Data ............
Model Improvement.......
..................................
..................................
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 2
. . . . . . . . . . . . . . . . . 2
. . . . . . . . . . . . . .. . . . 4
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 5
. . . . . . . . . . . . . .. . . . 9
. . . . . . . . . . . . . .. . . . 9
. . . . . . . . . . . . . . . . . 1 0
. . . . . . . . . . . . . . . . . 1 1
. . . . . . . . . . . . . . . . . 1 2
. . . . . . . . . . . . . . . . . 1 3
. . . . . . . . . . . . . . . . . 1 5
. . . . . . . . . . . . . . . . . 1 8
. . . . . . . . . . . . . . . . . 1 9
. . . . . . . . . . . . . . . . . 2 3
. . . . . . . . . . . . . . . . . 2 3
. . . . . . . . . . . . . . . . . 2 4
. . . . . . . . . . . . . . . . . 2 5
. . . . . . . . . . . . . . . . . 2 5
. . . . . . . . . . . . . . . . . 2 5
. . . . . . . . 10
. . . . . . . . 11
. . . . . . . . 12
.2 . . . . . .
.4 . . . . . .
.4 . . . . . .
.5 . . . . . .
.7 . . . . . . . .2
.9 . . . . . . . 2
.9 . . . . . . . .2
.10 ......... .......2
.11 ......... .......2
.1 . . . . . . . 25
5 FURTHER RESULTS AND DISCUSSION. ............... ........... 30
5.1 Single C am era .......... ..... .................. ........... 30
5.2 Stereo Cameras .......... .... ................ .......... 30
APPENDICES
A VISUAL CALIBRATION EVALUATION .............................. 39
B GRAPHICAL CALIBRATION TOOL ............................... 41
REFERENCES ................. .................................. 43
BIOGRAPHICAL SKETCH .................. ........................ 45
LIST OF FIGURES
figure page
21 The Pinhole Model of Perspective Projection ........................... 5
31 Example Training Edges ............... ....................... 11
32 Error Types of Initial M odels .................................... 13
33 Gradient Descent vs. Different Types of Error ......................... 15
34 Stochastic Perturbation vs. Different Types of Error .................. 16
35 Stochastic Perturbation with Adaptive Delta vs. Different Types of Error .... 17
36 GradientPerturbation Hybrid vs. Different Types of Error ............... 18
37 Performance Comparison for the Translational and Scale Initial Models .... 20
38 Performance Comparison for the Close and Rotational Initial Models. ...... 21
39 Final Models Using the Gradient Perturbation Hybrid Technique ......... 22
41 Results for a NonWeighted Model Improvement ................... .. 27
42 Error Over Time for Various Gains on the Training Model ............... 28
43 The Long Term Performance using Various Gains ................... .. 29
51 Example Single Camera Calibrations (1 & 2) .......................... 31
52 Example Single Camera Calibrations (3 & 4) .......................... 32
53 Example Single Camera Calibrations (5 & 6) .......................... 33
54 Example Single Camera Calibrations (7 & 8) .......................... 34
55 Poor Initial Calibrations (1 & 2) .................................. 35
56 Poor Initial Calibrations (3 & 4) .................................. 36
57 Stereo Calibration Examples (1 & 2). .............................. 37
58 Stereo Calibration Examples (3 & 4). .............................. 38
Ai The Calibration Grid ............... .......................... 40
A2 The Experimental Area ........................................ 40
Bl The Graphical Calibration Tool. ............................. 42
LIST OF TABLES
table page
31 Model Perturbation: Average Error Per Pixel ........................... . 19
32 Model Perturbation With Adaptive Delta: Average Error Per Pixel ............ 19
33 Gradient Perturbation Hybrid: Average Error Per Pixel ................. .. 19
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
IMPROVEMENT OF THE CAMERA CALIBRATION
THROUGH THE USE OF MACHINE
LEARNING TECHNIQUES
By
Scott A. Nichols
August 2001
Chairman: Dr. Michael C. Nechyba
Major Department: Electrical and Computer Engineering
In computer vision, we are frequently interested not only in automatically recognizing what
is present in a particular image or video, but also where it is located in the real world. That is, we
want to relate twodimensional image coordinates and threedimensional world coordinates. Cam
era calibration refers to the process through which we can derive this mapping from realworld
coordinates to image pixels. We propose to reduce the amount of effort required to compute accu
rate camera calibration models by automating much of the calibration process through machine
learning techniques. The techniques developed are intended to simplify the calibration process
such that a minimally trained person can perform single, stereo, or multiple fixedcamera calibra
tions with precision and ease. In this thesis, we first develop a learning algorithm for improving a
single calibration model through a combination of gradient descent and stochastic model perturba
tion. We then develop a second algorithm that applies specifically to simultaneous calibration of
multiple fixed cameras. Finally, we illustrate our algorithms through a series of examples and dis
cuss avenues for further research.
CHAPTER 1
INTRODUCTION
1.1 Computer Vision
For decades, researchers have been attempting to duplicate in machinecentered systems
what we as humans do on a daily basis with our eyes to recognize and understand the world
around us through visual input. To date, however, our imagination has outpaced the reality of state
oftheart computer vision systems. While science fiction has often depicted robots and machines
with humanlike capabilities, current computer vision systems do not come close to matching
those imagined capabilities. In fact, we are still many years away from computers that can rival
humans and other animals in image processing and recognition capabilities. Why? Computer
vision, it appears, is a much more difficult problem than was first believed by early researchers. In
the late sixties with the spread of general purpose computers, researchers felt that a solution to the
general vision problem the near instantaneous recognition of any visual input was achievable
within a short number of years. Since then, we have come to understand the enormous computa
tional resources our own brains devote to the vision processing task, and the consequent challenges
that the general computer vision problem poses.
Therefore, rather than develop one computer vision system with very general capabilities,
researchers have begun to focus on implementing practical computervision applications that are
limited in scope but can successfully carry out specific tasks. Some examples of recent work
include face and car recognition [14], people detection and tracking [6, 13], computer interaction
through gestures [15], handwriting recognition [9], traffic monitoring and accident reporting [8],
detection of lesions in retinal images [17], and even automated aircraft landing [5]. In many of
1
2
these computer vision projects, researchers are not only interested in recognizing what is present in
an image or video; they would also like to infer 3D geometric information about the world from
the image itself. If a system is to relate with and/or draw conclusions about the 3D position of
objects in the image, it needs to be calibrated; that is, we need to derive a relationship between the
3D geometry of the real world and corresponding image pixels.
1.2 Camera Calibration
Developing accurate calibration or camera models in computer vision has been the focus of
much research over the years. Many researchers have opted for the following simple approach:
first the 3D location of certain known image pixels is identified; then, these pairs of correspon
dence are exploited to estimate the parameters of the calibration model. From this model, intrinsic
camera parameters, which define the properties of the camera sensor and lens, and the extrinsic
parameters, which define the pose of the camera with respect to the world, can be extracted. This
process can be burdensome, as it requires many precise position measurements. Often, someone
doing computer vision research spends as much time fiddling with and worrying about camera cal
ibration, as they do on their actual application.
1.3 This Work
In this thesis, we propose to reduce the amount of effort required to compute accurate cam
era calibration models, by automating much of the calibration process through machine learning
techniques. The techniques developed are intended to simplify the calibration process such that a
minimally trained person can perform single, stereo, or multiple fixed camera calibrations with
precision and ease. In Chapter 2, we review the basics of camera calibration. Then, in Chapter 3 we
develop a training algorithm for improving a singlecamera calibration model from constrained
features in the image. Next, in chapter 4, we build on the previous chapter by developing an algo
3
rithm for improving multiple fixed camera calibration models. Finally, in chapter 5, we present fur
ther results and discuss possible extensions of this work.
CHAPTER 2
THE CAMERA MODEL
2.1 Introduction
Camera calibration in computer vision refers to the process of determining a camera's inter
nal geometric and optical characteristics (intrinsic parameters) and/or the 3D position and orienta
tion of the camera relative to a specified world coordinate system (extrinsic parameters) [16]. We
do this in order to extract 3D information from image pixel coordinates and to project 3D informa
tion onto 2D image coordinates. Cameras in computer vision can be mounted as fixed view, a pan
ning and tilting view, or can be integrated onto a mobile system. Panning, tilting and mobile
cameras do not have a fixed relationship to any world coordinate system (extrinsic parameters).
For such systems, information is often available only in the camera coordinate system, which is
defined by the direction of the camera at the moment the image was captured. External sensors and
encoders can alleviate this problem, but typically only at a significantly higher cost. In fixed cam
era systems, on the other hand, both the intrinsic and extrinsic parameters combine to provide a
transformation between 3D world and 2D image coordinates.
In recent years, many different techniques have been developed for camera calibration. The
differences in these techniques primarily reflect the wide array of applications that researchers
have pursued. One of the most popular, and the one that forms the basis of this work, is direct lin
ear transformation (DLT) introduced by AbdelAziz and Kahara [1]. The DLT method does not
consider radial lens distortion, but is conceptually and computationally simple. In 1987 Tsai [16]
proposed a method that is likely one of the most referenced works on the topic of camera calibra
tion. It outlines a twostage technique using the "radial alignment constraint" to model lens distor
4
5
tion. Tsai's method involves a direct solution for most of the camera parameters and some iterative
solutions for the remaining parameters. The cameras used in this work appear to have little or no
lens distortion. Since it has been shown byWeng, Cohen and Herniou [18] that Tsai's method can
be worse than DLT if lens distortion is relatively small, we chose the DLT method.
2.2 Definition
In this thesis, we apply the pinhole lens model of perspective projection, whose basic geom
etry is shown in Figure 21. This model constructs a transformation from 3D world coordinates to
pixels in an image in three steps. First, 3D world coordinates are converted to 3D camera coordi
nates through a homogeneous transformation. Let us denote P, = (x,, y,, z,) as a coordinate in
the world, and Pc = (xc, yc, z) as the corresponding 3D camera coordinate. Then, the homoge
neous transform T is defined by,
Figure 21: The Pinhole Model of Perspective Projection
T R t
000 1[
1C = T F1
where R denotes a 3 x 3 rotation matrix,
and t denotes a 3 x 1 translation vector,
r11
R
R 3i21
r31
r12 r13
r22 r23
r32 r3j
t=
Second, the 3D camera coordinate Pc is transformed to a 3D sensor coordinate P, = (u, v, w) :
where the camera's intrinsic matrix K is defined as,
fa fb u, 0
K 0 fc vo0 (26)
0 0 10
0 0 01
In equation (26)fis the effective focal length of the camera; a, b, and c describe the scaling
in x and y and the angle between the optical axis and the image sensor plane, respectively; and, uo
such that,
(21)
(22)
(23)
(24)
(25)
7
and vo are the offset from the image origin to the imaging sensor origin. Finally, the 3D sensor
coordinate P, is converted to the 2D image coordinate (x,, y) :
w (27)
v
The complete projection equation is therefore given by,
[1 = KT p (28)
The transformation KT on the world coordinate in equation (28) can be combined into a
single matrix S. Since we are only interested in the overall transform between world coordinates
and image coordinates, and not an explicit solution of the extrinsic and/or intrinsic camera param
eters, we therefore write,
P, = S P (29)
or alternatively,
S1l S12 S13 S14
P2 = S21 S2 S S (210)
S31 S32 S33 S34
2.3 Training the model
From equation (210), we now have a linear transformation with 12 unknowns that relate
world coordinates to image pixels. Each correspondence between a 3D world coordinate and a 2D
image point corresponds to a set of two equations,
(S31X + S32Y + 33Z + 34) = W(S11X + 12 + S13Z + 14)
(211)
v(S31X + s32y + 33Z + 34) = W(S21X+ S22y +23Z + S24)
or, in terms of the actual image coordinates,
x,(S31 + s32Y + S33Z + 34) = (S11 + S12y + S13Z+ S14)
(212)
y,(s31 + s32y + 33z + 34) = (s21 + s22y + 23z+ 24)
Given a set of n pairs of world and image coordinates, equation (211) can be written in matrix
form as,
xy z 1 0 0 0 xx xoy xoz xo s1
0 0 0 0 x y z 1 y z y 1 = 0 (213)
S34
Arbitrarily setting s34 = 1 leaves 11 unknown parameters which can be solved for using linear
regression. In general, the more correspondence pairs that are defined, the less susceptible the
model is to noise. To a large extent, this thesis aims to reduce or eliminate the need to collect many
such precise correspondence pairs, as that can be labor intensive and/or prone to human operator
error.
CHAPTER 3
A SINGLE CAMERA
3.1 Introduction
A common method of calibration is to place a special object in the field of view of the cam
era [10, 11, 16, 21]. Since the 3D shape of the object is known apriori, the 3D coordinates of spec
ified reference points on the object can be defined in an objectrelative coordinate system. One of
the more popular calibration objects described in the literature consists of a flat surface with a reg
ular pattern marked on it. This pattern is typically chosen such that the image coordinates of the
projected reference points (corners, for example) can be measured with great accuracy. Given a
number of these points, each one yielding an equation of the form (29), the perspective transfor
mation matrix S can be estimated with good results. One drawback of such techniques is that
sometimes a calibration object might not be available. Even when a calibration object is available,
the world coordinate system is defined by the placement of that object with respect to the camera,
and is not necessarily optimized to take advantage of the geometry of the scene.
In another popular method, called structure from motion, the camera is moved relative to the
world, and points of correspondence between consecutive images define the camera model [7, 21,
22]. In this approach, however, only the intrinsic parameters of the camera can be estimated (e.g.
the K matrix); as such, this method is used primarily for stereo vision and will be addressed in the
subsequent chapter.
In our work, we propose to divide the calibration process into two stages. First, we propose
to generate an initial "close" camera model, that then gets optimized to runtime quality through
standard machine learning techniques. Since our system will improve its model over time, all we
9
10
require initially is method for generating reasonably close calibrations without undue effort or
complexity. This type of problem was addressed by Worrall, et al. [19] through a graphical calibra
tion tool, where the user can rotate and translate a known grid to the desired position on the image.
The system then calculates a perspective projection matrix S that will place the grid at that location
in the image. Our group has implemented a similar, intuitive GUI interface, which allows us to
generate fast and easy initial calibration estimates; this interface is described in further detail in
Appendix B.
3.2 Training Edges
Now, in order to improve the calibration model from the GUI interface through machine
learning, we need something for the machine learning algorithm to train on. Since we do not want
to require human operators to meticulously and precisely select numerous known correspondence
pairs in an image, we should select features in the image that can be easily isolated or extracted
through simple image processing techniques. In manmade environments, constrained edges with
known dimensions are frequent and stand out visually; some examples of these might be the inter
section of the floor and a wall, the comer of a room, window sills, etc. These features can provide
a wealth of training data, without requiring explicit and precise imagetoworld correspondence.
As such, we choose to rely on such constrained data for improving our initial camera calibration
model.
Exploiting the geometry of a given scene, a set of lines is chosen in 3D space, where each of
the lines is constrained along two dimensions; for example the vertical intersection of two perpen
dicular walls is constrained by x = Co and y = C, where Co and C1 are known constants. The
pixels corresponding to each of the lines and the constraints that define each line are the basis for
our model improvement algorithm. In order not to bias the training algorithm, the lines (edges)
11
should be chosen so as to provide training data that is balanced throughout the region of interest
both in area and scale as shown, for example, in Figure 31.
3.3 Optimization Criterion
Given our constrainededges training data, we must define a reasonably wellbehaved opti
mization criterion that lets us know how the training algorithm is progressing. Care should be used
when choosing the optimization criterion, as it is the only mechanism a system has to evaluate a
potential model. After some experimentation, our final optimization criterion was designed to
reflect the error between the actual and projected pixel locations of the constrained edges and is
defined by,
E = n E, (31)
e = 00
where e denotes a training edge, i denotes a pixel along that edge, n denotes the number of pixels
along a particular edge, and m denotes the total number of edges. In equation (31), Ee, is defined
by,
(a) (b)
Figure 31: Example Training Edges
Ee = 2 (x, x +(y,y ) (32)
where (x,, y,) is a actual pixel location in the training set and (x yp) is the projected pixel loca
tion, which is computed as follows. First, given the two constraints of a single edge (e.g. x = Co,
y = C ), a pixel from the corresponding image line (x,, y,) and the current perspective projection
model S, apply equation (211) to generate two equations and one unknown. For example, for the
constraints specified above, the two equations would become,
(S31s S11)C0 + (S32Xs S12)C1 + (S33s S13)Z = S14 S34s
(33)
(S31s S21)C0 + (S32s S22)C1 + (S33ys S23) = S24 S34Ys
Equations (33) are easily generalized to arbitrary lines in space, and can be solved for the
unknown coordinate (or parameter, in the general case) through linear regression. Then, we can
project the resulting 3D coordinate onto the image, using equations (28) and (27), to get (xP, y )
The camera perspective will have some effect on the training data. Given two equal length
edges, the one with a larger image crossection will have more sample points and therefore will
contribute more to the error. To compensate for this, the error generated by each edge is averaged
to an error per pixel along that edge. The average error of each edge is then averaged to obtain the
final error measure E.
3.4 Initial Models
Error in a calibration model can be decomposed into three basic types: rotation, translation,
and scale. Any training algorithm may handle these different sources of error with varying degrees
of success. Therefore, we investigate how well our algorithm (defined below) performs on improv
ing initial models in four general classes, as defined and illustrated in Figure 32. Each of the first
13
three initial models in Figure 32 is labeled by the dominant type of error displayed. The fourth
model exhibits a combination of errors, but was generated to be fairly close to an acceptable final
model. Given a graphical calibration tool, as described in Appendix B, the close model represents
an easily achievable and therefore most likely starting configuration. The other three cases are pre
sented to establish the limits of our training approach and to determine the types of error that
present the most difficulty.
3.5 Gradient Descent
The error measure defined in equation (32) can be expanded to,
(a) Close
(b) Rotation
(c) Scale (d) Translation
Figure 32: Error Types of Initial Models
Ee= 275 Sli x+S12Y +S13z+sl4 2 S21X + S22Y + S23Z+S24 (34)
et s S31X + S32Y + S33Z + S34 s S31X + S32Y + S33Z + S34
This is a differentiable function in S for which we can compute the gradient VEe, with respect to
the parameters in S such that,
T
VE, E, E TEe, (35)
as,11 dS33
Note that since s34 is assigned to be equal to one, it is not part of the gradient in equation (35).
From equation (31), the overall gradient VE is then givenby,
1 m n VE
VE = m x VE,. (36)
m Y, Y, n
e = = 0
Given this gradient, the current model S can now be modified by a small positive constant 8, in
the direction of the negative gradient,
Snew = (1 VE6g)S (37)
For error surfaces that can be roughly approximated as quadratic, we would expect the
model recursion in equation (37) to converge to a good nearoptimal solution. Figure 33 below
illustrates the performance of pure gradient descent for the four types of initial models. From these
plots, it is apparent that the gradient descent recursion very quickly gets stuck in a local minimum
that is far from optimal; all four types of initial models caused gradient descent to fail within 2.75
seconds.1 In other words, the error surface is decidedly nonquadratic in a global sense, and gradi
entdescent represents at best only a partial training algorithm for this problem.
1. All experiments in this thesis were run on a 700 MHz Pentium III running Linux.
3.6 Model Perturbation
Another method for improving model is stochastic model perturbation. In this approach, the
current model S is first perturbed by a small delta 8p along a random direction,
Sw = (1+VD6p)S.
(38)
The error for the new model S,,e is computed; if it represents an improvement, the current
model becomes the new model; otherwise, the new model is discarded, and we simply try another
random perturbation of the current model.This approach is much less sensitive to the local minima
problem, since the random perturbations can effectively "jump out" of local minima; that is, the
direction VD is not constrained to be in the negative gradient direction. Figure 34 illustrates train
48
47.5
47
46.5
0 0.1 0.2 0.3 0.4 0.5
Close initial model
0 0.5 1 1.5 2 2.5
Rotation Error
0 0.25 0.5 0.75 1 1.25 1.5 0 0.25 0.5 0.75 1 1.25 1.5
Scale Error Translation Error
Figure 33: Gradient Descent vs. Different Types of Error
16
ing results for the four initial model types. Note that there is an immediate and significant improve
ment for all of the initial models, particularly for the cases of translation error, where training
results in an orderofmagnitude improvement. In each of the cases, the bulk of the model improve
ment occurs in less than one minute. The models improve further over time, although at some point
there is a trade off between model quality and computing time.
One possible approach to speeding up improvement over time is to introduce an adaptive
perturbation delta that grows when the current model is stuck for some period of time in a rela
tively isolated local minimum. Examples of convergence with an adaptive delta are shown in Fig
Avg Pixel Error
4
3.5
3
2.5
2
1.5
Avg Pixel Error
2o0
0. s Time (sec)
0 200 400 600 800 1000 1200
(a) Close initial model
Avg Pixel Error
30
25
20
15
10
5 ^  
0 200
Time (sec)
0 200 400 600 800 1000 1200
(b) Rotation Error
Avg Pixel Error
25
s Time (sec)
lime (sec)
400 600 800 1000 120C 0 200 400 600 800 1000 120C
(c) Scale Error (d) Translation Error
Figure 34: Stochastic Perturbation vs. Different Types of Error
17
ure 35. Not surprisingly, the results are very similar, but the adaptive delta does improve
performance slightly, and, we expect, would increasingly improve performance if used over a
longer period on time.
Avg Pixel Error
4
3.5
3
2.5
2
1.5
1
0.5 Time (sec)
0 200 400 600 800 1000 1200
(a) Close initial model
Avg Pixel Error
30
25
20
15
10
5
Time (sec)
Avg Pixel Error
20
15
10
5
Time (sec)
0 200 400 600 800 1000 1200
(b) Rotation Error
Avg Pixel Error
25
Time (sec)
0 200 400 600 800 1000 120C 0 200 400 600 800 1000 120C
(c) Scale Error (d) Translation Error
Figure 35: Stochastic Perturbation with Adaptive Delta vs. Different Types of Error
18
3.7 GradientPeturbation Hybrid
Given the relative speed of convergence of gradient descent in localized near quadratic
neighborhoods, and the insensitivity of stochastic perturbation to local minima, a combination of
gradient descent and model perturbation may well train faster than either approach by itself. In this
hybrid approach, gradient descent will quickly reach local minima, at which point, adaptivedelta
model perturbation takes over to search for better regions in model parameter space. In other
words, gradient descent and stochastic perturbation alternate in optimizing the camera model over
time. Sample results for this approach are shown in Figure 36.
Avg Pixel Error
Avg Pixel Error
20
Time (sec)
0 200 400 600 800 1000
(a) Close initial model
Avg Pixel Error
30
25
20
15
10
5
Time (sec)
Time (sec)
0 200 400 600 800 1000 1200
(b) Rotation Error
Avg Pixel Error
25
20
15
10
5
Time (sec)
0 200 400 600 800 1000 1200 0 200 400 600 800
(c) Scale Error (d) Translation Error
Figure 36: GradientPerturbation Hybrid vs. Different Types of Error
1000
3.8 Performance Comparisons
Which approach is best for our purposes depends on a number of different criteria. Ideally,
we would like a method that responds quickly, improves in the face of difficult models, and is able
to continue to improve if given an openended amount of time. As we have already seen, and as we
further detail inTables 31, 32 and 33, each method improves the camera model the most within
Table 31: Model Perturbation: Average Error Per Pixel
Initial
Close
Rotation
Scale
Translation
3.9336
21.2163
28.9091
23.0482
20 Sec.
2.0053
8.8545
6.9955
1.7082
1 Min.
1.9618
6.5318
6.9518
1.6143
5 Min.
1.6327
6.2082
6.8318
1.3182
10Min.
1.5182
6.1391
5.4664
1.1909
20 Min.
1.4246
5.8464
5.2064
1.1618
Table 32: Model Perturbation With Adaptive Delta: Average Error Per Pixel
Close
Rotation
Scale
Translation
Initial
3.9336
21.2163
28.9091
23.0482
20 Sec.
2.0053
9.138
7.0035
1.7082
1 Min.
1.9691
6.8472
6.9515
1.6145
5 Min.
1.64
6.22
5.8091
1.3255
10Min.
1.5972
5.9091
5.2945
1.2227
20 Min.
1.4127
5.2491
4.5236
1.19
Table 33: Gradient Perturbation Hybrid: Average Error Per Pixel
Close
Rotation
Scale
Translation
Initial
3.9336
21.2164
28.9091
23.0482
20 Sec.
2.0364
12.4909
14.13
1.6909
1 Min.
1.9173
8.0218
6.7918
1.5627
5 Min.
1.5664
5.3855
5.7545
1.4
10Min.
1.4736
5.1991
5.5655
1.3118
20 Min.
1.3355
5.0564
4.6891
1.1964
20
the first minute of training. For each approach the error then continues to decline, but the two tech
niques that make use of an adaptive delta show a continuing ability to improve over time with the
hybrid approach outperforming the others. A direct comparison is shown in Figures 37 and 38,
Translation Scale
Pixels
Pixels
0.02
0
STime (sec)
0.02
0.04
0 200 400 600 800 1000
Perturbation Perturbation with delta
Pixels
0.1
0
0.1
0.2
Time (sec)
0 200 400 600
Perturbation Hybrid
800 1000
1
0.75
0.5
0.25
0
0 .25 Time (sec)
0.5
0 200 400 600 800 100C
Perturbation Perturbation with delta
Pixels
1
0.5
0
0.5
1 Time (sec)
0 200 400 600 800 100C
Perturbation Hybrid
Pixels
0.1
0.05
0
0.05
0.1
0.15
_n )
Pixels
0.5
0
0.5
1
1.5
Time (sec)
S0 200 400 600 800 1000 0 200 400 600 800 100C
Perturbation with Delta Hybrid Perturbation with Delta Hybrid
Figure 37: Performance Comparison for the Translational and Scale Initial Models
21
which plot the difference in error between the different techniques over time. The final camera
models trained by the hybrid method are depicted in Figure 39.
Close
Pixels
Rotation
Pixels
0.02
0
0.02
0.04
0.06
0.08 ime (sec)
0 200 400 600 800 1000
Perturbation Perturbation with delta
Pixels
Perturbation Hybrid
Pixels
0.125
0.1
0.075
0.05
0.025
0
0.025
0.4
0.2
0.2
Time (sec)
0.6
0 200 400 600 800 100C
Perturbation Perturbation with delta
Pixels
0.8
0.6
0.4
0.2
0 Time (sec)
0 200 400 600 800 100C
Perturbation Hybrid
Pixels
0.75
0.5
0.25
0
Perturbation with Delta Hybrid Perturbation with Delta Hybrid
Figure 38: Performance Comparison for the Close and Rotational Initial Models
(a) Close (b) Rotation
(c) Scale (d) Translation
Figure 39: Final Models Using the Gradient Perturbation Hybrid Technique
Why does training virtually eliminate translation error while rotation and scale errors prove
more difficult to correct? The answer lies in the camera model itself [see equation (28)]. Transla
tion is parameterized by 3 of the 12 parameters in S, while rotation is parameterized by 9 parame
ters, which are additionally constrained by the orthonormality requirement for rotation matrices.
Finally, scale affects all 12 parameters in S. Thus, to correct rotational errors, 9 model parameters
must be changed simultaneously, while for scale errors, all 12 parameters must be changed. This
adds complexity and size to the search space, resulting in a slower improvement and a greater like
lihood of getting stuck in a remote local minimum.
CHAPTER 4
STEREO CAMERAS
4.1 Introduction
We are interested in abstracting 3D information about the world through computer vision
techniques. Therefore, we will, in general, require more than one noncoincident camera view for
an area of interest, since pixels from a singlecamera image do not correspond to exact 3D world
coordinates, but rather to rays in space. If we can establish a feature correspondence between at
least two camera views calibrated to the same world coordinate system, then we can extract the
underlying 3D information for that feature by intersecting the rays corresponding to its pixel loca
tion in each image. Hence, in this chapter, we consider the problem of simultaneous calibration of
multiple (stereo) fixed cameras to a unique world coordinate system. More specifically, we build
on the results of the previous chapter by assuming that one camera in a multicamera setup has
already been trained towards a good model. The task that remains is to calibrate the additional
camerass. In the remainder of this chapter, we focus on the twocamera case, although our results
generalize in a straightforward manner to more than two cameras.
Given a set of two cameras where one has already been trained, and the second has a rough
initial calibration, we propose to improve the second calibration through an iterative process that
involves using a "virtual calibration object." This calibration object is created by moving a physical
object throughout the area of interest and tracking it from each camera view. The path that the
object follows in the image (in pixels) and the corresponding 3D estimates constitute the training
data at each step of the algorithm.
4.2 Related Work
Previous work in stereo camera calibration is relatively extensive, but varies based on the
particular application of interest. Many of the previous methods use a known calibration pattern
whose features are extracted from each image. Rander and Kanade [12] have a system of approxi
mately 50 cameras arranged in a dome to observe a roomsized space. In order to calibrate these
cameras, a large calibration object is built and then moved to several precise locations, in effect
building a virtual calibration object that covers most of the room. Do [4] applied a neural network
to acquire stereo calibration models and determined the approach to be adequate; however, he went
on to show that high accuracy did not appear possible in a reasonable amount of training time.
Horaud et al. [7] use rigid and constrained motions of a stereo rig to allow camera calibration. This
approach, like most in the field, address small baselinestereo rigs which usually have a fixed, rigid
geometry. Azarbayejani and Pentland [2] propose an approach that tracks an item of interest
through images to obtain useable training data. This approach is similar to ours except that they
use object tracking to completely establish a calibration rather than using it to modify an already
existing calibration. The drawback of this technique is that scale is not recoverable, so absolute
distance has no meaning. Chen et al. [3] offer a similar technique that uses structurefrommotion
to obtain initial camera calibration models, and then tracks a virtual calibration object to iterate to
better models. The initial calibrations are obtained sequentially, where each new calibration model
depends on those previously derived. As such, error accumulates for each sucessive calibration,
placing newer calibrations further away from an optimal solution. Our work has shown (e.g. Fig
ures 34, 35 and 36) that this can have a significant impact on both the final calibration obtained
and the amount of time required to obtain it.
4.3 Our Approach
In this research, our primary motivation is to make the calibration process as quick and easy
as possible without sacrificing precision. We wish for a nonexpert to be able to completely cali
brate a system for operation by making part of the calibration process visual and automating the
rest. Each camera obtains its initial calibration via a graphical calibration tool (as described in
Appendix B). The calibration is then improved by having the cameras track an object of interest
throughout the viewing area. Ideally, image capture and processing should be synchronized
between cameras; however, in practice, such synchronization is difficult and expensive to achieve
in hardware. Even for an asynchronous system, however, we can approximate synchronous opera
tion through Kalman filtering and interpolation of timestamped images, as long as the system
clocks are synchronized between the image processing computers. This Kalmanfiltered and inter
polated trajectory then becomes the training data for improving our stereo camera models.
4.4 Training Data
In order to collect training data, we apply a modified version of Zapata's colormodelbased
tracking algorithm [20] to track an object of interest from multiple camera views in realtime. The
timestamped pixel data of the object's centroid are passed through a firstorder Kalman filter and
then sent to a multicamera data fusor. The data fusor accepts the timestamped data streams, inter
polating and synchronizing the multiple data streams at 30Hz. Prior to training, the synchronized
tracking data is balanced so that no one single region of the image is dominated by a disproportion
ately large amount of data. Although training examples reported in this thesis are over fixed data
sets, there is no algorithmic obstacle to training off streaming data in realtime.
4.5 Model Improvement
The training data now consists of n synchronized sets of pixel values from the m multiple
cameras. Let us consider the set of pixel values for the object at time t:
26
{(x, yl), (x2,y), ... (Xm, Y)}t (41)
where (x, y ) denotes the pixel location of the object for camera j. Given our current estimate of
each camera model SJ, we can estimate the 3D world coordinate of the object at time t by regress
ing the equations,
(s31,J s 11,)xt + (S32',X S12,')yt + (S33,XJ S13j)zt = S14J S34,JXJ (42)
(S31,jJ S21,j)t + (S32,yj S22,j)yt + (S33,jy S23,j)t = S24,j S34,JYJ
j { 1, 2, ..., m }, for (xt, Yt, z) which denotes the estimated 3D world coordinate at time t.
For each training camera, we now have n estimated correspondence pairs between the syn
chronized pixel values for that camera and their corresponding estimated 3D world coordinates.
Given this data, we can now apply equation (213) to generate a new perspective projection matrix
S,new based on the estimated 3D tracking data. This process is repeated until the calibration
reaches acceptable precision. Figure 41 shows the error over time for the above approach. There is
initial improvement followed by rapid and consistent model degradation. This is not what was
expected, but can be explained after looking a little closer at the procedure and how it operates.
The resulting calibration in Figure 41 gives an indication of the source of the problem. Recall
from equation (28) that part of the projection matrix is the camera matrix K which contains intri
nisic parameters for the camera such as scale, skew and offset. These parameters are fixed in real
ity but are obviously being changed by this training process. This is happening due to the
unconstrained nature of the training process: known incorrect (x, y, z) are being used to train a
model which is simply a linear least squares solution and K, as a component of this model, is also
being changed. We are seeking to use a good model to train a bad model by modifying its rotation
and translation, not its intrinsic camera parameters. A property of the linear least squares estimate
Error (mm)
800
600
400
200
Time (sec)
0
0 25 50 75 100 125 150 175
(a) Error Over Time
(b) Initial Calibration
(c) Best Calibration (d) Final Calibration
Figure 41: Results for a NonWeighted Model Improvement
is that it distributes the error as evenly as possible between the two models. The even distribution
of the error is not appropriate for us since we know that the source of the error is the bad model.
Weighing the parameters of the regression towards the training model as an indication that it is the
source of the error might help. Figure 42 shows the error vs. time as a function of different
weights applied to the training model. It shows that we can mitigate the effect on the K matrix of
the training model with a weighted regression. A gain of ten shows better performance but is still
unstable. A gain of 100 or 1000 shows much better stability and performance. Looking at the error
over time for the different gains it appears that a higher gain exhibits a similar response when
viewed on a larger scale.
Error (mm)
8O
70
60
50 J
40
30
20
10 Time (sec)
0 25 50 75 100 125 150 175
(a) Gain of 10 onTraining Modell
Error (mm)
Error (mm)
80
70
60
50
40
30
20
10
Time (sec)
0 25 50 75 100 125 150 175
(b) Gain of 100 on Training Model
Error (mm)
Time (sec)
0 25 50 75 100 125 150 175
(c) Gain of 1000 onTraining Model
Time (sec)
0 200 400 600 800 100012001400
(d) Gain of 10000 on Training Model
Figure 42: Error Over Time forVarious Gains on theTraining Model
Error (mm)
Time (sec)
0 25 50 75 100 125 150 175
(a) Gain of 1
Error (mm)
80
70
60
50
40
30
20
10 Time (sec)
0 25 50 75 100 125 150 175
(b) Gain of 10
Error (mm)
100
Time (sec)
0 250 500 750 1000 1250 1500
(c) Gain of 100
20
Time (sec)
0 100 200 300
(d) Gain of 1000
Figure 43: The LongTerm Performance using Various Gains
Error (mm)
CHAPTER 5
FURTHER RESULTS AND DISCUSSION
5.1 Single Camera
Here, we present some further results for the algorithms developed in the previous two chap
ters. Below, we solve eight sample calibration problems, where the initial calibrations are obtained
using our previously referenced graphical calibration tool These are shown in Figures 51, 52, 53
and 54. Given these descent initial calibrations, our algorithms develop runtime quality models
within two minutes from the start of training. Figures 55 and 56 show the systems perfor
mance in the face of poor initial calibrations. These figures show that even a poor initial
calibration can be improved significantly, but not always to runtime quality. Overall, these
results show that our singlecamera calibration approach meets our goal namely, mak
ing camera calibration a simpler, faster and easier process.
5.2 Stereo Cameras
Figure 57 and Figure 58 show four example calibrations obtained from two different cam
era angles. The initial calibrations are obtained using our graphical calibration tool. These calibra
tions and the associated graphs reveal the capabilities and shortcomings of our proposed stereo
technique. While in its present form, it may not yet be sufficiently robust for real world applica
tion, it can certainly be made to be so with minor changes. As we discussed in Chapter 4, if we
constrained modifications of the camera model to extrinsic parameters only, keeping the intrinsic
parameters of the model fixed, we expect that the observed model drift would no longer occur.
(a) Example 1 Initial
(c) Example 2 Initial
(b) Example 1 Final
(d) Example 2 Final
56 7
5 6
5
4
4
3
2 2
1 1
0 20 40 60 80 100 0 500 1000 1500 2000
(e) Example 1 Error vs. Time (f) Example 2 Error vs. Time
Figure 51: Example Single Camera Calibrations (1 & 2)
(a) Example 3 Initial
(c) Example 4 Initial
(b) Example 3 Final
(d) Example 4 Final
4 5
3.5
4
3
2.5 3
22
1.5
1
0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
(e) Example 3 Error vs. Time (f) Example 4 Error vs. Time
Figure 52: Example Single Camera Calibrations (3 & 4)
(a) Example 5 Initial
(c) Example 6 Initial
(b) Example 5 Final
(d) Example 6 Final
17.5
15
12.5
10
7.5
5
2.5
0 500 1000 1500 2000
(e) Example 5 Error vs. Time
0 200 400 600 800 1000 1200 1400
(f) Example 6 Error vs. Time
Figure 53: Example Single Camera Calibrations (5 & 6)
(a) Example 7 Initial
(c) Example 8 Initial
(b) Example 7 Final
(d) Example 8 Final
7
8
6
5 6
4
3 4
2
2
1
0 200 400 600 800 0 100 200 300 400 500
(e) Example 7 Error vs. Time (f) Example 8 Error vs. Time
Figure 54: Example Single Camera Calibrations (7 & 8)
(a) Poor Calibration 1 Initial
(c) Poor Calibration 2 Initial
(d) Poor Calibration 2 Final
0 50 100 150 200 250 300 350 0 100 200 300 400 500 600
(e) Poor Calibration 1 Error vs. Time (f) Poor Calibration 2 Error vs. Time
Figure 55: Poor Initial Calibrations (1 & 2)
(b) Poor Calibration 1 Final
(a) Poor Calibration 3 Initial
(c) Poor Calibration 4 Initial
25
20
15
10
5
0 100 200 300 400 500
(e) Poor Calibration 3 Error vs. Time
(b) Poor Calibration 3 Final
(d) Poor Calibration 4 Final
16
14
12
10
8
6
4
2
0 100 200 300 400 500 600
(f) Poor Calibration 4 Error vs. Time
Figure 56: Poor Initial Calibrations (3 & 4)
(a) Initial Stereo Calibration 1
(c) Initial Stereo Calibration 2
(d) Final Stereo Calibration 2
0 20 40 60 80 100 120 140
(e) Calibration 1 Error vs. Time
0 50 100 150 200 250
(f) Calibration 2 Error vs. Time
Figure 57: Stereo Calibration Examples (1 & 2)
(b) Final Stereo Calibration 1
(a) Initial Stereo Calibration 3
(c) Initial Stereo Calibration 4
60 100
57.5 90
55 80
52.5 70
50 60
47.5 50
45 40
42.5 30
0 50 100 150 200
(e) Calibration 3 Error vs. Time
(b) Final Stereo Calibration 3
(d) Final Stereo Calibration 4
0 25 50 75 100 125 150 175
(f) Calibration 4 Error vs. Time
Figure 58: Stereo Calibration Examples (3 & 4)
APPENDIX A
VISUAL CALIBRATION EVALUATION
When applying machine learning to improve camera models, an wellbehaved error measure
is critical. With exact correspondence data (2D image coordinates paired with 3D world coordi
nates), such an error measure is easily specified. It is precisely this type of data that we want to
avoid having to collect, however. Yet, without such data, it is difficult, if not impossible, to define a
globally wellbehaved error measure. As such, we choose the human eye as an appropriate means
of evaluating different calibration models. To make this intuitive, we draw a grid defined in space
onto the image using the current camera model, and decide visually if the system's error measure
is accurately measuring the quality of a model. This grid is chosen to match some features) in the
scene to allow a person to visually determine the quality of a calibration. Figure A2 shows the
experimental area. In Figure A2, the red arrows indicate the comer points of a one meter cube
placed in the far comer of the room, while the green arrows show the world coordinate system,
defined after considering the experimental area and how to keep initial data collection as simple as
possible. Figure Ai shows a sample grid drawn on an image. The grid uses 20cm2 squares and
should ideally be aligned with the floor and walls. Points marked with a red arrow in Figure A2
should have the outside comer of the fifth box out or up laying on it. The windowsill is a straight
edge that a human can use as a guide for the top edge of the grid.
Figure A2: The Experimental Area
Figure A1: The Calibration Grid
APPENDIX B
GRAPHICAL CALIBRATION TOOL
In order to generate initial rough calibration models quickly and easily, we developed an
intuitive graphical interface through which a user can manipulate and align a grid model of a scene
for different rotations and translations of the camera with respect to the world. This interface, as
shown in Figure B1, was written in C for the X/Motif window system.
Given approximate intrinsic parameters for a camera, we can project a threedimensional
model of the world onto an image for any given rotation, translation and scale (effective focal
length). Therefore, as the user adjusts parameters that control scale, rotation and translation, the
grid model of the world is continuously redrawn to reflect the new effective pose of the camera.
Once the user is happy with the alignment of the grid model to the world, he/she can then save the
effective perspective projection matrix and use that as the initial camera calibration model.
Figure B1: The Graphical Calibration Tool
REFERENCES
[1] Y I. AbdelAziz and H. M. Kahara, "Direct Linear Transformation From Comparator Coor
dinates into ObjectSpace Coordinates," Proc. American Society ofPhotogrammetry Sympo
sium on Close Range Photogrammetry, pp. 118, 1971.
[2] A. Azarbayejani and A. Pentland, "Camera Self Calibration From One Point Correspon
dence," Perceptual Computing Technical Report #341, Massachusetts Institute of Technol
ogy Media Library, 1995.
[3] X. Chen, J. Davis and P. Slusallek, "Wide Area Calibration Using Virtual Calibration
Objects," Proc. IEEE Conf on Computer ;ini and Pattern Recognition, vol. 2, pp. 5207,
2000.
[4] Y Do, "Application of Neural Networks for StereoCamera Calibration," Proc. Int. Joint
Conf on Neural Networks, vol. 4, pp 271922, 1999.
[5] R. Frezza and C. Altafini, "Autonomous Landing by Computer Vision: An Application of
Path following in SE(3)," Proc. IEEE Conf on Decision and Control, vol. 3, pp. 252732,
2000.
[6] I. Haritaoglu, D. Harwood and L. Davis, "W4: Who? When? Where? What? A Real Time
System for Detecting and Tracking People," Proc. IEEE Int. Conf on Face and Gesture Rec
ognition, pp. 2227, 1998.
[7] R. Horaud, G. Csurka and D. Demirdijian, "Stereo Calibration from Rigid Motions," IEEE
Trans. on Pattern andMachine Intelligence, vol. 22, no. 12, pp. 144652, 2000.
[8] S. Kamijo, Y Matsushita, K. Ikeuchi and M. Sakauchi, "Traffic Monitoring and Accident
Detection at Intersections," IEEE Trans. on Intelligent Transportation Systems, vol. 1, no. 2,
pp. 10818, 2000.
[9] F. Karbou and F. Karbou, "An Interval Approach to Handwriting Recognition," Proc. Conf
of the North American Fuzzy Information Processing Society, pp. 1537, 2000.
[10] J.M. Lee, B.H. Kim, M.H. Lee, K. Son, M.C. Lee, J.W. Choi and S.H. Han, "Fine Active Cal
ibration of Camera Position/Orientation Through Pattern Recognition," IEEE Int. Symposium
on Industrial Electronics, vol. 2, pp. 65762, 1998.
[11] P. Mendonca and R. Cipolla, "A Simple Technique for SelfCalibration," Proc. IEEE Conf. on
Computer J,",,, andPattern Recognition, vol. 1, pp. 5005, 1999.
[12] P. Rander, "A MultiCamera Method for 3D Digitization of Dynamic, RealWorld Events,"
CMURITR9812, Technical Report, The Robotics Institute, Carnegie Mellon University,
1998.
[13] G. Rigoll, S. Eickeler and S. Muller, "Person Tracking in RealWorld Scenarios Using Statis
tical Methods," Proc. IEEE Int. Conf on Automatic Face and Gesture RI. I IgIil ,, pp.3427,
2000.
[14] H. Schneiderman, "A Statistical Approach to 3D Object Detection Applied to Faces and
Cars," CMURITR0006, Ph.D. Thesis, The Robotics Institute, Carnegie Mellon Univer
sity, 2000.
[15] R. Sharma, M. Zeller, V.I. Pavlovic, T.S. Huang, Z. Lo, S. Chu, Y Zhao, J.C. Phillips and K.
Schulten. "Speech/Gesture Interface to a VisualComputing Environment," IEEE Computer
Graphics and Applications, vol. 20, no. 2, pp. 2937, 2000.
[16] R. Tsai, "A Versatile Camera Calibration Technique for HighAccuracy 3D Machine Vision
Metrology Using OfftheShelf TV Cameras and Lenses," IEEE Journal of Robotics and
Automation, vol. RA3, no. 4, pp 32344, 1987.
[17] H. Wang, W. Hsu, K. Guan and M. Lee, "An Effective Approach to Detect Lesions in Color
Retinal Images," Proc. IEEE Conf on Computer Vi, n, an Pattern Recognition, vol. 2, pp.
1816, 2000.
[18] J. Weng, P. Cohen and M. Herniou, "Camera Calibration with Distortion Models and Accu
racy Evaluation," IEEE Trans. ofPattern Analysis and Machine Intelligence, vol. 14, no. 10,
pp. 96580, 1992.
[19] A.D.Worrall, G.D.Sullivan and K.D.Baker, "A Simple Intuitive Camera Calibration Tool For
Natural Images," Proc. of the 5th British Machine ti,,, Conf, vol. 2, pp. 78190, 1994.
[20] I. Zapata, "Detecting Humans in Video Sequences Using Color and Shape Information," M.S.
Thesis, Dept. of Electrical and Computer Engineering, University of Florida, 2001.
[21] Z. Zhang, "A Flexible Technique for Camera Calibration," IEEE Trans. on Pattern Analysis
andMachine Intelligence, vol.22, no. 11, pp. 13304, 2000.
[22] Z. Zhang, "Motion and Structure of Four Points from One Motion of a Stereo Rig with
Unknown Extrinsic Parameters," IEEE Trans. on Pattern Analysis andMachine Intelligence,
vol. 17, no. 12, pp. 12225, 1995.
BIOGRAPHICAL SKETCH
Scott Nichols was born in Miami, Florida, in 1969. A high school dropout, Scott decided to
pursue an education and started community college fulltime in 1994. He transferred as a junior to
the University of Florida in 1996 and recieved both a Bachelor of Science degree in electrical engi
neering and a Bachelor of Science degree in computer engineering in Aug 1999. He has since
worked as a research assistant at the Machine Intelligence Laboratory, working toward a Master of
Science degree in electrical engineering.
