Title: Continuously adaptive M-estimation in the linear model
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00097429/00001
 Material Information
Title: Continuously adaptive M-estimation in the linear model
Alternate Title: M-estimation in the linear model
Physical Description: x, 171 leaves : ill. ; 28 cm.
Language: English
Creator: Conlon, Michael, 1953-
Copyright Date: 1982
Subject: Linear models (Statistics)   ( lcsh )
Estimation theory   ( lcsh )
Statistics thesis Ph. D
Dissertations, Academic -- Statistics -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
Thesis: Thesis (Ph. D.)--University of Florida, 1982.
Bibliography: Bibliography: leaves 164-170.
Additional Physical Form: Also available on World Wide Web
General Note: Typescript.
General Note: Vita.
Statement of Responsibility: by Michael Conlon.
 Record Information
Bibliographic ID: UF00097429
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: alephbibnum - 000317575
oclc - 08835244
notis - ABU4400


This item has the following downloads:

PDF ( 5 MBs ) ( PDF )

Full Text







Copyright 1932


Michael Conlon

To my father,

Norman Francis Conlon

10/'31,',2 9/24/, U


I am very grateful to Dr. A. I. Khuri for having directed this work,

and to Dr. James B. Conklin, Jr., for giving me the opportunity to pursue

my research interests. In addition, I acknowledge the patience, faith,

and understanding of my family and especially my wife.










4 CHOOSING A -t FUNCTION . . . . . .

Empirical Studies . . . . .
Asymptotic Considerations . . .
Minimax Estimators . . . . .
Adaptive Estimators . . . . .


The Generalized Lambda Famil . .
IMa,.irmum Likelihood Estimation .
Evaluating the Likelihood . . .
Solving for Maximum Likelihood . .
Examples . . . . . . . .







. . . . 16

. . . 30

. . . 43

. . . . 43
. . . . 47
. . . . 50

. . . . 69
. . . . 79
. . . . 5




IP.l-LS I l SAS . . . . . . . .





vi i




F DATA FOP E:X:AFPLES ........ .. .......... 151

BIBLIOGRAPHY . . .. . . . . . . . . . . 164

BIOGRAPHICAL SKETCH . . . . . . .......... 171


i, functions and their fILE densities



3. 1














5. 13



AVF in the Power Family . . .

Relative AVF inflation in the Power

AVF in GLF. Columns represent true
represent ;I-estimate values . .

Classes of distributions and their m

GLF approximation to densities .

GLF approximations to i' functions .

Range of GLF distributions . .

Comparison of Algorithms 5.1 and 5.2

Calculating the likelihood . .

Contents of e.'amples . . . .

OLS, CAM for E:ample . . .

OLS, CAM for Example 2 . . .

OLS. CAM for Example 3 . . .

OLS. CAM for Eample 4 . . .

OLS. CAM for Example 5 . . .

OLS. CAM for Example 6 . . .

OLS, CAM for Example 7 . .

OLS, CA1 for E.xample 8 . . .

OLS. CAMI for Example . . .

CAll. Nonlinear LS for Ex'.ample 10






values .



. . 38

. . 41

. 51

S. 64

. . 64

. . 67



. . 86



S 93

. . 97

. . 100


. 107

S. 09

. 113


GLF (-.886,.1333,.0193,.1588) . . . . . .

50 OLS residuals compared to 50 true error values .

500 OLS residuals compared to 500 true error values

i functions in the Power Family, p = 1.0, 1.4, 2.0


. . 22

. . 24

. . 25

. . 37

3.2 functions in GLF, '2 = .1,.1974; X3 = .1341, .5 . . 40

5.1 Coverage of (c3,c4) by GLF distributions . . . ... 59

5.2 Skewness of GLF for (X3'Y4) in quadrant 1 . . ... 60

5.3 Skewness of GLF for (X3, 4) in quadrant 3 . . ... 61

5.4 Kurtosis of GLF for (X3,X4) in quadrant 1 . . ... 62

5.5 Kurtosis of GLF for (X3,X4) in quadrant 3 . . ... 63

5.6 Moments of GLF distributions . . . . . . . 66

5.7 Effect on g(p) of altering r . . . . . .... 74

5.8 Two cubic approximations of g(p) . . . . .... 81

5.9 Density Estimate for Example 1 . ... .. ..... 89

5.10 Density Estimate for Example 2 . . . . . .... 91

5.11 Density Estimate and True Error Density for Example 3 94

5.12 CAM and OLS residuals for Example 3 . . . ... 96

5.13 Density Estimate and True Error Density for Example 4 . 98

5.14 CAM and OLS residuals for Example 4 . . . .... 99

5.15 Density Estimate and True Error Density for Example 5 . 101

5.16 Density Estimate and True Error Density for Example 6 . 104

5.17 Regression lines for Example 7 . . . . . .... 105


F i re






5. 13








C.1 AVF in Power Family c = .5

AVF in Power Family c

AVF in Power Family c

Relative AVF in Power

AVF in Power Family c

AVF in Power Family c

AVF in Power Family c

Pelative AVF in Power

go(p) '3 =: 01 4 :

go(p) "3 = .01 4 =

g,(F 3 = .13 =
g (p) "3 = .O =
g (p) '3 = ." J

go(p) = .13 =

9o(p) \3 = .40 4 =

g9 (p) -- .40 =
o 4

- .75

= 1.0


- .5

- .75

= 1.0

Fami ly


.13 .

.40 .


.13 .

.40 .

.01 .

.13 .

D.9 g (p) = .40

Density Estimate for Example 7.

Pegression lines for Example .

Density Estimate for Example .

Pegression lines for Example 9

CAM and OLS residuals for Example

Density Estimate for Example 9 .

Pegression curves for Example 10

Density Estimate for Example 10


. . . . . . . 106

. . . . . . 110

. . . . . . . 111

. . . . . . . 115

. . . . . . . 1

. . . . . . . 12

. . . . . 120

. . . . . . . 12

. . . . . . 127

. . . . . . . 1 30

. . . . . . . 13 1

. . . . . . . 130
. . . . . . . 131

. . . . . . . 132

. . . . . . . 133

. . . . . . . 35

. . . . . . . 13

. . . . . . . 13


. . . . . . . 140

. . . . . . . 141
















D. 3

= .40

Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Michael Conlon

Ilay 1982

Chairman: Andre I. Khuri
Major Department: Statistics

H-estimates of regression parameters are found by minimizing the sum

of a function of the difference between observed and predicted values of

a dependent variable. The choice of a particular function before the data

have been examined is shown to have serious consequences for the asymptotic

variance of the parameter estimates. Previous adaptive M-estimates used

one of a small number of functions selected after preliminary examination

of the data.

Continuously adaptive M-estimation (CAM) is introduced to choose a

function according to maximum likelihood criteria from a continuous class

of functions, thereby simultaneously estimating the regression parameters

and the underlying error density. Algorithms for calculating the estimates

are derived and numerical examples demonstrate the method's performance in

a variety of regression problems, including symmetric and asymmetric errors.


Consider the multiple linear regression model

(1.1) yi 60 + ex .i + " k ik i 1,...,n

where the are independent and identically distributed with mean zero.

The distribution of the L. will be referred to as the error distribution.

To estimate i = (e0, al'...,k )' as a linear transformation of

Y = (y ,.... j) one wculd use the method of least squares. The Gauss-

Markov Theorem (Rao, 1973) insures that the least squares estimator of

6 is the best linear unbiased estimator of a for the model (I.1) when

.. are iid. A least squares estimator of is a vector which minimizes

the sum of squares of the residuals, that is,

k n
(1.2) s { ,e P. f x '.)2 is minimum

where = (1, X il .... ) Let be an n ,: (k + 1) matrix whose

ith row is :i (i = 1,...,n). If : is of full column rank, then 6 is

When s. are iid normal, the least squares estimator of e is also

the maximum likelihood estimator. The maximum likelihood estimator of
r is a vector which maximizes the likelihood function L = n f( -
,.t), where f is the density of .. Ho'-. maximizing L is equivalent to
1 1
minimizing -In L, since the natural logarithrm is a monotonic function.

(1.3) -In L = In f( .' ,)
i- 1

or, since f(U.) = (1/7o-') expI- "/202) for \. normal with mean 0, mini-
.1 1
mizing (1.3) is equivalent to minimizing

kl.4) 7 (y. x i)2

If a is known, and 3i are iid normal, then maximizing L is equiva-

lent to minimizing the sums of squares of the residuals.

But if 1i are iid and not normally distributed, then least squares

will no longer be the maximum likelihood solution. In general, a maximum

likelihood estimator of will be a vector V* which minimizes

(1.5) -In L = ; O(vY xK)

with respect to B, where a = -In f. If a is strictly con''ex, and / is

full rank, then R* is unique.

For continuous and differentiable p, this minimum can be found bv

differentiating -In L with respect to R and setting the result equal to

zero. We will denote the derivative, p', of p, by 0'. A maximum likeli-

hood estimate can be found as a solution of

(1.6) L .,(y x.e.)x = 0 j = , k
i=1 j

where xi0 = 1, i = I,....n, and where = ,.

Since P = -In f, and JI = p', we can express in terms of f as

9 = -f'/f. Thus, when f is known, the maximum likelihood estimator of 0

can be found by solving the k + 1 equations in (1.6) using I = -f'/f.

Note that in the case of the normal density, i(A) = A/o2, and the

resulting equations are known as the normal equations.

If the error distribution is known to be double exponential, then

p(A) = IAI, i(A) = sign(A) and the resulting method is known as least

absolute value (LAV) regression (Gentle, 1978).

Equation (1.6) applies only when Ij is homogeneous of degree one,

that is, when I(c.-) = c-.'(). In this case, I, will be unaffected by the

scale of the error distribution. Both OLS and LAV estimation have homo-

geneous 9 functions. Many 'I functions proposed are not scale invariant.

For these techniques, Equation (1.6) must be modified to incorporate an

estimate of scale, say s, in which case, one must solve

n y. -
(1.7) ,p(' ) .. = 0 j = 0,1,...,k
i s 1A

Estimates of scale are not independent of estimates of 6 in this

case. In addition, since a goal of this form of estimation is to allow

departures from normality for the error process, a scale estimate based

on a sum of squares may not be appropriate. Estimators of scale used

in Andrews et al. (1972), where r. = y. x:6*, include
1 1 1-

(1.8) s = { interquartile range of r. }'1 .35

(1.9) s = median { Jr. median t{r. } /.6754

(1.10) s- l 1 r
n-1 l 1
Table 1.1 gives ,! functions and references for each. These are

presented without reference to the scale estimate which may be necessary

in their use.

Use of scale estimates introduces a problem in numerical solution

of Equation (1.7). Techniques for the solution of these equations often

involve iteration through a succession of parameter vector estimates. At

each iteration, the current estimate can be used to get a current estimate

of ri, which can then be used in one of the Equations (1.8). (1.9), or

(1.10) to produce a new estimate of s. Iterating on both f* and s can

lead to serious convergence problems. See Birch (1980b) for convergence


If f is known, then estimation of and s can be done jointly by


n y.-x.
(1.11) 1 : 1( -i ~) = na

for s simultaneously with (1.7), where ;':() is

(1.12) \.( ) ) ( )

Huber (1981) shows that in order to obtain consistency of the scale

estimate when f is the normal distribution, and to reproduce the standard

OLS estimate of scale when r(L) = L!/2, the constant a in Equation

(1.11) must be

(1.13) a = n + ) E )

Convergence problems and choice of scale estimators are serious problems

in using a 4, function. In Chapter 5, a method is presented which esti-

mates a variance of the error process jointly with regression parameters,

but avoids both these problems.

Uhen fitting a linear model to a set of data, the error distribution

is generally unknown. Maximum likelihood estimates cannot be obtained

since the method requires the specification of f. In this case one would

assume a form for f, and use the corresponding p. It would be desirable

to use a function which will produce estimates @* which will not be

sensitive to small changes in '. Such a method would be called qualita-

tively robust (Hampel, 1971). Practitioners have chosen least squares

because of its computational simplicity, the Central Limit Theorem, and

the Theory of Errors (Pao, 1973). In many cases, the error distribution

cannot be assumed to be normal, either because of knowledge of the under-

lying error process, examination of the residuals, or the presence of

outliers (Barnett and Lewis, 1978). In particular, if the error distri-

bution is known to be asymmetric, or bounded, then least squares cannot be

a reasonable estimation method.

Estimates of the regression parameter vector which are obtained using

a 'p function and solving Equations (1.6) are known as M-estimates (for

maximum likelihood, although it must be noted that M-estimates are maximum

likelihood estimates only for the density f, such that = -f'/f). When

the error distribution is unknown, one wishes to select so that the

estimates obtained will be close to estimates which would have been obtained

had f been known.

Other approaches to the problem of fitting parameters when the error

distribution is unknown have been proposed.

L-estimates involve a linear combination of order statistics and can

be used in location problems. They ha'.e not been generalized to regression

problems. Similarly R-estimates are linear combinations of functions of

the ranks of a sample, and are not applicable in tne regression setting.

Denby and Larsen (1977) study seven alternatives to least squares

regression via Monte Carlo simulation. Two of their alternatives are

M-estimators. Andrews et al. (1972) study robust estimates of location.

Of 68 estimators studied, 27 are M-estimates. Huber (1981) concludes

that of all the techniques for obtaining estimates when little information

concerning the error density is available, M-estimates are preferred

because they are more easily generalizable, are simpler to study analyti-

cally, and have performed well in all empirical work done to date.

Several reviews of robust estimation and H-estimators for location

and regression have appeared. Hampel (1973) reviews estimators of location

within the framewor-k of qualitative robustness. Huber (1973) gives basic

results on M-estimators for regression involving asymptotic design pro-

perties and methods of calculation. Hogg (1979) gives a short review of

the use of M-estimators'in applications. Narula and Wellington (1979)

discuss the extension of M-estimators from their use in estimating location

parameters to their use in the regression setting. The most comprehensive

review available is the one by Ershov (1979) in which the history of robust

estimation is outlined, and M-estimators are compared with other robust

regression estimators. Ershov cites 294 references in the field of robust

estimation. Bickel (1976) presents a good review of recent developments.

Box (1978) discusses the process of constructing models which incorporate

various departures from classical assumptions. He includes M-estimates as

a natural extension to incorporate non-normality in regression. Hogg (1978)

provides a good introduction to the theory and application of M-estimates.

Stigler (1973) provides a review of the early attempts at robust estimation.

Asymptotic distributional properties have been studied by Huber for

the location problem (1964) and the regression problem (1973). Yohai and

Maronna (1979) have proven under stringent regularity conditions that B*

will be asymptotically normally distributed for 9 and f not related by

the maximum likelihood criteria. Polyak and Tsypkin (1978) have stated

simple conditions for asymptotic normality without proof. Carrol and

Ruppert (1979) prove almost sure convergence under mild regularity condi-

tions. Asymptotic results are presented in Chapter 2.

Aside from the distributional questions partially answered by these

results, the central question in using M-estimators is the selection of

p when f is unknown. The effects of misspecification, choosing a i function

which does not correspond to the true error distribution, are illustrated

using two examples in Chapter 3.

Existing methods for proceeding to use an rM-estimator when the error

distribution is unknown are presented in Chapter 4. These methods fall

into two groups: those that attempt to find a 4, function which will work

reasonably well whatever f may be (robust methods), and those that choose

.' after looking at the data (adaptive methods). The robust methods use

three basic approaches: empirical, where and f choices are run and

compared on sets of data to measure performance; asymptotic, where p is

chosen on the basis of an asymptotic optimality criterion; and minima:;:

where I.' is chosen to protect against the 'worst" error distribution in

a specified class. Adaptive methods use a statistic calculated from a

preliminary fit in a decision rule to select the ip function.

Once a p function is chosen, Equations (1.6) can be solved to yield

estimates. Whatever the method of selection employed to choose ip, the

resulting fit is a maximum likelihood fit for the corresponding f. We

can express f in terms of ,p by solving

(1.7) i. = -f/f

for f, yielding

(1.8) f = ke-J'

where k is a constant which can be derived so that f(z)d- = 1. It

seems reasonable to expect that if f0 is the true error distribution and

one uses i1 to fit the data, then fl corresponding to .,1 via (1.3) should

closely resemble f0. If f1 is unreasonable, then the method of fitting,

represented by p' is unreasonable. For least squares, i () =- and fi ()
1 1 o2 1

is the normal distribution. Experimental designs may reduce the effect of

misspecification of 'I.

Other i functions proposed in robustness studies can be solved for

their corresponding error densities. A brief table of the most popular

'P functions in the literature is given in Table 1.1.

Once a 'P function has been chosen, usually from the list in Table 1.1,

one can solve Equations (1.6) using an appropriate numerical technique.

Holland and Welsch (1977) compare three basic techniques for solving the

equations. Newton's method is the most straightforward, but requires &p

to exist. Huber (1973) gives a simple algorithm, but one which cannot be

used with existing regression software. Dutter (1977a) refines Huber's

algorithm for speed, but implementing this algorithm would require careful

and extensive programming. The third algorithm is iteratively reweighted

least squares (IRULS) (Beaton and Tukey, 1974). They conclude that for

most 1 functions IRIILS is the easiest to use, most reliable, and often

the fastest. Dutter (1977b) presents a comprehensive review of 14 algo-

rithms for solving Equations (1.6), applied to 80 data sets, some con-

structed, some "real life" data. He finds no striking differences

between the algorithms in their accuracy or computer time usage. For

this reason, IRWLS is preferred since its computations are quite simple,

and can be performed using existing regression programs.

To perform IRWLS an initial estimate 8* of the parameter vector S

is required. Initial estimates may be obtained using least squares or

LAV estimation. Birch (1980a) has compared the dependence of IRWLS on

the starting estimate. To iterate to a new estimate at the ith step,


(1.14) Bt = Stl + (X'WX)-' X'WA*.
~1 I -1 \ \-I0 1

'I1 functions and their HLE densities.


Rao, 1973



,(A) = sign(_)

'(-) = sign( l) i -'

*-) 5 -j L i 'gn( ) k |L I s ,

4!,(L) = L.tanh(L)

,( ) = a /(h+(--c)")

Gentle, 1973

Forsythe, 1972

Huber, 1973
"Proposal two"

Bickel, 1978

Andrews et a1.,1972

Double Exponential

Power Family, see
Chapter 3

No rma 1
Ex:ponen t ial

I| i ,l>k


Ca u c hy

a-sign(s) a '-

Andre;is et al.,1972 Ex:ponential


,( ) -

,p(-) -=

, ( ") -- .e -a |


I 1>0

Andrews, 1974

Denby and Hallows,

f(is)=kec cos( sc

k, a function of c

f(i ) = (h)--(-h) .

Ramsey, 1977

=1) i/sa

:( ) = J


No rma 1

I |


b< I

Table 1 .1 .

Table 1.1. Extended

Reference Density

Dennis and
Welsch, 1976

Fair, 1974

i(A) = A3

I bS
rp(.1 = b
0 I0I a

tan( ) A| 2a
(A) = an(
[ o A|>a

4O(A) = A(1 + )P

sign(A) = ( ) = a I AI a

a |A >a

Ershov, 1979

Ershov, 1979

Moberg et al.,

2 v/Z 4/4
f(A) =- ir/4 e

Gross, 1977


f triangular on

p(() = A
1 + (V)

Holland and
Welsch, 1977

t distribution with
v degrees of freedom

(.) = .-e 2-

( 1) =
1 +

where 'X is an nx(k+l) matrix of fixed points, Vi-I is an nxl column vector

of estimates of the residuals,

(1.15) Li- ''

and W\ is an nxn matrix (w ab), where

(1.16) 0 if ab

ab -1,(( l )a)

It can be used to generate a new L+ and new w, and these in turn used to

generate ~T These iterations are repeated until successive iterations

produce little change in '.. Convergence properties of IPWLS are examined

in Birch (1980b).

IPWLS is easily programmed. It is available in SAS, the Statistical

Analysis System (SAS Institute, 1979), in PROC NLIII. A short program

listing is provided in the Appendix. The POSEPAK subroutine library

(Klema, 1976) is a commercially available IPULS subroutine library. IRWLS

is not recommended when i, is not continuous. In particular, when LAV

estimation is desired, special algorithms exist. Earrodale and Poberts

(1974) solve the LAV estimation problem using a modified simplex algorithm.

Sposito et al. (1977) present a special algorithm for Q(L) = sign(s) l-'I 1.

Direct optimization of (1.5) using Price's controlled random search

procedure (Price, 1977) is inefficient, but can be applied to nonlinear

and constrained models, as well as the model of (1.1).

Iterating until convergence in IRWLS may not be required in particu-

lar cases. Bickel (1975) argues that estimates derived from one Newton

iteration have the same asymptotic properties as fully iterated estimates.

and perform comparably on small samples. These "one step" estimators are

very quick to compute.

Many of the o' functions presented in Table 1.1 have constants in

them which must be specified by the practitioner. Optimal values for

each constant exist in terms of the true error distribution, but before

analysis, these values are unknown. Holland and Welsch (1977) give

constants for several of these t! functions to achieve 951. asymptotic

efficiency when f is the normal distribution.

Once estimates have been obtained, several questions relating to

any regression problem can be stated. One class of problems involves

optimal design of regression experiments when the data come from an

unknown distribution. Bo; and Watson (1962) propose a criteria for first

order designs to minimize the effects of non-normality on F-tests. These

designs are called zero kurtosis designs. The design kurtosis C, is given

in Khuri and lyers (1931) as

(1.17) C (n-l)[n(n+l)h k(k+2)(n-l)]
1 1' k(n k 1)(n 3)

(1.13) h hn .

and h.. is the ith diagonal element of H = D(D'D)-'D, where D is such

that X = [1:D]. Khuri and flyers give necessary and sufficient conditions

for these designs to exist and describe a method for construction of such

designs for a given number of experimental runs. Vuchkov and Solakov (1980)

demonstrate that the kurtosis is related to the distribution of the repeated

points among the support points of the design.

Eox and Draper (1975) indicate that the quantity h in (1.18) is a

measure of design sensitivity to outliers, and present 14 general proper-

ties of good response surface designs. Herzberg and Andrews (1976) intro-

duce two additional criteria. Andrews and Herzberg (1979) compare the

criteria and calculate values for several designs. Huber (1975) considers

distributional robustness properties of designs and model robustness. He

recommends each h.. be much less than 1, and that lack of distributional

robustness produces far more disastrous results than failures of model


Variable selection when using estimators other than OLS have been

studied only for two alternatives. Gentle and Hanson (1977) propose

techniques when using LAV estimators. Ilarula and Wellington (1977)

investigate variable selection when the criteria is minimization of maxi-

mum weighted absolute errors. General variable selection using M-estimators

has not been studied.

Other questions related to M-estimators involve the study of residuals,

subsequent tests of hypotheses, and calculation of multiple comparisons.

Bickel (1976) introduces the concept of "pseudo observations" in M-esti-

mation, as an attempt to answer these questions. Let

(1.19) + = + \*-


n i=l

r. = y. 4. *
ri = i -i-

The pseudo observations are defined so that the least squares estimate

based on the pseudo observations is equal to the robust estimate .*

obtained from the original data. The pseudo observations can be used for

further analysis such as successive testing, multiple comparisons, C plots,

AHOVA tables, and residual plots. No rigorous results on the performance

of pseudo observations ha'.e appeared in the literature. Their existence

does, however, provide the data analyst with a useful tool for performing

a "standard" regression analysis when using an i-estimator.

M-estimators are considered in relation to outlier processing techni-

ques by Barnett and Lewis (1978). i-estimates accommodate outliers by

estimating from all the data. M-estimators which have 'I functions which

redescend to zero effectively reject outliers by giving them weight O'in

Equations (1.6). Andrews (1974) demonstrates the ability of the SINE ,,

function to down weight outliers, using the data of Daniel and Wood (1971)

as an exa:rrple. M-estimators which descend to zero asymptotically down

weight outliers, and thereby accommodate them with small weight. The

Bisquare estimator (Gross, 1977) is a good example of an asymptotically

redescending 1-estimator.

M-estimators which redescend to zero trim the data at specific

values, controlled by tuning constants. These techniques can accommodate

outliers generated by a process other than the true error process, if the

contaminating values are outside the range of values to be kept after

trimming. Trimming techniques will fail if no data fall inside the

acceptable range. In this case, the tuning constants may be adjusted

and the procedure repeated.

H-estimators, whose '' functions are always nonzero,give weight to

all points in the sample and are not used to model truly contaminated

data. These estimators model data which inherently contain outliers due

to the heavy tailed characteristics of the presumed error density.

fl-estimators may perform well in situations which they do not model.

This is a measure of their robustness, not their validity for use in

handling outliers.

No single choice of I. is likely to be found which will perform well

in the wide variety of situations encountered with experimental data.

Current adaptive techniques, which calculate a statistic of the data

before choosing a .,are limited to a small catalogue of alternatives.

In Chapter 2 the basic asymptotic foundations of I-estimators, where l

is chosen a priori, are reviewed. Chapter 3 details the problems which

arise when the wrong qi function is chosen. Chapter 4 presents the known

techniques for choosing a Q function, both a priori and adaptively.

All these techniques must eventually perform poorly in situations

where the error distribution currently under study was not considered in

simulations for verifying the performance of fixed estimators. In

Chapter 5, a continuously adaptive H-estimator is introduced, which uses

the data to determine the appropriate i function. Unlike current adaptive

techniques, however, the choices of are presented in a continuum of

several dimensions rather than as a finite set of choices. This continuum

includes i functions which correspond to asymmetric densities as well as

symmetric possibilities.

The technique is shown to be able to recover the information from a

sample regarding the shape of the error distribution as well as estimates

of the regression parameters. Numerical algorithms and examples of their

use are presented, as well as performance comparisons for least squares.

Chapter 6 contains a review and some suggestions for future research.


A central question regarding H-estimators for robust regression is

the asymptotic distribution of the parameter estimate vector. If

, = -f'/f, then the M-estimate is a maximum likelihood estimate and is

known to have an asymptotically normal distribution (Rao, 1973). When

p is chosen independently of f, as in an application where the true error

distribution is unknown, the asymptotic distribution of E*, the vector of

parameter estimates can also, under certain regularity conditions, be

shown to be asymptotically multivariate normal. This result is most

clearly stated in Theorem 2.3 by Polyak and Tsypkin (1978).

Huber (1964) introduced the concept of an I-estimator in the location

problem and presented basic asymptotic results. Conditions for almost

sure convergence of an estimator based on tp when the underlying distribu-

tion is given by f, with cdf F, are

(2.1) must be strictly con'.ex

(2.2) 30 such that X( ) exists and is finite, where

(2.3) X( ) = f (A 0) F(dA)

(2.4) 3c 3 > c > X() < 0 and < c =>X() > 0

If in addition, the conditions below are satisfied, then the estimator

is asymptotically normal. If p* is the estimate, then

(2.5) ci (p* c) -- n(0,V)

where I'2 ( c) F(ds)

(2.6) V

The necessary conditions are

(2.7) k(c) = 0

(2.8) Q(;) is differentiable at = c and \.'(c) < 0

(2.9) ,,2 (L 5) F(ds) is finite and continuous at 5 = c.

Andrews et al. (1972) present asymptotic variance expressions for many

M-estimators of location.

A useful concept for comparing M-estimators of location is the

influence curve (Hampel, 1968). The influence curve measures the change

in the estimate obtained by making an infinitesimal change in the error

distribution. Estimators can be considered functionals of the empirical

distribution function, and the quantity to be estimated is a functional

of the error distribution, represented by its cumulative distribution

function. For example, the mean of a distribution is a functional as

shown below

(2.10) T(F) = j F(ds)

The influence curve IC(L;T,F) is defined below.

(2.11) IC(L;T,F) = lim T((1 c)F + ) T(F)

where -.. is the cumulative distribution of a random variable e having point

mass at L. It can be shown that the asymptotic variance A(F) in the

location case is related to the influence curve as shown below (Hampel,


(2.12) A(F) = IC(A;T,F)2 F(dA)

H-estimates of location and scale can be expressed as functional
(Andrews et al., 1972) T(F) and S(F) satisfy

(2.13) A SaT(F F(dA) = 0

(2.14) fX SA (F) F(dA) = 0

where x(A) = Ai(A) p(A). The influence curves are given by

(2.15) IC(A;T,F) = S(F) A

f -SlF F (dy)

(2.16) IC(A;S,F) = S(F) X [S r
|xj F(dy)

Hampel (1974) defines several measures derived from the influence
curve which can be used to compare estimators. Huber (1981) reviews
these measures as do Barnett and Lewis (1978). Empirical work has not
included references to these measures, or compared them to results
obtained in Monte Carlo studies.
In regression, influence of new data on OLS with normal error is
discussed by Belsley, Kuh and Welsch (1980). They show that the influence
of the ith data point on the regression parameter estimates j can be
measured by differentiating the fitted regression coefficients with respect
to the weight, w., of the ith observation, and evaluating the result at
w. = 1. This differentiation result is given below:


(2.17) w- i
1 j.=1

Similar results for M-estimators in regression are not available. Asymp-

totic results in regression are derived not from influence curves, but

from first principles.

Almost sure convergence of e, to e. in the regression problem was

first proved by Hijber (1973). This proof requires

(2.12) _sp- 0

.*h ere

(2.1) l.ma_ .. 'i(>'" )i nxp
1 nxp

i2. 20) p., must be continuous and bounded

(2.21) Ef ( (L)) = 0

This last condition w.ill be true when both f and p are symmetric functions,

but if either o f o r r both are not symmetric, this last condition will

not hold. Huber's proof does not include the special cases of least squares

(I, is unbounded, violating (2.20)) or LAV (p, is discontinuous, also violating


Carrol and Puppert (197?) present a new. proof of the almost sure con-

'.ergence of l-estimators in regression, which does not require sy,mmetry,

but does require for some a,K1,K >. 0 that

(2.22) '' ) = s= ign. K ,,

2. 23) ...s ) b a. I c i

This condition is satisfied b, fe-w. of the pI, functions in Table 1.1.

Ershov (1979) states conditions to insure that e,* will converge

almost surely to i. Let F be the cumulative distribution function of .

Theorem 2.1 (Ershov, 1979) Almost sure convergence of H-estimates.

Let o satisfy the following conditions:

(i) P is symmetric.

(ii) is non-negative.

(iii) o satisfies a Lipshitz condition of order 1.

(iv) either

(a) p is convex and 3: 3V6 ;: 0, F(z + 6) ;: F(: 6), and

0(: + 6) + (= 5) > 2o(=)


(b) pi is monotone for z: 0 and f = F" 3 f(:1) f(2,

where z: 2 0 and ]z ;:,0 : p(0 + 6) (: 5)

and f(: + 6) < f(: 6) for sufficiently small 65 0.

Then .* i almost surely, when 0* is any solution of (1.6).

No proof of the theorem was given.

These conditions on p, and on the underlying distribution represented

by its cumulative distribution function F, and its probability density

function f can readily be assumed to hold under data analytic conditions.

Thus M-estimates will converge almost surely to the true parameter vector

regardless of the relationships of p and f. An immediate corollary involves

the distribution of the residuals r. = y. xB*.

Corollary 2.2 Convergence in distribution of the residuals

The residuals converge in distribution to the distribution of f,

under the conditions of Theorem 2.1.

Proof: r. = y. x.n" i = 1,.. n or, in matrix notation.
1 1 -1*-

R = Y Xt*

R = XS + L .'* from Model (1.1)

R = ( + .

But '* converges to & almost surely, so '(8 6*) converges

to zero almost surely. -. is distributed as f, so r. converges

in distribution to .', distributed as f.

This result insures that as the sample size increases, the distri-

bution of the residuals will approach the true error distribution, regard-

less of the P function used. Anscombe and Tuke, (1963) present formulas

for estimating the moments of an error distribution from residuals obtained

using OLS. Andrews (1973) considers the use of residual displays when

methods other than least squares are used and the deficiencies of examining

elementary residual displays when OLS is used on non-normal data. For

small samples OLS produces residuals which are more nearly mound shaped

than the true error process.

As an example of the inadequacy of OLS residual analysis with small

samples, and of the convergence of OLS residuals to the true underlying

distribution in large samples, consider data as described by Equation (2.24).

(2.24) y = 2 + .4x + S

where x is distributed uniformly on [0,1] and is distributed as

GLF(-.386, .1333, .0193, .15 8). The GLF distributions are described in

Chapter 5. The error distribution used here has mean 0, variance 1,

skewness 1, and kurtosis 4. The true error density is shown in Figure 2.1.

I -



i ~






r~l r
I -



Figure 2.2 illustrates the problem of investigating simple OLS residual

displays. Fifty observations were generated according to (2.24) and fitted

using OLS. A histogram of the true errors is drawn ne;t to a histogram

of the least squares residuals. Figure 2.3 shows the same histograms

generated from 500 data points. In Figure 2.2 the OLS residuals are more

nearly mound-shaped than the 50 points generated from the density of

Figure 2.1. In Figure 2.3, the large sample size indicates that the

asymptotic result of Corollary 2.2 is approached for a sample of 500 for

this distribution. Only with a large quantity of data can such a display

indicate the shape of the underlying distribution.

Asymptotic multivariate normality of * was first considered by

Huber (1973). As in the proof of almost sure convergence, ep2-O was

required. lo formal proof was given. Yohai and Maronna (1979) present

a complete proof of asymptotic normality using involved conditions on p

and f. The statement of their tlieorenis given below.

Theorem 2.2 (Yohai and Naronna, 1979) Asymptotic normality of M-estimators.

The assumptions below are required by j and f.

Al) 1, is nondecreasing.

A2) 3b,c,d :> 0 0 D(u,z) > d if jul c and |zJ b where c is such

that F(c) F(-c) :- 0 and D(u,z) = (j,(u + z) (u))'z

A3) 'i dF = <

A4) J ,dF = 0

A5) For a sample of size n, let XIn be the nx:(k+l) matri:., then
3n a Vn >n, n ,'' is nonsingular.
) h) F = o(n) as h and

A6) [ (:< + h) i(.< h)]" dF = o(1) as h 0 and

' I

.D .

., . _. :

r *'

S, . .1

S, ,Fhm

' -' I-



I i -



I 0

*=I c


i m
I i


I. .... .. .. .. . -

~ 2




. O..

1 .






S [(x + q + h) P(x + q)]dF <
|lh J.-

for some E: 0.

[,(x + h)

A7) 3A(,;.,F) 3

A8) lim max
n-- li n

x = ( ') and I. is

M' = X'.,

- I.(x)]dF = hA(,,,F) + o(h)


any (k+l):(k+l) matrix, such that


1B* M. MiVNi(,T.2 1)


f J 2dF
A(, ,F)2

Uhile a rigc rous proof of this theorem exists in the literature, its use

is severely limited by the difficult to verify regularity conditions.

Ershov (1979) states a theorem presented by Polyak and Tsypkin (1978)

of the asymptotic normality of il-estimators, but no proof has been given.

The statement is given below.

Theorem 2.3 (Polyak and Tsypkin, 1978) Asymptotic normality of M-estimators.

Suppose that p is convex, there exists f = F", f symmetric, and

3z 3 f(z) > 0 and V6>0, p(z + 6) + p(z 5) > 2p(z), then

(2.25) 7i (B* P) -* MVN (0, (X'X)-' AVF(p,f))


jqj , |hi E

I ,; fd.
(2.26) AVF(p,f) =

This theorem is the fundamental result of the study of M-estimators

in regression. The only restrictions on f are that f must be symmetric

and exist as a derivative of F. These assumptions can easily be accepted

in data analytic situations. This theorem has several immediate conse-


Corollary 2.4 Under the regularity conditions of Theorem 2.3, e* is

asymptotically unbiased and squared error consistent.

Proof: obvious from (2.25)

Harvey (197S) shows that for 6* calculated using IRP.LS, if f is

symmetric and D is even, then a symmetric estimator of the starting point,

such as OLS, will not, in general, lead to a symmetrically distributed E:.

An asymmetric estimator, such as LAV is necessary as a starting point to

assure symmetry of P,' when using FI.WLS.

Corollary 2.5 Asymptotic design criteria depend on (.:":')-' only, not on

l or f.

Proof: obvious from (2.25)

Designs constructed to be robust against non-normality of the errors

protect against small sample properties of B'. Asymptotically, design

results for OLS 'which depend on criteria relating to properties of ('.:"')-1

for example D-optim.ality, apply to M-estirators as well.

Thus, misidentification of f may cause small sample bias problems,

and inflation of variance, but asymptotically, 6* is unbiased and converges

almost surel5 to ,. In addition, if we know ,I, f, and X, we can explicitly

state what the asymptotic covariance matrix of 6* will be. This enables

us to determine how much data will be needed to develop estimates stated

,iith the same variance when the error distribution changes. Some distri-

butions are easier to estimate than others, that is, they require less

data to achieve estimates whose variances are equivalent to estimates

developed using more data from a more difficult to estimate distribution.

Because of its special importance, the ratio of integrals in Equation (2.26)

will be called the asymptotic variance factor (AVF), that is,

j .,2 fd.
(2.27) AVF(,, f) = _

r f dA

:ote that if i,^ = -f'/f, then AVF(,r,f) reduces to

(2.28) AVF f- f 1 _1
f 2 1 f

which is the Cramer-Rao lower bound for the variance, where I(f) is the

Fisher information of f. This result suggests a minimax approach to

choosing I. using a least favorable distribution f, that is, a distribution

with minimum Fisher information. The result is stated by Ershov (1979)

Theorem 2.6 (Ershov, 1979) Existence of minima:x estimator.

Let F be a convex class of smmetric distributions for which G =

{f: 1(f) :- is dense in F, and such that 3f F such that I(f ) l(f)


Vf E F. Let = -f,'fo, = In f If p is convex, satisfies the

Lipshitz condition of order 1, and the conditions of Theorem 2.3, then

V E F,

(2.29) AVF(o ,f) AVF(, ,f ) .0 AVF(,,f0)

No proof of this theorem was given. The theorem provides a minimal

approach to choosing a ,' function. Examples of the use of the approach

are given in Chapter 4. Huber (1931) considers minimax M-estimators of

scale and location.


Under regularity conditions stated in Chapter 2, 8* will converge

almost surely to E, even in cases where i't has no relation to the maximum

likelihood ,., function for the true error distribution. Similarly

>,T (* 6) converges to a multivariate normal distribution. The asymp-

totic variance covariance matrix of .' depends on the relationship between

the 'P function used in the estimation of B* and f, the true error distri-

bution. The effects of misspecifying p are illustrated in this chapter

in terms of the size of the AVF for two families of error distributions.

The power family of distributions can be derived as that family for

which ,(-) = k-sign(.) I[ p produces the maximum likelihood estimates.

A general form for the p function is

(3.1) p(A) = c =

so that

(3.2) f(-) = k e (

To find k so that f(-) integrates to 1 over the real line, let

(3.3) z = c( )
-1 1
(3.4) d = ( P dz

We require

(3.5) jk e-(A) dA = 1

2 J k(/c)1' -

r(l/p)2.-kc lp o

1 z, 'p)-1 1 dz 1

z(1/p)-1 /1
(l /p) 1 p dz = 1
r(1,"p) 1i,"p

The integral in k3.7) is that of a Gamma (1,p,1), so

k = PCl1/p
2r( ip)'

This family includes the normal (c=1/2.p=2), and the double exponential

(c=l,p=l). Forsythe (1972) introduced the p function. Box (1973) pre-

sented an alternative parameterization. Regression based on minimizing
the pth power of the residuals is appropriate when 7 "-JP can
i =
be assumed to be 7, since if

'z .-D
\ I .>1 1

-_c 1," P
f(2) = 4 (,"p),

and if z = 2cI|,,/c then the density of z is

3.10) g(z) = e-z'

(3.11) g(z) 2= 1p r

That is, 2, If are iid f,

is :'. 2n If p, c, and a are known,
is give by the rejection region
is given by the rejection region

( 3.12)

Si 'p)-I
p(2c .i P

z( /Fp)-1 e- 2'

then, z. are iid ", so z.
1 it 2so 1e
then a test of whether are iid f

" U 2np

For the power family, the AVF can be calculated as shown below:




I I p
p-cl.\,, lI

(3.13) fdA =

ke-p() cp(pl) p-2
0- 0d


k-c-p- (p-1)

a exp{-z/2z}zl-2/p

o(2c) -i p

-1/p -7z/2
z 1/p e z'
r(1/p) 21-1/p


S pc1 (p- 1)P (-1/p)
Sr(1/p)20 -O 1

Sr(1- /p)


(3.19) fd k exp{-p()) c 2p-2
0 a o

2k c p
F o-

(2c)2-2p (2c)'P


j /2

,2-2/p ,/p)-

z1-1/ e- /-2
F(2-1/p) 22-1/p

S2p P /p
a 2F(l/p)o


(3.24) AVF(q,f)=

r(2-1/p) l/p

Jf 2 fdA

Lff' db1

r(2-1/p) c2/p


p(p- 2/p 2
G2a j





(3. 17)






(1-1/p) 21-1/p

r(2-1/p) cl/p

. ,(1/P)-1

_ (1-1/p)
r- (1-1,'p)

F(1/p) ,2
c /P(p-1)2

which reduces to u2 in the normal case, as it should.

Suppose that p is unknown, so that we use a i function from the power

family with power q, rather than the true power p. The AVF('i,f) can be

calculated using the steps below.

(3.26) jfd,


S k exp{-p(-)} cq(gq-) q dq

S 2k c q(q-1) 1 exp{-z,2}.
c; (2c)(q-2)/ p (2c)!P

zq/p-2/p l/p-1

S2kcq(q-l )a 2(q- )'' r((q-1 ).p)

J (q-1)/p e-zd
( r(q-1),p z(-l d

= 2cq(q-1) r((q-1)'p)-p.-c/P
p c(q- 1) '/r / )2
poc -1 f(l/p)2c

_ r -1)!p) q(q- c(p-q+2)/.'p
r(ip) c

(3.31) 2J ffd,

= k exp{-o(S)} c- 2q d

S2kc~ q (2c) (2q-2)/p 1
-; P (2c)

i..' p






Szt2q-2)/p (1/p)-1 dz

_ 2kc2 1 2g-) 2(2q-l)/p
op (2c)(2q-1)/p P

exp{-z/2} zj(2q-1)/p)-l
2r(-1) 2(2q-1)/p



2r( )o

- P ) 9 -2((p-q+1)/'p)

(3.36) AVF(i,f) =

r(g ) q 2 2((p-q+1)p)
r (1 ,/p ) ,2
r( ) -1)
(- (-q2)/p
p- C

. (2-I) q2c2((p-q+l)'p)

F( ) ,-



rF(-) 'j

P( ) q2(q-1)2 c27 q2)p

2( r(1
- ( p ) 1 2 ,
r2 (1 ) (q-1) c2p

When p=q this reduces to (3.25). When p=q=2, c='/2 this reduces to just

a', the least squares, normal AVF.

AVF(,i,f) can easily be calculated for various combinations of q, p,

and c. Table 3.1 contains values of AVF for the power family. From the

table we can see that if the error distribution is normal, and LAV estima-

tion is used (i.e., c=1/2, p=2,q=l), the asymptotic variance of 8* is




Table 3.1 AVF in the Power Family

c = .5

Q 1.0 1.1 1.2 1.3 1.4

1.6 1.7 1.3 1.9 2.0

1.00 4.00
1.10 4.06
1.20 4.21
1.30 4.44
1.40 4.73
1.50 5.09
1.60 5.52
1.70 6.02
1.30 6.59
1.90 7.25
2.00 3.00




2.31 2.43 2.24
2.71 2.35 2.09
2.63 2.23 2.00
2.70 2.27 1.96
2.77 2.23 1.95
2.36 2.32 1.96
2.98 2.39 1.99
3.13 2.47 2.03
3.31 2.57 2.09
3.51 2.70 2.17
3.74 2.84 2.26

1 .74
1 .70
1 .73
1 .381

1 57
1 51
1 .50
1 .52
1 .55
1 .53

1 .44
1 .39
1 .36
1 .34
1 .34

1 .35

1.71 1.63 1.57
1.53 1.46 1.39
1.42 1.34 1.27
1.34 1.25 1.19
1.23 1.19 1.12
1.24 1.15 1.08
1.22 1.12 1.05
1.21 1.11 1.02
1.20 1.10 1.01
1.21 1.09 1.00
1.22 1.10 1.00

c = .75

0 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 .3 1.9 2.0
1.00 1.73 1.57 1.43 1.33 1.25 1.20 1.15 1.12 1.09 1.07 1.05
1.10 1.30 1.55 1.38 1.26 1.17 1.10 1.05 1.01 0.93 0.95 0.93
1.20 1.37 1.57 1.36 1.22 1.12 1.05 0.99 0.94 0.90 0.37 0.85
1.30 1.97 1.61 1 .33 1.21 1.10 1.01 0.95 0.39 0.35 0.32 0.79
1.40 2.10 1.63 1.41 1.22 1.09 0.99 0.92 0.86 0.32 0.73 0.75
1.50 2.26 1.77 1.45 1.24 1.10 0.99 0.91 0.34 0.79 0.75 0.72
1.60 2.45 1.37 1.52 1.23 1.11 0.99 0.90 0.83 0.73 0.73 0.70
1.70 2.67 2.00 1.59 1.32 1.14 1.01 0.91 0.03 0.77 0.72 0.63
1.0S 2.93 2.15 1 .6 1.33 1.17 1.03 0.92 0.33 0.77 0.72 0.67
1.90 3.22 2.32 .7 1 .44 1.21 1.05 0.93 0.34 0.77 0.71 0.67
2.00 3.56 2.51 1.90 1.52 1.26 1.03 0.95 0.85 0.73 0.72 0.67

c = 1.0
0 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
1.00 1.00 0.93 0.8, 0.85 0.33 0. 0. 0.30 00. 0.79 0.79 0.79
1.10 1.01 0.92 0.35 0.31 0.73 0. 0. 0. 0. 0. 0.7 0.71 0.70 0.70
1.20 1.05 0.93 0.34 0.79 0.74 0.71 0.69 0.6 0.66 0.64 0.64
1.30 1.11 0.95 0.35 0.73 0.73 0.69 0.66 0.64 0.62 0.60 0.59
1.40 1.13 0.99 0.37 0.79 0.72 0.63 0.64 0.61 0.59 0.53 0.56
1.50 1.27 1.05 0.90 0.30 0.73 0.67 0.0.6 0.60 0.53 0.56 0.54
1.60 1.30 1.11 0.94 0.32 0.74 0.63 0.63 0.59 0.57 0.54 0.52
1.70 1.50 1.19 0.99 0.35 0.76 0.69 0.63 0.59 0.56 0.53 0.51
1.30 1.65 1.27 1.04 0.39 0.73 0.70 0.64 0.59 0.56 0.53 0.51
1.90 1.31 1.37 1.10 0.93 0.31 0.72 0.65 0.60 0.56 0.53 0.50
2.00 2.00 1.49 1.13 0.98 0.34 0.74 0.66 0.61 0.56 0.53 0.50

inflated by a factor of 1.57. If the error distribution is actually

double exponential, and least squares error is used (i.e., c=l, p=l, q=2),

then the factor is 2.00. If the error distribution is equally likely to

be normal or double exponential, the best compromise value for q is 1.35.

This q gives the smallest average AVF value over p=l and p=2.

Table 3.2 contains values of AVF relative to the best alternative

value for q in the range 1-2, that is, the minimum AVF will be obtained

when q=p by the maximum likelihood properties of 4'. If each value of AVF

for a given value of p is divided by AVF for p=q, we obtain a measure of

the inflation in the asymptotic variance caused by misspecification of the

power during estimation. For both the normal and double exponential dis-

tributions,when p=q, AVF=1 so this standardization does not affect the

results presented in the previous paragraph. The value of c drops out of

the expression for the ratio expressed in Table 3.2. Note that the largest

variance inflation occurs when using least squares, and the true error

distribution is double exponential. The best average inflation over the

range of p values from 1 to 2 occurs when q=1.4, the worst when q=l, or

q=2. These results are presented in the form of contour plots in the


For the power family, asymptotic variance considerations indicate a

choice for p in fitting of 1.4 or 1.5. The i functions for various choices

of p are chosen in Figure 3.1.

A second family of distributions which can be studied for the

effects of misspecification of 9 is the generalized Lambda Family (GLF),

introduced by Ramberg et al. (1979). The density function has four para-

meters and is represented as a function of a percentile from 0 to 1.

Given a percentile, say p, then the value of f and A can be found by using


'I I.


'*77\ : --

I' '

.. t
.. I ;


i, I!
\ i

"\* I

" "..'1 !- j.. .







Table 3.2

q 1.0 1.1 1.2

Relative AVF inflation in the Power Family

1.3 1.4 1.5 1.6 1.7 1.3 1.9

1.05 1.09 1.15
1.01 1.04 1.07
1.00 1.01 1.03
1.01 1.00 1.01
1.03 1.01 1.00
1.07 1.03 1.01
1.11 1.05 1.02
1.17 1.09 1.04
1.23 1.14 1.08
1.31 1.19 1.11
1.39 1.25 1.16

1 .02
1 .10



1.42 1.49 1.57
1.27 1.33 1.39
1.18 1.22 1.27
1.11 1.15 1.19
1.06 1.09 1.12
1.03 1.05 1.08
1.01 1.03 1.05
1.00 1.01 1.02
1.00 1.00 1.01
1.00 1.00 1.00
1.01 1.00 1.00

1 .70


1 .14
1 .39
1 .50

the equations shown below:

(3.39) f(p) --= 3-1 +
[\3P '3-1 + 4(1 p)'4-1]

(3.0) (p) 1 p)4]
3a.4o0 \l +.(

For densities in the GLF to have a zero expectation,

(3.41) 1 +
1 1 +\- 1+\4

Properties of the GLF, and its relation to other densities are presented

in Chapter 5.

To examine misspecification, attention must be restricted to the

densities in the GLF which satisfy the asymptotic properties of Chapter 2.

Equation (3.41) specifies 1 and \4 must equal \3 to insure symmetry. This

leaves \2 and >3 unspecified, and in practice, unknown. To fit an

M-estimate using a 'p function corresponding to an HLE estimator for a

density in the GLF, one needs to specify \2 and 13. Figure 3.2 shows

functions for various choices of \2 and \3. For \2 = .1974 and 1 .1341,

the GLF distribution closely approximates the standard normal distribution,

and '(pQ) .

Table 3.3 gives values of AVF for combinations of values of ;2 and N.

in a chosen / function (rows of Table 3.3) versus values of x2 and .3

in the true error desnity. Symmetry is achieved by setting '4 equal to 13

to correspond to a class of densities for which the results of Chapter 2

will apply.



\ \ ,\!
\ "- ;

"-'-.. \ 1; [

--- '\-

,_______- ^_________











-- I-r--rr-r










1 I

Ln Lf Lfl
C0 0 0
C 0 CD
0 0 0

C)J Cj :J
0 O 0
O 0 0
0C0 0
CD ) C

- CO O
O 0 0'

- oC O

00 0
O 0 O0
C) C) CD
C) C) C

-- C CD
C0 0
0* Q' 0

T CO \J C0
Ln .0 %.

-- O

'- Lfl .0 u

O r-- C)r

*n r--
O 0

O 0
0 0
0 0

- r) o C

-C) -<

^- ("" m

- LC)

O- C0
0 o
O o

C: c)
S c,
0 O

O 0
0 0

0 0
O 0

* m

m m
*4 )


The tables indicate that misspecification variance inflation is more

highly dependent on ., than on No one value of (2 3) will protect

against high variance throughout the parameter space. Least squares esti-

mation corresponds to (?.2' 3) = (.19,.13).


Empirical Studies

The largest amount of research in the use of M-estimators has been

empirical in nature. Recommendations for l, functions are based on an

evaluation of the performance of the 11 function against other proposals

for a variety of error distributions and performance criteria. Several

studies have appeared in which a 11 function was proposed and then simulated

against various alternatives. The largest such study was done for Ml-esti-

mators of location. The Princeton Robustness Study (Andrews et al., 1972)

involved sixty-five estimators (of which eighteen were il-estimators),

fourteen error distributions, and eighteen performance criteria. No con-

census .as reached by the authors as to the superiority of a single

M-estimator, but this may be due to the stated goal of the study to find

estimators which were 95', efficient in the normal case relative to the

sample mean and performed well against all other alternative error distri-

butions. No such estimator was found. Stigler (1977) compared seven robust

estimators of location (of which five were M-estimators) on historical data

sets. All estimators performed well, but the data sets used did not contain

outliers, or other indications of non-normality.

Andrews (1974) gives the first empirically reasoned presentation of a

'' function for regression parameter estimation. Andrew's ip, the so-called

SINE function:

sin ( )

SI& < CT,


removes from consideration any residual which exceeds cv. In this way,

outliers are rejected automatically. Andrews demonstrates how this '1'

function finds the four outliers found by Daniel and Wood (1971) in the

stack loss data (Brownlee, 1965). The constant c is chosen by the prac-

titioner to control the amount of trimming. Justification for the method

is presented in terms of the stability of the moments of 1' under five

alternative distributions, its ability to locate outliers and remove them,

and its functional simplicity relative to the Hampel estimators (Andrews

et al., 1972).

A second estimator, the bi-square estimator, is proposed by Gross (1976,

1977). Its ', function is given below:



1| ,1 1

This estimator was used in an application by Beaton and Tukey (1974).

Gross (1976) compares this J' in the location problem with twenty-four other

estimators and seven underlying distributions in a simulation study to

determine which yields confidence intervals which are true.

Gross (1977) considers the bi-square estimator for regression under

different design conditions, sample size, and error distribution. lo

comparisons were made with other H-estimators.

ILl j c7T

Huber (1964) proposes a 9 function which is the HLE for a normal

distribution with exponential tails. The estimator, known as Huber Pro-

posal 2, has the following i function:

k > k

(4.3) 'i,) = | | k

-k A < -k

The value of k is chosen by the user. Many k values were tried in the

Princeton robustness study in evaluating the effectiveness of the function

for estimating location. Denby and Mallows (1977) have developed graphical

techniques to aid in the choice of k in the regression problem.

Two large Monte Carlo studies comparing ,' functions in regression have

appeared. Denby and Larsen (1977) compare twelve estimators (of which four

are M-estimators) under twenty-four conditions (three underlying error

distributions and five other factors either present or absent, using a

quarter replicate of the 3x2` combinations of factors) measuring effective-

ness by comparing mean square error. Andrew's SINE estimator was found

to be superior to the other M-estimators studied.

Pamsey (1977) studied the L(p) estimators introduced by Forsythe (1972).

Their p function is

(4.4) ) = sign(s).IAP-I

This family of estimators, indexed by p, includes ordinary least squares

(OLS), when p=2, and LAV when p=l. Asymptotic variance effects when p is

misspecified in fitting relative to a true value of p for the underlying

error distribution are studied in Chapter 3. Ramsey compared these esti-

mators to SINE and E, estimators. The p function for an E. estimator is
a Id

(4.5) ) = .e

Contaminated normal distributions of various degrees of contamination were

used as the error distributions. Several other factors were studied. SINE

and Ea estimators were found to be superior in efficiency compared to

L(p) estimators over a 'wide range of these conditions.

Several smaller studies have measured the performance of M-estimators

empirically. Coffmann and Whiteside (1980) compare OLS and median regression

(Sen, 1968). Dodge and Lindstrom (1981) study convex combinations of LAV

and OLS. Hill and Holland (1977) compare LAV and SINE when the errors are

contaminated normal and prefer SINE. Hughes (1980) compares LAV and OLS

on experimental data systematically subdivided into replicates. LAV had

smaller variance in its estimates and its estimates were more normal than


A few other studies have used M-estimates on data produced by real

life phenomena. Beaton and Tukey (1974) use a bi-square estimator on band

spectroscopic data. Agee and Turner (1978) use a Hampel function to

detect outliers in trajectory data at White Sands missile range. Rocke and

Downs (1980) compare trimmed means, Huber p, and bi-square estimators of

location on chemistry data expanding the work of Stigler (1973) who compared

robust estimates of location as mentioned previously. Lenth (1980a, 1980b)

has used M-estimators of location on directional data in problems of air-

craft location.

Empirical studies of regression estimators have not matched the Prince-

ton Robustness Study of estimators of location for thoroughness either in

the estimators, the alternative error distributions, or the number of

criteria with which the estimators can be compared. Many other M-estimators

exist which have never been tested empirically, such as several of the

minimax estimators described later in this chapter.

Questions regarding the performance of M-estimators have also been

handled empirically. Carrol (1979) used the jackknife technique (Schucany,

1979) to estimate the variances of regression parameter estimators when

the underlying distribution is asymmetric. Hill (1979) followed Huber (1973)

in comparing numerically alternative estimators for the variance-covariance

matrix of the regression parameter estimators. Uelsch (1975) used a Monte

Carlo study to find the lengths of confidence intervals for M-estimators.

Ypelar and Velleman (1977) compare the effects of high leverage points on

four M-estimators: Bi-square, Fair, Dennis and Wlelsch, and Huber.

Empirical studies provide much needed experience in the use of M.-esti-

mators, but have not provided general conclusions that would indicate the

use of a particular function. A practitioner would need to assume that

the data used in these studies, whether it was historical or generated

using pseudo random numbers, was similar in distribution to the data to

be fitted so that conclusions derived from the empirical work could be

applied. This assumption is still quite large given the currently available

empirical work.

Asymptotic Considerations

A second approach to choosing a ,, function would be to use an asymp-

totic argument. In Chapter 2 proofs were referenced which indicate that

for a given error distribution, all 'I' functions will produce estimates

which converge almost surely to the true regression parameter values.

Differences between the I'p function; involve the asymptotic variance of

the estimates, which is proportional to the AVF for the ip in use and the

true error density. Recall that

I 2 fd.

(4.6) AVF(ip,f) --=
[ i' fdL ]2

A 7p function could be chosen to provide low values of AVF for a class of

possible error distributions. Specifically, might be chosen to minimize

the average relative AVF, or the maximum relative AVF.

In Chapter 3, the AVF of the power family was presented for various

alternatives of v and f within that family. If the error distribution

can be assumed to be from the power family, and one wishes to use an L(p)

estimator, the best choice of power for -is .4, since that value minimizes

the average relative AVF across the range of powers for the error density.

The value .4 minimizes the maximum relative AVF at 1.18 (see Table 3.2).

The AVF of the symmetric GLF was also presented in Chapter 3. This

is a much broader class of distributions than for the power family. No

one value of (,2,J) could be found which would protect well against all

alternative pairs.

AVF values can be computed analytically for many combinations of ,

and f, not just combinations within a parametric family of distributions.

The AVF values indicate the sample sizes needed to get estimates with

equal asymptotic precision. The bound on an estimate of Bk, the kth element

of 6 is

C(X'X)-1 -AVF
(4.7) B =

To achieve equal bounds using two different i functions would require


(4.) AVF1 AVF2
1 F1v9

(4.9) -
2 AVF2

For example, in the power family, if the true power is .2 and 100 points

are used to estimate a with power of .4, then 182 points would be needed

to estimate 6 with the same precision using least squares.

Comparing AVF values will only be feasible for those distributions

and .' functions satisfying the regularity conditions for the existence of

the asymptotic normality of B*, that is

1. f symmetric.

2. p must be convex

3. f must be a continuous derivative of F.

4. 30 3 f(z ) : 0 and p(z0 + 6) + p(z" ) > 2p(z ) V, > 0.

Condition 1 above is a serious limitation in the application of asymp-

totic variance techniques to experimental data situations which may be

known to have asymmetric behavior. Asymptotic behavior for these 1p

functions has not been established.

For some functions, establishing AVF can be done analytically. For

least squares,

(4.10) ,2fd = f = Var(.)

(4.11) j fd, = fd = 1

so AVF(OLS,f) = Var(s) for any distribution f.

For LAV.

(4.12) J 2fda = ffd5 = 1

(4.13) f fd, = 2f(o)

so AVF(LAV,f) = 1/(4f(o)2) for any distribution f. Thus LAV will outper-

form OLS in asymptotic variance for any error distribution for which

(4.14) 1 < Var(A)

Asymptotic variance expressions provide an analytic means for comparing

.p functions, but are severely limited in the choices of p and f which can

meet the regularity conditions sufficient to insure an AVF of the form in


Other asymptotic behavior measures can be computed from the influence

curve of an estimator. Influence curves for M-estimators of location are

given by Equation (2.13). Hampel (1974) defines six additional optimality

criteria for estimators in terms of the influence curve. In addition, a

table is presented in which seventeen estimators are compared on eight

criteria at the normal distribution.

Minimax Estimators

For distributions in classes which satisfy the regularity conditions

of Theorem 2.7, minimax estimators exist. Huber (1964) introduced the

concept of a minimax M-estimator of location. Polyak and Tsypkin (1978)

extend the notion to M-estimators of regression parameters. When F is a

class of distributions satisfying the conditions of Theorem 2.7, fo E F is

the distribution with minimum Fisher information, and io = -f'/f then
00 0

(4.15) AVF( o,f) 5 AVF(i ,f ) Vf E F

Thus Qo will provide a minimax estimator of F. So for certain families, F

protection is afforded by using the Q. which is MLE for the least infor-

mative distribution in F. Ershov (1979) gives several examples of families

and their resulting minimax estimators (see Table 4.1 below).

Little work has been done in applying these estimators except in cases

where they correspond to estimators developed using other criteria. Classes

F with more specific restrictions should yield interesting choices for t.

Minimal estimators need to be refined. Classes of distributions can be

more completely specified than has been done in the past. Using calculus

of variation for finding new t functions should yield better estimators

the more completely one specifies the class. Error distributions which

are feasible for experimental data can be well specified as a class. These

new estimators need to be compared to previously proposed and numerically

tested estimators. Conservative confidence bounds imposed using the mini-

max principle need to be studied. Small sample properties of many minimax

estimators are unknown.

Table 4.1 Classes of distributions and their
minimal estimator

Class Estimator

All distributions with f(o) > 0 LAV
All distributions with Var(l) o: OLS
All contaminated normal distributions Huber Proposal 2
All distributions on [-a,a] i.(l) = tan(nL/2a)

Minimal considerations can only be applied within the classes of distri-

butions which satisfy the conditions of Theorem 2.8, the most serious

limitation of which is symmetry.

Adaptive Estimators

In the three approaches to choosing a '' function presented in this

chapter so far, the selection of '1' is made a priori to data analysis. To

avoid such a possibly arbitrary decision, one may wish to use a procedure

which leaves ip, and thus f, partially unspecified at the beginning of

fitting. This would allow the data to adapt the 1I function to one which

fits the data best. Two possible methods for achieving this exist. The

first method is to use a preliminary fit of the data to indicate which I'p

function is to be used in subsequent fitting. A study using such a tech-

nique is described below.

A second technique involves using a I' function which contains unknown

parameters. Andrew's SINE function contains a parameter c, which could be

fitted using a maximum likelihood procedure along .,ith regression parameters.

This technique was used in the Princeton Robustness Study of location esti-

mators. Chapter 5 contains the results of applying the technique in


Using a preliminary fit in order to choose a 1, function is described

by Moberg et al. (1980), who use LAV to develop residuals, then calculate

Q3 and Q4 as defined below.

(4.16) Q3 = U(.05) M(.50)
M(.50) L(.50)

(4.17) U(.05) [(.05)
U(.50) L(.50)

where L(y), U(y), M(y) are the arithmetic means of the smallest, largest,

and middle n of the ordered residuals, respectively. The location of

(Q3,Q4) in a plane divided into five regions is then determined in order
to select the i function which corresponds to the region in which (Q3,Q4)


The performance of this adaptive procedure was studied by comparison

to OLS, LAV, SINE, and a one-step SINE estimator. Ten error distributions

were used, and the measure of performance was a median efficiency ratio

(HER) based on total squared error (TSE), where

(4.18) TSE = 7 (st )2

(4.19) HER = median (TSE of OLS)
median (TSE of adaptive procedure)

The adaptive procedure outperformed all competitors except the case of

least squares with normal errors. The median efficiency ratio of the

adaptive procedure ranged from .35 to 47.197.


In Chapter 4 several techniques for choosing a p function were pre-

sented. Each technique included choosing a i function either from a

limited set of functions based on a calculation from a preliminary fit,

or a priori according to a previously established performance criteria.

In this chapter a new technique is introduced which does not involve

choosing a function.

Continuously adaptive M-estimation is a method of simultaneously

fitting a regression model, as in Equation (1.1) and estimating parameters

of the error distribution. In OLS, one parameter, c2, of the error distri-

bution is estimated. The basic shape of the error distribution is fixed,

as in any M-estimation. Table 1.1 gives the correspondence between i

functions and the distributional assumptions they represent. One would

prefer to use a regression fitting method which did not presuppose a shape

for the error density. Existing adaptive methods (Hogg, 1974) fix a finite

set of alternative error distributions. The new technique will allow the

error distribution to be any member of a family of distributions which can

be written using one functional form.

A family of distributions which would be suitable for a continuously

adaptive M-estimator must have the properties listed below.

1. It has a single functional form.

2. It covers a wide range of skewness and kurtosis values.

3. It has known distributional properties.

The first requirement guarantees that an algorithm can be produced

to fit the parameters, and is a natural assumption for the experimenter.

Distributional families such as the Pearson curves (Solomon and Stephens,

1973), Johnson curves (Johnson, 1949), and Burr distributions (Burr, 1942)

are not appropriate, as a member of the curve family must be chosen prior

to estimation.

The second requirement insures that the error process can be adequately

modeled through its first four moments. Pearson, Johnson, and Burr (1979)

compare distributions having the same first four moments. They are encour-

aged about the similarity of the percentiles of such distributions, but

warn about tail behavior.

The third requirement is an obvious prerequisite to drawing conclusions

from such a fit. To make confidence intervals and detect outliers using

probability statements, we need to have knowledge of distributional properties.

Tadikamalla (1930) reviews six available techniques for generating non-

normal variates using a computer. The generalized lambda family of distri-

butions (P.amberg et al.. 1979) and the Fleishman proposal (Fleishman, 1973)

have a single functional form and cover a wide range of skewness and kurtosis

values. Fleishinan suggests generating non-normal variates using

(5.1) = a + bz + cz2 + dz3

where z is a normal random variate with mean 0 and variance 1. Tables are

presented relating the parameters a, b, c. and d to the first four moments

of the distribution. Nlo analytic results are available. Io tail character-

istics are known.

Ramberg et al. (1979) present a distribution parameterized by its per-

centiles. This four parameter family is given by

(5.2)= 1 p)

(5.3) f() 2
\3P 3-1 + \ (1 p)\4-1

where p is between 0 and 1. Random variables are generated by first

generating p as uniform random variate, and then using (5.2). Density

curves, such as Figure 2.1, can be drawn by varying p from 0 to 1 and for

each p, calculating and f(') from (5.2) and (5.3) and plotting the pair.

The properties of this family are covered in depth later in this chapter.

Having chosen a suitable parametric representation for the error dis-

tribution, a method of estimating the parameters must be selected. .i-esti-

mation involves estimating the regression parameters from Equations (1.6).

For given A1, X2, 2 3, and X4, 4 (A) could be calculated from (5.3) and used

in (1.6). When 11, X2, 2 3, and 14 are unknown, as in a data fitting prob-

lem, simultaneous estimation of B and X = (1, '2'X ,A'4) is necessary.

Difficulties in estimating B and a2 for nonhomogeneous i were referenced

in Chapter 1. One method of estimating the parameters is maximum likelihood

estimation. One needs to maximize the likelihood function

(5.4) L = I f(Y x.6;X)
i=1 1 ~ ~

over B and X. Using the maximum likelihood theory, one can obtain confidence

intervals for B and A, and covariance estimates. Maximum likelihood esti-

mation is known to be sensitive to departures from the assumption that the

likelihood is of the form in (5.4) (Huber, 1965). In data analytic situa-

tions, the assumptions implicit in (5.4) are quite reasonable, including

many possible error distribution alternatives. The assumption of a linear

model is not needed in the development which follows. If y = m(x .6) + x

is the assumed model, the likelihood can be written

(5.5) L = 1 f(y m(x. );.')

and maximized over 6 and .. An example of a nonlinear fit is given in this


Properties of the generalized lambda family are described in the next

section. General applicability of maximum likelihood estimation and conse-

quences follow. Algorithms for calculating (5.4) are presented followed by

examples of the procedure being used to fit a variety of different problems.

The Generalized Lambda Family

Ramberg et al. (1979) introduced the generalized lambda family (GLF)

as a generalization of the lambda family proposed by Tukey (1960). Equations

(5.2) and (5.3) define the density which does not exist in closed form.

Expressions for the first four central moments are reproduced below.

(5.6) u = 1 + A/,.

(5.7) 2 = (B A )/.1

(5.8) p3 = (C 3AE + 24A3)/,

(5.9) U4 = (D 4AC + 6A2B 3A')/.!


(5.10) A = 1/( + ,) 1/(1 + .)

B = 1/(1 + 2.,) 2 EETA(1 + .3,1 + 4) + 1/(1 + 21)


(5.12) C = 1/(1 + 3.3) 3 BETA(1 + 2-3,1 + 4 )

+ 3 BETA(1 + .3,1 + 2-\) 1/(1 + 3.1 )

(5.13) D = 1/(1 + 4.,) 4 BETA(1 + 3\3,1 + .4)

+ 6 BETA(1 + 2.'3,1 + 2)4) 4 BETA(1 + I3,1+ 3\ )

+ 1/(1 + 4.1)

and BETA(a,b) = r(a)r(b)/r(a + b).

These equations imply that to use GLF distributions with mean zero, we

must have
-1/12 I/ 2
(5.14) -/2 + 2
1 + .13 1 + A4

The skewness -a, = ul,/,. and kurtosis a = /o" are functions of \3 and 4


Lam, Bowman, and Shenton (1980) present contour plots of 3, and a
for axis of -j, and 6, where y = .' and 6 = .3/3 When 5 = 1, the distri-

bution is symmetric. For the original parameterization, the skewness and

kurtosis contours are given in Figures 5.2 through 5.5. Figure 5.1 shows

the area of the (ac3',4) plane covered by GLF distributions.

By matching the first four moments of a distribution using the tables

of Ramberg et al. (1979), one can get a A vector which will produce a GLF

distribution which will be close to the desired distribution. In Table 5.1

several distributions are given with their GLF approximations.

In a similar manner, various p functions can be approximated by matching

the first four moments of their corresponding maximum likelihood densities

from Table 1.1 to a density from the GLF. The i function for a density from

the GLF can be evaluated numerically using Equation (1.7). Table 5.2 gives

several GLF approximations to p functions from Table 1.1.

0 +2

Figure 5.1 Coverage of (cT3,,4) by GLF distributions.






LIi U-


I- ,-I


I -

1' ''

j r




1 I


J -



I i i I .-l


I '

I -










:I i

I I"
[ I-'- I,



L i

i I

I :-








I 1-J



; i L

L i


I _




I-'' -

-. en


I -


'"1- -*


'' '

Table 5.1


Normal (0,1)

Shifted Exponential


Gamma (2,2)

Uniform (-1,1)

Table 5.2

li Function



l/(I + (L/5)5)

GLF approximation to densities.

GLF Approximation

"1 "2 "3

0 .1914 .1341

-.993 -.1081 -.0407

0 -.0659 -.0363

-.782 .0379 .5603

-1 1 1

GLF approximations to p functions.

GLF Approximation

1 '2 '3 4

















Only certain values of .\ will produce a density function which is

greater than or equal to zero over the real line, and integrates to one.

The acceptable values of .1 depend only on "3 and .. The regions of

acceptable values are shown in Figure 5.6. The formula for the kth non-

central moment is given by Pamberg et al. (1979) as

kk Ik3 ik i)+1 ) r( i+1)
(5.15) E(ak) (- )i ( 3(k-i l 4
i 0 r(., k + ( ,-,3)i + 2)

when -1/k < min(3,'.4) all positive moments of order less than or equal

to k will exist. For this reason, the contour plots of skewness and

kurtosis were drawn in quadrant I of the ( 3,'4) plane, and for -.25 <
"3' 4 0 in quadrant III of the (''3, 4) plane, where skewness and kurtosis


The range of the distribution is also a function of .. Table 5.3

summarizes the ranges. In quadrant I of the (\,' 4) plane (row 1 of Table

5.3), GLF distributions have finite range determined by 1 and .2. The

normal distribution is closely approximated by a GLF distribution in quad-

rant I, with .1, = 0 and '2 = .1941 (see Table 5.1). These parameters

determine the range for the GLF approximation of the normal distribution

to be (-5.152,5.152). The normal distribution has .9999996 of its prob-

ability mass within this interval. Range dependencies on are given in

Table 5.3.

Figure 5.6 and Table 5.3 indicate a shortcoming of the GLF for its use

with maximum likelihood estimation. To maximize Equation (5.4) over E and

% involves maximizing the likelihood over the regions shown in Figure 5.6.

No restrictions on 6 will be needed. Restriction of 12 will be-as shoun in

Table 5.3. Each of the four regions in Figure 5.6 could be used individually

Figure 5.6

Moments of GLF distributions.

Range of GLF distributions.


" 1 /.2



> 0

> 0

= 0

0: 0


- 0:.

> 0

> 0

> 0


< 0

*: 0


~ + 1/2




> 0



*-, 0

Table 5.3

to develop maximum likelihood estimates. The overall maximum of the four

regions is the GLF maximum likelihood estimator.

Table 5.3 presents an additional difficulty in that the range of the

error distribution may depend on the parameter values. This will violate

the assumptions sufficient to insure that the maximum likelihood estimators

6* and \* will be asymptotically normally distributed and efficient. In

the third quadrant of the ( 3''4) plane, the range is (-x.,+o) independent

of \.

Maximum Likelihood Estimation

The use of maximum likelihood estimation will allow general results

for these estimators to be applied to 6* and '.. Let Q = (6,O6i ... ,

',2'',3,' 4), 0, the parameter space, and let 6* be the maximum likelihood
estimate from Equation (5.4). Then if

(5.16) E log L = 0 i = 1,2,...,k+4

(5.17) -E log L = E fa log L log L i,j=1,2,...,k+4
ae. ~e.6e. 9e. J
1 J 1 J

e* will be asymptotically normally distributed with mean 0 and variance

covariance matrix whose elements are given by (5.17). 9* is also efficient

(Kendall and Stuart, 1973).

For the estimation problem of (5.4), these regularity conditions hold

for A3 < 0, X4 < 0. When either X3 or A4 is positive, the range of the

distribution is a function of A and such a relationship will violate (5.16).

In the third quadrant of the (X3,X4) plane, the GLF will have a bounded

continuous density which is twice differentiable. This will imply (5.16)

and (5.17). One can use (5.17) to obtain estimates of the asymptotic

variance covariance matrix elements by evaluating 3 log L/5.i numerically

as needed (Kennedy and Gentle, 1980).

The asymptotic results of Chapter 2 do not apply to continuously

adaptive M-estimators, since the ip function for these estimators is not

fixed, but rather depends on \.

Evaluating the Likelihood

Evaluating Equation (5.4) in a given regression problem is not straight-

forward. For a specific set of regression data, v. and x. are known. The

likelihood function depends on 6 as defined previously. At a given 6,

calculation of the likelihood involves several steps.

(5.1S) Evaluate r. = v. x'C

(5.19) Evaluate f. = f(r ;.)
1 i 1
(5.20) L = n f.

Calculation of f. from r. and can be done using three different
1 1 ~
algorithms, recalling that the relationship in (5.19) is defined parametri-

cally in terms of a percentile pi by Equations (5.2) and (5.3). The problem

reduces to finding the root of a nonlinear function, g(pi), where

(5.21) g(Pi) = p 3 (1 pi) + 1' r.

We wish to find p. so that

(5.22) g(pi) 0

subject to

0 r p. i 1


where :\ is given by (5.14) so that E(L) = 0. Once pi is found, (5.3) can

be evaluated for f.. Functions given by (5.21) are shown in the Appendix

for various combinations of :.. Without loss of generality, the curves are

presented for .1 and ri both set to zero.

A straightforward algorithm for computing pi so that g(pi) = 0 given

.2, 3 ,4, and ri is presented below

Algorithm 5.1 (Solve (5.22) using binary search)

(5.24) Set plo to 0. Set phi to 1.

(5.25) Set p to (plo + hi)/2

(5.26) If g(p) > 0 then set phi to p

If g(p) = 0 then stop

If g(p) < 0 then set pl to p

(5.27) Go to (5.25)

This algorithm will converge, but requires many time consuming

evaluations of (5.21). A faster, but less reliable algorithm is an

adjusted -. Newton algorithm.

Algorithm 5.2 (Solve (5.22) using an adjusted flewton method)

(5.28) Set p to .5

(5.29) If g(p) = 0 then stop

(5.30) Set p to p g(p)/g'(p)
P 1
(5.31) If P > 1 then set p to p+(l-p) new
new new Pnew p

(5.32) If p < 0 then set p to p+p.
new new P-Pt
p Pnew
(5.33) Set p to pnew Goto (5.29).

The adjustments in (5.31) and (5.32) are necessary to meet the

conditions of (5.23). The Newton step in (5.30) may produce a pnew outside

the permissable range of values for the solution. If pnew is too high,

the adjustment in (5.31) moves p close to one if the Newton value for

new was much larger than one. If pnew is as far past one as p is less

than one, then pnew becomes half the distance from p to one. Proportional

adjustments are made in an analogous manner in (5.32) when pnew is less

than zero.

This algorithm is faster than Algorithm 5.1 if the root is near p=.5.

As the root moves closer to 0 or 1, Algorithm 5.1 is faster. The number

of evaluations of (5.29) required by each algorithm under various conditions

are given in Table 5.4. Without loss of generality, i2 is considered fixed

at 0.1. While Algorithm 5.2 is often faster than Algorithm 5.1, its

advantage diminishes whenever the root is near the boundary. As a general

algorithm, average time over the range of possible roots is lower with

Algorithm 5.1.

For calculation of the likelihood at a particular value of I and 1,

(5.21) must be solved repeatedly for each point (yi,xJi in the regression

data. This process can be greatly abbreviated by noting that when E and .

are fixed, the value of g(pi) depends only on ri, and ri effects g(pi) in

a simple linear manner. This suggests that if an approximation to g(pi)

can be found which can easily be solved for a particular r., then altering

the approximation for a subsequent r. can be achieved by a simple linear

translation. The highest degree polynomial which can be solved using

explicit formulas is a quartic. Examinations of g(p) for various -, and

initial approximations involving cubic polynomials indicate that a cubic

polynomial approximation is sufficient. Comparisons of improved accuracy

Table 5.4 Comparison of Algorithms 5.1 and 5.2

Evaluations of (5.21)


































> 100






















of a quartic approximation versus the faster cubic approximation are a

subject for future research.

Suppose h(p) is a cubic polynomial which closely approximates g(p) for

a particular r., that is

(5.34) g(p) h(p) = a0 + alP + a22 + ap3


(5.35) gi(p) = p (1 p)4 + ,2 (1 ri)


(5.36) gi(p) gj(p) = .2( ri)

So if h.i(p) gi(p), then

(5.37) g (p) hj(p) = a0 (r ri) + alp + ap + a

that is, only the constant tern in the cubic approximation is to be altered.

This effect is shown for a particular gi and gj in Figure 5.7. While this

simple linear translation is a trivial relationship between gi and gj, it

does not suggest a simple relationship between the root of g. and the root

of gj. If hi and h. are suitable cubic approximations, then the roots of

h. and h. are easily found and will approximate the roots of g. and gj

respectively. Thus, we can, without loss of generality, find a cubic

approximation to g(p) for ri = 0. Call this function to be approximated g0,

that is

(5.38) gO(p) = p 3 (1 p)'4

Two choices for finding approximating cubic polynomials have been

considered. The first is a Taylor series of g0 expanded around p = 1/2,


E '


' :, i

f -'



i "
,4 \

L r


\ r

which is the center of the neighborhood in which g0(p) is to be approxi-
mated. The derivation of the cubic Taylor series is given below.

g0(p) go(1/2) + go(1/2)(p 1/2) +

go (1/2)(p 1/2)3

g (1/2) = (1,'2)3 (1/2) + 1 "

g0(p) = ,p3 + 1(1 p)-

g(p) = 3(3 1) p 3-2 + 4(;-)(1

g0'(1/2)(p 1/2)

(p) = 13(1 3 1)(.3-2) p 3 3 + 4( -4 1)( 2)(1 p)'4

(p 1/2)2 = p2 p + 1/4

(p 1,2) = p3 (3/2)p2 + (3/4)p (1/8)


g(0) := (1/2)

(5.46) g0(p) g(0) +

g(1) = )(2) g(2) = (/2)
9 0 .1 0 =~1'2

(1)(p 1/2) +


, g(3) = /2)
0 g

[p2 p + (1/4)]

+ g-

(5.47) g0(p) z

[p3 (3/2)p2 + (3/4)p (1/')]

+ p 1g(1)

S (0) 1 (1) 1 (2) 1 (3
0 T81 I



(5. 1)







- p)'4-2

+ p -(2) 1 (3
+ p _ -^ 9 1

The second method of choosing an approximating cubic polynomial for gO(p)

is to evaluate 9g(p) at some points throughout p i [0,1], and then estimate

the coefficients of the polynomial using least squares. Least squares wias

chosen on the basis of ease of computation. Enough points must be chosen

to insure a good fit, but not so many as to slow the estimation of the

parameters. After trial and error, seven points were chosen, which under

a variety of conditions, produced good approximating cubic polynomials.

The choice of the seven points involved several considerations.

1. p = 1/2 must be included to get a good approximation in the center
of the range.

2. The seven points must be symmetric about p = 1/2 since no a priori
knowledge is available about Pr(r > 0).

3. The seven have points close to zero and one to get a good approxi-
mation in the tail of the error density.

The seven points chosen to form the design are given as vector p below.

(5.4 ) p = .001

This produces the P matrix, P'P, and (P'P)-'

(5.49) P = 1 .001 10-6 10-9

1 .10 102 10-

1 .25 .0625 .015625

1 .50 .25 .125

1 .75 .5625 .421875

1 .90 .81 .729

1 .999 .998001 .997003




below can

P'P =

(P"P)- = 83901



















matrix is constant so that to evaluate the likelihood, the steps

be used.

Algorithm 5.3 (Calculation of Likelihood)

(5.52) Evaluate gO(p) for each point in P. Call result g.

(5.53) Estimate the coefficients of h (p) using (5.57) or (5.47).

(5.54) For each r., evaluate p. so that h.(pi) = 0.

(5.55) For pi, calculate fi using (5.3)

(5.56) L n f..
i- 1
i=l 1

Let a be the coefficients of h('p). Then step (5.53) can be accomplished

via least squares as

(5.57) a = (P'P)-1 P'g

Since (P'P)-' and P' are fixed, finding an approximating cubic polynomial

can be done by a simple linear transformation of the vector g.

The solution of h. for p. such that h.(pi) = 0 can be done explicitly

using Cardan's formulas from Selby (1971). If

a a a
h.(p) 0 + p + p2 3 = 0
a3 3 a3

Rewrite h.(p) as

hi(p) = b0 + blq + q3

p = q 3a

1 a13
b 2
0 7

1 al
b -
1 3 aa

Further, let



VA =

al2 "O
- 9 + 27 -


-/ + b
bo I 3b1
+ 27
4 27


V 7 27

The only real root of the cubic equation (5.59) is

q = A + B, if

b0 b'
S 27
4 27 > 0

If (5.66) is not met, then either two real roots exist, or three

imaginary roots exist. If either of these conditions exist, the algorithms

under consideration will return 0 for fi to indicate that X2, 3, and X4

do not admit a solution with positive likelihood for the particular residual.










Cardan's technique can be used with a Taylor series cubic polynomial,

or with a fitted cubic polynomial. In either case, only a0 depends on r..

For fixed 8 and ., h. varies only in a Calculation of terms in (5.63)
1 0
and (5.64) is reduced by recalculating only those factors which change

when ri changes.

In Table 5.5 the seven point cubic approximation method is compared

to the binary search method of Algorithm 5.1. Data is from Example 7,

later in this chapter. Timings are real elapsed clock time in seconds.

See Appendix A for a description of the computing facilities used.

As can be seen from Table 5.5, the approximate method is five to fifteen

times faster than the exact method. In the rest of this work, the seven

point cubic approximation is used to estimate the likelihood.

In terms of closeness of approximation to the exact method, the seven

point cubic approximation is far superior to the third order Taylor series

expansion. Figure 5.8 shows both a Taylor series cubic polynomial approxi-

mation to a g function for the normal distribution as approximated by

GLF(0,.1974,.1341,.1341) for a residual of 1.645. The true root is, there-

fore, at p = .05. For this example, the Taylor series fails to give a

root in the interval [0,1].

Solving for Mlaximum Likelihood

In the preceding section a seven point cubic approximation algorithm

was presented which can quickly evaluate (5.4) for a given point 6 in the

parameter space. In this section methods will be presented for finding 6

so that (5.4) is maximized.

Solution of (5.4) using numerical methods can be accomplished using

gradient techniques, such as Fletcher and Powell (1963), or derivative-free

techniques. Closed form expressions for the partial derivatives of L with

Table 5.5 Calculating the likelihood.

Seven Point

-1n L




















Algorithm 5.1

-1n L




















































.2 (7


N %

\ \\













respect to each parameter do not exist. Numerical derivatives could be

used, but investigation of likelihood contours indicated that a gradient

technique may have severe difficulty with the number of turns in the

surface. Derivative-free techniques are, in general, slower than gradient

techniques, but are more reliable when the function being optimized is

not smooth (Chambers, 1979).

Derivative-free techniques can be classified as surface following or

direct search. A surface following algorithm uses a series of function

evaluations to move along the surface of the function being optimized. An

example of a surface following algorithm is the Simplex algorithm of

fielder and Mead (1964). While not using derivatives, a surface following

algorithm is subject to a similar disadvantage--susceptibility to local

minima. A direct search algorithm involves evaluating the function at a

continuingly evolving set of parameter points throughout the parameter

space. A simple algorithm might involve a Euclidean grid over the parameter

space and the evaluation of the function at each point on the grid. Such

a technique is not susceptible to local minima, but has difficulty with

large changes in function values which occur between grid points.

A direct search technique suitable for arbitrary functions is the

controlled random search technique (Price, 1977). The algorithm is

presented below.

Algorithm 5.4 (Controlled Random Search)

(5.66) Select k points O from ,'_. Call them S.

(5.67) Select j points from S. Call the centroid d.

(5.68) Select 1 point from S. Call it P). Calculate a new point via

(5.69) j = 2.- If M :0, goto (5.67)

(5.70) Evaluate L(6). If L(a) > min L(), then replace y with 0.

Otherwise goto (5.67).

/ miln max -
(5.71) If min L(a) max L(0) < C stop.

Otherwise goto (5.67).

The number of points in S (called the storage), is arbitrary. Values

of 20-j, where j is the dimension of 9, have worked well in practice.

This algorithm has been applied to a wide range of optimization problems

(Khuri and flyers, 1981; Khuri and Conlon, 1981) and is well-suited for

problems with difficult boundary conditions and multiple optima.

For solving the maximum likelihood problem, Price's (1977) procedure

is quite slow, requiring many evaluations of L(a), but the procedure

invariably converges. Simplex algorithms were implemented and tested, but

could not handle the lack of smoothness in the surface of L(6).

An additional advantage of Price's procedure over other optimization

algorithms is its lack of a single starting point. Choosing a starting

point for a gradient technique or a Simplex algorithm may predetermine

the result if local minima exist. Andrews (1974) developed a median

regression estimator to use as a starting point for iteratively reweighted

least squares in finding estimates using a SINE ',' function. Holland and

Welsch (1977) recommend using a LAV estimate as a starting point. Birch

(1980a) reviews the effect of starting points on IPWLS. Price's procedure

requires only that the parameter space Q, be defined so that (5.66) can

be accomplished, and the test of (5.69) can be performed. In most cases,

0 can be a j-dimensional rectangle. The size of the rectangle will control

the rate of convergence.

In continuously adaptive M-estimation using the GLF as an error

family, 0 need only be defined so that the true values of 5 and \ are

sure to lie within 0. This can be done by using initial least squares

estimates for 6, and centering the limits around e using a very small

confidence coefficient, a. Even under serious departures from normality,

if ai is chosen small enough we can be reasonably sure that the true 6 lies

in the confidence interval. For ., constant bounds given below will suffice.

(5.72) 1/,10o < I 2 < 10

(5.73) 0 < .3,.A 4

The assumption of (5.72) is that the least squares estimate of the

standard deviation of the error process, a, is within a factor of ten of

being correct. The assumptions that A3 and 'N are less than one serve

only to restrict fitting to densities which are not U-shaped. If during

fitting, the estimated values of the parameters are close to a boundary

of 0, the boundary could be relaxed. The bounds in (5.72) and (5.73)

are given purely for computational convenience. The true bounds are

given for k. by Figure 5.6, and for 6, the value may lie anywhere within

A listing of a FORTRAN program to perform the continuously adaptive

M-estimation fitting of arbitrary linear models using GLF densities, a

seven point cubic approximation to g0(p), and Price's procedure is given

in the Appendix.

The algorithms of this section can easily be extended to cover non-

linear models. One can write

r. = y m(x5 8)


and replace (5.18) in the estimation of the likelihood. The remainder of

the algorithms are unchanged.


Continuously adaptive M-estimation is demonstrated in the examples

which follow. Problems involving experimental data and problems involving

generated data are shown. For each problem, the fit is giien along with

a comparison to least squares estimates. Data for each example are given

in the Appendix. Symmetric and asymmetric error distributions are used

in the generated data problems. Both designed and undesigned situations

are considered. The examples are summarized in Table 5.6.

Because of the large amount of computer time required to calculate

a continuously adaptive M-estimate (CAM), no simulations have been done

to date. The examples used in this chapter were chosen to demonstrate

the ability of CAM to perform under a wide variety of modeling situations.

For each example, a description of the source of the data is given.

Estimates of the regression parameters are given for OLS fits and for CAM

fits. In addition, for CAM fits, the estimates of are given with a plot

of the corresponding error density. Since CAM admits asymmetric solutions,

no closed form expressions for standard errors of the regression parameters

from the literature of rl-estimation apply. Bootstrap estimates (Carrol,

1979) would, for CAM, be far too costly. Standard errors for the CAM

estimates remain unknown.

The first example consists of fifty points, each representing a

country. Four variables measuring economic attributes are used in a first

order multiple linear regression model to predict the country's average

aggregate personal savings rate. The fits are given in Table 5.7. For

the least squares fit, the regression parameter estimates and their standard

Contents of examples.


Belsley (1980)

Brownlee (1965)



Design Model

none mult. linear

none mult. linear



3r full quadratic

23 full quadratic

none simple linear

none simple linear

none simple linear

none nonlinear




sym GLF

asym GLF

sym GLF

asym GLF


contaminated normal














Table 5.6

OLS, CAN for Example 1.


24.589 .47358

-.36548 .10018

-1.7135 .49368

-.003698 .39483


4000 iterations

In L = 138.2











Table 5.7

errors are given. For continuousl, adaptive M-estimation (CAM), the

regression parameter estimates are given, as well as estimates of the

parameters of the error density. The graph of the estimated error density

corresponding to these parameters is shown in Figure 5.9.

The iteration count given in Table 5.7 is the number of times a new

parameter point succeeded in producing a negative log likelihood smaller

than the previously largest held value. The convergence criteria for CAM

in all examples was

(5.75) LI LJ

(5.76) M = ax -In L( ) L = m -In L()

The OLS and CAM regression parameter estimates in this example are

quite close to each other, indicating that CAI can give estimates similar

to OLS when the data are symmetric and mound-shaped.

Figure 5.9 indicates that little information is available about the

tails of the error distribution. There are no turning points in the

estimated density. In this example, (..3,".4) = (.49363,.39433) which gives

a density with a skewness close to zero, and a kurtosis close to 2.2,

according to tables in Ramberg et al. (1979). CAM estimates the true

error in this example to be lighter tailed than the normal distribution.

Example 2 is from Brownlee (1965), and was analyzed by Andrews (1974).

Twenty-one points on three variables are analyzed using a first order

multiple linear regression model. Table 5.8 gives the results of the OLS

and CAM fits.

The graph of the error density corresponding to the X parameter

estimates is given in Figure 5.10. This density is very close to a uniform

distribution on [-6,6]. This is quite reasonable given the lack of data.


L _





r a








:-' '-c-

Table 5.8 OLS, CAI for Example 2


8 s.e. 8

-39.9197 11.896 -37.225 -.080009

.7156 .1349 .55336 .20396

1.293 .3680 1.8710 .92584

-.1521 .1563 -.20964 .98833

5673 iterations

In L = 43.06

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs